Triangle counting
In graph theory, a triangle is a cycle consisting of exactly three vertices that are mutually connected by three edges. Triangle counting aims to count the number of triangles in a graph.
Explanation:
- If the vertices are AAA, BBB, and CCC, then the edges would be (A,B)(A, B)(A,B), (B,C)(B, C)(B,C), and (C,A)(C, A)(C,A), forming a closed loop.
- Triangles are a specific type of 3-clique, where all three vertices are directly connected to each other.
Example:
In a graph with vertices {0, 1, 2}
and edges {(0, 1), (0, 2), (1, 2)}
:
{0, 1, 2}
forms a triangle.
Triangle counting with binary search
Triangle counting using the binary search list method involves finding triangles in a graph by leveraging adjacency lists and binary search. Here’s how you can implement this approach for a single-threaded execution:
Key Idea:
For each edge (u, v)
, find the common neighbors of u
and v
, as these common neighbors form triangles with (u, v)
.
Algorithm:
- Graph Representation: Represent the graph as an adjacency list where the neighbors of each vertex are sorted. This allows efficient binary search.
- Iterate Over Edges: For each edge
(u, v)
, ensureu < v
(to avoid counting the same triangle multiple times). - Find Common Neighbors: Use the sorted adjacency lists of
u
andv
. Perform a binary search to find common neighbors efficiently. - Count Triangles: The number of common neighbors for an edge
(u, v)
gives the number of triangles that include the edge.
Complexity:
- Preprocessing: Sorting adjacency lists: O(E * log(V)).
- Triangle Counting: O(E * log(V)), assuming binary search is used for finding common neighbors.
CUDA coding
1. Kernel Function (tc_kernel
)
The kernel performs triangle counting by processing edges and finding common neighbors using binary search.
1.1. Initialization and Setup:
1 | uint const bid = blockIdx.x; |
bid
: Identifies the edge to process, as blockIdx.x
corresponds to the current block’s ID.
cnt
: A shared counter for counting triangles found by all threads in the block.
u
, v
: The vertices of the edge being processed, retrieved from d_edges
.
Degree-based optimization: If uuu has a higher degree than vvv, swap them to process the vertex with the smaller degree first. This reduces computational cost.
Synchronization: __syncthreads()
ensures all threads are ready before proceeding.
1.2. Triangle Detection Using Binary Search:
1 | for (uint i = threadIdx.x; i < d_degrees[u]; i += blockDim.x) { |
- Outer loop: Each thread processes one neighbor of
u
at a time.w
: A neighbor ofu
.v_nbr
: Pointer to the adjacency list ofv
.
- Binary search: Looks for www in the sorted adjacency list of
v
. If found, a triangle is identified, and the shared counter is incremented atomically. - Work distribution: Each thread handles a subset of
u's
neighbors.
1.3. Final Triangle Count Update:
1 | __syncthreads(); |
Synchronization: Ensures all threads in the block finish their work before the block-level triangle count (cnt
) is added to the global count (d_cnt
).
Atomic update: Avoids race conditions when updating the global triangle counter.
Code
1 | __global__ auto tc_kernel(int const* d_offset, int const* d_edges, int const* d_degrees, int const* d_neighbors, unsigned long long* d_cnt) -> void { |
- Grid and Block Configuration:
- Grid Size (
g->m
): The number of blocks equals the number of edges. This means each block processes one edge. - Block Size (
1024
): Each block uses 1024 threads to process neighbors of the smaller-degree vertex in parallel.
- Grid Size (
- Thread Work Distribution:
- Each thread in a block handles a subset of the neighbors of the vertex with the smaller degree. This is achieved by assigning neighbors in chunks based on
threadIdx.x
andblockDim.x
.
- Each thread in a block handles a subset of the neighbors of the vertex with the smaller degree. This is achieved by assigning neighbors in chunks based on
- Shared Memory:
- Each block uses shared memory (
cnt
) to locally count triangles for its edge. Shared memory allows fast, intra-block communication among threads.
- Each block uses shared memory (
- Global Memory:
- Graph data (like adjacency lists, degrees, and offsets) is stored in global memory (
d_neighbors
,d_offset
), accessible to all threads. Each thread retrieves its necessary data from global memory.
- Graph data (like adjacency lists, degrees, and offsets) is stored in global memory (
- Atomic Operations:
- Within a block: Threads increment the shared counter (
cnt
) atomically when a triangle is found. - Global Counter Update: The block’s local triangle count is added to the global counter (
d_cnt
) atomically to avoid race conditions.
- Within a block: Threads increment the shared counter (
By leveraging these CUDA properties, the program ensures efficient and parallel processing of all edges in the graph while maintaining correctness.