Triangle Counting

Triangle counting

In graph theory, a triangle is a cycle consisting of exactly three vertices that are mutually connected by three edges. Triangle counting aims to count the number of triangles in a graph.

Explanation:

  • If the vertices are AAA, BBB, and CCC, then the edges would be (A,B)(A, B)(A,B), (B,C)(B, C)(B,C), and (C,A)(C, A)(C,A), forming a closed loop.
  • Triangles are a specific type of 3-clique, where all three vertices are directly connected to each other.

Example:

In a graph with vertices {0, 1, 2} and edges {(0, 1), (0, 2), (1, 2)}

  • {0, 1, 2} forms a triangle.

012

Triangle counting using the binary search list method involves finding triangles in a graph by leveraging adjacency lists and binary search. Here’s how you can implement this approach for a single-threaded execution:

Key Idea:

For each edge (u, v), find the common neighbors of uand v, as these common neighbors form triangles with (u, v).

Algorithm:

  1. Graph Representation: Represent the graph as an adjacency list where the neighbors of each vertex are sorted. This allows efficient binary search.
  2. Iterate Over Edges: For each edge (u, v), ensure u < v (to avoid counting the same triangle multiple times).
  3. Find Common Neighbors: Use the sorted adjacency lists of u and v. Perform a binary search to find common neighbors efficiently.
  4. Count Triangles: The number of common neighbors for an edge (u, v) gives the number of triangles that include the edge.

Complexity:

  • Preprocessing: Sorting adjacency lists: O(E * log(V)).
  • Triangle Counting: O(E * log(V)), assuming binary search is used for finding common neighbors.

CUDA coding

1. Kernel Function (tc_kernel)

The kernel performs triangle counting by processing edges and finding common neighbors using binary search.

1.1. Initialization and Setup:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
uint const bid = blockIdx.x;
__shared__ unsigned long long cnt;

int u = d_edges[bid];
int v = d_edges[bid + 1];

if (threadIdx.x == 0) {
cnt = 0;

if (d_degrees[u] > d_degrees[v]) {
int tmp = u;
u = v;
v = tmp;
}
}

__syncthreads();

bid: Identifies the edge to process, as blockIdx.x corresponds to the current block’s ID.

cnt: A shared counter for counting triangles found by all threads in the block.

u, v: The vertices of the edge being processed, retrieved from d_edges.

Degree-based optimization: If uuu has a higher degree than vvv, swap them to process the vertex with the smaller degree first. This reduces computational cost.

Synchronization: __syncthreads() ensures all threads are ready before proceeding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for (uint i = threadIdx.x; i < d_degrees[u]; i += blockDim.x) {
int const w = d_neighbors[d_offset[u] + i];
int const* v_nbr = d_neighbors + d_offset[v];

int l = 0, r = d_degrees[v] - 1;
while (l <= r) {
int const mid = (l + r) / 2;
if (v_nbr[mid] == w) {
atomicAdd(&cnt, 1);
break;
} else if (v_nbr[mid] < w) {
l = mid + 1;
} else {
r = mid - 1;
}
}
}
  • Outer loop: Each thread processes one neighbor of u at a time.
    • w: A neighbor of u.
    • v_nbr: Pointer to the adjacency list of v.
  • Binary search: Looks for www in the sorted adjacency list of v. If found, a triangle is identified, and the shared counter is incremented atomically.
  • Work distribution: Each thread handles a subset of u's neighbors.

1.3. Final Triangle Count Update:

1
2
3
4
__syncthreads();
if (threadIdx.x == 0) {
atomicAdd(d_cnt, cnt);
}

Synchronization: Ensures all threads in the block finish their work before the block-level triangle count (cnt) is added to the global count (d_cnt).

Atomic update: Avoids race conditions when updating the global triangle counter.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
__global__ auto tc_kernel(int const* d_offset, int const* d_edges, int const* d_degrees, int const* d_neighbors, unsigned long long* d_cnt) -> void {

uint const bid = blockIdx.x;
__shared__ unsigned long long cnt;

int u = d_edges[bid];
int v = d_edges[bid + 1];

if (threadIdx.x == 0) {
cnt = 0;

if (d_degrees[u] > d_degrees[v]) {
int tmp = u;
u = v;
v = tmp;
}
}

__syncthreads();


for (uint i = threadIdx.x; i < d_degrees[u]; i += blockDim.x) {
// use binary search to find the common neighbors
int const w = d_neighbors[d_offset[u] + i];
int const* v_nbr = d_neighbors + d_offset[v];

int l = 0, r = d_degrees[v] - 1;
while (l <= r) {
int const mid = (l + r) / 2;
if (v_nbr[mid] == w) {
atomicAdd(&cnt, 1);
break;
} else if (v_nbr[mid] < w) {
l = mid + 1;
} else {
r = mid - 1;
}
}
}

__syncthreads();
if (threadIdx.x == 0) {
atomicAdd(d_cnt, cnt);
}

}
  1. Grid and Block Configuration:
    • Grid Size (g->m): The number of blocks equals the number of edges. This means each block processes one edge.
    • Block Size (1024): Each block uses 1024 threads to process neighbors of the smaller-degree vertex in parallel.
  2. Thread Work Distribution:
    • Each thread in a block handles a subset of the neighbors of the vertex with the smaller degree. This is achieved by assigning neighbors in chunks based on threadIdx.x and blockDim.x.
  3. Shared Memory:
    • Each block uses shared memory (cnt) to locally count triangles for its edge. Shared memory allows fast, intra-block communication among threads.
  4. Global Memory:
    • Graph data (like adjacency lists, degrees, and offsets) is stored in global memory (d_neighbors, d_offset), accessible to all threads. Each thread retrieves its necessary data from global memory.
  5. Atomic Operations:
    • Within a block: Threads increment the shared counter (cnt) atomically when a triangle is found.
    • Global Counter Update: The block’s local triangle count is added to the global counter (d_cnt) atomically to avoid race conditions.

By leveraging these CUDA properties, the program ensures efficient and parallel processing of all edges in the graph while maintaining correctness.

----- End Thanks for reading-----