Triangle counting

In graph theory, a triangle is a cycle consisting of exactly three vertices that are mutually connected by three edges. Triangle counting aims to count the number of triangles in a graph.

Explanation:

If the vertices are AAA, BBB, and CCC, then the edges would be (A,B)(A, B)(A,B), (B,C)(B, C)(B,C), and (C,A)(C, A)(C,A), forming a closed loop.
Triangles are a specific type of 3-clique, where all three vertices are directly connected to each other.

Example：

In a graph with vertices {0, 1, 2} and edges {(0, 1), (0, 2), (1, 2)}：

{0, 1, 2} forms a triangle.

Triangle counting with binary search

Triangle counting using the binary search list method involves finding triangles in a graph by leveraging adjacency lists and binary search. Here’s how you can implement this approach for a single-threaded execution:

Key Idea:

For each edge (u, v), find the common neighbors of uand v, as these common neighbors form triangles with (u, v).

Algorithm:

Graph Representation: Represent the graph as an adjacency list where the neighbors of each vertex are sorted. This allows efficient binary search.
Iterate Over Edges: For each edge (u, v), ensure u < v (to avoid counting the same triangle multiple times).
Find Common Neighbors: Use the sorted adjacency lists of u and v. Perform a binary search to find common neighbors efficiently.
Count Triangles: The number of common neighbors for an edge (u, v) gives the number of triangles that include the edge.

Complexity:

Preprocessing: Sorting adjacency lists: O(E * log(V)).
Triangle Counting: O(E * log(V)), assuming binary search is used for finding common neighbors.

CUDA coding

1. Kernel Function (`tc_kernel`)

The kernel performs triangle counting by processing edges and finding common neighbors using binary search.

1.1. Initialization and Setup:

uint const bid = blockIdx.x;
__shared__ unsigned long long cnt;

int u = d_edges[bid];
int v = d_edges[bid + 1];

if (threadIdx.x == 0) {
    cnt = 0;

    if (d_degrees[u] > d_degrees[v]) {
        int tmp = u;
        u = v;
        v = tmp;
    }
}

__syncthreads();

bid: Identifies the edge to process, as blockIdx.x corresponds to the current block’s ID.

cnt: A shared counter for counting triangles found by all threads in the block.

u, v: The vertices of the edge being processed, retrieved from d_edges.

Degree-based optimization: If uuu has a higher degree than vvv, swap them to process the vertex with the smaller degree first. This reduces computational cost.

Synchronization: __syncthreads() ensures all threads are ready before proceeding.

1.2. Triangle Detection Using Binary Search:

for (uint i = threadIdx.x; i < d_degrees[u]; i += blockDim.x) {
    int const w = d_neighbors[d_offset[u] + i];
    int const* v_nbr = d_neighbors + d_offset[v];

    int l = 0, r = d_degrees[v] - 1;
    while (l <= r) {
        int const mid = (l + r) / 2;
        if (v_nbr[mid] == w) {
            atomicAdd(&cnt, 1);
            break;
        } else if (v_nbr[mid] < w) {
            l = mid + 1;
        } else {
            r = mid - 1;
        }
    }
}

Outer loop: Each thread processes one neighbor of u at a time.
- w: A neighbor of u.
- v_nbr: Pointer to the adjacency list of v.
Binary search: Looks for www in the sorted adjacency list of v. If found, a triangle is identified, and the shared counter is incremented atomically.
Work distribution: Each thread handles a subset of u's neighbors.

1.3. Final Triangle Count Update:

__syncthreads();
if (threadIdx.x == 0) {
    atomicAdd(d_cnt, cnt);
}

Synchronization: Ensures all threads in the block finish their work before the block-level triangle count (cnt) is added to the global count (d_cnt).

Atomic update: Avoids race conditions when updating the global triangle counter.

Code

__global__ auto tc_kernel(int const* d_offset, int const* d_edges, int const* d_degrees, int const* d_neighbors, unsigned long long* d_cnt) -> void {

    uint const bid = blockIdx.x;
    __shared__ unsigned long long cnt;

    int u = d_edges[bid];
    int v = d_edges[bid + 1];

    if (threadIdx.x == 0) {
        cnt = 0;

        if (d_degrees[u] > d_degrees[v]) {
            int tmp = u;
            u = v;
            v = tmp;
        }
    }

    __syncthreads();


    for (uint i = threadIdx.x; i < d_degrees[u]; i += blockDim.x) {
        // use binary search to find the common neighbors
        int const w = d_neighbors[d_offset[u] + i];
        int const* v_nbr = d_neighbors + d_offset[v];

        int l = 0, r = d_degrees[v] - 1;
        while (l <= r) {
            int const mid = (l + r) / 2;
            if (v_nbr[mid] == w) {
                atomicAdd(&cnt, 1);
                break;
            } else if (v_nbr[mid] < w) {
                l = mid + 1;
            } else {
                r = mid - 1;
            }
        }
    }

    __syncthreads();
    if (threadIdx.x == 0) {
        atomicAdd(d_cnt, cnt);
    }

}

Grid and Block Configuration:
- Grid Size (g->m): The number of blocks equals the number of edges. This means each block processes one edge.
- Block Size (1024): Each block uses 1024 threads to process neighbors of the smaller-degree vertex in parallel.
Thread Work Distribution:
- Each thread in a block handles a subset of the neighbors of the vertex with the smaller degree. This is achieved by assigning neighbors in chunks based on threadIdx.x and blockDim.x.
Shared Memory:
- Each block uses shared memory (cnt) to locally count triangles for its edge. Shared memory allows fast, intra-block communication among threads.
Global Memory:
- Graph data (like adjacency lists, degrees, and offsets) is stored in global memory (d_neighbors, d_offset), accessible to all threads. Each thread retrieves its necessary data from global memory.
Atomic Operations:
- Within a block: Threads increment the shared counter (cnt) atomically when a triangle is found.
- Global Counter Update: The block’s local triangle count is added to the global counter (d_cnt) atomically to avoid race conditions.

By leveraging these CUDA properties, the program ensures efficient and parallel processing of all edges in the graph while maintaining correctness.