Background

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables developers to utilize the powerful computing capabilities of NVIDIA GPUs (Graphics Processing Units) for general-purpose computing, not limited to graphics processing. Here are some key features and components of CUDA:

Parallel Computing Model: CUDA provides a programming model based on parallel computing, allowing computation tasks to be executed in parallel on hundreds or even thousands of GPU cores. It supports large-scale parallel data processing and computation.
Programming Language: The CUDA programming model extends programming languages such as C, C++, and Fortran. Developers can write CUDA kernels (functions that run on GPUs) using these languages.
Memory Management: CUDA offers various memory management methods, including global memory, shared memory, constant memory, and texture memory. Shared memory is a very fast memory type that can share data between threads within the same thread block.
Thread Organization: CUDA organizes parallel computing tasks into grids, with each grid containing multiple thread blocks, and each thread block containing multiple threads. This hierarchical structure of grids and thread blocks allows CUDA to efficiently handle large-scale parallel computations.
Development Tools: NVIDIA provides a complete set of development tools, including the CUDA compiler (nvcc), CUDA libraries (such as cuBLAS, cuFFT, etc.), and debugging and performance analysis tools (such as NVIDIA Nsight).
Application Areas: CUDA is widely used in scientific computing, machine learning, image processing, financial modeling, computer vision, and many other fields. It is particularly suitable for tasks requiring large-scale parallel computation.

Architecture

The GPU (Graphics Processing Unit) architecture differs from traditional CPU architecture and is specifically designed for large-scale parallel computing tasks. Here’s a detailed introduction to the main components from Grid to Thread, including shared memory:

1. Grid

A Grid is the highest-level organizational unit in a CUDA program. It consists of multiple thread blocks (Blocks), all of which execute the same kernel function in parallel. The dimensions of a Grid can be one-dimensional, two-dimensional, or three-dimensional, depending on the requirements of the computational task. Blocks within a Grid are independent of each other and cannot communicate directly.

2. Block

Each Grid is composed of multiple thread blocks. A thread block is the basic unit of GPU execution scheduling, and all threads within a block can share data and coordinate through synchronization primitives. The size and number of thread blocks affect parallelism and performance, as they are distributed to different multiprocessors for execution.

3. Thread

A thread is the smallest execution unit of a GPU. Each Block consists of multiple threads, which share the same memory space (shared memory) within the Block and can collaborate. CUDA allows developers to define the layout of threads within each Block (e.g., one-dimensional, two-dimensional, or three-dimensional) to better adapt to the geometric structure of the problem.

4. SM (Streaming Multiprocessor)

A GPU is composed of multiple SMs, each capable of processing multiple thread blocks simultaneously. An SM is responsible for executing the thread blocks assigned to it and running threads in parallel on its multiple cores. Each SM has its own registers, shared memory, and other resources, which limit the number of thread blocks that can be executed simultaneously.

5. Registers

Registers are the storage closest to the processing cores and are very fast. Each thread has its own set of registers used to store temporary variables. The number of registers is limited, and if too many are used, it will cause register spilling to global memory, reducing performance.

6. Shared Memory

Shared memory is a memory area accessible by all threads within a thread block. It is a high-speed cache located between registers and global memory, allowing efficient collaboration and data sharing among threads within a block. Shared memory access is much faster than global memory, but its capacity is limited, so it needs to be managed carefully.

7. Global Memory

Global memory is the largest but relatively slow memory on the GPU. All threads can access global memory, but due to its slower access speed, it is typically used to store infrequently accessed data or data that needs to be shared between multiple Blocks.

8. Constant Memory and Texture Memory

Constant memory is a read-only memory area, typically used to store data that will not change during kernel execution, such as constant values. Texture memory is a specialized memory for image processing, capable of spatial localization and filtering of multidimensional data.

9. Synchronization and Atomic Operations

In CUDA, threads within a thread block can ensure synchronized execution through synchronization primitives (such as __syncthreads()). Additionally, CUDA provides some atomic operations that allow threads to safely modify shared data.

Computational schemes

CUDA computational schemes mainly consist of three types: thread-centric, block-centric, and warp-centric. These schemes define how tasks are allocated and the granularity of computation. Here’s a detailed description of these three schemes:

1. Thread-Centric

In the thread-centric computational scheme, each thread is responsible for processing an independent task. In this scheme, the granularity of tasks is very fine, fully utilizing the large number of GPU cores for parallel computation. Each thread independently executes its part of the task, which is very effective for processing a large number of independent computational tasks, such as element-wise calculations in matrix operations.

Advantages:

Maximizes parallelism as each thread works independently.
Suitable for situations that require processing a large number of small independent tasks.

Challenges:

Requires careful design to ensure uniform load distribution, otherwise some threads may finish faster than others, leading to resource waste.

2. Block-Centric

In the block-centric computational scheme, each thread block is responsible for processing one task. All threads within the block collaborate to complete this task, utilizing shared memory and thread synchronization to achieve efficient computation. In this scheme, the granularity of tasks is larger, suitable for computational tasks that require a large amount of data sharing and collaboration between threads, such as certain graphics algorithms or operations requiring aggregated computations.

Advantages:

Utilizes the shared memory and synchronization mechanisms of thread blocks to effectively collaborate on complex computational tasks.
Suitable for handling tasks that require extensive data sharing and coordination.

Challenges:

Load imbalance within thread blocks can lead to waste of computational resources.
Requires higher design requirements for task division, ensuring that the workload between thread blocks is roughly equal.

3. Warp-Centric

In the warp-centric computational scheme, each warp is responsible for processing one task. A warp typically consists of 32 threads that execute the same instruction set synchronously. In this scheme, the granularity of tasks is between thread-centric and block-centric. Since threads within a warp execute in complete synchronization, task design needs to consider the SIMD (Single Instruction, Multiple Data) characteristics of warps, ensuring that all threads within the same warp execute the same path as much as possible to avoid performance loss due to warp divergence.

Advantages:

Threads within a warp are completely synchronized, making them easy to manage and optimize.
Suitable for tasks requiring SIMD-style computation, such as convolution operations in image processing.

Challenges:

Warp divergence can severely impact performance, as the GPU needs to execute different paths separately when different threads execute different paths, causing some threads to be idle.
Task design needs to consider the characteristics of warps to ensure maximum utilization.

Example of computational scheme

Thread-Centric

#include <iostream>
#include <cuda_runtime.h>

__global__ void addArrays(float* a, float* b, float* c, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n) {
        c[index] = a[index] + b[index];
    }
}

int main() {
    const int arraySize = 10000;
    const int arrayBytes = arraySize * sizeof(float);

    float* h_a = (float*)malloc(arrayBytes);
    float* h_b = (float*)malloc(arrayBytes);
    float* h_c = (float*)malloc(arrayBytes);

    for (int i = 0; i < arraySize; ++i) {
        h_a[i] = i;
        h_b[i] = i * 2.0f;
    }

    float *d_a, *d_b, *d_c;
    cudaMalloc((void**)&d_a, arrayBytes);
    cudaMalloc((void**)&d_b, arrayBytes);
    cudaMalloc((void**)&d_c, arrayBytes);

    cudaMemcpy(d_a, h_a, arrayBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, arrayBytes, cudaMemcpyHostToDevice);

    int blockSize = 256;
    int numBlocks = (arraySize + blockSize - 1) / blockSize;

    addArrays<<<numBlocks, blockSize>>>(d_a, d_b, d_c, arraySize);

    cudaMemcpy(h_c, d_c, arrayBytes, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 10; ++i) {
        std::cout << "h_c[" << i << "] = " << h_c[i] << std::endl;
    }

    // Free device and host memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

blockSize refers to the number of threads in each block. Since we have 10000 data points to compute, we need (arraySize + blockSize - 1) / blockSize blocks.

__global__ void xxx is used to declare that the function runs on the GPU. The threadIdx.x + blockIdx.x * blockDim.x corresponds to the index of the 10000 data.

The calculation of threadIdx.x + blockIdx.x * blockDim.x can be better explained through the following illustration:

Grid
+------------------------------------------------------+
| Block 0                  | Block 1                  | 
| +--------------------+   | +--------------------+   |
| | Thread 0           |   | | Thread 0           |   |
| | Thread 1           |   | | Thread 1           |   |
| | ...                |   | | ...                |   |
| | Thread blockDim.x-1|   | | Thread blockDim.x-1|   |
| +--------------------+   | +--------------------+   |
|                                                    |
| ...                                                |
|                                                    |
| Block numBlocks-1                                  |
| +--------------------+                             |
| | Thread 0           |                             |
| | Thread 1           |                             |
| | ...                |                             |
| | Thread blockDim.x-1|                             |
| +--------------------+                             |
+------------------------------------------------------+

Calculating the global thread index

The formula threadIdx.x + blockIdx.x * blockDim.x is used to calculate the global index of a thread in the thread grid:

threadIdx.x: The index of the current thread within its block.
blockIdx.x: The index of the current block in the grid.
blockDim.x: The number of threads in each block (block dimension).

Example

Assume:

blockDim.x = 4 (4 threads per block)
blockIdx.x = 1 (this is the second block, index starts from 0)
threadIdx.x = 2 (this is the third thread in the block, index starts from 0)

Then, the steps to calculate the global thread index are:

Calculate the starting position of the block: blockIdx.x * blockDim.x = 1 * 4 = 4
Add the thread offset within the block: 4 + threadIdx.x = 4 + 2 = 6

Finally, the global index is 6, which means this thread is the 7th thread in the entire grid (index starts from 0).

warp-centric

#include <iostream>
#include <cuda_runtime.h>

#define WARP_SIZE 32

// CUDA kernel function: each warp calculates the sum of squares of an array segment
__global__ void warpSumSquares(float* input, float* output, int n) {
    int warpId = threadIdx.x / WARP_SIZE;
    int lane = threadIdx.x % WARP_SIZE;

    __shared__ float sharedMemory[WARP_SIZE];

    int start = warpId * WARP_SIZE * gridDim.x + blockIdx.x * WARP_SIZE;

    float sum = 0.0f;
    for (int i = start + lane; i < start + WARP_SIZE && i < n; i += WARP_SIZE) {
        sum += input[i] * input[i];
    }

    // Use shared memory and warp shuffle reduction
    sharedMemory[lane] = sum;
    __syncthreads();

    if (lane == 0) {
        float warpSum = 0.0f;
        for (int i = 0; i < WARP_SIZE; ++i) {
            warpSum += sharedMemory[i];
        }
        output[warpId * gridDim.x + blockIdx.x] = warpSum;
    }
}

int main() {
    const int arraySize = 1024;
    const int arrayBytes = arraySize * sizeof(float);
    const int outputSize = arraySize / WARP_SIZE;

    // Allocate memory on the host (CPU)
    float* h_input = (float*)malloc(arrayBytes);
    float* h_output = (float*)malloc(outputSize * sizeof(float));

    // Initialize the array
    for (int i = 0; i < arraySize; ++i) {
        h_input[i] = i * 1.0f;
    }

    // Allocate memory on the device (GPU)
    float *d_input, *d_output;
    cudaMalloc((void**)&d_input, arrayBytes);
    cudaMalloc((void**)&d_output, outputSize * sizeof(float));

    // Copy data from host to device
    cudaMemcpy(d_input, h_input, arrayBytes, cudaMemcpyHostToDevice);

    // Launch CUDA kernel function
    int blockSize = 64;  // Each block contains two warps
    int numBlocks = (outputSize + blockSize - 1) / blockSize;
    warpSumSquares<<<numBlocks, blockSize>>>(d_input, d_output, arraySize);

    // Copy results from device to host
    cudaMemcpy(h_output, d_output, outputSize * sizeof(float), cudaMemcpyDeviceToHost);

    // Print results for verification
    for (int i = 0; i < outputSize; ++i) {
        std::cout << "Output[" << i << "] = " << h_output[i] << std::endl;
    }

    // Free device and host memory
    cudaFree(d_input);
    cudaFree(d_output);
    free(h_input);
    free(h_output);

    return 0;
}

block-centric

#include <cuda_runtime.h>

#define BLOCK_SIZE 1024  // Number of threads per block

// CUDA kernel function: each block calculates the sum of squares for a portion of the data
__global__ void blockSumSquares(float* input, float* partialSum, int n) {
    __shared__ float sharedMemory[BLOCK_SIZE];

    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int threadId = threadIdx.x;
    int blockSize = blockDim.x;

    // Each thread calculates its own data
    float sum = 0.0f;
    if (tid < n) {
        sum = input[tid] * input[tid];
    }

    // Store the result in shared memory
    sharedMemory[threadId] = sum;
    __syncthreads();

    // Reduction operation: accumulate results from all threads in the block into shared memory
    if (threadId == 0) {
        float blockSum = 0.0f;
        for (int i = 0; i < blockSize; ++i) {
            blockSum += sharedMemory[i];
        }
        partialSum[blockIdx.x] = blockSum;
    }
}

int main() {
    const int arraySize = 2024;
    const int arrayBytes = arraySize * sizeof(float);
    const int partialSumSize = (arraySize + BLOCK_SIZE - 1) / BLOCK_SIZE;  // Calculate how many partial sums are needed
    const int partialSumBytes = partialSumSize * sizeof(float);

    // Allocate memory on the host (CPU)
    float* h_input = (float*)malloc(arrayBytes);
    float* h_partialSum = (float*)malloc(partialSumBytes);

    // Initialize the array
    for (int i = 0; i < arraySize; ++i) {
        h_input[i] = i * 1.0f;
    }

    // Allocate memory on the device (GPU)
    float *d_input, *d_partialSum;
    cudaMalloc((void**)&d_input, arrayBytes);
    cudaMalloc((void**)&d_partialSum, partialSumBytes);

    // Copy data from host to device
    cudaMemcpy(d_input, h_input, arrayBytes, cudaMemcpyHostToDevice);

    // Launch CUDA kernel function
    int numBlocks = (arraySize + BLOCK_SIZE - 1) / BLOCK_SIZE;
    blockSumSquares<<<numBlocks, BLOCK_SIZE>>>(d_input, d_partialSum, arraySize);

    // Copy results from device to host
    cudaMemcpy(h_partialSum, d_partialSum, partialSumBytes, cudaMemcpyDeviceToHost);

    // Calculate total sum on the host
    float totalSum = 0.0f;
    for (int i = 0; i < numBlocks; ++i) {
        std::cout << "block: " << i << " result: " << h_partialSum[i] << std::endl;
    }

    cudaFree(d_input);
    cudaFree(d_partialSum);
    free(h_input);
    free(h_partialSum);

    return 0;
}

Shared Memory Declaration

1	__shared__ float sharedMemory[BLOCK_SIZE];

sharedMemory is the shared memory within the block, used to store the results calculated by each thread. BLOCK_SIZE is the number of threads in each block (1024).

Thread ID Calculation

1	int tid = threadIdx.x + blockIdx.x * blockDim.x;

tid is the global index of the thread in the entire grid, used to access input data. threadIdx.x is the index of the thread within the block, blockIdx.x is the index of the block in the grid.

Each Thread Calculates Its Own Data

float sum = 0.0f;
if (tid < n) {
    sum = input[tid] * input[tid];
}

Each thread calculates the square sum of the input data and stores the result in sum. if (tid < n) ensures that the thread only processes valid data (prevents out-of-bounds access).

Store Results in Shared Memory

1 2	sharedMemory[threadId] = sum; __syncthreads();

Store the calculation results of each thread in shared memory. __syncthreads() is used to synchronize all threads within the block, ensuring that all thread results have been written to shared memory before proceeding with subsequent operations.

Reduction Operation

if (threadId == 0) {
    float blockSum = 0.0f;
    for (int i = 0; i < blockSize; ++i) {
        blockSum += sharedMemory[i];
    }
    partialSum[blockIdx.x] = blockSum;
}

if (threadId == 0): Only the first thread in the block (threadId == 0) performs the following reduction operation.
Iterate through all thread results in sharedMemory, accumulating them to get the total sum for the current block (blockSum).
Store blockSum in the corresponding position of the partialSum array. partialSum[blockIdx.x] is used to store the square sum result calculated by each block.