Example CUDA program: Matrix Multiplication

Program in CUDA consists of writing codes for the host (codes to be run on the CPU) and the device (codes to be run on the GPU). Both the host and the device programs are to be written in C. In addition to the C syntax, the device program (a.k.a. the kernel program) will utilize a handful of C extensions that are CUDA specific that helps to make programming GPU easier.

The following programs shows how to issue a kernel program to compute the product of 2 matrices on the GPU. The example comes from the CUDA Programming Guide 1.0 .

Host program

The following function will be executed on the host. It first generates 2 random matrices of size wA*wA and wB*wB. It copies the both matrices to the graphics card and it also allocates a matrix C on the device to store the result. Finally, it generates the execution configurations (read on for details) based on the size of the input matrices and then it starts the GPU computation by calling the kernel program.

// Thread block size 
#define BLOCK_SIZE 16 
 
// Forward declaration of the device multiplication function 
__global__ void Muld(float*, float*, int, int, float*); 
 
// Host multiplication function 
// Compute C = A * B 
//   hA is the height of A 
//   wA is the width of A 
//   wB is the width of B 
void Mul(const float* A, const float* B, int hA, int wA, int wB, 
         float* C) 
{ 
    int size; 
 
    // Load A and B to the device 
    float* Ad; 
    size = hA * wA * sizeof(float); 
    cudaMalloc((void**)&Ad, size); 
    cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); 
    float* Bd; 
    size = wA * wB * sizeof(float); 
    cudaMalloc((void**)&Bd, size); 
    cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); 
 
    // Allocate C on the device 
    float* Cd; 
    size = hA * wB * sizeof(float); 
    cudaMalloc((void**)&Cd, size); 
 
    // Compute the execution configuration assuming 
    // the matrix dimensions are multiples of BLOCK_SIZE 
    
    /********************
    calculates the execution configuration
    effectively the kernel function <Muld> will be
    executed concurrently by BLOCK_SIZE^2 GPU threads
    ************************/
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 
    dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y); 
 
    // Launch the device computation 
    Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd); 
 
    // Read C from the device 
    cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);  
 
    // Free device memory 
    cudaFree(Ad); 
    cudaFree(Bd); 
    cudaFree(Cd); 
}      

Device (Kernel) Program

The following function will be compiled by the Nvidia CUDA compiler nvcc and will be uploaded and executed on the graphics board. When a thread calls into this function, the function first figures out the calling thread's ID (using the CUDA special variable blockIdx and threadIdx). The thread ID is then used to control which sub matrix the current thread will work on. To compute the product, the thread first copies the sub matrix onto shared memory to leverage the performance of the shared memory. It then computes the product by accessing the elements in shared memory. Finally the result is copied back to matrix C in global memory for the host program to read.

// Device multiplication function called by Mul() 
// Compute C = A * B 
//   wA is the width of A 
//   wB is the width of B 
__global__ void Muld(float* A, float* B, int wA, int wB, float* C) 
{ 
    // Block index 
    int bx = blockIdx.x; 
    int by = blockIdx.y; 
 
    // Thread index 
    int tx = threadIdx.x; 
    int ty = threadIdx.y; 
 
    // Index of the first sub-matrix of A processed by the block 
    int aBegin = wA * BLOCK_SIZE * by; 
 
    // Index of the last sub-matrix of A processed by the block 
    int aEnd   = aBegin + wA - 1; 
 
    // Step size used to iterate through the sub-matrices of A 
    int aStep  = BLOCK_SIZE; 
 
    // Index of the first sub-matrix of B processed by the block 
    int bBegin = BLOCK_SIZE * bx; 
 
    // Step size used to iterate through the sub-matrices of B 
    int bStep  = BLOCK_SIZE * wB; 
 
    // The element of the block sub-matrix that is computed 
    // by the thread 
    float Csub = 0; 
 
    // Loop over all the sub-matrices of A and B required to 
    // compute the block sub-matrix 
    for (int a = aBegin, b = bBegin; 
             a <= aEnd; 
             a += aStep, b += bStep) { 
 
        // Shared memory for the sub-matrix of A 
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 
 
        // Shared memory for the sub-matrix of B 
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 
 
        // Load the matrices from global memory to shared memory; 
        // each thread loads one element of each matrix 
        As[ty][tx] = A[a + wA * ty + tx]; 
        Bs[ty][tx] = B[b + wB * ty + tx]; 
 
        // Synchronize to make sure the matrices are loaded 
        __syncthreads(); 
 
        // Multiply the two matrices together; 
        // each thread computes one element 
        // of the block sub-matrix 
        for (int k = 0; k < BLOCK_SIZE; ++k) 
            Csub += As[ty][k] * Bs[k][tx]; 
 
        // Synchronize to make sure that the preceding 
        // computation is done before loading two new 
        // sub-matrices of A and B in the next iteration 
        __syncthreads(); 
    } 
 
    // Write the block sub-matrix to global memory; 
    // each thread writes one element 
    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; 
    C[c + wB * ty + tx] = Csub; 
}         

Detailed walk through

See Chapter 6 of CUDA Programming Guide 1.0 .

Many more example programs can be found in the CUDA SDK ).