CS680 - WHT on GPU - Presentation

Presentations

Overview cs680_0.pdf
Hardware cs680_1.pdf
Execution Model (from CUDA Programming Guide 1.0 )
The way a block is split into warps is always the same; each warp contains threads of 
consecutive, increasing thread IDs with the first warp containing thread 0. 
Section 2.2.1 describes how thread IDs relate to thread indices in the block. 

A block is processed by only one multiprocessor, so that the shared memory space 
resides in the on-chip shared memory leading to very fast memory accesses. The 
multiprocessor's registers are allocated among the threads of the block. If the 
number of registers used per thread multiplied by the number of threads in the 
block is greater than the total number of registers per multiprocessor, the block 
cannot be executed and the corresponding kernel will fail to launch. 

Several blocks can be processed by the same multiprocessor concurrently by 
allocating the multiprocessor's registers and shared memory among the blocks. 

The issue order of the warps within a block is undefined, but their execution can be 
synchronized, as mentioned in Section 2.2.1, to coordinate global or shared memory 
accesses. 

The issue order of the blocks within a grid of thread blocks is undefined and there is 
no synchronization mechanism between blocks, so threads from two different 
blocks of the same grid cannot safely communicate with each other through global 
memory during the execution of the grid. 

If a non-atomic instruction executed by a warp writes to the same location in global 
or shared memory for more than one of the threads of the warp, the number of 
serialized writes that occur to that location and the order in which they occur is 
undefined, but one of the writes is guaranteed to succeed. If an atomic instruction 
(see Section 4.4.6) executed by a warp reads, modifies, and writes to the same 
location in global memory for more than one of the threads of the warp, each read, 
modify, write to that location occurs and they are all serialized, but the order in 
which they occur is undefined.
WHT with GPU

Project Documentation

Links

Presentations