Presentations

  • Overview cs680_0.pdf
  • Hardware cs680_1.pdf
  • Execution Model (from CUDA Programming Guide 1.0 )
    The way a block is split into warps is always the same; each warp contains threads of 
    consecutive, increasing thread IDs with the first warp containing thread 0. 
    Section 2.2.1 describes how thread IDs relate to thread indices in the block. 
    
    A block is processed by only one multiprocessor, so that the shared memory space 
    resides in the on-chip shared memory leading to very fast memory accesses. The 
    multiprocessor's registers are allocated among the threads of the block. If the 
    number of registers used per thread multiplied by the number of threads in the 
    block is greater than the total number of registers per multiprocessor, the block 
    cannot be executed and the corresponding kernel will fail to launch. 
    
    Several blocks can be processed by the same multiprocessor concurrently by 
    allocating the multiprocessor's registers and shared memory among the blocks. 
    
    The issue order of the warps within a block is undefined, but their execution can be 
    synchronized, as mentioned in Section 2.2.1, to coordinate global or shared memory 
    accesses. 
    
    The issue order of the blocks within a grid of thread blocks is undefined and there is 
    no synchronization mechanism between blocks, so threads from two different 
    blocks of the same grid cannot safely communicate with each other through global 
    memory during the execution of the grid. 
    
    If a non-atomic instruction executed by a warp writes to the same location in global 
    or shared memory for more than one of the threads of the warp, the number of 
    serialized writes that occur to that location and the order in which they occur is 
    undefined, but one of the writes is guaranteed to succeed. If an atomic instruction 
    (see Section 4.4.6) executed by a warp reads, modifies, and writes to the same 
    location in global memory for more than one of the threads of the warp, each read, 
    modify, write to that location occurs and they are all serialized, but the order in 
    which they occur is undefined.