The first read or write request to global memory is expensive, but subsequent contiguous operations are not. Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller.
Atomic accesses to global memory should always be avoided unless absolutely
required. The load and store functions should be coded to always infer bursting
transaction. This can be done using a memcpy
operation
as shown in the vadd.cpp file in the GitHub example, or by creating a tight
for
loop accessing all the required values
sequentially, as explained in Interfaces in
Developing Applications.