The hardware implementation, running on the U200 displays the kernel throughput (this is the kernel execution time but not the host->device and device->host copies of the data). A large data set of over 4 million parameters is used to overcome any initial pipeline latency and fully occupy the device buffers.
[bash]$ ./bsm_test ./xclbin/bsm_kernel.hw.u200.xclbin 4194304
This achieves around 203 million options per second, which is approximately 8.9GB/s of data transferred. This is about half of the theoretical DDR bandwidth, but around 80% of that achieved by the ‘xbutil validate’ DDR bandwidth test. This can be taken as a more realistic target as it includes any platform overhead which is also incurred in the BSM solver.