Kernels & Compute Unit:Kernel Execution reports 168ms. This should be same as when Bloom8x kernel run with ITER=8.
Kernels & Compute Unit: Compute Unit Stalls section also confirms that “External Memory” stalls are about 20.045 ms compared to no “External Memory” stalls when single buffer was used. This will result in slower data transfer and kernel compute compared to single buffer run.
Host Data Transfer: Host Transfer Host to Global Memory WRITE Transfer takes about 207.5 ms and Host to Global Memory READ Transfer takes about 36.4 ms.
Kernels & Compute Unit: Compute Unit Utilization section shows that CU Utilization is about 71%. This is an important measure representing how much time CU was active over the device execution time.
In the next lab, you will compare the results for “Host Data Transfer Rates” and “CU Utilization”.