The latency information presents the execution profile of each CU in the binary container. When analyzing this data, it is important to recognize that all values are measured from the CU boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for CUs targeted at the FPGA fabric. The following is an example of the latency report:
Latency Information (clock cycles)
Compute Unit Kernel Name Module Name Start Interval Best Case Avg Case Worst Case
------------ ----------- ----------- -------------- --------- -------- ----------
mmult_1 mmult mmult 826 ~ 829 825 827 828
The latency report is divided into the following fields:
- Start interval
- Best case latency
- Average case latency
- Worst case latency
The start interval defines the amount of time that has to pass between invocations of a CU for a given kernel.
The best, average, and worst case latency numbers refer to how much time it takes the CU to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report.
-
OpenCL kernels that do not have
explicit
reqd_work_group_size(x,y,z)
- Kernels that have loops with variable bounds