Description
You can pipeline a loop to improve latency and maximize kernel throughput and performance.
Although unrolling loops increases concurrency, it does not address the issue of keeping all elements in a kernel data path busy at all times. Even in an unrolled case, loop control dependencies can lead to sequential behavior. The sequential behavior of operations results in idle hardware and a loss of performance.
Xilinx addresses this issue by introducing a vendor extension on top of the OpenCL 2.0 API specification for loop pipelining using the XCL_PIPELINE_LOOP attribute.
By default, the v++
compiler automatically pipelines loops
with a trip count more than 64, or unrolls loops with a trip count less than 64. This should
provide good results. However, you can choose to pipeline loops (instead of the automatic
unrolling) by explicitly specifying the NOUNROLL attribute and XCL_PIPELINE_LOOP attribute
before the loop.
Syntax
Place the attribute in the OpenCL source before the loop definition:
__attribute__((xcl_pipeline_loop(<II_number>)))
Where:
- <II_number>: Specifies the desired initiation interval (II) for the pipeline. The Vitis HLS tool tries to meet this request; however, based on data dependencies, the loop might have a larger initiation interval. When the II is not specified, the default is 1.
Examples
The following example specifies an II target of 3 for the for
loop in the specified function:
__kernel void f(__global int *a) {
__attribute__((xcl_pipeline_loop(3)))
for (unsigned i = 0; i < 64; ++i)
a[i] = i;
}