Understanding Throughput Control Through an Example - 2022.2 English

Vitis Model Composer User Guide (UG1483)

Document ID
UG1483
Release Date
2023-01-13
Version
2022.2 English

The following section demonstrates the benefits of using the Throughput Control feature with the Optical flow example design found in the list of examples for Vitis Model Composer HLS library.

Figure 1. Optical Flow Example

This design uses the following blocks:

  • Data Type Conversion, Subtract, Right Shift, Product
  • Window processing blocks, with Gain, Sum of Elements, and Data Type Conversion.
  • An Import Function block with the calculating_roots function.
Figure 2. Window Processing Kernel

All these blocks follow the element-wise application pattern, and comply with the restrictions previously discussed.

Important: Direct use of the Sum of Elements block in subsystems using Throughput Control is restricted. In this example, the ‘Sum of Elements’ block is used in the Window Processing block but not directly in the top-level subsystem.
With the default Throughput Factor=1, Model Composer generates the code shown below:
void
Lucas_Kanade(hls::stream< uint8_t >& ImageIn, hls::stream< uint8_t >& 
    ImageInDelayed, hls::stream< float >& Vx, hls::stream< float >& Vy)
{
    #pragma HLS INTERFACE axis port=ImageIn
    #pragma HLS INTERFACE axis port=ImageInDelayed
    #pragma HLS INTERFACE axis port=Vx
    #pragma HLS INTERFACE axis port=Vy
    #pragma HLS INTERFACE s_axilite port=return
    #pragma HLS dataflow

The IP reads its inputs, the image and delayed image, over AXI4-Stream. These streams will use a data width of 8 bits (1 pixel). Similarly pixels of the output image are streamed over an AXI4-Stream interface of data width 8 bits.

If you set TF=4, you get the code shown below.
void
Lucas_Kanade(hls::stream< xmc::MultiScalar< uint8_t, 4 > >& ImageIn, 
    hls::stream< xmc::MultiScalar< uint8_t, 4 > >& ImageInDelayed, 
    hls::stream< xmc::MultiScalar< float, 4 > >& Vx, 
    hls::stream< xmc::MultiScalar< float, 4 > >& Vy)
{
    #pragma HLS INTERFACE axis port=ImageIn
    #pragma HLS data_pack variable=ImageIn
    #pragma HLS INTERFACE axis port=ImageInDelayed
    #pragma HLS data_pack variable=ImageInDelayed
    #pragma HLS INTERFACE axis port=Vx
    #pragma HLS data_pack variable=Vx
    #pragma HLS INTERFACE axis port=Vy
    #pragma HLS data_pack variable=Vy
    #pragma HLS INTERFACE s_axilite port=return
    #pragma HLS dataflow
This IP receives 4 pixels of the input image, and 4 pixels of the delayed input image, at the same time over AXI4-Stream that have data width of 32 bits. Inside the IP the logic has been duplicated so that 4 pixels are processed in parallel. The IP sends 4 pixels of the output image at a time over an AXI4-Stream, of data width 32 bits.
Note: xmc::MultiScalar<T,N> is a template struct defined in xmcMultiScalar.h. It is a struct that contains an array of N elements of type T.

The following table represents the Vitis HLS timing and resource estimates for optical flow design.

Table 1. Optical Flow Design Timing/Resource Utilization Estimates
Throughput factor = 1 Throughput factor = 4 Throughput factor = 8
Clock Freq 300 MHz 300 MHz 300 MHz
Latency/II 41848/41834 10483/10469 5358/5344
BRAM_18k (Utilization %) 5 2 4
DSP48E (Utilization %) 2 9 19
FF (Utilization %) 8 30 59
LUT (Utilization %) 14 36 88

The second line in the table shows the initiation interval (II). At clock frequency of 300 MHz and Throughput Factors 4 and 8, the initiation interval of the design is reduced by a factor of approximately 4 and approximately 8 respectively, when compared with the initiation interval for Throughput Factor=1. Note that this comes at the cost of increasing resource utilization when the Throughput Factor increases.

For Throughput factor of one, the II is 41,848. The input to this design is a 200x200 pixel image frame and the value of II here indicates the number of clocks to process the entire frame. As such it takes slightly more than the duration of one clock cycle to process one pixel. As the Throughput Factor increases, the II to process one frame decreases, and the application processes more than one pixel per clock cycle.