Multi-Kernel FIR Filter Implementation - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
Release Date
2024.1 English

Version: Vitis 2024.1

In this second part of the tutorial, you will dispatch the computations over multiple AI Engines and analyze the performances that can be achieved.

Navigate to the MultiKernel directory to continue.

Designing the Kernel

As in the Single-kernel tutorial, this design will use the streaming input and output but the performances must be improved. The limitations can come from two sources:

  • Limit on the bandwidth side

  • Limit in the compute performance side

In the single-kernel section of the tutorial, the maximum throughput was 225 Msps, which shows that the streams are starved due to a limitation of the compute performance. The data type cint16 is 32-bit wide and the maximum bandwidth of the AXI-Stream connection array is 1x cint16 per clock cycle on a single stream. In the single-kernel part, four of them were read in four clock cycles, but the computation was taking 16 clock cycles for the 32 taps. For the optimal trade-off, the computation should take only four clock cycles for each of the four input samples read from the stream. In four clock cycles, eight taps can be processed, the complete filtering operation should be split onto four AI Engines.

The Single-Kernel Filter can be represented by this convolution:

missing image

After subdivision into four Kernels, each one on a different AI Engine, the filter can be represented by four smaller filters in parallel running on the same data stream, except that for some of these kernels, the beginning of the stream is discarded: