The AI Engine reduces the overall requirement on the PL and DSPs in a design with a lot of vectorizable compute. For example, the following shows the required resources for the same 64-Tap FIR filter implemented in both AI Engine and PL with DSPs:
Impl | Filters | Taps | Param | Throughput | LUTS | Flops | DSP | AIE |
---|---|---|---|---|---|---|---|---|
AIE | 1 | 64 | win=2048 | 512.480 MSPS | 189 | 568 | 0 | 2 |
HLS | 1 | 64 | ck_per_sam=1 | 497.364 MSPS | 1888 | 5634 | 64 | 0 |
AIE | 10 | 64 | win=2048 | 5124.80 MSPS | 189 | 568 | 0 | 20 |
HLS | 10 | 64 | ck_per_sam=1 | 4781.55 MSPS | 10532 | 45009 | 640 | 0 |
AIE | 1 | 240 | win=2048 | 116.92 MSPS | 190 | 572 | 0 | 1 |
HLS | 1 | 240 | ck_per_sam=4 | 124.845 MSPS | 2528 | 7217 | 60 | 0 |
AIE | 10 | 240 | win=2048 | 1169.28 MSPS | 190 | 572 | 0 | 10 |
HLS | 10 | 240 | ck_per_sam=4 | 1235.07 MSPS | 16906 | 60872 | 600 | 0 |
It is clear that the AI Engine implementation offers significant savings of PL resources, especially as the design size increases.
Note: For the 240 tap FIR filter, the DSP version is processing one sample every four clock cycles. This reduces the throughput, but also proportionately reduces the logic and power. If ck_per_sam
are set to one, the result provides four times the resources, but also utilizes four times the resources and power, leading to an infeasible design from a resources point of view. In any design, targeting any architecture or technology, trade-offs exist and requires understanding to get the most efficient solution for your requirements.