AI Engine Specific Design Considerations - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English
Assigning Multiple AI Engines per Filter

Assigning Multiple AI Engines per Filter

For a HLS implementation, specifying the number of clocks per sample establishes the throughput and is the primary factor in determining how many resources are required, and the relationship is quite linear.

For the AI Engine DSPLib FIR filter kernels, the kernels provide a parameter called cascade length (CASC_LEN), which can be used to assign multiple AI Engines to a particular filter kernel. This results in increased throughput, but the relationship is not linear. The following graphs and table shows the results for a single 129 tap FIR filter, with CASC_LENs of 1,2, and 4 and window size as 256.

Cascade length Throughput (MSPS)
1 154.40
2 267.97
4 394.98

Image of 129 Tap FIR filter metrics - Throughput vs Casc Length

Cascade length Dynamic power(W)
1 0.749
2 0.896
4 1.100

Image of 129 Tap FIR filter metrics - Power vs Casc Length

CASCADE LENGTH Performance(MSPS/W)
1 206.915
2 299.073
4 358.439

Image of 129 Tap FIR filter metrics - Computational Efficiency vs Casc Length

As can be seen, going from CASC_LEN =1 to CASC_LEN=2 produces a significant improvement in performance. Going from CASC_LEN=2 to CASC_LEN=4 increases performance even further, but offers diminishing returns. Given that power increases with increasing AI Engines, the resulting computation efficiency chart shows that adding more AI Engines can potentially decrease computational efficiency as seem in this case.

However, some application may need every bit of throughput performance available and are not power constrained, others may see the two cascade option as optimal as it gives the best performance while maintaining the design within the power constraints. All decisions should be made with the complete application and its requirements in mind.

The following table provides some additional information on data on throughput for various filter sizes implemented on the AI Engines using different cascade lengths and window size as 256:

Filters Taps Throughput (CASC_LEN=1) Throughput (CASC_LEN=2) Throughput (CASC_LEN=4)
1 15 970.23MSPS(*) 970.014 MSPS Too small to cascade
1 64 278.30MSPS 427.55 MSPS 534.90 MSPS
1 129 154.40MSPS 267.97 MSPS 394.98 MSPS
1 240 89.724MSPS 169.064MSPS 250.596 MSPS

(*)Note: this result is I/O bound.

Window Size