Assigning Multiple AI Engines per Filter
Assigning Multiple AI Engines per Filter
For a HLS implementation, specifying the number of clocks per sample establishes the throughput and is the primary factor in determining how many resources are required, and the relationship is quite linear.
For the AI Engine DSPLib FIR filter kernels, the kernels provide a parameter called cascade length (CASC_LEN), which can be used to assign multiple AI Engines to a particular filter kernel. This results in increased throughput, but the relationship is not linear. The following graphs and table shows the results for a single 129 tap FIR filter, with CASC_LENs of 1,2, and 4 and window size as 256.
Cascade length | Throughput (MSPS) |
---|---|
1 | 154.40 |
2 | 267.97 |
4 | 394.98 |
Cascade length | Dynamic power(W) |
---|---|
1 | 0.749 |
2 | 0.896 |
4 | 1.100 |
CASCADE LENGTH | Performance(MSPS/W) |
---|---|
1 | 206.915 |
2 | 299.073 |
4 | 358.439 |
As can be seen, going from CASC_LEN =1 to CASC_LEN=2 produces a significant improvement in performance. Going from CASC_LEN=2 to CASC_LEN=4 increases performance even further, but offers diminishing returns. Given that power increases with increasing AI Engines, the resulting computation efficiency chart shows that adding more AI Engines can potentially decrease computational efficiency as seem in this case.
However, some application may need every bit of throughput performance available and are not power constrained, others may see the two cascade option as optimal as it gives the best performance while maintaining the design within the power constraints. All decisions should be made with the complete application and its requirements in mind.
The following table provides some additional information on data on throughput for various filter sizes implemented on the AI Engines using different cascade lengths and window size as 256:
Filters | Taps | Throughput (CASC_LEN=1) | Throughput (CASC_LEN=2) | Throughput (CASC_LEN=4) |
---|---|---|---|---|
1 | 15 | 970.23MSPS(*) | 970.014 MSPS | Too small to cascade |
1 | 64 | 278.30MSPS | 427.55 MSPS | 534.90 MSPS |
1 | 129 | 154.40MSPS | 267.97 MSPS | 394.98 MSPS |
1 | 240 | 89.724MSPS | 169.064MSPS | 250.596 MSPS |
(*)Note: this result is I/O bound.