Based on this system architecture concept, it only remains to identify how many AI Engine resources are required to implement the (i) “row” transforms, (ii) “point-wise twiddle” multiplications, and (iii) “column” transforms. We do some early prototyping of the AI Engine subgraphs to identify the number of instances required. We identify two separate subgraphs to consider:
One “front-end” subgraph performing a “row” IFFT-256 followed by a “pointwise-twiddle” multiplication of the samples on that row, followed by zero-insertion.
One “back-end” subgraph performing a “column” IFFT-256 followed by zero-insertion.
The zero-insertion allows the design of the “memory transpose” in PL to be simplified as is outlined in detail below.
The throughput of prototypes of the these two subgraphs identifies the number of instances of each required to achieve our overall throughput target of 2 Gsps. The figure below shows traces in Vitis Analyzer for the front-end subgraph. This design is hand-coded in AIE API combining all three functions together into a single tile design with small memory footprint. It’s throughput is 592 us or ~430 Msps. Based on a target throughput of 2 Gsps, we will need to include 5 instances of this subgraph in the overall design.
The figure below shows traces in Vitis Analyzer for the back-end subgraph. This design is also hand-coded in AIE API and combines the IFFT-256 with zero-padding into a second single tile design with small memory footprint. It’s throughput is 422.4 ns or ~600 Msps. Based on a target throughput of 2 Gsps, we will need to include 4 instances of this subgraph in the overall design.
Based on these prototyping efforts, we identify the final design architecture shown in the diagram below. The design uses five instances of each front-end and back-end subgraph. We use 5 instances of the back-end subgraph even though only four are required as this simplifies the overall design architecture. These instances become time shared over all transform operations required by the 2D algorithm. However, we require only 256 transforms in each “row” and “column” dimension, yet this number is not divisible by 5. Consequently, we “zero-pad” the 2D data cube by appending four rows at the bottom and four columns at the right to create a 260 x 260 data cube. It then follows that we can perform 52 transforms per AI Engine tile instance in each case for the “front-end” and “back-end” subgraph. This also means the design supports five I/O streams into and out of each subgraph. This also applies to the “memory transpose” operation in the PL. An important side effect of this zero-padding is its simplification in the construction on that PL design. It may then be implemented using a 5-bank memory architecture as outlined in detail below.