In this tutorial, the 8-engine design was achieved quickly by instantiating the single engine design repeatedly. This leaves a few opportunities for optimization on the table. These options include the following:
Each engine is using its own IFFT engine to transform the radar pulse. In reality, this operation is common to all engines and could be done with a single IFFT graph. The output of that common graph could then be broadcast to all eight engines. Clearly from ifft2k_async() this would save seven instances of six tiles or ~40 tiles. It will also remove seven GMIOs from the design which will dramatically reduce the NoC bandwidth required to deliver the radar pulses to the AI Engine array from DDR.
Constructing an 8-engine design with a single IFFT would require some code restructuring because routing the IFFT graph output to all engines requires a new top-level graph. This complicates the “Stamp and Repeat” approach to placement but should be manageable.
The PL URAM portion of the design can, in principle, be removed by partitioning these image buffers to DDR instead of the PL. In this case, the radar processing would require eight GMIO pairs, one pair for each engine. The data flow would proceed from DDR, streaming the input image to each engine over the NoC to the AIE array, updating each image segment by its engine, then streaming the output image back to DDR over the NoC. This would remove all PL resources from the design — a significant savings and simplification. The DDR buffer design needs to be optimized to maximize the burst bandwidth available to each engine. AMD is currently exploring this variant of the design.
Copyright © 2025 Advanced Micro Devices, Inc