It is important to understand the memory hierarchy within the Versal adaptive SoC to help you determine the following:
- Scope of the problem you need to resolve.
- How the data is communicated between the different engines and the available bandwidth.
- How to take advantage of the raw compute available in each AI Engine to deliver the best performance per watt for your application, if applicable.
Different types of applications have different memory hierarchies depending on the environments where they will be deployed. For example, some applications have external DDR memory requirements while others use interfaces like JESD to bring in data from discrete analog-to-digital converters (ADCs). Regardless of where the data originates, it is important to build the correct memory hierarchy to meet the needs of your system.
For example, in systems that require external DDR memory, it is only possible to get data from DDR memory at specified rates. Therefore, the maximum bandwidth is fixed for your system. For example, in Versal adaptive SoC, the maximum LPDDR memory bandwidth is approximately 34 GBps per memory controller to the NoC.
The data from DDR memory can be moved throughout the adaptive SoC via the NoC. In the case of the AI Engines, there is direct access to the AI Engine array via the NoC interface tiles in the AI Engine array interface. However, the bandwidth is relatively low, and this might not be the best way to get the data to the AI Engine tiles.
In most cases, because there is a large amount of on-chip memory available in Versal devices, it is expected that the data is brought via the NoC into a staging memory in the PL (for example, UltraRAM and/or block RAM). Then, because there is much higher bandwidth from the PL into and out of the AI Engine array, the PL interface array tiles are used for data transfer to and from the AI Engine.
Some applications might require direct DDR memory-NoC-AI Engine communication. This communication is possible but with lower overall bandwidth available. Therefore, for most applications, the NoC is recommended for debug, trace, and control communication via the PS or any master.
Devices that contain the AI Engine array also have local data memory in each AI Engine tile. Each tile contains eight data memory banks of 4 KB each, totaling 32 KB per tile. Each AI Engine core has direct local access to the data memory on the same AI Engine tile and to three of its neighboring tiles data memories (for example, North, South, and either East or West). This gives 128 KB local shared memory for each tile.
Devices with an AI Engine-ML array include additional rows of 512 KB memory tiles, which provide low-latency local memory storage.
The following figure shows total VC1902 available memory. For Versal AI Core devices, see the Versal AI Core Series Product Selection Guide (XMP452).
Data communication is key to enabling an efficient design, in particular when using the AI Engine array. As a result, understanding the data bandwidth to and from the AI Engine and internally between the AI Engines is required to efficiently partition the design. For more information, see this link in the Versal Adaptive SoC AI Engine Architecture Manual (AM009).
For some functions, like symmetric FIRs, convolutional neural network (CNN), or beamforming, there is some data reuse (for example, coefficient and weights sharing). For those functions, you can reduce memory bandwidth by using the streaming broadcast functionality to send the same weights or coefficients to multiple AI Engine tiles. Applications with a large amount of data reuse are well suited to implementation in the AI Engines. Window interfaces are better for data reuse in large filter implementation on a single tile.
The following figure shows an example dataflow through the Versal adaptive SoC into the AI Engine array with some important bandwidth numbers to consider when mapping your data through the device.
Using this example dataflow, there are considerations you must make based on the specifics of your application. Getting data from the DDR memory into the staging memories in the PL (for example, UltraRAMs) is determined by the amount of data to be transferred, the memory controller throughput, the NoC bandwidth, and the memory size.
After the data is loaded into the UltraRAMs, there might be a data sort or pre-processing stage required in the PL. Assuming this is not the case, the data needs to get to the AI Engines for processing. Key care-abouts at this phase of the dataflow are the bandwidth through the AI Engine array interface and the bandwidth within the AI Engine array for transferring data to the necessary tiles.
The scope of the problem to solve in the AI Engines is determined by both the AXI4-Stream bandwidth and the data memory size in the AI Engine tile and array.