System Compute - 2024.2 English - 2024.1 English

Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504)

Document ID
UG1504
Release Date
2024-11-13
Version
2024.2 English

Both the DSP Engines and AI Engines can perform similar types of computation. Both engines are designed for efficient multiply-accumulate (MAC) functions (for example, FIR filters). When partitioning your design, it is important to understand the different capabilities of both blocks as well as their interaction with the PL. This section focuses on block capabilities with respect to compute and how some data types might map to one engine better than the other.

The devices in the Versal AI Core, AI Edge Series, and Versal Premium VP2502 and VP2802 devices contain an array of AI Engine tiles or AI Engine-ML (AIE-ML) tiles. Devices with an AIE-ML array include additional rows of 512 KB memory tiles and are optimized for various compute-intensive applications, for example machine learning inference acceleration.

Following are examples of some functions that can be mapped to DSP Engines or can be vectorized and mapped to the AI Engines:

  • Multiply
  • Multiply-accumulate
  • Fast Fourier transforms (FFT)
  • Finite impulse response (FIR) filters
  • Matrix-matrix multiply

DSP and PL computation is sample based. The PL is good at bit-manipulating functions and fast data reordering, which can be important when managing the data for your system.

The AI Engine is an SIMD vector processor, which means that easily vectorized functions are a good fit for implementing in the AI Engine. For example, linear algebra functions lend themselves well to vectorization.

The AI Engine can perform both sample and block-based processing. For sample-based processing, the AI Engines run with streaming interfaces with blocks that align with the vector processing and register memories. This enables low latency and in particular, enables high throughput designs running at high sample rates (for example, super sample rate processing).

When using window interfaces, the AI Engine is configured for block-based processing, which is vectorized processing. Depending on the window size and sample rates, latency and throughput can be affected especially where the window (block) size is small. This can be a high bandwidth option if using both the 256-bit memory read ports and the 256-bit write port.

Because the AI Engine is a vector processor, it can perform more operations in a single clock cycle than the DSP Engine, for example, 256 operations at INT8 versus six operations in the DSP Engine in a single clock. The following table shows a performance comparison of native data types in AI Engine compared to the equivalent performance using DSP Engines. In some cases, these performance advantages make the AI Engine a good option if your system includes a large amount of linear algebra compute in targeting these data types. However, additional considerations must also be made as described in the rest of this section.

Table 1. Operations per Cycle for Each Versal Adaptive SoC AI Engine
Data Type DSP Engine AI Engine
INT8 6 256
INT16 2 64
INT24 2 16
INT32 N/A 1 16
FP32 2 16
Complex 16 2 2 16
Complex 32 N/A 1 2
  1. Cannot be implemented in a single DSP Engine and requires additional PL resources
  2. Requires two DSP Engines to implement up to 18-bit complex multiplier or MACC

Before deciding which engine to use to implement these types of functions, you must evaluate how much compute you require. As shown in the example in the following table, a small 11-tap FIR filter does not require as much compute power as a much larger 131-tap FIR. Therefore, you might choose to implement an 11-tap FIR using DSP Engines and the PL. Whereas, the 131-tap FIR is likely more efficient in the AI Engines.

Table 2. Example Compute Requirements for Different FIR Implementations
11-Tap FIR Filter 131-Tap FIR Filter

16-bit real data and 16-bit real coefficients

Available compute for 16 bits in AI Engine = 32 MACs

Required compute = 11 MACs

Number of AI Engines = 0.35 (1)

Resource Utilization = 0.35

Low Utilization

Required compute = 131 MACs

Number of AI Engines = 4.09 (5)

Resource Utilization = 0.82 per tile

High Utilization

In partitioning your design, there are other factors that might affect your decision. For example, where in the application dataflow does the 11-tap filter occur? Is it part of a much larger filter chain? If so, then it might make architectural sense to implement this small FIR and the rest of the filter chain in the AI Engine for a more efficient overall system design.

For functions that are not natively supported by the AI Engines (for example, INT4 or 24-bit complex), it might be better to implement these functions in the DSP58 and PL, because it simplifies the implementation. Alternatively, these non-native operands can also be supported within the AI Engine. If you use this approach, you must manage the data carefully within the vector lanes, which requires additional data management functionality, likely in the PL.

The AI Engine vector processor is efficient at performing a large number of concurrent arithmetic operations. If it makes sense for the application, some non-linear algebra functions can also be implemented in the AI Engine scalar processor. The advantage of this approach is that it eliminates the need to move data out of the AI Engine array and into the PL to perform those functions. Because of round trip latency between PL and the AI Engine array, it can be beneficial to perform pre- or post-processing in the PL with DSP58s.