AI Engine Code Vectorization - AI Engine Code Vectorization - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

To realize advantages of AI Engine processing, you must vectorize computations. Applying this to pixel interpolation, the calculation can be restated as:

\[f(x_q,y_1) = x_{frac}f(x_2,y_1) + f(x_1,y_1) - x_{frac}f(x_1,y_1)\]

and

\[f(x_q,y_2) = x_{frac}f(x_2,y_2) + f(x_1,y_2) - x_{frac}f(x_1,y_2)\]

for the first two interpolations in the x coordinate, and

\[f(x_q,y_q) = y_{frac}f(x_q,y_2) + f(x_q,y_1) - y_{frac}f(x_q,y_1)\]

for the final interpolation in the y coordinate. By reformulating the computation in this way, the first two terms in each equation represent a multiply-accumulate (MAC) operation. You can use this in a follow-on multiply and subtract from accumulator (MSC) operation to obtain the result. Each interpolated pixel requires 3 MAC plus 3 MSC operations.

This example uses single precision floating-point for computation. Figure 5 shows the floating-point vector unit of an AI Engine. Observe that the multiply and accumulator units process eight lanes in parallel. SIMD parallelism uses a pixel-per-lane approach. Because each pixel requires 3 MAC plus 3 MSC operations, each of which may execute in a single clock cycle, a lower limit on computation requirement would be 0.75 cycles per pixel. Consider this bound on computation as a ballpark estimate on expected performance, which is likely unachievable due to overhead, bandwidth limitations, and pipelining inefficiencies.

figure5

Figure 5 - Floating-Point Vector Unit