Performance Analysis - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

The following parameters are the same as in the previous section:

  • Number of multiplication-accumulations to perform: 64x64x64

  • Number of parallel multiplication-accumulations in the SIMD vector processor: 256

The visualization of the profiling information of the optimized version of the kernel gives you enough data to compute vector processor usage efficiency:

Kernel Version #cycles Efficiency
32 bits output 1750 58%
16 bits output 1121 91%

The 32-bit output version is still not really using very efficiently the hardware because of some difficulties to schedule load/store/compute operations.

The 16-bit output version is doing a frog-leap in performances, C sub-matrices storage is fast so it can be interleaved very easily in the inner-loop code.

Let’s have a look to this code:

Inner Loop as seen in vitis_analyzer

Some lines (1568, 1584, …) are not fully displayed in the interface, we need to get the original assembly code in the compilation directory (aie/Work1/aie/20_0/Release/20_0.lst). Let’s focus on the inner loop delimited by the ZLS/ZLE flags (Zero Overhead Loop Start/End):

.label ZLS_F_Z14ClassicMatMultIas ... EE_208
.loop_nesting 1
.begin_of_loop
         960    VLDA wl0, [p1], #128;         VLDB wh7, [p0, #32];    VSHUFFLE x5, x2, x1, r3;                VMAC cm2, cm3, x5, x0, r0
         972    VLDA wh9, [p4, #32];          VLDB wl7, [p0], #64;    VSHUFFLE x8, x3, x1, r2;                VMAC cm0, cm1, x4, x0, r0
         984    VLDA wl9, [p4], #128;         VLDB wh11, [p0, #32];   VSHUFFLE x4, x3, x1, r3;                VMAC cm3, cm2, x6, x1, r0
         996    VLDA wh8, [p1, #32];          VLDB wl11, [p0], #64;   VSHUFFLE x6, x2, x1, r2;                VMAC cm1, cm0, x8, x1, r0
        1008    VLDA wl8, [p1], #128;         VLDB wh2, [p0, #32];    VSHUFFLE x5, x2, x1, r3;                VMAC cm2, cm3, x5, x0, r0
        1020    VLDA wh1, [p4, #32];          VLDB wl2, [p0], #64;    VSHUFFLE x8, x3, x1, r2;                VMAC cm0, cm1, x4, x0, r0
        1032    VLDA wl1, [p4], #128;         VLDB wh3, [p0, #32];    VSHUFFLE x4, x3, x1, r3;                VMAC cm3, cm2, x6, x1, r0
        1044    VLDA wh0, [p1, #32];          VLDB wl3, [p0], #64;    VMAC cm1, cm0, x8, x1, r0
        1054    VLDA wl0, [p1], #128;         VLDB wh2, [p0, #32];    VSHUFFLE x10, x7, x1, r2;               VMAC cm2, cm3, x5, x0, r0
        1066                                  VLDB wl2, [p0], #64;    VSHUFFLE x7, x7, x1, r3;                VMAC cm0, cm1, x4, x0, r0
        1076    VLDA wh1, [p4, #32];          VLDB wh3, [p0, #32];    VSHUFFLE x10, x11, x1, r2;              VMUL cm4, x10, x9, r0
        1088    VLDA wl1, [p4], #128;         VLDB wl3, [p0], #64;     NOPS;   NOPX;     VSHUFFLE x11, x11, x1, r3;              NOPV
        1104    VLDA wh0, [p1, #32];          VLDB wh2, [p0, #32];     NOPS;   NOPX;     VSHUFFLE x6, x2, x1, r2;                VMUL cm5, x10, x9, r0
        1120    VLDA wl0, [p1], #128;         VLDB wl2, [p0], #64;     VST.SRS.s16.s32 bmh2, s0, [p2, #32];NOPX;  VSHUFFLE x5, x2, x1, r3;   VMAC cm2, cm4, x7, x8, r0
        1136    VLDA wh1, [p4, #32];          VLDB wh3, [p0, #32];     VST.SRS.s16.s32 bml2, s0, [p2], #64;NOPX;  VSHUFFLE x8, x3, x1, r2;   VMAC cm0, cm5, x11, x8, r0
        1152    VLDA wl1, [p4], #128;         VLDB wl3, [p0], #64;     VST.SRS.s16.s32 bmh0, s0, [p2, #32];NOPX;  VSHUFFLE x4, x3, x1, r3;   VMAC cm3, cm2, x6, x1, r0
.label ZLE_F_Z14ClassicMatMultIas ... EE_416
.end_of_loop
        1168    PADDA [p0], #-512;            VLDB wh0, [p1, #32];     VST.SRS.s16.s32 bml0, s0, [p2], #64;NOPX;   VSHUFFLE x6, x2, x1, r2;   VMAC cm1, cm0, x8, x1, r0

In this inner loop code we can see that there are 16 vector instructions VMUL/VMAC out of the 17 lines. This reveals a highly optimized pipelined loop implementation. On almost all lines there are 2 loads and one vector compute instruction, data storage takes only a fourth of the cycles.