The following parameters are the same as in the previous section:
Number of multiplication-accumulations to perform: 64x64x64
Number of parallel multiplication-accumulations in the SIMD vector processor: 256
The visualization of the profiling information of the optimized version of the kernel gives you enough data to compute vector processor usage efficiency:
| Kernel Version | #cycles | Efficiency |
|---|---|---|
| 32 bits output | 1750 | 58% |
| 16 bits output | 1121 | 91% |
The 32-bit output version is still not really using very efficiently the hardware because of some difficulties to schedule load/store/compute operations.
The 16-bit output version is doing a frog-leap in performances, C sub-matrices storage is fast so it can be interleaved very easily in the inner-loop code.
Let’s have a look to this code:
Some lines (1568, 1584, …) are not fully displayed in the interface, we need to get the original assembly code in the compilation directory (aie/Work1/aie/20_0/Release/20_0.lst). Let’s focus on the inner loop delimited by the ZLS/ZLE flags (Zero Overhead Loop Start/End):
.label ZLS_F_Z14ClassicMatMultIas ... EE_208
.loop_nesting 1
.begin_of_loop
960 VLDA wl0, [p1], #128; VLDB wh7, [p0, #32]; VSHUFFLE x5, x2, x1, r3; VMAC cm2, cm3, x5, x0, r0
972 VLDA wh9, [p4, #32]; VLDB wl7, [p0], #64; VSHUFFLE x8, x3, x1, r2; VMAC cm0, cm1, x4, x0, r0
984 VLDA wl9, [p4], #128; VLDB wh11, [p0, #32]; VSHUFFLE x4, x3, x1, r3; VMAC cm3, cm2, x6, x1, r0
996 VLDA wh8, [p1, #32]; VLDB wl11, [p0], #64; VSHUFFLE x6, x2, x1, r2; VMAC cm1, cm0, x8, x1, r0
1008 VLDA wl8, [p1], #128; VLDB wh2, [p0, #32]; VSHUFFLE x5, x2, x1, r3; VMAC cm2, cm3, x5, x0, r0
1020 VLDA wh1, [p4, #32]; VLDB wl2, [p0], #64; VSHUFFLE x8, x3, x1, r2; VMAC cm0, cm1, x4, x0, r0
1032 VLDA wl1, [p4], #128; VLDB wh3, [p0, #32]; VSHUFFLE x4, x3, x1, r3; VMAC cm3, cm2, x6, x1, r0
1044 VLDA wh0, [p1, #32]; VLDB wl3, [p0], #64; VMAC cm1, cm0, x8, x1, r0
1054 VLDA wl0, [p1], #128; VLDB wh2, [p0, #32]; VSHUFFLE x10, x7, x1, r2; VMAC cm2, cm3, x5, x0, r0
1066 VLDB wl2, [p0], #64; VSHUFFLE x7, x7, x1, r3; VMAC cm0, cm1, x4, x0, r0
1076 VLDA wh1, [p4, #32]; VLDB wh3, [p0, #32]; VSHUFFLE x10, x11, x1, r2; VMUL cm4, x10, x9, r0
1088 VLDA wl1, [p4], #128; VLDB wl3, [p0], #64; NOPS; NOPX; VSHUFFLE x11, x11, x1, r3; NOPV
1104 VLDA wh0, [p1, #32]; VLDB wh2, [p0, #32]; NOPS; NOPX; VSHUFFLE x6, x2, x1, r2; VMUL cm5, x10, x9, r0
1120 VLDA wl0, [p1], #128; VLDB wl2, [p0], #64; VST.SRS.s16.s32 bmh2, s0, [p2, #32];NOPX; VSHUFFLE x5, x2, x1, r3; VMAC cm2, cm4, x7, x8, r0
1136 VLDA wh1, [p4, #32]; VLDB wh3, [p0, #32]; VST.SRS.s16.s32 bml2, s0, [p2], #64;NOPX; VSHUFFLE x8, x3, x1, r2; VMAC cm0, cm5, x11, x8, r0
1152 VLDA wl1, [p4], #128; VLDB wl3, [p0], #64; VST.SRS.s16.s32 bmh0, s0, [p2, #32];NOPX; VSHUFFLE x4, x3, x1, r3; VMAC cm3, cm2, x6, x1, r0
.label ZLE_F_Z14ClassicMatMultIas ... EE_416
.end_of_loop
1168 PADDA [p0], #-512; VLDB wh0, [p1, #32]; VST.SRS.s16.s32 bml0, s0, [p2], #64;NOPX; VSHUFFLE x6, x2, x1, r2; VMAC cm1, cm0, x8, x1, r0
In this inner loop code we can see that there are 16 vector instructions VMUL/VMAC out of the 17 lines. This reveals a highly optimized pipelined loop implementation. On almost all lines there are 2 loads and one vector compute instruction, data storage takes only a fourth of the cycles.