The following parameters are the same as in the previous section:
Number of multiplication-accumulations to perform: 64x64x64
Number of parallel multiplication-accumulations in the SIMD vector processor: 256
The visualization of the profiling information of the optimized version of the kernel gives you enough data to compute vector processor usage efficiency:
Kernel Version |
#cycles |
Efficiency |
|---|---|---|
32 bits output |
1750 |
58% |
16 bits output |
1121 |
91% |
The 32-bit output version still does not use the hardware efficiently because of scheduling load, store, and compute operations.
The 16-bit output version makes a large performance jump. C sub-matrix storage is fast, so you interleave it easily in inner-loop code.
You can now look at this code:
Some lines (1568, 1584, …) do not fully display in the interface. You must get the original assembly code from the AIE compilation directory (aie/Work1/aie/20_0/Release/20_0.lst). You now focus on the inner loop marked by the zero overhead loop start (ZLS) and zero overhead loop end (ZLE) flags:
.label ZLS_F_Z14ClassicMatMultIas ... EE_208
.loop_nesting 1
.begin_of_loop
960 VLDA wl0, [p1], #128; VLDB wh7, [p0, #32]; VSHUFFLE x5, x2, x1, r3; VMAC cm2, cm3, x5, x0, r0
972 VLDA wh9, [p4, #32]; VLDB wl7, [p0], #64; VSHUFFLE x8, x3, x1, r2; VMAC cm0, cm1, x4, x0, r0
984 VLDA wl9, [p4], #128; VLDB wh11, [p0, #32]; VSHUFFLE x4, x3, x1, r3; VMAC cm3, cm2, x6, x1, r0
996 VLDA wh8, [p1, #32]; VLDB wl11, [p0], #64; VSHUFFLE x6, x2, x1, r2; VMAC cm1, cm0, x8, x1, r0
1008 VLDA wl8, [p1], #128; VLDB wh2, [p0, #32]; VSHUFFLE x5, x2, x1, r3; VMAC cm2, cm3, x5, x0, r0
1020 VLDA wh1, [p4, #32]; VLDB wl2, [p0], #64; VSHUFFLE x8, x3, x1, r2; VMAC cm0, cm1, x4, x0, r0
1032 VLDA wl1, [p4], #128; VLDB wh3, [p0, #32]; VSHUFFLE x4, x3, x1, r3; VMAC cm3, cm2, x6, x1, r0
1044 VLDA wh0, [p1, #32]; VLDB wl3, [p0], #64; VMAC cm1, cm0, x8, x1, r0
1054 VLDA wl0, [p1], #128; VLDB wh2, [p0, #32]; VSHUFFLE x10, x7, x1, r2; VMAC cm2, cm3, x5, x0, r0
1066 VLDB wl2, [p0], #64; VSHUFFLE x7, x7, x1, r3; VMAC cm0, cm1, x4, x0, r0
1076 VLDA wh1, [p4, #32]; VLDB wh3, [p0, #32]; VSHUFFLE x10, x11, x1, r2; VMUL cm4, x10, x9, r0
1088 VLDA wl1, [p4], #128; VLDB wl3, [p0], #64; NOPS; NOPX; VSHUFFLE x11, x11, x1, r3; NOPV
1104 VLDA wh0, [p1, #32]; VLDB wh2, [p0, #32]; NOPS; NOPX; VSHUFFLE x6, x2, x1, r2; VMUL cm5, x10, x9, r0
1120 VLDA wl0, [p1], #128; VLDB wl2, [p0], #64; VST.SRS.s16.s32 bmh2, s0, [p2, #32];NOPX; VSHUFFLE x5, x2, x1, r3; VMAC cm2, cm4, x7, x8, r0
1136 VLDA wh1, [p4, #32]; VLDB wh3, [p0, #32]; VST.SRS.s16.s32 bml2, s0, [p2], #64;NOPX; VSHUFFLE x8, x3, x1, r2; VMAC cm0, cm5, x11, x8, r0
1152 VLDA wl1, [p4], #128; VLDB wl3, [p0], #64; VST.SRS.s16.s32 bmh0, s0, [p2, #32];NOPX; VSHUFFLE x4, x3, x1, r3; VMAC cm3, cm2, x6, x1, r0
.label ZLE_F_Z14ClassicMatMultIas ... EE_416
.end_of_loop
1168 PADDA [p0], #-512; VLDB wh0, [p1, #32]; VST.SRS.s16.s32 bml0, s0, [p2], #64;NOPX; VSHUFFLE x6, x2, x1, r2; VMAC cm1, cm0, x8, x1, r0
In this inner loop, you see 16 vector multiply (VMUL) or vector multiply-accumulate (VMAC) instructions among 17 lines. This reveals a highly optimized pipelined loop implementation. On almost all lines, you execute two loads and one vector compute instruction. Data storage takes only one-fourth of the cycles.