Kernel Coding Bounds - 2025.2 English - UG1079

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2025-11-26
Version
2025.2 English

This example requires a total of 1024 int16 x int16 multiplications to compute a 128 output value. Given that the AI Engine can perform 32 16-bit multiplications per cycle, the compute bound for the kernel is as follows.

Compute bound = 32 cycles / invocation

The vector register can store matrix B because it is only 16*16-bit =256 bits. It does not need to be fetched from the AI Engine data memory or tile interface for each MAC operation. Considering the data “a” needed for computation, there are total 64*8*2=1024 bytes to be fetched from memory. Given that AI Engine allows two 256 bits (32 bytes) loads per cycle, the memory bound for the kernel is as follows.

Memory bound = 1024 / (2*32) = 16 cycles / invocation

The compute bound is larger than the memory bound. Hence the purpose of vectorization can be to achieve the theoretical limit of MAC operations in the vector processor.