Vector Processing Unit - 2021.1 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2021-07-19
Version
2021.1 English

The vector unit contains a fixed-point unit with 128 8-bit fixed-point multipliers and a floating-point unit with eight single-precision floating-point multipliers. The vector registers and permute network are shared between the fixed-point and floating-point multipliers. The peak performance depends on the size of the data types used by the operands. The following table provides the number of MAC operations that can be performed by the vector processor per instruction.

Table 1. AI Engine Vector Precision Support
X Operand Z Operand Output Number of MACs/Clock
8 real 8 real 48 real 128
16 real 8 real 48 real 64
16 real 16 real 48 real 32
16 real 16 complex 48 complex 16
16 complex 16 real 48 complex 16
16 complex 16 complex 48 complex 8
16 real 32 real 48/80 real 16
16 real 32 complex 48/80 complex 8
16 complex 32 real 48/80 complex 8
16 complex 32 complex 48/80 complex 4
32 real 16 real 48/80 real 16
32 real 16 complex 48/80 complex 8
32 complex 16 real 48/80 complex 8
32 complex 16 complex 48/80 complex 4
32 real 32 real 80 real 8
32 real 32 complex 80 complex 4
32 complex 32 real 80 complex 4
32 complex 32 complex 80 complex 2
32 SPFP 32 SPFP 32 SPFP 8

The X operand is 1024 bits wide and the Z operand is 256 bits wide. In terms of component use, consider the first row in the previous table. The multiplier operands come from the same 1024-bit and 256-bit input registers but some values are broadcast to multiple multipliers. There are 128 8-bit single multipliers and results are post-added and accumulated into 16 or 8 accumulator lanes of 48 bits each.

To calculate the maximum performance for a given datapath, it is necessary to multiply the number of MACs per instruction with the clock frequency of the AI Engine kernel. For example, with 16-bit input vectors X and Z, the vector processor can achieve 32 MACs per instruction. Using the clock frequency for the slowest speed grade device results in:

32 MACs * 1 GHz clock frequency = 32 Giga MAC operations/second