This table summarizes IO/Compute balance for all supported input data types and matrix sizes:
Mat A Type: number of bits of Matrix A data type.
bf16
is 16 bit wide.Mat B Type: number of bits of Matrix A data type.
bf16
is 16 bit wide.Compute (MAC/cyc): number of parallel multiplication-accumulations that can be performed by the vector processor of the AI Engine ML
M: number of rows of matrix A
K: number of columns of matrix A = number of rows of matrix B
N: number of columns of matrix B
Mat A Size B): number of bytes used to encode matrix A
Mat B Size (B): number of bytes used to encode matrix B
Load Mat A (cyc): number of cycles necessary to load Matrix A
Load Mat B (cyc): number of cycles necessary to load Matrix B
Compute (cyc): Number of cycles necessary to compute the matrix multiplication
Compute (%): Efficiency of the vector processor usage compared to the max of IO and compute burden
IO A (%): Efficiency of the matrix A load compared to the the max IO and compute burden
IO B (%): Efficiency of the matrix B load compared to the the max IO and compute burden
Mat A Type | Mat B Type | Compute (MAC/cyc) | M | K | N | Mat A Size (B) | Mat B Size (B) | Load Mat A (cyc) | Load Mat B (cyc) | Compute (cyc) | Compute (%) | IO A % | IO B % |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8b | 4b | 512 | 4 | 16 | 8 | 64 | 64 | 2 | 2 | 1 | 0.5 | 1 | 1 |
8b | 4b | 512 | 8 | 16 | 8 | 128 | 64 | 4 | 2 | 2 | 0.5 | 1 | 0.5 |
8b | 4b | 512 | 4 | 32 | 8 | 128 | 128 | 4 | 4 | 2 | 0.5 | 1 | 1 |
8b | 8b | 256 | 4 | 8 | 4 | 32 | 32 | 1 | 1 | 0.5 | 0.5 | 1 | 1 |
8b | 8b | 256 | 4 | 16 | 4 | 64 | 64 | 2 | 2 | 1 | 0.5 | 1 | 1 |
8b | 8b | 256 | 8 | 8 | 4 | 64 | 32 | 2 | 1 | 1 | 0.5 | 1 | 0.5 |
8b | 8b | 256 | 2 | 8 | 8 | 16 | 64 | 0.5 | 2 | 0.5 | 0.25 | 0.25 | 1 |
8b | 8b | 256 | 4 | 8 | 8 | 32 | 64 | 1 | 2 | 1 | 0.5 | 0.5 | 1 |
8b | 8b | 256 | 2 | 16 | 8 | 32 | 128 | 1 | 4 | 1 | 0.25 | 0.25 | 1 |
8b | 8b | 256 | 4 | 16 | 8 | 64 | 128 | 2 | 4 | 2 | 0.5 | 0.5 | 1 |
16b | 8b | 128 | 4 | 4 | 4 | 32 | 16 | 1 | 0.5 | 0.5 | 0.5 | 1 | 0.5 |
16b | 8b | 128 | 8 | 4 | 4 | 64 | 16 | 2 | 0.5 | 1 | 0.5 | 1 | 0.25 |
16b | 8b | 128 | 4 | 8 | 4 | 64 | 32 | 2 | 1 | 1 | 0.5 | 1 | 0.5 |
16b | 8b | 128 | 4 | 4 | 8 | 32 | 32 | 1 | 1 | 1 | 1 | 1 | 1 |
8b | 16b | 128 | 4 | 4 | 8 | 16 | 64 | 0.5 | 2 | 1 | 0.5 | 0.25 | 1 |
8b | 16b | 128 | 4 | 4 | 4 | 16 | 32 | 0.5 | 1 | 0.5 | 0.5 | 0.5 | 1 |
16b | 16b | 64 | 4 | 4 | 4 | 32 | 32 | 1 | 1 | 1 | 1 | 1 | 1 |
16b | 16b | 64 | 2 | 4 | 8 | 16 | 64 | 0.5 | 2 | 1 | 0.5 | 0.25 | 1 |
16b | 16b | 64 | 4 | 4 | 8 | 32 | 64 | 1 | 2 | 2 | 1 | 0.5 | 1 |
16b | 16b | 64 | 4 | 2 | 8 | 16 | 32 | 0.5 | 1 | 1 | 1 | 0.5 | 1 |
32b | 16b | 32 | 2 | 4 | 8 | 32 | 64 | 1 | 2 | 2 | 1 | 0.5 | 1 |
32b | 16b | 32 | 4 | 4 | 4 | 64 | 32 | 2 | 1 | 2 | 1 | 1 | 0.5 |
32b | 16b | 32 | 4 | 2 | 4 | 32 | 16 | 1 | 0.5 | 1 | 1 | 1 | 0.5 |
16b | 32b | 32 | 2 | 4 | 8 | 16 | 128 | 0.5 | 4 | 2 | 0.5 | 0.125 | 1 |
16b | 32b | 32 | 4 | 4 | 4 | 32 | 64 | 1 | 2 | 2 | 1 | 0.5 | 1 |
32b | 32b | 16 | 4 | 2 | 4 | 32 | 32 | 1 | 1 | 2 | 1 | 0.5 | 0.5 |
32b | 32b | 16 | 4 | 4 | 4 | 64 | 64 | 2 | 2 | 4 | 1 | 0.5 | 0.5 |
32b | 32b | 16 | 8 | 2 | 4 | 64 | 32 | 2 | 1 | 4 | 1 | 0.5 | 0.25 |
bf16 | bf16 | 128 | 4 | 8 | 4 | 64 | 64 | 2 | 2 | 1 | 0.5 | 1 | 1 |
Support
GitHub issues will be used for tracking requests and bugs. For questions, go to support.xilinx.com.
Copyright © 2023 Advanced Micro Devices, Inc.