Following table gives results for the Matrix Multiply function with a wide variety of supported parameters, which are defined in: L2 Matrix Multiply Configuration Parameters.
Note
cycleCountAvg does not include the cycle count information for the additional shuffling/tiling widget kernels, but initiationInterval and PROGRAM_MEMORY do include shuffling/tiling widget kernels.
| Library Element | T_DATA_A | T_DATA_B | P_DIM_A | P_DIM_AB | P_DIM_B | P_ADD_TILING_A | P_ADD_TILING_B | P_ADD_DETILING_OUT | P_INPUT_WINDOW_VSIZE_A | P_INPUT_WINDOW_VSIZE_B | P_CASC_LEN | NITER | cycleCountAvg | throughputAvg | initiationInterval | throughputInitIntAvg | NUM_BANKS | NUM_AIE | DATA_MEMORY | PROGRAM_MEMORY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| matrix_mult | cfloat | cfloat | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 858 | 2386 Msa/s | 1015 | 2017 Msa/s | 32 | 9 | 48299 | 3234 3050 3234 3314 3234 3314 3234 3386 3540 |
| matrix_mult | cint16 | cint16 | 8 | 8 | 8 | 1 | 1 | 1 | 64 | 64 | 1 | 16 | 109 | 4697 Msa/s | 2367 | 216 Msa/s | 12 | 3 | 9617 | 4252 2556 3460 |
| matrix_mult | cint16 | cint32 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 222 | 9225 Msa/s | 327 | 6262 Msa/s | 31 | 9 | 37295 | 1940 1812 1940 1820 1940 1254 1820 1940 2034 |
| matrix_mult | cint16 | int16 | 8 | 64 | 4 | 1 | 0 | 1 | 512 | 256 | 4 | 16 | 1067 | 1919 Msa/s | 1887 | 1085 Msa/s | 30 | 7 | 25127 | 3096 1904 3938 1722 1722 3954 1706 |
| matrix_mult | cint16 | int32 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 829 | 2470 Msa/s | 2396 | 854 Msa/s | 29 | 9 | 33199 | 4172 2814 4172 2926 4172 3386 2926 4172 3130 |
| matrix_mult | cint32 | cint16 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 233 | 8789 Msa/s | 338 | 6059 Msa/s | 32 | 9 | 41391 | 2004 1782 2004 1846 2004 1254 1846 2004 2010 |
| matrix_mult | cint32 | cint32 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 379 | 5403 Msa/s | 483 | 4240 Msa/s | 33 | 9 | 45993 | 1972 2092 1972 2176 1972 1254 2176 1972 2376 |
| matrix_mult | cint32 | int16 | 8 | 64 | 4 | 1 | 0 | 1 | 512 | 256 | 4 | 16 | 228 | 8982 Msa/s | 313 | 6543 Msa/s | 26 | 7 | 34087 | 1254 2050 2374 1838 1822 2358 1826 |
| matrix_mult | cint32 | int32 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 233 | 8789 Msa/s | 338 | 6059 Msa/s | 32 | 9 | 41391 | 2004 1782 2004 1846 2004 1254 1846 2004 2010 |
| matrix_mult | float | cfloat | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 406 | 5044 Msa/s | 510 | 4015 Msa/s | 32 | 9 | 39598 | 2644 2514 2644 2584 2644 2584 2644 1254 2702 |
| matrix_mult | float | float | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 545 | 3757 Msa/s | 2397 | 854 Msa/s | 34 | 9 | 35112 | 4172 2214 4172 2246 4172 2246 4172 3096 2438 |
| matrix_mult | int16 | cint16 | 16 | 16 | 16 | 1 | 1 | 1 | 256 | 256 | 1 | 16 | 348 | 11770 Msa/s | 1130 | 3624 Msa/s | 13 | 3 | 16784 | 2328 2484 3736 |
| matrix_mult | int16 | cint32 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 175 | 11702 Msa/s | 281 | 7288 Msa/s | 28 | 9 | 33455 | 2260 1902 2260 1886 2260 1254 1886 2276 2144 |
| matrix_mult | int16 | int16 | 16 | 16 | 16 | 1 | 1 | 1 | 256 | 256 | 1 | 16 | 260 | 15753 Msa/s | 2153 | 1902 Msa/s | 13 | 3 | 12686 | 4114 2096 3484 |
| matrix_mult | int32 | cint16 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 829 | 2470 Msa/s | 2396 | 854 Msa/s | 29 | 9 | 33199 | 4172 2814 4172 2926 4172 3386 2926 4172 3130 |
| matrix_mult | int32 | cint32 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 222 | 9225 Msa/s | 327 | 6262 Msa/s | 31 | 9 | 37295 | 1940 1812 1940 1820 1940 1254 1820 1940 2034 |
| matrix_mult | cint16 | cint16 | 8 | 8 | 8 | 1 | 1 | 1 | 64 | 64 | 1 | 16 | 109 | 4697 Msa/s | 2367 | 216 Msa/s | 12 | 3 | 9617 | 4252 2556 3460 |
| matrix_mult | cint16 | cint16 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 373 | 5490 Msa/s | 2387 | 857 Msa/s | 33 | 9 | 32815 | 4172 1992 4172 2060 4172 2060 4172 3096 2226 |
| matrix_mult | cint16 | cint16 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 1 | 16 | 294 | 6965 Msa/s | 2515 | 814 Msa/s | 12 | 3 | 19345 | 4188 2114 3080 |
| matrix_mult | cint16 | cint16 | 8 | 4 | 64 | 1 | 1 | 1 | 32 | 256 | 1 | 16 | 289 | 7086 Msa/s | 1313 | 1559 Msa/s | 13 | 3 | 19089 | 3246 1952 3484 |
| matrix_mult | cfloat | float | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 316 | 6481 Msa/s | 420 | 4876 Msa/s | 33 | 9 | 43694 | 2084 2334 2084 2334 2084 2334 2084 1254 2558 |
| matrix_mult | cint16 | cint16 | 1024 | 4 | 4 | 0 | 0 | 0 | 4096 | 16 | 1 | 16 | 2334 | 7019 Msa/s | 3996 | 4100 Msa/s | 11 | 1 | 67849 | 1960 |
| matrix_mult | cint16 | cint16 | 1024 | 4 | 4 | 1 | 1 | 1 | 4096 | 16 | 1 | 16 | 2334 | 7019 Msa/s | 4518 | 3626 Msa/s | 16 | 3 | 105105 | 3062 3342 1960 |
| matrix_mult | cint16 | cint16 | 16 | 16 | 16 | 0 | 0 | 0 | 256 | 256 | 1 | 16 | 631 | 6491 Msa/s | 653 | 6272 Msa/s | 7 | 1 | 8329 | 3408 |
| matrix_mult | cint16 | cint16 | 16 | 16 | 16 | 1 | 1 | 1 | 256 | 256 | 1 | 16 | 631 | 6491 Msa/s | 2455 | 1668 Msa/s | 13 | 3 | 18833 | 3736 3408 4604 |
| matrix_mult | cint16 | cint16 | 16 | 256 | 16 | 0 | 0 | 1 | 4096 | 4096 | 1 | 16 | 8315 | 7881 Msa/s | 8184 | 8007 Msa/s | 14 | 2 | 73997 | 3736 3408 |
| matrix_mult | cint16 | cint16 | 20 | 60 | 4 | 1 | 1 | 1 | 1200 | 240 | 1 | 16 | 725 | 6620 Msa/s | 2752 | 1744 Msa/s | 13 | 3 | 30865 | 4204 2220 3342 |
| matrix_mult | cint16 | cint16 | 24 | 4 | 4 | 1 | 1 | 1 | 96 | 16 | 1 | 16 | 85 | 4517 Msa/s | 1224 | 313 Msa/s | 12 | 3 | 9105 | 3062 1944 3342 |
| matrix_mult | int32 | int16 | 8 | 64 | 4 | 1 | 0 | 1 | 512 | 256 | 4 | 16 | 1067 | 1919 Msa/s | 1887 | 1085 Msa/s | 30 | 7 | 25127 | 3096 1904 3938 1722 1722 3954 1706 |
| matrix_mult | cint16 | cint16 | 32 | 32 | 32 | 0 | 0 | 0 | 1024 | 1024 | 1 | 16 | 4334 | 7560 Msa/s | 4182 | 7835 Msa/s | 7 | 1 | 26761 | 5100 |
| matrix_mult | cint16 | cint16 | 32 | 32 | 32 | 1 | 0 | 0 | 1024 | 1024 | 1 | 16 | 4345 | 7541 Msa/s | 4269 | 7675 Msa/s | 9 | 2 | 37133 | 3362 5100 |
| matrix_mult | cint16 | cint16 | 32 | 32 | 64 | 0 | 0 | 0 | 1024 | 2048 | 1 | 16 | 8590 | 7629 Msa/s | 8248 | 7945 Msa/s | 7 | 1 | 43145 | 5100 |
| matrix_mult | cint16 | cint16 | 32 | 64 | 32 | 0 | 0 | 0 | 2048 | 2048 | 1 | 16 | 8430 | 7774 Msa/s | 8097 | 8093 Msa/s | 7 | 1 | 43145 | 5100 |
| matrix_mult | cint16 | cint16 | 64 | 64 | 64 | 0 | 0 | 0 | 4096 | 4096 | 1 | 16 | 33529 | 7818 Msa/s | 31841 | 8232 Msa/s | 13 | 1 | 100489 | 5084 |
| matrix_mult | cint16 | cint16 | 8 | 4 | 4 | 1 | 1 | 1 | 32 | 16 | 1 | 16 | 47 | 2723 Msa/s | 1218 | 105 Msa/s | 12 | 3 | 7569 | 3062 1740 3096 |
| matrix_mult | cint16 | cint16 | 8 | 4 | 512 | 1 | 0 | 1 | 32 | 2048 | 1 | 16 | 2081 | 7873 Msa/s | 3879 | 4223 Msa/s | 13 | 2 | 86541 | 1952 3516 |
| matrix_mult | cint16 | cint16 | 8 | 4 | 512 | 1 | 1 | 1 | 32 | 2048 | 1 | 16 | 2081 | 7873 Msa/s | 3986 | 4110 Msa/s | 16 | 3 | 105105 | 3246 1952 3516 |
| matrix_mult | int32 | int32 | 8 | 64 | 4 | 1 | 1 | 1 | 512 | 256 | 4 | 16 | 373 | 5490 Msa/s | 2387 | 857 Msa/s | 33 | 9 | 32815 | 4172 1992 4172 2060 4172 2060 4172 3096 2226 |