This benchmark performs the matrix-matrix multiplication (A * B = C), M is number of rows of matrix A/C, K is number of columns of matrix A/number of rows of matrix B, N is number of columns of matrix B/C
gemm with OpenCL in u250
M | N | K | Kernel execution time [ms] | api execution time [ms] | Kernel Eff [%] |
---|---|---|---|---|---|
64 | 64 | 64 | 0.010905 | 1.750123 | 38.802577 |
128 | 128 | 128 | 0.048517 | 13.802416 | 69.772592 |
256 | 256 | 256 | 0.328314 | 14.645931 | 82.485022 |
512 | 512 | 512 | 3.213388 | 18.199255 | 67.420400 |
1024 | 1024 | 1024 | 24.113855 | 45.519852 | 71.875005 |
2048 | 2048 | 2048 | 186.688153 | 264.195138 | 74.270743 |
4096 | 4096 | 4096 | 1469.773731 | 1708.938204 | 75.469945 |
For more details on this benchmark, see:
gemm with XRT in u250
M | N | K | api execution time [ms] | api Eff [%] | PerfApiTops |
---|---|---|---|---|---|
256 | 256 | 256 | 2.295277 | 11.798572 | 0.058818 |
512 | 512 | 512 | 7.185994 | 30.148638 | 0.149859 |
1024 | 1024 | 1024 | 33.357721 | 51.957490 | 0.257887 |
2048 | 2048 | 2048 | 218.662946 | 63.410230 | 0.314501 |
4096 | 4096 | 4096 | 1594.648667 | 69.559988 | 0.344877 |
8192 | 8192 | 8192 | 12695.637510 | 69.897233 | 0.346485 |
gemm with XRT (one CU, streaming Kernel) in u250
M | N | K | api execution time [ms] | api Eff [%] | PerfApiTops |
---|---|---|---|---|---|
256 | 256 | 256 | 1.370527 | 19.127241 | 0.024626 |
512 | 512 | 512 | 4.517989 | 46.417820 | 0.059589 |
1024 | 1024 | 1024 | 29.500145 | 56.871639 | 0.072902 |
2048 | 2048 | 2048 | 217.555482 | 61.693563 | 0.079026 |
4096 | 4096 | 4096 | 1685.337895 | 63.710774 | 0.081580 |
For more details on the benchmarks, see: