For example let’s take the first element of the table which is 8b x 4b with a matrix size of 4x16x8:
The sub matrix A is of size 4x16 on 8 bits which is 512 bits: 2 clocks cycles are necessary to load it
The sub matrix B is of size 16x8 on 4 bits which is 512 bits: 2 clocks cycles are necessary to load it
The sub matrix C is of size 4x8 on 16 or 32 bits which is 512 or 1024 bits: 2 or 4 clocks cycles are necessary to store it
Finally, 512 MACs must be performed for this matrix which can be done in 1 clock cycles.
The overall efficiency is 50% (result in 16 bits) or 25% (results in 32 bits): 2 or 4 clock cycles for load/store, 1 clock cycle for the compute.