One thing is to support a matrix multiply of some size, another is to verify that the 2 loads, the store and the compute are equally optimized.
A complete table of the matrix multiply efficiency, including matrices load and vector compute, can be seen here: ePerformance Table