To use this version, you need to call API functions to allocate the device memory first, then fill in the host memory that is mapped to the device memory with values. There is no extra memory copy, and the programming is easier compared to the other two versions. However, when filling in the matrices, you need to use the padded sizes; the result output matrix’s sizes are padded instead of the original ones. For more usage information, see the examples.