This version has no limitations on user host memory, and it is easy to use. API functions will do the padding internally so this will lead to an extra memory copy in the host side. The resulting output matrix will also be the same sizes.