Column-Major Layout - 5.2 English - 57404

AOCL User Guide (57404)

Document ID
57404
Release Date
2025-12-29
Version
5.2 English

For a column-major layout, the batches are contiguous in memory and the elements are strided apart.

Consider this 2D representation:

 B1      B2      B3    ...   Bk

[ 0 ]   [ 0 ]   [ 0 ]  ...  [ 0 ]

[ 1 ]   [ 1 ]   [ 1 ]  ...  [ 1 ]

[ 2 ]   [ 2 ]   [ 2 ]  ...  [ 2 ]

...

[ N-1 ] [ N-1 ] [ N-1 ] ... [ N-1 ]

This layout requires careful stride setting as follows:

  • Inplace Problems:
    • For R2C, since batches are contiguous in memory and the input is real, for a unit-strided problem, one might expect the transform’s elements to be strided by number of batches.

      However, the output is complex where each real value is transformed in-place into a complex value (real and imaginary) that are stored consecutively, as illustrated below:

     B1      B2     ...   Bk                   B1          B2     ...     Bk
    
    [ 0 ]   [ 0 ]   ...  [ 0 ]            [ (0r,0i) ] [ (0r,0i) ] ... [ (0r,0i) ]
    
    [ 1 ]   [ 1 ]   ...  [ 1 ]      ->    [ (1r,1i) ] [ (1r,1i) ] ... [ (1r,1i) ]
    
    [ 2 ]   [ 2 ]   ...  [ 2 ]            [ (2r,2i) ] [ (2r,2i) ] ... [ (2r,2i) ]
    
    ...                                   ...
    
    [ N-1 ] [ N-1 ] ... [ N-1 ]           [ (Nr,Ni) ] [ (Nr,Ni) ] ... [ (Nr,Ni) ]
    
    • Therefore, the input elemental stride should be set to (Number of batches * 2) , to account for both real and imaginary parts in the output. The output elemental stride can be set to number of batches since the output is complex.

    • The same rule applies while setting the batch strides. In a unit-strided problem, one may expect the input batch stride to be 1 since they are contiguous in memory. But in order to account for the complex output, the input batch stride has to be set to 2, while the output batch stride can remain 1.

    • For C2R, the input & output strides are interchanged.

    • Example:

      For an input problem of 4v50, the correct stride setting for a unit-strided problem would be:

      • R2C in-place:

        • dims[0].in_stride = 8, dims[0].out_stride = 4

        • vecs[0].in_stride = 2, vecs[0].out_stride = 1

      • C2R in-place:

        • dims[0].in_stride = 4, dims[0].out_stride = 8

        • vecs[0].in_stride = 1, vecs[0].out_stride = 2

  • Out-of-place Problems: Similar to row-major layout, as the input and output buffers are separate, the strides can be set independently based on the actual data layout in memory.
    • For R2C, the input & output elemental strides should be set to the number of batches multiplied by the batch strides and the batch strides should be set to the actual spacing of batches in memory.

    • For C2R, its the same as R2C.

    • Example:

      For an input problem of 4v50, the correct stride setting for a unit-strided problem would be:

      • R2C out-of-place:

        • dims[0].in_stride = 4, dims[0].out_stride = 4

        • vecs[0].in_stride = 1, vecs[0].out_stride = 1

      • C2R out-of-place:

        • dims[0].in_stride = 4, dims[0].out_stride = 4

        • vecs[0].in_stride = 1, vecs[0].out_stride = 1