Programmable 4D Data-Mover User Guide - 2023.2 English

Vitis Libraries

Release Date
2023-12-20
Version
2023.2 English

All Data-Mover designs are “codeless” that user need to create the OpenCL kernels by simply calling the Kernel Generator with a JSON description and ROM content (if needed) as text file.

Kernel Generator consists of:

  • Kernel templates (in Jinja2), which can be instantiated through configurations from JSON
  • Data converter to transform the user provided texture ROM content into usable initialization file
  • Python script to automate the kernel generation from JSON to HLS C++ OpenCL kernel, which is L2/scripts/internal/generate_kernels.py.

Attention

Generated kernels are not self-contained source code, they would reference low-level block implementation headers in L1/include folder. Ensure that folder is passed to Vitis compiler as header search path when compiling project using generated PL kernels.

Feature

AIE application often need to deal with multi-dimension data. The mostly common cases are that AIE application need to select and read/write a regularly distributed sub-set from multi-dimensional array. Because multi-dimension array has to be stored in linear addressing memory space, such sub-set is rarely a contiguous block in memory but always a lot of data segment. It’s not convenient for user to implement logic to calculate all segments’ address and size and not efficient for AIE to do so.

Programmable 4D Data-Mover includes:

  • Concise descriptor design that use 9x64bits to fully describe access on 4-dimension (and low-dimension) data.
  • Template kernel design that can read multiple descriptor and accomplish the defined access pattern one by one.

Descriptor Design

Descriptor design is the most important part of Programmable 4D Data-Mover. It defines:

  • How 4-dimension data mapped to linear address
  • Where to find the sub-set to access
  • What’s the dimension of sub-set
  • How to serialize the sub-set

To store 4D array A[W][Z][Y][X] in memory, it has to be mapped into linear address space which will certainly lead to addressing like &A[w][z][y][x] = bias + w * (Z*Y*X) + z * (Y*X) + y * (X) + x. In such condition, adjacent elements in 4D array will have const strid. Then 9 parameters { bias (address of first element), X, X_stride, Y, Y_stride, Z, Z_stride, W, W_stride } will be enough to define the 4D array in memory.

  • &A[w][z][y][x+1] – &A[w][z][y][x] = 1 (X_stride)
  • &A[w][z][y+1][x] – &A[w][z][y][x] = X (Y_stride)
  • &A[w][z+1][y][x] – &A[w][z][y][x] = X * Y (Z_stride)
  • &A[w+1][z][y][x] – &A[w][z][y][x] = X * Y * Z (W_stride)

Since Programmable 4D Data Mover need to write to / read from AXI stream, it also needs to define how to serialize 4D array. We define ap_int<64> cfg[9] as descriptor to define one access:

  • cfg[0]: bias (address of 4D array’s first element in memory)
  • cfg[1], cfg[2]: stride of first accessed dimension, size of first accessed dimension
  • cfg[3], cfg[4]: stride of second accessed dimension, size of second accessed dimension
  • cfg[5], cfg[6]: stride of third accessed dimension, size of third accessed dimension
  • cfg[7], cfg[8]: stride of fourth accessed dimension, size of fourth accessed dimension

With the descriptor above, Programmable 4D Data Mover will serialize the read/write as pseudo-code below (take read as example): Programmable 4D Data Mover will load one or multiple descriptors from a descriptor buffer. The descriptor buffer begins with a 64bits num which indicate how many descriptors are there in the buffer. Then num will be followed by one or multiple 9x64bits descriptors, all compact stored. It will start parsing the first descriptor, finish the access, then parse and finish the next descriptor. It will keep the processing until it finishes all descriptors.

for(ap_int<64> d4 = 0; d4 < cfg[8]; d4++) {
    for(ap_int<64> d3 = 0; d3 < cfg[6]; d3++) {
        for(ap_int<64> d2 = 0; d2 < cfg[4]; d2++) {
            for(ap_int<64> d1 = 0; d1 < cfg[2]; d1++) {
                elem_address = cfg[0] + d4 * cfg[7] + d3 * cfg[5] + d2 * cfg[3] + d1 * cfg[1];
                data_to_be_read = data[elem_address];
            }
        }
    }
}

Kernel Design

Programmable 4D Data Movers are templated design to access elements of 32 / 64 / 128 / 256 / 512 bits width. They have standalone AXI master port to access descriptor buffer. AXI master to access descriptors are configured to be 64 bits wide. Other AXI master and AXI stream port are configured to be same width of data elements. AXI master ports share the same “latency” “outstanding” “burst length” setup, and they should be the same with the pragma setup in kernels that wrap up the data mover. They share the same kernel generator and JSON spec, please take reference from example below.