Graph Header File - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

In the graph header file, the fft1k_128_graph class is defined, with all the kernel and buffer objects and connection inside it. As a first step, the ADF library and kernel header file are included, then six variables are defined through #define compiler directives:

  • the number of signal instances $\text{N\_INST}=128$;

  • the width of the PLIO channels in terms of samples, that are CINT16, thus 32 bits wide: $\text{PLIO\_WIDTH} = 2$;

  • the time interleaving $\theta$ factor $\text{IO\_ILV}=4$;

  • the number of kernels, that is equal to the number of instances divided by the batch factor $\text{N\_KERS}=\frac{\text{N\_INST}}{\text{REPS}}=64$;

  • the number of I/O channels, that is $\text{N\_IO}=\frac{\text{N\_INST}}{\text{IO\_ILV}\hspace{1mm}\cdot\hspace{1mm}\text{PLIO\_WIDTH}}=16$.

  • the maximum number $\text{MAX\_BUF}$ of shared buffer that fits into a memory tile. It can be limited either by the memory occupation of the buffer, or by the number of memory interfaces.

    • Note that the formulas suggested in the code are calculated for ping-pong shared buffers and considering input and output buffers packed together (as it spares mem tile interface resources). $$\text{MAX\_BUF} = min\left[ \frac{512kB}{\text{BUF\_SIZE}\cdot\text{DATATYPE\_BYTES}\cdot 2\cdot 2}\quad; \quad \frac{6}{2\cdot \frac{\text{N\_KERS}}{\text{N\_IOs}}} \right]$$

    • For this reason, the build will work even if MAX_BUF = 0 in the case the design is port-limited, as in such case the buffers are left unconstrained.

In the private section of the class there are only attributes, that are an array of N_IO kernels and two arrays of N_IO shared buffers, one for the inputs and one for the outputs. In the public section of the graph class, the attributes are just two arrays of N_IO ports, one for the input and one for the output.

...
class fft1k_128_graph : public graph {
private:
    kernel k_kernel[N_KERS];
    shared_buffer<cint16> in_mem[N_IO], out_mem[N_IO];
public:
    port<input>     din[N_IO];
    port<output>    dout[N_IO];
...

The choice of creating one tensor for each I/O has been made to facilitate the job of the placer of the aiecompiler in binding the memory and routing hardware resources to the constructs. The graph class constructor is where all the kernels, buffers and connections are created and configured. The constructor code is divided in three for loops.

...
fft1k_128_graph(void)
{
  // LOOP 1
    for(int i=0; i<N_KERS; i++){
        k_kernel[i] = kernel::create_object<fft1k_kernel>();
        source(k_kernel[i]) = "fft1k_single_kernel.cpp";
        runtime<ratio>(k_kernel[i]) = 0.9;
        location<stack>(k_kernel[i]) = location<kernel>(k_kernel[i]);
        location<buffer>(k_kernel[i].in[0]) = location<kernel>(k_kernel[i]);
        location<buffer>(k_kernel[i].out[0]) = location<kernel>(k_kernel[i]);
    }
...

In the first loop, all the kernels are instantiated one by one specifying the kernel name, the kernel source file and its expected runtime ratio (i.e., its reserved cycle budget on the compute tile with respect to the entire one). Moreover, to lighten the job of the placer of aiecompiler, and to ensure minimizing resource usage, three relative location constraints are used to state that the input and output buffers and the program memory (that contains the twiddles) must be binded to the same tile in which the computation is carried out.

...
// LOOP 2
for(int i=0; i<N_IO; i++){
    // Creating the input and output shared buffers
    in_mem[i] = shared_buffer<cint16>::create({PLIO_WIDTH,IO_ILV,POINTS}, 1, int(PLIO_WIDTH*IO_ILV/REPS));
    out_mem[i] = shared_buffer<cint16>::create({PLIO_WIDTH,IO_ILV,POINTS}, int(PLIO_WIDTH*IO_ILV/REPS), 1);

    num_buffers(in_mem[i]) = 2;    // Ping-pong in
    num_buffers(out_mem[i]) = 2;   // Ping-pong out

    if(MAX_BUF != 0){
                if(i%MAX_BUF>0){
                    location<buffer>(in_mem[i]) = location<buffer>(in_mem[i-1]) + relative_offset({.col_offset = 0, .row_offset = 0});
                    location<buffer>(out_mem[i]) = location<buffer>(out_mem[i-1]) + relative_offset({.col_offset = 0, .row_offset = 0});}
                location<buffer>(out_mem[i]) = location<buffer>(in_mem[i]) + relative_offset({.col_offset = 0, .row_offset = 0});}

...

The second loop is used to instantiate the input and output shared buffers. As a first operation, the buffers are created to match a slice of the 3D tensor, that has a length equal to PLIO_WIDTH, a depth equal to IO_ILV, and a height equal to the number of samples. The ports of such buffers are configured to be connected to one I/O from the side facing the PL. On the other side, instead, they are configured to connect one kernel for every two instances they contain. Therefore, since they contain $\text{PLIO\_WIDTH} \cdot \text{IO\_ILV}$ instances, the number of ports on the buffer side facing the kernels is equal to 4, meaning that every buffer serves four kernels with two instances each. In the following lines, it is required that the shared buffers are instantiated as ping-pong buffers, thus to reserve double the memory for each. Moreover, the buffers are location constrained to pack together as many input and output buffers per memory tile as possible, being the limitation either the number of ports or the memory size.

...LOOP 2...
write_access(in_mem[i].in[0]) =
    tiling(
        {
            .buffer_dimension = {PLIO_WIDTH, IO_ILV, POINTS},      
            .tiling_dimension = {PLIO_WIDTH, IO_ILV, POINTS},      
            .offset           = {0, 0, 0},
            .tile_traversal   = {
                {.dimension=0, .stride=PLIO_WIDTH, .wrap=1},    
                {.dimension=1, .stride=IO_ILV, .wrap=1},           
                {.dimension=2, .stride=POINTS, .wrap=1}        
            } 
        });
connect(din[i], in_mem[i].in[0]);
...

The second and third code blocks of the second loop regulate the access policies that the PLIOs will have with respect to the buffers, and connects them. In particular, the tiling size is set equal to the buffer size, and the tile traversal is set to occur in the natural order in which the PLIOs feed the data, thus filling the buffer tensor vertically, slice by slice. This means that the first filling dimension is the length, then the depth, and finally the height of the tensor, resetting the previous dimensions every time there is a change of dimension caused by the reach of the end of the stride (because the .wrap parameter is set to 1). After defining the tiling and the buffer access policy, each buffer is connected to their respective port. The code shown above is only related to the input management because the output management one is analogous.

...
LOOP 3
for(int i=0; i<N_IO; i++){
    int cur = 0;
    for(int k=0; k<IO_ILV; k++)
        for(int j=0; j<PLIO_WIDTH/REPS; j++){
        // Kernel inputs management
            read_access(in_mem[i].out[cur]) =
                tiling(                                                                      
                {
                    .buffer_dimension = {PLIO_WIDTH, IO_ILV, POINTS},
                    .tiling_dimension = {1, 1, POINTS},              
                    .offset           = {REPS*j, k, 0},                 
                    .tile_traversal   = {
                        {.dimension=2, .stride=POINTS, .wrap=1},        
                        {.dimension=0, .stride=1, .wrap=REPS},         
                        {.dimension=1, .stride=1, .wrap=1}            
                }});
            connect(in_mem[i].out[cur], k_kernel[int(i*(PLIO_WIDTH*IO_ILV/REPS)+cur)].in[0]);
        // Kernel outputs management
            ...
            cur++;
        }
...

The third and last loop configures the interfacing between the shared buffers inside the memory tiles and the kernels inside the AIE-ML tiles. To this end, three nested loops are created: the top one to iterate through the shared buffers, and the two inner ones to loop though the instances in a ordered manner. Indeed, the tiling is unidimensional and makes the kernel read and write to the 3D shared buffers in the third dimension (the height), that constitutes the entire ensemble of points of a given instance inside the tensor. After reading one instance, the tiling wraps to the first dimension to select the second part of the PLIO (i.e., the second instance, that is paired to the first one in the 64bit PLIO channel), that is also the second kernel repetition. After that, to take into account also time interleaving, the tile moves to the second dimension to perform the same access pattern. This complete pattern is then repeated until the shared buffer has been fully traversed. The connections are then made to feed each kernel with contiguous instances from the input shared buffer, and then in a specular way to fill the output shared buffer keeping the data indexes in the same order of the input one, making the output PLIO return data in the same exact order of the input PLIOs. As for the second loop, the code shown for the third one is only related to the input management because the output management code is analogous.