Graph Optimizations - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

After gathering the performance data of the first 3D buffer implementation, because the profiling results showed that the kernels are locked for more than $75%$ of the time, an idea is to further exploit the capacity and features of the memory tiles to buffer more data and use a quarter of the kernels through data serialization. To do so, a further dimension that represents time, can be added for the shared buffer. This fourth dimension will make possible that the number of kernels is less affected by the input time interleaving, by moving a selectable amount of instances from the second dimension (depth) to the fourth one. The resulting code restructuring is minimal and involves the aforementioned reshaping of the shared buffer tensor though a new parameter $\text{KER\_ILV}$, that expresses the number of times that a kernel must run during the acquisition time, at steady state. The $\text{KER\_ILV}$ parameter effectively counteracts the $\text{IO\_ILV}$ effect, decoupling the spatial-temporal interleaving done for the IOs and the one done inside the AIE-ML, enabling seralization.

The graph header file fft1k_128_new_graph.h contains the new modified code. Note that the only changes done to the code are the following:

  • The shared buffer now has four dimensions: the time dimesion has been added as the third one, and the dimension of the samples (previously the third) is now the fourth.

  • Every time that $\text{IO\_ILV}$ appears in the code, it has to be divided by $\text{KER\_ILV}$.

  • The $\text{MAX\_BUF}$ variable can be now set to three, according to the formulas in the code.

    • The number of shared buffers that can fit in a memory tile is still port-limited, but since each shared buffer now serves just one kernel instead of four, the number of required active ports lowers to just one for each ping-pong buffer. This means that three input and three output memory shared buffers can be packed together.

In the following code block shows an excerpt of the third loop. Note the differences with the previous design iterations.

...
for(int i=0; i<N_IO; i++){
    int cur = 0;
    for(int k=0; k<IO_ILV/KER_ILV; k++)
        for(int j=0; j<PLIO_WIDTH/REPS; j++){
            read_access(in_mem[i].out[cur]) =
                tiling(                                                                              
                    {
                        .buffer_dimension = {PLIO_WIDTH, IO_ILV/KER_ILV, 
                                            KER_ILV, POINTS},    
                        .tiling_dimension = {1, 1, 1, POINTS},           
                        .offset           = {REPS*j, k, 0, 0},           
                        .tile_traversal   = {
                            {.dimension=3, .stride=POINTS, .wrap=1},       
                            {.dimension=0, .stride=1, .wrap=REPS},          
                            {.dimension=2, .stride=1, .wrap=KER_ILV},        
                            {.dimension=1, .stride=1, .wrap=1}         
                        } 
                    });
            connect(in_mem[i].out[cur], k_kernel[
                int(i*(PLIO_WIDTH*IO_ILV/(REPS*KER_ILV))+cur)].in[0]);
...

To use the new graph, open the "fft1k_128_graph.cpp" graph source code file and change the first line: #include "fft1k_128_graph.h" with #include "fft1k_128_new_graph.h"