Step 2 - Asynchronous GMIO Transfer for Input and Synchronous GMIO Transfer for Output

Step 2 - Asynchronous GMIO Transfer for Input and Synchronous GMIO Transfer for Output - 2022.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID

XD100

Release Date

2022-12-01

Version

2022.2 English

In the previous step, it was identified that the sequential manner of data transfer and kernel execution is the main bottleneck of the design performance. In this step, the AI Engine kernel is replaced with a vectorized version to reduce kernel execution time. Change the working directory to single_aie_gmio/step2. The vectorized kernel code is in aie/weighted_sum.cc.

Besides the kernel update, we try to do asynchronous GMIO transfers for inputs, but leave synchronous GMIO transfers for outputs in this step. The purpose of mixing synchronous and asynchronous GMIO transfers is to overlap data transfer and kernel execution. Thus, the performance is further improved.

Examine the code in main function aie/graph.cpp. This time ITERATION is four, and graph is executed by four iterations with gr.run(ITERATION) and the GMIO transaction from memory to AI Engine is through non-blocking GMIO API gr.gmioIn.gm2aie_nb(dinArray,BLOCK_SIZE_in_Bytes);. It does not block the following executions. However, we will keep using blocking GMIO API for output data.

//pre-processing
...
gr.gmioIn.gm2aie_nb(dinArray,BLOCK_SIZE_in_Bytes);//Transfer all blocks input data at a time
gr.run(ITERATION); //ITERATION=4
gr.gmioOut.aie2gm(doutArray,BLOCK_SIZE_in_Bytes);//Transfer all blocks output data at a time
...
//post-processing

Although a non-blocking GMIO API is used to transfer the input data, there is no need for explicit synchronization between data transfer and kernel execution. The synchronization between input data transfer and kernel execution is guaranteed by the window buffer, meaning that every iteration of kernel execution will wait for the block of input data to be ready. The output data is synchronized using a blocking GMIO API. After the blocking API returns, the data is guaranteed to be available in DDR memory and the post-processing sequence can be safely started.