The input_gmio
/output_gmio
port
attribute can be used to initiate AI Engine–DDR
memory read and write transactions in the PS program. This enables data transfer
between an AI Engine and the DDR controller
through APIs written in the PS program. The following example shows how to use GMIO
APIs to send data to an AI Engine for processing
and retrieve the processed data back to the DDR through the PS program.
class mygraph: public adf::graph
{
private:
adf::kernel k_m;
public:
adf::output_gmio gmioOut;
adf::input_gmio gmioIn;
mygraph()
{
k_m = adf::kernel::create(weighted_sum_with_margin);
gmioOut = adf::output_gmio::create("gmioOut", 64, 1000);
gmioIn = adf::input_gmio::create("gmioIn", 64, 1000);
adf::connect(gmioIn.out[0], k_m.in[0]);
adf::connect(k_m.out[0], gmioOut.in[0]);
dimensions(k_m.in[0]) = {256};
dimensions(k_m.out[0]) = {256};
adf::source(k_m) = "weighted_sum.cc";
adf::runtime<adf::ratio>(k_m)= 0.9;
};
};
graph.cpp
myGraph gr;
int main(int argc, char ** argv)
{
const int BLOCK_SIZE=256;
int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
// provide input data to AI Engine in inputArray
for (int i=0; i<BLOCK_SIZE; i++) {
inputArray[i] = i;
}
gr.init();
gr.gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gr.gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
gr.run(8);
gr.gmioOut.wait();
// can start to access output data from AI Engine in outputArray
...
GMIO::free(inputArray);
GMIO::free(outputArray);
gr.end();
}
This example declares two I/O objects: gmioIn
represents the DDR memory space to be read by
the AI Engine, and gmioOut
represents the DDR memory space to be written by the AI Engine. The constructor specifies the logical name
of the GMIO, burst length (that can be 64, 128, or 256 bytes) of the memory-mapped
AXI4 transaction, and the required bandwidth
(in
MB/s).
gmioOut = adf::output_gmio::create("gmioOut", 64, 1000);
gmioIn = adf::input_gmio::create("gmioIn", 64, 1000);
The application graph (myGraph
) has an input port
(myGraph::gmioIn
) connecting to the processing
kernels. The kernels produce data to the output port (myGraph::gmioOut
) producing the processed data from the kernels. The
following code connects the input port of the graph and connects to the output port
of the graph.
adf::connect(gmioIn.out[0], k_m.in[0]);
adf::connect(k_m.out[0], gmioOut.in[0]);
dimensions(k_m.in[0]) = {256};
dimensions(k_m.out[0]) = {256};
Inside the main
function, two 256-element int32 arrays are allocated by GMIO::malloc
. The inputArray
points
to the memory space to be read by the AI Engine
and the outputArray
points to the memory space to
be written by the AI Engine. In Linux, the
virtual address passed to GMIO::gm2aie_nb
, GMIO::aie2gm_nb
, GMIO::gm2aie
and GMIO::aie2gm
must be
allocated by GMIO::malloc
. After the input data is
allocated, it can be
initialized.
const int BLOCK_SIZE=256;
int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
gr.gmioIn.gm2aie_nb()
is used to initiate memory-mapped
AXI4 transactions for the AI Engine to read from DDR memory spaces. The first
argument in gr.gmioIn.gm2aie_nb()
is the pointer to the
start address of the memory space for the transaction. The second argument is the
transaction size in bytes. The memory space for the transaction must be within the
memory space allocated by GMIO::malloc
. Similarly,
gr.gmioOut.aie2gm_nb()
is used to initiate
memory-mapped AXI4 transactions for the AI Engine to write to DDR memory spaces. gr.gmioOut.gm2aie_nb()
or gr.gmioOut.aie2gm_nb()
is a non-blocking function in a sense that it
returns immediately when the transaction is issued, that is, it does not wait for the
transaction to complete. By contrast, gr.gmioIn.gm2aie()
or gr.gmioOut.aie2gm()
behaves in a blocking manner.gr.gmioIn.gm2aie_nb()
call to issue a read transaction for eight-iteration
worth of data, and one gr.gmioOut.aie2gm_nb()
call to
issue a write transaction for eight-iteration worth of
data.gr.gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gr.gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
gr.run(8)
is also a non-blocking
call to run the graph for eight iterations. To synchronize between the PS and
AI Engine for DDR memory read/write access,
you can use gr.gmioOut.wait()
to block PS execution
until the GMIO transaction is complete. In this example, gr.gmioOut.wait()
is called to wait for the output data to be written
to outputArray
DDR memory space.
After that, the PS program can access
the data. When PS has completed processing, the memory space allocated by GMIO::malloc
can be released by GMIO::free
.
GMIO::free(inputArray);
GMIO::free(outputArray);
The input_gmio
/output_gmio
APIs can be
used in various ways to perform different level of control for read/write access and
synchronization between the AI Engine, PS, and
DDR memory. Either input_gmio::gm2aie
, output_gmio::aie2gm
, input_gmio::gm2aie_nb
or output_gmio::aie2gm_nb
can be called multiple times to associate
different memory spaces for the same input_gmio
/output_gmio
object during
different phases of graph execution. Different input_gmio
/output_gmio
objects can
be associated with the same memory space for in-place AI Engine–DDR read/write access. Blocking versions of input_gmio::gm2aie
and output_gmio::aie2gm
APIs themselves are synchronization point for data
transportation and kernel execution. Calling input_gmio::gm2aie
(or output_gmio::aie2gm
) is equivalent to calling input_gmio::gm2aie_nb
(or output_gmio::aie2gm_nb
) followed immediately by output_gmio::wait
. The following example shows the
combination of the aforementioned use
cases.
myGraph gr;
int main(int argc, char ** argv)
{
const int BLOCK_SIZE=256;
// dynamically allocate memory spaces for in-place AI Engine read/write access
int32* inoutArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
gr.init();
for (int k=0; k<4; k++)
{
// provide input data to AI Engine in inoutArray
for(int i=0;i<BLOCK_SIZE;i++){
inoutArray[i]=i;
}
gr.run(8);
for (int i=0; i<8; i++)
{
gr.gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
gr.gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
}
gr.gmioOut.wait();
// can start to access output data from AI Engine in inoutArray
// ...
}
GMIO::free(inoutArray);
gr.end();
return 0;
}
In the previous example, the two GMIO objects
gmioIn
and gmioOut
are using the same memory space allocated by inoutArray
for in-place read and write access.
Without knowing data flow dependency among the kernels inside the
graph, and to ensure write-after-read for the inoutArray
memory space, the blocking version gr.gmioIn.gm2aie()
is called to ensure transaction data is copied from
DDR memory to AI Engine local memory before
issuing a write transaction to the same memory space in gr.gmioOut.aie2gm_nb()
.
gr.gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
gr.gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
gr.gmioOut.wait()
is to ensure
that data has been migrated to DDR memory. After it is done, the PS can access
output data for post-processing.
The graph execution is
divided into four phases in the for loop, for (int k=0;
k<4; k++)
. inoutArray
can be
re-initialized in the for
loop with different data
to be processed in different phases.