AI Engines can directly communicate through the AXI4-Stream interconnect without any DMA and memory interaction. Data can be sent from one AI Engine to another or broadcast through the streaming interface. The data bandwidth of a streaming connection is 32-bit per cycle and built-in handshake and backpressure mechanisms are available.
For
streaming input and output interfaces, when the performance is limited by
the stream number, the AI Engine
is able to use two streaming inputs or two streaming outputs in parallel, instead of one
streaming input or output. To use two parallel streams, it is recommended to use the
following pairs of macros, where idx1
and idx2
are the two streams. Add the __restrict
keyword to stream ports to ensure they are optimized for
parallel processing.
READINCR(SS_rsrc1, idx1) and READINCR(SS_rsrc2, idx2)
READINCRW(WSS_rsrc1, idx1) and READINCRW(WSS_rsrc2, idx2)
WRITEINCR(MS_rsrc1, idx1, val) and WRITEINCR(MS_rsrc2, idx2, val)
WRITEINCRW(WMS_rsrc1, idx1, val) and WRITEINCRW(WMS_rsrc2, idx2, val)
Following is a sample code to use two parallel input streams to achieve pipelining with interval 1. Interval 1 means that two read, one write, and one add are in every cycle.
void simple( input_stream_int32 * __restrict data0,
input_stream_int32 * __restrict data1,
output_stream_int32 * __restrict out) {
for(int i=0; i<1024; i++)
chess_prepare_for_pipelining
{
int32_t d = READINCR(SS_rsrc1, data0) ;
int32_t e = READINCR(SS_rsrc2, data1) ;
WRITEINCR(MS_rsrc1,out,d+e);
}
}
The stream connection can be unicast or multicast. Note that in the case of multicast communication, the data is sent to all the destination ports at the same time and only when all destinations are ready to receive data.