The host applications store input data (
i
andj
) in global memory (DDR) and turn on the PL HLS kernels (running at 300 MHz) and the AI Engine graph (running at 1GHz).Data moves from DDR to the dual-channel HLS datamover kernel
mm2s_mp
. Thei
data goes into one channel and thej
data goes into the other channel. Here, data movement switches from AXI-MM to AXI-Stream. The read/write bandwith of DDR is set to the default 0.04 Gbps.The AI Engine graph performs packet switching on the
input_i
data, so thei
data needs to be packaged appropriately before being sent to the AI Engine. So from themm2s_mp
kernel, it is streamed to the HLSpacket_sender
kernel. Thepacket_sender
kernel sends a packet header and appropriately assertsTLAST
before sending packets ofi
data to the 100input_i
ports in the AI Engine.The AI Engine graph expects the
j
data to be streamed directly into the AI Engine kernels, so no additional packaging is needed. Thej
data is directly streamed from themm2s_mp
kernel into the AI Engine.The AI Engine distributes the gravity equation computations onto 100 accelerators (each using 4 AI Engine tiles). The AI Engine graph outputs new
i
data through the 100output_i
ports. Theoutput_i
data is also packet switched and needs to be appropriately managed by thepacket_receiver
.The
packet_receiever
kernel receives a packet and evaluates the header as 0, 1, 2, or 3 and appropriately sends theoutput_i
data to thek0
,k1
,k2
, ork3
streams.The
s2mm_mp
quad-channel HLS datamover kernel receives theoutput_i
data and writes it to global memory (DDR). Here, data movement switches from AXI-Stream to AXI-MM.Then, depending on the host application, the new output data is read and compared against the golden expected data or saved as the next iteration of
i
data and the AI Engine N-Body Simulator runs for another timestep.
Note: The entire design is a compute-bound problem, meaning we are limited to how fast the AI Engine tiles compute the floating-point gravity equations. This is not a memory-bound design.