The host applications store input data (
iandj) in global memory (DDR) and turn on the PL HLS kernels (running at 300 MHz) and the AI Engine graph (running at 1GHz).Data moves from DDR to the dual-channel HLS datamover kernel
mm2s_mp. Theidata goes into one channel and thejdata goes into the other channel. Here, data movement switches from AXI-MM to AXI-Stream. The read/write bandwith of DDR is set to the default 0.04 Gbps.The AI Engine graph performs packet switching on the
input_idata, so theidata needs to be packaged appropriately before being sent to the AI Engine. So from themm2s_mpkernel, it is streamed to the HLSpacket_senderkernel. Thepacket_senderkernel sends a packet header and appropriately assertsTLASTbefore sending packets ofidata to the 100input_iports in the AI Engine.The AI Engine graph expects the
jdata to be streamed directly into the AI Engine kernels, so no additional packaging is needed. Thejdata is directly streamed from themm2s_mpkernel into the AI Engine.The AI Engine distributes the gravity equation computations onto 100 accelerators (each using 4 AI Engine tiles). The AI Engine graph outputs new
idata through the 100output_iports. Theoutput_idata is also packet switched and needs to be appropriately managed by thepacket_receiver.The
packet_receieverkernel receives a packet and evaluates the header as 0, 1, 2, or 3 and appropriately sends theoutput_idata to thek0,k1,k2, ork3streams.The
s2mm_mpquad-channel HLS datamover kernel receives theoutput_idata and writes it to global memory (DDR). Here, data movement switches from AXI-Stream to AXI-MM.Then, depending on the host application, the new output data is read and compared against the golden expected data or saved as the next iteration of
idata and the AI Engine N-Body Simulator runs for another timestep.
Note: The entire design is a compute-bound problem, meaning we are limited to how fast the AI Engine tiles compute the floating-point gravity equations. This is not a memory-bound design.