Dataflow - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English
  • The host applications store input data (i and j) in global memory (DDR) and turn on the PL HLS kernels (running at 300 MHz) and the AI Engine graph (running at 1GHz).

  • Data moves from DDR to the dual-channel HLS datamover kernel mm2s_mp. The i data goes into one channel and the j data goes into the other channel. Here, data movement switches from AXI-MM to AXI-Stream. The read/write bandwith of DDR is set to the default 0.04 Gbps.

  • The AI Engine graph performs packet switching on the input_i data, so the i data needs to be packaged appropriately before being sent to the AI Engine. So from the mm2s_mp kernel, it is streamed to the HLS packet_sender kernel. The packet_sender kernel sends a packet header and appropriately asserts TLAST before sending packets of i data to the 100 input_i ports in the AI Engine.

  • The AI Engine graph expects the j data to be streamed directly into the AI Engine kernels, so no additional packaging is needed. The j data is directly streamed from the mm2s_mp kernel into the AI Engine.

  • The AI Engine distributes the gravity equation computations onto 100 accelerators (each using 4 AI Engine tiles). The AI Engine graph outputs new i data through the 100 output_i ports. The output_i data is also packet switched and needs to be appropriately managed by the packet_receiver.

  • The packet_receiever kernel receives a packet and evaluates the header as 0, 1, 2, or 3 and appropriately sends the output_i data to the k0, k1, k2, or k3 streams.

  • The s2mm_mp quad-channel HLS datamover kernel receives the output_i data and writes it to global memory (DDR). Here, data movement switches from AXI-Stream to AXI-MM.

  • Then, depending on the host application, the new output data is read and compared against the golden expected data or saved as the next iteration of i data and the AI Engine N-Body Simulator runs for another timestep.

Note: The entire design is a compute-bound problem, meaning we are limited to how fast the AI Engine tiles compute the floating-point gravity equations. This is not a memory-bound design.