BP Engine Graph aand Kernel Scheduling - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-12-05
Version
2025.2 English
  • The BP engine uses multi-rate scheduling to control its system level operation. The scheme is hard-coded yet configurable for processing any number of radar pulses.

  • The design employs a single top-level graph.

  • A single graph iteration corresponds to a full update of the SAR target image for one radar pulse. Using multi-rate scheduling, each AI Engine kernel executes as many times as required to perform its workload for that pulse.

  • The ifft2k_async() graph contains two kernels. The ifft() kernel performs the inverse transform function. The lut() kernel computes the slope and offset LUTs required by the downstream interp1() graph. Each kernel must be performed once per radar pulse. Consequently, the ifft() and lut() kernels both use a setting of repetition_count=1.

  • The range_gen() kernel generates the \((x,y,z)\) coordinates of the target image in a just-in-time fashion as this data is already known and is easily computed. This saves considerable storage for the implementation.

  • All other graphs in the BP engine perform computations related to updating the SAR target image. Because the memory footprint for this image exceeds the local tile memory, partial SAR image data is streamed through the design using double-buffering. The design adopts a size of 1024 samples for these I/O kernel buffers. For a \(512\times 512\) image, it follows that 256 kernel invocations are required per graph iteration in order to process the full target image; all remaining kernels use repetition_count=256.

  • Both interp1() kernels (for servicing the real and imaginary components of the phase correction) must re-use its slope and offset LUTS over multiple kernel invocations to process the full target image. Because these LUTs are computed only once per graph iteration, the design must employ asynchronous buffering of these LUTs. Otherwise, the default multi-rate scheduling would insist on a 256-fold replication of these buffers. This is infeasible. For this reason, the interp1() kernel must be hand-coded to manage this asynchronous buffering. The current DSP library func_approx() IP (which otherwise could perform the required linear interpolation functionality) only supports synchronous buffering.

  • The memory footprint if the ifft() and lut() kernels is quite heavy and is only performed once per graph iteration. These outputs are held constant over 256 invocations of all remaining kernels. Consequently, the design elects to use single_buffer() designations on the I/O buffers of ifft() and lut(). This has minimal impact on the overall system throughput since most of the DDR input transfers of the next radar pulse may be hidden by the compute workload of the current radar pulse as noted earlier.