Baseline Mixer Design - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-12-05
Version
2025.2 English

The baseline TDM mixer design implements a single kernel using the vectorization outlined above. It processes NSAMP samples of a single channel, then switches to a different carrier frequency and processes NSAMP samples of this next channel, and repeats this until all channels have been processed. The input and output buffers can be read in linear addressing order because the DMA has already been configured to place the samples in their proper destinations.

A summary of this baseline design is shown in the figure below. The design supports NSAMP=64 samples and CC=32 channels. The design uses a single compute tile. Buffers are distributed in three tiles. No attempt was made to optimize the design floor plan. The design uses double buffered input and output and contains a single phase_inc lookup table that specifies the phase increment for each supported channel. The design was driven by unity values so that each channel produces its own tone.

Notice the compiler has scheduled the inner for loop (Line 57) to achieve an initiation interval (II) of 15 cycles, although the minimum II based on unscheduled hardware operations is only II=2. This indicates that the existing code is quite inefficient and is running about 7X to 8X slower than theory due to poor software scheduling. The code will be refactored below to make improvements. Based on the existing II=15, the design achieves a throughput of ~2600 MB/s or ~550 Msps (with each cint16 sample taking 4 bytes).

Note a couple of points regarding the II reporting of (Line 57). Firstly, this information will only be displayed by the tool if the “verbose” option is enabled for the compiler. Secondly, the code here contains three for-loops but reporting is only provided for the innermost loop. This may be unexpected, but it means the compiler has elected to apply software pipelining optimization only on this inner most loop for this design. In other designs, the II for multiple loops may be reported if they undergo such pipeline optimization.

figure

The figure below provides kernel code for the baseline design.

The top left shows the header file code. The static configuration for the mixer frequencies is provided to the constructor from the phase_inc_i array. An additional phase array holds the state of the mixer between kernel invocations.

The bottom left shows the constructor code. The initial phase of the mixer channels is initialized to zero in this code.

The right shows the actual kernel code. The output loop (Line 42) runs over all channels supported by the mixer. Lines 45 to 51 compute the fixed vector ramp required by the current channel. The previous phase value for the current channel is restored from memory in Line 54. The inner loop on Line 57 processes all NSAMP samples for the current channel eight at a time using two pipelined operations. The first multiplies the vector ramp by the next value generated by the sincos() generator. The second multiplies the 8-lane vector of the mixer phasor with the 8-lane vector of input samples. The curr variable accumulates the phase for the sincos() generator. Finally, the last phase value is stored to memory in Line 67 for the next kernel invocation.

Note the computation in Line 61-62 involves two 8-lane vector multiplications. These are pipelined instructions that take many cycles to complete. Because the compiler must wait for these instructions to complete before scheduling the next loop body iteration, the loop throughput becomes reduced (or effectively stalled) by the length of these pipelined instructions. For this reason, only an II=15 is achieved when the instructions themselves have an II=2. The key to improving this situation, as will be illustrated below, lies in filling this inner loop with more compute workloads.

figure