Hardware Validation - XAPP1391

Automatic Digital Pre-distortion Design Generation for AI Engine (XAPP1391)

Document ID
XAPP1391
Release Date
2023-05-24
Revision
1.0 English

In the DPD design, the throughput of each AXI bus is 983.04 MSPS, which is 98.3% of the theoretical limit of 1 GSPS. The accuracy of AI Engine simulation is around 5%, which is good enough for the designs but not for DPD which can tolerate no more than 1.7% throughput degradation. A hardware validation design is automatically generated by the MATLAB scripts to enable quick validation on an AMD VCK190 board.

The following figure shows a diagram of the DPD validation environment similar to that in Arbitrary Resampling Filter Design (XAPP1373). The AI Engine and PL portions of the DPD design are packaged as kernels, as is the tester which drives the input ports of the device under test (DUT) using a pre-stored stimulus and monitors the output AXI bus with the reference test vector. Throughput and latency are measured by the PL tester and recorded in a set of registers accessible by the processor via the AXI4-Lite interface. At the end of the test, the results are summarized and output through a COM port.

Figure 1. DPD Design Validation Environment

The DPD tester consists of two kernels, tst_din and tst_dout. The tst_din kernel controls the test process via the on/off register and reports the number of data being sent via the TestIterationCounter register. One iteration is 122880 samples for each AXI bus. The tst_dout kernel collects the outputs and compares them with the golden reference generated by MATLAB. All mismatches are recorded by an error counter. Every output AXI bus has one such monitor which shares the same register map and can be selected by the field AXISel[1:0].

Figure 2. DPD Tester Kernel Register Map

Debugging with waveform views in a software simulation environment is much easier than doing so directly on hardware with limited visibility. As noted in Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504), V++ supports PS + PL + AI Engine co-simulation and uses AMD Vivado™ XSIM as the GUI to display waveforms on which latencies of various signals can be measured. The PL kernels on the data path run under a 250 MHz clock with 128-bit data buses, and those on a control path like LUT configuration and register maps are under a 100 MHz clock with a 32-bit data bus. The following figure shows that it takes 20.4 µs for all the LUTs to be initialized before DPD starts to accept input data. The output is available in 0.6 µs for the first test and 0.8 µs for the second test onward. During each test, LUT switching is performed in runtime to make sure no data discontinuity results from the switching.

Figure 3. PL + PS + AI Engine Co-simulation Waveform for Quad-phase DPD at 3932.16 MSPS

After the design passes software verification, more comprehensive and longer tests are performed on the VCK190 evaluation board. By default, VCK190 boards come with VC1902-2MP devices. However, in the test platform, the part number is modified to VC1902-1LP, which is recommended for customers who prioritize power efficiency. The software running on the Arm® processor starts and stops the test ten times, from 123456 iterations (12 billion input samples) in the first test with an increment of 98765 iterations (9.7 billion samples) in each of the following tests. In the end, a short summary as shown in the following is output via the COM port.

------------------------------------------------------------------
--                       DPD TEST SUMMARY                       --
------------------------------------------------------------------
AXI-S  Latency(us)  Throughput   No.Outputs    Mismatch   Result
------------------------------------------------------------------
First Run
------------------------------------------------------------------
 0    0.640- 0.640  985.0Msps    3034546176           0   PASS
 1    0.660- 0.660  985.0Msps    3034546176           0   PASS
 2    0.660- 0.660  985.0Msps    3034546176           0   PASS
 3    0.676- 0.676  985.0Msps    3034546176           0   PASS
------------------------------------------------------------------
Afterwards
------------------------------------------------------------------
 0    0.868- 0.872  985.0Msps  136537128960           0   PASS
 1    0.880- 0.884  985.0Msps  136537128960           0   PASS
 2    0.900- 0.900  985.0Msps  136537128960           0   PASS
 3    0.912- 0.912  985.0Msps  136537128960           0   PASS
------------------------------------------------------------------

PASS!

The test results confirm all the output samples match the reference test vector stored in ROMs, and the average throughput is higher than 983.04 MSPS on all AXI buses. The latencies measured on the hardware also match the co-simulation results very well. The processing delays of all four AXI buses differ by several clock cycles which can be easily equalized with small FIFOs.