In the DPD design, the throughput of each AXI bus is 983.04 MSPS, which is 98.3% of the theoretical limit of 1 GSPS. The accuracy of AI Engine simulation is around 5%, which is good enough for the designs but not for DPD which can tolerate no more than 1.7% throughput degradation. A hardware validation design is automatically generated by the MATLAB scripts to enable quick validation on an AMD VCK190 board.
The following figure shows a diagram of the DPD validation environment similar to that in Arbitrary Resampling Filter Design (XAPP1373). The AI Engine and PL portions of the DPD design are packaged as kernels, as is the tester which drives the input ports of the device under test (DUT) using a pre-stored stimulus and monitors the output AXI bus with the reference test vector. Throughput and latency are measured by the PL tester and recorded in a set of registers accessible by the processor via the AXI4-Lite interface. At the end of the test, the results are summarized and output through a COM port.
The DPD tester consists of two kernels, tst_din
and tst_dout
. The tst_din
kernel controls the test process via the on/off register and
reports the number of data being sent via the TestIterationCounter
register. One iteration is 122880 samples for each
AXI bus. The tst_dout
kernel collects the outputs and
compares them with the golden reference generated by MATLAB. All mismatches are recorded by an error counter. Every output AXI
bus has one such monitor which shares the same register map and can be selected by the
field AXISel[1:0]
.
Debugging with waveform views in a software simulation environment is much easier than doing so directly on hardware with limited visibility. As noted in Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504), V++ supports PS + PL + AI Engine co-simulation and uses AMD Vivado™ XSIM as the GUI to display waveforms on which latencies of various signals can be measured. The PL kernels on the data path run under a 250 MHz clock with 128-bit data buses, and those on a control path like LUT configuration and register maps are under a 100 MHz clock with a 32-bit data bus. The following figure shows that it takes 20.4 µs for all the LUTs to be initialized before DPD starts to accept input data. The output is available in 0.6 µs for the first test and 0.8 µs for the second test onward. During each test, LUT switching is performed in runtime to make sure no data discontinuity results from the switching.
After the design passes software verification, more comprehensive and longer tests are performed on the VCK190 evaluation board. By default, VCK190 boards come with VC1902-2MP devices. However, in the test platform, the part number is modified to VC1902-1LP, which is recommended for customers who prioritize power efficiency. The software running on the Arm® processor starts and stops the test ten times, from 123456 iterations (12 billion input samples) in the first test with an increment of 98765 iterations (9.7 billion samples) in each of the following tests. In the end, a short summary as shown in the following is output via the COM port.
------------------------------------------------------------------
-- DPD TEST SUMMARY --
------------------------------------------------------------------
AXI-S Latency(us) Throughput No.Outputs Mismatch Result
------------------------------------------------------------------
First Run
------------------------------------------------------------------
0 0.640- 0.640 985.0Msps 3034546176 0 PASS
1 0.660- 0.660 985.0Msps 3034546176 0 PASS
2 0.660- 0.660 985.0Msps 3034546176 0 PASS
3 0.676- 0.676 985.0Msps 3034546176 0 PASS
------------------------------------------------------------------
Afterwards
------------------------------------------------------------------
0 0.868- 0.872 985.0Msps 136537128960 0 PASS
1 0.880- 0.884 985.0Msps 136537128960 0 PASS
2 0.900- 0.900 985.0Msps 136537128960 0 PASS
3 0.912- 0.912 985.0Msps 136537128960 0 PASS
------------------------------------------------------------------
PASS!
The test results confirm all the output samples match the reference test vector stored in ROMs, and the average throughput is higher than 983.04 MSPS on all AXI buses. The latencies measured on the hardware also match the co-simulation results very well. The processing delays of all four AXI buses differ by several clock cycles which can be easily equalized with small FIFOs.