For DDR based Alveo card (such as U200 and U250), the single DDR bank provides approximately 19.2 GB/s bandwidth. For 25 Gbps x 4 lane Aurora IP, the unidirectional throughput is about 12.5 GB/s. So when implementing the 25 Gbps lane speed loopback example design, if we make both the strm_dump and strm_issue kernels access the same DDR bank, performance will be degraded by DDR bandwidth limitation. So to achieve the highest performance, we could connect the AXI masters of the two HLS kernels to different DDR bank. We can control the kernel slr and sp assignment in the v++ linking configuration file. As an example, for U200 case, add following lines at the last of krnl_aurora_test.cfg file:
slr=strm_issue_0:SLR1
slr=strm_dump_0:SLR1
slr=krnl_aurora_0:SLR2
sp=strm_issue_0.m_axi_gmem:DDR[1]
sp=strm_dump_0.m_axi_gmem:DDR[2]
Based on the four adjustments mentioned above, we implement a test design with U200 card, and observe about 11.5 GB/s throughput, like below running log:
$ ./host_krnl_aurora_test
------------------------ krnl_aurora loopback test ------------------------
Transfer size: 100 MB
Generate TX data block.
Program running in hardware mode
Load krnl_aurora_test_hw.xclbin
Create kernels
Create TX and RX device buffer
Transfer TX data into device buffer
Check whether startup status of Aurora kernel is ready...
Aurora kernel startup status is GOOD: 1000111111111
[12]channel_up [11]soft_err [10]hard_err [9]mmcm_not_locked_out [8]gt_pll_lock [7:4]line_up [3:0]gt_powergood
Begin data loopback transfer
Data loopback transfer finish
Transfer time = 8.723 ms
Fetch RX data from device buffer and verification
Data loopback transfer throughput = 11463.9 MB/s
Aurora Error Status:
SOFT_ERR: 0
HARD_ERR: 0
Data verification SUCCEED