We are using the same topologies as the previous step but using an addressing scheme using random addresses within the selected range.
To run all the variations like in the previous step, You can also use the following Makefile target to run the application. There is no need to rebuild the xclbins again.
make all_hbm_rnd_run
The above target will generate the output file <Project>/makefile/Run_RandomAddress.perf
file with the following data.
Addr Pattern Total Size(MB) Transaction Size(B) Throughput Achieved(GB/s)
Random 256 (M0->PC0) 64 4.75379
Random 256 (M0->PC0) 128 9.59893
Random 256 (M0->PC0) 256 12.6164
Random 256 (M0->PC0) 512 13.1338
Random 256 (M0->PC0) 1024 13.155
Random 512 (M0->PC0_1) 64 0.760776
Random 512 (M0->PC0_1) 128 1.49869
Random 512 (M0->PC0_1) 256 2.71119
Random 512 (M0->PC0_1) 512 4.4994
Random 512 (M0->PC0_1) 1024 6.54655
Random 1024 (M0->PC0_3) 64 0.553107
Random 1024 (M0->PC0_3) 128 1.07469
Random 1024 (M0->PC0_3) 256 1.99473
Random 1024 (M0->PC0_3) 512 3.49935
Random 1024 (M0->PC0_3) 1024 5.5307
The top 5 rows show the point to point accesses, ie 256 MB accesses, with a Transaction size variation. The bandwidth drops compared to the top 5 rows in the previous step when the address pattern was sequential. You can still experience decent bandwidth for larger transaction sizes, though.
The bandwidth drops compared to the top 5 rows from 13GB/s using the sequential accesses at the previous step. You can still experience better bandwidth for larger transaction sizes than 64 bytes though, this is simply explained because when accessing 128 bytes or more, then, only the first access is random the next accesses in the transaction are sequential, so the memory is better utilized, efficiency-wise.
When the master is addressing 2 or 4 PCs to access a larger range, the bandwidth will drop significantly. So it’s important to observe that a single M_AXI connected to 1 PC will provide better bandwidth than connected to multiple PCs.
Let’s use the specific example of Row 13, the transaction size is 256 bytes and using a 1 GB of randomly accessed data - i.e. utilizing PC0-3. We can see the performance is ~2 GB/s. If this was a real design need, it would be advantageous to change the microarchitecture of said design to use 4 M_AXI to access 4 individual PC in an exclusive manner. This means that the kernel code would have to check the index/address it wished to access and then exclusively use one of the pointer arguments (translating to one of the 4 M_AXI) to make this memory access. As you might have already understood the access range is now 256 MB per pointer/M_AXI, which basically means that we fall back to a use case where we have one master accessing one PC, and this is exactly the situation in Row 3. As a result, this would provide 12+ GB/s of bandwidth using 4 interfaces but with only one utilized at a time. You could try to further improve the situation by making 2 parallel accesses using those 4 M_AXI but this means that the part of the design providing the indexes/addresses need to provide 2 in parallel, which might be a challenge too.