You are using the same topologies as the previous step, but using an addressing scheme using random addresses within the selected range.
To run all the variations like in the previous step, you can also use the following Makefile target to run the application. There is no need to rebuild the xclbins again.
make all_hbm_rnd_run
The above target will generate the output file <Project>/makefile/Run_RandomAddress.perf
file with the following data.
Addr Pattern Total Size(MB) Transaction Size(B) Throughput Achieved(GB/s)
Random 256 (M0->PC0) 64 4.75379
Random 256 (M0->PC0) 128 9.59893
Random 256 (M0->PC0) 256 12.6164
Random 256 (M0->PC0) 512 13.1338
Random 256 (M0->PC0) 1024 13.155
Random 512 (M0->PC0_1) 64 0.760776
Random 512 (M0->PC0_1) 128 1.49869
Random 512 (M0->PC0_1) 256 2.71119
Random 512 (M0->PC0_1) 512 4.4994
Random 512 (M0->PC0_1) 1024 6.54655
Random 1024 (M0->PC0_3) 64 0.553107
Random 1024 (M0->PC0_3) 128 1.07469
Random 1024 (M0->PC0_3) 256 1.99473
Random 1024 (M0->PC0_3) 512 3.49935
Random 1024 (M0->PC0_3) 1024 5.5307
The top five rows show the point to point accesses, ie 256 MB accesses, with a Transaction size variation. The bandwidth drops compared to the top five rows in the previous step when the address pattern was sequential. You can still experience decent bandwidth for larger transaction sizes, though.
The bandwidth drops compared to the top five rows from 13 GB/s using the sequential accesses at the previous step. You can still experience better bandwidth for larger transaction sizes than 64 bytes though, this is simply explained because when accessing 128 bytes or more, then, only the first access is random the next accesses in the transaction are sequential, so the memory is better utilized, efficiency-wise.
When the master is addressing two or four PCs to access a larger range, the bandwidth will drop significantly. So it is important to observe that a single M_AXI
connected to one PC will provide better bandwidth than connected to multiple PCs.
Use the specific example of Row 13; the transaction size is 256 bytes and using a 1 GB of randomly accessed data — i.e., utilizing PC0-3. You can see the performance is ~2 GB/s. If this was a real design need, it would be advantageous to change the microarchitecture of said design to use four M_AXI
to access four individual PC in an exclusive manner. This means that the kernel code would have to check the index/address it wished to access and then exclusively use one of the pointer arguments (translating to one of the four M_AXI
) to make this memory access. As you might have already understood the access range is now 256 MB per pointer/M_AXI
, which basically means that you fall back to a use case where you have one master accessing one PC, and this is exactly the situation in Row 3. As a result, this would provide 12+ GB/s of bandwidth using four interfaces but with only one utilized at a time. You could try to further improve the situation by making two parallel accesses using those four M_AXI
, but this means that the part of the design providing the indexes/addresses need to provide two in parallel, which might be a challenge too.