Random Accesses - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
Release Date
2022.2 English

We are using the same topologies as the previous step but using an addressing scheme using random addresses within the selected range.

To run all the variations like in the previous step, You can also use the following Makefile target to run the application. There is no need to rebuild the xclbins again.

make all_hbm_rnd_run

The above target will generate the output file <Project>/makefile/Run_RandomAddress.perf file with the following data.

Addr Pattern   Total Size(MB) Transaction Size(B)  Throughput Achieved(GB/s)

Random         256 (M0->PC0)             64                     4.75379
Random         256 (M0->PC0)             128                    9.59893
Random         256 (M0->PC0)             256                    12.6164
Random         256 (M0->PC0)             512                    13.1338
Random         256 (M0->PC0)             1024                   13.155

Random         512 (M0->PC0_1)           64                     0.760776
Random         512 (M0->PC0_1)           128                    1.49869
Random         512 (M0->PC0_1)           256                    2.71119
Random         512 (M0->PC0_1)           512                    4.4994
Random         512 (M0->PC0_1)           1024                   6.54655

Random         1024 (M0->PC0_3)          64                     0.553107
Random         1024 (M0->PC0_3)          128                    1.07469
Random         1024 (M0->PC0_3)          256                    1.99473
Random         1024 (M0->PC0_3)          512                    3.49935
Random         1024 (M0->PC0_3)          1024                   5.5307

The top 5 rows show the point to point accesses, ie 256 MB accesses, with a Transaction size variation. The bandwidth drops compared to the top 5 rows in the previous step when the address pattern was sequential. You can still experience decent bandwidth for larger transaction sizes, though.

The bandwidth drops compared to the top 5 rows from 13GB/s using the sequential accesses at the previous step. You can still experience better bandwidth for larger transaction sizes than 64 bytes though, this is simply explained because when accessing 128 bytes or more, then, only the first access is random the next accesses in the transaction are sequential, so the memory is better utilized, efficiency-wise.

When the master is addressing 2 or 4 PCs to access a larger range, the bandwidth will drop significantly. So it’s important to observe that a single M_AXI connected to 1 PC will provide better bandwidth than connected to multiple PCs.

Let’s use the specific example of Row 13, the transaction size is 256 bytes and using a 1 GB of randomly accessed data - i.e. utilizing PC0-3. We can see the performance is ~2 GB/s. If this was a real design need, it would be advantageous to change the microarchitecture of said design to use 4 M_AXI to access 4 individual PC in an exclusive manner. This means that the kernel code would have to check the index/address it wished to access and then exclusively use one of the pointer arguments (translating to one of the 4 M_AXI) to make this memory access. As you might have already understood the access range is now 256 MB per pointer/M_AXI, which basically means that we fall back to a use case where we have one master accessing one PC, and this is exactly the situation in Row 3. As a result, this would provide 12+ GB/s of bandwidth using 4 interfaces but with only one utilized at a time. You could try to further improve the situation by making 2 parallel accesses using those 4 M_AXI but this means that the part of the design providing the indexes/addresses need to provide 2 in parallel, which might be a challenge too.