For designers implementing a Vitis kernel there are various trade-offs available when working with the device memory (PLRAM, HBM and DDR) available on FPGA devices. The following is a checklist of best practices to use when designing AXI4 memory mapped interfaces for your application.
With throughput as the chief optimization goal, it is clear that accelerating the compute part of your application using the macro and micro-architecture optimizations is the first step but the time taken for transferring data to/from the kernel can also influence the application architecture with respect to throughput goals. Due to the high overhead for data transfer, it becomes important to think about overlapping the computation with the communication (data movement) that is present in your application.
For your given application:
- Decompose the kernel algorithm by building a pipeline of
producer-consumer tasks, modeled using a Load, Compute, Store (LCS) coding
pattern
- All external I/O accesses must be in the Load and Store tasks.
- There should be multiple Load or Store tasks if the kernel needs to read or write from different ports in parallel.
- The Compute task(s) should only have scalars, array, streams or stream of blocks arguments.
- Ensure that all three tasks (specified as functions) can be executed in overlapped fashion (enables task-level parallelism by the compiler).
- Compute tasks can be further split up into smaller compute tasks which may contain further optimizations such as pipelining. The same rules as LCS apply for these smaller compute functions as well.
- Always use local memory to pass data to/from the Compute task.
- Load and Store blocks are responsible for moving data between
global memory and the Compute blocks as efficiently as possible.
- On one end, they must read or write data through the streaming interface according to the (temporal) sequential order mandated by the Compute task inside the kernel
- On the other end, they must read or write data through the memory-mapped interface according to the (spatial) arrangement order set by the software application
- Changing your mindset about data accesses is key to building a
proper HW design with HLS
- In SW, it is common to think about how the data is “accessed” (the algorithm pulls the data it needs).
- In HW, it is more efficient in think of how data “flows” through the algorithm (the data is pushed to the algorithm)
- In SW, you reason about array indices and “where” data is accessed
- In HW, you reason about streams and “when” data is accessed
- Global memories have long access times (DRAM, HBM) and their
bandwidth is limited (DRAM). To reduce the overhead of accessing global memory,
the interface function needs to
- Access sufficiently large contiguous blocks of data (to benefit from bursting)
- Accessing data sequentially leads to larger bursts (and higher data throughput efficiency) as compared to accessing random and/or out-of-order data (where burst analysis will fail)
- Avoid redundant accesses (to preserve bandwidth)
- Since random accesses into DRAM are very expensive, bursting accesses should be preferred over minimizing access redundancy
- In many cases, the sequential order of data in and out of the
Compute tasks is different from the arrangement order of data in global memory.
- In this situation, optimizing the interface functions
requires creating internal caching structures
that gather enough data and organize it appropriately to minimize the
overhead of global memory accesses while being able to satisfy the
sequential order expected by the streaming interface
- Example: 2D Convolution
- In order to simplify the data movement logic, the developer can also consider different ways of storing the data in memory. For instance, accessing data in DRAM in a row-major fashion can be very inefficient. Rather than implementing a dedicated data-mover in the kernel, it may be better to transpose the data in SW and store in column-major order instead which will greatly simply HW access patterns.
- In this situation, optimizing the interface functions
requires creating internal caching structures
that gather enough data and organize it appropriately to minimize the
overhead of global memory accesses while being able to satisfy the
sequential order expected by the streaming interface
- Maximize the port width of the interface, i.e., the bit-width of
each AXI port by setting it to 512 bits (64 bytes).
- Use hls::vector or ap_(u)int<512> as the data type of the port to infer maximal burst lengths. Usage of structs in the interface may result in poor burst performance.
- Accessing the global memory is expensive and so accessing larger word sizes is more efficient.
- Imagine the interface ports to be like pipes feeding data to your kernel. The wider the pipe, the more data that can be accessed and processed, and sent back.
- Transfer large blocks of data from the global device
memory. One large transfer is more efficient than several smaller
transfers. The bandwidth is limited by the PCIe performance. Run the
DMA test to measure PCIe
transfer effective max throughput. It is usually in the range of 10-12
GB/sec for reading and writing respectively.
- Memory resources include PLRAM (small size but fast access with the lowest latency), HBM (moderate size and access speed with some latency), and DRAM (large size but slow access with high latency).
- Given the asynchronous nature of reads, distributed RAMs are ideal for fast buffers. You can use the read value immediately, rather than waiting for the next clock cycle. You can also use distributed RAM to create small ROMs. However, distributed ram is not suited for large memories, and you’ll get better performance (and lower power consumption) for memories larger than about 128 bits using block RAM or UltraRAM.
- Decide on the optimal number of concurrent ports, i.e., the
number of concurrent AXI (memory-mapped) ports
- If the Load task needs to get multiple input data sets to feed to the Compute task, it can choose to use multiple interface ports to access this data in parallel.
- However, the data needs to be stored in different memory banks or the accesses will be sequentialized. There is a maximum of 4 DDR banks on FPGAs while there are 32 HBM channels.
- When multiple processes are accessing the same memory port or memory bank, an arbiter will sequentialize these concurrent accesses to the same memory port or bank.
- Setting the right burst length i.e., the maximum burst access
length (in terms of the number of elements) for each AXI port.
- Set the burst length equivalent to the maximum 4k bytes transfer. For example, using AXI data width of 512-bit (64 bytes), the burst length should be set to 64.
- Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller
- Write application code in such a way to infer the maximal length bursts for both reads and writes to/from global memory
- Setting the number of outstanding memory requests that an AXI
port can sustain before stalling
- Setting a reasonable number of outstanding requests, allows the system to submit multiple memory requests before stalling - this pipelining of requests allows the system to hide some of the memory latency some of the memory latency at the cost of additional BRAM/URAM resources.