Description
In FPGA designs, non-burst accesses to off-chip DDR memory are expensive and can degrade performance. To mitigate this, Vitis HLS supports a cache mechanism in the M_AXI adapter that reduces average access latency by exploiting locality of reference (temporal and spatial). When a memory region is accessed, it is likely to be accessed again soon, and nearby locations are also likely to be accessed.
The cache mechanism supports both a single-port cache and a multi-port hierarchy:
- Single-port cache: Direct-mapped, read-only cache attached to a single m_axi port. This reduces repeated DRAM reads when the kernel revisits the same line.
- Multi-port cache: A hierarchical cache with per-port L1 caches and a shared L2 cache. It enables up to N read accesses to the same top-level m_axi pointer in one clock cycle without changing source code, targeting stencil-like and other dynamic access patterns where the compiler cannot infer a window buffer. Compared to a compiler-inferred window buffer, the multi-port cache is more flexible (handles dynamic patterns) but generally uses more resources.
Multi-port Cache Architecture
- L2 cache (shared):
-
One shared, direct-mapped cache across all ports.
-
Loaded from off-chip DRAM via burst transfers.
-
Lines are broadcast to L1 caches that request them, optimizing memory bandwidth by avoiding redundant DRAM loads.
-
- L1 caches (per-port):
-
One direct-mapped L1 cache per cache port.
-
Each L1 can serve one data item per cycle, allowing up to N read accesses per cycle to the same m_axi pointer.
-
L1 lines are loaded from the shared L2 cache.
-
syn.interface.m_axi_cache_impl.Syntax
syn.directive.cache=<location> port=<name> lines=<value> depth=<value> [ports=<N>] [l2_lines=<M>]
Where:
-
<location> - Specifies the function where the specified ports can be found. This is the top function.
-
port=<name> - Specifies the read-only port (top-level m_axi pointer) to add cache to.
-
lines=<value> - Number of cache lines in each L1 cache (single-port case: the one cache). Specify 1 for a single line, or a power-of-two greater than 1 for multiple lines. Optional; defaults to 1 if not specified.
-
depth=<value> - Size of each cache line in words (must be a power-of-two). Applies to
the pointer’s element type (for example, depth is in units of int, float, etc.).
Optional; defaults to the max burst length. The max burst length defaults to 16, but can
be globally specified by syn.interface.m_axi_max_read_burst_length or per-interface via
syn.directive.interface.Note: In the multi-port cache, the L1 and L2 caches share the same line depth; depth applies to both levels.
-
ports=<N> - Enables the multi-port cache feature. Defines how many read accesses to the corresponding top m_axi pointer can be scheduled in one clock cycle. Applies only to a read-only pointer mapped to its m_axi bundle; only one read-only array can be mapped to that bundle when ports>1 (see Limitations).
-
l2_lines=<M> - Optional; by default, the number of lines in both L1 and L2 caches is chosen by the depth option. Explicitly sets the number of lines in the shared L2 cache. The size of each L2 line equals the L1 line size and is controlled by depth. Must be greater than lines; there must be more L2 cache lines than L1 cache lines.
Behavior and Performance Notes
- The multi-port cache is designed for stencil-like and dynamic access patterns where window buffers are not inferred. It can improve throughput by allowing multiple read requests per cycle while reducing redundant DRAM bursts through the shared L2.
- Both L1 and L2 caches are direct-mapped. Hit/miss behavior is governed by locality; higher lines and appropriate depth can improve hit rate but increase resources.
- The shared L2 cache loads a DRAM line one time per miss and broadcasts the line to any L1 caches that need it, optimizing bandwidth similar to a window buffer but with support for dynamic patterns.
Limitations
- Cache is only supported for read-only ports. Write ports can share a bundle with a cached read-only pointer because they use independent channels, but inout ports are not supported.
- In multi-port mode (ports>1):
- The cached top-level m_axi pointer must be the only read port mapped to its bundle.
- A write port can also be mapped to the same bundle, but not an inout port.
- The cache is direct-mapped (single way) at both L1 and L2 levels.
- l2_lines must be greater than lines when specified.
- Only one read-only array can be mapped to the m_axi bundle of a multi-port cached pointer.
Additional Controls
-
syn.interface.m_axi_cache_implcan be used to control cache implementation resources. -
syn.interface.m_axi_max_read_burst_lengthandsyn.directive.interfacecan control the default depth via the max burst length.
Example
The following example shows a design where overlapping access causes the burst to fail. Using the CACHE pragma or directive improves the performance of the design.
extern "C" {
void dut(
const double *in, // Read-Only Vector 1
double *out, // Output Result
int size // Size in integer
)
#pragma HLS INTERFACE m_axi port=in bundle=aximm depth = 1026
#pragma HLS INTERFACE m_axi port=out bundle=aximm depth = 1024
#pragma HLS cache port=in lines=8 depth=128
for(int i = 0; i < size; i++)
{
out[i] = in[i] + in[i + 1];
}
}