Performance Considerations When Implementing RAM

Performance Considerations When Implementing RAM - 2022.1 English

Versal ACAP Hardware, IP, and Platform Development Methodology Guide (UG1387)

Document ID

UG1387

Release Date

2022-05-25

Version

2022.1 English

To efficiently infer memory elements, consider these factors affecting performance:

Using Dedicated Blocks or Distributed RAMs
RAMs can be implemented in either the dedicated block RAM or within LUTs using distributed RAM. The choice not only impacts resource selection, but can also significantly impact performance and power.

In general, the required depth of the RAM is the first criterion. Memory arrays described up to 64 bits deep are generally implemented in LUTRAMs, where depths of 32 bits and less are mapped 2 bits per LUT and depths up to 64-bits can be mapped one bit per LUT. Deeper RAMs can also be implemented in LUTRAM depending on available resources and synthesis tool assignment.

Memory arrays deeper than 256 bits are generally implemented in block memory. Xilinx devices have the flexibility to map such structures in different width and depth combinations. Familiarize yourself with these configurations to understand the number and structure of block RAMs used for larger memory array declarations in the code.
Using the Output Pipeline Register
Using an output register is required for high performance designs, and is recommended for all designs. This improves the clock to output timing of the block RAM. Additionally, a second output register is beneficial, as slice output registers have faster clock to out timing than a block RAM register. Having both registers has a total read latency of 3. When inferring these registers, they should be in the same level of hierarchy as the RAM array. This allows the tools to merge the block RAM output register into the primitive.
Using the Input Pipeline Register
When RAM arrays are large and mapped across many primitives, they can span a considerable area of the die. This can lead to performance issues on address and control lines. Consider adding an extra register after the generation of these signals and before the RAMs. To further improve timing, use phys_opt_design later in the flow to replicate this register. Registers without logic on the input will replicate more easily.