Performance Considerations When Implementing RAM - 2022.1 English

Versal ACAP Hardware, IP, and Platform Development Methodology Guide (UG1387)

Document ID
UG1387
Release Date
2022-05-25
Version
2022.1 English

To efficiently infer memory elements, consider these factors affecting performance:

  • Using Dedicated Blocks or Distributed RAMs

    RAMs can be implemented in either the dedicated block RAM or within LUTs using distributed RAM. The choice not only impacts resource selection, but can also significantly impact performance and power.

    In general, the required depth of the RAM is the first criterion. Memory arrays described up to 64 bits deep are generally implemented in LUTRAMs, where depths of 32 bits and less are mapped 2 bits per LUT and depths up to 64-bits can be mapped one bit per LUT. Deeper RAMs can also be implemented in LUTRAM depending on available resources and synthesis tool assignment.

    Memory arrays deeper than 256 bits are generally implemented in block memory. Xilinx devices have the flexibility to map such structures in different width and depth combinations. Familiarize yourself with these configurations to understand the number and structure of block RAMs used for larger memory array declarations in the code.

  • Using the Output Pipeline Register

    Using an output register is required for high performance designs, and is recommended for all designs. This improves the clock to output timing of the block RAM. Additionally, a second output register is beneficial, as slice output registers have faster clock to out timing than a block RAM register. Having both registers has a total read latency of 3. When inferring these registers, they should be in the same level of hierarchy as the RAM array. This allows the tools to merge the block RAM output register into the primitive.

  • Using the Input Pipeline Register

    When RAM arrays are large and mapped across many primitives, they can span a considerable area of the die. This can lead to performance issues on address and control lines. Consider adding an extra register after the generation of these signals and before the RAMs. To further improve timing, use phys_opt_design later in the flow to replicate this register. Registers without logic on the input will replicate more easily.