Reducing Clock Delay in UltraScale and UltraScale+ Devices - 2024.1 English

UltraFast Design Methodology Guide for FPGAs and SoCs (UG949)

Document ID
UG949
Release Date
2024-05-30
Version
2024.1 English

In UltraScale and AMD UltraScale+™ global clock routing, the clock net is first routed from a global clock buffer via the horizontal and vertical routing track to a central location called the clock root. From the clock root, the clock net spans out to drive clock rows in each clock region via the vertical distribution track. On each row, there are programmable delays in the clock network on BUFCE_ROW route-through sites that perform a coarse-grained deskew as the clock spans farther from the clock root.

The following figure shows a clock path from the global clock buffer (BUFG) to the clock root. The clock routing switches from routing to the vertical distribution track, through the BUFCE_ROW in each clock region row that drives the horizontal distribution tracks, and then to the leaf level. The source is shown in green and the destination in red.

Figure 1. Clock Path from BUFG to the Leaf Level via BUFCE_ROW

The row programmable tap delay is the largest near the clock root. This delay decreases by one tap for one clock region as the clock reaches farther away from the root in the vertical direction, eventually decreasing to zero.

The following figure shows the topology of the programmable row tap values decreasing from the root. Higher tap values mean higher delays and higher crossing SLR clock skew, because the higher tap values add additional uncertainty for timing due to the minimum/maximum delay variation introduced by the manufacturing process variation. This makes it more difficult to meet timing near the root where programmable tap delay values are higher. Farther from the root in the vertical direction, there is less uncertainty, and it is generally easier to fix hold violations on crossing SLR buses. For SLR crossing buses that are farther from the root in the horizontal direction, the clock row delays increase. This additional delay introduces more minimum/maximum delay variation and reduces the performance of SLR crossings.

Figure 2. Row Programmable Tap Delay Settings Across an UltraScale+ SSI Technology Device

For UltraScale+ SSI technology devices, you can improve SLR crossing speed using either of the following methods:

  • Move the clock root close to the SLR crossings in the horizontal direction
  • Limit the maximum row programmable tap delay value to reduce the uncertainty
Note: Timing paths farther from the root in the vertical direction might become slightly slower due to increased delay from hold fixing route detours. However, using these methods results in an overall performance gain.

You can review the row programmable tap delay settings that the Vivado tool chose for each global clock in your design in the Device Cell Placement Summary for Global Clock sections in the Clock Utilization Report. Following is an example that shows the row programmable tap delay settings for the g13 global clock in the HORIZONTAL PROG DELAY column, which is highlighted in yellow.

Figure 3. Global Clock Row Programmable Tap Delay Settings in the Clock Utilization Report

For UltraScale+ SSI technology devices, the placer limits the maximum row programmable tap delay value to reduce minimum/maximum delay variation and reduce SLR crossing clock skew near the clock root, while also ensuring that clock regions on either side of SLR crossings have an increasing or decreasing tap delay value to balance the clock skew on SLR crossing paths farther from the root. The MAX_PROG_DELAY property value of the clock net can be queried to find the maximum row programmable tap delay value used by the placer.

You can also limit the row programmable tap value using the USER_MAX_PROG_DELAY property. Following is an example. To set the USER_MAX_PROG_DELAY property, the value must be applied to the net segment directly driven by the global clock buffer. If the USER_MAX_PROG_DELAY property is not set, the placer can use the maximum possible tap setting of 7.

set_property USER_MAX_PROG_DELAY <0-7> [get_nets -of [get_pins BUFG/O]]

Following are tips when using the USER_MAX_PROG_DELAY property:

  • The recommended USER_MAX_PROG_DELAY tap value is 3 or 4 for clocks that span the majority of UltraScale+ SSI technology devices. When clock roots are near GT, PCIe® , or CMAC blocks that are off-center in the device, SLR crossing performance on the opposite device side is heavily impacted, because the common node for the launch and capture clock is farther away from the SLR crossing.
  • For clock groups using the CLOCK_DELAY_GROUP for clock network matching, ensure that all clocks within the clock group use the same USER_MAX_PROG_DELAY value.