Improving Performance in the PS

Improving Performance in the PS - 2023.2 English

Versal Adaptive SoC System Integration and Validation Methodology Guide (UG1388)

Document ID

UG1388

Release Date

2023-11-15

Version

2023.2 English

In the Versal device processing system (PS), you can control quality of service (QoS) using the following interconnect traffic types: low-latency datapaths and high-throughput datapaths. For information, see this link in the Versal Adaptive SoC Technical Reference Manual (AM011).

The following figure shows the potential contention points affecting the PS architecture.

Figure 1. PS Architecture Contention Points

When thinking of the processing system and memory subsystems performance, the first thoughts that come to mind are about the processors performances, their clock frequencies, memory hierarchy, caches, and more generally internal and external memory capabilities. While these considerations are important, understanding the processing system architecture is equally so.

The external shared DDR memory subsystem and the on-chip OCM are the main two memory subsystems present in Versal devices. The masters accessing the memory subsystems can be in PL or in PS. The PL master can greatly impact PS masters performances when traversing the PS to access OCM or external DDR memory.

The PL masters traffic routes are:

PL master (soft IP) attached to NoC via a NoC NMU
PL master attached to PS AXI interfaces
PL master attached to PS ACP

The AI Engine and CPM can also require to route traffic via the PS interconnect and close attention must be paid to the multiple ways traffic can be routed.

The internal PS master (APU, RPU, DMA, etc.) generates traffic and leverages the PS interconnect that connects to the NoC.

Direct PL to PS interfaces (S_AXI_FPD, S_ACE_LITE_FPD, S_AXI_LPD)
- S_AXI_FPD (in FPD) is virtualized and non-coherent
- S_ACE_LITE_FPD (in FPD) is virtualized and coherent
- S_AXI_LPD (in LPD) can be configured as physical, or as virtualized and coherent
NoC to PS interfaces (NoC_FPD_AXI0, NoC_FPD_AXI_1, NoC_FPD_CCI_0, NoC_FPD_CCI_1, all in FPD)
- NoC_FPD_AXI0 and NoC_FPD_AXI1 are virtualized and non-coherent
- NoC_FPD_CCI_0 and NoC_FPD_CCI_1 are virtualized and coherent

Understanding and leveraging the multiple ways a master can reach the external DDR memory or OCM is critical to optimized the routing and avoiding PS interconnect congestion.

For example, selecting to route the traffic of multiple masters through the CCI can be detrimental to performances, as a maximum of four CCI500 AXI4 master ports connect to four NoC NMUs. In the following figure these are M2, M3, M4, and M5.

Figure 2. Master Ports Connected to NoC NMUs

The SMMU and CCI can also impact performance. Each share internal resources that can contribute to adding latency and therefore reducing throughput. If you are optimizing for high performance and if virtualization/isolation and coherency are not required, using the CCI500 and SMMU are not recommended.

Traffic regulation mechanisms are available at the source, which is the NMU in the NoC switches, and at the destination slaves (MC). You can use these mechanisms to apply the desired QoS scheme.

The PS interconnect does not support virtual channels, so physical separation must be used.

In the ingress direction (external masters to PS slaves), each NSU supports a single traffic class, so different classes must enter the PS by different physical ports. Similarly, in the egress direction, the internal PS interconnect network will carry different traffic classes over different physical channels into NMUs that connect to the horizontal NoC. For example, two of the four CCI to NoC channels can carry LL (M2 and M3 in the above figure), while the other two can carry BE (M4 and M5 above).

The challenge to supporting the ever-growing demand for performance keeping the energy consumption at acceptable levels is difficult to solve. A complex device like the Versal device can be fine-tuned to allow the best performance per Watt.

To meet the performance requirements of these modern applications, perform the following:

Benchmark the PS with the NoC-DDR using industry-standard benchmarks
Run proprietary algorithms while executing performance analysis with generic tools

The PS performances can be tuned based on real-time non-intrusive measurements collecting performance data while the application executes. Based on the statistics gathered, techniques like frequency scaling and task scheduling can be used to tune the system to lower power consumption without compromising performances.

The Performance Monitor Units available in the APU, RPU, and NoC provide invaluable data to understand CPU utilization, cache misses, branch misprediction, finding code execution hot spots, etc.

For user interaction and design, set up to six performance counters in the APU and six counters in the RPU to read the desired events to obtain the performance information of the processor at runtime. Each processor’s PMU also provides a dedicated cycle counter besides the six performance counters.

For the APU, the AMD Linux toolchain, PetaLinux, can be used to create a Linux image that provides access to the PMU counters through the Perf application for profiling and tracing tools.

For the RPU, a standalone application can be implemented to enable its PMU counters. Later, support is planned in the RPU standalone BSP.

Finally, the AMD ChipScoPy provides Python APIs to access registers (for example, performance counters) through JTAG and an application example showing NoC performances.