Partitioning Validation - 2024.2 English - UG1504

Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504)

Document ID
UG1504
Release Date
2024-12-18
Version
2024.2 English

In this step, you will build a 32-tile design based on the proposed solution synthesis above and profile the performance to understand achieved throughput.

There are three computations that need to occur for all the kernels and those are Rho computation, Address computation, and Histogram update. Code snippets below implement this functionality.

Figure 1. Code Snippet Examples

Each of the histogram routines updates four pixels with four values of theta each using RMW and runs on the AI Engine scalar processor. The following shows the implementation code.

Figure 2. Histogram Code Example

After compiling and simulating the design with –profile, you can get a view of the total number of cycles needed to process the data, as well as the breakdown of the cycles consumed per function call.

In the profiling snippet below, the total number of cycles to process 216 x 240 pixels is 2,650,436 @ 1.25 GHz clock rate, which translates to ~24.2 Mpps.

Each histogram update takes on average (1185840 + 1185840) / (216 x 240 / 8) = 366 cycles.

Note: Assumption was eight cycles.
Figure 3. Profiling Code Example

Rho and address computation are implemented to run on the vector processor and compute eight pixels per iteration. It took 277204 cycles to consume 216 x 240 pixels which translates to 5.3 cycles per pixel or 42.8 cycles per iteration.

Note: The budget is 45.5 cycles per iteration assuming 8x vectorization.

Below is a table summary of the initial assumptions versus achieved during the rapid prototyping step.

Table 1. Histogram Assumption vs. Achieved
Budget (in cycles) Assumption Achieved
Histogram update, RMW 8 366
Budget for Rho and address computation (8x vectorization) 45.4 42.8