Compute Requirements - 2024.1 English

Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504)

Document ID
UG1504
Release Date
2024-06-19
Version
2024.1 English

The main compute requirements in this algorithm comes from line 37 in the code above. The operands of the multiplication are both int16. The number of real MACs/pixel required is 256 (128 θ steps x 2 real MACs/step) and the algorithm needs to run at 220 Mpps. Assuming a clock rate of 1.25 GHz, this means you need to compute 46 real MACs/cycle. Based on the functional overview in the Versal Adaptive SoC AI Engine Architecture Manual (AM009), a single tile in AIE1 supports 32 real MACs/cycle. Notice how the number of MACs/cycle on the AI Engine is directly related to the data type. For example, if your operands are both int32 instead, you can only perform eight MACs/cycle.

Following rho computation, you need to increment the accumulator where edges in the image exists. This requires read-modify-write (RMW) operation that cannot be vectorized, so assume it is run on the AI Engine scalar processor. Also, assume that RMW operation requires eight cycles.

The overall budget in cycles is based on the number of pixels process, the clock rate, and target throughput. Therefore, total Budget = 216 x 240 x 1.25e9 / 220e6 = 294545 cycles. This translates to a maximum budget of 5.7 cycles/pixel.

If the algorithm cannot be vectorized, the budget to process one pixel is 5.7 cycles.

If the algorithm is vectorized to run with eight pixels in parallel, the budget to process the eight pixels is 5.7 x 8 = 45.5 cycles.

Because the histogram update is expected to be the throughput bottleneck, you can perform a quick calculation based on some assumptions to predict what it will be. If every pixel requires 128 histogram updates and each update requires eight cycles, assuming a clock rate of 1.25 GHz and a 32-tile design, the maximum throughput you can achieve is 1.25e9 x 32 / 128 / 8 = 39.1 Mpixel/sec.

Based on the requirements and assumptions above, you can build the tables and graph below to better visualize the options. You will need to construct your solution to validate these assumptions and revise your proposal if necessary.

Table 1. Input System Parameters
Parameter Value Units Notes
AIE Clock Rate 1.25 GHz Use -2M speed grade
Image # Rows Pixels "R" 216 Pixels
Image # Column Pixels "C" 240 Pixels
Histogram Data Type Size 2 B Assume <int16>
Theta Resolution (# steps) 128 Over 180°
# AIE Tiles 32 tiles Partition over "Theta"
Target throughput 220 Mpixel/sec
# cycles per Histogram update 8 cycles RMW Op (3/2/3 cycles)
Table 2. Output Compute Parameters
Parameter Value Units Notes
Inner loop body scalar iterations 51,840 iter Just R x C
Inner loop body vector iterations 6,480 iter Just R x C / 8
Overall budget 294,545 cycles To meet throughput
Max cycles per pixel 5.7 cycles/Pixel
Min Loop II (scalar) 5.7 cycles
Min Loop II (8X vector) 45.5 cycles
# MAC per pixel per tile 8 MAC/Pixel/tile Based on algorithm
Compute Bound (rho) 5000.0 Mpixel/sec Only consider rho compute
Compute Estimate (histogram) 39.1 Mpixel/sec Only consider histogram

Using a pivot table from Excel, the following figure shows the Hough transform compute estimate.

Figure 1. Hough Transform Compute Estimate

To reach the target throughput requirement from here, you will need to use option 2 or 3 referenced earlier. You can split the histogram update over more tiles to reach the target throughput or instantiate the lower throughput solution multiple times.