Requirements Gathering - 2024.2 English - UG1504

Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504)

Document ID
UG1504
Release Date
2024-12-18
Version
2024.2 English

The following is a functional model of the algorithm implemented in MATLAB® .

Figure 1. MATLAB Code Example

The following table summarizes the algorithm and system requirements of the example Hough transform to be implemented on AIE1.

Table 1. Hough Transform Parameters
Parameter Value Units Notes
Image # Rows Pixels "R" 216 Pixels  
Image # Column Pixels "C" 240 Pixels  
Histogram Data Type Size 2 B Assume <int16>
Theta Resolution (# steps) 128   Over 180°
Target throughput 220 Mpixel/sec

It is important to reflect on these parameters and understand how they drive your implementation cost. Image size will understandably drive storage cost because it will need to be stored in-part or in-full for processing. Storing the image in the AI Engine local tile memory is costly, so it is better to store the image in PL and stream the pixels into the tiles for computation as needed.

The histogram H(ρ, θ) collecting statistics will also needs to be stored and its size is dependent on:

  • Largest possible value of “ρ,” calculated as ceiling(sqrt(R2+C2))
    • Directly dependent on image size. A tall image is more costly than a square image.
  • θ resolution
  • Data type, chosen as int16 requiring 2 Bytes

Total histogram storage requirement can be calculated as (1 + 2 x largest possible value of “ρ”) x θ resolution x 2 Bytes.

cos_theta and sin_theta used in line 37 of the code above can be precomputed and stored in a look-up table (LUT) as int16 values. The size of the LUT will be directly dependent on the theta_resolution and assumed data type.

Compute cost is driven by throughput requirement, image dimensions, and theta resolution. It is also driven by data types of multiplication operands in line 37 of the MATLAB code above.

For example, computation of rho_i requires 128 x 2 (int16 x int16) MACs/pixel and you will need to run at 220 Mpixel/sec. Assuming AI Engine clock rate of 1.25 GHz, this requires 46 real MACs/cycle. Based on the functional overview in the Versal Adaptive SoC AI Engine Architecture Manual (AM009), a single tile in AIE1 supports 32 real MACs/cycle. This is an important conclusion, as it means the solution will require multiple tiles.

Similarly, bandwidth cost is driven by throughput and image dimensions.

From an image processing perspective, there are a couple of options to partition this problem.

Option 1
Each tile computes a full transform for a portion of the image. This has a linear reduction in compute and reduces the amount of bandwidth required to each tile but has the storage requirements of the full histogram. It also requires an additional compute block to combine the histogram outputs from each tile.
Option 2
Each tile computes a partial transform for the whole input image. This also has a linear reduction in compute, requires higher bandwidth compared to option 1 because the full image must be delivered to each tile but it reduces the storage requirements of the histogram. The histogram outputs from each tile will then need to be collected.
Option 3
A combination of options 1 and 2 in which you build a low-throughput solution using option 2, then instantiate that multiple times to achieve your desired throughput.

The next step is to proceed with analyzing hardware requirements.