The following is a functional model of the algorithm implemented in MATLAB® .
The following table summarizes the algorithm and system requirements of the example Hough transform to be implemented on AIE1.
Parameter | Value | Units | Notes |
---|---|---|---|
Image # Rows Pixels "R" | 216 | Pixels | |
Image # Column Pixels "C" | 240 | Pixels | |
Histogram Data Type Size | 2 | B | Assume <int16> |
Theta Resolution (# steps) | 128 | Over 180° | |
Target throughput | 220 | Mpixel/sec |
It is important to reflect on these parameters and understand how they drive your implementation cost. Image size will understandably drive storage cost because it will need to be stored in-part or in-full for processing. Storing the image in the AI Engine local tile memory is costly, so it is better to store the image in PL and stream the pixels into the tiles for computation as needed.
The histogram H(ρ, θ) collecting statistics will also needs to be stored and its size is dependent on:
- Largest possible value of “ρ,” calculated as ceiling(sqrt(R2+C2))
- Directly dependent on image size. A tall image is more costly than a square image.
- θ resolution
- Data type, chosen as int16 requiring 2 Bytes
Total histogram storage requirement can be calculated as (1 + 2 x largest possible value of “ρ”) x θ resolution x 2 Bytes.
cos_theta
and sin_theta
used in line 37 of the code
above can be precomputed and stored in a look-up table (LUT) as int16 values. The size
of the LUT will be directly dependent on the theta_resolution
and
assumed data type.
Compute cost is driven by throughput requirement, image dimensions, and theta resolution. It is also driven by data types of multiplication operands in line 37 of the MATLAB code above.
For example, computation of rho_i
requires 128 x 2 (int16 x int16)
MACs/pixel and you will need to run at 220 Mpixel/sec. Assuming AI Engine clock rate of 1.25 GHz, this requires 46 real MACs/cycle. Based
on the functional overview in the
Versal
Adaptive SoC AI Engine Architecture Manual (AM009), a single tile in AIE1 supports 32 real MACs/cycle. This is an
important conclusion, as it means the solution will require multiple tiles.
Similarly, bandwidth cost is driven by throughput and image dimensions.
From an image processing perspective, there are a couple of options to partition this problem.
- Option 1
- Each tile computes a full transform for a portion of the image. This has a linear reduction in compute and reduces the amount of bandwidth required to each tile but has the storage requirements of the full histogram. It also requires an additional compute block to combine the histogram outputs from each tile.
- Option 2
- Each tile computes a partial transform for the whole input image. This also has a linear reduction in compute, requires higher bandwidth compared to option 1 because the full image must be delivered to each tile but it reduces the storage requirements of the histogram. The histogram outputs from each tile will then need to be collected.
- Option 3
- A combination of options 1 and 2 in which you build a low-throughput solution using option 2, then instantiate that multiple times to achieve your desired throughput.
The next step is to proceed with analyzing hardware requirements.