The main compute requirements in this algorithm comes from line 37 in the code above. The operands of the multiplication are both int16. The number of real MACs/pixel required is 256 (128 θ steps x 2 real MACs/step) and the algorithm needs to run at 220 Mpps. Assuming a clock rate of 1.25 GHz, this means you need to compute 46 real MACs/cycle. Based on the functional overview in the Versal Adaptive SoC AI Engine Architecture Manual (AM009), a single tile in AIE1 supports 32 real MACs/cycle. Notice how the number of MACs/cycle on the AI Engine is directly related to the data type. For example, if your operands are both int32 instead, you can only perform eight MACs/cycle.

Following rho computation, you need to increment the accumulator where edges in the image exists. This requires read-modify-write (RMW) operation that cannot be vectorized, so assume it is run on the AI Engine scalar processor. Also, assume that RMW operation requires eight cycles.

The overall budget in cycles is based on the number of pixels process, the clock rate, and target throughput. Therefore, total Budget = 216 x 240 x 1.25e9 / 220e6 = 294545 cycles. This translates to a maximum budget of 5.7 cycles/pixel.

If the algorithm cannot be vectorized, the budget to process one pixel is 5.7 cycles.

If the algorithm is vectorized to run with eight pixels in parallel, the budget to process the eight pixels is 5.7 x 8 = 45.5 cycles.

Because the histogram update is expected to be the throughput bottleneck, you can perform a quick calculation based on some assumptions to predict what it will be. If every pixel requires 128 histogram updates and each update requires eight cycles, assuming a clock rate of 1.25 GHz and a 32-tile design, the maximum throughput you can achieve is 1.25e9 x 32 / 128 / 8 = 39.1 Mpixel/sec.

Based on the requirements and assumptions above, you can build the tables and graph below to better visualize the options. You will need to construct your solution to validate these assumptions and revise your proposal if necessary.

Parameter | Value | Units | Notes |
---|---|---|---|

AIE Clock Rate | 1.25 | GHz | Use -2M speed grade |

Image # Rows Pixels "R" | 216 | Pixels | |

Image # Column Pixels "C" | 240 | Pixels | |

Histogram Data Type Size | 2 | B | Assume <int16> |

Theta Resolution (# steps) | 128 | Over 180° | |

# AIE Tiles | 32 | tiles | Partition over "Theta" |

Target throughput | 220 | Mpixel/sec | |

# cycles per Histogram update | 8 | cycles | RMW Op (3/2/3 cycles) |

Parameter | Value | Units | Notes |
---|---|---|---|

Inner loop body scalar iterations | 51,840 | iter | Just R x C |

Inner loop body vector iterations | 6,480 | iter | Just R x C / 8 |

Overall budget | 294,545 | cycles | To meet throughput |

Max cycles per pixel | 5.7 | cycles/Pixel | |

Min Loop II (scalar) | 5.7 | cycles | |

Min Loop II (8X vector) | 45.5 | cycles | |

# MAC per pixel per tile | 8 | MAC/Pixel/tile | Based on algorithm |

Compute Bound (rho) | 5000.0 | Mpixel/sec | Only consider rho compute |

Compute Estimate (histogram) | 39.1 | Mpixel/sec | Only consider histogram |

Using a pivot table from Excel, the following figure shows the Hough transform compute estimate.

To reach the target throughput requirement from here, you will need to use option 2 or 3 referenced earlier. You can split the histogram update over more tiles to reach the target throughput or instantiate the lower throughput solution multiple times.