Based on the Requirements Gathering section above, here is a list of key takeaways.
- The image pixels are stored in PL and are streamed into the design as they get consumed.
- The histogram will be stored and computed in the AI Engine. The histogram storage requirement is larger than the tile memory available for a single AI Engine tile, therefore, the solution is expected to be a multi-tile design.
- The
rho_i
computation will be vectorized and run on the AI Engine vector processor. You need to prototype this computation to confirm what vectorization and throughput is achievable. - The histogram update is expected to be the throughput bottleneck in this design due to the RMW access pattern which cannot be vectorized or pipelined, so it will run on the AI Engine scalar processor. Assume a single RMW requires eight cycles, but you will need to validate this as this is the key assumption.
To validate the assumptions above, you need to prototype a multi-tile solution with histogram updates. The exact number of tiles to use is not critical at this point, because the objective is to characterize the cycles consumed by the RMW. Additionally, you can receive an accurate estimate for how many tiles you will need to reach the target throughput. Once you define the assumptions accurately, you can scale the solution accordingly. Next, proceed by prototyping a 32-tile design.
For a 32-tile design, each tile will be computing four of the 128 theta values. Propose using mac16 intrinsics operating on four pixels per cycle. Local tile memory is used to store the cos(θ) and sin(θ) arrays, as well as the partial 2D histogram H.
To meet the desired throughput of 220 Mpps, you have a budget of 294545 cycles to process 216 x 240 pixels. This translates to 5.7 cycles per pixel. If you can vectorize the algorithm to run eight pixels in parallel, the budget to process the eight pixels is 5.7 x 8 = 45.5 cycles.
The expected throughput is ~39 Mpps, bottlenecked by the RMW assuming eight cycles per histogram update (or 32 cycles per pixel assuming a 32-tile design).
To support the final desired throughput of 220 Mpps, you will need to scale the 32-tile design based on either option 2 or 3 previously mentioned.
The risky areas that need to be developed and quantified are the rho and address compute as well as the histogram update. The next section will focus on de-risking these.