In this step, you will build a 32-tile design based on the proposed solution synthesis above and profile the performance to understand achieved throughput.
There are three computations that need to occur for all the kernels and those are Rho computation, Address computation, and Histogram update. Code snippets below implement this functionality.
Each of the histogram routines updates four pixels with four values of theta each using RMW and runs on the AI Engine scalar processor. The following shows the implementation code.
After compiling and simulating the design with –profile
, you can get a
view of the total number of cycles needed to process the data, as well as the breakdown
of the cycles consumed per function call.
In the profiling snippet below, the total number of cycles to process 216 x 240 pixels is 2,650,436 @ 1.25 GHz clock rate, which translates to ~24.2 Mpps.
Each histogram update takes on average (1185840 + 1185840) / (216 x 240 / 8) = 366 cycles.
Rho and address computation are implemented to run on the vector processor and compute eight pixels per iteration. It took 277204 cycles to consume 216 x 240 pixels which translates to 5.3 cycles per pixel or 42.8 cycles per iteration.
Below is a table summary of the initial assumptions versus achieved during the rapid prototyping step.
Budget (in cycles) | Assumption | Achieved |
---|---|---|
Histogram update, RMW | 8 | 366 |
Budget for Rho and address computation (8x vectorization) | 45.4 | 42.8 |