With an accurate cost model from prototyping, you can explore alternate solutions through scaling. Consider “Image Partitioning” plus “Theta Partitioning” to scale up throughput. In this scheme, partition image portions to 32-tile clusters. Each cluster computes partial histograms for four \(\theta\) values (as in the prototype). This approach scales throughput linearly higher. The following diagram shows achieving ~ 220 MP/s with ~275 tiles.