System Partitioning of a Hough Transform on AI Engine - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
XD100
Release Date
2024-06-19
Version
2024.1 English



AI Engine Development

See Vitis™ Development Environment on xilinx.com See Vitis™ AI Development Environment on xilinx.com

System Partitioning of a Hough Transform on AI Engine

Version: Vitis 2024.1

Table of Contents

  1. Introduction

  2. Hough Transform Matlab Model

  3. System Partitioning

  4. Conclusions

References

Support

License

Introduction

This tutorial walks through the process of planning the implementation of a well-known image processing algorithm, mapping and partitioning it to the resources available in a Versal Adaptive SoC device. The goal is to partition the compute workloads of the application across the heterogeneous compute domains of the device, namely the Processor Subsystem (PS), the Programmable Logic (PL) and the AI Engine (AIE). We then identify a “data flow” through the device to pass data between storage locations and these compute domains. This requires analysis of the Compute, Storage, and Bandwidth requirements of the system algorithms and aligning those to the available device resources to discover a workable system solution. We call this “System Partitioning”, and we will illustrate this process using the well-known Hough Transform.

What is the Hough Transform?

The Hough Transform is a feature extraction technique for computer vision & image processing. It was invented in 1959 to detect lines in the machine analysis of monochromatic bubble chamber photographs, was patented in 1962, and was popularized & extended in support from lines to other objects by Duda & Hart in 1972 [1]. We only consider the line detection style of Hough Transform in this tutorial.

The Hough Transform detects lines in an image using a parameteric representation by transforming the line in 2D from its normal $(x,y)$ coordinates into a new $(\rho,\theta)$ domain where $\rho$ represents the line drawn from the origin to where it meets the line at a 90 degree angle, and $\theta$ identifies the angle that perpendicular line makes to the x-axis. This is shown in the diagram below. Notice how all points on the red line have the same $(\rho,\theta)$ values. Consequently, if every pixel in the image is associated with a $(\rho,\theta)$ pair, then lines in the original image may be identified in the $(\rho,\theta)$ plane by observing the relative occurance of these pairs – via a histogram of the $(\rho,\theta)$ data. These histogram statistics are collected by the Hough Transform in the $(\rho,\theta)$ plane to identify lines in the original image.

figure

What is System Partitioning?

We use System Partitioning to architect an embedded system for hetergeneous compute by partitioning the application workload across the various compute elements in the device. Once partitioned, a workable “data flow” identifies the path taken between subsystems in the device as data moves between storage and compute resources. Along the way, I/O bandwidth is managed carefully within interface limits. Data storage locations are identified based on suitability of interface bandwidths, storage depths, and compute requirements. In the end, we identify a workable system solution that eliminates risk from the device selection. This is all accomplished using a combination of analysis and prototyping work without implementing the full design. This solution scope is outlined in the diagram below.

figure

System Partitioning Methodology

We define the following methodology to steer our system partitioning analysis activities:

  1. Requirements Gathering – an analysis step to compare system/algorithm requirements against device resources

  2. Solution Synthesis – a conceptual step to identify possible solutions based on specific partitions and their resulting data flows

  3. Partitioning Validation – a feasibility assessment driven by detailed analysis & prototyping to identify shortcomings and reduce risk

  4. Iterate to System Feasibility – revise, rework, and re-envision until a suitable low-risk solution is identified

Some aspects of each phase is outlined in the diagram below.

figure

Hough Transform Matlab Model

A proper algorithm model is required for system partitioning. We start with the Matlab model shown below. A detailed study of this model identifies key aspects of the system design that impact its solution. For example:

  • The overall compute load complexity is driven by the image size through $R$ and $C$ dimensions.

  • The resolution adopted for $\theta$ through the theta_res parameter drives complexity, bandwidth, and histogram storage.

  • The algorithm exhibits a “dual-nested for-loop” character.

  • The algorithm employs lookup tables through cos_theta and sin_theta

  • The algorithm has been quantized to use int16 data types.

  • There are multiple compute workloads: for rho_i, addr and histogram update H

function [H,theta,rho,rho_calc] = hough_model( BW, theta_res )
   if     (nargin == 1) theta_res = 180; 
   elseif (nargin ~= 2) error('hough_model(BW,theta_res)'); end
   [R,C] = size(BW);
   rho_max = ceil(sqrt((R-1)^2 + (C-1)^2));
   fprintf(1,'rho_max: %g\n',rho_max);
   % Initialize:
   tmp = linspace(-90,90,1+theta_res);
   theta = tmp(1:end-1);
   cos_theta = double(fi(cos(theta*pi/180),1,16,15,'RoundingMethod','Round','OverflowAction','Saturate'));
   sin_theta = double(fi(sin(theta*pi/180),1,16,15,'RoundingMethod','Round','OverflowAction','Saturate'));
   rho = [-rho_max:1:rho_max];
   Nt = numel(theta);
   Nr = numel(rho);
   H = zeros(Nr,Nt); 
   rho_calc = zeros(R,C,Nt);
   % Compute transform:
   for yy = 0 : R-1
     for xx = 0 : C-1
       % Compute the 'rho' value:
       rho_i = double(fi(xx*cos_theta + yy*sin_theta,...
                         1,16,0,'RoundingMethod','Round','OverflowAction','Saturate'));
       rho_calc(1+yy,1+xx,:) = rho_i;
       % Add offset to turn onto 'rho' address:
       addr = rho_i + rho_max;
       for ii = 1 : Nt
         H(1+addr(ii),ii) = H(1+addr(ii),ii) + BW(1+yy,1+xx);
     end % yy
   end %xx
end

The Matlab model may be run and its performance compared to the built-in Matlab function hough found in the Image Processing Toolbox. Here, we run with a $216\times 240$ image of the AMD Instinct and show “heat maps” of the 2D Hough Transform output histograms for both the Matlab function and our AMD 16-bit Matlab model.

figure

System Partitioning

This section illustrates the details of system partitioning for the Hough Transform. We consider the ultimate system goals, techniques to parallelize the algorithm over multiple AI Engine tiles, analyzing storage, compute, and bandwidth and their impacts on the partitioning choices. A spreadsheet analysis steers us to some early conclusions on what may be feasible, but some detailed prototyping work is required to refine these estimates to finally identify a more accurate scoping of how many AI Engine resources are required to achieve the design objectives.

Goals

Our goal here for this tutorial is to identify the “best we can do” using only AI Engine resources to implement the Hough Transform. To this end, we set a target throughput requirement of 220 Mpixel/sec (or Mpps) and pose the question: “how many AI Engine tiles are required”? From our Matlab model above, we know the image size and $\theta$ resolution will be key parameters driving compute, bandwidth, and storage. With this in mind, we brainstorm solutions for how we might parallelize the Hough Transform algorithm across multiple AI Engine tiles.

Parallelizing Over “Image Tiles”

One obvious way to increase throughput is to parallelize the image over AI Engine tiles directly such that each tile sees only a small portion of the original image. This is shown in the diagram below. In this way, each tile computes a full Hough transform for the portion of the image that it sees. This yields a linear reduction in its compute workload and a similar reduction in the input bandwidth delivered to each tile. There is no reduction in tile memory for histogram storage. One consequence of this approach is that we must combine the histogram outputs from each tile, resulting in an additional compute workload to combine together all tile results.

figure

Parallelizing Over “Theta”

An alternative strategy partitions the Hough Transform such that each tile sees the full image but computes only a partial transform over a subset of the $\theta$ range. This also leads to a linear reduction in compute, but does not achieve any input bandwidth reduction. This scheme benefits from a linear reduction in the tile histogram storage. We must “collect” the histogram outputs from each tile, but there is no extra compute workload involved with this approach.

figure

Analyzing Storage Requirements

Having identified some possible parallelization schemes, we next dive into the Requirements Gathering phase of System Partitioning to analyze storage requirements. Details of this analysis are outlined in the diagram below. Following the “Parallelizing over Theta” strategy, we find a single full histogram cannot fit into the 32 KB local tile memory. We must partition over multiple tiles to make storage feasible. We find the total histogram storage for our image size is 162 KB, and we require at least 8 tiles to reduce this to manageable levels.

figure

Analyzing Compute Requirements

Next, we use a spreadsheet analysis to assess compute requirements. We load the system input parameters on the left side of the spreadsheet shown below and we analyze compute parameters on the right side. We find it useful to tabulate the numbers of processor cycles required by each loop body in our original Matlab model of the Hough Transform. Based on the AI Engine compute capacity of 32 MACs/cycle for int16 data types, we realize we can process 2 MACs/pixel per $\theta$ value in real time. Based on these vector estimates, the spreadsheet tells us we must process 5.7 cycles per pixel to meet our 220 Mpps throughput objective. This is equivalent to 45 cycles for the vector processor with its 8 lanes SIMD execution. Our compute bound for the vector processor is very high at 5000 Mpps. However, assuming an 8-cycle “read-modify-write” instruction for updating the histogram tables in the third compute workload, we find our throughput limited by the scalar processor to 39 Mpps if using 32 tiles. When projected to more tiles, we fail to reach our 220 Mpps target with even 128 tiles.