The following MATLAB code shows the QRD algorithm adopted in this tutorial. It contains an inner loop and an outer loop. Its loop structure has been modified for more efficient AI Engine implementation. The outer loop updates of R(kk,kk)
and Q(:,kk)
have been moved from after to before the inner loop. Also, the inner loop indices now run from kk+1
to C
instead of from 1
to kk-1
. This admits an easier to implement form of software pipelining in which a single AI Engine tile may be assigned to compute a single iteration of each outer loop body. For example, the first tile computes the R(1,1)
and Q(:,1)
updates and then performs all inner loop computations corresponding to kk=1
. This tile will then pass all values of R
and Q
to the next tile in the chain. The second tile will accept these inputs but only update R(2,2)
and Q(:,2)
and all inner loop computations corresponding to kk=2
. This approach requires a total of \(8\) tiles to process the $128\times 8$ snapshot matrix $\textbf{A}$ for MUSIC.
function [Q,R] = qrd_mgssr_hw_model(A)
[R,C] = size(A);
Q = single(A);
R = single(zeros(C,C));
% Here, 'kk' now represents the tile
% --> Each tile performs its outer loop and then performs the inner loop iterations for all
% columns that follow
for kk = 1 : C
R(kk,kk) = norm(Q(:,kk));
Q(:,kk) = Q(:,kk) * (1.0/R(kk,kk));
for ii = kk+1 : C
R(kk,ii) = transpose(conj(Q(:,kk))) * Q(:,ii);
Q(:,ii) = Q(:,ii) - R(kk,ii) * Q(:,kk);
end
end
end
It turns out this \(8\)-tile solution does not provide sufficient compute capacity to achieve the 1 $\mu s$ throughput objective of the tutorial. Additional throughput can be achieved by partitioning each inner loop body to its own tile. Similarly, each outer loop body may be partitioned to its own tile. The algorithm exhibits \(C\) outer loop iterations, and \(C-k\) inner loop body iterations for each outer loop \(k\). It follows the total # of tiles required is \(C + C(C-1)/2\). For the \(8\) columns here, this equals $8+8\times7/2=36$ tiles.
The following diagram shows the AI Engine floorplan for this \(36\)-tile solution. Here, no attempt has been made to floorplan the design; the tools elect by default to simply use the second row for buffers. These could be co-located in the first row for many of the tiles; indeed this occurs in the final floorplan shown below.
The following diagram shows additional details of the AI Engine QRD \(norm()\) kernel code. The code is partitioned into three separate workloads:
The “Initialization” code accepts the \(R\) and \(Q\) inputs from the previous tile and initializes \(R\) to zero for the first tile.
The “QRD Norm” code computes the \(norm()\) required by the QRD outer loop body, updates the appropriate \(Q\) column.
The “Output” code delivers the updated \(R\) and \(Q\) values to the following tile. The \(Q\) is not returned by the last tile.
Note that kernels with indices \(0,8,15,21,26,30,33,35\) perform the outer loop \(norm()\) operations whereas the remaining tiles compute the inner loop bodies. Only the upper triangular portion of the \(R\) matrix is updated as it is passed through the AI Engine pipeline.
The following diagram shows additional details of the AI Engine QRD \(qr()\) kernel code. The code is partitioned into three separate workloads:
The “Initialization” code accepts the \(R\) and \(Q\) inputs from the previous tile.
The “QR” code computes the dot product between columns
Q(i)
andQ(m)
and then updatesQ(i)
based on the result.The “Output” code delivers the updated \(R\) and \(Q\) values to the following tile.