The following MATLAB code shows the SVD algorithm adopted in this tutorial. It contains three nested loops. The outer-most loop performs identical “iterations.” This tutorial performs a fixed set of \(N_I=4\) iterations per SVD. The inner two loops admit a structure similar to the QRD analyzed above except all compute workloads are in the inner most loop only. here is no workload in the outer of these two loops. The inner loop workload involves computing a \(2\times2\) Jacobi rotation matrix Rot and then applying that matrix to both V and W.
function [U,S,V] = svd_one_sided_jacobi( A, max_iter )
if (nargin == 1) max_iter = 4;
elseif (nargin ~= 2) error('Incorrect signature');
end
% Perform one-sided Jacobi for MUSIC application (no need to compute 'U' in principle):
[m,n] = size(A);
V = single(eye(n));
W = single(A);
for iter = 1 : max_iter
for p = 1 : n-1
for q = p+1 : n
Rot = compute_rotation( W(:,p), W(:,q) );
V(:,[p,q]) = V(:,[p,q]) * Rot;
W(:,[p,q]) = W(:,[p,q]) * Rot;
end
end
end
% Compute singular values and 'U':
U = single(zeros(m,n));
S = single(zeros(n,n));
for ii = 1 : n
S(ii,ii) = sqrt(W(:,ii)'*W(:,ii));
U(:,ii) = W(:,ii)/S(ii,ii);
end
end
For completeness, the following MATLAB code defines the compute workload for the one-sided Jacobi rotation. This renders the two vectors Xv and Yv orthogonal. This workload contains some vectorizable dot product operations along with some \(sqrt()\), \(inv()\), and squaring operations. These can map to the AI Engine vector datapath or can leverage the non-linear hardware accelerator on the scalar datapath.
function [res] = calc_ei_2t(x,y)
R = sqrt(x^2+y^2);
res = complex(x/R,y/R);
end
function [res] = calc_ei_t(sin_2t,cos_2t)
R = sqrt((1+cos_2t)^2 + sin_2t^2);
res = complex((1+cos_2t)/R, sin_2t/R);
end
function [Rot] = calc_rot( eit, c, s )
Rot = [ eit *complex(0,s), eit*complex(c,0);
conj(eit)*complex(-c,0), conj(eit)*complex(0,-s) ];
end
function [Rot] = compute_rotation( Xv, Yv )
Hpp = real(transpose(Xv)*conj(Xv));
Hqq = real(transpose(Yv)*conj(Yv));
tmp = transpose(Xv)*conj(Yv);
Hrr = real(tmp);
Hjj = imag(tmp);
ei_2t1 = calc_ei_2t(Hjj,Hrr);
ei_t1 = calc_ei_t(imag(ei_2t1),real(ei_2t1));
tx = 0.5*(Hqq-Hpp);
ty = Hrr * imag(ei_2t1) + Hjj * real(ei_2t1);
ei_2t2 = calc_ei_2t(tx,ty);
ei_t2 = calc_ei_t(imag(ei_2t2), real(ei_2t2));
Rot = calc_rot( ei_t1, real(ei_t2), imag(ei_t2) );
end
You can parallelize the SVD across multiple AI Engine tiles in a manner similar to the QRD. The most aggressive scheme assigns a single AI Engine tile to each inner-most loop body—in essence the system of three nested loops fully flattens. It turns out this scheme is overkill for the throughput target of 1 \(\mu s\). Instead, it is possible to partition three inner loop body workloads to each AI Engine tile and still meet the requirement. This saves considerable resources; The SVD requires only \(38\) tiles in total.
The following screenshot shows the kernel object creation in the adf::graph of the SVD graph implementation in svd_graph.h. The code comments indicate which indices \((p,q)\) are assigned to each tile. Each tile is assigned three inner loop workloads. The last tile is only assigned two inner loop workloads. However, the last tile also performs the final workload to compute the singular values required by the MUSIC algorithm for identifying the noise subspace basis vectors.
The following diagram shows the AI Engine physical array view for the SVD subgraph. The data flow graph has a linear structure similar to the QRD graph, although requires less memory resources. This is because the \(\textbf{U}\), \(\textbf{S}\), and \(\textbf{V}\) matrices are all \(8\times 8\) in this case.