The focus of Lab B is on computing summations of exponential functions used to determine the likelihood ratios for each data bit. These are the equations
$$\displaystyle\sum_{s \in S_0} e^{-\frac{1}{\sigma^2}\left( \left( x - s_x \right)^2 + \left( y - s_y \right)^2 \right)}$$
and
$$\displaystyle\sum_{s \in S_1} e^{-\frac{1}{\sigma^2}\left( \left( x - s_x \right)^2 + \left( y - s_y \right)^2 \right)}$$
which are part of the LLR equation presented previously. For each data bit, the same group of exponential terms are used but the sets $S_0$ and $S_1$ differ.
The 4 data bits assigned to each baseband symbol may be converted into a decimal value and used to index each symbol in the constellation diagram presented previously. For example, the symbol that is assigned data bits 0000 is labeled $s_0\(, the symbol assigned data bits 0001 is labeled \)s_1\(, the symbol assigned data bits 0010 is labeled \)s_2$, and so on. Each of the terms
$$e^{-\frac{1}{\sigma^2}\left( \left( x - s_x \right)^2 + \left( y - s_y \right)^2 \right)}$$
is computed from the received symbol and one of the reference constellation symbols. Let $e_i$ denote the term computed using constellation symbol $s_i$ for $i \in 0, \ldots, 15$.
Input to the loop which computes the desired values for each symbol is a float vector of length 16. Organization of the $e_i$ terms in the vector is shown in the following figure.
Output of the loop is a float vector of length 8 containing summations over set $S_0$ and $S_1$ for each data bit. This format is shown below.
Constellation points used in the summations for each data bit are shown in the following figure, where green circles denote symbols comprising set $S_0$ and red circles denote symbols comprising set $S_1$ for each data bit.
The AI Engine API provides some functions which look to be ideal for this computation. The functions filter_even() and filter_odd() accept an input vector and produce an output vector half the size, extracting elements according to a specified pattern. For the sum of exponentials, input will be a vector of 16 float values and the output will be a vector of 8 float values representing the sums over $S_0$ and $S_1$ for each data bit. The other function of interest is reduce_add() which provides a summation of vector elements.
Source files for this lab are found in <path to repo>/labB/src/. Open the file softdemod_kernel.cpp in a text editor and locate the loop located in section 4A, as shown below.
//-------------------------------------------------------------------------
// Begin section 4A
//-------------------------------------------------------------------------
// sum exponential components for 1 and 0 bit values according to
// constellation mapping
//-------------------------------------------------------------------------
auto pwbufC16 = aie::begin_restrict_vector<16>(wbufC); // input exponentials
for (unsigned i = 0; i < BUFSZ; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFSZ)
{
aie::vector<float,8> expsum; // holds exponentials sums
expsum[0] = aie::reduce_add(aie::filter_even(*pwbufC16, 8));
expsum[1] = aie::reduce_add(aie::filter_even(*pwbufC16, 4));
expsum[2] = aie::reduce_add(aie::filter_even(*pwbufC16, 2));
expsum[3] = aie::reduce_add(aie::filter_even(*pwbufC16, 1));
expsum[4] = aie::reduce_add(aie::filter_odd( *pwbufC16, 8));
expsum[5] = aie::reduce_add(aie::filter_odd( *pwbufC16, 4));
expsum[6] = aie::reduce_add(aie::filter_odd( *pwbufC16, 2));
expsum[7] = aie::reduce_add(aie::filter_odd( *pwbufC16++,1));
*pwbufB8++ = expsum;
}
pwbufB8 -= BUFSZ; // reset iterator for next loop
//-------------------------------------------------------------------------
// End section 4A
//-------------------------------------------------------------------------
Take a moment to examine the kernel code. Notice how AI Engine API functions provide a concise description of a fairly complicated addition process. In order to verify the design is performing computation correctly, cd to directory <path to repo>/labB and enter the following:
$ make x86com
$ make x86sim
$ make x86check
This compiles the design using x86 compiler and runs a simulation to functionally verify computations are performed correctly. The final command executes a python script which reports how accurately the kernel computes results. You should observe something similar to the following:
Since the design functions correctly, it can be compiled for AI Engine and analyzed using the commands:
$ make aiecom
$ make loop_ii
After entering these commands, you may observe something similar to the following:
The results are not as expected. There are 8 loops in the kernel, but only 7 are listed in the report. To get more insight, profile the code and examine resulting microcode in Vitis Analyzer. This is accomplished with the commands:
$ make profile
$ make analyze
This simulation seems to take a long time to complete but should finish when the timer displayed in the terminal reaches approximately 270 us. Once Vitis Analyzer opens, select Profile > Profile Details to display microcode, and locate the sixth loop in the kernel. You should see something similar to the following:
The section of microcode generated from loop 6 kernel code spans 332 lines and contains loops within the loop. One of these loops is highlighted by the yellow rectangle and appears to consist primarily of scalar processor operations. A seven-cycle spacing between consecutive VFPMAC operations, as highlighted by the red rectangle, indicates the vector processor pipeline is not being utilized efficiently. It appears that although AI Engine API functions produce concise code, the implementation generated to run on the AI Engine is unusable in this case.
Perhaps AI Intrinsics can be used to provide a more efficient solution. One of the issues using API functions was the need to perform addition across vector lanes. Can the calculation be formatted in a way that avoids this? There are 8 summations required and 8 vector lanes in the floating-point vector processor. If data could be routed to the lanes as needed, the vector processor could be used to efficiently compute all 8 sums in parallel. This can be achieved with the fpadd() intrinsic function. Using an input vector with 16 exponential values, fpadd() could select 8 at a time to accumulate. This process is illustrated in the following figure:
In the diagram, rows represent exponential values required to be fed into fpadd() at each invocation to compute sums for the bits assigned to the columns. Open the file softdemod_kernel.cpp found in the <path to repo>/labB/src/ directory. Locate and comment out section 4A of the source code and uncomment section 4B. This should look like:
//-------------------------------------------------------------------------
// Begin section 4B
//-------------------------------------------------------------------------
// sum exponential components for 1 and 0 bit values according to
// constellation mapping
//-------------------------------------------------------------------------
auto pwbufC16 = aie::begin_restrict_vector<16>(wbufC); // input exponentials
for (unsigned i = 0; i < BUFSZ; i++)
chess_prepare_for_pipelining
chess_loop_count(BUFSZ)
{
aie::vector<float,8> expsum = aie::zeros<float,8>(); // accumulator register
//>>>>>>>>>>>>ENTER CODE HERE<<<<<<<<<<<<
*pwbufB8++ = expsum;
}
pwbufB8 -= BUFSZ; // reset iterator for next loop
//-------------------------------------------------------------------------
// End section 4B
//-------------------------------------------------------------------------
Using the fpadd() intrinsic, create code in the indicated location to perform the desired computation. A solution is provided in the file <path to repo>/labB/solution/softdemod_kernel_solution.cpp. Once the source code changes are complete, you can verify functionality by entering commands:
$ make clean
$ make x86com
$ make x86sim
$ make x86check
Once your code is verified to be performing the computation correctly, build the kernel for AI Engine with the commands:
$ make aiecom
$ make loop_ii
This should display something similar to:
Notice that the correct number of loops is now listed, with the loop of interest highlighted. It appears minimum II is 8 and actual II is 23. This suggests there may be more opportunities for further optimization, so it’s worth examining microcode. Simulate the design to obtain profile data and open Vitis Analyzer with the commands:
$ make profile
$ make analyze
Simulation should run until the timer displayed in the terminal reaches approximately 40 us. Once Vitis Analyzer opens, click Profile > Profile Details and locate the loop of interest in microcode. This should look similar to:
Notice that the VFPMAC operations are spaced at least two cycles apart. This may reflect the fact that the floating-point vector accumulator has a two-cycle latency. Perhaps using two accumulators in an alternate fashion would be beneficial? Feel free to explore further optimizations for this design.