Part2a - 2024.1 English

Vitis Tutorials: AI Engine

Document ID
XD100
Release Date
2024-06-19
Version
2024.1 English

Implementing an IIR Filter on the AI Engine - Part 2a

Version: Vitis 2024.1

Preliminaries

In Part 1a, we focused on vectorizing the calculation for a second-order section of an IIR filter using Emulation-SW mode. Using eqn. (4) in Part 1a, we can calculate eight consecutive outputs by multiplying an 8x12 matrix of constants with a 12x1 vector (composed of eight consecutive inputs and four states).

Fig. 1

From Fig. 26 of AM009, the floating-point vector processor can perform eight multiply-accumulate operations on floating-point operands in two cycles, in E6 and E7.

Note: The red dashed arrow in the figure indicates the feedback path for the accumulator. Thus, ideally, 12*2=24 cycles would be the minimum required to calculate eight floating-point outputs.

Fig. 2

In this and succeeding sections, we attempt to minimize latency and maximize throughput while showing the typical steps to analyze and optimize a design.

Notes:

  • The Julia script aie_iir_2a.jl generates the coefficients for the specified IIR filter, and the header file required by the program. It also generates an impulse signal and the filter’s response.

    • The generated header file should be moved to the src directory.

    • The generated *.dat files should be moved to the data directory.

  • The Julia script check.jl calculates the difference between the golden impulse response generated by aie_iir_2a.jl and the output of the AI Engine.

Kernel Code

As a first step, we use the kernel code as follows:

template<unsigned id>
void SecondOrderSection(
	adf::input_buffer<float> & __restrict idata,	// 8 input samples per iteration
	adf::output_buffer<float> & __restrict odata,	// 8 output samples per iteration
    const float (&C)[96]	// RTP port for coefficient matrix
) {

	static Vector8f state_reg = aie::zeros<float, 8>();	// clear states

	// input/output iterators
	auto inIter = aie::begin_vector<8>(idata);
	auto outIter = aie::begin_vector<8>(odata);

	Vector8f xreg_hi = *inIter++;		// fetch input samples
	Vector16f xreg = aie::concat(state_reg, xreg_hi);	// xreg[4]: ym2; xreg[5]: ym1; xreg[6]: xm2; xreg[7]: xm1; xreg[8:15]: x0:x7
	Vector8f coeff;
	VAcc8f acc = aie::zeros<accfloat, 8>();

	for (auto i = 0; i < 12; i++) {
		coeff = aie::load_v<8>(&C[8 * i]);
		float xval = xreg[i + 4];
		acc = aie::mac(acc, coeff, xval);
	} // end for (auto i = 0; i < 12; i++)

	Vector8f yout = acc;	// transfer accumulator register to vector register to update states

	// update states
	state_reg = xreg_hi;
	state_reg[4] = yout[6];
	state_reg[5] = yout[7];

	*outIter++ = yout;

} // end SecondOrderSection()

The for loop scales each column of the coefficient matrix with an element in xreg and accumulates the result. This performs the matrix and vector multiplication in eqn. (4).

Testbench Code

#include "kernel.hpp"
#include "graph.hpp"
#include "C1.h"

using namespace std;

using namespace adf;

// specify the dataflow graph (DFG)
the_graph my_graph;

const unsigned num_pts = 256;			// number of sample points in "input.dat"
const unsigned num_iterations = num_pts/8;	// number of iterations to run

// main simulation program
int main() {

	my_graph.init();				// load the DFG into the AI Engine array, establish     connectivity, etc.

	my_graph.update(my_graph.cmtx1, C1, 96);	// transfer coefficients

	my_graph.run(num_iterations);	// run the DFG for the specified number of iterations

	my_graph.end();					// terminate AI Engine processing

	return (0);

} // end main()

The testbench

  • initializes the graph.

  • loads the filter coefficients.

  • runs the graph 32 times.

  • terminates all processing.

Analysis

We begin by opening the launch.json file under Settings in the Vitis Components pane. Select Part2a_aiesim_1 to view the AIE Simulator parameters and check the box for Enable Profile. Build, then run the simulation.

Fig. 3

After the simulation completes, the “goodness” of the result can be checked by running:

$ julia check.jl aie

The result is “good” when the maximum(abs.(err)) is less than eps(Float32).

To view the profiler result, in the FLOW pane, under AIE SIMULATOR / HARDWARE, expand REPORTS (below the Debug icon) and click on Profile.

Fig. 4

In the AIE SIMULATION pane, click on Total Function Time to show the number of cycles consumed by each function.

Fig. 5

Note: The kernel function, SecondOrderSection<1> was executed 32 times and ran for 2,313 cycles. Each function call consumed 2,313/32 = 72.28 cycles. The minimum function time is 72 cycles and the maximum is 81 cycles. This implies that the first call consumed nine more cycles (81 + 31 * 72 = 2,313).

Another item of interest is the top-level main function which calls my_graph.run(), which calls SecondOrderSection<1>. The Total Function + Descendants Time (cycles) column shows the number of cycles consumed by that function, including all other routines called within it. This includes setting up the heap and stack, initialization, actual processing, etc. For this implementation, 4,579 cycles were used to process 256 samples, or 4579/256 = 17.89 cycles/sample. Assuming that the AI Engine runs with a 1 GHz clock, the throughput would be 1e9 cycles/sec / 17.89 cycles/sample = 55.897 Msamples/sec.

Note: The main processing occurs in SecondOrderSection<1>, which consumes 2,313 cycles. Thus, 4,579 - 2,313 = 2,266 “overhead” cycles are not used for sample processing.

Click Profile Details to view the generated assembly code.