Kernels that require retaining their state from one invocation (iteration) to the next can use global or static variables to store this state. Variables with static storage class, such as global variables and static variables are a cause of discrepancies between x86 simulation and AI Engine simulation.
The root cause of discrepancies is that in x86 simulation, the source files of all kernels compile into a single executable. However, for AI Engine simulation, each kernel targeting an AI Engine compiles independently. Thus, if two kernels refer to a variable with static storage class and map to the same AI Engine, both simulations share the variable.
However, if those kernels are mapped to different AI Engines, the variable is still shared for x86 simulation. However, for AI Engine simulation each AI Engine has its own copy and there is no sharing. If kernels both read from and write to the variable, the x86 and AI Engine simulations can produce mismatched results.
The preferred way of modeling state to be carried across kernel iterations is to use a C++ kernel class. See C++ Kernel Class Support in AI Engine Kernel and Graph Programming Guide (UG1079). Using a C++ kernel class prevents issues caused by variables with static storage class.
Alternatively you can change the storage class of the global or static variable to thread_local, but only for x86 simulation. In this case, each instance of the kernel has its own copy of the variable in x86 simulation. This behavior matches the behavior of AI Engine simulation if using the variable are mapped to different AI Engines.
In the following example, the kernel carries the state across kernel iteration via the global variable delayLine and the static variable pos. This approach causes mismatches between x86 simulation and AI Engine simulation if there are multiple kernel instances using this source file. You can avoid mismatches by changing the storage class of these variables to thread_local.
Original kernel source code is as follows:
// fir.cpp
#include <adf.h>
using namespace adf;
cint16 delayLine[16] = {};
void fir(input_buffer<cint16> *in1,
output_buffer<cint16> *out1)
{
static int pos = 0;
..
}
Reworked kernel source code is as follows:
// fir.cpp
#include <adf.h>
#ifndef __X86SIM__
cint16 delayLine[16] = {};
#else
thread_local cint16 delayLine[16] = {};
#endif
void fir(input_buffer<cint16> *in1,
output_buffer<cint16> *out1)
{
using namespace adf;
#ifndef __X86SIM__
static int pos = 0;
#else
static thread_local int pos = 0;
#endif
..
}
Another possibility is to use the macro X86SIM_THREAD_LOCAL to make the global read/write thread safe. adf.h defines the macro as follows:
#ifdef _X86SIM_
#define X86SIM_THREAD_LOCAL thread_local
#else
#define X86SIM_THREAD_LOCAL
#endif
which makes it defined only for X86 simulations.
The previous code is simplified as follows:
// fir.cpp
#include <adf.h>
X86SIM_THREAD_LOCAL cint16 delayLine[16] = {};
void fir(input_buffer<cint16> *in1,
output_buffer<cint16> *out1)
{
using namespace adf;
static X86SIM_THREAD_LOCAL int pos = 0;
..
}