Kernels that require retaining state from one invocation (iteration) to the next can use global or static variable to store this state. Variables with static storage class, such as global variables and static variables are a cause of discrepancies between x86 simulation and AI Engine simulation. The root causes is that for x86 simulation, the sources files of all kernels are compiled into a single executable, whereas for AI Engine simulation each kernel targeting an AI Engine is compiled independently. Thus, if a variable with static storage class is referred to by two kernels and these kernels are mapped to the same AI Engine, the variable is shared for both x86 simulation and AI Engine simulation. However, if these kernels are mapped to different AI Engines, then the variable is still shared for x86 simulation, but for AI Engine simulation each AI Engine has its own copy and there is no sharing. This leads to mismatches between x86 simulation and AI Engine simulation if the variable is both read and written to by the kernels.
The preferred way of modeling state to be carried across kernel iterations is to use a C++ kernel class (see C++ Kernel Class Support). This avoids the pitfall of variables with static storage class. Alternatively the storage class of the global or static variable can be changed to thread_local, but just for x86 simulation. In this case, each instance of the kernel has its own copy of the variable in x86 simulation. This matches the behavior of AI Engine simulation if using the variable are mapped to different AI Engines. In the following example, the kernel carries the state across kernel iteration via global variable delayLine and static variable pos. This causes mismatches between x86 simulation and AI Engine simulation if there are multiple kernel instances using this source file. This can be avoided by changing the storage class of these variables to thread_local.
Original kernel source code:
// fir.cpp
#include <adf.h>
cint16 delayLine[16] = {};
void fir(input_window<cint16> *in1,
output_window<cint16> *out1)
{
static int pos = 0;
..
}
Reworked kernel source code:
// fir.cpp
#include <adf.h>
#ifndef __X86SIM__
cint16 delayLine[16] = {};
#else
thread_local cint16 delayLine[16] = {};
#endif
void fir(input_window<cint16> *in1,
output_window<cint16> *out1)
{
#ifndef __X86SIM__
static int pos = 0;
#else
static thread_local int pos = 0;
#endif
..
}