Memory Model - 2024.2 English - UG1076

AI Engine Tools and Flows User Guide (UG1076)

Document ID
Release Date
2024.2 English

Kernels that require retaining state from one invocation (iteration) to the next can use global or static variables to store this state. Variables with static storage class, such as global variables and static variables are a cause of discrepancies between x86 simulation and AI Engine simulation. The root cause is that for x86 simulation, the source files of all kernels are compiled into a single executable, whereas for AI Engine simulation each kernel targeting an AI Engine is compiled independently. Thus, if a variable with static storage class is referred to by two kernels and those kernels are mapped to the same AI Engine, the variable is shared for both x86 simulation and AI Engine simulation. However, if those kernels are mapped to different AI Engines, then the variable is still shared for x86 simulation, but for AI Engine simulation each AI Engine has its own copy and there is no sharing. This leads to mismatches between x86 simulation and AI Engine simulation if the variable is both read and written to by the kernels.

The preferred way of modeling state to be carried across kernel iterations is to use a C++ kernel class (see C++ Kernel Class Support in AI Engine Kernel and Graph Programming Guide (UG1079)). This avoids the pitfall of variables with static storage class. Alternatively the storage class of the global or static variable can be changed to thread_local, but just for x86 simulation. In this case, each instance of the kernel has its own copy of the variable in x86 simulation. This matches the behavior of AI Engine simulation if using the variable are mapped to different AI Engines. In the following example, the kernel carries the state across kernel iteration via the global variable delayLine and the static variable pos. This causes mismatches between x86 simulation and AI Engine simulation if there are multiple kernel instances using this source file. This can be avoided by changing the storage class of these variables to thread_local.

Original kernel source code:

// fir.cpp
#include <adf.h>
using namespace adf;
cint16 delayLine[16] = {};
void fir(input_buffer<cint16> *in1,
         output_buffer<cint16> *out1)
  static int pos = 0;

Reworked kernel source code:

// fir.cpp
#include <adf.h>

#ifndef __X86SIM__
cint16 delayLine[16] = {};
thread_local cint16 delayLine[16] = {};
void fir(input_buffer<cint16> *in1,
         output_buffer<cint16> *out1)
using namespace adf;
#ifndef __X86SIM__
  static int pos = 0;
  static thread_local int pos = 0;
Another possibility is to use the macro X86SIM_THREAD_LOCALto make the global read/write thread safe. The macro is defined in adf.h as follows:
#ifdef _X86SIM_
#define X86SIM_THREAD_LOCAL thread_local
which makes it defined only for X86 simulations.
The previous code is simplified as follows:
// fir.cpp
#include <adf.h>

X86SIM_THREAD_LOCAL cint16 delayLine[16] = {};

void fir(input_buffer<cint16> *in1,
         output_buffer<cint16> *out1)
using namespace adf;
static X86SIM_THREAD_LOCAL int pos = 0;