Design Introduction - Design Introduction - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English

This design has a graph that has four AI Engine kernels. Each kernel has one input and one output. Thus, four AI Engine GMIO inputs and four AI Engine GMIO outputs connect to the graph.

Change the working directory to perf_profile_aie_gmio. Take a look at the graph code in aie/graph.h.

static const int col[8]={2,6,10,18,26,34,42,46};
static const int NUM=4;

class topgraph: public adf::graph
	{
	public:
		adf::kernel k[NUM];
		adf::input_gmio gmioIn[NUM];	
		adf::output_gmio gmioOut[NUM];
		
		topgraph(){
			for(int i=0;i<NUM;i++){
				k[i] = adf::kernel::create(vec_incr);
				adf::source(k[i]) = "vec_incr.cc";
				adf::runtime<adf::ratio>(k[i])= 1;
				gmioIn[i]=adf::input_gmio::create("gmioIn"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
				gmioOut[i]=adf::output_gmio::create("gmioOut"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
				adf::connect<>(gmioIn[i].out[0], k[i].in[0]);	
				adf::connect<>(k[i].out[0], gmioOut[i].in[0]);
	
				adf::location<adf::kernel>(k[i])=adf::tile(col[i],0);
				location<GMIO>(gmioIn[i]) = location<kernel>(k[i]) + relative_offset({.col_offset=0});	
				location<GMIO>(gmioOut[i]) = location<kernel>(k[i]) + relative_offset({.col_offset=1});
			}
		}
	};

In the previous code, you assign location constraints (adf::location) to each kernel and set relative constraints for GMIO inputs and outputs. This setup means that when GMIO ports sit in different columns, performance counters keep working when you profile all ports with the event API simultaneously.

Next, examine the kernel code aie/vec_incr.cc. It increments each int32 input by one and outputs the AI Engine tile cycle counter. You can use this counter to calculate the system throughput later.

using namespace adf;
void vec_incr(input_buffer<int32,extents<256>>& __restrict data,output_buffer<int32,extents<258>>& __restrict out){
	auto inIter=aie::begin_vector<16>(data);
	auto outIter=aie::begin_vector<16>(out);
	aie::vector<int32,16> vec1=aie::broadcast<int32>(1);
		for(int i=0;i<16;i++)
		chess_prepare_for_pipelining
		{
			aie::vector<int32,16> vdata=*inIter++;
			aie::vector<int32,16> vresult=aie::add(vdata,vec1);
			*outIter++=vresult;
		}
	aie::tile tile=aie::tile::current();
	unsigned long long time=tile.cycles();//cycle counter of the AI Engine tile
	decltype(aie::begin(out)) p=*(decltype(aie::begin(out))*)&outIter;
	*p++=time&0xffffffff;
	*p++=(time>>32)&0xffffffff;
	}

Next, examine the host code sw/host.cpp. The concepts introduced in AIE GMIO Programming Model apply here. This section introduces new concepts and explains performance profiling. Some constants defined in the code are as follows:

const int NUM=4;
int ITERATION=8192;	
char* emu_mode = getenv("XCL_EMULATION_MODE");
    if (emu_mode != nullptr) {
		ITERATION=4;
	}
const int BLOCK_SIZE_in_Bytes=1024*ITERATION;
const int BLOCK_SIZE_out_Bytes=1032*ITERATION;

For hardware flow, set ITERATION to 8192; otherwise, set it to four. This setting lets the AI Engine simulator finish in less time.

In the main function, the PS code profiles NUM GMIO inputs and outputs, where NUM is four. Use non-blocking GMIO APIs (GMIO::gm2aie_nb and GMIO::aie2gm_nb) for GMIO transactions, and GMIO::wait for output data synchronization.

//Pre-processing
......

//start graph and GMIO output ports first
gr.run(ITERATION);
for(int i=0;i<NUM;i++){
	gr.gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
	}

//Profile starts here
......

//start GMIO inputs and wait for GMIO outputs to complete
for(int i=0;i<NUM;i++){
	gr.gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
	}
		for(int i=0;i<NUM;i++){
			gr.gmioOut[i].wait();
	}

//Profile ends here
......

//check output correctness 
......