Multi-Process and Multi-Thread Control of AI Engine Graphs - 2025.2 English - UG1076

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2025-11-20
Version
2025.2 English

Multi-Process Access Control

Xilinx Runtime (XRT) APIs provide multi-process support for controlling the AI Engine array and graphs. XRT supports the following three operating modes on AI Engine array and graphs.

Exclusive Mode
Provides full access to the AI Engine array or graph.
No other process can access it.
Primary Mode
Provides full access to the AI Engine array or graph.
Other processes can gain non-destructive access to the AI Engine array or graph.
Non-destructive access includes any AIE control functions that do not change the status of the AI Engine, such as asynchronous RTP read.
Shared Mode
Provides only non-destructive access to the AI Engine array or graph.
Non-destructive access includes any AIE control functions that do not change the status of the AI Engine, such as asynchronous RTP read.

In summary, the following three modes offer different levels of access control for multiple processes:

Exclusive Mode
Single process with full control.
Primary Mode
Single process with full control, other processes can read without modifying the state.
Shared Mode
Multiple processes can read, but none can modify.

The XRT C++ API extends the xrt::aie::device class to open device with access mode in xrt/xrt_aie.h:

  • xrt::aie::access_mode::exclusive (Exclusive mode)
  • xrt::aie::access_mode::primary (Primary mode)
  • xrt::aie::access_mode::shared (Shared mode)

The XRT C++ API extends the xrt::graph class to open graph with access mode in xrt/xrt_graph.h:

  • xrt::graph::access_mode::exclusive (Exclusive mode)
  • xrt::graph::access_mode::primary (Primary mode)
  • xrt::graph::access_mode::shared (Shared mode)

You must close the graph before closing the array. You can re-open all AI Engine arrays and graphs after the are closed. Some restrictions apply, as listed below.

The following restrictions apply for the AI Engine array multi-process support:

  • Only one process can open the AI Engine array in exclusive mode. After opening in exclusive mode, no other process—including the same one—can open it in any mode.
  • Only one process can open the AI Engine array in the primary mode. After opening in primary mode, no process—including the same one—can reopen it in primary or exclusive mode. However, shared mode remains available.

The following restrictions apply for the AI Engine graph multi-process support:

  • A single process cannot open the same graph more than once.
  • If a process opens an AI Engine graph in exclusive mode, no process can reopen it in any mode.
  • Multiple processes can open a graph in shared mode, but only one process can open a graph in primary mode.

Multi-Thread Access Control

Important: Although restrictions for multi-process access also apply to multi-threaded access, remember that AI Engine device handles and graph handles are sharable between threads. This means you can use the same handle between multiple threads. However, the host application is responsible for ensuring proper synchronization of the AI Engine array and graph states between threads. Synchronization is particularly crucial when multiple threads have exclusive or primary ownership of the AI Engine array or graph.
Note: AMD recommends that multiple threads use the same model as multiple processes. However, because the AI Engine device and graph handles are sharable between threads, you can use the same device handle or graph handle between threads. The host application is responsible for synchronizing the AI Engine array state and graph state between threads. Especially when multiple threads are the exclusive or primary owner of the AI Engine array or graph.

Use Cases

The following figure shows multiple-process and multiple-thread use cases based on the access mode restrictions:

Figure 1. Multi-Process and Multi-Thread Use Cases

The figure illustrates how different processes can access the AI Engine devices and graphs.

Use Case 1:

  • Process 1 opens xrt::aie::device or xrt::graph in exclusive mode.
  • Process 2 cannot open xrt::aie::device or xrt::graph because exclusive mode grants full control to Process 1. No other process can access the resource.

Use Case 2:

  • Process 1 opens xrt::aie::device or xrt::graph in primary mode.
  • Process 2 can open xrt::aie::device or xrt::graph in shared mode only.
  • In shared mode, Process 2 can only perform non-destructive operations on the AI Engine, such as asynchronous RTP read. Non-destructive operations are those that do not change the state of the AI Engine.

Use Case 3:

  • Thread 1 opens xrt::aie::device or xrt::graph without any mode restrictions.
  • Both Thread 1 and Thread 2 can operate on the AI Engine.
  • Ensure appropriate handling of thread violations to avoid conflicts between the threads.

A sample code to demonstrate Use Case 2 is as follows.

#include <stdlib.h>
#include <fstream>
#include <iostream>
#include <unistd.h>
#include <sys/wait.h>
#include "adf/adf_api/XRTConfig.h"
#include "experimental/xrt_aie.h"
#include "experimental/xrt_graph.h"
#include "experimental/xrt_kernel.h"

//8192 matches 32 iterations of graph::run
#define OUTPUT_SIZE 8192
int value1[16] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
int value2[16] = {-1,-2,-3,-4,-5,-6,-7,-8,-9,-10,-11,-12,-13,-14,-15,-16};

using namespace adf;

int run(int argc, char* argv[],int id){
	std::cout<<"Child process "<<id<<" start"<<std::endl;
	
	//TARGET_DEVICE macro needs to be passed from gcc command line
	if(argc != 2) {
		std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
		return EXIT_FAILURE;
	}
	char* xclbinFilename = argv[1];
	std::string graph_name=std::string("gr[")+std::to_string(id)+"]";
	std::string rtp_inout_name=std::string("gr[")+std::to_string(id)+std::string("].k.inout[0]");
	
	int ret;
	int value_readback[16]={0};
	if(fork()==0){//child child process
		xrt::aie::device device{0, xrt::aie::device::access_mode::shared};
		auto uuid = device.load_xclbin(xclbinFilename);
		xrt::graph graph{device, uuid, graph_name, xrt::graph::access_mode::shared};

		graph.read(rtp_inout_name, value_readback);
		std::cout<<"Add value read back are:";
		for(int i=0;i<16;i++){
			std::cout<<value_readback[i]<<",\t";
		}
		std::cout<<std::endl;
		std::cout<<"child child process exit"<<std::endl;
		exit(0);
	}

	xrt::aie::device device{0};   // default primary context
	auto uuid = device.load_xclbin(xclbinFilename);
	xrt::graph graph{device, uuid, graph_name}; // default primary context

	std::string rtp_in_name=std::string("gr[")+std::to_string(id)+std::string("].k.in[1]");
	graph.update(rtp_in_name, value1);
	graph.run(16); // 16 iterations

	graph.wait(0); // wait 0 => wait till graph is done
	std::cout<<"Graph wait done"<<std::endl;
			
	//second run
	graph.update(rtp_in_name.data(), value2);
	graph.run(16); // 16 iterations;

	while(wait(NULL)>0){//Wait for child child process
	}

	graph.wait(0); // wait 0 => wait till graph is done
	std::cout<<"Child process:"<<id<<" done"<<std::endl;
	return 0;
}

int main(int argc, char* argv[])
{
	try {
		for(int i=0;i<GRAPH_NUM;i++){
			if(fork()==0){//child
				auto match = run(argc, argv,i);
				std::cout << "TEST child " <<i<< (match ? " FAILED" : " PASSED") << "\n";
				return (match ? EXIT_FAILURE :  EXIT_SUCCESS);
			}else{
				size_t output_size_in_bytes = OUTPUT_SIZE * sizeof(int);
				//TARGET_DEVICE macro needs to be passed from gcc command line
				if(argc != 2) {
					std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
					return EXIT_FAILURE;
				}
				char* xclbinFilename = argv[1];
				
				int ret;
				// Open xclbin
				auto device = xrt::device(0); //device index=0
				auto uuid = device.load_xclbin(xclbinFilename);
			
				// s2mm & data_generator kernel handle
				std::string s2mm_kernel_name=std::string("s2mm:{s2mm_")+std::to_string(i+1)+std::string("}");
				xrt::kernel s2mm = xrt::kernel(device, uuid, s2mm_kernel_name.data());
				std::string data_generator_kernel_name=std::string("data_generator:{data_generator_")+std::to_string(i+1)+std::string("}");
				xrt::kernel data_generator = xrt::kernel(device, uuid, data_generator_kernel_name.data());
			
				// output memory
				auto out_bo=xrt::bo(device, output_size_in_bytes,s2mm.group_id(0));
				auto host_out=out_bo.map<int*>();
				auto s2mm_run = s2mm(out_bo, nullptr, OUTPUT_SIZE);//1st run for s2mm has started
				auto data_generator_run = data_generator(nullptr, OUTPUT_SIZE);

				// wait for s2mm done
				std::cout<<"Waiting s2mm to complete"<<std::endl;
				auto state = s2mm_run.wait();
				std::cout << "s2mm "<<" completed with status(" << state << ")"<<std::endl;
				out_bo.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
				
				int match = 0;
				int counter=0;
				for (int i = 0; i < OUTPUT_SIZE/2/16; i++) {
					for(int j=0;j<16;j++){
						if(host_out[i*16+j]!=counter+value1[j]){
							std::cout<<"ERROR: num="<<i*16+j<<" out="<<host_out[i*16+j]<<std::endl;
							match=1;
							break;
						}
						counter++;
					}
				}
				for(int i=OUTPUT_SIZE/2/16;i<OUTPUT_SIZE/16;i++){
					for(int j=0;j<16;j++){
						if(host_out[i*16+j]!=counter+value2[j]){
							std::cout<<"ERROR: num="<<i*16+j<<" out="<<host_out[i*16+j]<<std::endl;
							match=1;
							break;
						}
						counter++;
					}
				}

				std::cout << "TEST " <<i<< (match ? " FAILED" : " PASSED") << "\n";
				while(wait(NULL)>0){//Wait for all child process
				}
				std::cout<<"all done"<<std::endl;
				return (match ? EXIT_FAILURE :  EXIT_SUCCESS);
			}
		}
	}	
catch (std::exception const& e) {
std::cout << "Exception: " << e.what() << "\n";
std::cout << "FAILED TEST\n";
return 1;
	}
}