The AI Engine driver maintains global device information for different versions of AI Engine devices. It is possible to compile the driver for a single version of device to reduce the memory consumption. At runtime, there is memory allocation to keep track of resource usage when the driver instance is initialized. After that, no more memory allocation is required.
Light weight implementation has been introduced for platform management APIs and features switch macros. These are introduced to optionally enable features to reduce runtime memory consumption for control applications, which needs to run on small memory systems.
As PDI is used to configure AI Engine, the AI Engine configuration time depends on the size of the graph and
the number of ELFs. The configuration time optimization is an active topic in AMD. At runtime, data memory and program memory access
can be done directly from the application with memcpy
and memset
. For register access on bare-metal,
applications can directly access registers through the driver without context switch. On
Linux, applications access registers through the Linux kernel driver, and thus access
registers from Linux have more overhead. This can be optimized in the future by using
batch mode to send multiple register access with single syscall
. In order to reduce the userspace to kernel space context switch
and memory copying for every I/O write, I/O transaction mode is introduced to flush a
buffered I/O command to the kernel.
With FAL and AI Engine driver resource management, higher level libraries and tools, such as XRT and ADF can reduce the overhead to manage the AI Engine resources and reduce the risk of releasing a resource which the function or process hasn’t reserved. There is memory and performance overhead introduced to maintain the state machine of each resource. However, it adds flexibility to add or remove individual resources and make sure a function does not use or release a resource it has not reserved. Furthermore, the resources to manage are not in the data path, but instead are used for profiling and debugging, and the overhead is only applied to the initial setup and teardown.