Description
This rule checks the stall percentage of AI engine cores.
Explanation
There are many types of AIE core stalls, including memory, stream, cascade,
and lock.
- MEMORY_STALL: Time AI Engine was in a memory stall. This could be due to multiple reasons such as multiple memory accesses on the same bank in the same cycle, multiple kernels accessing multiple memories on the same bank, etc.
- STREAM_STALL: Time AI Engine was in a stream stall. This could be due to multiple reasons such as streams being read faster than they are written to streams from the PL being clocked at a slower frequency, etc.
- CASCADE_STALL: Time AI Engine was in a cascade stall. This could be due to multiple reasons such as cascade streams being read faster than they are written to streams from the PL being clocked at a slower frequency, etc.
- LOCK_STALL: Time AI Engine was in a lock stall. This could be due to multiple reasons such as buffers being read faster than they are written to or from streams between PL being clocked at a slower frequency, etc.
Recommendation
See this link in AI Engine Tools and Flows User Guide (UG1076) for all supported stalls.
- MEMORY_STALL: You can resolve the stall by examining access patterns
using trace results and placing the memory on different banks, or using the Aiecompiler
"BufferOptLevel" mapper option.
- Dispatch memories to different banks. (memories include system memory, RTP, window buffers, data memories)
- If memory banks are exhausted, do profile and trace to find better solution.
- Specify BufferOptLeve option in aiecompiler to build design.
- STREAM_STALL: You can resolve the stall by examining stream access
patterns using trace results and increasing/balancing the FIFO depth on the stream, or
maximizing the PL bandwidth to the AI Engine.
- Increase FIFO depth.
- Adjust stream read and write instructions in the loop.
- Multiple streams: Insert DMA FIFO or set different FIFO depth for different destination nets.
- PLIO: maximize AIE-PL interface bandwidth. For example: 64bit interface, highest frequency(1/2 AIE frequency) for PL, BLI register (channels with it).
- CASCADE_STALL: You can resolve the stall by examining stream access
patterns using trace results and adjusting the instructions in the loop to match between
the input/output streams or maximizing the PL bandwidth to the AI Engine.
- Adjust instructions in the loop.
- LOCK_STALL: You can resolve the stall by examining buffer access
patterns using trace results and acquiring and releasing buffers on time. Use of local
buffers may also resolve the issue. You should also ensure the PL interface throughput
matches the AI Engine throughput in the case the PL interface is either the source or
destination of the stall.
- Use PING-PONG buffer (default).
- Balance throughput between kernels.
- Acquire and release buffer in-time. Use local buffer as needed.