In the case where multiple kernels fit in a single AI Engine, communication between two or more consecutive kernels can be established using a common buffer in the local data memory of the AI Engine or in any of the three neighboring memories to which the AI Engine has direct access. In this case, only a single buffer is needed because the kernels execute one after another in a round-robin fashion.
For cases where the kernels are in separate but neighboring AI Engines, the communication can be carried out through the data memory module shared between the two neighboring AI Engine tile that use ping-pong buffers. These buffers can be on separate memory banks so access conflicts are avoided. The synchronization is done through locks. The input and output buffers for the AI Engine kernel are ensured to be ready by the locks associated with the buffers. In this type of communication, routing resources are saved and data transferring latency is eliminated because DMA and AXI4-Stream interconnect are not needed.