The data comes from all the 128 instances with a rate \(r=125MSa/s\). Assuming that the data comes from the PL, the PL interface tiles are used to send the data inside the device. To reduce as much as possible the PLIO channels usage, the best decision would be to apply time interleaving between the instances and run the fabric at the high (but manageable) speed of $f_{PL}=500MHz\(, with an interleaving factor \)\theta=4$. The data is sent through the PLIO channels that are 64-bit wide, and can thus accomodate two samples per channel, allowing also time interleaving. The instances can be assigned to the channels in a round-robin fashion, as shown in the following table.
Table 2: PLIO Channels Mapping with Temporal and Spatial Interleaving.
A not trivial design challenge to face when performing time interleaving is the management of the kernels’ buffers. In fact, each single kernel that computes the FFT expects its input data to be contiguous inside the buffer. If for instance we set the batching parameter $\text{REPEAT} = 2$ in the kernels’ header file, the input buffer size will increase, but the data is expected to be ordered with the first N samples coming from one signal instance, followed by the second N samples coming from another instance. This is not possible, since the data will be inputted as interleaved, as visible from Table 2.
Within this context, the easiest construct to use would be packet switching, where a single physical channel is used to route data to different kernels. However, the only supported packet switching mode for AI Engine kernels is currently the explicit one, where is sent a 32-bit header first, containing the packet information, and then a payload containing the wanted amount of data. Because of the interleaved nature of the data communication, in such case the header overhead must be paid for every sent sample, halving the channels bandwidth. The second possibility is to use passthrough kernels to just route the data to the compute kernels, relocating the overhead from the I/O resources to the computational ones. The third option is to use 16 BRAM buffers in PL and thus to send the data in order, paying with fabric resources and a large initial latency, that would last about twice the total acquisition time: $2 \cdot T_{acq} \simeq 16,4\mu s$. The fourth and last possibility is using the AIE-ML memory tile as buffers, and exploit their programmable access pattern to use them as interfaces.
The choice falls on this last option, as it manages all the data within the AI Engine and keeps the bandwidth, the interleaving, and the used programmable logic resources unvaried, at the cost of a resonably increased latency and the resources inside the array memory tiles.