The packet_receiver is an HLS kernel that performs cross-port packet switching from 4 input AXI Stream ports to 8 output AXI Stream ports, reorganizing data from the AI Engine’s packet-switched TDM FIR outputs back into polyphase streams for the IFFT. It performs two critical functions:
Packet Reception and Buffering: The producer function reads packets from 4 input ports and routes them to 32 dedicated stream-of-blocks buffers based on packet ID. Each input port receives 8 packets sequentially, with packet IDs extracted from the 32-bit AI Engine header (bits [4:0]). The design supports out-of-order packet arrival through a two-level lookup mechanism: extracting the index from header bits [4:0], then looking up the actual packet ID from
packet_ids_Narrays to determine the destination buffer.Data Reorganization and Output: The consumer function reads from the buffered packets and reorganizes samples into 8 output streams with interleaved ordering. Each output port receives 4 packets (strided pattern: output N receives packets N, N+8, N+16, N+24), with samples written in round-robin fashion across the packets to ensure balanced timing.
Key implementation details:
32 independent stream-of-blocks: One dedicated 128-sample buffer per packet with ping-pong double buffering (depth=2) for concurrent producer/consumer operation
LUTRAM implementation: Zero BRAM usage - all 512 Kb of storage (32 packets × 128 samples × 64 bits × 2 buffers) implemented in distributed RAM
Array partitioning: Cyclic factor=2 partitioning splits each buffer into even/odd memory banks, enabling dual-write optimization for producer II=1
Out-of-order packet support: Header-based routing using
packet_ids_N[header[4:0]]lookup ensures correct data flow regardless of packet arrival order from upstream AI Engine processingCross-port switching: Input port N receives packets (N×8) to (N×8+7), while output port N receives packets N, N+8, N+16, N+24
Performance: Achieves 2.38 Gsps sustained throughput (95.2% efficiency) at 312.5 MHz with 128-bit interfaces, validated in RTL co-simulation
The kernel processes 4096 samples per transform (32 packets × 128 samples), with each packet containing a 32-bit header followed by 128 cint32 samples. The two-level packet ID lookup mechanism (defined in packet_receiver.h using values from packet_ids_c.h) provides flexible mapping between AI Engine packet IDs and internal routing, critical for robust integration with the asynchronous AI Engine fabric.