Layer Design Details: conv1d_w1() - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-12-05
Version
2025.2 English

The following figure summarizes the key aspects of the design of the conv1d_w1() layer. The Jupyter Notebook used for validation is gen_vectors.ipynb.

  • An input memory tile is used to pre/post zero pad the input tensor to satisfy model requirements that use padding="same". The layer of interest uses kerne_size=7 which requires the incoming 1024 I/Q samples to be pre-padded with three zeros and post-padded with three zeros. To guarantee kernel input size is a multiple of 16 bytes, it was chosen to pre-pad with 4 zeros and post-pad with 4 zeros.

  • Incoming (samples,nodes) dimension becomes flipped on the output due to nature of compute. This will be recovered in max_pool1d_w2 layer.

  • The layer input data fits in the local tile memory, but the output is expanded to 64x1024 bfloat16 samples corresponding to 256KB (assuming double buffering), which is larger than the local tile memory of 64KB. Splitting the output data impacts the nature of processing, as described earlier in Key Design Concepts.

  • For this reason, the kernel was implemented using an output_async_buffer, enabling the split of the output_buffer by NSPLIT=4.

  • Because this layer has only two input nodes, the mac_elem_64() intrinsic is used which drops the maximum achievable hardware utilization to 25%.

  • The inner loop has KERNEL_SIZE=7 iterations and is fully unrolled. The next inner loop achieves II=57 with 7x2 MAC operations.

  • The overall kernel structure employs an outer loop over the nodes dimension, an inner loop over samples dimension and the most inner loop over kernel_size dimension. This is a good fit for the chosen intrinsic.

  • Notice how the tiling parameters of the memory tile are used to pre/post-pad the input samples dimension with 4 zeros.

figure