The following figure summarizes the key aspects of the design of the conv1d_w1() layer. The Jupyter Notebook used for validation is gen_vectors.ipynb.
An input memory tile is used to pre/post zero pad the input tensor to satisfy model requirements that use
padding="same". The layer of interest useskerne_size=7which requires the incoming 1024 I/Q samples to be pre-padded with three zeros and post-padded with three zeros. To guarantee kernel input size is a multiple of 16 bytes, it was chosen to pre-pad with 4 zeros and post-pad with 4 zeros.Incoming (samples,nodes) dimension becomes flipped on the output due to nature of compute. This will be recovered in max_pool1d_w2 layer.
The layer input data fits in the local tile memory, but the output is expanded to 64x1024 bfloat16 samples corresponding to 256KB (assuming double buffering), which is larger than the local tile memory of 64KB. Splitting the output data impacts the nature of processing, as described earlier in Key Design Concepts.
For this reason, the kernel was implemented using an
output_async_buffer, enabling the split of the output_buffer by NSPLIT=4.Because this layer has only two input nodes, the
mac_elem_64()intrinsic is used which drops the maximum achievable hardware utilization to 25%.The inner loop has KERNEL_SIZE=7 iterations and is fully unrolled. The next inner loop achieves II=57 with 7x2 MAC operations.
The overall kernel structure employs an outer loop over the nodes dimension, an inner loop over samples dimension and the most inner loop over kernel_size dimension. This is a good fit for the chosen intrinsic.
Notice how the tiling parameters of the memory tile are used to pre/post-pad the input samples dimension with 4 zeros.