Layer Design Details: conv2d_w1() - 2025.1 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-08-25
Version
2025.1 English

The figure below summarizes the key aspects of the design of the conv2d_w1() layer. The Jupyter Notebook used for validation is gen_vectors.ipynb.

  • An input memory tile is used to zero pad the input images from tensors of (28,28,1) to tensors of (28,32,1). This zero padding introduces 4 columns of zeros on the right side of the image so the overall # of columns is a multiple of 32. Only the column dimension requires padding because it forms the inner loop. Because the design uses bfloat16 data and the memory tiles require 32-bit alignment, the input memory tile is designed to take four images (i.e., with a (4,28,28,1) tensor) and the conv2d_w1_graph is set up as a multi-rate solution with a repetition_count=1 on the memory tile and a repetition_count=4 on the compute kernel. This is a key principle carried across the full design.

  • Because this layer has only a single input channel, the mac_elem_16_2() intrinsic is used at 50% capacity; it processes two channels by default and here one channel is zeroed out. This impacts its overall vector efficiency.

  • The inner loop II=17 is achieved to deliver nine MAC operations, which is only 26% efficient. Due to the small nature of the images here it is difficult to fill the pipeline with more MAC operations. This could easily be done in a larger design.

  • The overall loop structure employs an outer loop over the output image rows and an inner loop over the output image columns. This is a good fit for the chosen intrinsic.

  • Notice how the tiling parameters of the memory tile are used to pad the column dimension of the input images with four additional zero-valued pixels.

figure13