The figure below summarizes the key aspects of the design of the conv2d_w1() layer. The Jupyter Notebook used for validation is gen_vectors.ipynb.
An input memory tile is used to zero pad the input images from tensors of (28,28,1) to tensors of (28,32,1). This zero padding introduces 4 columns of zeros on the right side of the image so the overall # of columns is a multiple of 32. Only the column dimension requires padding because it forms the inner loop. Because the design uses
bfloat16data and the memory tiles require 32-bit alignment, the input memory tile is designed to take four images (i.e., with a (4,28,28,1) tensor) and theconv2d_w1_graphis set up as a multi-rate solution with arepetition_count=1on the memory tile and arepetition_count=4on the compute kernel. This is a key principle carried across the full design.Because this layer has only a single input channel, the
mac_elem_16_2()intrinsic is used at 50% capacity; it processes two channels by default and here one channel is zeroed out. This impacts its overall vector efficiency.The inner loop II=17 is achieved to deliver nine MAC operations, which is only 26% efficient. Due to the small nature of the images here it is difficult to fill the pipeline with more MAC operations. This could easily be done in a larger design.
The overall loop structure employs an outer loop over the output image rows and an inner loop over the output image columns. This is a good fit for the chosen intrinsic.
Notice how the tiling parameters of the memory tile are used to pad the column dimension of the input images with four additional zero-valued pixels.