The figure below summarizes the key aspects of the design of the conv2d_w1()
layer. The Jupyter Notebook used for validation is gen_vectors.ipynb.
An input memory tile is used to zero pad the input images from tensors of (28,28,1) to tensors of (28,32,1). Only the column dimension requires padding because it forms the inner loop. Because the design uses
bfloat16
data and the memory tiles require 32-bit alignment, the input memory tile is designed to take four images (i.e., with a (4,28,28,1) tensor) and theconv2d_w1_graph
is set up as a multi-rate solution with arepetition_count=1
on the memory tile and arepetition_count=4
on the compute kernel. This is a key principle carried across the full design.Because this layer has only a single input channel, the
mac_elem_16_2()
intrinsic is used at 50% capacity; it processes two channels by default and here one channel is zeroed out. This impacts its overall vector efficiency.The inner loop II=17 is achieved to deliver nine MAC operations, which is only 26% efficient. Due to the small nature of the images here it is difficult to fill the pipeline with more MAC operations. This could easily be done in a larger design.
The overall loop structure employs an outer loop over the output image rows and an inner loop over the output image columns. This is a good fit for the chosen intrinsic.
Notice how the tiling parameters of the memory tile are used to pad the column dimension of the input images with four additional zero-valued pixels.