The figure below summarizes the key aspects of the design of the conv2d_w5()
layer. The Jupyter Notebook used for validation is gen_vectors.ipynb. There are a number of unique aspects to this fifth layer:
This layer is split over four tiles due to the heavy storage requirement of its weights. Each tile processes one quarter of the total output channels, and each local tile memory stores one quarter of the total weights.
The layer uses input and output memory tiles using an approach similar to the
conv2d_w3()
layer to extract and replace samples in the order required for compute consumption. This leads to efficient utilization of the high performance compute usingmax_4x8_8x4()
and leads to a perfect inner loop software pipelining.Because computation is partitioned across the four compute tiles, a “collection” function must be performed upon completion of the processing. Output channel results must be passed from each compute tile to the fourth tile for collection and reordering. The design uses I/O streams for this purpose. This requires an additional sample reordering process that is outlined in further detail below.
The figure below illustrates the output sample collection process required by the conv2d_w5()
layer in order to collect & restore sample ordering in the four sets of output channels computed by its four tiles. An input memory tile extracts input samples with an 8 x 8 tiling pattern and these samples are broadcast to all four compute tiles. Each tile computes its assigned portion of the output channels. Once collected by the fourth tile, these four data sets must be reshuffled into proper order by the fourth compute tile prior to extraction via the output 4 x 8 tiling pattern used by the output memory tile. The colored blocks in the figure below help to identify the various output image rows computed by each tile. These must be interleaved row-size to restore the (3,8,128) tensor shape desired at the layer output. This is performed using additional compute cycles by the fourth tile. Each data set when collected is copied into a scratch memory in the local tile. These four scratch areas are then read in proper order to produce the desired shuffling.