Vectorization of 3x3 Conv2D Layer Processing - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

The 2D convolutional layers in the MNIST ConvNet use 3x3 patch processing as outlined in the figure below. Each layer has a number of input channels $C_I$ of images of size $H_I\times W_I$ pixels. Each of these input images is processed by the 3x3 patch to produce a number of output channels $C_O$ of images of size $H_O\times W_O$ pixels where $H_O=H_I-2$ and $W_O=W_I-2\(. An outer border of one pixel width is lost in the output images to allow the 3x3 patch to fully span the input image without exceeding its borders. The 3x3 patch processing involves computing nine point-wise products of the input image pixels with the 9 weights of the 3x3 patch and summing the results. A bias term is added to the sum and this gives the output value for the output pixel at the center of the 3x3 patch. The diagram below shows a \)5\times 5$ input image and its corresponding $3\times 3$ output image. Each output pixel is computed by moving the patch one pixel to the right across the image, and then repeating this procedure along every row in the output image. Each input/output layer pair has a unique set of 9 weights + 1 bias for the 3x3 patch used to compute a specific output layer from a specific input layer.

figure9

AIE-ML data path is optimized for matrix multiplication, and it turns out the 3x3 convolutional processing outlined above can be cast into this form. The diagram below shows the compute performed by the mac_4x8_8x4() intrinsic for the bfloat16 data type in AIE-ML. This intrinsic performs matrix multiplication of the form $M=X\times Y\(, where matrix \)X$ has size $[A\times B]\(, matrix \)Y$ has size $[B\times C]$ and matrix \(M\) has size $[A\times C]\(. In this case, \)A\times B\times C = 4\times 8\times 4\(. The input matrix \)X$ is $4\times 8\(, the input matrix \)Y$ is $8\times 4$ and the output matrix \(M\) is $4\times 4$.

The diagram below shows how this intrinsic may be loaded to perform 3x3 convolutional layer processing. Input matrix \(X\) may be loaded with 8 channels of 4 pixels each, where the pixels are stored in columns for each channel. AI Engine maps this input matrix in row-major order into its vector lanes. So we can consider matrix \(X\) getting mapped into a 32-lane register where the first 8 lanes contain pixel-0 from all 8 channels, the second 8 lanes contain pixel-1 from all 8 channels, and so on.

The weights may be loaded as columns of matrix \(Y\) where each weight $w(C_i,C_o)$ maps from an input channel to an output layer. Since there are 8 rows and 4 columns, each weight maps from 8 input channels to 4 output channels. The weights are not a function of the pixels; the same weights are used across a specific image.

Based on this vectorization, we can compute in a single cycle 16 output samples (four pixels for each of four output channels) from 32 input samples (four pixels from eight input channels). The trick then for each AI Engine kernel is to load efficiently these $4\times 8$ input samples and the $8\times 4$ weights continuously to keep the compute busy.

figure10