Storage requirements for I/O data for some layers might exceed the available 64 KB in the local tile memory. Therefore, you must split the consumed input or produced output data into chunks. This has a direct impact on the nature of processing.
If the input and output data fit into the memory tile but do not fit into the local tile memory, split processing into
NSPLITchunks such that local tile storage does not exceed 64 KB. You must split both input and output buffers by the same factor in the local tile. Then schedule processing for the layer as a multi-rate solution with arepetition_count=1on the memory tile and arepetition_count=NSPLITon the kernel. For the Radio-ML design, this concept is used inconv1d_w3-w7layers as well asmax_pool1d_w4-w6. To motivate this, considerconv1d_w3in more details. The block has (64,512) bfloat16 samples on the I/Os. Assuming ping-pong buffering and zero insertion, the input requires storage size of at least 64 nodes x (512 samples + 3 pre-pad zeros + 3 post-pad zeros ) x 2 ping-pong x 2 bytes/sample = 129.5 KB, which is 2.03x larger than local tile. This motivates the need for splitting the processing over at least NSPLIT>2 chunks. Forconv1d_w3, choose NSPLIT=8 to fit both I/Os into local tile. The following diagram illustrates this concept.The multi-rate solution with buffer splitting outlined above does not work if one of the buffers, input or output, does not require splitting. In this case, the multi-rate scheduling applies only to the buffer that is split. The buffer that is not split must use a
repetition_count=1. But AIE kernel multi-rate scheduling forces both input and output scheduling to use the same factor. Instead, solve this using the asynchronous buffer mechanism on the buffer that requires splitting.Consider the case with output splitting. Implement the kernel using
output_async_buffer, enabling the split of the output_buffer byNSPLITfactor. At the beginning of kernel execution, the kernel acquires the locks for ping side of the output buffer. Once the kernel produces 1/NSPLIT of the output data, it releases the lock on the ping side and acquires it on the pong, then continues processing etc. For more information on asynchronous buffer ports, refer to Asynchronous Buffer Port Access (UG1603). For the Radio-ML design, theconv1d_w1layer uses this concept. To motivate this, considerconv1d_w1in more details. The block has (2,1024) bfloat16 samples on the input and (64,1024) on the output. The input requires ~8.04 KB of storage (accounting for zero-insertion and ping-pong storage) while the output requires 256 KB. Therefore, it requires NSPLIT = 128 KB x 2 / 64 KB = 4. The following diagram shows the dataflow.For layers with compute, for example
conv1d_w1-w13, the I/O data is a 2D matrix represented as (nodes,samples).Splitting the output:
Splitting the output processing over the samples dimension requires the weights to be read multiple times. The following figure highlights this:
Splitting the output processing over the nodes dimension requires the input samples to be read multiple times.
For
conv1d_w1, the design splits the output in the nodes dimension because the input samples easily fit in the local tile and re-reading comes for “free.” It is also been possible to split the output in the nodes dimension, because the weights also fit in the local tile.For conv1d_w3-w7 layers, the design splits outputs in the samples dimension because the weights fit in the local tile memory (while the input does not) and re-reading comes for “free.”
Splitting the input:
Splitting the input over the samples dimension requires explicit state history handling. The following figure highlights this:
Splitting the input over the nodes dimension requires the storage of partial results.
The latter requires additional storage and results in an implementation that does not software pipeline efficiently. For the former, either store state history samples in local tile or re-send them from the memory tile as needed. Re-sending the samples from the memory tile results in slight bandwidth expansion, but this is not an issue because conv1d_w3-w7 layers are not bandwidth-bound.
Therefore, the choice for conv1d_w3-w7 is to split the input data in the samples dimension and use the memory tiles to send samples with overlap to model state history. The conv1d_w1 input (as well as conv1d_w9-w13) does not need splitting because the local tile storage is sufficient to store all input samples.
One Graph invocation of
radioml_topis one inference based on 1024 complex I/Q incoming samples. This translates into the following per layer invocation.Layer
Input Tensor Shape
Output Tensor Shape
Input storage req. local tile (KB)
Output storage req. local tile (KB)
Kernel invocation
Memory Tile invocation
Note
conv1d_w1
(2,1024)
(64,1024)
8
256
1
1
Split output nodes by NSPLIT=8, handled inside kernel using async_output_buffer
max_pool1d_w2
(64,1024)
(64,512)
256
128
8
1
—
conv1d_w3
(64,512)
(64,512)
176
128
8
1
Split I/O over samples dimension
max_pool1d_w4
(64,512)
(64,256)
128
64
4
NA
—
conv1d_w5
(64,256)
(64,256)
88
64
4
1
Same note as conv1d_w3
max_pool1d_w6
(64,256)
(64,128)
64
32
2
NA
—
conv1d_w7
(64,128)
(64,128)
44
32
2
1
Same note as conv1d_w3
max_pool1d_w8
(64,128)
(64,64)
32
16
1
NA
—
conv1d_w9
(64,64)
(64,64)
17.5
16
1
1
NA
max_pool1d_w10
(64,64)
(64,32)
16
8
1
NA
—
conv1d_w11
(64,32)
(64,32)
9.5
8
1
1
NA
max_pool1d_w12
(64,32)
(64,16)
8
4
1
NA
—
conv1d_w13
(64,16)
(64,16)
5.5
4
1
1
NA
max_pool1d_w14
(64,16)
(64,8)
4
2
1
NA
—
dense_w16
(512)
(128)
2
0.5
1
NA
—
dense_w17
(128)
(128)
0.5
0.5
1
NA
—
dense_w18
(128)
(24)
0.5
0.1
1
NA
—