Storage requirements for I/O data for some layers may exceed the available 64KB in the local tile memory. Therefore, splitting of the consumed input or produced output data into chunks is required. This has a direct impact on the nature of processing.
If the input and output data fit into the memory tile but do not fit into the local tile memory, processing must be split into
NSPLITchunks such that local tile storage does not exceed 64 KB. Both input and output buffers must be split by the same factor in the local tile. Processing may then be scheduled for the layer as a multi-rate solution with arepetition_count=1on the memory tile and arepetition_count=NSPLITon the kernel. For the Radio-ML design, this concept is used inconv1d_w3-w7layers as well asmax_pool1d_w4-w6. To motivate this, considerconv1d_w3in more details. The block has (64,512) bfloat16 samples on the I/Os. Assuming ping-pong buffering and zero insertion, the input requires storage size of at least 64 nodes x (512 samples + 3 pre-pad zeros + 3 post-pad zeros ) x 2 ping-pong x 2 bytes/sample = 129.5 KB, which is 2.03x larger than local tile. This motivates the need for splitting the processing over at least NSPLIT>2 chunks. Forconv1d_w3, choose NSPLIT=8 to fit both I/Os into local tile. The following diagram illustrates this concept.The multi-rate solution with buffer splitting outlined above does not work if one of the buffers, input or output, does not require splitting. In this case, the multi-rate scheduling applies only to the buffer that is split. The buffer that is not split must use a
repetition_count=1. But AIE kernel multi-rate scheduling forces both input and output scheduling to use the same factor. Instead, this can be solved using the asynchronous buffer mechanism on the buffer that requires splitting.Consider the case with output splitting. The kernel can be implemented using
output_async_buffer, enabling the split of the output_buffer byNSPLITfactor. At the beginning of kernel execution, the locks for ping side of the output buffer are acquired. Once 1/NSPLIT of the output data are produced, the kernel releases the lock on the ping side and acquires it on the pong, then continues processing etc. For more information on asynchronous buffer ports, refer to Asynchronous Buffer Port Access (UG1603). For the Radio-ML design, this concept is used inconv1d_w1layer. To motivate this, considerconv1d_w1in more details. The block has (2,1024) bfloat16 samples on the input and (64,1024) on the output. The input requires ~8.04 KB of storage (accounting for zero-insertion and ping-pong storage) while the output requires 256 KB. Therefore, it requires NSPLIT = 128 KB x 2 / 64 KB = 4. The following diagram illustrates the dataflow.For layers with compute, for example
conv1d_w1-w13, the I/O data is a 2D matrix represented as (nodes,samples).Splitting the output:
Splitting the output processing over the samples dimension requires the weights to be read multiple times. This is highlighted in the following figure:
Splitting the output processing over the nodes dimension requires the input samples to be read multiple times.
For
conv1d_w1, the output was split in the nodes dimension since the input samples could easily fit in the local tile and re-reading comes for “free.” It could have been also possible to split the output in the nodes dimension, since the weights also fit in the local tile.For conv1d_w3-w7 layers, outputs were split in the samples dimension since the weights fit in the local tile memory (while the input does not) and re-reading comes for “free.”
Splitting the input:
Splitting the input over the samples dimension requires explicit state history handling. This is highlighted in the following figure:
Splitting the input over the nodes dimension requires the storage of partial results.
The latter requires additional storage and results in an implementation that does not software pipeline efficiently. For the former, state history samples can either be stored in local tile or re-sent from the memory tile as needed. Re-sending the samples from the memory tile results in slight bandwidth expansion, but this is not an issue since conv1d_w3-w7 layers are not bandwidth-bound.
Therefore, the choice for conv1d_w3-w7 is to split the input data in the samples dimension and use the memory tiles to send samples with overlap to model state history. conv1d_w1 input (as well as conv1d_w9-w13) does not need to be split since the local tile storage is sufficient to store all input samples.
One Graph invocation of
radioml_topis one inference based on 1024 complex I/Q incoming samples. This translates into the following per layer invocation.Layer
Input Tensor Shape
Output Tensor Shape
Input storage req. local tile (KB)
Output storage req. local tile (KB)
Kernel invocation
Memory Tile invocation
Note
conv1d_w1
(2,1024)
(64,1024)
8
256
1
1
Split output nodes by NSPLIT=8, handeled inside kernel using async_output_buffer
max_pool1d_w2
(64,1024)
(64,512)
256
128
8
1
—
conv1d_w3
(64,512)
(64,512)
176
128
8
1
Split I/O over samples dimension
max_pool1d_w4
(64,512)
(64,256)
128
64
4
NA
—
conv1d_w5
(64,256)
(64,256)
88
64
4
1
Same note as conv1d_w3
max_pool1d_w6
(64,256)
(64,128)
64
32
2
NA
—
conv1d_w7
(64,128)
(64,128)
44
32
2
1
Same note as conv1d_w3
max_pool1d_w8
(64,128)
(64,64)
32
16
1
NA
—
conv1d_w9
(64,64)
(64,64)
17.5
16
1
1
NA
max_pool1d_w10
(64,64)
(64,32)
16
8
1
NA
—
conv1d_w11
(64,32)
(64,32)
9.5
8
1
1
NA
max_pool1d_w12
(64,32)
(64,16)
8
4
1
NA
—
conv1d_w13
(64,16)
(64,16)
5.5
4
1
1
NA
max_pool1d_w14
(64,16)
(64,8)
4
2
1
NA
—
dense_w16
(512)
(128)
2
0.5
1
NA
—
dense_w17
(128)
(128)
0.5
0.5
1
NA
—
dense_w18
(128)
(24)
0.5
0.1
1
NA
—