Key Design Concepts - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2025-12-05
Version
2025.2 English
  • Storage requirements for I/O data for some layers may exceed the available 64KB in the local tile memory. Therefore, splitting of the consumed input or produced output data into chunks is required. This has a direct impact on the nature of processing.

  • If the input and output data fit into the memory tile but do not fit into the local tile memory, processing must be split into NSPLIT chunks such that local tile storage does not exceed 64 KB. Both input and output buffers must be split by the same factor in the local tile. Processing may then be scheduled for the layer as a multi-rate solution with a repetition_count=1 on the memory tile and a repetition_count=NSPLIT on the kernel. For the Radio-ML design, this concept is used in conv1d_w3-w7 layers as well as max_pool1d_w4-w6. To motivate this, consider conv1d_w3 in more details. The block has (64,512) bfloat16 samples on the I/Os. Assuming ping-pong buffering and zero insertion, the input requires storage size of at least 64 nodes x (512 samples + 3 pre-pad zeros + 3 post-pad zeros ) x 2 ping-pong x 2 bytes/sample = 129.5 KB, which is 2.03x larger than local tile. This motivates the need for splitting the processing over at least NSPLIT>2 chunks. For conv1d_w3, choose NSPLIT=8 to fit both I/Os into local tile. The following diagram illustrates this concept.

    figure

  • The multi-rate solution with buffer splitting outlined above does not work if one of the buffers, input or output, does not require splitting. In this case, the multi-rate scheduling applies only to the buffer that is split. The buffer that is not split must use a repetition_count=1. But AIE kernel multi-rate scheduling forces both input and output scheduling to use the same factor. Instead, this can be solved using the asynchronous buffer mechanism on the buffer that requires splitting.

    Consider the case with output splitting. The kernel can be implemented using output_async_buffer, enabling the split of the output_buffer by NSPLIT factor. At the beginning of kernel execution, the locks for ping side of the output buffer are acquired. Once 1/NSPLIT of the output data are produced, the kernel releases the lock on the ping side and acquires it on the pong, then continues processing etc. For more information on asynchronous buffer ports, refer to Asynchronous Buffer Port Access (UG1603). For the Radio-ML design, this concept is used in conv1d_w1 layer. To motivate this, consider conv1d_w1 in more details. The block has (2,1024) bfloat16 samples on the input and (64,1024) on the output. The input requires ~8.04 KB of storage (accounting for zero-insertion and ping-pong storage) while the output requires 256 KB. Therefore, it requires NSPLIT = 128 KB x 2 / 64 KB = 4. The following diagram illustrates the dataflow.

    figure

  • For layers with compute, for example conv1d_w1-w13, the I/O data is a 2D matrix represented as (nodes,samples).

    • Splitting the output:

      • Splitting the output processing over the samples dimension requires the weights to be read multiple times. This is highlighted in the following figure: figure

      • Splitting the output processing over the nodes dimension requires the input samples to be read multiple times. figure

      • For conv1d_w1, the output was split in the nodes dimension since the input samples could easily fit in the local tile and re-reading comes for “free.” It could have been also possible to split the output in the nodes dimension, since the weights also fit in the local tile.

      • For conv1d_w3-w7 layers, outputs were split in the samples dimension since the weights fit in the local tile memory (while the input does not) and re-reading comes for “free.”

    • Splitting the input:

      • Splitting the input over the samples dimension requires explicit state history handling. This is highlighted in the following figure: figure

      • Splitting the input over the nodes dimension requires the storage of partial results. figure

      • The latter requires additional storage and results in an implementation that does not software pipeline efficiently. For the former, state history samples can either be stored in local tile or re-sent from the memory tile as needed. Re-sending the samples from the memory tile results in slight bandwidth expansion, but this is not an issue since conv1d_w3-w7 layers are not bandwidth-bound.

      • Therefore, the choice for conv1d_w3-w7 is to split the input data in the samples dimension and use the memory tiles to send samples with overlap to model state history. conv1d_w1 input (as well as conv1d_w9-w13) does not need to be split since the local tile storage is sufficient to store all input samples.

    • One Graph invocation of radioml_top is one inference based on 1024 complex I/Q incoming samples. This translates into the following per layer invocation.

      Layer

      Input Tensor Shape

      Output Tensor Shape

      Input storage req. local tile (KB)

      Output storage req. local tile (KB)

      Kernel invocation

      Memory Tile invocation

      Note

      conv1d_w1

      (2,1024)

      (64,1024)

      8

      256

      1

      1

      Split output nodes by NSPLIT=8, handeled inside kernel using async_output_buffer

      max_pool1d_w2

      (64,1024)

      (64,512)

      256

      128

      8

      1

      conv1d_w3

      (64,512)

      (64,512)

      176

      128

      8

      1

      Split I/O over samples dimension

      max_pool1d_w4

      (64,512)

      (64,256)

      128

      64

      4

      NA

      conv1d_w5

      (64,256)

      (64,256)

      88

      64

      4

      1

      Same note as conv1d_w3

      max_pool1d_w6

      (64,256)

      (64,128)

      64

      32

      2

      NA

      conv1d_w7

      (64,128)

      (64,128)

      44

      32

      2

      1

      Same note as conv1d_w3

      max_pool1d_w8

      (64,128)

      (64,64)

      32

      16

      1

      NA

      conv1d_w9

      (64,64)

      (64,64)

      17.5

      16

      1

      1

      NA

      max_pool1d_w10

      (64,64)

      (64,32)

      16

      8

      1

      NA

      conv1d_w11

      (64,32)

      (64,32)

      9.5

      8

      1

      1

      NA

      max_pool1d_w12

      (64,32)

      (64,16)

      8

      4

      1

      NA

      conv1d_w13

      (64,16)

      (64,16)

      5.5

      4

      1

      1

      NA

      max_pool1d_w14

      (64,16)

      (64,8)

      4

      2

      1

      NA

      dense_w16

      (512)

      (128)

      2

      0.5

      1

      NA

      dense_w17

      (128)

      (128)

      0.5

      0.5

      1

      NA

      dense_w18

      (128)

      (24)

      0.5

      0.1

      1

      NA