Key Design Concepts - Key Design Concepts - 2025.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2026-03-27
Version
2025.2 English
  • Storage requirements for I/O data for some layers might exceed the available 64 KB in the local tile memory. Therefore, you must split the consumed input or produced output data into chunks. This has a direct impact on the nature of processing.

  • If the input and output data fit into the memory tile but do not fit into the local tile memory, split processing into NSPLIT chunks such that local tile storage does not exceed 64 KB. You must split both input and output buffers by the same factor in the local tile. Then schedule processing for the layer as a multi-rate solution with a repetition_count=1 on the memory tile and a repetition_count=NSPLIT on the kernel. For the Radio-ML design, this concept is used in conv1d_w3-w7 layers as well as max_pool1d_w4-w6. To motivate this, consider conv1d_w3 in more details. The block has (64,512) bfloat16 samples on the I/Os. Assuming ping-pong buffering and zero insertion, the input requires storage size of at least 64 nodes x (512 samples + 3 pre-pad zeros + 3 post-pad zeros ) x 2 ping-pong x 2 bytes/sample = 129.5 KB, which is 2.03x larger than local tile. This motivates the need for splitting the processing over at least NSPLIT>2 chunks. For conv1d_w3, choose NSPLIT=8 to fit both I/Os into local tile. The following diagram illustrates this concept.

    figure

  • The multi-rate solution with buffer splitting outlined above does not work if one of the buffers, input or output, does not require splitting. In this case, the multi-rate scheduling applies only to the buffer that is split. The buffer that is not split must use a repetition_count=1. But AIE kernel multi-rate scheduling forces both input and output scheduling to use the same factor. Instead, solve this using the asynchronous buffer mechanism on the buffer that requires splitting.

    Consider the case with output splitting. Implement the kernel using output_async_buffer, enabling the split of the output_buffer by NSPLIT factor. At the beginning of kernel execution, the kernel acquires the locks for ping side of the output buffer. Once the kernel produces 1/NSPLIT of the output data, it releases the lock on the ping side and acquires it on the pong, then continues processing etc. For more information on asynchronous buffer ports, refer to Asynchronous Buffer Port Access (UG1603). For the Radio-ML design, the conv1d_w1 layer uses this concept. To motivate this, consider conv1d_w1 in more details. The block has (2,1024) bfloat16 samples on the input and (64,1024) on the output. The input requires ~8.04 KB of storage (accounting for zero-insertion and ping-pong storage) while the output requires 256 KB. Therefore, it requires NSPLIT = 128 KB x 2 / 64 KB = 4. The following diagram shows the dataflow.

    figure

  • For layers with compute, for example conv1d_w1-w13, the I/O data is a 2D matrix represented as (nodes,samples).

    • Splitting the output:

      • Splitting the output processing over the samples dimension requires the weights to be read multiple times. The following figure highlights this: figure

      • Splitting the output processing over the nodes dimension requires the input samples to be read multiple times. figure

      • For conv1d_w1, the design splits the output in the nodes dimension because the input samples easily fit in the local tile and re-reading comes for “free.” It is also been possible to split the output in the nodes dimension, because the weights also fit in the local tile.

      • For conv1d_w3-w7 layers, the design splits outputs in the samples dimension because the weights fit in the local tile memory (while the input does not) and re-reading comes for “free.”

    • Splitting the input:

      • Splitting the input over the samples dimension requires explicit state history handling. The following figure highlights this: figure

      • Splitting the input over the nodes dimension requires the storage of partial results. figure

      • The latter requires additional storage and results in an implementation that does not software pipeline efficiently. For the former, either store state history samples in local tile or re-send them from the memory tile as needed. Re-sending the samples from the memory tile results in slight bandwidth expansion, but this is not an issue because conv1d_w3-w7 layers are not bandwidth-bound.

      • Therefore, the choice for conv1d_w3-w7 is to split the input data in the samples dimension and use the memory tiles to send samples with overlap to model state history. The conv1d_w1 input (as well as conv1d_w9-w13) does not need splitting because the local tile storage is sufficient to store all input samples.

    • One Graph invocation of radioml_top is one inference based on 1024 complex I/Q incoming samples. This translates into the following per layer invocation.

      Layer

      Input Tensor Shape

      Output Tensor Shape

      Input storage req. local tile (KB)

      Output storage req. local tile (KB)

      Kernel invocation

      Memory Tile invocation

      Note

      conv1d_w1

      (2,1024)

      (64,1024)

      8

      256

      1

      1

      Split output nodes by NSPLIT=8, handled inside kernel using async_output_buffer

      max_pool1d_w2

      (64,1024)

      (64,512)

      256

      128

      8

      1

      conv1d_w3

      (64,512)

      (64,512)

      176

      128

      8

      1

      Split I/O over samples dimension

      max_pool1d_w4

      (64,512)

      (64,256)

      128

      64

      4

      NA

      conv1d_w5

      (64,256)

      (64,256)

      88

      64

      4

      1

      Same note as conv1d_w3

      max_pool1d_w6

      (64,256)

      (64,128)

      64

      32

      2

      NA

      conv1d_w7

      (64,128)

      (64,128)

      44

      32

      2

      1

      Same note as conv1d_w3

      max_pool1d_w8

      (64,128)

      (64,64)

      32

      16

      1

      NA

      conv1d_w9

      (64,64)

      (64,64)

      17.5

      16

      1

      1

      NA

      max_pool1d_w10

      (64,64)

      (64,32)

      16

      8

      1

      NA

      conv1d_w11

      (64,32)

      (64,32)

      9.5

      8

      1

      1

      NA

      max_pool1d_w12

      (64,32)

      (64,16)

      8

      4

      1

      NA

      conv1d_w13

      (64,16)

      (64,16)

      5.5

      4

      1

      1

      NA

      max_pool1d_w14

      (64,16)

      (64,8)

      4

      2

      1

      NA

      dense_w16

      (512)

      (128)

      2

      0.5

      1

      NA

      dense_w17

      (128)

      (128)

      0.5

      0.5

      1

      NA

      dense_w18

      (128)

      (24)

      0.5

      0.1

      1

      NA