Large Bitonic Sorting Example - 2024.2 English - XD100

Vitis Tutorials: AI Engine Development (XD100)

Document ID
XD100
Release Date
2024-12-06
Version
2024.2 English

This section reviews the design of a larger Bitonic SIMD sorting example for \(N=1024\) samples. This is more challenging than the previous \(N=16\) example because the entire array to be sorted no longer fits in the available vector register space. Instead we must store the array in local tile memory and work on smaller portions of the array using the vector registers. The existing \(N=16\) code base serves us nicely for this task. It remains to build up the additional stages and memory addressing to manage the computation across the full \(N=1024\) array.

The block diagram below shows the Bitonic Sorter design for \(N=1024\). The design requires a total of \(q(q+1)/2=55\) rounds of processing where $q=\log_2(N)=10\(. Once again, these rounds are collected into \)q$ stages. The first four stages (ie. Stage 0 to Stage 3) are identical to the stages we have seen earlier in the \(N=16\) example. The only difference is these stages operate on the full vector of \(N=1024\) samples; they are processed as 64 groups of 16 samples each, where the processing of each group of 16 samples is identical to the previous example.

The diagram below breaks apart the ten rounds of Stage 9 to illustrate the nature of processing which occurs from Stages 4 to 9. Each round of a given stage may be categorized by three parameters, “GROUP”, “SPAN”, and “ITERATION”:

  • The GROUP indicates the number of sets of identical processing that occurs in each round.

  • The SPAN indicates the width or straddle in samples between the “top” (or “bottom”) of two consecutive butterflies in the round.

  • The ITERATION indicates the number of SIMD comparisons performed per GROUP (where we know from the \(N=16\) example we process 8 samples per SIMD comparison).

figure

Based on these definitions, each round may be characterized. For example, the first round of Stage 9 may be characterized as <GROUP,SPAN,ITER>=<1,1024,64> since the largest span of the first butterfly is 1024 samples (SPAN=1024). There is a single group of butterflies that span the array vertically (GROUP=1). There are 64 SIMD cycles required to process $8\times64=512$ butterflies in the round (ITER=64). The second round of Stage 9 may be characterized as <GROUP,SPAN,ITER>=<2,256,32> since the span of all butterfiles is 256 samples (SPAN=256), there are two groups (GROUP=2), and there are $8\times32=512$ butterflies in the round (ITER=32).

It turns out the first round in Stages 1 to \(q\) always contains these butterflies with “reducing span” dropping from the largest span down to a single sample. All other rounds in the stage consist of butterflies with identical spans. This creates two fundamental types of processing that must be managed. For each type, they only differ in the number of groups processed in each round. This is identified in the figure below where the different types of processing are each identified with a unique color. Notice how all of the rounds from Stage 4 onwards are made up of five different types of processing. Three of these originate from Stage 0 and Stage 1. The other two consists of these forms of “reducing span” butterflies vs. “fixed span” butterflies identified above.

figure

The code block below shows the implementation of the “reducing span” algorithm required for the first round of every stage for Stage 4 and above. Notice how we loop over GROUP*ITER iterations of 8-lane comparison operations. For these “reducing span” comparisons we must reverse the order of the bottom butterfly data to perform the proper comparisons. Afterwards, we may omit the reversal prior to storage since the remaining rounds will reorder each half of the samples properly. This omission saves cycles. After processing ITER sets of SIMD computations for each GROUP, we adjust the butterfly memory pointers and load another 16-lanes of data from memory, storing previously sorted results.

template<unsigned GROUP,unsigned SPAN,unsigned ITER> void __attribute__((noinline)) bitonic_fp1024::stageY(void)
{
  float* __restrict gTop = &data[0];
  float* __restrict gBot = &data[SPAN];
  float* __restrict pTop = gTop;
  float* __restrict pBot = gBot - 8;
  for (unsigned ii=0, jj=0; ii < GROUP*ITER; ii++)
    chess_prepare_for_pipelining
    {
      // ------------------------------------------------------------
      aie::accum<accfloat,8> v_top; v_top.from_vector(aie::load_v<8>(pTop));
      aie::vector<float,8>   v_bot = aie::load_v<8>(pBot);
      aie::vector<float,8>   v_mx = fpmax(v_top,v_bot,0,0x01234567);
      aie::vector<float,8>   v_mn = fpmin(v_top,v_bot,0,0x01234567);
      v_mn.store(pTop);
      v_mx.store(pBot); // Algorithm works whether or not we use aie::reverse() here -- Omit to save cycles
      // ------------------------------------------------------------
      if (jj==ITER-1) {
        // Time to hop to next group:
        pTop = gTop + SPAN;
        pBot = gBot + SPAN - 8;
        gTop = gTop + SPAN;
        gBot = gBot + SPAN;
        jj=0;
      }
      else {
        // Next set of 8 butterflies:
        pTop = pTop + 8;
        pBot = pBot - 8;
        jj++;
      }
    }
}

The code block below shows the implementation of the “fixed span” algorithm required for the remaining rounds of every stage for Stage 4 and above. Here no reversal operations are required since the butterfly assignments follow a regular pattern. Again, the memory management adjusts the top and bottom butterfly pointers according to the different GROUP and ITER to be processed

template<unsigned GROUP,unsigned SPAN,unsigned ITER> void __attribute__((noinline)) bitonic_fp1024::stageB(void)
{
  float* __restrict gTop = &data[0];
  float* __restrict gBot = &data[SPAN];
  float* __restrict pTop = gTop;
  float* __restrict pBot = gBot;
  for (unsigned ii=0, jj=0; ii < GROUP*ITER; ii++)
    chess_prepare_for_pipelining
    {
      // ------------------------------------------------------------
      aie::vector<float,8>  v_top = aie::load_v<8>(pTop);
      aie::vector<float,8>  v_bot = aie::load_v<8>(pBot);
      aie::vector<float,8>  v_mx  = aie::max(v_top,v_bot);
      aie::vector<float,8>  v_mn  = aie::min(v_top,v_bot);
      v_mn.store(pTop);
      v_mx.store(pBot);
      // ------------------------------------------------------------
      if (jj==ITER-1) {
        // Time to hop to next group:
        gTop = gTop + 2*SPAN;
        gBot = gBot + 2*SPAN;
        pTop = gTop;
        pBot = gBot;
        jj=0;
      }
      else {
        // Next set of 8 butterflies:
        pTop = pTop + 8;
        pBot = pBot + 8;
        jj++;
      }
    }
}