The TP_SSR
template parameter enables the use of multiple bitonic sort kernels to process larger datasets more efficiently. When TP_SSR > 1
, the bitonic sort graph will instantiate TP_SSR
bitonic sort kernels, each responsible for sorting a subset of the input data. These sorted sublists are then passed through a tree of TP_SSR - 1
merge sort kernels, which combine the sublists into a single sorted output stream.
The performance of a single bitonic sort kernel diminishes as the list size (TP_DIM
) increases due to the increasing number of stages in the bitonic sort. For larger list sizes, it is more efficient to split the workload across multiple bitonic sort kernels, each sorting TP_DIM / TP_SSR
samples. This approach improves performance, especially for larger datasets, but at the cost of resources. On the lower end of TP_DIM
sizes, performance is generally better with a single kernel (TP_SSR = 1
), as the overhead of merging sublists is avoided.
Using multiple bitonic sort kernels also increases the maximum possible list size (TP_DIM
) that can be sorted. With a single kernel, the maximum TP_DIM
is limited by the data memory of an AI Engine tile. By splitting the sort across TP_SSR
kernels, the maximum TP_DIM
can be increased by a factor of TP_SSR
.
The bitonic sort graph configured with TP_SSR > 1
will have TP_SSR
input IO buffers and a single output stream. Note that TP_SSR
is only supported when the number of frames TP_NUM_FRAMES
is set to 1.