The design of the primitive includes three modules:
- read: Reads data from the input stream then output data by one stream whose
width is
lcm(Win, N * Wout)
bits. Here, the least common multiple ofWin
andN * Wout
is the inner buffer size to solve the different input width and output width. - reduce: Splits the large width to an array of
N
elements ofWout
bits. - distribute: Reads the array of elements, and distributes them to output streams that are not yet full.
Attention
Current implementation has the following limitations:
- It uses a wide
ap_uint
as an internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited byAP_INT_MAX_W
, which defaults to 1024. - This library tries to override
AP_INT_MAX_W
to 4096. Ensure thatap_int.h
is not included before the library headers. - Too large
AP_INT_MAX_W
significantly slows down HLS synthesis.
Important
The depth of output streams must be no less than four due to an internal delay.