The design of the primitive includes three modules:
- read: Reads data from the input stream then output data by one stream whose
width is
lcm(Win, N * Wout)bits. Here, the least common multiple ofWinandN * Woutis the inner buffer size to solve the different input width and output width. - reduce: Splits the large width to an array of
Nelements ofWoutbits. - distribute: Reads the array of elements, and distributes them to output streams that are not yet full.
Attention
Current implementation has the following limitations:
- It uses a wide
ap_uintas an internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited byAP_INT_MAX_W, which defaults to 1024. - This library tries to override
AP_INT_MAX_Wto 4096. Ensure thatap_int.his not included before the library headers. - Too large
AP_INT_MAX_Wsignificantly slows down HLS synthesis.
Important
The depth of output streams must be no less than four due to an internal delay.