The design of the primitive includes 3 modules:
- read: Read data from the input stream then output data by one stream whose
width is
lcm(Win, N * Wout)bits. Here, the least common multiple ofWinandN * Woutis the inner buffer size in order to solve the different input width and output width. - reduce: split the large width to a array of
Nelements ofWoutbits. - distribute: Read the array of elements, and distibute them to output streams which are not full yet.
Attention
Current implementation has the following limitations:
- It uses a wide
ap_uintas internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited byAP_INT_MAX_W, which defaults to 1024. - This library will try to override
AP_INT_MAX_Wto 4096, but user should ensure thatap_int.hhas not be included before the library headers. - Too large
AP_INT_MAX_Wwill significantly slow down HLS synthesis.
Important
The depth of output streams must be no less than 4 due to internal delay.