The design of the primitive includes 3 modules:
- read: Read data from the input stream then output data by one stream whose
width is
lcm(Win, N * Wout)
bits. Here, the least common multiple ofWin
andN * Wout
is the inner buffer size in order to solve the different input width and output width. - reduce: split the large width to a array of
N
elements ofWout
bits. - distribute: Read the array of elements, and distibute them to output streams which are not full yet.
Attention
Current implementation has the following limitations:
- It uses a wide
ap_uint
as internal buffer. The buffer is as wide as the least common multiple (LCM) of input width and total output width. The width is limited byAP_INT_MAX_W
, which defaults to 1024. - This library will try to override
AP_INT_MAX_W
to 4096, but user should ensure thatap_int.h
has not be included before the library headers. - Too large
AP_INT_MAX_W
will significantly slow down HLS synthesis.
Important
The depth of output streams must be no less than 4 due to internal delay.