The Aggregate kernel is another key kernel of General Query Engine (GQE) which supports both grouping and non-grouping aggregate operations.
The internal structure of this kernel is shown in the figure above. Same as the join kernel, the 8-cols data buffer, 1x kernel config buffer, and 1x meta info buffer are employed as the kernel input. Due to the diversity of output data types, e.g., aggregate max, min, raw data, etc., the 16x output column buffers are used as the output buffer. As shown in above figure, before entering into the hash group aggregate module, each element in each row will be evaluated and filtered. Thus, some new elements can be generated and some rows will be discarded. Moreover, two cascaded evaluation modules are added to support more complex expression.
The core module of aggregate kernel is a hash group aggregate, which is a multi-PU implementation and given in the following diagram. Each PU requires two HBM banks and some URAM memory blocks to buffer distinct keys as well as payloads after aggregate operations, and one internal loop is implemented to consume all input rows with each iteration. Furthermore, all PUs are working in parallel to achieve higher performance.
The data structure of input and output meta and raw data are same as the join kernel. The configuration buffer is composed of 128 x 32-bit slots. The details of configuration buffers are listed in the following table:
Module | Module Config Width | Position |
Scan | 64 bit | config[0]~config[1] |
Eval0 | 289 bit | config[2]~config[11] |
Eval1 | 289 bit | config[12]~config[21] |
Filter | 45*32 bit | config[22]~config[66] |
Shuffle0 | 64 bit | config[67]~config[68] |
Shuffle1 | 64 bit | config[69]~config[70] |
Shuffle2 | 64 bit | config[71]~config[72] |
Shuffle3 | 64 bit | config[73]~config[74] |
Group Aggr | 4*32 bit | config[75]~config[78] |
Column Merge | 64 bit | config[79]~config[80] |
Aggregate | 1 bit | config[81] |
Write | 16 bit | config[82] |
Reserved | config[83]~config[127] |
The hardware resource utilization of hash group aggregate is shown in the following table (work as 180 MHz).
Primitive | Quantity | LUT | LUT as memory | LUT as logic | Register | BRAM36 | URAM | DSP |
Scan | 1 | 12209 | 4758 | 7451 | 18974 | 0 | 0 | 2 |
Eval | 8 | 2153 | 426 | 1727 | 2042 | 4 | 0 | 21 |
Filter | 4 | 2168 | 13 | 2155 | 1764 | 0.5 | 0 | 0 |
Group Aggr | 1 | 162202 | 27819 | 134383 | 210926 | 62 | 256 | 0 |
Direct Aggr | 1 | 4349 | 0 | 4349 | 6611 | 0 | 0 | 0 |
Write | 1 | 30938 | 9490 | 21448 | 43579 | 0 | 0 | 0 |
AXI DDR | 1 | 4586 | 1313 | 3273 | 78855 | 18 | 0 | 0 |
AXI HBM | 1 | 20528 | 4456 | 16072 | 45416 | 124 | 0 | 0 |
Total | 298470 | 60402 | 238068 | 399737 | 255 | 256 | 2 |