The 3-in-1 GQE kernel is a compound of multiple previous released GQE kernels, containing a large number of post-bitstream programmable primitives. It can execute not only hash-join/hash-bloomfilter/hash-partition individually but also hash-based bloom filter build + partition or bloom filter probe + partition as a combination to minimize the intermdediate data transfer. To instantiate the 3-in-1 in the resource-limited AMD Alveo™ U50 for obtaining the best benefit-cost ratio, every processing unit (PU) is refactored as well as the output data paths and retired the bypass design in the current kernel; now it enpowers the Q5-simplified query can be done with a 4x performance improvement without any cost increasing on device comparing to the previous separated GQE kernels.
The internal structure of 3-in-1 GQE is illustrated in the figure above. Besides those challenging parts, accomplished in previous separated GQE kernels, it is important to emphasize the necessity of the hardware structure implemented in 3-in-1 GQE for reusing the two AXI-Master ports and three internal huge URAM buffers in each PU, as you cannot tolerate even a little of resource wastes on the device to make the gqeJoin/gqeBloomfilter/gqePart all on a single U50 comes ture.
The reason why you need this reusable hardware structure is because of the limitation on the resources of the Alveo U50 and the different precedence of URAM and HBM in different flows. It can be described in the figure above on the left side of the vertical red line.
JOIN build/probe flow: Needs to save the total number of a unique key in the URAM first, and then save the key/payload pairs in HBM.
Bloom filter build/probe flow: Needs HBM to access its corresponding hash-table.
PART: Needs URAM to buffer the post-partitioned key/payload pairs to get a reasonable throughput when flushing the partitioned bucket out.
While performing the bloom filter probe operation + PART flow, you do not know when the key/payload pairs in a specific bucket is enough to be flushed out. To avoid implementing a duplicated URAM buffer after the partition module, arrange the bloom filter operation before the partition so that you can fully utilize the original URAM in partition module to collect the partitioned rows. Thus you need this reusable hardware structure as illustrated on the right of the vertical red line in the figure above, where the data goes along with the different paths (marked with red) under different configurations.
The hardware resource utilization of 3-in-1 kernel is shown in the following table (post-placement). Total part contains not only the listed sub-modules resource utilization but the interconnect streams to serve for the dataflow design in 3-in-1 GQE kernel.
Caution
In the current release, all columns are expected to have the same number of rows with the same type.