The implementation is shown in the following figure:
The design used several cascade hash aggregations. Due to the power law distribution of the graph, most of the vertex degree is small. So first, use a 64 depth LUT ram to do II=1 hash aggregation. If the number overflow is 64, the data streaming to 5 hash aggregation URAM module, which can do hash aggregation with II = 1. The 5 hash aggregation URAM module including 4 32K URAMs and one 4K URAM to further reduce the number of hash collision to the final stage. In the final stage, the hash collision should be small, so the last stage hash aggregation performance is poor, II>3. After all seven aggregation stages are done, the output module collects the result of each stage and output to HBMs.