The implemention is shown in the figure below:
The design used several cascade hash aggregation. Because the power law distribution of the graph. Most of the vertex degree is small, so first we use a 64 depth LUT ram to do II=1 hash aggregation. It the number overflow 64, the data streaming to 5 hash aggregation URAM module which can do hash aggregation with II = 1. The 5 hash aggregation URAM module including 4 32K URAMs and one 4K uram to further reduce the number of hash collision to the final stage. In the final stage, the hash collision should be very small, so the last stage hash aggregation performance is very poor, II>3. After all 7 aggregation stages done, the output module will collect the result of each stage and output to HBMs.