Data Frame Format (on DDR) - 2024.2 English

Vitis Libraries

Release Date
2025-04-14
Version
2024.2 English

An Apache Arrow format data can be represented in the illustrated figure on the left side. The whole data is seperated into multiple record batches; each batch consists of multiple columns with the same length.

data frame layout

It is worth mentioning that the length of each record batch is a statistic info, unknown while reading/writing each record batch data. Besides, the data width of different data types are different, especially for string, because the length of each string data is variable.

Thus, the Apache Arrow columnar data format cannot be implemented directly on hardware. A straight-forward implementation of the arrow data would be, for each field id, one fixed size DDR buffer is predefined. However, because the number and data type of each field is unknown, DDR space is wasted heavily. To fully utilize the DDR memory on the FPGA, the “data-frame” format is defined and employed, which can be seen in the right side of the preceding figure.

The DDR is split into multiple mem blocks. Each block is 4 MB in size with a 64-bit width. The mem block address and linking info is recored on the meta section of DDR header. In other words, for each column/field, the data is stored in 4M -> 4M -> 4M linkable mem blocks. The length, size, count, etc. info are also saved in the DDR header.

Three types of data are columnar stored differently comparing to the Apache Arrow format, namely, Null, Boolean, and String. For Null and Boolean, because to only 1-bit is required for each data, bitmap[4096][16] and boolbuff[4096][16] (each data 64-bit) is used to save the data, respectively. The following figure illustrates the bitmap layout; each 64-bit data indicates 64 x input data, and the maximum supported number of input data number of 64 x 4096. And the supported maximum field num is 16. The same data storage buffer is employed for Boolbuff.

data layout1

As for the String data, four lines of input example is provided. The input data are given at the left side, and the compact arrow format data storage is in the middle. It is clear that no bubbles exist in the data buffer, and in data-frame, the string data layout is shown on the right side. Each input string data is consisiting of one or multi-lines of 64 bit data; each char is 8 bits. If the string is not 64-bit aligned, bubbles are inserted to the ending 64-bit string. The reason that you introduced bubbles to data-frame storage is to ensure each string data is started in a new DDR address. This greatly guaranteed the string data access is faster without a timing issue. Simliar to the arrow format, the offset buffer always points to the starting address of each string input.

string layout

For the normal 4 MB mem blocks, the f_buff saves the starting and ending Node address of each mem block. The tail mem block size is also counted. The detailed info of each node is provided in the LinkTable buffer.

Besides the data, the input data length, size, etc. info are also counted and added to the according buffer when the input stream ends.

data layout2