Currently, the hardware JSON parser supports the following 6 data types: bool, integer (64-bit), string, float, double, and date. This parser does not support data type inference, so user should always provide the correct setting for each column in the input schema before initiating the acceleration process. It is also important to be noticed that we need the nested structure of the specific key in the schema to parse the nested-object in the JSON lines. For example, if we have a JSON format like:
{ "nested" : { "object" : { "bool column" : [true, null, false, false] }, "int column" : 503 }, "float column" : 3.1415, "double column" : 0.99999, "date column" : "1970-01-01", "string column" : "hello world" }
Then, to let the hardware JSON parser works correctly, the schema should be specified as:
"nested/object/bool column": 0 // 0 denotes the bool type "nested/int column": 1 // 1 denotes the integer (64-bit) type "float column": 2 // 2 denotes the single-precision floating-point type "double column": 3 // 3 denotes the double-precision floating-point type "date column": 4 // 4 denotes the date (in string) type "string column": 5 // 5 denotes the string type
For the input JSON file, each JSON line should be stored compactly without any padding, and seperated by “\n” character. The hardware implementation will divide the whole input JSON file evenly and feed them into the individual processing unit(PU). Besides the flattened key-value pairs, the nested-object and the array on the leaf node is also welcomed by the hardware implementation. Meanwhile, the hardware parser will automatically detect the array and label each element with a specific index, no redundant information needed in the input schema. Moreover, in one JSON line, the absence of some keys are allowed, the hardware parser will figure out each of them and then fill a null flag into the output object-stream.