The Data Analytics Library has the following additions in the 2020.2 release:
- Text Processing APIs. Two major APIs in this family has been included: the regular expression match and geo-IP lookup. The former API can be used to extract content from unstructured data like logs, while the latter is often used in processing web logs, to annotate with geographic information by IP address. A demo tool that converts Apache HTTP server log in batch into JSON file is provided with the library.
- DataFrame APIs. DataFrame is widely popular in-memory data abstraction in the data analytics domain; the DataFrame write and read APIs should enable data analytics kernel developers to store temporal data or interact with open-source software using Apache Arrow DataFrame more easily.
- Tree Ensemble Method. Random forest is extended to include regression. Gradient boost tree, based on boosting method, is added to support both classification and regression. Support for XGBoost on classification and regression is also included to exploit the second order derivative of loss function and regularization.