The random forest framework is shown in the following figure:
Separate random forest int sampling module and decision tree module. In sampling module, you design some different implementations of bagging methods, each of which implements different sampling methods. Instance sampling: withoutplacement instance sampling, sampling ratio can be set by user.
In particular, withplacement instance sampling is implemented by multi withoutplacement instance samplings.
Feature Quantize: To save bandwith and memory, you learn from spark binned dataset, quantize each feature into a fixed point integer.
- In quantize method, each feauture value is binned into a fixed point integer; for example, if a feature value is 0.45, while the feature splitting array is [-0.1,0.25,0.50,0.80], the binning rule includes:
- < -0.1 -> 0
- -0.1 ~ 0.25 -> 1
- 0.25 ~ 0.50 -> 2
- 0.50 ~ 0.80 -> 3
- > 0.80 -> 4
so after quantizing, the binned feature value is 2 (0,25<0.45<0.5). In the quantize module, the main part is searching the inverval of each value, Figure 2 shows a detailed implementation by unrolling a binary search tree.
In the decision tree module, feature sampling support is added, so that each tree point reserves its own feature subsets. When splitting a tree node, it can only select the best feature from the specific subset. In current version, decision tree in random forest implements the quntize optimization for more kernel frequency and performance. You can get a special random forest with one single tree to implement a decision tree. Actually, decision tree from this method can ease IO bound compared with a non-quantized decision tree.
Figure 3 and Figure 4 shows the data layout in the ddr buffer. In Figure 3, you reserve last 288 bits in the data header for multi-seeds, by setting seedid, the function read corresponding seed from the header. After one call is done, the corresponding seed will write back an updated seed. The trick is for multi-sampling module kernel calls and PCIe® data write only once.
In general, you can only put 1~4 individual trees on board. Figure 5 shows the host implementaion mode of a random forest tree. In this mode, you can overlap the PCIe read, kernel call, and PCIe write, making the most of the kernel/hardware resources.
Caution
The current rf decision tree has the same limitations with the orignal decision tree. For thousands of features, the current framework cannot support all features saving in an individual tree, so a feature sampling module is implemented for further extention.