The random forest framework is shown in the figure below:
We seperate random forest int sampling modelue and decision tree module. In sampling module, we design some different implementations of bagging methods, each of which implements different sampling methods. Instance sampling: withoutplacement instance sampling, sampling ratio can be set by user.
In particular, withplacement instance sampling is implemented by multi withoutplacement instance samplings.
Feature Quantize: To save bandwith and memory, we learn from spark binned dataset, quantize each feature into a fixed point integer.
- In quantize method, we binned each feauture value into a fixed point integer, for example, if a feature value is 0.45, while the feature splitting array is [-0.1,0.25,0.50,0.80], the binning rule includes:
- < -0.1 -> 0
- -0.1 ~ 0.25 -> 1
- 0.25 ~ 0.50 -> 2
- 0.50 ~ 0.80 -> 3
- > 0.80 -> 4
so after quantizing, the binned feature value is 2 (0,25<0.45<0.5). In quantize module, the main part is searching the inverval of each value, Figure 2 shows detailed implementation by unrolling a binary search tree.
In decision tree module, we add feature sampling support, so that each tree point reserves its own feature subsets. When spliiting a tree node, it can only select the best feature from the specific subset. In current version, decision tree in random forest implements the quntize optimization for more kernel frequency and performance. We can get a special random forest with one single tree to implement a decision tree. Actually, decision tree from this method can ease IO bound compared with Non-quantized decision tree.
Figure 3 and Figure 4 shows the data layout in the ddr buffer. In figure 3, we reserve last 288 bits in data header for multi-seeds, by setting seedid, the function read corresponding seed from header. After one call done, the corresponding seed will write back an updated seed. The trick is for multi-sampling modlue kernel calls and pcie data write only once.