The decision tree training engine (for both classfication and regression) is shown in the following figure:
As you can see from Figure 1, a decision tree training engine is an iteration processing, in which one round has three main modules: Predict, dispatch, CIG(Compute Info Gain), and UpdateTree. The stop conditions of decision tree training engine includes: Max depth of Decision Tree, Min instance number in Per Node, and Min Info Gain. In one training round, Predict, dispatch, and CIG are located in the dataflow region. Samples, parameters, and final decision tree are saved in 512-bit width DDR arrays. The predict module uses the same design method of Decision Tree Predicting in L1; this module is used to predict which node a sample locates in; if the node is not in the current tree layer, the sample will be discarded.
Detailed implementations can be found at the bottom of Figure 2, dispatch module inputs raw sample streams, and outputs a fixed number of streams. The number of output streams are the same with the array length in the middle of the dispatch detailed module. Suppose the \(i-th\) element in the dispatching array is \(n\), data of the \(i-th\) output stream will come from the \(n-th\) input stream.
In CIG module of decision tree classification, statistic count and Gini are the main parallel design. The count module takes the corresponding input and judges if it should make a adder operation on its URAM elements. Then, each URAM array is read by the Gini Unit to compute info gain. Finally, all the info gains are merged to compute the max one and output into the UpdateTree module. For regression, the layout of the URAM array is like that in the classification implementation. The depth of each URAM element is changed to max_parallel_node_num * 9, and the data width of each URAM is 288 bits which can be broke down in Figure 3.
Figure 2 shows the parallel designing using URAM.
There are \(max_split_num\) URAM, so sum of all the feature splits will be limited by this value. Only for regression, an extra \(max_split_para\) parameter is used to control the number of features split in parallel. As a result, it requires \((split_num+max_split_para-1)/max_split_para\) rounds of iteration to find the best candidate feature spilt for each node. Each count maintains a URAM array, and the URAM arrays store all the \(max_parallel_node_num\) statistic info for computing info gain. In training processing, suppose the node number of the current tree layer is \(N\), after \((N+max_parallel_node_num-1)/max_parallel_node_num\) rounds of iteration, all the nodes in current layer complete splitting.
In decision tree function, the first row of the data array (DDR buffer) is reserved. From the second row, instances are closely arranged in rows. Figure 4 and 5 show the config and tree layout of the decision tree.