Decision trees - 5.2 English - 68552

AOCL API Guide (68552)

Document ID
68552
Release Date
2025-12-29
Version
5.2 English
class aoclda.decision_tree.decision_tree(criterion='gini', seed=-1, max_depth=10, max_features=0, min_samples_split=2, build_order='breadth first', min_impurity_decrease=0.0, min_split_score=0.0, feat_thresh=1.0e-06, check_data=false)#

A decision tree classifier.

Parameters:
  • max_depth (int, optional) – Set the maximum depth of the tree. Default = 29.

  • seed (int, optional) – Set the random seed for the random number generator. If the value is -1, a random seed is automatically generated. In this case the resulting classification will create non-reproducible results. Default = -1.

  • max_features (int, optional) – Set the number of features to consider when splitting a node. 0 means take all the features. Default 0.

  • criterion (str, optional) – Select scoring function to use. It can take the values ‘cross-entropy’, ‘gini’, or ‘misclassification’

  • min_samples_split (int, optional) – The minimum number of samples required to split an internal node. Default = 2.

  • build_order (str, optional) – Select in which order to explore the nodes. It can take the values ‘breadth first’ or ‘depth first’. Default ‘breadth first’.

  • min_impurity_decrease (float, optional) – Minimum score improvement needed to consider a split from the parent node. Default = 0.0

  • min_split_score (float, optional) – Minimum score needed for a node to be considered for splitting. Default 0.0.

  • feat_thresh (float, optional) – Minimum difference in feature value required for splitting. Default = 1.0e-05

  • precision (str, optional) – Whether to initialize the decision_tree object in double or single precision. It can take the values ‘single’ or ‘double’. Default = ‘double’.

  • detect_categorical_data (bool, optional) – Whether to check which features are categorical in X. Default = False.

  • max_category (int, optional) – Maximum number of categories for a given feature to be considered categorical. Default = 50.

  • category_tolerance (float, optional) – How far data can be from an integer to be considered not categorical. Default = 1.0e-05

  • category_split_strategy (str, optional) – The strategy to use for splitting categorical features. It can take the values ‘ordered’ or ‘one-vs-all’. Default = ‘ordered’.

  • histogram (bool, optional) – Whether to use histograms for continuous features. Default = False.

  • maximum_bins (int, optional) – Maximum number of bins to use for histograms. Default = 256.

  • predict_proba (bool, optional) – Whether to predict class probabilities. Default = True.

  • check_data (bool, optional) – Whether to check the data for NaNs. Default = False.

property depth#

The depth of the trained tree

Type:

int

fit(X, y, categorical_features=None)#

Computes the decision tree on the feature matrix X and response vector y

Parameters:
  • X (array-like) – The feature matrix on which to compute the model. Its shape is (n_samples, n_features).

  • y (array-like) – The response vector. Its shape is (n_samples).

  • categorical_features (array-like, optional) – Integer vector. categorical_features[i] should be set to a negative value if feature i is continuous or to the number of different categories if feature i if it is categorical. If None, all features are considered continuous. Its shape is (n_features).

Returns:

Returns the instance itself.

Return type:

self (object)

property max_features#

Get the maximum number of features to consider when splitting a node.

property n_features#

The number of features used in the trained tree

Type:

int

property n_leaves#

The number of leaves in the trained tree

Type:

int

property n_nodes#

The number of nodes in the trained tree

Type:

int

property n_obs#

The number of observations used to the trained tree

Type:

int

property n_samples#

The number of samples used in the trained tree

Type:

int

predict(X)#

Generate labels using fitted decision forest on a new set of data X.

Parameters:
  • X (array-like) – The feature matrix to evaluate the model on. It must have

  • columns. (n_features) –

Returns:

The prediction vector, where n_samples is the number of rows of X.

Return type:

numpy.ndarray of length n_samples

predict_log_proba(X)#

Generate class log probabilities using fitted decision forest on a new set of data X.

Parameters:

X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.

Returns:

The prediction vector,

where n_samples is the number of rows of X.

Return type:

numpy.ndarray of length n_samples

predict_proba(X)#

Generate class probabilities using fitted decision forest on a new set of data X.

Parameters:
  • X (array-like) – The feature matrix to evaluate the model on. It must have

  • columns. (n_features) –

Returns:

The prediction vector, where n_samples is the number of rows of X.

Return type:

numpy.ndarray of length n_samples

score(X, y)#

Calculates score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data.

Parameters:
  • X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.

  • y (array-like) – The response vector. It must have shape (n_samples).

Returns:

The mean accuracy of the model on the test data.

Return type:

float

update_model_info()#

update the model information (content of rinfo from the C interface)

da_status da_tree_set_training_data_s(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const float *X, da_int ldx, const da_int *y, const da_int *categorical_features)#
da_status da_tree_set_training_data_d(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const double *X, da_int ldx, const da_int *y, const da_int *categorical_features)#

Pass a data matrix and a label array to the da_handle object in preparation for fitting a decision tree.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_decision_tree.

  • n_samples[in] number of observations in X.

  • n_features[in] number of features in X.

  • n_class[in] number of distinct classes in y. Will be computed automatically if n_class is set to 0.

  • X[in] array containing n_samples \(\times\) n_features data matrix. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.

  • ldx[in] leading dimension of X. Constraint: ldx \(\ge\) n_samples if X is stored in column-major order, or ldx \(\ge\) n_features if X is stored in row-major order.

  • y[in] array containing the n_samples labels. The label values are expected to range from 0 to n_class - 1.

  • categorical_features[in] integer array of size n_features specifying if each feature is categorical. If set to NULL, all features are considered continuous. Otherwise, categorical_features[i] is expected to be set to the number of different categories for feature i (or to 0 if feature i is continuous).

Returns:

da_status. The function returns:

da_status da_tree_fit_s(da_handle handle)#
da_status da_tree_fit_d(da_handle handle)#

Fit the decision tree defined in the handle.

Compute the decision tree parameters given the data passed by da_tree_set_training_data_?. Note that you can customize the model before using the fit function through the use of optional parameters, see this section for a list of available options.

Parameters:

handle[inout] a da_handle object, initialized with type da_handle_decision_tree.

Returns:

da_status. The function returns:

Post:

After successful execution, da_handle_get_result_? can be queried with the following enum:

  • da_rinfo - return an array of size 5 containing n_features, n_samples, the number of samples the tree was trained on, the value of the random seed used to fit the tree and the depth of the tree.

da_status da_tree_predict_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, da_int *y_pred)#
da_status da_tree_predict_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, da_int *y_pred)#

Generate labels using fitted decision tree on a new set of data X_test.

After a model has been fitted using da_tree_fit_?, it can be used to generate predicted labels on new data. This function returns the decision tree predictions in the array y_pred.

For each data point i, y_pred[i] will contain the label of the most likely class according to the decision tree, the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_decision_tree.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test.

  • X_test[in] array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_pred[out] - array of size at least n_samples. On output, will contain the predicted class labels.

Returns:

da_status

da_status da_tree_predict_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_proba, da_int n_class, da_int ldy)#
da_status da_tree_predict_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_proba, da_int n_class, da_int ldy)#

Generate class probabilities using fitted decision tree on a new set of data X_test.

After a model has been fitted using da_tree_fit_?, it can be used to generate predicted labels on new data. This function returns the decision tree class probabilities in the array y_pred.

For each data point i, and class j, the (i,j) element of y_proba will contain the class probability according to the decision tree, and the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_decision_tree.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test.

  • X_test[in] array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_proba[out] - array of size at least n_samples \(\times\) n_class . On output, will contain the predicted class probabilities.

  • n_class[in] - number of classes in y_proba.

  • ldy[in] leading dimension of y_proba. Constraint: ldy \(\ge\) n_samples if X_test is stored in column-major order, or ldy \(\ge\) n_class if X_test is stored in row-major order.

Returns:

da_status

da_status da_tree_predict_log_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_log_proba, da_int n_class, da_int ldy)#
da_status da_tree_predict_log_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_log_proba, da_int n_class, da_int ldy)#

Generate class log probabilities using fitted decision tree on a new set of data X_test.

After a model has been fitted using da_tree_fit_?, it can be used to generate predicted labels on new data. This function returns the decision tree class log probabilities in the array y_pred.

For each data point i, and class j, the (i,j) element of y_log_proba will contain the class log probability according to the decision tree, and the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_decision_tree.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test.

  • X_test[in] array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_log_proba[out] - array of size at least n_samples \(\times\) n_class . On output, will contain the predicted class log probabilities.

  • n_class[in] - number of classes in y_log_proba.

  • ldy[in] leading dimension of y_log_proba. Constraint: ldy \(\ge\) n_samples if X_test is stored in column-major order, or ldy \(\ge\) n_class if X_test is stored in row-major order.

Returns:

da_status

da_status da_tree_score_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, const da_int *y_test, float *mean_accuracy)#
da_status da_tree_score_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, const da_int *y_test, double *mean_accuracy)#

Calculate score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data X_test.

To be used after a model has been fitted using da_tree_fit_?.

For each data point i, y_test[i] will contain the label of the test data, and the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] - a da_handle object, initialized with type da_handle_decision_tree.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test. It must match the number of features from the training data set.

  • X_test[in] - array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] - leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_test[in] - actual class labels.

  • mean_accuracy[out] - proportion of observations where predicted label matches actual label.

Returns:

da_status