Decision forests - 5.2 English - 68552

AOCL API Guide (68552)

Document ID
68552
Release Date
2025-12-29
Version
5.2 English
class aoclda.decision_forest.decision_forest(criterion='gini', bootstrap=True, n_trees=100, features_selection='sqrt', max_features=0, seed=-1, max_depth=10, min_samples_split=2, build_order='breadth first', samples_factor=0.8, min_impurity_decrease=0.0, min_split_score=0.0, feat_thresh=1.0e-06, check_data=false)#

Decision forest classifier.

An ensemble classifier based on decision trees.

Parameters:
  • n_trees (int, optional) – Set the number of trees to train. Default = 100.

  • criterion (str, optional) – Select scoring function to use. It can take the values ‘cross-entropy’, ‘gini’, or ‘misclassification’

  • max_depth (int, optional) – Set the maximum depth of the trees. Default = 29.

  • seed (int, optional) – Set the random seed for the random number generator. If the value is -1, a random seed is automatically generated. In this case the resulting classification will create non-reproducible results. Default = -1.

  • features_selection (str, optional) – Select how many features to use for each split. ‘custom’ reads the ‘maximum features’ option, proportion reads the ‘proportion features’ option. ‘all’, ‘sqrt’ and ‘log2’ select respectively all, the square root or the base-2 logarithm of the total number of features.

  • max_features (int, optional) – Set the number of features to consider when ‘features selection’ is set to ‘custom’. 0 means take all the features. Default 0.

  • proportion_features (float, optional) – Proportion of features to consider when ‘features selection’ is set to ‘proportion’. Default 0.1.

  • min_samples_split (int, optional) – The minimum number of samples required to split an internal node. Default 2.

  • bootstrap (bool, optional) – Select whether to bootstrap the samples in the trees. Default True.

  • samples_factor (float, optional) – Proportion of samples to draw from the data set to build each tree if ‘bootstrap’ was set to True. Default 1.0.

  • min_impurity_decrease (float, optional) – Minimum score improvement needed to consider a split from the parent node. Default 0.0

  • min_split_score (float, optional) – Minimum score needed for a node to be considered for splitting. Default 0.0.

  • feat_thresh (float, optional) – Minimum difference in feature value required for splitting. Default 1.0e-05

  • histogram (bool, optional) – Whether to use histogram-based splitting. Default = False.

  • maximum_bins (int, optional) – Maximum number of bins to use for histogram-based splitting. Default = 256.

  • block_size (int, optional) – Block size for internal parallelism. Default = 256.

  • category_split_strategy (str, optional) – Strategy to use for splitting categorical features. For a given categorical feature, ‘one-vs-all’ tries to split each categorical value from all the others while ‘ordered’ will try to split the smaller categorical from the bigger ones. Can be set to “one-vs-all” or “ordered”. Default = “ordered”.

  • check_data (bool, optional) – Whether to check the data for NaNs. Default = False.

fit(X, y, categorical_features=None)#

Computes the decision forest on the feature matrix X and response vector y

Parameters:
  • X (array-like) – The feature matrix on which to compute the model. Its shape is (n_samples, n_features).

  • y (array-like) – The response vector. Its shape is (n_samples).

  • categorical_features (array-like, optional) – Integer vector. categorical_features[i] should be set to a negative value if feature i is continuous or to the number of different categories if feature i if it is categorical. If None, all features are considered continuous. Its shape is (n_features).

Returns:

Returns the instance itself.

Return type:

self (object)

predict(X)#

Generate labels using fitted decision forest on a new set of data X.

Parameters:

X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.

Returns:

The prediction vector,

where n_samples is the number of rows of X.

Return type:

numpy.ndarray of length n_samples

predict_log_proba(X)#

Generate class log probabilities using fitted decision forest on a new set of data X.

Parameters:

X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.

Returns:

The prediction vector,

where n_samples is the number of rows of X.

Return type:

numpy.ndarray of length n_samples

predict_proba(X)#

Generate class probabilities using fitted decision forest on a new set of data X.

Parameters:

X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.

Returns:

The prediction vector,

where n_samples is the number of rows of X.

Return type:

numpy.ndarray of length n_samples

score(X, y)#

Calculates score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data.

Parameters:
  • X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.

  • y (array-like) – The response vector. It must have shape (n_samples).

Returns:

The mean accuracy of the model on the test data.

Return type:

float

da_status da_forest_set_training_data_s(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const float *X, da_int ldx, const da_int *y, const da_int *categorical_features)#
da_status da_forest_set_training_data_d(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const double *X, da_int ldx, const da_int *y, const da_int *categorical_features)#

Pass a data matrix and a label array to the da_handle object in preparation for fitting a decision forest.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_decision_forest.

  • n_samples[in] number of observations in X.

  • n_features[in] number of features in X.

  • n_class[in] number of distinct classes in y. Will be computed automatically if n_class is set to 0.

  • X[in] array containing n_samples \(\times\) n_features data matrix. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.

  • ldx[in] leading dimension of X. Constraint: ldx \(\ge\) n_samples if X is stored in column-major order, or ldx \(\ge\) n_features if X is stored in row-major order.

  • y[in] array containing the n_samples labels. The label values are expected to range from 0 to n_class - 1.

  • categorical_features[in] integer array of size n_features specifying if each feature is categorical. If set to NULL, all features are considered continuous. Otherwise, categorical_features[i] is expected to be set to the number of different categories for feature i (or to 0 if feature i is continuous).

Returns:

da_status. The function returns:

da_status da_forest_fit_s(da_handle handle)#
da_status da_forest_fit_d(da_handle handle)#

Fit the decision forest defined in the handle.

Compute the decision forest parameters given the data passed by da_forest_set_training_data_?. Note that you can customize the model before using the fit function through the use of optional parameters, see this section for a list of available options.

Parameters:

handle[inout] a da_handle object, initialized with type da_handle_decision_forest.

Returns:

da_status. The function returns:

Post:

After successful execution, da_handle_get_result_? can be queried with the following enum:

  • da_rinfo - return an array of size 5 containing n_features, n_samples, the number of samples the tree was trained on, the value of the random seed used by the RNG and n_tree, the total number of trees in the forest.

da_status da_forest_predict_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, da_int *y_pred)#
da_status da_forest_predict_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, da_int *y_pred)#

Generate labels using fitted decision forest on a new set of data X_test.

After a model has been fitted using forest_fit_?, it can be used to generate predicted labels on new data. This function returns the decision forest predictions in the array y_pred.

For each data point i, y_pred[i] will contain the label of the most likely class according to the decision forest; the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_decision_forest.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test.

  • X_test[in] array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_pred[out] - array of size at least n_samples. On output, will contain the predicted class labels.

Returns:

da_status

da_status da_forest_predict_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_proba, da_int n_class, da_int ldy)#
da_status da_forest_predict_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_proba, da_int n_class, da_int ldy)#

Generate class probabilities using fitted decision forest on a new set of data X_test.

After a model has been fitted using da_forest_fit_?, it can be used to generate predicted labels on new data. This function returns the decision forest class probabilities in the array y_pred.

For each data point i, and class j, the (i,j) element of y_proba will contain the class probability according to the decision forest, and the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] - a da_handle object, initialized with type da_handle_decision_forest.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test.

  • X_test[in] - array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] - leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_proba[out] - array of size at least n_samples \(\times\) n_class . On output, will contain the predicted class probabilities.

  • n_class[in] - number of classes in y_proba.

  • ldy[in] leading dimension of y_proba. Constraint: ldy \(\ge\) n_samples if X_test is stored in column-major order, or ldy \(\ge\) n_class if X_test is stored in row-major order.

Returns:

da_status

da_status da_forest_predict_log_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_log_proba, da_int n_class, da_int ldy)#
da_status da_forest_predict_log_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_log_proba, da_int n_class, da_int ldy)#

Generate class log probabilities using fitted decision forest on a new set of data X_test.

After a model has been fitted using da_forest_fit_?, it can be used to generate predicted labels on new data. This function returns the decision forest class log probabilities in the array y_pred.

For each data point i, and class j, the (i,j) element of y_proba will contain the class log probability according to the decision forest, and the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_decision_forest.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test.

  • X_test[in] - array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] - leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_log_proba[out] - array of size at least n_samples \(\times\) n_class . On output, will contain the predicted class log probabilities.

  • n_class[in] - number of classes in y_log_proba.

  • ldy[in] leading dimension of y_log_proba. Constraint: ldy \(\ge\) n_samples if X_test is stored in column-major order, or ldy \(\ge\) n_class if X_test is stored in row-major order.

Returns:

da_status

da_status da_forest_score_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, const da_int *y_test, float *mean_accuracy)#
da_status da_forest_score_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, const da_int *y_test, double *mean_accuracy)#

Calculate score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data X_test.

To be used after a model has been fitted using da_forest_fit_?.

For each data point i, y_test[i] will contain the label of the test data, and the (i,j) element of X_test should contain the feature j for observation i.

Parameters:
  • handle[inout] - a da_handle object, initialized with type da_handle_decision_forest.

  • n_samples[in] - number of observations in X_test.

  • n_features[in] - number of features in X_test. It must match the number of features from the training data set.

  • X_test[in] - array containing n_samples \(\times\) n_features data matrix, in the same storage format used to fit the model.

  • ldx_test[in] - leading dimension of X_test. Constraint: ldx_test \(\ge\) n_samples if X_test is stored in column-major order, or ldx_test \(\ge\) n_features if X_test is stored in row-major order.

  • y_test[in] - actual class labels.

  • mean_accuracy[out] - proportion of observations where predicted label matches actual label.

Returns:

da_status