- class aoclda.decision_forest.decision_forest(criterion='gini', bootstrap=True, n_trees=100, features_selection='sqrt', max_features=0, seed=-1, max_depth=10, min_samples_split=2, build_order='breadth first', samples_factor=0.8, min_impurity_decrease=0.0, min_split_score=0.0, feat_thresh=1.0e-06, check_data=false)#
Decision forest classifier.
An ensemble classifier based on decision trees.
- Parameters:
n_trees (int, optional) – Set the number of trees to train. Default = 100.
criterion (str, optional) – Select scoring function to use. It can take the values ‘cross-entropy’, ‘gini’, or ‘misclassification’
max_depth (int, optional) – Set the maximum depth of the trees. Default = 29.
seed (int, optional) – Set the random seed for the random number generator. If the value is -1, a random seed is automatically generated. In this case the resulting classification will create non-reproducible results. Default = -1.
features_selection (str, optional) – Select how many features to use for each split. ‘custom’ reads the ‘maximum features’ option, proportion reads the ‘proportion features’ option. ‘all’, ‘sqrt’ and ‘log2’ select respectively all, the square root or the base-2 logarithm of the total number of features.
max_features (int, optional) – Set the number of features to consider when ‘features selection’ is set to ‘custom’. 0 means take all the features. Default 0.
proportion_features (float, optional) – Proportion of features to consider when ‘features selection’ is set to ‘proportion’. Default 0.1.
min_samples_split (int, optional) – The minimum number of samples required to split an internal node. Default 2.
bootstrap (bool, optional) – Select whether to bootstrap the samples in the trees. Default True.
samples_factor (float, optional) – Proportion of samples to draw from the data set to build each tree if ‘bootstrap’ was set to True. Default 1.0.
min_impurity_decrease (float, optional) – Minimum score improvement needed to consider a split from the parent node. Default 0.0
min_split_score (float, optional) – Minimum score needed for a node to be considered for splitting. Default 0.0.
feat_thresh (float, optional) – Minimum difference in feature value required for splitting. Default 1.0e-05
histogram (bool, optional) – Whether to use histogram-based splitting. Default = False.
maximum_bins (int, optional) – Maximum number of bins to use for histogram-based splitting. Default = 256.
block_size (int, optional) – Block size for internal parallelism. Default = 256.
category_split_strategy (str, optional) – Strategy to use for splitting categorical features. For a given categorical feature, ‘one-vs-all’ tries to split each categorical value from all the others while ‘ordered’ will try to split the smaller categorical from the bigger ones. Can be set to “one-vs-all” or “ordered”. Default = “ordered”.
check_data (bool, optional) – Whether to check the data for NaNs. Default = False.
- fit(X, y, categorical_features=None)#
Computes the decision forest on the feature matrix
Xand response vectory- Parameters:
X (array-like) – The feature matrix on which to compute the model. Its shape is (n_samples, n_features).
y (array-like) – The response vector. Its shape is (n_samples).
categorical_features (array-like, optional) – Integer vector. categorical_features[i] should be set to a negative value if feature i is continuous or to the number of different categories if feature i if it is categorical. If None, all features are considered continuous. Its shape is (n_features).
- Returns:
Returns the instance itself.
- Return type:
self (object)
- predict(X)#
Generate labels using fitted decision forest on a new set of data
X.- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.
- Returns:
- The prediction vector,
where n_samples is the number of rows of X.
- Return type:
numpy.ndarray of length n_samples
- predict_log_proba(X)#
Generate class log probabilities using fitted decision forest on a new set of data
X.- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.
- Returns:
- The prediction vector,
where n_samples is the number of rows of X.
- Return type:
numpy.ndarray of length n_samples
- predict_proba(X)#
Generate class probabilities using fitted decision forest on a new set of data
X.- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.
- Returns:
- The prediction vector,
where n_samples is the number of rows of X.
- Return type:
numpy.ndarray of length n_samples
- score(X, y)#
Calculates score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data.
- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.
y (array-like) – The response vector. It must have shape (n_samples).
- Returns:
The mean accuracy of the model on the test data.
- Return type:
float
-
da_status da_forest_set_training_data_s(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const float *X, da_int ldx, const da_int *y, const da_int *categorical_features)#
-
da_status da_forest_set_training_data_d(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const double *X, da_int ldx, const da_int *y, const da_int *categorical_features)#
Pass a data matrix and a label array to the da_handle object in preparation for fitting a decision forest.
- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_forest.
n_samples – [in] number of observations in
X.n_features – [in] number of features in
X.n_class – [in] number of distinct classes in
y. Will be computed automatically ifn_classis set to 0.X – [in] array containing
n_samples\(\times\)n_featuresdata matrix. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.ldx – [in] leading dimension of
X. Constraint:ldx\(\ge\)n_samplesifXis stored in column-major order, orldx\(\ge\)n_featuresifXis stored in row-major order.y – [in] array containing the
n_sampleslabels. The label values are expected to range from 0 ton_class- 1.categorical_features – [in] integer array of size
n_featuresspecifying if each feature is categorical. If set to NULL, all features are considered continuous. Otherwise, categorical_features[i] is expected to be set to the number of different categories for feature i (or to 0 if feature i is continuous).
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_invalid_leading_dimension - the constraint on
ldxwas violated.
-
da_status da_forest_fit_d(da_handle handle)#
Fit the decision forest defined in the
handle.Compute the decision forest parameters given the data passed by da_forest_set_training_data_?. Note that you can customize the model before using the fit function through the use of optional parameters, see this section for a list of available options.
- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_forest.
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_incompatible_options - some of the options set are incompatible with the model defined in
handle. You can obtain further information using da_handle_print_error_message.da_status_memory_error - internal memory allocation encountered a problem.
da_status_internal_error - an unexpected error occurred.
- Post:
After successful execution, da_handle_get_result_? can be queried with the following enum:
da_rinfo- return an array of size 5 containingn_features,n_samples, the number of samples the tree was trained on, the value of the random seed used by the RNG andn_tree, the total number of trees in the forest.
-
da_status da_forest_predict_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, da_int *y_pred)#
-
da_status da_forest_predict_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, da_int *y_pred)#
Generate labels using fitted decision forest on a new set of data
X_test.After a model has been fitted using forest_fit_?, it can be used to generate predicted labels on new data. This function returns the decision forest predictions in the array
y_pred.For each data point
i,y_pred[i]will contain the label of the most likely class according to the decision forest; the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_forest.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test.X_test – [in] array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_pred – [out] - array of size at least
n_samples. On output, will contain the predicted class labels.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - the constraint on
ldx_testwas violated.
-
da_status da_forest_predict_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_proba, da_int n_class, da_int ldy)#
-
da_status da_forest_predict_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_proba, da_int n_class, da_int ldy)#
Generate class probabilities using fitted decision forest on a new set of data
X_test.After a model has been fitted using da_forest_fit_?, it can be used to generate predicted labels on new data. This function returns the decision forest class probabilities in the array
y_pred.For each data point
i, and classj, the(i,j)element ofy_probawill contain the class probability according to the decision forest, and the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] - a da_handle object, initialized with type da_handle_decision_forest.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test.X_test – [in] - array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] - leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_proba – [out] - array of size at least
n_samples\(\times\)n_class. On output, will contain the predicted class probabilities.n_class – [in] - number of classes in
y_proba.ldy – [in] leading dimension of
y_proba. Constraint:ldy\(\ge\)n_samplesifX_testis stored in column-major order, orldy\(\ge\)n_classifX_testis stored in row-major order.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - one of the constraints on
ldx_testorldywas violated.
-
da_status da_forest_predict_log_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_log_proba, da_int n_class, da_int ldy)#
-
da_status da_forest_predict_log_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_log_proba, da_int n_class, da_int ldy)#
Generate class log probabilities using fitted decision forest on a new set of data
X_test.After a model has been fitted using da_forest_fit_?, it can be used to generate predicted labels on new data. This function returns the decision forest class log probabilities in the array
y_pred.For each data point
i, and classj, the(i,j)element ofy_probawill contain the class log probability according to the decision forest, and the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_forest.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test.X_test – [in] - array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] - leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_log_proba – [out] - array of size at least
n_samples\(\times\)n_class. On output, will contain the predicted class log probabilities.n_class – [in] - number of classes in
y_log_proba.ldy – [in] leading dimension of
y_log_proba. Constraint:ldy\(\ge\)n_samplesifX_testis stored in column-major order, orldy\(\ge\)n_classifX_testis stored in row-major order.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - one of the constraints on
ldx_testorldywas violated.
-
da_status da_forest_score_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, const da_int *y_test, float *mean_accuracy)#
-
da_status da_forest_score_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, const da_int *y_test, double *mean_accuracy)#
Calculate score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data
X_test.To be used after a model has been fitted using da_forest_fit_?.
For each data point
i,y_test[i]will contain the label of the test data, and the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] - a da_handle object, initialized with type da_handle_decision_forest.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test. It must match the number of features from the training data set.X_test – [in] - array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] - leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_test – [in] - actual class labels.
mean_accuracy – [out] - proportion of observations where predicted label matches actual label.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - the constraint on
ldx_testwas violated.