- class aoclda.decision_tree.decision_tree(criterion='gini', seed=-1, max_depth=10, max_features=0, min_samples_split=2, build_order='breadth first', min_impurity_decrease=0.0, min_split_score=0.0, feat_thresh=1.0e-06, check_data=false)#
A decision tree classifier.
- Parameters:
max_depth (int, optional) – Set the maximum depth of the tree. Default = 29.
seed (int, optional) – Set the random seed for the random number generator. If the value is -1, a random seed is automatically generated. In this case the resulting classification will create non-reproducible results. Default = -1.
max_features (int, optional) – Set the number of features to consider when splitting a node. 0 means take all the features. Default 0.
criterion (str, optional) – Select scoring function to use. It can take the values ‘cross-entropy’, ‘gini’, or ‘misclassification’
min_samples_split (int, optional) – The minimum number of samples required to split an internal node. Default = 2.
build_order (str, optional) – Select in which order to explore the nodes. It can take the values ‘breadth first’ or ‘depth first’. Default ‘breadth first’.
min_impurity_decrease (float, optional) – Minimum score improvement needed to consider a split from the parent node. Default = 0.0
min_split_score (float, optional) – Minimum score needed for a node to be considered for splitting. Default 0.0.
feat_thresh (float, optional) – Minimum difference in feature value required for splitting. Default = 1.0e-05
precision (str, optional) – Whether to initialize the decision_tree object in double or single precision. It can take the values ‘single’ or ‘double’. Default = ‘double’.
detect_categorical_data (bool, optional) – Whether to check which features are categorical in X. Default = False.
max_category (int, optional) – Maximum number of categories for a given feature to be considered categorical. Default = 50.
category_tolerance (float, optional) – How far data can be from an integer to be considered not categorical. Default = 1.0e-05
category_split_strategy (str, optional) – The strategy to use for splitting categorical features. It can take the values ‘ordered’ or ‘one-vs-all’. Default = ‘ordered’.
histogram (bool, optional) – Whether to use histograms for continuous features. Default = False.
maximum_bins (int, optional) – Maximum number of bins to use for histograms. Default = 256.
predict_proba (bool, optional) – Whether to predict class probabilities. Default = True.
check_data (bool, optional) – Whether to check the data for NaNs. Default = False.
- property depth#
The depth of the trained tree
- Type:
int
- fit(X, y, categorical_features=None)#
Computes the decision tree on the feature matrix
Xand response vectory- Parameters:
X (array-like) – The feature matrix on which to compute the model. Its shape is (n_samples, n_features).
y (array-like) – The response vector. Its shape is (n_samples).
categorical_features (array-like, optional) – Integer vector. categorical_features[i] should be set to a negative value if feature i is continuous or to the number of different categories if feature i if it is categorical. If None, all features are considered continuous. Its shape is (n_features).
- Returns:
Returns the instance itself.
- Return type:
self (object)
- property max_features#
Get the maximum number of features to consider when splitting a node.
- property n_features#
The number of features used in the trained tree
- Type:
int
- property n_leaves#
The number of leaves in the trained tree
- Type:
int
- property n_nodes#
The number of nodes in the trained tree
- Type:
int
- property n_obs#
The number of observations used to the trained tree
- Type:
int
- property n_samples#
The number of samples used in the trained tree
- Type:
int
- predict(X)#
Generate labels using fitted decision forest on a new set of data
X.- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have
columns. (n_features) –
- Returns:
The prediction vector, where n_samples is the number of rows of
X.- Return type:
numpy.ndarray of length n_samples
- predict_log_proba(X)#
Generate class log probabilities using fitted decision forest on a new set of data
X.- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.
- Returns:
- The prediction vector,
where n_samples is the number of rows of X.
- Return type:
numpy.ndarray of length n_samples
- predict_proba(X)#
Generate class probabilities using fitted decision forest on a new set of data
X.- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have
columns. (n_features) –
- Returns:
The prediction vector, where n_samples is the number of rows of
X.- Return type:
numpy.ndarray of length n_samples
- score(X, y)#
Calculates score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data.
- Parameters:
X (array-like) – The feature matrix to evaluate the model on. It must have n_features columns.
y (array-like) – The response vector. It must have shape (n_samples).
- Returns:
The mean accuracy of the model on the test data.
- Return type:
float
- update_model_info()#
update the model information (content of rinfo from the C interface)
-
da_status da_tree_set_training_data_s(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const float *X, da_int ldx, const da_int *y, const da_int *categorical_features)#
-
da_status da_tree_set_training_data_d(da_handle handle, da_int n_samples, da_int n_features, da_int n_class, const double *X, da_int ldx, const da_int *y, const da_int *categorical_features)#
Pass a data matrix and a label array to the da_handle object in preparation for fitting a decision tree.
- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_tree.
n_samples – [in] number of observations in
X.n_features – [in] number of features in
X.n_class – [in] number of distinct classes in
y. Will be computed automatically ifn_classis set to 0.X – [in] array containing
n_samples\(\times\)n_featuresdata matrix. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.ldx – [in] leading dimension of
X. Constraint:ldx\(\ge\)n_samplesifXis stored in column-major order, orldx\(\ge\)n_featuresifXis stored in row-major order.y – [in] array containing the
n_sampleslabels. The label values are expected to range from 0 ton_class- 1.categorical_features – [in] integer array of size
n_featuresspecifying if each feature is categorical. If set to NULL, all features are considered continuous. Otherwise, categorical_features[i] is expected to be set to the number of different categories for feature i (or to 0 if feature i is continuous).
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_memory_error - internal memory allocation encountered a problem.
da_status_invalid_leading_dimension - the constraint on
ldxwas violated.
-
da_status da_tree_fit_d(da_handle handle)#
Fit the decision tree defined in the
handle.Compute the decision tree parameters given the data passed by da_tree_set_training_data_?. Note that you can customize the model before using the fit function through the use of optional parameters, see this section for a list of available options.
- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_tree.
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_incompatible_options - some of the options set are incompatible with the model defined in
handle. You can obtain further information using da_handle_print_error_message.da_status_memory_error - internal memory allocation encountered a problem.
da_status_internal_error - an unexpected error occurred.
- Post:
After successful execution, da_handle_get_result_? can be queried with the following enum:
da_rinfo- return an array of size 5 containingn_features,n_samples, the number of samples the tree was trained on, the value of the random seed used to fit the tree and the depth of the tree.
-
da_status da_tree_predict_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, da_int *y_pred)#
-
da_status da_tree_predict_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, da_int *y_pred)#
Generate labels using fitted decision tree on a new set of data
X_test.After a model has been fitted using da_tree_fit_?, it can be used to generate predicted labels on new data. This function returns the decision tree predictions in the array
y_pred.For each data point
i,y_pred[i]will contain the label of the most likely class according to the decision tree, the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_tree.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test.X_test – [in] array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_pred – [out] - array of size at least
n_samples. On output, will contain the predicted class labels.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - the constraint on
ldx_testwas violated.
-
da_status da_tree_predict_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_proba, da_int n_class, da_int ldy)#
-
da_status da_tree_predict_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_proba, da_int n_class, da_int ldy)#
Generate class probabilities using fitted decision tree on a new set of data
X_test.After a model has been fitted using da_tree_fit_?, it can be used to generate predicted labels on new data. This function returns the decision tree class probabilities in the array
y_pred.For each data point
i, and classj, the(i,j)element ofy_probawill contain the class probability according to the decision tree, and the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_tree.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test.X_test – [in] array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_proba – [out] - array of size at least
n_samples\(\times\)n_class. On output, will contain the predicted class probabilities.n_class – [in] - number of classes in
y_proba.ldy – [in] leading dimension of
y_proba. Constraint:ldy\(\ge\)n_samplesifX_testis stored in column-major order, orldy\(\ge\)n_classifX_testis stored in row-major order.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - one of the constraints on
ldx_testorldywas violated.
-
da_status da_tree_predict_log_proba_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, float *y_log_proba, da_int n_class, da_int ldy)#
-
da_status da_tree_predict_log_proba_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, double *y_log_proba, da_int n_class, da_int ldy)#
Generate class log probabilities using fitted decision tree on a new set of data
X_test.After a model has been fitted using da_tree_fit_?, it can be used to generate predicted labels on new data. This function returns the decision tree class log probabilities in the array
y_pred.For each data point
i, and classj, the(i,j)element ofy_log_probawill contain the class log probability according to the decision tree, and the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_decision_tree.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test.X_test – [in] array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_log_proba – [out] - array of size at least
n_samples\(\times\)n_class. On output, will contain the predicted class log probabilities.n_class – [in] - number of classes in
y_log_proba.ldy – [in] leading dimension of
y_log_proba. Constraint:ldy\(\ge\)n_samplesifX_testis stored in column-major order, orldy\(\ge\)n_classifX_testis stored in row-major order.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - one of the constraints on
ldx_testorldywas violated.
-
da_status da_tree_score_s(da_handle handle, da_int n_samples, da_int n_features, const float *X_test, da_int ldx_test, const da_int *y_test, float *mean_accuracy)#
-
da_status da_tree_score_d(da_handle handle, da_int n_samples, da_int n_features, const double *X_test, da_int ldx_test, const da_int *y_test, double *mean_accuracy)#
Calculate score (prediction accuracy) by comparing predicted labels and actual labels on a new set of data
X_test.To be used after a model has been fitted using da_tree_fit_?.
For each data point
i,y_test[i]will contain the label of the test data, and the(i,j)element ofX_testshould contain the featurejfor observationi.- Parameters:
handle – [inout] - a da_handle object, initialized with type da_handle_decision_tree.
n_samples – [in] - number of observations in
X_test.n_features – [in] - number of features in
X_test. It must match the number of features from the training data set.X_test – [in] - array containing
n_samples\(\times\)n_featuresdata matrix, in the same storage format used to fit the model.ldx_test – [in] - leading dimension of
X_test. Constraint:ldx_test\(\ge\)n_samplesifX_testis stored in column-major order, orldx_test\(\ge\)n_featuresifX_testis stored in row-major order.y_test – [in] - actual class labels.
mean_accuracy – [out] - proportion of observations where predicted label matches actual label.
- Returns:
da_status
da_status_success - the operation was successfully completed.
da_status_wrong_type - the floating point precision of the arguments is incompatible with the
handleinitialization.da_status_invalid_pointer - the
handlehas not been correctly initialized.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_out_of_date - the model has not been trained yet.
da_status_invalid_leading_dimension - the constraint on
ldx_testwas violated.