- class aoclda.clustering.kmeans(n_clusters=1, initialization_method='k-means++', C=None, n_init=10, max_iter=300, seed=-1, algorithm='elkan', tol=1.0e-4, check_data=false)#
k-means clustering.
Partition a data matrix into clusters using k-means clustering.
- Parameters:
n_clusters (int, optional) – Number of clusters to form. Default=1.
initialization_method (str, optional) – The method used to find the initial cluster centres. It can take the values ‘k-means++’, ‘random’ (initial clusters are chosen randomly from the sample data points) or ‘random partitions’ (sample points are assigned to a random cluster and the corresponding cluster centres are computed and used as the starting point). Default: ‘k-means++’.
C (array-like, optional) – The matrix of initial cluster centres. It has shape (n_clusters, n_features). If supplied, these centres will be used as the starting point for the first iteration, otherwise the initialization method specified above will be used. Default = None.
n_init (int, optional) – Number of runs with different random seeds (ignored if you specify initial cluster centres). Default=10.
max_iter (int, optional) – Number of runs with different random seeds (ignored if you specify initial cluster centres). Default=300.
seed (int, optional) – Seed for random number generation; set to -1 for non-deterministic results. Default=-1.
algorithm (str, optional) – The algorithm used to compute the clusters. It can take the values ‘elkan’, ‘lloyd’, ‘macqueen’ or ‘hartigan-wong’. Default = ‘lloyd’.
tol (float, optional) – The convergence tolerance for the iterations. Default = 1.0-e-4.
check_data (bool, optional) – Whether to check the data for NaNs. Default = False.
- property cluster_centres#
The coordinates of the cluster centres.
- Type:
numpy.ndarray of shape (n_clusters, n_features)
- fit(A)#
Computes k-means clusters for the supplied data matrix, optionally using the supplied centres as the starting point.
- Parameters:
A (array-like) – The data matrix with which to compute the k-means clusters. It has shape (n_samples, n_features).
- Returns:
Returns the instance itself.
- Return type:
self (object)
- property inertia#
The inertia (sum of the squared distance of each sample to its closest cluster centre).
- Type:
numpy.ndarray of shape (1, )
- property labels#
The label (i.e. which cluster) of each sample point in the data matrix.
- Type:
numpy.ndarray of shape (n_samples, )
- property n_clusters#
The number of clusters found.
- Type:
int
- property n_features#
The number of features in the data matrix.
- Type:
int
- property n_iter#
The number iterations performed in the k-means computation.
- Type:
int
- property n_samples#
The number of samples in the data matrix used.
- Type:
int
- predict(Y)#
Predict the cluster each sample in a data matrix belongs to.
For each sample in the data matrix
Yfind the closest cluster centre out of the clusters previously computed inkmeans.fit.- Parameters:
Y (array-like) – The data matrix to be transformed. It has shape (k_samples, k_features). Note that
k_featuresmust matchn_features, the number of features in the data matrix used inkmeans.fit.- Returns:
The labels.
- Return type:
numpy.ndarray of shape (k_samples, )
- transform(X)#
Transform a data matrix into cluster distance space.
Transforms a data matrix
Xfrom the original coordinate system into the new coordinates in which each dimension is the distance to the cluster centres previously computed bykmeans.fit.- Parameters:
X (array-like) – The data matrix to be transformed. It has shape (m_samples, m_features). Note that
m_featuresmust matchn_features, the number of features in the data matrix originally supplied tokmeans.fit.- Returns:
The transformed matrix.
- Return type:
numpy.ndarray of shape (m_samples, n_clusters)
-
da_status da_kmeans_set_data_s(da_handle handle, da_int n_samples, da_int n_features, const float *A, da_int lda)#
-
da_status da_kmeans_set_data_d(da_handle handle, da_int n_samples, da_int n_features, const double *A, da_int lda)#
Pass a data matrix to the da_handle object in preparation for k-means clustering.
The data itself is not copied; a pointer to the data matrix is stored instead.
After calling this function you may use the option setting APIs to set options.
- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_kmeans.
n_samples – [in] the number of rows of the data matrix,
A. Constraint:n_samples\(\ge\) 1.n_features – [in] the number of columns of the data matrix,
A. Constraint:n_features\(\ge\) 1.A – [in] the
n_samples\(\times\)n_featuresdata matrix. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.lda – [in] the leading dimension of the data matrix. Constraint:
lda\(\ge\)n_samplesifAis stored in column-major order, orlda\(\ge\)n_featuresifAis stored in row-major order.
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the handle may have been initialized with the wrong precision.
da_status_invalid_pointer - the handle has not been initialized, or
Ais null.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_incompatible_options - if you have already set the number of clusters and it is too high, then it will be reduced accordingly, and this warning returned.
da_status_invalid_leading_dimension - the constraint on
ldawas violated.
-
da_status da_kmeans_set_init_centres_d(da_handle handle, const double *C, da_int ldc)#
Pass a matrix of initial cluster centres to the da_handle object in preparation for k-means clustering.
The data itself is not copied; a pointer to the data matrix is stored instead.
The matrix of initial clusters is not required if k-means++ or random initialization methods are used (see options).
Note, you must call da_kmeans_set_data_? prior to this function.
- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_kmeans.
C – [in] the
n_clusters\(\times\)n_featuresmatrix of initial centres. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.ldc – [in] the leading dimension of the data matrix. Constraint:
ldc\(\ge\)n_clustersifCis stored in column-major order, orldc\(\ge\)n_featuresifCis stored in row-major order. Make sure you setn_clustersusing da_options_set_int first.
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_no_data - the function da_kmeans_set_data_? has not been called.
da_status_wrong_type - the handle may have been initialized using the wrong precision.
da_status_invalid_pointer - the handle has not been initialized, or
Cis null.da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_invalid_leading_dimension - the constraint on
ldcwas violated.
-
da_status da_kmeans_compute_d(da_handle handle)#
Compute k-means clustering.
Computes k-means clustering on the data matrix previously passed into the handle using da_kmeans_set_data_?.
- Parameters:
handle – [inout] a da_handle object, initialized with type da_handle_kmeans and with data passed in via da_kmeans_set_data_?.
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the handle may have been initialized using the wrong precision.
da_status_invalid_pointer - the handle has not been initialized.
da_status_no_data - da_kmeans_set_data_? has not been called prior to this function call, or the required initial cluster centres have not been provided.
da_status_internal_error - this can occur if your data contains undefined values.
da_status_incompatible_options - you can obtain further information using da_handle_print_error_message.
da_status_maxit - the iteration limit was reached without converging. The results may still be usable though.
- Post:
After successful execution, da_handle_get_result_? can be queried with the following enums for floating-point output:
da_kmeans_cluster_centres- return an array of sizen_clusters\(\times\)n_featurescontaining the coordinates of the cluster centres, in the same storage format as the input data.da_rinfo- return an array of size 5 containingn_samples,n_features,n_clusters,n_iter(the number of iterations performed) andinertia(the sum of the squared distance of each sample to its closest cluster centre). In addition da_handle_get_result_int can be queried with the following enum:da_kmeans_labels- return an array of sizen_samplescontaining the label (i.e. which cluster it is in) of each sample point.
-
da_status da_kmeans_transform_s(da_handle handle, da_int m_samples, da_int m_features, const float *X, da_int ldx, float *X_transform, da_int ldx_transform)#
-
da_status da_kmeans_transform_d(da_handle handle, da_int m_samples, da_int m_features, const double *X, da_int ldx, double *X_transform, da_int ldx_transform)#
Transform a data matrix into the cluster distance space.
Transforms a data matrix
Xfrom the original coordinate system into the new coordinates in which each dimension is the distance to the cluster centres previously computed in da_kmeans_compute_?.- Parameters:
handle – [inout] a da_handle object, with *k*-means clusters previously computed via da_kmeans_compute_?.
m_samples – [in] the number of rows of the data matrix,
X. Constraint:m_samples\(\ge\) 1.m_features – [in] the number of columns of the data matrix,
X. Constraint:m_features\(=\)n_features, the number of features in the data matrix originally supplied to da_kmeans_set_data_?.X – [in] the
m_samples\(\times\)m_featuresdata matrix, in the same storage format used to fit the model.ldx – [in] the leading dimension of the data matrix. Constraint:
ldx\(\ge\)m_samplesifXis stored in column-major order, orldx\(\ge\)m_featuresifXis stored in row-major order.X_transform – [out] an array of size at least
m_samples\(\times\)n_clusters, in which the transformed data will be stored.ldx_transform – [in] the leading dimension of
X_transform. Constraint:ldx_transform\(\ge\)m_samplesifXis stored in column-major order, orldx_transform\(\ge\)n_clustersifXis stored in row-major order.
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the handle may have been initialized using the wrong precision.
da_status_invalid_pointer - the handle has not been initialized, or one of the arrays is null.
da_status_no_data - the k-means clusters have not been computed prior to this function call.
da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_invalid_leading_dimension - one of the constraints on
ldxorldx_transformwas violated.
-
da_status da_kmeans_predict_s(da_handle handle, da_int k_samples, da_int k_features, const float *Y, da_int ldy, da_int *Y_labels)#
-
da_status da_kmeans_predict_d(da_handle handle, da_int k_samples, da_int k_features, const double *Y, da_int ldy, da_int *Y_labels)#
Predict the cluster each sample in a data matrix belongs to.
For each sample in the data matrix
Yfind the closest cluster centre out of the clusters previously computed in da_kmeans_compute_?.- Parameters:
handle – [inout] a da_handle object, with k-means clusters previously computed via da_kmeans_compute_?.
k_samples – [in] the number of rows of the data matrix,
Y. Constraint:k_samples\(\ge\) 1.k_features – [in] the number of columns of the data matrix,
Y. Constraint:k_features\(=\)n_features, the number of features in the data matrix originally supplied to da_kmeans_set_data_?.Y – [in] the
k_samples\(\times\)k_featuresdata matrix, in the same storage format used to fit the model.ldy – [in] the leading dimension of the data matrix. Constraint:
ldy\(\ge\)k_samplesifYis stored in column-major order, orldy\(\ge\)k_featuresifYis stored in row-major order.Y_labels – [out] an array of size at least
k_samples, in which the labels will be stored.
- Returns:
da_status. The function returns:
da_status_success - the operation was successfully completed.
da_status_wrong_type - the handle may have been initialized using the wrong precision.
da_status_invalid_pointer - the handle has not been initialized, or one of the arrays is null.
da_status_no_data - the k-means clustering has not been computed prior to this function call.
da_status_invalid_input - one of the arguments had an invalid value. You can obtain further information using da_handle_print_error_message.
da_status_invalid_leading_dimension - the constraint on
ldywas violated.