k-means - 5.2 English - 68552

AOCL API Guide (68552)

Document ID
68552
Release Date
2025-12-29
Version
5.2 English
class aoclda.clustering.kmeans(n_clusters=1, initialization_method='k-means++', C=None, n_init=10, max_iter=300, seed=-1, algorithm='elkan', tol=1.0e-4, check_data=false)#

k-means clustering.

Partition a data matrix into clusters using k-means clustering.

Parameters:
  • n_clusters (int, optional) – Number of clusters to form. Default=1.

  • initialization_method (str, optional) – The method used to find the initial cluster centres. It can take the values ‘k-means++’, ‘random’ (initial clusters are chosen randomly from the sample data points) or ‘random partitions’ (sample points are assigned to a random cluster and the corresponding cluster centres are computed and used as the starting point). Default: ‘k-means++’.

  • C (array-like, optional) – The matrix of initial cluster centres. It has shape (n_clusters, n_features). If supplied, these centres will be used as the starting point for the first iteration, otherwise the initialization method specified above will be used. Default = None.

  • n_init (int, optional) – Number of runs with different random seeds (ignored if you specify initial cluster centres). Default=10.

  • max_iter (int, optional) – Number of runs with different random seeds (ignored if you specify initial cluster centres). Default=300.

  • seed (int, optional) – Seed for random number generation; set to -1 for non-deterministic results. Default=-1.

  • algorithm (str, optional) – The algorithm used to compute the clusters. It can take the values ‘elkan’, ‘lloyd’, ‘macqueen’ or ‘hartigan-wong’. Default = ‘lloyd’.

  • tol (float, optional) – The convergence tolerance for the iterations. Default = 1.0-e-4.

  • check_data (bool, optional) – Whether to check the data for NaNs. Default = False.

property cluster_centres#

The coordinates of the cluster centres.

Type:

numpy.ndarray of shape (n_clusters, n_features)

fit(A)#

Computes k-means clusters for the supplied data matrix, optionally using the supplied centres as the starting point.

Parameters:

A (array-like) – The data matrix with which to compute the k-means clusters. It has shape (n_samples, n_features).

Returns:

Returns the instance itself.

Return type:

self (object)

property inertia#

The inertia (sum of the squared distance of each sample to its closest cluster centre).

Type:

numpy.ndarray of shape (1, )

property labels#

The label (i.e. which cluster) of each sample point in the data matrix.

Type:

numpy.ndarray of shape (n_samples, )

property n_clusters#

The number of clusters found.

Type:

int

property n_features#

The number of features in the data matrix.

Type:

int

property n_iter#

The number iterations performed in the k-means computation.

Type:

int

property n_samples#

The number of samples in the data matrix used.

Type:

int

predict(Y)#

Predict the cluster each sample in a data matrix belongs to.

For each sample in the data matrix Y find the closest cluster centre out of the clusters previously computed in kmeans.fit.

Parameters:

Y (array-like) – The data matrix to be transformed. It has shape (k_samples, k_features). Note that k_features must match n_features, the number of features in the data matrix used in kmeans.fit.

Returns:

The labels.

Return type:

numpy.ndarray of shape (k_samples, )

transform(X)#

Transform a data matrix into cluster distance space.

Transforms a data matrix X from the original coordinate system into the new coordinates in which each dimension is the distance to the cluster centres previously computed by kmeans.fit.

Parameters:

X (array-like) – The data matrix to be transformed. It has shape (m_samples, m_features). Note that m_features must match n_features, the number of features in the data matrix originally supplied to kmeans.fit.

Returns:

The transformed matrix.

Return type:

numpy.ndarray of shape (m_samples, n_clusters)

da_status da_kmeans_set_data_s(da_handle handle, da_int n_samples, da_int n_features, const float *A, da_int lda)#
da_status da_kmeans_set_data_d(da_handle handle, da_int n_samples, da_int n_features, const double *A, da_int lda)#

Pass a data matrix to the da_handle object in preparation for k-means clustering.

The data itself is not copied; a pointer to the data matrix is stored instead.

After calling this function you may use the option setting APIs to set options.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_kmeans.

  • n_samples[in] the number of rows of the data matrix, A. Constraint: n_samples \(\ge\) 1.

  • n_features[in] the number of columns of the data matrix, A. Constraint: n_features \(\ge\) 1.

  • A[in] the n_samples \(\times\) n_features data matrix. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.

  • lda[in] the leading dimension of the data matrix. Constraint: lda \(\ge\) n_samples if A is stored in column-major order, or lda \(\ge\) n_features if A is stored in row-major order.

Returns:

da_status. The function returns:

da_status da_kmeans_set_init_centres_s(da_handle handle, const float *C, da_int ldc)#
da_status da_kmeans_set_init_centres_d(da_handle handle, const double *C, da_int ldc)#

Pass a matrix of initial cluster centres to the da_handle object in preparation for k-means clustering.

The data itself is not copied; a pointer to the data matrix is stored instead.

The matrix of initial clusters is not required if k-means++ or random initialization methods are used (see options).

Note, you must call da_kmeans_set_data_? prior to this function.

Parameters:
  • handle[inout] a da_handle object, initialized with type da_handle_kmeans.

  • C[in] the n_clusters \(\times\) n_features matrix of initial centres. By default, it should be stored in column-major order, unless you have set the storage order option to row-major.

  • ldc[in] the leading dimension of the data matrix. Constraint: ldc \(\ge\) n_clusters if C is stored in column-major order, or ldc \(\ge\) n_features if C is stored in row-major order. Make sure you set n_clusters using da_options_set_int first.

Returns:

da_status. The function returns:

da_status da_kmeans_compute_s(da_handle handle)#
da_status da_kmeans_compute_d(da_handle handle)#

Compute k-means clustering.

Computes k-means clustering on the data matrix previously passed into the handle using da_kmeans_set_data_?.

Parameters:

handle[inout] a da_handle object, initialized with type da_handle_kmeans and with data passed in via da_kmeans_set_data_?.

Returns:

da_status. The function returns:

Post:

After successful execution, da_handle_get_result_? can be queried with the following enums for floating-point output:

  • da_kmeans_cluster_centres - return an array of size n_clusters \(\times\) n_features containing the coordinates of the cluster centres, in the same storage format as the input data.

  • da_rinfo - return an array of size 5 containing n_samples, n_features, n_clusters, n_iter (the number of iterations performed) and inertia (the sum of the squared distance of each sample to its closest cluster centre). In addition da_handle_get_result_int can be queried with the following enum:

  • da_kmeans_labels - return an array of size n_samples containing the label (i.e. which cluster it is in) of each sample point.

da_status da_kmeans_transform_s(da_handle handle, da_int m_samples, da_int m_features, const float *X, da_int ldx, float *X_transform, da_int ldx_transform)#
da_status da_kmeans_transform_d(da_handle handle, da_int m_samples, da_int m_features, const double *X, da_int ldx, double *X_transform, da_int ldx_transform)#

Transform a data matrix into the cluster distance space.

Transforms a data matrix X from the original coordinate system into the new coordinates in which each dimension is the distance to the cluster centres previously computed in da_kmeans_compute_?.

Parameters:
  • handle[inout] a da_handle object, with *k*-means clusters previously computed via da_kmeans_compute_?.

  • m_samples[in] the number of rows of the data matrix, X. Constraint: m_samples \(\ge\) 1.

  • m_features[in] the number of columns of the data matrix, X. Constraint: m_features \(=\) n_features, the number of features in the data matrix originally supplied to da_kmeans_set_data_?.

  • X[in] the m_samples \(\times\) m_features data matrix, in the same storage format used to fit the model.

  • ldx[in] the leading dimension of the data matrix. Constraint: ldx \(\ge\) m_samples if X is stored in column-major order, or ldx \(\ge\) m_features if X is stored in row-major order.

  • X_transform[out] an array of size at least m_samples \(\times\) n_clusters, in which the transformed data will be stored.

  • ldx_transform[in] the leading dimension of X_transform. Constraint: ldx_transform \(\ge\) m_samples if X is stored in column-major order, or ldx_transform \(\ge\) n_clusters if X is stored in row-major order.

Returns:

da_status. The function returns:

da_status da_kmeans_predict_s(da_handle handle, da_int k_samples, da_int k_features, const float *Y, da_int ldy, da_int *Y_labels)#
da_status da_kmeans_predict_d(da_handle handle, da_int k_samples, da_int k_features, const double *Y, da_int ldy, da_int *Y_labels)#

Predict the cluster each sample in a data matrix belongs to.

For each sample in the data matrix Y find the closest cluster centre out of the clusters previously computed in da_kmeans_compute_?.

Parameters:
  • handle[inout] a da_handle object, with k-means clusters previously computed via da_kmeans_compute_?.

  • k_samples[in] the number of rows of the data matrix, Y. Constraint: k_samples \(\ge\) 1.

  • k_features[in] the number of columns of the data matrix, Y. Constraint: k_features \(=\) n_features, the number of features in the data matrix originally supplied to da_kmeans_set_data_?.

  • Y[in] the k_samples \(\times\) k_features data matrix, in the same storage format used to fit the model.

  • ldy[in] the leading dimension of the data matrix. Constraint: ldy \(\ge\) k_samples if Y is stored in column-major order, or ldy \(\ge\) k_features if Y is stored in row-major order.

  • Y_labels[out] an array of size at least k_samples, in which the labels will be stored.

Returns:

da_status. The function returns: