Current location - Plastic Surgery and Aesthetics Network - Plastic surgery and beauty - Clustering K-means ++, k-means parameter, small batch K-means
Clustering K-means ++, k-means parameter, small batch K-means
1. 1 km device introduction

Advantages and disadvantages of k-means:

1. The algorithm is fast and simple;

2. For large data sets, it is efficient and scalable;

3. The time complexity is close to linear, which is suitable for mining large-scale data sets. The time complexity of K-Means clustering algorithm is O(n×k×t), where n represents the number of objects in the data set, t represents the number of iterations of the algorithm, and k represents the number of clusters. In the worst case, the computational complexity is O (n (k+2/p)), where n is the sample size and p is the number of features.

Note that in practice, k-means algorithm is very fast and belongs to the fastest practical algorithm. But its solution is only a local solution generated by a specific initial value. Therefore, in order to make the result more accurate and true, it needs to be repeated several times with different initial values in practice.

Disadvantages of k-means:

1. In the K-means algorithm, K is given in advance, and the selection of this K value is difficult to estimate. Many times, it is not known in advance how many categories a given data set should be divided into.

2. In the K-means algorithm, firstly, we need to determine an initial partition according to the initial clustering center, and then optimize the initial partition. The selection of this initial clustering center has a great influence on the clustering results. Once the initial value is not selected well, it may not get effective clustering results, which becomes a big problem of K-means algorithm.

3. From the framework of K-means algorithm, it can be seen that the algorithm needs to constantly adjust the sample classification and calculate the adjusted new clustering center, so when the data volume is very large, the time cost of the algorithm is very large. Therefore, it is necessary to analyze and improve the time complexity of the algorithm to improve the application scope of the algorithm.

The choice of initial clustering center can be solved by **k-means++.

1.2 thousand average () parameter

Parameters:

N_clusters: shaping, the default value is 8, that is, the number of generated clusters, that is, the number of generated centroids.

Max_iter: integer, default value =300, and the maximum number of iterations to execute the k-means algorithm.

N_init: plastic, the default value = 10, the number of times the algorithm runs under different centroid initialization values, and the final solution is the optimal result selected in the sense of inertia.

Init: There are three optional values: "k-means++", "random" or passing a ndarray vector.

This parameter specifies the initialization method, and the default value is' k-means++'.

(1)' K-means++' adopts a special method to select the initial centroid, which can accelerate the convergence of the iterative process (that is, the above-mentioned k-means++).

(2) Randomly select the initial centroid from the training data.

(3) If a ndarray is passed, its shape should be like (n_clusters, n_features) and the initial centroid is given.

Precompute_distances: three optional values, "auto", True or False.

The pre-calculation distance is faster, but it will take up more memory.

(1) Automatic: If the number of samples multiplied by the number of clusters is greater than 12 million, the distance will not be calculated in advance. When using double precision, this is equivalent to an overhead of about 100MB per job.

(2) True: Always calculate the distance in advance.

(3) Error: Never calculate the distance in advance.

Tol: floating-point shape, the default value is = 1e-4. The convergence condition is determined by combining inertia.

N_jobs: integer. Specifies the number of processes used in the calculation. The internal principle is to calculate the times specified by n_init at the same time.

(1) If the value is-1, all CPUs are used for operation. If the value is 1, parallel operation is not performed, which is convenient for debugging.

(2) If the value is less than-1, the number of CPUs used is (n _ CPU+ 1+n _ jobs). Therefore, if the value of n_jobs is -2, the number of CPUs used is the total number of CPUs minus 1.

Random_state: you can choose integer or numpy. RandomState type.

Initialize the generator of the center of mass. If the value is an integer, the seed is determined. The default value of this parameter is the numpy random number generator.

Copy_x: Boolean, default value =True.

When we pre-calculate the distance, we will get more accurate results by concentrating the data. If the parameter value is set to True, the original data will not be changed. If False, the original data will be directly modified and restored when the function returns a value. However, due to the addition and subtraction of the average data in the calculation process, the data may be slightly different from the original data after returning.

Attribute:

Cluster_centers_: vector, [n_clusters, n_features] (coordinates of cluster center)

Label _: Classification of each point

Inertia _: float, the sum of the distances from each point to the centroid of its cluster.

Method:

Fit(X[, y]): Calculate K-means clustering.

Fit _ predictor (x [,y]): Calculate the centroid of classification and predict the category of each sample.

Fit_transform(X[, y]): Calculate the cluster and transform x into the cluster distance space.

Get_params([deep]): Get the parameters of the estimator.

Predict(X):predict(X) estimates the closest classification for each sample.

Score(X[, y]): Calculate the clustering error.

Set_params(params): Set parameters manually for this estimator.

Transform(X[, y]): Transform x into the clustering distance space. In the new space, each dimension is the distance to the center of the cluster. Note that even if x is sparse, the array returned by the transformation is usually dense.

The basic idea of k-means++ algorithm in selecting initial seeds is that the mutual distance between initial clustering centers should be as far as possible.

2. 1 algorithm steps

(1) Randomly select a point from the set of input data points as the first clustering center.

(2) for each point x in the data set, calculate the distance D(x) between it and the nearest cluster center (referring to the selected cluster center)

(3) Select a new data point as the new clustering center, and the selection principle is that the point with larger D(x) is more likely to be selected as the clustering center.

(4) Repeat 2 and 3 until k clustering centers are selected.

(5) Using these k initial clustering centers to run the standard k-means algorithm.

As can be seen from the above description of the algorithm, the key of the algorithm is the third step. How to reflect D(x) to the probability that a point is selected, an algorithm is as follows:

(1) First, choose a random point from our database as the "seed point".

(2) For each point, we calculate the distance D(x) from the nearest "seed point" and save it in an array, and then add these distances to get the Sum(D(x)).

(3) Then, take a random value and calculate the next "seed point" according to the weight. The realization of this algorithm is to take a random value that can fall in Sum(D(x)), and then use Random -= D(x) until it.

(4) Repeat 2 and 3 until k clustering centers are selected.

(5) Using these k initial clustering centers to run the standard k-means algorithm.

It can be seen that the method of selecting a new center in the third step of the algorithm can ensure that the point with a larger distance D(x) will be selected as the cluster center.

3. 1 Introduction

In the unified K-Means algorithm, it is necessary to calculate the distances from all sample points to all centroids. If the sample size is very large, such as more than 65438+ million, and the features exceed 100, it is very time-consuming to use the traditional K-Means algorithm, even if elkan K-Means optimization is added. In the era of big data, there are more and more such scenes. At this time, Mini Batch K-Means came into being.

As the name implies, Mini Batch uses some samples in the sample set to do the traditional K-Means, which can avoid the calculation problem when the sample size is too large, and the convergence speed of the algorithm is greatly accelerated. Of course, the price at this time is that the accuracy of our clustering will also be reduced. Generally speaking, this decline is within an acceptable range.

In small batch k-means, we will choose an appropriate batch sample size, and we only use batch samples for K-Means clustering. So how did this batch sample come from? Generally, it is obtained by random sampling, and there is no replacement.

In order to improve the accuracy of the algorithm, we usually run the Mini Batch K-Means algorithm for many times, and get the clustering by obtaining different random sampling sets, and choose the best clustering.