0, image clustering
0.1 What is image clustering?
Clustering is a widely used exploratory data analysis technique. Intuitively speaking, clustering is a task of grouping objects, grouping similar objects into one category, and dissimilar objects into different categories.
When the clustering object is an image, it is called image clustering.
For a more detailed introduction, please refer to the reference material [1] at the end of the article.
Image clustering is to classify images according to similarity in a given image set, according to the content of the image, without prior knowledge. This makes the classified images have high intraclass similarity and low interclass similarity. There is a Chinese saying that "things gather by kind, and people are divided by groups", which is probably what it means.
0.2 Classification of clustering algorithms
As mentioned earlier, clustering is to gather objects with high similarity into one category. How to measure the similarity between objects is a key issue.
According to the scale of clustering, clustering methods can be divided into the following three types:
 Distancebased clustering algorithm: Use a variety of distances to measure the similarity between data objects
 Densitybased clustering algorithm: According to the appropriate density function to classify
 Clustering algorithm based on interconnectivity: usually based on graph or hypergraph model , clustering highly connected objects into one category.
The following part mainly introduces the Kmeans
clustering method.
1. Kmeans clustering algorithm
Kmeans
The algorithm is a distancebased clustering algorithm, also called K
or K
, and often called Lloyd's ( Lloyd
) algorithm. Iteratively divides each point in the data set into the cluster closest to it, and the distance refers to the distance from the data point to the cluster center.
1.1 Basic principles of Kmeans
Kmeans
The idea of the algorithm is very simple. For a given sample set, the samples are divided into K
clusters according to the distance between the samples . Connect the data in the clusters as closely as possible, and make the distance between the clusters as large as possible.
Kmeans
Steps :
 Randomly initialize
k
cluster center coordinates  Calculate
k
the distance from all objects in the data set to the center of a cluster, and divide the data points into the nearest cluster  For each cluster, recalculate the centroid of the cluster, which is the average value of the coordinates of the nodes in the current cluster
 Repeat steps 2 and 3 until convergence
The termination conditions are:
 No more redistribution
 Maximum number of iterations reached
 All class centers move less than a certain value
Kmeans
problem

Greedy algorithm: often falls into a local optimal solution
K
Selection of the number of class centers Initial point selection

Sensitive to noise or outliers
It is impossible to distinguish which are noise or outliers, and only one category can be judged for each data point. This will cause the sample centroid to shift, resulting in misjudgment or reduced clustering tightness.

The influence of sample point shape on clustering
Kmeans
The algorithm has a good effect on convex data. It can divide the data into spherical clusters based on clustering, but it can't do anything for data points with nonconvex shapes, such as circular data. The left side of the figure below showsKmeans
the clustering effect of the method.
1.2 Parameters in Kmeans
1. K value
The number of cluster centers K
needs to be given in advance. K
The selection of the value directly affects the final clustering effect.
Selection method:
elbow method
By plottingK
the relationship with the loss function, select theK
value at the inflection point . That is, try using multiple values, and take the turning point where the clustering index is optimal or improved. Experience selection. According to manual experience
K
, select a few first , and randomly initialize the center for many times to select the most suitable empirically.
It is usually selected based on experience, because the inflection point is not obvious in actual operation and the efficiency is too low, so it is not allowed to do this in practice.
elbow method
Also called the "elbow method", it uses the change trend of the sum of squares of errors (SSE) as K
an indicator of the selected value.
Among them, is the first cluster, is the sample point in the middle, is the centroid, is the clustering error of all samples, and indicates the quality of the clustering effect.
As shown below, when K
the time ranging from 2 to 7, corresponding to the clustering result, when the K=5
effect of the best.
2. Initial cluster center (centroid)
Kmeans
The final classification results obtained with different initial points may also be different. In actual use, we do not know which of the data sets to be clustered are of our concern, label
and it is unrealistic to manually specify the centroid in advance.
The general method of selecting the initial centroid is:
 random selection
Kmeans++
the way First class center > random selection
 Recorded as the distance between the data point and the nearest cluster center
 Select the next cluster center, the selected probability is proportional to
 And so on, to the first
K
one.
3. Distance measurement
Distance isKmeans
an index to measure the similarity of sample points in a cluster.
Kmeans
The more commonly used distance measurement methods are:
 Euclidean distance
 Cosine similarity
 Manhattan distance
2. Sklearn implemented by Kmeans
Python
The clustering implementation method sklearn
is provided in the library in Kmeans
, we can call it directly.
For image clustering, we need to extract the features that represent the content of the image, which is a dimensional feature vector. There is an image whose feature vector is expressed as and the dimension is.
Example:
from sklearn.cluster import KMeans
import numpy as np
# X
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
# array([0, 0, 0, 1, 1, 1], dtype=int32)
kmeans.predict([[0, 0], [4, 4]])
# array([0, 1], dtype=int32)
kmeans.cluster_centers_
# array([[ 1., 2.],
# [ 4., 2.]])
2.1 KMeans class
class sklearn.cluster.KMeans(n_clusters=8, init='kmeans++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')
parameter:
parameter  meaning 

n_clsuters 
int , Optional, the default value is 8 . The number of cluster centers, that is, the number of clusters 
init 
{ kmeans++ , 'random' Or an ndarray}, the method of initializing the centroid, the default is 'kmeans++' to random randomly select the initial centroid from the training data, if you pass an ndarray, it should behave as such (n_clusters, n_features) and give the initial centroid 
n_init 
int , By default 10 , the number of times of running the algorithm is initialized with different centroids, and the final solution is the optimal result selected in the meaning of inertia 
max_iter 
int , By default 300 , Kmeans the maximum number of iterations to execute an algorithm 
tol 
float Type, default0.0001 
precompute_distances 
{ auto, True, False }Precalculate the distance value (faster, but occupies more memory). When only running clustering results for a set of data, there is no need for preselection calculation. 
verbose 
int Type, default 0 , whether to print the intermediate process, 0 is not to print 
random_state 
int Type, RandomState instance of or None , optional, default None . If it is int , it random_state is the seed used by the random number generator, if it is an RandomState instance, it random_state is the random number generator, if it is None , the random number generator is np.random an RandomState instance of 
n_jobs 
int Type, the amount of computing power used, is achieved by calculating each n_init running in parallel. If it is 1 , all CPUs are used. If it is specified 1 , parallel code is not used, which is convenient for debugging. If the value is less than 1, use (n_cpus + 1 + n_jobs) . For n_jobs = 2 , use n_cpus1 . 
algorithm 
Optional 'auto', 'full','elkan' . 'full' is the traditional Kmeans algorithm,'elkan' is the elkan Kmeans algorithm, the default value of'auto' will decide how to choose between'full' and'elkan' according to whether the data value is sparse. Generally, choose'elkan' for dense data, otherwise it is'full'. 
Main attributes:
Attributes  meaning 

cluster_centers_ 
Vector [n_clsuters, n_features], the coordinates of the center of each cluster 
Labels_ 
The classification label of each data, starting from 0 
inertia_ 
float Type, the sum of the distances from each data point to the centroid of its cluster, used to evaluate whether the number of clusters is appropriate 
2.2 KMeans class method
1. fit()
Cluster the data set after Kmeans determines the category.
definition:
def fit(self, X, y=None)
random_state = check_random_state(self.random_state)
X = self._check_fit_data(X)
self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ =/
k_means(
X, n_clusters=self.n_clusters, init=self.init,
n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
precompute_distances=self.precompute_distances,
tol=self.tol, random_state=random_state, copy_x=self.copy_x,
n_jobs=self.n_jobs, algorithm=self.algorithm,
return_n_iter=True)
return self
Call the k_means
function internally to cluster and return self
.
Call k_means()
function, returns self.cluster_centers_
, self.labels_
, self.inertia_
, self.n_iter_
.
self.cluster_centers_
: Cluster center, shape is(k, n_features)
self.labels_
:int
, Cluster index value, shape is(n_samples,)
self.inertia_
: Clustering distortion value (the sum of the squares of all observed distances in the training set)self.n_iter_
: Optimal number of iterations corresponding to the results, only whenreturn_n_iter
setTrue
on return.
2. predict()
According to the clustering results, determine the category
def predict(self, X)
X
:{array, sparse matrix}
, Shape is[n_samples, n_features]
Return value:labels
:array
, Shape is[n_samples,]
. Each sample belongs to the category index of the cluster.
3. fit_predict
def fit_predict(self, X, y=None)
return self.fit(X).labels_
return value:
labels
:array
, Shape is[n_samples,]
. Each sample belongs to the category index value of the cluster.
Calculate the cluster and predict the cluster index of each sample.
It is equivalent to fit(X)
calling the predict(X)
function after calling the method .
Note: In this function, the self.fit(X).labels_ attribute is returned.
4. transform
def transform(self, X)
Convert X into clusterdistance space
return value:
X_new
:array
, Shape is[n_samples, k]
5. fit_transform
def fit_transform(self, X, y=None)
Perform clustering operations and X
transform them into distance space.
It is equivalent to fit(X)
calling the transform(X)
function after calling the method , but it is more effective.
Important, the Kmeans method in sklearn cannot specify the distance measurement method, and the Euclidean distance is used by default
Kmeans
Euclidean distance is used by default, which is the basis of measurement at the beginning of the algorithm design. The reason is that the calculation of the average is involved.
3. scipy implemented by Kmeans
scipy
Kmeans
Algorithms are also implemented in the library .
The center index or cluster index is also called "code" , and the mapping table from code to center is called "code book" .
3.1 kmeans function
kmeans
Clustering using functions requires two steps to achieve
 Use
kmeans
function to generatecodebook
and distortion values  Use the
vq
function tocodebook
assign each observation data, and get the distance of each observation data to its nearest center point.
Example:
import scipy
from scipy.cluster.vq import kmeans, vq, whiten
import numpy as np
# , 20 , 4 :
points=scipy.randn(20,4)
data=whiten(points) #
#
codebook, variance = kmeans(data, 4)
# vq vq label
code, distance = vq(data, codebook)
#
>>> codebook # (4,4)
array([[1.227829 , 0.41256122, 0.1342359 , 0.98257834],
[ 1.01190005, 0.34999089, 0.13180372, 0.06394479],
[ 0.01156929, 0.39212056, 1.86893218, 0.34921357],
[ 0.21946277, 1.36809613, 0.87196001, 0.9213216 ]])
>>>variance
1.221658211170926
>>> code
array([2, 0, 0, 2, 0, 2, 1, 3, 1, 1, 3, 0, 1, 0, 1, 1, 3, 2, 3, 2],
dtype=int32)
>>>distance
array([1.32927696, 0.99594691, 1.38351644, 1.22323281, 1.12605626,
2.04444249, 0.55554746, 2.06947197, 1.44928466, 1.09481098,
1.60957745, 1.07210177, 1.3848659 , 0.6393925 , 0.69392457,
1.06200234, 1.09091552, 0.87726365, 0.76938663, 1.96214695])
kmeans function definition:
def kmeans(obs, k_or_guess, iter=20, thresh=1e5, check_finite=True)
parameter:
obs
:ndarray
,MxN
Array, each row represents an observation vector. Feature must be inwhiten
processed functionk_or_guess
:int
Orndarray
, the number of generated center points, each center point is assigned onecode
, which is alsocode_book
the row index of the center of mass in the generated matrix, and the initial k center is selected by randomly selecting observation values from the observation matrix. You can alsokxN
specify the initial k center point by passing in an array.iter
:int
, Optional value. The number of times the kmeans algorithm is run , and the one with the lowest distortion is returnedcode book
. Ifk_or_guess
the initial centroid is specified for the parameter array, this parameter will be ignored. This parameter does not represent the number of iterations of the kmeans algorithm .thresh
:float
, Optional value. If the distortion change since the last kmeans iteration is less than or equal to the threshold, the kmeans algorithm is terminated.check_finite
:bool
Optional value,True
default: . Whether to check that the input matrix contains only finite numbers. Disabling may improve performance, but if the input does contain infinity or NaN, it may cause problems (crash, termination).
return:
codebook
:ndarray
,(k,N)
An array of dimensions consisting of k centroidsdistortion
:float
Type, the average Euclidean distance (not squared) between the observed value and the generated center point . Please note thatKmeans
the standard definition of distortion in the algorithm is the sum of squared distances.
Note: 1. In the kmeans function, the iter parameter is used to specify the number of times to run the Kmeans algorithm, not the number of iterations. The algorithm termination condition can only be specified by the thresh parameter. 2. The distance metric uses the average Euclidean distance (not squared)
vq function definition:
def vq(obs, code_book, check_finite=True)
Assign code book
each of them code
to observations. MXN
Each observation vector in the array is compared code book
with the centroid in and is assigned the closest centroidcode
.
obs
The features in should have unit variance, which can be achieved by passing them to the whiten
function. code book
It can be created using Kmeans
algorithms or other encoding algorithms.
parameter:
obs
:MxN
Array, each row represents an observation value. Features must bewhiten
processed by functions.code_book
:ndarray
. Usually generated using the kmeans algorithm, each row represents a different feature valuecode
representedcode
by the column .check_finite
:bool
, Optional value, default isTrue
. Whether to check that the input array contains only limited values. Disabling may improve performance, but if the input does contain infinity or NaN, it may cause problems (crash, termination).
return value:
code
:ndarray
, An array of length M, used to store each observation datacode book
.dist
:ndarray
,(M,)
The distortion value of each observed data to its latest code.
3.2 kmeans2 function
This function is also used to implement Kmeans
algorithms. The algorithm attempts to minimize the Euclidean distance between the observation and the centroid , and includes several initialization methods.
scipy.cluster.vq.kmeans2(data, k, iter=10, thresh=1e05, minit='random', missing='warn', check_finite=True)
parameter:
data
:ndarray
,(M,N)
The array contains M observation data with N dimensions.k
:int or ndarray
, The number of clusters. If theminit
parameter ismatrix
, or if onendarray
is given , it is interpreted as the initial cluster and used instead.iter
:int
, Optional value,kmeans
the number of iterations for the algorithm to run , note that the meaning is different from thekmeans
functioniter
parameters.thresh
:float
, Optional value, not usedminit
:str
, Optional value, initialization method. Alternativelyrandom
,points
,++
andmatrix
.random
: Generate k centroids from Gauss, the mean and variance are estimated based on the datapoints
: Randomly select k observations (rows) from the data as the initial center++
:kmeans++
Select k observations according to the methodmatrix
: Interpret the k parameter as the (k,M) array of the initial centroid
missing
:str
, Optional value, used to solve the empty clustering method, the available methods arewarn
sumraise
.warn
: Give a warning and continueraise
: TriggerClusterError
and terminate the algorithm
check_finite
:bool
, Optional value, whether to check that the input matrix contains only finite numbers, defaultTrue
return value:
centroid
::ndarray
An array of (k, N), representingkmeans
the center point of the last iteration of the methodlabel
:ndarray
, The code or index value of each observation. label [i] is the code or index of the centroid that is closest to the ith observation
Example
import scipy
from scipy.cluster.vq import kmeans2, whiten
import numpy as np
# , 20 , 4 :
points=scipy.randn(20,4)
data=whiten(points) #
centroid, label = kmeans2(data, 5)
#
>>> centroid
array([[ 0.52132816, 0.97577703, 0.30863464, 1.30546523],
[0.27344139, 0.81129939, 0.59560322, 0.47788319],
[ 1.99658961, 0.10701021, 1.09921144, 0.51397034],
[0.37598454, 1.72180727, 0.18127439, 0.58114466],
[ 0.25895367, 0.01881385, 1.25681737, 0.03119893]])
>>> label
array([1, 0, 3, 0, 1, 1, 2, 4, 0, 1, 1, 0, 4, 4, 0, 3, 1, 4, 3, 2],
dtype=int32)
4. Reference materials
[1] A preliminary study on various clustering algorithmsZheng Han Andrew.Hann
[2] Feature extraction method: Kmeans clusteringJack_Kuo's blogCSDN blog
[3] sklearn.cluster.KMeans scikitlearn 0.17 documentation
[4] Kmeans clustering and vector quantization (scipy.cluster.vq) SciPy v1.3.1 Reference Guide