This is a super duper fast implementation of the kmeans clustering algorithm. The code is fully vectorized and extremely succinct. It is much much faster than the Matlab builtin kmeans function. The kmeans++ seeding algorithm is also included (kseeds.m) for good initialization. Therefore, this package is not only for coolness, it is indeed practical.

Please try the demo script in the package.

Detail explanation of this algorithm can be found in following blog post:

http://statinfer.wordpress.com/2011/12/12/efficient-matlab-ii-kmeans-clustering-algorithm/

This function is now a part of the PRML toolbox (http://www.mathworks.com/matlabcentral/fileexchange/55826-pattern-recognition-and-machine-learning-toolbox).

Mo Chen (2021). Kmeans Clustering (https://www.mathworks.com/matlabcentral/fileexchange/24616-kmeans-clustering), MATLAB Central File Exchange. Retrieved .

Created with
R2016b

Compatible with any release

**Inspired by:**
Pattern Recognition and Machine Learning Toolbox

**Inspired:**
Wavelet Based Image Segmentation, k-means++, Kmeans, Kernel Learning Toolbox, Logistic Regression for Classification, Naive Bayes Classifier, Kernel Kmeans

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!Create scripts with code, output, and formatted text in a single executable document.

Faheem Akhtaranybody tells me how and where i import data set in this algorithm? please.....

Alexander Laut'gamrnd' requires Statistics and Machine Learning Toolbox.

Ame ZLError using *

MTIMES (*) is not supported for one sparse argument and one single

argument.

Error in kkmeans (line 17)

mu = X*normalize(sparse(idx,last,1),1); % compute cluster centers

How can I solve this error? Thanks

edgar avalosJon WieseraepoundA little bit of substitution of bsxfun(@minus,x,y) in for the places where this fails in Matlab < R2016b makes this work just fine. And it is still faster or at least on par with the builtin kmeans (without the need for a Stats toolbox license - which we run out of sometimes...).

A little bit of conditional tests could make this work for pretty much every Matlab version, i.e. using an if statement or an anonymous function to perform those expansions that fail on older versions, but if performance/speed is your only metric, then...

I definitely appreciated the kmeans++ initialization.

Thanks!

Mo Chen@Stephane CHO Just like rating a windows software low because not working on linux, especially, it claims windows only.

Stephane CHONot working on R2015a

chiranjib patraI can see visually the clusters are separated ,but how do I get to know the members of the clusters (i.e index of the main dataset )

Mo Chen@John, this is a new syntax of Matlab >=R2016b.

SRSame Problem as John. Running the demo throws an error:

Error using -

Matrix dimensions must agree.

Error in kseeds (line 15)

D = min(D,sum((X-mu(:,i-1)).^2,1));

Error in kmeans_demo (line 9)

y = kmeans(X,kseeds(X,k));

Jonathan ElkinDoes not work in MATLAB 7.12.0 (R2011a).

This tool does: https://www.mathworks.com/matlabcentral/fileexchange/28804-k-means++

JohnI'm having trouble running kmeans_demo. In kmeans_demo, the line y = kmeans(X,kseeds(X,k)); calls kseed with X(2,5000) and k=3. In kseeds, the line D = min(D,sum((X-mu(:,i-1)).^2,1)); the term sum((X-mu(:,i-1)).^2 doesn't evaluated with X as a 2x5000 array and mu a 2x1 array.

Yoshi BlackBruno CostaChathurikaGayatri DuwarahHOW TO CREATE A DATABASE IN MATLAB?

David InouyeThere is a small bug for newer versions of MATLAB because the behavior of UNIQUE has changed so that the indices are always returned as column vectors. See demo below.

Though I'm not a guru on speed, one suggested change would be to add the following line after calls to UNIQUE:

label = label(:)';

Demo code:

clc; lb=[3 1 1 5]; szb=size(lb); [~,~,la] = unique(lb); sza=size(la);

fprintf('<<<<<< %s >>>>>>\nLabels before: %s, Labels after: %s\nSize before:%s, Size after:%s\n', version, mat2str(lb), mat2str(la), mat2str(szb), mat2str(sza));

Output on different versions of MATLAB:

<<<<<< 8.3.0.532 (R2014a) >>>>>>

Labels before: [3 1 1 5], Labels after: [2;1;1;3]

Size before:[1 4], Size after:[4 1]

<<<<<< 7.12.0.635 (R2011a) >>>>>>

Labels before: [3 1 1 5], Labels after: [2 1 1 3]

Size before:[1 4], Size after:[1 4]

selvakumar sHow to view the segmented result! pls help me!

Dipalihow to run k-means for a input as a any dataset file

Ashutosh Kumar UpadhyayAntoni J. CanósThe algorithm randomly fails, probably due to the random initialization.

For instance, the vector of 1D data X=[9.13,2.68,2.33,9.41] with k=2 is sometimes clustered (labeled) as [1 1 1 1], instead of the right values [1 2 2 1] or [2 1 1 2];

LakshmiHi, i'm new to Matlab coding. Pls tell how to run this code...

ChathurikaNavjot Jassihow should i run this program ??

JohnWorks fine, almost always. However, besides the needed transpose mentioned by Matei, I noticed another problem. Suppose I want 9 clusters from a dataset with 100 elements. Quite often I only get 7 or 8 groups. The same thing happens with other cluster numbers, though I haven't seen it for fewer than wanting 5.

mohit raisent file

Brian WangMilkaHey Mate,

did you already implement the Replication Feature? (From judging your code no, but maybe I don't get the stuff correctly yet ;-))

And why the spread function can only get an input matrix X of 2 or 3 rows, while in your kmeans algorithm it can have any size in the first dimension?

Matei Tene@Zoe: I would rather just transpose instead of reshaping, i.e. in line 14

last = label';

BojanCuts down my calculation from 4h down to 19min, so very pleased. Had to reshape label as per suggestion by Zoe below to make it work with 2013. Otherwise very good!

rohithare no. of seeds equal to no.of clusters??

and i need labels for all data points.Whereas this is giving label whose dimension is equal to no. of seeds

Ramaniplease help me in Em estimator for parameter estimation

ZoéSeems to be broken in R2013a due too a column/row vector mix-up. As a quick workaround I have added

label = reshape(label, length(label), 1);

after line 15 in litekmeans.m.

ZoéleilaThank you for sharing. Does the code support 3d data?

Abdullah Alomarinot working ..

RSHi,

I am looking for an implemented kmeans clustering algorithm to segment a full body 3D CT matrix into 4 classes in Matlab. This kmeans output will then be used as input to Potts model segmentation.

Is there anyone who can help med with this or give me some suggestions.

Thanks in advance

Rachelle

Mo ChenIt is simply not for image.

Angie Orozco Sanchezdoes it works for rgb images?

Jonathan C. LanseyTim BenhamGood, fast implementation but it should be possible to pass in the cluster centers. Also arguments are not checked (try k=1) and standard help is not provided.

Mo ChenHi, YIMING

It is normal behavior. This line of code remove empty clusters which does not happen very often when k is small.

YIMINGHey Michael,

I met problem when compiling the following sentence of your code,

[~,~,label] = unique(label);

As is known, the character '~' is not supported for representing non-used variables in recent releases of Matlab. However, if I change the '~' to some other variable names (not to be used), then this sentence seems unnecessary --- label is not changed after executing this sentence.

Any comments on this?

subrajeetEven if we choose the cluster centers initially still there is no guarantee that we will get the same clusters every time.

There is one paperon this and a specific clustering technique to solve this problem But I am unable to code it properly. Can you have a look on this algorithm called Quality Threshold (QT) based clustering.

Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression

% data: Identification and analysis of coexpressed genes. Genome Research

% 9, 1106–1115.

%

% http://genome.cshlp.org/content/9/11/1106.full

% http://genome.cshlp.org/content/9/11/1106/F5.large.jpg

Mo ChenHi, surajeet,

If you want deterministic behavior of kmeans, you can use some deterministic method to initialize the clusters. For example, you can first choose the sample which is fastest from the center of the whole data set as a cluster center, then iteratively choose sample which has the largest sum of distances from the chosen centers.

However, you are not guarantee to obtain a good result on any data set.

subrajeetThanks for the posting Michael. I have a small query is thery any clustering technique by which the cluster results will remain same everytime we run the code.

Please kindly tell.

Babak TaatiThis is a really nice function.

As Fen Xie pointed out, it sometimes returns empty clusters (esp if you set k, the desired number of clusters too high compare to the size of you data).

The solution is easy though. You only need two extra lines of code to get rid of the empty clusters:

% 1- dicsard empty clusters

centers = centers(:, unique(labels));

% 2- re-assign labels

[~,labels] = max(bsxfun(@minus,centers'*X,0.5*sum(centers.^2)'));

Ehsan ArdestaniHi Michael,

Thanks a lot for the nice code. It was very usefult for me.

Mo ChenHi Yen, thanks, I've update the file that fixed the problem.

a.y@n.mHmm, the site submitted an empty comment for me when I left the page. Nice.

Anyways, thank you for the highly optimized code. I can tell that a lot of thought went into minimizing the memory footprint and the number of computations (e.g., sparse matrices, coefficient placement outside of summations, computing xc-0.5c^2 instead of the full Euclidean distance x^2-2xc+c^2). I'd like to point out a bug, though -- litekmeans.m, line 9: your sum command isn't equipped to handle data sets that are 1-by-N. Change it to: 0.5*sum(center.^2,1)' to force summation over rows.

Mo ChenHi, Miao,

I just implement the classical kmeans. there is no improvement in the algorithm aspect. The speed gain is obtained by coding tricks which is vectorization. For the question of whether this one is suitable for large k, I guess you have to try it. You might get a result with the number of clusters less than k, because some clusters might become empty during the iterations.

a.y@n.mmiao chenAnd is it suitable for problems that the number of clusters is large, e.g. k=256?

miao chenHi, Michael, I tried some other kmeans algorithms and found that your method indeed fast in speed.Why you say that it could be faster than other fast kmeans method(such as kd-tree and linear time algorithm), what is the computation complexity of your method, while the classical is O(nk)? Thanks a lot!

Sev BayHey Michael,

Thanks for sharing.

I wonder the complexity of this thing. Do you know?

Mo ChenThe results of kmeans algorithm can be different with different initializations. Actually the kmeans function in matlab is not a standard kmeans algorithm. It tries to get smaller energy by switching data points in different clusters after the standard kmeans procedure converged.

One purpose of the litekmeans is to be simple (only 10 lines of code), therefore I did not add extra code to handle empty cluster. It just discard the cluster if the cluster becomes empty. You can modify the code yourself if you want extra functionality.

Fen Xiethis method produces empty clusters constantly, be careful dealing with these exceptions～

Fen XieSorry, I have compared the results of your program and the embedded program of matlab, the two results doesn't show the same, so what does it mean??

Mo ChenYes, just call the litekmeans.m to get the clustering results. You cannot get a visualization in a simple way for the data whose dimensions are more than 3. The scatterd.m can only handle data of 2d or 3d.

Onur KalabakThank you for the share. I have two questions. Do we have to use both of the functions to cluster? I have 13x7000 matrix which I want to cluster. Should I just simply apply the matrice to litekmeans.m? And how can I plot the result as displayed in the picture?

Thanks

Onur KalabakMo ChenTo Sven:

Sorry for the inconvenience. Here's the answer to your questions.

The function takes two parameters. The first one is a d x n data matrix, of which each column is assumed to be a sample vector of d dimension. The second parameter is the number of clusters. The output is a 1 x n vector, of which each element is the label of the corresponding input sample vector.

The function handles data of arbitrary dimensions.

SvenThis gave a simple implementation to the problem I had.

A couple of notes:

- Reduction of code is good, reduction of comments is not. "litekmeans" is without any header line or comments describing what this (very useful) function does, what kind of input it takes, what it produces etc. Things like the dimension to arrange the input points are necessary to use the function properly, but are not explained in a header. Whether or not the function handles 1D, 2D, 3D, etc data isn't written anywhere.

- Providing data as a data.mat file is useful for functions requiring custom made data, but it tends to obscure the format of the data. It would be simpler and easier to understand by providing equally simple sample code that makes the data.

- Providing brief sample code on the file exchange is fine, but it should be included as a header to the file, so that people can find it when they need it, rather than when they first download it.

Otherwise, thanks! Like I said, it solved my immediate problem.