BUG (#2)? kmeans is sensitive to rows (points) order

1 回表示 (過去 30 日間)
micholeodon
micholeodon 2019 年 3 月 12 日
編集済み: micholeodon 2019 年 3 月 12 日
Dear All,
I have noticed that kmeans gives different results for different points order !
This does not make any sense in my opinion.
I guess row order in matrix should have no impact on centroids location if random generator is set to fixed seed.
Anybody can explain that?
clear; close all; clc;
nPoints = 100;
nDimensions = 2;
nClusters = 3;
data = rand(nPoints,nDimensions) % points from uniform distr.
scatter(data(:,1), data(:,2), 'b')
rndGenSeed = 1;
%% cluster unshuffled data
rng(rndGenSeed) % set random generator's seed
[~, clusters] = kmeans(data, nClusters)
hold on
scatter(clusters(:,1), clusters(:,2), 'rv') % red triangles
hold off
%% cluster shuffled data
rng(rndGenSeed) % set random generator's seed - same seed
[~, clusters_sh] = kmeans(sortrows(data), nClusters)
hold on
scatter(data(:,1), data(:,2), 'k*') % control - plot shuffeled points - they should be ion same spots
scatter(clusters_sh(:,1), clusters_sh(:,2), 'gv') % these points should cover red triangles
hold off
grid on
  1 件のコメント
micholeodon
micholeodon 2019 年 3 月 12 日
編集済み: micholeodon 2019 年 3 月 12 日
I think I have some clue, but it would be highly recommended that somebody from MathWorks Team verify it.
So my clue is this:
  1. Kmeans needs to choose some initial clusters positions. It can select randomly k INPUT POINTS to start.
  2. If you set rng(seed), seed=const. you will always get SAME row indices from data matrix as a starting cluster position.
  3. If you shuffle input data (input points locations are the same, only order in data structure is shuffled), even if you set rng(seed), seed=const. , you will get SAME row indices, BUT points under that indices are DIFFERENT !
  4. That means that kmeans will converge differently for shuffled input data points.
This would explain also my puzzle in another question: https://www.mathworks.com/matlabcentral/answers/448832-bug-evalclusters-is-sensitive-to-rows-points-order
What do you think MathWorks experts? :) Does k-means select input data points as a starting centroids locations?

サインインしてコメントする。

回答 (0 件)

カテゴリ

Help Center および File ExchangeCluster Analysis and Anomaly Detection についてさらに検索

タグ

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by