Why do I get this message : Error using kmeans ---X must have more rows than the number of clusters.

Question

Alayt Abraham Issak 2019 年 4 月 10 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/455569-why-do-i-get-this-message-error-using-kmeans-x-must-have-more-rows-than-the-number-of-clusters

コメント済み: Adam Danz 2019 年 4 月 11 日

clear all; close all
load BRI
rng(0,'twister');
% A_train is 2055 x 89 factors
A_train = D_num(:,[1:66,68:89]);
all_factors={'recommended_for_research','umbrella','year','donor','Gov''t_funding_agency'...
    'State-owned_funding_company','Other_private_funding_company','implementing_agency_china'...
    'Pipeline: Pledge','Pipeline: Commitment','Implementation','Completion','Suspended','Cancelled'...
    'Debt forgiveness','Export Credits','Grant','Strategic/Supplier Credit','Debt Rescheduling'...
    'Free-standing Technical Assistance','Scholarship/Training in Donor Country','Joint Venture with Recipient'...
    'Loan','Foreign Direct Investment','ODA-like_flow class','OOF-like_flow class','Vague_flow class','Other flow '...
   'Development Intent','Commercial Intent','Representational Intent','Mixed Intent','amount','Cash/physical_money'...
   'USD_currency','CMY_currency','Other currency','usd_defl_2014','usd_current_publish','usd_current_2019','crs_sector_code'...
   'sources_count','cofinancing_agency','Gov''t_recepient_agency','State-owned_recepient_company','Other_private_agency'...
   'recipient_agencies_count','deflators_used','exchange_rates_used','start_actual','start_planned','end_actual','end_planned'...
   'Beginning_date_since_2000','End_date_since_2014','Planned_start > Actual_start','Planned_start > Actual_start','Planned_start = Actual_start'...
  'Planned_end > Actual_end','Planned_end < Actual_end','Planned_end = Actual_end','Planned_duration','Actual_Duration','year_uncertain'...
  '2019 population','GDP(IMF)_of _reipient ','GDP_per_capita','recipient_count','recipient_cow_code','recipient_oecd_code'...
  'recipient_un_code','recipient_imf_code','Africa ','Middle East','Asia','The Pacific','Latin America and the Caribbean','Central and Eastern Europe'...
  'line_of_credit','is_cofinanced','is_ground_truthing','loan_type','interest_rate','maturity','grace_period','grant_element','source_triangulation'...
  'field_completeness'};
% B_train is 2055 x 1 (1 if debt distressed, 0 if not)
B_train = D_num(:,90);
% Deal with missing GDP (IMF)
no_GDP=(A_train(:,66)==0); %row numbers of those missing an age (showing 0 instead)
avg_age=nanmean(A_train(no_GDP==0,66)); % average age of those with one listed
A_train(no_GDP==1,66)=avg_age; %fill in those missing ages with the average value
% Deal with missing GDP per capita
no_GDP2=(A_train(:,67)==0); %row numbers of those missing an age (showing 0 instead)
avg_age2=nanmean(A_train(no_GDP2==0,67)); % average age of those with one listed
A_train(no_GDP2==1,67)=avg_age2; %fill in those missing ages with the average value

The Eerror occurs Here:

k=8; % Number of clusters
dist_type='sqeuclidean'; % Distance metric (others include 'cityblock' (L1), 'cosine', and 'correlation')
[clust,centr]=kmeans(A_train,k,'dist',dist_type); % returns cluster assignments & centroid of each cluster

And I have not been able to continue on wards

figure(1) 
colstyle = {'cs','rd','b^','go','k+','d',':bs','-mo'}; %define 8 color/style combos for this plot
attribs=[1 2 3]; %categories for x, y, and z axes
for j=1:k 
    q=find(clust==j); %ID numbers of the items in this cluster
    nsample(j)=length(q); %Sample size in the cluster
    survival(j)=mean(B_train(q)); %Survival rate withini this cluster
    plot3(A_train(q,attribs(1)),A_train(q,attribs(2)),A_train(q,attribs(3)),colstyle{j}) % 3-D plot with marker types by cluster
    hold on
end
hold off
legend('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6','Cluster 7','Cluster 8');
xlabel(all_factors(attribs(1)));
ylabel(all_factors(attribs(2)));
zlabel(all_factors(attribs(3)));
figure(2);
silhouette(A_train,clust,dist_type)
Try various numbers of clusters
nn=100; dist_type='sqeuclidean';
for j=2:nn
    [clust,centr,sumd]=kmeans(A_train,j,'dist',dist_type);
    Dtot(j,1)=sum(sumd);
end
figure(3)
plot(2:nn,Dtot(2:nn),'b-');

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Adam Danz 2019 年 4 月 10 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/455569-why-do-i-get-this-message-error-using-kmeans-x-must-have-more-rows-than-the-number-of-clusters#answer_370031

編集済み: Adam Danz 2019 年 4 月 10 日

MATLAB Online で開く

Possibility 1

Your variable 'A_train' does not have enough rows. You are requesting 8 clusters (k=8) and, as the error indicates, 'A_train' needs to have at least k+1 rows.

[clust,centr]=kmeans(A_train,k,'dist',dist_type);

To confirm this is the problem, call this just prior to the kmean() funciton.

size(A_train)

Possibility 2

Your variable 'A_train' has too many rows that contain at least one NaN value. kmeans() ignores any rows that contain at least one NaN value. To determine that you have enough rows that do not contain NaN values, run this line:

sum(any(~isnan(A_train), 2))
ans =
     5    % only 5 rows have no-nan values which is less than K (8)

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Alayt Abraham Issak 2019 年 4 月 11 日

編集済み: Alayt Abraham Issak 2019 年 4 月 11 日

This makes a lot of sense as I do have a lot of NaN values in my data set. This is because there are columns in which I do not know many of the values and so I did not want to tamper with the data by adding values.

However, I have managed to fix the issue by eliminating the columns in the data set. Nonetheless, as tey are essential to my analysis, could a recommend another method of using kmeans despite the prevalence of Nan values? or another function of similarity?

Adam Danz 2019 年 4 月 11 日

To answer that, I'd step away from thinking about how to implement the analysis to the more fundamental problem of classifying mising data. There is no easy solution for this problem.

Sometimes missing data only accounts for a small portion of the dataset and those samples can just be ignored. That doesn't seem to be the case with your data.

If you're classifying a matrix with many variables (columns) and there's just one variable that contains most of the missing data, you could run the analysis without that variable as long as it's not an influential variable.

You could determine the number of rows that contain a complete set of data and reduce your cluster size accordingly but that's usually a poor solution since the number of klusters should be chosen with intention.

Some sources suggest that you could fill in missing values but means or randoms but such arbitrary decisions are bad practice and can really throw off the results such that they no longer represent the underlying unknown reality.

A simple search on google scholar lists these two papers with >100 citations. They discuss the problem of missing data in classification models and potential solutions.

http://www.jmlr.org/papers/v8/saar-tsechansky07a.html - See the pdf link

https://link.springer.com/chapter/10.1007/BFb0052868

サインインしてコメントする。

Why do I get this message : Error using kmeans ---X must have more rows than the number of clusters.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

Why do I get this message : Error using kmeans ---X must have more rows than the number of clusters.

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示