Why do I get this message : Error using kmeans ---X must have more rows than the number of clusters.

25 ビュー (過去 30 日間)
clear all; close all
load BRI
rng(0,'twister');
% A_train is 2055 x 89 factors
A_train = D_num(:,[1:66,68:89]);
all_factors={'recommended_for_research','umbrella','year','donor','Gov''t_funding_agency'...
'State-owned_funding_company','Other_private_funding_company','implementing_agency_china'...
'Pipeline: Pledge','Pipeline: Commitment','Implementation','Completion','Suspended','Cancelled'...
'Debt forgiveness','Export Credits','Grant','Strategic/Supplier Credit','Debt Rescheduling'...
'Free-standing Technical Assistance','Scholarship/Training in Donor Country','Joint Venture with Recipient'...
'Loan','Foreign Direct Investment','ODA-like_flow class','OOF-like_flow class','Vague_flow class','Other flow '...
'Development Intent','Commercial Intent','Representational Intent','Mixed Intent','amount','Cash/physical_money'...
'USD_currency','CMY_currency','Other currency','usd_defl_2014','usd_current_publish','usd_current_2019','crs_sector_code'...
'sources_count','cofinancing_agency','Gov''t_recepient_agency','State-owned_recepient_company','Other_private_agency'...
'recipient_agencies_count','deflators_used','exchange_rates_used','start_actual','start_planned','end_actual','end_planned'...
'Beginning_date_since_2000','End_date_since_2014','Planned_start > Actual_start','Planned_start > Actual_start','Planned_start = Actual_start'...
'Planned_end > Actual_end','Planned_end < Actual_end','Planned_end = Actual_end','Planned_duration','Actual_Duration','year_uncertain'...
'2019 population','GDP(IMF)_of _reipient ','GDP_per_capita','recipient_count','recipient_cow_code','recipient_oecd_code'...
'recipient_un_code','recipient_imf_code','Africa ','Middle East','Asia','The Pacific','Latin America and the Caribbean','Central and Eastern Europe'...
'line_of_credit','is_cofinanced','is_ground_truthing','loan_type','interest_rate','maturity','grace_period','grant_element','source_triangulation'...
'field_completeness'};
% B_train is 2055 x 1 (1 if debt distressed, 0 if not)
B_train = D_num(:,90);
% Deal with missing GDP (IMF)
no_GDP=(A_train(:,66)==0); %row numbers of those missing an age (showing 0 instead)
avg_age=nanmean(A_train(no_GDP==0,66)); % average age of those with one listed
A_train(no_GDP==1,66)=avg_age; %fill in those missing ages with the average value
% Deal with missing GDP per capita
no_GDP2=(A_train(:,67)==0); %row numbers of those missing an age (showing 0 instead)
avg_age2=nanmean(A_train(no_GDP2==0,67)); % average age of those with one listed
A_train(no_GDP2==1,67)=avg_age2; %fill in those missing ages with the average value
The Eerror occurs Here:
k=8; % Number of clusters
dist_type='sqeuclidean'; % Distance metric (others include 'cityblock' (L1), 'cosine', and 'correlation')
[clust,centr]=kmeans(A_train,k,'dist',dist_type); % returns cluster assignments & centroid of each cluster
And I have not been able to continue on wards
figure(1)
colstyle = {'cs','rd','b^','go','k+','d',':bs','-mo'}; %define 8 color/style combos for this plot
attribs=[1 2 3]; %categories for x, y, and z axes
for j=1:k
q=find(clust==j); %ID numbers of the items in this cluster
nsample(j)=length(q); %Sample size in the cluster
survival(j)=mean(B_train(q)); %Survival rate withini this cluster
plot3(A_train(q,attribs(1)),A_train(q,attribs(2)),A_train(q,attribs(3)),colstyle{j}) % 3-D plot with marker types by cluster
hold on
end
hold off
legend('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6','Cluster 7','Cluster 8');
xlabel(all_factors(attribs(1)));
ylabel(all_factors(attribs(2)));
zlabel(all_factors(attribs(3)));
figure(2);
silhouette(A_train,clust,dist_type)
Try various numbers of clusters
nn=100; dist_type='sqeuclidean';
for j=2:nn
[clust,centr,sumd]=kmeans(A_train,j,'dist',dist_type);
Dtot(j,1)=sum(sumd);
end
figure(3)
plot(2:nn,Dtot(2:nn),'b-');

採用された回答

Adam Danz
Adam Danz 2019 年 4 月 10 日
編集済み: Adam Danz 2019 年 4 月 10 日
Possibility 1
Your variable 'A_train' does not have enough rows. You are requesting 8 clusters (k=8) and, as the error indicates, 'A_train' needs to have at least k+1 rows.
[clust,centr]=kmeans(A_train,k,'dist',dist_type);
To confirm this is the problem, call this just prior to the kmean() funciton.
size(A_train)
Possibility 2
Your variable 'A_train' has too many rows that contain at least one NaN value. kmeans() ignores any rows that contain at least one NaN value. To determine that you have enough rows that do not contain NaN values, run this line:
sum(any(~isnan(A_train), 2))
ans =
5 % only 5 rows have no-nan values which is less than K (8)
  4 件のコメント
Alayt Abraham Issak
Alayt Abraham Issak 2019 年 4 月 11 日
編集済み: Alayt Abraham Issak 2019 年 4 月 11 日
This makes a lot of sense as I do have a lot of NaN values in my data set. This is because there are columns in which I do not know many of the values and so I did not want to tamper with the data by adding values.
However, I have managed to fix the issue by eliminating the columns in the data set. Nonetheless, as tey are essential to my analysis, could a recommend another method of using kmeans despite the prevalence of Nan values? or another function of similarity?
Adam Danz
Adam Danz 2019 年 4 月 11 日
To answer that, I'd step away from thinking about how to implement the analysis to the more fundamental problem of classifying mising data. There is no easy solution for this problem.
Sometimes missing data only accounts for a small portion of the dataset and those samples can just be ignored. That doesn't seem to be the case with your data.
If you're classifying a matrix with many variables (columns) and there's just one variable that contains most of the missing data, you could run the analysis without that variable as long as it's not an influential variable.
You could determine the number of rows that contain a complete set of data and reduce your cluster size accordingly but that's usually a poor solution since the number of klusters should be chosen with intention.
Some sources suggest that you could fill in missing values but means or randoms but such arbitrary decisions are bad practice and can really throw off the results such that they no longer represent the underlying unknown reality.
A simple search on google scholar lists these two papers with >100 citations. They discuss the problem of missing data in classification models and potential solutions.

サインインしてコメントする。

その他の回答 (0 件)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by