How to apply PCA correctly?

260 ビュー (過去 30 日間)
Sepp
Sepp 2015 年 12 月 12 日
編集済み: the cyclist 2022 年 10 月 12 日
Hello
I'm currently struggling with PCA and Matlab. Let's say we have a data matrix X and a response y (classification task). X consists of 12 rows and 4 columns. The rows are the data points, the columns are the predictors (features).
Now, I can do PCA with the following command:
[coeff, score] = pca(X);
As I understood from the matlab documentation, coeff contains the loadings and score contains the principal components in the columns. That mean first column of score contains the first principal component (associated with the highest variance) and the first column of coeff contains the loadings for the first principal component.
Is this correct?
But if this is correct, why is then X * coeff not equal to score?
  1 件のコメント
DrJ
DrJ 2019 年 12 月 11 日
編集済み: DrJ 2019 年 12 月 11 日
Sepp @Sepp
your doubt can be clarified by this tutorial (eventhough in another program context) .. specially after 5' in https://www.youtube.com/watch?v=eJ08Gdl5LH0
the cliclist
fabulous and generous explanation

サインインしてコメントする。

採用された回答

the cyclist
the cyclist 2015 年 12 月 12 日
編集済み: the cyclist 2022 年 10 月 12 日
==============================================================================
EDIT: I recommend looking at my answer to this other question for a more detailed discussion of topics mentioned here.
==============================================================================
Maybe this script will help.
rng 'default'
M = 7; % Number of observations
N = 5; % Number of variables observed
X = rand(M,N);
% De-mean
X = bsxfun(@minus,X,mean(X));
% Do the PCA
[coeff,score,latent] = pca(X);
% Calculate eigenvalues and eigenvectors of the covariance matrix
covarianceMatrix = cov(X);
[V,D] = eig(covarianceMatrix);
% "coeff" are the principal component vectors.
% These are the eigenvectors of the covariance matrix.
% Compare the columns of coeff and V.
% (Note that the columns are not necessarily in the same *order*,
% and they might be *lightly different from each other
% due to floating-point error.)
coeff
V
% Multiply the original data by the principal component vectors
% to get the projections of the original data on the
% principal component vector space. This is also the output "score".
% Compare ...
dataInPrincipalComponentSpace = X*coeff
score
% The columns of X*coeff are orthogonal to each other. This is shown with ...
corrcoef(dataInPrincipalComponentSpace)
% The variances of these vectors are the eigenvalues of the covariance matrix, and are also the output "latent". Compare
% these three outputs
var(dataInPrincipalComponentSpace)'
latent
sort(diag(D),'descend')
  15 件のコメント
Keinan Poradosu
Keinan Poradosu 2022 年 3 月 31 日
Hi, thanks for the explanation, however for some reason when I run made up data coeff*X doesnt equal score. Any insights? here's the code:
%%
X=[80,90,30;
90,90,70;
95,85,50;
92,92,20];
[coeff,score,latent,~,explained] = pca(X);
Pca_space_Dat=X*coeff;
%%
What I get is:
Pca_space_Dat =
31.6028 61.1270 103.2703
72.2528 67.1491 106.6326
53.0821 74.7708 101.6937
22.5825 73.5387 106.8180
score =
-13.2773 -8.0194 -1.3334
27.3728 -1.9973 2.0290
8.2020 5.6244 -2.9099
-22.2975 4.3923 2.2144
%%
Thanks in advance
the cyclist
the cyclist 2022 年 3 月 31 日
You skipped the step where the means are subtracted:
%%
X=[80,90,30;
90,90,70;
95,85,50;
92,92,20];
% De-mean
X = bsxfun(@minus,X,mean(X)); % <------ YOU MISSED THIS STEP
[coeff,score,latent,~,explained] = pca(X);
Pca_space_Dat=X*coeff
Pca_space_Dat = 4×3
-13.2773 -8.0194 -1.3334 27.3728 -1.9973 2.0290 8.2020 5.6244 -2.9099 -22.2975 4.3923 2.2144
score
score = 4×3
-13.2773 -8.0194 -1.3334 27.3728 -1.9973 2.0290 8.2020 5.6244 -2.9099 -22.2975 4.3923 2.2144
The reason for this step is mentioned in the comments above. Also, in more recent versions of MATLAB, you can do
X = X - mean(X);
rather than
X = bsxfun(@minus,X,mean(X));

サインインしてコメントする。

その他の回答 (2 件)

Yaser Khojah
Yaser Khojah 2019 年 4 月 17 日
Dear the cyclist, thanks for showing this example. I have a question regarding to the order of the COEFF since they are different than the V. Is there anyway to see which order of these columns? In another word, what are the variables of each column?
  8 件のコメント
Yuan Luo
Yuan Luo 2020 年 11 月 8 日
why X need to be de-meaned? since pca by defualt will center the data.
the cyclist
the cyclist 2020 年 12 月 26 日
Sorry it took me a while to see this question.
If you do
[coeff,score] = pca(X);
it is true that pca() will internally de-mean the data. So, score is derived from de-meaned data.
But it does not mean that X itself [outside of pca()] has been de-meaned. So, if you are trying to re-create what happens inside pca(), you need to manually de-mean X first.

サインインしてコメントする。


Greg Heath
Greg Heath 2015 年 12 月 13 日
Hope this helps.
Thank you for formally accepting my answer
Greg

カテゴリ

Help Center および File ExchangeDimensionality Reduction and Feature Extraction についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by