MATLAB Answers

PCA on high dimensional data

33 ビュー (過去 30 日間)
Ame ZL
Ame ZL 2018 年 7 月 27 日
コメント済み: Anton Semechko 2018 年 7 月 28 日
I have a matrix X with 13952736 rows x 104 columns, of single data type values. I've been trying to run PCA, with a simple one line code that has worked before, but in return i'm having empty arrays results.
the code can't be simpler: [COEFF, SCORE, LATENT, TSQ , EXPLAINED] = pca (X)
I've also tried asking for less variables, but they also come out empty. [~, ~, LATENT, ~ , EXPLAINED] = pca (X)
So I imagine the problem is related with Matlab memory, but no error message is being displayed. I'm running it in Windows 10 Pro, 16 Gb RAM, i7-8550U processor.
any suggestions?
Thanks to both Ben and Anton, for stopping by and helping!
It seems the problem was not related with memory, but one of my columns was filled with NaN from top to bottom.
By performing SVD step by step as Anton suggested it allowed me to spot this problems, and by eliminating the faulty column the pca() function worked fine again.
Hope this will help somebody in future.
Best wishes

  2 件のコメント

Ben Frankel
Ben Frankel 2018 年 7 月 27 日
The matrix X should take up about 6 GB of RAM. It depends on what else is in your computer's memory while MATLAB is running, but most likely that should fit in your RAM. Also, I've had MATLAB use too much memory before, but I didn't get unexpected empty results or crashes (my computer's memory just went to swap). The behavior might be different for different functions for all I know, though.
EDIT: Actually, the return values of pca will take up memory, and the temporary variables used in the pca function will take up memory as well. It's definitely possible that you are running out of memory.
Ame ZL
Ame ZL 2018 年 7 月 27 日
Hi Ben, thanks very much for your reply.
Yeah I tried calculating the PCA on the first 60 columns (just to try), and it worked perfectly.
So I really believe it's a matter of memory as my computer goes to 95 to 100% on all CPU, RAM, and Disk use.
I wonder if there's any solution to this other than changing my laptop?

Sign in to comment.


Anton Semechko
Anton Semechko 2018 年 7 月 27 日
編集済み: Anton Semechko 2018 年 7 月 27 日
A 13952736-by-104 data matrix (with observations along rows and variables along columns) will take up
13952736*104*8/2^30 = 10.8 GB
of memory when represented in 'double' format (i.e., 8 bits per element).
Are you able to load this entire matrix into your Matlab workspace?
If yes, then you can obtain PCA of X by performing singular value decomposition (SVD) of its 104-by-104 covariance matrix.
Principal directions will be along columns of U, and D will contain singular values of C corresponding to U. Proportion of variance explained with the first k modes of U will be:
xlabel('# principal modes')
To project X on the first k modes, do:
To whiten the loadings, do:

  4 件のコメント

表示 1 件の古いコメント
Anton Semechko
Anton Semechko 2018 年 7 月 27 日
The only way you can find out whether it works or not is to try running the code I posted above. Performing SVD of a 104-by-104 matrix should not cause any memory issues.
Ame ZL
Ame ZL 2018 年 7 月 28 日
Hi Anton,
I tried this morning and it did work!
However, the most important thing is that by getting the results step by step, it allowed me to notice that my column 74 was filled with NaN.
I deleted that column from my data, and run both your code and the pca() and they both ran fine and yield very similar results.
So it seems the problem was that column of NaNs.
Thanks very much for your time and help.
Anton Semechko
Anton Semechko 2018 年 7 月 28 日
Glad to help, America.
And good job spotting those 'NaNs'. In general, it is a good practice to make sure the input data matrix does not contain any 'NaN's or 'Inf's prior to any type of data analysis. This can be do in one line of code:

Sign in to comment.

その他の回答 (0 件)

Translated by