How to get linearly independent subset of matrix columns in a high-dimensional matrix

4 ビュー (過去 30 日間)
Dear all:
I have a few sets of fixed effects in my linear model, which has a collinearity problem since the rank does not match the number of columns.
  • After a wide search on the platform, I found that there is a great function called lincols to solve this issue. However, it occurs to me that the function becomes very slow as the dimension of my fixed effects is very large (at least in the scale of thousands).
My questions are:
  • Is there any alternative (faster) function that could handle a high-dimensional collinearity problem?
  • While the lincols function handles matrices of any kind, I am wondering if there is any way to speed up the algorithm since I am only dealing with dummies (fixed effects).
Thank you very much, and I look forward to hearing from you!
Best,
Long

採用された回答

John D'Errico
John D'Errico 2020 年 8 月 23 日
編集済み: John D'Errico 2020 年 8 月 23 日
You have thousands of columns, and therefore probably thousands of rows. Why does it not seem surprising that big problems are computationally intensive, and that bigger problems are more so?
However you choose to solve this, the problem will be O(n^3), that is, it will grow in complexity with the cube of the size of the matrix. This means if you double the size of the matrix, I'd expect to see the complexity of the problem to go up by a factor of 8. One way or another, you will be using a column pivoted matrix factorization. There is no magic here that will solve the problem in the blink of an eye. And that you think of the columns as dummies is irrelevant. In any case, you need to use linear algebra to solve the problem, and linear algebra does not care what the columns mean. Big problems take big time, or at least big computers.
  7 件のコメント
John D'Errico
John D'Errico 2020 年 8 月 24 日
編集済み: John D'Errico 2020 年 8 月 24 日
The problem is, in order to use QR for this purpose, you need to use the THREE output version of QR. The reason QR does the work for you, is in the column pivoting. At each step, it kills off what it has effectively already seem, then it takes the column that is most linearly independent form those it has already seen. This is why it works for your purpose.
It is important that qr does this task on sparse matrices, since your problem is sparse. However, if that cannot be pushed onto the GPU, you just need to use the fastest server you can find. The most available cores would be important.
For example, if I try this not very sparse problem, but large enough that it will make my computer work hard enough to get all 8 cores humming for long enough to see that happen:
X = sprand(5000,20000,.01);
[Q,R,E] = qr(X,0);
Then I see MATLAB is indeed using the full capacity of my computer, all 8 cores. On small problems, only 1 core will wake up. And there is a big difference between 2 cores on a laptop and 8 or 12 or more. (A lot of heat generated too.)
My point is, for the most speed, find something with as many physical cores that you can access as possible. If you can get time on something with 32 or 64 or 128 cores, then do so.
Long Hong
Long Hong 2020 年 8 月 24 日
Thanks @John for your suggestions! My university does have some super computers for this purpose (with many cores). I will try to find a way to use it. :D

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeParallel Computing Fundamentals についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by