How to get linearly independent subset of matrix columns in a high-dimensional matrix

Question

Long Hong 2020 年 8 月 23 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/583481-how-to-get-linearly-independent-subset-of-matrix-columns-in-a-high-dimensional-matrix

コメント済み: Long Hong 2020 年 8 月 24 日

Dear all:

I have a few sets of fixed effects in my linear model, which has a collinearity problem since the rank does not match the number of columns.

After a wide search on the platform, I found that there is a great function called lincols to solve this issue. However, it occurs to me that the function becomes very slow as the dimension of my fixed effects is very large (at least in the scale of thousands).

My questions are:

Is there any alternative (faster) function that could handle a high-dimensional collinearity problem?
While the lincols function handles matrices of any kind, I am wondering if there is any way to speed up the algorithm since I am only dealing with dummies (fixed effects).

Thank you very much, and I look forward to hearing from you!

Best,

Long

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

John D'Errico 2020 年 8 月 23 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/583481-how-to-get-linearly-independent-subset-of-matrix-columns-in-a-high-dimensional-matrix#answer_483998

編集済み: John D'Errico 2020 年 8 月 23 日

You have thousands of columns, and therefore probably thousands of rows. Why does it not seem surprising that big problems are computationally intensive, and that bigger problems are more so?

However you choose to solve this, the problem will be O(n^3), that is, it will grow in complexity with the cube of the size of the matrix. This means if you double the size of the matrix, I'd expect to see the complexity of the problem to go up by a factor of 8. One way or another, you will be using a column pivoted matrix factorization. There is no magic here that will solve the problem in the blink of an eye. And that you think of the columns as dummies is irrelevant. In any case, you need to use linear algebra to solve the problem, and linear algebra does not care what the columns mean. Big problems take big time, or at least big computers.

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示

John D'Errico 2020 年 8 月 24 日

編集済み: John D'Errico 2020 年 8 月 24 日

MATLAB Online で開く

I have no idea where you found lincols, so that means I need to find it. It was probably on the file exchange. It always helps if you give a link when you say that you found code.

https://www.mathworks.com/matlabcentral/fileexchange/77437-extract-linearly-independent-subset-of-matrix-columns

But what does lincols do? IT CALLS QR, a column pivoted QR factorization to be exact. And that is what I'd probably use. Look at the code you found.

Is QR the best tool for the job? Probably. A pivoted QR is an excellent tool for this purpose. It is fast, efficient, and numerically stable, and better in all respects than you will get from any other tool you might choose.

The guts of lincols is just a call to that pivoted QR. I won't bother downloading lincols.

function E = testlindep(X)
[Q,R,E] = qr(X,0);

Now let me test it on a random rank 400 matrix of size 1000x1000.

X = rand(1000,400)*rand(400,1000);
timeit(@() testlindep(X))
ans =
            0.035278595132

And that is far better than any alternative you will find. So lincols will be fast as blazes. The main alternative I might have tried is rref, but rref is not compiled code, and is hugely slower than qr. That it takes more time for larger problam is, as I said, just a reflection that big problems take big time or big computers.

Make sure that you are not memory limited, as that could cause problems.

While someone will surely tell you to compile the code, that will not help, since qr is already incredibly highly optimized, and it is already compiled. If someone tells you to use the parallel computing toolbox, again, that will be a waste of time, since qr is already going to be using all of your available cores on big problems.

Is there anything you can do? Perhaps. You can gain a little bit, if you really don't care how accurate it is.

Xs = single(X);
timeit(@() testlindep(Xs))
ans =
            0.024235015132

So in this case, if I convert the array into single precision, the QR call was 33% faster. This is at a considerable cost in how well the code will run. And you would need to hack the lincols code, since the tolerance it uses is far too small in context of a single precision array.

Another option is if you have the parallel computing toolbox, and you can offload the computation onto your GPU. (I don't even know if that can be done, since I don't have that toolbox.)

John D'Errico 2020 年 8 月 24 日

編集済み: John D'Errico 2020 年 8 月 24 日

MATLAB Online で開く

The problem is, in order to use QR for this purpose, you need to use the THREE output version of QR. The reason QR does the work for you, is in the column pivoting. At each step, it kills off what it has effectively already seem, then it takes the column that is most linearly independent form those it has already seen. This is why it works for your purpose.

It is important that qr does this task on sparse matrices, since your problem is sparse. However, if that cannot be pushed onto the GPU, you just need to use the fastest server you can find. The most available cores would be important.

For example, if I try this not very sparse problem, but large enough that it will make my computer work hard enough to get all 8 cores humming for long enough to see that happen:

X = sprand(5000,20000,.01);
[Q,R,E] = qr(X,0);

Then I see MATLAB is indeed using the full capacity of my computer, all 8 cores. On small problems, only 1 core will wake up. And there is a big difference between 2 cores on a laptop and 8 or 12 or more. (A lot of heat generated too.)

My point is, for the most speed, find something with as many physical cores that you can access as possible. If you can get time on something with 32 or 64 or 128 cores, then do so.

Long Hong 2020 年 8 月 24 日

Thanks @John for your suggestions! My university does have some super computers for this purpose (with many cores). I will try to find a way to use it. :D

サインインしてコメントする。

How to get linearly independent subset of matrix columns in a high-dimensional matrix

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

How to get linearly independent subset of matrix columns in a high-dimensional matrix

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

7 件のコメント 5 件の古いコメントを表示5 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示