Regression with tall array (Using datastore, CSV) - Error

Hi

5 件のコメント

Ive J
Ive J 2021 年 7 月 12 日
編集済み: Ive J 2021 年 7 月 12 日
Do you mean?
result = fitglm(x, y, 'Distribution', 'binomial', 'Link', 'logit');
because you have an extra ) there (though I'm sure the error nags about something else).
Can you confirm you have tall arrays (for x and y)?
istall(x)
ans =
logical
1
Also, are you trying to set the fromula? because error says so, but your call to fitglm doesn't show this.
K.P.
K.P. 2021 年 7 月 12 日
Yes, your fitglm-line is the one I have, the ) was a copy-paste error.
And yes, x and y are both tall arrays.
No, I am not calling a special formula.
Ive J
Ive J 2021 年 7 月 12 日
can you share the output of your dependent/independent variables?
x
y
K.P.
K.P. 2021 年 7 月 12 日
x is a 1000x500 (tall) table. This are the first entries:
7 6 12 12 15 13 12 30 71 6
3 4 4 0 0 1 10 2 6 1
1 0 0 0 0 0 2 0 0 0
1 0 4 0 0 0 0 0 4 0
6 3 5 2 0 0 10 0 3 0
3 26 10 3 0 2 15 7 24 1
17 85 5 4 0 0 29 0 6 0
1 0 1 0 0 2 1 0 0 0
2 0 3 0 0 0 9 0 4 0
5 18 11 2 0 1 6 0 3 0
3 1 0 0 0 2 4 0 0 0
2 0 0 0 0 0 0 0 0 0
2 0 10 0 0 0 0 0 0 0
2 0 1 1 0 3 0 0 3 0
2 16 3 0 0 0 3 2 36 1
y is a 1000x1 (tall) table and the first entries are:
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
dpb
dpb 2021 年 7 月 12 日
I just tried to see if it was tall arrays and fitglm
>> X=[1:1000].'; X=tall(X);
>> Y=randn(size(X)); % this is interesting sidelight on the way...
Error using randn
Size inputs must be numeric.
>> size(X)
ans =
1×2 tall double row vector
1000 1
>> Y=randn(1000,1); Y=tall(Y); % OK, have to brute-force it
>> fitglm(X,Y,'Distribution',"normal")
Iteration [1]: 0% completed
Iteration [1]: 50% completed
Iteration [1]: 100% completed
Iteration [2]: 0% completed
Iteration [2]: 50% completed
Iteration [2]: 100% completed
Iteration [3]: 0% completed
Iteration [3]: 100% completed
ans =
Compact generalized linear regression model:
y ~ 1 + x1
Distribution = Normal
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ________ _______
(Intercept) 0.0015036 0.064429 0.023338 0.98139
x1 1.6177e-05 0.00011151 0.14507 0.88468
1000 observations, 998 error degrees of freedom
Estimated Dispersion: 1.04
F-statistic vs. constant model: 0.021, p-value = 0.885
>>
So, fitglm will accept tall arrays; the syntax must be else where it would seem...

サインインしてコメントする。

 採用された回答

Ive J
Ive J 2021 年 7 月 13 日
編集済み: Ive J 2021 年 7 月 13 日

0 投票

Well, your data is tall table, and that's what MATLAB complains about: since your first argument is a table, MATLAB thinks y is modelspec. You have two options:
% 1-feed fitglm with matrix
mdl = fitglm(x{:, :}, y{:, :}, 'Link', 'logit', 'Distribution', 'binomial');
% 2-OR: merge x and y as a table
data = [x, y]; % last column is the dependent variable by default
mdl = fitglm(data, 'Link', 'logit', 'Distribution', 'binomial');
Btw, your data is fairly small and (I assume) fits within memory, tall arrays should be avoided for such small datasets.

2 件のコメント

K.P.
K.P. 2021 年 7 月 13 日
Hi Ive,
I merged the x and y tables and converted the new table before building the tall array with:
ds = transform(ds,@table2array);
Now it works, Thanks for your help!
PS: the file here was was only a smaller sample. The "real" one is 320000x30000.
Ive J
Ive J 2021 年 7 月 13 日
If I were you I would also test with arrays. Processing tables is almost always (based on my experience) slower than arrays.
Good luck!

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

ヘルプ センター および File ExchangeTables についてさらに検索

質問済み:

2021 年 7 月 12 日

編集済み:

2021 年 8 月 1 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by