Using unbalanced data with fitlme

Hi,
I try to make fitlme work with unbalanced data, but I always get the error "Fixed Effects design matrix X must be of full column rank." So I looked into the code to see what the problem is and fitlme truncates my data, but retains the categorical names from the input table, which leads to a deficient rank.
In my data I have full rows, but also rows with missing data, for example [2012, 'String1', 1, NaN, 44.91, 62.9] The last column is the response column, the rest are predictors. So when I look into the fitlme function it truncates my 12042 rows input table to a 628 rows table, so that apparently every row gets deleted where at least one NaN value is present.
Shashank Prasanna talks about unbalanced data in this video, but how exactly does that work? I tried everything I could and don't know how to proceed.

6 件のコメント

the cyclist
the cyclist 2023 年 1 月 2 日
I'm a bit confused by your use of the word "unbalanced" here. I am used to that being used for categorical response variables, where you have many more observations in one class than the other. (Could you perhaps give us a timestamp of when in the video he talks about this?)
It seems to me that what you are describing here is a pretty serious missing data problem (not imbalance). I guess my first question to you would be, "What do you want to happen with a row like the example you give?" There are several methods that can be used to impute missing data.
It would be helpful if you uploaded your data (or at least a subset that illustrates the problem). You can do that using hte paper clip icon in the INSERT section of the toolbar.
Torsten
Torsten 2023 年 1 月 2 日
So when I look into the fitlme function it truncates my 12042 rows input table to a 628 rows table, so that apparently every row gets deleted where at least one NaN value is present.
But if you have 3 columns and 628 rows, why don't you have full column rank (3) ? It's almost impossible that 3 column vectors of length 628 are linearly dependent.
Tobias Averbeck
Tobias Averbeck 2023 年 1 月 2 日
編集済み: Tobias Averbeck 2023 年 1 月 2 日
@the cyclist Regarding the video, at 5:30 he briefly explains unbalanced data and around 23:35 he shows unbalanced data, but unfortunately doesn't use it, but instead just his complete dataset. I uploaded a sample dataset from a bigger dataset with much more missing data points. In the end I basically want to see which column or interactions do have the most influence on the response data and the video seemed to go in this direction.
@Torsten The rank deficiency comes from the design matrix, where I used dummy variables, so it is a 628x203 matrix. The problem is, that fitlme creates a dummy variable for every unique string, but then after truncating not all of them remain in the truncated dataset, which results in the rank deficiency, and some of the columns in the design matrix are zero columns. I hope this is understandable.
the cyclist
the cyclist 2023 年 1 月 2 日
@Tobias Averbeck, can you also share the code that give the error you are seeing? It's best if we can replicate what you are doing as closely as possible.
the cyclist
the cyclist 2023 年 1 月 2 日
If you are OK with using only the data where you have complete rows, then you can just remove the incomplete rows yourself, before calling fitlme and the creation of the dummy variables.
If you are not OK with that, then I would again say that you need to solve your missing data problem, not your rank deficiency issue.
Tobias Averbeck
Tobias Averbeck 2023 年 1 月 2 日
This is sample data only, if I were to remove all rows from the entire dataset that have NaN values in them, there would be none left, so I need a way to create a model with missing data points. The database is also as full as it gets, there is no way to get more data points, so I can't solve this any other way.
The following code with the attached table reproduces the error:
Formula = 'Response ~ 1 + Predictor2 + Predictor3 + Predictor4 + Predictor5 + (1|Predictor1)';
lme = fitlme(DataSampleTable,Formula,'DummyVarCoding','full','FitMethod','REML');

サインインしてコメントする。

回答 (2 件)

Sulaymon Eshkabilov
Sulaymon Eshkabilov 2023 年 1 月 2 日

0 投票

Suggestion. If you are not using all columns of your data then it is reaonable, you had better clean up your data (only the columns that are being used) by removing the rows where the data is missing (NaN). You can employ isnan() or ismissing() fcn to clean up your data before processing using fitlm() or fitlme(). Note that the demo video, the example data he used has exessive data points.

1 件のコメント

Tobias Averbeck
Tobias Averbeck 2023 年 1 月 2 日
With the Data sample this is possible, but not with the whole data set I have. With this dataset there are many more columns and there is no single row with a value in each column. So how do you solve the problem when you have such patchy data but you want to know what has how much influence on the response?

サインインしてコメントする。

the cyclist
the cyclist 2023 年 1 月 4 日

0 投票

You need to learn about data imputation methods. This is not, at its core, a MATLAB problem.

カテゴリ

質問済み:

2023 年 1 月 2 日

回答済み:

2023 年 1 月 4 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by