Are lassoglm solutions independent of data order?

6 ビュー (過去 30 日間)
Ken Johnson
Ken Johnson 2024 年 8 月 13 日
コメント済み: Ken Johnson 2024 年 8 月 19 日
I thought lassoglm solutions were unique, but I find that the solution from lassoglm depends on the X array order. Is there a way to avoid this? Here's my example. I have 3 X variables (3 columns) and the Y variable. I get different solutions with X(var1, var2, var3) or X(var1, var3, var2). In the example code, the fitted values are:
CONC123 = [21.54, 1.689, 0.726]
CONC132 = [21.94, 2.558, 0]
Which solution is most correct? The deviance for the 123 solution is a bit smaller.
load('YX123') % Y and X(var1, var2, var3)
load('YX132') % Y and X(var1, var3, var2)
lambda = 0.0005 ; % lambda was optimized at 0.0005 with a training set
reltol = 1e-4; % default value
alpha = 1; % forces lasso regression
[CONC123,FitInfo123]=lassoglm(X123,Y,'normal','Alpha',alpha,'Lambda',lambda,'RelTol',reltol);
[CONC132,FitInfo132]=lassoglm(X132,Y,'normal','Alpha',alpha,'Lambda',lambda,'RelTol',reltol);

回答 (1 件)

Ayush
Ayush 2024 年 8 月 16 日
Hi Ken,
The issue you're encountering is related to the numerical stability and convergence properties of the optimization algorithm used inlassoglm function. The order of the columns in theXmatrix can sometimes affect the solution due to these numerical properties. However, in theory, the Lasso regression should yield the same solution regardless of the order of the columns inX.
However, I use several methods to mitigate this type of numerical instability:
  1. Standardize the features: Standardizing features, i.e. scaling them to have zero mean and unit variance, can help in making the optimization process more stable and less sensitive to order of the columns. Here’s the pseudo code for standardizing the features and performing required Lasso regression on standardized features.
% Standardize the features
X123_standardized = zscore(X123);
X132_standardized = zscore(X132);
% Perform Lasso regression on standardized features
[CONC123, FitInfo123] = lassoglm(X123_standardized, Y, 'normal', 'Alpha', alpha, 'Lambda', lambda, 'RelTol', reltol);
[CONC132, FitInfo132] = lassoglm(X132_standardized, Y, 'normal', 'Alpha', alpha, 'Lambda', lambda, 'RelTol', reltol);
2. Checking deviance: If deviance for one solution is smaller, that solution is generally more desirable. However, it’s essential to confirm that this is not due to overfitting.
deviance123 = FitInfo123.Deviance;
deviance132 = FitInfo132.Deviance;
if deviance123 < deviance132
% solution 123 is preferred
else
% solution 132 is preferred
end
Note: One more technique which I generally use is “Cross-validation”. It helps to ensure that the chosen model generalizes well to unseen data. This can sometimes help in mitigating the sensitivity to feature order.
So, by standardizing your features, comparing deviance, and using cross-validation, you can reduce the sensitivity of your Lasso regression solutions to the order of the columns in“X. The solution with the smaller deviance is generally more correct, but it's crucial to ensure that this is not due to overfitting.
For standardization, I’ve used “zscore” function. You can read more about it here:
Also, for reading more about Lasso regularization, you may refer:
Hope it helps!
  1 件のコメント
Ken Johnson
Ken Johnson 2024 年 8 月 19 日
Super, thank you.

サインインしてコメントする。

製品


リリース

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by