How to replicate Regression Learner app based training using Matlab script?

47 ビュー (過去 30 日間)
Quazi Hussain
Quazi Hussain 2025 年 8 月 15 日 18:29
編集済み: dpb 2025 年 8 月 18 日 17:38
I have trained a ML model in regression learner app using optimizable GPR model using the default setting such as 5 k-fold validation and 30 iterations etc. Now I am traying to do the same using the Matlab script.using the following where X are the resgressors and Y is the response variable.
>> ML_mdl=fitrgp(X,Y,'OptimizeHyperparameters','all','HyperparameterOptimizationOptions',struct('KFold',5))
Are the two resulting models more or less equivalent? I know there will be some difference due to the probabilistic nature of the algorithm. When I test it on the entire training set, the R squared value is practically 1.0. Is it overfitting even with K-fold cross-correlation? The prediction on unseen testing set is not that good. Any suggestions?
  3 件のコメント
dpb
dpb 2025 年 8 月 16 日 19:29
編集済み: dpb 2025 年 8 月 17 日 16:56
"Is it overfitting even with K-fold cross-correlation? The prediction on unseen testing set is not that good. Any suggestions?"
Possibly. Depends on how much data you've got although the other possibility is that the other dataset simply is different from the dataset used for training.
Without the data to look at, we're shooting in the dark.
As an aside, regarding @Umar's comment "slightly different", a recent thread here in the forum illustrated that the randomized selection of the training dataset occasionally produced a grossly different result from the same overall dataset. That indicated that there were subsets of the total dataset that had markedly different characteristics than other random subsets. One cannot naively assume that recalculating with a different training subset will always produce model estimates that are similar; that will be true only if all random subsets of the overall data are similar to each other in their pertinent characteristics. In particular, different models are sensitive to different things; for example some may be very susceptible to outliers in which case a single training set that happens to pick one outlier may result in a very different model from a training set without any such extreme values. Unfortunately, "it all depends" and about the only way to know with such algorithms is to run a number of times and observe just how stable (or unstable) the results are.
OLS on the other hand, uses the entire dataset and so is deterministic although again the results may be affected by the presence of outliers and just how strongly is still dependent upon the particular model chosen.
Umar
Umar 2025 年 8 月 17 日 17:49

@dpb - You're absolutely right, I oversimplified that. The "slightly different" comment assumes well-behaved data, but as you point out, some datasets can produce dramatically different models depending on the random subset selection.

For @Quazi Hussain's case, this variability could actually explain the overfitting issue. If CV folds are inconsistent due to data heterogeneity, the hyperparameter optimization might be fitting noise rather than signal.

Good suggestion to run multiple times with different seeds to check stability - high variability would indicate the dataset sensitivity you mentioned.

Thanks for the clarification.

サインインしてコメントする。

採用された回答

dpb
dpb 2025 年 8 月 15 日 20:50
To replicate the fit, save generate the function in the learner app.
To produce the identical results, set the random number seed before doing the fit calculation in both the Learner app and from the command line.
  2 件のコメント
Quazi Hussain
Quazi Hussain 2025 年 8 月 18 日 13:43
In script, I can set the random number generator to a seed, say 1, by calling rng(1) right before the fit command. How do I do that in regressionLearenr? Do I do that in Matlab command window prior to involking the app?
>> rng(1)
>> regressionLearner
or, there is somewhere in the app setting I can do that? Thanks.
dpb
dpb 2025 年 8 月 18 日 14:28
編集済み: dpb 2025 年 8 月 18 日 17:38
Yes, set in Matlab command window prior to invoking(*) the app; the random number generator stream is global in MATLAB so it will pick up from the last invocation/reset.
This means, of course, that you can't call anything else that generates another random number between the setting and the evaluation in order to be at the same point between the two.
It probably would not be a bad enhancement request to ask for there to be a way to set the seed inside the app to facilitate such use.
ADDENDUM:
(*) Actually, you should be able to just go to the command line while in the app and reset the seed...that would be easy enough to check that if set first, then run a fit that if then reset the seed to the same value that can replicate the fit.

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeSupport Vector Machine Regression についてさらに検索

製品


リリース

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by