Nonlinear regression + Cross Validation = possible?

Hello. World. I want to know is it possible to perform cross validation on nonlinear regression model?

 採用された回答

Star Strider
Star Strider 2017 年 6 月 16 日

0 投票

Cross-validation is used to assess the performance of classifiers.
Nonlinear regression does curve fitting (objective function parameter estimation).
These are two entirely different statistical techniques. What are you doing? How would you use cross-validation with your nonlinear regression?

19 件のコメント

wesleynotwise
wesleynotwise 2017 年 6 月 16 日
Glad to see you here again. You know I have been developing a model. Now I would like to test/validate my model, which I thought one can use the cross validation method. I can, of course, use another set of independent data to test the model, just wondering if cross validation works better?
p/s: I thought I have read somewhere that cross validation can be used to ensure that the model is not overfitting? Correct me if I am wrong.
Star Strider
Star Strider 2017 年 6 月 16 日
You, too!
If you are doing curve fitting, you simply need to calculate statistics on the fit to see if the model accurately explains your data. The F-statistic and parameter confidence intervals are important here. (The fitnlm function will provide these.) If you have more than one model, deciding which of them best explains your data can be complicated, although is relatively straightforward if both models have the same number of parameters.
Cross-validation assesses the performance of classifiers. It cannot be used for other purposes.
wesleynotwise
wesleynotwise 2017 年 6 月 16 日
Yeah. The fitnlm function provides the statistics justification. But my goal is to test my model with an independent set of data, I bet only way to do that is to randomly selected the data from my total data, and use them when the model's been developed?
So, cross-validation is only used in machine learning, right? Now I know.
Also, can I check with you, is there a way to tell the error of estimation in Matlab? For example: the predicted values is 30 +/- 2, or saying that the model carries 10% error? Something along that line?
Star Strider
Star Strider 2017 年 6 月 16 日
You are dealing with two different ideas: system identification and parameter estimation. Once you have a mathematical model of your system (or process), you can estimate the parameters. If it describes your system, the statistics will reflect that.
You seem to be describing ‘bootstrap sampling’, implemented in the bootstrp (link) and related functions. I am not certain what you would gain by that, but then I do not know what you are researching.
If you get a parameter estimate that is noted as ‘30±2’, it means that (with the default 95% confidence limits), 95% of the time you run your model with similar data (acquired under the same conditions from the same system), your parameter estimate will be between 28 and 32. It expresses only the uncertainty in the estimate.
A measure close to your ‘10% error’ idea is the Coefficient of determination (link), also known as , described as ‘the proportion of the variance in the dependent variable that is predictable from the independent variable(s)’.
wesleynotwise
wesleynotwise 2017 年 6 月 16 日
You read my mind very well... I am looking for bootstrap sampling + iteration (kfold) kind of function to test my model. Well, my idea/knowledge is pretty much mixed up. Let me sort them out. At the moment, I think the best way to test the validation of the model is using independent set of data.
Sorry I think my example given above was not a good one. What I meant was how to tell the error term of a predicted value. Let say, for a given set of conditions, the estimated value using the model is 30. I thought (not very sure) people normally report it as '30 with x% of error'. The R-squared is perhaps not something I was referring to, as it describes the performance of a model, but not predicted value.
Star Strider
Star Strider 2017 年 6 月 17 日
As a clinician (retired), I have significant practice with that!
I have no recent experience with the bootstrap technique, so I must leave that to you to explore. The idea is just as you described what you want to do.
Your statement with respect to '30 with x% of error' is not a way I have ever seen a statistic reported. Parameter estimates are reported as I previously described them, as essentially a range of probable values the parameter may take.
wesleynotwise
wesleynotwise 2017 年 6 月 17 日
Cool. I am very impressed with you and your professionalism!!
I think the '30 with x% of error' is something like a forecasting error measurement, which is used in finance/business/meteorology field? eg: mean absolute percentage error (MAPE) and MAD (Mean Absolute Deviation).
'A range of probable values the parameter may take' is another alternative description which I am looking for. Something like: the estimated value is 30, but it can be in a range of 28 to 32. Do you have any idea how this can be done?
Star Strider
Star Strider 2017 年 6 月 17 日
Thank you!
I am not familiar with the financial forecasting applications. The meteorological probabilities are arrived at (as I recall) by running a model 10 times with slightly different initial conditions, creating a spatial histogram of precipitation probabilities for each area. (I forgot the technical tern for this histogram.) That becomes the precipitation probability, always reported in steps of 10%.
‘Do you have any idea how this can be done?’
It is the normal way the parameter estimates are presented in MATLAB, for example in fitnlm using the coefCI function and the full-precision coefficient estimates from:
beta = mdl.Coefficients.Estimate
ci = coefCI(mdl)
The actual calculations can be found in references on regression parameter estimation. You can program them yourself if you need to, or if you just want to understand how they work.
wesleynotwise
wesleynotwise 2017 年 6 月 17 日
編集済み: wesleynotwise 2017 年 6 月 17 日
Hmm.. the concept used in obtaining the precipitation probability is interesting, but I don't think I need to go that route. Thanks for the info though.
Sorry for the confusion from my side. I know one can use coefCI for the beta (coefficient) values, but I am after the estimated values (the y).
y = (b1.*x1.^2)./(b2.*log(x2)); % eg: this is my model
b1 = 0.50, b2 = 7.50 % These are obtained from regression
x1 = 2, x2 = 3; % Test the model with a new set of data
y = 5.58 % The output for the given x values
Is there a way to report 'y = 5.58 +/- 0.50' or 'y = 5.58 with 10% error in the estimation'. Something like how the plotSlice function works? https://uk.mathworks.com/help/stats/nonlinearmodel.plotslice.html
Star Strider
Star Strider 2017 年 6 月 17 日
See the documentation on the predict (link) function for NonLinearModel (link) class objects.
Again, the confidence intervals do not report percent error. They report that with the chosen probability (usually 95%), the true value of the estimated parameter or the predicted value of the regression lies within those limits. Please do not cling to the idea that they report ‘percent error’. That is not the correct way to interpret them!
wesleynotwise
wesleynotwise 2017 年 6 月 17 日
Yes to predict function. Brilliant!
And, thanks for your advice, I have erased the idea of reporting percentage error from my memory.
Sorry for being selfish here, if you happen to know how to adjust the transparency of marker (link) in scatter plot, please help.
Star Strider
Star Strider 2017 年 6 月 17 日
Thank you!
My pleasure. I apologise for being a bit forceful, but in science, details are important and cannot be neglected.
The MarkerFaceAlpha property requires a scalar for each scatter call. So if you want different transparency values, you have to set them in different scatter calls. This applies to the marker edges as well as the faces.
Example
x = rand(1, 10);
y = rand(1, 10);
figure(1)
scatter(x, y, 625, 'pg', 'MarkerFaceColor', [0 1 0], 'MarkerFaceAlpha',0.6)
hold on
scatter(x+0.01, y+0.04, 625, 'pr', 'MarkerFaceColor', [1 0 0], 'MarkerFaceAlpha',0.3)
hold off
grid
If you have several different sets with different alphas that you want to plot on the same axes, you will have to use a for loop. I used only two different calls here (and small offsets), so avoided the loop. I also used the default colour designators as well as the RGB-triplets. Both work.
wesleynotwise
wesleynotwise 2017 年 6 月 18 日
編集済み: wesleynotwise 2017 年 6 月 18 日
Thanks for the reply, I'm putting the chart plotting on hold, as I am dealing with other problems. I will come back to this post, once the problem is solved. Thanks!
Oh, and don't worry about for being forceful, which I don't see it that way. All the discussions have been very useful and healthy, and I do sincerely appreciate all of them.
Star Strider
Star Strider 2017 年 6 月 18 日
My pleasure.
wesleynotwise
wesleynotwise 2017 年 6 月 21 日
Paging for Star Strider. Paging for Star Strider!
Star Strider
Star Strider 2017 年 6 月 21 日
?
wesleynotwise
wesleynotwise 2017 年 6 月 21 日
Great god. You are here. Could you please help me with this subplot problem? Need your help here.(click the link). Thank you!
Star Strider
Star Strider 2017 年 6 月 21 日
I’m here occasionally these days.
I looked at the subplot problem when you posted it. I would not use subplot in that situation, instead just plotting all the data on one set of axes and using a legend call.
wesleynotwise
wesleynotwise 2017 年 6 月 21 日
編集済み: wesleynotwise 2017 年 6 月 21 日
Ah. I still need subplot in my case, due to the overlapping of data points, and it is easy for me to do the analysis. I think I have an idea now how to crack it. Thanks.

サインインしてコメントする。

その他の回答 (1 件)

Greg Heath
Greg Heath 2017 年 6 月 22 日
編集済み: Greg Heath 2017 年 6 月 22 日

1 投票

I am surprised to hear that SS thinks that cross validation is not used for regression.
Maybe it is just a misunderstanding of terminology but I have used crossvalidation in regression many times.
Typically it is used when there are mounds of data:
1. Randomly divide the data into k subsets.
2. Then design a neural network model with two subsets: one for training
and one for validation.
3. Test the net on the remaining k-2 subsets.
4. If performance of one net is poor, the same data can be used several
(say 10) times with different random initial weights. Then, choose the
best of the 10.
5. Finally you can choose the best of the k nets or combine m (<=k) nets
Hope this helps.
Thank you for formally accepting my answer
Greg

4 件のコメント

Star Strider
Star Strider 2017 年 6 月 22 日
‘wesleynotwise’ is not using neural nets, or doing classification. He’s doing bootstrapping to estimate parameters. That’s completely different.
wesleynotwise
wesleynotwise 2017 年 6 月 22 日
Hi Greg, as pointed by Star Strider, I was not doing neural networks, so the cross validation techniques I was looking for initially does not apply to my case. It was me who got confused with the terminology and the statistics techniques. Instead, I have used 90% of my data for model building and 10% for the validation.
Greg Heath
Greg Heath 2017 年 6 月 22 日
編集済み: Greg Heath 2017 年 6 月 22 日
It doesn't matter what your model is you can still use
1. k-fold cross-validation where there are k distinct subsets
2. k-fold bootstrapping where there are k nondistinct random subsets.
A driving factor is the ratio of fitting equations to the number of parameters that have to be estimated.
Hope this helps.
Greg
wesleynotwise
wesleynotwise 2017 年 6 月 22 日
編集済み: wesleynotwise 2017 年 6 月 22 日
Yes. Star Strider did point out that I was actually looking for bootstrap sampling techniques. My tiny wee brain cannot cope with that at the moment, that's why I used the alternative - data splitting.
Thanks :)

サインインしてコメントする。

カテゴリ

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by