confidence intervals returned by predict()

The predict() function returns confidence intervals (CIs) for values predicted from a model. There are four options available for the CIs. Two of the options do not give the CIs I expect. Can someone explain these unexpected results? Are my expectaitons wrong or is the function wrong? I will give examples, using a simple linear regression model, and I will explain what values I expect. I'm sorry this is a long post, but I did not have time to make it shorter.
Create some data and make a simple linear regression model:
x=(5:15)';
b0=0; b1=1; sigma=1; %b0=intercept, b1=slope, sigma=s.d. of random noise
y=b0+b1*x+sigma*randn(size(x));
mdl=fitlm(x,y); % model using x, y
Make predictions with confidence intervals (four options for CIs)
xnew=(0:20)';
[~,yci1] =predict(mdl,xnew,'Prediction','curve', 'Simultaneous',false);
[~,yci2] =predict(mdl,xnew,'Prediction','curve', 'Simultaneous',true);
[~,yci3] =predict(mdl,xnew,'Prediction','observation','Simultaneous',false);
[ypred,yci4]=predict(mdl,xnew,'Prediction','observation','Simultaneous',true);
Plot predictions and confidence intervals
figure
subplot(211)
plot(x,y,'k*',xnew,ypred,'-k.'); hold on
plot(xnew,yci1(:,1),'-r',xnew,yci2(:,1),'-g',xnew,yci3(:,1),'-b',xnew,yci4(:,1),'-m');
plot(xnew,yci1(:,2),'-r',xnew,yci2(:,2),'-g',xnew,yci3(:,2),'-b',xnew,yci4(:,2),'-m');
legend('Data','Prediction','curve,non-simul','curve,simul.','obs.,non-simul','obs.,simul.')
ylabel('Y'); grid on
subplot(212)
plot(xnew,yci1(:,2)-ypred,'-r',xnew,yci2(:,2)-ypred,'-g',...
xnew,yci3(:,2)-ypred,'-b',xnew,yci4(:,2)-ypred,'-m');
legend('curve,non-simul','curve,simul.','obs.,non-simul','obs.,simul.')
xlabel('X'); ylabel('C.I. Half-width'); grid on
I wish the Matlab help epxlained the following, which took me some work to figure out: The four different CIs returned by predict() follow the general formula
where SE varies depending on the 'Prediction' option, and c varies depending on the 'Simultaneous' option.
When predict() is called with 'Prediction','curve', SE is given by
where
When predict() is called with 'Prediction','observation', SE is given by
When predict() is called with 'Simultaneous',false, c (for simple linear regression) is given by
where p is the CI probability, 0.95 by default. The critical value of the t statistic can be obtained in Matlab with c=tinv((1+p)/2,n-2). In the example here, p=0.95 and n=11, therefore c=tinv(.975,9)=2.2622. The formulas above produce CIs that agree with the CIs of predict(), when Simultaneous is false. These CIs are plotted in red and blue above.
When Simultaneous is true, the results are not what I expect. I expect the CIs (which, according to the Matlab Help, are by Scheffe's method) to be (see here and here; these sources use different notation, but they appear to agree):
where d is the number of independent new x values for simultaneous prediction. In the examples plotted above, d=21, because length(xnew)=21. Therefore we expect c=sqrt(21*finv(.95,21,9))=7.8391. Therefore we expect the CI widths to be wider by a uniform factor of 7.84/2.26=3.47, when Simultaneous is true. But the CIs are only wider by a factor of 1.2898. (The ratio of CI widths is the same when 'Prediction','observation' is used.) Why the discrepancy?
The confidence interval, when predicting a single value with 'Simultaneous',true , is also not what we expect. When predicting a single value, d=1, and c simplifies to . , where p is the CI probability. This is identical to the non-simultaneous confidence interval, , due to the relationship between F and t distributions. It makes sense that the simultaneous and non-simultaneous CIs would be the same when there is only one value being predicted "simultaneously". But the CIs returned by predict() are not the same, when one value is being predicted. See example below.
xnew=10;
[ypred1,yci1]=predict(mdl,xnew,'Prediction','curve','Simultaneous',false);
[ypred2,yci2]=predict(mdl,xnew,'Prediction','curve','Simultaneous',true);
fprintf('CI, non-simultaneous: %.2f to %.2f; half-width %.2f\n',yci1,yci1(2)-ypred1)
CI, non-simultaneous: 9.17 to 10.89; half-width 0.86
fprintf('CI, simultaneous: %.2f to %.2f; half-width %.2f\n',yci2,yci2(2)-ypred2)
CI, simultaneous: 8.93 to 11.14; half-width 1.11
Why are the CIs not the same?

 採用された回答

ProblemSolver
ProblemSolver 2023 年 6 月 27 日
編集済み: ProblemSolver 2023 年 6 月 27 日

1 投票

Hello William,
The discrepancies you observe in the confidence intervals returned by the predict function can be attributed to the different methods employed for simultaneous prediction and the specific case of predicting a single value.
When the 'Simultaneous' option is set to true, the predict function calculates the confidence intervals using Scheffe's method, which assumes that all predicted values are correlated and adjusts the intervals accordingly. However, for the specific case of predicting a single value, the correlation among predictions is not applicable, leading to different results between simultaneous and non-simultaneous predictions.
In the case of simultaneous prediction with multiple values, the factor by which the confidence interval is widened compared to the non-simultaneous case depends on the number of independent new x values (denoted as d in your explanation). The formula you provided,
is correct for calculating the scaling factor of the confidence interval width.
Regarding the case of predicting a single value, the simultaneous and non-simultaneous confidence intervals should indeed be the same since there is no correlation among predictions. However, the predict function in MATLAB calculates the confidence intervals differently for the two cases, resulting in discrepancies.
To obtain the expected confidence intervals for the simultaneous prediction of a single value, you can manually calculate the non-simultaneous confidence interval using the formula you mentioned:
CI = ypred ± tinv((1+p)/2, n-2) * SE,
where SE is calculated based on the 'Prediction' option.
I hope this helps.

7 件のコメント

William Rose
William Rose 2023 年 6 月 28 日
Thank you for your fast and thoughtful answer. You wrote:
In the case of simultaneous prediction with multiple values, the factor by which the confidence interval is widened compared to the non-simultaneous case depends on the number of independent new x values (denoted as d in your explanation). The formula you provided,
is correct for calculating the scaling factor of the confidence interval width.
But the CIs from predict() do not agree with that formula, when Simultaneous is true. That is my whole point. Why not?
You also wrote:
Regarding the case of predicting a single value, the simultaneous and non-simultaneous confidence intervals should indeed be the same since there is no correlation among predictions. However, the predict function in MATLAB calculates the confidence intervals differently for the two cases, resulting in discrepancies.
To obtain the expected confidence intervals for the simultaneous prediction of a single value, you can manually calculate the non-simultaneous confidence interval using the formula you mentioned...
Do you agree that predict() returns an incorrect CI() in this case?
Is it fair to summarize: "The CIs from predict(), when Simultaneous is true, are wrong, no matter how many values are being predicted. They are not the Scheffe CIs." ?
If so, then predict() should be fixed, or, at a minimum, the Help should be updated to acknowledge the problem.
ProblemSolver
ProblemSolver 2023 年 6 月 29 日
編集済み: ProblemSolver 2023 年 6 月 29 日
@William Rose - Apologise on the late reply, I tried on three different versions of the MATLAB 2021, 2022a and b, and 2023a. The results were consistent with what you observed. Therefore, I do think that it can be an issue, and it should be either rectified or Help should be updated to acknowledge the problem.
partika partikasiwatch
partika partikasiwatch 2024 年 1 月 2 日
With all the issues mentioned above ... should we use predict function to calculate CI or not?
William Rose
William Rose 2024 年 1 月 2 日
I think it is good to use predict() to calculate CIs. The CIs returned by predict() need no adjustment, if "Simultaneous" is false. If I want the CI for a curve that extends from x=0 to 20, I use
xnew=(0:20)';
[~,yci] = predict(mdl,xnew,'Prediction','curve','Simultaneous',false);
and the CIs will be good.
If I want the CI for a single observation at x=10, I use
xnew=10;
[~,yci] = predict(mdl,xnew,'Prediction','observation','Simultaneous',false);
and the CI will be good.
The problem arises if I want to estimate a CI for multiple observations. The CI returned by
[~,yci] = predict(mdl,xnew,'Prediction','observation','Simultaneous',true);
is incorrect, as discussed above. Therefore I would compute the CI using the formulas I gave in my initial question above.
Therefore, when predicting a CI of probability p, for d simultaneous observations, the CI is
where
and
and n is the number of observations used to estimate the model. In the example in my original posting, I used n=11 (x,y) observations to estimate the model. If I want the 95% CI for 21 simultaneous observations, I compute c=sqrt(21*finv(.95,21,9))=7.8391.
partika partikasiwatch
partika partikasiwatch 2024 年 1 月 2 日
Thanks for the early and elaborate reply.
Just had a another doubt related to the current issue.
Whether the formula used to calculate CI through "predict" or "predint" function of matlab is same or not?
William Rose
William Rose 2024 年 1 月 2 日
You're welcome. I predict that the CIs obtained with predint(), with appropriate options for intopt and simopt, will match the CIs from predict(), with corresponding options for Prediction and Simultaneous. You will have to investigate to be sure.
partika partikasiwatch
partika partikasiwatch 2024 年 1 月 4 日
Yeah i Checked and CI values match for both functions with right options for prediction and simultaneous.
Again, thanks for answer.

サインインしてコメントする。

その他の回答 (0 件)

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by