How reliable is fitdist with regard to the gamma-distribution?

2 ビュー (過去 30 日間)
Markus Franke
Markus Franke 2020 年 8 月 20 日
コメント済み: Jeff Miller 2020 年 8 月 21 日
Hello everyone,
I stumbled across the following issue concerning the reliability of the fitdist-function with regard to the gamma-distribution and could not find anything in the answers. The code-snippet below can be used by everyone for testing purposes. I generated a virtual dataset Y1 from a gamma-distribution with parameters p and b. The variable scaling simulates the virtual amount of measurements/observations. In order to test the dataset Y1 for gamma-distrubtion using the chi-squared-test, the procedure highlighted here was used.
The issue is that the fitted distribution pd has a very different shape than the original (gamma-distributed) dataset. Consequently, this yields to a rejection of null-hypothesis of the chi-squared test (h=1), although the null-hypothesis (Y1 is gamma-distributed) is correct.
Is that a known issue, or does someone has a hint on how to challenge this problem? As far as I know, fitdist utilizes the Maximum Likelihood estimator. Is there any option to modify the estimator or have an insight on how the parameters of the gamma-distribution are guessed/optimized?
X= 1:1:20;
p = 1;
b = 9;
scaling = 1000;
Y1 = scaling*gampdf(X,p,b);
pd = fitdist(X','Gamma','Frequency',round(y1',0));
expCounts = scaling * pdf(pd,X');
[h,pc,st] = chi2gof(X','Ctrs',X',...
'Frequency',y1', ...
'Expected',expCounts,...
'NParams',2, ...
'Alpha', 0.05);

採用された回答

Jeff Miller
Jeff Miller 2020 年 8 月 21 日
If your overall goal is to check on the accuracy of fitdist, then I think your method is flawed because the data you construct (i.e., a set of X's with frequencies given by y1) do not actually represent the true distribution very well. Consider this example:
% Examine the true distribution
p = 0.7457;
b = 9.129764;
pd = makedist('Gamma', 'a', p, 'b', b);
disp([' True mean & variance are ' num2str(mean(pd)) ' and ' num2str(var(pd))])
% construct the data that supposedly represent that distribution
lower = 1;
upper = 20;
X= lower:1:upper;
scaling = 1000;
y1 = scaling*gampdf(X,p,b);
constructed_mean = sum(X.*y1)/sum(y1);
constructed_EXsqr = sum(X.^2.*y1)/sum(y1);
constructed_var = constructed_EXsqr - constructed_mean^2;
disp(['Constructed mean & variance are ' num2str(constructed_mean) ' and ' num2str(constructed_var)])
Running this code gives
True mean & variance are 6.8081 and 62.156
Constructed mean & variance are 6.0679 and 23.9548
Since the data in y1 have the wrong mean and variance, you can't really expect fitdist to return the true parameter values as estimates.
If you actually want to check fitdist, then a better way to generate data to represent the distribution is to use the cdf, as John said. For example:
% define equally spaced points along the cdf
cdf_grain = 0.01;
cdf_points = cdf_grain/2:cdf_grain:(1-cdf_grain/2);
% define true distribution
p = 0.7457;
b = 9.129764;
pd = makedist('Gamma', 'a', p, 'b', b);
disp([' True parameters ' num2str(pd.a) ' and ' num2str(pd.b)])
% look up the data points
X = icdf(pd,cdf_points);
% use fitdist with those data points
pd = fitdist(X','Gamma');
disp(['Estimate parameters ' num2str(pd.a) ' and ' num2str(pd.b)])
Here you will find that the estimated parameters match the true parameters rather well, and the match improves as cdf_grain is reduced.
(Check the mean and variance of these generate data points if you want; all are equally likely. They will not be perfect, but they will be closer to the true distribution than those obtained with your method.)
FWIW, I would not use chi2gof at all in this situation. As I see it, the only issue is how well the estimated parameters match the known true values.
  2 件のコメント
Markus Franke
Markus Franke 2020 年 8 月 21 日
Thank you very much for your answer, Jeff!
The overall goal is to check for given (large) dataset(s) which distribution fits best. In that process, I stumbled across the outlined issue. To check the goodness of fit, I used the chi-squared test. As a rule of thumb, the amount of bins to use is sqrt(n), with n being the sample size of the data set, limiting the resolution.
You are totally right. The only issue is how well the estimated parameters match the true values. However, it is a bit problematic, if the whole procedure (especially fitdist) is so sensitive with regard to the resolution.
I will use the cdf, as John mentioned, in combination with the Kolmogorov-Smirnov test in the next tries. But this is another topic, I guess.
Therefore, I will mark this topic as answered.
Thanks again Jeff and John.
Jeff Miller
Jeff Miller 2020 年 8 月 21 日
With that overall goal, you might also find it illuminating to:
  1. generate random datasets (of the size you have) from each of your candidate distributions.
  2. fit each dataset with all of the candidate distributions, and then
  3. see how often the known true underlying distribution provides a better fit to the dataset than any of the candidate distributions.
When I have done this, I have always been disappointed in the results. With (say) 4-5 similar candidate distributions, it is rarely the case that the true distribution provides the best fit to more than about half of the datasets (at least with datasets of the size I work with). Bottom line: identifying the true underlying distribution family is really hard, and in some cases it may be an unrealistic goal.

サインインしてコメントする。

その他の回答 (1 件)

Jeff Miller
Jeff Miller 2020 年 8 月 21 日
I think fitdist is doing the best it can, but you gave it a non-gamma sample to fit because you truncated the X range at 20. Note that y1 has lots of predicted scores at 18, 19, 20 but then stops suddenly, which would not be expected in a true random sample from this gamma distribution. Try X=1:1:50. Or, if you really want to consider a truncated distribution, look at MATLAB's truncate command for truncating probability distributions.
  4 件のコメント
John D'Errico
John D'Errico 2020 年 8 月 21 日
編集済み: John D'Errico 2020 年 8 月 21 日
I think part of the problem is you are not truly sending in true gamma samples into fitdist.
You are scaling the values of a pdf. So even the re-scaling is not truly correct, because to be correct, you would need to scale those effective histogram counts by the number of samples that would have fallen into a histogram bin.
X = 1:1:20;
p = 1;
b = 9;
scaling = 1000;
y1 = gamcdf(x + 0.5,p,b) - gamcdf(x - 0.5,p,b);
y1 = round(scaling*y1);
Even at that, this is still effectively censored data. If you were generating gamma data, and then binning it as if from a histogram, then:
>> gamcdf(1,1,9)
ans =
0.10516068318563
>> gamcdf(20,1,9)
ans =
0.891631976778104
we see that you have implicitly discarded all samples representing roughly 20% of your data. So you were effectively using a truncated distribution. I suppose censoring is another way to look at it.
So a better result might be gained if you did this:
y1 = gamcdf(x + 0.5,p,b) - gamcdf(x - 0.5,p,b);
y1(1) = gamcdf(x(1) + .5,p,b) - 0;
y1(end) = 1 - gamcdf(x(end) - .5,p,b);
y1 = round(scaling*y1);
This is NOT the same thing as just using pdf to scale your data into frequencies! A pdf does NOT return a probability!
Jeff's suggestion of extending the range will help some I think, but even there, use of a cdf is correct, not using the pdf.
Markus Franke
Markus Franke 2020 年 8 月 21 日
To me, this feels like a work-around that is not really reliable. If you change p to p = 0.5, the null-hypothesis gets rejected again and the parameters of the fitted distribution are a bit off.
I just found out, that more reliable and better results can be obtained by doing the parameters estimation using the the least-squares approach.
p = 0.7457;
b = 9.129764;
lower = 1;
upper = 20;
X= lower:1:upper;
scaling = 1000;
y1 = scaling*gampdf(X,p,b);
fun = @(x)sseval(x,y1,X, scaling);
x0 = [0.5, 3];
bestx = fminsearch(fun,x0)
pd = makedist('Gamma', 'a', bestx(1), 'b', bestx(2));
expCounts = scaling * pdf(pd,X');
[h,pc,st] = chi2gof(X','Ctrs',X',...
'Frequency',y1', ...
'Expected',expCounts,...
'NParams',2, ...
'Alpha', 0.05);
function sse = sseval(x, ydata, bins, scaling)
p = x(1);
b = x(2);
pd = makedist('Gamma', 'a', p, 'b', b);
expC = scaling* pdf(pd,bins');
sse = sum((ydata' - expC).^2);
end

サインインしてコメントする。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by