How to check which distribution (normal or exponential or gamma) fits best to a data set? Which function to use for this?

158 views (last 30 days)
I have a dataset. I have fitted normal,exponential and gamma distribution to this dataset. Now I want know which distribution fits the data set most accurately. How to do that? I have also calculated the cdf values using normcdf,expcdf and gamcdf.

Answers (3)

Mrutyunjaya Hiremath
Mrutyunjaya Hiremath on 7 Aug 2023
To determine which distribution (normal, exponential, or gamma) fits the dataset most accurately, you can use a goodness-of-fit test. MATLAB provides the 'chi2gof' function that can help you perform a chi-square goodness-of-fit test. Here's how you can use it:
  1. Calculate the Empirical CDF: Start by calculating the empirical cumulative distribution function (ECDF) from your dataset.
  2. Calculate the Theoretical CDF: Use the estimated parameters from your fitted distributions (mean and standard deviation for normal, rate parameter for exponential, and shape and scale parameters for gamma) to calculate the theoretical CDF values for each distribution.
  3. Perform Goodness-of-Fit Test: Use the chi2gof function to perform a chi-square goodness-of-fit test. This function compares the observed ECDF values with the expected theoretical CDF values for the specified distribution.
Here's an example code snippet:
% Load your data and fit the distributions (replace with your code)
data = randn(1000, 1); % Example data
params_normal = fitdist(data, 'Normal');
params_exponential = fitdist(data, 'Exponential');
params_gamma = fitdist(data, 'Gamma');
% Calculate the ECDF values
ecdf_values = (1:length(data)) / length(data);
% Calculate the theoretical CDF values for each distribution
cdf_normal = normcdf(data, params_normal.mu, params_normal.sigma);
cdf_exponential = expcdf(data, params_exponential.mu);
cdf_gamma = gamcdf(data, params_gamma.a, params_gamma.b);
% Perform chi-square goodness-of-fit test
[h_normal, p_normal, stats_normal] = chi2gof(data, 'CDF', params_normal);
[h_exponential, p_exponential, stats_exponential] = chi2gof(data, 'CDF', params_exponential);
[h_gamma, p_gamma, stats_gamma] = chi2gof(data, 'CDF', params_gamma);
% Display the results
fprintf('Normal Distribution:\n');
disp(stats_normal);
fprintf('Exponential Distribution:\n');
disp(stats_exponential);
fprintf('Gamma Distribution:\n');
disp(stats_gamma);
% Compare p-values to determine which distribution fits the data best
if p_normal < p_exponential && p_normal < p_gamma
fprintf('Normal distribution fits the data best.\n');
elseif p_exponential < p_normal && p_exponential < p_gamma
fprintf('Exponential distribution fits the data best.\n');
else
fprintf('Gamma distribution fits the data best.\n');
end
  1 Comment
Jeff Miller
Jeff Miller on 8 Aug 2023
  • A smaller p indicates a worse fit, so you would pick the distribution with the largest p--not the smallest p--if you were going to choose a distribution on this basis.
  • The code snippet won't run because some generated 'data' values will be negative and fitdist won't fit an exponential or gamma distribution if there are any negative values.

Sign in to comment.


Walter Roberson
Walter Roberson on 8 Aug 2023
You can use fitdist to fit probability distributions to the data, getting out probability distribution objects.
You should then be able to use the distribution objects to calculate modeled output for each input, and use that to calculate mean squared error. Or I think you can do that... I am not sure of the steps at the moment. And I'm not sure why the distribution objects do not provide a direct method for calculating this.
You would then compare the mean squared errors for the various different distributions, and say the one with the lowest mean squared error was the most likely.
However... in practice if you have results that have noise in them (processes are not perfect, sensors are not perfect, transformed readings are not perfect) then it is common that you are able to find a distribution that appears to fit the data better than the model you have reason to expect would be the case. For example if you generate points at equal intervals according to a known function, and then add rand()*std to them, and then you do a fitting against the known function, and a fitting against an 11-or so degree polynomial... it is not uncommon for the high-degree polynomial to predict the noisy data better than the known function.
  1 Comment
Walter Roberson
Walter Roberson on 8 Aug 2023
Another approach is to use the curvefitting toolbox to fit the data against several model function, getting out a residue for each, and comparing them. This uses a different toolbox than the above, and can get the residue information more directly than the above -- but has exactly the same issues that when the data is not perfect, functions that you know to be irrelevant might have a lower residue than the known function.

Sign in to comment.


Image Analyst
Image Analyst on 8 Aug 2023
Those functions all look dramatically different. They might be similar over very short segments where your data is fairly linear, but over the whole shape of the function they are SO different that I doubt they would all be contenders for your fitting function.
I don't think it's a good approach just to try a ton of different functions to see which fits your experimental data the best. That's rather arbitrary. I think it's very much preferred to decide upon the model in advance that best fits your data theoretically. For example if you know that you have a Poisson process, then fit that. If you know you have a Gaussian process, then fit that. Then your model will adhere to the theory behind your physical process. Otherwise if you want the best possible fit, just use a Lagrange Interpolating Polynomial, which will give a perfect fit but be the extreme example of overfitting.
That said, if you still don't care what model you use and just want something that kind of works, while ignoring any possible theory behind your data, then use the Regression Learner on the Apps tab of the tool ribbon if you have the Statistics and Machine Learning Toolbox. You can try out dozens of models all in one shot and then look at statistics such as R, RMSE, or MAE (which the Learner applet will provide to you) to determine which model fits best. You can then export the code and implement it in your program.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!