Need most reliable normal distribution identifier in matlab

4 views (last 30 days)
I am writing a script for my final project for my course that allows a user to input their data. The script then determines descriptive statistics based on the data, like mean, standard deviation, variance, etc.
Part of my script needs to accurately determine whether the data has a normal distribution or not. Originally I was using the following to determine this:
x_normal = vartest(x(:),x_var);
if x_normal == 1
fprintf('The data you have selected has a normal distribution.\n')
elseif x_normal == 0
fprintf('The data you have selected does not have a normal distribution.\n')
end
x is the data set of two columns x(:) is the data set condensed into one column x_var is the variance of x
And I have been using two different data sets to make sure this is working. One data set I created on my own that is obviously not normally distributed. The other data set I used random.org to create, making sure it was normally distributed.
The data is always input into x, and the data is always either one column or two, but of an unknown length. Because I don't know how many columns will be in the data set, I always use x(:) rather than x to put every single element of the array into a single column.
Here are the data sets I have been using. First, the data set I created which is NOT normal:
15 3
5 22
4 54
7 0
10 23
9 42
8 60
4 66
15 98
7 77
1 111
55 13
8 123
35 30
11 29
81 150
And the next one is the data set that IS normal, and from random.org:
4.480e+1 4.830e+1
1.630e+1 2.610e+1
5.200e+1 7.130e+1
4.840e+1 2.260e+1
3.730e+1 3.800e+1
3.520e+1 4.820e+1
5.820e+1 4.950e+1
3.270e+1 3.790e+1
3.900e+1 5.020e+1
4.540e+1 2.310e+1
2.270e+1 4.530e+1
3.120e+1 3.480e+1
6.070e+1 3.960e+1
2.480e+1 3.850e+1
Vartest says both my sets of data are not normal, even though the second data set IS normal, because it was generated to be normal. Here are the matlab tests I have already tried to no avail:
Chi-square goodness-of-fit test, Lilliefors test, z-test, Kolmogorov-Smirnov test
I am only using the chi-square test in my example just for the sake of showing the if statements and such. I have substituted it with these other tests with the appropriate variables. I can't seem to produce accurate results. As soon as one test says the second data set is normal, it will say the first data set is normal as well. I need a test that will work 100% of the time because I don't have access to the data set that will actually be tested with my script. I can't figure out how to change the significance level of any of these functions.
One thing I'm curious about that may be altering results: Whenever the amount of elements in the data array is greater than 30, my script produces the POPULATION variance and standard deviation, rather than the sample variance and standard deviation. This part of the script is required, so I can't take it out if it is causing the problems. I would appreciate any suggestions on how to work around it if this is indeed causing the normal distribution tests to produce inaccurate results.

Answers (2)

the cyclist
the cyclist on 18 Apr 2017
There is no "one best test" for normality. Such is the nature of statistics. According to the Wikipedia page on normality tests, there are at least eight different ones.
But such theoretical considerations aside, your question confuses me. You seem to be saying that, for example, lillietest does not "correctly" identify the random.org sample as being from a normal distribution. But it does.
x = [4.480e+1 4.830e+1
1.630e+1 2.610e+1
5.200e+1 7.130e+1
4.840e+1 2.260e+1
3.730e+1 3.800e+1
3.520e+1 4.820e+1
5.820e+1 4.950e+1
3.270e+1 3.790e+1
3.900e+1 5.020e+1
4.540e+1 2.310e+1
2.270e+1 4.530e+1
3.120e+1 3.480e+1
6.070e+1 3.960e+1
2.480e+1 3.850e+1]
h = lillietest(x(:))
returns h = 0, indicating that the null hypothesis -- that the data is a sample from a normal distribution -- is not rejected.
That same test does reject the null hypothesis for your non-normal data (with a P-value of 0.0013).
So, all seems to be well in the statistical testing world.
  1 Comment
Matthew Piccolo
Matthew Piccolo on 18 Apr 2017
Oh my gosh, I feel so dumb. This whole time I have been thinking when the test results = 1, the data is normal, not the other way around. Thank you for your help!

Sign in to comment.


Simon T
Simon T on 12 Jan 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!