How to evaluate the fitted distributions by comparing quantitative statistics?

Question

Frank on 19 May 2023

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1967654-how-to-evaluate-the-fitted-distributions-by-comparing-quantitative-statistics

Commented: Ive J on 6 Jun 2023

Hi,

I'm running a matrix of regressions (say, 10*50 regressions, same original dataset, different approaches).

Beside simply compare the R-square of the regressions, I also wish to know how are the residuals distributed since the residuals are important to further investigations. I wish the residuals to be convergent and quite well distributed so that it means the original dataset and fitted curve are closely linked (probability of deviation to be minimized).

Since there are lots of them, the plotting methodology are not suitable (histogram, qqplot, etc.). I wish there are some statistics that can evaluate how good the fitted distributions are so that I can pick the 'good models' and use it as a good approach to the orignal dataset.

With human comparison, I feel the fitted distribution is likely t-localization-scale distributed. I tried coftest, kstest and adtest. However, they all reject the null hypothesis (even when R-square is 0.96). I would like to know how can I change the criteria of null hypothesis? Furthermore, the p-value are miserably low. I'm deeply confused with it.

Here are a comparison of some of the distributions:

As you can see the data1 and data2 are quite well distributed as they have the shape while data3 and data4 are not-so-convergent. I tried to compare loglikelihood but even data4 has some 1.0e+07 level of high loglikelihood. What am I supposed to do?

7 Comments
Show 5 older commentsHide 5 older comments

the cyclist on 22 May 2023

There is a lot to unpack in this question. I'm not sure where to begin.

First off, you have shared some info, but perhaps not the most important info. So, in rough order of importance ...

Can you share the dataset? (Am I correct that the dataset has perhaps millions of data points?)
Can you share the code you used to build the models?
If you can't share the code, at least describe what kind of models you are using. Ordinary linear regression? Machine learning (and what type)? Etc.
Do you have an underlying theory on why the model will have a particular functional form?
What is driving your decision to build 500(!) models? (It seems like maybe you are hand-tuning a parameter?)
Are you using all the data to train the models, or are you using separate training/validation/test sets?

Some other maybe-relevant thoughts:

coeftest, kstest, and adtest have some important differences. What null hypothesis are you trying to test here? That the residuals have a normal distribution? For some models, that is an important test, but for others it is not. Regardless, with such a huge dataset, I would expect that null hypothesis to pretty much always be rejected.

The question "Which model is best?" is highly dependent on what the model is going to be used for. That is what should drive the definition of goodness-of-fit, not some (possibly arbitrary) statistic. For example, often times the "best" model is the one you believe will best predict future data (i.e. minimize "generalization error"). There are methods for estimating that.

When you say, "the p-value is low" ... what p-value?

Frank on 23 May 2023

Open in MATLAB Online

Wow! You have really deep insight and undestanding in this area!

Allow me to introduce the background of this model so that it would be easier to comprehend why I am doing all this.

The model aims to use a series of inter-correlated variables to predict interaction behaviors among them. For example, the test I'm currently running contains 8 correlated variables (called Va, Vb, ...). I first process them into the difference between its latest value and its previous N-lag average, do it 8 times and we can get a 'data' for each N parameter:

data(i,j) = (V(i,j) - mean(V(i-N:i-1,j));

Then, I pca converted their N-period based data into 3 axis. For each designated N we have:

 [coeff{i,1}, score{i,1}, latent{i,1}, tsquare{i,1}, explained{i,1}, mu{i,1}] = pca(data);

I'm using those 3 axis to regression against each variable because 2-axis couldn't reach 85% explained rate. It is strange that the explanation almost always keep rising with increasing N period. (illustrated below) I actually don't really understand why.

In fear of overfitting, I decide to use pca axis only as a way to reduce inter-correlations instead of parameter picking. I then regress the differences against pca axis to form the prediction model. The model will look like:

pred_A = para1*pca1 + para2*pca2 + para3*pca3 + C = w1*A1 + w2*A2 + .. + w8*A8 + C

and its code is:

[b{j,i},~,r{j,i},~,stats{j,i}] = regress(data(:,i),[score{j,1}(:,1:3) ones(length(score{j,1}),1)]);

Here j is for difference set of N-period processes pca axis; while i represents for each of the 8 differences (from their original variables).

My focus is mainly on the residuals r, or r{j,i}. I compared the R-squares and plotted the histograms of residuals. The outcome is a higher R-square doesn't necessarily lead towards a better distributed residual. Thus I decide to adopt the morphology of the residuals to determine if the regression is a good/reasonable/explainable model. But since as you see there are 8 variables and already some 40 N parameters, which gives out 200+ residual plots (this example is a trial, will expand further if necessary), I cannot examine their pictures on by one, I have to use some statistical indicator to filter the residual distributions. And here we are.

For the statistics of distribution similarity, I just figure out where might the problem be today. But first, let me answer the questions in case I didn't explain clearly above:

Yes (but 58.5MB cannot attach). And yes, dataset contains millions of data points. (key factor lead to my question)
Please refer to the explanation above.
I think so. It seems to me that I'm using all data to train the model.

I use the 'tlocationscale' model to fit the distribution:

dist_t{j,i} = fitdist(tmp_r,'tlocationscale');
loglikelihood_t{j,i} = sum(log(pdf(dist_t{j,i},tmp_r)));

The log-likelihood are almost all over 1E+07, thus not distinguish-able. I have to use some other statistics, I tried chi2goftest, kstest and adtest but all return extremely small p-values (0 or almost 0). There is merely any help to change Null Hypothesis with such small value of p-value.

tmp_dist = random(dist_t{j,i},size(r{j,i}));
[~,pvalue_Matrix{j,i}] = kstest(tmp_dist,'cdf',dist_t{j,i},'Alpha',0.05);

And here's what I realize now:

Although kstest is more big-sample friendly, my sample size is still too big.
My dataset has very limited tail (short tailed, indicated by qqplot)
The huge sample size will inevitably result in value in the 'fat tail' area. Forming a huge difference against the max and min of the actual dataset, result in a huge difference in cdf. The huge difference projecting to en curve of kstest eventually result in a 0 p-value.

qqplot(residual_sample,dist_t{j,i})

However, I still don't understand:

why qqplot look like this (shouldn't quantile/percentile change with smaller range either?)
is there any way to compare only the 99% middle area of the histogram?
is there any way to fit the distribution only to the 99% middle area of the histogram?

I failed to find the tools or codes for the latter 2 requirements. Thus I'm reading the related codes and papers, trying to limit the range of cdf-comparison, attaining a smaller difference which will result in a seemingly intuitve p-value (which can be compared) (This idea may apply only for kstest model, but still did not solve the huge sample problem). Do you have any good idea?

At the end, My purpose is to use some statistics to compare and filter out the bad-looking-distribution model since its not that explainable. There might be some other problem(s) existed in my whole model, please do enlighten me.

Thank you very much!

the cyclist on 26 May 2023

Sorry that it took me a while to reply. Part of the reason is my own work, and part is that it took me a long time to understand what you are doing. (But I do think I mostly understand now.)

My impression is that you are effectively re-inventing a modeling method that already exists, known as vector autoregression (VAR). As you can read on that Wikipedia page, it is a method for modeling time-series data where each value is dependent on the past (i.e. lagged) values of all other variables.

You also seem to be trying to solve for the best-fitting model by making up and testing your own statistic. But such goodness-of-fit methods will already exist.

Functionality exists for VAR models in MATLAB. It looks like you will need the Econometrics Toolbox, though. Even if you are not able to use that, I expect that reading its documentation (or just understanding VAR models in general) will help you solve your problem.

Unfortunately, I don't have any experience using those types of models, so I don't think I can be more helpful than that.

Frank on 27 May 2023

Thanks very much for your time!

By reviewing VAR models I have to admit there are similarities in some aspects. In my understanding VAR model is for forcasting (predict yt based on preview values), but what I'm focusing here is actually the relationship (distribution) among different variables. But you are right, if we ignore the definition of time series properties they are much alike (PCA and (ZZ')^-1 seem quite smilar methodology?). My statistics are getting rusty, I might need more time to figure the true difference.

After examing a few tests, the key issue that cause the problem is still the huge number of observations.It not only affect the tests, but also influence the distfit() function initially. As far as I have explored, there is no setting in distributionFitter tool to target the fitting only to the main area.

I sincerely hope there will be some good tools and methodologies out there so I don't have to build the wheel...

Anyway, thanks very much!

Ive J on 6 Jun 2023

I'm not familiar with time-series modeling, but the general idea you've mentioned looks like what partial least-squares (PLS) does, and in that sense you don't need to first get PCs and fit an OLS. Nonetheless, what is your objective with the residuals? Your residuals should be normal otherwise it's a clear violation of OLS assumptions, and as you've already figured it out qq plots are one way to visually check this assumption. Also, what's wrong with adjusted R-squared you get from the linear model (check `fitlm`) which is a direct measure (as @the cyclist just mentioned) of goodness-of-fit.

Sign in to comment.

Sign in to answer this question.