# compare with a plot the distribution of my data to the extreme value (Gumbel) distribution

24 views (last 30 days)
V on 10 Apr 2015
Answered: V on 10 Apr 2015
Hello,
I am trying to make a plot which compare the distribution of my data to a Gumbel distribution. Basically, I am trying to do by hand what the probplot function does. Here is my code:
data=[2.35 2.75 2.75 2.82 2.85 2.9 2.9 3.05 3.2 3.4 3.41 3.9]*10000; %my data
figure
h=probplot('extreme value',data) %this is the plot I want to obtain
[f,x] = ecdf(data);%Empirical cumulative distribution
parmhat = evfit(data)%Extreme value parameter estimates (type 1 = Gumbel)
% fitdist could have been used to
mu=parmhat(1);
sigma=parmhat(2);
figure
plot(x,f,'.b')%I would prefer to plot(data,F(data)) but data and x have different sizes, why?
hold on
plot(data,(data-mu)/sigma,'r') %and this line does not fit the data
evinv(0.01,mu,sigma)%100 year return value
I suspect something is wrong with the (data-mu)/sigma and that I should be plotting something else. Could you please have a look?
Thanks!

Ingrid on 10 Apr 2015
calling the help for ecdf this indicates that the function returns the empirical cumulative distribution function (cdf), f, evaluated at the points in x, using the data in the vector y. In your case data and x have different sizes because your data has points with duplicate values i.e. 2.75 occurs twize and 2.9 also. For a given x there can only be one probability so therefore the size of x is smaller than the size of data and you cannot use plot(data,f) and even with the same size would not make much sense given the explanation given above.
indeed there is a mistake in plotting (data-mu)/sigma, since the mu and sigma are not for a normal distribution in this case but for the Gumbel distribution so you should use the cumulative distribution equation for the Gumbel distribution

V on 10 Apr 2015
Thank you for your quick answer. Your comment about ecdf makes sense. I actually realise I want to obtain something else than what the probplot does. I want -log(-log(F)), with F the cumulative distribution function on the y axis. That is why I plotted (data-mu)/sigma on the y axis. I should then plot -log(log(f))instead.
But (data-mu)/sigma still does not fit the data. Looks like 1+(data-mu)/sigma is better. I do not know why.
I am also not sure ecdf is the most relevant function to use in my case because f(3.9(the highest value in my data))=1 (which is not the case in reality) and then -log(-log(1))=Inf. Maybe I should define f as f=0.05:0.075:0.95; (not sure what should be the boundaries, I put 5% and 95% but is there a way to calculate these values?)
This is my updated code:
data=[2.35 2.75 2.76 2.82 2.85 2.9 2.91 3.05 3.2 3.4 3.41 3.9]*10000;
[f,x] = ecdf(data);
figure
plot(x,-log(log(f)),'.b')%f cumulative distribution function
parmhat = evfit(data)
mu=parmhat(1);
sigma=parmhat(2);
hold on
plot(data,(data-mu)/sigma,'r')
Thanks again!