www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/gparetodemo.m

    %% Modelling Tail Data with the Generalized Pareto Distribution
% This example shows how to fit tail data to the Generalized Pareto
% distribution by maximum likelihood estimation.
%
% Fitting a parametric distribution to data sometimes results in a model
% that agrees well with the data in high density regions, but poorly in
% areas of low density. For unimodal distributions, such as the normal or
% Student's t, these low density regions are known as the "tails" of the
% distribution. One reason why a model might fit poorly in the tails is
% that by definition, there are fewer data in the tails on which to base a
% choice of model, and so models are often chosen based on their ability to
% fit data near the mode. Another reason might be that the distribution of
% real data is often more complicated than the usual parametric models.
%
% However, in many applications, fitting the data in the tail is the main
% concern. The Generalized Pareto distribution (GP) was developed as a
% distribution that can model tails of a wide variety of distributions,
% based on theoretical arguments. One approach to distribution fitting that
% involves the GP is to use a non-parametric fit (the empirical cumulative
% distribution function, for example) in regions where there are many
% observations, and to fit the GP to the tail(s) of the data.

%   Copyright 2004-2014 The MathWorks, Inc.


%% The Generalized Pareto Distribution
% The Generalized Pareto (GP) is a right-skewed distribution, parameterized
% with a shape parameter, k, and a scale parameter, sigma.  k is also known as
% the "tail index" parameter, and can be positive, zero, or negative.
x = linspace(0,10,1000);
plot(x,gppdf(x,-.4,1),'-', x,gppdf(x,0,1),'-', x,gppdf(x,2,1),'-');
xlabel('x / sigma');
ylabel('Probability density');
legend({'k < 0' 'k = 0' 'k > 0'});
%%
% Notice that for k < 0, the GP has zero probability above an upper limit of
% -(1/k). For k >= 0, the GP has no upper limit.  Also, the GP is often used
% in conjunction with a third, threshold parameter that shifts the lower limit
% away from zero.  We will not need that generality here.
%
% The GP distribution is a generalization of both the exponential distribution
% (k = 0) and the Pareto distribution (k > 0).  The GP includes those two
% distributions in a larger family so that a continuous range of shapes is
% possible.


%% Simulating Exceedance Data
% The GP distribution can be defined constructively in terms of exceedances.
% Starting with a probability distribution whose right tail drops off to zero,
% such as the normal, we can sample random values independently from that
% distribution.  If we fix a threshold value, throw out all the values that
% are below the threshold, and subtract the threshold off of the values that
% are not thrown out, the result is known as exceedances.  The distribution of
% the exceedances is approximately a GP.  Similarly, we can set a threshold in
% the left tail of a distribution, and ignore all values above that threshold.
% The threshold must be far enough out in the tail of the original
% distribution for the approximation to be reasonable.
%
% The original distribution determines the shape parameter, k, of the
% resulting GP distribution.  Distributions whose tails fall off as a
% polynomial, such as Student's t, lead to a positive shape parameter.
% Distributions whose tails decrease exponentially, such as the normal,
% correspond to a zero shape parameter.  Distributions with finite tails, such
% as the beta, correspond to a negative shape parameter.

%%
% Real-world applications for the GP distribution include modelling extremes
% of stock market returns, and modelling extreme floods.  For this example,
% we'll use simulated data, generated from a Student's t distribution with 5
% degrees of freedom.  We'll take the largest 5% of 2000 observations from the
% t distribution, and then subtract off the 95% quantile to get exceedances.
rng(3,'twister');
x = trnd(5,2000,1);
q = quantile(x,.95);
y = x(x>q) - q;
n = numel(y)


%% Fitting the Distribution Using Maximum Likelihood
% The GP distribution is defined for 0 < sigma, and -Inf < k < Inf.  However,
% interpretation of the results of maximum likelihood estimation is problematic
% when k < -1/2.  Fortunately, those cases correspond to fitting tails from
% distributions like the beta or triangular, and so will not present a problem
% here.
paramEsts = gpfit(y);
kHat      = paramEsts(1)   % Tail index parameter
sigmaHat  = paramEsts(2)   % Scale parameter
%%
% As might be expected, since the simulated data were generated using a t
% distribution, the estimate of k is positive.


%% Checking the Fit Visually
% To visually assess how good the fit is, we'll plot a scaled histogram of
% the tail data, overlayed with the density function of the GP that we've
% estimated.  The histogram is scaled so that the bar heights times their
% width sum to 1.
bins = 0:.25:7;
h = bar(bins,histc(y,bins)/(length(y)*.25),'histc');
h.FaceColor = [.9 .9 .9];
ygrid = linspace(0,1.1*max(y),100);
line(ygrid,gppdf(ygrid,kHat,sigmaHat));
xlim([0,6]);
xlabel('Exceedance');
ylabel('Probability Density');

%%
% We've used a fairly small bin width, so there is a good deal of noise in
% the histogram.  Even so, the fitted density follows the shape of the
% data, and so the GP model seems to be a good choice.
%
% We can also compare the empirical CDF to the fitted CDF.
[F,yi] = ecdf(y);
plot(yi,gpcdf(yi,kHat,sigmaHat),'-');
hold on;
stairs(yi,F,'r');
hold off;
legend('Fitted Generalized Pareto CDF','Empirical CDF','location','southeast');


%% Computing Standard Errors for the Parameter Estimates
% To quantify the precision of the estimates, we'll use standard errors
% computed from the asymptotic covariance matrix of the maximum likelihood
% estimators.  The function |gplike| computes, as its second output, a
% numerical approximation to that covariance matrix.  Alternatively, we could
% have called |gpfit| with two output arguments, and it would have returned
% confidence intervals for the parameters.
[nll,acov] = gplike(paramEsts, y);
stdErr = sqrt(diag(acov))

%%
% These standard errors indicate that the relative precision of the estimate
% for k is quite a bit lower that that for sigma -- its standard error is on
% the order of the estimate itself.  Shape parameters are often difficult to
% estimate.  It's important to keep in mind that computation of these standard
% errors assumed that the GP model is correct, and that we have enough data
% for the asymptotic approximation to the covariance matrix to hold.


%% Checking the Asymptotic Normality Assumption
% Interpretation of the standard errors usually involves assuming that, if the
% same fit could be repeated many times on data that came from the same source,
% the maximum likelihood estimates of the parameters would approximately
% follow a normal distribution.  For example, confidence intervals are often
% based this assumption.
%
% However, that normal approximation may or may not be a good one.  To assess
% how good it is in this example, we can use a bootstrap simulation.  We will
% generate 1000 replicate datasets by resampling from the data, fit a GP
% distribution to each one, and save all the replicate estimates.
replEsts = bootstrp(1000,@gpfit,y);

%%
% As a rough check on the sampling distribution of the parameter
% estimators, we can look at histograms of the bootstrap replicates.
subplot(2,1,1);
hist(replEsts(:,1));
title('Bootstrap estimates of k');
subplot(2,1,2);
hist(replEsts(:,2));
title('Bootstrap estimates of sigma');


%% Using a Parameter Transformation
% The histogram of the bootstrap estimates for k appears to be only a little
% asymmetric, while that for the estimates of sigma definitely appears skewed
% to the right.  A common remedy for that skewness is to estimate the
% parameter and its standard error on the log scale, where a normal
% approximation may be more reasonable.  A Q-Q plot is a better way to assess
% normality than a histogram, because non-normality shows up as points that do
% not approximately follow a straight line.  Let's check that to see if the
% log transform for sigma is appropriate.
subplot(1,2,1);
qqplot(replEsts(:,1));
title('Bootstrap estimates of k');
subplot(1,2,2);
qqplot(log(replEsts(:,2)));
title('Bootstrap estimates of log(sigma)');

%%
% The bootstrap estimates for k and log(sigma) appear acceptably
% close to normality.  A Q-Q plot for the estimates of sigma, on the unlogged
% scale, would confirm the skewness that we've already seen in the histogram.
% Thus, it would be more reasonable to construct a confidence interval for
% sigma by first computing one for log(sigma) under the assumption of
% normality, and then exponentiating to transform that interval back to the
% original scale for sigma.
%
% In fact, that's exactly what the function |gpfit| does behind the scenes.
[paramEsts,paramCI] = gpfit(y);
%%
kHat
kCI  = paramCI(:,1)
%%
sigmaHat
sigmaCI  = paramCI(:,2)

%%
% Notice that while the 95% confidence interval for k is symmetric about the
% maximum likelihood estimate, the confidence interval for sigma is not.
% That's because it was created by transforming a symmetric CI for log(sigma).