www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/nonparametricCDFdemo.m

    %% Nonparametric Estimates of Cumulative Distribution Functions and Their Inverses
% This example shows how to estimate the cumulative distribution function
% (CDF) from data in a non-parametric or semi-parametric fashion. It also
% illustrates the inversion method for generating random numbers from the
% estimated CDF. The Statistics and Machine Learning Toolbox(TM) includes
% more than two dozen random number generator functions for parametric
% univariate probability distributions. These functions allow you to
% generate random inputs for a wide variety of simulations, however, there
% are situations where it is necessary to generate random values to
% simulate data that are not described by a simple parametric family.
%
% The toolbox also includes the functions |pearsrnd| and |johnsrnd|, for
% generating random values without having to specify a parametric distribution
% from which to draw--those functions allow you to specify a distribution in
% terms of its moments or quantiles, respectively.
%
% However, there are still situations where even more flexibility is needed,
% to generate random values that "imitate" data that you have collected even
% more closely. In this case, you might use a nonparametric estimate of the
% CDF of those data, and use the inversion method to generate random values.  
% The inversion method involves generating uniform random values on the unit 
% interval, and transforming them to a desired distribution using the inverse 
% CDF for that distribution.
%
% From the opposite perspective, it is sometimes desirable to use a
% nonparametric estimate of the CDF to transform observed data onto the unit
% interval, giving them an approximate uniform distribution.
%
% The |ecdf| function computes one type of nonparametric CDF estimate, the
% empirical CDF, which is a stairstep function.  This example illustrates some
% smoother alternatives, which may be more suitable for simulating or
% transforming data from a continuous distribution.

%   Copyright 2007-2014 The MathWorks, Inc.


%%
% For the purpose of illustration, here are some simple simulated data. There
% are only 25 observations, a small number chosen to make the plots in the
% example easier to read.  The data are also sorted to simplify plotting.

rng(19,'twister');
n = 25;
x = evrnd(3,1,n,1); x = sort(x);
hist(x,-2:.5:4.5);
xlabel('x'); ylabel('Frequency');


%% A Piecewise Linear Nonparametric CDF Estimate
% The |ecdf| function provides a simple way to compute and plot a "stairstep"
% empirical CDF for data.  In the simplest cases, this estimate makes discrete
% jumps of 1/n at each data point.
[Fi,xi] = ecdf(x);
stairs(xi,Fi,'r');
xlim([-2.5 5]); xlabel('x'); ylabel('F(x)');

%%
% This estimate is useful for many purposes, including investigating the
% goodness of fit of a parametric model to data.  Its discreteness, however, may
% make it unsuitable for use in empirically transforming continuous data to or
% from the unit interval.
%
% It is simple to modify the empirical CDF to address that problems. Instead
% of taking discrete jumps of 1/n at each data point, define a function that
% is piecewise linear, with breakpoints at the midpoints of those jumps. The
% height at each of the data points is then [1/2n, 3/2n, ..., (n-1/2)/n],
% instead of [(1/n), (2/n), ..., 1].  Use the output of |ecdf| to
% compute those breakpoints, and then "connect the dots" to define the
% piecewise linear function.
xj = xi(2:end);
Fj = (Fi(1:end-1)+Fi(2:end))/2;
hold on
plot(xj,Fj,'b.', xj,Fj,'b-');
hold off
legend({'ECDF' 'Breakpoints' 'Piecewise Linear Estimate'},'location','NW');

%%
% Because |ecdf| deals appropriately with repeated values and censoring, this
% calculation works even in cases with more complicated data than in this
% example.

%%
% Since the smallest data point corresponds to a height of 1/2n, and the
% largest to 1-1/2n, the first and last linear segments must be extended beyond
% the data, to make the function reach 0 and 1.
xj = [xj(1)-Fj(1)*(xj(2)-xj(1))/((Fj(2)-Fj(1)));
      xj;
      xj(n)+(1-Fj(n))*((xj(n)-xj(n-1))/(Fj(n)-Fj(n-1)))];
Fj = [0; Fj; 1];
hold on
plot(xj,Fj,'b-');
hold off
%%
% This piecewise linear function provides a nonparametric estimate of the CDF
% that is continuous and symmetric.  Evaluating it at points other than the
% original data is just a matter of linear interpolation, and it can be
% convenient to define an anonymous function to do that.
F = @(y) interp1(xj,Fj,y,'linear','extrap');
y = linspace(-1,3.75,10);
plot(xj,Fj,'b-',y,F(y),'ko');
xlim([-2.5 5]); xlabel('x'); ylabel('F(x)');


%% A Piecewise Linear Nonparametric Inverse CDF Estimate
% You can use the same calculations to compute a nonparametric estimate of the
% inverse CDF. In fact, the inverse CDF estimate is just the CDF estimate with
% the axes swapped.
stairs(Fi,[xi(2:end); xi(end)],'r');
hold on
plot(Fj,xj,'b-');
hold off
ylim([-2.5 5]); ylabel('x'); xlabel('F(x)');
legend({'ECDF' 'Piecewise Linear Estimate'},'location','NW');

%%
% Evaluating this nonparametric inverse CDF at points other than the original
% breakpoints is again just a matter of linear interpolation.  For example,
% generate uniform random values and use the CDF estimate to transform them
% back to the scale of your original observed data.  This is the inversion
% method.
Finv = @(u) interp1(Fj,xj,u,'linear','extrap');
u = rand(10000,1);
hist(Finv(u),-2:.25:4.5);
xlabel('x'); ylabel('Frequency');

%%
% Notice that this histogram of simulated data is more spread out than the
% histogram of the original data.  This is due, in part, to the much larger
% sample size--the original data consist of only 25 values.   But it is also
% because the piecewise linear CDF estimate, in effect, "spreads out" each of
% the original observations over an interval, and more so in regions where the
% individual observations are well-separated.
%
% For example, the two individual observations to the left of zero correspond
% to a wide, flat region of low density in the simulated data.  In contrast,
% in regions where the data are closely spaced, towards the right tail, for
% example, the piecewise linear CDF estimate "spreads out" the observations to
% a lesser extent.  In that sense, the method performs a simple version of
% what is known as variable bandwidth smoothing.  However, despite the
% smoothing, the simulated data retain most of the idiosyncrasies of the
% original data, i.e., the regions of high and low density.


%% Kernel Estimators for the CDF and Inverse CDF
% Instead of estimating the CDF using a piecewise linear function, you can
% perform kernel estimation using the |ksdensity| function to make a smooth
% nonparametric estimate. Though it is often used to make a nonparametric
% _density_ estimate, |ksdensity| can also estimate other functions.  For
% example, to transform your original data to the unit interval, use it to
% estimate the CDF.
F = ksdensity(x, x, 'function','cdf', 'width',.35);
stairs(xi,Fi,'r');
hold on
plot(x,F,'b.');
hold off
xlim([-2.5 5]); xlabel('x'); ylabel('F(x)');
legend({'ECDF' 'Kernel Estimates'},'location','NW');

%%
% |ksdensity| also provides a convenient way to evaluate the kernel CDF
% estimate at points other than the original data.  For example, plot the
% estimate as a smooth curve.
y = linspace(-2.5,5,1000);
Fy = ksdensity(x, y, 'function','cdf', 'width',.35);
stairs(xi,Fi,'r');
hold on
plot(y,Fy,'b-');
hold off
legend({'ECDF' 'Kernel Estimate'},'location','NW');

%%
% |ksdensity| uses a bandwidth parameter to control the amount of smoothing in
% the estimates it computes, and it is possible to let |ksdensity| choose a
% default value.  The examples here use a fairly small bandwidth to limit the
% amount of smoothing.  Even so, the kernel estimate does not follow the ECDF
% as closely as the piecewise linear estimate does. 

%%
% One way to estimate the inverse CDF using kernel estimation is to compute the
% kernel CDF estimate on a grid of points spanning the range of the original
% data, and then use the same procedure as for the piecewise linear estimate.  For
% example, to plot the inverse CDF kernel estimate as a smooth curve, simply
% swap the axes.
stairs(Fi,[xi(2:end); xi(end)],'r');
hold on
plot(Fy,y,'b-');
hold off
ylim([-2.5 5]); ylabel('x'); xlabel('F(x)');
legend({'ECDF' 'Kernel Estimate'},'location','NW');

%% 
% To transform uniform random values back to the scale of the original data, 
% interpolate using the grid of CDF estimates.
Finv = @(u) interp1(Fy,y,u,'linear','extrap');
hist(Finv(u),-2.5:.25:5);
xlabel('x'); ylabel('Frequency');

%%
% Notice that the simulated data using the kernel CDF estimate has not
% completely "smoothed over" the two individual observations to the left of
% zero present in the original data.  The kernel estimate uses a fixed
% bandwidth.  With the particular bandwidth value used in this example, those
% two observations contribute to two localized areas of density, rather than a
% wide flat region as was the case with the piecewise linear estimate. In
% contrast, the kernel estimate has smoothed the data _more_ in the right tail
% than the piecewise linear estimate.

%%
% Another way to generate simulated data using kernel estimation is to use
% |ksdensity| to compute an estimate of the inverse CDF directly, again using
% the |'function'| parameter.  For example, transform those same uniform values.
r = ksdensity(x, u, 'function','icdf', 'width',.35);

%%
% However, using the latter method can be time-consuming for large amounts of
% data. A simpler, but equivalent, method is to resample with replacement from
% the original data and add some appropriate random noise.
r = datasample(x,100000,'replace',true) + normrnd(0,.35,100000,1);

%%
% If you generate enough random values, a histogram of the result follows the
% kernel density estimate of the original data very closely.
binwidth = .25;
edges = -2.5:binwidth:6;
ctrs = edges(1:end-1) + binwidth./2;
counts = histc(r,edges); counts = counts(1:end-1);
bar(ctrs,counts./(sum(counts).*binwidth),1,'FaceColor',[.9 .9 .9]);
hold on
xgrid = edges(1):.1:edges(end);
fgrid = ksdensity(x, xgrid, 'function','pdf', 'width',.3);
plot(xgrid,fgrid,'k-');
hold off
xlabel('x'); ylabel('f(x)');


%% A Semiparametric CDF Estimate
% A nonparametric CDF estimate requires a good deal of data to achieve
% reasonable precision.  In addition, data only affect the estimate "locally."
% That is, in regions where there is a high density of data, the estimate is
% based on more observations than in regions where there is a low density of
% data.  In particular, nonparametric estimates do not perform well in the
% tails of a distribution, where data are sparse by definition.
%
% Fitting a semiparametric model to your data using the |paretotails| function
% allows the best of both the nonparametric and parametric worlds.  In the
% "center" of the distribution, the model uses the piecewise linear
% nonparametric estimate for the CDF.  In each tail, it uses a generalized
% Pareto distribution.  The generalized Pareto is often used as a model for
% the tail(s) of a dataset, and while it is flexible enough to fit a wide
% variety of distribution tails, it is sufficiently constrained so that it
% requires few data to provide a plausible and smooth fit to tail data.
%
% For example, you might define the "center" of the data as the middle 50%, and
% specify that the transitions between the nonparametric estimate and the Pareto
% fits take place at the .25 and .75 quantiles of your data.  To evaluate the
% CDF of the semiparametric model fit, use the fit's |cdf| method.
semipFit = paretotails(x,.25,.75);
%%
% The warning is due to using so few data -- 6 points in each tail in this
% case -- and indicates that the fitted generalized Pareto distribution in the
% upper tail extends exactly to the smallest observation, and no further.  You
% can see that in the histogram shown below.  In a real application, you would
% usually have more data, and the warning would typically not occur.
[p,q] = boundary(semipFit);
y = linspace(-4,6,1000);
Fy = cdf(semipFit,y);
plot(y,Fy,'b-', q,p,'k+');
xlim([-6.5 5]); xlabel('x'); ylabel('F(x)');
legend({'Semi-parametric Estimate' 'Segment Boundaries'},'location','NW');

%%
% To transform uniform random values back to the scale of your original data,
% use the fit's |icdf| method.
r = icdf(semipFit,u);
hist(r,-6.5:.25:5);
xlabel('x'); ylabel('Frequency');

%%
% This semiparametric estimate has smoothed the tails of the data more than
% the center, because of the parametric model used in the tails.  In that
% sense, the estimate is more similar to the piecewise linear estimate than to
% the kernel estimate. However, it is also possible to use |paretotails| to
% create a semiparametric fit that uses kernel estimation in the center of the
% data.


%% Conclusions
% This example illustrates three methods for computing a non- or semi-parametric
% CDF or inverse CDF estimate from data.  The three methods impose different
% amounts and types of smoothing on the data.  Which method you choose depends
% on how each method captures or fails to capture what you consider the
% important features of your data.