www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/copulademo.m

    %% Simulating Dependent Random Variables Using Copulas
% This example shows how to use copulas to generate data from multivariate
% distributions when there are complicated relationships among the
% variables, or when the individual variables are from different
% distributions.
%
% MATLAB(R) is an ideal tool for running simulations that incorporate
% random inputs or noise. Statistics and Machine Learning Toolbox(TM)
% provides functions to create sequences of random data according to many
% common univariate distributions. The Toolbox also includes a few
% functions to generate random data from multivariate distributions, such
% as the multivariate normal and multivariate t.  However, there is no
% built-in way to generate multivariate distributions for all marginal
% distribtions, or in cases where the individual variables are from
% different distributions.
%
% Recently, copulas have become popular in simulation models.  Copulas are
% functions that describe dependencies among variables, and provide a way
% to create distributions to model correlated multivariate data.  Using a
% copula, a data analyst can construct a multivariate distribution by
% specifying marginal univariate distributions, and choosing a particular
% copula to provide a correlation structure between variables.  Bivariate
% distributions, as well as distributions in higher dimensions, are
% possible.  In this example, we discuss how to use copulas to generate
% dependent multivariate random data in MATLAB, using
% Statistics and Machine Learning Toolbox.

%   Copyright 2004-2014 The MathWorks, Inc.



%% Dependence Between Simulation Inputs
% One of the design decisions for a Monte-Carlo simulation is a choice of
% probability distributions for the random inputs.  Selecting a
% distribution for each individual variable is often straightforward, but
% deciding what dependencies should exist between the inputs may not be.
% Ideally, input data to a simulation should reflect what is known about
% dependence among the real quantities being modelled.  However, there
% may be little or no information on which to base any dependence in the
% simulation, and in such cases, it is a good idea to experiment with
% different possibilities, in order to determine the model's sensitivity.
%
% However, it can be difficult to actually generate random inputs with
% dependence when they have distributions that are not from a standard
% multivariate distribution.  Further, some of the standard multivariate
% distributions can model only very limited types of dependence.  It's
% always possible to make the inputs independent, and while that is a
% simple choice, it's not always sensible and can lead to the wrong
% conclusions.
%
% For example, a Monte-Carlo simulation of financial risk might have random
% inputs that represent different sources of insurance losses.  These
% inputs might be modeled as lognormal random variables.  A reasonable
% question to ask is how dependence between these two inputs affects the
% results of the simulation.  Indeed, it might be known from real data that
% the same random conditions affect both sources, and ignoring that in
% the simulation could lead to the wrong conclusions.
%
% Simulation of independent lognormal random variables is trivial.  The
% simplest way would be to use the |lognrnd| function.  Here, we'll use the
% |mvnrnd| function to generate n pairs of independent normal random
% variables, and then exponentiate them.  Notice that the covariance matrix
% used here is diagonal, i.e., independence between the columns of Z.
n = 1000;
sigma = .5;
SigmaInd = sigma.^2 .* [1 0; 0 1]
%%
ZInd = mvnrnd([0 0], SigmaInd, n);
XInd = exp(ZInd);
plot(XInd(:,1),XInd(:,2),'.');
axis equal;
axis([0 5 0 5]);
xlabel('X1');
ylabel('X2');
%%
% Dependent bivariate lognormal r.v.'s are also easy to generate, using a
% covariance matrix with non-zero off-diagonal terms.
rho = .7;
SigmaDep = sigma.^2 .* [1 rho; rho 1]
%%
ZDep = mvnrnd([0 0], SigmaDep, n);
XDep = exp(ZDep);
%%
% A second scatter plot illustrates the difference between these two bivariate
% distributions.
plot(XDep(:,1),XDep(:,2),'.');
axis equal;
axis([0 5 0 5]);
xlabel('X1');
ylabel('X2');

%%
% It's clear that there is more of a tendency in the second dataset for
% large values of X1 to be associated with large values of X2, and
% similarly for small values.  This dependence is determined by the
% correlation parameter, rho, of the underlying bivariate normal.  The
% conclusions drawn from the simulation could well depend on whether or not
% X1 and X2 were generated with dependence or not.

%%
% The bivariate lognormal distribution is a simple solution in this case,
% and of course easily generalizes to higher dimensions and cases where the
% marginal distributions are _different_ lognormals.  Other multivariate
% distributions also exist, for example, the multivariate t and the
% Dirichlet distributions are used to simulate dependent t and beta random
% variables, respectively.  But the list of simple multivariate
% distributions is not long, and they only apply in cases where the
% marginals are all in the same family (or even the exact same
% distributions).  This can be a real limitation in many situations.


%% A More General Method for Constructing Dependent Bivariate Distributions
% Although the above construction that creates a bivariate lognormal is
% simple, it serves to illustrate a method which is more generally
% applicable.  First, we generate pairs of values from a bivariate normal
% distribution.  There is statistical dependence between these two
% variables, and each has a normal marginal distribution. Next, a
% transformation (the exponential function) is applied separately to each
% variable, changing the marginal distributions into lognormals. The
% transformed variables still have a statistical dependence.
%
% If a suitable transformation could be found, this method could be
% generalized to create dependent bivariate random vectors with other
% marginal distributions.  In fact, a general method of constructing such a
% transformation does exist, although not as simple as just exponentiation.
%
% By definition, applying the normal CDF (denoted here by PHI) to a standard
% normal random variable results in a r.v. that is uniform on the
% interval [0, 1].  To see this, if Z has a standard normal distribution,
% then the CDF of U = PHI(Z) is
%
%    Pr{U <= u0} = Pr{PHI(Z) <= u0} = Pr{Z <= PHI^(-1)(u0)} = u0,
%
% and that is the CDF of a U(0,1) r.v.  Histograms of some simulated normal
% and transformed values demonstrate that fact.
n = 1000;
z = normrnd(0,1,n,1);
hist(z,-3.75:.5:3.75);
xlim([-4 4]);
title('1000 Simulated N(0,1) Random Values');
xlabel('Z');
ylabel('Frequency');
%%
u = normcdf(z);
hist(u,.05:.1:.95);
title('1000 Simulated N(0,1) Values Transformed to U(0,1)');
xlabel('U');
ylabel('Frequency');

%%
% Now, borrowing from the theory of univariate random number generation,
% applying the inverse CDF of any distribution F to a U(0,1) random
% variable results in a r.v. whose distribution is exactly F.  This is known
% as the Inversion Method.  The proof is essentially the opposite of the
% above proof for the forward case.  Another histogram illustrates the
% transformation to a gamma distribution.
x = gaminv(u,2,1);
hist(x,.25:.5:9.75);
title('1000 Simulated N(0,1) Values Transformed to Gamma(2,1)');
xlabel('X');
ylabel('Frequency');

%%
% This two-step transformation can be applied to each variable of a
% standard bivariate normal, creating dependent r.v.'s with arbitrary
% marginal distributions.  Because the transformation works on each
% component separately, the two resulting r.v.'s need not even have the
% same marginal distributions.  The transformation is defined as
%
%    Z = [Z1 Z2] ~ N([0 0],[1 rho; rho 1])
%    U = [PHI(Z1) PHI(Z2)]
%    X = [G1(U1) G2(U2)]
%
% where G1 and G2 are inverse CDFs of two possibly different distributions.
% For example, we can generate random vectors from a bivariate distribution
% with t(5) and Gamma(2,1) marginals.
n = 1000;
rho = .7;
Z = mvnrnd([0 0], [1 rho; rho 1], n);
U = normcdf(Z);
X = [gaminv(U(:,1),2,1) tinv(U(:,2),5)];

%%
% This plot has histograms alongside a scatter plot to show both the
% marginal distributions, and the dependence.
[n1,ctr1] = hist(X(:,1),20);
[n2,ctr2] = hist(X(:,2),20);
subplot(2,2,2);
plot(X(:,1),X(:,2),'.');
axis([0 12 -8 8]);
h1 = gca;
title('1000 Simulated Dependent t and Gamma Values');
xlabel('X1 ~ Gamma(2,1)');
ylabel('X2 ~ t(5)');
subplot(2,2,4);
bar(ctr1,-n1,1);
axis([0 12 -max(n1)*1.1 0]);
axis('off');
h2 = gca;
subplot(2,2,1);
barh(ctr2,-n2,1);
axis([-max(n2)*1.1 0 -8 8]);
axis('off');
h3 = gca;
h1.Position = [0.35 0.35 0.55 0.55];
h2.Position = [.35 .1 .55 .15];
h3.Position = [.1 .35 .15 .55];
colormap([.8 .8 1]);


%% Rank Correlation Coefficients
% Dependence between X1 and X2 in this construction is determined by the
% correlation parameter, rho, of the underlying bivariate normal. However,
% it is _not_ true that the linear correlation of X1 and X2 is rho.  For
% example, in the original lognormal case, there is a closed form for that
% correlation:
%
%    cor(X1,X2) = (exp(rho.*sigma.^2) - 1) ./ (exp(sigma.^2) - 1)
%
% which is strictly less than rho unless rho is exactly one.  In more
% general cases, though, such as the Gamma/t construction above, the linear
% correlation between X1 and X2 is difficult or impossible to express in
% terms of rho, but simulations can be used to show that the same effect
% happens.
%
% That's because the linear correlation coefficient expresses the _linear_
% dependence between r.v.'s, and when nonlinear transformations are applied
% to those r.v.'s, linear correlation is not preserved.  Instead, a rank
% correlation coefficient, such as Kendall's tau or Spearman's rho, is more
% appropriate.
%
% Roughly speaking, these rank correlations measure the degree to which
% large or small values of one r.v. associate with large or small values of
% another. However, unlike the linear correlation coefficient, they
% measure the association only in terms of ranks.  As a consequence, the
% rank correlation is preserved under any monotonic transformation.  In
% particular, the transformation method just described preserves the rank
% correlation.  Therefore, knowing the rank correlation of the bivariate
% normal Z exactly determines the rank correlation of the final transformed
% r.v.'s X.  While rho is still needed to parameterize the underlying
% bivariate normal, Kendall's tau or Spearman's rho are more useful in
% describing the dependence between r.v.'s, because they are invariant to the
% choice of marginal distribution.
%
% It turns out that for the bivariate normal, there is a simple 1-1 mapping
% between Kendall's tau or Spearman's rho, and the linear correlation
% coefficient rho:
%
%    tau   = (2/pi)*arcsin(rho)     or   rho = sin(tau*pi/2)
%    rho_s = (6/pi)*arcsin(rho/2)   or   rho = 2*sin(rho_s*pi/6)
subplot(1,1,1);
rho = -1:.01:1;
tau = 2.*asin(rho)./pi;
rho_s = 6.*asin(rho./2)./pi;
plot(rho,tau,'-', rho,rho_s,'-', [-1 1],[-1 1],'k:');
axis([-1 1 -1 1]);
xlabel('rho');
ylabel('Rank correlation coefficient');
legend('Kendall''s tau', 'Spearman''s rho_s', 'location','northwest');

%%
% Thus, it's easy to create the desired rank correlation between X1 and X2,
% regardless of their marginal distributions, by choosing the correct rho
% parameter value for the linear correlation between Z1 and Z2.
%
% Notice that for the multivariate normal distribution, Spearman's rank
% correlation is almost identical to the linear correlation.  However, this
% is not true once we transform to the final random variables.


%% Copulas
% The first step of the construction described above defines what is known
% as a copula, specifically, a Gaussian copula.  A bivariate copula is
% simply a probability distribution on two random variables, each of whose
% marginal distributions is uniform.  These two variables may be completely
% independent, deterministically related (e.g., U2 = U1), or anything in
% between.  The family of bivariate Gaussian copulas is parameterized by
% Rho = [1 rho; rho 1], the linear correlation matrix.  U1 and U2 approach
% linear dependence as rho approaches +/- 1, and approach complete
% independence as rho approaches zero.
%
% Scatter plots of some simulated random values for various levels of rho
% illustrate the range of different possibilities for Gaussian copulas:
n = 500;
Z = mvnrnd([0 0], [1 .8; .8 1], n);
U = normcdf(Z,0,1);
subplot(2,2,1);
plot(U(:,1),U(:,2),'.');
title('rho = 0.8');
xlabel('U1');
ylabel('U2');
Z = mvnrnd([0 0], [1 .1; .1 1], n);
U = normcdf(Z,0,1);
subplot(2,2,2);
plot(U(:,1),U(:,2),'.');
title('rho = 0.1');
xlabel('U1');
ylabel('U2');
Z = mvnrnd([0 0], [1 -.1; -.1 1], n);
U = normcdf(Z,0,1);
subplot(2,2,3);
plot(U(:,1),U(:,2),'.');
title('rho = -0.1');
xlabel('U1');
ylabel('U2');
Z = mvnrnd([0 0], [1 -.8; -.8 1], n);
U = normcdf(Z,0,1);
subplot(2,2,4);
plot(U(:,1),U(:,2),'.');
title('rho = -0.8');
xlabel('U1');
ylabel('U2');

%%
% The dependence between U1 and U2 is completely separate from the marginal
% distributions of X1 = G(U1) and X2 = G(U2).  X1 and X2 can be given _any_
% marginal distributions, and still have the same rank correlation.
% This is one of the main appeals of copulas -- they allow this separate
% specification of dependence and marginal distribution.


%% t Copulas
% A different family of copulas can be constructed by starting from a
% bivariate t distribution, and transforming using the corresponding t CDF.
% The bivariate t distribution is parameterized with Rho, the linear
% correlation matrix, and nu, the degrees of freedom.  Thus, for example,
% we can speak of a t(1) or a t(5) copula, based on the multivariate t with
% one and five degrees of freedom, respectively.
%
% Scatter plots of some simulated random values for various levels of rho
% illustrate the range of different possibilities for t(1) copulas:
n = 500;
nu = 1;
T = mvtrnd([1 .8; .8 1], nu, n);
U = tcdf(T,nu);
subplot(2,2,1);
plot(U(:,1),U(:,2),'.');
title('rho = 0.8');
xlabel('U1');
ylabel('U2');
T = mvtrnd([1 .1; .1 1], nu, n);
U = tcdf(T,nu);
subplot(2,2,2);
plot(U(:,1),U(:,2),'.');
title('rho = 0.1');
xlabel('U1');
ylabel('U2');
T = mvtrnd([1 -.1; -.1 1], nu, n);
U = tcdf(T,nu);
subplot(2,2,3);
plot(U(:,1),U(:,2),'.');
title('rho = -0.1');
xlabel('U1');
ylabel('U2');
T = mvtrnd([1 -.8; -.8 1], nu, n);
U = tcdf(T,nu);
subplot(2,2,4);
plot(U(:,1),U(:,2),'.');
title('rho = -0.8');
xlabel('U1');
ylabel('U2');

%%
% A t copula has uniform marginal distributions for U1 and U2, just as a
% Gaussian copula does.  The rank correlation tau or rho_s between
% components in a t copula is also the same function of rho as for a
% Gaussian.  However, as these plots demonstrate, a t(1) copula differs
% quite a bit from a Gaussian copula, even when their components have the
% same rank correlation.  The difference is in their dependence structure.
% Not surprisingly, as the degrees of freedom parameter nu is made larger,
% a t(nu) copula approaches the corresponding Gaussian copula.
%
% As with a Gaussian copula, any marginal distributions can be imposed over
% a t copula.  For example, using a t copula with 1 degree of freedom, we
% can again generate random vectors from a bivariate distribution with
% Gam(2,1) and t(5) marginals:
subplot(1,1,1);
n = 1000;
rho = .7;
nu = 1;
T = mvtrnd([1 rho; rho 1], nu, n);
U = tcdf(T,nu);
X = [gaminv(U(:,1),2,1) tinv(U(:,2),5)];

[n1,ctr1] = hist(X(:,1),20);
[n2,ctr2] = hist(X(:,2),20);
subplot(2,2,2);
plot(X(:,1),X(:,2),'.');
axis([0 15 -10 10]);
h1 = gca;
title('1000 Simulated Dependent t and Gamma Values');
xlabel('X1 ~ Gamma(2,1)');
ylabel('X2 ~ t(5)');
subplot(2,2,4);
bar(ctr1,-n1,1);
axis([0 15 -max(n1)*1.1 0]);
axis('off');
h2 = gca;
subplot(2,2,1);
barh(ctr2,-n2,1);
axis([-max(n2)*1.1 0 -10 10]);
axis('off');
h3 = gca;
h1.Position = [0.35 0.35 0.55 0.55];
h2.Position = [.35 .1 .55 .15];
h3.Position = [.1 .35 .15 .55];
colormap([.8 .8 1]);

%%
% Compared to the bivariate Gamma/t distribution constructed earlier, which
% was based on a Gaussian copula, the distribution constructed here, based
% on a t(1) copula, has the same marginal distributions and the same rank
% correlation between variables, but a very different dependence structure.
% This illustrates the fact that multivariate distributions are not
% uniquely defined by their marginal distributions, or by their
% correlations.  The choice of a particular copula in an application may be
% based on actual observed data, or different copulas may be used as a way
% of determining the sensitivity of simulation results to the input
% distribution.


%% Higher-Order Copulas
% The Gaussian and t copulas are known as elliptical copulas.  It's easy to
% generalize elliptical copulas to a higher number of dimensions.  For
% example, we can simulate data from a trivariate distribution with Gamma(2,1),
% Beta(2,2), and t(5) marginals using a Gaussian copula as follows.
subplot(1,1,1);
n = 1000;
Rho = [1 .4 .2; .4 1 -.8; .2 -.8 1];
Z = mvnrnd([0 0 0], Rho, n);
U = normcdf(Z,0,1);
X = [gaminv(U(:,1),2,1) betainv(U(:,2),2,2) tinv(U(:,3),5)];
plot3(X(:,1),X(:,2),X(:,3),'.');
grid on;
view([-55, 15]);
xlabel('U1');
ylabel('U2');
zlabel('U3');

%%
% Notice that the relationship between the linear correlation parameter rho
% and, for example, Kendall's tau, holds for each entry in the correlation
% matrix Rho used here.  We can verify that the sample rank correlations of
% the data are approximately equal to the theoretical values.
tauTheoretical = 2.*asin(Rho)./pi
%%
tauSample = corr(X, 'type','Kendall')


%% Copulas and Empirical Marginal Distributions
% To simulate dependent multivariate data using a copula, we have seen that
% we need to specify
%
%    1) the copula family (and any shape parameters),
%    2) the rank correlations among variables, and
%    3) the marginal distributions for each variable
%
% Suppose we have two sets of stock return data, and we would like to run a
% Monte Carlo simulation with inputs that follow the same distributions as
% our data.
load stockreturns
nobs = size(stocks,1);
subplot(2,1,1);
hist(stocks(:,1),10);
xlabel('X1');
ylabel('Frequency');
subplot(2,1,2);
hist(stocks(:,2),10);
xlabel('X2');
ylabel('Frequency');

%%
% (These two data vectors have the same length, but that is not crucial.)
%
% We could fit a parametric model separately to each dataset, and use those
% estimates as our marginal distributions.  However, a parametric model may
% not be sufficiently flexible.  Instead, we can use an empirical model
% for the marginal distributions.  We only need a way to compute the
% inverse CDF.
%
% The empirical inverse CDF for these datasets is just a stair function,
% with steps at the values 1/nobs, 2/nobs, ... 1.  The step heights are
% simply the sorted data.
invCDF1 = sort(stocks(:,1));
n1 = length(stocks(:,1));
invCDF2 = sort(stocks(:,2));
n2 = length(stocks(:,2));
subplot(1,1,1);
stairs((1:nobs)/nobs, invCDF1,'b');
hold on;
stairs((1:nobs)/nobs, invCDF2,'r');
hold off;
legend('X1','X2');
xlabel('Cumulative Probability');
ylabel('X');

%%
% For the simulation, we might want to experiment with different copulas
% and correlations.  Here, we'll use a bivariate t(5) copula with a fairly
% large negative correlation parameter.
n = 1000;
rho = -.8;
nu = 5;
T = mvtrnd([1 rho; rho 1], nu, n);
U = tcdf(T,nu);
X = [invCDF1(ceil(n1*U(:,1))) invCDF2(ceil(n2*U(:,2)))];

[n1,ctr1] = hist(X(:,1),10);
[n2,ctr2] = hist(X(:,2),10);
subplot(2,2,2);
plot(X(:,1),X(:,2),'.');
axis([-3.5 3.5 -3.5 3.5]);
h1 = gca;
title('1000 Simulated Dependent Values');
xlabel('X1');
ylabel('X2');
subplot(2,2,4);
bar(ctr1,-n1,1);
axis([-3.5 3.5 -max(n1)*1.1 0]);
axis('off');
h2 = gca;
subplot(2,2,1);
barh(ctr2,-n2,1);
axis([-max(n2)*1.1 0 -3.5 3.5]);
axis('off');
h3 = gca;
h1.Position = [0.35 0.35 0.55 0.55];
h2.Position = [.35 .1 .55 .15];
h3.Position = [.1 .35 .15 .55];
colormap([.8 .8 1]);

%%
% The marginal histograms of the simulated data closely match those of the
% original data, and would become identical as we simulate more pairs of
% values.  Notice that the values are drawn from the original data, and
% because there are only 100 observations in each dataset, the simulated
% data are somewhat "discrete".  One way to overcome this would be to
% add a small amount of random variation, possibly normally distributed, to
% the final simulated values.  This is equivalent to using a smoothed
% version of the empirical inverse CDF.