www.gusucode.com > econ 案例源码程序 matlab代码 > econ/Demo_TSReg4.m

    %% Time Series Regression IV: Spurious Regression
%
% This example considers trending variables, spurious regression, and
% methods of accommodation in multiple linear regression models. It is the
% fourth in a series of examples on time series regression, following the
% presentation in previous examples.

% Copyright 2012-2014 The MathWorks, Inc.

%% Introduction
%
% Predictors that trend over time are sometimes viewed with suspicion in
% multiple linear regression (MLR) models. Individually, however, they need
% not affect ordinary least squares (OLS) estimation. In particular, there
% is no need to linearize and detrend each predictor. If response values
% are well-described by a linear combination of the predictors, an MLR
% model is still applicable, and classical linear model (CLM) assumptions
% are not violated.
%
% If, however, a trending predictor is paired with a trending response,
% there is the possibility of _spurious regression_, where $t$-statistics
% and overall measures of fit become misleadingly "significant." That is,
% the statistical significance of relationships in the model do not
% accurately reflect the causal significance of relationships in the
% data-generating process (DGP).
%
% To investigate, we begin by loading relevant data from the previous
% example on "Influential Observations," and continue the analysis of the
% credit default model presented there:

load Data_TSReg3

%% Confounding
%
% One way that mutual trends arise in a predictor and a response is when
% both variables are correlated with a causally prior _confounding
% variable_ outside of the model. The omitted variable (OV) becomes a part
% of the innovations process, and the model becomes implicitly restricted,
% expressing a false relationship that would not exist if the OV were
% included in the specification. Correlation between the OV and model
% predictors violates the CLM assumption of strict exogeneity.
%
% When a model fails to account for a confounding variable, the result is
% _omitted variable bias_, where coefficients of specified predictors
% over-account for the variation in the response, shifting estimated values
% away from those in the DGP. Estimates are also _inconsistent_, since the
% source of the bias does not disappear with increasing sample size.
% Violations of strict exogeneity help model predictors track correlated
% changes in the innovations, producing overoptimistically small confidence
% intervals on the coefficients and a false sense of goodness of fit.
%
% To avoid underspecification, it is tempting to pad out an explanatory
% model with _control variables_ representing a multitude of economic
% factors with only tenuous connections to the response. By this method,
% the likelihood of OV bias would seem to be reduced. However, if
% irrelevant predictors are included in the model, the variance of
% coefficient estimates increases, and so does the chance of false
% inferences about predictor significance. Even if _relevant_ predictors
% are included, if they do not account for all of the OVs, then the bias
% and inefficiency of coefficient estimates may increase or decrease,
% depending, among other things, on correlations between included and
% excluded variables [1]. This last point is usually lost in textbook
% treatments of OV bias, which typically compare an underspecified model to
% a practically unachievable fully-specified model.
%
% Without experimental designs for acquiring data, and the ability to use
% random sampling to minimize the effects of misspecification,
% econometricians must be very careful about choosing model predictors. The
% certainty of underspecification and the uncertain logic of control
% variables makes the role of relevant theory especially important in model
% specification. Examples in this series on "Predictor Selection" and
% "Residual Diagnostics" describe the process in terms of cycles of
% diagnostics and respecification. The goal is to converge to an acceptable
% set of coefficient estimates, paired with a series of residuals from
% which all relevant specification information has been distilled.
%
% In the case of the credit default model introduced in the example on
% "Linear Models," confounding variables are certainly possible. The
% candidate predictors are somewhat ad hoc, rather than the result of any
% fundamental accounting of the causes of credit default. Moreover, the
% predictors are proxies, dependent on other series outside of the model.
% Without further analysis of potentially relevant economic factors,
% evidence of confounding must be found in an analysis of model residuals.

%% Detrending
%
% Detrending is a common preprocessing step in econometrics, with different
% possible goals. Often, economic series are detrended in an attempt to
% isolate a stationary component amenable to ARMA analysis or spectral
% techniques. Just as often, series are detrended so that they can be
% compared on a common scale, as with per capita normalizations to remove
% the effect of population growth. In regression settings, detrending may
% be used to minimize spurious correlations.
%
% A plot of the credit default data (see the example on "Linear Models")
% shows that the predictor |BBB| and the response |IGD| are both trending.
% It might be hoped that trends could be removed by deleting a few atypical
% observations from the data. For example, the trend in the response seems
% mostly due to the single influential observation in 2001:

figure
hold on
plot(dates,y0,'k','LineWidth',2);
plot(dates,y0-detrend(y0),'m.-')
plot(datesd1,yd1-detrend(yd1),'g*-')
hold off
legend(respName0,'Trend','Trend with 2001 deleted','Location','NW')
xlabel('Year') 
ylabel('Response Level') 
title('{\bf Response}')
axis tight
grid on

%%
% Deleting the point reduces the trend, but does not eliminate it.
%
% Alternatively, variable transformations are used to remove trends. This
% may improve the statistical properties of a regression model, but it
% complicates analysis and interpretation. Any transformation alters the
% economic meaning of a variable, favoring the predictive power of a model
% over explanatory simplicity.
%
% The manner of trend-removal depends on the type of trend. One type of
% trend is produced by a _trend-stationary_ (TS) process, which is the sum
% of a deterministic trend and a stationary process. TS variables, once
% identified, are often linearized with a power or log transformation, then
% detrended by regressing on time. The |detrend| function, used above,
% removes the least-squares line from the data. This transformation often
% has the side effect of regularizing influential observations.
%
%% Stochastic Trends
%
% Not all trends are TS, however. _Difference stationary_ (DS) processes,
% also known as _integrated_ or _unit root_ processes, may exhibit
% _stochastic trends_, without a TS decomposition. When a DS predictor is
% paired with a DS response, problems of spurious regression appear [2].
% This is true even if the series are generated independently from one
% another, without any confounding. The problem is complicated by the fact
% that not all DS series are trending.
%
% Consider the following regressions between DS random walks with various
% degrees of drift. The coefficient of determination ($R^2$) is computed in
% repeated realizations, and the distribution displayed. For comparison,
% the distribution for regressions between random vectors (without an
% autoregressive dependence) is also displayed:

T = 100;
numSims = 1000;
drifts = [0 0.1 0.2 0.3];
numModels = length(drifts);
Steps = randn(T,2,numSims);

% Regression between two random walks:
ResRW = zeros(numSims,T,numModels);
RSqRW = zeros(numSims,numModels);
for d = 1:numModels
    
    for s = 1:numSims
        
        Y = zeros(T,2);
        for t = 2:T 
            Y(t,:) = drifts(d) + Y(t-1,:) + Steps(t,:,s);
        end
        
        % The compact regression formulation:
        %
        % MRW = fitlm(Y(:,1),Y(:,2));
        % ResRW(s,:,d) = MRW.Residuals.Raw';
        % RSqRW(s,d) = MRW.Rsquared.Ordinary;
        %
        % is replaced by the following for
        % efficiency in repeated simulation:
        
        X = [ones(size(Y(:,1))),Y(:,1)];
        y = Y(:,2);
        Coeff = X\y;
        yHat = X*Coeff;
        res = y-yHat;
        yBar = mean(y);
        regRes = yHat-yBar;
        SSR = regRes'*regRes;
        SSE = res'*res;
        SST = SSR+SSE;
        RSq = 1-SSE/SST;
        ResRW(s,:,d) = res';
        RSqRW(s,d) = RSq;
        
    end
    
end

% Plot R-squared distributions:
figure

[v(1,:),edges] = histcounts(RSqRW(:,1));
for i=2:size(RSqRW,2)
    v(i,:) = histcounts(RSqRW(:,i),edges); 
end
numBins = size(v,2);

ax = axes;
ticklocs = edges(1:end-1)+diff(edges)/2;
names = cell(1,numBins);
for i = 1:numBins
    names{i} = sprintf('%0.5g-%0.5g',edges(i),edges(i+1));
end
bar(ax,ticklocs,v.');
set(ax,'XTick',ticklocs,'XTickLabel',names,'XTickLabelRotation',30);

fig = gcf;
CMap = fig.Colormap;
Colors = CMap(linspace(1,64,numModels),:);
legend(strcat({'Drift = '},num2str(drifts','%-2.1f')),'Location','North')
xlabel('{\it R}^2')
ylabel('Number of Simulations')
title('{\bf Regression Between Two Independent Random Walks}')

clear RsqRW

% Regression between two random vectors:
RSqR = zeros(numSims,1);
for s = 1:numSims
    
    % The compact regression formulation:
    %
    % MR = fitlm(Steps(:,1,s),Steps(:,2,s));
    % RSqR(s) = MR.Rsquared.Ordinary;
    %
    % is replaced by the following for
    % efficiency in repeated simulation:

    X = [ones(size(Steps(:,1,s))),Steps(:,1,s)];
    y = Steps(:,2,s);
    Coeff = X\y;
    yHat = X*Coeff;
    res = y-yHat;
    yBar = mean(y);
    regRes = yHat-yBar;
    SSR = regRes'*regRes;
    SSE = res'*res;
    SST = SSR+SSE;
    RSq = 1-SSE/SST;
    RSqR(s) = RSq;
    
end

% Plot R-squared distribution:
figure

histogram(RSqR)
ax = gca;
ax.Children.FaceColor = [.8 .8 1];
xlabel('{\it R}^2')
ylabel('Number of Simulations')
title('{\bf Regression Between Two Independent Random Vectors}')

clear RSqR

%%
% The $R^2$ for the random-walk regressions becomes more significant as the
% drift coefficient increases. Even with zero drift, random-walk
% regressions are more significant than regressions between random vectors,
% where $R^2$ values fall almost exclusively below 0.1.
%
% Spurious regressions are often accompanied by signs of autocorrelation in
% the residuals, which can serve as a diagnostic clue. The following shows
% the distribution of autocorrelation functions (ACF) for the residual
% series in each of the random-walk regressions above:

numLags = 20;
ACFResRW = zeros(numSims,numLags+1,numModels);
for s = 1:numSims 
    for d = 1:numModels
        ACFResRW(s,:,d) = autocorr(ResRW(s,:,d));
    end
end
clear ResRW

% Plot ACF distributions:
figure
boxplot(ACFResRW(:,:,1),'PlotStyle','compact','BoxStyle','outline','LabelOrientation','horizontal','Color',Colors(1,:))
ax = gca;
ax.XTickLabel = {''};
hold on
boxplot(ACFResRW(:,:,2),'PlotStyle','compact','BoxStyle','outline','LabelOrientation','horizontal','Widths',0.4,'Color',Colors(2,:))
ax.XTickLabel = {''};
boxplot(ACFResRW(:,:,3),'PlotStyle','compact','BoxStyle','outline','LabelOrientation','horizontal','Widths',0.3,'Color',Colors(3,:))
ax.XTickLabel = {''};
boxplot(ACFResRW(:,:,4),'PlotStyle','compact','BoxStyle','outline','LabelOrientation','horizontal','Widths',0.2,'Color',Colors(4,:),'Labels',0:20)
line([0,21],[0,0],'Color','k')
line([0,21],[2/sqrt(T),2/sqrt(T)],'Color','b')
line([0,21],[-2/sqrt(T),-2/sqrt(T)],'Color','b')
hold off
xlabel('Lag')
ylabel('Sample Autocorrelation')
title('{\bf Residual ACF Distributions}')
grid on

clear ACFResRW

%%
% Colors correspond to drift values in the bar plot above. The plot shows
% extended, significant residual autocorrelation for the majority of
% simulations. Diagnostics related to residual autocorrelation are
% discussed further in the example on "Residual Diagnostics."
%
%% Differencing
%
% The simulations above lead to the conclusion that, trending or not, _all_
% regression variables should be tested for integration. It is then usually
% advised that DS variables be detrended by differencing, rather than
% regressing on time, to achieve a stationary mean.
%
% The distinction between TS and DS series has been widely studied (for
% example, in [3]), particularly the effects of _underdifferencing_
% (treating DS series as TS) and _overdifferencing_ (treating TS series as
% DS). If one trend type is treated as the other, with inappropriate
% preprocessing to achieve stationarity, regression results become
% unreliable, and the resulting models generally have poor forecasting
% ability, regardless of the in-sample fit.
%
% Econometrics Toolbox(TM) has several tests for the presence or absence of
% integration: |adftest|, |pptest|, |kpsstest|, and |lmctest|. For example,
% the augmented Dickey-Fuller test, |adftest|, looks for statistical
% evidence against a null of integration. With default settings, tests on
% both |IGD| and |BBB| fail to reject the null in favor of a
% trend-stationary alternative:

IGD = y0;
BBB = X0(:,2);

[h1IGD,pValue1IGD] = adftest(IGD,'model','TS')
[h1BBB,pValue1BBB] = adftest(BBB,'model','TS')

%%
% Other tests, like the KPSS test, |kpsstest|, look for statistical
% evidence against a null of trend-stationarity. The results are mixed:

s = warning('off'); % Turn off large/small statistics warnings

[h0IGD,pValue0IGD] = kpsstest(IGD,'trend',true)
[h0BBB,pValue0BBB] = kpsstest(BBB,'trend',true)

%%
% The _p_-values of 0.1 and 0.01 are, respectively, the largest and
% smallest in the table of critical values used by the right-tailed
% |kpsstest|. They are reported when the test statistics are, respectively,
% very small or very large. Thus the evidence against trend-stationarity is
% especially weak in the first test, and especially strong in the second
% test. The |IGD| results are ambiguous, failing to reject
% trend-stationarity even after the Dickey-Fuller test failed to reject
% integration. The results for |BBB| are more consistent, suggesting the
% predictor is integrated.
%
% What is needed for preprocessing is a systematic application of these
% tests to all of the variables in a regression, and their differences. The
% utility function |i10test| automates the required series of tests. The
% following performs paired ADF/KPSS tests on all of the model variables
% and their first differences:

I.names = {'model'};
I.vals = {'TS'};
S.names = {'trend'};
S.vals = {true};

i10test(DataTable,'numDiffs',1,...
        'itest','adf','iparams',I,...
        'stest','kpss','sparams',S);

warning(s) % Restore warning state

%%
% Columns show test results and _p_-values against nulls of integration,
% $I(1)$, and stationarity, $I(0)$. At the given parameter settings, the
% tests suggest that |AGE| is stationary (_integrated of order_ 0), and
% |BBB| and |SPR| are integrated but brought to stationarity by a single
% difference (_integrated of order_ 1). The results are ambiguous for |CPF|
% and |IGD|, but both appear to be stationary after a single difference.
%
% For comparison with the original regression in the example on "Linear
% Models," we replace |BBB|, |SPR|, |CPF|, and |IGD| with their first
% differences, |D1BBB|, |D1SPR|, |D1CPF|, and |D1IGD|. We leave |AGE|
% undifferenced:

%%
D1X0 = diff(X0);
D1X0(:,1) = X0(2:end,1); % Use undifferenced AGE
D1y0 = diff(y0);
predNamesD1 = {'AGE','D1BBB','D1CPF','D1SPR'};
respNameD1 = {'D1IGD'};

%%
% Original regression with undifferenced data:
M0

%%
% Regression with differenced data:
MD1 = fitlm(D1X0,D1y0,'VarNames',[predNamesD1,respNameD1])

%%
% The differenced data increases the standard errors on all coefficient
% estimates, as well as the overall RMSE. This may be the price of
% correcting a spurious regression. The sign and the size of the
% coefficient estimate for the undifferenced predictor, |AGE|, shows little
% change. Even after differencing, |CPF| has pronounced significance among
% the predictors. Accepting the revised model depends on practical
% considerations like explanatory simplicity and forecast performance,
% evaluated in the example on "Forecasting."

%% Summary
%
% Because of the possibility of spurious regression, it is usually advised
% that variables in time series regressions be detrended, as necessary, to
% achieve stationarity before estimation. There are trade-offs, however,
% between working with variables that retain their original economic
% meaning and transformed variables that improve the statistical
% characteristics of OLS estimation. The trade-off may be difficult to
% evaluate, since the degree of "spuriousness" in the original regression
% cannot be measured directly. The methods discussed in this example will
% likely improve the forecasting abilities of resulting models, but may do
% so at the expense of explanatory simplicity.
%
%% References
%
% [1] Clarke, K. A. "The Phantom Menace: Omitted Variable Bias in
% Econometric Research." _Conflict Management and Peace Science_. Vol. 22,
% 2005, pp. 341-352.
%
% [2] Granger, C., and P. Newbold. (1974). "Spurious Regressions in
% Econometrics". _Journal of Econometrics_. Vol. 2, 1974, pp. 111-120.
%
% [3] Nelson, C. R., and C. I. Plosser. "Trends Versus Random Walks in
% Macroeconomic Time Series: Some Evidence and Implications." Journal of
% Monetary Economics. Vol. 10, 1982, pp. 139-162.