www.gusucode.com > econ 案例源码程序 matlab代码 > econ/Demo_TSReg3.m

    %% Time Series Regression III: Influential Observations
%
% This example shows how to detect influential observations in time series
% data and accommodate their effect on multiple linear regression models.
% It is the third in a series of examples on time series regression,
% following the presentation in previous examples.

% Copyright 2012-2014 The MathWorks, Inc.

%% Introduction
%
% When considering the empirical limitations that affect OLS estimates,
% Belsley et al. [1] advise that collinearities be addressed first. A next
% step is to look for influential observations, whose presence,
% individually or in groups, have measurable effects on regression results.
% We distinguish the essentially metric notion of "influential observation"
% from the more subjective notion of "outlier," which may include any data
% that do not follow expected patterns.
%
% We begin by loading relevant data from the previous example on
% "Collinearity & Estimator Variance," and continue the analysis of the
% credit default model presented there:

load Data_TSReg2

%% Influential Observations
%
% Influential observations arise in two fundamentally distinct ways. First,
% they may be the result of measurement or recording errors. In that case,
% they are just bad data, detrimental to model estimation. On the other
% hand, they may reflect the true distribution of the innovations process,
% exhibiting heteroscedasticity, skewness, or leptokurtosis for which the
% model fails to account. Such observations may contain abnormal sample
% information that is nevertheless essential to accurate model estimation.
% Determining the type of influential observation is difficult when looking
% at data alone. The best clues are often found in the data-model
% interactions that produce the residual series. We investigate these
% further in the example on "Residual Diagnostics."
%
% Preprocessing influential observations has three components:
% identification, influence assessment, and accommodation. In econometric
% settings, identification and influence assessment are usually based on
% regression statistics. Accommodation, if there is any, is usually a
% choice between deleting data, which requires making assumptions about the
% DGP, or else implementing a suitably robust estimation procedure, with
% the potential to obscure abnormal, but possibly important, information.
% 
% Time series data differ from cross-sectional data in that deleting
% observations leaves "holes" in the time base of the sample. Standard
% methods for imputing replacement values, such as smoothing, violate the
% CLM assumption of strict exogeneity. If time series data exhibit serial
% correlation, as they often do in economic settings, deleting observations
% will alter estimated autocorrelations. The ability to diagnose departures
% from model specification, through residual analysis, is compromised. As a
% result, the modeling process must cycle between diagnostics and
% respecification until acceptable coefficient estimates produce an
% acceptable series of residuals.

%% Deletion Diagnostics
%
% The function |fitlm| (equivalent to the static method |LinearModel.fit|)
% computes many of the standard regression statistics used to measure the
% influence of individual observations. These are based on a sequence of
% one-at-a-time _row deletions_ of jointly observed predictor and response
% values. Regression statistics are computed for each delete-1 data set and
% compared to the statistics for the full data set.
%
% Significant changes in the coefficient estimates $\hat{\beta}$ after
% deleting an observation are the main concern. The fitted model property
% |Diagnostics.dfBetas| scales these differences by estimates of the
% individual coefficient variances, for comparison:

dfBetas = M0.Diagnostics.dfBetas;

figure
hold on
plot(dates,dfBetas(:,2:end),'LineWidth',2)
plot(dates,dfBetas(:,1),'k','LineWidth',2)
hold off
legend([predNames0,'Intercept'],'Location','Best')
xlabel('Observation Deleted') 
ylabel('Scaled Change in Coefficient Estimate') 
title('{\bf Delete-1 Coefficient Estimate Changes}')
axis tight
grid on

%% 
% Effects of the deletions on component pairs in $\hat{\beta}$ are made
% apparent in a matrix of 2-D scatter plots of the changes:

figure
gplotmatrix(dfBetas,[],[],[],'o',2,[],...
            'variable',['Const',predNames0]);
title('{\bf Delete-1 Coefficient Estimate Changes}')                       

%%
% With sufficient data, these scatters tend to be approximately elliptical
% [2]. Outlying points can be labeled with the name of the corresponding
% deleted observation by typing |gname(dates)| at the command prompt, then
% clicking on a point in the plots.
%
% Alternatively, _Cook's distance_, found in the
% |Diagnostics.CooksDistance| property of the fitted model, is a common
% summary statistic for these plots, with contours forming ellipses
% centered around $\hat{\beta}$ (that is, |dfBeta = 0|). Points far from
% the center in multiple plots have a large Cook's distance, indicating an
% influential observation:

cookD = M0.Diagnostics.CooksDistance;

figure
plot(dateNums,cookD,'m','LineWidth',2)
ax = gca;
ax.XTick = dateNums(1:2:end);
datetick('x','yyyy','keepticks')
recessionplot
xlabel('Observation') 
ylabel('Cook''s Distance') 
title('{\bf Cook''s Distance}')
axis tight
grid on

%% 
% If $\hat{\beta}_{(i)}$ is the estimated coefficient vector with the
% $i^{th}$ observation deleted from the data, then Cook's distance is also
% the Euclidean distance between
%
% $$\hat{y_t} = X_t \hat{\beta}$$
%
% and 
%
% $$\hat{y_t}_{(i)} = X_t \hat{\beta}_{(i)}.$$
%
% As a result, Cook's distance is a direct measure of the influence of an
% observation on fitted response values.
%
% A related measure of influence is _leverage_, which uses the normal
% equations to write
%
% $$\hat{y_t} = X_t \hat{\beta} = X_t (X_t^T X_t)^{\rm-1} X_t^T y_t = H y_t,$$
%
% where $H$ is the _hat matrix_, computed from the predictor data alone.
% The diagonal elements of $H$ are the leverage values, giving
% componentwise proportions of the observed $y_t$ contributing to the
% corresponding estimates in $\hat{y_t}$. The leverage values, found in the
% |Diagnostics.Leverage| property of the fitted model, emphasize different
% sources of influence:

leverage = M0.Diagnostics.Leverage;

figure
plot(dateNums,leverage,'m','LineWidth',2)
ax = gca;
ax.XTick = dateNums(1:2:end);
datetick('x','yyyy','keepticks')
recessionplot
xlabel('Observation') 
ylabel('Leverage') 
title('{\bf Leverage}')
axis tight
grid on

%%
% Another common measure of influence, the _Mahalanobis distance_, is just
% a scaled version of the leverage. The Mahalanobis distances in |X0| can
% be computed using |d = mahal(X0,X0)|, in which case the leverage values
% are given by |h = d/(T0-1)+(1/T0)|.
%
% Additional diagnostic plots can be created by retrieving other statistics
% from the |Diagnostics| property of a fitted model, or by using the
% |plotDiagnostics| method of the |LinearModel| class.

%% Economic Significance
%
% Before deleting data, some kind of economic meaning should be assigned to
% the influential points identified by the various measures. Cook's
% distance, associated with changes in the overall response, shows a sharp
% spike in 2001. Leverage, associated with predictor data alone, shows a
% sharp spike in 1988. It is also noteworthy that after the sudden increase
% in leverage and a period of high default rates, the predictor |BBB| bends
% upward after 1991, and the percentage of lower-grade bonds begins to
% trend. (See a plot of the predictors in the example on "Linear Models.")
% 
% Some clues are found in the economic history of the times. 2001 was a
% period of recession in the U.S. economy (second vertical band in the
% plots above), brought on, in part, by the collapse of the speculative
% Internet bubble and a reduction in business investments. It was also the
% year of the September 11 attacks, which delivered a severe shock to the
% bond markets. Uncertainty, rather than quantifiable risk, characterized
% investment decisions for the rest of that year. The 1980s, on the other
% hand, saw the beginning of a long-term change in the character of the
% bond markets. New issues of high-yield bonds, which came to be known as
% "junk bonds," were used to finance many corporate restructuring projects.
% This segment of the bond market collapsed in 1989. Following a recession
% (first vertical band in the plots above) and an oil price shock in
% 1990-1991, the high-yield market began to grow again, and matured.
%
% The decision to delete data ultimately depends on the purpose of the
% model. If the purpose is mostly explanatory, deleting accurately recorded
% data is inappropriate. If, however, the purpose is forecasting, then it
% must be asked if deleting points would create a presample that is more
% "typical" of the past, and so the future. The historical context of the
% data in 2001, for example, may lead to the conclusion that it
% misrepresents historical patterns, and should not be allowed to influence
% a forecasting model. Likewise, the history of the 1980s may lead to the
% conclusion that a structural change occurred in the bond markets, and
% data prior to 1991 should be ignored for forecasts in the new regime.
%
% For reference, we create both of the amended data sets:

% Delete 2001:
d1 = (dates ~= 2001); % Delete 1
datesd1 = dates(d1);
Xd1 = X0(d1,:);
yd1 = y0(d1);

% Delete dates prior to 1991, as well:
dm = (datesd1 >= 1991); % Delete many
datesdm = datesd1(dm);
Xdm = Xd1(dm,:);
ydm = yd1(dm);

%% Summary
%
% The effects of the deletions on model estimation are summarized below.
% Tabular arrays provide a convenient format for comparing the regression
% statistics across models:

Md1 = fitlm(Xd1,yd1);
Mdm = fitlm(Xdm,ydm);

% Model mean squared errors:
MSEs = table(M0.MSE,...
             Md1.MSE,...
             Mdm.MSE,...
             'VariableNames',{'Original','Delete01','Post90'},...              
             'RowNames',{'MSE'})

% Coefficient estimates:
fprintf('\n')
Coeffs = table(M0.Coefficients.Estimate,...
               Md1.Coefficients.Estimate,...
               Mdm.Coefficients.Estimate,...
               'VariableNames',{'Original','Delete01','Post90'},...              
               'RowNames',['Const',predNames0])

% Coefficient standard errors:
fprintf('\n')
StdErrs = table(M0.Coefficients.SE,...
               Md1.Coefficients.SE,...
               Mdm.Coefficients.SE,...
               'VariableNames',{'Original','Delete01','Post90'},...              
               'RowNames',['Const',predNames0])

%%
% The MSE improves with deletion of the point in 2001, and then again with
% deletion of the pre-1991 data. Deleting the point in 2001 also has the
% effect of tightening the standard errors on the coefficient estimates.
% Deleting all of the data prior to 1991, however, severely reduces the
% sample size, and the standard errors of several of the estimates grow
% larger than they were with the original data.

%% References
%
% [1] Belsley, D. A., E. Kuh, and R. E. Welsch. _Regression Diagnostics_.
% Hoboken, NJ: John Wiley & Sons, 1980.
%
% [2] Weisberg, S. _Applied Linear Regression_. Hoboken, NJ: John Wiley &
% Sons, Inc., 2005.