www.gusucode.com > econ 案例源码程序 matlab代码 > econ/Demo_TSReg5.m

    %% Time Series Regression V: Predictor Selection
%
% This example shows how to select a parsimonious set of predictors with
% high statistical significance for multiple linear regression models. It
% is the fifth in a series of examples on time series regression, following
% the presentation in previous examples.

% Copyright 2012-2014 The MathWorks, Inc.

%% Introduction
%
% What are the "best" predictors for a multiple linear regression (MLR)
% model? Without a theoretical basis for answering this question, models
% may, at least initially, include a mix of "potential" predictors that
% degrade the quality of OLS estimates and confuse the identification of
% significant effects.
% 
% Ideally, a predictor set would have the following characteristics:
%
% * Every predictor contributes to the variation in the response (necessity
% and parsimony)
%
% * No additional predictors contribute to the variation in the response
% (sufficiency)
%
% * No additional predictors significantly change the coefficient estimates
% (stability)
%
% The realities of economic modeling, however, make it challenging to find
% such a set. First, there is the inevitability of omitted, significant
% predictors, which lead to models with biased and inefficient coefficient
% estimates. Other examples in this series discuss related challenges, such
% as correlation among predictors, correlation between predictors and
% omitted variables, limited sample variation, atypical data, and so forth,
% all of which pose problems for a purely statistical selection of "best"
% predictors.
%
% Automated selection techniques use statistical significance, despite its
% shortcomings, as a substitute for theoretical significance. These
% techniques usually pick a "best" set of predictors by minimizing some
% measure of forecast error. Optimization constraints are used to indicate
% required or excluded predictors, or to set the size of the final model.
%
% In the previous example on "Spurious Regression," it was suggested that
% certain transformations of predictors may be beneficial in producing a
% more accurate forecasting model. Selecting predictors _before_
% transformation has the advantage of retaining original units, which may
% be important in identifying a subset that is both meaningful and
% statistically significant. Typically, selection and transformation
% techniques are used together, with a modeling goal of achieving a simple,
% but still accurate, forecasting model of the response.
%
% To examine selection techniques, we begin by loading relevant data from
% the previous example on "Spurious Regression":

load Data_TSReg4

%%
% For reference, we display models with a full set of predictors in both
% levels and differences:

M0

%%

MD1

%% Stepwise Regression
%
% Many approaches to predictor selection use _t_-statistics of estimated
% coefficients, and _F_-statistics of groups of coefficients, to measure
% statistical significance. When using these statistics, it must be
% remembered that omitting predictors with insignificant individual
% contributions can hide a significant joint contribution. Also, _t_ and
% _F_ statistics can be unreliable in the presence of collinearity or
% trending variables. As such, data issues should be addressed prior to
% predictor selection.
%
% _Stepwise regression_ is a systematic procedure for adding and removing
% MLR predictors based on _F_ statistics. The procedure begins with an
% initial subset of the potential predictors, including any considered to
% be theoretically significant. At each step, the _p_-value of an
% _F_-statistic (that is, the square of a _t_-statistic with an identical
% _p_-value) is computed to compare models with and without one of the
% potential predictors. If a predictor is not currently in the model, the
% null hypothesis is that it would have a zero coefficient if added to the
% model. If there is sufficient evidence to reject the null hypothesis, the
% predictor is added to the model. Conversely, if a predictor is currently
% in the model, the null hypothesis is that it has a zero coefficient. If
% there is insufficient evidence to reject the null hypothesis, the
% predictor is removed from the model. At any step, the procedure may
% remove predictors that have been added or add predictors that have been
% removed.
%
% Stepwise regression proceeds as follows:
% 
% *1.* Fit the initial model.
% 
% *2.* If any predictors not in the model have _p_-values less than an
% entrance tolerance (that is, if it is unlikely that they would have zero
% coefficient if added to the model), add the one with the smallest
% _p_-value and repeat this step; otherwise, go to step *3*.
% 
% *3.* If any predictors in the model have _p_-values greater than an exit
% tolerance (that is, if it is unlikely that the hypothesis of a zero
% coefficient can be rejected), remove the one with the largest _p_-value
% and go to step *2*; otherwise, end.
% 
% Depending on the initial model and the order in which predictors are
% moved in and out, the procedure may build different models from the same
% set of potential predictors. The procedure terminates when no single step
% improves the model. There is no guarantee, however, that a different
% initial model and a different sequence of steps will not lead to a better
% fit. In this sense, stepwise models are locally optimal, but may not be
% globally optimal. The procedure is nevertheless efficient in avoiding an
% assessment of every possible subset of potential predictors, and often
% produces useful results in practice.
%
% The function |stepwiselm| (equivalent to the static method
% |LinearModel.stepwise|) carries out stepwise regression automatically. By
% default, it includes a constant in the model, starts from an empty set of
% predictors, and uses entrance/exit tolerances on the _F_-statistic
% _p_-values of 0.05 / 0.10. The following applies |stepwiselm| to the
% original set of potential predictors, setting an upper bound of |Linear|
% on the model, which limits the procedure by not including squared or
% interaction terms when searching for the model with the lowest
% root-mean-squared error (RMSE):

M0SW = stepwiselm(DataTable,'Upper','Linear')

%%
% The display shows the active predictors at termination. The _F_-tests
% choose two predictors with optimal joint significance, |BBB| and |CPF|.
% These are not the predictors with the most significant individual
% _t_-statistics, |AGE| and |CPF|, in the full model |M0|. The RMSE of the
% reduced model, 0.0808, is comparable to the RMSE of |M0|, 0.0763. The
% slight increase is the price of parsimony.
%
% For comparison, we apply the procedure to the full set of differenced
% predictors (with |AGE| undifferenced) in |MD1|:

MD1SW = stepwiselm(D1X0,D1y0,'Upper','Linear','VarNames',[predNamesD1,respNameD1])

%%
% The RMSE of the reduced model, 0.109, is again comparable to that of
% |MD1|, 0.106. The stepwise procedure pares down the model to a single
% predictor, |D1CPF|, with its significantly smaller _p_-value.
%
% The RMSE, of course, is no guarantee of forecast performance, especially
% with small samples. Since there is a theoretical basis for including an
% aging effect in models of credit default [5], we might want to force
% |AGE| into the model. This is done in by fixing |D1IGD ~ AGE| as both the
% initial model and as a lower bound on all models considered:

MD1SWA = stepwiselm(D1X0,D1y0,'D1IGD~AGE',...
                              'Lower','D1IGD~AGE',...
                              'Upper','Linear',...
                              'VarNames',[predNamesD1,respNameD1])

%%
% The RMSE is slightly reduced, highlighting the local nature of the
% search. For this reason, multiple stepwise searches are recommended,
% moving forward from an empty initial model and backwards from a full
% initial model, while fixing any theoretically important predictors.
% Comparison of local minima, in the context of theory, produces the most
% reliable results.
%
% The stepwise regression procedure can be examined in more detail using
% the function |stepwise|, which allows interaction at each step, and the
% function |Example_StepwiseTrace|, which displays the history of
% coefficient estimates throughout the selection process.

%% Information Criteria
%
% Stepwise regression compares nested models, using _F_-tests that are
% equivalent to likelihood ratio tests. To compare models that are not
% extensions or restrictions of each other, _information criteria_ (IC) are
% often used. There are several common varieties, but all attempt to
% balance a measure of in-sample fit with a penalty for increasing the
% number of model coefficients. The Akaike information criterion (AIC) and
% the Bayes information criterion (BIC) are computed by the
% |ModelCriterion| method of the |LinearModel| class. We compare measures
% using the full set of predictors in both levels and differences:

AIC0 = M0.ModelCriterion.AIC
BIC0 = M0.ModelCriterion.BIC

%%

AICD1 = MD1.ModelCriterion.AIC
BICD1 = MD1.ModelCriterion.BIC

%%
% Because both models estimate the same number of coefficients, AIC and BIC
% favor |M0|, with the lower RMSE.
%
% We might also want to compare |MD1| to the best reduced model found by
% stepwise regression, |MD1SWA|:

AICD1SWA = MD1SWA.ModelCriterion.AIC
BICD1SWA = MD1SWA.ModelCriterion.BIC

%%
% Both measures are reduced as a result of fewer coefficient estimates, but
% the model still does not make up for the increased RMSE relative to |M0|,
% which resulted from differencing to correct for spurious regression.

%% Cross Validation
% 
% Another common model comparison technique is _cross validation_. Like
% information criteria, cross-validation can be used to compare nonnested
% models, and penalize a model for overfitting. The difference is that
% cross validation evaluates a model in the context of out-of-sample
% forecast performance, rather than in-sample fit. 
%
% In standard cross-validation, data is split at random into a _training
% set_ and a _test set_. Model coefficients are estimated with the training
% set, then used to predict response values in the test set. Training and
% test sets are shuffled at random, and the process is carried out
% repeatedly. Small prediction errors, on average, across all of the test
% sets, indicate good forecast performance for the model predictors. There
% is no need to adjust for the number of coefficients, as in information
% criteria, since different data are used for fitting and estimation.
% Overfitting becomes apparent in the forecast performance.
%
% Cross-validation is a generalization of "split sample" or "hold out"
% techniques, where only a single subset is used to estimate prediction
% error. There is statistical evidence that cross-validation is a much
% better procedure for small data sets [2]. Asymptotically, minimizing the
% cross-validation error of a linear model is equivalent to minimizing AIC
% or BIC [6], [7].
%
% For time series data, the procedure has some complications. Time series
% data are generally not independent, so random training sets taken from
% anywhere in the time base may be correlated with random test sets.
% Cross-validation can behave erratically in this situation [3]. One
% solution is to test for an $L$ such that observations at time $t_1$ are
% uncorrelated with observations at time $t_2$ for $|t_1 - t_2| > L$ (see
% the example on "Residual Diagnostics"), then select training and test
% sets with sufficient separation. Another solution is to use sufficiently
% many test sets so that correlation effects are washed out by the random
% sampling. The procedure can be repeated using test sets of different
% sizes, and the sensitivity of the results can be evaluated.
%
% Standard cross-validation is carried out by the |crossval| function. By
% default, data is randomly partitioned into 10 subsamples, each of which
% is used once as a test set (10-fold cross-validation). An average MSE is
% then computed across the tests. The following compares |M0| to |MD1SWA|.
% Because the data has ~20 observations (one more for the undifferenced
% data), the default test sets have a size of 2:

yFit = @(XTrain,yTrain,XTest)(XTest*regress(yTrain,XTrain));

cvMSE0 = crossval('MSE',X0,y0,'predfun',yFit);
cvRMSE0 = sqrt(cvMSE0)

%%

cvMSED1SWA = crossval('MSE',D1X0(:,[1 3]),D1y0,'predfun',yFit);
cvRMSED1SWA = sqrt(cvMSED1SWA)

%%
% The RMSEs are slightly higher than those found previously, 0.0763 and
% 0.108, respectively, and again favor the full, original set of
% predictors.

%% Lasso
%
% Finally, we consider the least absolute shrinkage and selection operator,
% or _lasso_ [4], [8]. The lasso is a regularization technique similar to
% ridge regression (discussed in the example on "Collinearity & Estimator
% Variance"), but with an important difference that is useful for predictor
% selection. Consider the following, equivalent formulation of the ridge
% estimator:
%
% $$\hat{\beta}_{ridge} = \min_{\beta} (SSE + k \sum_i{\beta_i^2}),$$
%
% where $SSE$ is the error (residual) sum of squares for the regression.
% Essentially, the ridge estimator minimizes $SSE$ while penalizing for
% large coefficients $\beta_i$. As the ridge parameter $k > 0$ increases,
% the penalty shrinks coefficient estimates toward 0 in an attempt to
% reduce the large variances produced by nearly collinear predictors.
%
% The lasso estimator has a similar formulation:
% 
% $$\hat{\beta}_{lasso} = \min_{\beta} (SSE + k \sum_i{|\beta_i|}).$$
% 
% The change in the penalty looks minor, but it affects the estimator in
% important ways. Like the ridge estimator, $\hat{\beta}_{lasso}$ is biased
% toward zero (giving up the "U" in BLUE). Unlike the ridge estimator,
% however, $\hat{\beta}_{lasso}$ is not linear in the response values $y_t$
% (giving up the "L" in BLUE). This fundamentally changes the nature of the
% estimation procedure. The new geometry allows coefficient estimates to
% shrink to zero for finite values of $k$, effectively selecting a subset
% of the predictors.
%
% Lasso is implemented by the |lasso| function. By default, |lasso|
% estimates the regression for a range of parameters $k$, computing the MSE
% at each value. We set |'CV'| to 10 to compute the MSEs by 10-fold
% cross-validation. The function |lassoPlot| displays traces of the
% coefficient estimates:

[lassoBetas,lassoInfo] = lasso(X0,y0,'CV',10);

[hax,hfig] = lassoPlot(lassoBetas,lassoInfo,'PlotType','Lambda');
hax.XGrid = 'on';
hax.YGrid = 'on';
hax.GridLineStyle = '-';
hax.Title.String = '{\bf Lasso Trace}';
hax.XLabel.String = 'Lasso Parameter';
hlplot = hax.Children;
hMSEs = hlplot(5:6);
htraces = hlplot(4:-1:1);
set(hlplot,'LineWidth',2)
set(hMSEs,'Color','m')
legend(htraces,predNames0,'Location','NW')
hfig.HandleVisibility = 'on';

%%
% Larger values of $k$ appear on the left, with the OLS estimates on the
% right, reversing the direction of a typical ridge trace. The degrees of
% freedom for the model (the number of nonzero coefficient estimates)
% increases from left to right, along the top of the plot. The dashed
% vertical lines show the $k$ values with minimum MSE (on the right), and
% minimum MSE plus one standard error (on the left). In this case, the
% minimum occurs for the OLS estimates, $k = 0$, exactly as for ridge
% regression. The one-standard-error value is often used as a guideline for
% choosing a smaller model with good fit [1].
%
% The plot suggests |AGE| and |CPF| as a possible subset of the original
% predictors. We perform another stepwise regression with these predictors
% forced into the model:

M0SWAC = stepwiselm(X0,y0,'IGD~AGE+CPF',...
                          'Lower','IGD~AGE+CPF',...
                          'Upper','Linear',...
                          'VarNames',[predNames0,respName0])

%%
% The regression also moves |BBB| into the model, with a resulting RMSE
% below the value of 0.0808 found earlier by stepwise regression from an
% empty initial model, |M0SW|, which selected |BBB| and |CPF| alone.
%
% Because including |BBB| increases the number of estimated coefficients,
% we use AIC and BIC to compare the more parsimonious 2-predictor model
% |M0AC| found by the lasso to the expanded 3-predictor model |M0SWAC|:

M0AC = fitlm(DataTable(:,[1 3 5]))

AIC0AC = M0AC.ModelCriterion.AIC
BIC0AC = M0AC.ModelCriterion.BIC

%%

AIC0SWAC = M0SWAC.ModelCriterion.AIC
BIC0SWAC = M0SWAC.ModelCriterion.BIC

%%
% The lower RMSE is enough to compensate for the extra predictor, and both
% criteria choose the 3-predictor model over the 2-predictor model.

%% Comparing Models
%
% The procedures described here suggest a number of reduced models with
% statistical characteristics comparable to the models with the full set of
% original, or differenced, predictors. We summarize the results:
%
% *M0* Model with the original predictors, |AGE|, |BBB|, |CPF|, and |SPR|.
%
% *M0SW* Submodel of |M0| found by stepwise regression, starting from an
% empty model. It includes |BBB| and |CPF|.
%
% *M0SWAC* Submodel of |M0| found by stepwise regression, starting from a
% model that forces in |AGE| and |CPF|. Suggested by lasso. It includes
% |AGE|, |BBB|, and |CPF|.
%
% *MD1* Model with the original predictor |AGE| and the differenced
% predictors |D1BBB|, |D1CPF|, and |D1SPR|. Suggested by integration and
% stationarity testing in the example on "Spurious Regression."
%
% *MD1SW* Submodel of |MD1| found by stepwise regression, starting from an
% empty model. It includes |D1CPF|.
%
% *MD1SWA* Submodel of |MD1| found by stepwise regression, starting from a
% model that forces in |AGE|. Suggested by theory. It includes |AGE| and
% |D1CPF|.

% Compute missing information:
AIC0SW = M0SW.ModelCriterion.AIC;
BIC0SW = M0SW.ModelCriterion.BIC;

AICD1SW = MD1SW.ModelCriterion.AIC;
BICD1SW = MD1SW.ModelCriterion.BIC;

% Create model comparison table:
RMSE = [M0.RMSE;M0SW.RMSE;M0SWAC.RMSE;MD1.RMSE;MD1SW.RMSE;MD1SWA.RMSE];
AIC = [AIC0;AIC0SW;AIC0SWAC;AICD1;AICD1SW;AICD1SWA];
BIC = [BIC0;BIC0SW;BIC0SWAC;BICD1;BICD1SW;BICD1SWA];

Models = table(RMSE,AIC,BIC,...              
               'RowNames',{'M0','M0SW','M0SWAC','MD1','MD1SW','MD1SWA'})
             
%%
% Models involving the original, undifferenced data get generally higher
% marks (lower RMSEs and ICs) than models using differenced data, but the
% possibility of spurious regression, which led to consideration of the
% differenced data in the first place, must be remembered. In each model
% category, the results are mixed. The original models with the most
% predictors (|M0|, |MD1|) have the lowest RMSEs in their category, but
% there are reduced models with lower AICs (|M0SWAC|, |MD1SW|, |MD1SWA|)
% and lower BICs (|M0SW|, |M0SWAC|, |MD1SW|, |MD1SWA|). It is not unusual
% for information criteria to suggest smaller models, or for different
% information criteria to disagree (|M0SW|, |M0SWAC|). Also, there are many
% combinations of the original and differenced predictors that we have not
% included in our analysis. Practitioners must decide how much parsimony is
% enough, in the context of larger modeling goals.

%% Summary
%
% This example compares a number of predictor selection techniques in the
% context of a practical economic forecasting model. Many such techniques
% have been developed for experimental situations where data collection
% leads to a huge number of potential predictors, and statistical
% techniques are the only practical sorting method. In a situation with
% more limited data options, purely statistical techniques can lead to an
% array of potential models with comparable measures of goodness of fit.
% Theoretical considerations, as always, must play a crucial role in
% economic model selection, while statistics are used to select among
% competing proxies for relevant economic factors.

%% References
%
% [1] Brieman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone.
% _Classification and Regression Trees_. Boca Raton, FL: Chapman &
% Hall/CRC, 1984.
% 
% [2] Goutte, C. "Note on Free Lunches and Cross-Validation." _Neural
% Computation_. Vol. 9, 1997, pp. 1211-1215.
%
% [3] Hart, J. D. "Kernel Regression Estimation With Time Series Errors."
% _Journal of the Royal Statistical Society_. Series B, Vol. 53, 1991, pp.
% 173-187.
%
% [4] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of
% Statistical Learning. New York: Springer, 2008.
%
% [5] Jonsson, J. G., and M. Fridson. "Forecasting Default Rates on High
% Yield Bonds." Journal of Fixed Income. Vol. 6, No. 1, 1996, pp. 69-77.
%
% [6] Shao, J. "An Asymptotic Theory for Linear Model Selection."
% _Statistica Sinica_. Vol. 7, 1997, pp. 221-264.
%
% [7] Stone, M. "An Asymptotic Equivalence of Choice of Model by
% Cross-Validation and Akaike's Criterion." _Journal of the Royal
% Statistical Society_. Series B, Vol. 39, 1977, pp. 44-47.
%
% [8] Tibshirani, R. "Regression Shrinkage and Selection Via the Lasso."
% Journal of the Royal Statistical Society, Series B. Vol 58, No. 1, 1996,
% pp. 267-288.