www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/plspcrdemo.m

    %% Partial Least Squares Regression and Principal Components Regression
% This example shows how to apply Partial Least Squares Regression (PLSR)
% and Principal Components Regression (PCR), and discusses the
% effectiveness of the two methods.  PLSR and PCR are both methods to model
% a response variable when there are a large number of predictor variables,
% and those predictors are highly correlated or even collinear.  Both
% methods construct new predictor variables, known as components, as linear
% combinations of the original predictor variables, but they construct
% those components in different ways.  PCR creates components to explain
% the observed variability in the predictor variables, without considering
% the response variable at all. On the other hand, PLSR does take the
% response variable into account, and therefore often leads to models that
% are able to fit the response variable with fewer components.  Whether or
% not that ultimately translates into a more parsimonious model, in terms
% of its practical use, depends on the context.
%

%   Copyright 2008 The MathWorks, Inc.



%% Loading the Data
% Load a data set comprising spectral intensities of 60 samples of gasoline at
% 401 wavelengths, and their octane ratings.  These data are described in
% Kalivas, John H., "Two Data Sets of Near Infrared Spectra," Chemometrics and
% Intelligent Laboratory Systems, v.37 (1997) pp.255-259.
load spectra
whos NIR octane

%%
[dummy,h] = sort(octane);
oldorder = get(gcf,'DefaultAxesColorOrder');
set(gcf,'DefaultAxesColorOrder',jet(60));
plot3(repmat(1:401,60,1)',repmat(octane(h),1,401)',NIR(h,:)');
set(gcf,'DefaultAxesColorOrder',oldorder);
xlabel('Wavelength Index'); ylabel('Octane'); axis('tight');
grid on

%% Fitting the Data with Two Components
% Use the |plsregress| function to fit a PLSR model with ten PLS components
% and one response.
X = NIR;
y = octane;
[n,p] = size(X);
[Xloadings,Yloadings,Xscores,Yscores,betaPLS10,PLSPctVar] = plsregress(...
	X,y,10);
%%
% Ten components may be more than will be needed to adequately fit the
% data, but diagnostics from this fit can be used to make a choice of a
% simpler model with fewer components. For example, one quick way to choose
% the number of components is to plot the percent of variance explained in
% the response variable as a function of the number of components.
plot(1:10,cumsum(100*PLSPctVar(2,:)),'-bo');
xlabel('Number of PLS components');
ylabel('Percent Variance Explained in Y');
%%
% In practice, more care would probably be advisable in choosing the number
% of components.  Cross-validation, for instance, is a widely-used method
% that will be illustrated later in this example.  For now, the above plot
% suggests that PLSR with two components explains most of the variance in
% the observed |y|.  Compute the fitted response values for the
% two-component model.
[Xloadings,Yloadings,Xscores,Yscores,betaPLS] = plsregress(X,y,2);
yfitPLS = [ones(n,1) X]*betaPLS;

%%
% Next, fit a PCR model with two principal components.  The first step is
% to perform Principal Components Analysis on |X|, using the |pca|
% function, and retaining two principal components. PCR is then just a
% linear regression of the response variable on those two components.  It
% often makes sense to normalize each variable first by its standard
% deviation when the variables have very different amounts of variability,
% however, that is not done here.
[PCALoadings,PCAScores,PCAVar] = pca(X,'Economy',false);
betaPCR = regress(y-mean(y), PCAScores(:,1:2));
%%
% To make the PCR results easier to interpret in terms of the original
% spectral data, transform to regression coefficients for the original,
% uncentered variables.
betaPCR = PCALoadings(:,1:2)*betaPCR;
betaPCR = [mean(y) - mean(X)*betaPCR; betaPCR];
yfitPCR = [ones(n,1) X]*betaPCR;

%%
% Plot fitted vs. observed response for the PLSR and PCR fits.
plot(y,yfitPLS,'bo',y,yfitPCR,'r^');
xlabel('Observed Response');
ylabel('Fitted Response');
legend({'PLSR with 2 Components' 'PCR with 2 Components'},  ...
	'location','NW');
%%
% In a sense, the comparison in the plot above is not a fair one -- the
% number of components (two) was chosen by looking at how well a
% two-component PLSR model predicted the response, and there's no reason
% why the PCR model should be restricted to that same number of components.
% With the same number of components, however, PLSR does a much better job
% at fitting |y|.  In fact, looking at the horizontal scatter of fitted
% values in the plot above, PCR with two components is hardly better than
% using a constant model.  The r-squared values from the two regressions
% confirm that.
TSS = sum((y-mean(y)).^2);
RSS_PLS = sum((y-yfitPLS).^2);
rsquaredPLS = 1 - RSS_PLS/TSS
%%
RSS_PCR = sum((y-yfitPCR).^2);
rsquaredPCR = 1 - RSS_PCR/TSS

%%
% Another way to compare the predictive power of the two models is to plot the
% response variable against the two predictors in both cases.
plot3(Xscores(:,1),Xscores(:,2),y-mean(y),'bo');
legend('PLSR');
grid on; view(-30,30);
%%
% It's a little hard to see without being able to interactively rotate the
% figure, but the PLSR plot above shows points closely scattered about a plane.
% On the other hand, the PCR plot below shows a cloud of points with little
% indication of a linear relationship.
plot3(PCAScores(:,1),PCAScores(:,2),y-mean(y),'r^');
legend('PCR');
grid on; view(-30,30);

%%
% Notice that while the two PLS components are much better predictors of
% the observed |y|, the following figure shows that they explain
% somewhat less variance in the observed |X| than the first two principal
% components used in the PCR.
plot(1:10,100*cumsum(PLSPctVar(1,:)),'b-o',1:10,  ...
	100*cumsum(PCAVar(1:10))/sum(PCAVar(1:10)),'r-^');
xlabel('Number of Principal Components');
ylabel('Percent Variance Explained in X');
legend({'PLSR' 'PCR'},'location','SE');
%%
% The fact that the PCR curve is uniformly higher suggests why PCR with two
% components does such a poor job, relative to PLSR, in fitting |y|.  PCR
% constructs components to best explain |X|, and as a result, those first
% two components ignore the information in the data that is important in
% fitting the observed |y|.


%% Fitting with More Components
% As more components are added in PCR, it will necessarily do a better job
% of fitting the original data |y|, simply because at some point most of the
% important predictive information in |X| will be present in the principal
% components.  For example, the following figure shows that the
% difference in residuals for the two methods is much less dramatic when
% using ten components than it was for two components.
yfitPLS10 = [ones(n,1) X]*betaPLS10;
betaPCR10 = regress(y-mean(y), PCAScores(:,1:10));
betaPCR10 = PCALoadings(:,1:10)*betaPCR10;
betaPCR10 = [mean(y) - mean(X)*betaPCR10; betaPCR10];
yfitPCR10 = [ones(n,1) X]*betaPCR10;
plot(y,yfitPLS10,'bo',y,yfitPCR10,'r^');
xlabel('Observed Response');
ylabel('Fitted Response');
legend({'PLSR with 10 components' 'PCR with 10 Components'},  ...
	'location','NW');
%%
% Both models fit |y| fairly accurately, although PLSR still makes a
% slightly more accurate fit.  However, ten components is still an
% arbitrarily-chosen number for either model.


%% Choosing the Number of Components with Cross-Validation
% It's often useful to choose the number of components to minimize the
% expected error when predicting the response from future observations on
% the predictor variables.  Simply using a large number of components will
% do a good job in fitting the current observed data, but is a strategy
% that leads to overfitting.  Fitting the current data too well results in
% a model that does not generalize well to other data, and gives an
% overly-optimistic estimate of the expected error.
%
% Cross-validation is a more statistically sound method for choosing the
% number of components in either PLSR or PCR.  It avoids overfitting data
% by not reusing the same data to both fit a model and to estimate
% prediction error. Thus, the estimate of prediction error is not
% optimistically biased downwards.
%
% |plsregress| has an option to estimate the mean squared prediction error
% (MSEP) by cross-validation, in this case using 10-fold C-V.
[Xl,Yl,Xs,Ys,beta,pctVar,PLSmsep] = plsregress(X,y,10,'CV',10);
%%
% For PCR, |crossval| combined with a simple function to compute the sum of
% squared errors for PCR, can estimate the MSEP, again using 10-fold
% cross-validation.
PCRmsep = sum(crossval(@pcrsse,X,y,'KFold',10),1) / n;

%%
% The MSEP curve for PLSR indicates that two or three components does about
% as good a job as possible.  On the other hand, PCR needs four components
% to get the same prediction accuracy.
plot(0:10,PLSmsep(2,:),'b-o',0:10,PCRmsep,'r-^');
xlabel('Number of components');
ylabel('Estimated Mean Squared Prediction Error');
legend({'PLSR' 'PCR'},'location','NE');
%%
% In fact, the second component in PCR _increases_ the prediction error
% of the model, suggesting that the combination of predictor variables
% contained in that component is not strongly correlated with |y|.  Again,
% that's because PCR constructs components to explain variation in |X|, not
% |y|.


%% Model Parsimony
% So if PCR requires four components to get the same prediction accuracy as
% PLSR with three components, is the PLSR model more parsimonious?  That
% depends on what aspect of the model you consider.
%
% The PLS weights are the linear combinations of the original variables
% that define the PLS components, i.e., they describe how strongly each
% component in the PLSR depends on the original variables, and in what
% direction.
[Xl,Yl,Xs,Ys,beta,pctVar,mse,stats] = plsregress(X,y,3);
plot(1:401,stats.W,'-');
xlabel('Variable');
ylabel('PLS Weight');
legend({'1st Component' '2nd Component' '3rd Component'},  ...
	'location','NW');
%%
% Similarly, the PCA loadings describe how strongly each component in the PCR
% depends on the original variables.
plot(1:401,PCALoadings(:,1:4),'-');
xlabel('Variable');
ylabel('PCA Loading');
legend({'1st Component' '2nd Component' '3rd Component'  ...
	'4th Component'},'location','NW');

%%
% For either PLSR or PCR, it may be that each component can be given a
% physically meaningful interpretation by inspecting which variables it
% weights most heavily.  For instance, with these spectral data it may be
% possible to interpret intensity peaks in terms of compounds present in
% the gasoline, and then to observe that weights for a particular component
% pick out a small number of those compounds.  From that perspective, fewer
% components are simpler to interpret, and because PLSR often requires
% fewer components to predict the response adequately, it leads to more
% parsimonious models.
% 
% On the other hand, both PLSR and PCR result in one regression coefficient
% for each of the original predictor variables, plus an intercept.  In that
% sense, neither is more parsimonious, because regardless of how many
% components are used, both models depend on all predictors.  More
% concretely, for these data, both models need 401 spectral intensity
% values in order to make a prediction.
%
% However, the ultimate goal may to reduce the original set of variables to
% a smaller subset still able to predict the response accurately.  For
% example, it may be possible to use the PLS weights or the PCA loadings to
% select only those variables that contribute most to each component.  As
% shown earlier, some components from a PCR model fit may serve
% primarily to describe the variation in the predictor variables, and may
% include large weights for variables that are not strongly correlated with
% the response. Thus, PCR can lead to retaining variables that are
% unnecessary for prediction.
%
% For the data used in this example, the difference in the number of
% components needed by PLSR and PCR for accurate prediction is not great,
% and the PLS weights and PCA loadings seem to pick out the same variables.
% That may not be true for other data.