www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/orthoregdemo.m

    %% Fitting an Orthogonal Regression Using Principal Components Analysis
% This example shows how to use Principal Components Analysis (PCA) to fit a 
% linear regression. PCA minimizes the perpendicular distances from the data 
% to the fitted model. This is the linear case of what is known as Orthogonal 
% Regression or Total Least Squares, and is appropriate when there is no 
% natural distinction between predictor and response variables, or when all 
% variables are measured with error. This is in contrast to the usual regression 
% assumption that predictor variables are measured exactly, and only the response 
% variable has an error component.
%
% For example, given two data vectors x and y, you can fit a line that
% minimizes the perpendicular distances from each of the points (x(i), y(i))
% to the line.  More generally, with p observed variables, you can fit an
% r-dimensional hyperplane in p-dimensional space (r < p).  The choice of r is
% equivalent to choosing the number of components to retain in PCA. It may be
% based on prediction error, or it may simply be a pragmatic choice to reduce
% data to a manageable number of dimensions.
%
% In this example, we fit a plane and a line through some data on three
% observed variables.  It's easy to do the same thing for any number of
% variables, and for any dimension of model, although visualizing a fit
% in higher dimensions would obviously not be straightforward.

%   Copyright 2005-2011 The MathWorks, Inc.



%% Fitting a Plane to 3-D Data
% First, we generate some trivariate normal data for the example.  Two of
% the variables are fairly strongly correlated.
rng(5,'twister');
X = mvnrnd([0 0 0], [1 .2 .7; .2 1 0; .7 0 1],50);
plot3(X(:,1),X(:,2),X(:,3),'bo');
grid on;
maxlim = max(abs(X(:)))*1.1;
axis([-maxlim maxlim -maxlim maxlim -maxlim maxlim]);
axis square
view(-9,12);

%%
% Next, we fit a plane to the data using PCA.  The coefficients for the first
% two principal components define vectors that form a basis for the plane.
% The third PC is orthogonal to the first two, and its coefficients define the
% normal vector of the plane.
[coeff,score,roots] = pca(X);
basis = coeff(:,1:2)
%%
normal = coeff(:,3)
%%
% That's all there is to the fit.  But let's look closer at the results, and
% plot the fit along with the data.

%%
% Because the first two components explain as much of the variance in the data
% as is possible with two dimensions, the plane is the best 2-D linear
% approximation to the data.  Equivalently, the third component explains the
% least amount of variation in the data, and it is the error term in the
% regression.  The latent roots (or eigenvalues) from the PCA define the
% amount of explained variance for each component.
pctExplained = roots' ./ sum(roots)

%%
% The first two coordinates of the principal component scores give the
% projection of each point onto the plane, in the coordinate system of the
% plane.  To get the coordinates of the fitted points in terms of the original
% coordinate system, we multiply each PC coefficient vector by the
% corresponding score, and add back in the mean of the data.  The residuals
% are simply the original data minus the fitted points.
[n,p] = size(X);
meanX = mean(X,1);
Xfit = repmat(meanX,n,1) + score(:,1:2)*coeff(:,1:2)';
residuals = X - Xfit;

%%
% The equation of the fitted plane, satisfied by each of the fitted points in
% |Xfit|, is |([x1 x2 x3] - meanX)*normal = 0|.  The plane passes through the
% point |meanX|, and its perpendicular distance to the origin is
% |meanX*normal|. The perpendicular distance from each point in |X| to the
% plane, i.e., the norm of the residuals, is the dot product of each centered
% point with the normal to the plane.  The fitted plane minimizes the sum of
% the squared errors.
error = abs((X - repmat(meanX,n,1))*normal);
sse = sum(error.^2)

%%
% To visualize the fit, we can plot the plane, the original data, and their
% projection to the plane.
[xgrid,ygrid] = meshgrid(linspace(min(X(:,1)),max(X(:,1)),5), ...
                         linspace(min(X(:,2)),max(X(:,2)),5));
zgrid = (1/normal(3)) .* (meanX*normal - (xgrid.*normal(1) + ygrid.*normal(2)));
h = mesh(xgrid,ygrid,zgrid,'EdgeColor',[0 0 0],'FaceAlpha',0);

hold on
above = (X-repmat(meanX,n,1))*normal < 0;
below = ~above;
nabove = sum(above);
X1 = [X(above,1) Xfit(above,1) nan*ones(nabove,1)];
X2 = [X(above,2) Xfit(above,2) nan*ones(nabove,1)];
X3 = [X(above,3) Xfit(above,3) nan*ones(nabove,1)];
plot3(X1',X2',X3','-', X(above,1),X(above,2),X(above,3),'o', 'Color',[0 .7 0]);
nbelow = sum(below);
X1 = [X(below,1) Xfit(below,1) nan*ones(nbelow,1)];
X2 = [X(below,2) Xfit(below,2) nan*ones(nbelow,1)];
X3 = [X(below,3) Xfit(below,3) nan*ones(nbelow,1)];
plot3(X1',X2',X3','-', X(below,1),X(below,2),X(below,3),'o', 'Color',[1 0 0]);

hold off
maxlim = max(abs(X(:)))*1.1;
axis([-maxlim maxlim -maxlim maxlim -maxlim maxlim]);
axis square
view(-9,12);
%%
% Green points are above the plane, red points are below.


%% Fitting a Line to 3-D Data
% Fitting a straight line to the data is even simpler, and because of the
% nesting property of PCA, we can use the components that have already been
% computed.  The direction vector that defines the line is given by the
% coefficients for the first principal component.  The second and third PCs
% are orthogonal to the first, and their coefficients define directions
% that are perpendicular to the line.  The simplest equation to describe the
% line is |meanX + t*dirVect|, where |t| parameterizes the position along the
% line.
dirVect = coeff(:,1)

%%
% The first coordinate of the principal component scores gives the
% projection of each point onto the line.  As with the 2-D fit, the PC
% coefficient vectors multiplied by the scores the gives the fitted points
% in the original coordinate system.
Xfit1 = repmat(meanX,n,1) + score(:,1)*coeff(:,1)';

%%
% Plot the line, the original data, and their projection to the line.
t = [min(score(:,1))-.2, max(score(:,1))+.2];
endpts = [meanX + t(1)*dirVect'; meanX + t(2)*dirVect'];
plot3(endpts(:,1),endpts(:,2),endpts(:,3),'k-');

X1 = [X(:,1) Xfit1(:,1) nan*ones(n,1)];
X2 = [X(:,2) Xfit1(:,2) nan*ones(n,1)];
X3 = [X(:,3) Xfit1(:,3) nan*ones(n,1)];
hold on
plot3(X1',X2',X3','b-', X(:,1),X(:,2),X(:,3),'bo');
hold off
maxlim = max(abs(X(:)))*1.1;
axis([-maxlim maxlim -maxlim maxlim -maxlim maxlim]);
axis square
view(-9,12);
grid on

%%
% While it appears that many of the projections in this plot are not
% perpendicular to the line, that's just because we're plotting 3-D data
% in two dimensions.  In a live |MATLAB(R)| figure window, you could
% interactively rotate the plot to different perspectives to verify that
% the projections are indeed perpendicular, and to get a better feel for
% how the line fits the data.