www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/cmdscaledemo.m

    %% Classical Multidimensional Scaling
% This example shows how to perform "classical" multidimensional scaling,
% using the |cmdscale| function in the Statistics and Machine Learning Toolbox(TM). Classical
% multidimensional scaling, also known as Principal Coordinates Analysis,
% takes a matrix of interpoint distances, and creates a configuration of
% points.  Ideally, those points can be constructed in two or three
% dimensions, and the Euclidean distances between them approximately
% reproduce the original distance matrix.  Thus, a scatter plot of the
% those points provides a visual representation of the original distances.
%
% This example illustrates applications of multidimensional scaling to
% dissimilarity measures other than spatial distance, and shows how to
% construct a configuration of points to visualize those dissimilarities.
%
% This example describes "classical" multidimensional scaling. The
% |mdscale| function performs "non-classical" MDS, which is sometimes more
% flexible than the classical method.  Non-classical MDS is described in
% the <docid:stats_examples.example-ex79993300> example.

%   Copyright 2002-2014 The MathWorks, Inc.



%% Reconstructing Spatial Locations from Non-Spatial Distances
% Suppose you have measured the genetic "distance", or dissimilarity,
% between a number of local subpopulations of a single species of animal.
% You also know their geographic locations, and would like to know how
% closely their genetic and spatial distances correspond.  If they do, that
% is evidence that interbreeding between the subpopulations is affected by
% their geographic locations.
%
% Below are the spatial locations of the subpopulations, and the
% upper-triangle of the matrix of genetic distances, in the same vector
% format produced by |pdist|.
X = [39.1     18.7;
     40.7     21.2;
     41.5     21.5;
     39.2     21.8;
     38.7     20.6;
     41.7     20.1;
     40.1     22.1;
     39.2     21.6];

D = [4.69 6.79 3.50 3.11 4.46 5.57 3.00 ...
          2.10 2.27 2.65 2.36 1.99 1.74 ...
               3.78 4.53 2.83 2.44 3.79 ...
                    1.98 4.35 2.07 0.53 ...
                         3.80 3.31 1.47 ...
                              4.35 3.82 ...
                                   2.57];

%%
% Although this vector format for |D| is space-efficient, it's often easier
% to see the distance relationships if you reformat the distances to a
% square matrix.
squareform(D)
%%
% |cmdscale| recognizes either of the two formats.
[Y,eigvals] = cmdscale(D);

%%
% |cmdscale|'s first output, |Y|, is a matrix of points created to have
% interpoint distances that reproduce the distances in |D|. With eight
% species, the points (rows of |Y|) could have as many as eight dimensions
% (columns of |Y|). Visualization of the genetic distances depends on using
% points in only two or three dimensions. Fortunately, |cmdscale|'s second
% output, |eigvals|, is a set of sorted eigenvalues whose relative magnitudes
% indicate how many dimensions you can safely use. If only the first two or
% three eigenvalues are large, then only those coordinates of the points in
% |Y| are needed to accurately reproduce |D|. If more than three eigenvalues
% are large, then it is not possible to find a good low-dimensional
% configuration of points, and it will not be easy to visualize the
% distances.
[eigvals eigvals/max(abs(eigvals))]

%%
% Notice that there are only two large positive eigenvalues, so the
% configuration of points created by |cmdscale| can be plotted in two
% dimensions.  The two negative eigenvalues indicate that the genetic
% distances are not Euclidean, that is, no configuration of points can
% reproduce |D| exactly. Fortunately, the negative eigenvalues are small
% relative to the largest positive ones, and the reduction to the first two
% columns of |Y| should be fairly accurate. You can check this by looking at
% the error in the distances between the two-dimensional configuration and
% the original distances.
maxrelerr = max(abs(D - pdist(Y(:,1:2)))) / max(D)

%%
% Now you can compare the "genetic locations" created by |cmdscale| to the
% actual geographic locations.  Because the configuration returned by
% |cmdscale| is unique only up to translation, rotation, and reflection, the
% genetic locations probably won't match the geographic locations.  They
% will also have the wrong scale.  But you can use the |procrustes| command
% to match up the two sets of points best in the least squares sense.
[D,Z] = procrustes(X,Y(:,1:2));
plot(X(:,1),X(:,2),'bo',Z(:,1),Z(:,2),'rd');
labels = num2str((1:8)');
text(X(:,1)+.05,X(:,2),labels,'Color','b');
text(Z(:,1)+.05,Z(:,2),labels,'Color','r');
xlabel('Distance East of Reference Point (Km)');
ylabel('Distance North of Reference Point (Km)');
legend({'Spatial Locations','Constructed Genetic Locations'},'Location','SE');

%%
% This plot shows the best match of the reconstructed points in the same
% coordinates as the actual spatial locations.  Apparently, the genetic
% distances do have a close link to the spatial distances between the
% subpopulations.


%% Visualizing a Correlation Matrix Using Multidimensional Scaling
% Suppose you have computed the following correlation matrix for a set of
% 10 variables.  It's obvious that these variables are all positively
% correlated, and that there are some very strong pairwise correlations.
% But with this many variables, it's not easy to get a good feel for the
% relationships among all 10.
Rho = ...
  [1       0.3906  0.3746  0.3318  0.4141  0.4279  0.4216  0.4703  0.4362  0.2066;
   0.3906  1       0.3200  0.3629  0.2211  0.9520  0.9811  0.9052  0.4567  0     ;
   0.3746  0.3200  1       0.8993  0.7999  0.3589  0.3460  0.3333  0.8639  0.6527;
   0.3318  0.3629  0.8993  1       0.7125  0.3959  0.3663  0.3394  0.8719  0.5726;
   0.4141  0.2211  0.7999  0.7125  1       0.2374  0.2079  0.2335  0.7050  0.7469;
   0.4279  0.9520  0.3589  0.3959  0.2374  1       0.9657  0.9363  0.4791  0.0254;
   0.4216  0.9811  0.3460  0.3663  0.2079  0.9657  1       0.9123  0.4554  0.0011;
   0.4703  0.9052  0.3333  0.3394  0.2335  0.9363  0.9123  1       0.4418  0.0099;
   0.4362  0.4567  0.8639  0.8719  0.7050  0.4791  0.4554  0.4418  1       0.5272;
   0.2066  0       0.6527  0.5726  0.7469  0.0254  0.0011  0.0099  0.5272  1     ];

%%
% Multidimensional scaling is often thought of as a way to (re)construct
% points using only pairwise distances.  But it can also be used with
% dissimilarity measures that are more general than distance, to spatially
% visualize things that are not "points in space" in the usual sense.  The
% variables described by Rho are an example, and you can use |cmdscale| to plot
% a visual representation of their interdependencies.
%
% Correlation actually measures similarity, but it is easy to transform it
% to a measure of dissimilarity.  Because all the correlations here are
% positive, you can simply use
D = 1 - Rho;
%%
% although other choices might also make sense.  If |Rho| contained negative
% correlations, you would have to decide whether, for example, a correlation
% of -1 indicated more or less of a dissimilarity than a correlation of 0,
% and choose a transformation accordingly.

%%
% It's important to decide whether visualization of the information in the
% correlation matrix is even possible, that is, whether the number of
% dimensions can be reduced from ten down to two or three.  The eigenvalues
% returned by |cmdscale| give you a way to decide.  In this case, a scree
% plot of those eigenvalues indicates that two dimensions are enough to
% represent the variables.  (Notice that some of the eigenvalues in the
% plot below are negative, but small relative to the first two.)
[Y,eigvals] = cmdscale(D);
plot(1:length(eigvals),eigvals,'bo-');
line([1,length(eigvals)],[0 0],'LineStyle',':','XLimInclude','off',...
     'Color',[.7 .7 .7])
axis([1,length(eigvals),min(eigvals),max(eigvals)*1.1]);
xlabel('Eigenvalue number');
ylabel('Eigenvalue');

%%
% In a more independent set of variables, more dimensions might be needed.
% If more than three variables are needed, the visualization isn't all that
% useful.
%
% A 2-D plot of the configuration returned by |cmdscale| indicates that there
% are two subsets of variables that are most closely correlated among
% themselves, plus a single variable that is more or less on its own.  One
% of the clusters is tight, while the other is relatively loose.
labels = {' 1',' 2',' 3',' 4',' 5',' 6',' 7',' 8',' 9',' 10'};
plot(Y(:,1),Y(:,2),'bx');
axis(max(max(abs(Y))) * [-1.1,1.1,-1.1,1.1]); axis('square');
text(Y(:,1),Y(:,2),labels,'HorizontalAlignment','left');
line([-1,1],[0 0],'XLimInclude','off','Color',[.7 .7 .7])
line([0 0],[-1,1],'YLimInclude','off','Color',[.7 .7 .7])

%%
% On the other hand, the results from |cmdscale| for the following
% correlation matrix indicates a much different structure:  there are no
% real groups among the variables.  Rather, there is a kind of "circular"
% dependency, where each variable has a pair of "closest neighbors" but is
% less well correlated with the remaining variables.
Rho = ...
  [1       0.7946  0.1760  0.2560  0.7818  0.4496  0.2732  0.3995  0.5305  0.2827;
   0.7946  1       0.1626  0.4227  0.5674  0.6183  0.4004  0.2283  0.3495  0.2777;
   0.1760  0.1626  1       0.2644  0.1864  0.1859  0.4330  0.4656  0.3947  0.8057;
   0.2560  0.4227  0.2644  1       0.1017  0.7426  0.8340  0       0.0499  0.4853;
   0.7818  0.5674  0.1864  0.1017  1       0.2733  0.1484  0.4890  0.6138  0.2025;
   0.4496  0.6183  0.1859  0.7426  0.2733  1       0.6303  0.0648  0.1035  0.3242;
   0.2732  0.4004  0.4330  0.8340  0.1484  0.6303  1       0.1444  0.1357  0.6291;
   0.3995  0.2283  0.4656  0       0.4890  0.0648  0.1444  1       0.8599  0.3948;
   0.5305  0.3495  0.3947  0.0499  0.6138  0.1035  0.1357  0.8599  1       0.3100;
   0.2827  0.2777  0.8057  0.4853  0.2025  0.3242  0.6291  0.3948  0.3100  1     ];

[Y,eigvals] = cmdscale(1-Rho);
[eigvals eigvals./max(abs(eigvals))]
%%
plot(Y(:,1),Y(:,2),'bx');
axis(max(max(abs(Y))) * [-1.1,1.1,-1.1,1.1]); axis('square');
text(Y(:,1),Y(:,2),labels,'HorizontalAlignment','left');
line([0 0],[-1,1],'XLimInclude','off','Color',[.7 .7 .7])
line([-1,1],[0 0],'YLimInclude','off','Color',[.7 .7 .7])


%% A Comparison of Principal Components Analysis and Classical Multidimensional Scaling
% Multidimensional scaling is most often used to visualize data when only
% their distances or dissimilarities are available.  However, when the
% original data are available, multidimensional scaling can also be used as
% a dimension reduction method, by reducing the data to a distance matrix,
% creating a new configuration of points using |cmdscale|, and retaining only
% the first few dimensions of those points.  This application of
% multidimensional scaling is much like Principal Components Analysis, and
% in fact, when you call |cmdscale| using the Euclidean distances between the
% points, the results are identical to PCA, up to a change in sign.
n = 10; m = 5;
X = randn(n,m);
D = pdist(X,'Euclidean');

[Y,eigvals] = cmdscale(D);
[PC,Score,latent] = pca(X);

Y
%%
Score
%%
% Even the nonzero eigenvalues are identical up to a scale factor.
[eigvals(1:m) (n-1)*latent]