www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/clusterdemo.m

    %% Cluster Analysis
% This example shows how to examine similarities and dissimilarities of
% observations or objects using cluster analysis in
% Statistics and Machine Learning Toolbox(TM).
% Data often fall naturally into groups, or clusters, of observations,
% where the characteristics of objects in the same cluster are similar and
% the characteristics of objects in different clusters are dissimilar.

%   Copyright 2002-2014 The MathWorks, Inc.


%% K-Means and Hierarchical Clustering
% The Statistics and Machine Learning Toolbox includes functions to perform
% two types of cluster analysis, K-means clustering and hierarchical
% clustering.
%
% K-means clustering is a partitioning method that treats observations in
% your data as objects having locations and distances from each other. It
% partitions the objects into K mutually exclusive clusters, such that
% objects within each cluster are as close to each other as possible, and
% as far from objects in other clusters as possible. Each cluster is
% characterized by its centroid, or center point. Of course, the distances
% used in clustering often do not represent spatial distances.
%
% Hierarchical clustering is a way to investigate grouping in your data,
% simultaneously over a variety of scales of distance, by creating a
% cluster tree. The tree is not a single set of clusters, as in K-Means,
% but rather a multi-level hierarchy, where clusters at one level are
% joined as clusters at the next higher level. This allows you to decide
% what scale or level of clustering is most appropriate in your
% application.

%%
% Some of the functions used in this example call MATLAB(R) built-in random
% number generation functions. To duplicate the exact results shown in this
% example, you should execute the command below, to set the random number
% generator to a known state. If you do not set the state, your results may
% differ in trivial ways, for example, you may see clusters numbered in a
% different order. There is also a chance that a suboptimal cluster
% solution may result (the example includes a discussion of suboptimal
% solutions, including ways to avoid them).
rng(14,'twister');

%% Fisher's Iris Data
% In the 1920's, botanists collected measurements on the sepal length,
% sepal width, petal length, and petal width of 150 iris specimens, 50 from
% each of three species. The measurements became known as Fisher's iris
% data.
%
% Each observation in these data comes from a known species, and so there
% is already an obvious way to group the data. For the moment, we will
% ignore the species information and cluster the data using only the raw
% measurements. When we are done, we can compare the resulting clusters to
% the actual species, to see if the three types of iris possess distinct
% characteristics.


%% Clustering Fisher's Iris Data Using K-Means Clustering
% The function |kmeans| performs K-Means clustering, using an iterative
% algorithm that assigns objects to clusters so that the sum of distances
% from each object to its cluster centroid, over all clusters, is a
% minimum. Used on Fisher's iris data, it will find the natural groupings
% among iris specimens, based on their sepal and petal measurements. With
% K-means clustering, you must specify the number of clusters that you want
% to create.
%
% First, load the data and call |kmeans| with the desired number of
% clusters set to 2, and using squared Euclidean distance. To get an idea
% of how well-separated the resulting clusters are, you can make a
% silhouette plot. The silhouette plot displays a measure of how close each
% point in one cluster is to points in the neighboring clusters.
load fisheriris
[cidx2,cmeans2] = kmeans(meas,2,'dist','sqeuclidean');
[silh2,h] = silhouette(meas,cidx2,'sqeuclidean');

%%
% From the silhouette plot, you can see that most points in both clusters
% have a large silhouette value, greater than 0.8, indicating that those
% points are well-separated from neighboring clusters. However, each
% cluster also contains a few points with low silhouette values, indicating
% that they are nearby to points from other clusters.
%
% It turns out that the fourth measurement in these data, the petal width,
% is highly correlated with the third measurement, the petal length, and so
% a 3-D plot of the first three measurements gives a good representation of
% the data, without resorting to four dimensions.  If you plot the data,
% using different symbols for each cluster created by |kmeans|, you can
% identify the points with small silhouette values, as those points that
% are close to points from other clusters.
ptsymb = {'bs','r^','md','go','c+'};
for i = 1:2
    clust = find(cidx2==i);
    plot3(meas(clust,1),meas(clust,2),meas(clust,3),ptsymb{i});
    hold on
end
plot3(cmeans2(:,1),cmeans2(:,2),cmeans2(:,3),'ko');
plot3(cmeans2(:,1),cmeans2(:,2),cmeans2(:,3),'kx');
hold off
xlabel('Sepal Length');
ylabel('Sepal Width');
zlabel('Petal Length');
view(-137,10);
grid on

%%
% The centroids of each cluster are plotted using circled X's. Three of the
% points from the lower cluster, plotted with triangles, are very close to
% points from the upper cluster, plotted with squares. But, in fact,
% because the upper cluster is so spread out, those three points are closer
% to the centroid of the lower cluster than to that of the upper cluster,
% even though they are separated from the bulk of the points in their own
% cluster by a gap. Because K-means clustering only considers distances,
% and not densities, this kind of result can occur.
%
% You can increase the number of clusters to see if |kmeans| can find further
% grouping structure in the data. This time, use the optional 'display'
% parameter to print out information about each iteration in the clustering
% algorithm.
[cidx3,cmeans3] = kmeans(meas,3,'display','iter');

%%
% At each iteration, the |kmeans| algorithm reassigns points among clusters
% to decrease the sum of point-to-centroid distances, and then recomputes
% cluster centroids for the new cluster assignments.  Notice that the total
% sum of distances and the number of reassignments decrease at each
% iteration until the algorithm reaches a minimum.  The algorithm used in
% |kmeans| consists of two phases.  In the example here, the second phase of
% the algorithm did not make any reassignments, indicating that the first
% phase reached a minimum after only a few iterations.
%
% By default, |kmeans| begins the clustering process using a randomly
% selected set of initial centroid locations. Just as in many other types
% of numerical minimizations, the solution that |kmeans| reaches sometimes
% depends on the starting points, and it is possible for it to reach a
% local minimum, where reassigning any one point to a new cluster would
% increase the total sum of point-to-centroid distances, but where a better
% solution does exist. However, you can use the optional 'replicates'
% parameter to overcome that problem.  When you specify more than one
% replicate, |kmeans| repeats the clustering process starting from different
% randomly selected centroids for each replicate.
[cidx3,cmeans3,sumd3] = kmeans(meas,3,'replicates',5,'display','final');

%%
% The output shows that, even for this relatively simple problem,
% non-global minima do exist.  Each of these five replicates began from a
% different set of initial centroids.  Depending on where it started from,
% |kmeans| reached one of two different solutions. However, the final
% solution that |kmeans| returns is the one with the lowest total sum of
% distances, over all replicates.  The third output argument contains the
% sum of distances within each cluster for that best solution.
sum(sumd3)

%%
% A silhouette plot for this three-cluster solution indicates that there is
% one cluster that is well-separated, but that the other two clusters are
% not very distinct.
[silh3,h] = silhouette(meas,cidx3,'sqeuclidean');

%%
% Again, you can plot the raw data to see how |kmeans| has assigned the
% points to clusters.
for i = 1:3
    clust = find(cidx3==i);
    plot3(meas(clust,1),meas(clust,2),meas(clust,3),ptsymb{i});
    hold on
end
plot3(cmeans3(:,1),cmeans3(:,2),cmeans3(:,3),'ko');
plot3(cmeans3(:,1),cmeans3(:,2),cmeans3(:,3),'kx');
hold off
xlabel('Sepal Length');
ylabel('Sepal Width');
zlabel('Petal Length');
view(-137,10);
grid on

%%
% You can see that |kmeans| has split the upper cluster from the two-cluster
% solution, and that those two clusters are very close to each other.
% Depending on what you intend to do with these data after clustering them,
% this three-cluster solution may be more or less useful than the previous,
% two-cluster, solution.  The first output argument from |silhouette|
% contains the silhouette values for each point, which you can use to
% compare the two solutions quantitatively.  The average silhouette value
% was larger for the two-cluster solution, indicating that it is a better
% answer purely from the point of view of creating distinct clusters.
[mean(silh2) mean(silh3)]

%%
% You can also cluster these data using a different distance. The cosine
% distance might make sense for these data because it would ignore
% absolute sizes of the measurements, and only consider their relative sizes.
% Thus, two flowers that were different sizes, but which had similarly shaped
% petals and sepals, might not be close with respect to squared Euclidean
% distance, but would be close with respect to cosine distance.
[cidxCos,cmeansCos] = kmeans(meas,3,'dist','cos');

%%
% From the silhouette plot, these clusters appear to be only slightly
% better separated than those found using squared Euclidean distance.
[silhCos,h] = silhouette(meas,cidxCos,'cos');
[mean(silh2) mean(silh3) mean(silhCos)]

%%
% Notice that the order of the clusters is different than in the previous
% silhouette plot.  This is because |kmeans| chooses initial cluster
% assignments at random.
%
% By plotting the raw data, you can see the differences in the cluster
% shapes created using the two different distances. The two solutions are
% similar, but the two upper clusters are elongated in the direction of
% the origin when using cosine distance.
for i = 1:3
    clust = find(cidxCos==i);
    plot3(meas(clust,1),meas(clust,2),meas(clust,3),ptsymb{i});
    hold on
end
hold off
xlabel('Sepal Length');
ylabel('Sepal Width');
zlabel('Petal Length');
view(-137,10);
grid on

%%
% This plot does not include the cluster centroids, because a centroid with
% respect to the cosine distance corresponds to a half-line from the origin
% in the space of the raw data. However, you can make a parallel coordinate
% plot of the normalized data points to visualize the differences between
% cluster centroids.
lnsymb = {'b-','r-','m-'};
names = {'SL','SW','PL','PW'};
meas0 = meas ./ repmat(sqrt(sum(meas.^2,2)),1,4);
ymin = min(min(meas0));
ymax = max(max(meas0));
for i = 1:3
    subplot(1,3,i);
    plot(meas0(cidxCos==i,:)',lnsymb{i});
    hold on;
    plot(cmeansCos(i,:)','k-','LineWidth',2);
    hold off;
    title(sprintf('Cluster %d',i));
    xlim([.9, 4.1]);
    ylim([ymin, ymax]);
    h_gca = gca;
    h_gca.XTick = 1:4;
    h_gca.XTickLabel = names;
end

%%
% It's clear from this plot that specimens from each of the three clusters
% have distinctly different relative sizes of petals and sepals on average.
% The first cluster has petals that are strictly smaller than their sepals.
% The second two clusters' petals and sepals overlap in size, however,
% those from the third cluster overlap more than the second.  You can also
% see that the second and third clusters include some specimens which are
% very similar to each other.
%
% Because we know the species of each observation in the data, you can
% compare the clusters discovered by |kmeans| to the actual species, to see
% if the three species have discernibly different physical
% characteristics. In fact, as the following plot shows, the clusters
% created using cosine distance differ from the species groups for only
% five of the flowers. Those five points, plotted with stars, are all near
% the boundary of the upper two clusters.
subplot(1,1,1);
for i = 1:3
    clust = find(cidxCos==i);
    plot3(meas(clust,1),meas(clust,2),meas(clust,3),ptsymb{i});
    hold on
end
xlabel('Sepal Length');
ylabel('Sepal Width');
zlabel('Petal Length');
view(-137,10);
grid on
sidx = grp2idx(species);
miss = find(cidxCos ~= sidx);
plot3(meas(miss,1),meas(miss,2),meas(miss,3),'k*');
legend({'setosa','versicolor','virginica'});
hold off


%% Clustering Fisher's Iris Data Using Hierarchical Clustering
% K-Means clustering produced a single partition of the iris data, but
% you might also want to investigate different scales of grouping in
% your data. Hierarchical clustering lets you do just that, by creating a
% hierarchical tree of clusters.
%
% First, create a cluster tree using distances between observations in the
% iris data.  Begin by using Euclidean distance.
eucD = pdist(meas,'euclidean');
clustTreeEuc = linkage(eucD,'average');

%%
% The cophenetic correlation is one way to verify that the cluster tree is
% consistent with the original distances.  Large values indicate that the
% tree fits the distances well, in the sense that pairwise linkages between
% observations correlate with their actual pairwise distances.  This tree
% seems to be a fairly good fit to the distances.
cophenet(clustTreeEuc,eucD)

%%
% To visualize the hierarchy of clusters, you can plot a dendrogram.
[h,nodes] = dendrogram(clustTreeEuc,0);
h_gca = gca;
h_gca.TickDir = 'out';
h_gca.TickLength = [.002 0];
h_gca.XTickLabel = [];

%%
% The root node in this tree is much higher than the remaining nodes,
% confirming what you saw from K-Means clustering: there are two large,
% distinct groups of observations. Within each of those two groups, you can
% see that lower levels of groups emerge as you consider smaller and
% smaller scales in distance. There are many different levels of groups, of
% different sizes, and at different degrees of distinctness.

%%
% Based on the results from K-Means clustering, cosine might also be a good
% choice of distance measure.  The resulting hierarchical tree is quite
% different, suggesting a very different way to look at group structure in
% the iris data.
cosD = pdist(meas,'cosine');
clustTreeCos = linkage(cosD,'average');
cophenet(clustTreeCos,cosD)
%%
[h,nodes] = dendrogram(clustTreeCos,0);
h_gca = gca;
h_gca.TickDir = 'out';
h_gca.TickLength = [.002 0];
h_gca.XTickLabel = [];

%%
% The highest level of this tree separates iris specimens into two very
% distinct groups.  The dendrogram shows that, with respect to cosine
% distance, the within-group differences are much smaller relative to the
% between-group differences than was the case for Euclidean distance.  This
% is exactly what you would expect for these data, since the cosine
% distance computes a zero pairwise distance for objects that are in the
% same "direction" from the origin.
%
% With 150 observations, the plot is cluttered, but you can make a
% simplified dendrogram that does not display the very lowest levels of the
% tree.
[h,nodes] = dendrogram(clustTreeCos,12);

%%
% The three highest nodes in this tree separate out three equally-sized
% groups, plus a single specimen (labeled as leaf node 5) that is not near
% any others.
[sum(ismember(nodes,[11 12 9 10])) sum(ismember(nodes,[6 7 8])) ...
                  sum(ismember(nodes,[1 2 4 3])) sum(nodes==5)]

%%
% For many purposes, the dendrogram might be a sufficient result.  However,
% you can go one step further, and use the |cluster| function to cut the tree
% and explicitly partition observations into specific clusters, as with
% K-Means. Using the hierarchy from the cosine distance to create clusters,
% specify a linkage height that will cut the tree below the three highest
% nodes, and create four clusters, then plot the clustered raw data.
hidx = cluster(clustTreeCos,'criterion','distance','cutoff',.006);
for i = 1:5
    clust = find(hidx==i);
    plot3(meas(clust,1),meas(clust,2),meas(clust,3),ptsymb{i});
    hold on
end
hold off
xlabel('Sepal Length');
ylabel('Sepal Width');
zlabel('Petal Length');
view(-137,10);
grid on
%%
% This plot shows that the results from hierarchical clustering with
% cosine distance are qualitatively similar to results from K-Means, using
% three clusters. However, creating a hierarchical cluster tree allows you
% to visualize, all at once, what would require considerable
% experimentation with different values for K in K-Means clustering.
%
% Hierarchical clustering also allows you to experiment with different
% linkages. For example, clustering the iris data with single linkage,
% which tends to link together objects over larger distances than average
% distance does, gives a very different interpretation of the structure
% in the data.
clustTreeSng = linkage(eucD,'single');
[h,nodes] = dendrogram(clustTreeSng,0);
h_gca = gca;
h_gca.TickDir = 'out';
h_gca.TickLength = [.002 0];
h_gca.XTickLabel = [];