www.gusucode.com > stats 源码程序 matlab案例代码 > stats/ClusteringUsingGaussianMixtureModelsExample.m
%% Clustering Using Gaussian Mixture Models %% How Gaussian Mixture Models Cluster Data % Gaussian mixture models (GMM) are often used for data clustering. % Usually, fitted GMMs cluster by assigning query data points to the % multivariate normal components that maximize the component posterior % probability given the data. That is, given a fitted GMM, % <docid:stats_ug.brx2uny-1 cluster> assigns query data to the component % yielding the highest posterior probability. This method of assigning a % data point to exactly one cluster is called _hard_ clustering. For an % example showing how to fit a GMM to data, cluster using the fitted model, % and estimate component posterior probabilities, see % <docid:stats_ug.bra9fvn Cluster Data from Mixture of Gaussian % Distributions>. %% % However, GMM clustering is more flexible because you can view it as a % _fuzzy_ or _soft clustering_ method. Soft clustering methods assign a % score to a data point for each cluster. The value of the score indicates % the association strength of the data point to the cluster. As opposed to % hard clustering methods, soft clustering methods are flexible in that % they can assign a data point to more than one cluster. When clustering % with GMMs, the score is the posterior probability. For an example of % soft clustering using GMM, see <docid:stats_ug.buqq83f Cluster Gaussian % Mixture Data Using Soft Clustering>. %% % Moreover, GMM clustering can accommodate clusters that have different % sizes and correlation structures within them. Because of this, GMM % clustering can be more appropriate to use than, e.g, _k_-means % clustering. %% % Like most clustering methods, you must specify the number of desired % clusters before fitting the model. The number of clusters specifies the % number of components in the GMM. For GMMs, it is best practice to also % consider the: % % * Component covariance structure. You can specify diagonal or full % covariance matrices, or whether all components have the same covariance % matrix. % * Initial conditions. The Expectation-Maximization (EM) algorithm fits % the GMM. Like the _k_-means clustering algorithm, EM is sensitive to % initial conditions and might converge to a local optimum. You can % specify your own starting values for the parameters, specify initial % cluster assignments for data points or let them be randomly chosen, or % specify to use the _k_-means ++ algorithm. % * Regularization parameter. If, for example, you have more predictors % than data points, then you can regularize for estimation stability. % %% Covariance Structure Options % Load Fisher's iris data set. Consider clustering the % sepal measurements. load fisheriris; X = meas(:,1:2); [n,p] = size(X); rng(3); % For reproducibility figure; plot(X(:,1),X(:,2),'.','MarkerSize',15); title('Fisher''s Iris Data Set'); xlabel('Sepal length (cm)'); ylabel('Sepal width (cm)'); %% % The number of components, _k_, in a GMM determines number of % subpopulations or clusters. In this figure, it is difficult to determine % whether two, three, or perhaps more components are appropriate. A GMM % increases in complexity as _k_ increases. %% % Each component has a covariance matrix. Geometrically, the covariance % structure determines the shape of a confidence ellipsoid drawn over a % subpopulation or cluster. You can specify whether the covariance % matrices for all components are diagonal or full, or whether all % components have the same covariance matrix. Each combination of % specifications determines the shape and orientation of the ellipsoids. %% % Fit GMMs to the data and examine the effects of specifying all % combinations of covariance structure options on the shape of the % ellipsoids. That is, specify all combinations of the name-value pair % arguments |'CovarianceType'| and |'SharedCovariance'|. Covariance % structure specifications apply to all components. For illustration, % specify that there are three components. To draw the ellipsoids: % % # Use the fitted GMM to cluster a grid covering the plane composed of the % extremes of the measurements. % # Obtain the score that specifies a 99% probability threshold for each % confidence region. This specification determines the length of the major % and minor axes of the ellipsoids. % # Color the ellipse using a similar color to its cluster. % k = 3; Sigma = {'diagonal','full'}; nSigma = numel(Sigma); SharedCovariance = {true,false}; SCtext = {'true','false'}; nSC = numel(SharedCovariance); d = 500; x1 = linspace(min(X(:,1)) - 2,max(X(:,1)) + 2,d); x2 = linspace(min(X(:,2)) - 2,max(X(:,2)) + 2,d); [x1grid,x2grid] = meshgrid(x1,x2); X0 = [x1grid(:) x2grid(:)]; threshold = sqrt(chi2inv(0.99,2)); options = statset('MaxIter',1000); % Increase number of EM iterations figure; c = 1; for i = 1:nSigma; for j = 1:nSC; gmfit = fitgmdist(X,k,'CovarianceType',Sigma{i},... 'SharedCovariance',SharedCovariance{j},'Options',options); clusterX = cluster(gmfit,X); mahalDist = mahal(gmfit,X0); subplot(2,2,c); h1 = gscatter(X(:,1),X(:,2),clusterX); hold on; for m = 1:k; idx = mahalDist(:,m)<=threshold; Color = h1(m).Color*0.75 + -0.5*(h1(m).Color - 1); h2 = plot(X0(idx,1),X0(idx,2),'.','Color',Color,'MarkerSize',1); uistack(h2,'bottom'); end plot(gmfit.mu(:,1),gmfit.mu(:,2),'kx','LineWidth',2,'MarkerSize',10) title(sprintf('Sigma is %s, SharedCovariance = %s',... Sigma{i},SCtext{j}),'FontSize',8) legend(h1,{'1','2','3'}); hold off c = c + 1; end end %% % While the probability threshold for the confidence region determines the % length of the major and minor axes, the covariance type determines the % orientation of the axes. % % * Diagonal covariance matrices indicate that the predictors are % uncorrelated. The major and minor axes of the ellipses are parallel or % perpendicular to the _x_ and _y_ axes. This specification increases the % total number of parameters by _p_, the number of predictors, for each % component, but is more parsimonious than the full covariance % specification. % * Full covariance matrices allow for correlated predictors. There is no % restriction to the orientation of the ellipses relative to the _x_ and % _y_ axes. Each component increases the total number of parameters by % $p(p+1)/2$, but captures correlation structure among the predictors. % This specification can cause overfitting. % * Shared covariance matrices indicate that all components have the same % covariance matrix. All ellipses are the same size and have the same % orientation. This specification is more parsimonious than the unshared % specification because the total number of parameters only increases % by the number of covariance parameters for one component. % * Unshared covariance matrices indicate that all components have their % own covariance matrix. The size and orientation of all ellipses might % differ. This specification increases the number of parameters by _k_ % times the number of covariance parameters for a component, but can % capture covariance differences among components. % %% % The figure also shows that |cluster| does not always preserve cluster % order. That is, if you cluster several fitted |gmdistribution| models, % |cluster| might assign different cluster labels for similar components. %% % In most applications, the number of components, _k_, and appropriate % covariance structure, $\Sigma$, are unknown. One way you can tune a GMM % is by comparing information criteria. Two popular information criteria % are Akaike's Information Criterion (AIC) and the Bayesian Information % Criterion (BIC). Both take the optimized, negative log likelihood, and % then penalize it with the number of parameters in the model (i.e., the % model complexity). However, the BIC penalizes for complexity more % severely than AIC. Therefore, the AIC tends to choose more complex models % that might overfit, and BIC tends to choose simpler models that might % underfit. A good practice is to look at both criteria when deciding on a % model. Lower AIC or BIC values indicate better fitting models. You % should also ensure that your choices for _k_ and the covariance matrix % structure is appropriate for your application. |fitgmdist| stores the AIC % and BIC of fitted |gmdistribution| model objects in the properties |AIC| % and |BIC|. You can access them using dot notation. For an example on % choosing the appropriate parameters, see <docid:stats_ug.bus8_i5-1 Tune % Gaussian Mixture Models>. %% Effects of Initial Conditions % The algorithm that fits a GMM to the data can be sensitive to initial % conditions. To illustrate this, consider fitting several GMMs and % specify different initial conditions. Specifically, specify that, % initially, most data points belong to the first cluster, two sets of % random, initial assignments, and use _k_-means ++ to obtain initial % cluster centers. For all instances, specify three % components, unshared and full covariance matrices, the same initial % mixture proportions, and the same initial covariance matrices. For % stability when you try different sets of initial values, increase the % number of EM algorithm iterations. cluster0 = {[ones(n-8,1); [2; 2; 2; 2]; [3; 3; 3; 3]];... randsample(1:k,n,true); randsample(1:k,n,true); 'plus'}; converged = nan(4,1); figure; for j = 1:4; gmfit = fitgmdist(X,k,'CovarianceType','full',... 'SharedCovariance',false,'Start',cluster0{j},... 'Options',options); clusterX = cluster(gmfit,X); mahalDist = mahal(gmfit,X0); subplot(2,2,j); h1 = gscatter(X(:,1),X(:,2),clusterX); hold on; nK = numel(unique(clusterX)); for m = 1:nK; idx = mahalDist(:,m)<=threshold; Color = h1(m).Color*0.75 + -0.5*(h1(m).Color - 1); h2 = plot(X0(idx,1),X0(idx,2),'.','Color',Color,'MarkerSize',1); uistack(h2,'bottom'); end plot(gmfit.mu(:,1),gmfit.mu(:,2),'kx','LineWidth',2,'MarkerSize',10) legend(h1,{'1','2','3'}); hold off converged(j) = gmfit.Converged; end sum(converged) %% % All algorithms converged. Each of the different sets of starting cluster % assignments for the data points leads to a different, fitted cluster % assignment. You can specify a positive integer for the name-value pair % argument |'Replicates'|, which runs the algorithm the specified number of % times. Subsequently, |fitgmdist| chooses the fit that yields the largest % likelihood. %% When to Regularize % Sometimes, during an EM iteration, a fitted covariance matrix can become % ill conditioned, that is, the likelihood is escaping to infinity. This % can happen if: % % * There are more predictors than data points. % * You specify to fit with too many components. % * Variables are highly correlated. % % To overcome this problem, you can specify a small, positive number using % the |'Regularize'| name-value pair argument. |fitgmdist| adds the % specified small, positive number to the diagonal elements of all % covariance matrices, which ensures all matrices are positive definite. % Regularizing can reduce the maximal likelihood value. For more details, see % <docid:stats_ug.bt8x9gq |fitgmdist|>.