www.gusucode.com > stats 源码程序 matlab案例代码 > stats/ClassifyingRadarReturnsForIonosphereDataWithTreeBaggerExample.m

    %% Classifying Radar Returns for Ionosphere Data Using TreeBagger
% You can also use ensembles of decision trees for classification. For this
% example, use ionosphere data with 351 observations and 34 real-valued
% predictors. The response variable is categorical with two levels:
%
% * 'g' represents good radar returns.
% * 'b' represents bad radar returns.
%
% The goal is to predict good or bad returns using a set of 34 measurements. 
%%
% Fix the initial random seed, grow 50 trees, inspect how the ensemble
% error changes with accumulation of trees, and estimate feature
% importance. For classification, it is best to set the minimal leaf size
% to 1 and select the square root of the total number of features for each
% decision split at random. These settings are defaults for 
% |TreeBagger| used for classification.
load ionosphere
rng(1945,'twister')
b = TreeBagger(50,X,Y,'OOBVarImp','On');
figure
plot(oobError(b))
xlabel('Number of Grown Trees')
ylabel('Out-of-Bag Classification Error')
%%
% The method trains ensembles with few trees on observations that are in
% bag for all trees. For such observations, it is impossible to compute the
% true out-of-bag prediction, and |TreeBagger| returns the most probable
% class for classification and the sample mean for regression. You can
% change the default value returned for in-bag observations using the
% |DefaultYfit| property. If you set the default value to an empty
% character vector for classification, the method excludes in-bag
% observations from computation of the out-of-bag error. In this case, the
% curve is more variable when the number of trees is small, either because
% some observations are never out of bag (and are therefore excluded) or
% because their predictions are based on few trees.
b.DefaultYfit = '';
figure
plot(oobError(b))
xlabel('Number of Grown Trees')
ylabel('Out-of-Bag Error Excluding In-Bag Observations')
%%
% The |OOBIndices| property of |TreeBagger| tracks which observations are out
% of bag for what trees. Using this property, you can monitor the fraction
% of observations in the training data that are in bag for all trees. The
% curve starts at approximately 2/3, which is the fraction of unique observations
% selected by one bootstrap replica, and goes down to 0 at approximately 10
% trees.
finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
    finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
finbag = finbag / size(X,1);
figure
plot(finbag)
xlabel('Number of Grown Trees')
ylabel('Fraction of In-Bag Observations')
%%
% Estimate feature importance.
figure
bar(b.OOBPermutedVarDeltaError)
xlabel('Feature Index')
ylabel('Out-of-Bag Feature Importance')
%%
% Select the features yeilding an importance measure greater than 0.75.
% This threshold is chosen arbitrarily.
idxvar = find(b.OOBPermutedVarDeltaError>0.75)
%%
% Having selected the most important features, grow a larger
% ensemble on the reduced feature set. Save time by not permuting
% out-of-bag observations to obtain new estimates of feature importance for
% the reduced feature set (set |OOBVarImp| to |'off'|). You would still be
% interested in obtaining out-of-bag estimates of classification error (set
% |OOBPred| to |'on'|).
b5v = TreeBagger(100,X(:,idxvar),Y,'OOBVarImp','off','OOBPred','on');
figure
plot(oobError(b5v))
xlabel('Number of Grown Trees')
ylabel('Out-of-Bag Classification Error')
%%
% For classification ensembles, in addition to classification error
% (fraction of misclassified observations), you can also monitor the
% average classification margin. For each observation, the _margin_ is
% defined as the difference between the score for the true class and the
% maximal score for other classes predicted by this tree. The cumulative
% classification margin uses the scores averaged over all trees and the
% mean cumulative classification margin is the cumulative margin averaged
% over all observations. The |oobMeanMargin| method with the |'mode'|
% argument set to |'cumulative'| (default) shows how the mean cumulative
% margin changes as the ensemble grows: every new element in the returned
% array represents the cumulative margin obtained by including a new tree
% in the ensemble. If training is successful, you would expect to see a
% gradual increase in the mean classification margin.
%%
% The method trains ensembles with few trees on observations that are in
% bag for all trees. For such observations, it is impossible to compute the
% true out-of-bag prediction, and |TreeBagger| returns the most probable
% class for classification and the sample mean for regression.
%% 
% For decision trees, a classification score is the probability of
% observing an instance of this class in this tree leaf. For example, if
% the leaf of a grown decision tree has five |'good'| and three |'bad'|
% training observations in it, the scores returned by this decision tree
% for any observation fallen on this leaf are 5/8 for the |'good'| class
% and 3/8 for the |'bad'| class. These probabilities are called |'scores'|
% for consistency with other classifiers that might not have an obvious
% interpretation for numeric values of returned predictions.
figure
plot(oobMeanMargin(b5v));
xlabel('Number of Grown Trees')
ylabel('Out-of-Bag Mean Classification Margin')
%% 
% Compute the matrix of proximities and examine the distribution of
% outlier measures. Unlike regression, outlier measures for classification
% ensembles are computed within each class separately.
b5v = fillProximities(b5v);
figure
histogram(b5v.OutlierMeasure)
xlabel('Outlier Measure')
ylabel('Number of Observations')
%%
% Slightly more than half of the extreme outliers are labeled |'bad'|.
extremeOutliers = b5v.Y(b5v.OutlierMeasure>40)
percentBad = 100*sum(strcmp(extremeOutliers,'b'))/numel(extremeOutliers)
%%
% As for regression, you can plot scaled coordinates, displaying the two
% classes in different colors using the 'Colors' name-value pair argument of |mdsProx|. This
% argument takes a character vector in which every character represents a color.
% The software does not rank class names. Therefore, it is best practice to
% determine the position of the classes in the |ClassNames|
% property of the ensemble.
gPosition = find(strcmp('g',b5v.ClassNames))
%%
% The |'bad'| class is first and the |'good'| class is second. Display
% scaled coordinates using red for the |'bad'| class and blue for the
% |'good'| class observations.
figure
[s,e] = mdsProx(b5v,'Colors','rb');
xlabel('First Scaled Coordinate')
ylabel('Second Scaled Coordinate')
%%
% Plot the first 20 eigenvalues obtained by scaling. The first eigenvalue
% clearly dominates and the first scaled coordinate is most
% important.
figure
bar(e(1:20))
xlabel('Scaled Coordinate Index')
ylabel('Eigenvalue')
%%
% Another way of exploring the performance of a classification ensemble is
% to plot its Receiver Operating Characteristic (ROC) curve or another
% performance curve suitable for the current problem. Obtain
% predictions for out-of-bag observations. For a classification ensemble,
% the |oobPredict| method returns a cell array of classification labels as
% the first output argument and a numeric array of scores as the second
% output argument. The returned array of scores has two columns, one for
% each class. In this case, the first column is for the |'bad'| class and
% the second column is for the |'good'| class. One column in the score matrix
% is redundant because the scores represent class probabilities in tree
% leaves and by definition add up to 1.
[Yfit,Sfit] = oobPredict(b5v);
%%
% Use |perfcurve| to compute a performance curve. By default,
% |perfcurve| returns the standard ROC curve, which is the true positive rate
% versus the false positive rate. |perfcurve| requires true class labels,
% scores, and the positive class label for input. In this case, choose the
% |'good'| class as positive.
[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,gPosition),'g');
figure
plot(fpr,tpr)
xlabel('False Positive Rate')
ylabel('True Positive Rate')
%%
% Instead of the standard ROC curve, you might want to plot, for example,
% ensemble accuracy versus threshold on the score for the |'good'| class. The
% |ycrit| input argument of |perfcurve| lets you specify the criterion for the
% |y|-axis, and the third output argument of |perfcurve| returns an array of
% thresholds for the positive class score. Accuracy is the fraction of
% correctly classified observations, or equivalently, 1 minus the
% classification error.
[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,gPosition),'g','YCrit','Accu');
figure(20)
plot(thre,accu)
xlabel('Threshold for ''good'' Returns')
ylabel('Classification Accuracy')
%%
% The curve shows a flat region indicating that any threshold from 0.2 to
% 0.6 is a reasonable choice. By default, the |perfcurve| assigns
% classification labels using 0.5 as the boundary between the two classes.
% You can find exactly what accuracy this corresponds to.
accu(abs(thre-0.5)<eps)
%%
% The maximal accuracy is a little higher than the default one.
[maxaccu,iaccu] = max(accu)
%%
% The optimal threshold is therefore.
thre(iaccu)