www.gusucode.com > stats 源码程序 matlab案例代码 > stats/ConductACostSensitiveComparisonOfTwoClassificationModelsExample.m

    %% Conduct a Cost-Sensitive Comparison of Two Classification Models
% For data sets with imbalanced class representations, or if the
% false-positive and false-negative costs are imbalanced, you can 
% statistically compare the predictive performances of two classification
% models by including a cost matrix in the analysis.
%%
% Load the |arrhythmia| data set.  Determine the class representations in
% the data.

% Copyright 2015 The MathWorks, Inc.

load arrhythmia;
Y = categorical(Y);
tabulate(Y);
%%
% There are 16 classes, however some are not represented in the data set. Most
% observations are classified as not having arrhythmia (class |1|). To
% summarize, the data set is highly discrete with imbalanced classes.

%%
% Combine all observations with arrhythmia (classes 2 through 15) into one
% class. Remove those observations with unknown arrhythmia status from the
% data set.
Y = Y(Y ~= '16');
Y(Y ~= '1') = '2';
X = X(Y ~= '16',:);

%%
% Create a partition that evenly splits the data into training and testing
% sets. 
rng(1);                             % For reproducibility
CVP = cvpartition(Y,'holdout',0.5);
idxTrain = training(CVP);           % Training-set indices 
idxTest = test(CVP);                % Test-set indices
%%
% |CVP| is a cross-validation partition object that specifies the training
% and test sets.
%%
% Create a cost matrix such that misclassifiying an arrhythmatic patient
% into the no arrhythmia class is five times worse that misclassifying a
% patient without arrhythmia into the arrhythmia class.  Classifying
% correctly incurs no cost. The rows indicate the true class and the
% columns indicate predicted class.  When conducting a cost-sensitive
% analysis, it is a good practice to specify the order of the classes.
Cost = [0 1;5 0];
ClassNames = categorical([2 1]);
%%
% Train two boosting ensembles of 50 classification trees, one that uses
% AdaBoostM1 and the other that uses LogitBoost.  Because there are missing
% values, specify to use surrogate splits.  Train the models using the
% cost matrix.
t = templateTree('Surrogate','on');
numTrees = 50;
MdlAda = fitensemble(X(idxTrain,:),Y(idxTrain),'AdaBoostM1',numTrees,t,...
    'Cost',Cost,'ClassNames',ClassNames);
MdlLogit = fitensemble(X(idxTrain,:),Y(idxTrain),'LogitBoost',numTrees,t,...
    'Cost',Cost,'ClassNames',ClassNames);
%%
% |MdlAda| and |MdlLogit| are trained |ClassificationEnsemble| models.
%% 
% Label the test-set observations using the trained models.
YhatAda = predict(MdlAda,X(idxTest,:));
YhatLogit = predict(MdlLogit,X(idxTest,:));
%%
% |YhatLinear| and |YhatRBF| are vectors containing the predicted
% class labels of the respective models.
%%
% Test whether the AdaBoostM1 ensemble (|MdlAda|) and the LogitBoost
% ensemble (|MdlLogit|) have equal predictive accuracy. Supply the cost
% matrix. Conduct the asymptotic, likelihood ratio, cost-sensitive test
% (the default when you pass in a cost matrix). Request to return
% _p_-values and misclassification costs.
[h,p,e1,e2] = testcholdout(YhatAda,YhatLogit,Y(idxTest),'Cost',Cost)
%%
% |h = 0|  indicates to not reject the null hypothesis that the two models
% have equal predictive accuracies.