www.gusucode.com > stats 源码程序 matlab案例代码 > stats/ConductACostSensitiveComparisonOfTwoClassificationModelsExample.m
%% Conduct a Cost-Sensitive Comparison of Two Classification Models % For data sets with imbalanced class representations, or if the % false-positive and false-negative costs are imbalanced, you can % statistically compare the predictive performances of two classification % models by including a cost matrix in the analysis. %% % Load the |arrhythmia| data set. Determine the class representations in % the data. % Copyright 2015 The MathWorks, Inc. load arrhythmia; Y = categorical(Y); tabulate(Y); %% % There are 16 classes, however some are not represented in the data set. Most % observations are classified as not having arrhythmia (class |1|). To % summarize, the data set is highly discrete with imbalanced classes. %% % Combine all observations with arrhythmia (classes 2 through 15) into one % class. Remove those observations with unknown arrhythmia status from the % data set. Y = Y(Y ~= '16'); Y(Y ~= '1') = '2'; X = X(Y ~= '16',:); %% % Create a partition that evenly splits the data into training and testing % sets. rng(1); % For reproducibility CVP = cvpartition(Y,'holdout',0.5); idxTrain = training(CVP); % Training-set indices idxTest = test(CVP); % Test-set indices %% % |CVP| is a cross-validation partition object that specifies the training % and test sets. %% % Create a cost matrix such that misclassifiying an arrhythmatic patient % into the no arrhythmia class is five times worse that misclassifying a % patient without arrhythmia into the arrhythmia class. Classifying % correctly incurs no cost. The rows indicate the true class and the % columns indicate predicted class. When conducting a cost-sensitive % analysis, it is a good practice to specify the order of the classes. Cost = [0 1;5 0]; ClassNames = categorical([2 1]); %% % Train two boosting ensembles of 50 classification trees, one that uses % AdaBoostM1 and the other that uses LogitBoost. Because there are missing % values, specify to use surrogate splits. Train the models using the % cost matrix. t = templateTree('Surrogate','on'); numTrees = 50; MdlAda = fitensemble(X(idxTrain,:),Y(idxTrain),'AdaBoostM1',numTrees,t,... 'Cost',Cost,'ClassNames',ClassNames); MdlLogit = fitensemble(X(idxTrain,:),Y(idxTrain),'LogitBoost',numTrees,t,... 'Cost',Cost,'ClassNames',ClassNames); %% % |MdlAda| and |MdlLogit| are trained |ClassificationEnsemble| models. %% % Label the test-set observations using the trained models. YhatAda = predict(MdlAda,X(idxTest,:)); YhatLogit = predict(MdlLogit,X(idxTest,:)); %% % |YhatLinear| and |YhatRBF| are vectors containing the predicted % class labels of the respective models. %% % Test whether the AdaBoostM1 ensemble (|MdlAda|) and the LogitBoost % ensemble (|MdlLogit|) have equal predictive accuracy. Supply the cost % matrix. Conduct the asymptotic, likelihood ratio, cost-sensitive test % (the default when you pass in a cost matrix). Request to return % _p_-values and misclassification costs. [h,p,e1,e2] = testcholdout(YhatAda,YhatLogit,Y(idxTest),'Cost',Cost) %% % |h = 0| indicates to not reject the null hypothesis that the two models % have equal predictive accuracies.