www.gusucode.com > stats 源码程序 matlab案例代码 > stats/ConductACostSensitiveComparisonOfTwocHOExample.m
%% Conduct a Cost-Sensitive Comparison of Two Classification Models % For data sets with imbalanced class representations, or for data sets % with imbalanced false-positive and false-negative costs, % you can statistically compare the predictive performances of two % classification models by including a cost matrix in the analysis. %% % Load the |arrhythmia| data set. Determine the class representations in % the data. % Copyright 2015 The MathWorks, Inc. load arrhythmia; Y = categorical(Y); tabulate(Y); %% % There are 16 classes, however some are not represented in the data set. Most % observations are classified as not having arrhythmia (class |1|). To % summarize, the data set is highly discrete with imbalanced classes. %% % Combine all observations with arrhythmia (classes 2 through 15) into one % class. Remove those observations with unknown arrhythmia status from the % data set. Y = Y(Y ~= '16'); Y(Y ~= '1') = '2'; X = X(Y ~= '16',:); %% % Create a partition that evenly splits the data into training and testing % sets. rng(1); % For reproducibility CVP = cvpartition(Y,'holdout',0.5); idxTrain = training(CVP); % Training-set indices idxTest = test(CVP); % Test-set indices %% % |CVP| is a cross-validation partition object that specifies the training % and test sets. %% % Create a cost matrix such that misclassifiying an arrhythmatic patient % into the no arrhythmia class is five times worse that misclassifying a % patient without arrhythmia into the arrhythmia class. Classifying % correctly incurs no cost. The rows indicate the true class and the % columns indicate the predicted class. When conducting a cost-sensitive % analysis, it is a good practice to specify the order of the classes. cost = [0 1;5 0]; ClassNames = categorical([2 1]); %% % Train two boosting ensembles of 50 classification trees, one that uses % AdaBoostM1 and the other that uses LogitBoost. Because there are missing % values, specify to use surrogate splits. Train the models using the % cost matrix. t = templateTree('Surrogate','on'); numTrees = 50; C1 = fitensemble(X(idxTrain,:),Y(idxTrain),'AdaBoostM1',numTrees,t,... 'Cost',cost); C2 = fitensemble(X(idxTrain,:),Y(idxTrain),'LogitBoost',numTrees,t,... 'Cost',cost); %% % |C1| and |C2| are trained |ClassificationEnsemble| models. %% % Test whether the AdaBoostM1 ensemble (|C1|) and the LogitBoost % ensemble (|C2|) have equal predictive accuracy. Supply the cost % matrix. Conduct the asymptotic, likelihood ratio, cost-sensitive test % (the default when you pass in a cost matrix). Request to return % _p_-values and misclassification costs. [h,p,e1,e2] = compareHoldout(C1,C2,X(idxTest,:),X(idxTest,:),Y(idxTest),... 'Cost',cost) %% % |h = 0| indicates to not reject the null hypothesis that the two models % have equal predictive accuracies.