www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/cvsequentialfsdemo.m
%% Selecting Features for Classifying High-dimensional Data % This example shows how to select features for classifying % high-dimensional data. More specifically, it shows how to perform % sequential feature selection, which is one of the most popular feature % selection algorithms. It also shows how to use holdout and % cross-validation to evaluate the performance of the selected features. % % Reducing the number of features (dimensionality) is important in % statistical learning. For many data sets with a large number of features % and a limited number of observations, such as bioinformatics data, % usually many features are not useful for producing a desired learning % result and the limited observations may lead the learning algorithm to % overfit to the noise. Reducing features can also save storage and % computation time and increase comprehensibility. % % There are two main approaches to reducing features: feature selection and % feature transformation. Feature selection algorithms select a subset of % features from the original feature set; feature transformation % methods transform data from the original high-dimensional feature space % to a new space with reduced dimensionality. % Copyright 2007-2014 The MathWorks, Inc. %% Loading the Data % Serum proteomic pattern diagnostics can be used to differentiate % observations from patients with and without disease. Profile patterns are % generated using surface-enhanced laser desorption and ionization (SELDI) % protein mass spectrometry. These features are ion intensity levels at % specific mass/charge values. % % The data in this example is from the % <http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp FDA-NCI % Clinical Proteomics Program Databank>. This example uses the % high-resolution ovarian cancer data set that was generated using the WCX2 % protein array. After some pre-processing steps, similar to those shown in % the Bioinformatics Toolbox(TM) example % <http://www.mathworks.com/help/bioinfo/examples/preprocessing-raw-mass-spectrometry-data.html % Pre-processing Raw Mass Spectrometry % Data>, the data set has two variables |obs| and |grp|. The |obs| % variable consists 216 observations with 5000 features. Each element in % |grp| defines the group to which the corresponding row of |obs| belongs. load ovariancancer; whos %% Dividing Data Into a Training Set and a Test Set % Some of the functions used in this example call MATLAB(R) built-in random % number generation functions. To duplicate the exact results shown in this % example, execute the command below to set the random number % generator to a known state. Otherwise, your results may differ. rng(8000,'twister'); %% % The performance on the training data (resubstitution performance) is not % a good estimate for a model's performance on an independent test set. % Resubstitution performance will usually be over-optimistic. To predict % the performance of a selected model, you need to assess its performance % on another data set that was not used to build the model. Here, we use % |cvpartition| to divide data into a training set of size 160 and a test % set of size of size 56. Both the training set and the test set have % roughly the same group proportions as in |grp|. We select features using % the training data and judge the performance of the selected % features on the test data. This is often called holdout validation. % Another simple and widely-used method for evaluating and selecting a % model is cross-validation, which will be illustrated later in this example. % holdoutCVP = cvpartition(grp,'holdout',56) %% dataTrain = obs(holdoutCVP.training,:); grpTrain = grp(holdoutCVP.training); %% The Problem of Classifying Data Using All the Features % Without first reducing the number of features, some classification % algorithms would fail on the data set used in this example, since the number % of features is much larger than the number of observations. In this example, % we use Quadratic Discriminant Analysis (QDA) as the classification % algorithm. If we apply QDA on the data using all the features, as shown % in the following, we will get an error because there are not enough % samples in each group to estimate a covariance matrix. try yhat = classify(obs(test(holdoutCVP),:), dataTrain, grpTrain,'quadratic'); catch ME display(ME.message); end %% Selecting Features Using a Simple Filter Approach % Our goal is to reduce the dimension of the data by finding a small set of % important features which can give good classification performance. % Feature selection algorithms can be roughly grouped into two categories: % filter methods and wrapper methods. Filter methods rely on general % characteristics of the data to evaluate and to select the feature subsets % without involving the chosen learning algorithm (QDA in this example). % Wrapper methods use the performance of the chosen learning algorithm to % evaluate each candidate feature subset. Wrapper methods search for % features better fit for the chosen learning algorithm, but they can be % significantly slower than filter methods if the learning algorithm takes % a long time to run. The concepts of "filters" and "wrappers" are % described in John G. Kohavi R. (1997) "Wrappers for feature subset % selection", Artificial Intelligence, Vol.97, No.1-2, pp.272-324. This % example shows one instance of a filter method and one instance of a wrapper % method. % % Filters are usually used as a pre-processing step since they are simple % and fast. A widely-used filter method for bioinformatics data is to apply % a univariate criterion separately on each feature, assuming that there is % no interaction between features. % % For example, we might apply the _t_-test on each feature and compare % _p_-value (or the absolute values of _t_-statistics) for each % feature as a measure of how effective it is at separating groups. dataTrainG1 = dataTrain(grp2idx(grpTrain)==1,:); dataTrainG2 = dataTrain(grp2idx(grpTrain)==2,:); [h,p,ci,stat] = ttest2(dataTrainG1,dataTrainG2,'Vartype','unequal'); %% % In order to get a general idea of how well-separated the two groups are % by each feature, we plot the empirical cumulative distribution function % (CDF) of the _p_-values: ecdf(p); xlabel('P value'); ylabel('CDF value') %% % There are about 35% of features having _p_-values close to zero and over % 50% of features having _p_-values smaller than 0.05, meaning there are % more than 2500 features among the original 5000 features that have strong % discrimination power. One can sort these features according to their % _p_-values (or the absolute values of the _t_-statistic) and select some % features from the sorted list. However, it is usually difficult to decide % how many features are needed unless one has some domain knowledge or the % maximum number of features that can be considered has been dictated in % advance based on outside constraints. % % One quick way to decide the number of needed features is to plot the MCE % (misclassification error, i.e., the number of misclassified observations % divided by the number of observations) on the test set as a function of % the number of features. Since there are only 160 observations in the % training set, the largest number of features for applying QDA is limited, % otherwise, there may not be enough samples in each group to estimate a % covariance matrix. Actually, for the data used in this example, the holdout % partition and the sizes of two groups dictate that the largest allowable % number of features for applying QDA is about 70. % Now we compute MCE for various numbers of features between 5 and 70 and % show the plot of MCE as a function of the number of features. In order to % reasonably estimate the performance of the selected model, it is % important to use the 160 training samples to fit the QDA model and % compute the MCE on the 56 test observations (blue circular marks in the % following plot). To illustrate why resubstitution error is not a good % error estimate of the test error, we also show the resubstitution MCE % using red triangular marks. % [~,featureIdxSortbyP] = sort(p,2); % sort the features testMCE = zeros(1,14); resubMCE = zeros(1,14); nfs = 5:5:70; classf = @(xtrain,ytrain,xtest,ytest) ... sum(~strcmp(ytest,classify(xtest,xtrain,ytrain,'quadratic'))); resubCVP = cvpartition(length(grp),'resubstitution') for i = 1:14 fs = featureIdxSortbyP(1:nfs(i)); testMCE(i) = crossval(classf,obs(:,fs),grp,'partition',holdoutCVP)... /holdoutCVP.TestSize; resubMCE(i) = crossval(classf,obs(:,fs),grp,'partition',resubCVP)/... resubCVP.TestSize; end plot(nfs, testMCE,'o',nfs,resubMCE,'r^'); xlabel('Number of Features'); ylabel('MCE'); legend({'MCE on the test set' 'Resubstitution MCE'},'location','NW'); title('Simple Filter Feature Selection Method'); %% % For convenience, |classf| is defined as an anonymous function. It fits % QDA on the given training set and returns the number of misclassified % samples for the given test set. If you were developing your own % classification algorithm, you might want to put it in a separate file, as % follows: % function err = classf(xtrain,ytrain,xtest,ytest) % yfit = classify(xtest,xtrain,ytrain,'quadratic'); % err = sum(~strcmp(ytest,yfit)); %% % The resubstitution MCE is over-optimistic. It consistently decreases when % more features are used and drops to zero when more than 60 features are % used. However, if the test error increases while the resubstitution error % still decreases, then overfitting may have occurred. This simple filter % feature selection method gets the smallest MCE on the test set when 15 % features are used. The plot shows overfitting begins to occur when 20 or % more features are used. The smallest MCE on the test set is 12.5%: testMCE(3) %% % These are the first 15 features that achieve the minimum MCE: featureIdxSortbyP(1:15) %% Applying Sequential Feature Selection % The above feature selection algorithm does not consider interaction % between features; besides, features selected from the list based on their % individual ranking may also contain redundant information, so that not % all the features are needed. For example, the linear correlation % coefficient between the first selected feature (column 2814) and the % second selected feature (column 2813) is almost 0.95. corr(dataTrain(:,featureIdxSortbyP(1)),dataTrain(:,featureIdxSortbyP(2))) %% % This kind of simple feature selection procedure is usually used as a % pre-processing step since it is fast. More advanced feature selection % algorithms improve the performance. Sequential feature selection is one % of the most widely used techniques. It selects a subset of features by % sequentially adding (forward search) or removing (backward search) until % certain stopping conditions are satisfied. % % In this example, we use forward sequential feature selection in a wrapper % fashion to find important features. More specifically, since the typical % goal of classification is to minimize the MCE, the feature selection % procedure performs a sequential search using the MCE of the learning % algorithm QDA on each candidate feature subset as the performance % indicator for that subset. The training set is used to select the % features and to fit the QDA model, and the test set is used to evaluate % the performance of the finally selected feature. During the feature % selection procedure, to evaluate and to compare the performance of the % each candidate feature subset, we apply stratified 10-fold % cross-validation to the training set. We will illustrate later why % applying cross-validation to the training set is important. % % First we generate a stratified 10-fold partition for the training set: tenfoldCVP = cvpartition(grpTrain,'kfold',10) %% % Then we use the filter results from the previous section as a % pre-processing step to select features. For instance, we select 150 % features here: fs1 = featureIdxSortbyP(1:150); %% % We apply forward sequential feature selection on these 150 features. % The function |sequentialfs| provides a simple way (the default option) to % decide how many features are needed. It stops when the first local % minimum of the cross-validation MCE is found. fsLocal = sequentialfs(classf,dataTrain(:,fs1),grpTrain,'cv',tenfoldCVP); %% % The selected features are the following: fs1(fsLocal) %% % To evaluate the performance of the selected model with these three features, % we compute the MCE on the 56 test samples. testMCELocal = crossval(classf,obs(:,fs1(fsLocal)),grp,'partition',... holdoutCVP)/holdoutCVP.TestSize %% % With only three features being selected, the MCE is only a little over % half of the smallest MCE using the simple filter feature selection method. %% % The algorithm may have stopped prematurely. Sometimes a smaller MCE is % achievable by looking for the minimum of the cross-validation MCE over a % reasonable range of number of features. For instance, we draw the plot of % the cross-validation MCE as a function of the number of features for up % to 50 features. [fsCVfor50,historyCV] = sequentialfs(classf,dataTrain(:,fs1),grpTrain,... 'cv',tenfoldCVP,'Nf',50); plot(historyCV.Crit,'o'); xlabel('Number of Features'); ylabel('CV MCE'); title('Forward Sequential Feature Selection with cross-validation'); %% % The cross-validation MCE reaches the minimum value when 10 features are used % and this curve stays flat over the range from 10 features to 35 features. % Also, the curve goes up when more than 35 features are used, which means % overfitting occurs there. % % It is usually preferable to have fewer features, so here we pick 10 % features: fsCVfor10 = fs1(historyCV.In(10,:)) %% % To show these 10 features in the order in which they are selected in the % sequential forward procedure, we find the row in which they first become % true in the |historyCV| output: [orderlist,ignore] = find( [historyCV.In(1,:); diff(historyCV.In(1:10,:) )]' ); fs1(orderlist) %% % To evaluate these 10 features, we compute their MCE for QDA on the test % set. We get the smallest MCE value so far: testMCECVfor10 = crossval(classf,obs(:,fsCVfor10),grp,'partition',... holdoutCVP)/holdoutCVP.TestSize %% % It is interesting to look at the plot of resubstitution MCE values on the % training set (i.e., without performing cross-validation during the % feature selection procedure) as a function of the number of features: [fsResubfor50,historyResub] = sequentialfs(classf,dataTrain(:,fs1),... grpTrain,'cv','resubstitution','Nf',50); plot(1:50, historyCV.Crit,'bo',1:50, historyResub.Crit,'r^'); xlabel('Number of Features'); ylabel('MCE'); legend({'10-fold CV MCE' 'Resubstitution MCE'},'location','NE'); %% % Again, the resubstitution MCE values are overly optimistic here. Most are % smaller than the cross-validation MCE values, and the resubstitution MCE % goes to zero when 16 features are used. We can compute the MCE value of % these 16 features on the test set to see their real performance: fsResubfor16 = fs1(historyResub.In(16,:)); testMCEResubfor16 = crossval(classf,obs(:,fsResubfor16),grp,'partition',... holdoutCVP)/holdoutCVP.TestSize %% % |testMCEResubfor16|, the performance of these 16 features (chosen by % resubstitution during the feature selection procedure) on the test set, is % about double that for |testMCECVfor10|, the performance of the 10 features % (chosen by 10-fold cross-validation during the feature selection procedure) % on the test set. It again indicates that the resubstitution error generally % is not a good performance estimate for evaluating and selecting features. We % may want to avoid using resubstitution error, not only during the final % evaluation step, but also during the feature selection procedure.