www.gusucode.com > nnet 案例源码 matlab代码程序 > nnet/cancerdetectdemonnet.m
%% Cancer Detection % This example demonstrates using a neural network to detect cancer from % mass spectrometry data on protein profiles. % Copyright 2003-2014 The MathWorks, Inc. %% Introduction % Serum proteomic pattern diagnostics can be used to differentiate samples % from patients with and without disease. Profile patterns are generated % using surface-enhanced laser desorption and ionization (SELDI) protein % mass spectrometry. This technology has the potential to improve clinical % diagnostics tests for cancer pathologies. %% The Problem: Cancer Detection % The goal is to build a classifier that can distinguish between cancer and % control patients from the mass spectrometry data. % % The methodology followed in this example is to select a reduced set of % measurements or "features" that can be used to distinguish between cancer % and control patients using a classifier. % % These features will be ion intensity levels at specific mass/charge % values. %% Formatting the Data % The data in this example is from the FDA-NCI Clinical Proteomics Program % Databank: http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp % % To recreate the data in *ovarian_dataset.mat* used in this example, download % and uncompress the raw mass-spectrometry data from the FDA-NCI web site. % Create the data file *OvarianCancerQAQCdataset.mat* by either % running script *msseqprocessing* in Bioinformatics Toolbox (TM) or by % following the steps in the example *biodistcompdemo* (Batch processing with % parallel computing). The new file contains variables |Y|, |MZ| and |grp|. % % Each column in |Y| represents measurements taken from a patient. There % are |216| columns in |Y| representing |216| patients, out of which |121| % are ovarian cancer patients and |95| are normal patients. % % Each row in |Y| represents the ion intensity level at a specific % mass-charge value indicated in |MZ|. There are |15000| mass-charge values % in |MZ| and each row in |Y| represents the ion-intesity levels of the % patients at that particular mass-charge value. % % The variable |grp| holds the index information as to which of these % samples represent cancer patients and which ones represent normal % patients. % % An extensive description of this data set and excellent introduction to % this promising technology can be found in [1] and [2]. %% Ranking Key Features % This is a typical classification problem in which the number of features % is much larger than the number of observations, but in which no single % feature achieves a correct classification, therefore we need to find a % classifier which appropriately learns how to weight multiple features and % at the same time produce a generalized mapping which is not over-fitted. % % A simple approach for finding significant features is to assume that each % M/Z value is independent and compute a two-way t-test. *rankfeatures* % returns an index to the most significant M/Z values, for instance 100 % indices ranked by the absolute value of the test statistic. % % To finish recreating the data from *ovarian_dataset.mat*, load the % *OvarianCancerQAQCdataset.mat* and *rankfeatures* from Bioinformatics % Toolbox to choose 100 highest ranked measurements as inputs |x|. % % ind = rankfeatures(Y,grp,'CRITERION','ttest','NUMBER',100); % x = Y(ind,:); % % Define the targets |t| for the two classes as follows: % % t = double(strcmp('Cancer',grp)); % t = [t; 1-t]; % % The preprocessing steps from the script and example listed above are % intended to demonstrate a representative set of possible pre-processing % and feature selection procedures. Using different steps or parameters % may lead to different and possibly improved results of this example. [x,t] = ovarian_dataset; whos %% % Each column in |x| represents one of 216 different patients. % % Each row in |x| represents the ion intensity level at one of the 100 % specific mass-charge values for each patient. % % The variable |t| has 2 rows of 216 values each of which are either [1;0], % indicating a cancer patient, or [0;1] for a normal patient. %% Classification Using a Feed Forward Neural Network % Now that you have identified some significant features, you can use this % information to classify the cancer and normal samples. %% % Since the neural network is initialized with random initial weights, the % results after training the network vary slightly every time the example is % run. To avoid this randomness, the random seed is set to reproduce the % same results every time. However this is not necessary for your own % applications. setdemorandstream(672880951) %% % A 1-hidden layer feed forward neural network with 5 hidden layer neurons % is created and trained. The input and target samples are automatically % divided into training, validation and test sets. The training set is % used to teach the network. Training continues as long as the network % continues improving on the validation set. The test set provides a % completely independent measure of network accuracy. % % The input and output have sizes of 0 because the network has not yet % been configured to match our input and target data. This will happen % when the network is trained. net = patternnet(5); view(net) %% % Now the network is ready to be trained. The samples are automatically % divided into training, validation and test sets. The training set is % used to teach the network. Training continues as long as the network % continues improving on the validation set. The test set provides a % completely independent measure of network accuracy. % % The NN Training Tool shows the network being trained and the algorithms % used to train it. It also displays the training state during training % and the criteria which stopped training will be highlighted in green. % % The buttons at the bottom open useful plots which can be opened during % and after training. Links next to the algorithm names and plot buttons % open documentation on those subjects. [net,tr] = train(net,x,t); %% % To see how the network's performance improved during training, either % click the "Performance" button in the training tool, or call PLOTPERFORM. % % Performance is measured in terms of mean squared error, and shown in % log scale. It rapidly decreased as the network was trained. % % Performance is shown for each of the training, validation and test sets. % The version of the network that did best on the validation set is % was after training. plotperform(tr) %% % The trained neural network can now be tested with the testing samples we % partitioned from the main dataset. The testing data was not used in % training in any way and hence provides an "out-of-sample" dataset to % test the network on. This will give us a sense of how well the network % will do when tested with data from the real world. % % The network outputs will be in the range 0 to 1, so we threshold them % to get 1's and 0's indicating cancer or normal patients respectively. testX = x(:,tr.testInd); testT = t(:,tr.testInd); testY = net(testX); testClasses = testY > 0.5 %% % One measure of how well the neural network has fit the data is the % confusion plot. Here the confusion matrix is plotted across all samples. % % The confusion matrix shows the percentages of correct and incorrect % classifications. Correct classifications are the green squares on the % matrices diagonal. Incorrect classifications form the red squares. % % If the network has learned to classify properly, the percentages in the % red squares should be very small, indicating few misclassifications. % % If this is not the case then further training, or training a network % with more hidden neurons, would be advisable. plotconfusion(testT,testY) %% % Here are the overall percentages of correct and incorrect classification. [c,cm] = confusion(testT,testY) fprintf('Percentage Correct Classification : %f%%\n', 100*(1-c)); fprintf('Percentage Incorrect Classification : %f%%\n', 100*c); %% % Another measure of how well the neural network has fit data is the % receiver operating characteristic plot. This shows how the false % positive and true positive rates relate as the thresholding of outputs % is varied from 0 to 1. % % The farther left and up the line is, the fewer false positives need to % be accepted in order to get a high true positive rate. The best % classifiers will have a line going from the bottom left corner, to the % top left corner, to the top right corner, or close to that. % % Class 1 indicate cancer patiencts, class 2 normal patients. plotroc(testT,testY) %% % This example illustrated how neural networks can be used as classifiers % for cancer detection. One can also experiment using techniques like % principal component analysis to reduce the dimensionality of the data % to be used for building neural networks to improve classifier % performance. % %% References % [1] T.P. Conrads, et al., "High-resolution serum proteomic features for % ovarian detection", Endocrine-Related Cancer, 11, 2004, pp. 163-178. %% % [2] E.F. Petricoin, et al., "Use of proteomic patterns in serum to % identify ovarian cancer", Lancet, 359(9306), 2002, pp. 572-577.