www.gusucode.com > nnet 案例源码 matlab代码程序 > nnet/yeastdemonnet.m
%% Gene Expression Analysis % This example demonstrates looking for patterns in gene expression % profiles in baker's yeast using neural networks. % Copyright 2003-2014 The MathWorks, Inc. %% The Problem: Analyzing Gene Expressions in Baker's Yeast (Saccharomyces Cerevisiae) % The goal is to gain some understanding of gene expressions in % Saccharomyces cerevisiae, which is commonly known as baker's yeast or % brewer's yeast. It is the fungus that is used to bake bread and ferment % wine from grapes. % % Saccharomyces cerevisiae, when introduced in a medium rich in glucose, % can convert glucose to ethanol. Initially, yeast converts glucose to % ethanol by a metabolic process called "fermentation". However once supply % of glucose is exhausted yeast shifts from anaerobic fermentation of % glucose to aerobic respiraton of ethanol. This process is called diauxic % shift. This process is of considerable interest since it is accompanied % by major changes in gene expression. % % The example uses DNA microarray data to study temporal gene expression of % almost all genes in Saccharomyces cerevisiae during the diauxic shift. % %% % You need Bioinformatics Toolbox(TM) to run this example. if ~nnDependency.bioInfoAvailable errordlg('This example requires Bioinformatics Toolbox.'); return; end %% The Data % This example uses data from DeRisi, JL, Iyer, VR, Brown, PO. "Exploring % the metabolic and genetic control of gene expression on a genomic scale." % Science. 1997 Oct 24;278(5338):680-6. PMID: 9381177 % % The full data set can be downloaded from the Gene Expression Omnibus % website: http://www.yeastgenome.org % % Start by loading the data into MATLAB(R). load yeastdata.mat %% % Gene expression levels were measured at seven time points during the % diauxic shift. The variable |times| contains the times at which the % expression levels were measured in the experiment. The variable |genes| % contains the names of the genes whose expression levels were measured. % The variable |yeastvalues| contains the "VALUE" data or LOG_RAT2N_MEAN, % or log2 of ratio of CH2DN_MEAN and CH1DN_MEAN from the seven time steps % in the experiment. %% % To get an idea of the size of the data you can use *numel(genes)* to show % how many genes there are in the data set. numel(genes) %% % genes is a cell array of the gene names. You can access the entries using % MATLAB cell array indexing: genes{15} %% % This indicates that the 15th row of the variable *yeastvalues* contains % expression levels for the ORF |YAL054C|. You can use the web command to % access information about this ORF in the Saccharomyces Genome Database % (SGD). url = sprintf(... 'http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=%s',... genes{15}); web(url); %% Filtering the Genes % The data set is quite large and a lot of the information corresponds to % genes that do not show any interesting changes during the experiment. To % make it easier to find the interesting genes, the first thing to do is to % reduce the size of the data set by removing genes with expression % profiles that do not show anything of interest. There are 6400 expression % profiles. You can use a number of techniques to reduce this to some % subset that contains the most significant genes. %% % If you look through the gene list you will see several spots marked as % 'EMPTY'. These are empty spots on the array, and while they might have % data associated with them, for the purposes of this example, you can % consider these points to be noise. These points can be found using the % *strcmp* function and removed from the data set with indexing commands. % emptySpots = strcmp('EMPTY',genes); yeastvalues(emptySpots,:) = []; genes(emptySpots) = []; numel(genes) %% % In the yeastvalues data you will also see several places where the % expression level is marked as NaN. This indicates that no data was % collected for this spot at the particular time step. One approach to % dealing with these missing values would be to impute them using the % mean or median of data for the particular gene over time. This example % uses a less rigorous approach of simply throwing away the data for any % genes where one or more expression level was not measured. % % The function *isnan* is used to identify the genes with missing data and % indexing commands are used to remove the genes with missing data. nanIndices = any(isnan(yeastvalues),2); yeastvalues(nanIndices,:) = []; genes(nanIndices) = []; numel(genes) %% % If you were to plot the expression profiles of all the remaining % profiles, you would see that most profiles are flat and not significantly % different from the others. This flat data is obviously of use as it % indicates that the genes associated with these profiles are not % significantly affected by the diauxic shift; however, in this example, % you are interested in the genes with large changes in expression % accompanying the diauxic shift. You can use filtering functions in the % Bioinformatics Toolbox(TM) to remove genes with various types of profiles % that do not provide useful information about genes affected by the % metabolic change. % % You can use the *genevarfilter* function to filter out genes with small % variance over time. The function returns a logical array of the same size % as the variable genes with ones corresponding to rows of yeastvalues with % variance greater than the 10th percentile and zeros corresponding to % those below the threshold. mask = genevarfilter(yeastvalues); % Use the mask as an index into the values to remove the filtered genes. yeastvalues = yeastvalues(mask,:); genes = genes(mask); numel(genes) %% % The function *genelowvalfilter* removes genes that have very low absolute % expression values. Note that the gene filter functions can also % automatically calculate the filtered data and names. [mask, yeastvalues, genes] = ... genelowvalfilter(yeastvalues,genes,'absval',log2(3)); numel(genes) %% % Use *geneentropyfilter* to remove genes whose profiles have low entropy: [mask, yeastvalues, genes] = ... geneentropyfilter(yeastvalues,genes,'prctile',15); numel(genes) %% Principal Component Analysis % Now that you have a manageable list of genes, you can look for % relationships between the profiles. % % Normalizing the standard deviation and mean of data allows the network % to treat each input as equally important over its range of values. % % Principal-component analysis (PCA) is a useful technique that can be used % to reduce the dimensionality of large data sets, such as those from % microarray analysis. This technique isolates the principal components of % the dataset eliminating those components that contribute the least to the % variation in the data set. % % The two settings variables can be used to apply *mapstd* and *processpca* % to other data to consistently when the network is applied to new data. [x,std_settings] = mapstd(yeastvalues'); % Normalize data [x,pca_settings] = processpca(x,0.15); % PCA %% % The input vectors are first normalized, using |mapstd|, so that they have % zero mean and unity variance. |processpca| is the function that % implements the PCA algorithm. The second argument passed to |processpca| % is 0.15. This means that |processpca| eliminates those principal % components that contribute less than 15% to the total variation in the % data set. The variable |pc| now contains the principal components of the % yeastvalues data. %% % The principal components can be visiualized using the *scatter* function. figure scatter(x(1,:),x(2,:)); xlabel('First Principal Component'); ylabel('Second Principal Component'); title('Principal Component Scatter Plot'); %% Cluster Analysis: Self-Organizing Maps % The principal components can be now be clustered using the % Self-Organizing map (SOM) clustering algorithm available in Neural % Network Toolbox software. % % The *selforgmap* function creates a Self-Organizing map network which can % then be trained with the *train* function. % % The input size is 0 because the network has not yet been configured to % match our input data. This will happen when the network is trained. % net = selforgmap([5 3]); view(net) %% % Now the network is ready to be trained. % % The NN Training Tool shows the network being trained and the algorithms % used to train it. It also displays the training state during training % and the criteria which stopped training will be highlighted in green. % % The buttons at the bottom open useful plots which can be opened during % and after training. Links next to the algorithm names and plot buttons % open documentation on those subjects. net = train(net,x); nntraintool %% % Use *plotsompos* to display the network over a scatter plot of the % first two dimensions of the data. figure plotsompos(net,x); %% % You can assign clusters using the SOM by finding the nearest node to each % point in the data set. y = net(x); cluster_indices = vec2ind(y); %% % Use *plotsomhits* to see how many vectors are assigned to each of % the neurons in the map. figure plotsomhits(net,x); %% % You can also use other clustering algorithms like Hierarchical clustering % and K-means, available in the Statistics and Machine Learning Toolbox(TM) % for cluster analysis. %% Glossary % *ORF* - An open reading frame (ORF) is a portion of a geneís sequence % that contains a sequence of bases, uninterrupted by stop sequences, that % could potentially encode a protein.