www.gusucode.com > distcomp 案例源码程序 matlab代码 > distcomp/paralleldemo_hiv_dist.m
%% Distributed Analysis of the Origin of the Human Immunodeficiency Virus % This example shows how the Parallel Computing Toolbox(TM) can be used to perform % pairwise sequence alignment (PWSA). PWSA has multiple % applications in bioinformatics, such as multiple sequence analysis and % phylogenetic tree reconstruction. We look at a PWSA that uses a global dynamic % programming algorithm to align each pair of sequences, and we then calculate % the pairwise distances using the Tajima-Nei metric. This gives us a matrix % of distances between sequences that we use for inferring the phylogeny of HIV % and SIV viruses. PWSA is a computationally expensive task with complexity % O(L*L*N*N), where L is the average length of the sequences and N is the number % of sequences [Durbin, et al. Cambridge University Press, 1998]. % % For details about the computations, % <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_setup_hiv.m')) view the code for pctdemo_setup_hiv>. % % Prerequisites: % % * <docid:distcomp_examples.example-ex53988799 % Customizing the Settings for the Examples in the Parallel Computing Toolbox> % * <docid:distcomp_examples.example-ex17307408 % Dividing MATLAB(R) Computations into Tasks> % % Related examples: % % * <docid:distcomp_examples.example-ex21533229 % Sequential Analysis of the Origin of the Human Immunodeficiency Virus> % Copyright 2007-2013 The MathWorks, Inc. %% Analyze the Sequential Problem % First, we look at how the computations in the sequential example fit into the % model introduced in the <docid:distcomp_examples.example-ex17307408 Dividing MATLAB % Computations into Tasks> example. The call to |seqpdist| is the most % computationally intensive part of the sequential example, and it basically % performs the following parameter sweep: % % for all pairs (s1, s2) formed from the elements of seq % distance(s1, s2) = seqpdist(s1, s2); % end % % The distance is, of course, symmetric, i.e., |seqpdist(s1, s2) = seqpdist(s2, % s1)|. Because |seqpdist| re-normalizes the distances based on the entire input % sequence |seq|, we have to calculate that renormalization factor ourselves in % this example, and we also need to identify the pairs between which to measure % the distance. % % Since we calculate a rather large number of distances, we have each task % calculate several distances. This requires us to write a simple wrapper task % function that invokes |seqpdist|. %% Load the Example Settings and the Data % The example uses the default profile when identifying the cluster to use. % The % <matlab:helpview(fullfile(docroot,'toolbox','distcomp','distcomp_ug.map'),'profiles_help') % profiles documentation> % explains how to create new profiles and how to change the default % profile. See % <docid:distcomp_examples.example-ex53988799 % Customizing the Settings for the Examples in the Parallel Computing Toolbox> % for instructions on how to change the example difficulty level or the number of % tasks created. [difficulty, myCluster, numTasks] = pctdemo_helper_getDefaults(); %% % The |pctdemo_setup_hiv| function retrieves the protein sequence information % from the NCBI GenBank(R) database, and the |difficulty| parameter controls how % many protein sequences we retrieve. % You can % <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_setup_hiv.m')) view the code for pctdemo_setup_hiv> % for full details. [fig, pol, description] = pctdemo_setup_hiv(difficulty); numViruses = length(pol); startTime = clock; %% Divide the Work into Smaller Tasks % We use the Tajima-Nei method to measure the distance between % the POL coding regions. % Tajima-Nei distances are based on the frequency of nucleotides in % the whole group of sequences. When you call |seqpdist| % for only two sequences, the function would compute the % nucleotide count based on only the two sequences and not on the % whole group. Consequently, we calculate the frequency based on % the whole group and pass that information to |seqpdist|. bc = basecount(strcat(pol.Sequence)); bf = [bc.A bc.C bc.G bc.T]/(bc.A + bc.C + bc.G + bc.T); %% % Let's find the parameter space that we want to traverse. The |seqpdist| % documentation gives us useful information in this regard: % % % D = SEQPDIST(SEQS) returns a vector D containing biological distances % % between each pair of sequences stored in the M elements of the cell % % SEQS. D is an (M*(M-1)/2)-by-1 vector, corresponding to the M*(M-1)/2 % % pairs of sequences in SEQS. The output D is arranged in the order of % % ((2,1),(3,1),..., (M,1),(3,2),...(M,2),.....(M,M-1), i.e., the lower % % left triangle of the full M-by-M distance matrix. % % Based on this information, we create two vectors, |Aseq| and |Bseq|, that % contain the sequence pairs between which to calculate the distances. Aseq = struct; Bseq = struct; ind = 1; for j = 1:numViruses for i = j+1:numViruses Aseq(ind).Sequence = pol(j).Sequence; Bseq(ind).Sequence = pol(i).Sequence; ind = ind + 1; end end %% % We want to divide the vectors |Asplit| and |Bsplit| between the tasks. [Asplit, numTasks] = pctdemo_helper_split_vector(Aseq, numTasks); Bsplit = pctdemo_helper_split_vector(Bseq, numTasks); fprintf(['This example will submit a job with %d task(s) ' ... 'to the cluster.\n'], numTasks); %% Create and Submit the Job % We create a job and the tasks in the job. Task |i| calculates % the pairwise distances between the elements in |Asplit{i}| and |Bsplit{i}|. % You can % <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_task_hiv.m')) view the code for pctdemo_task_hiv> % for full details. job = createJob(myCluster); for i = 1:numTasks createTask(job, @pctdemo_task_hiv, 1, {Asplit{i}, Bsplit{i}, bf}); end %% % We can now submit the job and wait for it to finish. submit(job); wait(job); %% Retrieve the Results % When we have obtained and verified all the results, we allow the cluster % to free its resources. |fetchOutputs| will throw an error if the tasks % did not complete successfully, in which case we need to delete the job % before throwing the error. try jobResults = fetchOutputs(job); catch err delete(job); rethrow(err); end pold = cell2mat(jobResults(:, 1))'; %% % We have now finished all the verifications, so we can delete the job. delete(job); %% Measure the Elapsed Time % The time used for the distributed computations should be compared % against the time it takes to perform the same set of calculations % in the % <docid:distcomp_examples.example-ex21533229 Sequential Analysis of the Origin of the Human % Immunodeficiency Virus> example. % The elapsed time varies with the underlying hardware and network infrastructure. elapsedTime = etime(clock, startTime); fprintf('Elapsed time is %2.1f seconds\n', elapsedTime); %% Plot the Results % Now that we have all the distances, we can construct the phylogenetic trees % for the POL proteins. You can % <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_plot_hiv.m')) view the code for pctdemo_plot_hiv> % for full details. pctdemo_plot_hiv(fig, pold, description);