www.gusucode.com > distcomp 案例源码程序 matlab代码 > distcomp/paralleldemo_hiv_dist.m

    %% Distributed Analysis of the Origin of the Human Immunodeficiency Virus
% This example shows how the Parallel Computing Toolbox(TM) can be used to perform
% pairwise sequence alignment (PWSA). PWSA has multiple
% applications in bioinformatics, such as multiple sequence analysis and
% phylogenetic tree reconstruction. We look at a PWSA that uses a global dynamic
% programming algorithm to align each pair of sequences, and we then calculate
% the pairwise distances using the Tajima-Nei metric.  This gives us a matrix
% of distances between sequences that we use for inferring the phylogeny of HIV
% and SIV viruses. PWSA is a computationally expensive task with complexity
% O(L*L*N*N), where L is the average length of the sequences and N is the number
% of sequences [Durbin, et al. Cambridge University Press, 1998].
%
% For details about the computations,  
% <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_setup_hiv.m')) view the code for pctdemo_setup_hiv>.
%
% Prerequisites:
% 
% * <docid:distcomp_examples.example-ex53988799
% Customizing the Settings for the Examples in the Parallel Computing Toolbox>
% * <docid:distcomp_examples.example-ex17307408 
% Dividing MATLAB(R) Computations into Tasks>
%
% Related examples:
% 
% * <docid:distcomp_examples.example-ex21533229 
% Sequential Analysis of the Origin of the Human Immunodeficiency Virus>

%   Copyright 2007-2013 The MathWorks, Inc.

%% Analyze the Sequential Problem
% First, we look at how the computations in the sequential example fit into the
% model introduced in the <docid:distcomp_examples.example-ex17307408 Dividing MATLAB
% Computations into Tasks> example.  The call to |seqpdist| is the most
% computationally intensive part of the sequential example, and it basically
% performs the following parameter sweep:
%
%    for all pairs (s1, s2) formed from the elements of seq
%       distance(s1, s2) = seqpdist(s1, s2);
%    end
%
% The distance is, of course, symmetric, i.e.,  |seqpdist(s1, s2) = seqpdist(s2,
% s1)|.  Because |seqpdist| re-normalizes the distances based on the entire input
% sequence |seq|, we have to calculate that renormalization factor ourselves in 
% this example, and we also need to identify the pairs between which to measure  
% the distance.
%
% Since we calculate a rather large number of distances, we have each task
% calculate several distances.  This requires us to write a simple wrapper task
% function that invokes |seqpdist|.

%% Load the Example Settings and the Data
% The example uses the default profile when identifying the cluster to use.
% The
% <matlab:helpview(fullfile(docroot,'toolbox','distcomp','distcomp_ug.map'),'profiles_help')
% profiles documentation> 
% explains how to create new profiles and how to change the default 
% profile.  See 
% <docid:distcomp_examples.example-ex53988799
% Customizing the Settings for the Examples in the Parallel Computing Toolbox> 
% for instructions on how to change the example difficulty level or the number of
% tasks created.
[difficulty, myCluster, numTasks] = pctdemo_helper_getDefaults();

%%
% The |pctdemo_setup_hiv| function retrieves the protein sequence information
% from the NCBI GenBank(R) database, and the |difficulty| parameter controls how
% many protein sequences we retrieve.
% You can 
% <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_setup_hiv.m')) view the code for pctdemo_setup_hiv> 
% for full details.
[fig, pol, description] = pctdemo_setup_hiv(difficulty);
numViruses = length(pol);
startTime = clock;

%% Divide the Work into Smaller Tasks
% We use the Tajima-Nei method to measure the distance between
% the POL coding regions. 
% Tajima-Nei distances are based on the frequency of nucleotides in
% the whole group of sequences.  When you call |seqpdist|
% for only two sequences, the function would compute the
% nucleotide count based on only the two sequences and not on the
% whole group.  Consequently, we calculate the frequency based on
% the whole group and pass that information to |seqpdist|.
bc = basecount(strcat(pol.Sequence));
bf = [bc.A bc.C bc.G bc.T]/(bc.A + bc.C + bc.G + bc.T);

%%
% Let's find the parameter space that we want to traverse.  The |seqpdist| 
% documentation gives us useful information in this regard:
%
%   % D = SEQPDIST(SEQS) returns a vector D containing biological distances
%   % between each pair of sequences stored in the M elements of the cell
%   % SEQS. D is an (M*(M-1)/2)-by-1 vector, corresponding to the M*(M-1)/2
%   % pairs of sequences in SEQS. The output D is arranged in the order of
%   % ((2,1),(3,1),..., (M,1),(3,2),...(M,2),.....(M,M-1), i.e., the lower
%   % left triangle of the full M-by-M distance matrix.  
% 
% Based on this information, we create two vectors, |Aseq| and |Bseq|, that 
% contain the sequence pairs between which to calculate the distances. 
Aseq = struct;
Bseq = struct;
ind = 1;
for j = 1:numViruses
    for i = j+1:numViruses
       Aseq(ind).Sequence = pol(j).Sequence;
       Bseq(ind).Sequence = pol(i).Sequence;
       ind = ind + 1;
    end
end

%%
% We want to divide the vectors |Asplit| and |Bsplit| between the tasks.
[Asplit, numTasks] = pctdemo_helper_split_vector(Aseq, numTasks);
Bsplit = pctdemo_helper_split_vector(Bseq, numTasks);
fprintf(['This example will submit a job with %d task(s) ' ...
         'to the cluster.\n'], numTasks);

%% Create and Submit the Job
% We create a job and the tasks in the job.  Task |i| calculates 
% the pairwise distances between the elements in |Asplit{i}| and |Bsplit{i}|.
% You can 
% <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_task_hiv.m')) view the code for pctdemo_task_hiv> 
% for full details.
job = createJob(myCluster);
for i = 1:numTasks
   createTask(job, @pctdemo_task_hiv, 1, {Asplit{i}, Bsplit{i}, bf});
end
%%
% We can now submit the job and wait for it to finish.
submit(job);
wait(job);

%% Retrieve the Results
% When we have obtained and verified all the results, we allow the cluster
% to free its resources.  |fetchOutputs| will throw an error if the tasks
% did not complete successfully, in which case we need to delete the job
% before throwing the error.
try
    jobResults = fetchOutputs(job);
catch err
    delete(job);
    rethrow(err);
end

pold = cell2mat(jobResults(:, 1))';

%%
% We have now finished all the verifications, so we can delete the job.
delete(job);

%% Measure the Elapsed Time
% The time used for the distributed computations should be compared
% against the time it takes to perform the same set of calculations
% in the 
% <docid:distcomp_examples.example-ex21533229 Sequential Analysis of the Origin of the Human 
% Immunodeficiency Virus> example.
% The elapsed time varies with the underlying hardware and network infrastructure.
elapsedTime = etime(clock, startTime);
fprintf('Elapsed time is %2.1f seconds\n', elapsedTime);

%% Plot the Results
% Now that we have all the distances, we can construct the phylogenetic trees
% for the POL proteins.  You can 
% <matlab:edit(fullfile(matlabroot, 'examples', 'distcomp', 'pctdemo_plot_hiv.m')) view the code for pctdemo_plot_hiv> 
% for full details.
pctdemo_plot_hiv(fig, pold, description);