www.gusucode.com > distcomp 案例源码程序 matlab代码 > distcomp/paralleldemo_distarray_prof.m

    %% Profiling Load Unbalanced Codistributed Arrays
% This example shows how to profile the implicit communication that 
% occurs when using an unevenly distributed array.

%   Copyright 2007-2012 The MathWorks, Inc.

%% 
% *Prerequisites*:
%
% * Interactive Parallel Mode in Parallel Computing Toolbox(TM) (See
% |pmode| in the users guide.)
% * <docid:distcomp_examples.example-ex31793179 Using the Parallel Profiler in Pmode>

%%
% This example shows how to use the parallel profiler in the case of an
% unevenly distributed array. The easiest way to create a codistributed array 
% is to pass a |codistributor| as an argument, such as in |rand(N, codistributor)|. 
% This evenly distributes your matrix of size N between your MATLAB(R) workers. 
% To get an unbalanced data distribution, you can get some number of columns of a 
% codistributed array as a function of |labindex|. 
%
% The plots in this example are produced from a 12-node MATLAB 
% cluster. Everything else is shown running on a four-node local cluster.

%% The Algorithm
% The algorithm we chose for this codistributed array is relatively simple.  We
% generate a large matrix such that each lab gets an approximately 512-by-512
% submatrix, except for the first lab. The first lab receives only one
% column of the matrix and the other columns are assigned to the last lab.
% Thus, on a four-lab cluster, lab 1 keeps only a 1-by-512 column, labs 2
% and 3 have their allotted partitions, and lab 4 has its allotted partition
% plus the additional columns (left over from lab 1). The end result is an
% unbalanced workload when doing zero communication element-wise operations
% (such as |sin|) and communication delays with data parallel operations
% (such as |codistributed/mtimes|). We start with a data parallel operation first
% (|codistributed/mtimes|). We then perform, in a loop, |sqrt|, |sin|, and inner
% product operations, all of which only operate on individual elements of
% the matrix.
%
% The MATLAB file code for this example can be found in:
% < pctdemo_aux_profdistarray>

%%
% In this example, the size of the matrix differs depending on the number
% of MATLAB workers (|numlabs|). However, it takes approximately the same
% amount of computation time (not including communication) to run this example
% on any cluster, so you can try using a larger cluster without having to
% wait a long time.

labBarrier; % synchronize all the labs
mpiprofile reset;
mpiprofile on;
pctdemo_aux_profdistarray();
mpiprofile viewer;


%%
% First, browse the Function Summary Report, making sure it is sorted
% by the execution time by clicking the *Total Time*
% column. Then follow the link for the top-level function (which should be
% |pctdemo_aux_profdistarray|) to see the Function Detail Report.

%%      The Busy Line Table in the Function Detail Report
% Each MATLAB function entry has its own *Busy Line* table, which is useful if
% you want to profile multiple programs or examples at the same time.
%
% * In the *Function Detail Report*, observe the communication information
% for the executed MATLAB code on a line-by-line basis. 
%
% * Compare profiling information using the Busy Line table. Click *Compare
% max vs. min TotalTime*. Observe the Busy Line table and check to see
% which line numbers took the most time by sorting the time field using the
% drop-down list. There are no for-loops in this code and no increasing
% complexity as you saw in the previous <docid:distcomp_examples.example-ex24154837
% Profiling Parallel Work Distribution> example. However, there still 
% is a large difference in computation
% load between the labs. Look at the |sqrt( sin( D .* D ) );| line.

%%
% <<../paralleldemo_distarray_proffileopts.png>>

%%
% <<../paralleldemo_distarray_profblt.png>>

%%
% Despite the fact that no communication is required for this element-wise
% operation, the performance is not optimal, because some labs do more work
% than others. In the second row, (|D*D*D|), the total time taken is the
% same on both labs. However, the *Data Rec*
% and *Data Sent* columns show a large difference in the amount of data
% sent and received. The time taken for this |mtimes| is similar on all
% labs, because the codistributed array communication implicitly synchronizes 
% all the labs.

%%
% In the ninth column (from the left)  of the Busy Line table, a
% bar shows the percentage for the selected field (using the *Sort busy
% lines*
% list box). These bars can also be used to visually compare *Total Time*,
% and *Data Sent* or *Data Received* of the main and comparison labs. 
%

%% Observing Codistributed Array Operations in Plot View
% If you click the relevant function name and are in the *Function Detail
% Report*, you get more specific information about a codistributed array 
% operation.

%%
% * To get the inter-lab communication data click |Plot All PerLab
% Communication|. In the first figure, you can see lab 1 transferring the
% most amount of data, and the last lab (lab 12) transferring the least
% amount of data.
% * To go back to the *Function Summary Report*, click *Home* and then 
% click on the |pctdemo_aux_profdistarray| link to view the Busy Line table
% again.
% 
% Using the comparisons, you can also see the amount of data communicated
% between each lab. This is constant for all labs except for the first and
% last labs. When there is no explicit communication, this indicates a
% distribution problem. In a typical codistributed array |mtimes| operation, 
% labs that have the least amount of data (e.g., lab 1) receive all the 
% required data from their neighboring labs (e.g., lab 2). 

%% The Data Received Plot 
% <<../paralleldemo_distarray_profrimage.png>>

%% 
% In this *Data Received Per Lab* plot, there is a significant decrease
% in the amount of data transferred by the last lab and an increase in the
% amount transferred by the first lab. Observing the Receive Communication
% Time plot (not shown) further illustrates that there is something
% different going on in the first lab. That is, the first lab is spending
% the longest amount of time in communication.

%%
% As you can see, the uneven distribution of a matrix causes unnecessary
% communication delays when using data parallel codistributed array operations 
% and uneven work distribution with task parallel (no communication) operations.
% In addition, labs (like the first lab in this example) that are receiving more
% data start with the least amount of data prior to the codistributed array
% operation.