www.gusucode.com > distcomp 案例源码程序 matlab代码 > distcomp/paralleldemo_resource_bench.m

    %% Resource Contention in Task Parallel Problems
% This example looks at why it is so hard to give a concrete answer to the
% question "How will my (parallel) application perform on my multi-core
% machine or on my cluster?"  The answer most commonly given is "It depends
% on your application as well as your hardware," and we will try to explain
% why this is all one can say without more information.
%
% This example shows the contention for memory access that we can run into on
% a regular multi-core computer.  This example illustrates the case of all 
% the workers running on the same multi-core computer with only one CPU.
%
% Related examples:
%
% * <docid:distcomp_examples.example-ex58868462 Benchmarking Independent Jobs on the Cluster> 
% * <docid:distcomp_examples.example-ex53144152 Simple Benchmarking of PARFOR Using Blackjack>

%   Copyright 2008-2014 The MathWorks, Inc.

%%
% To simplify the problem at hand, we will benchmark the computer's ability
% to execute task parallel problems that do not involve disk IO.  This
% allows us to ignore several factors that might affect parallel
% applications, such as:
%
% *  Amount of inter-process communication
% *  Network bandwidth and latency for inter-process communication
% *  Process startup and shutdown times
% *  Time to dispatch requests to the processes
% *  Disk IO performance
%
% This leaves us with only:
%
% * Time spent executing task parallel code

%%  Analogy for Resource Contention and Efficiency
% To understand why it is worthwhile to perform such a simple benchmark,
% consider the following example: If one person can fill one bucket of
% water, carry it some distance, empty it, and take it back to refill it in
% one minute, how long will it take two people to go the same round trip
% with one bucket each?  This simple analogy closely reflects the task
% parallel benchmarks in this example.  At first glance, it seems absurd that
% there should be any decrease in efficiency when two people are
% simultaneously doing the same thing as compared to one person.
% 

%% Competing for One Pipe: Resource Contention
% If all things are perfect in our previous example, two people complete
% one loop with a bucket of water each in one minute.  Each person quickly
% fills one bucket, carries the bucket over to the destination, empties it
% and walks back, and they make sure never to interfere or interrupt one
% another.
% 
% However, imagine that they have to fill the buckets from a single, small
% water hose.  If they arrive at the hose at the same time, one would have
% to wait.  This is one example of a *contention for a shared resource*.
% Maybe the two people don't need to simultaneously use the hose, and the
% hose therefore serves their needs; but if you have 10 people transporting
% a bucket each, some might always have to wait.

%%
% In our case, the water hose corresponds to the computer hardware that we
% use, in particular the access to the computer memory.  If multiple
% programs are running simultaneously on one CPU core each, and they all
% need access to data that is stored in the computer's memory, some of the
% programs may have to wait because of limited memory bandwidth.

%% Same Hose, Different Distance, Different Results
% To take our analogy a bit further, imagine that we have contention at the
% hose when two people are carrying one bucket each, but then we change the
% task and ask them to carry the water quite a bit further away from the
% hose.  When performing this modified task, the two people spend a larger
% proportion of their time doing work, i.e., walking with the buckets, and
% a smaller proportion of their time contending over the shared resource,
% the hose.  They are therefore less likely to need the hose at the same
% time, so this modified task has a higher *parallel efficiency* than the
% original one.

%%
% In the case of the benchmarks in this example, this corresponds on the one
% hand to running programs that require lots of access to the computer's
% memory, but they perform very little work with the data once fetched.
% If, on the other hand, the programs perform lots of computations with the
% data, it becomes irrelevant how long it took to fetch the data, the
% computation time will overshadow the time spent waiting for access to the
% memory.

%%
% The predictability of the memory access of an algorithm also effects how
% contended the memory access will be.  If the memory is accessed in a 
% regular, predictable manner, we will not experience the same amount of
% contention as when memory is accessed in an irregular manner.  This can
% be seen further below, where, for example, singular value decomposition
% calculations result in more contention than matrix multiplication.

%%
% The code shown in this example can be found in this function:
function paralleldemo_resource_bench

%% Check the Status of the parallel pool
% We will use the parallel pool to call our task parallel test functions,
% so we start by checking whether the pool is open.  Note that if the
% parallel pool is using workers running on multiple computers, you may not
% experience any of the resource contention that we attempt to illustrate.
p = gcp;
if isempty(p)
    error('pctexample:backslashbench:poolClosed', ...
        ['This example requires a parallel pool. ' ...
         'Manually start a pool using the parpool command or set ' ...
         'your parallel preferences to automatically start a pool.']);
end
poolSize = p.NumWorkers;
    
%% Set up Benchmarking Problem
% For our calculations, we create an input matrix large enough that it
% needs to be brought from the computer's memory onto the CPU each time it
% is processed.  That is, we make it so large that we deliberately cause
% resource contention.
sz = 2048;
m = rand(sz*sz, 1);

%% Repeated Summation
% These computations are very simple: Repeatedly sum a single array.  Since
% the computations are very lightweight, we expect to see resource
% contention when running multiple copies of this function simultaneously
% with a large input array.
function sumOp(m)
    s = 0; 
    for itr = 1:100 % Repeat multiple times to get accurate timing
        s = s + sum(m); 
    end
end

%% The Benchmarking Function
% The core of the timing function consists of a simple |spmd| statement.
% Notice that we retain the minimum execution time observed for a given
% level of concurrency, |n|.  As stated at the beginning, we are
% benchmarking task parallel problems, measuring only the actual runtime.
% This means that we are not benchmarking the performance of MATLAB, the
% Parallel Computing Toolbox(TM), or the |spmd| language construct.
% Rather, we are benchmarking the ability of our OS and hardware to
% simultaneously run multiple copies of a program.
function time = timingFcn(fcn, numConcurrent)

    time = zeros(1, length(numConcurrent));
    
    for ind = 1:length(numConcurrent)
        % Invoke the function handle fcn concurrently on n different labs.
        % Store the minimum time it takes all to complete. 
        n = numConcurrent(ind);
        spmd(n)
            tconcurrent = inf;
            for itr = 1:5
                labBarrier;
                tic; % Measure only task parallel runtime.
                    fcn();
                tAllDone = gop(@max, toc);  % Time for all to complete.
                tconcurrent = min(tconcurrent, tAllDone);
            end
        end
        time(ind) = tconcurrent{1};
        clear tconcurrent itr tAllDone;
        if ind == 1
            fprintf('Execution times: %f', time(ind));
        else
            fprintf(', %f', time(ind));
        end
    end
    fprintf('\n');
end

%% Benchmarking the Summation
% We measure how long it takes to simultaneously evaluate |n| copies of a
% summation function for different values of |n|.  Since the summation
% calculations are so simple, we expect to encounter a resource contention
% when running multiples of these computations simultaneously on a
% multi-core CPU.  Consequently, we expect it to take longer to evaluate
% the summation function when we are performing multiple such evaluations
% concurrently than it takes to execute a single such evaluation on an
% otherwise idle CPU.
tsum = timingFcn(@() sumOp(m), 1:poolSize);

%% Benchmarking FFT
% We now look at a more computationally intensive problem, that of
% calculating the FFT of our vector.  Since FFT is more computationally
% intensive than summation, we expect that we will not see the same
% performance degradations when concurrently evaluating multiple calls to
% the FFT function as we saw with calls to the summation function.
function fftOp(m)
    for itr = 1:10 % Repeat a few times for accurate timing
        fft(m);
    end
end    

tfft = timingFcn(@() fftOp(m), 1:poolSize);

%% Matrix Multiplication
% Finally, we look at an even more computationally intensive problem than
% either FFT or summation.  The memory access in matrix multiplication is
% also very regular, so this problem therefore has the potential to be
% executed quite efficiently in parallel on a multi-core machine. 
function multOp(m)
    m*m;  %#ok<VUNUS> % No need to repeat for accurate timing.
end
m = reshape(m, sz, sz);
tmtimes = timingFcn(@() multOp(m), 1:poolSize);
clear m;

%% Investigate the Resource Contention
% We create a simple bar chart showing the speedup achieved by running
% multiple function invocations concurrently.  This kind of graph shows the
% speedup with what is known as *weak scaling*.  Weak scaling is where the
% number of processes/processors varies, and the problem size on each
% process/processor is fixed.  This has the effect of increasing the total
% problem size as we increase the number of processes/processors.  On the
% other hand, *strong scaling* is where the problem size is fixed and the
% number of processes/processors varies.  The effect of this is that as we
% increase the number of processes/processors, the work done by each
% process/processor decreases.
allTimes = [tsum(:), tfft(:), tmtimes(:)];
% Normalize the execution times.
efficiency = bsxfun(@rdivide, allTimes(1, :), allTimes);  
speedup = bsxfun(@times, efficiency, (1:length(tsum))'); 
fig = figure;
ax = axes('parent', fig);
bar(ax, speedup);
legend(ax, 'vector sum', 'vector fft', 'matrix mult', ...
       'Location', 'NorthWest')
xlabel(ax, 'Number of concurrent processes');
ylabel(ax, 'Speedup')
title(ax, ['Effect of number of concurrent processes on ', ... 
           'resource contention and speedup']);

%% Effect on Real Applications
% Looking at the graph above, we can see that these simple problems scale
% very differently on the same computer.  When we also look at the fact
% that other problems and different computers may show very different
% behavior, it should become clear why it is impossible to give a general
% answer to the question "How will my (parallel) application perform on my
% multi-core machine or on my cluster?"  The answer to that question truly
% depends on the application and the hardware in question.

%% Measure Effect of Data Size on Resource Contention
% The resource contention does not depend only on the function that we are
% executing, but also on the size of the data that we process.  To
% illustrate this, we measure the execution times of various functions
% with various sizes of input data.  As before, we are benchmarking the
% ability of our hardware to perform these computations concurrently, and
% we are not benchmarking MATLAB or its algorithms.  We use more functions
% than before so that we can investigate the effects of different memory
% access patterns as well as the effects of different data sizes.
szs = [128, 256, 512, 1024, 2048];
description = {'vector sum', 'vector fft', 'matrix mult', 'matrix LU', ...
    'matrix SVD', 'matrix eig'};

%%
% We loop through the different data sizes and the functions we want to
% benchmark, and measure the speedup observed.  We compare the sequential
% execution time to the time it takes to execute concurrently as many
% invocations as we have labs in the pool.
speedup = zeros(length(szs), length(description));
for i = 1:length(szs)
    sz = szs(i);
    fprintf('Using matrices of size %d-by-%d.\n', sz, sz);
    j = 1;
    for f = [{@sumOp; sz^2; 1}, {@fftOp; sz^2; 1}, {@multOp; sz; sz}, ...
            {@lu; sz; sz}, {@svd;  sz; sz}, {@eig;  sz; sz}]
        op = f{1};
        nrows = f{2};
        ncols = f{3};
        m = rand(nrows, ncols);
        % Compare sequential execution to execution on all labs.
        tcurr = timingFcn(@() op(m), [1, poolSize]);
        speedup(i, j) = tcurr(1)/tcurr(2)*poolSize;
        j = j + 1;
    end
end

%% View Resource Contention As a Function of Data Size
% When we look at the results, we have to keep in mind how a function
% interacts with the cache on a CPU.  For small data sizes, we are always
% working out of the CPU cache for all these functions.  In that case, we
% expect to see perfect speedup.  When the input data is too large to fit
% into the CPU cache, we start seeing the performance degradation caused by
% contention for memory access.  This happens in several ways, but for this
% example, the following are the most important:
% 
% * The function performs a relatively small amount of computation with the
% data.  The sum of a vector is a good example of that.
% * The function accesses data in small chunks or has irregular data access
% patterns.  The following graph shows that with eigenvalue calculations
% (eig) and singular value decomposition (svd).
fig = figure;
ax = axes('parent', fig);
plot(ax, speedup);
lines = ax.Children;
set(lines, {'Marker'}, {'+', 'o', '*', 'x', 's', 'd'}');
hold(ax, 'on');
plot(ax, repmat(poolSize, 1, length(szs)), '--');
hold(ax, 'off');
legend(ax, [description, {'ideal speedup'}], 'Location', 'SouthWest');
ax.XTick = 1:length(szs);
ax.XTickLabel = arrayfun(@(x) {sprintf('%d^2', x)}, szs);
yl = ylim;
ylim([0, max([poolSize + 0.1, yl(2)])])
xlabel(ax, 'Number of elements per process');
ylabel(ax, 'Speedup')
title(ax, 'Effect of data size on resource contention and speedup');


end