www.gusucode.com > distcomp 案例源码程序 matlab代码 > distcomp/paralleldemo_backslash_bench.m
%% Benchmarking A\b % This example shows how to benchmark solving a linear system on % a cluster. The MATLAB(R) code to solve for |x| in |A*x = b| is very % simple. Most frequently, one uses matrix left division, also known as % mldivide or the backslash operator (\), to calculate |x| (that is, % |x = A\b|). Benchmarking the performance of matrix left division on % a cluster, however, is not as straightforward. % % One of the most challenging aspects of benchmarking is to avoid falling % into the trap of looking for a single number that represents the overall % performance of the system. We will look at the performance curves that % might help you identify the performance bottlenecks on your cluster, and % maybe even help you see how to benchmark your code and be able to draw % meaningful conclusions from the results. % % Related examples: % % * <docid:distcomp_examples.example-ex53144152 Simple Benchmarking of parfor Using % Blackjack> % * <docid:distcomp_examples.example-ex58868462 Benchmarking Independent Jobs on % the Cluster> % * <docid:distcomp_examples.example-ex61407340 Resource Contention in Task Parallel % Problems> % Copyright 2009-2016 The MathWorks, Inc. %% % The code shown in this example can be found in this function: function results = paralleldemo_backslash_bench(memoryPerWorker) %% % It is very important to choose the appropriate matrix size for the % cluster. We can do this by specifying the amount of system memory in GB % available to each worker as an input to this example function. The default % value is very conservative; you should specify a value that is % appropriate for your system. if nargin == 0 memoryPerWorker = 8.00; % In GB % warning('pctexample:backslashbench:BackslashBenchUsingDefaultMemory', ... % ['Amount of system memory available to each worker is ', ... % 'not specified. Using the conservative default value ', ... % 'of %.2f gigabytes per worker.'], memoryPerWorker); end %% Avoiding Overhead % To get an accurate measure of our capability to solve linear systems, we % need to remove any possible source of overhead. This includes getting % the current parallel pool and temporarily disabling the deadlock % detection capabilities. p = gcp; if isempty(p) error('pctexample:backslashbench:poolClosed', ... ['This example requires a parallel pool. ' ... 'Manually start a pool using the parpool command or set ' ... 'your parallel preferences to automatically start a pool.']); end poolSize = p.NumWorkers; pctRunOnAll 'mpiSettings(''DeadlockDetection'', ''off'');' %% The Benchmarking Function % We want to benchmark matrix left division (\), and not the cost of % entering an |spmd| block, the time it takes to create a matrix, or other % parameters. We therefore separate the data generation from the solving % of the linear system, and measure only the time it takes to do the % latter. We generate the input data using the 2-D block-cyclic % codistributor, as that is the most effective distribution scheme for % solving a linear system. Our benchmarking then consists of measuring the % time it takes all the workers to complete solving the linear system |A*x % = b|. Again, we try to remove any possible source of overhead. function [A, b] = getData(n) fprintf('Creating a matrix of size %d-by-%d.\n', n, n); spmd % Use the codistributor that usually gives the best performance % for solving linear systems. codistr = codistributor2dbc(codistributor2dbc.defaultLabGrid, ... codistributor2dbc.defaultBlockSize, ... 'col'); A = codistributed.rand(n, n, codistr); b = codistributed.rand(n, 1, codistr); end end function time = timeSolve(A, b) spmd tic; x = A\b; %#ok<NASGU> We don't need the value of x. time = gop(@max, toc); % Time for all to complete. end time = time{1}; end %% Choosing Problem Size % Just like with a great number of other parallel algorithms, the % performance of solving a linear system in parallel depends greatly on the % matrix size. Our _a priori_ expectations are therefore that the % computations be: % % * Somewhat inefficient for small matrices % * Quite efficient for large matrices % * Inefficient if the matrices are too large to fit into system memory and % the operating systems start swapping memory to disk % % It is therefore important to time the computations for a number of % different matrix sizes to gain an understanding of what "small," "large," % and "too large" mean in this context. Based on previous experiments, we % expect: % % * "Too small" matrices to be of size 1000-by-1000 % * "Large" matrices to occupy slightly less than 45% of the memory available to % each worker % * "Too large" matrices occupy 50% or more of system memory available to each % worker % % These are heuristics, and the precise values may change between releases. % It is therefore important that we use matrix sizes that span this entire % range and verify the expected performance. % % Notice that by changing the problem size according to the number of % workers, we employ weak scaling. Other benchmarking examples, such as % <docid:distcomp_examples.example-ex53144152 Simple Benchmarking of parfor Using % Blackjack> and <docid:distcomp_examples.example-ex58868462 Benchmarking % Independent Jobs on the Cluster>, also employ weak scaling. As those % examples benchmark task parallel computations, their weak scaling consists % of making the number of iterations proportional to the number of % workers. This example, however, is benchmarking data parallel computations, % so we relate the upper size limit of the matrices to the number of % workers. % Declare the matrix sizes ranging from 1000-by-1000 up to 45% of system % memory available to each worker. maxMemUsagePerWorker = 0.45*memoryPerWorker*1024^3; % In bytes. maxMatSize = round(sqrt(maxMemUsagePerWorker*poolSize/8)); matSize = round(linspace(1000, maxMatSize, 5)); %% Comparing Performance: Gigaflops % We use the number of floating point operations per second as our measure % of performance because that allows us to compare the performance of the % algorithm for different matrix sizes and different number of workers. If % we are successful in testing the performance of matrix left division for % a sufficiently wide range of matrix sizes, we expect the performance % graph to look similar to the following: % % <<../paralleldemo_backslash_bench_expected.png>> %% % By generating graphs such as these, we can answer questions such as: % % * Are the smallest matrices so small that we get poor performance? % * Do we see a performance decrease when the matrix is so large that it % occupies 45% of total system memory? % * What is the best performance we can possibly achieve for a given % number of workers? % * For which matrix sizes do 16 workers perform better than 8 workers? % * Is the system memory limiting the peak performance? % % Given a matrix size, the benchmarking function creates the matrix |A| and % the right-hand side |b| once, and then solves |A\b| multiple times to get % an accurate measure of the time it takes. We use the floating operations % count of the HPC Challenge, so that for an n-by-n matrix, we count the % floating point operations as |2/3*n^3 + 3/2*n^2|. function gflops = benchFcn(n) numReps = 3; [A, b] = getData(n); time = inf; % We solve the linear system a few times and calculate the Gigaflops % based on the best time. for itr = 1:numReps tcurr = timeSolve(A, b); if itr == 1 fprintf('Execution times: %f', tcurr); else fprintf(', %f', tcurr); end time = min(tcurr, time); end fprintf('\n'); flop = 2/3*n^3 + 3/2*n^2; gflops = flop/time/1e9; end %% Executing the Benchmarks % Having done all the setup, it is straightforward to execute the % benchmarks. However, the computations may take a long time to complete, % so we print some intermediate status information as we complete the % benchmarking for each matrix size. fprintf(['Starting benchmarks with %d different matrix sizes ranging\n' ... 'from %d-by-%d to %d-by-%d.\n'], ... length(matSize), matSize(1), matSize(1), matSize(end), ... matSize(end)); gflops = zeros(size(matSize)); for i = 1:length(matSize) gflops(i) = benchFcn(matSize(i)); fprintf('Gigaflops: %f\n\n', gflops(i)); end results.matSize = matSize; results.gflops = gflops; %% Plotting the Performance % We can now plot the results, and compare to the expected graph shown % above. fig = figure; ax = axes('parent', fig); plot(ax, matSize/1000, gflops); lines = ax.Children; lines.Marker = '+'; ylabel(ax, 'Gigaflops') xlabel(ax, 'Matrix size in thousands') titleStr = sprintf(['Solving A\\b for different matrix sizes on ' ... '%d workers'], poolSize); title(ax, titleStr, 'Interpreter', 'none'); %% % If the benchmark results are not as good as you might expect, here are % some things to consider: % % * The underlying implementation is using ScaLAPACK, which has a proven % reputation of high performance. It is therefore very unlikely that the % algorithm or the library is causing inefficiencies, but rather the way in % which it is used, as described in the items below. % * If the matrices are too small or too large for your cluster, the % resulting performance will be poor. % * If the network communications are slow, performance will be severely % impacted. % * If the CPUs and the network communications are both very fast, but the % amount of memory is limited, it is possible you are not able to benchmark % with sufficiently large matrices to fully utilize the available CPUs and % network bandwidth. % * For ultimate performance, it is important to use a version of MPI that % is tailored for your networking setup, and have the workers running in % such a manner that as much of the communication happens through shared % memory as possible. It is, however, beyond the scope of this example to % explain how to identify and solve those types of problems. %% Comparing Different Numbers of Workers % We now look at how to compare different numbers of workers by viewing data % obtained by running this example using different numbers of workers. This data % is obtained on a different cluster from the one above. % % Other examples such as % <docid:distcomp_examples.example-ex58868462 Benchmarking Independent Jobs on the Cluster> % have explained that when benchmarking parallel algorithms for different % numbers of workers, one usually employs weak scaling. That is, as we % increase the number of workers, we increase the problem size % proportionally. In the case of matrix left division, we have to show % additional care because the performance of the division depends greatly % on the size of the matrix. The following code creates a graph of the % performance in Gigaflops for all of the matrix sizes that we tested with and % all the different numbers of workers, as that gives us the most detailed % picture of the performance characteristics of matrix left division _on this % particular cluster_. s = load('pctdemo_data_backslash.mat', 'workers4', 'workers8', ... 'workers16', 'workers32', 'workers64'); fig = figure; ax = axes('parent', fig); plot(ax, s.workers4.matSize./1000, s.workers4.gflops, ... s.workers8.matSize./1000, s.workers8.gflops, ... s.workers16.matSize./1000, s.workers16.gflops, ... s.workers32.matSize./1000, s.workers32.gflops, ... s.workers64.matSize./1000, s.workers64.gflops); lines = ax.Children; set(lines, {'Marker'}, {'+'; 'o'; 'v'; '.'; '*'}); ylabel(ax, 'Gigaflops') xlabel(ax, 'Matrix size in thousands') title(ax, ... 'Comparison data for solving A\\b on different numbers of workers'); legend('4 workers', '8 workers', '16 workers', '32 workers', ... '64 workers', 'location', 'NorthWest'); %% % The first thing we notice when looking at the graph above is that 64 % workers allow us to solve much larger linear systems of equations than is % possible with only 4 workers. Additionally, we can see that even if one % could work with a matrix of size 60,000-by-60,000 on 4 workers, we would % get a performance of approximately only 10 Gigaflops. Thus, even if the % 4 workers had sufficient memory to solve such a large problem, 64 workers % would nevertheless greatly outperform them. % % Looking at the slope of the curve for 4 workers, we can see that there is % only a modest performance increase between the three largest matrix % sizes. Comparing this with the earlier graph of the expected performance % of |A\b| for different matrix sizes, we conclude that we are quite close % to achieving peak performance for 4 workers with matrix size of % 7772-by-7772. % % Looking at the curve for 8 and 16 workers, we can see that the % performance drops for the largest matrix size, indicating that we are % near or already have exhausted available system memory. However, we see % that the performance increase between the second and third largest matrix % sizes is very modest, indicating stability of some sort. We therefore % conjecture that when working with 8 or 16 workers, we would most likely % not see a significant increase in the Gigaflops if we increased the % system memory and tested with larger matrix sizes. % % Looking at the curves for 32 and 64 workers, we see that there is a % significant performance increase between the second and third largest % matrix sizes. For 64 workers, there is also a significant performance % increase between the two largest matrix sizes. We therefore conjecture % that we run out of system memory for 32 and 64 workers before we have % reached peak performance. If that is correct, then adding more memory to % the computers would both allow us to solve larger problems and perform % better at those larger matrix sizes. %% Speedup % The traditional way of measuring speedup obtained with linear algebra % algorithms such as backslash is to compare the peak performance. We % therefore calculate the maximum number of Gigaflops achieved for each % number of workers. peakPerf = [max(s.workers4.gflops), max(s.workers8.gflops), ... max(s.workers16.gflops), max(s.workers32.gflops), ... max(s.workers64.gflops)]; disp('Peak performance in Gigaflops for 4-64 workers:') disp(peakPerf) disp('Speedup when going from 4 workers to 8, 16, 32 and 64 workers:') disp(peakPerf(2:end)/peakPerf(1)) %% % We therefore conclude that we get a speedup of approximately 13.5 when % increasing the number of workers 16 fold, going from 4 workers to 64. As % we noted above, the performance graph indicates that we might be able to % increase the performance on 64 workers (and thereby improve the speedup % even further), by increasing the system memory on the cluster computers. %% The Cluster Used % This data was generated using 16 dual-processor, octa-core computers, % each with 64 GB of memory, connected with GigaBit Ethernet. When using 4 % workers, they were all on a single computer. We used 2 computers for 8 % workers, 4 computers for 16 workers, etc. %% Re-enabling the Deadlock Detection % Now that we have concluded our benchmarking, we can safely re-enable the % deadlock detection in the current parallel pool. pctRunOnAll 'mpiSettings(''DeadlockDetection'', ''on'');' end