www.gusucode.com > matlab 案例源码 matlab代码程序 > matlab/ComputeSummaryStatisticsByGroupUsingMapReduceExample.m

    %% Compute Summary Statistics by Group Using MapReduce
% This example shows how to compute summary statistics organized by group
% using |mapreduce|. It demonstrates the use of an anonymous function to
% pass an extra grouping parameter to a parameterized map function. This
% parameterization allows you to quickly recalculate the statistics using a
% different grouping variable.

% Copyright 1984-2014 The MathWorks, Inc.

%% Prepare Data
% Create a datastore using the |airlinesmall.csv| data set. This
% 12-megabyte data set contains 29 columns of flight information for
% several airline carriers, including arrival and departure times. For this
% example, select |Month|, |UniqueCarrier| (airline carrier ID), and
% |ArrDelay| (flight arrival delay) as the variables of interest.
ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.SelectedVariableNames = {'Month', 'UniqueCarrier', 'ArrDelay'};

%%
% The datastore treats |'NA'| values as missing, and replaces the missing
% values with |NaN| values by default. Additionally, the
% |SelectedVariableNames| property allows you to work with only the
% selected variables of interest, which you can verify using |preview|.
preview(ds)

%% Run MapReduce
% The |mapreduce| function requires a map function and a reduce function as
% inputs. The mapper receives chunks of data and outputs intermediate
% results. The reducer reads the intermediate results and produces a final
% result.

%% 
% In this example, the mapper computes the grouped statistics for
% each chunk of data and stores the statistics as intermediate key-value
% pairs. Each intermediate key-value pair has a key for the group level and
% a cell array of values with the corresponding statistics.
%
% This map function accepts four input arguments, whereas the |mapreduce|
% function requires the map function to accept exactly three input
% arguments. The call to |mapreduce| (below) shows how to pass in this
% extra parameter.

%%
% Display the map function file.
%
% <include>statsByGroupMapper.m</include>
%

%%
% After the Map phase, |mapreduce| groups the intermediate key-value pairs
% by unique key (in this case, the airline carrier ID), so each call to the
% reduce function works on the values associated with one airline. The
% reducer receives a list of the intermediate statistics for the airline
% specified by the input key (|intermKey|) and combines the statistics into
% separate vectors: |n|, |m|, |v|, |s|, and |k|. Then, the reducer uses
% these vectors to calculate the count, mean, variance, skewness, and
% kurtosis for a single airline. The final key is the airline carrier code,
% and the associated values are stored in a structure with five fields.

%%
% Display the reduce function file.
%
% <include>statsByGroupReducer.m</include>
%

%%
% Use |mapreduce| to apply the map and reduce functions to the datastore,
% |ds|. Since the parameterized map function accepts four inputs, use an
% anonymous function to pass in the airline carrier IDs as the fourth
% input.
outds1 = mapreduce(ds, ...
    @(data,info,kvs)statsByGroupMapper(data,info,kvs,'UniqueCarrier'), ...
    @statsByGroupReducer);

%%
% |mapreduce| returns a datastore, |outds1|, with files in the current
% folder.

%%
% Read the final results from the output datastore.
r1 = readall(outds1)

%% Organize Results
% To organize the results better, convert the structure containing the
% statistics into a table and use the carrier IDs as the row names.
% |mapreduce| returns the key-value pairs in the same order as they were
% added by the reduce function, so sort the table by carrier ID.
statsByCarrier = struct2table(cell2mat(r1.Value), 'RowNames', r1.Key);
statsByCarrier = sortrows(statsByCarrier, 'RowNames')

%% Change Grouping Parameter
% The use of an anonymous function to pass in the grouping variable allows
% you to quickly recalculate the statistics with a different grouping.
%
% For this example, recalculate the statistics and group the results by
% |Month|, instead of by the carrier IDs, by simply passing the |Month|
% variable into the anonymous function.
outds2 = mapreduce(ds, ...
    @(data,info,kvs)statsByGroupMapper(data,info,kvs,'Month'), ...
    @statsByGroupReducer);

%%
% Read the final results and organize them into a table.
r2 = readall(outds2);
r2 = sortrows(r2,'Key');
statsByMonth = struct2table(cell2mat(r2.Value));
mon = {'Jan','Feb','Mar','Apr','May','Jun', ...
       'Jul','Aug','Sep','Oct','Nov','Dec'};
statsByMonth.Properties.RowNames = mon