www.gusucode.com > matlab 案例源码 matlab代码程序 > matlab/ComputeMeanByGroupUsingMapReduceExample.m

    %% Compute Mean by Group Using MapReduce
% This example shows how to compute the mean by group in a data set using
% |mapreduce|. It demonstrates how to do computations on subgroups of data.

% Copyright 1984-2014 The MathWorks, Inc.

%% Prepare Data
% Create a datastore using the |airlinesmall.csv| data set. This
% 12-megabyte data set contains 29 columns of flight information for
% several airline carriers, including arrival and departure times. In this
% example, select |DayOfWeek| and |ArrDelay| (flight arrival delay) as the
% variables of interest.
ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.SelectedVariableNames = {'ArrDelay', 'DayOfWeek'};

%%
% The datastore treats |'NA'| values as missing, and replaces the missing
% values with |NaN| values by default. Additionally, the
% |SelectedVariableNames| property allows you to work with only the
% selected variables of interest, which you can verify using |preview|.
preview(ds)

%% Run MapReduce
% The |mapreduce| function requires a map function and a reduce function as
% inputs. The mapper receives chunks of data and outputs intermediate
% results. The reducer reads the intermediate results and produces a final
% result.

%% 
% In this example, the mapper computes the count and sum of delays
% by the day of week in each chunk of data, and then stores the results as
% intermediate key-value pairs. The keys are integers (1 to 7) representing
% the days of the week and the values are two-element vectors representing
% the count and sum of the delay of each day.

%%
% Display the map function file.
%
% <include>meanArrivalDelayByDayMapper.m</include>
%

%%
% After the Map phase, |mapreduce| groups the intermediate key-value pairs
% by unique key (in this case, day of the week). Thus, each call to the
% reducer works on the values associated with one day of the week. The
% reducer receives a list of the intermediate count and sum of delays for
% the day specified by the input key (|intermKey|) and sums up the values
% into the total count, |n| and total sum |s|. Then, the reducer calculates
% the overall mean, and adds one final key-value pair to the output. This
% key-value pair represents the mean flight arrival delay for one day of
% the week.

%%
% Display the reduce function file.
%
% <include>meanArrivalDelayByDayReducer.m</include>
%

%%
% Use |mapreduce| to apply the map and reduce functions to the datastore,
% |ds|.
meanDelayByDay = mapreduce(ds, @meanArrivalDelayByDayMapper, ...
                               @meanArrivalDelayByDayReducer);

%%
% |mapreduce| returns a datastore, |meanDelayByDay|, with files in the
% current folder.

%%
% Read the final result from the output datastore, |meanDelayByDay|.
result = readall(meanDelayByDay)

%% Organize Results
% The integer keys (1 to 7) represent the days of the week. To organize the
% results more, convert the keys to a categorical array, retrieve the
% numeric values from the single element cells, and rename the variable
% names of the resulting table.
result.Key = categorical(result.Key, 1:7, ...
               {'Mon','Tue','Wed','Thu','Fri','Sat','Sun'});
result.Value = cell2mat(result.Value);
result.Properties.VariableNames = {'DayOfWeek', 'MeanArrDelay'}

%%
% Sort the rows of the table by mean flight arrival delay. This reveals
% that Saturday is the best day of the week to travel, whereas Friday is
% the worst.
result = sortrows(result,'MeanArrDelay')