www.gusucode.com > demos工具箱matlab源码程序 > demos/SubsettingMapReduceExample.m
%% Simple Data Subsetting Using MapReduce % This example shows how to extract a subset of a large data set. % % There are two aspects of subsetting, or performing a query. One is % selecting a subset of the variables (columns) in the data set. % The other is selecting a subset of the observations, or rows. % % In this example, the selection of variables takes place in the definition % of the datastore. (The mapper function could perform a further % sub-selection of variables, but that is not within the scope of this % example.) In this example, the role of the mapper function is to perform % the selection of observations. The role of the reducer function is to % concatenate the subsetted records extracted by each call to the mapper % function. This approach assumes that the data set can fit in memory after % the Map phase. % Copyright 1984-2014 The MathWorks, Inc. %% Prepare Data % Create a datastore using the |airlinesmall.csv| data set. This 12 % megabyte data set contains 29 columns of flight information for several % airline carriers, including arrival and departure times. This example % uses 15 variables out of the 29 variables available in the data. ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA'); ds.SelectedVariableNames = ds.VariableNames([1 2 5 9 12 13 15 16 17 ... 18 20 21 25 26 27]); ds.SelectedVariableNames %% % |tabularTextDatastore| returns a |TabularTextDatastore| object for the data. This % datastore treats |'NA'| strings as missing, and replaces the missing % values with |NaN| values by default. Additionally, the % |SelectedVariableNames| property allows you to work with only the % specified variables of interest, which you can verify using |preview|. preview(ds) %% Run MapReduce % The |mapreduce| function requires a mapper function and a reducer % function. The mapper function receives chunks of data and outputs % intermediate results. The reducer function reads the intermediate results % and produces a final result. %% % In this example, the mapper function receives a table with the variables % described by the |SelectedVariableNames| property in the datastore. Then, % the mapper function extracts flights that had a high amount of delay % after pushback from the gate. Specifically, it identifies flights with a % duration exceeding 2.5 times the length of the scheduled duration. The % mapper function ignores flights prior to 1995, because some of the % variables of interest for this example were not collected before that % year. %% % Display the mapper function file. type subsettingMapper.m %% % The reducer function receives the subsetted observations obtained from % the mapper function and simply concatenates them into a single table. The % reducer function returns one key (which is relatively meaningless) and % one value (the concatenated table). %% % Display the reducer function file. type subsettingReducer.m %% % Use |mapreduce| to apply the mapper and reducer functions to the % datastore, |ds|. result = mapreduce(ds, @subsettingMapper, @subsettingReducer); %% % |mapreduce| returns an output datastore, |result|, with files in % the current folder. %% Display Results % Look for patterns in the first 10 variables that were pulled % from the data set. These variables identify the airline, the destination, % and the arrival airports, as well as some basic delay information. r = readall(result); tbl = r.Value{1}; tbl(:,1:10) %% % Looking at the first record, a US Airways flight departed the gate 14 % minutes after its scheduled departure time and arrived 118 minutes late. % The flight experienced a delay of 104 minutes after pushback from the % gate which is the difference between |ActualElapsedTime| and % |CRSElapsedTime|. % % There is one anomalous record. In February of 2006, a JetBlue flight had % a departure time of 3:24 a.m. and an elapsed flight time of 1650 minutes, % but an arrival delay of only 415 minutes. This might be a data entry % error. % % Otherwise, there are no clear cut patterns concerning when and where % these exceptionally delayed flights occur. No airline, time of year, time % of day, or single airport dominates. Some intuitive patterns, such as % O'Hare (ORD) in the winter months, are certainly present. %% Delay Patterns % Beginning in 1995, the airline system performance data began including % measurements of how much delay took place in the taxi phases of a flight. % Then, in 2003, the data also began to include certain causes of delay. %% % Examine these two variables in closer detail. tbl(:,[1,7,8,11:end]) %% % For these exceptionally delayed flights, the great majority of delay % occurs during taxi out, on the tarmac. Moreover, the major cause of the % delay is _NASDelay_. NAS delays are holds imposed by the national % aviation authorities on departures headed for an airport that is forecast % to be unable to handle all scheduled arrivals at the time the flight is % scheduled to arrive. NAS delay programs in effect at any given time are % posted at http://www.fly.faa.gov/ois/. % % Preferably, when NAS delays are imposed, boarding of the aircraft is % simply delayed. Such a delay would show up as a departure delay. However, % for most of the flights selected for this example, the delays took place % largely after departure from the gate, leading to a taxi delay. %% Rerun MapReduce % The previous mapper function had the subsetting criteria hard-wired in % the function file. A new mapper function would have to be written for any % new query, such as flights departing San Francisco on a given day. % % A generic mapper can be more adaptive by separating out the subsetting % criteria from the mapper function definition and using an anonymous % function to configure the mapper function for each query. This generic % mapper function uses a fourth input argument that supplies the desired % query variable. %% % Display the generic mapper function file. type subsettingMapperGeneric.m %% % Create an anonymous function that performs the same selection of rows % that is hard-coded in |subsettingMapper.m|. inFlightDelay150percent = @(data) data.Year > 1994 & ... (data.ActualElapsedTime - data.CRSElapsedTime) > ... 1.50 * data.CRSElapsedTime; %% % Since the |mapreduce| function requires the mapper and reducer functions % to accept exactly three inputs, use another anonymous function to specify % the fourth input to the mapper function, |subsettingMapperGeneric.m|. % Subsequently, you can use this anonymous function to call % |subsettingMapperGeneric.m| using only three arguments (the fourth is % implicit). configuredMapper = ... @(data, info, intermKVStore) subsettingMapperGeneric(... data, info, intermKVStore, inFlightDelay150percent); %% % Use |mapreduce| to apply the generic mapper function to the input % datastore. result2 = mapreduce(ds, configuredMapper, @subsettingReducer); %% % |mapreduce| returns an output datastore, |result2|, with files in % the current folder. %% Verify Results % Confirm that the generic mapper gets the same result as with the % hard-wired subsetting logic. r2 = readall(result2); tbl2 = r2.Value{1}; if isequaln(tbl, tbl2) disp('Same results with the configurable mapper.') else disp('Oops, back to the drawing board.') end