www.gusucode.com > matlab 案例源码 matlab代码程序 > matlab/CreateHistogramsUsingMapReduceExample.m
%% Create Histograms Using MapReduce % This example shows how to visualize patterns in a large data set without % having to load all of the observations into memory simultaneously. It % demonstrates how to compute lower volume summaries of the data that are % sufficient to generate a graphic. % % Histograms are a common visualization technique that give an empirical % estimate of the probability density function (pdf) of a variable. % Histograms are well-suited to a big data environment, because they can % reduce the size of raw input data to a vector of counts. Each count is % the number of observations that falls within each of a set of contiguous, % numeric intervals or bins. % % The |mapreduce| function computes counts separately on multiple chunks of % the data. Then |mapreduce| sums the counts from all chunks. The map % function and reduce function are both extremely simple in this example. % Nevertheless, you can build flexible visualizations with the summary % information that they collect. % Copyright 1984-2014 The MathWorks, Inc. %% Prepare Data % Create a datastore using the |airlinesmall.csv| data set. This % 12-megabyte data set contains 29 columns of flight information for % several airline carriers, including arrival and departure times. In this % example, select |ArrDelay| (flight arrival delay) as the variable of % interest. ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA'); ds.SelectedVariableNames = 'ArrDelay'; %% % The datastore treats |'NA'| values as missing, and replaces the missing % values with |NaN| values by default. Additionally, the % |SelectedVariableNames| property allows you to work with only the % selected variable of interest, which you can verify using |preview|. preview(ds) %% Run MapReduce % The |mapreduce| function requires a map function and a reduce function as % inputs. The mapper receives chunks of data and outputs intermediate % results. The reducer reads the intermediate results and produces a final % result. %% % In this example, the mapper collects the counts of flights with various % amounts of arrival delay by accumulating the arrival delays into bins. % The bins are defined by the fourth input argument to the map function, % |edges|. %% % Display the map function file. % % <include>visualizationMapper.m</include> % %% % The bin size of the histogram is important. Bins that are too wide can % obscure important details in the data set. Bins that are too narrow can % lead to a noisy histogram. When working with very large data sets, it is % best to avoid making multiple passes over the data to try out different % bin widths. A simple way to avoid making multiple passes is to collect % counts with bins that are narrow. Then, to get wider bins, you can % aggregate adjacent bin counts without reprocessing the raw data. The % flight arrival delays are reported in 1-minute increments, so define % 1-minute bins from -60 minutes to 599 minutes. edges = -60:599; %% % Create an anonymous function to configure the map function to use the bin % edges. The anonymous function allows you to specialize the map function % by specifying a particular value for its fourth input argument. Then, you % can call the map function via the anonymous function, using only the % three input arguments that the |mapreduce| function expects. ourVisualizationMapper = ... @(data, info, intermKVstore) visualizationMapper(data, info, intermKVstore, edges); %% % Display the reduce function file. The reducer sums the counts stored by % the mapper. % % <include>visualizationReducer.m</include> % %% % Use |mapreduce| to apply the map and reduce functions to the datastore, % |ds|. result = mapreduce(ds, ourVisualizationMapper, @visualizationReducer); %% % |mapreduce| returns an output datastore, |result|, with files in % the current folder. %% Organize Results % Read the final bin count results from the output datastore. r = readall(result); counts = r.Value{1}; %% Visualize Results % Plot the raw bin counts using the whole range of the data (apart from a % few outliers excluded by the mapper). bar(edges, counts, 'hist'); title('Distribution of Flight Delay') xlabel('Arrival Delay (min)') ylabel('Flight Counts') %% % The histogram has long tails. Look at a restricted bin range to better % visualize the delay distribution of the majority of flights. Zooming in a % bit reveals there is a reporting artifact; it is common to round delays % to 5-minute increments. xlim([-50,50]); grid on grid minor %% % Smooth the counts with a moving average filter to remove the 5-minute % recording artifact. smoothCounts = filter( (1/5)*ones(1,5), 1, counts); figure bar(edges, smoothCounts, 'hist') xlim([-50,50]); title('Distribution of Flight Delay') xlabel('Arrival Delay (min)') ylabel('Flight Counts') grid on grid minor %% % To give the graphic a better balance, do not display the top 1% of % most-delayed flights. You can tailor the visualization in many ways % without reprocessing the complete data set, assuming that you collected % the appropriate information during the full pass through the data. empiricalCDF = cumsum(counts); empiricalCDF = empiricalCDF / empiricalCDF(end); quartile99 = find(empiricalCDF>0.99, 1, 'first'); low99 = 1:quartile99; figure empiricalPDF = smoothCounts(low99) / sum(smoothCounts); bar(edges(low99), empiricalPDF, 'hist'); xlim([-60,edges(quartile99)]); ylim([0, max(empiricalPDF)*1.05]); title('Distribution of Flight Delay') xlabel('Arrival Delay (min)') ylabel('Probability Density')