www.gusucode.com > matlab_featured 案例源码 matlab代码程序 > matlab_featured/AnalyzeBigDataInMATLABUsingTallArraysExample.m

    %% Analyze Big Data in MATLAB Using Tall Arrays
% This example shows how to use tall arrays to work with big data in
% MATLAB®. You can use tall arrays to perform a variety of calculations
% on different types of data that does not fit in memory. These include
% basic calculations, as well as machine learning algorithms within
% Statistics and Machine Learning Toolbox™.
% 
% This example operates on a small subset of data on a single computer, and
% then it then scales up to analyze all of the data set. However, this
% analysis technique can scale up even further to work on data sets that
% are so large they cannot be read into memory, or to work on systems like
% Apache Spark™.

%% Introduction to Tall Arrays
% Tall arrays and tall tables are used to work with out-of-memory data that
% has any number of rows. Instead of writing specialized code that takes
% into account the huge size of the data, tall arrays and tables let you
% work with large data sets in a manner similar to in-memory MATLAB®
% arrays. The difference is that |tall| arrays typically remain unevaluated
% until you request that the calculations be performed.
% 
% This deferred evaluation enables MATLAB to combine the queued
% calculations where possible and take the minimum number of passes through
% the data. Since the number of passes through the data greatly affects
% execution time, it is recommended that you request output only when
% necessary.

%% Create |datastore| for Collection of Files
% Creating a |datastore| enables you to access a collection of data. A
% |datastore| can process arbitrarily large amounts of data, and the data
% can even be spread across multiple files in multiple folders. You can
% create a |datastore| for a collection of tabular text files (demonstrated
% here), spreadsheets, images, a SQL database (Database Toolbox™
% required) or Hadoop® sequence files.
% 
% Create a |datastore| for a |.csv| file containing airline data. Treat
% |'NA'| values as missing so that |datastore| replaces them with |NaN|
% values. Select the variables of interest, and specify a categorical data
% type for the |Origin| and |Dest| variables. Preview the contents.
ds = datastore('airlinesmall.csv');
ds.TreatAsMissing = 'NA';
ds.SelectedVariableNames = {'Year','Month','ArrDelay','DepDelay','Origin','Dest'};
ds.SelectedFormats(5:6) = {'%C','%C'};
pre = preview(ds)

%% Create Tall Array
% Tall arrays are similar to in-memory MATLAB arrays, except that they can
% have any number of rows. Tall arrays can contain data that is numeric,
% logical, datetime, duration, calendarDuration, categorical, or strings.
% Also, you can convert any in-memory array to a tall array. (The in-memory
% array |A| must be one of the supported data types.)
% 
% The underlying class of a tall array is based on the type of datastore
% that backs it. For example, if the datastore |ds| contains tabular data,
% then |tall(ds)| returns a tall table containing the data.
tt = tall(ds)

%% 
% The display indicates the underlying data type and includes the first
% several rows of data. The size of the table displays as "Mx6" to indicate
% that MATLAB does not yet know how many rows of data there are. 

%% Perform Calculations on Tall Arrays
% You can work with tall arrays and tall tables in a similar manner in
% which you work with in-memory MATLAB arrays and tables.
% 
% One important aspect of tall arrays is that as you work with them, MATLAB
% does not perform most operations immediately. These operations appear to
% execute quickly, because the actual computation is deferred until you
% specifically request output. This deferred evaluation is important
% because even a simple command like |size(X)| executed on a tall array
% with a billion rows is not a quick calculation.
% 
% As you work with tall arrays, MATLAB keeps track of all of the operations
% to be carried out and optimizes the number of passes through the data.
% Thus, it is normal to work with unevaluated tall arrays and request
% output only when you require it. MATLAB does not know the contents or
% size of unevaluated tall arrays until you request that the array be
% evaluated and displayed.
% 
% Calculate the mean departure delay.
mDep = mean(tt.DepDelay,'omitnan')

%% Gather Results into Workspace
% The benefit of deferred evaluation is that when the time comes for MATLAB
% to perform the calculations, it is often possible to combine the
% operations in such a way that the number of passes through the data is
% minimized. So, even if you perform many operations, MATLAB only makes
% extra passes through the data when absolutely necessary.
% 
% The |gather| function forces evaluation of all queued operations and
% brings the resulting output back into memory. Since |gather| returns the
% _entire_ result in MATLAB, you should make sure that the result will fit
% in memory. For example, use |gather| on tall arrays that are the result
% of a function that reduces the size of the tall array, such as |sum|,
% |min|, |mean|, and so on.
% 
% Use |gather| to calculate the mean departure delay and bring the answer
% into memory. This calculation requires a single pass through the data,
% but other calculations might require several passes through the data.
% MATLAB determines the optimal number of passes for the calculation and
% displays this information at the command line.
mDep = gather(mDep)

%% Select Subset of Tall Array
% You can extract values from a tall array by subscripting or indexing. You
% can index the array starting from the top or bottom, or by using a
% logical index. The functions |head| and |tail| are useful alternatives to
% indexing, enabling you to explore the first and last portions of a tall
% array. Gather both variables at the same time to avoid extra passes
% through the data.
h = head(tt);
tl = tail(tt);
[h,tl] = gather(h,tl)

%% 
% Use |head| to select a subset of 10,000 rows from the data for
% prototyping code before scaling to the full data set.
ttSubset = head(tt,10000);

%% Select Data by Condition
% You can use typical logical operations on tall arrays, which are useful
% for selecting relevant data or removing outliers with logical indexing.
% The logical expression creates a tall logical vector, which then is used
% to subscript, identifying the rows where the condition is true.
% 
% Select only the flights out of Boston by comparing the
% elements of the categorical variable |Origin| to the value |'BOS'|.
idx = (ttSubset.Origin == 'BOS');
bosflights = ttSubset(idx,:)

%% 
% You can use the same indexing technique to remove rows with missing data
% or NaN values from the tall array.
idx = any(ismissing(ttSubset),2); 
ttSubset(idx,:) = [];

%% Determine Largest Delays
% Due to the nature of big data, sorting all of the data using traditional
% methods like |sort| or |sortrows| is inefficient. However, the |topkrows|
% function for tall arrays returns the top |k| rows in sorted order.
% 
% Calculate the top 10 greatest departure delays. 
biggestDelays = topkrows(ttSubset,10,'DepDelay');
biggestDelays = gather(biggestDelays)

%% Visualize Data in Tall Arrays 
% Plotting every point in a big data set is not feasible. For that reason,
% visualization of tall arrays involves reducing the number of data points
% using sampling or binning.
%
% Visualize the number of flights per year with a histogram. The
% visualization functions pass through the data and immediately evaluate
% the solution when you call them, so |gather| is not required.
histogram(ttSubset.Year,'BinMethod','integers')
xlabel('Year')
ylabel('Number of Flights')
title('Number of Flights by Year, 1987 - 1989')

%% Scale to Entire Data Set
% Instead of using the smaller data returned from |head|, you can scale up
% to perform the calculations on the entire data set by using the results
% from |tall(ds)|.
tt = tall(ds);
idx = any(ismissing(tt),2); 
tt(idx,:) = [];
mnDelay = mean(tt.DepDelay,'omitnan');
biggestDelays = topkrows(tt,10,'DepDelay'); 
[mnDelay,biggestDelays] = gather(mnDelay,biggestDelays)

%%
histogram(tt.Year,'BinMethod','integers')
xlabel('Year')
ylabel('Number of Flights')
title('Number of Flights by Year, 1987 - 2008')

%%
% Use |histogram2| to further break down the number of flights by month for
% the whole data set. Since the bins for |Month| and |Year| are known ahead
% of time, specify the bin edges to avoid an extra pass through the data.
year_edges = 1986.5:2008.5;
month_edges = 0.5:12.5;
histogram2(tt.Year,tt.Month,year_edges,month_edges,'DisplayStyle','tile')
colorbar
xlabel('Year')
ylabel('Month')
title('Airline Flights by Month and Year, 1987 - 2008')

%% Data Analytics and Machine Learning with Tall Arrays
% You can perform more sophisticated statistical analysis on tall arrays,
% including calculating predictive analytics and performing machine
% learning, using the functions in Statistics and Machine Learning
% Toolbox™.
% 
% For more information, see <docid:stats_ug.bvd_k7b-1>.

%% Scale to Big Data Systems
% A key capability of tall arrays in MATLAB is the connectivity to big data
% platforms, such as computing clusters and Apache Spark&trade;.
% 
% This example only scratches the surface of what is possible with tall
% arrays for big data. See <docid:import_export.bvciqp3-1> for more
% information about using:
% 
% * Statistics and Machine Learning Toolbox&trade;
% * Database Toolbox&trade;
% * Parallel Computing Toolbox&trade;
% * MATLAB&reg; Distributed Computing Server&trade;
% * MATLAB Compiler&trade;