www.gusucode.com > matlab_featured 案例源码 matlab代码程序 > matlab_featured/AnalyzeBigDataInMATLABUsingTallArraysExample.m
%% Analyze Big Data in MATLAB Using Tall Arrays % This example shows how to use tall arrays to work with big data in % MATLAB®. You can use tall arrays to perform a variety of calculations % on different types of data that does not fit in memory. These include % basic calculations, as well as machine learning algorithms within % Statistics and Machine Learning Toolbox™. % % This example operates on a small subset of data on a single computer, and % then it then scales up to analyze all of the data set. However, this % analysis technique can scale up even further to work on data sets that % are so large they cannot be read into memory, or to work on systems like % Apache Spark™. %% Introduction to Tall Arrays % Tall arrays and tall tables are used to work with out-of-memory data that % has any number of rows. Instead of writing specialized code that takes % into account the huge size of the data, tall arrays and tables let you % work with large data sets in a manner similar to in-memory MATLAB® % arrays. The difference is that |tall| arrays typically remain unevaluated % until you request that the calculations be performed. % % This deferred evaluation enables MATLAB to combine the queued % calculations where possible and take the minimum number of passes through % the data. Since the number of passes through the data greatly affects % execution time, it is recommended that you request output only when % necessary. %% Create |datastore| for Collection of Files % Creating a |datastore| enables you to access a collection of data. A % |datastore| can process arbitrarily large amounts of data, and the data % can even be spread across multiple files in multiple folders. You can % create a |datastore| for a collection of tabular text files (demonstrated % here), spreadsheets, images, a SQL database (Database Toolbox™ % required) or Hadoop® sequence files. % % Create a |datastore| for a |.csv| file containing airline data. Treat % |'NA'| values as missing so that |datastore| replaces them with |NaN| % values. Select the variables of interest, and specify a categorical data % type for the |Origin| and |Dest| variables. Preview the contents. ds = datastore('airlinesmall.csv'); ds.TreatAsMissing = 'NA'; ds.SelectedVariableNames = {'Year','Month','ArrDelay','DepDelay','Origin','Dest'}; ds.SelectedFormats(5:6) = {'%C','%C'}; pre = preview(ds) %% Create Tall Array % Tall arrays are similar to in-memory MATLAB arrays, except that they can % have any number of rows. Tall arrays can contain data that is numeric, % logical, datetime, duration, calendarDuration, categorical, or strings. % Also, you can convert any in-memory array to a tall array. (The in-memory % array |A| must be one of the supported data types.) % % The underlying class of a tall array is based on the type of datastore % that backs it. For example, if the datastore |ds| contains tabular data, % then |tall(ds)| returns a tall table containing the data. tt = tall(ds) %% % The display indicates the underlying data type and includes the first % several rows of data. The size of the table displays as "Mx6" to indicate % that MATLAB does not yet know how many rows of data there are. %% Perform Calculations on Tall Arrays % You can work with tall arrays and tall tables in a similar manner in % which you work with in-memory MATLAB arrays and tables. % % One important aspect of tall arrays is that as you work with them, MATLAB % does not perform most operations immediately. These operations appear to % execute quickly, because the actual computation is deferred until you % specifically request output. This deferred evaluation is important % because even a simple command like |size(X)| executed on a tall array % with a billion rows is not a quick calculation. % % As you work with tall arrays, MATLAB keeps track of all of the operations % to be carried out and optimizes the number of passes through the data. % Thus, it is normal to work with unevaluated tall arrays and request % output only when you require it. MATLAB does not know the contents or % size of unevaluated tall arrays until you request that the array be % evaluated and displayed. % % Calculate the mean departure delay. mDep = mean(tt.DepDelay,'omitnan') %% Gather Results into Workspace % The benefit of deferred evaluation is that when the time comes for MATLAB % to perform the calculations, it is often possible to combine the % operations in such a way that the number of passes through the data is % minimized. So, even if you perform many operations, MATLAB only makes % extra passes through the data when absolutely necessary. % % The |gather| function forces evaluation of all queued operations and % brings the resulting output back into memory. Since |gather| returns the % _entire_ result in MATLAB, you should make sure that the result will fit % in memory. For example, use |gather| on tall arrays that are the result % of a function that reduces the size of the tall array, such as |sum|, % |min|, |mean|, and so on. % % Use |gather| to calculate the mean departure delay and bring the answer % into memory. This calculation requires a single pass through the data, % but other calculations might require several passes through the data. % MATLAB determines the optimal number of passes for the calculation and % displays this information at the command line. mDep = gather(mDep) %% Select Subset of Tall Array % You can extract values from a tall array by subscripting or indexing. You % can index the array starting from the top or bottom, or by using a % logical index. The functions |head| and |tail| are useful alternatives to % indexing, enabling you to explore the first and last portions of a tall % array. Gather both variables at the same time to avoid extra passes % through the data. h = head(tt); tl = tail(tt); [h,tl] = gather(h,tl) %% % Use |head| to select a subset of 10,000 rows from the data for % prototyping code before scaling to the full data set. ttSubset = head(tt,10000); %% Select Data by Condition % You can use typical logical operations on tall arrays, which are useful % for selecting relevant data or removing outliers with logical indexing. % The logical expression creates a tall logical vector, which then is used % to subscript, identifying the rows where the condition is true. % % Select only the flights out of Boston by comparing the % elements of the categorical variable |Origin| to the value |'BOS'|. idx = (ttSubset.Origin == 'BOS'); bosflights = ttSubset(idx,:) %% % You can use the same indexing technique to remove rows with missing data % or NaN values from the tall array. idx = any(ismissing(ttSubset),2); ttSubset(idx,:) = []; %% Determine Largest Delays % Due to the nature of big data, sorting all of the data using traditional % methods like |sort| or |sortrows| is inefficient. However, the |topkrows| % function for tall arrays returns the top |k| rows in sorted order. % % Calculate the top 10 greatest departure delays. biggestDelays = topkrows(ttSubset,10,'DepDelay'); biggestDelays = gather(biggestDelays) %% Visualize Data in Tall Arrays % Plotting every point in a big data set is not feasible. For that reason, % visualization of tall arrays involves reducing the number of data points % using sampling or binning. % % Visualize the number of flights per year with a histogram. The % visualization functions pass through the data and immediately evaluate % the solution when you call them, so |gather| is not required. histogram(ttSubset.Year,'BinMethod','integers') xlabel('Year') ylabel('Number of Flights') title('Number of Flights by Year, 1987 - 1989') %% Scale to Entire Data Set % Instead of using the smaller data returned from |head|, you can scale up % to perform the calculations on the entire data set by using the results % from |tall(ds)|. tt = tall(ds); idx = any(ismissing(tt),2); tt(idx,:) = []; mnDelay = mean(tt.DepDelay,'omitnan'); biggestDelays = topkrows(tt,10,'DepDelay'); [mnDelay,biggestDelays] = gather(mnDelay,biggestDelays) %% histogram(tt.Year,'BinMethod','integers') xlabel('Year') ylabel('Number of Flights') title('Number of Flights by Year, 1987 - 2008') %% % Use |histogram2| to further break down the number of flights by month for % the whole data set. Since the bins for |Month| and |Year| are known ahead % of time, specify the bin edges to avoid an extra pass through the data. year_edges = 1986.5:2008.5; month_edges = 0.5:12.5; histogram2(tt.Year,tt.Month,year_edges,month_edges,'DisplayStyle','tile') colorbar xlabel('Year') ylabel('Month') title('Airline Flights by Month and Year, 1987 - 2008') %% Data Analytics and Machine Learning with Tall Arrays % You can perform more sophisticated statistical analysis on tall arrays, % including calculating predictive analytics and performing machine % learning, using the functions in Statistics and Machine Learning % Toolbox™. % % For more information, see <docid:stats_ug.bvd_k7b-1>. %% Scale to Big Data Systems % A key capability of tall arrays in MATLAB is the connectivity to big data % platforms, such as computing clusters and Apache Spark™. % % This example only scratches the surface of what is possible with tall % arrays for big data. See <docid:import_export.bvciqp3-1> for more % information about using: % % * Statistics and Machine Learning Toolbox™ % * Database Toolbox™ % * Parallel Computing Toolbox™ % * MATLAB® Distributed Computing Server™ % * MATLAB Compiler™