www.gusucode.com > matlab 案例源码 matlab代码程序 > matlab/GroupedStatisticsCalculationsWithTallArraysExample.m
%% Grouped Statistics Calculations with Tall Arrays % This example shows how to use the |findgroups| and |splitapply| functions % to calculate grouped statistics of a tall table containing power outage % data. |findgroups| and |splitapply| enable you to break up tall table % variables into groups, use those groups to separate data, and then apply % a function to each group of data. Alternatively, if you have Statistics % and Machine Learning Toolbox(TM), then you also can use the % <docid:stats_ug.bthgdq6> function to calculate grouped statistics. % % This example creates a tall table for the power outage data, even though % the raw data only has about 1500 rows. However, you can use the % techniques presented here on much larger data sets because no assumptions % are made about the size of the data in the tall table. %% Create Datastore and Tall Table % The sample file, |outages.csv|, contains data representing electric % utility outages in the United States. The file contains six columns: % |Region|, |OutageTime|, |Loss|, |Customers|, |RestorationTime|, and % |Cause|. % % Create a datastore for the |outages.csv| file. Use the % |'TextScanFormats'| option to specify the kind of data each column % contains: categorical (|'%C'|), floating-point numeric (|'%f'|), or % datetime (|'%D'|). data_formats = {'%C','%D','%f','%f','%D','%C'}; ds = datastore('outages.csv','TextscanFormats',data_formats); %% % Create a tall table on top of the datastore. T = tall(ds) %% Clean Missing Data % Some of the rows in the tall table have missing data represented by |NaN| % and |NaT| values. Remove all of the rows that are missing at least one % piece of data. idx = ~any(ismissing(T),2); T = T(idx,:) %% Mean Power Outage Duration by Region % Determine the mean power outage duration in each region. The |findgroups| % function groups the data by the categorical values in |Region|. The % |splitapply| function applies the specified function to each group of % data and concatenates the results together. G = findgroups(T.Region); times = gather(splitapply(@mean,T.RestorationTime-T.OutageTime,G)) %% % Change the display format of the duration results to be in days, and put % the results in a table with the associated regions. times.Format = 'd'; regions = gather(categories(T.Region)); varnames = {'Regions','MeanOutageDuration'}; maxOutageDurations = table(regions,times,'VariableNames',varnames) %% Most Common Power Outage Causes by Region % Determine how often each power outage cause occurs in each region. First, % group the data by both cause and region. Then use |splitapply| to create % a cell array containing the number of occurrences of each cause in each % region. G2 = findgroups(T.Cause,T.Region); C = splitapply(@(r,c) {size(r,1),r(1),c(1)},T.Region,T.Cause,G2); C = gather(C) %% % Convert the cell array into a table and unstack the |'Count'| and % |'Region'| variables. Use |fillmissing| on the in-memory table to replace % |NaN| values with zeros. tmp = cell2table(C, 'VariableNames', {'Count', 'Region', 'Cause'}); RegionCauses = unstack(tmp, 'Count', 'Region'); RegionCauses = fillmissing(RegionCauses,'constant',{'',0,0,0,0,0})