www.gusucode.com > matlab 案例源码 matlab代码程序 > matlab/GroupedStatisticsCalculationsWithTallArraysExample.m

    %% Grouped Statistics Calculations with Tall Arrays
% This example shows how to use the |findgroups| and |splitapply| functions
% to calculate grouped statistics of a tall table containing power outage
% data. |findgroups| and |splitapply| enable you to break up tall table
% variables into groups, use those groups to separate data, and then apply
% a function to each group of data. Alternatively, if you have Statistics
% and Machine Learning Toolbox(TM), then you also can use the
% <docid:stats_ug.bthgdq6> function to calculate grouped statistics.
%
% This example creates a tall table for the power outage data, even though
% the raw data only has about 1500 rows. However, you can use the
% techniques presented here on much larger data sets because no assumptions
% are made about the size of the data in the tall table.

%% Create Datastore and Tall Table
% The sample file, |outages.csv|, contains data representing electric
% utility outages in the United States. The file contains six columns:
% |Region|, |OutageTime|, |Loss|, |Customers|, |RestorationTime|, and
% |Cause|. 
%
% Create a datastore for the |outages.csv| file. Use the
% |'TextScanFormats'| option to specify the kind of data each column
% contains: categorical (|'%C'|), floating-point numeric (|'%f'|), or
% datetime (|'%D'|).
data_formats = {'%C','%D','%f','%f','%D','%C'};
ds = datastore('outages.csv','TextscanFormats',data_formats);

%%
% Create a tall table on top of the datastore.
T = tall(ds)

%% Clean Missing Data
% Some of the rows in the tall table have missing data represented by |NaN|
% and |NaT| values. Remove all of the rows that are missing at least one
% piece of data.
idx = ~any(ismissing(T),2);
T = T(idx,:)

%%  Mean Power Outage Duration by Region
% Determine the mean power outage duration in each region. The |findgroups|
% function groups the data by the categorical values in |Region|. The
% |splitapply| function applies the specified function to each group of
% data and concatenates the results together.
G = findgroups(T.Region);
times = gather(splitapply(@mean,T.RestorationTime-T.OutageTime,G))

%%
% Change the display format of the duration results to be in days, and put
% the results in a table with the associated regions.
times.Format = 'd';
regions = gather(categories(T.Region));
varnames = {'Regions','MeanOutageDuration'};
maxOutageDurations = table(regions,times,'VariableNames',varnames)

%% Most Common Power Outage Causes by Region
% Determine how often each power outage cause occurs in each region. First,
% group the data by both cause and region. Then use |splitapply| to create
% a cell array containing the number of occurrences of each cause in each
% region.
G2 = findgroups(T.Cause,T.Region);
C = splitapply(@(r,c) {size(r,1),r(1),c(1)},T.Region,T.Cause,G2);
C = gather(C)

%%
% Convert the cell array into a table and unstack the |'Count'| and
% |'Region'| variables. Use |fillmissing| on the in-memory table to replace
% |NaN| values with zeros.
tmp = cell2table(C, 'VariableNames', {'Count', 'Region', 'Cause'});
RegionCauses = unstack(tmp, 'Count', 'Region');
RegionCauses = fillmissing(RegionCauses,'constant',{'',0,0,0,0,0})