www.gusucode.com > stats 源码程序 matlab案例代码 > stats/MachineLearningWithTallArraysExample.m

    %% Logistic Regression with Tall Arrays 
% This example shows how to use logistic regression and other techniques to
% perform data analysis on tall arrays. Tall arrays represent data that is
% too large to fit into computer memory.

%% Get Data into MATLAB(R)
% Create a datastore that references the folder location with the data. The
% data can be contained in a single file, a collection of files, or an
% entire folder. Treat |'NA'| values as missing data so that |datastore|
% replaces them with |NaN| values. Select a subset of the variables to work
% with, and include the name of the airline (|UniqueCarrier|) as a
% categorical variable. Create a tall table on top of the datastore.
ds = datastore('airlinesmall.csv');
ds.TreatAsMissing = 'NA';
ds.SelectedVariableNames = {'DayOfWeek','UniqueCarrier',...
    'ArrDelay','DepDelay','Distance'};
ds.SelectedFormats{2} = '%C';
tt = tall(ds);
tt.DayOfWeek = categorical(tt.DayOfWeek,1:7,...
    {'Sun','Mon','Tues','Wed','Thu','Fri','Sat'},'Ordinal',true)

%% Late Flights
% Determine the flights that are late by 20 minutes or more by defining a
% logical variable that is true for a late flight. Add this variable to the
% tall table of data, noting that it is not yet evaluated. A preview of
% this variable includes the first few rows.
tt.LateFlight = tt.ArrDelay>=20

%% 
% Calculate the mean of |LateFlight| to determine the overall proportion of
% late flights. Use |gather| to trigger evaluation of the tall array and
% bring the result into memory.
m = mean(tt.LateFlight)

%%
m = gather(m)

%% Late Flights by Carrier
% Examine whether certain types of flights tend to be late. First, check to
% see if certain carriers are more likely to have late flights.
tt.LateFlight = double(tt.LateFlight);
late_by_carrier = gather(grpstats(tt,'UniqueCarrier','mean','DataVar','LateFlight'))

%% 
% Carriers |B6| and |EV| have higher proportions of late flights. Carriers
% |AQ|, |ML(1)|, and |HA| have relatively few flights, but lower
% proportions of them are late.

%% Late Flights by Day of Week
% Next, check to see if different days of the week tend to have later
% flights.
late_by_day = gather(grpstats(tt,'DayOfWeek','mean','DataVar','LateFlight'))

%% 
% Wednesdays and Thursdays have the highest proportion of late flights,
% while Fridays have the lowest proportion.

%% Late Flights by Distance
% Check to see if longer or shorter flights tend to be late. First, look at
% the density of the flight distance for flights that are late, and compare
% that with flights that are on time.
ksdensity(tt.Distance(tt.LateFlight==1))
hold on
ksdensity(tt.Distance(tt.LateFlight==0))
hold off
legend('Late','On time')

%% 
% Flight distance does not make a dramatic difference in whether a flight
% is early or late. However, the density appears to be slightly higher for
% on-time flights at distances of about 400 miles. The density is also
% higher for late flights at distances of about 2000 miles. Calculate some
% simple descriptive statistics for the late and on-time flights.
late_by_distance = gather(grpstats(tt,'LateFlight',{'mean' 'std'},'DataVar','Distance'))

%% 
% Late flights are about 60 miles longer on average, although this value
% makes up only a small portion of the standard deviation of the distance
% values.

%% Logistic Regression Model
% Build a model for the probability of a late flight, using both continuous
% variables (such as |Distance|) and categorical variables (such as
% |DayOfWeek|) to predict the probabilities. This model can help to
% determine if the previous results observed for each predictor
% individually also hold true when you consider them together.
glm = fitglm(tt,'LateFlight~Distance+DayOfWeek','Distribution','binomial')

%% 
% The model confirms that the previously observed conclusions hold true
% here as well:
%
% * The Wednesday and Thursday coefficients are positive, indicating a
% higher probability of a late flight on those days. The Friday coefficient
% is negative, indicating a lower probability. 
% * The Distance coefficient is positive, indicating that longer flights
% have a higher probability of being late. 
%
% All of these coefficients have very small p-values. This is common with
% data sets that have many observations, since one can reliably estimate
% small effects with large amounts of data. In fact, the uncertainty in the
% model is larger than the uncertainty in the estimates for the parameters
% in the model.

%% Prediction with Model
% Predict the probability of a late flight for each day of the week, and
% for distances ranging from 0 to 3000 miles. Create a table to hold the
% predictor values by indexing the first 100 rows in the original table
% |tt|.
x = gather(tt(1:100,{'Distance' 'DayOfWeek'}));
x.Distance = linspace(0,3000)';
x.DayOfWeek(:) = 'Sun';
plot(x.Distance,predict(glm,x));

days = {'Sun' 'Mon' 'Tues' 'Wed' 'Thu' 'Fri' 'Sat'};
hold on
for j=2:length(days)
    x.DayOfWeek(:) = days{j};
    plot(x.Distance,predict(glm,x));
end
legend(days)

%% 
% According to this model, a Wednesday or Thursday flight of 500 miles has
% the same probability of being late, about 18%, as a Friday flight of
% about 3000 miles.
% 
% Since these probabilities are all much less than 50%, the model is
% unlikely to predict that any given flight will be late using this
% information. Investigate the model more by focusing on the flights for
% which the model predicts a probability of 20% or more of being late, and
% compare that to the actual results.
C = gather(crosstab(tt.LateFlight,predict(glm,tt)>.20))

%% 
% Among the flights predicted to have a 20% or higher probablity of being
% late, about 20% were late |1125/(1125 + 4391)|. Among the remainder, less
% than 16% were late |18394/(18394 + 99613)|.