www.gusucode.com > stats 源码程序 matlab案例代码 > stats/MachineLearningWithTallArraysExample.m
%% Logistic Regression with Tall Arrays % This example shows how to use logistic regression and other techniques to % perform data analysis on tall arrays. Tall arrays represent data that is % too large to fit into computer memory. %% Get Data into MATLAB(R) % Create a datastore that references the folder location with the data. The % data can be contained in a single file, a collection of files, or an % entire folder. Treat |'NA'| values as missing data so that |datastore| % replaces them with |NaN| values. Select a subset of the variables to work % with, and include the name of the airline (|UniqueCarrier|) as a % categorical variable. Create a tall table on top of the datastore. ds = datastore('airlinesmall.csv'); ds.TreatAsMissing = 'NA'; ds.SelectedVariableNames = {'DayOfWeek','UniqueCarrier',... 'ArrDelay','DepDelay','Distance'}; ds.SelectedFormats{2} = '%C'; tt = tall(ds); tt.DayOfWeek = categorical(tt.DayOfWeek,1:7,... {'Sun','Mon','Tues','Wed','Thu','Fri','Sat'},'Ordinal',true) %% Late Flights % Determine the flights that are late by 20 minutes or more by defining a % logical variable that is true for a late flight. Add this variable to the % tall table of data, noting that it is not yet evaluated. A preview of % this variable includes the first few rows. tt.LateFlight = tt.ArrDelay>=20 %% % Calculate the mean of |LateFlight| to determine the overall proportion of % late flights. Use |gather| to trigger evaluation of the tall array and % bring the result into memory. m = mean(tt.LateFlight) %% m = gather(m) %% Late Flights by Carrier % Examine whether certain types of flights tend to be late. First, check to % see if certain carriers are more likely to have late flights. tt.LateFlight = double(tt.LateFlight); late_by_carrier = gather(grpstats(tt,'UniqueCarrier','mean','DataVar','LateFlight')) %% % Carriers |B6| and |EV| have higher proportions of late flights. Carriers % |AQ|, |ML(1)|, and |HA| have relatively few flights, but lower % proportions of them are late. %% Late Flights by Day of Week % Next, check to see if different days of the week tend to have later % flights. late_by_day = gather(grpstats(tt,'DayOfWeek','mean','DataVar','LateFlight')) %% % Wednesdays and Thursdays have the highest proportion of late flights, % while Fridays have the lowest proportion. %% Late Flights by Distance % Check to see if longer or shorter flights tend to be late. First, look at % the density of the flight distance for flights that are late, and compare % that with flights that are on time. ksdensity(tt.Distance(tt.LateFlight==1)) hold on ksdensity(tt.Distance(tt.LateFlight==0)) hold off legend('Late','On time') %% % Flight distance does not make a dramatic difference in whether a flight % is early or late. However, the density appears to be slightly higher for % on-time flights at distances of about 400 miles. The density is also % higher for late flights at distances of about 2000 miles. Calculate some % simple descriptive statistics for the late and on-time flights. late_by_distance = gather(grpstats(tt,'LateFlight',{'mean' 'std'},'DataVar','Distance')) %% % Late flights are about 60 miles longer on average, although this value % makes up only a small portion of the standard deviation of the distance % values. %% Logistic Regression Model % Build a model for the probability of a late flight, using both continuous % variables (such as |Distance|) and categorical variables (such as % |DayOfWeek|) to predict the probabilities. This model can help to % determine if the previous results observed for each predictor % individually also hold true when you consider them together. glm = fitglm(tt,'LateFlight~Distance+DayOfWeek','Distribution','binomial') %% % The model confirms that the previously observed conclusions hold true % here as well: % % * The Wednesday and Thursday coefficients are positive, indicating a % higher probability of a late flight on those days. The Friday coefficient % is negative, indicating a lower probability. % * The Distance coefficient is positive, indicating that longer flights % have a higher probability of being late. % % All of these coefficients have very small p-values. This is common with % data sets that have many observations, since one can reliably estimate % small effects with large amounts of data. In fact, the uncertainty in the % model is larger than the uncertainty in the estimates for the parameters % in the model. %% Prediction with Model % Predict the probability of a late flight for each day of the week, and % for distances ranging from 0 to 3000 miles. Create a table to hold the % predictor values by indexing the first 100 rows in the original table % |tt|. x = gather(tt(1:100,{'Distance' 'DayOfWeek'})); x.Distance = linspace(0,3000)'; x.DayOfWeek(:) = 'Sun'; plot(x.Distance,predict(glm,x)); days = {'Sun' 'Mon' 'Tues' 'Wed' 'Thu' 'Fri' 'Sat'}; hold on for j=2:length(days) x.DayOfWeek(:) = days{j}; plot(x.Distance,predict(glm,x)); end legend(days) %% % According to this model, a Wednesday or Thursday flight of 500 miles has % the same probability of being late, about 18%, as a Friday flight of % about 3000 miles. % % Since these probabilities are all much less than 50%, the model is % unlikely to predict that any given flight will be late using this % information. Investigate the model more by focusing on the flights for % which the model predicts a probability of 20% or more of being late, and % compare that to the actual results. C = gather(crosstab(tt.LateFlight,predict(glm,tt)>.20)) %% % Among the flights predicted to have a 20% or higher probablity of being % late, about 20% were late |1125/(1125 + 4391)|. Among the remainder, less % than 16% were late |18394/(18394 + 99613)|.