www.gusucode.com > demos工具箱matlab源码程序 > demos/CovarianceMapReduceExample.m
%% Using MapReduce to Compute Covariance and Related Quantities % This example shows how to compute the mean and covariance for several % variables in a large data set using |mapreduce|. It then uses the % covariance to perform several follow-up calculations that do not require % another iteration over the entire data set. % Copyright 1984-2014 The MathWorks, Inc. %% Prepare Data % Create a datastore using the |airlinesmall.csv| data set. This 12 % megabyte data set contains 29 columns of flight information for several % airline carriers, including arrival and departure times. In this example, % select |ActualElapsedTime| (total flight time), |Distance| (total flight % distance), |DepDelay| (flight departure delay), and |ArrDelay| (flight % arrival delay) as the variables of interest. ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA'); ds.SelectedVariableNames = {'ActualElapsedTime', 'Distance', ... 'DepDelay', 'ArrDelay'} %% % |tabularTextDatastore| returns a |TabularTextDatastore| object for the data. This % datastore treats |'NA'| strings as missing, and replaces the missing % values with |NaN| values by default. Additionally, the % |SelectedVariableNames| property allows you to work with only the % selected variables of interest, which you can verify using |preview|. preview(ds) %% Run MapReduce % The |mapreduce| function requires a mapper function and a reducer % function. The mapper function receives chunks of data and outputs % intermediate results. The reducer function reads the intermediate results % and produces a final result. %% % In this example, the mapper function computes the count, mean, and % covariance for the variables in each chunk of data in the datastore, % |ds|. Then, the mapper function stores the computed values for each chunk % as an intermediate key-value pair consisting of a single key with a cell % array containing the three computed values. %% % Display the mapper function file. type covarianceMapper %% % The reducer function combines the intermediate results for each chunk to % obtain the count, mean, and covariance for each variable of interest in % the entire data set. The reducer function stores the final key-value % pairs for the keys |'count'|, |'mean'|, and |'cov'| with the % corresponding values for each variable. %% % Display the reducer function file. type covarianceReducer %% % Use |mapreduce| to apply the mapper and reducer functions to the % datastore, |ds|. outds = mapreduce(ds, @covarianceMapper, @covarianceReducer); %% % |mapreduce| returns a datastore, |outds|, with files in the current % folder. %% % View the results of the |mapreduce| call by using the |readall| function % on the output datastore. results = readall(outds) Count = results.Value{1}; MeanVal = results.Value{2}; Covariance = results.Value{3}; %% Compute Correlation Matrix % The covariance, mean, and count values are useful to perform further % calculations. Compute a correlation matrix by finding the standard % deviations and normalizing them to correlation form. s = sqrt(diag(Covariance)); Correlation = Covariance ./ (s*s') %% % The elapsed time (first column) and distance (second column) are highly % correlated, since |Correlation(2,1) = 0.9666|. The departure delay (third % column) and arrival delay (fourth column) are also highly correlated, % since |Correlation(4,3) = 0.8748|. %% Compute Regression Coefficients % Compute some regression coefficients to predict the arrival % delay, |ArrDelay|, using the other three variables as predictors. slopes = Covariance(1:3,1:3)\Covariance(1:3,4); intercept = MeanVal(4) - MeanVal(1:3)*slopes; b = table([intercept; slopes], 'VariableNames', {'Estimate'}, ... 'RowNames', {'Intercept','ActualElapsedTime','Distance','DepDelay'}) %% Perform PCA % Use |svd| to perform PCA (principal components analysis). PCA is a % technique for finding a lower dimensional summary of a data set. The % following calculation is a simplified version of PCA, but more options % are available from the |pca| and |pcacov| functions in Statistics and % Machine Learning Toolbox(TM). % % You can carry out PCA using either the covariance or correlation. In this % case, use the correlation since the difference in scale of the variables % is large. The first two components capture most of the variance. [~,latent,pcacoef] = svd(Correlation); latent = diag(latent) %% % Display the coefficient matrix. Each column of the coefficients matrix % describes how one component is defined as a linear combination of the % standardized original variables. The first component is mostly an average % of the first two variables, with some additional contribution from the % other variables. Similarly, the second component is mostly an average of % the last two variables. pcacoef