www.gusucode.com > matlab 案例源码 matlab代码程序 > matlab/PreprocessingDataExample.m

    %% Preprocessing Data
% This example shows how to preprocess data for analysis.
%% Overview
% Begin a data analysis by loading data into suitable MATLAB(R) container
% variables and sorting out the "good" data from the "bad."
% This is a preliminary step that assures meaningful conclusions in
% subsequent parts of the analysis.
%% Loading the Data
% Begin by loading the data in |count.dat|:

% Copyright 2015 The MathWorks, Inc.

load count.dat
%%
% The 24-by-3 array |count| contains hourly traffic counts (the rows)
% at three intersections (the columns) for a single day.
%% Missing Data
% The MATLAB |NaN| (Not a Number) value is normally used to represent missing
% data. |NaN| values allow variables with missing data to maintain their
% structure - in this case, 24-by-1 vectors with consistent indexing across
% all three intersections.
%
% Check the data at the third intersection for |NaN| values using the |isnan| function:
c3 = count(:,3); % Data at intersection 3
c3NaNCount = sum(isnan(c3))
%%
% |isnan| returns a logical vector the same size as |c3|, with entries
% indicating the presence (|1|) or absence (|0|) of |NaN| values for each of the
% 24 elements in the data. In this case, the logical values sum to |0|, so
% there are no |NaN| values in the data.
%
% |NaN| values are introduced into the data in the section on Outliers.
%% Outliers
% Outliers are data values that are dramatically different from patterns in
% the rest of the data. They might be due to measurement error, or they
% might represent significant features in the data. Identifying outliers,
% and deciding what to do with them, depends on an understanding of the
% data and its source.
%
% One common method for identifying outliers is to look for values more
% than a certain number of standard deviations $\sigma$ from the mean $\mu$. The
% following code plots a histogram of the data at the third intersection
% together with lines at $\mu$ and $\mu$ + $\eta$, for $\eta$ = 1, 2:
bin_counts = hist(c3); % Histogram bin counts
N = max(bin_counts); % Maximum bin count
mu3 = mean(c3); % Data mean
sigma3 = std(c3); % Data standard deviation

hist(c3) % Plot histogram
hold on
plot([mu3 mu3],[0 N],'r','LineWidth',2) % Mean
X = repmat(mu3+(1:2)*sigma3,2,1);
Y = repmat([0;N],1,2);
plot(X,Y,'g','LineWidth',2) % Standard deviations
legend('Data','Mean','Stds')
hold off
%%
% The plot shows that some of the data are more than two standard
% deviations above the mean. If you identify these data as errors (not
% features), replace them with |NaN| values as follows:
outliers = (c3 - mu3) > 2*sigma3;
c3m = c3; % Copy c3 to c3m
c3m(outliers) = NaN; % Add NaN values
%% Smoothing and Filtering
% A time-series plot of the data at the third intersection (with the
% outlier removed in Outliers) results in the following plot:
plot(c3m,'o-')
hold on
%%
% The |NaN| value at hour 20 appears as a gap in the plot. This handling of
% |NaN| values is typical of MATLAB plotting functions.
%
% Noisy data shows random variations about expected values. You might want
% to smooth the data to reveal its main features before building a model.
% Two basic assumptions underlie smoothing:
%
% - The relationship between the predictor (time) and the response (traffic volume) is smooth.
%
% - The smoothing algorithm results in values that are better estimates of
% expected values because the noise has been reduced.
%
% Apply a simple moving average smoother to the data using the MATLAB
% |convn| function:
span = 3; % Size of the averaging window
window = ones(span,1)/span; 
smoothed_c3m = convn(c3m,window,'same');

h = plot(smoothed_c3m,'ro-');
legend('Data','Smoothed Data')
%%
% The extent of the smoothing is controlled with the variable |span|. The
% averaging calculation returns |NaN| values whenever the smoothing window
% includes the |NaN| value in the data, thus increasing the size of the gap
% in the smoothed data.
%
% The |filter| function is also used for smoothing data:
smoothed2_c3m = filter(window,1,c3m);

delete(h)
plot(smoothed2_c3m,'ro-');
%%
% The smoothed data are shifted from the previous plot. |convn| with the
% |'same'| parameter returns the central part of the convolution, the same
% length as the data. |filter| returns the initial part of the convolution,
% the same length as the data. Otherwise, the algorithms are identical.
%
% Smoothing estimates the center of the distribution of response values at
% each value of the predictor. It invalidates a basic assumption of many
% fitting algorithms, namely, that _the errors at each value of the
% predictor are independent_. Accordingly, you can use smoothed data to
% _identify_ a model, but avoid using smoothed data to _fit_ a model.