www.gusucode.com > matlab 案例源码 matlab代码程序 > matlab/PreprocessingDataExample.m
%% Preprocessing Data % This example shows how to preprocess data for analysis. %% Overview % Begin a data analysis by loading data into suitable MATLAB(R) container % variables and sorting out the "good" data from the "bad." % This is a preliminary step that assures meaningful conclusions in % subsequent parts of the analysis. %% Loading the Data % Begin by loading the data in |count.dat|: % Copyright 2015 The MathWorks, Inc. load count.dat %% % The 24-by-3 array |count| contains hourly traffic counts (the rows) % at three intersections (the columns) for a single day. %% Missing Data % The MATLAB |NaN| (Not a Number) value is normally used to represent missing % data. |NaN| values allow variables with missing data to maintain their % structure - in this case, 24-by-1 vectors with consistent indexing across % all three intersections. % % Check the data at the third intersection for |NaN| values using the |isnan| function: c3 = count(:,3); % Data at intersection 3 c3NaNCount = sum(isnan(c3)) %% % |isnan| returns a logical vector the same size as |c3|, with entries % indicating the presence (|1|) or absence (|0|) of |NaN| values for each of the % 24 elements in the data. In this case, the logical values sum to |0|, so % there are no |NaN| values in the data. % % |NaN| values are introduced into the data in the section on Outliers. %% Outliers % Outliers are data values that are dramatically different from patterns in % the rest of the data. They might be due to measurement error, or they % might represent significant features in the data. Identifying outliers, % and deciding what to do with them, depends on an understanding of the % data and its source. % % One common method for identifying outliers is to look for values more % than a certain number of standard deviations $\sigma$ from the mean $\mu$. The % following code plots a histogram of the data at the third intersection % together with lines at $\mu$ and $\mu$ + $\eta$, for $\eta$ = 1, 2: bin_counts = hist(c3); % Histogram bin counts N = max(bin_counts); % Maximum bin count mu3 = mean(c3); % Data mean sigma3 = std(c3); % Data standard deviation hist(c3) % Plot histogram hold on plot([mu3 mu3],[0 N],'r','LineWidth',2) % Mean X = repmat(mu3+(1:2)*sigma3,2,1); Y = repmat([0;N],1,2); plot(X,Y,'g','LineWidth',2) % Standard deviations legend('Data','Mean','Stds') hold off %% % The plot shows that some of the data are more than two standard % deviations above the mean. If you identify these data as errors (not % features), replace them with |NaN| values as follows: outliers = (c3 - mu3) > 2*sigma3; c3m = c3; % Copy c3 to c3m c3m(outliers) = NaN; % Add NaN values %% Smoothing and Filtering % A time-series plot of the data at the third intersection (with the % outlier removed in Outliers) results in the following plot: plot(c3m,'o-') hold on %% % The |NaN| value at hour 20 appears as a gap in the plot. This handling of % |NaN| values is typical of MATLAB plotting functions. % % Noisy data shows random variations about expected values. You might want % to smooth the data to reveal its main features before building a model. % Two basic assumptions underlie smoothing: % % - The relationship between the predictor (time) and the response (traffic volume) is smooth. % % - The smoothing algorithm results in values that are better estimates of % expected values because the noise has been reduced. % % Apply a simple moving average smoother to the data using the MATLAB % |convn| function: span = 3; % Size of the averaging window window = ones(span,1)/span; smoothed_c3m = convn(c3m,window,'same'); h = plot(smoothed_c3m,'ro-'); legend('Data','Smoothed Data') %% % The extent of the smoothing is controlled with the variable |span|. The % averaging calculation returns |NaN| values whenever the smoothing window % includes the |NaN| value in the data, thus increasing the size of the gap % in the smoothed data. % % The |filter| function is also used for smoothing data: smoothed2_c3m = filter(window,1,c3m); delete(h) plot(smoothed2_c3m,'ro-'); %% % The smoothed data are shifted from the previous plot. |convn| with the % |'same'| parameter returns the central part of the convolution, the same % length as the data. |filter| returns the initial part of the convolution, % the same length as the data. Otherwise, the algorithms are identical. % % Smoothing estimates the center of the distribution of response values at % each value of the predictor. It invalidates a basic assumption of many % fitting algorithms, namely, that _the errors at each value of the % predictor are independent_. Accordingly, you can use smoothed data to % _identify_ a model, but avoid using smoothed data to _fit_ a model.