www.gusucode.com > stats 源码程序 matlab案例代码 > stats/OutlierDetectionUsingQuantileRegressionExample.m

    %% Outlier Detection Using Quantile Regression
% This example shows how to detect outliers using quantile random forest.
%%
% An outlier is an observation that is located "far enough" from most of
% the other observations in a data set, and so it is considered anomalous.
% Causes of outlying observations include inherent variability or
% measurement error.  Outliers can have a significant impact on estimates
% and inference, so it is important to detect them and decide whether to
% remove them or consider a robust analysis.
%%
% The Statistics and Machine Learning Toolbox(TM) provides several
% functionalities to detect or remove outliers.  This is a list a few
% functionalities.
%
% * <docid:stats_ug.btg0kob |zscore|> - Compute _z_ scores of observations.
% * <docid:stats_ug.f3655195 |trimmean|> - Estimate mean of data excluding
% outliers.
% * <docid:stats_ug.bu180jd |boxplot|> - Draw box plot of data.
% * <docid:stats_ug.bu27blg |probplot|> - Draw probability plot.
% * <docid:stats_ug.bu2xi0_ |robustcov|> - Estimate robust covariance of
% multivariate data. estimation
% * <docid:stats_ug.bt9w6j6 |fitcsvm|> - Fit a one-class support vector
% machine (SVM) to determine which observations are located which support
% vectors are far from the decision boundary.
%
%%
% To demonstrate outlier detection, this example:
%
% # Generates data from a nonlinear model with heteroscedasticity, and
% simulates a few outliers.
% # Grows a quantile random forest of regression trees.
% # Estimates conditional quartiles ($Q_1$, $Q_2$, and $Q_3$) and the
% interquartile range ($IQR$) within the ranges of the predictor variables.
% # Compares the observations to the quantities $F_1 = Q_1 - 1.5IQR$ and
% $F_2 = Q_3 + 1.5IQR$.  Any observation that is less than $F_1$ or greater
% than $F_2$ is an outlier.
%
%% Generate Data
% Generate 500 observations from the model
%
% $$y_t = 10 +3t + t\sin(2t) + \varepsilon_t.$$
%
% $t$ is uniformly distributed between 0 and $4\pi$ and $\varepsilon_t\sim
% N(0,t+0.01)$.  Store the data in a table.
n = 500;
rng('default'); % For reproducibility
t = randsample(linspace(0,4*pi,1e6),n,true)';
epsilon = randn(n,1).*sqrt((t+0.01));
y = 10 + 3*t + t.*sin(2*t) + epsilon;

Tbl = table(t,y);
%%
% Move five observations in a random vertical direction by 90% of the value
% of the response.
numOut = 5;
[~,idx] = datasample(Tbl,numOut);
Tbl.y(idx) = Tbl.y(idx) + randsample([-1 1],numOut,true)'.*(0.9*Tbl.y(idx));
%%
% Draw a scatter plot of the data and identify the outliers.
figure;
plot(Tbl.t,Tbl.y,'.');
hold on
plot(Tbl.t(idx),Tbl.y(idx),'*');
axis tight;
ylabel('y');
xlabel('t');
title('Scatter Plot of Data');
legend('Data','Simulated outliers','Location','NorthWest');
%% Grow Quantile Random Forest
% Grow a bag of 200 regression trees using <docid:stats_ug.brz_kvr-1
% |TreeBagger|>.
Mdl = TreeBagger(200,Tbl,'y','Method','regression');
%% 
% |Mdl| is a <docid:stats_ug.brz_iy1-1 |TreeBagger|> ensemble.
%% Predict Conditional Quartiles and Interquartile Ranges
% Using quantile regression, estimate the conditional quartiles of
% 50 equally-spaced values within the range of |t|.
tau = [0.25 0.5 0.75];
predT = linspace(0,4*pi,50)';
quartiles = quantilePredict(Mdl,predT,'Quantile',tau);

%%
% |quartiles| is a 500-by-3 matrix of conditional quartiles.  Rows
% correspond to the observations in |t| and columns correspond to the
% probabilities in |tau|.
%%
% On the scatter plot of the data, plot the conditional mean and median
% responses.
meanY = predict(Mdl,predT);

plot(predT,[quartiles(:,2) meanY],'LineWidth',2);
legend('Data','Simulated outliers','Median response','Mean response',...
    'Location','NorthWest');
hold off;
%%
% Although the conditional mean and median curves are close, the mean curve
% tends to be influenced by the simulated outliers.
%%
% Compute the conditional $IQR$, $F_1$, and $F_2$.
iqr = quartiles(:,3) - quartiles(:,1);
k = 1.5;
f1 = quartiles(:,1) - k*iqr;
f2 = quartiles(:,3) + k*iqr;
%%
% |k = 1.5| means that all observations less than |f1| or greater than |f2|
% are considered outliers, but does not disambiguate from extreme outliers.
% A |k| of |3| identifies extreme outliers.
%% Compare Observations to $F_1$ and $F_2$ 
% Plot the observations, $F_1$, and $F_2$.
figure;
plot(Tbl.t,Tbl.y,'.');
hold on
plot(Tbl.t(idx),Tbl.y(idx),'*');
plot(predT,[f1 f2]);
legend('Data','Simulated outliers','F_1','F_2','Location','NorthWest');
axis tight
title('Outlier Detection Using Quantile Regression')
hold off
%%
% Observations that are located below $F_1$ or above $F_2$ are outliers.
% $F_1$ and $F_2$ detected all simulated outliers, and some observations
% all outside these curves as well.
%%
% Quantile random forest can detect outliers with respect to the
% conditional distribution of $Y$.  This method cannot detect outliers the
% predictors.  For outlier detection in the predictor data using a bag of
% decision trees, see the |OutlierMeasure| property of a |TreeBagger|
% model.