www.gusucode.com > stats 源码程序 matlab案例代码 > stats/OutlierDetectionUsingQuantileRegressionExample.m
%% Outlier Detection Using Quantile Regression % This example shows how to detect outliers using quantile random forest. %% % An outlier is an observation that is located "far enough" from most of % the other observations in a data set, and so it is considered anomalous. % Causes of outlying observations include inherent variability or % measurement error. Outliers can have a significant impact on estimates % and inference, so it is important to detect them and decide whether to % remove them or consider a robust analysis. %% % The Statistics and Machine Learning Toolbox(TM) provides several % functionalities to detect or remove outliers. This is a list a few % functionalities. % % * <docid:stats_ug.btg0kob |zscore|> - Compute _z_ scores of observations. % * <docid:stats_ug.f3655195 |trimmean|> - Estimate mean of data excluding % outliers. % * <docid:stats_ug.bu180jd |boxplot|> - Draw box plot of data. % * <docid:stats_ug.bu27blg |probplot|> - Draw probability plot. % * <docid:stats_ug.bu2xi0_ |robustcov|> - Estimate robust covariance of % multivariate data. estimation % * <docid:stats_ug.bt9w6j6 |fitcsvm|> - Fit a one-class support vector % machine (SVM) to determine which observations are located which support % vectors are far from the decision boundary. % %% % To demonstrate outlier detection, this example: % % # Generates data from a nonlinear model with heteroscedasticity, and % simulates a few outliers. % # Grows a quantile random forest of regression trees. % # Estimates conditional quartiles ($Q_1$, $Q_2$, and $Q_3$) and the % interquartile range ($IQR$) within the ranges of the predictor variables. % # Compares the observations to the quantities $F_1 = Q_1 - 1.5IQR$ and % $F_2 = Q_3 + 1.5IQR$. Any observation that is less than $F_1$ or greater % than $F_2$ is an outlier. % %% Generate Data % Generate 500 observations from the model % % $$y_t = 10 +3t + t\sin(2t) + \varepsilon_t.$$ % % $t$ is uniformly distributed between 0 and $4\pi$ and $\varepsilon_t\sim % N(0,t+0.01)$. Store the data in a table. n = 500; rng('default'); % For reproducibility t = randsample(linspace(0,4*pi,1e6),n,true)'; epsilon = randn(n,1).*sqrt((t+0.01)); y = 10 + 3*t + t.*sin(2*t) + epsilon; Tbl = table(t,y); %% % Move five observations in a random vertical direction by 90% of the value % of the response. numOut = 5; [~,idx] = datasample(Tbl,numOut); Tbl.y(idx) = Tbl.y(idx) + randsample([-1 1],numOut,true)'.*(0.9*Tbl.y(idx)); %% % Draw a scatter plot of the data and identify the outliers. figure; plot(Tbl.t,Tbl.y,'.'); hold on plot(Tbl.t(idx),Tbl.y(idx),'*'); axis tight; ylabel('y'); xlabel('t'); title('Scatter Plot of Data'); legend('Data','Simulated outliers','Location','NorthWest'); %% Grow Quantile Random Forest % Grow a bag of 200 regression trees using <docid:stats_ug.brz_kvr-1 % |TreeBagger|>. Mdl = TreeBagger(200,Tbl,'y','Method','regression'); %% % |Mdl| is a <docid:stats_ug.brz_iy1-1 |TreeBagger|> ensemble. %% Predict Conditional Quartiles and Interquartile Ranges % Using quantile regression, estimate the conditional quartiles of % 50 equally-spaced values within the range of |t|. tau = [0.25 0.5 0.75]; predT = linspace(0,4*pi,50)'; quartiles = quantilePredict(Mdl,predT,'Quantile',tau); %% % |quartiles| is a 500-by-3 matrix of conditional quartiles. Rows % correspond to the observations in |t| and columns correspond to the % probabilities in |tau|. %% % On the scatter plot of the data, plot the conditional mean and median % responses. meanY = predict(Mdl,predT); plot(predT,[quartiles(:,2) meanY],'LineWidth',2); legend('Data','Simulated outliers','Median response','Mean response',... 'Location','NorthWest'); hold off; %% % Although the conditional mean and median curves are close, the mean curve % tends to be influenced by the simulated outliers. %% % Compute the conditional $IQR$, $F_1$, and $F_2$. iqr = quartiles(:,3) - quartiles(:,1); k = 1.5; f1 = quartiles(:,1) - k*iqr; f2 = quartiles(:,3) + k*iqr; %% % |k = 1.5| means that all observations less than |f1| or greater than |f2| % are considered outliers, but does not disambiguate from extreme outliers. % A |k| of |3| identifies extreme outliers. %% Compare Observations to $F_1$ and $F_2$ % Plot the observations, $F_1$, and $F_2$. figure; plot(Tbl.t,Tbl.y,'.'); hold on plot(Tbl.t(idx),Tbl.y(idx),'*'); plot(predT,[f1 f2]); legend('Data','Simulated outliers','F_1','F_2','Location','NorthWest'); axis tight title('Outlier Detection Using Quantile Regression') hold off %% % Observations that are located below $F_1$ or above $F_2$ are outliers. % $F_1$ and $F_2$ detected all simulated outliers, and some observations % all outside these curves as well. %% % Quantile random forest can detect outliers with respect to the % conditional distribution of $Y$. This method cannot detect outliers the % predictors. For outlier detection in the predictor data using a bag of % decision trees, see the |OutlierMeasure| property of a |TreeBagger| % model.