www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/glmdemo.m
%% Fitting Data with Generalized Linear Models % This example shows how to fit and evaluate generalized linear models % using |glmfit| and |glmval|. Ordinary linear regression can be used to % fit a straight line, or any function that is linear in its parameters, to % data with normally distributed errors. This is the most commonly used % regression model; however, it is not always a realistic one. Generalized % linear models extend the linear model in two ways. First, assumption of % linearity in the parameters is relaxed, by introducing the link function. % Second, error distributions other than the normal can be modeled % Copyright 2000-2008 The MathWorks, Inc. %% Generalized Linear Models % A regression model defines the distribution of a response variable (often % generically denoted as y) in terms of one or more predictor variables % (often denoted x1, x2, etc.). The most commonly used regression model, % the ordinary linear regression, models y as a normal random variable, whose % mean is linear function of the predictors, b0 + b1*x1 + ... , and whose % variance is constant. In the simplest case of a single predictor x, the % model can be represented as a straight line with Gaussian distributions % about each point. mu = @(x) -1.9+.23*x; x = 5:.1:15; yhat = mu(x); dy = -3.5:.1:3.5; sz = size(dy); k = (length(dy)+1)/2; x1 = 7*ones(sz); y1 = mu(x1)+dy; z1 = normpdf(y1,mu(x1),1); x2 = 10*ones(sz); y2 = mu(x2)+dy; z2 = normpdf(y2,mu(x2),1); x3 = 13*ones(sz); y3 = mu(x3)+dy; z3 = normpdf(y3,mu(x3),1); plot3(x,yhat,zeros(size(x)),'b-', ... x1,y1,z1,'r-', x1([k k]),y1([k k]),[0 z1(k)],'r:', ... x2,y2,z2,'r-', x2([k k]),y2([k k]),[0 z2(k)],'r:', ... x3,y3,z3,'r-', x3([k k]),y3([k k]),[0 z3(k)],'r:'); zlim([0 1]); xlabel('X'); ylabel('Y'); zlabel('Probability density'); grid on; view([-45 45]); %% % In a generalized linear model, the mean of the response is modeled as a % monotonic nonlinear transformation of a linear function of the % predictors, g(b0 + b1*x1 + ...) . The inverse of the transformation g is % known as the "link" function. Examples include the logit (sigmoid) % link and the log link. Also, y may have a non-normal distribution, such % as the binomial or Poisson. For example, a Poisson regression with log % link and a single predictor x can be represented as an exponential curve % with Poisson distributions about each point. mu = @(x) exp(-1.9+.23*x); x = 5:.1:15; yhat = mu(x); x1 = 7*ones(1,5); y1 = 0:4; z1 = poisspdf(y1,mu(x1)); x2 = 10*ones(1,7); y2 = 0:6; z2 = poisspdf(y2,mu(x2)); x3 = 13*ones(1,9); y3 = 0:8; z3 = poisspdf(y3,mu(x3)); plot3(x,yhat,zeros(size(x)),'b-', ... [x1; x1],[y1; y1],[z1; zeros(size(y1))],'r-', x1,y1,z1,'r.', ... [x2; x2],[y2; y2],[z2; zeros(size(y2))],'r-', x2,y2,z2,'r.', ... [x3; x3],[y3; y3],[z3; zeros(size(y3))],'r-', x3,y3,z3,'r.'); zlim([0 1]); xlabel('X'); ylabel('Y'); zlabel('Probability'); grid on; view([-45 45]); %% Fitting a Logistic Regression % This example involves an experiment to help model the proportion of cars % of various weights that fail a mileage test. The data include % observations of weight, number of cars tested, and number failed. % A set of car weights weight = [2100 2300 2500 2700 2900 3100 3300 3500 3700 3900 4100 4300]'; % The number of cars tested at each weight tested = [48 42 31 34 31 21 23 23 21 16 17 21]'; % The number of cars failing the test at each weight failed = [1 2 0 3 8 8 14 17 19 15 17 21]'; % The proportion of cars failing for each weight proportion = failed ./ tested; plot(weight,proportion,'s') xlabel('Weight'); ylabel('Proportion'); %% % This graph is a plot of the proportion of cars failing, as a function of % weight. It's reasonable to assume that the failure counts came from a % binomial distribution, with a probability parameter P that increases with % weight. But how exactly should P depend on weight? % % We can try fitting a straight line to these data. linearCoef = polyfit(weight,proportion,1); linearFit = polyval(linearCoef,weight); plot(weight,proportion,'s', weight,linearFit,'r-', [2000 4500],[0 0],'k:', [2000 4500],[1 1],'k:') xlabel('Weight'); ylabel('Proportion'); %% % There are two problems with this linear fit: % % 1) The line predicts proportions less than 0 and greater than 1. % % 2) The proportions are not normally distributed, since they are % necessarily bounded. This violates one of the assumptions required for % fitting a simple linear regression model. % % Using a higher-order polynomial may appear to help. [cubicCoef,stats,ctr] = polyfit(weight,proportion,3); cubicFit = polyval(cubicCoef,weight,[],ctr); plot(weight,proportion,'s', weight,cubicFit,'r-', [2000 4500],[0 0],'k:', [2000 4500],[1 1],'k:') xlabel('Weight'); ylabel('Proportion'); %% % However, this fit still has similar problems. The graph shows that the % fitted proportion starts to decrease as weight goes above 4000; in fact % it will become negative for larger weight values. And of course, the % assumption of a normal distribution is still violated. % % Instead, a better approach is to use |glmfit| to fit a logistic % regression model. Logistic regression is a special case of a generalized % linear model, and is more appropriate than a linear regression for these % data, for two reasons. First, it uses a fitting method that is % appropriate for the binomial distribution. Second, the logistic link % limits the predicted proportions to the range [0,1]. % % For logistic regression, we specify the predictor matrix, and a matrix % with one column containing the failure counts, and one column containing % the number tested. We also specify the binomial distribution and the % logit link. [logitCoef,dev] = glmfit(weight,[failed tested],'binomial','logit'); logitFit = glmval(logitCoef,weight,'logit'); plot(weight,proportion,'bs', weight,logitFit,'r-'); xlabel('Weight'); ylabel('Proportion'); %% % As this plot indicates, the fitted proportions asymptote to zero and one % as weight becomes small or large. %% Model Diagnostics % The |glmfit| function provides a number of outputs for examining the fit % and testing the model. For example, we can compare the deviance values % for two models to determine if a squared term would improve the fit % significantly. [logitCoef2,dev2] = glmfit([weight weight.^2],[failed tested],'binomial','logit'); pval = 1 - chi2cdf(dev-dev2,1) %% % The large p-value indicates that, for these data, a quadratic term does % not improve the fit significantly. A plot of the two fits shows there is % little difference in the fits. logitFit2 = glmval(logitCoef2,[weight weight.^2],'logit'); plot(weight,proportion,'bs', weight,logitFit,'r-', weight,logitFit2,'g-'); legend('Data','Linear Terms','Linear and Quadratic Terms','Location','northwest'); %% % To check the goodness of fit, we can also look at a probability plot of % the Pearson residuals. These are normalized so that when the model is a % reasonable fit to the data, they have roughly a standard normal % distribution. (Without this standardization, the residuals would have % different variances.) [logitCoef,dev,stats] = glmfit(weight,[failed tested],'binomial','logit'); normplot(stats.residp); %% % The residual plot shows a nice agreement with the normal distribution. %% Evaluating the Model Predictions % Once we are satisfied with the model, we can use it to make predictions, % including computing confidence bounds. Here we predict the expected % number of cars, out of 100 tested, that would fail the mileage test at % each of four weights. weightPred = 2500:500:4000; [failedPred,dlo,dhi] = glmval(logitCoef,weightPred,'logit',stats,.95,100); errorbar(weightPred,failedPred,dlo,dhi,':'); %% Link Functions for Binomial Models % For each of the five distributions that |glmfit| supports, there is a % canonical (default) link function. For the binomial distribution, the % canonical link is the logit. However, there are also three other links % that are sensible for binomial models. All four maintain the mean % response in the interval [0, 1]. eta = -5:.1:5; plot(eta,1 ./ (1 + exp(-eta)),'-', eta,normcdf(eta), '-', ... eta,1 - exp(-exp(eta)),'-', eta,exp(-exp(eta)),'-'); xlabel('Linear function of predictors'); ylabel('Predicted mean response'); legend('logit','probit','complementary log-log','log-log','location','east'); %% % For example, we can compare a fit with the probit link to one with the % logit link. probitCoef = glmfit(weight,[failed tested],'binomial','probit'); probitFit = glmval(probitCoef,weight,'probit'); plot(weight,proportion,'bs', weight,logitFit,'r-', weight,probitFit,'g-'); legend('Data','Logit model','Probit model','Location','northwest'); %% % It's often difficult for the data to distinguish between these four % link functions, and a choice is often made on theoretical grounds.