www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/cfitdfitdemo.m

    %% Curve Fitting and Distribution Fitting
% This example shows the difference between fitting a curve to a set
% of points, and fitting a probability distribution to a sample of data.
%
% A common question is, "I have some data and I want to fit a Weibull
% distribution.  What MATLAB(R) functions can I use to do Weibull curve
% fitting?" Before answering that question, we need to figure out what kind
% of data analysis is really appropriate.

%   Copyright 2005-2014 The MathWorks, Inc.



%% Curve Fitting
% Consider an experiment where we measure the concentration of a compound in
% blood samples taken from several subjects at various times after taking an
% experimental medication.
time = [ 0.1   0.1   0.3   0.3   1.3   1.7   2.1   2.6   3.9   3.9 ...
         5.1   5.6   6.2   6.4   7.7   8.1   8.2   8.9   9.0   9.5 ...
         9.6  10.2  10.3  10.8  11.2  11.2  11.2  11.7  12.1  12.3 ...
        12.3  13.1  13.2  13.4  13.7  14.0  14.3  15.4  16.1  16.1 ...
        16.4  16.4  16.7  16.7  17.5  17.6  18.1  18.5  19.3  19.7];
conc = [0.01  0.08  0.13  0.16  0.55  0.90  1.11  1.62  1.79  1.59 ...
        1.83  1.68  2.09  2.17  2.66  2.08  2.26  1.65  1.70  2.39 ...
        2.08  2.02  1.65  1.96  1.91  1.30  1.62  1.57  1.32  1.56 ...
        1.36  1.05  1.29  1.32  1.20  1.10  0.88  0.63  0.69  0.69 ...
        0.49  0.53  0.42  0.48  0.41  0.27  0.36  0.33  0.17  0.20];
plot(time,conc,'o');
xlabel('Time');
ylabel('Blood Concentration');
%%
% Notice that we have one response variable, blood concentration, and one
% predictor variable, time after ingestion.  The predictor data are assumed to
% be measured with little or no error, while the response data are assumed
% to be affected by experimental error.  The main objective in analyzing data
% like these is often to define a model that predicts the response variable.
% That is, we are trying to describe the trend line, or the mean response of y
% (blood concentration), as a function of x (time).  This is curve fitting
% with bivariate data.
%
% Based on theoretical models of absorption into and breakdown in the
% bloodstream, we might, for example, decide that the concentrations ought to
% follow a Weibull curve as a function of time.  The Weibull curve, which
% takes the form
%
% $$y = c * (x/a)^{(b-1)} * exp(-(x/a)^b),$$
%
% is defined with three parameters: the first scales the curve along the
% horizontal axis, the second defines the shape of the curve, and the third
% scales the curve along the vertical axis.  Notice that while this curve has
% almost the same form as the Weibull probability density function, it is not
% a density because it includes the parameter c, which is necessary to allow
% the curve's height to adjust to data.
%
% We can fit the Weibull model using nonlinear least squares.
modelFun =  @(p,x) p(3) .* (x ./ p(1)).^(p(2)-1) .* exp(-(x ./ p(1)).^p(2));
startingVals = [10 2 5];
coefEsts = nlinfit(time, conc, modelFun, startingVals);
xgrid = linspace(0,20,100);
line(xgrid, modelFun(coefEsts, xgrid), 'Color','r');
%%
% This scatterplot suggests that the measurement errors do not have equal
% variance, but rather, that their variance is proportional to the height of
% the mean curve.  One way to address this problem would be to use weighted
% least squares.  However, there is another potential problem with this fit.
%
% The Weibull curve we're using, as with other similar models such as
% Gaussian, gamma, or exponential curves, is often used as a model when the
% response variable is nonnegative.  Ordinary least squares curve fitting is
% appropriate when the experimental errors are additive and can be considered
% as independent draws from a symmetric distribution, with constant variance.
% However, if that were true, then in this example it would be possible to
% measure negative blood concentrations, which is clearly not reasonable.
%
% It might be more realistic to assume multiplicative errors, symmetric on the
% log scale.  We can fit a Weibull curve to the data under that assumption by
% logging both the concentrations and the original Weibull curve itself.  That
% is, we can use nonlinear least squares to fit the curve
%
% $$log(y) = log(c) + (b-1)*log(x/a) - (x/a)^b.$$
coefEsts2 = nlinfit(time, log(conc), @(p,x)log(modelFun(p,x)), startingVals);
line(xgrid, modelFun(coefEsts2, xgrid), 'Color',[0 .5 0]);
legend({'Raw Data' 'Additive Errors Model' 'Multiplicative Errors Model'});
%%
% This model fit should be accompanied by estimates of precision and followed
% by checks on the model's goodness of fit.  For example, we should make
% residual plots on the log scale to check the assumption of constant variance
% for the multiplicative errors.  For simplicity we'll leave that out here.
%
% In this example, using the multiplicative errors model made little
% difference in the model's predictions, but that's not always the case.  An
% example where it does matter is described in the <docid:stats_examples.example-ex54296189> example.


%% Functions for Curve Fitting
% |MATLAB| and several toolboxes contain functions that can used to perform
% curve fitting.  The Statistics and Machine Learning Toolbox(TM) includes the functions |nlinfit|, for
% nonlinear least squares curve fitting, and |glmfit|, for fitting Generalized
% Linear Models.  The Curve Fitting Toolbox(TM) provides command line and
% graphical tools that simplify many of the tasks in curve fitting, including
% automatic choice of starting coefficient values for many models, as well as
% providing robust and nonparametric fitting methods.  Many complicated types
% of curve fitting analyses, including models with constraints on the
% coefficients, can be done using functions in the Optimization Toolbox(TM).  The
% |MATLAB| function |polyfit| fits polynomial models, and |fminsearch| can be
% used in many other kinds of curve fitting.


%% Distribution Fitting
% Now consider an experiment where we've measured the time to failure for 50
% identical electrical components.
life = [ 6.2 16.1 16.3 19.0 12.2  8.1  8.8  5.9  7.3  8.2 ...
        16.1 12.8  9.8 11.3  5.1 10.8  6.7  1.2  8.3  2.3 ...
         4.3  2.9 14.8  4.6  3.1 13.6 14.5  5.2  5.7  6.5 ...
         5.3  6.4  3.5 11.4  9.3 12.4 18.3 15.9  4.0 10.4 ...
         8.7  3.0 12.1  3.9  6.5  3.4  8.5  0.9  9.9  7.9];
%%
% Notice that only one variable has been measured -- the components'
% lifetimes. There is no notion of response and predictor variables; rather,
% each observation consists of just a single measurement.  The objective of an
% analysis for data like these is not to predict the lifetime of a new
% component given a value of some other variable, but rather to describe the
% full distribution of possible lifetimes.  This is distribution fitting with
% univariate data.
%
% One simple way to visualize these data is to make a histogram.
binWidth = 2;
binCtrs = 1:binWidth:19;
hist(life,binCtrs);
xlabel('Time to Failure');
ylabel('Frequency');
ylim([0 10]);
%%
% It might be tempting to think of this histogram as a set of (x,y) values,
% where each x is a bin center and each y is a bin height.
h_gca = gca;
h = h_gca.Children;
h.FaceColor = [.98 .98 .98];
h.EdgeColor = [.94 .94 .94];
counts = hist(life,binCtrs);
hold on
plot(binCtrs,counts,'o');
hold off
%%
% We might then try to fit a curve through those points to model the data.
% Since lifetime data often follow a Weibull distribution, and we might even
% think to use the Weibull curve from the curve fitting example above.
coefEsts = nlinfit(binCtrs,counts,modelFun,startingVals);
%%
% However, there are some potential pitfalls in fitting a curve to a
% histogram, and this simple fit is not appropriate.  First, the bin counts
% are nonnegative, implying that measurement errors cannot be symmetric.
% Furthermore, the bin counts have different variability in the tails than in
% the center of the distribution.  They also have a fixed sum, implying that
% they are not independent measurements.  These all violate basic assumptions
% of least squares fitting.
%
% It's also important to recognize that the histogram really represents a
% scaled version of an empirical probability density function (PDF).  If we
% fit a Weibull curve to the bar heights, we would have to constrain the curve
% to be properly normalized.
%
% These problems might be addressed by using a more appropriate least squares
% fit.  But another concern is that for continuous data, fitting a model
% based on the histogram bin counts rather than the raw data throws away
% information.  In addition, the bar heights in the histogram are very
% dependent on the choice of bin edges and bin widths.  While it is possible
% to fit distributions in this way, it's usually not the best way.

%%
% For many parametric distributions, maximum likelihood is a much better way
% to estimate parameters.  Maximum likelihood does not suffer from any of the
% problems mentioned above, and it is often the most efficient method in the
% sense that results are as precise as is possible for a given amount of data.
%
% For example, the function |wblfit| uses maximum likelihood to fit a Weibull
% distribution to data.  The Weibull PDF takes the form
%
% $$y = (a/b) * (x/a)^{(b-1)} * exp(-(x/a)^b).$$
%
% Notice that this is almost the same functional form as the Weibull curve
% used in the curve fitting example above.  However, there is no separate
% parameter to independently scale the vertical height.  Being a PDF, the
% function always integrates to 1.
%
% We'll fit the data with a Weibull distribution, then plot a histogram of the
% data, scaled to integrate to 1, and superimpose the fitted PDF.
paramEsts = wblfit(life);
n = length(life);
prob = counts / (n * binWidth);
bar(binCtrs,prob,'hist');
h_gca = gca;
h = h_gca.Children;
h.FaceColor = [.9 .9 .9];
xlabel('Time to Failure');
ylabel('Probability Density');
ylim([0 0.1]);
xgrid = linspace(0,20,100);
pdfEst = wblpdf(xgrid,paramEsts(1),paramEsts(2));
line(xgrid,pdfEst)
%%
% Maximum likelihood can, in a sense, be thought of as finding a Weibull PDF
% to best match the histogram.  But it is not done by minimizing the sum of
% squared differences between the PDF and the bar heights.
%
% As with the curve fitting example above, this model fit should be
% accompanied by estimates of precision and followed by checks on the model's
% goodness of fit; for simplicity we'll leave that out here.


%% Functions for Distribution Fitting
% The Statistics and Machine Learning Toolbox includes functions, such as |wblfit|, for fitting
% many different parametric distributions using maximum likelihood, and the
% function |mle| can be used to fit custom distributions for which dedicated
% fitting functions are not explicitly provided.  The function |ksdensity|
% fits a nonparametric distribution model to data.  The Statistics and Machine Learning Toolbox
% also provides the GUI |dfittool|, which simplifies many of the tasks in
% distribution fitting, including generation of various visualizations and
% diagnostic plots.  Many types of complicated distributions, including
% distributions with constraints on the parameters, can be fit using functions
% in the Optimization Toolbox. Finally, the |MATLAB| function |fminsearch| can
% be used in many kinds of maximum likelihood distribution fitting.

%%
% Although fitting a curve to a histogram is usually not optimal, there are
% sensible ways to apply special cases of curve fitting in certain
% distribution fitting contexts.  One method, applied on the cumulative
% probability (CDF) scale instead of the PDF scale, is described in the
% <docid:stats_examples.example-ex48933917> example.


%% Summary
% It's not uncommon to do curve fitting with a model that is a scaled
% version of a common probability density function, such as the Weibull,
% Gaussian, gamma, or exponential.  Curve fitting and distribution fitting can
% be easy to confuse in these cases, but the two are very different kinds of
% data analysis.
%
% * *Curve fitting* involves modelling the trend or mean of a response
% variable as a function of a second predictor variable.  The model usually
% must include a parameter to scale the height of the curve, and may also
% include an intercept term.  The appropriate plot for the data is an x-y
% scatterplot.
%
% * *Distribution fitting* involves modelling the probability distribution of
% a single variable.  The model is a normalized probability density function.
% The appropriate plot for the data is a histogram.