www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/customdist2demo.m
%% Fitting Custom Univariate Distributions, Part 2 % This example shows how to use some more advanced techniques with the % Statistics and Machine Learning Toolbox(TM) function |mle| to fit custom distributions to % univariate data. The techniques include fitting models to censored % data, and illustration of some of the numerical details of fitting % with custom distributions. % % See <docid:stats_examples.example-ex48933917> % for additional examples of fitting custom distributions to % univariate data. % Copyright 2004-2016 The MathWorks, Inc. %% Fitting Custom Distributions with Censored Data % The extreme value distribution is often used to model failure times of % mechanical parts, and experiments in such cases are sometimes only run % for a fixed length of time. If not all of the experimental units have % failed within that time, then the data are right-censored, that is, the % value of some failure times are not known exactly, but only known to be % larger than a certain value. %% % The Statistics and Machine Learning Toolbox includes the function |evfit|, which fits an extreme % value distribution to data, including data with censoring. However, for % the purposes of this example, we will ignore |evfit|, and show how to % use |mle| and custom distributions to fit a model to censored data, % using the extreme value distribution. %% % Because the values for the censored data are not known exactly, maximum % likelihood estimation becomes more difficult. In particular, both the % PDF and the CDF are needed to compute the log-likelihood. Therefore, you % must provide |mle| with functions for both of those in order to fit % censored data. The Statistics and Machine Learning Toolbox includes the functions |evpdf| and % |evcdf|, so for this example, the work of writing the code has already been % done. %% % We'll fit the model to simulated data. The first step is to generate some % uncensored extreme value data. rng(0,'twister'); n = 50; mu = 5; sigma = 2.5; x = evrnd(mu,sigma,n,1); %% % Next, we censor any values that are larger than a predetermined cutoff, % by replacing those values with the cutoff value. This is known as Type % II censoring, and with the cutoff at 7, about 12% of the original data % end up being censored. c = (x > 7); x(c) = 7; sum(c)/length(c) %% % We can plot a histogram of these data, including a stacked bar to % represent the censored observations. [uncensCnts,binCtrs] = hist(x(~c)); censCnts = hist(x(c),binCtrs); bar(binCtrs,[uncensCnts' censCnts'],'stacked'); %% % Although there is censoring, the fraction of censored observations is % relatively small, and so the method of moments can provide a reasonable % starting point for the parameter estimates. We compute the values of mu % and sigma that correspond to the observed mean and standard deviation of % the uncensored data. sigma0 = sqrt(6)*std(x(~c))./pi mu0 = mean(x(~c))-psi(1).*sigma0 %% % In addition to passing the data, x, and handles to the PDF and CDF % functions into |mle|, we also use the 'censoring' parameter to pass in the % censoring vector, c. Because the scale parameter, sigma, must be % positive, we specify lower parameter bounds. |mle| returns the maximum % likelihood estimates of the two extreme value distribution parameters, mu % and sigma, as well as approximate 95% confidence intervals. [paramEsts,paramCIs] = mle(x, 'censoring',c, 'pdf',@evpdf, 'cdf',@evcdf, ... 'start',[mu0 sigma0], 'lower',[-Inf,0]) %% Some Numerical Issues in Fitting Custom Distributions % Fitting a custom distribution requires an initial guess for the % parameters, and it's often difficult to know how good or bad a starting % point is a priori. In the previous example, if we had picked a starting % point that was farther away from the maximum likelihood estimates, some % of the observations could have been very far out in the tails of the % extreme value distribution corresponding to the starting point. One of % two things might then have happened. % % First, one of the PDF values might have become so small that it % underflowed to zero in double precision arithmetic. Second, one of the % CDF values might have become so close to 1 that it rounded up in double % precision. (It's also possible that a CDF value might have become so % small as to underflow, but that turns out not to be a problem.) % % Either of these conditions causes problems when |mle| computes the % log-likelihood, because they lead to log-likelihood values of -Inf, and % the optimization algorithm in |mle| can not normally be expected to step % out of such regions. %% % Knowing what the maximum likelihood estimates are, let's see what happens % with a different starting point. start = [1 1]; try [paramEsts,paramCIs] = mle(x, 'censoring',c, 'pdf',@evpdf, 'cdf',@evcdf, ... 'start',start, 'lower',[-Inf,0]) catch ME disp(ME.message) end %% % In this case, the second problem has occurred: Some of the CDF values at % the initial parameter guess are computed as exactly 1, and so the % log-likelihood is infinite. We could try setting |mle|'s 'FunValCheck' % control parameter to 'off', which would disable checking for non-finite % likelihood values, and then hope for the best. But the right way to solve % this numerical problem is at its root, and in this case it's not hard to % do. % % Notice that the extreme value CDF is of the form % % p = 1 - exp( -exp((x-mu)./sigma) ) % % The contribution of the censored observations to the log-likelihood is % the log of their survival function (SF) values, i.e., log(1-CDF). For % the extreme value distribution, the log of the SF is just % -exp((x-mu)./sigma). If we could compute the log-likelihood using the % log SF directly, (instead of, in effect, computing log(1 - % (1-exp(logSF)))) we would avoid the rounding issues with the CDF. That's % because observations whose CDF values are not distinguishable from 1 in % double precision have log SF values that are still easily representable % as non-zero values. For example, a CDF value of (1 - 1e-20) rounds to 1 % in double precision, because double precision |eps| is about 2e-16. SFval = 1e-20; CDFval = 1 - SFval %% % However, the log of the corresponding SF value, i.e. log(1-CDF), is % easily represented. log(SFval) %% % A similar observation can be made about using the log PDF rather than the % PDF itself -- the contribution of uncensored observations to the % log-likelihood is the log of their PDF values. Using the log PDF % directly (instead of, in effect, computing log(exp(logPDF))) avoids % underflow problems where the PDF is not distinguishable from zero in % double precision, but the log PDF is still easily representable as a % finite negative number. For example, a PDF value of 1e-400 underflows in % double precision, because double precision |realmin| is about 2e-308. logPDFval = -921; PDFval = exp(logPDFval) %% % |mle| provides a syntax for specifying a custom distribution using the % log PDF and the log SF (rather than the PDF and CDF), via the 'logpdf' % and 'logsf' parameters. Unlike the PDF and CDF functions, there are no % existing functions, so we'll create anonymous functions that compute % these values: evlogpdf = @(x,mu,sigma) ((x - mu) ./ sigma - exp((x - mu) ./ sigma)) - log(sigma); evlogsf = @(x,mu,sigma) -exp((x-mu)./sigma); %% % Using the same starting point, the alternate logPDF/logSF specification % of the extreme value distribution makes the problem solvable: start = [1 1]; [paramEsts,paramCIs] = mle(x, 'censoring',c, 'logpdf',evlogpdf, 'logsf',evlogsf, ... 'start',start, 'lower',[-Inf,0]) %% % However, this strategy cannot always mitigate a poor starting point, and % a careful choice of starting point is always recommended. %% Supplying a Gradient % By default, |mle| uses the function |fminsearch| to find parameter values % that maximize the log-likelihood for the data. |fminsearch| uses an % optimization algorithm that is derivative-free, and is often a good % choice. % % However, for some problems, choosing an optimization algorithm that uses % the derivatives of the log-likelihood function can make the difference % between converging to the maximum likelihood estimates or not, especially % when the starting point is far away from the final answer. Providing the % derivatives can also sometimes speed up the convergence. % % If your MATLAB(R) installation includes the Optimization Toolbox(TM), |mle| allows % you to use the function |fmincon|, which includes optimization algorithms % that can use derivative information. To take best advantage of the % algorithms in |fmincon|, you can specify a custom distribution using a % log-likelihood function, written to return not only the log-likelihood % itself, but its gradient as well. The gradient of the log-likelihood % function is simply the vector of its partial derivatives with respect to % its parameters. % % This strategy requires extra preparation, to write code that computes % both the log-likelihood and its gradient. For this example, we've the % created code to do that for the extreme value distribution as a separate % file <matlab:edit('evnegloglike.m') |evnegloglike.m|>. type evnegloglike.m %% % Notice that the function |evnegloglike| returns the _negative_ of both % the log-likelihood values and of the gradient values, because MLE % _minimizes_ that negative log-likelihood. %% % To compute the maximum likelihood estimates using a gradient-based % optimization algorithm, we use the 'nloglf' parameter, specifying that we % are providing a handle to a function that computes the negative % log-likelihood, and the 'optimfun' parameter, specifying |fmincon| as the % optimization function. |mle| will automatically detect that % |evnegloglike| can return both the negative log-likelihood and its % gradient. start = [1 1]; [paramEsts,paramCIs] = mle(x, 'censoring',c, 'nloglf',@evnegloglike, ... 'start',start, 'lower',[-Inf,0], 'optimfun','fmincon')