www.gusucode.com > stats_featured 源码程序 matlab案例代码 > stats_featured/samplesizedemo.m

    %% Selecting a Sample Size
% This example shows how to determine the number of samples or observations
% needed to carry out a statistical test. It illustrates sample size
% calculations for a simple problem, then shows how to use the
% |sampsizepwr| function to compute power and sample size for two more
% realistic problems.  Finally, it illustrates the use of 
% Statistics and Machine Learning Toolbox(TM) functions to compute the required
% sample size for a test that the |sampsizepwr| function does not support.

%   Copyright 2004-2014 The MathWorks, Inc.


%% Testing a Normal Mean with Known Standard Deviation, One-Sided
% Just to introduce some concepts, let's consider an unrealistically simple
% example where we want to test a mean and we know the standard deviation.
% Our data are continuous, and can be modeled with the normal distribution.
% We need to determine a sample size N so that we can distinguish between a
% mean of 100 and a mean of 110. We know the standard deviation is 20.
%
% When we carry out a statistical test, we generally test a *null
% hypothesis* against an *alternative hypothesis*.  We find a test
% statistic T, and look at its distribution under the null hypothesis.  If
% we observe an unusual value, say one that has less than 5% chance of
% happening if the null hypothesis is true, then we reject the null
% hypothesis in favor of the alternative.  (The 5% probability is known as
% the *significance level* of the test.)  If the value is not unusual, we do
% not reject the null hypothesis.
%
% In this case the test statistic T is the sample mean.  Under the null
% hypothesis it has a mean of 100 and a standard deviation of 20/sqrt(N).
% First let's look at a fixed sample size, say N=16.  We will reject the
% null hypothesis if T is in the shaded region, which is the upper tail of
% its distribution.  That makes this a one-sided test, since we will not
% reject if T is in the lower tail.  The cutoff for this shaded region is
% 1.6 standard deviations above the mean.

rng(0,'twister');
mu0 = 100;
sig = 20;
N = 16;
alpha = 0.05;
conf = 1-alpha;
cutoff = norminv(conf, mu0, sig/sqrt(N));
x = [linspace(90,cutoff), linspace(cutoff,127)];
y = normpdf(x,mu0,sig/sqrt(N));
h1 = plot(x,y);
xhi = [cutoff, x(x>=cutoff)];
yhi = [0, y(x>=cutoff)];
patch(xhi,yhi,'b');
title('Distribution of sample mean, N=16');
xlabel('Sample mean');
ylabel('Density');
text(96,.01,sprintf('Reject if mean>%.4g\nProb = 0.05',cutoff),'Color','b');


%%
% This is how T behaves under the null hypothesis, but what about under an
% alternative?  Our alternative distribution has a mean of 110, as
% represented by the red curve.
mu1 = 110;
y2 = normpdf(x,mu1,sig/sqrt(N));
h2 = line(x,y2,'Color','r');
yhi = [0, y2(x>=cutoff)];
patch(xhi,yhi,'r','FaceAlpha',0.25);
P = 1 - normcdf(cutoff,mu1,sig/sqrt(N));
text(115,.06,sprintf('Reject if T>%.4g\nProb = %.2g',cutoff,P),'Color',[1 0 0]);
legend([h1 h2],'Null hypothesis','Alternative hypothesis');

%%
% There is a larger chance of rejecting the null hypothesis if the
% alternative is true.  This is just what we want.  It's easier to
% visualize if we look at the cumulative distribution function (cdf)
% instead of the density (pdf).  We can read probabilities directly from
% this graph, instead of having to compute areas.

ynull = normcdf(x,mu0,sig/sqrt(N));
yalt = normcdf(x,mu1,sig/sqrt(N));
h12 = plot(x,ynull,'b-',x,yalt,'r-');
zval = norminv(conf);
cutoff = mu0 + zval * sig / sqrt(N);
line([90,cutoff,cutoff],[conf, conf, 0],'LineStyle',':');
msg = sprintf(' Cutoff = 100 + %.2g \\times 20 / \\surd{n}',zval);
text(cutoff,.15,msg,'Color','b');
text(min(x),conf,sprintf('   %g%% test',100*alpha),'Color','b',...
                         'verticalalignment','top')
palt = normcdf(cutoff,mu1,sig/sqrt(N));
line([90,cutoff],[palt,palt],'Color','r','LineStyle',':');
text(91,palt+.02,sprintf(' Power is 1-%.2g = %.2g',palt,1-palt),'Color',[1 0 0]);
legend(h12,'Null hypothesis','Alternative hypothesis');

%%
% This graph shows the probability of getting a significant statistic
% (rejecting the null hypothesis) for two different mu values when N=16.
%
% The *power function* is defined as the probability of rejecting the null
% hypothesis when the alternative is true.  It depends on the value of the
% alternative and on the sample size.  We'll graph the power (which is one
% minus the cdf) as a function of N, fixing the alternative at 110.  We'll
% select N to achieve a power of 80%.  The graph shows that we need about N=25.

DesiredPower = 0.80;
Nvec = 1:30;
cutoff = mu0 + norminv(conf)*sig./sqrt(Nvec);
power = 1 - normcdf(cutoff, mu1, sig./sqrt(Nvec));
plot(Nvec,power,'bo-',[0 30],[DesiredPower DesiredPower],'k:');
xlabel('N = sample size')
ylabel('Power')
title('Power function for the alternative: \mu = 110')


%%
% In this overly simple example there is a formula to compute the required
% value directly to get a power of 80%:
mudiff = (mu1 - mu0) / sig;
N80 = ceil(((norminv(1-DesiredPower)-norminv(conf)) / mudiff)^2)

%%
% To verify that this works, let's do a Monte Carlo simulation and generate
% 400 samples of size 25 both  under the null hypothesis with mean 100, and
% under the alternative hypothesis with mean 110.  If we test each sample
% to see if its mean is 100, we should expect about 5% of the first group
% and 80% of the second group to be significant.
nsamples = 400;
samplenum = 1:nsamples;
N = 25;
h0 = zeros(1,nsamples);
h1 = h0;
for j = 1:nsamples
    Z0 = normrnd(mu0,sig,N,1);
    h0(j) = ztest(Z0,mu0,sig,alpha,'right');
    Z1 = normrnd(mu1,sig,N,1);
    h1(j) = ztest(Z1,mu0,sig,alpha,'right');
end
p0 = cumsum(h0) ./ samplenum;
p1 = cumsum(h1) ./ samplenum;
plot(samplenum,p0,'b-',samplenum,p1,'r-')
xlabel('Sample number')
ylabel('Proportion significant')
title('Verification of power computation')
legend('Null hypothesis','Alternative hypothesis','Location','East')

%% Testing a Normal Mean with Unknown Standard Deviation, Two-Sided
% Now let's suppose we don't know the standard deviation, and we want to
% perform a two-sided test, that is, one that rejects the null hypothesis
% whether the sample mean is too high or too low.
%
% The test statistic is a t statistic, which is the difference between the
% sample mean and the mean being tested, divided by the standard error of
% the mean.  Under the null hypothesis, this has Student's t distribution
% with N-1 degrees of freedom.  Under the alternative hypothesis, the
% distribution is a noncentral t distribution with a noncentrality
% parameter equal to the normalized difference between the true mean and
% the mean being tested.
%
% For this two-sided test we have to allocate the 5% chance of an error
% under the null hypothesis equally to both tails, and reject if the test
% statistic is too extreme in either direction.  We also have to consider
% both tails under any alternative.

N = 16;
df = N-1;
alpha = 0.05;
conf = 1-alpha;
cutoff1 = tinv(alpha/2,df);
cutoff2 = tinv(1-alpha/2,df);
x = [linspace(-5,cutoff1), linspace(cutoff1,cutoff2),linspace(cutoff2,5)];
y = tpdf(x,df);
h1 = plot(x,y);
xlo = [x(x<=cutoff1),cutoff1];
ylo = [y(x<=cutoff1),0];
xhi = [cutoff2,x(x>=cutoff2)];
yhi = [0, y(x>=cutoff2)];
patch(xlo,ylo,'b');
patch(xhi,yhi,'b');
title('Distribution of t statistic, N=16');
xlabel('t');
ylabel('Density');
text(2.5,.05,sprintf('Reject if t>%.4g\nProb = 0.025',cutoff2),'Color','b');
text(-4.5,.05,sprintf('Reject if t<%.4g\nProb = 0.025',cutoff1),'Color','b');


%%
% Instead of examining the power function just under the null hypothesis
% and a single alternative value of mu, we can look at it as a function of
% mu.  The power increases as mu moves away from the null hypothesis value
% in either direction.  We can use the |sampsizepwr| function to compute
% the power.  For a power calculation we need to specify a value for the
% standard deviation, which we suspect will be roughly 20.  Here's a
% picture of the power function for a sample size N=16.

N = 16;
x = linspace(90,127);
power = sampsizepwr('t',[100 20],x,[],N);
plot(x,power);
xlabel('True mean')
ylabel('Power')
title('Power function for N=16')

%%
% We want a power of 80% when the mean is 110.  According to this graph,
% our power is less than 50% with a sample size of N=16.  What sample size
% will give the power we want?

N = sampsizepwr('t',[100 20],110,0.8)

%%
% We need a sample size of about 34. Compared with the previous example, we
% need to take nine additional observations to allow a two-sided test and
% to make up for not knowing the true standard deviation.
%
% We can make a plot of the power function for various values of N.
Nvec = 2:40;
power = sampsizepwr('t',[100 20],110,[],Nvec);
plot(Nvec,power,'bo-',[0 40],[DesiredPower DesiredPower],'k:');
xlabel('N = sample size')
ylabel('Power')
title('Power function for the alternative: \mu = 110')

%%
% And we can do a simulation similar to the earlier one to verify that we
% get the power we need.

nsamples = 400;
samplenum = 1:nsamples;
N = 34;
h0 = zeros(1,nsamples);
h1 = h0;
for j = 1:nsamples
    Z0 = normrnd(mu0,sig,N,1);
    h0(j) = ttest(Z0,mu0,alpha);
    Z1 = normrnd(mu1,sig,N,1);
    h1(j) = ttest(Z1,mu0,alpha);
end
p0 = cumsum(h0) ./ samplenum;
p1 = cumsum(h1) ./ samplenum;
plot(samplenum,p0,'b-',samplenum,p1,'r-')
xlabel('Sample number')
ylabel('Proportion significant')
title('Verification of power computation')
legend('Null hypothesis','Alternative hypothesis','Location','East')

%%
% Suppose we could afford to take a sample size of 50.  Presumably our
% power for detecting the alternative value mu=110 would be larger than
% 80%.  If we maintain the power at 80%, what alternative could we
% detect?
mu1 = sampsizepwr('t',[100 20],[],.8,50)

%% Testing a Proportion
% Now let's turn to the problem of determining the sample size needed to
% distinguish between two proportions.  Suppose that we are sampling a
% population in which about 30% favor some candidate, and we want to sample
% enough people so we can distinguish this value from 33%.
%
% The idea here is the same as before.  Here we can use the sample count as
% our test statistic.  This count has a binomial distribution.  For any
% sample size N we can compute the cutoff for rejecting the null hypothesis
% P=0.30.  For N=100, for instance, we would reject the null hypothesis if
% the sample count is larger than a cutoff value computed as follows:

N = 100;
alpha = 0.05;
p0 = 0.30;
p1 = 0.33;
cutoff = binoinv(1-alpha, N, p0)

%%
% A complication with the binomial distribution comes because it is
% discrete.  The probability of exceeding the cutoff value is not exactly
% 5%:

1 - binocdf(cutoff, N, p0)

%%
% Once again, let's compute the power against the alternative P=0.33 for a
% range of sample sizes.  We'll use a one-sided (right-tailed) test because
% we're interested only in alternative values greater than 30%.

Nvec = 50:50:2000;
power = sampsizepwr('p',p0,p1,[],Nvec,'tail','right');
plot(Nvec,power,'bo-',[0 2000],[DesiredPower DesiredPower],'k:');
xlabel('N = sample size')
ylabel('Power')
title('Power function for the alternative: p = 0.33')

%%
% We can use the |sampsizepwr| function to request the sample size required
% for a power of 80%.

approxN = sampsizepwr('p',p0,p1,0.80,[],'tail','right')

%%
% A warning message informs us that the answer is just approximate.  If we
% look at the power function for different sample sizes, we can see that
% the function is generally increasing, but irregular because the binomial
% distribution is discrete. Let's look at the probability of rejecting the
% null hypothesis for both p=0.30 and p=0.33 in the range of samples sizes
% from 1470 to 1480.

subplot(3,1,1);
Nvec = 1470:1480;
power = sampsizepwr('p',p0,p1,[],Nvec,'tail','right');
plot(Nvec,power,'ro-',[min(Nvec),max(Nvec)],[DesiredPower DesiredPower],'k:');
ylabel(sprintf('Prob[T>cutoff]\nif p=0.33'))
h_gca = gca;
h_gca.XTickLabel = '';
ylim([.78, .82]);

subplot(3,1,2);
alf = sampsizepwr('p',p0,p0,[],Nvec,'tail','right');
plot(Nvec,alf,'bo-',[min(Nvec),max(Nvec)],[alpha alpha],'k:');
ylabel(sprintf('Prob[T>cutoff]\nif p=0.30'))
h_gca = gca;
h_gca.XTickLabel = '';
ylim([.04, .06]);

subplot(3,1,3);
cutoff = binoinv(1-alpha, Nvec, p0);
plot(Nvec,cutoff,'go-');
xlabel('N = sample size')
ylabel('Cutoff')

%%
% This plot reveals that the power function curve (top plot) is not only
% irregular, but also decreases at some sample sizes.  These are the
% sample sizes for which it is necessary to increase the cutoff value
% (bottom plot) in order to keep the significance level (middle plot) no
% larger than 5%.  We can find a smaller sample size within this range
% that gives the desired power of 80%:

min(Nvec(power>=0.80))

%% Testing a Correlation
% In the examples we've considered so far, we were able to figure out the
% cutoff for a test statistic to achieve a certain significance level, and
% to calculate the probability of exceeding that cutoff under an
% alternative hypothesis.  For our final example, let's consider a problem
% where that is not so easy.
%
% Imagine we can take samples from two variables X and Y, and we want to
% know what sample size we would need to test whether they are uncorrelated
% versus the alternative that their correlation is as high as 0.4.
% Although it is possible to work out the distribution of the sample
% correlation by transforming it to a t distribution, let's use a method
% that we can use even in problems where we can't work out the distribution
% of the test statistic.
%
% For a given sample size, we can use Monte Carlo simulation to determine
% an approximate cutoff value for a test of the correlation.  Let's do a
% large simulation run so we can get this value accurately.  We'll start
% with a sample size of 25.

nsamples = 10000;
N = 25;
alpha = 0.05;
conf = 1-alpha;
r = zeros(1,nsamples);
for j = 1:nsamples
    xy = normrnd(0,1,N,2);
    r(j) = corr(xy(:,1),xy(:,2));
end
cutoff = quantile(r,conf)

%%
% Then we can generate samples under the alternative hypothesis, and
% estimate the power of the test.

nsamples = 1000;
mu = [0; 0];
sig = [1 0.4; 0.4 1];
r = zeros(1,nsamples);
for j = 1:nsamples
    xy = mvnrnd(mu,sig,N);
    r(j) = corr(xy(:,1),xy(:,2));
end
[power,powerci] = binofit(sum(r>cutoff),nsamples)

%%
% We estimate the power to be 65%, and we have 95% confidence that the true
% value is between 62% and 68%.  To get a power of 80%, we need a larger
% sample size.  We might try increasing N to 50, estimating the cutoff
% value for this sample size, and repeating the power simulation.

nsamples = 10000;
N = 50;
alpha = 0.05;
conf = 1-alpha;
r = zeros(1,nsamples);
for j = 1:nsamples
    xy = normrnd(0,1,N,2);
    r(j) = corr(xy(:,1),xy(:,2));
end
cutoff = quantile(r,conf)

nsamples = 1000;
mu = [0; 0];
sig = [1 0.4; 0.4 1];
r = zeros(1,nsamples);
for j = 1:nsamples
    xy = mvnrnd(mu,sig,N);
    r(j) = corr(xy(:,1),xy(:,2));
end
[power,powerci] = binofit(sum(r>cutoff),nsamples)

%%
% This sample size gives a power better than our target of 80%.  We could
% continue experimenting this way, trying to find a sample size less than
% 50 that would meet our requirements.

%% Conclusion
% The probability functions in the Statistics and Machine Learning Toolbox
% can be used to determine the sample size required to achieve a desired
% level of power in a hypothesis test.  In some problems the sample size
% can be compute directly; in others it is necessary to search over a range
% of sample sizes until the right value is found. Random number generators
% can help verify that the desired power is met, and can also be used to
% study the power of a specific test under alternative conditions.