Statistics Review 4: Sample Size Calculations. Whitley E, Ball J. Critical Care. 2002; 6: 335-341. Available online from the Baystate Health Sciences Library, or from PubMed at your institution.
I am always pleased when I see articles expounding the importance of sample size and statistical power in clinical research – particularly when they appear in clinical journals. I was happy to see the Whitley and Ball article in Critical Care. Although the article is several years old, the issues discussed are still – and will continue to be – relevant. I just want to reinforce and expand on a couple of the concepts.
First, among some
disciplines, there exists an idea – a myth, really – that sample size and
statistical power are only relevant for randomized clinical trials. I don’t know where this idea got started or
how it continues, but it is completely untrue.
Any time a researcher wishes to describe a comparison with a p-value, he
must also address statistical power. This
applies to all study types – randomized controlled trials, cohort studies,
case-control studies, retrospective chart reviews (this last is actually not a
study type, but rather a method of data collection), etc. If the study uses a data analytic approach
that generates a p-value, then statistical power should be addressed.
The above issue
becomes clear when one considers the reasoning underlying statistical analysis
and sample size. Suppose a researcher
posits a question about some clinical effect -- e.g., Doesan exposure cause
disease? Does the treatment reduce morbidity?
Does a medication reduce pain?
Since he doesn’t know the correct answer, he designs a study to answer
the question. (If he knew the right
answer, he wouldn’t have to conduct the study, right?) He wants to make a “generalizable” statement,
(e.g., the effect is generally real for all patients). However, he isn’t studying all patients – he
is only studying a sample of patients.
So, his conclusion will have some error because he is basing his results
on only one sample. In designing his
study (before he collects any data), there are two possible errors he could
make in his conclusion.
A) He could conclude that there is an effect in his sample,
when one truly doesn’t exist in the population.
This is the α-error (“alpha” or Type 1) and this is what the p-value
reflects.
B) He could conclude from his sample that there is no effect,
when one truly exists in population.
This is the β-error (“beta” or Type 2) and this is what power reflects –
actually, 1-β = power.
It is important to
note that these errors are “conditional”.
That is, α-error is the probability of rejecting the null hypothesis if
the null hypothesis is true. β-error is the probability of NOT finding an
effect if the null hypothesis is false.
At the start of the study, the investigator doesn’t know the true status
of the null hypothesis, so BOTH errors have to be addressed. Among other things, this is precisely what a
sample size calculation does. It
incorporates estimates of both errors into the computation of sample size.
Once the study is
conducted, only one state for the null hypothesis will exist in the sample:
A) If the results reject the null hypothesis with a p <
0.05, then the investigator either has
found a true effect or has a Type 1 error
B) If the results fail to reject the null hypothesis, then the
investigator has either demonstrated that there is truly no effect or he has a
Type 2 error.
Notice that because
the results are based on a sample and not the population, there will always be
some error in the interpretation, regardless of what the p-value is. Notice also that nowhere in this discussion
does the approach apply only to an RCT type study. The errors and
uncertainty facing the investigator are present regardless of study type.
The second point
about statistical power discussed by the authors is the effect size. I think this is a fundamental issue because
when sample size and statistical power are addressed, the investigator
establishes the clinical context within which study results will be
interpreted. How does he do that with
sample size? Consider the factors that
go into computing a sample size: type 1
error, type 2 error, some measure of variability (e.g., variance or standard deviation)
and the specified difference in the groups or the treatment/exposure effect.
Once these four factors are specified, the sample size can be computed.
It is the last component – the specified
difference in the groups or treatment/exposure effect – where the investigator
defines what is clinically important.
After all, isn’t that the goal of clinical research? To comment on clinically important effects? It has been the eternal frustration of
statisticians to have an investigator ask, “How many patients do I need to find
a statistically significant result?”
Asking a question in this manner indicates that the investigator has not
considered his study within a clinical context.
Similarly, a protocol that proposes a review 100 charts without any
sample size justification, essentially suffers the same limitation – the investigator
has not begun to think of his study (or the subsequent results) within a
clinical context.
Luckily, at
Baystate there are people in Academic Affairs that can help clinicians work
through the issues of sample size and statistical power. A discussion with a statistician is a great
way to address sample size. There are a
lot of options and configurations to consider and clinicians should be aware
how these options may affect your study.
P.S. (To answer the question, “How many charts
should I review to find statistical significance?” Well, if clinical relevance
is . . . irrelevant . . . , then I think it is safe to say if one reviews
10,000 charts, something will turn up statistically significant. Better yet, review 20,000 charts just to be
sure.)