• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

How to Calculate Sample Size Needed for Power

By Jim Frost 73 Comments

Determining a good sample size for a study is always an important issue. After all, using the wrong sample size can doom your study from the start. Fortunately, power analysis can find the answer for you. Power analysis combines statistical analysis, subject-area knowledge, and your requirements to help you derive the optimal sample size for your study.

Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. As you’ll see in this post, both under-powered and over-powered studies are problematic. Let’s learn how to find a good sample size for your study! Learn more about Statistical Power .

When you perform hypothesis testing, there is a lot of preplanning you must do before collecting any data. This planning includes identifying the data you will gather, how you will collect it, and how you will measure it among many other details. A crucial part of the planning is determining how much data you need to collect. I’ll show you how to estimate the sample size for your study.

Before we get to estimating sample size requirements, let’s review the factors that influence statistical significance. This process will help you see the value of formally going through a power and sample size analysis rather than guessing.

Related post : 5 Steps for Conducting Scientific Studies with Statistical Analyses

Factors Involved in Statistical Significance

Look at the chart below and identify which study found a real treatment effect and which one didn’t. Within each study, the difference between the treatment group and the control group is the sample estimate of the effect size.

A bar chart that displays the treatment and control group for two studies. Study A has a larger effect size than study B.

Did either study obtain significant results? The estimated effects in both studies can represent either a real effect or random sample error. You don’t have enough information to make that determination. Hypothesis tests incorporate these considerations to determine whether the results are statistically significant.

  • Effect size : The larger the effect size, the less likely it is to be random error. It’s clear that Study A exhibits a more substantial effect in the sample—but that’s insufficient by itself.
  • Sample size : Larger sample sizes allow hypothesis tests to detect smaller effects. If Study B’s sample size is large enough, its more modest effect can be statistically significant.
  • Variability : When your sample data have greater variability, random sampling error is more likely to produce considerable differences between the experimental groups even when there is no real effect. If the sample data in Study A have sufficient variability, random error might be responsible for the large difference.

Hypothesis testing takes all of this information and uses it to calculate the p-value —which you use to determine statistical significance. The key takeaway is that the statistical significance of any effect depends collectively on the size of the effect, the sample size, and the variability present in the sample data. Consequently, you cannot determine a good sample size in a vacuum because the three factors are intertwined.

Related post : How Hypothesis Tests Work

Statistical Power of a Hypothesis Test

Because we’re talking about determining the sample size for a study that has not been performed yet, you need to learn about a fourth consideration—statistical power. Statistical power is the probability that a hypothesis test correctly infers that a sample effect exists in the population. In other words, the test correctly rejects a false null hypothesis. Consequently, power is inversely related to a Type II error . Power = 1 – β. The power of the test depends on the other three factors.

For example, if your study has 80% power, it has an 80% chance of detecting an effect that exists. Let this point be a reminder that when you work with samples, nothing is guaranteed! When an effect actually exists in the population, your study might not detect it because you are working with a sample. Samples contain sample error, which can occasionally cause a random sample to misrepresent the population.

Related post : Types of Errors in Hypothesis Testing

Goals of a Power and Sample Size Analysis

Power analysis involves taking these three considerations, adding subject-area knowledge, and managing tradeoffs to settle on a sample size. During this process, you must rely heavily on your expertise to provide reasonable estimates of the input values.

Power analysis helps you manage an essential tradeoff. As you increase the sample size, the hypothesis test gains a greater ability to detect small effects. This situation sounds great. However, larger sample sizes cost more money. And, there is a point where an effect becomes so minuscule that it is meaningless in a practical sense.

You don’t want to collect a large and expensive sample only to be able to detect an effect that is too small to be useful! Nor do you want an underpowered study that has a low probability of detecting an important effect. Your goal is to collect a large enough sample to have sufficient power to detect a meaningful effect—but not too large to be wasteful.

As you’ll see in the upcoming examples, the analyst provides numeric values that correspond to “a good chance” and “meaningful effect.” These values allow you to tailor the analysis to your needs.

All of these details might sound complicated, but a statistical power analysis helps you manage them. In fact, going through this procedure forces you to focus on the relevant information. Typically, you specify three of the four factors discussed above and your statistical software calculates the remaining value. For instance, if you specify the smallest effect size that is practically significant, variability, and power, the software calculates the required sample size.

Let’s work through some examples in different scenarios to bring this to life.

2-Sample t-Test Power Analysis for Sample Size

Suppose we’re conducting a 2-sample t-test to determine which of two materials is stronger. If one type of material is significantly stronger than the other, we’ll use that material in our process. Furthermore, we’ve tested these materials in a pilot study, which provides background knowledge for the estimates.

In a power and sample size analysis, statistical software presents you with a dialog box something like the following:

Power and sample size analysis dialog box for 2-sample t-test.

We’ll go through these fields one-by-one. First off, we will leave Sample sizes blank because we want the software to calculate this value.

Differences

Differences is often a confusing value to enter. Do not enter your guess for the difference between the two types of material. Instead, use your expertise to identify the smallest difference that is still meaningful for your application. In other words, you consider smaller differences to be inconsequential. It would not be worthwhile to expend resources to detect them.

By choosing this value carefully, you tailor the experiment so that it has a reasonable chance of detecting useful differences while allowing smaller, non-useful differences to remain potentially undetected. This value helps prevent us from collecting an unnecessarily large sample.

For our example, we’ll enter 5 because smaller differences are unimportant for our process.

Power values

Power values is where we specify the probability that the statistical hypothesis test detects the difference in the sample if that difference exists in the population. This field is where you define the “reasonable chance” that I mentioned earlier. If you hold the other input values constant and increase the test’s power, the required sample size also increases. The proper value to enter in this field depends on norms in your study area or industry. Common power values are 0.8 and 0.9.

We’ll enter a power of 0.9 so that the 2-sample t-test has a 90% chance of detecting a difference of 5.

Standard deviation

Standard deviation is the field where we enter the data variability. We need to enter an estimate for the standard deviation of material strength. Analysts frequently base these estimates on pilot studies and historical research data. Inputting better variability estimates will produce more reliable power analysis results. Consequently, you should strive to improve these estimates over time as you perform additional studies and testing. Providing good estimates of the standard deviation is often the most difficult part of a power and sample size analysis.

For our example, we’ll assume that the two types of material have a standard deviation of 4 units of strength. After we click OK, we see the results.

Related post : Measures of Variability

Interpreting the Statistical Power Analysis and Sample Size Results

Statistical power and sample size analysis provides both numeric and graphical results, as shown below.

Statistical output for the power and sample size analysis for the 2-sample t-test.

The text output indicates that we need 15 samples per group (total of 30) to have a 90% chance of detecting a difference of 5 units.

The dot on the Power Curve corresponds to the information in the text output. However, by studying the entire graph, we can learn additional information about how statistical power varies by the difference. If we start at the dot and move down the curve to a difference of 2.5, we learn that the test has a power of approximately 0.4 (40%). This power is too low. However, we indicated that differences less than 5 were not practically significant to our process. Consequently, having low power to detect a difference of 2.5 is not problematic.

Conversely, follow the curve up from the dot and notice how power quickly increases to nearly 100% before we reach a difference of 6. This design satisfies the process requirements while using a manageable sample size of 15 per group.

Other Power Analysis Options

Now, let’s explore a few more options that are available for power analysis. This time we’ll use a one-tailed test and have the software calculate a value other than sample size.

Suppose we are again comparing the strengths of two types of material. However, in this scenario, we are currently using one kind of material and are considering switching to another. We will change to the new material only if it is stronger than our current material. Again, the smallest difference in strength that is meaningful to our process is 5 units. The standard deviation in this study is now 7. Further, let’s assume that our company uses a standard sample size of 20, and we need approval to increase it to 40. Because the standard deviation (7) is larger than the smallest meaningful difference (5), we might need a larger sample.

In this scenario, the test needs to determine only whether the new material is stronger than the current material. Consequently, we can use a one-tailed test. This type of test provides greater statistical power to determine whether the new material is stronger than the old material, but no power to determine if the current material is stronger than the new—which is acceptable given the dictates of the new scenario.

In this analysis, we’ll enter the two potential values for Sample sizes and leave Power values blank. The software will estimate the power of the test for detecting a difference of 5 for designs with both 20 and 40 samples per group.

We fill in the dialog box as follows:

Power and sample size analysis dialog box for a one-side 2-sample t-test.

And, in Options , we choose the following one-tailed test:

Options for the power and sample size analysis dialog box.

Interpreting the Power and Sample Size Results

Statistical output for the power and sample size analysis for the one-sided 2-sample t-test.

The statistical output indicates that a design with 20 samples per group (a total of 40) has a ~72% chance of detecting a difference of 5. Generally, this power is considered to be too low. However, a design with 40 samples per group (80 total) achieves a power of ~94%, which is almost always acceptable. Hopefully, the power analysis convinces management to approve the larger sample size.

Assess the Power Curve graph to see how the power varies by the difference. For example, the curve for the sample size of 20 indicates that the smaller design does not achieve 90% power until the difference is approximately 6.5. If increasing the sample size is genuinely cost prohibitive, perhaps accepting 90% power for a difference of 6.5, rather than 5, is acceptable. Use your process knowledge to make this type of determination.

Use Power Analysis for Sample Size Estimation For All Studies

Throughout this post, we’ve been looking at continuous data, and using the 2-sample t-test specifically. For continuous data, you can also use power analysis to assess sample sizes for ANOVA and DOE designs. Additionally, there are hypothesis tests for other types of data , such as proportions tests ( binomial data ) and rates of occurrence (Poisson data). These tests have their own corresponding power and sample analyses.

In general, when you move away from continuous data to these other types of data , your sample size requirements increase. And, there are unique intricacies in each. For instance, in a proportions test, you need a relatively larger sample size to detect a difference when your proportion is closer 0 or 1 than if it is in the middle (0.5). Many factors can affect the optimal sample size. Power analysis helps you navigate these concerns.

After reading this post, I hope you see how power analysis combines statistical analyses, subject-area knowledge, and your requirements to help you derive the optimal sample size for your specific needs. If you don’t perform this analysis, you risk performing a study that is either likely to miss an important effect or have an exorbitantly large sample size. I’ve written a post about a Mythbusters experiment that had no chance of detecting an effect because they guessed a sample size instead of performing a power analysis.

In this post, I’ve focused on how power affects your test’s ability to detect a real effect. However, low power tests also exaggerate effect sizes !

Finally, experimentation is an iterative process. As you conduct more studies in an area, you’ll develop better estimates to input into power and sample size analyses and gain a clearer picture of how to proceed.

Share this:

example of power analysis in research

Reader Interactions

' src=

July 10, 2024 at 4:22 am

Thank you for this wonderful article, your articles are always very informative & helpful!

I have 2 questions regarding sample size, I’m running an experiment and: 1) I was under the impression that you don’t need to have determined what model you will eventually be using, but rather you determine the sample size beforehand & all is well. However, I do need to assume some things… I have G*Power and am not sure what ‘statistical test’ to choose! All I know is to choose the ‘F-test’. I know I have to determine the effect size, mention the number of groups, but am not sure where to add variance (or what it means exactly), nor what stat test to choose. I am assuming it is important to mention that my dependent variable is an ordinal scale (arguably can be run using OLS); not sure whether I am ‘comparing means between more than 2 groups’ or if there is something else I should assume/consider. Also, I’m not sure the ‘linear’ regression option would work.

2) I already ran a large pilot study, how can I use that to determine the sample size (I’m afraid my pilot study is at/close to the required size anyway…)

Navigating the net for help on power analysis has strangely been quite difficult, so I appreciate your help!

' src=

May 11, 2024 at 2:18 am

Thank you Mr. Jim for such a brief explanation on power analysis for same size. i read several explanations on this topic but none got to my head and understanding. but your explanation made it understandable.

Regards, Roopini

' src=

April 15, 2024 at 6:56 pm

Jim, Are you able to share what statistical software was used for your examples? Are there equations that can be typed into Excel to determine sample size and power? Is there free, reliable statistical analysis software you can recommend for calculating sample size and power?

Thank you! Suzann

' src=

April 16, 2024 at 3:37 pm

I used Minitab statistical software for the examples. Unfortunately, I don’t believe Excel has this feature built into it. However, there is a free Power and Sample Size analysis software that I highly recommend. It’s called G*Power . Click the link to get it for free!

' src=

May 24, 2024 at 2:13 pm

Jim, The post is really informative but i want to know how to use power analysis to find a corelation between 2 variables

May 25, 2024 at 5:27 pm

You can use power analysis to determine the sample size you’d need to detect a correlation of a particular strength with a specified power.

I recommend using the free power analysis tool called G*Power. Below I show an example of using it to find the sample size I’d need to detect a correlation of 0.7 with 95% power. The answer is a sample size of 20. See how I set up G*Power to get this answer below.

Power analysis for correlation.

October 24, 2022 at 8:29 pm

Hi again Jim, apologies if this was posted multiple times but I looked into the Bonferroni Correction and saw that this was the equation αnew = αoriginal / n

αoriginal: The original α level n: The total number of comparisons or tests being performed Seeing this would 6000 or 1000 be the n in my case? Would I also have to perform this once or more then once. Second question after finding this out when performing the power analysis that you mentioned do I have to do it multiple times to account for the different combinations with the states that I will match with each other.

October 24, 2022 at 10:23 pm

In this context, n is the number of comparisons between groups. If you want to compare all groups to each other (i.e., all pairwise comparisons), then with 6 groups you’ll have 15 comparisons. So, n = 15. However, you don’t necessarily need to compare all groups. It depends on your research question. If you can avoid all pairwise comparisons, it’s a good thing. Just decide on your comparisons and record it in your plans before proceeding with the project. If you wait until after analyzing the data, you might (even if subconsciously) be tempted to cherry pick the comparisons that give good results.

As an example of an alternative to all pairwise comparisons, you might compare five of the states to one reference state in your sample. That reduces the pairwise comparisons (n) from 15 to 5. That helps because you’re dividing alpha by the number of comparisons. A lower n won’t lower your Bonferroni corrected significance level as much:

0.05/15 = 0.003 0.05/5 = 0.01

You’ll need an extremely low p-value with 15 comparisons ( Using Post Hoc Tests with ANOVA . Of course, you’re not working with ANOVA. But if you need information about what and why you need to control the familywise error rate, it’ll be helpful. The same ideas will apply to the multiple comparisons you’re making with the 2 proportions test. In your case, if you go with 15 comparisons (all pairwise for the 6 states), your familywise error rate is 0.54. Over a 50% chance of a false positive!

October 21, 2022 at 8:59 pm

Hello again Jim, I looked on your other page about the margin of error and I had a few extra questions. The approach I would be taking is as you said taking with using 1000 people from each for a comparison with the surveys. I saw the formula that you had so would my confidence level for this instance be 95%? Also as your formula is listed would my bottom number be 1000 as well or would it be 6000, or would I have to complete this one instead Finding the Margin of Error for Other Percentages formula.

October 23, 2022 at 4:27 pm

Typically, surveys don’t delve so deep into statistical differences between groups in the responses. At least not that I’ve seen. Usually, they’ll calculate and report the margin of error. If the margins don’t overlap, you can assume the difference is statistically significant. However, as I point out in the margin of error post, that process is conservative because the difference can be statistically significant even with a little overlap.

What you need to do for your cases is perform a power analysis for a two-sample proportions test. That’s beyond what most public opinion surveys do but will get you the answers you need. In your case, the proportions you’re testing are the proportion of individual in state A who respond a particular way to a survey item and the other will be the proportion in state B who respond that way to the item.

I didn’t realize that you were performing hypothesis testing with your survey data, or I would’ve mentioned this from the start! Because you’re comparing six states, you’re also facing the problem of the multiple comparison increasing the familywise error rate for that set of comparison. You’ll need to use something like a Bonferroni correction to appropriately lower the significance level you use, which will affect the numbers you need for a particular power.

I hope that helps!

October 20, 2022 at 4:33 pm

Hello Jim, I am hoping you can have some guidance for me here. I am currently doing an assignment involving this subject here and my professor said this statement to me, There’s no rationale for the six thousand surveys. How did you arrive at your sample size? You need to report the power analysis (and numbers you used in that analysis) to arrive at your chosen sample size–like everything else in scientific writing the sample size needs justification. My study involves six states and getting specific individuals opinions from each state about there opinions on crime and how it has affected them. Surveys are my choice of use here, so my question is how would I come about to a sample size here. I had thought 6,000 was a starting point but am unsure if thats right?

October 21, 2022 at 4:11 pm

With surveys you typically calculate the sample size to produce a specific margin of error . Click the link to learn more about that and how to tell whether there are differences. It’s a little different process that power analysis in other contexts but it’s related. The big questions are how precise do you want your estimates to be? And if you have groups you want to compare, that can affect the calculates.

For instance, 6,000 would generally be considered a large sample size for survey research. However, if you’re comparing subgroups within your sample, that can affect how many you need. I don’t know if you plan to do this or not, but if you wanted to compare the differences between the six states, that means you’d have about 1,000 per state. That’s still fairly decent but you’ll have a larger margin of error. You’ll need to know whether your primary interest is estimates for the total sample or differences between subgroups. If it’s differences between subgroups, that always increases your required sample size.

That’s not to say that 1000 per state isn’t enough. I don’t know. But you’d do the margin of error calculations to see if it produces sufficient precision for your needs. The process involves a combination of doing the MoE calculations and knowing the required precision (or possibly standards in your subject area).

' src=

October 15, 2022 at 2:38 am

So can a “power analysis’ be done to get the sample size for a proposed survey instead of calculating for the sample size? In other words, is a “power analysis” the same as calculating for the sample size when doing a research study? Thank you.

October 16, 2022 at 2:01 am

Hi Ronrico,

There’s definitely a related concept. For surveys, you typically need to calculate the margin of error . Click the link to read my post about it!

' src=

August 16, 2022 at 5:59 am

Wonderful post!

I was wondering, how would I be able to determine if a sample size is large enough for a paper that I’m reading, assuming they do not give the power calculation? If they d give the power calculation, should the be 80% or over for stat sig results?

Thank you so much 🙂

August 21, 2022 at 12:28 am

Determining whether a study’s sample size and, hence, its statistical power, are sufficient isn’t quite as straightforward as it might appear. It’s tempting to take the study’s sample size, effect size, and variability and enter them into a power analysis. However, that’s problematic. What happens is that if the study has statistically significant findings the power analysis will always indicate sufficient sample size/power. However, if the study has non-significant results, the power analysis will always indicate that the sample size/power are insufficient.

That’s a problem because it’s possible to obtain significant results with low power studies and insignificant results with high power studies. It’s important recognize all these cases because significant low power studies will exaggerate the effects sizes and insignificant high power studies are more likely to indicate that the effect does not exist in the population.

What you need to do instead is enter the study’s sample size, use a literature review to obtain reasonable estimates of the variability (if possible), and then enter an effect size that represents either the literature’s collective best estimate of it or a minimum sample size that is still practically meaningful. Note that you are not using the study’s estimates for these calculations for the reasons I indicate earlier!

' src=

November 13, 2021 at 1:46 am

Hi Sir Jim!

I I’d like to know how I can utilize the GPOWER Calculator to figure out the sample size for my study. It essentially employed stratified random sampling. I’m hoping you’ll respond! best wishes!

November 13, 2021 at 11:57 pm

It depends on how you’ve conducted your stratified sampling and what you want to test. Are you comparing the strata within your sample? If so, you’d just select the type of test, such as a t-test, and then enter your values. G Power uses the default setting that your groups size are equal. That’s fine if you’re using a disproportionate stratified sampling design and set all your strata to the same size. However, if your strata sizes are unequal, you’ll need to adjust the allocation ratio.

' src=

June 16, 2021 at 7:32 am

Hello Jim. I want your help in calculating sample size for my study. I have three groups, first group is control (normal), second is a clinical population group undergoing treatment 1 and third colonics group (same disease as group2) undergoing treatment 2. So here I will compare some parameters between pre-post treatment for group 2 and 3 separately first. Then compare group 2 and 3 before treatment and after treatment and then compare baseline parameters and after treatment parameters across all three groups. I hope I have not confused you. I want to know the sample size for my three groups. My hypothesis is that the two treatments will improve the parameters in group 2 and 3, what I want to check is which treatment (1 or 2) is most effective.. I request you to kindly help me in this regard

' src=

April 19, 2021 at 10:49 pm

Dear Jim, I have question regarding calculating the sample size in this scenario: I’m doing a hospital based study (chart review study) where i will include all patients who have a specific disease (celiac disease) in the last 5 years. How would i know that the number which i will get is sufficient to answer my research questions considering that this disease is rare? suppose for example i ended up with 100 patients, how would i know that i can use this sample for further analysis ? Is there a way to calculate ahead the minimum number of patients needed to do my research?

' src=

March 8, 2021 at 10:45 pm

I am looking to determine the sample size necessary to detect differences in bird populations (composition and abundance) between forest treatment types. I assume I would use an ANOVA given that I have control units. My data will be bird occurrence data, so I imagine Poisson distribution. I have zero pilot data, though. Do you have any recommendations for reading up on ways to simulate or bootstrap data in this situation for use in making variability estimates?

Thank you!!

March 9, 2021 at 7:20 pm

Hi Lorelle,

Yes, I’d think you’d use something Poisson regression or negative binomial regression because of the count data. I write a little bit about them in my post about choosing the correct type of regression analysis . You can include categorical variables for forest types.

I don’t have good ideas for developing variability estimates. That can be the most difficult part of a power analysis. I’d recommend reading up on the literature as much as possible. Perhaps others have conduct similar research and you can use their estimates. Unfortunately, if you don’t have any data, you can’t bootstrap or simulate it.

I wish I had some better advice, but the best I can think of is to look through the literature for comparable studies. That’s always a good idea anyway, but here it’ll help you with the power analysis too.

' src=

February 17, 2021 at 7:04 am

I am confused in some parts as I am new to this, let’s assume I have difference in mean, standard error, power 80%, I have these information to get a sample size, (delta, sd, power). But question is how I would know this is correct sample size to get 80% power? which type I need to put paired or two.sample or one.sample? After power.t.test I get sample size 8.7 for two sample and 6 for one sample, I am not sure which would be correct one. How to determine that?

February 18, 2021 at 12:34 am

The correct test depends on the nature of the data you collect. Are you comparing the means of two groups? In that case, you need to use a 2-sample t-test. If you have one group and are comparing its mean to a test value, you need a 1-sample t-test.

You can read about the purposes and interpretations the various t-tests in my post about How to do t-tests in Excel . That should be helpful even if you’re not using Excel. Also, I write more about how t-tests work , which will be helpful in showing you what each test can do.

' src=

February 7, 2021 at 6:53 pm

Hey there! What sort of test would be best to determine sample size needed for a study determining a 10% difference between two groups at a power of say 80%? Thanks!

February 7, 2021 at 10:23 pm

Hi Kristin, you’d need to perform a power and sample size analysis for a 2-sample t-test. As I indicate in this post, you’ll need to supply an estimate of the population’s standard deviation, the difference you want to detect, and the power, and the procedure will tell you the sample size per group.

' src=

January 30, 2021 at 7:48 pm

I have an essay question if anyone can help me with:

Do a calculation: write down what you think the typical power of psychological study really is and what percentage of research hypotheses are “good” hypotheses. Assume that journals reserve 10% of their pages for publishing null results. Under these assumptions, what percentage of published psychological research is wrong? Do you agree that this analysis make sense or is this the wrong way to think about “right” and “wrong” research

January 30, 2021 at 8:57 pm

I can’t do your essay for you, but I’ve written two blog posts that should be extremely helpful for your assignment.

Reproducibility in Psychology Experiments Low power tests exaggerate effect sizes

Those two should give you some good food for thought!

' src=

January 26, 2021 at 1:17 pm

Dear Jim I have a question regrading sample size calculation for a laboratory study. The laboratory evaluation includes evaluation of marginal integrity of 2 dental material vs a control material? what type of test should I use ?

January 26, 2021 at 9:13 pm

Hi Eman, that largely depends on the type of data you’re collecting for your outcome. If marginal integrity is continuous data and you want to compare the means between the control and two treatment groups, one-way ANOVA is a great place to start.

' src=

November 22, 2020 at 10:30 am

Hi Jim, what if I want to run mixed model ANOVAS twice (on two different dependent variables) – would I have to then double the sample size that I calculated using g power? Thanks, Joanna

' src=

November 16, 2020 at 11:35 pm

Hi Jim. What about molecular data? For instance, I sequenced my 6 samples, 3 controls and 3 treatments, but each sample (tank replicate) consist of 500-800 individuals of biological replicates (larvae). Given the analysis after sequencing is that there are thousand of genes that may show mean differences between the control and treatment. My concern is, does power analysis still play a fair role here, given that increasing the “sample size” which is the number of tank replicate to a number of 5 or more suggested by power analysis to get >0.8 is nearly impossible in a physical setting?

' src=

November 5, 2020 at 8:09 pm

I have somewhat of a basic question. I am performing some animal studies and looking at the effect of preservation solution on ischemia repercussion injury following transplantation. I am comparing 5 different preservation solutions. What should be my sample size for each group? I want to know how exactly I can calculate that.

November 6, 2020 at 8:58 pm

You’ll need to have an estimate of the effect. Or, an estimate of the minimum effect size that is practically meaningful in a real-world sense. If you’re comparing means, you’ll also need an estimate of the variability. The nature of what and how to determine the sample size depends on the type of hypothesis test you’ll be using. That in turn depends on the nature of your outcome variable. Are you comparing means with continuous data or comparing proportions with binary data? But in all cases you’ll need that effect size estimate.

You’ll also need software to calculate that for you. I recommend a freeware program called G*Power . Although, most statistical applications can do these power calculations. I cover examples in this post that should be helpful for you.

If you have 5 solutions and you want to compare their means, you’ll need to perform power and sample size calculations for one-way ANOVA.

' src=

September 4, 2020 at 3:48 am

Hi Jim, I’ve calculate that I need 34 pairs for a paired t-test with an alpha=0.05 and beta=0.10 with standard deviation of 1.945 to detect a 1.0 increase in the difference. If after 5 pairs I run my hypothesis tests and I find that the difference is significant (i.e. I reject the null hypothesis) is there a need to complete the remaining 29 pairs? Thanks, Sam

' src=

August 20, 2020 at 12:13 pm

Thank you for the explanation. I am currently using G power to determine my sample size. But I am still confused about the effect size. Let say I use medium effect size for conducting a correlation, so sample size that have been suggested is 138 (example) but then when I use medium effect size for conducting a t test to find differences between two independent group, the sample size that have been suggested is 300 (example). So which sample size I should take? Does the same effect size need to be use for every statistical test? or actually each statistical test have different effect size?

' src=

August 15, 2020 at 1:45 pm

I want to calculate the sample size for my animal studies. We have designed a novel neural probe and want to perform experiment to test the functionality of these probes in rat brain. As this a binary study i.e. either probe works or don’t work (success or failure) and its a new technology so its lacking any previous literature. Can anyone please suggest me which statistical analysis (test) I should use and what parameters i.e. effect size should I use. I am using G power and looking for 95% confidence level.

Thanks in Advance Vishal

August 15, 2020 at 3:11 pm

It sounds like you need to use a 2-sample proportions test. It’s one of the many hypothesis tests that I cover in my new Hypothesis Testing ebook . You’ll find the details about how and why to use it, assumptions, interpretations and examples for it.

As for using G*Power to estimate power and sample size, under the Test family drop-down list, choose Exact . Under the Statistical test drop-down, choose Proportions: Inequality, two independent groups (Fisher’s exact test) . That assumes that your two groups have different probes. From there, you’ll need to enter estimates for your study based on whatever background subject-area research/knowledge you have.

I hope this helps!

' src=

August 15, 2020 at 10:00 am

Hi Jim, Is that scientifically appropriate to use G*Power in sample size calculation of a clinical biomedical research?

August 15, 2020 at 3:24 pm

Hi, yes, G*Power should be appropriate to use for statistical analyses in any area. Did you have a specific concern about it?

August 11, 2020 at 12:21 am

Hi Everyone

I want to calculate the sample size for my animal studies. We have designed a novel neural probe and want to perform experiment to test the functionality of these probes in rat brain. As this a binary study i.e. either probe works or don’t work (success or failure) and its a new technology so its lacking any previous literature. Can anyone please suggest me which statistical analysis (test) I should use and what parameters i.e. effect size should I use. I am using G power and looking for 95% confidence level.

' src=

July 12, 2020 at 1:56 am

Thank you, Jim, for the app reference. I am checking it out right now. #TeamNoSleep

July 12, 2020 at 5:40 pm

Hi Jamie, Ah, yes, #TeamNoSleep. I’ve unfortunately been on that team! 🙂

' src=

June 17, 2020 at 1:30 am

Hi Jim, What is the name of the software you use?

June 18, 2020 at 5:40 pm

I’m using Minitab statistical software. If you’d like free software to calculate power and sample sizes, I highly recommend G*Power .

' src=

June 10, 2020 at 4:05 pm

I would like to calculate power for a poisson regression (my DV consists of count data). Do you have any guidance on how to do so?

June 10, 2020 at 4:29 pm

Hi Veronica,

Unfortunately, I’m not familiar with an application will calculation power for Poisson regression. If your counts are large enough (lambda greater than 10), Poisson approximates a normal distribution. You might then be able to use power analysis for linear multiple regression, which I have seen in the free application G*Power . That might give you an idea at least. I’m not sure about power analysis specifically for Poisson regression.

' src=

June 3, 2020 at 6:24 am

Dear Jim, your post looks very nice. I have just one comment: how I could calculate the sample size and power for an “Equal variances” test comparing more than 2 samples ? Is it mandatory as in t-tests ? Which is the test statistic used in that test ? Thanks in advance for your tip

June 3, 2020 at 8:13 pm

Hi Ciro, to be honest, I’ve never seen a power analysis for an equal variances test with more than two samples!

The test statistic depends upon which of several methods you use, F-test, Levene’s test statistic, and Bartlett’s test statistic.

While it would be nice to estimate power for this type of test, I don’t think it’s a common practice and I haven’t seen it available in the software I have checked.

' src=

April 24, 2020 at 12:10 am

Why are the sample sizes here all so small?

April 25, 2020 at 1:37 am

For sample sizes, large and small are relative. Given the parameters entered, which include the effect size you want to detect, the properties of the data, and the desired powered, the sample sizes are exactly the correct size! Of course, you’re always working estimates for these values and there’s a chance your estimates are off. But, the proper sample size depends on the nature of all those properties.

I’m curious, was there some reason why you were expecting larger sample sizes? Some times you’ll see big studies, such as medical trials. In some cases with lives on the line, you’ll want very large sample sizes that go beyond just issue of statistical power. But, for many scientific studies where the stakes aren’t so high, they use the approach described here.

' src=

December 1, 2019 at 6:20 pm

Does he formula n equals z times standard deviation decided by margin of error all squared already a power analysis? I’m looking for power analysis for just estimating a statistic (descriptive statistics) and not hypothesis testing as in many cases of inferential statistics. Does that formula suffice? Thanks in advanced 😊

December 2, 2019 at 2:43 pm

You might not realize it, but you’re asking me a trick question! The answer for how you calculate power for descriptive statistics is that you don’t calculate power for descriptive statistics.

Descriptive statistics simply describe the characteristics of a particular group. You’re not making inferences about a larger population. Consequently, there is no hypothesis testing. Power relates to the probability that a hypothesis test will detect a population effect that actually exists. Consequently, if there is no hypothesis test/inferences about a population, there’s no reason to calculate power.

Relatedly, descriptive statistics do not involve a margin of error based on random sampling. The mean of a group is a specific known value without error (excluding measurement error) because you’re measuring all members of that group.

For more information about this topic, read my post about the differences between descriptive and inferential statistics .

' src=

October 22, 2019 at 3:24 am

Just wanted to understand, if the confidence interval and power is same.

' src=

September 9, 2019 at 8:25 am

Thanks for your explanation, Jim.

August 21, 2019 at 7:46 am

I would like to design a test for the following problem (under the assumption that the Poisson distribution applies):

Samples from a population can be either defective or not (e.g. some technical component from a production)

Out of a random sample of N, there should be at most k defective occurrences, with a 95% probability (e.g. N = 100’000, k = 30).

I would like to design a test for this (testing this Hypothesis) with a sample size N1 (different from N). What should my limit on k1 (defective occurrences from the sample of N1) be? Such that I can say that with a 95% confidence, there will be at most k occurrences out of N samples.

E.g. N2 = 20’000. k1 = ???

Any hints how to tackle this problem?

Many thanks in advance Tom

August 21, 2019 at 11:46 pm

To me, it sounds like you need to use the binomial distribution rather than the Poisson distribution. You use the binomial distribution when you have binary data and you know the probability of an event and the number of trials. That’s sounds like you’re scenario!

In the graph below, I illustrate a binomial distribution where we assume the defect rate is 0.001 and the sample size is 100,000. I had the software shade the upper and lower ~2.5% of the tails. 95% of the outcomes should fall within the middle.

example of binomial distribution

If you have sample data, you can use the Proportions hypothesis test, which is based on the binomial distribution. If you have a single sample, use the Proportions test to determine whether your sample is significantly different from a target probability and to construct a confidence interval.

I hope this help!

' src=

March 17, 2019 at 6:37 pm

Thanks very much for putting together this very helpful and informative page. I just have a quick question about statistical power: it’s been surprisingly difficult for me to locate an answer to it in the literature.

I want to calculate the sample size required in order to reach a certain level of a priori statistical power in my experiment. My question is about what ‘sample size’ means in this type of calculation. Does it mean the number of participants or the number of data points? If there is one data point per participant, then these numbers will obviously be the same. However, I’m using a mixed-effects logistic regression model in which there are multiple data points nested within each participant. (Each participant produces multiple ‘yes/no’ responses.)

It would seem odd if the calculation of a priori statistical power did not differentiate between whether each participant produces one response or multiple responses.

' src=

April 8, 2018 at 4:46 am

Thank you so much sir for the lucid explanation. Really appreciate your kind help. Many Thanks!

April 1, 2018 at 4:36 am

Dear sir, When i search online for sample size determination, i predominantly see mention of margin of error formula for its calculation.

At other places, like your website, i see use of effect size and desired power etc. for the same calcation.

I’m struggling to reconcile between these 2 approaches. Is there a link between the two?

I wish to determine sample size for testing a hypothesis with sufficient power, say 80% or 90%. Please guide me.

April 2, 2018 at 11:37 am

Hi Khalid, a margin of error (MOE) quantifies the amount of random sampling error in the estimation of a parameter, such as the mean or proportion. MOEs represent the uncertainty about how well the sample estimates from a study represent the true population value and are related to confidence intervals. In a confidence interval, the margin of error is the distance between the sample estimate and each endpoint of the CI.

Margins of error are commonly used for surveys. For example, if a survey result is that 75% of the respondents like the product with a MOE of 3 percent. This result indicates that we can be 95% confident that 75% +/- 3% (or 72-78%) of the population like the product.

If you conduct a study, you can estimate the sample size that you need to achieve a specific margin of error. The narrower the MOE, the more precise the estimate. If you have requirements about the precision of the estimates, then you might need to estimate the margin of error based on different sample sizes. This is simply one form of power and sample size analysis where the focus is on how sample sizes relate to the margin of error.

However, if you need to calculate power to detect an effect, use the methods I describe in this post.

In summary, determine what your requirements are and use the corresponding analysis. Do you need to estimate a sample size that produces a level of precision that you specify for the estimates? Or, do you need to estimate a sample size that produces an amount of power to detect a specific size effect? Of course, these are related questions and it comes down to what you want to input in as your criteria.

' src=

March 24, 2018 at 8:39 pm

' src=

March 20, 2018 at 10:42 am

Thank you so much for this very intuitive article on sample size.

Thank you, Ashwini

March 20, 2018 at 10:53 am

Hi Ashwini, you’re very welcome! I’m glad it was helpful!

' src=

March 19, 2018 at 1:22 pm

Thank you.This was very helpful

March 19, 2018 at 1:25 pm

You’re very welcome, Hellen! I’m glad you found it to be helpful!

' src=

March 13, 2018 at 4:27 am

Thanks for your answer Jim. I was indeed aware of this tool, which is great for demonstration. I think I’ll stick to it.

' src=

March 12, 2018 at 7:53 am

Awaiting your book!

March 12, 2018 at 2:06 pm

Thanks! If all goes well, the first one should be out in September 2018!

March 12, 2018 at 4:18 am

Once again, a nice demonstration. Thanks Jim. I was wondering which software you used in your examples. Is it, perhaps, R or G*Power? And, would you have any suggestions on an (online/offline) tool that can be used in class?

March 12, 2018 at 2:03 pm

Hi George, thank you very much! I’m glad it was helpful! I used Minitab for the examples, but I would imagine that most statistical software have similar features.

I found this interactive tool for displaying how power, alpha, effect size, etc. are related. Perhaps this is what you’re looking for?

' src=

March 12, 2018 at 1:02 am

Thanks for information, please explain for case- control study, sample size calculation if different study says different prevalence for different parameter.

' src=

March 12, 2018 at 12:26 am

Thnks sir …. Wana to salute uh. ……bt to far Sir send me sme articles on distributions of probability. ..

MOST KNDNSS

Comments and Questions Cancel reply

example of power analysis in research

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

S.5 power analysis, why is power analysis important section  .

Consider a research experiment where the p -value computed from the data was 0.12. As a result, one would fail to reject the null hypothesis because this p -value is larger than \(\alpha\) = 0.05. However, there still exist two possible cases for which we failed to reject the null hypothesis:

  • the null hypothesis is a reasonable conclusion,
  • the sample size is not large enough to either accept or reject the null hypothesis, i.e., additional samples might provide additional evidence.

Power analysis is the procedure that researchers can use to determine if the test contains enough power to make a reasonable conclusion. From another perspective power analysis can also be used to calculate the number of samples required to achieve a specified level of power.

Example S.5.1

Let's take a look at an example that illustrates how to compute the power of the test.

Let X denote the height of randomly selected Penn State students. Assume that X is normally distributed with unknown mean \(\mu\) and a standard deviation of 9. Take a random sample of n = 25 students, so that, after setting the probability of committing a Type I error at \(\alpha = 0.05\), we can test the null hypothesis \(H_0: \mu = 170\) against the alternative hypothesis that \(H_A: \mu > 170\).

What is the power of the hypothesis test if the true population mean were \(\mu = 175\)?

\[\begin{align}z&=\frac{\bar{x}-\mu}{\sigma / \sqrt{n}} \\ \bar{x}&= \mu + z \left(\frac{\sigma}{\sqrt{n}}\right) \\ \bar{x}&=170+1.645\left(\frac{9}{\sqrt{25}}\right) \\ &=172.961\\ \end{align}\]

So we should reject the null hypothesis when the observed sample mean is 172.961 or greater:

\[\begin{align}\text{Power}&=P(\bar{x} \ge 172.961 \text{ when } \mu =175)\\ &=P\left(z \ge \frac{172.961-175}{9/\sqrt{25}} \right)\\ &=P(z \ge -1.133)\\ &= 0.8713\\ \end{align}\]

and illustrated below:

Two overlapping normal distributions with means of 170 and 175. The power of 0.871 is show on the right curve.

In summary, we have determined that we have an 87.13% chance of rejecting the null hypothesis \(H_0: \mu = 170\) in favor of the alternative hypothesis \(H_A: \mu > 170\) if the true unknown population mean is, in reality, \(\mu = 175\).

Calculating Sample Size Section  

If the sample size is fixed, then decreasing Type I error \(\alpha\) will increase Type II error \(\beta\). If one wants both to decrease, then one has to increase the sample size.

To calculate the smallest sample size needed for specified \(\alpha\), \(\beta\), \(\mu_a\), then (\(\mu_a\) is the likely value of \(\mu\) at which you want to evaluate the power.

Let's investigate by returning to our previous example.

Example S.5.2

Let X denote the height of randomly selected Penn State students. Assume that X is normally distributed with unknown mean \(\mu\) and standard deviation 9. We are interested in testing at \(\alpha = 0.05\) level , the null hypothesis \(H_0: \mu = 170\) against the alternative hypothesis that \(H_A: \mu > 170\).

Find the sample size n that is necessary to achieve 0.90 power at the alternative μ = 175.

\[\begin{align}n&= \dfrac{\sigma^2(Z_{\alpha}+Z_{\beta})^2}{(\mu_0−\mu_a)^2}\\ &=\dfrac{9^2 (1.645 + 1.28)^2}{(170-175)^2}\\ &=27.72\\ n&=28\\ \end{align}\]

In summary, you should see how power analysis is very important so that we are able to make the correct decision when the data indicate that one cannot reject the null hypothesis. You should also see how power analysis can also be used to calculate the minimum sample size required to detect a difference that meets the needs of your research.

Guide to Power Analysis and Statistical Power

Power analysis is a process that involves evaluating a test’s statistical power to determine the necessary sample size for a hypothesis test. Learn more.

Anmolika Singh

Statistical power and power analysis are essential tools for any researcher or data scientist . Statistical power measures the likelihood that a hypothesis test will detect a specific effect. Power analysis is a process that researchers use to determine the necessary sample size for a hypothesis test.

Power Analysis Definition

Power analysis is a statistical method that involves calculating the necessary sample size required for a study to detect meaningful results. It ensures that a study isn’t too small, which can result in false negatives, nor too large, which is a waste of resources.

The article explores the factors influencing power, such as sample size, effect size, significance level and data variability. We’ll also examine power analysis, a method ensuring studies have adequate sample sizes to detect meaningful effects. Conducting power analysis before data collection can prevent errors, allow you to allocate resources effectively and design ethically sound studies.

Understanding Statistical Power

Statistical power is a vital concept in hypothesis testing which is a statistical method to determine if the sample data supports a specific claim against the null statement. It measures the likelihood that a test will detect an effect if there truly is one. In other words, it shows how well the test can reject a false null hypothesis. 

In a study, a Type I error occurs when a true null hypothesis is mistakenly rejected, leading to a false positive result. This means that the test indicates an effect or difference when none actually exists. Conversely, a Type II error happens when a false null hypothesis is not rejected, resulting in a false negative. This error means the test fails to detect an actual effect or difference, wrongly concluding that no effect exists. 

High statistical power means there’s a lower chance of making a Type II error, which happens when a test fails to spot a real effect. 

Several factors affect a study’s power, including:

  • Sample size: The tally of observations or data points in a study.
  • Effect size: The magnitude of the difference or relationship being scrutinized.
  • Significance level: The threshold probability for dismissing the null hypothesis, often set at 0.05.
  • Data variability: The extent to which data points diverge from each other.

Ensuring sufficient power in a study is important to correctly identify and reject a false null hypothesis, thereby recognizing genuine effects and not missing them.

More on Data Science An Introduction to the Shapiro-Wilk Test for Normality

What Is Power Analysis?

Power analysis is a process that helps researchers determine the necessary sample size to detect an effect of a given size with a specific level of confidence. This method involves calculating the test’s statistical power for different sample sizes and effect sizes. Researchers use power analysis to design studies that aren’t too small, which might miss significant effects, or too large, which might waste resources. This ensures that they have enough participants to detect meaningful effects while managing resources wisely.

Why Is Power Analysis Important?

Power analysis is essential because it makes sure a study has the right tools to find the effects it aims to uncover. If a study lacks sufficient power, it might miss important effects, leading to false negatives. On the other hand, an overpowered study could waste resources. 

By doing a power analysis before collecting data, researchers can figure out the right sample size, use resources efficiently and boost the reliability of their findings. This step is key to producing trustworthy results in any scientific research.

5 Components of Power Analysis

Several components are essential to power analysis:

  • Effect size: This quantifies the magnitude of the difference or relationship under scrutiny. Larger effect sizes facilitate detection and necessitate smaller sample sizes, whereas smaller effect sizes mandate larger samples.
  • Sample size: This denotes the number of participants or observations in the study. A greater number of participants enhance power by furnishing more data, thereby simplifying the detection of true effects.
  • Significance level (α): Typically established at 0.05, this represents the threshold for refuting the null hypothesis. Reducing the significance level lessens the risk of Type I errors but entails a larger sample size to uphold power.
  • Power (1-β): This signifies the probability of accurately refuting the null hypothesis, commonly sought to be 0.80 or higher. Insufficient power heightens the likelihood of overlooking true effects, resulting in false negatives.
  • Variability: This pertains to the extent of fluctuation in the data. Greater variability diminishes power, rendering the detection of true effects more challenging. Researchers can mitigate variability by utilizing precise measurement tools and managing extraneous variables.

Power Analysis Example

Imagine a team of researchers embarking on a clinical trial to assess a new drug’s efficacy in combating a specific disease. They theorize that the innovative medication will slash symptoms by a remarkable 30 percent compared to the standard treatment. However, before commencing the trial, they must decide the required sample size to detect this effect size with ample power.

Conducting a power analysis, the researchers factor in the anticipated effect size, the desired power level, typically 80 percent or higher, and the significance level, usually 0.05. They also account for the variability in treatment response and potential dropouts or losses to follow-up.

Based on the power analysis, the researchers deduce that they must enlist 100 patients in each group, treatment and control, to achieve 80 percent power. This requires recruiting a total of 200 patients for the trial.

Benefits of Power Analysis

  • Optimal resource allocation: Power analysis ensures that studies utilize appropriate sample sizes, preventing wastage of resources on excessively large samples. This optimization allows for more efficient resource allocation, potentially enabling more studies to be conducted with the same resources.
  • Enhanced study validity: By diminishing the risk of Type II errors, power analysis bolsters the reliability and validity of study outcomes. This confidence in results and conclusions can lead to more impactful research outcomes.
  • Ethical research practices: Power analysis aids in designing ethically sound studies by avoiding unnecessary participant involvement. By determining the minimum sample size required to detect an effect, researchers can minimize participant exposure to experimental conditions without compromising the study’s validity.
  • Informed decision making: Power analysis equips researchers with data to substantiate sample sizes and study designs. This information enables researchers to make informed decisions regarding the feasibility and potential impact of their studies, leading to more successful research outcomes.

More on Data Science How to Do a T-Test in Python

Applications of Power Analysis

  • Medical research: Power analysis is crucial in medical research for determining sample sizes in clinical trials. By ensuring studies have sufficient power, researchers can more accurately detect treatment effects and improve patient outcomes.
  • Psychology: In psychology, power analysis is utilized to design experiments that detect behavioral effects. By determining the necessary sample size, researchers can ensure studies are adequately powered to detect meaningful effects, leading to more robust conclusions.
  • Education: Power analysis is employed in education to evaluate the effectiveness of educational interventions. By determining the sample size necessary to detect a desired effect size, researchers can design studies that offer valuable insights into the impact of educational programs.
  • Business: In business, power analysis is used in market research to assess consumer preferences and behaviors. By determining the sample size required to detect differences in consumer behavior, businesses can make informed decisions about marketing strategies and product development.

Frequently Asked Questions

What comprises power analysis.

The primary components of power analysis encompass effect size, sample size, significance level (α), power (1-β) and variability. Effect size denotes the magnitude of the difference or relationship under scrutiny, while sample size represents the number of observations or participants. The significance level serves as the probability threshold for refuting the null hypothesis, and power stands as the likelihood of accurately refuting the null hypothesis when the alternative hypothesis holds true. Variability indicates the extent of variation in the data, which can impact the study’s power.

What insights does power analysis offer?

Power analysis provides the necessary sample size for detecting an effect of a specified magnitude with a particular level of confidence. It aids researchers in devising studies that possess adequate power to identify significant effects, thereby diminishing the risk of Type II errors and optimizing resource utilization. Additionally, power analysis enlightens researchers about the probability of detecting true effects in their studies, enriching the validity and dependability of their conclusions.

Recent Big Data Articles

27 Big Data in Healthcare Examples and Applications

Introduction to Statistics and Data Science - Home

Power analysis

4. power analysis #.

In previous weeks we focussed on testing how likely a given result was to occur due to chance, if the null hypothesis were true (chance of a Type 1 error).

This week we are thinking about the other type of error, Type 2.

Type 2 errors occur when the alternative hypothesis is actually true (eg, there is a difference in means between groups) but we fail to detect it.

Power is the probability of not making a Type 2 error, that is the probability of detecting an effect, if there is one present.

We saw in the lecture that whilst the probability of a Type 1 error is generally fixed at 5% (or whatever alpha value we use) for any sample size, the probability of a Type 2 error is much larger in small samples

In other words, sometimes our sample is just too small to reliably detect an effect even if there is one

To assess the sample size needed to detect an effect of a certain size, we conduct a power analysis

We will cover two examples:

power of a correlation (Pearson’s r) analysis

power of a t-test (independent and paired samples)

We will see how power analyses can be constructed using ‘home made’ code, and also learn to run them for \(t\) -test and correlation using a built in function in the Python library statsmodels .

4.1. Tasks for this week #

Conceptual material is covered in the lecture. In addition to the live lecture, you can find lecture videos on Canvas.

Please work through the guided exercises in this section (everything except the page labelled “Tutorial Exercises”) in advance of the computer-based tutorial session.

To complete the guided exercises you will need to either:

open the pages in Google Colab (simply click the Colab button on each page), or

download them as Jupyter Notbooks to your own computer and work with them locally (eg in JupyterLab)

If you find something difficult or have questions, you can discuss with your tutor in the computer-based tutoral session.

This week is particularly heavy on conceptual material, so please do discuss the guided exercises and tutorial exercises with your tutor to make sure you understand

Power Analysis

  • Living reference work entry
  • First Online: 26 May 2022
  • Cite this living reference work entry

example of power analysis in research

  • Manuel C. Voelkle 2 &
  • Edgar Erdfelder 3  

71 Accesses

Probability of a true positive decision ; Sensitivity

The power of a statistical hypothesis test is the probability of rejecting the null hypothesis given that the null hypothesis is in fact false.

Description

There are four possible outcomes of a statistical hypothesis test: (1) the null hypothesis is maintained given that it is in fact true (a true negative decision); (2) the null hypothesis is rejected even though it is true (a false positive decision or type I error ); (3) the null hypothesis is maintained even though it is false (a false negative decision or type II error ); and (4) the null hypothesis is rejected given that it is in fact false (a true positive decision). The probabilities of type I and type II errors are often denoted by the Greek letters α and β, respectively. Accordingly, the power (i.e., the probability of a true positive decision, also referred to as the sensitivity of a test) is (1-β), whereas (1-α) denotes the probability of a true...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author.

Google Scholar  

Berger, M. P. F., & Wong, W. K. (2009). An introduction to optimal design for social and biomedical research . Chichester: Wiley.

Book   Google Scholar  

Brandmaier, A. M., von Oertzen, T., Ghisletta, P., Hertzog, C., & Lindenberger, U. (2015). LIFESPAN: A tool for the computer-aided design of longitudinal studies. Frontiers in Psychology, 6 , 272. https://doi.org/10.3389/fpsyg.2015.00272 .

Article   Google Scholar  

Champely, S. (2020). pwr: Basic functions for power analysis. (Version 1.3-0) . Retrieved from https://CRAN.R-project.org/package=pwr

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65 (3), 145–153. https://doi.org/10.1037/h0045186 .

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Erlbaum.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112 (1), 155–159. https://doi.org/10.1037/0033-2909.112.1.155 .

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Hillsdale: Erlbaum.

Erdfelder, E., Faul, F., Buchner, A., & Cüpper, L. (2010). Effektgröße und Teststärke. In H. Holling & B. Schmitz (Eds.), Handbuch der Psychologischen Methoden und Evaluation (pp. 358–369). Göttingen: Hogrefe.

Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39 (2), 175–191. https://doi.org/10.3758/BF03193146 .

Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41 (4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149 .

Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power. The American Statistician, 55 (1), 19–24. https://doi.org/10.1198/000313001300339897 .

Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25 (1), 178–206. https://doi.org/10.3758/s13423-016-1221-4 .

Liu, X., & Wang, L. (2019). Sample size planning for detecting mediation effects: A power analysis procedure considering uncertainty in effect size estimates. Multivariate Behavioral Research, 54 (6), 822–839. https://doi.org/10.1080/00273171.2019.1593814 .

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9 (2), 147–163. https://doi.org/10.1037/1082-989X.9.2.147 .

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59 , 537–563. https://doi.org/10.1146/annurev.psych.59.103006.093735 .

Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing experiments and analyzing data: a model comparison perspective (3rd ed.). New York: Routledge.

Moshagen, M., & Erdfelder, E. (2016). A new strategy for testing structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 23 , 54–60. https://doi.org/10.1080/10705511.2014.950896 .

Muthén, L. K., & Muthén, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9 (4), 599–620. https://doi.org/10.1207/S15328007SEM0904_8 .

Onwuegbuzie, A. J., & Leech, N. L. (2004). Post hoc power: A concept whose time has come. Understanding Statistics, 3 (4), 201–230. https://doi.org/10.1207/s15328031us0304_1 .

Paxton, P., Curran, P. J., Bollen, K. A., Kirby, J., & Chen, F. (2001). Monte Carlo experiments: Design and implementation. Structural Equation Modeling: A Multidisciplinary Journal, 8 , 287–312. https://doi.org/10.1207/S15328007SEM0802_7 .

R Core Team. (2021). R: A language and environment for statistical computing . Vienna: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/ .

Schnuerch, M., & Erdfelder, E. (2020). Controlling decision errors with minimal costs: The sequential probability ratio t test. Psychological Methods, 25 (2), 206–226. https://doi.org/10.1037/met0000234 .

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105 (2), 309–316. https://doi.org/10.1037/0033-2909.105.2.309 .

von Oertzen, T. (2010). Power equivalence in structural equation modelling. British Journal of Mathematical and Statistical Psychology, 63 , 257–272. https://doi.org/10.1348/000711009X441021 .

Wald, A. (1947). Sequential analysis . New York: Wiley.

Download references

Author information

Authors and affiliations.

Humboldt-Universität zu Berlin, Berlin, Germany

Manuel C. Voelkle

Lehrstuhl für Kognitive Psychologie und Differentielle Psychologie, Universität Mannheim, Mannheim, Germany

Edgar Erdfelder

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Manuel C. Voelkle .

Editor information

Editors and affiliations.

Dipartimento di Scienze Statistiche, Sapienza Università di Roma, Roma, Italy

Filomena Maggino

Section Editor information

Social Statistics, Italian National Institute of Statistics – Istat, Rome, Italy

Leonardo Salvatore Alaimo

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this entry

Cite this entry.

Voelkle, M.C., Erdfelder, E. (2021). Power Analysis. In: Maggino, F. (eds) Encyclopedia of Quality of Life and Well-Being Research. Springer, Cham. https://doi.org/10.1007/978-3-319-69909-7_2230-2

Download citation

DOI : https://doi.org/10.1007/978-3-319-69909-7_2230-2

Received : 26 September 2019

Accepted : 10 August 2021

Published : 26 May 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-69909-7

Online ISBN : 978-3-319-69909-7

eBook Packages : Springer Reference Social Sciences Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Frequently asked questions

What is a power analysis.

A power analysis is a calculation that helps you determine a minimum sample size for your study. It’s made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.

  • Statistical power : the likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.
  • Sample size : the minimum number of observations needed to observe an effect of a certain size with a given power level.
  • Significance level (alpha) : the maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Expected effect size : a standardized way of expressing the magnitude of the expected result of your study, usually based on similar studies or a pilot study.

Frequently asked questions: Statistics

As the degrees of freedom increase, Student’s t distribution becomes less leptokurtic , meaning that the probability of extreme values decreases. The distribution becomes more and more similar to a standard normal distribution .

The three categories of kurtosis are:

  • Mesokurtosis : An excess kurtosis of 0. Normal distributions are mesokurtic.
  • Platykurtosis : A negative excess kurtosis. Platykurtic distributions are thin-tailed, meaning that they have few outliers .
  • Leptokurtosis : A positive excess kurtosis. Leptokurtic distributions are fat-tailed, meaning that they have many outliers.

Probability distributions belong to two broad categories: discrete probability distributions and continuous probability distributions . Within each category, there are many types of probability distributions.

Probability is the relative frequency over an infinite number of trials.

For example, the probability of a coin landing on heads is .5, meaning that if you flip the coin an infinite number of times, it will land on heads half the time.

Since doing something an infinite number of times is impossible, relative frequency is often used as an estimate of probability. If you flip a coin 1000 times and get 507 heads, the relative frequency, .507, is a good estimate of the probability.

Categorical variables can be described by a frequency distribution. Quantitative variables can also be described by a frequency distribution, but first they need to be grouped into interval classes .

A histogram is an effective way to tell if a frequency distribution appears to have a normal distribution .

Plot a histogram and look at the shape of the bars. If the bars roughly follow a symmetrical bell or hill shape, like the example below, then the distribution is approximately normally distributed.

Frequency-distribution-Normal-distribution

You can use the CHISQ.INV.RT() function to find a chi-square critical value in Excel.

For example, to calculate the chi-square critical value for a test with df = 22 and α = .05, click any blank cell and type:

=CHISQ.INV.RT(0.05,22)

You can use the qchisq() function to find a chi-square critical value in R.

For example, to calculate the chi-square critical value for a test with df = 22 and α = .05:

qchisq(p = .05, df = 22, lower.tail = FALSE)

You can use the chisq.test() function to perform a chi-square test of independence in R. Give the contingency table as a matrix for the “x” argument. For example:

m = matrix(data = c(89, 84, 86, 9, 8, 24), nrow = 3, ncol = 2)

chisq.test(x = m)

You can use the CHISQ.TEST() function to perform a chi-square test of independence in Excel. It takes two arguments, CHISQ.TEST(observed_range, expected_range), and returns the p value.

Chi-square goodness of fit tests are often used in genetics. One common application is to check if two genes are linked (i.e., if the assortment is independent). When genes are linked, the allele inherited for one gene affects the allele inherited for another gene.

Suppose that you want to know if the genes for pea texture (R = round, r = wrinkled) and color (Y = yellow, y = green) are linked. You perform a dihybrid cross between two heterozygous ( RY / ry ) pea plants. The hypotheses you’re testing with your experiment are:

  • This would suggest that the genes are unlinked.
  • This would suggest that the genes are linked.

You observe 100 peas:

  • 78 round and yellow peas
  • 6 round and green peas
  • 4 wrinkled and yellow peas
  • 12 wrinkled and green peas

Step 1: Calculate the expected frequencies

To calculate the expected values, you can make a Punnett square. If the two genes are unlinked, the probability of each genotypic combination is equal.

RRYY RrYy RRYy RrYY
RrYy rryy Rryy rrYy
RRYy Rryy RRyy RrYy
RrYY rrYy RrYy rrYY

The expected phenotypic ratios are therefore 9 round and yellow: 3 round and green: 3 wrinkled and yellow: 1 wrinkled and green.

From this, you can calculate the expected phenotypic frequencies for 100 peas:

Round and yellow 78 100 * (9/16) = 56.25
Round and green 6 100 * (3/16) = 18.75
Wrinkled and yellow 4 100 * (3/16) = 18.75
Wrinkled and green 12 100 * (1/16) = 6.21

Step 2: Calculate chi-square

Round and yellow 78 56.25 21.75 473.06 8.41
Round and green 6 18.75 −12.75 162.56 8.67
Wrinkled and yellow 4 18.75 −14.75 217.56 11.6
Wrinkled and green 12 6.21 5.79 33.52 5.4

Χ 2 = 8.41 + 8.67 + 11.6 + 5.4 = 34.08

Step 3: Find the critical chi-square value

Since there are four groups (round and yellow, round and green, wrinkled and yellow, wrinkled and green), there are three degrees of freedom .

For a test of significance at α = .05 and df = 3, the Χ 2 critical value is 7.82.

Step 4: Compare the chi-square value to the critical value

Χ 2 = 34.08

Critical value = 7.82

The Χ 2 value is greater than the critical value .

Step 5: Decide whether the reject the null hypothesis

The Χ 2 value is greater than the critical value, so we reject the null hypothesis that the population of offspring have an equal probability of inheriting all possible genotypic combinations. There is a significant difference between the observed and expected genotypic frequencies ( p < .05).

The data supports the alternative hypothesis that the offspring do not have an equal probability of inheriting all possible genotypic combinations, which suggests that the genes are linked

You can use the chisq.test() function to perform a chi-square goodness of fit test in R. Give the observed values in the “x” argument, give the expected values in the “p” argument, and set “rescale.p” to true. For example:

chisq.test(x = c(22,30,23), p = c(25,25,25), rescale.p = TRUE)

You can use the CHISQ.TEST() function to perform a chi-square goodness of fit test in Excel. It takes two arguments, CHISQ.TEST(observed_range, expected_range), and returns the p value .

Both correlations and chi-square tests can test for relationships between two variables. However, a correlation is used when you have two quantitative variables and a chi-square test of independence is used when you have two categorical variables.

Both chi-square tests and t tests can test for differences between two groups. However, a t test is used when you have a dependent quantitative variable and an independent categorical variable (with two groups). A chi-square test of independence is used when you have two categorical variables.

The two main chi-square tests are the chi-square goodness of fit test and the chi-square test of independence .

A chi-square distribution is a continuous probability distribution . The shape of a chi-square distribution depends on its degrees of freedom , k . The mean of a chi-square distribution is equal to its degrees of freedom ( k ) and the variance is 2 k . The range is 0 to ∞.

As the degrees of freedom ( k ) increases, the chi-square distribution goes from a downward curve to a hump shape. As the degrees of freedom increases further, the hump goes from being strongly right-skewed to being approximately normal.

To find the quartiles of a probability distribution, you can use the distribution’s quantile function.

You can use the quantile() function to find quartiles in R. If your data is called “data”, then “quantile(data, prob=c(.25,.5,.75), type=1)” will return the three quartiles.

You can use the QUARTILE() function to find quartiles in Excel. If your data is in column A, then click any blank cell and type “=QUARTILE(A:A,1)” for the first quartile, “=QUARTILE(A:A,2)” for the second quartile, and “=QUARTILE(A:A,3)” for the third quartile.

You can use the PEARSON() function to calculate the Pearson correlation coefficient in Excel. If your variables are in columns A and B, then click any blank cell and type “PEARSON(A:A,B:B)”.

There is no function to directly test the significance of the correlation.

You can use the cor() function to calculate the Pearson correlation coefficient in R. To test the significance of the correlation, you can use the cor.test() function.

You should use the Pearson correlation coefficient when (1) the relationship is linear and (2) both variables are quantitative and (3) normally distributed and (4) have no outliers.

The Pearson correlation coefficient ( r ) is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.

This table summarizes the most important differences between normal distributions and Poisson distributions :

Characteristic Normal Poisson
Continuous
Mean (µ) and standard deviation (σ) Lambda (λ)
Shape Bell-shaped Depends on λ
Symmetrical Asymmetrical (right-skewed). As λ increases, the asymmetry decreases.
Range −∞ to ∞ 0 to ∞

When the mean of a Poisson distribution is large (>10), it can be approximated by a normal distribution.

In the Poisson distribution formula, lambda (λ) is the mean number of events within a given interval of time or space. For example, λ = 0.748 floods per year.

The e in the Poisson distribution formula stands for the number 2.718. This number is called Euler’s constant. You can simply substitute e with 2.718 when you’re calculating a Poisson probability. Euler’s constant is a very useful number and is especially important in calculus.

The three types of skewness are:

  • Right skew (also called positive skew ) . A right-skewed distribution is longer on the right side of its peak than on its left.
  • Left skew (also called negative skew). A left-skewed distribution is longer on the left side of its peak than on its right.
  • Zero skew. It is symmetrical and its left and right sides are mirror images.

Skewness of a distribution

Skewness and kurtosis are both important measures of a distribution’s shape.

  • Skewness measures the asymmetry of a distribution.
  • Kurtosis measures the heaviness of a distribution’s tails relative to a normal distribution .

Difference between skewness and kurtosis

A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (“ x affects y because …”).

A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses . In a well-designed study , the statistical hypotheses correspond logically to the research hypothesis.

The alternative hypothesis is often abbreviated as H a or H 1 . When the alternative hypothesis is written using mathematical symbols, it always includes an inequality symbol (usually ≠, but sometimes < or >).

The null hypothesis is often abbreviated as H 0 . When the null hypothesis is written using mathematical symbols, it always includes an equality symbol (usually =, but sometimes ≥ or ≤).

The t distribution was first described by statistician William Sealy Gosset under the pseudonym “Student.”

To calculate a confidence interval of a mean using the critical value of t , follow these four steps:

  • Choose the significance level based on your desired confidence level. The most common confidence level is 95%, which corresponds to α = .05 in the two-tailed t table .
  • Find the critical value of t in the two-tailed t table.
  • Multiply the critical value of t by s / √ n .
  • Add this value to the mean to calculate the upper limit of the confidence interval, and subtract this value from the mean to calculate the lower limit.

To test a hypothesis using the critical value of t , follow these four steps:

  • Calculate the t value for your sample.
  • Find the critical value of t in the t table .
  • Determine if the (absolute) t value is greater than the critical value of t .
  • Reject the null hypothesis if the sample’s t value is greater than the critical value of t . Otherwise, don’t reject the null hypothesis .

You can use the T.INV() function to find the critical value of t for one-tailed tests in Excel, and you can use the T.INV.2T() function for two-tailed tests.

You can use the qt() function to find the critical value of t in R. The function gives the critical value of t for the one-tailed test. If you want the critical value of t for a two-tailed test, divide the significance level by two.

You can use the RSQ() function to calculate R² in Excel. If your dependent variable is in column A and your independent variable is in column B, then click any blank cell and type “RSQ(A:A,B:B)”.

You can use the summary() function to view the R²  of a linear model in R. You will see the “R-squared” near the bottom of the output.

There are two formulas you can use to calculate the coefficient of determination (R²) of a simple linear regression .

R^2=(r)^2

The coefficient of determination (R²) is a number between 0 and 1 that measures how well a statistical model predicts an outcome. You can interpret the R² as the proportion of variation in the dependent variable that is predicted by the statistical model.

There are three main types of missing data .

Missing completely at random (MCAR) data are randomly distributed across the variable and unrelated to other variables .

Missing at random (MAR) data are not randomly distributed but they are accounted for by other observed variables.

Missing not at random (MNAR) data systematically differ from the observed values.

To tidy up your missing data , your options usually include accepting, removing, or recreating the missing data.

  • Acceptance: You leave your data as is
  • Listwise or pairwise deletion: You delete all cases (participants) with missing data from analyses
  • Imputation: You use other data to fill in the missing data

Missing data are important because, depending on the type, they can sometimes bias your results. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample .

Missing data , or missing values, occur when you don’t have data stored for certain variables or participants.

In any dataset, there’s usually some missing data. In quantitative research , missing values appear as blank cells in your spreadsheet.

There are two steps to calculating the geometric mean :

  • Multiply all values together to get their product.
  • Find the n th root of the product ( n is the number of values).

Before calculating the geometric mean, note that:

  • The geometric mean can only be found for positive values.
  • If any value in the data set is zero, the geometric mean is zero.

The arithmetic mean is the most commonly used type of mean and is often referred to simply as “the mean.” While the arithmetic mean is based on adding and dividing values, the geometric mean multiplies and finds the root of values.

Even though the geometric mean is a less common measure of central tendency , it’s more accurate than the arithmetic mean for percentage change and positively skewed data. The geometric mean is often reported for financial indices and population growth rates.

The geometric mean is an average that multiplies all values and finds a root of the number. For a dataset with n numbers, you find the n th root of their product.

Outliers are extreme values that differ from most values in the dataset. You find outliers at the extreme ends of your dataset.

It’s best to remove outliers only when you have a sound reason for doing so.

Some outliers represent natural variations in the population , and they should be left as is in your dataset. These are called true outliers.

Other outliers are problematic and should be removed because they represent measurement errors , data entry or processing errors, or poor sampling.

You can choose from four main ways to detect outliers :

  • Sorting your values from low to high and checking minimum and maximum values
  • Visualizing your data with a box plot and looking for outliers
  • Using the interquartile range to create fences for your data
  • Using statistical procedures to identify extreme values

Outliers can have a big impact on your statistical analyses and skew the results of any hypothesis test if they are inaccurate.

These extreme values can impact your statistical power as well, making it hard to detect a true effect if there is one.

No, the steepness or slope of the line isn’t related to the correlation coefficient value. The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes.

To find the slope of the line, you’ll need to perform a regression analysis .

Correlation coefficients always range between -1 and 1.

The sign of the coefficient tells you the direction of the relationship: a positive value means the variables change together in the same direction, while a negative value means they change together in opposite directions.

The absolute value of a number is equal to the number without its sign. The absolute value of a correlation coefficient tells you the magnitude of the correlation: the greater the absolute value, the stronger the correlation.

These are the assumptions your data must meet if you want to use Pearson’s r :

  • Both variables are on an interval or ratio level of measurement
  • Data from both variables follow normal distributions
  • Your data have no outliers
  • Your data is from a random or representative sample
  • You expect a linear relationship between the two variables

A correlation coefficient is a single number that describes the strength and direction of the relationship between your variables.

Different types of correlation coefficients might be appropriate for your data based on their levels of measurement and distributions . The Pearson product-moment correlation coefficient (Pearson’s r ) is commonly used to assess a linear relationship between two quantitative variables.

There are various ways to improve power:

  • Increase the potential effect size by manipulating your independent variable more strongly,
  • Increase sample size,
  • Increase the significance level (alpha),
  • Reduce measurement error by increasing the precision and accuracy of your measurement devices and procedures,
  • Use a one-tailed test instead of a two-tailed test for t tests and z tests.

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Statistical analysis is the main method for analyzing quantitative research data . It uses probabilities and models to test predictions about a population from sample data.

The risk of making a Type II error is inversely related to the statistical power of a test. Power is the extent to which a test can correctly detect a real effect when there is one.

To (indirectly) reduce the risk of a Type II error, you can increase the sample size or the significance level to increase statistical power.

The risk of making a Type I error is the significance level (or alpha) that you choose. That’s a value that you set at the beginning of your study to assess the statistical probability of obtaining your results ( p value ).

The significance level is usually set at 0.05 or 5%. This means that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.

To reduce the Type I error probability, you can set a lower significance level.

In statistics, a Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s actually false.

In statistics, power refers to the likelihood of a hypothesis test detecting a true effect if there is one. A statistically powerful test is more likely to reject a false negative (a Type II error).

If you don’t ensure enough power in your study, you may not be able to detect a statistically significant result even when it has practical significance. Your study might not have the ability to answer your research question.

While statistical significance shows that an effect exists in a study, practical significance shows that the effect is large enough to be meaningful in the real world.

Statistical significance is denoted by p -values whereas practical significance is represented by effect sizes .

There are dozens of measures of effect sizes . The most common effect sizes are Cohen’s d and Pearson’s r . Cohen’s d measures the size of the difference between two groups while Pearson’s r measures the strength of the relationship between two variables .

Effect size tells you how meaningful the relationship between variables or the difference between groups is.

A large effect size means that a research finding has practical significance, while a small effect size indicates limited practical applications.

Using descriptive and inferential statistics , you can make two types of estimates about the population : point estimates and interval estimates.

  • A point estimate is a single value estimate of a parameter . For instance, a sample mean is a point estimate of a population mean.
  • An interval estimate gives you a range of values where the parameter is expected to lie. A confidence interval is the most common type of interval estimate.

Both types of estimates are important for gathering a clear idea of where a parameter is likely to lie.

Standard error and standard deviation are both measures of variability . The standard deviation reflects variability within a sample, while the standard error estimates the variability across samples of a population.

The standard error of the mean , or simply standard error , indicates how different the population mean is likely to be from a sample mean. It tells you how much the sample mean would vary if you were to repeat a study using new samples from within a single population.

To figure out whether a given number is a parameter or a statistic , ask yourself the following:

  • Does the number describe a whole, complete population where every member can be reached for data collection ?
  • Is it possible to collect data for this number from every member of the population in a reasonable time frame?

If the answer is yes to both questions, the number is likely to be a parameter. For small populations, data can be collected from the whole population and summarized in parameters.

If the answer is no to either of the questions, then the number is more likely to be a statistic.

The arithmetic mean is the most commonly used mean. It’s often simply called the mean or the average. But there are some other types of means you can calculate depending on your research purposes:

  • Weighted mean: some values contribute more to the mean than others.
  • Geometric mean : values are multiplied rather than summed up.
  • Harmonic mean: reciprocals of values are used instead of the values themselves.

You can find the mean , or average, of a data set in two simple steps:

  • Find the sum of the values by adding them all up.
  • Divide the sum by the number of values in the data set.

This method is the same whether you are dealing with sample or population data or positive or negative numbers.

The median is the most informative measure of central tendency for skewed distributions or distributions with outliers. For example, the median is often used as a measure of central tendency for income distributions, which are generally highly skewed.

Because the median only uses one or two values, it’s unaffected by extreme outliers or non-symmetric distributions of scores. In contrast, the mean and mode can vary in skewed distributions.

To find the median , first order your data. Then calculate the middle position based on n , the number of values in your data set.

\dfrac{(n+1)}{2}

A data set can often have no mode, one mode or more than one mode – it all depends on how many different values repeat most frequently.

Your data can be:

  • without any mode
  • unimodal, with one mode,
  • bimodal, with two modes,
  • trimodal, with three modes, or
  • multimodal, with four or more modes.

To find the mode :

  • If your data is numerical or quantitative, order the values from low to high.
  • If it is categorical, sort the values by group, in any order.

Then you simply need to identify the most frequently occurring value.

The interquartile range is the best measure of variability for skewed distributions or data sets with outliers. Because it’s based on values that come from the middle half of the distribution, it’s unlikely to be influenced by outliers .

The two most common methods for calculating interquartile range are the exclusive and inclusive methods.

The exclusive method excludes the median when identifying Q1 and Q3, while the inclusive method includes the median as a value in the data set in identifying the quartiles.

For each of these methods, you’ll need different procedures for finding the median, Q1 and Q3 depending on whether your sample size is even- or odd-numbered. The exclusive method works best for even-numbered sample sizes, while the inclusive method is often used with odd-numbered sample sizes.

While the range gives you the spread of the whole data set, the interquartile range gives you the spread of the middle half of a data set.

Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in different groups being compared.

This is an important assumption of parametric statistical tests because they are sensitive to any dissimilarities. Uneven variances in samples result in biased and skewed test results.

Statistical tests such as variance tests or the analysis of variance (ANOVA) use sample variance to assess group differences of populations. They use the variances of the samples to assess whether the populations they come from significantly differ from each other.

Variance is the average squared deviations from the mean, while standard deviation is the square root of this number. Both measures reflect variability in a distribution, but their units differ:

  • Standard deviation is expressed in the same units as the original values (e.g., minutes or meters).
  • Variance is expressed in much larger units (e.g., meters squared).

Although the units of variance are harder to intuitively understand, variance is important in statistical tests .

The empirical rule, or the 68-95-99.7 rule, tells you where most of the values lie in a normal distribution :

  • Around 68% of values are within 1 standard deviation of the mean.
  • Around 95% of values are within 2 standard deviations of the mean.
  • Around 99.7% of values are within 3 standard deviations of the mean.

The empirical rule is a quick way to get an overview of your data and check for any outliers or extreme values that don’t follow this pattern.

In a normal distribution , data are symmetrically distributed with no skew. Most values cluster around a central region, with values tapering off as they go further away from the center.

The measures of central tendency (mean, mode, and median) are exactly the same in a normal distribution.

Normal distribution

The standard deviation is the average amount of variability in your data set. It tells you, on average, how far each score lies from the mean .

In normal distributions, a high standard deviation means that values are generally far from the mean, while a low standard deviation indicates that values are clustered close to the mean.

No. Because the range formula subtracts the lowest number from the highest number, the range is always zero or a positive number.

In statistics, the range is the spread of your data from the lowest to the highest value in the distribution. It is the simplest measure of variability .

While central tendency tells you where most of your data points lie, variability summarizes how far apart your points from each other.

Data sets can have the same central tendency but different levels of variability or vice versa . Together, they give you a complete picture of your data.

Variability is most commonly measured with the following descriptive statistics :

  • Range : the difference between the highest and lowest values
  • Interquartile range : the range of the middle half of a distribution
  • Standard deviation : average distance from the mean
  • Variance : average of squared distances from the mean

Variability tells you how far apart points lie from each other and from the center of a distribution or a data set.

Variability is also referred to as spread, scatter or dispersion.

While interval and ratio data can both be categorized, ranked, and have equal spacing between adjacent values, only ratio scales have a true zero.

For example, temperature in Celsius or Fahrenheit is at an interval scale because zero is not the lowest possible temperature. In the Kelvin scale, a ratio scale, zero represents a total lack of thermal energy.

A critical value is the value of the test statistic which defines the upper and lower bounds of a confidence interval , or which defines the threshold of statistical significance in a statistical test. It describes how far from the mean of the distribution you have to go to cover a certain amount of the total variation in the data (i.e. 90%, 95%, 99%).

If you are constructing a 95% confidence interval and are using a threshold of statistical significance of p = 0.05, then your critical value will be identical in both cases.

The t -distribution gives more probability to observations in the tails of the distribution than the standard normal distribution (a.k.a. the z -distribution).

In this way, the t -distribution is more conservative than the standard normal distribution: to reach the same level of confidence or statistical significance , you will need to include a wider range of the data.

A t -score (a.k.a. a t -value) is equivalent to the number of standard deviations away from the mean of the t -distribution .

The t -score is the test statistic used in t -tests and regression tests. It can also be used to describe how far from the mean an observation is when the data follow a t -distribution.

The t -distribution is a way of describing a set of observations where most observations fall close to the mean , and the rest of the observations make up the tails on either side. It is a type of normal distribution used for smaller sample sizes, where the variance in the data is unknown.

The t -distribution forms a bell curve when plotted on a graph. It can be described mathematically using the mean and the standard deviation .

In statistics, ordinal and nominal variables are both considered categorical variables .

Even though ordinal data can sometimes be numerical, not all mathematical operations can be performed on them.

Ordinal data has two characteristics:

  • The data can be classified into different categories within a variable.
  • The categories have a natural ranked order.

However, unlike with interval data, the distances between the categories are uneven or unknown.

Nominal and ordinal are two of the four levels of measurement . Nominal level data can only be classified, while ordinal level data can be classified and ordered.

Nominal data is data that can be labelled or classified into mutually exclusive categories within a variable. These categories cannot be ordered in a meaningful way.

For example, for the nominal variable of preferred mode of transportation, you may have the categories of car, bus, train, tram or bicycle.

If your confidence interval for a difference between groups includes zero, that means that if you run your experiment again you have a good chance of finding no difference between groups.

If your confidence interval for a correlation or regression includes zero, that means that if you run your experiment again there is a good chance of finding no correlation in your data.

In both of these cases, you will also find a high p -value when you run your statistical test, meaning that your results could have occurred under the null hypothesis of no relationship between variables or no difference between groups.

If you want to calculate a confidence interval around the mean of data that is not normally distributed , you have two choices:

  • Find a distribution that matches the shape of your data and use that distribution to calculate the confidence interval.
  • Perform a transformation on your data to make it fit a normal distribution, and then find the confidence interval for the transformed data.

The standard normal distribution , also called the z -distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1.

Any normal distribution can be converted into the standard normal distribution by turning the individual values into z -scores. In a z -distribution, z -scores tell you how many standard deviations away from the mean each value lies.

The z -score and t -score (aka z -value and t -value) show how many standard deviations away from the mean of the distribution you are, assuming your data follow a z -distribution or a t -distribution .

These scores are used in statistical tests to show how far from the mean of the predicted distribution your statistical estimate is. If your test produces a z -score of 2.5, this means that your estimate is 2.5 standard deviations from the predicted mean.

The predicted mean and distribution of your estimate are generated by the null hypothesis of the statistical test you are using. The more standard deviations away from the predicted mean your estimate is, the less likely it is that the estimate could have occurred under the null hypothesis .

To calculate the confidence interval , you need to know:

  • The point estimate you are constructing the confidence interval for
  • The critical values for the test statistic
  • The standard deviation of the sample
  • The sample size

Then you can plug these components into the confidence interval formula that corresponds to your data. The formula depends on the type of estimate (e.g. a mean or a proportion) and on the distribution of your data.

The confidence level is the percentage of times you expect to get close to the same estimate if you run your experiment again or resample the population in the same way.

The confidence interval consists of the upper and lower bounds of the estimate you expect to find at a given level of confidence.

For example, if you are estimating a 95% confidence interval around the mean proportion of female babies born every year based on a random sample of babies, you might find an upper bound of 0.56 and a lower bound of 0.48. These are the upper and lower bounds of the confidence interval. The confidence level is 95%.

The mean is the most frequently used measure of central tendency because it uses all values in the data set to give you an average.

For data from skewed distributions, the median is better than the mean because it isn’t influenced by extremely large values.

The mode is the only measure you can use for nominal or categorical data that can’t be ordered.

The measures of central tendency you can use depends on the level of measurement of your data.

  • For a nominal level, you can only use the mode to find the most frequent value.
  • For an ordinal level or ranked data, you can also use the median to find the value in the middle of your data set.
  • For interval or ratio levels, in addition to the mode and median, you can use the mean to find the average value.

Measures of central tendency help you find the middle, or the average, of a data set.

The 3 most common measures of central tendency are the mean, median and mode.

  • The mode is the most frequent value.
  • The median is the middle number in an ordered data set.
  • The mean is the sum of all values divided by the total number of values.

Some variables have fixed levels. For example, gender and ethnicity are always nominal level data because they cannot be ranked.

However, for other variables, you can choose the level of measurement . For example, income is a variable that can be recorded on an ordinal or a ratio scale:

  • At an ordinal level , you could create 5 income groupings and code the incomes that fall within them from 1–5.
  • At a ratio level , you would record exact numbers for income.

If you have a choice, the ratio level is always preferable because you can analyze data in more ways. The higher the level of measurement, the more precise your data is.

The level at which you measure a variable determines how you can analyze your data.

Depending on the level of measurement , you can perform different descriptive statistics to get an overall summary of your data and inferential statistics to see if your results support or refute your hypothesis .

Levels of measurement tell you how precisely variables are recorded. There are 4 levels of measurement, which can be ranked from low to high:

  • Nominal : the data can only be categorized.
  • Ordinal : the data can be categorized and ranked.
  • Interval : the data can be categorized and ranked, and evenly spaced.
  • Ratio : the data can be categorized, ranked, evenly spaced and has a natural zero.

No. The p -value only tells you how likely the data you have observed is to have occurred under the null hypothesis .

If the p -value is below your threshold of significance (typically p < 0.05), then you can reject the null hypothesis, but this does not necessarily mean that your alternative hypothesis is true.

The alpha value, or the threshold for statistical significance , is arbitrary – which value you use depends on your field of study.

In most cases, researchers use an alpha of 0.05, which means that there is a less than 5% chance that the data being tested could have occurred under the null hypothesis.

P -values are usually automatically calculated by the program you use to perform your statistical test. They can also be estimated using p -value tables for the relevant test statistic .

P -values are calculated from the null distribution of the test statistic. They tell you how often a test statistic is expected to occur under the null hypothesis of the statistical test, based on where it falls in the null distribution.

If the test statistic is far from the mean of the null distribution, then the p -value will be small, showing that the test statistic is not likely to have occurred under the null hypothesis.

A p -value , or probability value, is a number describing how likely it is that your data would have occurred under the null hypothesis of your statistical test .

The test statistic you use will be determined by the statistical test.

You can choose the right statistical test by looking at what type of data you have collected and what type of relationship you want to test.

The test statistic will change based on the number of observations in your data, how variable your observations are, and how strong the underlying patterns in the data are.

For example, if one data set has higher variability while another has lower variability, the first data set will produce a test statistic closer to the null hypothesis , even if the true correlation between two variables is the same in either data set.

The formula for the test statistic depends on the statistical test being used.

Generally, the test statistic is calculated as the pattern in your data (i.e. the correlation between variables or difference between groups) divided by the variance in the data (i.e. the standard deviation ).

  • Univariate statistics summarize only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.

Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.

In statistics, model selection is a process researchers use to compare the relative value of different statistical models and determine which one is the best fit for the observed data.

The Akaike information criterion is one of the most common methods of model selection. AIC weights the ability of the model to predict the observed data against the number of parameters the model requires to reach that level of precision.

AIC model selection can help researchers find a model that explains the observed variation in their data while avoiding overfitting.

In statistics, a model is the collection of one or more independent variables and their predicted interactions that researchers use to try to explain variation in their dependent variable.

You can test a model using a statistical test . To compare how well different models fit your data, you can use Akaike’s information criterion for model selection.

The Akaike information criterion is calculated from the maximum log-likelihood of the model and the number of parameters (K) used to reach that likelihood. The AIC function is 2K – 2(log-likelihood) .

Lower AIC values indicate a better-fit model, and a model with a delta-AIC (the difference between the two AIC values being compared) of more than -2 is considered significantly better than the model it is being compared to.

The Akaike information criterion is a mathematical test used to evaluate how well a model fits the data it is meant to describe. It penalizes models which use more independent variables (parameters) as a way to avoid over-fitting.

AIC is most often used to compare the relative goodness-of-fit among different models under consideration and to then choose the model that best fits the data.

A factorial ANOVA is any ANOVA that uses more than one categorical independent variable . A two-way ANOVA is a type of factorial ANOVA.

Some examples of factorial ANOVAs include:

  • Testing the combined effects of vaccination (vaccinated or not vaccinated) and health status (healthy or pre-existing condition) on the rate of flu infection in a population.
  • Testing the effects of marital status (married, single, divorced, widowed), job status (employed, self-employed, unemployed, retired), and family history (no family history, some family history) on the incidence of depression in a population.
  • Testing the effects of feed type (type A, B, or C) and barn crowding (not crowded, somewhat crowded, very crowded) on the final weight of chickens in a commercial farming operation.

In ANOVA, the null hypothesis is that there is no difference among group means. If any group differs significantly from the overall group mean, then the ANOVA will report a statistically significant result.

Significant differences among group means are calculated using the F statistic, which is the ratio of the mean sum of squares (the variance explained by the independent variable) to the mean square error (the variance left over).

If the F statistic is higher than the critical value (the value of F that corresponds with your alpha value, usually 0.05), then the difference among groups is deemed statistically significant.

The only difference between one-way and two-way ANOVA is the number of independent variables . A one-way ANOVA has one independent variable, while a two-way ANOVA has two.

  • One-way ANOVA : Testing the relationship between shoe brand (Nike, Adidas, Saucony, Hoka) and race finish times in a marathon.
  • Two-way ANOVA : Testing the relationship between shoe brand (Nike, Adidas, Saucony, Hoka), runner age group (junior, senior, master’s), and race finishing times in a marathon.

All ANOVAs are designed to test for differences among three or more groups. If you are only testing for a difference between two groups, use a t-test instead.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.

For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

A t-test should not be used to measure differences among more than two groups, because the error structure for a t-test will underestimate the actual error when many groups are being compared.

If you want to compare the means of several groups at once, it’s best to use another statistical test such as ANOVA or a post-hoc test.

A one-sample t-test is used to compare a single population to a standard value (for example, to determine whether the average lifespan of a specific town is different from the country average).

A paired t-test is used to compare a single population before and after some experimental intervention or at two different points in time (for example, measuring student performance on a test before and after being taught the material).

A t-test measures the difference in group means divided by the pooled standard error of the two group means.

In this way, it calculates a number (the t-value) illustrating the magnitude of the difference between the two group means being compared, and estimates the likelihood that this difference exists purely by chance (p-value).

Your choice of t-test depends on whether you are studying one group or two groups, and whether you care about the direction of the difference in group means.

If you are studying one group, use a paired t-test to compare the group mean over time or after an intervention, or use a one-sample t-test to compare the group mean to a standard value. If you are studying two groups, use a two-sample t-test .

If you want to know only whether a difference exists, use a two-tailed test . If you want to know if one group mean is greater or less than the other, use a left-tailed or right-tailed one-tailed test .

A t-test is a statistical test that compares the means of two samples . It is used in hypothesis testing , with a null hypothesis that the difference in group means is zero and an alternate hypothesis that the difference in group means is different from zero.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

A test statistic is a number calculated by a  statistical test . It describes how far your observed data is from the  null hypothesis  of no relationship between  variables or no difference among sample groups.

The test statistic tells you how different two or more groups are from the overall population mean , or how different a linear slope is from the slope predicted by a null hypothesis . Different test statistics are used in different statistical tests.

Statistical tests commonly assume that:

  • the data are normally distributed
  • the groups that are being compared have similar variance
  • the data are independent

If your data does not meet these assumptions you might still be able to use a nonparametric statistical test , which have fewer requirements but also make weaker inferences.

Ask our team

Want to contact us directly? No problem.  We  are always here for you.

Support team - Nina

Our team helps students graduate by offering:

  • A world-class citation generator
  • Plagiarism Checker software powered by Turnitin
  • Innovative Citation Checker software
  • Professional proofreading services
  • Over 300 helpful articles about academic writing, citing sources, plagiarism, and more

Scribbr specializes in editing study-related documents . We proofread:

  • PhD dissertations
  • Research proposals
  • Personal statements
  • Admission essays
  • Motivation letters
  • Reflection papers
  • Journal articles
  • Capstone projects

Scribbr’s Plagiarism Checker is powered by elements of Turnitin’s Similarity Checker , namely the plagiarism detection software and the Internet Archive and Premium Scholarly Publications content databases .

The add-on AI detector is powered by Scribbr’s proprietary software.

The Scribbr Citation Generator is developed using the open-source Citation Style Language (CSL) project and Frank Bennett’s citeproc-js . It’s the same technology used by dozens of other popular citation tools, including Mendeley and Zotero.

You can find all the citation styles and locales used in the Scribbr Citation Generator in our publicly accessible repository on Github .

  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Statistical Methods and Data Analytics

Sample Power Data Analysis Examples Power Analysis for Two-group Independent Sample t-test

Example 1. A clinical dietician wants to compare two different diets, A and B, for diabetic patients.  She hypothesizes that diet A (Group 1) will be better than diet B (Group 2), in terms of lower blood glucose.  She plans to get a random sample of diabetic patients and randomly assign them to one of the two diets.  At the end of the experiment, which lasts 6 weeks, a fasting blood glucose test will be conducted on each patient.  She also expects that the average difference in blood glucose measure between the two group will be about 10 mg/dl.  Furthermore, she also assumes the standard deviation of blood glucose distribution for diet A to be 15 and the standard deviation for diet B to be 17.  The dietician wants to know the number of subjects needed in each group assuming equal sized groups.

Example 2. An audiologist wanted to study the effect of gender on the response time to a certain sound frequency.  He suspected that men were better at detecting this type of sound then were women.  He took a random sample of 20 male and 20 female subjects for this experiment.  Each subject was be given a button to press when he/she heard the sound.  The audiologist then measured the response time – the time between the sound was emitted and the time the button was pressed.  Now, he wants to know what the statistical power is based on his total of 40 subjects to detect the gender difference.

Prelude to the power analysis

There are two different aspects of power analysis.  One is to calculate the necessary sample size for a specified power as in Example 1.  The other aspect is to calculate the power when given a specific sample size as in Example 2.  Technically, power is the probability of rejecting the null hypothesis when the specific alternative hypothesis is true. We can also think of this as the probability of detecting a true effect.  

For the power analyses below, we are going to focus on Example 1 and calculate the required sample size for a given statistical power when testing the difference in the effect of diet A and diet B.  In order to perform the power analysis, the dietician has to make some decisions about the precision and sensitivity of the test and provide some educated guesses about the data distributions. Here is the information we have to know or estimate or assume in order to perform the power analysis and the values given for Example 1:

  • The expected difference in the average blood glucose; in this case it is set to 10.
  • The standard deviations of blood glucose for Group 1 and Group 2; in this case, they are set to 15 and 17 respectively. 
  • The alpha level, or the Type I error rate, which is the probability of rejecting the null hypothesis when it is actually true.  A common practice is to set it at the .05 level.
  • The pre-specified level of statistical power for calculating the sample size; this will be set to .8.
  • The pre-specified number of subjects for calculating the statistical power; this is the situation for Example 2.

Notice that in the first example, the dietician specified the difference in the two means but didn’t specify the means for each group. This is because that she is only interested in the difference and the actual values of the means will not effect her analysis.

Power analysis

In Sample Power, it is fairly straightforward to perform power analysis for comparing means.  Simply begin a new analysis and select ‘t-test for two independent groups with common variance [enter means]’. 

We can then specify the two means, the mean for Group 1 (diet A) and the mean for Group 2 (diet B).  Since what really matters is the difference between the two values, we can enter a mean of zero for Group 1 and a mean of 10 for Group 2 so that the difference in means will be 10.  Next, we specify the standard deviation for the first population and standard deviation for the second population.  Note that upon entering a value for the second group, the following window will appear, to which ‘yes’ is the appropriate response.

Image indeps2

From there, the default significance level (alpha level) is .05.  For this example, we will set the power to be .8 by clicking the ‘Find N for any power’ button.

The calculation results indicate that we would need 42 subjects for diet A and another 42 subject for diet B in our sample in order to detect the specified difference with the given power.  Now, let’s use another pair of means with the same difference.  As we have discussed earlier, the results should be the same and we can see that they are.

A third way to the same result would be to begin a new project, designate ‘t-test for two independent groups with common variance [enter difference]’, and enter the difference between the means and the pooled standard deviation, which is the square root of the average of the two variances (squared standard deviations). In this case, it is sqrt((15^2 + 17^2)/2) ≈ 16.0.  Despite the different types of inputs, we see an identical outcome.

Now the dietician may feel that a total sample size of 82 subjects is beyond her budget.  One way of reducing the sample size is to increase the Type I error rate, or the alpha level.  Let’s say instead of using alpha level of .05 we will use .07, an adjustment accomplished by clicking the alpha value, then the blank value, and entering the new alpha value.  As a result, our sample size decreases by 4 for each group as shown below.

Now suppose the dietician can only collect data on 60 subjects with 30 in each group.  What will the statistical power for her t-test be with respect to alpha level of .07?

What if she actually collected her data on 60 subjects but with 40 on diet A and 20 on diet B instead of equal sample sizes in the groups?   Note that the N values will need to be unlinked, like the standard deviations were before.

As you can see the power goes down from .72 to .69 even though the total number of subjects is the same.  This is why we always say that a balanced design is more efficient.

An important technical assumption is the normality assumption.  If the distribution is skewed, then a small sample size may not have the power shown in the results because the power in the results is calculated using the normality assumption.  We have seen that in order to compute the power or the sample size, we have to make a number of assumptions.  These assumptions are used not only for the purpose of power calculations but in the actual t-test itself. So one important side benefit of performing power analysis is to help us to better understand our designs and our hypotheses.

We have seen in the power calculation process that what matters in the two-independent sample t-test is the difference in the means and the standard deviations for the two groups.  This leads to the concept of effect size.  In this case, the effect size will be the difference in means over the pooled standard deviation.  The larger the effect size, the higher the power for a given sample size.  Or the larger the effect size, the smaller the sample size needed to achieve a given level of power.  Given this relationship, we can see that an accurate estimate of effect size is key to a good power analysis.  However, it is not always an easy task to determine the effect size prior to collecting data.  Good estimates of effect size often come from the existing literature or pilot studies.

For more information on power analysis, please visit our Introduction to Power Analysis seminar.

Your Name (required)

Your Email (must be a valid email for us to receive the report!)

Comment/Error Report (required)

How to cite this page

  • © 2024 UC REGENTS

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Perspect Behav Sci
  • v.42(1); 2019 Mar

Tutorial: Small-N Power Analysis

Elizabeth g. e. kyonka.

Psychology, University of New England, Armidale, NSW 2351 Australia

Power analysis is an overlooked and underreported aspect of study design. A priori power analysis involves estimating the sample size required for a study based on predetermined maximum tolerable Type I and II error rates and the minimum effect size that would be clinically, practically, or theoretically meaningful. Power is more often discussed within the context of large-N group designs, but power analyses can be used in small-N research and within-subjects designs to maximize the probative value of the research. In this tutorial, case studies illustrate how power analysis can be used by behavior analysts to compare two independent groups, behavior in baseline and intervention conditions, and response characteristics across multiple within-subject treatments. After reading this tutorial, the reader will be able to estimate just noticeable differences using means and standard deviations, convert them to standardized effect sizes, and use G*Power to determine the sample size needed to detect an effect with desired power.

Behavior analysts have a longstanding history of skepticism when it comes to the necessity and utility of statistical inference. Sidman ( 1960 , p. 44) described statistical tests as a “curious negation of the professed aims of science,” inferior to techniques that establish experimental control because statistical tests rely on comparisons against an unknown parent distribution. In a similar vein, Michael ( 1974 , p. 650) described statistical inference as a “weak solution to an artificial problem” in single-organism research, arguing that even in applied settings the problem of uncontrolled variance in observations can be eliminated with appropriate experimental controls. Behavior analysts continue to argue that adequate experimental control obviates the need for statistical tests (e.g., Cohen, Feinstein, Masuda, & Vowles, 2014 ; Fisher & Lerman, 2014 ; Perone, 1999 ). In spite of many eloquent arguments against their use, a survey of almost any poster session at a large behavior analysis conference will show that this skepticism does not prevent behavior analysts from relying on statistical tests.

Null hypothesis significance testing (NHST), the orthodox statistical procedure in psychology for most of the 20th century, has been criticized by behavior analysts and others for as long as it has been practiced. It involves rejecting or failing to reject a particular “null” hypothesis based on whether the probability of the obtained result is less than some value (usually 5%) if the hypothesis were true. Branch ( 1999 , 2014 ) outlined the logical fallacy and malignant consequences of NHST for behavior analysts in at least two publications. To wit, p -values reported as outcomes of NHST are generally misinterpreted, and reliance on NHST suppresses genuine scientific advancement. Many social and behavioral scientists who are not behavior analysts also call for an end to the statistical ritual of NHST (e.g., Gigerenzer, 2004 ). Some researchers blame the misuse of NHST for a replicability crisis in psychology (Pashler & Harris, 2012 ). The American Statistical Association recently released a statement on statistical significance and p -values (Wasserstein & Lazar, 2016 ). It clarifies that a p -value is a measure of how incompatible data are with a specified statistical model, not of the probability that a particular hypothesis is true, the size of an effect or the importance of a result. It also provides some recommendations for good statistical practice that emphasize the importance of using an analytic approach that is compatible with study design and caution against relying solely on p -values when drawing scientific conclusions. The American Psychological Association also published guidelines on the use of statistical methods in psychology journals (Wilkinson & the Task Force on Statistical Inference, 1999 ). Among other recommendations, it prohibited some of the language associated with NHST and mandated the inclusion of effect sizes and estimates of reliability (i.e., confidence intervals) when reporting statistical results of any hypothesis test. At least one journal reacted by banning p -values altogether (Trafimow & Marks, 2015 ).

At a time when other fields seem to be abandoning certain types of statistical inference in favor of other ways of evaluating evidence, behavior analysts’ use of inferential statistics is increasing. The number of randomized controlled trials appearing in applied journals has increased over time. A search of the Wiley Online Library on January 22, 2018 for "randomized controlled trial" in the Journal of Applied Behavior Analysis produced 40 results. Two articles were published prior to 2000, 8 between 2000 and 2009, and the remaining 30 in were published in published 2010 or later. Zimmermann, Watkins, and Poling ( 2015 , p. 209) reported that the proportion of articles published in the Journal of the Experimental Analysis of Behavior that include “an inferential statistic” increases by approximately 8% every five years. A nontrivial portion of the research that behavior analysts do involves group-average comparisons and statistical inferences. Considering statistical power when determining sample sizes can help behavior analysts use these techniques correctly.

Behaviorist Constructions of Statistical Induction

Critiques of NHST are valid, but NHST is not the only form of statistical inference and there is a place for tests of statistical significance in behavioral research (Haig, 2017 ). NHST is a hybrid of Fisherian significance testing and Neyman and Pearson’s hypothesis testing, approaches that are mathematically similar to each other and to NHST, but involve different objectives, procedures, and philosophies of statistics. Although the amalgamation of Fisher’s insights about p -values with Neyman and Pearson’s ideas about error rates embeds a logical fallacy (post hoc ergo propter hoc) in NHST, other perspectives (e.g., Bayesian, Neo-Fisherian, and error-statistical) are logically consistent. Within those approaches, tests of statistical significance can be used in combination with other analyses to answer certain research questions (Haig, 2017 ; Wasserstein & Lazar, 2016 ).

The error-statistical approach (Mayo & Spanos, 2006 ) is related to Neyman-Pearsonian hypothesis testing (Neyman & Pearson, 1928 ) and incorporates statistical methods of testing hypotheses that are based on Neyman and Pearson’s inductive behaviorist philosophy of science. For both techniques, empirical research can contrast null and alternative hypotheses about data generating mechanisms. Hypotheses can be directional (e.g., response rates will be higher in this condition than in that condition), nondirectional (mean scores for this group will be different than for that group), or nil (this independent variable will have no effect on that behavior). The null and alternative hypotheses must exhaust the parameter space, such that one or the other must be correct (e.g., if the null hypothesis is that the response rates in two conditions are equal, the alternative hypothesis that exhausts the parameter space is that they are unequal). Neyman and Pearson ( 1928 , 1933 ) specified an all-or-none procedure whereby the null hypothesis is rejected or accepted based on how the test statistic compares to a critical value. Through it does not immunize the researcher against logical fallacy (Mayo & Spanos, 2006 ), it differs from NHST in that the inference is whether the evidence supports the null or alternative model, not whether the null hypothesis is sufficiently improbable. In the error-statistical approach, the test statistic ( t score, F ratio, etc.) is not compared to a critical value. Instead, it quantifies the discrepancy between the null hypothesis and data. An important aspect of the error statistical approach is that the probative value of the test is tempered by its severity. A statistical test is severe when the data collected provide good evidence for or against the null hypothesis (Mayo & Spanos, 2011 ). Statistical significance is informative because it “[enables] the assessment of how well probed or how severely tested claims are” (Mayo & Spanos, 2006 , p. 328), not the likelihood of the hypothesis.

Power analysis is an important component of Neyman-Pearson hypothesis testing, error-statistical analysis, and related inductive techniques for evaluating evidence with inferential tests. Using a priori power analysis to determine an appropriate sample size for an experiment does not guarantee a severe test of the hypothesis; as quantified by Mayo and Spanos ( 2006 , 2011 ) severity can only be determined after data collection is complete. What it does is ensure that the test is optimally sensitive for detecting the effect size that is of greatest interest to the researcher. Using power analysis to determine sample size is an important step in research design no less because it increases the proportion of studies that yield conclusive results (Simmons, Nelson, & Simonsohn, 2013 ) than because it is required by the American Psychological Association (Wilkinson & the Task Force on Statistical Inference, 1999 ).

Power analysis was identified as a critical step in hypothesis testing 90 years ago (Neyman & Pearson, 1928 ) and relatively plain-language instructions on how to use power analysis in the design of psychological research have been available for more than half a century (Cohen, 1962 ).

Unfortunately, power analyses are not conventionally reported in behavior analytic research, perhaps owing to a skill and knowledge gap among behavior analysts as a group. Inferential statistics and power analyses are not necessarily covered in behavior analysis research methods classes. They are not mentioned in the accreditation standards of the Association for Behavior Analysis International Accreditation Board ( 2017 ) or the Behavior Analyst Certification Board’s ( 2017 ) current task list. Most modern comprehensive statistics textbooks provide detailed treatments of power analysis, but they do not focus on the small-N, within-subject designs preferred by behavior analysts. The mathematical principles of power analysis are more or less the same regardless of design, but behavior analysts may be less likely to conduct power analyses for their own designs because they have not seen examples of power analysis that resemble the type of research that they do.

This tutorial is designed to remove the lack of appropriate models as one of the possible reasons behavior analysts continue to ignore statistical power when designing experiments and reporting results. It describes statistical power and the factors that determine statistical power and illustrates through case studies how behavior analysts can use G*Power (Faul, Erdfelder, Buchner, & Lang, 2009 ; Faul, Erdfelder, Lang, & Buchner, 2007 ) in small-N research. G*Power is a free power calculator that can be used in power analysis for a wide range of research designs, including many of those popular with behavior analysts. (It is available for download from http://www.gpower.hhu.de/en.html .) Of course, not all analyses of behavior involve inferential tests and the power analyses that are possible using G*Power are not applicable to all of the research designs that are used by behavior analysts. Research designs that are amenable to power analyses in G*Power focus on group-average effects. They include both between-group and within-subject comparisons and can be categorical (i.e., involving analysis of variance) or continuous (involving regression). I am not trying to convince anyone who relies on other designs or analytic techniques to start evaluating hypotheses with significance tests. The aim here is to help those behavior analysts who sometimes compare group or condition means to maximize the probative value of their results.

Statistical Power

Statistical power, the ability of a test to detect an effect of a given size, is an important consideration in research design. Failing to detect a meaningful effect when one is present is a Type I error and falsely detecting a meaningful effect that is not there is a Type II error. In Neyman-Pearson hypothesis testing, the long-run probabilities of Type I and Type II errors are referred to as α and β, respectively. Power is 1- β, the long-run probability of not making a Type II error, that is, of correctly concluding that there is no meaningful effect. When testing hypotheses, failure to consider statistical power in the initial planning stages can produce sample sizes that are too small or large. A priori power analysis involves computing the sample size required to detect an effect of a given size with the desired power. Observed (or post-hoc) power is the power of an already-conducted statistical to detect as significant (i.e., to lead to a rejection of the null hypothesis in NHST) an effect size equal to the one obtained. The observed power of a test is directly (though nonlinearly) related to its p -value: when the p -value is high, power is always low and vice versa. As such, observed power reports the same information as a p -value, expressed a different way. It is no better or different from the p -value, and there is no reason to include it in a Results section that reports the p -value for the same test. Nevertheless, many statistical analysis tools and applications report observed power, so it is important that researchers do not confuse a priori and observed power.

Underpowered tests have too few observations to identify effects that are clinically, theoretically, or practically meaningful (Peterson, 2009 ). For example, a new drug that offers a subtle improvement over the current treatment may be worth research and development even if the effect is small. Likewise, if the dependent measure is naturally highly variable, detecting a difference with a statistical test requires more evidence than one that varies little. When an effect is small or a dependent measure highly variable, a test that compares the two treatments in few patients (e.g., 10) will not produce a statistically significant result. Absence of evidence is not evidence of absence. In those cases, the absence of a statistically significant difference indicates that the amount of evidence is not sufficient to provide compelling support for either hypothesis, not because the two treatments are equally effective. Overpowered tests will detect effects that are trivially small. For instance, a difference of one-tenth of an IQ point is not meaningful in any practical sense of the word even if it is reliable, valid, consistently replicable, and otherwise real. However, a test comparing IQ scores of two groups of people that had a sufficiently large sample size (e.g., 500,000 per group) would detect a difference of 0.1 point as statistically significant and lead to rejection of the hypothesis that the two groups were equally intelligent. An experiment can have too few or too many observations to have genuine probative value.

A priori power is a conditional probability and subject to potential misinterpretation, like other p -values. A p -value in NHST is the probability of the data given the hypothesis, not the probability of the hypothesis given the data. Likewise, power is the probability of rejecting the null hypothesis, given the alternative hypothesis, not the probability the alternative hypothesis is true, given the null hypothesis was rejected. Designing a study to have high statistical power does not guarantee that the results obtained will accurately reflect the true state of affairs. Conducting a power analysis to determine an appropriate sample size maximizes the probative value of the test by ensuring it is neither underpowered nor overpowered for a selected critical effect size. Another advantage of thinking about power when designing behavior-analytic studies is that it quantifies the benefits of experimental control: better experimental control reduces behavioral variability, making effects easier to detect, so studies with a high degree of experimental control require fewer observations.

Factors that Affect Statistical Power

The factors that determine statistical power mathematically are α, β, effect size, and sample size. Knowing the value of any three factors makes it possible to solve for the fourth. In an a priori power analysis, researchers decide on the largest Type I and II error rates they are willing to tolerate and the smallest effect they would consider to be meaningful, then use those values to solve for the sample size required. The mathematical functions relating α, β, and effect size to sample size are described in detail elsewhere (Cohen, 1988 remains the gold standard, but its coverage of within-subject designs is limited), but researchers with a general conceptual understanding of power and subject-matter expertise can use G*Power effectively without direct manipulation of the mathematical equations.

Type of Test

Power calculations are test-dependent. All else being equal, the number of observations required to detect an effect of a given size is different for a chi-square than a t -test, and different for 2 x 3 analysis of variance (ANOVA) than a 2 x 4 ANOVA. Whether a test is parametric or nonparametric and whether it involves within-subject or between-group comparisons are also important considerations.

Parametric statistical tests typically assume that the parent distributions from which samples are drawn are all normal distributions with the same standard deviation. Nonparametric tests are not assumption free, but in general they do not require that parent distributions conform to a specific shape. A researcher who initially planned to use a parametric test might report a nonparametric test instead if one or more of their samples was nonnormal or there were large differences in sample standard deviations. A researcher might plan to use a nonparametric test because the dependent measure is ordinal (i.e., ranks rather than scores) or because prior research or quantitative theory suggests the assumptions of the parametric test will be violated. Davison ( 1999 ) recommended the use of nonparametric tests as a default for behavior analysts; however, when there is no reason to expect that their assumptions will be violated, one advantage to planning to use parametric tests is that power calculations are comparatively straightforward, both mathematically and in G*Power.

Within-subject designs are more powerful than comparable between-subject designs (Thompson & Campbell, 2004 ), but there are additional assumptions related to the covariance of repeated measures and the assumption of sphericity that must be addressed in power analysis in within-subject research. Along with potential order effects, concerns about sphericity have lead some scholars to suggest that within-subject designs ought to be avoided (Greenwald, 1976 ). Most behavior analysts appreciate the advantages of having each subject serve as its own control (Perone, 1999 ), but may not know when to anticipate violations of sphericity in their experimental designs or how to evaluate sphericity as part of their analyses. In simple (i.e., single-factor) repeated-measures designs, the assumption of sphericity is that the variances (denoted by s 2 ) of the difference scores are all equal. Difference scores are just the differences between scores from each pair of conditions for each subject. For example, in a within-subject design that measures response rate, R, in four conditions, there are six difference scores for each subject and the assumption of sphericity is that s (R1-R2) 2 = s (R1-R3) 2 = s (R1-R4) 2 = s (R2-R3) 2 = s (R2-R4) 2 = s (R3-R4) 2 . Sphericity is not a concern when there are only two levels of the independent variable (there is only one difference score, so differences in the variances of difference scores are impossible). Violations of sphericity occur when scores from some conditions are more correlated than scores in other conditions (e.g., response rates from a baseline condition are correlated with response rates in treatment but uncorrelated with response rates in a follow-up).

Mauchly ( 1940 ) developed a test to examine repeated measures for violations of sphericity. For readers interested in learning more about sphericity, Lane’s ( 2016 ) description of the assumption of sphericity and suggestions about what to do when it is violated is succinct, accessible, and also addresses more complex multifactor designs. The output of Mauchly’s test is typically expressed as epsilon (ε), a measure of the degree to which sphericity is violated. Upper and lower bounds of ε are 1 (no violation) to 1/( k -1), where k is the number of measurements in the ANOVA. The sample size required to detect an effect with low ε (sphericity violated) may be higher than the sample size required to detect a smaller effect with ε ≈ 1 (no violation of sphericity). For the purpose of a power analysis, a researcher might estimate ε based on previous research or pilot data. In some circumstances, researchers might be able to eliminate concerns about violations of sphericity through experimental control, which would mean they could power their experiment assuming ε ≈ 1. As an alternative, a researcher might elect to be conservative and power their study assuming the largest possible violation of sphericity (and smallest value of ε).

There are broad differences between test families and more nuanced differences between specific tests within the same family. Tests from different families can be used to achieve the same goal, for example, regression, ANOVA, and t -tests could all be used to test whether difference between two independent samples was significant. Within the same test family, the parametric Student’s t -test and the nonparametric Mann-Whitney U test both compare two independent samples, but they differ in some underlying assumptions. Detailing differences between types of tests and explaining when to use each type of test are best left to statistics textbooks, so I encourage readers seeking more information about selecting tests to consult their preferred statistics textbook.

Tolerable Rates of β and α

Power is 1- β, the long-run probability of not making a Type II error. Describing β as a factor that affects power (as statistics textbooks sometimes do) is a misnomer because they are two sides of the same coin, just as it would be uninformative to write that the number of incorrectly answered questions on an exam is a factor affecting exam score. By definition, whatever increases β decreases power and vice versa. Larger βs are associated with less statistical power. By contrast, larger αs are associated with greater statistical power. In hypothesis testing, tolerating a higher type I error rate (e.g., α = .10 instead of α = .05) means that the critical value for the test statistic is higher and the test is stricter overall. The hypothesis is less likely to be rejected whether it is true or not, so type II errors are less likely, β is lower and power is higher. Cohen ( 1992a , 1992b ) noted that adopting an α of .05 is typical in psychological research and recommended that psychologists use a β of .20 (power = .80). For certain types of study designs, there are other, more principled ways of selecting error rates. For example, when the sample size of an experiment cannot be adjusted, Mudge, Baker, Edge, and Houlahan ( 2012 ) describe an approach for optimizing α to minimize the combination of Type I and Type II error rates at a critical effect size. Such techniques can mitigate some of the risks of misinterpreting results of low-power studies, but they are not applicable to sample size estimation.

Setting different error rates is a simple matter in G*Power but deciding what error rates are tolerable is highly dependent on the research question, the selected test and even the researcher’s philosophy of science. Any attempt at prescriptive guidelines about tolerable error rates would be essentially useless. Nevertheless, examples of situations when power is more and less important than in your typical psychology experiment may be instructive. No one with any capacity for human compassion would accept power of 80% in criminal trials for capital offenses—the implication would be that they could tolerate putting to death 20% of the innocent people who happen to wind up on trial. By contrast, if the treatment for a fatal disease was very mild, presumably no one would object to very high rates of Type II errors (giving the treatment to healthy people) because it would minimize or eliminate Type I errors (failing to treat someone who is infected).

Meaningful Effect Sizes

An effect size is a descriptive statistic that estimates the magnitude of an effect. In research that compares means across different groups or conditions, some effect sizes (including Cohen’s d and f ) estimate the standardized difference between means. Others (e.g., eta-squared, η 2 ) estimate the proportion of variance in the dependent variable that can be explained by the different levels of an independent variable. Both types of effect size can be decomposed into the difference between population means and population variance. Larger differences between population means are easier to detect, so statistical tests have more power when differences between condition means are large. Likewise, consistent, reliable differences are easier to detect, so statistical tests have more power when population variance is low.

In psychophysics, the just noticeable difference (JND; Fechner, 1860/ 1912 ) is the smallest difference between two stimuli (e.g. brightness in lumens, volume in decibels, or pitch in Hz) that is perceptible to the subject. The objective of a priori power analysis is to determine the sample size required to detect a meaningful effect with the desired level of power, so in this article, the JND is the smallest effect that the researcher would consider to be practically, clinically, or theoretically significant.

Cohen ( 1988 , 1992b ) provided effect-size conventions that define small, medium, and large effects for several types of statistical tests. However, researchers investigating directly observable behavior and other similarly concrete dependent variables are advised to ignore conventions and consider the minimum absolute difference that would have impact or be meaningful in their research as the “unstandardized” JND to be used in power analysis (Thompson, 2002 ). For example, if five micrograms of lead per deciliter of blood is unsafe for children, medical tests of lead levels need to be able to detect that concentration of lead with high power (presumably >>>.80), even if it is only a tiny fraction of the standard deviation found in children in general and the standardized effect size is small. Conversely, if a child bites classmates on an almost daily basis, an intervention that reduces biting by 50% is probably insufficient to allow the child to return to the classroom even if the effect size is large. The case studies that follow illustrate the process of identifying an appropriate JND, converting it to a standardized effect size for computing the sample size needed to detect the JND with a desired level of power (given a specified α) in G*Power. Readers are encouraged to download G*Power and have it open as they read each case so that they can follow along with the calculations.

Case Studies

Case no. 1: reducing employee absences.

A large company suspects that employee absenteeism is having a significant negative impact on their bottom line, though lax record-keeping makes it difficult to know for sure. The records they do have indicate that on average, employees miss 8.0 ( SD = 2.0) days of work per year in addition to annual leave and explained medical absences. Someone has developed a strategy for reducing these unscheduled absences and the company has approved a request to run a month-long test to evaluate the efficacy and economic viability of this strategy.

The plan is to monitor absences in two groups of employees: a treatment group who will pilot the new strategy and a comparison group, with the same sample size in both groups. Based on the projected cost of implementing the new strategy throughout the company, it is estimated that the strategy needs to reduce absences by one day per employee per year to be cost neutral. If the board of directors is convinced that the strategy is likely to be effective enough to recoup the costs of implementation, they are prepared to adopt the new strategy.

Power analysis can determine how many employees to monitor during the test. The question of whether the strategy was effective enough in the pilot to be worth implementing throughout the company can be evaluated with an independent-samples t test. The test is one-tailed, because the company is specifically interested in reducing absences. The action taken if the treatment group has more absences than the comparison group would be the same as if both groups missed the same number of days: the company would not implement the strategy. To estimate the number of employees needed for the test, first determine the JND, convert it to a standardized effect size (Cohen’s d ), and compute an a priori power analysis in G*Power.

  • Determine the JND. The strategy needs to reduce absences by one day per employee per year to break even and the company will not adopt the strategy if it does not break even, so the JND is 1.0 day per year.
  • Convert the JND to a standardized effect size. The standardized effect size that G*Power uses to evaluate differences between two independent means is Cohen’s d. Cohen’s d expresses the difference between two group means in standard deviations. For example, a Cohen’s d of 2.0 indicates that the two means are precisely two standard deviations apart. The standard deviations for the employee absence data that has not yet been collected are unknown, but they can be estimated using the simplifying assumption that they will be similar to the standard deviation of annual unscheduled absences in existing employee records. Those records indicate the distribution of unscheduled absences has a mean of 8 and a standard deviation of 2. The standardized effect size for the JND is Cohen’s d = (break-even reduction in absences)/(standard deviation of absences) = (1 day per year)/(2 days per year) = 0.5. It does not matter that the pilot runs for one month and the effect size was estimated based on annual absences because the standardized effect is the same regardless of the unit of analysis.

An external file that holds a picture, illustration, etc.
Object name is 40614_2018_167_Fig1_HTML.jpg

Settings and input parameters for Case #1 (comparing means of two independent samples)

Figure ​ Figure1 1 also shows the input parameters for this analysis. This is a one-tailed test with an effect size of 0.5. Setting the Type I error rate, α, equal to .05 is conventional in behavioral and social sciences rather than objectively correct. Using an α of .05 means that if the true effect size is zero (the new strategy has absolutely no effect) and the samples are drawn randomly and representative of their respective populations, the probability that the sample means will be significantly different is .05. Setting power to .90 means that the Type II error rate (β) is equal to .10: if the true effect size is exactly equal to the JND and the samples are drawn randomly and representative of their respective populations, the probability that the sample means will be significantly different is .90. Using a higher value for power would decrease the chances that your pilot test fails to detect the “true” reduction in absences (if any) by increasing the sample size. An allocation ratio of 1 means that both groups will have the same sample size.

Clicking the Calculate button reveals that the test requires a total sample size of 140, 70 employees each in the treatment and comparison groups. With this sample size, the independent-samples t test has 90.3% power to detect the JND. Power analysis does not guarantee that if the strategy is effective, the statistical test will be significant. It also does not mean that if the result is statistically significant, the strategy is guaranteed to be cost-effective (Button et al., 2013 ). In this situation, power analyses are an assurance that the results of the inferential test provide useful information about the strength of evidence for the efficacy of the strategy and the appropriate course of action in light of that evidence.

This particular power analysis estimates the number of employees to monitor. It does not provide any insight about the optimal duration of the pilot test. On would assume that the duration of the test should be as long as it needs to be to ensure that results are reliable, but no longer. A pilot test that is too short might not yield a reliable estimate of the rate of absences, but beyond a certain point there are diminishing returns in continuing. The optimal duration is the variation in temporal distributions of absences. If the number of absences per day is relatively stable from one month to the next, it is advisable to run a shorter test than if absences varied dramatically. The JND in this example was estimated from annual records, but the planned test is only one month long, so the implicit assumption is that effect size (i.e., the difference between the treatment and comparison groups in standard deviations) will be comparable.

Case No. 2: Probability Discounting

Holt, Green, and Myerson ( 2003 ) assessed probability discounting in college students and found that college students who gambled discounted probabilistic rewards less steeply than nongamblers (they were more likely to choose the risky option). One way to replicate and extend these results with older adults is to measure probability discounting in older adults, in particular comparing discounting in senior citizens who belong to a local casino’s loyalty program with an equal number of age-matched participants who reported that they do not gamble. Many of the students in Holt et al.’s sample of gamblers reported only moderate rates of gambling. Problem gambling occurs at higher rates in older adults (Ladd, Molina, Kerins, & Petry, 2003 ), so it would be reasonable to aim to detect group differences that are larger than those reported by Holt et al. Determining the JND, converting it to Cohen’s d , and computing an a priori power analysis in G*Power will estimate the number of senior citizens to recruit for this replication and extension.

Determine the JND

Holt et al. ( 2003 ) reported the area under the curve (AUC) for each participant for probabilistic amounts of $1,000 and $5,000. Area under the curve is a unitless measure that can take any value between 0 and 1. Smaller values indicate steeper discounting and arguably greater risk-aversion. For $1,000, the mean AUC was .23 ( SD = .21) for gamblers and .10 ( SD = .07) for nongamblers. For $5,000, the mean AUC was .17 ( SD = .21) for gamblers and .09 ( SD = .12) for nongamblers. The difference in college students’ probability discounting AUCs was larger for $1,000 than for $5,000. To detect larger differences than those observed in college students, one might select the largest difference in AUC that Holt et al. obtained as the JND. The larger difference was for the $1,000 reward giving a JND of .23 - .10 = .13.

Convert the JND to a standardized effect size using G*Power

Assume the standard deviations in the senior citizen samples will be similar to those that Holt et al. ( 2003 ) obtained for college students. The formula for estimating Cohen’s d based on two independent samples is ( M 1 – M 2 )/ s pooled , where M 1 and M 2 are sample means and s pooled is the pooled standard deviation. 1 First, select settings for the type of power analysis you will run (in this case, test family, statistical test, and type of power analysis should be set to t tests , Means: Difference between two independent means (two groups) , and a priori, respectively). Next, clicking the “ Determine => ” button under input parameters opens an effect-size calculator in G*Power that will compute Cohen’s d based on sample means and standard deviations. Figure ​ Figure2 2 shows the input parameters for this example (means and standard deviations from each sample). The standardized effect size for the JND is d = 0.83.

An external file that holds a picture, illustration, etc.
Object name is 40614_2018_167_Fig2_HTML.jpg

Independent-samples t effect size calculator in G*Power with values set for Case #2, probability discounting

Compute an a priori power analysis in G*Power

The settings and many of the input parameters for this example are the same as shown in Figure ​ Figure1. 1 . This is a one-tailed test because the hypothesis is directional: gamblers will discount probabilistic rewards less steeply than nongamblers. The standardized effect size for the JND determined by G*Power’s effect-size calculator based on means and standard deviations from Holt et al. ( 2003 ) is 0.83, a large effect according to Cohen’s ( 1988 , 1992b ) conventions. Setting α and power to .05 and .80 is conventional in behavioral and social sciences. The allocation ratio of 1 means that both groups will have the same sample size.

Clicking the Calculate button reveals that this test requires a total sample size of 38, 19 gamblers and 19 nongamblers. With this sample size, the independent-samples t test has 80.7% power to detect the JND of 0.83 of a standard deviation. Although they did not mention power analysis explicitly, Holt et al. ( 2003 ) happened to include 19 participants in each group. Other research using a similar between-groups design to address similar research questions (e.g., Madden, Petry, & Johnson, 2009 ; Weller, Cook, Avsar, & Cox, 2008 ) has used the same sample size. For example, Madden et al. ( 2009 ) examined discounting in treatment-seeking pathological gamblers and demographically matched controls who did not gamble. They might have reasonably assumed that if differences between gamblers and nongamblers were detectable in Holt et al.’s sample of 38 college students, differences between pathological gamblers and nongamblers would be as large or larger, therefore the same sample size would be adequately powered for the desired comparison.

To evaluate whether there were differences between gamblers’ and nongamblers’ rates of discounting, Holt et al. ( 2003 ) compared AUCs using a parametric test, but they also reported Mann-Whitney U-tests to compare other dependent variables. The Mann-Whitney U test is a nonparametric equivalent to an independent-samples t test. It compares the ranks of scores rather than the scores directly and does not assume that the distributions take any particular shape (though it does assume that observations are independent and that the variances of the populations sampled are equal), so it can be used to compare ordinal or other nonnormal data. Computing the sample size needed to detect a JND of d = 0.83 with 80% power for a Mann-Whitney U test in G*Power requires some additional details about the shape of the parent distribution (under Input Parameters) and how the test will be calculated (under Options) that are beyond the scope of this tutorial.

Case No. 3: Increasing Physical Activity

Hayes and Van Camp ( 2015 ) increased children’s physical activity during school recess with an intervention that involved reinforcement, self-monitoring, goal setting, and feedback. One way to extend their research might be to devise an alternative intervention and determine whether the alternative intervention is also effective at increasing physical activity. Like Hayes and Van Camp’s experiment, the follow up might involve recording number of steps taken by elementary schoolchildren during 20-minute recess periods using accelerometers and evaluating the efficacy of the intervention using a withdrawal design. Data analysis for this experiment is likely to involve visual inspection of single-subject graphs showing number of steps as a function of session rather than a dependent-samples t test. Nonetheless, a power analysis for dependent means can be used to determine a sample size that is neither under- nor overpowered. To estimate the number of children needed as participants, determine the JND, convert it to Cohen’s d , and compute an a priori power analysis in G*Power.

Hayes and Van Camp ( 2015 ) reported that their intervention successfully increased steps during recess by 47%, or M = 630 steps per student. If the objective were to determine whether the alternative intervention is at least as effective as Hayes and Van Camp’s intervention, the unstandardized JND would be 630 steps. However, the objective is to determine whether the intervention is effective, without comparison to prior results. The results of the experiment might indicate that implementing the intervention is worthwhile even if the effect is smaller than the effect Hayes and Van Camp reported. One might reason that 249 additional steps during a single 20-minute recess period would have a negligible effect on distance traveled or minutes of moderate-to-vigorous physical activity, so a difference must be at least 250 steps to be considered meaningful. Of course, this decision is arbitrary because the difference between 249 steps and 250 steps is miniscule. The specific boundary of the JND is less important than whether the intended audience (e.g., an institutional review board, thesis committee, grant review panel, journal editors, the researchers themselves) is convinced by the rationale.

Convert the JND to a standardized effect size

Assume the standard deviations will be similar to those obtained in previous research. Hayes and Van Camp ( 2015 ) did not report standard deviations, but Table ​ Table1 1 shows mean steps taken during baseline and the intervention for individual subjects from that study (C. M. Van Camp, personal communication, September 22, 2017). Cohen’s d for dependent or related samples can be calculated either by dividing the mean of the difference scores for each subject by their standard deviation, or from means and standard deviations for each condition. Difference scores are X Tx – X Bl , mean steps taken by a subject in the intervention and baseline phases, respectively. From Table ​ Table1, 1 , the mean difference score for Hayes and Van Camp’s six subjects was 630.17 steps ( SD = 214.61 steps), giving an effect size d = 2.94. Replacing the obtained mean difference with the unstandardized JND gives a standardized JND of d = 250/214.61 = 1.16.

Step counts for individual subjects from Hayes and Van Camp ( 2015 )

SubjectStepsDifference Score
BaselineIntervention
Ellen13091640331
Summer14231960537
Laura14382085647
Kate11842177993
Fallon12101840630
Sara13922035643
Mean13261956.17630.17
SD109.74192.38214.61

The effect-size calculator in G*Power can calculate Cohen’s d either “from differences” as above or “from group parameters.” Calculating d from group parameters requires the correlation between baseline and intervention scores ( r = .071) in addition to means and standard deviations of both samples because correlation between baseline and intervention scores is an additional source of variance that must be accounted for in the power analysis. Select settings (in this case, test family, statistical test, and type of power analysis should be set to t tests , Means: Difference between two dependent means (matched pairs) , and a priori, respectively). Next, click the “ Determine => ” button under input parameters to open the effect-size calculator in G*Power. Figure ​ Figure3 3 shows the input parameters for this example. The mean for group 2 is the baseline plus the unstandardized JND, 1326 + 250 = 1576. Any pair of means that differ by 250 steps will produce the same effect size. Consistent with the calculation from difference scores, the standardized effect size for the JND is d = 1.16.

An external file that holds a picture, illustration, etc.
Object name is 40614_2018_167_Fig3_HTML.jpg

G*Power settings, inputs and outputs for Case #3 (comparing means of two dependent samples)

Figure ​ Figure3 3 shows the settings and input parameters to select in G*Power for this analysis. This is a one-tailed test because it has a directional hypothesis (the intervention will increase steps). Setting α and power to .05 and .80 is conventional in behavioral and social sciences. There is no allocation ratio for this test, because it is within-subject: each subject will experience both baseline and intervention. Clicking the Calculate button reveals that for the specified input parameters, the dependent-samples test requires a total sample size of seven. With this sample size, a comparison of steps taken in baseline versus the intervention has 85.7% power to detect the JND of 1.16 standard deviations.

Case No. 4: Stimulus Discrimination

In three-key operant conditioning chambers, Kyonka, Rice, and Ward ( 2017 ) trained pigeons in a discrimination task that shares some features with slot machines. They compared several characteristics of responding across four trial types. Replicating results in chambers with different equipment (e.g., touch screens) might be a first step to conducting related follow-up experiments. An a priori power analysis in G*Power can be used to estimate the number of pigeons needed to confirm, with 80% power, that differences in responding to each trial type correspond to those reported in Kyonka et al.’s Table ​ Table1 1 (p. 35).

Partial eta squared ( η p 2 ) is the proportion of total variance in a dependent variable that is associated with the different treatments of the independent variable, with effects of other independent variables and interactions partialed out. It is calculated as SS treatment /(SS treatment + SS error ), where SS treatment is the sum of squared deviations of the treatment mean from the grand (overall) mean for each observation and SS error is the sum of squared deviations of each observation from its treatment mean. In this experiment, the four different trial types are the four treatments. For the main effect of trial type on response proportion, conditional response rate, sample-phase response time, and “collect-phase” response latency, Kyonka et al. ( 2017 ) reported η p 2 s of .81, .66, .58 and .64. Trial type had the smallest effect on sample-phase response time, but violations of sphericity were observed for the effects on proportion and conditional response rate.

Sphericity is an assumption of repeated-measures ANOVA. Epsilon is a measure of the degree to which sphericity is violated, with upper and lower bounds of 1 (no violation) to 1/( k -1), where k is the number of measurements in the ANOVA. For an experiment with four trial types, the smallest ε that G*Power will accept is 0.34. Violations of sphericity do not affect the calculation of η p 2 but can increase the sample size needed to achieve a certain power. To estimate the sample size needed to replicate an experiment with multiple dependent variables, a researcher might identify one critical dependent variable and power the experiment to detect that particular effect, or they might conduct separate analyses for multiple dependent variables and use the largest sample size indicated. This example estimates the number of pigeons needed to replicate the effect of conditional response rate, a measure of the relative conditioned reinforcing value of the stimuli presented (Kyonka et al., 2017 ).

Convert the JND to Cohen’s f

Partial eta squared is a standardized effect already, but G*Power uses Cohen’s f in power analysis for repeated measures. Cohen’s f is the standard deviation of the treatment means divided by their common standard deviation. It can be derived from partial eta squared as the square root of [ η p 2 /(1- η p 2 )] with the effect-size calculator in G*Power or any other calculator. The effect-size conditional response rate converts from η p 2 = .66 to f = √1.94 = 1.39.

Figure ​ Figure4 4 shows the initial settings and options to select in G*Power. The statistical test that compares a dependent variable in four different trial types is a repeated-measures ANOVA. Select the test family, F tests , first. The statistical test is ANOVA: Repeated measures, within factors (rather than between or interaction). The type of power analysis is a priori .

An external file that holds a picture, illustration, etc.
Object name is 40614_2018_167_Fig4_HTML.jpg

Settings and options for Case #4 (comparing means from four within-subject conditions)

G*Power 3 provides several different options for calculating the effect size of repeated measures. To change the effect-size calculator, click the Options button, select the radio button for the desired effect-size specification, and click OK . The option “as in SPSS” assumes that the effect size is calculated from SS treatment /(SS treatment + SS error ), so it is the appropriate option for this analysis regardless of the program used to calculate η p 2 . Using this option, G*Power can calculate Cohen’s f directly from η p 2 , or from treatment and error variances, sample sizes, and number of repeated measures in the within-subject factor, information that generally can be found in ANOVA source tables.

Figure ​ Figure5 5 shows the input parameters for this power analysis. Setting α and power to .05 and .80 is conventional in behavioral and social sciences. This experimental design involves one group of subjects (i.e., there are no between-groups factors) and four measurements. The last input parameter is the nonsphericity correction factor, ε. Figure ​ Figure5 5 also shows the output parameters for this power analysis. Clicking the Calculate button yields an estimated sample size of seven. By using seven pigeons in the experiment, the ANOVA has 81.7% power to detect a JND of f = 1.39 with the maximum possible violation of sphericity, ε = .34. Repeating the power analysis for different values of ε produces sample size estimates between four and seven.

An external file that holds a picture, illustration, etc.
Object name is 40614_2018_167_Fig5_HTML.jpg

Settings, input and output parameters for Case #4 (comparing means from four within-subject conditions)

Author Note

The author thanks Don Hantula for his input on the utility of a tutorial on this topic, Carole Van Camp for providing additional data from a previously published experiment, Jarrett Byrnes for drawing attention to the Mudge et al. ( 2012 ) article on optimizing error rates, and Regina Carroll for valuable feedback on a portion of the manuscript.

1 For two groups with standard deviations SD 1 and SD 2 , and sample sizes N 1 and N 2 , the pooled standard deviation is N 1 − 1 SD 1 2 + N 2 − 1 SD 2 2 N 1 + N 2 − 2 1 N 1 + 1 N 2 , which simplifies to SD 1 2 + SD 2 2 2 if sample sizes are equal.

  • Association for Behavior Analysis International Accreditation Board . Accreditation handbook . Portage, MI: Author; 2017. [ Google Scholar ]
  • Behavior Analyst Certification Board . BCBA/BCaBA task list . 5. Littleton, CO: Author; 2017. [ Google Scholar ]
  • Branch M. Malignant side effects of null-hypothesis significance testing. Theory & Psychology. 2014; 24 :256–277. doi: 10.1177/0959354314525282. [ CrossRef ] [ Google Scholar ]
  • Branch MN. Statistical inference in behavior analysis: some things significance testing does and does not do. Behavior Analyst. 1999; 22 :87–92. doi: 10.1007/BF03391984. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Button KS, Ioannidis J, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. 2013; 14 :365–376. doi: 10.1038/nrn3475. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cohen J. The statistical power of abnormal—social psychological research: a review. Journal of Abnormal & Social Psychology. 1962; 65 :145–153. doi: 10.1037/h0045186. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cohen J. Statistical power analysis for the behavioral sciences . 2. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. [ Google Scholar ]
  • Cohen J. Statistical power analysis. Current Directions in Psychological Science. 1992; 1 :98–101. doi: 10.1111/1467-8721.ep10768783. [ CrossRef ] [ Google Scholar ]
  • Cohen J. A power primer. Psychological Bulletin. 1992; 112 :155–159. doi: 10.1037/0033-2909.112.1.155. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cohen LL, Feinstein A, Masuda A, Vowles KE. Single-case research design in pediatric psychology: considerations regarding data analysis. Journal of Pediatric Psychology. 2014; 39 :124–137. doi: 10.1093/jpepsy/jst065. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Davison M. Statistical inference in behavior analysis: having my cake and eating it? Behavior Analyst. 1999; 22 :99–103. doi: 10.1007/BF03391986. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Faul F, Erdfelder E, Buchner A, Lang A-G. Statistical power analyses using G*Power 3.1: tests for correlation and regression analyses. Behavior Research Methods. 2009; 41 :1149–1160. doi: 10.3758/BRM.41.4.1149. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods. 2007; 39 :175–191. doi: 10.3758/BF03193146. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fechner, G. T. (1912). Elements of psychophysics (H. S. Langfeld, Trans.). In B. Rand (Ed.), The classical psychologists (pp. 562–572). Retrieved from http://psychclassics.yorku.ca/Fechner/ (Original work published 1860).
  • Fisher WW, Lerman DC. It has been said that, “There are three degrees of falsehoods: lies, damn lies, and statistics.” Journal of School Psychology. 2014; 52 :243–248. doi: 10.1016/j.jsp.2014.01.001. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gigerenzer G. Mindless statistics. Journal of Socio-Economics. 2004; 33 :587–606. doi: 10.1016/j.socec.2004.09.033. [ CrossRef ] [ Google Scholar ]
  • Greenwald AG. Within-subjects designs: to use or not to use? Psychological Bulletin. 1976; 83 (2):314–320. doi: 10.1037/0033-2909.83.2.314. [ CrossRef ] [ Google Scholar ]
  • Haig BD. Tests of statistical significance made sound. Educational & Psychological Measurement. 2017; 77 :489–506. doi: 10.1177/0013164416667981. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hayes LB, Van Camp CM. Increasing physical activity of children during school recess. Journal of Applied Behavior Analysis. 2015; 48 :690–695. doi: 10.1002/jaba.222. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Holt DD, Green L, Myerson J. Is discounting impulsive? Evidence from temporal and probability discounting in gambling and non-gambling college students. Behavioural Processes. 2003; 64 :355–367. doi: 10.1016/S0376-6357(03)00141-4. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kyonka EG, Rice N, Ward AA. Categorical discrimination of sequential stimuli: all SΔ are not created equal. Psychological Record. 2017; 67 :27–41. doi: 10.1007/s40732-016-0203-2. [ CrossRef ] [ Google Scholar ]
  • Ladd GT, Molina CA, Kerins GJ, Petry NM. Gambling participation and problems among older adults. Journal of Geriatric Psychiatry & Neurology. 2003; 16 :172–177. doi: 10.1177/0891988703255692. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lane D. The assumption of sphericity in repeated-measures designs: what it means and what to do when it is violated. Quantitative Methods for Psychology. 2016; 12 :114–122. doi: 10.20982/tqmp.12.2.p114. [ CrossRef ] [ Google Scholar ]
  • Madden GJ, Petry NM, Johnson PS. Pathological gamblers discount probabilistic rewards less steeply than matched controls. Experimental & Clinical Psychopharmacology. 2009; 17 :283–290. doi: 10.1037/a0016806. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mauchly JW. Significance test for sphericity of a normal n-variate distribution. Annals of Mathematical Statistics. 1940; 11 :204–209. doi: 10.1214/aoms/1177731915. [ CrossRef ] [ Google Scholar ]
  • Mayo DG, Spanos A. Severe testing as a basic concept in a Neyman-Pearson philosophy of induction. British Journal for the Philosophy of Science. 2006; 57 :323–357. doi: 10.1093/bjps/axl003. [ CrossRef ] [ Google Scholar ]
  • Mayo DG, Spanos A. Error statistics. In: Bandyopadhyay PS, Forster MR, editors. Handbook of philosophy of science . Amsterdam, Netherlands: Elsevier; 2011. pp. 153–198. [ Google Scholar ]
  • Michael J. Statistical inference for individual organism research: mixed blessing or curse? Journal of Applied Behavior Analysis. 1974; 7 :647–653. doi: 10.1901/jaba.1974.7-647. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mudge JF, Baker LF, Edge CB, Houlahan JE. Setting an optimal α that minimizes errors in null hypothesis significance tests. PLoS ONE. 2012; 7 (2):e32734. doi: 10.1371/journal.pone.0032734. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Neyman J, Pearson ES. On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika. 1928; 20A :175–240. [ Google Scholar ]
  • Neyman J, Pearson ES. IX. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character. 1933; 231 (694–706):289–337. [ Google Scholar ]
  • Pashler H, Harris CR. Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science. 2012; 7 :531–536. doi: 10.1177/1745691612463401. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Perone M. Statistical inference in behavior analysis: experimental control is better. Behavior Analyst. 1999; 22 :109–116. doi: 10.1007/BF03391988. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Peterson C. Minimally sufficient research. Perspectives on Psychological Science. 2009; 4 :7–9. doi: 10.1111/j.1745-6924.2009.01089.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sidman M. Tactics of scientific research: evaluating experimental data in psychology . New York, NY: Basic Books; 1960. [ Google Scholar ]
  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2013). Life after p-hacking. Meeting of the Society for Personality and Social Psychology, New Orleans, LA, January 17–19, 2013. Available at SSRN: http://ssrn.com/abstract=2205186 or doi:10.2139/ssrn.2205186.
  • Thompson B. “Statistical,” “practical,” and “clinical”: how many kinds of significance do counselors need to consider? Journal of Counseling & Development. 2002; 80 :64–71. doi: 10.1002/j.1556-6678.2002.tb00167.x. [ CrossRef ] [ Google Scholar ]
  • Thompson VA, Campbell JI. A power struggle: between-vs. within-subjects designs in deductive reasoning research. Psychologia. 2004; 47 :277–296. doi: 10.2117/psysoc.2004.277. [ CrossRef ] [ Google Scholar ]
  • Trafimow D, Marks M. Publishing models and article dates explained. Basic & Applied Social Psychology. 2015; 37 :1. doi: 10.1080/01973533.2015.1012991. [ CrossRef ] [ Google Scholar ]
  • Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. American Statistician. 2016; 70 :129–133. doi: 10.1080/00031305.2016.1154108. [ CrossRef ] [ Google Scholar ]
  • Weller RE, Cook EW, Avsar KB, Cox JE. Obese women show greater delay discounting than healthy-weight women. Appetite. 2008; 51 :563–569. doi: 10.1016/j.appet.2008.04.010. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wilkinson L, The Task Force on Statistical Inference, American Psychological Association, Science Directorate Statistical methods in psychology journals: guidelines and explanations. American Psychologist. 1999; 54 :594–604. doi: 10.1037/0003-066X.54.8.594. [ CrossRef ] [ Google Scholar ]
  • Zimmermann ZJ, Watkins EE, Poling A. JEAB research over time: species used, experimental designs, statistical analyses, and sex of subjects. Behavior Analyst. 2015; 38 :203–218. doi: 10.1007/s40614-015-0034-5. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • BMJ Journals

You are here

  • Volume 20, Issue 5
  • An introduction to power and sample size estimation
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

This article has corrections. Please see:

  • Correction - January 01, 2004
  • Correction: An introduction to power and sample size estimation - October 01, 2023

Download PDF

  • S R Jones 1 ,
  • S Carley 2 ,
  • M Harrison 3
  • 1 North Manchester Hospital, Manchester, UK
  • 2 Royal Bolton Hospital, Bolton, UK
  • 3 North Staffordshire Hospital, UK
  • Correspondence to: Dr S R Jones, Emergency Department, Manchester Royal Infirmary, Oxford Road, Manchester M13 9WL, UK; steve.r.jones{at}bigfoot.com

The importance of power and sample size estimation for study design and analysis.

  • research design
  • sample size

https://doi.org/10.1136/emj.20.5.453

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Correction notice Following recent feedback from a reader, the authors have corrected this article. The original version of this paper stated that: “Strictly speaking, “power” refers to the number of patients required to avoid a type II error in a comparative study.” However, the formal definition of “power” is that it is the probability of avoiding a type II error (rejecting the alternative hypothesis when it is true), rather than a reference to the number of patients. Power is, however, related to sample size as power increases as the number of patients in the study increases. This statement has therefore been corrected to: “Strictly speaking, “power” refers to the probability of avoiding a type II error in a comparative study.

Linked Articles

  • Correction Correction BMJ Publishing Group Ltd and the British Association for Accident & Emergency Medicine Emergency Medicine Journal 2004; 21 126-126 Published Online First: 20 Jan 2004.
  • Correction Correction: An introduction to power and sample size estimation BMJ Publishing Group Ltd and the British Association for Accident & Emergency Medicine Emergency Medicine Journal 2023; 40 e4-e4 Published Online First: 27 Sep 2023. doi: 10.1136/emj.20.5.453corr2

Read the full text or download the PDF:

IMAGES

  1. How To Use Power Analysis To Determine The Appropriate Sample Size Of A

    example of power analysis in research

  2. There are three major considerations when doing a power analysis for

    example of power analysis in research

  3. PPT

    example of power analysis in research

  4. PPT

    example of power analysis in research

  5. PPT

    example of power analysis in research

  6. What is a Power Analysis ?

    example of power analysis in research

VIDEO

  1. How we use Power BI as Data Analytics Tools tutorial for Beginners

  2. 01. Power BI Full Course(part 1) || Beginner to Advance || Bangla

  3. 12|Electrical Power Factor Regulator

  4. Intro to Power Analysis

  5. Calculated Columns vs. Measures in Power BI: When to Use

  6. Power Analysis-I

COMMENTS

  1. How to Calculate Sample Size Needed for Power

    Statistical power and sample size analysis provides both numeric and graphical results, as shown below. The text output indicates that we need 15 samples per group (total of 30) to have a 90% chance of detecting a difference of 5 units. The dot on the Power Curve corresponds to the information in the text output.

  2. S.5 Power Analysis

    To calculate the smallest sample size needed for specified α, β, μ a, then (μ a is the likely value of μ at which you want to evaluate the power. Sample Size for One-Tailed Test. n = σ 2 (Z α + Z β) 2 (μ 0 − μ a) 2. Sample Size for Two-Tailed Test. n = σ 2 (Z α / 2 + Z β) 2 (μ 0 − μ a) 2. Let's investigate by returning to our ...

  3. Power Analysis and Sample Size, When and Why?

    In order to interpret the findings correctly and to adapt this to the diagnosis or treatment of patients, it is very important to conduct power analysis in scientific research. By determining the number of samples to be included in the study by power analysis, it can be demonstrated that the results obtained are really significant or not (1, 2).

  4. Sample size, power and effect size revisited: simplified and practical

    In order to understand and interpret the sample size, power analysis, effect size, and P value, it is necessary to know how the hypothesis of the study was formed. ... In clinical research, sample size is calculated in line with the hypothesis and study design. The cross-over study design and parallel study design apply different approaches for ...

  5. Introduction to Power Analysis

    For example, a power analysis is often required as part of a grant proposal. And finally, doing a power analysis is often just part of doing good research. A power analysis is a good way of making sure that you have thought through every aspect of the study and the statistical analysis before you start collecting data.

  6. Statistical Power and Why It Matters

    To calculate sample size or perform a power analysis, use online tools or statistical software like G*Power. Sample size. Sample size is positively related to power. A small sample (less than 30 units) may only have low power while a large sample has high power. Increasing the sample size enhances power, but only up to a point. When you have a ...

  7. Power analysis in health policy and systems research: a guide to

    Power is a growing area of study for researchers and practitioners working in the field of health policy and systems research (HPSR). Theoretical development and empirical research on power are crucial for providing deeper, more nuanced understandings of the mechanisms and structures leading to social inequities and health disparities; placing contemporary policy concerns in a wider historical ...

  8. Guide to Power Analysis and Statistical Power

    A tutorial explaining the basics of power analysis. | Video: StatQuest With Josh Starmer. More on Data Science How to Do a T-Test in Python. Applications of Power Analysis. Medical research: Power analysis is crucial in medical research for determining sample sizes in clinical trials. By ensuring studies have sufficient power, researchers can more accurately detect treatment effects and ...

  9. 4. Power analysis

    To assess the sample size needed to detect an effect of a certain size, we conduct a power analysis. We will cover two examples: power of a correlation (Pearson's r) analysis. power of a t-test (independent and paired samples) We will see how power analyses can be constructed using 'home made' code, and also learn to run them for \(t ...

  10. Statistical Power Analysis

    Statistical power refers to the probability of rejecting a false null hypothesis (i.e., finding what the researcher wants to find). Power analysis allows researchers to determine adequate sample size for designing studies with an optimal probability for rejecting false null hypotheses. When conducted correctly, power analysis helps researchers ...

  11. 3.8: Power Analysis

    These analyses take advantage of pilot data or previous research. When power analysis is done ahead of time, it is a PROSPECTIVE power analysis. This example is a retrospective power analysis, as it is done after the experiment is completed. So back to our greenhouse example. Typically we want power to be at 80%.

  12. 11: Power Analysis

    Retrospective power analysis as a means of considering whether a lack of statistical significance encountered in a completed test could be merely the result of low sample size. 11.3: Factors influencing statistical power Discussion of factors that affect statistical power, including the size of alpha, variance, effect size, and sample size.

  13. Calculation of Statistical Power and Sample Size

    Since a statistical power of 80% is generally accepted, once the effect size, α-level, and type of statistical analysis have been determined, the sample size can be calculated before the study is initiated. A key element in a priori sample size calculation is to determine an appropriate effect size.

  14. (PDF) Power analysis guide

    Abstract. This Power Analyses Collaborative Guide aims to provide students and early-career researchers with hands-on, step-by-step instructions for conducting power analysis for common ...

  15. Power Analysis

    This type of power analysis, often referred to as sensitivity power analysis (Faul et al. 2009), has found increasing attention in recent years. In the 2019 editorial guidelines of the Journal of Experimental Social Psychology , for example, this type of power analysis has been made mandatory for all key hypothesis tests reported in a research ...

  16. Conducting power analysis for meta‐analysis with dependent effect sizes

    In Section 4, we provide empirical examples of how to conduct and visualize various power, sample size, and minimum detectable effects analyses. In Sections 5 and 6, we reflect on the utility of power analysis in meta-analysis and on what it requires from the research community to make meta-analytical power analyses common practice.

  17. What is a power analysis?

    A power analysis is a calculation that helps you determine a minimum sample size for your study. It's made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component. Statistical power: the likelihood that a test will detect an effect of a certain size if there is one, usually set ...

  18. Sample Power Data Analysis Examples Power Analysis for Two-group

    There are two different aspects of power analysis. One is to calculate the necessary sample size for a specified power as in Example 1. The other aspect is to calculate the power when given a specific sample size as in Example 2. Technically, power is the probability of rejecting the null hypothesis when the specific alternative hypothesis is true.

  19. Post-hoc power analysis: a conceptually valid approach for power based

    Introduction. Power analysis is critical to designing and planning prospective studies in biomedical and psychosocial research. It provides critically important sample sizes needed to detect statistically significant and clinically meaningful treatment differences and evaluate cost-benefit ratios so that studies can be conducted with minimal resources without compromising scientific ...

  20. What is a Power Analysis ?

    Power analysis precedes data collection and assists researchers in determining the optimal sample size required to achieve their desired level of statistical power. It helps in: Estimating the likelihood of obtaining statistically significant results. Taking into account factors such as effect size.

  21. PDF Guidance on Conducting Sample Size and Power Calculations

    Power Analysis • Power analysis is the calculation that is used to determine the minimum sample size needed for a research study. • Poweranalysis is conducted before the study begins. • To compute the power or sample size, you will need:-Null and alternative hypotheses-The statistical method that will be used to test the null hypothesis-

  22. Power to Detect What? Considerations for Planning and Evaluating Sample

    If the sample size was decided a priori via power analysis, make sure to report the statistical test the analysis is based on, the effect size (with units, e.g., d, f 2), the rationale for choosing an effect size, target power including any justification for using that criterion, and any other parameters used in the power analysis. We also ...

  23. Tutorial: Small-N Power Analysis

    It describes statistical power and the factors that determine statistical power and illustrates through case studies how behavior analysts can use G*Power (Faul, Erdfelder, Buchner, & Lang, 2009; Faul, Erdfelder, Lang, & Buchner, 2007) in small-N research. G*Power is a free power calculator that can be used in power analysis for a wide range of ...

  24. Full article: Power Analysis, Sample Size, and Assessment of

    In this analysis, the sample sizes and statistical tests reported in a sample of lighting research papers are used as example data in determining the power achieved for different effect sizes. This aims to reveal the power capable of being achieved by existing research practices within the lighting field.

  25. An introduction to power and sample size estimation

    Correction: An introduction to power and sample size estimation. BMJ Publishing Group Ltd and the British Association for Accident & Emergency Medicine. Emergency Medicine Journal 2023; 40 e4-e4 Published Online First: 27 Sep 2023. doi: 10.1136/emj.20.5.453corr2.