Type 1 and Type 2 Errors in Statistics
Saul McLeod, PhD
Editor-in-Chief for Simply Psychology
BSc (Hons) Psychology, MRes, PhD, University of Manchester
Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.
Learn about our Editorial Process
On This Page:
A statistically significant result cannot prove that a research hypothesis is correct (which implies 100% certainty). Because a p -value is based on probabilities, there is always a chance of making an incorrect conclusion regarding accepting or rejecting the null hypothesis ( H 0 ).
Anytime we make a decision using statistics, there are four possible outcomes, with two representing correct decisions and two representing errors.
The chances of committing these two types of errors are inversely proportional: that is, decreasing type I error rate increases type II error rate and vice versa.
As the significance level (α) increases, it becomes easier to reject the null hypothesis, decreasing the chance of missing a real effect (Type II error, β). If the significance level (α) goes down, it becomes harder to reject the null hypothesis , increasing the chance of missing an effect while reducing the risk of falsely finding one (Type I error).
Type I error
A type 1 error is also known as a false positive and occurs when a researcher incorrectly rejects a true null hypothesis. Simply put, it’s a false alarm.
This means that you report that your findings are significant when they have occurred by chance.
The probability of making a type 1 error is represented by your alpha level (α), the p- value below which you reject the null hypothesis.
A p -value of 0.05 indicates that you are willing to accept a 5% chance of getting the observed data (or something more extreme) when the null hypothesis is true.
You can reduce your risk of committing a type 1 error by setting a lower alpha level (like α = 0.01). For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).
Scenario: Drug Efficacy Study
Imagine a pharmaceutical company is testing a new drug, named “MediCure”, to determine if it’s more effective than a placebo at reducing fever. They experimented with two groups: one receives MediCure, and the other received a placebo.
- Null Hypothesis (H0) : MediCure is no more effective at reducing fever than the placebo.
- Alternative Hypothesis (H1) : MediCure is more effective at reducing fever than the placebo.
After conducting the study and analyzing the results, the researchers found a p-value of 0.04.
If they use an alpha (α) level of 0.05, this p-value is considered statistically significant, leading them to reject the null hypothesis and conclude that MediCure is more effective than the placebo.
However, MediCure has no actual effect, and the observed difference was due to random variation or some other confounding factor. In this case, the researchers have incorrectly rejected a true null hypothesis.
Error : The researchers have made a Type 1 error by concluding that MediCure is more effective when it isn’t.
Implications
Resource Allocation : Making a Type I error can lead to wastage of resources. If a business believes a new strategy is effective when it’s not (based on a Type I error), they might allocate significant financial and human resources toward that ineffective strategy.
Unnecessary Interventions : In medical trials, a Type I error might lead to the belief that a new treatment is effective when it isn’t. As a result, patients might undergo unnecessary treatments, risking potential side effects without any benefit.
Reputation and Credibility : For researchers, making repeated Type I errors can harm their professional reputation. If they frequently claim groundbreaking results that are later refuted, their credibility in the scientific community might diminish.
Type II error
A type 2 error (or false negative) happens when you accept the null hypothesis when it should actually be rejected.
Here, a researcher concludes there is not a significant effect when actually there really is.
The probability of making a type II error is called Beta (β), which is related to the power of the statistical test (power = 1- β). You can decrease your risk of committing a type II error by ensuring your test has enough power.
You can do this by ensuring your sample size is large enough to detect a practical difference when one truly exists.
Scenario: Efficacy of a New Teaching Method
Educational psychologists are investigating the potential benefits of a new interactive teaching method, named “EduInteract”, which utilizes virtual reality (VR) technology to teach history to middle school students.
They hypothesize that this method will lead to better retention and understanding compared to the traditional textbook-based approach.
- Null Hypothesis (H0) : The EduInteract VR teaching method does not result in significantly better retention and understanding of history content than the traditional textbook method.
- Alternative Hypothesis (H1) : The EduInteract VR teaching method results in significantly better retention and understanding of history content than the traditional textbook method.
The researchers designed an experiment where one group of students learns a history module using the EduInteract VR method, while a control group learns the same module using a traditional textbook.
After a week, the student’s retention and understanding are tested using a standardized assessment.
Upon analyzing the results, the psychologists found a p-value of 0.06. Using an alpha (α) level of 0.05, this p-value isn’t statistically significant.
Therefore, they fail to reject the null hypothesis and conclude that the EduInteract VR method isn’t more effective than the traditional textbook approach.
However, let’s assume that in the real world, the EduInteract VR truly enhances retention and understanding, but the study failed to detect this benefit due to reasons like small sample size, variability in students’ prior knowledge, or perhaps the assessment wasn’t sensitive enough to detect the nuances of VR-based learning.
Error : By concluding that the EduInteract VR method isn’t more effective than the traditional method when it is, the researchers have made a Type 2 error.
This could prevent schools from adopting a potentially superior teaching method that might benefit students’ learning experiences.
Missed Opportunities : A Type II error can lead to missed opportunities for improvement or innovation. For example, in education, if a more effective teaching method is overlooked because of a Type II error, students might miss out on a better learning experience.
Potential Risks : In healthcare, a Type II error might mean overlooking a harmful side effect of a medication because the research didn’t detect its harmful impacts. As a result, patients might continue using a harmful treatment.
Stagnation : In the business world, making a Type II error can result in continued investment in outdated or less efficient methods. This can lead to stagnation and the inability to compete effectively in the marketplace.
How do Type I and Type II errors relate to psychological research and experiments?
Type I errors are like false alarms, while Type II errors are like missed opportunities. Both errors can impact the validity and reliability of psychological findings, so researchers strive to minimize them to draw accurate conclusions from their studies.
How does sample size influence the likelihood of Type I and Type II errors in psychological research?
Sample size in psychological research influences the likelihood of Type I and Type II errors. A larger sample size reduces the chances of Type I errors, which means researchers are less likely to mistakenly find a significant effect when there isn’t one.
A larger sample size also increases the chances of detecting true effects, reducing the likelihood of Type II errors.
Are there any ethical implications associated with Type I and Type II errors in psychological research?
Yes, there are ethical implications associated with Type I and Type II errors in psychological research.
Type I errors may lead to false positive findings, resulting in misleading conclusions and potentially wasting resources on ineffective interventions. This can harm individuals who are falsely diagnosed or receive unnecessary treatments.
Type II errors, on the other hand, may result in missed opportunities to identify important effects or relationships, leading to a lack of appropriate interventions or support. This can also have negative consequences for individuals who genuinely require assistance.
Therefore, minimizing these errors is crucial for ethical research and ensuring the well-being of participants.
Further Information
- Publication manual of the American Psychological Association
- Statistics for Psychology Book Download
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Statistical notes for clinical researchers: Type I and type II errors in statistical decision
Hae-young kim.
- Author information
- Article notes
- Copyright and License information
Correspondence to Hae-Young Kim, DDS, PhD. Associate Professor, Department of Health Policy and Management, College of Health Science, and Department of Public Health Sciences, Graduate School, Korea University, 145 Anam-ro, Seongbukgu, Seoul, Korea 136-701. TEL, +82-2-3290-5667; FAX, +82-2-940-2879; [email protected]
Corresponding author.
Issue date 2015 Aug.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/3.0/ ) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Statistical inference is a procedure that we try to make a decision about a population by using information from a sample which is a part of it. In modern statistics it is assumed that we never know about a population, and there is always a possibility to make errors. Theoretically a sample statistic may have values in a wide range because we may select a variety of different samples, which is called a sampling variation. To get practically meaningful inference we preset a certain level of error. In statistical inference we presume two types of error, type I and type II errors.
Null hypothesis and alternative hypothesis
The first step of statistical testing is the setting of hypotheses. When comparing multiple group means we usually set a null hypothesis. For example, "There is no true mean difference," is a general statement or a default position. The other side is an alternative hypothesis such as "There is a true mean difference." Often the null hypothesis is denoted as H 0 and the alternative hypothesis as H 1 or H a . To test a hypothesis, we collect data and measure how much the data support or contradict the null hypothesis. If the measured results are similar to or only slightly different from the condition stated by the null hypothesis, we do not reject and accept H 0 . However, if the dataset shows a big and significant difference from the condition stated by the null hypothesis, we regard that there is enough evidence that the null hypothesis is not true and reject H 0 . When a null hypothesis is rejected, the alternative hypothesis is adopted.
Type I and type II errors
As we assume that we never directly know the information of the population, we never know whether the statistical decision is right or wrong. Actually, the H 0 may be right or wrong and we could make a decision of the acceptance or the rejection of H 0 . In a situation of statistical decision, there may be four different occasions as presented in Table 1 . Two situations lead correct conclusions that true H 0 is accepted and false H 0 is rejected. However, the others are two incorrect erroneous situations that false H 0 is accepted and true H 0 is rejected. A Type I error or alpha (α) error refers to an erroneous rejection of true H 0 . Conversely, a Type II error or beta (β) error refers to an erroneous acceptance of false H 0 .
Table 1. Possible results of hypothesis testing.
Making some level of error is unavoidable because fundamental uncertainty lies in a statistical inference procedure. As allowing errors is basically harmful, we need to control or limit the maximum level of errors. Which type of error is more risky between type I and type II errors? Traditionally, committing type I error has been considered more risky, and thus more strict control of type I error has been performed in statistical inference.
When we have interest in the null hypothesis only, we may think about type I error only. Let's consider a situation that someone develops a new method and insists that it is more efficient than conventional methods but the new method is actually not more efficient. The truth is H 0 that says "The effects of conventional and newly developed methods are equal." Let's suppose the statistical test results support the efficiency of the new method, which is an erroneous conclusion that the true H 0 is rejected (type I error). According to the conclusion, we consider adopting the newly developed method and making effort to construct a new production system. The erroneous statistical inference with type I error would result in an unnecessary effort and vain investment for nothing better. Otherwise, if the statistical conclusion was made correctly that the conventional and newly developed methods were equal, then we could comfortably stay with the familiar conventional method. Therefore, type I error has been strictly controlled to avoid such useless effort for an inefficient change to adopt new things.
In other example, let's think that we are interested in a safety issue. Someone developed a new method which is actually safer compared to the conventional method. In this situation, null hypothesis states that "Degrees of safety of both methods are equal", when the alternative hypothesis that "The new method is safer than conventional method" is true. Let's suppose that we erroneously accept the null hypothesis (type II error) as the result of statistical inference. We erroneously conclude equal safety and we stay on the less safe conventional environment and have to be exposed to risks continuously. If the risk is a serious one, we would stay in a danger because of the erroneous conclusion with type II error. Therefore, not only type I error but also type II error need to be controlled.
Schematic example of type I and type II errors
Figure 1 shows a schematic example of relative sampling distributions under a null hypothesis (H 0 ) and an alternative hypothesis (H 1 ). Let's suppose they are two sampling distributions of sample means ( X ). H 0 states that sample means are normally distributed with population mean zero. H 1 states the different population mean of 3 under the same shape of sampling distribution. For simplicity, let's assume the standard error of two distributions is one. Therefore, the sampling distribution under H 0 is assumed as the standard normal distribution in this example. In statistical testing on H 0 with an alpha level 0.05, the critical values are set at ± 2 (or exactly 1.96). If the observed sample mean from the dataset lies within ± 2, then we accept H 0 , because we don't have enough evidence to deny H 0 . Or, if the observed sample mean lies beyond the range, we reject H 0 and adopt H 1 . In this example we can say that the probability of alpha error (two-sided) is set at 0.05, because the area beyond ± 2 is 0.05, which is the probability of rejecting the true H 0 . As seen in Figure 1 , extreme values larger than absolute 2 can appear under H 0 with the standard normal distribution ranging to infinity. However, we practically decide to reject H 0 , because the extreme values are too different from the assumed mean, zero. Though the decision includes a probability of error of 0.05, we allow the risk of error because the difference is considered sufficiently big to reach a reasonable conclusion that the null hypothesis is false. As we never know the truth whether the sample dataset we have is from the population H 0 or H 1 , we can make decision only based on the value we observe from the sample data.
Figure 1. Illustration of type I (α) and type II (β) errors.
Type II error is shown as the area lower than 2 under the distribution of H 1 . The amount of type II error can be calculated only when the alternative hypothesis suggest a definite value. In Figure 1 , a definite mean value of 3 is used in the alternative hypothesis. The critical value 2 is one standard error (= 1) smaller than mean 3 and is standardized to z = - 1 = 2 - 3 1 in a standard normal distribution. The area less than z = -1 is 0.16 (yellow area) in standard normal distribution. Therefore, the amount of type II error is obtained as 0.16 in this example.
Relationship and affecting factors on type I and type II errors
1. related change of both errors.
Type I and type II errors are closely related. If all other conditions are the same, the reduction of Type I error level accompanies the increase of type II error level. When we decrease alpha error level from 0.05 to 0.01, the critical value moves outward to around ± 2.58. As the result, beta level will increase to around 0.34 in Figure 1 , if all other conditions are the same. Conversely, moving the determinant line to the left side will cause both decrease of type II error level and increase of type I error level. Therefore, the determination of error level should be a procedure considering both error types simultaneously.
2. Effect of distance between H 0 and H 1
If H 1 suggest a bigger center, e.g., 4 instead of 3, then the distribution moves to the right. If we fix the alpha level as 0.05, then the beta level gets smaller than ever. If the center value is 4 then z value is -2 and the area less than -2 in the standard normal distribution is obtained as 0.025. If all other condition is the same, the increase of distance between H 0 and H 1 decrease the beta error level.
3. Effect of sample size
Then how do we maintain both error levels lower? Increasing the sample size is one answer, because a large sample size reduce standard error (standard deviation/√sample size) when all other conditions retained as the same. Smaller standard error can produce more concentrated sampling distributions with slender curve under both null and alternative hypothesis and the consequent overlapping area gets smaller. As sample size increases, we can get satisfactory low levels of both alpha and beta errors.
Statistical significance level
Type I error level of is often called a significance level. In a statistical testing, we reject the null hypothesis when the observed value from the dataset is located in area of extreme 0.05 and conclude there is evidence of difference from the null hypothesis when we set the alpha level at 0.05. As we consider the difference over the level is statistically significant, the level is called a significance level. Sometimes the significance level is expressed using p value, e.g., "Statistical significance was determined as p < 0.05." P value is defined as the probability of obtaining the observed value or more extreme values when the null hypothesis is true. Figure 2 shows that type I error level at 0.05 and a two-sided p value of 0.02. The observed z value 2.3 was located in the rejection region with p value of 0.02, which is smaller than the significance level 0.05. Small p value indicates that the probability of observing such a dataset or more extreme cases is very low under the assumed null hypothesis.
Figure 2. Significance level and p value.
Statistical power
Power is the probability of rejecting a false null hypothesis, which is the other side of type II error. Power is calculated as 1- Type II error (β). In Figure 1 , type II error level is 0.16 and power is obtained as 0.84. Usually a power level of 0.8 - 0.9 is required in experimental studies. Because of the relationship between type I and type II errors, we need to keep a minimum required level of both errors. Sufficient sample size is needed to keep the type I error low as 0.05 or 0.01 and the power high as 0.8 or 0.9.
- 1. Rosner B. Fundamentals of Biostatistics. 6th ed. Belmont: Duxbury Press; 2006. pp. 226–252. [ Google Scholar ]
- View on publisher site
- PDF (548.6 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
User Preferences
Content preview.
Arcu felis bibendum ut tristique et egestas quis:
- Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
- Duis aute irure dolor in reprehenderit in voluptate
- Excepteur sint occaecat cupidatat non proident
Keyboard Shortcuts
6.1 - type i and type ii errors.
When conducting a hypothesis test there are two possible decisions: reject the null hypothesis or fail to reject the null hypothesis. You should remember though, hypothesis testing uses data from a sample to make an inference about a population. When conducting a hypothesis test we do not know the population parameters. In most cases, we don't know if our inference is correct or incorrect.
When we reject the null hypothesis there are two possibilities. There could really be a difference in the population, in which case we made a correct decision. Or, it is possible that there is not a difference in the population (i.e., \(H_0\) is true) but our sample was different from the hypothesized value due to random sampling variation. In that case we made an error. This is known as a Type I error.
When we fail to reject the null hypothesis there are also two possibilities. If the null hypothesis is really true, and there is not a difference in the population, then we made the correct decision. If there is a difference in the population, and we failed to reject it, then we made a Type II error.
Rejecting \(H_0\) when \(H_0\) is really true, denoted by \(\alpha\) ("alpha") and commonly set at .05
\(\alpha=P(Type\;I\;error)\)
Failing to reject \(H_0\) when \(H_0\) is really false, denoted by \(\beta\) ("beta")
\(\beta=P(Type\;II\;error)\)
Example: Trial Section
A man goes to trial where he is being tried for the murder of his wife.
We can put it in a hypothesis testing framework. The hypotheses being tested are:
- \(H_0\) : Not Guilty
- \(H_a\) : Guilty
Type I error is committed if we reject \(H_0\) when it is true. In other words, did not kill his wife but was found guilty and is punished for a crime he did not really commit.
Type II error is committed if we fail to reject \(H_0\) when it is false. In other words, if the man did kill his wife but was found not guilty and was not punished.
Example: Culinary Arts Study Section
A group of culinary arts students is comparing two methods for preparing asparagus: traditional steaming and a new frying method. They want to know if patrons of their school restaurant prefer their new frying method over the traditional steaming method. A sample of patrons are given asparagus prepared using each method and asked to select their preference. A statistical analysis is performed to determine if more than 50% of participants prefer the new frying method:
- \(H_{0}: p = .50\)
- \(H_{a}: p>.50\)
Type I error occurs if they reject the null hypothesis and conclude that their new frying method is preferred when in reality is it not. This may occur if, by random sampling error, they happen to get a sample that prefers the new frying method more than the overall population does. If this does occur, the consequence is that the students will have an incorrect belief that their new method of frying asparagus is superior to the traditional method of steaming.
Type II error occurs if they fail to reject the null hypothesis and conclude that their new method is not superior when in reality it is. If this does occur, the consequence is that the students will have an incorrect belief that their new method is not superior to the traditional method when in reality it is.
No internet connection.
All search filters on the page have been cleared., your search has been saved..
- Sign in to my profile My Profile
- Chi-Squared Test
- Bonferroni, Carlo Emilio
- Variance Estimation
- Yule, George Udny
- Correlation
- Statistical Significance Problem
- Confidence Intervals
- Canonical Correlation Analysis
- Fisher, Ronald Aylmer
Type I and Type II Errors
- Pearson, Karl
- Goodness-of-Fit Measures
- Central Limit Theorem
- Effect Size
- ANOVA and ANCOVA
- Standard Errors
Discover method in the Methods Map
- By: John MacInnes | Edited by: Paul Atkinson, Sara Delamont, Alexandru Cernat, Joseph W. Sakshaug & Richard A.Williams
- Publisher: SAGE Publications Ltd
- Publication year: 2020
- Online pub date: September 23, 2020
- Discipline: Anthropology , Business and Management , Communication and Media Studies , Computer Science , Counseling and Psychotherapy , Criminology and Criminal Justice , Economics , Education , Engineering , Geography , Health , History , Marketing , Mathematics , Medicine , Nursing , Political Science and International Relations , Psychology , Social Policy and Public Policy , Science , Social Work , Sociology , Technology
- Methods: Type II errors , Type I errors , Null hypothesis
- Length: 2k+ Words
- DOI: https:// doi. org/10.4135/9781526421036946329
- Online ISBN: 9781529750959 More information Less information
- What's Next
In frequentist inferential hypothesis testing, the researcher risks committing one of two errors. They may reject a hypothesis that is in fact true , known as a Type I error, or fail to reject a hypothesis that is in fact false , called a Type II error. The risk of these errors arises because the population described by any hypothesis is unobserved, while the evidence for acceptance or rejection of the hypothesis rests on a random sample drawn from that population. There are, of course, other sources of error in the process of data analysis in general and hypothesis testing in particular. This entry provides an overview of hypothesis testing and errors and then discusses confusion and concerns associated with Type I and Type II errors.
Hypothesis Testing and Errors
Analysis of an unobserved population on the basis of a sample drawn randomly from it often comprises specifying and testing a null hypothesis (H 0 ) about that population. This can take the form of any statement about that population that would allow a sampling distribution to be calculated under the condition that the null is true, and therefore a goodness of fit test made of any data observed to that expected under the null. In practice the most convenient form of the null hypothesis is usually a statement of the form of “there is no difference,” “no effect,” or “no association” since research usually looks for phenomena such as a difference in the mean value of a variable, or distribution of the categories of a variable, between two groups, or an association between two variables. The hypothesis must be sufficiently specific and explicit to be empirically falsifiable with appropriate data.
By definition, any null hypothesis must be either true or false for a given population, but since the population is unobserved, any decision to reject it or not is based on analysis of a sample drawn randomly from that population. In the classic frequentist approach to statistics, the acceptable rate of risk is first established for committing the Type I error of rejecting a true null hypothesis. This is sometimes referred to as the alpha level (α). Conventionally, this has often been set at 5%, but there is little sound statistical reason for such a tradition, which was originally rooted in the convenience of setting a common level when adequate computing power was unavailable to tackle voluminous goodness of fit calculations. The probability of obtaining the data observed in the sample, conditional upon the null hypothesis being true in the population is then calculated, resulting in what is commonly called a “ p value.” If this probability is below the α level (e.g., the conventional p ≤ .05 or 5%) then there is evidence to reject the null hypothesis and the result may be declared “statistically significant” at the α level.
Any such rejection of the null hypothesis runs the risk of a Type I error equal to the p value. No matter how low the probability of observing the data in the sample, given a true null hypothesis, if this probability is greater than zero, then the null hypothesis may in fact be true for the population. If this is the case, the rejection of the null constitutes a Type I error. The probability of a Type I error can, of course, only refer to the hypothesis testing procedure used, and not the individual test. In any individual test, an error either has or has not been made, any falsifiable hypothesis, null or otherwise, must either be true or not. However, without observation of the population, this can never be definitively known.
Conversely, if the probability of obtaining the data observed conditional upon the null being true (the p value) is not equal to or less than the α level threshold, then it may be decided that there is insufficient evidence against the null hypothesis so that the researcher fails to reject it. Again, this decision may be correct or incorrect. No matter how high the p value or how probable the observed data may be, were the null hypothesis to be true, it may be that in the population from which the sample was drawn, the hypothesis is in fact false. Failure to reject a null hypothesis that is in fact false, constitutes a Type II error. While calculating the probability of committing a Type I error is straightforward, the calculation of the probability of committing a Type II error, referred to as beta (β) also depends on knowledge of whatever effect renders the null false, which, by definition, must be unknown. The Type II error probability will vary inversely with the effect size being investigated and the size of the sample. Rather than the Type II error rate (β), researchers usually report the power of a test defined as 1−β to detect a given effect size. A conventional choice of test power is 80%, which implies a risk of a Type II error of 0.2.
Any hypothesis based on analysis of sample data therefore faces four possibilities, as shown in Table 1 .
A Type I error may often be described as a “false positive” result (since rejection of the null typically implies some kind of finding or discovery), while a Type II error may be described as a “false negative” (the failure to obtain evidence for a finding that is there to be discovered).
Confusion and Concerns
Students and novice researchers usually find the double negative wording of hypothesis formulation and testing confusing. Positive results arise from the rejection of a null hypothesis. Conversely, it is often unclear just what status a null hypothesis that has failed rejection possesses. This confusion arises from two sources. One is the constraint upon the kind of hypothesis that can yield a sampling distribution and therefore the calculation of p values. Usually, a researcher looking for evidence of an effect of some kind does not know, or may not wish to make assumptions about, the likely size of such an effect. In the absence of what would be an arbitrary assumption about its size, it is impossible to directly specify a hypothesis about the effect. Conversely, a “null hypothesis” of no effect is precise and permits the straightforward calculation of the probability of observing any data obtained under the assumption that this null hypothesis is true.
Another source of confusion is the status of knowledge produced by a failure to reject a null hypothesis, and this has been the subject of much discussion. The statistician who did most to develop and popularise hypothesis testing, R. A. Fisher, understood falsification of a null hypothesis as signifying probable evidence of an effect that ought then to be investigated by further research. He said less about the meaning of a failure to reject but was clear that this did not in itself provide evidence, let alone proof, of the absence of an effect: Any test might simply be underpowered. In Fisher’s approach, rejecting a null led to further research, while failure to reject might close a line of enquiry but did not mean “accepting” the null since there was always some risk that a hypothesis test was underpowered and the sample size too low to avoid a Type II error for the size of effect lurking in the population.
The work of Jerzy Neyman and Egon Pearson in the 1920s and 1930s introduced the idea of linking Type I and Type II errors to the relative costs and benefits of decisions that would result from each error. For example, in a medical context, it might be highly desirable to avoid false negative results, even at the cost of inflating the risk of false positive ones. Neyman and Pearson also introduced the idea of an alternative hypothesis that researchers would commit to provisionally accepting, were the null rejected. This alternative hypothesis would then be open to falsification and revision as research progressed. While this procedure has been widely adopted in some fields, it has always been the subject of debate, since it is often unclear on what basis any alternative hypothesis is specified, and unless it can be readily formulated as a null hypothesis, it is not always clear how its falsification and revision ought to proceed.
As well as confusion over the definition and interpretation of Type I and II errors, concerns are often expressed about multiple testing. The α level refers to a single hypothesis test, specified in advance, on any individual sample of data. Multiple tests systematically increase the risk of Type I errors. The most conservative correction that can be applied is simply to divide the α level by the number of tests undertaken on the sample data, so that with an original α level of .05, performing 10 tests would require the level to be set at .005. Other procedures, which take account of the fact that not all the tests conducted would be independent of each other, lead to a smaller but still substantial revision downward of the α level. Except for experiment-based research, it is rare in the social sciences for unique hypotheses to be formulated for testing prior to data observation. Far more typical is the “detective work” of data exploration, searching for promising patterns of association in data that is usable for a variety of overlapping research questions. However, novice researchers may confuse the p values for a priori formulated unique hypothesis offered by statistical software with the post-hoc probability of the existence of some effect, a conflation of errors known as “ p -hacking” or “data snooping.”
Multiple testing also leads many observers to argue that the conventional setting of α levels of 5% in some disciplines may in fact lead to the common reporting of false positive findings because of the “bottom drawer” problem: Journals never publish studies that merely confirm a null hypothesis, so that such studies leave no trace, while the 5% of similar studies that unwittingly commit a Type I error may indeed be published.
It is also important to remember that the α level and statistical significance is not the same as “substantive significance.” A large enough sample size will render even the most trivial effect “significant.” A common critique of the classical inferential procedure is that in a population no null hypothesis is likely to be perfectly true and that all that is ever required to reject a null is a large enough sample.
Finally, it is important to note that probabilities associated with both “errors” refer only to the impact of sample size and variance. Errors may still arise for quite other reasons, especially if the distribution of the variables being measured is different to that assumed or expected, measurements are poor, data are contaminated, models misspecified or the population to which generalisation might be made poorly identified. Statisticians and others have thus proposed a wide range of “Type III” errors, often related to finding a “correct answer to the wrong problem,” but none of them have yet gained widespread acceptance.
Sign in to access this content
Get a 30 day free trial, more like this, sage recommends.
We found other relevant content for you on other Sage platforms.
Have you created a personal profile? Login or create a profile so that you can save clips, playlists and searches
- Sign in/register
Navigating away from this page will delete your results
Please save your results to "My Self-Assessments" in your profile before navigating away from this page.
Sign in to my profile
Please sign into your institution before accessing your profile
Sign up for a free trial and experience all Sage Learning Resources have to offer.
You must have a valid academic email address to sign up.
Get off-campus access
- View or download all content my institution has access to.
Sign up for a free trial and experience all Sage Learning Resources has to offer.
- view my profile
- view my lists
Have a thesis expert improve your writing
Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.
- Knowledge Base
- Type I & Type II Errors | Differences, Examples, Visualizations
Type I & Type II Errors | Differences, Examples, Visualizations
Published on 18 January 2021 by Pritha Bhandari . Revised on 2 February 2023.
In statistics , a Type I error is a false positive conclusion, while a Type II error is a false negative conclusion.
Making a statistical decision always involves uncertainties, so the risks of making these errors are unavoidable in hypothesis testing .
The probability of making a Type I error is the significance level , or alpha (α), while the probability of making a Type II error is beta (β). These risks can be minimized through careful planning in your study design.
- Type I error (false positive) : the test result says you have coronavirus, but you actually don’t.
- Type II error (false negative) : the test result says you don’t have coronavirus, but you actually do.
Table of contents
Error in statistical decision-making, type i error, type ii error, trade-off between type i and type ii errors, is a type i or type ii error worse, frequently asked questions about type i and ii errors.
Using hypothesis testing, you can make decisions about whether your data support or refute your research predictions with null and alternative hypotheses .
Hypothesis testing starts with the assumption of no difference between groups or no relationship between variables in the population—this is the null hypothesis . It’s always paired with an alternative hypothesis , which is your research prediction of an actual difference between groups or a true relationship between variables .
In this case:
- The null hypothesis (H 0 ) is that the new drug has no effect on symptoms of the disease.
- The alternative hypothesis (H 1 ) is that the drug is effective for alleviating symptoms of the disease.
Then , you decide whether the null hypothesis can be rejected based on your data and the results of a statistical test . Since these decisions are based on probabilities, there is always a risk of making the wrong conclusion.
- If your results show statistical significance , that means they are very unlikely to occur if the null hypothesis is true. In this case, you would reject your null hypothesis. But sometimes, this may actually be a Type I error.
- If your findings do not show statistical significance, they have a high chance of occurring if the null hypothesis is true. Therefore, you fail to reject your null hypothesis. But sometimes, this may be a Type II error.
A Type I error means rejecting the null hypothesis when it’s actually true. It means concluding that results are statistically significant when, in reality, they came about purely by chance or because of unrelated factors.
The risk of committing this error is the significance level (alpha or α) you choose. That’s a value that you set at the beginning of your study to assess the statistical probability of obtaining your results ( p value).
The significance level is usually set at 0.05 or 5%. This means that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.
If the p value of your test is lower than the significance level, it means your results are statistically significant and consistent with the alternative hypothesis. If your p value is higher than the significance level, then your results are considered statistically non-significant.
To reduce the Type I error probability, you can simply set a lower significance level.
Type I error rate
The null hypothesis distribution curve below shows the probabilities of obtaining all possible results if the study were repeated with new samples and the null hypothesis were true in the population .
At the tail end, the shaded area represents alpha. It’s also called a critical region in statistics.
If your results fall in the critical region of this curve, they are considered statistically significant and the null hypothesis is rejected. However, this is a false positive conclusion, because the null hypothesis is actually true in this case!
A Type II error means not rejecting the null hypothesis when it’s actually false. This is not quite the same as “accepting” the null hypothesis, because hypothesis testing can only tell you whether to reject the null hypothesis.
Instead, a Type II error means failing to conclude there was an effect when there actually was. In reality, your study may not have had enough statistical power to detect an effect of a certain size.
Power is the extent to which a test can correctly detect a real effect when there is one. A power level of 80% or higher is usually considered acceptable.
The risk of a Type II error is inversely related to the statistical power of a study. The higher the statistical power, the lower the probability of making a Type II error.
Statistical power is determined by:
- Size of the effect : Larger effects are more easily detected.
- Measurement error : Systematic and random errors in recorded data reduce power.
- Sample size : Larger samples reduce sampling error and increase power.
- Significance level : Increasing the significance level increases power.
To (indirectly) reduce the risk of a Type II error, you can increase the sample size or the significance level.
Type II error rate
The alternative hypothesis distribution curve below shows the probabilities of obtaining all possible results if the study were repeated with new samples and the alternative hypothesis were true in the population .
The Type II error rate is beta (β), represented by the shaded area on the left side. The remaining area under the curve represents statistical power, which is 1 – β.
Increasing the statistical power of your test directly decreases the risk of making a Type II error.
The Type I and Type II error rates influence each other. That’s because the significance level (the Type I error rate) affects statistical power, which is inversely related to the Type II error rate.
This means there’s an important tradeoff between Type I and Type II errors:
- Setting a lower significance level decreases a Type I error risk, but increases a Type II error risk.
- Increasing the power of a test decreases a Type II error risk, but increases a Type I error risk.
This trade-off is visualized in the graph below. It shows two curves:
- The null hypothesis distribution shows all possible results you’d obtain if the null hypothesis is true. The correct conclusion for any point on this distribution means not rejecting the null hypothesis.
- The alternative hypothesis distribution shows all possible results you’d obtain if the alternative hypothesis is true. The correct conclusion for any point on this distribution means rejecting the null hypothesis.
Type I and Type II errors occur where these two distributions overlap. The blue shaded area represents alpha, the Type I error rate, and the green shaded area represents beta, the Type II error rate.
By setting the Type I error rate, you indirectly influence the size of the Type II error rate as well.
It’s important to strike a balance between the risks of making Type I and Type II errors. Reducing the alpha always comes at the cost of increasing beta, and vice versa .
For statisticians, a Type I error is usually worse. In practical terms, however, either type of error could be worse depending on your research context.
A Type I error means mistakenly going against the main statistical assumption of a null hypothesis. This may lead to new policies, practices or treatments that are inadequate or a waste of resources.
In contrast, a Type II error means failing to reject a null hypothesis. It may only result in missed opportunities to innovate, but these can also have important practical consequences.
In statistics, a Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s actually false.
The risk of making a Type I error is the significance level (or alpha) that you choose. That’s a value that you set at the beginning of your study to assess the statistical probability of obtaining your results ( p value ).
To reduce the Type I error probability, you can set a lower significance level.
The risk of making a Type II error is inversely related to the statistical power of a test. Power is the extent to which a test can correctly detect a real effect when there is one.
To (indirectly) reduce the risk of a Type II error, you can increase the sample size or the significance level to increase statistical power.
Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.
Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .
When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.
In statistics, power refers to the likelihood of a hypothesis test detecting a true effect if there is one. A statistically powerful test is more likely to reject a false negative (a Type II error).
If you don’t ensure enough power in your study, you may not be able to detect a statistically significant result even when it has practical significance. Your study might not have the ability to answer your research question.
Cite this Scribbr article
If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.
Bhandari, P. (2023, February 02). Type I & Type II Errors | Differences, Examples, Visualizations. Scribbr. Retrieved 21 October 2024, from https://www.scribbr.co.uk/stats/type-i-and-type-ii-error/
Is this article helpful?
Pritha Bhandari
An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Curbing type I and type II errors
Kenneth j rothman.
- Author information
- Article notes
- Copyright and License information
Corresponding author.
Received 2010 Mar 3; Accepted 2010 Mar 3; Issue date 2010.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
The statistical education of scientists emphasizes a flawed approach to data analysis that should have been discarded long ago. This defective method is statistical significance testing. It degrades quantitative findings into a qualitative decision about the data. Its underlying statistic, the P -value, conflates two important but distinct aspects of the data, effect size and precision [ 1 ]. It has produced countless misinterpretations of data that are often amusing for their folly, but also hair-raising in view of the serious consequences.
Significance testing maintains its hold through brilliant marketing tactics—the appeal of having a “significant” result is nearly irresistible—and through a herd mentality. Novices quickly learn that significant findings are the key to publication and promotion, and that statistical significance is the mantra of many senior scientists who will judge their efforts. Stang et al. [ 2 ], in this issue of the journal, liken the grip of statistical significance testing on the biomedical sciences to tyranny, as did Loftus in the social sciences two decades ago [ 3 ]. The tyranny depends on collaborators to maintain its stranglehold. Some collude because they do not know better. Others do so because they lack the backbone to swim against the tide.
Students of significance testing are warned about two types of errors, type I and II, also known as alpha and beta errors. A type I error is a false positive, rejecting a null hypothesis that is correct. A type II error is a false negative, a failure to reject a null hypothesis that is false. A large literature, much of it devoted to the topic of multiple comparisons, subgroup analysis, pre-specification of hypotheses, and related topics, are aimed at reducing type I errors [ 4 ]. This lopsided emphasis on type I errors comes at the expense of type II errors. The type I error, the false positive, is only possible if the null hypothesis is true. If the null hypothesis is false, a type I error is impossible, but a type II error, the false negative, can occur.
Type I and type II errors are the product of forcing the results of a quantitative analysis into the mold of a decision, which is whether to reject or not to reject the null hypothesis. Reducing interpretations to a dichotomy, however, seriously degrades the information. The consequence is often a misinterpretation of study results, stemming from a failure to separate effect size from precision. Both effect size and precision need to be assessed, but they need to be assessed separately, rather than blended into the P -value, which is then degraded into a dichotomous decision about statistical significance.
As an example of what can happen when significance testing is exalted beyond reason, consider the case of the Wall Street Journal investigative reporter who broke the news of a scandal about a medical device maker, Boston Scientific, having supposedly distorted study results [ 5 ]. Boston Scientific reported to the FDA that a new device was better than a competing device. They based their conclusion in part on results from a randomized trial in which the significance test showing the superiority of their device had a P -value of 0.049, just under the criterion of 0.05 that the FDA used statistical significance. The reporter found, however, that the P -value was not significant when calculated using 16 other test procedures that he tried. The P -values from those procedures averaged 0.051. According to the news story, that small difference between the reported P -value of 0.049 and the journalist’s recalculated P -value of 0.051 was “the difference between success and failure” [ 5 ]. Regardless of what the “correct” P -value is for the data in question, it should be obvious that it is absurd to classify the success or failure of this new device according to whether or not the P -value falls barely on one side or the other of an arbitrary line, especially when the discussion revolves around the third decimal place of the P -value. No sensible interpretation of the data from the study should be affected by the news in this newspaper report. Unfortunately, the arbitrary standard imposed by regulatory agencies, which foster that focus on the P -value, reduces the prospects for more sensible evaluations.
In their article, Stang et al. [ 2 ] not only describe the problems with significance testing, but also allude to the solution, which is to rely on estimation using confidence intervals. Sadly, although the use of confidence intervals is increasing, for many readers and authors they are used only as surrogate tests of statistical significance [ 6 ], to note whether the null hypothesis value falls inside the interval or not. This dichotomy is equivalent to the dichotomous interpretation that results from significance testing. When confidence intervals are misused in this way, the entire conclusion can depend on whether the boundary of the interval is located precisely on one side or the other of an artificial criterion point. This is just the kind of mistake that tripped up the Wall Street Journal reporter. Using a confidence interval as a significance test is an opportunity lost.
How should a confidence interval be interpreted? It should be approached in the spirit of a quantitative estimate. A confidence interval allows a measurement of both effect size and precision, the two aspects of study data that are conflated in a P -value. A properly interpreted confidence interval allows these two aspects of the results to be inferred separately and quantitatively. The effect size is measured directly by the point estimate, which, if not given explicitly, can be calculated from the two confidence limits. For a difference measure, the point estimate is the arithmetic mean of the two limits, and for a ratio measure, it is the geometric mean. Precision is measured by the narrowness of the confidence interval. Thus, the two limits of a confidence interval convey information on both effect size and precision. The single number that is the P -value, even without degrading it into categories of “significant” and “not significant”, cannot measure two distinct things. Instead the P -value mixes effect size and precision in a way that by itself reveals little about either.
Scientists who wish to avoid type I or type II errors at all costs may have chosen the wrong profession, because making and correcting mistakes are inherent to science. There is a way, however, to minimize both type I and type II errors. All that is needed is simply to abandon significance testing. If one does not impose an artificial and potentially misleading dichotomous interpretation upon the data, one can reduce all type I and type II errors to zero. Instead of significance testing, one can rely on confidence intervals, interpreted quantitatively, not simply as surrogate significance tests. Only then would the analyses be truly quantitative.
Finally, here is a gratuitous bit of advice for testers and estimators alike: both P -values and confidence intervals are calculated and all too often interpreted as if the study they came from were free of bias. In reality, every study is biased to some extent. Even those who wisely eschew significance testing should keep in mind that if any study were increased in size, its precision would improve and thus all its confidence intervals would shrink, but as they do, they would eventually converge around incorrect values as a result of bias. The final interpretation should measure effect size and precision separately, while considering bias and even correcting for it [ 7 ].
Open Access
- 1. Lang J, Rothman KJ, Cann CI. That confounded P-value (Editorial) Epidemiology. 1998;9:7–8. doi: 10.1097/00001648-199801000-00004. [ DOI ] [ PubMed ] [ Google Scholar ]
- 2. Stang A, Poole C, Kuss O. The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol. 1991 doi: 10.1007/s10654-010-9440-x. [ DOI ] [ PubMed ] [ Google Scholar ]
- 3. Loftus GR. On the tyranny of hypothesis testing in the social sciences. Contemp Psychol. 1991;36:102–105. [ Google Scholar ]
- 4. Feise RJ. Do multiple outcome measures require p-value adjustment? BMC Med Res Methodol. 2002;2:8. doi: 10.1186/1471-2288-2-8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 5. Winstein KJ. Boston Scientific stent study flawed. Wall Str J. 2008;August 14:B1.
- 6. Poole C. Beyond the confidence interval. Am J Public Health. 1987;77:195–199. doi: 10.2105/AJPH.77.2.195. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 7. Lash TL, Fox MP, Fink AK. Applying quantitative bias analysis to epidemiologic data. New York: Springer; 2009. [ Google Scholar ]
- View on publisher site
- PDF (85.2 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
Jump to navigation
Cochrane Methods Methodology Register
The Cochrane Methodology Register (CMR) is a bibliography of publications that report on methods used in the conduct of controlled trials. It includes journal articles, books, and conference proceedings, and the content is sourced from MEDLINE and hand searches. CMR contains studies of methods used in reviews and more general methodological studies that could be relevant to anyone preparing systematic reviews. CMR records contain the title of the article, information on where it was published (bibliographic details), and, in some cases, a summary of the article. They do not contain the full text of the article.
The CMR was produced by the Cochrane UK , until 31 st May 2012. There are currently no plans to reinstate the CMR and it is not receiving updates.* If you have any queries, please contact the Cochrane Community Service Team ( [email protected] ).
The Publishers, John Wiley & Sons Ltd, thanks Update Software for the continued use of their data formats in the Cochrane Methodology Register (CMR).
*Last update in January 2019.
- How it works
"Christmas Offer"
Terms & conditions.
As the Christmas season is upon us, we find ourselves reflecting on the past year and those who we have helped to shape their future. It’s been quite a year for us all! The end of the year brings no greater joy than the opportunity to express to you Christmas greetings and good wishes.
At this special time of year, Research Prospect brings joyful discount of 10% on all its services. May your Christmas and New Year be filled with joy.
We are looking back with appreciation for your loyalty and looking forward to moving into the New Year together.
"Claim this offer"
In unfamiliar and hard times, we have stuck by you. This Christmas, Research Prospect brings you all the joy with exciting discount of 10% on all its services.
Offer valid till 5-1-2024
We love being your partner in success. We know you have been working hard lately, take a break this holiday season to spend time with your loved ones while we make sure you succeed in your academics
Discount code: RP0996Y
Type 1 and Type 2 Errors
Published by Owen Ingram at September 2nd, 2021 , Revised On August 3, 2023
When testing your hypothesis, it is crucial to establish a null hypothesis . The null hypothesis proposes that there is no statistical or cause-and-effect relationship between the variables in the population. Commonly denoted by an H0 symbol, researchers work to reject the null hypothesis.
Similarly, they also come up with an alternate hypothesis, which they believe explains their research.
Let’s look at an example to explain null and alternate hypotheses,
Question: Is the COVID-19 vaccine safe for people with heart conditions?
Null Hypothesis: COVID-19 vaccine is not safe for people with heart conditions.
Alternate Hypothesis: COVID-19 vaccine is safe for people with heart conditions.
These statistical hypotheses rely on probabilities for their experiments. Even though they are meant to be reliable, there may be two kinds of errors that can occur when testing your hypothesis . These errors are known as the Type 1 error and the Type 2 error.
Understanding Type 1 Error
Type 1 errors are commonly known as false positives. A type 1 error occurs when a null hypothesis is rejected during hypothesis testing, even though it is accurate. In this type of error, we conclude that our results are significantly correct when they’re not.
The probability of making this type of error is represented by the alpha or ‘α’ you choose, which is the p-value . For example, a p-value of 0.02 reveals a 2% chance that you may reject the actual null hypothesis wrong. You can reduce this probability of committing a type 1 error achieving by lowering the value for p. For example, a p-value of 0.01 indicates that there may be a 1% chance of committing an error.
Example for Type 1 Error
Let’s say that you’re convinced the Earth is flat and want to prove it to others. Your null hypothesis, in this case, will be: The earth is not flat.
To nullify this hypothesis, you walk on a plain surface for a few days, noticing that hey – it does look and feel flat when walking, so it must be flat and not a sphere.
You, therefore, reject your null hypothesis and tell everyone that the earth is, in fact, flat. This example is a simple explanation of the Type 1 Error. Although type 1 errors are a little more complex in real life than the example used, this is what the error looks like.
In such cases, your goal is to minimize the chances of the type 1 error. For instance, here, you could have minimized your probability of type 1 error by reading some scientific journals about the earth’s shape.
Is the Statistics assignment pressure too much to handle?
How about we handle it for you.
Put in the order right now in order to save yourself time, money, and nerves at the last minute.
Understanding Type 2 Errors
Typ2 errors are also called false negatives. Type 2 errors occur when a hypothesis test rejects the null hypothesis and makes a false assumption. Beta or the “β” determines the probability of making a type 2 error. Beta is related to the power of the statistical test, i.e., power=1-β
A high test power can reduce the probability of committing a type 2 error.
Reducing Type 2 Errors
The only option to reduce a Type 2 error is by minimizing its probability. Since a type 2 error is closely related to the power of the test, increasing the test power can reduce these types of errors. So, how do we do that?
- Increasing Sample Size: The simplest way to increase the power of the test is by increasing the sample size of the analysis. The greater the sample size is, the greater the chances to note the differences in the statistical test. Running the test for a longer duration and gathering more extensive data can help you make an accurate decision with your results.
- Increasing Significant Levels: A higher significant level indicates a higher chance of nullifying the null hypothesis when it is true.
Examples of a Type 2 Error
Suppose a pharmaceutical company is testing how effective two new vaccines for COVID-19 are in developing antibodies. The null hypothesis states that both the vaccines are equally effective, whereas the alternate hypothesis states that there is varying effectiveness between the two vaccines.
To test this hypothesis, the pharmaceutical company begins a trial to compare the two vaccines. It divides the participants, giving Group A one of the vaccines and Group B the other vaccine.
The beta is calculated to be 0.03 (or 3%). This means that the chances of committing a type 2 error are 97%. The null hypothesis should be rejected if the vaccines are not equally effective. However, if the null hypothesis is not rejected, a type 2 error occurs, indicating a false-negative error.
Which is Worse – Type 1 or Type 2 Error?
The simple answer is – it depends entirely on the statistical situation.
Suppose you design a medical screening test for diabetes. A Type 1 error in this test may make the patient believe they have diabetes when they don’t. However, this may lead to further examinations and ultimately reveal that the patient does not have the illness.
In contrast, a Type 2 error may give the patient the wrong assurance that they do not have diabetes, when in fact, they do. As a result, they may not go for further examinations and treat the illness. This may lead to several problems for the patient.
In this case, if you were to choose which type of error is ‘less dangerous,’ you’d opt for Type 1 error.
Now let’s take a scenario of a courtroom where a person is being suspected of committing murder. A type 1 error in this situation would prove the suspect is innocent when they’re not. Whereas a type 2 error would prove, the suspect is guilty when they’re innocent. Not very fair, is it? In this scenario, one can argue that a type 1 error may cause less harm than a type 2 error.
FAQs About Type 1 and Type 2 Errors
What is the difference between type 1 and type 2 errors.
Type 1 errors are false-positive and occur when a null hypothesis is wrongly rejected when it is true. Wheres, type 2 errors are false negatives and happen when a null hypothesis is considered true when it is wrong.
How do you minimize the chances of Type 1 errors?
Choosing a lower significant value can minimize Type 1 errors. This significance level is usually kept at 0.05, which means that your results will have a 5% chance of a type 1 error.
You May Also Like
T-distribution describes the probability of data. Although it looks similar to a normal distribution with a bell curve, it has a lower height and a broader curve.
P-value in statistics is the probability of getting outcomes at least as extreme as the outcomes of a statistical hypothesis test, assuming the null hypothesis to be correct.
Statistical tests are used for testing the hypothesis to determine the relaltionship between variables but which statistical test you should use?
As Featured On
USEFUL LINKS
LEARNING RESOURCES
COMPANY DETAILS
Splash Sol LLC
- How It Works
Comparing type 1 and type 2 error rates of different tests for heterogeneous treatment effects
- Original Manuscript
- Open access
- Published: 20 March 2024
- Volume 56 , pages 6582–6597, ( 2024 )
Cite this article
You have full access to this open access article
- Steffen Nestler 1 &
- Marie Salditt 1
922 Accesses
1 Altmetric
Explore all metrics
Psychologists are increasingly interested in whether treatment effects vary in randomized controlled trials. A number of tests have been proposed in the causal inference literature to test for such heterogeneity, which differ in the sample statistic they use (either using the variance terms of the experimental and control group, their empirical distribution functions, or specific quantiles), and in whether they make distributional assumptions or are based on a Fisher randomization procedure. In this manuscript, we present the results of a simulation study in which we examine the performance of the different tests while varying the amount of treatment effect heterogeneity, the type of underlying distribution, the sample size, and whether an additional covariate is considered. Altogether, our results suggest that researchers should use a randomization test to optimally control for type 1 errors. Furthermore, all tests studied are associated with low power in case of small and moderate samples even when the heterogeneity of the treatment effect is substantial. This suggests that current tests for treatment effect heterogeneity require much larger samples than those collected in current research.
Similar content being viewed by others
Compensation and Amplification of Attenuation Bias in Causal Effect Estimates
A group-specific prior distribution for effect-size heterogeneity in meta-analysis
Effect heterogeneity and variable selection for standardizing causal effects to a target population
Avoid common mistakes on your manuscript.
In the last decades, psychologists have conducted a large number of randomized controlled trials and observational studies to test the effectiveness of specific interventions, such as, for example, an educational support program or a psychotherapy (Kravitz et al., 2004 ). In almost all of these studies, the parameter of interest is the average treatment effect, a measure of the overall impact of treatment. However, researchers and practitioners know that treatment effects can be highly heterogeneous across study participants. For example, students with a low social status may benefit more from an educational support program than students with a higher social status, or certain patients may respond more to a specific treatment than other patients.
Knowing the variables that are responsible for the heterogeneity in treatment effects is very interesting from an applied perspective because this would allow to better tailor the results of randomized controlled trials to particular persons (e.g., to identify the right educational support program for a student or the right treatment for a patient; see Deaton & Cartwright, 2018 , but also Cook, 2018 ). Hence, there is a growing interest in statistical approaches that allow to detect whether – and if so, why – treatment effects vary in randomized controlled trials. Specifically, multiple methods have been suggested for detecting how the treatment effect varies based on variables that are measured prior to treatment. Among these approaches are standard linear methods such as the moderated regression model (Cox, 1984 ), but also different machine learning methods such as causal forests (Athey et al., 2019 ), causal boosting (Powers, Qian, and et al., 2018 ), and various meta-learners (e.g., the S-Learner, the T-Learner, or the X-Learner, see Künzel et al., 2019 ; Salditt et al., 2023 ; Wendling et al., 2018 ).
However, in some settings, the relevant variables may not have been measured, so that these methods cannot be applied. Then, researchers may simply want to assess whether the average treatment effect observed in their randomized controlled trial varies across participants to such a degree that it is of substantive importance. If it does not, this would indicate that the treatment can be applied to all individuals. Conversely, if it does vary, this would indicate the need for further research to investigate which variables are driving the heterogeneity. A number of tests have been proposed to assess the null hypothesis of homogeneity of treatment effects in the causal inference literature to answer this question. Despite the increasing focus on heterogeneity of treatment effects, only some of these tests are known in psychology (see Kaiser et al., 2022 for a meta-analytic application of one of these tests), and to date there are no simulation studies that have examined and compared the performance of all of these tests in different settings.
In this article we aim to present the different tests suggested in the causality literature for detecting heterogeneous treatment effects. Furthermore, we conducted a simulation study to examine their performance subject to the amount of treatment effect heterogeneity, the sample size, and the inclusion of an additional covariate when performing the test. In the following, we first introduce the average treatment effect using the potential outcomes framework (Imbens & Rubin, 2015 ). We then describe the different test procedures to test the null hypothesis that the average treatment effect is constant across persons. Afterwards, we present the results of the simulation study, and then conclude with a discussion of the results and questions for future research.
Potential outcome framework and heterogeneous treatment effects
We use the potential outcome framework to define homogeneous and heterogeneous treatment effects (see Hernan & Robins, 2020 ; Imbens & Rubin, 2015 ; Rosenbaum, 2010 for introductions). To this end, let \(A_i\) be the binary treatment variable with 0 indicating that person i is in the control group and 1 that she is in the experimental group. The potential outcome corresponding to the outcome person i would have experienced had she received the treatment is denoted by \(Y_{i}(1)\) and the outcome had she been in the control condition is \(Y_{i}(0)\) . The causal effect of the treatment for the i th person then is
The stable unit treatment value assumption (SUTVA, see Imbens & Rubin2015, 2015 ; Rosenbaum, 2002 ) entails that the observed outcome equals the potential outcome under the condition actually received:
Since a single person i can only be assigned to either the experimental or the control group, only \(Y_{i}(1)\) or \(Y_{i}(0)\) can be observed, such that it is impossible to observe \(\tau _{i}\) (see Holland, 1986 ). However, under certain additional assumptions, such as the assumption that the potential outcomes are independent from treatment (i.e., \(A_{i} \perp \!\!\!\!\perp \lbrace Y_{i}(0),Y_{i}(1) \rbrace \) ), we can estimate the average of each potential outcome using the average of the observed outcome values in the experimental and control group, respectively:
This in turn allows to estimate the average of the individual treatment effects:
When the treatment effect is homogeneous across all persons, that is, \(\tau _{i} = \tau \) for all \(i = 1, \dots , n\) , the average treatment effect equals the constant effect, \(\mathbb {E}[\tau _{i}] = \tau \) . Thus, the treatment increases the mean difference between the experimental and control group by amount \(\tau \) :
Tests for heterogeneous treatment effects
The goal of all test procedures suggested in the causal inference literature is to test the null hypothesis that the treatment effect is constant for all persons:
The proposed tests differ in the sample statistics they use and in whether they make distributional assumptions for the potential outcomes. Concerning sample statistics, the different tests use estimates of the variance parameters in the control and the experimental group, the empirical distribution functions of the two groups, or they compare a grid of quantiles estimated in the two groups. Concerning the distributional assumptions, some tests presume that the potential outcomes, and therefore the observed outcomes, are normally distributed, while other tests belong to the class of Fisher randomization tests (FRT) that do not rely on any distributional assumptions (Imbens & Rubin, 2015 ; Rosenbaum, 2002 ).
Comparing variance terms
There are several ways to implement a comparison of variance terms for testing treatment effect heterogeneity. All of these tests are based on the observation that when the treatment effect is constant across persons, it does not influence the variance of the potential outcomes:
where the first equality follows from Eq. 1 and the second uses the fact that the variance of a constant is zero. Under the assumptions introduced above, \(\mathbb {V}[Y_{i}(1)]\) and \(\mathbb {V}[Y_{i}(0)]\) equal the variance of the observed values in the experimental and the control group, respectively:
Variance ratio
Using Eq. 8 , a potential test statistic to test for treatment effect heterogeneity consists of computing the ratio of the two estimates of the aforementioned variance terms:
When we assume that each potential outcome is normally distributed, \(T_{V}\) is F -distributed with degrees of freedom \(n_{1} - 1\) and \(n_{0} - 1\) and testing \(T_{V}\) equals the F -test for two variance terms (e.g., Casella & Berger, 2002 ). Alternatively, one can use a heterogeneous regression model in which the residual variance is modeled as a function of the treatment variable (Bloome & Schrage, 2021 ; Western & Bloome, 2009 ):
Here, the estimate \(\hat{\gamma }_1\) measures the difference between the variance of the two groups on a log-scale and a test statistic \(T_{\gamma }\) is obtained by dividing \(\hat{\gamma }_1\) by its standard error. The resulting test statistic can be used to test the presence of a heterogeneous treatment effect (Bloome & Schrage, 2021 ), and we will simply refer to this test as \(T_{\gamma }\) in the simulation study reported later. Importantly, \(\hat{\gamma }_1\) equals \(T_{V}\) after exponentiation, which is why we only consider \(T_{\gamma }\) and not the F -test for \(T_{V}\) as variance-ratio-based test that assumes normally distributed data, because the two tests yield the same decision concerning the null hypothesis.
A test that does not rely on such distributional assumptions can be obtained when using \(T_{V}\) in a Fisher randomization test (FRT, Imbens & Rubin, 2015 ; Rosenbaum, 2002 ; 2010 ). The basis of the FRT is that – given the aforementioned null hypothesis – the missing potential outcome values can be imputed assuming that we know the treatment effect. Thus, when person i belongs to the experimental group, her observed value is \(Y_i = Y_i(1)\) and the missing potential outcome is \(Y_i(0) = Y_i(1) - \tau = Y_i - \tau \) . Alternatively, when person i is in the control group, her missing potential outcome is \(Y_i(1) = Y_i + \tau \) .
Using this, the FRT constructs a sampling distribution of the test statistic ( \(T_{V}\) in this case) by randomly permuting the observed values of the treatment variable . To illustrate, when the observed treatment values are 0, 1, 1, 0, 0, and 1 for person one to six, a randomly permuted treatment variable would be 1, 0, 0, 1, 1, and 0. The test statistic is then computed again using the group assignments in the permuted treatment variable. This process is repeated a large number of times (e.g., B = 1000) and the resulting distribution of the test statistic (e.g., the distribution of the \(T_{V}\) values) is used to obtain a p value by computing the relative frequency of the values of the test statistic that are greater than the observed test statistic:
where \(T_{\textrm{obs}}\) is the observed test statistic (i.e., \(T_{V}\) here) and \(T_j\) is the value of the test statistic in the j th randomization sample. As said above, the basis of the FRT is that given the null hypothesis of a constant treatment effect, the missing potential outcome values can be imputed assuming that one knows the true average treatment effect. In case of \(T_{V}\) the assumption of knowing the true treatment effect is not important, because the test compares two variance parameters that – given the null hypothesis – are not affected by the ‘constant’ \(\tau \) .
Variance difference
\(T_{V}\) and \(T_{\gamma }\) use the ratio of the two variance terms. Another way to test the null hypothesis of a constant treatment effect consists of taking the absolute difference between the two variance terms
and to use \(T_{D}\) in a FRT (we refer to this test as \(T_{D}\) in the simulation study reported later). Alternatively, when one assumes normally distributed potential outcomes, one can implement a test of the variance difference in a random coefficient regression model or in the multiple group structural equation model framework. In a random coefficient regression model, the slope of a predictor variable is allowed to vary across individuals (as in the case of a multilevel model), although only one score is available for a single participant (in econometrics this model is called the Hildreth–Houck model, see Hildreth & Houck, 1968 ; Muthén et al., 2017 ). Applied to the present context, the slope of the treatment variable is modeled to differ between persons:
where \(u_i\) is a normally distributed error term with expectation zero and variance \(\sigma _{u}^{2}\) . When the (normally distributed) error term is uncorrelated with \(u_i\) , the conditional variance of \(y_i\) is given by
Thus, the variance in the control group is \(\mathbb {V}[y_{i}|A_i = 0] = \sigma _{\epsilon }^{2}\) and in the experimental group the variance is \(\mathbb {V}[y_{i}|A_i = 1] = \sigma _{u}^{2} + \sigma _{\epsilon }^{2}\) , showing that the difference between the two is \(\sigma _{u}^{2}\) . When we fit the model to the data (using Mplus, see below and Muthen & Muthen, 1998-2017 ), a Wald test can then be employed to check whether the sample estimate \(\hat{\sigma }_{u}^{2}\) is significantly different from zero. In the simulation study, we call this test \(T_{\beta }\) .
The second way to implement the variance difference test when assuming normally distributed data is to use a multiple group structural equation model (Tucker-Drob, 2011 ). Specifically, one fits a structural equation model to the data of the experimental group and another one to the control group and compares the fit of this unconstrained multiple group model to the fit of a constrained multiple group model in which the variance terms of the outcome variable are restricted to be equal. Formally, the fit of the two models is compared using a chi-squared difference test that is a likelihood-ratio (LR) test. The LR-test is asymptotically equivalent to a Wald-test that compares the difference between the two variance terms (Bollen, 1989 ; Muthén et al., 2017 ). Later, we abbreviate this test with \(T_{LR}\) .
Variance ratio and variance differences with rank data
All tests described so far use the observed data to compute the respective test statistic. However, in the case of testing the null hypothesis of no average treatment effect, simulation studies showed that a FRT that uses rank data has good error rates across different simulation conditions (Imbens & Rubin, 2015 ; Rosenbaum, 2002 , 2010 ). To conduct such a rank-based FRT, one first transforms the outcome to ranks and then computes the test statistic on the resulting rank data. To examine whether the good performance of the rank-based FRTs for testing the average treatment effects against zero generalizes to the context of heterogeneous treatment effects, we also consider rank-based versions of the FRTs in the simulation study (referred to as \(T^{R}_{V}\) and \(T^{R}_{D}\) , respectively).
Comparing the cumulative distribution functions
A second group of tests is based on the two-sample Kolmogorov–Smirnov test statistic. The idea is to compare the marginal distribution function in the control group with the marginal distribution function in the experimental group, because under the null hypothesis of a constant treatment effect, the two distribution functions differ only by the average treatment effect \(\tau \) . Thus, if \(\tau \) would be known, one could use (Ding et al., 2016 )
where \(\hat{F}_{0}\) and \(\hat{F}_{1}\) denotes the empirical cumulative distribution functions of the outcome in the control and experimental group, respectively. \(T_{\textrm{KS}}\) presumes that the true treatment effect \(\tau \) is known. Because this is not the case, an obvious fix would be to use the estimated average treatment effect \(\hat{\tau }\) , resulting in the ’shifted’ test statistic:
However, (Ding et al., 2016 ) show that the sampling distribution of \(T_{\textrm{SKS}}\) is not equivalent to the sampling distribution of the standard KS test statistic (see Wasserman, 2004 for the function) and therefore yields either inflated or deflated false-positive and false-negative error rates. To deal with this problem, they suggest to use \(T_{\textrm{SKS}}\) in a FRT. To this end, one first computes the potential outcomes \(Y_{i}(0)\) and \(Y_{i}(1)\) for each participant by plugging in the estimated average treatment effect \(\hat{\tau }\) . That is, \(Y_{i}(0) = Y_i\) and \(Y_{i}(1) = Y_i + \hat{\tau }\) for the persons in the control group, and \(Y_{i}(0) = Y_i - \hat{\tau }\) and \(Y_{i}(1) = Y_i\) for the persons in the experimental group. These potential outcomes are then used as the basis for the FRT in which \(T_{\textrm{SKS}}\) is computed in each randomization sample. The resulting distribution of the \(T_{\textrm{SKS}}\) scores is then used to test the null hypothesis (see Eq. 11 ).
Ding et al. ( 2016 ) call this the FRT-Plug in test (henceforth, FRT-PI) and argue that the procedure should yield valid results when the estimated average treatment effect is close to the true average treatment effect (e.g., when the sample size is large). However, as there are no guarantees that this is the case, Ding et al. ( 2016 ) suggest a second FRT in which not only one value for the hypothesized constant treatment effect is plugged in, but rather a range of plausible average treatment effects. Specifically, they suggest to construct a 99.9% confidence interval for the estimated average treatment effect and to use a grid of values covering this interval in the FRT. That is, for each of these treatment effects, the FRT is performed and one then finds the maximum p value over all the resulting randomization tests. When this p value is smaller than the significance level, one rejects the null hypothesis. Following Ding et al. ( 2016 ), we henceforth call this test FRT-CI.
Comparing quantiles
A third test was suggested by Chernozhukov & Fernandez-Val ( 2005 ) and is based on a comparison of quantiles instead of the cumulative distribution functions. Specifically, the test is based on the observation that
where \(F^{-1}_{0}\) and \(F^{-1}_{1}\) is the inverse of the cumulative distribution function in the control and experimental group, respectively, q is a certain quantile, and \(\tau (q)\) is the average treatment effect at the q th quantile. When the null hypothesis of a constant treatment effect is true, \(\tau (q)\) would be constant across all quantiles q . Therefore, Chernozhukov & Fernandez-Val ( 2005 ) suggest testing the null hypothesis with a type of KS-statistic in which one first obtains estimates of the treatment effect at certain quantiles and then takes the largest difference between these values and the average treatment effect:
In their implementation of the test, Chernozhukov & Fernandez-Val ( 2005 ) use quantile regression (Hao & Naiman, 2007 ), in which the outcome Y is regressed on the treatment indicator A at a quantile q , to obtain an estimate of \(\hat{\tau }(q)\) .
Similar to the shifted KS statistic (see Eq. 16 ), \(T_{\textrm{sub}}\) uses an estimate of the true average treatment effect. Therefore, Chernozhukov & Fernandez-Val ( 2005 ) suggest to use a subsampling approach to obtain a valid test statistic. In subsampling, the sampling distribution of a statistic is obtained by drawing subsamples of a certain size b without replacement from the current dataset. In each subsample, the respective statistic is computed ( \(T_{\textrm{sub}}\) in our case) and the resulting sampling distribution is then used – similar to the bootstrap – for inferences concerning the statistic. Chernozhukov & Fernandez-Val ( 2005 ) show that their subsampling approach is asymptotically correct. They also found their approach to yield valid type 1 error rates in a simulation study.
Comparing coefficients of variation
Finally, it was also suggested to compare the coefficients of variation (CV) instead of the variance terms to detect heterogeneous treatment effects (Nakagawa et al., 2015 ; Volkmann et al., 2020 ), because the magnitude of the variance depends on the mean (e.g., the larger the mean on a bounded scale, the lower a variable’s reachable variance). The coefficient of variation is the ratio of the standard deviation \(\sigma \) of a variable to its mean \(\mu \)
and a potential test statistic then is to compare the estimated CVs from the two groups
However, a problem with using \(T_{CV}\) is that the test statistic is not only affected by a heterogeneous treatment effect (e.g., \(\hat{\sigma }_{1} \ne \hat{\sigma }_{0}\) ), but also by whether there is a nonzero average treatment effect. For instance, when \(\hat{\sigma }_{1}\) = \(\hat{\sigma }_{0}\) = 1 and \(\hat{\mu }_{1} = \hat{\mu }_{0} + 1\) , then \(T_{CV}\) would be equal to 0.5. Thus, \(T_{CV}\) cannot be used to test the null hypothesis of a constant treatment effect, at least when the average treatment effect is nonzero. Footnote 1 Therefore, we do not consider this test statistic in our simulation.
Considering covariates
So far we assumed that data is available for the treatment variable and the outcome only. However, when a randomized controlled trial is conducted in practice, researchers may also have assessed a number of person-level pretreatment covariates. When testing the average treatment effect against zero, it is well known (e.g., Murnane & Willett, 2011 ) that considering these covariates in the model can increase the precision of the treatment effect estimate and power. In a similar way, considering covariates may help to increase the power of the treatment heterogeneity tests.
In case of the test based on the heterogeneous regression model ( \(T_{\gamma }\) , see Eq. 10 ), the random coefficient model ( \(T_{\beta }\) , see Eq. 13 ), and the structural equation model ( \(T_{LR}\) ), one can consider such covariates by including them as predictors of the outcome variable. Footnote 2 For the other tests, the outcome variable is first regressed on the covariates. The residuals of this regression can then be used in the FRT approaches or in the tests based on the cumulative distribution function. In the latter case, one then computes a ‘regression-adjusted KS statistic’ (cf., Ding et al., 2016 ). For the quantile-based test (i.e., \(T_{\textrm{sub}}\) ), Koenker and Xiao ( 2002 ) suggested to estimate \(\tau (q)\) in a quantile regression of the outcome on the treatment variable and the covariates and to use these estimates when computing \(T_{\textrm{sub}}\) .
The present research
The foregoing discussion showed that a number of tests were proposed in the causal inference literature to test for heterogeneity in treatment effects (see Table 1 for a summary) and simulation studies conducted so far found that some of the proposed tests keep their nominal type 1 error rates while also preserving low type 2 error rates. Bloome and Schrage ( 2021 ), for example, examined the performance of \(T_{\gamma }\) in case of normally distributed potential outcomes and a sample size of 200 or 1000 persons per group. They found that \(T_{\gamma }\) has good type 1 error rates at both sample sizes and that the type 2 error rate gets smaller the larger the size of the two groups. Ding et al. ( 2016 ) compared their proposed FRT approaches (i.e., FRT-PI and FRT-CI) with the subsampling approach of (Chernozhukov & Fernandez-Val, 2005 ) at sample sizes of 100 or 1000 persons in each group. When the treatment effect is homogeneous, all three tests yielded good type 1 error rates. Furthermore, the power of the tests to detect heterogeneous treatment effects was low for all three methods when the size of the samples was small. However, as expected, the power increased with increasing sample sizes and with increasing treatment effect heterogeneity.
The simulation study reported here is aimed at replicating and extending these results. Specifically, we conducted the simulation study with four purposes. First, in our review we described several tests whose suitability for testing treatment effect heterogeneity has not yet been investigated (e.g., the random coefficient model, rank-based FRTs), neither alone nor in comparison to the other tests. We wanted to compare all of the proposed tests in one simulation study, rather than evaluating a subset of the tests only. We believe that this is important because so far psychologists often compare – if at all – the variance terms to test the presence of heterogeneous treatment effects and seldomly use (rank-based) FRT versions. Thus, if the other tests have higher power, applied researchers would currently make a type 2 error unnecessarily often (i.e., they would underestimate the frequency of heterogeneous treatment effects). Second, we wanted to examine the error rates at smaller sample sizes. So far, sample sizes of 100, 200, or 1000 persons per group have been investigated. However, randomized controlled trials sometimes involve fewer persons per condition and it is thus important to evaluate the performance of the tests at such smaller sample sizes. Third, to the best of our knowledge, there is no simulation study that has investigated whether and to what extent the inclusion of covariates affects the power of the various tests. Finally, we also wanted to investigate whether (and how) the error rates are affected when variables are not normally distributed, but stem from distributions in which more extreme values can occur, or which are skewed.
Simulation study
We performed a simulation study to assess the type 1 and type 2 error rates in different sample size conditions and in different conditions of treatment effect heterogeneity. We also varied the size of the average treatment effect and the distribution of the potential outcomes, and we examined whether the consideration of a covariate increase the power of the tests. The R code for the simulation study can be found at https://osf.io/nbrg2/ .
Simulation model and conditions
The population model that we used was inspired by the model used in Chernozhukov and Fernandez ( 2005 ) and Ding et al. ( 2016 ). Specifically, the model used to generate the data was an additive treatment effect model:
To manipulate the size of the heterogeneous treatment effect, we followed (Ding et al., 2016 ) by setting \(\sigma _{\tau }\) to 0, 0.25, or 0.5, corresponding to a constant treatment effect across participants, a moderate, or a large heterogeneous treatment effect. Although no conventions with regard to the size of heterogeneous treatment effect exist, we call the two conditions moderate and large, because setting \(\sigma _{\tau }\) to 0.25 (0.50) implies that the variance in the experimental group is about 1.5 (2) times larger than the variance in the control group (i.e., \(\sigma ^2_{0} = 1\) vs. \(\sigma ^2_{1} = 1.56\) and \(\sigma ^2_{1} = 2.25\) , respectively). To manipulate the type of underlying distribution, \(\epsilon _{i}\) was generated from a standard normal distribution (i.e., \(\epsilon _{i} \sim \mathcal {N}(0,1)\) ), from a t -distribution with 5 degrees of freedom, or from a log-normal distribution with mean 0 and standard deviation 1 (both on a log-scale). The t -distribution is a symmetric distribution that has thicker tails compared to the normal distribution. The chance that extreme values occur in a random sample are thus larger. The log-normal distribution, by contrast, is an asymmetric distribution. Using this distribution thus allows us to examine how skewness of the data influences the five tests. Furthermore, the average treatment effect \(\tau \) was set to either 0 or 1. The latter manipulation was examined to examine whether the performance of the tests is affected by the presence of an nonzero average treatment effect and we set \(\tau \) to this rather large value (i.e., in the normal case the implied d is 1) to not miss any effects that an (incorrect) estimation of the average treatment effect might have.
The number of participants was set to 25, 50, 100, or 250 in each group. A sample size of 25 per group is a rather small sample size for a randomized controlled trial and 250 participants per group can be considered a large sample (reflecting, for example, the growing number of multi-center studies). The other two sample sizes are in between and were included to examine the error rates in case of more moderate sample sizes. Finally, to examine the effect of including a covariate on the power of the tests, we generated a standard normal variable \(X_i\) and used
to generate \(Y_{i}(0)\) in the first step of the respective conditions. In the case where \(\epsilon _{i}\) is drawn from a standard normal distribution, this implies that the correlation between \(X_i\) and each of the potential outcomes is 0.30, which corresponds to a moderate effect.
Tests and dependent variables
For each of the 3 (size of heterogeneity) \(\times \) 3 (type of distribution) \(\times \) 2 (zero versus nonzero average treatment effect) \(\times \) 4 (no. of participants in a group) \(\times \) 2 (zero versus nonzero covariate) = 144 conditions, we generated 1000 simulated data sets. For each condition, the 1000 replications were analyzed by computing the ten test statistics, that is, \(T_{\gamma }\) , \(T_{V}\) , \(T^{R}_{V}\) , \(T_{LR}\) , \(T_{\beta }\) , \(T_{D}\) , \(T^{R}_{D}\) , FRT-PI, FRT-CI, and \(T_{\textrm{sub}}\) . Footnote 3 \(T_{V}\) , \(T^{R}_{V}\) , \(T_{D}\) , and \(T^{R}_{D}\) were each implemented as a FRT. We used B = 2000 randomization samples to obtain the sampling distribution of the statistic in each replication. This distribution was then used to obtain the p value for this replication. When the p value was smaller than \(\alpha = .05\) , the null hypothesis of a constant treatment effect was rejected, otherwise it was accepted.
We used the functions of Ding et al. ( 2016 ) to compute the FRT-PI and the FRT-CI, respectively. For both tests, we proceeded the same way as for \(T_{V}\) , but with the difference that we followed the suggestions of Ding et al. ( 2016 ) and set the number of randomization samples to B = 500. In case of FRT-CI, a grid of 150 values was used to compute the maximum p value (this is the default value in the functions of Ding et al. 2016 ). \(T_{\gamma }\) was obtained by regressing the treatment variable on the outcome variable in a regression model. Furthermore, the residual variance of this model (see Eq. 10 ) was also modeled as function of the treatment variable and the latter coefficient was used to test for heterogeneous treatment effects. Specifically, if the absolute value of the estimate divided by its standard error was greater than 1.96, the null hypothesis of a constant treatment effect was counted as rejected. The heterogeneous regression model was estimated using remlscore function in the R package statmod (Giner & Smyth, 2016 ). We used Mplus (Muthén & Muthén, 1998-2017 ) to compute the random coefficient regression model to obtain \(T_{\beta }\) . Mplus implements a maximum likelihood estimator for the model’s parameters (see, Muthén et al., 2017 . In the simulation, we used the R package MplusAutomation (Hallquist & Wiley, 2018 ) to control Mplus from R. To obtain \(T_{LR}\) , we fitted two multiple group structural equation models with the lavaan package (Rosseel, 2012 ). The first model was an unconstrained model, in which the variance of the outcome variable in the experimental and the control group were allowed to differ, and in the second, constrained model, they were restricted to be equal. The LR-test is then obtained by comparing the fit of the two models with a chi-square difference test. Finally, we used the functions of Chernozhukov and Fernandez-Val ( 2005 ) with their default settings to obtain \(T_{\textrm{sub}}\) (resulting in a subsampling size of \(b \approx \) 20).
In each replication, we determined whether a test statistic rejected the null hypothesis of a constant treatment effect and used this information to calculate the rejection rate per simulation condition. When the standard deviation \(\sigma _{\tau }\) is 0, this rate measures the type 1 error rate. If \(\sigma _{\tau }\) is greater than 0, the rejection rate is an indicator of the power of a test.
We first discuss the results for the case that no covariate was considered in the tests. The results for the different tests concerning the two error rates were very similar for an average treatment effect of zero and one. Therefore, we focus on the results for the conditions of a zero average treatment effect here and later discuss some noticeable differences from the conditions in which the treatment effect is one.
Table 2 presents the results concerning type 1 error and power rates when no covariate was considered when performing the tests. Turning to the number of type 1 errors first (i.e., the columns with \(\sigma _{\tau } = 0\) ), \(T_{V}\) \(T_{V}^{R}\) , \(T_{D}\) , \(T_{D}^{R}\) , \(T_{\gamma }\) , \(T_{LR}\) and the FRT-PI yielded exact or almost exact rejection rates when the potential outcomes were normally distributed, irrespective of the size of the two groups. For FRT-CI, \(T_{\textrm{sub}}\) , and \(T_{\beta }\) the rates were too conservative, but they almost approached the nominal \(\alpha \) -level as the sample size increased. When the data followed a t -distribution, all FRT-based except FRT-CI tests (i.e., \(T_{V}\) , \(T_{V}^{R}\) , \(T_{D}\) , \(T_{D}^{R}\) , FRT-PI) were still near the nominal value. For the latter, the type 1 error rate again approached the nominal value with an increasing sample size, while the rates for \(T_{\textrm{sub}}\) and \(T_{\beta }\) remained largely too conservative. For \(T_{\gamma }\) and \(T_{LR}\) , rejection rates were too liberal. In case of the log-normal distribution, \(T_{\textrm{sub}}\) and \(T_{\beta }\) yielded even higher type 1 error rates. Interestingly, while FRT-PI showed a good size for normally and t -distributed potential outcomes, the test was too liberal when the potential outcomes were log-normally distributed and when the samples size was small to moderate. In this case, the FRT-CI yielded more exact rejection rates. Finally, \(T_{V}\) \(T_{V}^{R}\) , \(T_{D}\) , and \(T_{D}^{R}\) were again near the nominal value, and the rates for \(T_{\textrm{sub}}\) and \(T_{\beta }\) were again largely unaffected by the sample size.
With regard to the power of the tests (see columns with \(\sigma _{\tau } = 0.25\) and \(\sigma _{\tau } = 0.50\) in Table 2 ), all tests were more powerful the larger the size of the effect and the larger the size of the two groups. However, there were notable differences between the tests depending on conditions. Specifically, \(T_{V}\) , \(T_{D}\) , \(T_{\gamma }\) , \(T_{LR}\) , and \(T_{\beta }\) were the most powerful tests in case of normally distributed data, irrespective of sample size and level of treatment effect variation. The rank-based FRT versions \(T_{V}^{R}\) and \(T_{D}^{R}\) were always less powerful than their raw-data counterparts \(T_{D}\) and \(T_{V}\) , respectively, and \(T_{\textrm{sub}}\) was more powerful than FRT-CI, but not FRT-PI. However, the latter three tests reached power rates above 80% only at 250 participants per group and when treatment effect variation was large. All other tests reached these values prior to these three tests when treatment effect variation was moderate and large, but even these tests did not reach a sufficient power for the small sample size condition. A similar result pattern occurred in conditions in which the potential outcomes were t -distributed. Again, \(T_{V}\) , \(T_{D}\) , \(T_{\gamma }\) , \(T_{LR}\) and \(T_{\beta }\) were the most powerful tests, although the result for \(T_{\gamma }\) and \(T_{LR}\) has to be considered in light of the large type 1 error rates when there was no effect variation. Interestingly, \(T_{\beta }\) , although also based on normal-theory, performed similar to \(T_{V}\) and \(T_{D}\) . Furthermore, and in contrast to the normal distribution condition, FRT-PI was more powerful than FRT-CI and \(T_{\textrm{sub}}\) and the rank-based versions of \(T_{V}\) and \(T_{D}\) had the same power rates as their raw-data counterparts. Finally, when the potential outcomes were log-normally distributed, \(T^{R}_{V}\) , \(T^{R}_{D}\) , \(T_{\gamma }\) , \(T_{LR}\) and FRT-PI were the most powerful tests. Again, for \(T_{\gamma }\) and \(T_{LR}\) this result has to be considered in light of the worse performance of the two tests in case of no treatment effect variation. Similarly, the results concerning the rank-based tests should be interpreted in comparison to their performance when the average treatment effect is one, where \(T^{R}_{V}\) and \(T^{R}_{D}\) yielded high rejection rates when the treatment effect was constant (see the discussion below).
The type 1 and type 2 error rates of the tests when considering an additional covariate were, unexpectedly, very similar to the error rates when not considering the covariate (see Table 3 for the specific results). When we examine the small sample size condition only (i.e., 25 participants per condition) and average across all tests, we find that for normally distributed data the mean rejection rate is 0.038 when \(\sigma _{\tau }\) is zero, it is .137 when \(\sigma _{\tau }\) is 0.25, and it is 0.331 when \(\sigma _{\tau }\) is 0.50. These values differ only slightly from the values that we obtain when we consider the tests without covariates, \(\sigma _{\tau }\) = 0: 0.038, \(\sigma _{\tau }\) = 0.25: .133, \(\sigma _{\tau }\) = 0.50: 0.333. Very similar results are obtained when we consider the conditions in which the data was t - or log-normal-distributed, and also when we consider each specific tests. Thus, in the conditions studied here, taking a covariate into account leads, if at all, only to a small improvement in the power of the tests, when the correlation between the covariate and the dependent variable does not exceed 0.3.
Finally, as mentioned above, the results are very similar for the conditions in which the average treatment effect is one (see Tables 6 and 7 in the Appendix for the specific error rates of the considered tests). Exceptions were the results for the rank tests \(T^{R}_{V}\) and \(T^{R}_{D}\) when the data were log-normally distributed. Here, rejection rates were unacceptably high when there was no treatment effect variation (i.e., \(\sigma _{\tau }\) = 0). To better understand these results, we conducted another small simulation study in which we examined the performance of a FRT when testing the null hypotheses that the average treatment effect is zero for log-normal data. The results showed (see Table 4 ) that the rank based test performed better than the raw-data based test when there was no treatment effect heterogeneity (i.e., \(\sigma _{\tau }\) = 0). However, when the treatment effect was zero, we found that rejection rates of the rank test increased the larger the variation in the treatment effect, while the raw-data based test had the correct size. This suggests that nonzero treatment effect heterogeneity results in a difference between the two outcome means in the rank data, which is actually not present, and this increases with higher heterogeneity. This ’bias’ also affects the rank-test of the variances (analogous to the test using the coefficient of variation, as discussed above). Finally, when the average treatment is one, then the actual difference in means is ’added’ to this bias, leading to the exceptional rejection rates.
Psychological researchers are increasingly interested in methods that allow them to detect whether a treatment effect investigated in a study is non-constant across participants. Based on the potential outcome framework, a number of procedures were suggested to test the null hypothesis of a homogeneous treatment effect. These tests can be distinguished according to whether they make distributional assumptions and to which sampling statistics are used in the test. The present study was done to examine the type 1 and type 2 error rates of the suggested tests as a function of the amount of treatment heterogeneity, the presence of an average causal effect, the distribution of the potential outcomes, the sample size, and of whether a covariate is considered when conducting the test.
With regard to the type 1 error rate, our results replicate the findings of earlier simulation research (Bloome & Schrage, 2021 ; Ding et al., 2016 ) by showing that the majority of tests were close to the nominal \(\alpha \) -level regardless of sample size when the data was normally distributed. However, extending prior research we also found that the variance tests in the heterogeneous regression model \(T_{\gamma }\) and the LR test implemented with a multiple group structural equation model \(T_{LR}\) rejected the null hypothesis too often in the case of non-normally distributed data. Furthermore, we found that the subsampling procedure \(T_{sub}\) and the test in the random coefficient regression model \(T_{\beta }\) decided too conservatively in these conditions, and that the FRT-CI performed best in case of skewed data. Overall, these results suggest that, regardless of sample size, applied researchers should use a Fisher randomization test with variance ratios or variance differences (i.e., \(T_{V}\) , \(T_{D}\) ) or the empirical distribution function (i.e., FRT-CI), because they protect well against false positive decisions. When the data are normally distributed, they could, again regardless of sample size, use the variance tests in the heterogeneous regression model \(T_{\gamma }\) and the LR test, but should use one of the FRT tests as a sensitivity check if there is any doubt as to whether the normality assumption is met.
Concerning the power of the tests, the results of the simulation – at least for the small sample sizes of 25 and 50 persons per group – are quite sobering, because none of the tests reached satisfactory power levels even when size of treatment heterogeneity was stronger. When samples sizes were larger, our results were consistent with earlier research showing that the variance test of the heterogeneous regression model is characterized by good type 2 error rates in case of normally distributed data (Bloome & Schrage, 2021 ). They also show that the LR-test and the test of the random coefficient regression model achieved good power. Power rates of all three tests were slightly larger than the power rates of the variance ratio and the variance difference tests, but this is to be expected given that the parametric assumptions are met in these conditions. Furthermore, the power of FRT-PI increased with larger sample sizes and with larger treatment effect heterogeneity. The same pattern occurred for FRT-CI, although it performed better (in terms of type 2 and type 1 error rates) than FRT-PI in case of skewed data. For practitioners these results suggest that they should use the FRT variance tests, the variance test in the heterogeneous regression model \(T_{\gamma }\) or the LR test when the data is normally distributed. However, they should keep in mind that the power is only acceptable for sample sizes greater than or equal to 50 participants per group in case of large heterogeneity in treatment effects, or with 250 participants per group in case of moderate heterogeneity. In the case of non-normally distributed data, they should use the FRT variance tests or the FRT-PI test. Once again, however, when interpreting the result, it is important to bear in mind that the power is only sufficiently high for very large samples (250 people per group in case of high heterogeneity).
We also examined the performance of rank-based versions of the variance ratio and the variance difference tests, because in the case of the average treatment effect, simulation studies show that an FRT that uses rank data (i.e., the initial data is transformed into ranks in a first step) performs well across many simulation conditions (Imbens & Rubin, 2015 ; Rosenbaum, 2002 , 2010 ). In our simulations, these tests performed similar (in case of t -distributed data) or worse (in case of normally distributed data) than their raw-data based counterparts. Furthermore, in the case of log-normal distributed data, their performance was unsatisfactory. Thus, we provisionally conclude that the performance advantage that is observed for the average treatment effect does not generalize to the detection of a heterogeneous treatment effect. However, our results have to be replicated in future simulation studies to reach a final conclusion.
Finally, we also investigated whether the power of the tests is increased by considering a relevant covariate. However, our results show only a tiny effect of including a covariate. The obvious explanation is that the influence of the covariate was too small in our simulation to detect an effect on power. We used a moderate effect, because we think this is the most plausible value for the considered setting in which participants are randomly assigned to treatment. Nevertheless, future research should replicate and also extend our results by additionally examining larger sizes of the covariate’s effect. Another point to consider is that covariates are often not measured with perfect reliability, which in turn may also affect the error rates of the tests for treatment effect heterogeneity. In fact, we assumed here that the outcome variable is not subject to measurement error and it remains unclear how a violation of this assumption would affect the different tests. We think that this is also an interesting question to examine in future research.
Furthermore, in our simulation study, we focused on tests that have been proposed in the causal inference literature to test for treatment effect heterogeneity. However, in the context of methods that compare variance terms, the statistical literature suggests a number of further tests that are aimed to be more robust to violations of the normal distribution, such as the Levene test or the Brown and Forsythe test (see Wilcox, 2017b , for an overview) or tests that are based on bootstrapping (see Lim & Loh, 1996 ; Wilcox, 2017a ). We expect these tests to perform similarly to the FRT tests, though this assumption should be validated in a future simulation study. Finally, in the case of the average treatment effect, previous research has found that the meta-analytical integration of the results of many small-sample studies can match a single, large-sample study in terms of power (e.g., Cappelleri et al., 1996 ). This suggests that the pooled results of several heterogeneous treatment effect tests may have a sufficient power, even when the single studies are underpowered due to using small samples. We think that testing this hypothesis in future research is not only interesting from an applied perspective, but also from a methodological point of view: Although meta-analytic methods for the variance ratio statistic exist (see Nakagawa et al., 2015 ; Senior et al., 2020 )), it is currently unclear, at least to our knowledge, how to best pool the results of the FRT tests or the quantile tests (i.e., whether Fisher’s or Stouffer’s method is appropriate, for example, when there is substantial between-study heterogeneity). Footnote 4
To summarize, our results suggest that researchers often fail to reject the null hypothesis of a constant treatment effect in a single study even when there is actual heterogeneity in effects present in the data. We think that this result is relevant for applied researchers from multiple points of view. To begin with, currently the sample sizes are determined a priori in such a way that a pre-specified average (causal) effect can be tested with as much power as possible. Thus, assuming that future studies will extend their focus to treatment effect heterogeneity, much larger samples sizes have to be assessed, at least when the variables that are responsible for treatment effect heterogeneity are not known before conducting the RCT.
Specifically, in our study we examined the setting that researchers have not assessed or do not know the heterogeneity-generating variables and therefore perform one of the (global) tests considered here. However, researchers may sometimes know these variables prior to the randomization and may then, assuming that they have assessed them and that a linear model is true, compute a regression model with an interaction term to test whether the treatment effect varies with this variable. In this case, the power of the interaction test is larger than the power of the tests considered in the simulation. For instance, when the population model is linear and the interaction term has a moderate effect ( \(R^2\) about .13), the power of the interaction test is higher than the power of the variance ratio and the variance difference tests (see Table 5 ). Footnote 5 For applied researchers this result suggests, second, that they should already consider which variables can explain potential variation in the average treatment effect during the planning stage of the study, because in this case a much higher power can be achieved at lower samples sizes.
A final recommendation for applied researchers pertains to the case that one of the tests studied here leads to the rejection of the null hypothesis of a constant treatment effect. Then, researchers will be motivated to detect the person-level variables that explain differences in the treatment effect. In the introduction we mentioned that there are a number of statistical approaches available (classical as well as more modern approaches from machine learning) that can then be used for this task. When performing these (post-hoc) analyses, however, it has to be taken into account that one potentially conducts a large number of statistical tests, which may lead to many false-positive results (Schulz and Grimes, 2005 ; Sun et al., 2014 ). Thus, one has to investigate in other studies how stable (in terms of replicability) and generalizable (in terms of external validity) the resulting findings are, also given that the sample sizes are (typically) optimized with respect to the statistical test of the average treatment effect.
To conclude, we conducted a simulation study to investigate the type 1 and type 2 error rates of different tests of treatment effect heterogeneity that were suggested in the causal inference literature. The results suggest that a randomization test should be used in order to have good control of the type 1 error. Furthermore, all tests studied are associated with high type 2 error rates when sample sizes are small to moderate. Thus, to detect heterogeneous treatment effects with sufficient power, much larger samples are needed than those typically collected in current studies, or new test procedures must be developed that have higher power even with smaller samples.
We also computed \(T_{CV}\) in selected conditions of our simulation. When the average treatment effect and the variance of the treatment effects was 0, the rejection rate of \(T_{CV}\) was near the nominal value (i.e., 0.05). However, when the average treatment effect was 1, but the variance of the treatment effects was 0, the rejection rate was 0.935.
The heterogeneous regression model, the random coefficient model, and the structural equation model assume a linear relationship between the covariates and the outcome. When this assumption is suspected to be violated, one could also first predict the outcome with the covariates using any machine learning model and then proceed with the residuals from this model.
Chernozhukov and Fernandez ( 2005 ) also suggest a bootstrap-variant of their approach that we also included in the simulation. However, the performance of the test with regard to the type 1 and type 2 error rates was unsatisfactory (see Ding et al., 2016 , for similar findings), so that we decided to not report results concerning this test.
In a preliminary simulation study, we examined the power of a pooled variance ratio test depending on the size of treatment effect heterogeneity in the single studies, the number of the studies to be pooled, and the amount of between-study heterogeneity. Specifically, to generate the data for a single study, we used the same population model as in the simulation study reported in the main text. The differences were that we generated data for a small, a moderate, and a large heterogeneous treatment effect (i.e., the standard deviation of \(\sigma _{\tau }\) was set to 0.10, 0.25, or 0.50; see Eq. 21 ), and that we considered the small sample size condition only (i.e., 25 persons in the experimental and the control group in a single study). In addition, we varied the amount of systematic variation of the heterogeneous treatment effect across studies (i.e., the between-study heterogeneity was either 0 or 0.10) and we varied the number of studies that are meta-analytically integrated (i.e., 10 vs. 30 vs. 50 studies). In each of the 1000 replications generated for a simulation condition, we computed a random-effect meta-analysis (using the metafor package, see Viechtbauer, 2010 ) and checked whether the pooled variance ratio was significantly different from zero. The results showed that when the size of the heterogeneous treatment effect was small, the power of the pooled test was adequate, when the number of aggregated studies was 50 and when there was no between-study heterogeneity (i.e., the power was .816). When between-study heterogeneity was 0.10, the power was lower than .80 when the number of to-be-aggregated studies was 50 (i.e., the power was .746). Furthermore, when the heterogeneous treatment effect was moderate or large, the power of the pooled variance ratio test was already adequate when the number of studies was 10 (i.e., the power was near 1.00 in almost all conditions), independent upon the amount of between-study heterogeneity. The exception was the moderate heterogeneous treatment effect condition in which between-study heterogeneity was 0.10; here, the nominal power level was not reached (i.e., the power was .750). However, we note that these results were obtained under rather optimal conditions (i.e., correct test statistic; distributional assumptions are met) and that they have to be replicated in future simulation research.
The model used to generate the data was
were A is the treatment variable (with prevalence 0.5) and X is a standard normal variable. The residual terms were also standard normally distributed.
Athey, S., Wager, S., & Tibshirani, J. (2019). Generalized random forests. Annals of Statistics, 47 , 1148–1178. https://doi.org/10.1214/18-AOS1709
Article Google Scholar
Bloome, D., & Schrage, D. (2021). Covariance regression models for studying treatment effect heterogeneity across one or more outcomes: Understanding how treatments shape inequality. Sociological Methods & Research, 50 , 1034–1072. https://doi.org/10.1177/0049124119882449
Bollen, K. A. (1989). Structural equations with latent variables . West Sussex: John Wiley & Sons.
Book Google Scholar
Cappelleri, J. C., & loannidis, J.P.A., Schmid, C.H., de Ferranti, S.D., Aubert, M., Chalmers, T.C., Lau, J. (1996). Large trials vs meta-analysis of smaller trials: How do their results compare? JAMA, 276 , 1332–1338. https://doi.org/10.1001/jama.1996.03540160054033
Casella, G., & Berger, R. L. (2002). Statistical inference . Duxbury Press.
Google Scholar
Chernozhukov, V., & Fernandez-Val, I. (2005). Subsampling inference on quantile regression processes. Sankhya: The Indian Journal of Statistics, 67 , 253–276.
Cook, T. D. (2018). Twenty-six assumptions that have to be met if single random assignment experiments are to warrant ‘gold standard’ status: A commentary on deaton and cartwright. Social science & medicine, 210 , 37–40. https://doi.org/10.1016/j.socscimed.2018.04.031
Cox, D. R. (1984). Interaction. International Statistics Review, 52 , 1–24. https://doi.org/10.2307/1403235
Deaton, A., & Cartwright, N. (2018). Understanding and misunderstanding randomized controlled trials. Social Science & Medicine, 210 , 2–21. https://doi.org/10.1016/j.socscimed.2017.12.005
Ding, P., Feller, A., & Miratrix, L. (2016). Randomization inference for treatment effect variation. Journal of the Royal Statistical Society, Section B, 78 , 655–671.
Giner, G., & Smyth, G. K. (2016). statmod: Probability calculations for the inverse gaussian distribution. R Journal, 8 , 339–351.
Hallquist, M. N., & Wiley, J. F. (2018). MplusAutomation: An R package for facilitating large-scale latent variable analyses in Mplus. Structural Equation Modeling, 621–638 ,. https://doi.org/10.1080/10705511.2017.1402334
Hao, L., & Naiman, D. Q. (2007). Quantile regression . Thousand Oaks, California: Sage.
Hernan, M., & Robins, J. M. (2020). Causal inference: What if . Boca Raton: Chapman & Hall/CRC.
Hildreth, C., & Houck, J. P. (1968). Some estimators for a linear model with random coefficients. Journal of the American Statistical Association, 63 , 584–595. https://doi.org/10.2307/2284029
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81 , 945–960. https://doi.org/10.2307/2289064
Imbens, G. W., & Rubin, D. (2015). Causal inference for statistics, social, and biomedical sciences: An introduction . Cambridge: Cambridge University Press.
Kaiser, T., Volkmann, C., Volkmann, A., Karyotaki, E., Cuijpers, P., & Brakemeier, E.- L. (2022). Heterogeneity of treatment effects in trials on psychotherapy of depression. Clinical Psychology: Science and Practice. https://doi.org/10.1037/cps0000079
Koenker, R., & Xiao, Z. (2002). Inference on the quantile regression process. Econometrica, 70 , 1583–1612. https://doi.org/10.1111/1468-0262.00342
Kravitz, R. L., Duan, N., & Braslow, J. (2004). Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages. The Milbank quarterly, 82 , 661–687. https://doi.org/10.1111/j.0887-378X.2004.00327.x
Article PubMed PubMed Central Google Scholar
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Science, 116 , 4156–4165. https://doi.org/10.1073/pnas.1804597116
Lim, T.- S., & Loh, W.- Y. (1996). A comparison of tests of equality of variances. Computational Statistics & Data Analysis, 22 , 287–301. https://doi.org/10.1016/0167-9473(95)00054-2
Murnane, R. J., & Willett, J. B. (2011). Methods matter: Improving causal inference in educational and social science research . Oxford: Oxford University Press.
Muthén, B. O., Muthén, L. K., & Asparouhov, T. (2017). Regression and mediation analyses using mplus . Los Angeles: Muthén & Muthén.
Muthén, B. O., Muthén, L. K., & Asparouhov, T. (2017). Regression and mediation analyses using mplus . Los Angeles, CA: Muthén & Muthén.
Nakagawa, S., Poulin, R., Mengersen, K., Reinhold, K., Engqvist, L., Lagisz, M., & Senior, A. M. (2015). Meta-analysis of variation: Ecological and evolutionary applications and beyond. Methods in Ecology and Evolution, 6 , 143–152. https://doi.org/10.1111/2041-210X.12309
Powers, S., Qian, J., K.J., et al. (2018). Some methods for heterogeneous treatment effect estimation in high dimensions. Statistics in Medicine, 37 , 1767–1787. https://doi.org/10.1002/sim.7623
Rosenbaum, P. R. (2002). Observational studies . New York: Springer.
Rosenbaum, P. R. (2010). Design of observational studies . New York: Springer.
Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48 , 1–36. https://doi.org/10.18637/jss.v048.i02
Salditt, M., Eckes, T., & Nestler, S. (2023). A tutorial introduction to heterogeneous treatment effect estimation with meta-learners. Administration and Policy in Mental Health and Mental Health Services Research . https://doi.org/10.1007/s10488-023-01303-9
Article PubMed Google Scholar
Schulz, K. F., & Grimes, D. A. (2005). Multiplicity in randomised trials ii: subgroup and interim analyses. The Lancet, 365 , 1657–1661. https://doi.org/10.1016/S0140-6736(05)66516-6
Senior, A. M., Viechtbauer, W., & Nakagawa, S. (2020). Revisiting and expanding the meta-analysis of variation: The log coefficient of variation ratio. Research Synthesis Methods, 11 , 553–567. https://doi.org/10.1002/jrsm.1423
Sun, X., Ioannidis, J. P., Agoritsas, T., Alba, A. C., & Guyatt, G. (2014). How to use a subgroup analysis: Users’ guide to the medical literature. Jama, 311 , 405–411. https://doi.org/10.1001/jama.2013.285063
Tucker-Drob, E. M. (2011). Individual difference methods for randomized experiments. Psychological Methods, 16 , 298–318. https://doi.org/10.1037/a0023349
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software , 36 (3), 1–48. https://doi.org/10.18637/jss.v036.i03
Volkmann, C., Volkmann, A., & Müller, C. A. (2020). On the treatment effect heterogeneity of antidepressants in major depression: A bayesian meta-analysis and simulation study. Plos One, 15 , e0241497. https://doi.org/10.1371/journal.pone.0241497
Wasserman, L. (2004). All of statistics . New York: Springer.
Wendling, T., Jung, K., Callahan, A., Schuler, A., Shah, N. H., & Gallego, B. (2018). Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Statistics in Medicine, 37 , 3309–3324. https://doi.org/10.1002/sim.7820
Western, B., & Bloome, D. (2009). Variance function regressions for studying inequality. Sociological Methodology, 39 , 293–326. https://doi.org/10.1111/j.1467-9531.2009.0122
Wilcox, R. R. (2017). Introduction to robust estimation and hypothesis testing . West Sussex: John Wiley & Sons.
Wilcox, R. R. (2017). Understanding and applying basic statistical methods using R . West Sussex: John Wiley & Sons.
Download references
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and affiliations.
University of Münster, Institut für Psychologie, Fliednerstr. 21, 48149, Münster, Germany
Steffen Nestler & Marie Salditt
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Steffen Nestler .
Additional information
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional materials for this article can be found at https://osf.io/nbrg2/ . The materials on the OSF include the R code for the simulation study and the Mplus code to estimate the random coefficient regression model.
Additional results
Here, we present additional results of the simulation study. Table 6 displays the results for the examined tests when no cova-riates are considered and when the average treatment effect is 1, and Table 7 shows the results for the tests with covariates considered and for an average treatment effect of 1.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Nestler, S., Salditt, M. Comparing type 1 and type 2 error rates of different tests for heterogeneous treatment effects. Behav Res 56 , 6582–6597 (2024). https://doi.org/10.3758/s13428-024-02371-x
Download citation
Accepted : 13 February 2024
Published : 20 March 2024
Issue Date : October 2024
DOI : https://doi.org/10.3758/s13428-024-02371-x
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Heterogeneous treatment effects
- Randomization tests
- Heterogeneous regression
- Find a journal
- Publish with us
- Track your research
The problem of the type II statistical error
Affiliation.
- 1 Department of Obstetrics and Gynecology, Chicago Lying-in Hospital, Illinois, USA.
- PMID: 7566865
- DOI: 10.1016/0029-7844(95)00251-L
Objective: To determine if type II statistical errors (also known as beta errors) are a common problem in published clinical research.
Methods: Type II statistical errors occur when sample sizes are too small to show an effect of treatment, even when an effect truly exists. Searching the Medline data base, we identified ten meta-analyses published during 1986-1994 in the American Journal of Obstetrics and Gynecology, Obstetrics and Gynecology, and The Journal of Reproductive Medicine. Meta-analyses were used as sources of component or individual studies for the following reason: When small component studies have negative findings that differ from the overall conclusions of the meta-analysis, the component studies may have type II statistical errors.
Results: We found that only 6.5% (15 of 231) of component studies provided any documentation that power calculations to determine sample sizes had been done a priori (before) the research began. Thus, many of these component studies with findings of no treatment effect may have had type II errors because of too-small sample sizes. When stratifying the component studies by year of publication, we found that 7.9% (14 of 178) of studies published in the 1980s and 1990s had any documented evidence of a priori power calculations. In the 1960s and 1970s, only one of 53 component studies had documented evidence of power calculations.
Conclusion: To ensure that truly effective treatments are introduced into clinical practice as quickly as possible, we believe that a priori power calculations should always be done in quantitative clinical research.
- Meta-Analysis as Topic*
- Statistics as Topic*
Numbers, Facts and Trends Shaping Your World
Read our research on:
Full Topic List
Regions & Countries
- Publications
Our Methods
- Short Reads
- Tools & Resources
Read Our Research On:
Key facts about U.S. poll workers
Trump, harris voters mostly say immigrants fill jobs u.s. citizens don’t wan, harris, trump voters differ over election security, vote counts and hacking concerns.
More than seven-in-ten registered voters in the U.S. (73%) think the election this November will be run and administered at least somewhat well. Nine-in-ten Harris supporters say this, compared with 57% of Trump supporters. Just 20% of voters are highly confident the Supreme Court would be politically neutral if it rules on legal issues in the 2024 election.
How Americans See Men and Masculinity
Turks lean negative on erdoğan, give national government mixed ratings, sign up for our weekly newsletter.
Fresh data delivered Saturday mornings
Latest Publications
Majority of americans aren’t confident in the safety and reliability of cryptocurrency.
A 63% majority of Americans have little or no confidence that cryptocurrencies are reliable and safe, but some groups are more wary than others.
90% of Harris voters, versus 57% of Trump voters, are confident the 2024 election will be administered well.
What do poll workers do, and how many typically help in general elections? Read about state requirements for these temporary election staffers and more.
Dissatisfaction with democracy is widespread in Japan ahead of snap election
Ahead of Japan’s election on Oct. 27, here are four key facts about people’s views of democracy and political parties in the country.
Most U.S. voters say immigrants – no matter their legal status – mostly take jobs citizens don’t want
Three-quarters of voters say undocumented immigrants fill jobs citizens don’t want, while 61% say the same of legal immigrants.
All publications >
Most Popular
Sign up for the briefing.
Weekly updates on the world of news & information
Election 2024
Most voters say harris will concede – and trump won’t – if defeated in the election.
About seven-in-ten voters say if Harris loses the 2024 election she will accept the results and concede, while roughly a quarter say Trump will do this if he loses. And a majority of voters say the threat of violence against political leaders and their families is a major problem in the country.
How Voters Expect Harris’ and Trump’s Policies to Affect Different Groups in Society
Key facts about union members and the 2024 election, on most issues, black voters are more confident in harris than trump, military veterans remain a republican group, backing trump over harris by wide margin.
All Election 2024 research >
Are you in the American middle class? Find out with our income calculator
How u.s. public opinion has changed in 20 years of our surveys, why many parents and teens think it’s harder being a teen today, religious composition of the world’s migrants, 1990-2020.
All Features >
International Affairs
Brazilians mostly optimistic about country’s global standing ahead of g20 summit.
Brazilians increasingly say their country is or will become a top world power, and trust in their government has roughly doubled since 2017.
7 facts about Germany’s AfD party
Alternative for Germany (AfD) is the first far-right political party to win a state election in Germany since World War II.
Many Israelis say social media content about the Israel-Hamas war should be censored
Most Israeli adults do not post or share about political and social issues online – including the war between Israel and Hamas.
How Mexicans and Americans view each other and their governments’ handling of the border
Mexicans hold generally positive views of the United States, while Americans hold generally negative views of Mexico – a reversal from 2017.
All INTERNATIONAL AFFAIRS RESEARCH >
News Media Trends
Friends, family and neighbors are americans’ most common source of local news.
About three-quarters of Americans (73%) say they often or sometimes get local news from friends, family and neighbors.
More Americans – especially young adults – are regularly getting news on TikTok
The share of adults who say they regularly get news from TikTok has grown about fivefold since 2020, from 3% to 17% in 2024.
Americans’ Experiences With Local Crime News
Most U.S. adults say they are interested in several types of local crime coverage, but far fewer say this information is easy to find.
How Americans Get Local Political News
Most U.S. adults follow news about local government and politics, yet only a quarter are highly satisfied with the quality of coverage.
How Americans Get News on TikTok, X, Facebook and Instagram
X is still more of a news destination than these other platforms, but the vast majority of users on all four see news-related content.
All News Media Trends RESEARCH >
Race & Ethnicity
Latinx awareness has doubled among u.s. hispanics since 2019, but only 4% use it.
Three-quarters of Hispanics who have heard of the term Latinx say it should not be used to describe the Hispanic or Latino population.
1 in 10 eligible voters in the U.S. are naturalized citizens
Naturalized citizens make up a record number of eligible voters in 2022, most of whom have lived here more than 20 years.
A look at historically Black colleges and universities in the U.S.
Historically Black colleges and universities continue to play an important role in U.S. higher education.
All Race & Ethnicity RESEARCH >
U.S. Surveys
Pew Research Center has deep roots in U.S. public opinion research. Launched as a project focused primarily on U.S. policy and politics in the early 1990s, the Center has grown over time to study a wide range of topics vital to explaining America to itself and to the world.
International Surveys
Pew Research Center regularly conducts public opinion surveys in countries outside the United States as part of its ongoing exploration of attitudes, values and behaviors around the globe.
Data Science
Pew Research Center’s Data Labs uses computational methods to complement and expand on the Center’s existing research agenda.
Demographic Research
Pew Research Center tracks social, demographic and economic trends, both domestically and internationally.
All Methods research >
Our Experts
“A record 23 million Asian Americans trace their roots to more than 20 countries … and the U.S. Asian population is projected to reach 46 million by 2060.”
Neil G. Ruiz , Head of New Research Initiatives
Key facts about asian americans >
Methods 101 Videos
Methods 101: random sampling.
The first video in Pew Research Center’s Methods 101 series helps explain random sampling – a concept that lies at the heart of all probability-based survey research – and why it’s important.
Methods 101: Survey Question Wording
Methods 101: mode effects, methods 101: what are nonprobability surveys.
All Methods 101 Videos >
Add Pew Research Center to your Alexa
Say “Alexa, enable the Pew Research Center flash briefing”
Signature Reports
Race and lgbtq issues in k-12 schools, representative democracy remains a popular ideal, but people around the world are critical of how it’s working, americans’ dismal views of the nation’s politics, measuring religion in china, diverse cultures and shared experiences shape asian american identities, parenting in america today, editor’s pick, about a third of un member states have ever had a woman leader, what’s new with you what americans talk about with family and friends, same-sex marriage around the world, is college worth it, broad public support for legal abortion persists 2 years after dobbs, immigration & migration, the religious composition of the world’s migrants, migrant encounters at u.s.-mexico border have fallen sharply in 2024, in some countries, immigration accounted for all population growth between 2000 and 2020, social media, who u.s. adults follow on tiktok, whatsapp and facebook dominate the social media landscape in middle-income nations, how teens and parents approach screen time.
901 E St. NW, Suite 300 Washington, DC 20004 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 | Media Inquiries
Research Topics
- Email Newsletters
ABOUT PEW RESEARCH CENTER Pew Research Center is a nonpartisan, nonadvocacy fact tank that informs the public about the issues, attitudes and trends shaping the world. It does not take policy positions. The Center conducts public opinion polling, demographic research, computational social science research and other data-driven research. Pew Research Center is a subsidiary of The Pew Charitable Trusts , its primary funder.
© 2024 Pew Research Center
IMAGES
VIDEO
COMMENTS
Healthcare professionals, when determining the impact of patient interventions in clinical studies or research endeavors that provide evidence for clinical practice, must distinguish well-designed studies with valid results from studies with research design or statistical flaws. This article will help providers determine the likelihood of type I or type II errors and judge the adequacy of ...
Sampling methods; Types of variables. Types of variables; Levels of measurement; Nominal data; Ordinal data; ... you can make decisions about whether your data support or refute your research predictions with null and alternative hypotheses. ... Type I & Type II Errors | Differences, Examples, Visualizations.
Yes, there are ethical implications associated with Type I and Type II errors in psychological research. Type I errors may lead to false positive findings, resulting in misleading conclusions and potentially wasting resources on ineffective interventions. This can harm individuals who are falsely diagnosed or receive unnecessary treatments.
The hypotheses for this study are: Null: The drug has no effect in the population.; Alternative: The drug is effective in the population.; Our analysis yields a p-value of 0.08, above our alpha level of 0.05. The study is not statistically significant.Consequently, we fail to reject the null and conclude the drug is ineffective.
In this setting, Type I and Type II errors are fundamental concepts to help us interpret the results of the hypothesis test. 1 They are also vital components when calculating a study sample size. 2, 3 We have already briefly met these concepts in previous Research Design and Statistics articles 2, 4 and here we shall consider them in more detail.
Example: Bus brake pads. Bus brake pads are claimed to last on average at least 60,000 miles and the company wants to test this claim. The bus company considers a "practical" value for purposes of bus safety to be that the pads last at least 58,000 miles.
Since in a real experiment it is impossible to avoid all type I and type II errors, it is important to consider the amount of risk one is willing to take to falsely reject H 0 or accept H 0.The solution to this question would be to report the p-value or significance level α of the statistic. For example, if the p-value of a test statistic result is estimated at 0.0596, then there is a ...
Statistical notes for clinical researchers: Type I and type II errors in statistical decision. Hae-Young Kim. Hae-Young Kim. 1 Department of Health Policy and Management, ... Let's consider a situation that someone develops a new method and insists that it is more efficient than conventional methods but the new method is actually not more ...
As with Type I error, there are several general strategies that can be used to increase the statistical power of a given study beyond the use of more conservative p levels. The Encyclopedia of Research Methods in Criminology and Criminal Justice
Hypothesis testing is one of the most widely used quantitative methods in decision making. It answers a research question in terms of statistical (non-) significance of a null hypothesis. The procedure of hypothesis testing can result in several errors. This entry focuses on Type II errors, which occur when a false hypothesis is not rejected.
Therefore, the inverse of Type II errors is the probability of correctly detecting an effect. Statisticians refer to this concept as the power of a hypothesis test. Consequently, 1 - β = the statistical power. Analysts typically estimate power rather than beta directly.
6.1 - Type I and Type II Errors. When conducting a hypothesis test there are two possible decisions: reject the null hypothesis or fail to reject the null hypothesis. You should remember though, hypothesis testing uses data from a sample to make an inference about a population. When conducting a hypothesis test we do not know the population ...
The conclusion errors are important because research findings are used to understand the world and to guide actions; it is, therefore, important to consider and reduce the risk of conclusion errors. ... Replication is the act of repeating a study under the same or similar conditions and methods as were used in a prior study to see if the same ...
Embed this Content. Add this content to your learning management system or webpage by copying the code below into the HTML editor on the page. Look for the words HTML or </>. Learn More about Embedding icon link (opens in new window)
Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test.Significance is usually denoted by a p-value, or probability value.. Statistical significance is arbitrary - it depends on the threshold, or alpha value, chosen by the researcher.
Type I and type II errors are the product of forcing the results of a quantitative analysis into the mold of a decision, which is whether to reject or not to reject the null hypothesis. Reducing interpretations to a dichotomy, however, seriously degrades the information. The consequence is often a misinterpretation of study results, stemming ...
Abstract. OBJECTIVE: To determine if type II statistical errors (also known as beta errors) are a common problem in published clinical research. METHODS: Type II statistical errors occur when sample sizes are too small to show an effect of treatment, even when an effect truly exists. Searching the Medline data base, we identified ten meta ...
An introduction to research methods; Research approaches. Inductive vs. deductive. Inductive vs. deductive; Inductive reasoning; Deductive reasoning; ... Ultimately, you might make a false positive or a false negative conclusion (a Type I or II error) about the relationship between the variables you're studying. Prevent plagiarism. Run a free ...
What is the difference between Type 1 and Type 2 errors? Type 1 errors are false-positive and occur when a null hypothesis is wrongly rejected when it is true. Wheres, type 2 errors are false negatives and happen when a null hypothesis is considered true when it is wrong.
Psychologists are increasingly interested in whether treatment effects vary in randomized controlled trials. A number of tests have been proposed in the causal inference literature to test for such heterogeneity, which differ in the sample statistic they use (either using the variance terms of the experimental and control group, their empirical distribution functions, or specific quantiles ...
Objective: To determine if type II statistical errors (also known as beta errors) are a common problem in published clinical research. Methods: Type II statistical errors occur when sample sizes are too small to show an effect of treatment, even when an effect truly exists. Searching the Medline data base, we identified ten meta-analyses published during 1986-1994 in the American Journal of ...
Pew Research Center is a nonpartisan, nonadvocacy fact tank that informs the public about the issues, attitudes and trends shaping the world. ... The first video in Pew Research Center's Methods 101 series helps explain random sampling - a concept that lies at the heart of all probability-based survey research - and why it's important. ...