Icon Partners

  • Quality Improvement
  • Talk To Minitab

Understanding Hypothesis Tests: Confidence Intervals and Confidence Levels

Topics: Hypothesis Testing , Data Analysis , Statistics

In this series of posts, I show how hypothesis tests and confidence intervals work by focusing on concepts and graphs rather than equations and numbers.  

Previously, I used graphs to show what statistical significance really means . In this post, I’ll explain both confidence intervals and confidence levels, and how they’re closely related to P values and significance levels.

How to Correctly Interpret Confidence Intervals and Confidence Levels

A confidence interval is a range of values that is likely to contain an unknown population parameter. If you draw a random sample many times, a certain percentage of the confidence intervals will contain the population mean. This percentage is the confidence level.

Most frequently, you’ll use confidence intervals to bound the mean or standard deviation, but you can also obtain them for regression coefficients, proportions, rates of occurrence (Poisson), and for the differences between populations.

Just as there is a common misconception of how to interpret P values , there’s a common misconception of how to interpret confidence intervals. In this case, the confidence level is not the probability that a specific confidence interval contains the population parameter.

The confidence level represents the theoretical ability of the analysis to produce accurate intervals if you are able to assess many intervals and you know the value of the population parameter. For a specific confidence interval from one study, the interval either contains the population value or it does not—there’s no room for probabilities other than 0 or 1. And you can't choose between these two possibilities because you don’t know the value of the population parameter.

"The parameter is an unknown constant and no probability statement concerning its value may be made."  —Jerzy Neyman, original developer of confidence intervals.

This will be easier to understand after we discuss the graph below . . .

With this in mind, how do you interpret confidence intervals?

Confidence intervals serve as good estimates of the population parameter because the procedure tends to produce intervals that contain the parameter. Confidence intervals are comprised of the point estimate (the most likely value) and a margin of error around that point estimate. The margin of error indicates the amount of uncertainty that surrounds the sample estimate of the population parameter.

In this vein, you can use confidence intervals to assess the precision of the sample estimate. For a specific variable, a narrower confidence interval [90 110] suggests a more precise estimate of the population parameter than a wider confidence interval [50 150].

Confidence Intervals and the Margin of Error

Let’s move on to see how confidence intervals account for that margin of error. To do this, we’ll use the same tools that we’ve been using to understand hypothesis tests. I’ll create a sampling distribution using probability distribution plots , the t-distribution , and the variability in our data. We'll base our confidence interval on the energy cost data set that we've been using.

When we looked at significance levels , the graphs displayed a sampling distribution centered on the null hypothesis value, and the outer 5% of the distribution was shaded. For confidence intervals, we need to shift the sampling distribution so that it is centered on the sample mean and shade the middle 95%.

Probability distribution plot that illustrates how a confidence interval works

The shaded area shows the range of sample means that you’d obtain 95% of the time using our sample mean as the point estimate of the population mean. This range [267 394] is our 95% confidence interval.

Using the graph, it’s easier to understand how a specific confidence interval represents the margin of error, or the amount of uncertainty, around the point estimate. The sample mean is the most likely value for the population mean given the information that we have. However, the graph shows it would not be unusual at all for other random samples drawn from the same population to obtain different sample means within the shaded area. These other likely sample means all suggest different values for the population mean. Hence, the interval represents the inherent uncertainty that comes with using sample data.

You can use these graphs to calculate probabilities for specific values. However, notice that you can’t place the population mean on the graph because that value is unknown. Consequently, you can’t calculate probabilities for the population mean, just as Neyman said!

Why P Values and Confidence Intervals Always Agree About Statistical Significance

You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree.

The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

  • If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.
  • If the confidence interval does not contain the null hypothesis value, the results are statistically significant.
  • If the P value is less than alpha, the confidence interval will not contain the null hypothesis value.

For our example, the P value (0.031) is less than the significance level (0.05), which indicates that our results are statistically significant. Similarly, our 95% confidence interval [267 394] does not include the null hypothesis mean of 260 and we draw the same conclusion.

To understand why the results always agree, let’s recall how both the significance level and confidence level work.

  • The significance level defines the distance the sample mean must be from the null hypothesis to be considered statistically significant.
  • The confidence level defines the distance for how close the confidence limits are to sample mean.

Both the significance level and the confidence level define a distance from a limit to a mean. Guess what? The distances in both cases are exactly the same!

The distance equals the critical t-value * standard error of the mean . For our energy cost example data, the distance works out to be $63.57.

Imagine this discussion between the null hypothesis mean and the sample mean:

Null hypothesis mean, hypothesis test representative : Hey buddy! I’ve found that you’re statistically significant because you’re more than $63.57 away from me!

Sample mean, confidence interval representative : Actually, I’m significant because you’re more than $63.57 away from me !

Very agreeable aren’t they? And, they always will agree as long as you compare the correct pairs of P values and confidence intervals. If you compare the incorrect pair, you can get conflicting results, as shown by common mistake #1 in this post .

Closing Thoughts

In statistical analyses, there tends to be a greater focus on P values and simply detecting a significant effect or difference. However, a statistically significant effect is not necessarily meaningful in the real world. For instance, the effect might be too small to be of any practical value.

It’s important to pay attention to the both the magnitude and the precision of the estimated effect. That’s why I'm rather fond of confidence intervals. They allow you to assess these important characteristics along with the statistical significance. You'd like to see a narrow confidence interval where the entire range represents an effect that is meaningful in the real world.

If you like this post, you might want to read the previous posts in this series that use the same graphical framework:

  • Part One: Why We Need to Use Hypothesis Tests
  • Part Two: Significance Levels (alpha) and P values

For more about confidence intervals, read my post where I compare them to tolerance intervals and prediction intervals .

If you'd like to see how I made the probability distribution plot, please read: How to Create a Graphical Version of the 1-sample t-Test .

minitab-on-twitter

You Might Also Like

  • Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

  • Terms of Use
  • Privacy Policy
  • Cookies Settings

Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Parts I & II Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Hypothesis Testing

Hypothesis Testing

After completing this reading, you should be able to:

  • Construct an appropriate null hypothesis and alternative hypothesis and distinguish between the two.
  • Construct and apply confidence intervals for one-sided and two-sided hypothesis tests, and interpret the results of hypothesis tests with a specific level of confidence.
  • Differentiate between a one-sided and a two-sided test and identify when to use each test.
  • Explain the difference between Type I and Type II errors and how these relate to the size and power of a test.
  • Understand how a hypothesis test and a confidence interval are related.
  • Explain what the p-value of a hypothesis test measures.
  • Interpret the results of hypothesis tests with a specific level of confidence.
  • Identify the steps to test a hypothesis about the difference between two population means.
  • Explain the problem of multiple testing and how it can bias results.

Hypothesis testing is defined as a process of determining whether a hypothesis is in line with the sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is true. Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. The null hypothesis is an assumption of the population parameter. On the other hand,  the alternative hypothesis states the parameter values (critical values) at which the null hypothesis is rejected. The critical values are determined by the distribution of the test statistic (when the null hypothesis is true) and the size of the test (which gives the size at which we reject the null hypothesis).

Components of the Hypothesis Testing

The elements of the test hypothesis include:

  • The null hypothesis.
  • The alternative hypothesis.
  • The test statistic.
  • The size of the hypothesis test and errors
  • The critical value.
  • The decision rule.

The Null hypothesis

As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. The null hypothesis is the statement concerning the population parameter values. It brings out the notion that “there is nothing about the data.”

The  null hypothesis , denoted as H 0 , represents the current state of knowledge about the population parameter that’s the subject of the test. In other words, it represents the “status quo.” For example, the U.S Food and Drug Administration may walk into a cooking oil manufacturing plant intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol and not more. The inspectors will formulate a hypothesis like:

H 0 : Each 1 kg package has 0.15% cholesterol.

A test would then be carried out to confirm or reject the null hypothesis.

Other typical statements of H 0  include:

$$H_0:\mu={\mu}_0$$

$$H_0:\mu≤{\mu}_0$$

\(μ\) = true population mean and,

\(μ_0\)= the hypothesized population mean.

The Alternative Hypothesis

The  alternative hypothesis , denoted H 1 , is a contradiction of the null hypothesis. The null hypothesis determines the values of the population parameter at which the null hypothesis is rejected. Thus, rejecting the H 0  makes H 1  valid. We accept the alternative hypothesis when the “status quo” is discredited and found to be untrue.

Using our FDA example above, the alternative hypothesis would be:

H 1 : Each 1 kg package does not have 0.15% cholesterol.

The typical statements of H1   include:

$$H_1:\mu \neq {\mu}_0$$

$$H_1:\mu > {\mu}_0$$

Note that we have stated the alternative hypothesis, which contradicted the above statement of the null hypothesis.

The Test Statistic

A test statistic is a standardized value computed from sample information when testing hypotheses. It compares the given data with what we would expect under the null hypothesis. Thus, it is a major determinant when deciding whether to reject H 0 , the null hypothesis.

We use the test statistic to gauge the degree of agreement between sample data and the null hypothesis. Analysts use the following formula when calculating the test statistic.

$$ \text{Test Statistic}= \frac{(\text{Sample Statistic–Hypothesized Value})}{(\text{Standard Error of the Sample Statistic})}$$

The test statistic is a random variable that changes from one sample to another. Test statistics assume a variety of distributions. We shall focus on normally distributed test statistics because it is used hypotheses concerning the means, regression coefficients, and other econometric models.

We shall consider the hypothesis test on the mean. Consider a null hypothesis \(H_0:μ=μ_0\). Assume that the data used is iid, and asymptotic normally distributed as:

$$\sqrt{n} (\hat{\mu}-\mu) \sim N(0, {\sigma}^2)$$

Where \({\sigma}^2\) is the variance of the sequence of the iid random variable used. The asymptotic distribution leads to the test statistic:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}}\sim N(0,1)$$

Note this is consistent with our initial definition of the test statistic.

The following table  gives a brief outline of the various test statistics used regularly, based on the distribution that the data is assumed to follow:

$$\begin{array}{ll} \textbf{Hypothesis Test} & \textbf{Test Statistic}\\ \text{Z-test} & \text{z-statistic} \\ \text{Chi-Square Test} & \text{Chi-Square statistic}\\ \text{t-test} & \text{t-statistic} \\ \text{ANOVA} & \text{F-statistic}\\ \end{array}$$ We can subdivide the set of values that can be taken by the test statistic into two regions: One is called the non-rejection region, which is consistent with H 0  and the rejection region (critical region), which is inconsistent with H 0 . If the test statistic has a value found within the critical region, we reject H 0 .

Just like with any other statistic, the distribution of the test statistic must be specified entirely under H 0  when H 0  is true.

The Size of the Hypothesis Test and the Type I and Type II Errors

While using sample statistics to draw conclusions about the parameters of the population as a whole, there is always the possibility that the sample collected does not accurately represent the population. Consequently, statistical tests carried out using such sample data may yield incorrect results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two types of errors:

Type I Error

Type I error occurs when we reject a true null hypothesis. For example, a type I error would manifest in the form of rejecting H 0  = 0 when it is actually zero.

Type II Error

Type II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test provides insufficient evidence to reject the null hypothesis when it’s false.

The level of significance denoted by α represents the probability of making a type I error, i.e., rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken to be the probability of making a type II error within the bounds of statistical testing. The ideal but practically impossible statistical test would be one that  simultaneously   minimizes α and β. We use α to determine critical values that subdivide the distribution into the rejection and the non-rejection regions.

The Critical Value and the Decision Rule

The decision to reject or not to reject the null hypothesis is based on the distribution assumed by the test statistic. This means if the variable involved follows a normal distribution, we use the level of significance (α) of the test to come up with critical values that lie along with the standard normal distribution.

The decision rule is a result of combining the critical value (denoted by \(C_α\)), the alternative hypothesis, and the test statistic (T). The decision rule is to whether to reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis.

For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-side alternative, the decision is to reject the null hypothesis if \(|T|>C_α\). That is, reject the null hypothesis if the absolute value of the test statistic is greater than the critical value. When testing on the one-sided, decision rule, reject the null hypothesis if \(T<C_α\)  when using a one-sided lower alternative and if \(T>C_α\)  when using a one-sided upper alternative. When a null hypothesis is rejected at an α significance level, we say that the result is significant at α significance level.

Note that prior to decision-making, one must decide whether the test should be one-tailed or two-tailed. The following is a brief summary of the decision rules under different scenarios:

Left One-tailed Test

H 1 : parameter < X

Decision rule: Reject H 0  if the test statistic is less than the critical value. Otherwise,  do not reject  H 0.

Right One-tailed Test

H 1 : parameter > X

Decision rule: Reject H 0  if the test statistic is greater than the critical value. Otherwise,  do not reject  H 0.

Two-tailed Test

H 1 : parameter  ≠  X (not equal to X)

Decision rule: Reject H 0  if the test statistic is greater than the upper critical value or less than the lower critical value.

Two-tailed Test

 H 0 : μ < μ 0  vs. H 1 : μ > μ 0.

The second graph represents the rejection region when the alternative is a one-sided upper. The null hypothesis, in this case, is stated as:

H 0 : μ > μ 0  vs. H 1 : μ < μ 0.

Example: Hypothesis Test on the Mean

Consider the returns from a portfolio \(X=(x_1,x_2,\dots, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. We wish to determine whether the expected value of the return is different from 0 at a 5% significance level.

We start by stating the two-sided hypothesis test:

H 0 : μ =0 vs. H 1 : μ ≠ 0

The test statistic is:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}} \sim N(0,1)$$

In this case, we have,

\(\hat{μ}\)=0.075

\(\hat{\sigma}^2\)=0.17 2

$$T=\frac{0.075-0}{\sqrt{\frac{0.17^2}{40}}} \approx 2.79$$

At the significance level, \(α=5\%\),the critical value is \(±1.96\). Since this is a two-sided test, the rejection regions are ( \(-\infty,-1.96\) ) and (\(1.96, \infty \) ) as shown in the diagram below:

Rejection Regions - Two-Sided Test

The example above is an example of a Z-test (which is mostly emphasized in this chapter and immediately follows from the central limit theorem (CLT)). However, we can use the Student’s t-distribution if the random variables are iid and normally distributed and that the sample size is small (n<30).

In Student’s t-distribution, we used the unbiased estimator of variance. That is:

$$s^2=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}}$$

Therefore the test statistic for \(H_0=μ_0\) is given by:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}} \sim t_{n-1}$$

The Type II Error and the Test Power

The power of a test is the direct opposite of the level of significance. While the level of relevance gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a test gives the probability of correctly discrediting and rejecting the null hypothesis when it is false. In other words, it gives the likelihood of rejecting H 0  when, indeed, it’s false. Denoting the probability of type II error by \(\beta\), the power test is given by:

$$ \text{Power of a Test}=1–\beta $$

The power test measures the likelihood that the false null hypothesis is rejected. It is influenced by the sample size, the length between the hypothesized parameter and the true value, and the size of the test.

Confidence Intervals

A confidence interval can be defined as the range of parameters at which the true parameter can be found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter values where the null hypothesis cannot be rejected when using a 5% test size. Therefore, a 1-α confidence interval contains values that cannot be disregarded at a test size of α.

It is important to note that the confidence interval depends on the alternative hypothesis statement in the test. Let us start with the two-sided test alternatives.

$$ H_0:μ=0$$

$$H_1:μ≠0$$

Then the \(1-α\) confidence interval is given by:

$$\left[\hat{\mu} -C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]$$

\(C_α\) is the critical value at \(α\) test size.

Example: Calculating Two-Sided Alternative Confidence Intervals

Consider the returns from a portfolio \(X=(x_1,x_2,…, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95% confidence interval for the portfolio return.

The \(1-\alpha\) confidence interval is given by:

$$\begin{align*}&\left[\hat{\mu}-C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]\\& =\left[0.0750-1.96 \times \frac{0.17}{\sqrt{40}}, 0.0750+1.96 \times \frac{0.17}{\sqrt{40}} \right]\\&=[0.02232,0.1277]\end{align*}$$

Thus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be rejected against the alternative.

One-Sided Alternative

For the one-sided alternative, the confidence interval is given by either:

$$\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

for the lower alternative

$$\left ( \hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}},\infty \right )$$

for the upper alternative.

Example: Calculating the One-Sided Alternative Confidence Interval

Assume that we were conducting the following one-sided test:

\(H_0:μ≤0\)

\(H_1:μ>0\)

The 95% confidence interval for the portfolio return is:

$$\begin{align*}&=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )\\&=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)\\&=(-\infty, 0.1192)\end{align*}$$

On the other hand, if the hypothesis test was:

\(H_0:μ>0\)

\(H_1:μ≤0\)

The 95% confidence interval would be:

$$=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

$$=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)=(0.1192, \infty)$$

Note that the critical value decreased from 1.96 to 1.645 due to a change in the direction of the change.

The p-Value

When carrying out a statistical test with a fixed value of the significance level (α), we merely compare the observed test statistic with some critical value. For example, we might “reject H 0  using a 5% test” or “reject H 0 at 1% significance level”. The problem with this ‘classical’ approach is that it does not give us details about the  strength of the evidence  against the null hypothesis.

Determination of the  p-value  gives statisticians a more informative approach to hypothesis testing. The p-value is the lowest level at which we can reject H 0 . This means that the strength of the evidence against H 0  increases as the  p-value becomes smaller. The test statistic depends on the alternative.

The p-Value for One-Tailed Test Alternative

For one-tailed tests, the  p-value  is given by the probability that lies below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-tailed tests gives the  p-value.

Denoting the test statistic by T, the p-value for \(H_1:μ>0\)  is given by:

$$P(Z>|T|)=1-P(Z≤|T|)=1- \Phi (|T|) $$

Conversely , for  \(H_1:μ≤0 \)  the p-value is given by:

$$ P(Z≤|T|)= \Phi (|T|)$$ 

Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the right tail is measured whether T is negative or positive.

The p-Value for Two-Tailed Test Alternative

  If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start by determining the probability lying below the negative value of the test statistic. Then, we add this to the probability lying above the positive value of the test statistic. That is the p-value for the two-tailed hypothesis test is given by:

$$2\left[1-\Phi [|T|\right]$$

Example 1: p-Value for One-Sided Alternative

Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the coin 200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of significance.

H 0 : θ = 0.5

H 1 : θ < 0.5

First, not that repeatedly tossing a coin follows a binomial distribution.

Our p-value will be given by P(X < 85) where X  `binomial(200,0.5)  with mean 100(np=200*0.5), assuming H 0  is true.

$$\begin{align*}P\left [ z< \frac{85.5-100}{\sqrt{50}} \right]&=P(Z<-2.05)\\&=1–0.97982=0.02018 \end{align*}$$

Recall that for a binomial distribution, the variance is given by:

$$np(1-p)=200(0.5)(1-0.5)=50$$

(We have applied the Central Limit Theorem by taking the binomial distribution as approx. normal)

Since the probability is less than 0.05, H 0  is extremely unlikely, and we actually have strong evidence against H 0  that favors H 1 . Thus, clearly expressing this result, we could say:

“There is very strong evidence against the hypothesis that the coin is fair. We, therefore, conclude that the coin is biased against heads.”

Remember, failure to reject H 0  does not mean it’s true. It means there’s insufficient evidence to justify rejecting H 0,  given a certain level of significance.

Example 2:  p-Value for Two-Sided Alternative

A CFA candidate conducts a statistical test about the mean value of a random variable X.

H 0 : μ = μ 0  vs. H 1 : μ  ≠  μ 0

She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the  p-value

$$ \text{P-value}=2P(Z>2.2)=2[1–P(Z≤2.2)]  =1.39\%×2=2.78\%$$

(We have multiplied by two since this is a two-tailed test)

Example - Two-Sided Test

The p-value (2.78%) is less than the level of significance (5%). Therefore, we have sufficient evidence to reject H 0 . In fact, the evidence is so strong that we would also reject H 0  at significance levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not reject H 0  since the  p-value  surpasses these values.

Hypothesis about the Difference between Two Population Means.

It’s common for analysts to be interested in establishing whether there exists a significant difference between the means of two different populations. For instance, they might want to know whether the average returns for two subsidiaries of a given company exhibit  significant  differences.

Now, consider a bivariate random variable:

$$W_i=[X_i,Y_i]$$

Assume that the components \(X_i\) and \(Y_i\)are both iid and are correlated. That is: \(\text{Corr} (X_i,Y_i )≠0\)

Now, suppose that we want to test the hypothesis that:

$$H_0:μ_X=μ_Y$$

$$H_1:μ_X≠μ_Y$$

In other words, we want to test whether the constituent random variables have equal means. Note that the hypothesis statement above can be written as:

$$H_0:μ_X-μ_Y=0$$

$$H_1:μ_X-μ_Y≠0$$

To execute this test, consider the variable:

$$Z_i=X_i-Y_i$$

Therefore, considering the above random variable, if the null hypothesis is correct then,

$$E(Z_i)=E(X_i)-E(Y_i)=μ_X-μ_Y=0$$

Intuitively, this can be considered as a standard hypothesis test of

H 0 : μ Z =0 vs. H 1 : μ Z  ≠ 0.

The tests statistic is given by:

$$T=\frac{\hat{\mu}_z}{\sqrt{\frac{\hat{\sigma}^2_z}{n}}} \sim N(0,1)$$

Note that the test statistic formula accounts for the correction between \(X_i \) and \(Y_i\). It is easy to see that:

$$V(Z_i)=V(X_i )+V(Y_i)-2COV(X_i, Y_i)$$

Which can be denoted as:

$$\hat{\sigma}^2_z =\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}$$

$$ \hat{\mu}_z ={\mu}_X-{\mu}_Y $$

And thus the test statistic formula can be written as:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}$$

This formula indicates that correlation plays a crucial role in determining the magnitude of the test statistic.

Another special case of the test statistic is when \(X_i\), and \(Y_i\) are iid and independent. The test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X}{n_X}+\frac{\hat{\sigma}^2_Y}{n_Y}}}$$

Where \(n_X\)  and \(n_Y\)  are the sample sizes of \(X_i\), and \(Y_i\) respectively.

Example: Hypothesis Test on Two Means

An investment analyst wants to test whether there is a significant difference between the means of the two portfolios at a 95% level. The first portfolio X consists of 30 government-issued bonds and has a mean of 10% and a standard deviation of 2%. The second portfolio Y consists of 30 private bonds with a mean of 14% and a standard deviation of 3%. The correlation between the two portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is rejected or otherwise.

The hypothesis statement is given by:

H 0 : μ X – μ Y =0 vs. H 1 : μ X – μ Y ≠ 0.

Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value \(C_α=±1.96\). 

Recall that:

$$Cov(X, Y)=σ_{XY}=ρ_{XY} σ_X σ_Y$$

Where ρ_XY  is the correlation coefficient between X and Y.

Now the test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\rho}_{XY} {\sigma}_X {\sigma}_Y}{n}}}$$

$$=\frac{0.10-0.14}{\sqrt{\frac{0.02^2 +0.03^2-2\times 0.7 \times 0.02 \times 0.03}{30}}}=-10.215$$

The test statistic is far much less than -1.96. Therefore the null hypothesis is rejected at a 95% level.

The Problem of Multiple Testing

Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data set. The reuse of data results in spurious results and unreliable conclusions that do not hold up to scrutiny. The fundamental problem with multiple testing is that the test size (i.e., the probability that a true null is rejected) is only applicable for a single test. However, repeated testing creates test sizes that are much larger than the assumed size of alpha and therefore increases the probability of a Type I error.

Some control methods have been developed to combat multiple testing. These include Bonferroni correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).

Practice Question An experiment was done to find out the number of hours that candidates spend preparing for the FRM part 1 exam. For a sample of 10 students , the average study time was found to be 312.7 hours, with a standard deviation of 7.2 hours. What is the 95% confidence interval for the mean study time of all candidates? A. [307.5, 317.9] B. [310, 317] C. [300, 317] D. [307.5, 312.2] The correct answer is A. To calculate the 95% confidence interval for the mean study time of all candidates, we can use the formula for the confidence interval when the population variance is unknown: \[\text{Confidence Interval} = \bar{X} \pm t_{1-\frac{\alpha}{2}} \times \frac{s}{\sqrt{n}}\] Where: \(\bar{X}\) is the sample mean \(t_{1-\frac{\alpha}{2}}\) is the t-score corresponding to the desired confidence level and degrees of freedom \(s\) is the sample standard deviation \(n\) is the sample size In this case: \(\bar{X} = 312.7\) hours (the average study time) \(s = 7.2\) hours (the standard deviation of study time) \(n = 10\) students (the sample size) To find the t-score (\(t_{1-\frac{\alpha}{2}}\)), we look at the t-table for the 95% confidence level (which corresponds to \(\alpha = 0.05\)) and 9 degrees of freedom (\(n – 1 = 10 – 1 = 9\)). The t-score is 2.262. Now, we can plug these values into the confidence interval formula: \[\text{Confidence Interval} = 312.7 \pm 2.262 \times \frac{7.2}{\sqrt{10}}\] Calculating the margin of error: \[\text{Margin of Error} = 2.262 \times \frac{7.2}{\sqrt{10}} \approx 5.2\] So the confidence interval is: \[\text{Confidence Interval} = 312.7 \pm 5.2 = [307.5, 317.9]\] Therefore, the 95% confidence interval for the mean study time of all candidates is [307.5, 317.9] hours.

Offered by AnalystPrep

null hypothesis 95 confidence interval

Approaches to Asset Allocation

Anatomy of the great financial crisis ....

After completing this reading, you should be able to: Describe the historical background... Read More

Foreign Exchange Risk

After completing this reading, you should be able to: Calculate a financial institution’s... Read More

Central Clearing

After completing this reading, you should be able to: Provide examples of the... Read More

Exchanges and OTC Markets

After completing this reading, you should be able to: Describe how exchanges can... Read More

Leave a Comment Cancel reply

You must be logged in to post a comment.

Confidence Intervals Explained: Examples, Formula & Interpretation

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

On This Page:

The confidence interval (CI) is a range of values that’s likely to include a population value with a certain degree of confidence. It is often expressed as a % whereby a population mean lies between an upper and lower interval.

95% Confidence Interval Explained

What is a 95% confidence interval?

The 95% confidence interval is a range of values that you can be 95% confident contains the true mean of the population. Due to natural sampling variability, the sample mean (center of the CI) will vary from sample to sample.

The confidence is in the method, not in a particular CI. If we repeated the sampling method many times, approximately 95% of the intervals constructed would capture the true population mean.

Therefore, as the sample size increases, the range of interval values will narrow, meaning that you know that mean with much more accuracy than a smaller sample.

We can visualize this using a normal distribution (see the below graph).

Why is Z 1.96 at 95 confidence?

For example, the probability of the population mean value being between -1.96 and +1.96 standard deviations (z-scores) from the sample mean is 95%.

Accordingly, there is a 5% chance that the population mean lies outside of the upper and lower confidence interval (as illustrated by the 2.5% of outliers on either side of the 1.96 z-scores).

Why use confidence intervals?

It is more or less impossible to study every single person in a population, so researchers select a sample or sub-group of the population.

This means that the researcher can only estimate a population’s parameters (i.e., characteristics), the estimated range being calculated from a given set of sample data.

Therefore, a confidence interval is simply a way to measure how well your sample represents the population you are studying.

The probability that the confidence interval includes the true mean value within a population is called the confidence level of the CI.

You can calculate a CI for any confidence level you like, but the most commonly used value is 95%. A 95% confidence interval is a range of values (upper and lower) that you can be 95% certain contains the true mean of the population.

How to calculate

To calculate the confidence interval, start by computing the mean and standard error of the sample.

Remember, you must calculate an upper and low score for the confidence interval using the z-score for the chosen confidence level (see table below).

Confidence interval formula

Confidence Interval Formula

  • X is the mean
  • Z is the chosen Z-value (1.96 for 95%)
  • s is the standard error
  • n is the sample size

For the lower interval score, divide the standard error by the square root on n, and then multiply the sum of this calculation by the z-score (1.96 for 95%). Finally, subtract the value of this calculation from the sample mean.

An Example:

  • X (mean) = 86
  • Z = 1.960 (from the table above for 95%)
  • s (standard error) = 6.2
  • n (sample size) = 46

Lower Value: 86 – 1.960 × 6.2 √46 = 86 – 1.79 = 84.21

Upper Value: 86 + 1.960 × 6.2 √46 = 86 + 1.79 = 87.79

So the population mean is likely to be between 84.21 and 87.79

Population mean and sample mean

How can we be sure that the population mean is similar to the sample mean?

The narrower the interval (upper and lower values), the more precise our estimate is.

As a general rule, as the sample size increases, the confidence interval should become more narrow.

Therefore, with large samples, you can estimate the population mean more precisely than with smaller samples. Hence, the confidence interval is quite narrow when computed from a large sample.

How to report

The APA 6 style manual states (p.117):

“ When reporting confidence intervals, use the format 95% CI [LL, UL] where LL is the lower limit of the confidence interval and UL is the upper limit.”

For example, one might report a 95% CI [5.62, 8.31].

Confidence intervals can also be reported in a table

apa style table

Further Information

  • Hypothesis testing and p-values (Kahn Academy)
  • Publication manual of the American Psychological Association
  • Statistics for Psychology Book Download

What Does a Confidence Interval Reveal?

A confidence interval gives a range where we think a certain number (like an average) lies for the whole population, based on our sample data. The “confidence level” (like 95%) is how sure we are that this range includes the true value.

So, if we have a 95% confidence interval for the average height of all 16-year-olds as 5’4″ to 5’8″, we’re saying we’re 95% confident that the true average height for all 16-year-olds is somewhere between 5’4″ and 5’8″.

It doesn’t mean all heights are equally likely, just that the true average probably falls in this range. It’s a way to show our uncertainty in estimates.

Is The confidence interval the same as standard deviation?

No, they’re different. The standard deviation shows how much individual measurements in a group vary from the average. Think of it like how much students’ grades differ from the class average.

A confidence interval, on the other hand, is a range that we’re pretty sure (like 95% sure) contains the true average grade for all classes, based on our class. It’s about our certainty in estimating a true average, not about individual differences.

Does a boxplot show confidence intervals?

A standard box plot displays medians and interquartile ranges, not confidence intervals. However, some enhanced box plots can include confidence intervals around the median or mean, represented by notches or error bars.

While not a traditional feature, adding confidence intervals can give more insight into the data’s reliability of central tendency estimates.

Confidence Interval Practice Problems

  • A researcher took a sample of 30 students’ test scores with an average score of 85 and a standard deviation of 5. What is the 95% confidence interval for the test scores?
  • A study measures the heights of 50 people, finding an average height of 170 cm with a standard deviation of 10 cm. What is the 99% confidence interval for the population’s height?
  • In a sample of 40 light bulbs, the mean lifetime is 5000 hours and the standard deviation is 400 hours. Compute a 90% confidence interval for the average lifetime of the bulbs.
  • For a 95% confidence interval and a sample size > 30, we typically use a z-score of 1.96. The formula for a confidence interval is (mean – (z* (std_dev/sqrt(n)), mean + (z* (std_dev/sqrt(n)). So, the confidence interval is (85 – (1.96*(5/sqrt(30))), 85 + (1.96*(5/sqrt(30))) = (83.21, 86.79).
  • For a 99% confidence interval and a sample size > 30, we typically use a z-score of 2.58. So, the confidence interval is (170 – (2.58*(10/sqrt(50))), 170 + (2.58*(10/sqrt(50))) = (167.35, 172.65).
  • For a 90% confidence interval and a sample size > 30, we typically use a z-score of 1.645. So, the confidence interval is (5000 – (1.645*(400/sqrt(40))), 5000 + (1.645*(400/sqrt(40))) = (4870.92, 5129.08).

Statistical significance in ab testing. Where the confidence interval is 95%, the p-value is 0.05

Related Articles

Exploratory Data Analysis

Exploratory Data Analysis

What Is Face Validity In Research? Importance & How To Measure

Research Methodology , Statistics

What Is Face Validity In Research? Importance & How To Measure

Criterion Validity: Definition & Examples

Criterion Validity: Definition & Examples

Convergent Validity: Definition and Examples

Convergent Validity: Definition and Examples

Content Validity in Research: Definition & Examples

Content Validity in Research: Definition & Examples

Construct Validity In Psychology Research

Construct Validity In Psychology Research

null hypothesis 95 confidence interval

Random Error

  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • |   6  
  • |   7  

On This Page sidebar

Confidence Intervals and p-Values

The importance of precision, with "non-significant" results, with "significant" results.

Learn More sidebar

Epi_Tools.XLSX

All Modules

Confidence intervals are calculated from the same equations that generate p-values, so, not surprisingly, there is a relationship between the two, and confidence intervals for measures of association are often used to address the question of "statistical significance" even if a p-value is not calculated. We already noted that one way of stating the null hypothesis is to state that a risk ratio or an odds ratio is 1.0. We also noted that the point estimate is the most likely value, based on the observed data, and the 95% confidence interval quantifies the random error associated with our estimate, and it can also be interpreted as the range within which the true value is likely to lie with 95% confidence. This means that values outside the 95% confidence interval are unlikely to be the true value. Therefore, if the null value (RR=1.0 or OR=1.0) is not contained within the 95% confidence interval, then the probability that the null is the true value is less than 5%. Conversely, if the null is contained within the 95% confidence interval, then the null is one of the values that is consistent with the observed data, so the null hypothesis cannot be rejected.  

NOTE: Such a usage is unfortunate in my view because it is essentially using a confidence interval to make an accept/reject decision rather than focusing on it as a measure of precision, and it focuses all attention on one side of a two-sided measure (for example, if the upper and lower limits of a confidence interval are .90 and 2.50, there is just as great a chance that the true result is 2.50 as .90).

An easy way to remember the relationship between a 95% confidence interval and a p-value of 0.05 is to think of the confidence interval as arms that "embrace" values that are consistent with the data. If the null value is "embraced", then it is certainly not rejected , i.e. the p-value must be greater than 0.05 (not statistically significant) if the null value is within the interval. However, if the 95% CI excludes the null value, then the null hypothesis has been rejected , and the p-value must be < 0.05.

If the null value, meaning no difference is within the 95% confidence interval (i.e. embraced by it) then the results are not statistically significant.

Video Summary: Confidence Intervals for Risk Ratio, Odds Ratio, and Rate Ratio (8:35)

Link to a transcrip of the video

The difference between the perspective provided by the confidence interval and significance testing is particularly clear when considering non-significant results. The image below shows two confidence intervals; neither of them is "statistically significant" using the criterion of P < 0.05, because both of them embrace the null (risk ratio = 1.0). However, one should view these two estimates differently. The estimate with the wide confidence interval was likely obtained with a small sample size and a lot of potential for random error. However, even though it is not statistically significant, the point estimate (i.e., the estimated risk ratio or odds ratio) was somewhere around four, raising the possibility of an important effect. In this case one might want to explore this further by repeating the study with a larger sample size. Repeating the study with a larger sample would certainly not guarantee a statistically significant result, but it would provide a more precise estimate. The other estimate that is depicted is also non-significant, but it is a much narrower, i.e., more precise estimate, and we are confident that the true value is likely to be close to the null value. Even if there were a difference between the groups, it is likely to be a very small difference that may have little if any clinical significance. So, in this case, one would not be inclined to repeat the study.

A narrow confidence interval and a wide confidence interval both include the null value. The greater precision of the larger sample that produced the narrow confidence interval indicates that it is unlikely that there is a clinically important effect, but the study with the wide confidence interval should perhaps be repeated with a larger sample in order to avoid missing a clinically important effect.

For example, even if a huge study were undertaken that indicated a risk ratio of 1.03 with a 95% confidence interval of 1.02 - 1.04, this would indicate an increase in risk of only 2 - 4%. Even if this were true, it would not be important, and it might very well still be the result of biases or residual confounding. Consequently, the narrow confidence interval provides strong evidence that there is little or no association.

The next figure illustrates two study results that are both statistically significant at P < 0.05, because both confidence intervals lie entirely above the null value (RR or OR = 1). The upper result has a point estimate of about two, and its confidence interval ranges from about 0.5 to 3.0, and the lower result shows a point estimate of about 6 with a confidence interval that ranges from 0.5 to about 12. The narrower, more precise estimate enables us to be confident that there is about a two-fold increase in risk among those who have the exposure of interest. In contrast, the study with the wide confidence interval is "statistically significant," but it leaves us uncertain about the magnitude of the effect. Is the increase in risk relatively modest or is it huge? We just don't know.

A narrow and a wide confidence interval. Both lie completely above the null value, which is risk ratio=1

So, regardless of whether a study's results meet the criterion for statistically significance, a more important consideration is the precision of the estimate.

   

return to top | previous page | next page

Click to close

Content ©2016. All Rights Reserved. Date last modified: June 16, 2016. Wayne W. LaMorte, MD, PhD, MPH,

Using a confidence interval to decide whether to reject the null hypothesis

Suppose that you do a hypothesis test. Remember that the decision to reject the null hypothesis (H 0 ) or fail to reject it can be based on the p-value and your chosen significance level (also called α). If the p-value is less than or equal to α, you reject H 0 ; if it is greater than α, you fail to reject H 0 .

  • If the reference value specified in H 0 lies outside the interval (that is, is less than the lower bound or greater than the upper bound), you can reject H 0 .
  • If the reference value specified in H 0 lies within the interval (that is, is not less than the lower bound or greater than the upper bound), you fail to reject H 0 .
  • Minitab.com
  • License Portal
  • Cookie Settings

You are now leaving support.minitab.com.

Click Continue to proceed to:

Rejecting the Null Hypothesis Using Confidence Intervals

Rejecting the Null Hypothesis Using Confidence Intervals

After a discussion on the two primary methods of statistical inference, viz. hypothesis tests and confidence intervals, it is shown that these two methods are actually equivalent.

In an introductory statistics class, there are three main topics that are taught: descriptive statistics and data visualizations, probability and sampling distributions, and statistical inference. Within statistical inference, there are two key methods of statistical inference that are taught, viz. confidence intervals and hypothesis testing . While these two methods are always taught when learning data science and related fields, it is rare that the relationship between these two methods is properly elucidated.

In this article, we’ll begin by defining and describing each method of statistical inference in turn and along the way, state what statistical inference is, and perhaps more importantly, what it isn’t. Then we’ll describe the relationship between the two. While it is typically the case that confidence intervals are taught before hypothesis testing when learning statistics, we’ll begin with the latter since it will allow us to define statistical significance.

Hypothesis Tests

The purpose of a hypothesis test is to answer whether random chance might be responsible for an observed effect. Hypothesis tests use sample statistics to test a hypothesis about population parameters. The null hypothesis, H 0 , is a statement that represents the assumed status quo regarding a variable or variables and it is always about a population characteristic. Some of the ways the null hypothesis is typically glossed are: the population variable is equal to a particular value or there is no difference between the population variables . For example:

  • H 0 : μ = 61 in (The mean height of the population of American men is 69 inches)
  • H 0 : p 1 -p 2 = 0 (The difference in the population proportions of women who prefer football over baseball and the population proportion of men who prefer football over baseball is 0.)

Note that the null hypothesis always has the equal sign.

The alternative hypothesis, denoted either H 1 or H a , is the statement that is opposed to the null hypothesis (e.g., the population variable is not equal to a particular value  or there is a difference between the population variables ):

  • H 1 : μ > 61 im (The mean height of the population of American men is greater than 69 inches.)
  • H 1 : p 1 -p 2 ≠ 0 (The difference in the population proportions of women who prefer football over baseball and the population proportion of men who prefer football over baseball is not 0.)

The alternative hypothesis is typically the claim that the researcher hopes to show and it always contains the strict inequality symbols (‘<’ left-sided or left-tailed, ‘≠’ two-sided or two-tailed, and ‘>’ right-sided or right-tailed).

When carrying out a test of H 0 vs. H 1 , the null hypothesis H 0 will be rejected in favor of the alternative hypothesis only if the sample provides convincing evidence that H 0 is false. As such, a statistical hypothesis test is only capable of demonstrating strong support for the alternative hypothesis by rejecting the null hypothesis.

When the null hypothesis is not rejected, it does not mean that there is strong support for the null hypothesis (since it was assumed to be true); rather, only that there is not convincing evidence against the null hypothesis. As such, we never use the phrase “accept the null hypothesis.”

In the classical method of performing hypothesis testing, one would have to find what is called the test statistic and use a table to find the corresponding probability. Happily, due to the advancement of technology, one can use Python (as is done in the Flatiron’s Data Science Bootcamp ) and get the required value directly using a Python library like stats models . This is the p-value , which is short for the probability value.

The p-value is a measure of inconsistency between the hypothesized value for a population characteristic and the observed sample. The p -value is the probability, under the assumption the null hypothesis is true, of obtaining a test statistic value that is a measure of inconsistency between the null hypothesis and the data. If the p -value is less than or equal to the probability of the Type I error, then we can reject the null hypothesis and we have sufficient evidence to support the alternative hypothesis.

Typically the probability of a Type I error ɑ, more commonly known as the level of significance , is set to be 0.05, but it is often prudent to have it set to values less than that such as 0.01 or 0.001. Thus, if p -value ≤ ɑ, then we reject the null hypothesis and we interpret this as saying there is a statistically significant difference between the sample and the population. So if the p -value=0.03 ≤ 0.05 = ɑ, then we would reject the null hypothesis and so have statistical significance, whereas if p -value=0.08 ≥ 0.05 = ɑ, then we would fail to reject the null hypothesis and there would not be statistical significance.

Confidence Intervals

The other primary form of statistical inference are confidence intervals. While hypothesis tests are concerned with testing a claim, the purpose of a confidence interval is to estimate an unknown population characteristic. A confidence interval is an interval of plausible values for a population characteristic. They are constructed so that we have a chosen level of confidence that the actual value of the population characteristic will be between the upper and lower endpoints of the open interval.

The structure of an individual confidence interval is the sample estimate of the variable of interest margin of error. The margin of error is the product of a multiplier value and the standard error, s.e., which is based on the standard deviation and the sample size. The multiplier is where the probability, of level of confidence, is introduced into the formula.

The confidence level is the success rate of the method used to construct a confidence interval. A confidence interval estimating the proportion of American men who state they are an avid fan of the NFL could be (0.40, 0.60) with a 95% level of confidence. The level of confidence is not the probability that that population characteristic is in the confidence interval, but rather refers to the method that is used to construct the confidence interval.

For example, a 95% confidence interval would be interpreted as if one constructed 100 confidence intervals, then 95 of them would contain the true population characteristic. 

Errors and Power

A Type I error, or a false positive, is the error of finding a difference that is not there, so it is the probability of incorrectly rejecting a true null hypothesis is ɑ, where ɑ is the level of significance. It follows that the probability of correctly failing to reject a true null hypothesis is the complement of it, viz. 1 – ɑ. For a particular hypothesis test, if ɑ = 0.05, then its complement would be 0.95 or 95%.

While we are not going to expand on these ideas, we note the following two related probabilities. A Type II error, or false negative, is the probability of failing to reject a false null hypothesis where the probability of a type II error is β and the power is the probability of correctly rejecting a false null hypothesis where power = 1 – β. In common statistical practice, one typically only speaks of the level of significance and the power.

The following table summarizes these ideas , where the column headers refer to what is actually the case, but is unknown. (If the truth or falsity of the null value was truly known, we wouldn’t have to do statistics.)

null hypothesis 95 confidence interval

Hypothesis Tests and Confidence Intervals

Since hypothesis tests and confidence intervals are both methods of statistical inference, then it is reasonable to wonder if they are equivalent in some way. The answer is yes, which means that we can perform hypothesis testing using confidence intervals.

Returning to the example where we have an estimate of the proportion of American men that are avid fans of the NFL, we had (0.40, 0.60) at a 95% confidence level. As a hypothesis test, we could have the alternative hypothesis as H 1 ≠ 0.51. Since the null value of 0.51 lies within the confidence interval, then we would fail to reject the null hypothesis at ɑ = 0.05.

On the other hand, if H 1 ≠ 0.61, then since 0.61 is not in the confidence interval we can reject the null hypothesis at ɑ = 0.05. Note that the confidence level of 95% and the level of significance at ɑ = 0.05 = 5%  are complements, which is the “H o is True” column in the above table.

In general, one can reject the null hypothesis given a null value and a confidence interval for a two-sided test if the null value is not in the confidence interval where the confidence level and level of significance are complements. For one-sided tests, one can still perform a hypothesis test with the confidence level and null value. Not only is there an added layer of complexity for this equivalence, it is the best practice to perform two-sided hypothesis tests since one is not prejudicing the direction of the alternative.

In this discussion of hypothesis testing and confidence intervals, we not only understand when these two methods of statistical inference can be equivalent, but now have a deeper understanding of statistical significance itself and therefore, statistical inference.

Learn More About Data Science at Flatiron

The curriculum in our Data Science Bootcamp incorporates the latest technologies, including artificial intelligence (AI) tools. Download the syllabus to see what you can learn, or book a 10-minute call with Admissions to learn about full-time and part-time attendance opportunities.

Disclaimer: The information in this blog is current as of February 28, 2024. Current policies, offerings, procedures, and programs may differ.

null hypothesis 95 confidence interval

About Brendan Patrick Purdy

Brendan is the senior curriculum developer for data science at the Flatiron School. He holds degrees in mathematics, data science, and philosophy, and enjoys modeling neural networks with the Python library TensorFlow.

Related Posts

null hypothesis 95 confidence interval

Learn to Code Python: Free Lesson for Beginners

null hypothesis 95 confidence interval

Tim Lee: From Finance to Data Science

null hypothesis 95 confidence interval

What Do Data Analysts Do?

Related resources.

null hypothesis 95 confidence interval

Behind JavaScript, HTML/CSS, and SQL, Python is the fourth most popular language with 44.1% of developers. Check out this article on how you can learn this popular programming language for free.

null hypothesis 95 confidence interval

"I was working at one of the Big Four banks as a project manager, but I wasn't getting as hands-on as I would like. It just didn't scratch the itch to create."

null hypothesis 95 confidence interval

Gain insights into the key responsibilities of a data analyst role, plus the essential skills needed to succeed and the various career paths data analysts can travel down.

Privacy Overview

Replicability-Index

Improving the replicability of empirical research, null-hypothesis testing with confidence intervals.

Statistics is a mess. Statistics education is a mess. Not surprising, the understanding of statistics by applied research workers is a mess. This was less of a problem when there was only one way to conduct statistical analyses. Nobody knew what they were doing, but at least everybody was doing the same thing. Now we have a multiverse of statistical approaches and applied research workers are happy to mix and match statistics to fit their needs. This is making the reporting of results worse and leads to logical contradictions.

For example, the authors of an article in a recent issue of Psychological Science that shall remain anonymous claimed (a) that a Bayesian Hypothesis Test provided evidence for the nil-hypothesis (effect size of zero) and (b) claimed that their preregistered replication study had high statistical power. This makes no sense, because power is defined as the probability of correctly rejecting the null-hypothesis, which assumes an a priori effect size greater than zero. Power is simply not defined when the hypothesis is that the population effect size is zero.

Errors like this are avoidable, if we realize that Neyman introduced confidence intervals to make hypothesis testing easier and more informative. Here is a brief introduction to think clearly about hypothesis testing that should help applied research workers to understand what they are doing.

Effect Size and Sampling Error

The most important information that applied research workers should report are (a) an estimate of the effect size and (b) an estimate of sampling error. Every statistics course should start with introducing these two concepts because all other statistics like p-values or Bayes-Factors or confidence intervals are based on effect size and sampling error. They are also the most useful information for meta-analysis.

Information about effect sizes and sampling error can be in the form of unstandardized values (e.g., 5 cm difference in height with SE = 2 cm) or in standardized form (d = .5, SE = .2). This is not relevant for hypothesis testing and I will use standardized effect sizes for my example.

Specifying the Null-Hypothesis

The null-hypothesis is the hypothesis that a researcher believes to be untrue. It is the hypothesis that they want to reject or NULLify. The biggest mistake in statistics is the assumption that this hypothesis is always that there is no effect (effect size of zero). Cohen (1994) called this hypothesis the nil-hypothesis to distinguish it from other null-hypotheses.

For example, in a directional test that studying harder leads to higher grades, the null-hypothesis specifies all non-positive values (zero and all negative values). When this null-hypothesis is rejected, it automatically implies that the alternative hypothesis is true (given a specific error criterion and a bunch of assumption, not in a mathematically proven sense). Normally, we go through various steps to reject the null-hypothesis, to then affirm the alternative. However, with confidence intervals we can directly affirm the alternative.

Calculating Confidence Intervals

A confidence interval requires three pieces of information.

1. An ESTIMATE of the effect size. This estimate is provided by the mean difference in a sample. For example, the height difference of 5cm or d = .5 are estimates of the population mean difference in height.

2. An Estimate of sampling error. In simple designs, sampling error is a function of sample size, but even then we are making assumptions that can be violated and difficult or impossible to test in small samples In more complex designs, sampling error depends on other statistics that are sample dependent. Thus, sampling error is also just an estimate. The main job of statisticians is to find plausible estimates of sampling error for applied research workers. Applied researchers simply use the information that is provided by statistics programs. In our example, sampling error was estimated to be d = .2.

3. The third piece of information is how confident we want to be in our inferences. All data-based inferences are inductions that can be wrong, but we can specify the probability of being wrong. This quantity is known as the type-I error with the Greek symbol alpha. A common value is alpha = .05. This implies that we have a long-run error rate of no more than 5%. If we obtain 100 confidence intervals, the long-run error rate is limited to no more than 5% false inferences in favor of the alternative hypothesis. With alpha = .05, sampling error has to be multiplied by approximately 2 to compute a confidence interval.

To use our example, with d = .5, and SE = .2, we can create a confidence interval that ranges from d = .5 – .2*2 = .1 to .5 + .2*2 = .9. We can now state that WITHOUT ANY OTHER INFORMATION that may be relevant (e.g., we already know the alternative is true based on a much larger trustworthy prior study and our study is only a classroom demonstration) that the data support our hypothesis that there is a positive effect because the confidence interval fits into the predicted interval; that is values from .1 to .9 fit into the set of values from 0 to infinity.

A more common way to express this finding is to state that the confidence interval does not include the largest value of the null-hypothesis, which is zero. However, this leads to the impression that we tested the nil-hypothesis, and rejected it. But that is not true. We also rejected all the values less than 0. Thus, we did not test or reject the nil-hypothesis. We tested and rejected the null-hypothesis of effect sizes ranging from -infinity to 0. But it is also not necessary to state that we rejected this null-hypothesis because this statement is redundant with the statement we actually want to make. We found evidence for our hypothesis that the effect size is positive (i.e., in the range from 0 to infinity excluding 0).

I hope this example makes it clear how hypothesis testing with confidence intervals works. We first specify a range of values that we think are plausible (e.g., all positive values). We then compute a confidence interval of values that are consistent with our data. We then examine whether the confidence interval falls into the hypothesized range of values. When this is the case, we infer that the data support our hypothesis.

Different Outcomes

When we divide the range of possible effect sizes into two mutually exclusive regions, we can distinguish three possible outcomes.

One possible outcome is that the confidence interval falls into a predicted region. In this case, the data provide support for the prediction.

One possible outcome is that the confidence interval overlaps with the predicted range of values, but also falls outside the range of predicted values. For example, the data could have produced an effect size estimate of d = .1 and an confidence interval ranging from -.3 to .5. In this case, the data are inconclusive. It is possible that the population effect size is, as predicted, positive, or it is negative.

Another possible outcome is that the confidence interval falls entirely outside the predicted range of values (e.g., d = -.5, confidence interval -.9 to -.1). In this case, the data disconfirm the prediction of a positive effect. It follows that it is not even necessary to make a prediction one way or the other. We can simply see whether the confidence interval fits into one or the other region and infer that the population effect size is in the region that contains the confidence interval.

Do We Need A Priori Hypotheses?

Let’s assume that we predicted a positive effect and our hypothesis covers all effect sizes greater than zero and the confidence interval includes values from d = .1 to .9. We said that this finding allows us to accept our hypothesis that the effect size is positive; that is, it is within an interval ranging from 0 to infinity without zero. However, the confidence interval provides a much smaller range of values. A confidence interval ranging from .1 to .9 not only excludes negative values or a value of zero, it also excludes values of 1 or 2. Thus, we are not using all of the information that our data are providing when we simply infer from the data that the effect size is positive, which includes trivial values of 0.0000001 and implausible values of 999.9. The advantage of reporting results with confidence intervals is that we can specify a narrow range of values that are consistent with the data. This is particularly helpful when the confidence interval is close to zero. For example, a confidence interval that ranges from d = 0.001 to d = .801 can be used to claim that the effect size is positive, but it cannot be used to claim that the effect size is theoretically meaningful, unless d = .001 is theoretically meaningful.

Specifying A Minimum Effect Size

To make progress, psychology has to start taking effect sizes more seriously, and this is best achieved by reporting confidence intervals. Confidence intervals ranging from d = .8 to d = 1.2 and ranging from d = .01 to d = .41 are both consistent with the prediction that there is a positive effect, p < .01. However, the two confidence intervals also specify very different ranges of possible effect sizes. Whereas the first confidence interval rejects the hypothesis that effect sizes are small or moderate, the second confidence interval rejects large effect sizes. Traditional hypothesis testing with p-values hides this distinction and makes it look as if these two studies produced identical results. However the lowest value of the first interval (d = .8) is higher than the highest value of the second interval (d = .41), which actually implies that the results are significantly different from each other. Thus, these two studies produced conflicting results when we consider effect sizes, while giving the same answer about the direction of an effect.

If predictions were made in terms of a minimum effect size that is theoretically or practically relevant, the distinction between the two results would also be visible. For example, a standard criterion for a minimum effect size could be a small effect size of d = .2. Using this criterion, the first study confirms predictions (i.e.., the confidence interval from .8 to 1.2 falls into the region from .2 to infinity), but the second study does not, d = .01 to .41 is partially outside the interval from .2 to infinity. In this case, the data are inconclusive.

If the population effect size is zero (e.g., effect of random future events on behavior), confidence intervals will cluster around zero. This makes it hard to fit confidence intervals within a region that is below a minimum effect size (e.g., d = -.2 to d = .2). This is the reason why it is empirically difficult to provide evidence for the absence of an effect. Reducing the minimum effect size makes it even harder and eventually impossible. However, logically there is nothing special about providing evidence for the absence of an effect. We are again dividing the range of plausible effects into two regions: (a) values below the minimum effect size and (b) values above the minimum effect size. We then decide in favor of the interval that fully contains the confidence interval. Of course, we can do this also without an a priori range of effect sizes. For example, if we find a confidence interval ranging from -.15 to +.18, we can infer from this finding that the population effect size is small (less than .2).

But What about Bayesian Statistics?

Bayesian statistics also uses information about effect sizes and sampling error. The main difference is that Bayesians assume that we have prior knowledge that can inform our interpretation of results. For example, if one-hundred studies already tested the same hypothesis, we can use the information of these studies. In this case, it would also be possible to conduct a meta-analysis and to draw inferences on evidence from all 101 studies, rather than just a single study. Bayesians also sometimes incorporate information that is harder to quantify. However, the main logic of hypothesis testing with confidence intervals or Bayesian credibility intervals does not change. Ironically, Bayesians also tend to use alpha = .05 when they report 95% credibility intervals. The only difference is that information that is external to the data (prior distributions) is used, whereas confidence intervals rely exclusively on information from the data.

I hope that this blog post helps researchers to better understand what they are doing. Empirical studies provides estimates of two important statistics, an effect size estimate and an sampling error estimate. This information can be used to create intervals that specify a range of values that are likely to contain the population effect size. Hypotheses testing divides the range of possible values into regions and decides in favor of hypotheses that fully contain the confidence interval. However, hypothesis testing is redundant and less informative because we can simply decide in favor of the values that are inside the confidence interval which is smaller than the range of values specified by a theoretical prediction. The use of confidence intervals makes it possible to identify weak evidence (confidence interval excludes zero, but not very small values that are not theoretically interesting) and also makes it possible to provide evidence for the absence of an effect (confidence interval only includes trivial values).

A common criticism of hypothesis testing is that it is difficult to understand and not intuitive. The use of confidence intervals solves this problem. Seeing whether a small objects fits into a larger object is probably achieved at some early developmental stage in Piaget’s model and most applied research workers should be able to carry out these comparisons. Standardized effect sizes also help with evaluating the size of objects. Thus, confidence intervals provide all of the information that applied research workers need to carry out empirical studies and to draw inferences from these studies. The main statistical challenge is to obtain estimates of sampling error in complex designs; that is the job of statisticians. The main job of empirical research workers is to collect theoretically or practically important data with small sampling error.

Share this:

10 thoughts on “ null-hypothesis testing with confidence intervals ”.

I’m not a researcher, but I appreciate your site.

Great post. I plan to share this with other research friends. The next step is improving statistical training and literacy in the young generation of researchers like myself.

Thank you for the feedback.

“It is the hypothesis that they want to reject or NULLify”. Perhaps just a poor choice of words, but “want to reject” is at the heart of so many of social psych’s problems.

I see what you are saying but the only thing you can do with the nil-hypothesis is to reject it. So the nature of the null-hypothesis is the reason for the problem that everybody wants to reject it.

  • Pingback: Expressing Uncertainty about Analysis Plans with Conservative Confidence Intervals | Replicability-Index

Hi Ulrich, “Power is simply not defined when the hypothesis is that the population effect size is zero.” Maybe I do not understand this sentence correctly, but I think, this is not correct, Power is simply defined as the probability to identify a particular effect different from the effect under study. However, it is not a feature of an underlying distribution, but of a decision-theoretic comparison between two distributions, in the most simple case of two sampling distributions of t (we call it t-test). Mathematically, the form of the central t distribution (aka Student’s t) is not much different from non-central t distributions, except for the non-centrality parameter (which is 0, reflecting the absence of any effect, for the central t distribution). Assume now, we are testing our research hypothesis of, e.g., d>=0.5 as meaningful non-nil null hypothesis (d_0) (since we are hard-boned falsificationists!). The alternative is h1: d<0.5. Then, the probabilty to identify a possibly correct alternative d_1 (which actually is Power, right?) will be higher the larger the difference between d_0 and d_1 is, right? Which means that the Power to identify an alternative correctly is higher for d_1=0 than for larger values of d, e.g. d_1=0.3. Which also means that Power for the correct identification for a true population effect of d=0 does not only exist, but is also computable.

Hi Uwe, I am follow the common definition of power as a conditional probability. “The power of a binary hypothesis test is the probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true.” https://en.wikipedia.org/wiki/Power_(statistics)

However, I have no problem to use the term also for cases where the null is true and “power” equals alpha.

As long as it is clear what we are talking about, both are important and useful concepts.

But doesn’t my example exactly fit the definition from Wikipedia, provided my null hypothesis (d=0.5, assumed effect) is false and a specific alternative (d=0, no effect) is true?

Maybe I am not understanding your point. If I am testing d .5, power would not be defined if the true population parameter is d = .5, No matter how many subjects I have, I will only get significant results at the rate of alpha.

Leave a Reply Cancel reply

Discover more from replicability-index.

Subscribe now to keep reading and get access to the full archive.

Type your email…

Continue reading

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Braz J Phys Ther
  • v.23(4); Jul-Aug 2019

Understanding and interpreting confidence and credible intervals around effect estimates

Luiz hespanhol.

a Masters and Doctoral Programs in Physical Therapy, Universidade Cidade de São Paulo (UNICID), São Paulo, SP, Brazil

b Department of Public and Occupational Health (DPOH), Amsterdam Public Health Research Institute (APH), VU University Medical Center (VUmc), Amsterdam, The Netherlands

c Amsterdam Collaboration on Health and Safety in Sports (ACHSS), Academic Medical Center/VU University Medical Center IOC Research Center, Amsterdam, The Netherlands

Caio Sain Vallio

Lucíola menezes costa, bruno t saragiotto.

  • • Confidence intervals (CI) measure the uncertainty around effect estimates.
  • • Frequentist 95% CI: we can be 95% confident that the true estimate would lie within the interval.
  • • Bayesian 95% CI: there is a 95% probability that the true estimate would lie within the interval.
  • • Decision-making should not be made considering only the dichotomized interpretation of CIs.
  • • Training and education may enhance knowledge related to understanding and interpreting CIs.

Introduction

Reporting confidence intervals in scientific articles is important and relevant for evidence-based practice. Clinicians should understand confidence intervals in order to determine if they can realistically expect results similar to those presented in research studies when they implement the scientific evidence in clinical practice. The aims of this masterclass are: (1) to discuss confidence intervals around effect estimates; (2) to understand confidence intervals estimation (frequentist and Bayesian approaches); and (3) to interpret such uncertainty measures.

Confidence intervals are measures of uncertainty around effect estimates. Interpretation of the frequentist 95% confidence interval: we can be 95% confident that the true (unknown) estimate would lie within the lower and upper limits of the interval, based on hypothesized repeats of the experiment. Many researchers and health professionals oversimplify the interpretation of the frequentist 95% confidence interval by dichotomizing it in statistically significant or non-statistically significant, hampering a proper discussion on the values, the width (precision) and the practical implications of such interval. Interpretation of the Bayesian 95% confidence interval (which is known as credible interval): there is a 95% probability that the true (unknown) estimate would lie within the interval, given the evidence provided by the observed data.

Conclusions

The use and reporting of confidence intervals should be encouraged in all scientific articles. Clinicians should consider using the interpretation, relevance and applicability of confidence intervals in real-world decision-making. Training and education may enhance knowledge and skills related to estimating, understanding and interpreting uncertainty measures, reducing the barriers for their use under either frequentist or Bayesian approaches.

A paper published within this issue of the Brazilian Journal of Physical Therapy (BJPT) raised a very interesting, important and relevant matter for evidence-based practice: the use of the 95% confidence interval (CI) for reporting the uncertainty around between-group comparisons in randomized controlled trials investigating the effects of physical therapy interventions. 1 Briefly, the study found that: (1) only less than one-third of physical therapy trials report CIs; (2) trials with lower risk of bias (i.e., higher quality) are more likely to report CIs; and (3) there has been a consistent increase in reporting CIs over time. 1 The increasing trend on reporting CIs is good news for physical therapy evidence-based practice. Nevertheless, clinicians should understand CIs so they can appropriately interpret results of trials in order to better implement such evidence in practice. Therefore, this masterclass is aimed at: (1) discussing CIs around effect estimates on continuous (mean and mean difference) and dichotomous (proportion, odds, absolute risk reduction [ARR], relative risk [RR] and odds ratio [OR]) outcomes; (2) understanding CIs estimation (frequentist and Bayesian approaches); and (3) interpreting such uncertainty measures. We believe that this initiative might help clinicians to achieve the purpose of better understanding and interpreting uncertainty measures around effect estimates.

What are confidence intervals?

A CI is a measure of the uncertainty around the effect estimate. It is an interval composed of a lower and an upper limit, which indicates that the true (unknown) effect may be somewhere within this interval. The effect presented in the scientific report must always be inside the CI reported, and the width of the interval represents the precision of the effect estimate. Therefore, the narrower the CI the more precise is the effect estimate. The CI width (degree of uncertainty) varies according to two factors: (1) sample size ( n ); and (2) heterogeneity (standard deviation [SD] or standard error [SE]) contained in the study. The sample size is inversely proportional to the degree of uncertainty; the larger the sample size, the smaller the CI width, which would indicate a lower degree of uncertainty. However, heterogeneity is directly proportional to the degree of uncertainty; the lower the heterogeneity the lower the uncertainty. This means that studies presenting lower SDs or SEs have a lower degree of uncertainty and a narrower CI.

The confidence (probability) level (i.e., 95%) of the CI represents the accuracy of the effect estimate. 2 For example, the 99% CI is more accurate than the 95% CI, because it captures a broader spectrum of the data distribution. Thereby, the 99% CI is wider than the 95% CI. However, the trade-off is that the 99% CI is less precise than the 95% CI. The decision of using a certain confidence level should consider a balance between accuracy and precision. In health sciences the 95% confidence level is most often used. Two common approaches to estimate CIs are the frequentist and the Bayesian. In the next sections we will discuss the following topics related to both approaches: how to estimate; how to interpret; advantages; disadvantages; and illustrative examples (with case studies described in Box 1 , Box 2 ).

Case study of a randomized controlled trial (RCT) with a continuous outcome.

Parreira et al. 21 have conducted a RCT aimed at investigating the effects of Kinesio Taping applied according to the manuals ( n I  = 74) compared to sham applications ( n C  = 74) in individuals with chronic nonspecific low back pain. One of the primary outcomes was pain intensity measured with a numeric pain rating scale (NPRS) ranging from 0 (no pain) to 10 (worst possible pain). The table below describes the results for each group at baseline and after four weeks from baseline.

SD, standard deviation. CI, frequentist confidence interval. “diff”, difference.

Mean difference between groups

The recommended outcome of RCTs investigating continuous variables, as the NPRS, is the between-group difference of the within-group difference. This outcome is usually obtained from the regression coefficient representing the interaction term composed of group and time in linear mixed models. 22 Simplifying, the interaction term can also be estimated using a table like the one above. Therefore, the effect found for pain intensity after four weeks from baseline in this study was −0.4, which means that the intervention group reduced 0.4 more points in the 11-point NPRS compared to the control group.

95% confidence interval (CI)

- Standard error (SE): Eq. (2.1)

 • SE diff  = √((( n I  − 1)SD I 2 ) + (( n C  − 1)SD C 2 )/( n I  +  n C  − 2)) × √((1/ n I ) + (1/ n C ))

 • SE diff  = √((((74 − 1)3.1 2 ) + ((74 − 1)2.7 2 ))/(74 + 74 − 2)) × √((1/74) + (1/74)) = 0.478

- t (probability=0.95; df = 74 + 74 − 2) = 1.976346 ≈ 1.96

- 95% CI = (mean I  − mean C ) ± ( t  × SE diff ) = (−2.6 − (−2.2)) ± (1.96 × 0.478) = −1.3 to 0.5

The 95% frequentist CI around the effect found for pain intensity after four weeks from baseline in the study of Parreira et al. 21 was −1.3 to 0.5 in the 11-point NPRS. This means that we can be 95% confident that individuals with chronic nonspecific low back pain would present, on average, a mean difference between −1.3 and 0.5 when comparing the intervention with the comparison group, based on hypothesized repeats of the experiment. Since the 95% CI contains the null effect (i.e., zero), which represents the null hypothesis (i.e., no difference between the groups), we cannot be 95% confident that the intervention group would present a reduced pain intensity compared to the comparison group in repeats of the experiment, as suggested by the effect estimate (i.e., −0.4). Therefore, we can conclude that this effect was not statically significant, which means that this evidence supports the null hypothesis. In other words, there was no difference between the groups.

Case study of a randomized controlled trial (RCT) with a dichotomous outcome.

Mateus-Vasconcelos et al. 23 have conducted a RCT aimed at investigating the effects of vaginal palpation, vaginal palpation associated with posterior pelvic tilt, and intravaginal electrical stimulation in facilitating voluntary contraction of the pelvic floor muscles in women with pelvic floor dysfunctions. This case study is considering only the vaginal palpation associated with posterior pelvic tilt as the intervention group ( n I  = 33), and verbal instructions to perform pelvic floor muscle exercises at home as the comparison group ( n C  = 33). The primary outcomes was the number of women who had changed in the Modified Oxford Scale (MOS) for pelvic floor muscle strength, ranging from 0 (no contraction) to 5 (strong contraction with lift). The table below describes, using a 2 by 2 table, the number of participants in each group who changed (improved) their pelvic floor muscle strength from MOS 0 or 1 to ≥2 after eight weeks from baseline.

Relative risk (RR) to compare groups

- Risk of intervention group =  A /( A  +  C ) = 0.697 or 69.7%

- Risk of comparison group =  B /( B  +  D ) = 0.182 or 18.2%

- RR = ( A /( A  +  C ))/( B /( B  +  D )) = 0.697/0.182 = 3.83

- Standard error for RR (SE ln(RR) ): Eq. (5.2)

 • SE ln(RR)  = √((1/ A ) − (1/( A  +  C )) + (1/ B ) − (1/( B  +  D ))) = √((1/23) − (1/(33)) + (1/6) − (1/(33))) = 0.387

- 95% CI RR  = e ln(RR)±( z ×SE ln(RR))  = e 1.342865±(1.96×0.387)  = e 0.584345 to 2.101385  = 1.79 to 8.17

95% confidence interval (CI) for R R

The 95% frequentist CI around the RR found for pelvic floor muscle strength after eight weeks from baseline was 1.79 to 8.17 in the 6-point MOS. This means that we can be 95% confident that women with pelvic floor dysfunctions would present, on average, an RR between 1.79 and 8.17 when comparing the intervention with the comparison group, based on hypothesized repeats of the experiment. Since the 95% CI does not contain the null effect (i.e., one), which represents the null hypothesis (i.e., the same risk for both groups), we can conclude that this effect was statically significant, which means that we can be 95% confident that the intervention would be effective on increasing the risk of women changing the MOS for the better, which means strengthen the pelvic floor muscles, compared to the comparison group in repeats of the experiment.

Odds ratio (OR) to compare groups

- Odds of intervention group = ( A /( A  +  C ))/( C /( A  +  C )) =  A / C  = 2.30

- Odds of comparison group = ( B /( B  +  D ))/( D /( B  +  D )) =  B / D  = 0.2222…

- OR = ( A / C )/( B / D ) = 2.30/0.22 = 10.35

- SE ln(OR) : Eq. (6.2)

 • SE ln(OR)  = √((1/ A ) + (1/ B ) + (1/ C ) + (1/ D )) = √((1/23) + (1/6) + (1/10) + (1/27)) = 0.589

- 95% CI OR  = e ln(OR)±( z ×SE ln(OR))  = e 2.336987±(1.96×0.589)  = e 1.182547 to 3.491427  = 3.26 to 32.84

95% confidence interval (CI) for OR

The 95% frequentist CI around the OR found for pelvic floor muscle strength after eight weeks from baseline was 3.26 to 32.84 in the 6-point MOS. This means that we can be 95% confident that women with pelvic floor dysfunctions would present, on average, an OR between 3.26 and 32.84 when comparing the intervention with the comparison group, based on hypothesized repeats of the experiment. Since the 95% CI does not contain the null effect (i.e., one), which represents the null hypothesis (i.e., the same odds for both groups), we can conclude that this effect was statically significant, which means that we can be 95% confident that the intervention would be effective on increasing the odds of women changing the MOS for the better, which means strengthen the pelvic floor muscles, compared to the comparison group in repeats of the experiment.

Frequentist approach for CIs

The most known and widely used approach for statistical inference is the frequentist approach, also known as the classical (Neyman–Pearson) statistical approach. 3 , 4 The frequentist approach for statistical inference is based on sampling distributions and the Central Limit Theorem (CLT). 3 , 5 This explains the term “long-run frequency” attached to the interpretation of outcomes estimated using this approach (see the section “ Interpreting frequentist 95% CIs ”), and the term “frequentist” to refer to this statistical thinking. The frequentist approach treats the population parameters of interest as fixed values. 2 , 3 , 6

For example, let's assume a population distribution with mean ( μ ) = 0 and SD ( σ ) = 5 ( Fig. 1 A). In reality, we usually do not know the true mean and standard deviation in the population; however, for the sake of examples, we are defining the population distribution in Fig. 1 A. Let us say a researcher has collected data from this population, and the sample mean ( x ¯ 1 ) = 0.4 and the sample SD ( s 1 ) = 4.8 ( Fig. 1 B, “ Data collected ”). The sample mean is considered the best guess of the sampling distribution mean (i.e., the mean of the sample means represented by Fig. 1 C). In turn, the sampling distribution ( Fig. 1 C) is considered a long-run frequency of samples, including the one the researcher has collected data (sample 1 in Fig. 1 B, “ Data collected ”), but also considering a set of hypothetical samples (samples 2–100 represented in Fig. 1 B, “ Hypothetical samples ”) that do not exist (i.e., the researcher has not collected data for this hypothetical samples). This has some implications that are discussed in the section “ Disadvantages of frequentist 95% CIs ”.

An external file that holds a picture, illustration, etc.
Object name is gr1.jpg

Graphical representation of: (A) a population distribution; (B) samples 1 to 100 from the population distribution ( n  = 100 for each sample); and (C) the sampling distribution. “ N ”, population size. “ n ”, sample size. “ μ ”, population mean. “ σ ”, population standard deviation. “ x ¯ ”, sample mean. “ s ”, sample standard deviation. “SE”, standard error. “CI”, confidence interval.

There are several methods for estimating frequentist 95% CIs. In this masterclass we will describe the methods implemented in the Physiotherapy Evidence Database (PEDro) CI calculator, which can be downloaded in English at https://www.pedro.org.au/english/downloads/confidence-interval-calculator/ . 7 The reader can follow the estimations described in the case studies in Box 1 , Box 2 using the PEDro CI calculator.

Estimating frequentist CIs

Eq. (1) describes CI formula for a mean ( x ¯ ). The critical value “ t ” is based on the t distribution attached with a particular probability level and degrees of freedom. For a 95% CI, the probability level must be set as 0.95 (or 95%) and the degrees of freedom are determined by subtracting 1 from the sample size ( n  − 1). The SE of the sample mean can be estimated by Eq. (1.1) .

Mean difference

Eq. (2) describes the CI calculation for a mean difference ( x ¯ 1 − x ¯ 2 ). The critical value “ t ” is based on the t distribution attached with a particular probability level and degrees of freedom. For a 95% CI, the probability level must be set as 0.95 (or 95%) and the degrees of freedom are determined by subtracting 2 from the overall sample size ( n 1  +  n 2  − 2). “SE diff” refers to the SE of the difference between the two sample means assuming equal variances (Eq. (2.1) ). Box 1 describes a case study using mean difference and its 95% CI.

Proportion and odds

Eq. (3) describes the Wilson score method 8 , 9 to estimate the CI for a proportion ( p ). The critical value “ z ” is based on the normal (Gaussian) distribution attached with a particular probability level. For a 95% CI, the critical value “ z ” is approximately 1.96. The odds and its 95% CI can be obtained by converting the proportions to odds using Eq. (3.1) .

Absolute risk reduction (ARR)

Eqs. (4.1) , (4.2) describe the Newcombe–Wilson method 9 , 10 to estimate the lower ( LCI ARR ) and upper ( UCI ARR ) limits of the CI for the ARR, respectively. The letters “ L ” and “ U ” represents the lower and upper limits of the proportions for groups 1 and 2, which can be estimated using Equation 3.

Relative risk (RR) and odds ratio (OR)

Eqs. (5) , (6) describe the CI calculation for the RR and for the OR, respectively. 11 In Eqs. (5.1) , (5.2) , (6.1) , (6.2) , “ A ” represents the number of individuals with the event in group 1; “ B ” represents the number of individuals with the event in group 2; “ C ” represents the number of individuals without the event in group 1; and “ D ” represents the number of individuals without the event in group 2. These values can be determined in a 2 by 2 table. 11 Box 2 describes a case study using RR, OR and their respective 95% CIs.

Interpreting frequentist CIs

The frequentist CI has a long-run frequency interpretation, that is: random samples from the same target population and with the same sample size would yield CIs that contain the true (unknown) estimate in a frequency (percentage) set by the confidence level. However, we usually do not have several random samples from the same population; instead we collect data from only one sample of the population of interest and compute the CI for this particular sample. The interpretation of this particular CI would be: we can be XX% confident that the true (unknown) estimate would lie within the lower and upper limits of the CI, based on hypothesized repeats of the experiment.

For the 95% CI, this would imply that if we repeat an experiment 100 times and compute the 95% CI for all 100 experiments ( Fig. 1 B), then 95 (95%) of these CIs would contain the true (unknown) estimate, while 5 (5%) of these CIs would not contain the true (unknown) estimate. This true (unknown) estimate is represented in Fig. 1 C by the mean of the sampling distribution (i.e., “mean of x ¯ 1 : 100 = 0.0 ”), which frequentists use as a proxy for the population mean represented by Fig. 1 A. But let us suppose we have collected data from only one sample of the target population (which is usually the case), that is represented by the sample “ Data collected ” in Fig. 1 B. The 95% CI yielded from this particular sample can be interpret as follows: we can be 95% confident that the true (unknown) estimate would lie within the lower and upper limits of the CI, based on hypothesized repeats of the experiment.

Regarding statistical significance, if the CI does not contain the null hypothesized value, this would indicate statistical significance for the particular significance level set by the investigator. For example, in case of a between-group mean difference in a randomized controlled trial, the null hypothesized value represented by the null hypothesis ( H 0 ) is zero (i.e., no difference between the groups: x ¯ 1 − x ¯ 2 = 0 ). If the 95% CI does not contain zero and the limits are negative (e.g., −4.0 to −1.0; Fig. 2 A) this means that we can be 95% confident that the true (unknown) between-group mean difference would, on average, lie within negative values, indicating that we can be 95% confident that the intervention group would present a lower mean compared to the comparison group. Moreover, if the 95% CI does not contain zero and the limits are positive (e.g., 0.5 to 3.5; Fig. 2 C) this means that we can be 95% confident that the true (unknown) between-group mean difference would, on average, lie within positive values, indicating that we can be 95% confident that the intervention group would present a higher mean compared to the comparison group. Both scenarios would indicate a statistically significant result at a significance level of 0.05 (1–0.95) or 5%, since both CIs do not contain zero. These results would certainly yield a p -value lower than 0.05. However, if the 95% CI contains zero (e.g., −2.0 to 1.0; Fig. 2 B) this means that we can be 95% confident that the true (unknown) between-group mean difference would, on average, lie within a negative and a positive value, indicating that we cannot be 95% confident that the intervention group would present a lower or a higher mean compared to the comparison group. This would indicate a non-statistically significant result, certainly yielding a p -value higher than 0.05 (for another example and interpretation, see Box 1 ).

An external file that holds a picture, illustration, etc.
Object name is gr2.jpg

Graphical representation of statistically significant (A, C, D, and F) and non-statistically significant (B and E) results for frequentist 95% confidence intervals or Bayesian 95% credible intervals. For simplicity, both frequentist and Bayesian intervals are interchangeable in this figure, and they are represented with the acronym “CI”. “RR”, relative risk. “OR”, odds ratio.

In case of ratios, such as RR and OR, the null hypothesized value represented by the null hypothesis ( H 0 ) is 1 (i.e., same proportion or odds in both groups: p 1 / p 2 = 1 ). If the 95% CI does not contain 1 and the limits are lower than 1 (e.g., 0.40 to 0.80; Fig. 2 D) this means that we can be 95% confident that the true (unknown) ratio would, on average, lie within values lower than 1, indicating that we can be 95% confident that the intervention group would present a lower event proportion compared to the comparison group. Moreover, if the 95% CI does not contain 1 and the limits are higher than 1 (e.g., 2.0 to 3.0; Fig. 2 F) this means that we can be 95% confident that the true (unknown) ratio would, on average, lie within values higher than 1, indicating that we can be 95% confident that the intervention group would present a higher event proportion compared to the comparison group. Both scenarios would indicate a statistically significant result at a significance level of 0.05 (1–0.95) or 5%, since both CIs do not contain 1. These results would certainly yield a p -value lower than 0.05. However, if the 95% CI contains 1 (e.g., 0.70 to 1.50; Fig. 2 E) this means that we can be 95% confident that the true (unknown) ratio would, on average, lie within a value lower than 1 and a value higher than 1, indicating that we cannot be 95% confident that the intervention group would present a lower or a higher event proportion compared to the comparison group. This would indicate a non-statistically significant result, certainly yielding a p -value higher than 0.05 (for another example and interpretation, see Box 2 ). The same interpretation approach for RR can also be applied to OR. However, one should note that RR and OR are not the same measure ( Box 2 ).

Advantages of using frequentist CIs rather than p -values

The frequentist approach is well known for performing hypothesis testing. Frequentist hypothesis testing lies in accepting or rejecting the null hypothesis ( H 0 ) by calculating the famous “ p -value”. The p -value is defined as the probability of observing the acquired or a more extreme result in a hypothetical series of repeats of the experiment (i.e., sampling distribution), given that the null hypothesis is true. 3 , 4 Health science researchers usually define a significance level of 0.05 (or 5%) for hypothesis testing. Therefore, one rejects the null hypothesis when a p -value is smaller than 0.05, which means that the probability of observing the actual or a more extreme estimate, given that the null hypothesis is true, is very low, supporting the conclusion that the null hypothesis might not be true. On the other hand, one accepts the null hypothesis when a p -value is equal to or greater than 0.05, which means that the probability of observing the actual or a more extreme estimate, given that the null hypothesis is true, is moderate to high, supporting the conclusion that the null hypothesis might be actually true. Another simple way of interpreting p -values is the following: the smaller the p -value the greater the evidence against the null hypothesis and, therefore, the results suggest that the alternative hypothesis ( H 1 ) might be more likely.

However, criticisms have been raised on how researchers and health professionals have been misinterpreting, misusing, and overemphasizing frequentist hypothesis testing, especially the p -value. 4 , 12 , 13 These criticisms include the following 3 , 4 , 12 , 13 , 14 :

  • • The p -value is not the probability that the null hypothesis is (or is not) true, which would be formally represented as p ( H 0 | y ); “ H 0 ” represents the null hypothesis and “ y ” represents the observed data. However, many researchers and health professionals are tempted to interpret the p -value this way, leading to misinterpretations. Actually, the p -value is a measure of the extremeness of the actual result given the null hypothesis, which may be formally represented as p ( y | H 0 ). Perhaps due to non-familiarity with these concepts, the p -value interpretation most used in research and in practice is dichotomized, i.e., statistically significant or not statistically significant based on a threshold of 0.05. This may avoid the probability misinterpretation of p -values, but also oversimplifies the information provided by them.
  • • The dichotomized interpretation approach of p -values, which are widely used in research and in practice, allows for accepting or rejecting the null hypothesis without questioning the effect size or the variability (e.g., uncertainty or precision) of the effect estimate.
  • • The p -value seems to have a large sample-to-sample variability, indicating that this measure is probably not reliable on indicating the strength of evidence against the null hypothesis.

The frequentist CI has been suggested as an alternative to p -values. 12 , 13 It has the advantage of describing the variability of the estimate and its width indicates the precision of the estimate. 2 Therefore, researchers have recommended that effect estimates should be followed by their CIs (usually with a 95% confidence level) in scientific reportings. 1 , 15 However, the current use of the frequentist CI has also raised some concerns, which would be discussed in the next section (i.e., “ Disadvantages of using frequentist CIs ”).

Disadvantages of using frequentist CIs

We believe that the use of the frequentist CI has two potential disadvantages. Firstly, the long-run frequency interpretation of the frequentist CI is not friendly. Therefore, many researchers and health professionals have misinterpreted the frequentist CI. 15 For the 95% CI, a common misinterpretation is the following: there is a 95% probability that the true (unknown) effect estimate lies within the 95% CI. This interpretation is not accurate for the frequentist CI, since the frequentist approach treats the population parameter as a fixed (unknown) value and, therefore, this fixed value is either inside or outside the interval with 100% (or 0%) probability. 2 , 6 Actually, the “probability interpretation” that clinicians usually use in clinical practice refers to the Bayesian interval (see the section “ Bayesian approach for CIs ”). 3 , 15 Thereby, the accurate interpretation for the frequentist 95% CI would be the following: if we repeat an experiment over and over again (graphically represented by Fig. 1 B) and we compute the 95% CI for all experiments, then 95% of these CIs would contain the true (unknown) estimate (represented by “mean of x ¯ 1 : 100 ” in Fig. 1 C), while 5% of these CIs would not contain the true (unknown) estimate ( Box 1 , Box 2 ). A graphical representation of the frequentist 95% CI can be found in Fig. 1 B.

Secondly, many researchers and health professionals oversimplify the interpretation of the frequentist 95% CI by dichotomizing it in statistically significant or non-statistically significant and, therefore, hampering a proper discussion on the values, the width (i.e., precision) and the practical implications of such interval. This would lead to some limitations and criticisms discussed earlier in this masterclass for the use of p -values, ruling out the advantages of using frequentist CIs rather than p -values. Therefore, there is no additional benefit in replacing the use of p -values by an oversimplified (i.e., dichotomized) interpretation of the frequentist CI.

Illustrative example of frequentist CIs

A randomized controlled trial had investigated the effectiveness of back school versus McKenzie exercises in individuals with chronic nonspecific low back pain. 16 The primary outcomes were pain intensity (0–10 pain numerical rating scale) and disability (Roland–Morris Disability Questionnaire analyzed as a 0–24 numeric scale) one month after randomization. The between-group difference (adjusted for within-group differences) for pain intensity was 0.66 with a 95% CI of −0.29 to 1.62, meaning that we can be 95% confident that the true (unknown) effect would lie between −0.29 and 1.62, based on hypothesized repeats of the experiment. For disability, the between-group difference (adjusted for within-group differences) was 2.37 in favor of McKenzie with a 95% CI of 0.76 to 3.99, meaning that we can be 95% confident that the true (unknown) effect would lie between this CI, based on hypothesized repeats of the experiment. The null hypothesized effect was zero (i.e., no difference between groups). The 95% CI for pain intensity contained the null effect (i.e., zero), meaning that the result was not statistically significant. For disability, the 95% CI did not contain the null effect, meaning that the result was statistically significant. Up to now, the conclusions would be the same if one had used the dichotomized interpretation of p -values instead of the dichotomized interpretation of CIs. However, despite significance, the effect for disability was considered small, because in this case, clinicians could expect that their clinical results for disability would fall approximately within 0.76 to 3.99 points on a 0–24 points measure. This interpretation would not be possible when considering only the p -value (which only measures the extremeness of the result under the null hypothesis) or the dichotomized interpretation of the CI. Therefore, the authors concluded that McKenzie exercise were not superior than back school for improving pain intensity in individuals with chronic nonspecific low back pain, and were only slightly more effective for disability. 16

Bayesian approach for CIs

Bayesian inference is a statistical approach aiming at estimating a certain parameter (e.g., a mean or a proportion) from the population distribution, given the evidence provided by the observed (i.e., collected) data. 3 Therefore, the Bayesian approach for statistical inference is considered a more direct or natural approach to answer a research question, since it estimates the parameter of interest directly from the population distribution ( Fig. 1 A) instead of estimating from the sampling distribution as the frequentist approach ( Fig. 1 C). The Bayesian approach treats the parameters of interest as random variables, and, therefore, parameters can be described with probability distributions. 3 , 17 One of the main characteristic of the Bayesian approach is the compromise of prior evidence with the observed data. Prior evidence and the observed data are represented with probability distributions that, in Bayesian terminology, are defined as prior and likelihood distributions, respectively. The prior distribution is combined with the likelihood distribution in order to update the previous knowledge, resulting in the posterior distribution, which is formally represented as p ( θ | y ); “ θ ” represents the parameter of interest and “ y ” represents the observed data. 3

The outcome of a Bayesian analysis is the posterior distribution. The posterior distribution can be summarized by measures of central tendency (e.g., median, mean or mode) and measures of uncertainty (e.g., variance or standard deviation). One of the most used measures of uncertainty in Bayesian inference is the Bayesian credible interval (CrI), which is analogous to the CI in the frequentist approach.

Estimating Bayesian CrIs

Describing and discussing the computation of posterior distributions are beyond the scope of this masterclass. However, once the posterior distribution that represents the updated knowledge about a parameter of interest is defined, obtaining the CrI is straightforward. There are typically two types of Bayesian CrIs: (1) equal tail interval; and (2) highest posterior density (HPD) interval. The following sections will be focused on defining, explaining and interpreting such intervals.

Equal tail CrI

The Bayesian equal tail CrI method returns threshold values of the posterior distribution that represent an interval with the probability of interest (e.g., 95%) of the distribution mass around the center of the distribution ( Fig. 3 A). In other words, the lower limit of the 95% equal tail CrI is the quantile representing a probability of 0.025 (or the 2.5% percentile) of the posterior distribution, while the upper limit of the equal tail CrI is the quantile representing a probability 0.975 (or the 97.5% percentile) of the posterior distribution. An advantage of estimating the equal tail Bayesian CrI is that this interval is easily calculated. However, a common concern related to the equal tail Bayesian CrI is that it might yield estimate values with lower probability inside the interval than outside the interval when the posterior distribution is not symmetric (i.e., right or left skewed). 3 When this occurs, the meaning would be that some values would have a higher probability of representing the parameter when outside the interval compared to some values inside the interval. Graphically, this would yield a shift line connecting the lower and upper limits for this interval ( Fig. 3 C and E). Since this situation is not desired, another method has been proposed in order to estimate Bayesian CrIs: the HPD interval, which is discussed in the next section (i.e., “ Highest posterior density (HPD) CrI ”).

An external file that holds a picture, illustration, etc.
Object name is gr3.jpg

Graphical representation of symmetric (A and B), right (positive) skewed (C and D) and left (negative) skewed (E and F) distributions.

Highest posterior density (HPD) CrI

The Bayesian HPD CrI method returns threshold values of the posterior distribution that represent an interval with the probability of interest (e.g., 95%) of the distribution mass around the center of the distribution, holding true the assumption that all values inside the interval have higher probabilities of representing the parameter than all the values outside the interval. For example, for a 95% HPD CrI, the interval contains 95% of the mass of the posterior distribution around the center of the distribution, and all values inside the interval are more likely to represent the parameter than the values outside the interval. Graphically, this would always yield a straight line connecting the lower and upper limits for this interval ( Fig. 3 B, D and F). For symmetric posterior distributions, the HPD CrI is equivalent to the equal tail Bayesian CI ( Fig. 3 A and B). 3 A disadvantage of the HPD CrI method is that the computation of the interval is more complex compared to the equal tail CrI method, since the HPD CrI estimation requires numerical optimization. 3

Interpreting Bayesian CrIs

Bayesian CrIs have a more natural interpretation than frequentist CIs. 3 This is due to the fact that the Bayesian CrI estimates the most likely values of the parameter of interest directly from the computed posterior distribution, which, in turn, represents all knowledge and evidence about the population distribution at the moment. The interpretation of the Bayesian 95% CrI is the following: there is a 95% probability that the true (unknown) effect estimate (represented by “ μ ” in Fig. 1 A) would lie within the interval, given the evidence provided by the observed data. 3 , 15

The way we judge if there is a statistical significance result when interpreting the Bayesian CrI is similar to the frequentist CI. However, one should note that the interpretation of the Bayesian CrI is rather different from the frequentist CI. For example, in case of a between-group mean difference, the null effect is zero (i.e., no difference between the groups: x ¯ 1 − x ¯ 2 = 0 ). Let us suppose that a 95% CrI is composed of the following limits: −4.0 to −1.0 ( Fig. 2 A). This would indicate that there is a 95% probability that the population mean difference would lie between −4.0 and −1.0, given the observed data. In other words, the most plausible values (i.e., −4.0 to −1.0) with higher probability of representing the true (unknown) estimate indicate that the mean of the intervention group would be lower compared to the comparison group, with at least a 95% probability. Moreover, let us suppose that a 95% CrI is composed of the following limits: 0.5 to 3.5 ( Fig. 2 C). This would indicate that there is a 95% probability that the population mean difference would lie between 0.5 and 3.5, given the observed data. In other words, the most plausible values (i.e., 0.5 to 3.5) with higher probability of representing the true (unknown) estimate indicate that the mean of the intervention group would be higher compared to the comparison group, with at least a 95% probability. Both scenarios would indicate a statistically significant result at a significance level of 0.05 (1–0.95) or 5%, since both CrIs do not contain zero. However, in case of a 95% CrI composed of the following limits: −2.0 to 1.0 ( Fig. 2 B), this would indicate that there is a 95% probability that the population mean difference would lie between −2.0 and 1.0, given the observed data. Since the most plausible values (i.e., −2.0 to 1.0) with higher probability of representing the true (unknown) estimate indicate that the mean of the intervention group could be either lower or higher compared to the comparison group, this would indicate a non-statistically significant result.

In case of ratios, such as RR and OR, the null effect is 1 (i.e., same proportion or odds in both groups: p 1 / p 2 = 1 ). Let us suppose that a 95% CrI for an RR is composed of the following limits: 0.40 to 0.80 ( Fig. 2 D). This would indicate that there is a 95% probability that the population RR would lie between 0.40 and 0.80, given the observed data. In other words, the most plausible values (i.e., 0.40 to 0.80) with higher probability of representing the true (unknown) estimate indicate that the event proportion of the intervention group would be lower compared to the comparison group, with at least a 95% probability. Moreover, let us suppose that a 95% CrI for an RR is composed of the following limits: 2.0 to 3.0 ( Fig. 2 F). This would indicate that there is a 95% probability that the population RR would lie between 2.0 and 3.0, given the observed data. In other words, the most plausible values (i.e., 2.0 to 3.0) with higher probability of representing the true (unknown) estimate indicate that the event proportion of the intervention group would be higher compared to the comparison group, with at least a 95% probability. Both scenarios would indicate a statistically significant result at a significance level of 0.05 (1–0.95) or 5%, since both CrIs do not contain 1. However, in case of a 95% CrI composed of the following limits: 0.70 to 1.50 ( Fig. 2 E), this would indicate that there is a 95% probability that the population RR would lie between 0.70 and 1.50, given the observed data. Since the most plausible values (i.e., 0.70 to 1.50) with higher probability of representing the true (unknown) estimate indicate that the event proportion of the intervention group could be either lower or higher compared to the comparison group, this would indicate a non-statistically significant result. The same interpretation approach of RR can also be applied to OR. However, one should note that RR and OR are not the same measure ( Box 2 ).

Advantages of using Bayesian CrIs

A clear advantage of Bayesian CrIs is the interpretability of such measures. For example, let us consider the frequentist 95% CI related to the effectiveness of back school compared to McKenzie exercises on disability discussed earlier in this masterclass (see the section “ Illustrative example of frequentist CIs ”), that is 0.76 to 3.99. As discussed earlier, the interpretation of this frequentist 95% CI is that, considering a hypothetical series of repeats of the experiment, we can be 95% confident that the true (unknown) effect estimate (represented by “mean of x ¯ 1 : 100 ” in Fig. 1 C) would lie between 0.76 and 3.99. Now, let us suppose that this interval was estimated using Bayesian inference. Considering the same interval as a Bayesian CrI, the interpretation would be that there is a 95% probability that the true (unknown) effect estimate (represented by “ μ ” in Fig. 1 A) lies within 0.76 to 3.99, given the observed data. The Bayesian CrI is considered to be easier to interpret than the frequentist CI, because:

  • • The Bayesian CrI can be interpreted in a probabilistic way, which clinicians usually use in clinical practice even for frequentist CIs. 3 This indicates the preference of clinicians for this probabilistic interpretation.
  • • The Bayesian approach reflects a direct estimate from the population distribution ( Fig. 1 A) represented by the actual computed posterior distribution, instead of estimating from the hypothetical sampling distribution ( Fig. 1 B and C) in the frequentist approach.

Disadvantages of using Bayesian CrIs

A clear disadvantage of using Bayesian CrIs is the complexity of computing posterior distributions, especially in complex problems/analyses conducted in, for example, randomized controlled trials. In the past, this imposed a very important barrier to the use of Bayesian inference. However, considering the recent advantages in computer science and technology, the use of Bayesian inference was significantly facilitated especially in complex situations. Therefore, computation issues should not preclude Bayesian analyses nowadays. However, knowledge and skills for performing such analyses are clearly remaining barriers that should be considered in biostatistics education for health scientists and for health professionals. This might generate work opportunities for clinicians, including physical therapists, as suggested by Casals and Finch. 18 , 19

Illustrative example of Bayesian CrIs

A 6-month randomized controlled trial had investigated the effectiveness of an online tailored advice package (i.e., TrailS6 ) compared to general advice on preventing running-related injuries (RRI) in trail runners. 20 The main result was presented by an ARR of −13.1% (i.e., the intervention reduced the risk of sustaining RRIs in 13.1%), with a 95% HPD CrI of −23.3% to −3.1%. The interpretation of this 95% HPD CrI is that there was a 95% probability that the true (unknown) preventive effect would have been within −23.3% to −3.1%. 20 In other words, the most plausible values (i.e., −23.3% to −3.1%) with higher probability of representing the true (unknown) estimate indicate that the intervention group would present a lower risk of RRIs compared to the comparison group, with at least a 95% probability. We believe that this interval is more natural an easy-to-interpret than the frequentist CI. The authors of the trial concluded that the online tailored advice package ( TrailS6 ) was effective on preventing RRIs in trail runners.

We believe that, as recommended by Freire et al., 1 the use and reporting of 95% CIs should be encouraged even when p -values are presented. Decision-making should neither be made considering only the dichotomized interpretation of p -values nor the dichotomized interpretation of CIs (i.e., statistically significant or non-statistically significant). Instead, a more in-depth analysis and interpretation of the values and width (i.e., precision) of CIs are recommended in order to avoid oversimplification of these rich measures. Frequentist CIs are alternative and preferable measures compared to p -values. However, the interpretability of the frequentist approach, which is based on hypothetical series of repeats of the experiment (i.e., sampling distribution) given that the null hypothesis is true, opens the opportunity for the use of Bayesian CrIs, that are more naturally and easily interpretable. Training and education may enhance knowledge and skills related to estimating, understanding and interpreting uncertainty measures, reducing the barriers for their use under either frequentist or Bayesian approaches.

This masterclass had no funding source of any nature.

Conflicts of interest

The authors declare no conflicts of interest.

Acknowledgements

Luiz Hespanhol was granted with a Young Investigator Grant from the São Paulo Research Foundation (FAPESP), grant 2016/09220-1. Caio Sain Vallio was granted with a PhD scholarship from FAPESP, process number 2017/11665-4. Bruno T Saragiotto was granted with a Young Investigator Grant from FAPESP, grant 2016/24217-7.

  • Search Search Please fill out this field.

What Is a Confidence Interval?

  • How It Works
  • Calculation

The Bottom Line

  • Fundamental Analysis

What Is a Confidence Interval and How Do You Calculate It?

Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.

null hypothesis 95 confidence interval

Gordon Scott has been an active investor and technical analyst or 20+ years. He is a Chartered Market Technician (CMT).

null hypothesis 95 confidence interval

Investopedia / Julie Bang

A confidence interval, in statistics, refers to the probability that a  population  parameter will fall between a set of values for a certain proportion of times. Analysts often use confidence intervals that contain either 95% or 99% of expected observations. Thus, if a point estimate is generated from a statistical model of 10.00 with a 95% confidence interval of 9.50 to 10.50, it means one is 95% confident that the true value falls within that range.

Statisticians and other analysts use confidence intervals to understand the statistical significance of their estimations, inferences, or predictions. If a confidence interval contains the value of zero (or some other null hypothesis) , then one cannot satisfactorily claim that a result from data generated by testing or experimentation is to be attributable to a specific cause rather than chance.

Key Takeaways

  • A confidence interval displays the probability that a parameter will fall between a pair of values around the mean.
  • Confidence intervals measure the degree of uncertainty or certainty in a sampling method.
  • They are also used in hypothesis testing and regression analysis.
  • Statisticians often use p-values in conjunction with confidence intervals to gauge statistical significance.
  • They are most often constructed using confidence levels of 95% or 99%.

Understanding Confidence Intervals

Confidence intervals measure the degree of uncertainty or certainty in a sampling method. They can take any number of probability limits, with the most common being a 95% or 99% confidence level. Confidence intervals are conducted using statistical methods, such as a  t-test .

Statisticians use confidence intervals to measure uncertainty in an estimate of a population parameter based on a sample. For example, a researcher selects different samples randomly from the same population and computes a confidence interval for each sample to see how it may represent the true value of the population variable. The resulting datasets are all different; some intervals include the true population parameter and others do not.

A confidence interval is a range of values, bounded above and below the statistic's mean , that likely would contain an unknown population parameter. Confidence level refers to the percentage of probability, or certainty, that the confidence interval would contain the true population parameter when you draw a random sample many times.

Or, in the vernacular, "we are 99% certain (confidence level) that most of these samples (confidence intervals) contain the true population parameter."

The biggest misconception regarding confidence intervals is that they represent the percentage of data from a given sample that falls between the upper and lower bounds. For example, one might erroneously interpret the aforementioned 99% confidence interval of 70-to-78 inches as indicating that 99% of the data in a random sample falls between these numbers.

This is incorrect, though a separate method of statistical analysis exists to make such a determination. Doing so involves identifying the sample's mean and standard deviation and plotting these figures on a bell curve .

Confidence interval and confidence level are interrelated but are not exactly the same.

Calculating Confidence Intervals

Suppose a group of researchers is studying the heights of high school basketball players. The researchers take a random sample from the population and establish a mean height of 74 inches.

The mean of 74 inches is a point estimate of the population mean. A point estimate by itself is of limited usefulness because it does not reveal the uncertainty associated with the estimate; you do not have a good sense of how far away this 74-inch sample mean might be from the population mean. What's missing is the degree of uncertainty in this single sample.

Confidence intervals provide more information than point estimates. By establishing a 95% confidence interval using the sample's mean and standard deviation , and assuming a normal distribution as represented by the bell curve, the researchers arrive at an upper and lower bound that contains the true mean 95% of the time.

Assume the interval is between 72 inches and 76 inches. If the researchers take 100 random samples from the population of high school basketball players as a whole, the mean should fall between 72 and 76 inches in 95 of those samples.

If the researchers want even greater confidence, they can expand the interval to 99% confidence. Doing so invariably creates a broader range, as it makes room for a greater number of sample means. If they establish the 99% confidence interval as being between 70 inches and 78 inches, they can expect 99 of 100 samples evaluated to contain a mean value between these numbers.

 A 90% confidence level, on the other hand, implies that you would expect 90% of the interval estimates to include the population parameter, and so forth.

What Does a Confidence Interval Reveal?

A confidence interval is a range of values, bounded above and below the statistic's mean, that likely would contain an unknown population parameter. Confidence level refers to the percentage of probability, or certainty, that the confidence interval would contain the true population parameter when you draw a random sample many times.

Why Are Confidence Intervals Used?

Statisticians use confidence intervals to measure uncertainty in a sample variable. For example, a researcher selects different samples randomly from the same population and computes a confidence interval for each sample to see how it may represent the true value of the population variable. The resulting datasets are all different where some intervals include the true population parameter and others do not.

What Is a Common Misconception About Confidence Intervals?

The biggest misconception regarding confidence intervals is that they represent the percentage of data from a given sample that falls between the upper and lower bounds. In other words, it would be incorrect to assume that a 99% confidence interval means that 99% of the data in a random sample falls between these bounds. What it actually means is that one can be 99% certain that the range will contain the population mean.

What Is a T-Test?

Confidence intervals are conducted using statistical methods, such as a t-test. A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related to certain features. Calculating a t-test requires three key data values. They include the difference between the mean values from each data set (called the mean difference), the standard deviation of each group, and the number of data values of each group.

How Do You Interpret P-Values and Confidence Intervals?

A p-value is a statistical measurement used to validate a hypothesis against observed data that measures the probability of obtaining the observed results, assuming that the null hypothesis is true. In general, a p-value less than 0.05 is considered to be statistically significant, in which case the null hypothesis should be rejected. This can somewhat correspond to the probability that the null hypothesis value (which is often zero) is contained within a 95% confidence interval.

Confidence intervals allow analysts to understand the likelihood that the results from statistical analyses are real or due to chance. When trying to make inferences or predictions based on a sample of data, there will be some uncertainty as to whether the results of such an analysis actually correspond with the real-world population being studied. The confidence interval depicts the likely range within which the true value should fall.

null hypothesis 95 confidence interval

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

7.2: Confidence interval and hypothesis tests for the slope and intercept

  • Last updated
  • Save as PDF
  • Page ID 33281

  • Mark Greenwood
  • Montana State University

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Our inference techniques will resemble previous material with an interest in forming confidence intervals and doing hypothesis testing, although the interpretation of confidence intervals for slope coefficients take some extra care. Remember that the general form of any parametric confidence interval is

\[\text{estimate} \mp t^*\text{SE}_{estimate},\]

so we need to obtain the appropriate standard error for regression model coefficients and the degrees of freedom to define the \(t\) -distribution to look up \(t^*\) multiplier. We will find the \(\text{SE}_{b_0}\) and \(\text{SE}_{b_1}\) in the model summary. The degrees of freedom for the \(t\) -distribution in simple linear regression are \(\mathbf{df = n-2}\) . Putting this together, the confidence interval for the true y-intercept, \(\beta_0\) , is \(\mathbf{b_0 \mp t^*_{n-2}}\textbf{SE}_{\mathbf{b_0}}\) although this confidence interval is rarely of interest. The confidence interval that is almost always of interest is for the true slope coefficient, \(\beta_1\) , that is \(\mathbf{b_1 \mp t^*_{n-2}}\textbf{SE}_{\mathbf{b_1}}\) . The slope confidence interval is used to do two things: (1) inference for the amount of change in the mean of \(y\) for a unit change in \(x\) in the population and (2) to potentially do hypothesis testing by checking whether 0 is in the CI or not. The sketch in Figure 7.4 illustrates the roles of the CI for the slope in terms of determining where the population slope, \(\beta_1\) , coefficient might be – centered at the sample slope coefficient – our best guess for the true slope. This sketch also informs an interpretation of the slope coefficient confidence interval :

Graphic illustrating the confidence interval for a slope coefficient for a 1 unit increase in \(x\).

For a 1 [ units of X ] increase in X , we are ___ % confident that the true change in the mean of Y will be between LL and UL [ units of Y ] .

In this interpretation, LL and UL are the calculated lower and upper limits of the confidence interval. This builds on our previous interpretation of the slope coefficient, adding in the information about pinning down the true change (population change) in the mean of the response variable for a difference of 1 unit in the \(x\) -direction. The interpretation of the y-intercept CI is:

For an x of 0 [ units of X ] , we are 95% confident that the true mean of Y will be between LL and UL [ units of Y ] .

This is really only interesting if the value of \(x = 0\) is interesting – we’ll see a method for generating CIs for the true mean at potentially more interesting values of \(x\) in Section 7.7. To trust the results from these confidence intervals, it is critical that any issues with the regression validity conditions are minor.

The only hypothesis test of interest in this situation is for the slope coefficient. To develop the hypotheses of interest in SLR, note the effect of having \(\beta_1 = 0\) in the mean of the regression equation, \(\mu_{y_i} = \beta_0 + \beta_1x_i = \beta_0 + 0x_i = \beta_0\) . This is the “intercept-only” or “mean-only” model that suggests that the mean of \(y\) does not vary with different values of \(x\) as it is always \(\beta_0\) . We saw this model in the ANOVA material as the reduced model when the null hypothesis of no difference in the true means across the groups was true. Here, this is the same as saying that there is no linear relationship between \(x\) and \(y\) , or that \(x\) is of no use in predicting \(y\) , or that we make the same prediction for \(y\) for every value of \(x\) . Thus

\[\boldsymbol{H_0: \beta_1 = 0}\]

is a test for no linear relationship between \(\mathbf{x}\) and \(\mathbf{y}\) in the population . The alternative of \(\boldsymbol{H_A: \beta_1\ne 0}\) , that there is some linear relationship between \(x\) and \(y\) in the population, is our main test of interest in these situations. It is also possible to test greater than or less than alternatives in certain situations.

Test statistics for regression coefficients are developed, if we can trust our assumptions, using the \(t\) -distribution with \(n-2\) degrees of freedom. The \(t\) -test statistic is generally

\[t = \frac{b_i}{\text{SE}_{b_i}}\]

with the main interest in the test for \(\beta_1\) based on \(b_1\) initially. The p-value would be calculated using the two-tailed area from the \(t_{n-2}\) distribution calculated using the pt function. The p-value to test these hypotheses is also provided in the model summary as we will see below.

The greater than or less than alternatives can have interesting interpretations in certain situations. For example, the greater than alternative \(\left(\boldsymbol{H_A: \beta_1 > 0}\right)\) tests an alternative of a positive linear relationship, with the p-value extracted just from the right tail of the same \(t\) -distribution. This could be used when a researcher would only find a result “interesting” if a positive relationship is detected, such as in the study of tree height and tree diameter where a researcher might be justified in deciding to test only for a positive linear relationship. Similarly, the left-tailed alternative is also possible, \(\boldsymbol{H_A: \beta_1 < 0}\) . To get one-tailed p-values from two-tailed results (the default), first check that the observed test statistic is in the direction of the alternative ( \(t>0\) for \(H_A:\beta_1>0\) or \(t<0\) for \(H_A:\beta_1<0\) ). If these conditions are met, then the p-value for the one-sided test from the two-sided version is found by dividing the reported p-value by 2 . If \(t>0\) for \(H_A:\beta_1>0\) or \(t<0\) for \(H_A:\beta_1<0\) are not met, then the p-value would be greater than 0.5 and it would be easiest to look it up directly using pt using the tail area direction in the direction of the alternative.

We can revisit a couple of examples for a last time with these ideas in hand to complete the analyses.

For the Beers, BAC data, the 95% confidence for the true slope coefficient, \(\beta_1\) , is

\[\begin{array}{rl} \boldsymbol{b_1 \mp t^*_{n-2}} \textbf{SE}_{\boldsymbol{b_1}} & \boldsymbol{= 0.01796 \mp 2.144787 * 0.002402} \\ & \boldsymbol{= 0.01796 \mp 0.00515} \\ & \boldsymbol{\rightarrow (0.0128, 0.0231).} \end{array}\]

You can find the components of this calculation in the model summary and from qt(0.975, df = n-2) which was 2.145 for the \(t^*\) -multiplier. Be careful not to use the \(t\) -value of 7.48 in the model summary to make confidence intervals – that is the test statistic used below. The related calculations are shown at the bottom of the following code:

We can also get the confidence interval directly from the confint function run on our regression model, saving some calculation effort and providing both the CI for the y-intercept and the slope coefficient.

We interpret the 95% CI for the slope coefficient as follows: For a 1 beer increase in number of beers consumed, we are 95% confident that the true change in the mean BAC will be between 0.0128 and 0.0231 g/dL. While the estimated slope is our best guess of the impacts of an extra beer consumed based on our sample, this CI provides information about the likely range of potential impacts on the mean in the population. It also could be used to test the two-sided hypothesis test and would suggest strong evidence against the null hypothesis since the confidence interval does not contain 0, but its main use is to quantify where we think the true slope coefficient resides.

The width of the CI, interpreted loosely as the precision of the estimated slope, is impacted by the variability of the observations around the estimated regression line, the overall sample size, and the positioning of the \(x\) -observations. Basically all those aspects relate to how “clearly” known the regression line is and that determines the estimated precision in the slope. For example, the more variability around the line that is present, the more uncertainty there is about the correct line to use (Least Squares (LS) can still find an estimated line but there are other lines that might be “close” to its optimizing choice). Similarly, more observations help us get a better estimate of the mean – an idea that permeates all statistical methods. Finally, the location of \(x\) -values can impact the precision in a slope coefficient. We’ll revisit this in the context of multicollinearity in the next chapter, and often we have no control of \(x\) -values, but just note that different patterns of \(x\) -values can lead to different precision of estimated slope coefficients 120 .

For hypothesis testing, we will almost always stick with two-sided tests in regression modeling as it is a more conservative approach and does not require us to have an expectation of a direction for relationships a priori . In this example, the null hypothesis for the slope coefficient is that there is no linear relationship between Beers and BAC in the population. The alternative hypothesis is that there is some linear relationship between Beers and BAC in the population. The test statistic is \(t = 0.01796/0.002402 = 7.48\) which, if model assumptions hold, follows a \(t(14)\) distribution under the null hypothesis. The model summary provides the calculation of the test statistic and the two-sided test p-value of \(2.97\text{e-6} = 0.00000297\) . So we would just report “p-value < 0.0001”. This suggests that there is very strong evidence against the null hypothesis of no linear relationship between Beers and BAC in the population, so we would conclude that there is a linear relationship between them. Because of the random assignment, we can also say that drinking beers causes changes in BAC but, because the sample was made up of volunteers, we cannot infer that these results would hold in the general population of OSU students or more generally.

There are also results for the y-intercept in the output. The 95% CI is from -0.0398 to 0.0144, that the true mean BAC for a 0 beer consuming subject is between -0.0398 to 0.01445. This is really not a big surprise but possibly is comforting to know that these results would not show much evidence against the null hypothesis that the true mean BAC for 0 Beers is 0. Finding little evidence of a difference from 0 makes sense and makes the estimated y-intercept of -0.013 not so problematic. In other situations, the results for the y-intercept may be more illogical but this will often be because the y-intercept is extrapolating far beyond the scope of observations. The y-intercept’s main function in regression models is to be at the right level for the slope to “work” to make a line that describes the responses and thus is usually of lesser interest even though it plays an important role in the model.

As a second example, we can revisit modeling the Hematocrit of female Australian athletes as a function of body fat % . The sample size is \(n = 99\) so the df are 97 in any \(t\) -distributions. In Chapter 6, the relationship between Hematocrit and body fat % for females appeared to be a weak negative linear association. The 95% confidence interval for the slope is -0.186 to 0.0155. For a 1% increase in body fat %, we are 95% confident that the change in the true mean Hematocrit is between -0.186 and 0.0155% of blood. This suggests that we would find little evidence against the null hypothesis of no linear relationship because this CI contains 0. In fact the p-value is 0.0965 which is larger than 0.05 and so provides a consistent conclusion with using the 95% confidence interval to perform a hypothesis test. Either way, we would conclude that there is not strong evidence against the null hypothesis but there is some evidence against it with a p-value of that size since more extreme results are somewhat common but still fairly rare if we assume the null is true. If you think p-values around 0.10 provide moderate evidence, you might have a different opinion about the evidence against the null hypothesis here. For this reason, we sometimes interpret this sort of marginal result as having some or marginal evidence against the null but certainly would never say that this presents strong evidence.

One more worked example is provided from the Montana fire data. In this example pay particular attention to how we are handling the units of the response variable, log-hectares, and to the changes to doing inferences with a 99% confidence level CI, and where you can find the needed results in the following output:

  • Based on the estimated regression model, we can say that if the average temperature is 0, we expect that, on average, the log-area burned would be -69.8 log-hectares.
  • From the regression model summary, \(b_1 = 1.39\) with \(\text{SE}_{b_1} = 0.2165\) and \(\mathbf{t = 6.41}\) .
  • There were \(n = 23\) measurements taken, so \(\mathbf{df = n-2 = 23-3 = 21}\) .

\[H_0: \beta_1 = 0\]

  • In words, the true slope coefficient between Temperature and log-area burned is 0 OR there is no linear relationship between Temperature and log-area burned in the population.

\[H_A: \beta_1\ne 0\]

  • In words, the alternative states that the true slope coefficient between Temperature and log-area burned is not 0 OR there is a linear relationship between Temperature and log-area burned in the population.

Test statistic: \(t = 1.39/0.217 = 6.41\)

  • Assuming the null hypothesis to be true (no linear relationship), the \(t\) -statistic follows a \(t\) -distribution with \(n-2 = 23-2 = 21\) degrees of freedom (or simply \(t_{21}\) ).
  • Interpretation: There is less than a 0.01% chance that we would observe slope coefficient like we did or something more extreme (greater than 1.39 log(hectares)/ \(^\circ F\) ) if there were in fact no linear relationship between temperature ( \(^\circ F\) ) and log-area burned (log-hectares) in the population.

Conclusion: There is very strong evidence against the null hypothesis of no linear relationship, so we would conclude that there is, in fact, a linear relationship between Temperature and log(Hectares) burned.

Scope of Inference: Since we have a time series of results, our inferences pertain to the results we could have observed for these years but not for years we did not observe – so just for the true slope for this sample of years. Because we can’t randomly assign the amount of area burned, we cannot make causal inferences – there are many reasons why both the average temperature and area burned would vary together that would not involve a direct connection between them.

\[\text{99}\% \text{ CI for } \beta_1: \boldsymbol{b_1 \mp t^*_{n-2}}\textbf{SE}_{\boldsymbol{b_1}} \rightarrow 1.39 \mp 2.831\bullet 0.217 \rightarrow (0.78, 2.00)\]

Interpretation of 99% CI for slope coefficient:

  • For a 1 degree F increase in Temperature , we are 99% confident that the change in the true mean log-area burned is between 0.78 and 2.00 log(Hectares).

Another way to interpret this is:

  • For a 1 degree F increase in Temperature , we are 99% confident that the mean Area Burned will change by between 0.78 and 2.00 log(Hectares) in the population .

Also \(R^2\) is 66.2%, which tells us that Temperature explains 66.2% of the variation in log(Hectares) burned . Or that the linear regression model built using Temperature explains 66.2% of the variation in yearly log(Hectares) burned so this model explains quite a bit but not all the variation in the responses.

  • Research Guides

BSCI 1510L Literature and Stats Guide: 5.4 A test for differences of sample means: 95% Confidence Intervals

  • 1 What is a scientific paper?
  • 2 Referencing and accessing papers
  • 2.1 Literature Cited
  • 2.2 Accessing Scientific Papers
  • 2.3 Traversing the web of citations
  • 2.4 Keyword Searches
  • 3 Style of scientific writing
  • 3.1 Specific details regarding scientific writing
  • 3.2 Components of a scientific paper
  • 4 For further information
  • Appendix A: Calculation Final Concentrations
  • 1 Formulas in Excel
  • 2 Basic operations in Excel
  • 3 Measurement and Variation
  • 3.1 Describing Quantities and Their Variation
  • 3.2 Samples Versus Populations
  • 3.3 Calculating Descriptive Statistics using Excel
  • 4 Variation and differences
  • 5 Differences in Experimental Science
  • 5.1 Aside: Commuting to Nashville
  • 5.2 P and Detecting Differences in Variable Quantities
  • 5.3 Statistical significance

5.4 A test for differences of sample means: 95% Confidence Intervals

  • 5.5 Error bars in figures
  • 5.6 Discussing statistics in your scientific writing
  • 6 Scatter plot, trendline, and linear regression
  • 7 The t-test of Means
  • 8 Paired t-test
  • 9 Two-Tailed and One-Tailed Tests
  • 10 Variation on t-tests: ANOVA
  • 11 Reporting the Results of a Statistical Test
  • 12 Summary of statistical tests
  • 1 Objectives
  • 2 Project timeline
  • 3 Background
  • 4 Previous work in the BSCI 111 class
  • 5 General notes about the project
  • 6 About the paper
  • 7 References

Simulation of 95% confidence intervals

The following simulation of 95% confidence intervals is recommended to help you understand the concept.  It uses the same fish sampling example as in the sampling simulation of Section 3.2 :

http://www.zoology.ubc.ca/~whitlock/kingfisher/CIMean.htm

Thanks again to Whitlock and Schluter  for making this great resource available.

In Section 5.3, we talked about differences and P -values in a general way.  Over the rest of this semester and next semester we will learn a number of particular statistical tests that are applicable to various types of experimental designs.  Each of those tests is designed to answer the same general question: "Are the differences I am seeing significant?" and each of those tests will attempt to answer that question by testing whether the value of P falls below the alpha level of 0.05 .  In order to answer this question, each of those tests will calculate the value of some statistical quantity (a "statistic") which can be examined to determine whether the value of P falls below 0.05 or not.  There are two approaches to making the determination.  One is to simply let a computer program calculate and spit out a value of P .  In that case, all that is required is to examine P and see if it falls below 0.05 .  The other approach is to specify a critical value for the statistic that would indicate whether P was less than 0.05 or not.  In this latter approach, the test does not produce an actual numeric value for P . 

This semester we will learn two commonly used tests for determining whether two sample means are significantly different.  The t -test of means (which we will learn about later) generates a value of P , while the test described in this section, 95% confidence intervals , allows us to know whether P < 0.05 without actually generating a value for P .

5.4.1 What is a 95% confidence interval?

The concept of 95% confidence intervals is directly related to the ideas of sampling discussed in Section 3.2.2.  In that section we were interested in describing how close the means of samples of a certain size were likely to fall from the actual mean of the population.  The standard error of the mean (S.E.) describes the standard deviation of sample means of a certain sample size N .  In the example of that section, calculation of the S.E. allowed us to predict that 68% of the time when we sampled 16 people from the artificial Homo sapiens population, the sample mean would fall within 1.8 cm of the actual population mean.  32% of the time our sampling is less representative of the population distribution, and our sample means are more than one standard error from the true mean: more than 1.8 cm higher than the mean or more than 1.8 cm lower than the mean (i.e. outside of the range of 174.3 and 177.9 cm). 

Standard error of the mean is mathematically related to the probability of unrepresentative sampling.  4.5% of the time sample means would fall outside of the range of +/- two standard errors (172.5 to 179.7 cm).  0.3% of the time sample means would fall outside the range of +/- three standard errors (170.7 to 181.5 cm).  The wider the range around the actual mean, the less likely it is that sample means would fall outside it.  Since unrepresentative sampling is exactly what P -values assess, it is possible to define a range around a sample mean which can be used to assess whether that sample mean is significantly different from another sample mean.  This range is called the 95% confidence interval (or 95% CI).  The 95% CI is related to the standard error of the mean, and for large sample sizes its limits can be estimated from standard error:

Upper 95% confidence limit (CL) = sample mean + (1.96)(SE)

Lower 95% CL = sample mean – (1.96)(SE)

For smaller N , calculation of the 95% CI is more complex.  However, the 95% CI can be easily calculated by computer so we will simply use software to determine it.

5.4.2 Testing for significance using 95% confidence intervals

null hypothesis 95 confidence interval

Fig. 9 Application of the 95% confidence interval test for the blood pressure drug trial with small sample size (old drug at top, new drug at bottom). 

Fig. 9 illustrates the situation described in Section 5.2 for the drug trial with very few test subjects (four per treatment).  Each white block represents one of the four test subjects.  In the figure, the sample means are shown as vertical lines, and the 95% confidence intervals are shown as error bars to the left and right of the vertical line.  If we apply the 95% confidence interval test to these results, we see that the difference between the sample means (5.4 mm Hg) cannot be shown to be significant because the upper 95% CL of the old drug exceeds the lower 95% CL of the new drug (the error bars overlap).  We would say that we have failed to reject the null hypothesis, i.e. failed to show that the response to the two drugs is different. 

Does that mean that there is actually no difference in the response to the two drugs?  Not really.  This is a really terrible drug trial because it has so few participants.  No real drug trial would have only four participants in treatment group.  The reality is that there is a great deal of uncertainty about our determination of the sample mean, as reflected in the relatively wide 95% confidence intervals.  It is quite possible that the response to the two drugs is actually different (i.e. the alternate hypothesis is true), and that if we improved our estimate of the mean by measuring more patients in our sample, the 95% confidence intervals would shrink to the point where they did not overlap.  On the other hand, it is also possible that the drug does nothing at all (the null hypothesis is true).   We may have just had bad luck in sampling and our few participants were not very representative of the population in general.  The only way we could have known which of these two possibilities were true would be if we had sampled more patients.  We will now examine how the situation could change under each of these two scenarios if we had more patients.

null hypothesis 95 confidence interval

Fig. 10 Application of the 95% confidence interval test for the blood pressure drug trial with larger sample size (old drug at top, new drug at bottom) and alternative hypothesis true.

Let us imagine what would have happened under the first scenario (where the alternative hypothesis is true) if we had measured a total of 16 patients per group instead of 4.  Fig. 10 shows the first four measurement of each treatment as gray blocks and an additional 12 measurements as white blocks.  Because we have so many more measurements, the histogram looks a lot more like a bell-shaped curve.  In this scenario, the first four measurements were actually pretty representative of their populations, so the sample means did not change that much with the addition of 12 more patients.  But because with 16 samples we have a much clearer picture of the distribution of measurements in the sample, the 95% CIs have narrowed to the point where they do not overlap.  The increased sample size has allowed us to conclude that the difference between the sample means is significant.

null hypothesis 95 confidence interval

Fig. 11 Application of the 95% confidence interval test for the blood pressure drug trial with larger sample size (old drug at top, new drug at bottom) and null hypothesis true.

Fig. 11 shows what could happen in the second scenario where the null hypothesis was true (the new drug had no different effect than the old drug).  Under this scenario, we can see that we really did have bad luck with our first four measurements.  By chance, one of the first four participants that we sampled in the old drug treatment had an uncharacteristically small response to the drug and the other three had a smaller response than the average of all sixteen patients.  By chance, three of the four participants who received the new drug had higher than average responses to the drug.  But now that we have sampled a greater number of participants, the difference between the two treatment groups has evaporated.  The uncharacteristic measurements have been canceled out by other measurements that vary in the opposite direction. 

Although the confidence interval is narrower with 16 participants than it was with four, that narrowing did not result in a significant difference because at the same time, the estimate of the mean for the two treatments got better as well.  Since under this scenario there was no difference between the two treatments, the better estimates for the sample means converged on the single population mean (about 16 mm Hg). 

In discussing these two scenarios, we supposed that we somehow knew which of the two hypotheses was true.  In reality, we never know this.  But if we have a sufficiently large sample size, we are in a better position to judge which hypothesis is best supported by the statistics.

5.4.3 Statistical Power

By now you have hopefully gotten the picture that the major issue in experimental science is to be able to tell whether differences that we see are real or whether they are caused by chance non-representative sampling.  You have also seen that to a large extent our ability to detect real differences depends on the number of trials or measurements we make.  (Of course, if the differences we observe are not real, then no increase in sample size will be big enough to make the difference significant.) 

Our ability to detect real differences is called statistical power .  Obviously we would like to have the most statistical power possible.  There are several ways we can obtain greater statistical power.  One way is to increase the size of the effect by increasing the size of the experimental factor.  An example would be to try to produce a larger effect in a drug trial by increasing the dosage of the drug.  Another way is to reduce the amount of uncontrolled variation in the results.  For example, standardizing your method of data collection, reducing the number of different observers conducting the experiment, using less variable experimental subjects, and controlling the environment of the experiment as much as possible are all ways to reduce uncontrolled variability.  A third way of increasing statistical power is to change the design of the experiment in a way that allows you to conduct a more powerful test.  For example, having equal numbers of replicates in all of your treatments usually increases the power of the test.  Simplifying the design of the experiment may increase the power of the test.  Using a more appropriate test can also increase statistical power.  Finally, increasing the sample size (or number of replicates) nearly always increases the statistical power.  Obviously the practical economics of time and money place a limit on the number of replicates you can have.  

Theoretically, the outcome of an experiment should be equally interesting regardless of whether the outcome of an experiment shows a factor to have a significant effect or not.  As a practical matter, however, there are far more experiments published showing significant differences than studies showing factors to not be significant.  There is an important practical reason for this.  If an experiment shows differences that are significant, then we assume that is because the factor has a real effect.  However, if an experiment fails to show significant differences, this could be because the factor doesn't really have any effect.  But it could also be that the factor has an effect but the experiment just didn't have enough statistical power to detect it.  The latter possibility has less to do with the biology involved and more to do with the experimenter's possible failure at planning and experimental design - not something that a scientific journal is going to want to publish a paper about!  Generally, in order to publish experiments that do not have significant differences it is necessary to conduct a power test .  A power test is used to show that a test would have been capable of detecting differences of a certain size if those differences had existed.

5.4.4 Calculating 95% Confidence Intervals using Excel

Note: to calculate the descriptive statistical values in this section, you must have enabled the Data Analysis Tools in Excel.  This has already been done on the lab computers but if you are using a computer elsewhere, you may need to enable it.  Go to the Excel Reference homepage for instructions for PC and Mac.

To calculate the 95% CI for a column of data, click on the Data ribbon.  Click on Data Analysis in the Analysis section.  Select Descriptive Statistics, then click OK.  Click on the Input Range selection button, then select the range of cells for the column.  If there is a label for the column, click the "Labels in first row" checkbox and include it when you select the range of cells.  Check the Confidence Level for Mean: checkbox and make sure the level is set for 95%.  (You may also wish to check the Summary statistics checkbox as well if you want to calculate the mean value.)  To put the results on the same sheet as the column of numbers, click on the Output Range radio button then click on the selection button.  Click on the upper left cell of the area of the sheet where you would like for the results to go.  Then press OK.  The value given for Confidence Level is the amount that must be added to the mean to calculate the upper 95% confidence limit and that must be subtracted from the mean to calculate the lower 95% confidence limit.

5.4.5 An example of the application of 95% confidence intervals

Sickbert-Bennett et al. (2005) compared the effect of a large number of different anti-microbial agents on a bacterium and a virus.  They applied Serratia marcescens bacteria to the hands of their test subjects and measured the number of bacteria that they could extract from the hands with and without washing with the antimicrobial agent.  [The distributions of bacterial samples tend not to be normally distributed because there are relatively large numbers of samples with few bacteria, while a few samples have very high bacterial counts.  Taking the logarithm of the counts changes the shape of the distribution to one closer to the normal curve (see Leyden et al. 1991, Fig. 3 for an example) and allows the calculation of typical statistics to be valid.  Thus we see this kind of data transformation being used in all three of the papers.]

null hypothesis 95 confidence interval

Fig. 12 Sample mean log reductions and 95% confidence intervals from Table 3 of Sickbert-Bennett et al. (2005)

For each type of agent, Sickbert-Bennett et al. (2005) calculated the mean decline in the log of the counts as well as the 95% confidence intervals for that mean (Fig. 12).  By examining this table, we can see whether any two cleansing agents produced significantly different declines.  For example, since our experiment will be focused on consumer soap containing triclosan, we would like to know if Sickbert-Bennett et al. (2005) found a difference in the effect of soap with 1% triclosan and the control nonantimicrobial soap.  After the first episode of handwashing, the mean log reduction of the triclosan soap was greater (1.90) than the mean log reduction of the plain soap (1.87).  However, the lower 95% confidence limit of the triclosan mean (1.50) was less than the upper 95% confidence limit of the control soap mean (2.19).  So the measured difference was not significant by the P <0.05 criterion. 

5.4.6 Weaknesses in using 95% confidence intervals as a significance test

The method of checking for significance by looking for overlap in 95% confidence intervals has several weaknesses.  An obvious weakness is that the test does not produce a numeric measure of the degree of significance.  It simply indicates whether P is more or less than 0.05 .  Another is that it can be a more conservative test than necessary.  In an experiment with only two treatment groups, if 95% confidence intervals do not overlap, then it is clear that the two means are significantly different at the P <0.05 level. However, confidence intervals can actually overlap by a small amount and the difference still be significant. 

Another more subtle problem occurs when more than two groups are being compared.  The more groups that are being compared, the more possible pairwise comparisons there are that can be made between groups.  This increase in possible pairwise comparisons does not increase linearly with the number of groups.  If the alpha level for each comparison is left at 0.05, more than 5% of the groups whose means are the same will erroneously be considered to be different (i.e. the Type I error rate will be greater than 5%).  Thus a simple comparison of 95% confidence intervals cannot be made without adjustments.  In a scathing response, Paulson (2005) points out that Sickbert-Bennett et al. (2005) did not properly adjust their statistics to account for multiple comparisons.  According to Paulson, this mistake (Point 3 in his paper) along with several other statistical and procedural errors made their conclusions meaningless.  This paper, along with the response of Kampf and Kramer (2005) make interesting reading as they show how a paper can be publically excoriated for poor experimental design and statistical analysis. 

Despite these problems, 95% confidence intervals provide a convenient way to illustrate the certainty with which sample means are known, as shown in the following section.

Kampf, G. and A. Kramer. Efficacy of hand hygiene agents at short application times. American Journal of Infection Control 33:429-431.  http://dx.doi.org/10.1016/j.ajic.2005.03.013

Leyden J.J., K.J. McGinley, M.S. Kaminer, J. Bakel, S. Nishijima, M.J. Grove, G.L. Grove. 1991. Computerized image analysis of full-hand touch plates: a method for quantification of surface bacteria on hands and the effect of antimicrobial agents, Journal of Hospital Infection 18 (Supplement B):13-22.  http://dx.doi.org/10.1016/0195-6701(91)90258-A

Paulson, D.S. 2005. Response: comparative efficacy of hand hygiene agents.  American Journal of Infection Control 33:431-434.  http://dx.doi.org/10.1016/j.ajic.2005.03.013

Sickbert-Bennett, E.E., D.J. Weber, M.F. Gergen-Teague, M.D. Sobsey, G.P. Samsa, W.A. Rutala. 2005. Comparative efficacy of hand hygiene agents in the reduction of bacteria and viruses.  American Journal of Infection Control 33:67-77.  http://dx.doi.org/10.1016/j.ajic.2004.08.005

  • << Previous: 5.3 Statistical significance
  • Next: 5.5 Error bars in figures >>
  • Last Updated: Apr 22, 2024 12:50 PM
  • URL: https://researchguides.library.vanderbilt.edu/bsci1510L

Creative Commons License

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Registered Report
  • Open access
  • Published: 27 May 2024

Comparing researchers’ degree of dichotomous thinking using frequentist versus Bayesian null hypothesis testing

  • Jasmine Muradchanian   ORCID: orcid.org/0000-0002-2914-9197 1 ,
  • Rink Hoekstra 1 ,
  • Henk Kiers 1 ,
  • Dustin Fife 2 &
  • Don van Ravenzwaaij 1  

Scientific Reports volume  14 , Article number:  12120 ( 2024 ) Cite this article

Metrics details

  • Human behaviour
  • Neuroscience

A large amount of scientific literature in social and behavioural sciences bases their conclusions on one or more hypothesis tests. As such, it is important to obtain more knowledge about how researchers in social and behavioural sciences interpret quantities that result from hypothesis test metrics, such as p -values and Bayes factors. In the present study, we explored the relationship between obtained statistical evidence and the degree of belief or confidence that there is a positive effect in the population of interest. In particular, we were interested in the existence of a so-called cliff effect: A qualitative drop in the degree of belief that there is a positive effect around certain threshold values of statistical evidence (e.g., at p  = 0.05). We compared this relationship for p -values to the relationship for corresponding degrees of evidence quantified through Bayes factors, and we examined whether this relationship was affected by two different modes of presentation (in one mode the functional form of the relationship across values was implicit to the participant, whereas in the other mode it was explicit). We found evidence for a higher proportion of cliff effects in p -value conditions than in BF conditions (N = 139), but we did not get a clear indication whether presentation mode had an effect on the proportion of cliff effects.

Protocol registration

The stage 1 protocol for this Registered Report was accepted in principle on 2 June 2023. The protocol, as accepted by the journal, can be found at: https://doi.org/10.17605/OSF.IO/5CW6P .

Similar content being viewed by others

null hypothesis 95 confidence interval

Foundations of intuitive power analyses in children and adults

null hypothesis 95 confidence interval

Reducing bias, increasing transparency and calibrating confidence with preregistration

null hypothesis 95 confidence interval

Simple nested Bayesian hypothesis testing for meta-analysis, Cox, Poisson and logistic regression models

Introduction.

In applied science, researchers typically conduct statistical tests to learn whether an effect of interest differs from zero. Such tests typically tend to quantify evidence by means of p -values (but see e.g., Lakens 1 who warns against such an interpretation of p -values). A Bayesian alternative to the p -value is the Bayes factor (BF), which is a tool used for quantifying statistical evidence in hypothesis testing 2 , 3 . P -values and BFs are related to one another 4 , with BFs being used much less frequently. Having two contrasting hypotheses (i.e., a null hypothesis, H 0 , and an alternative hypothesis, H 1 ), a p -value is the probability of getting a result as extreme or more extreme than the actual observed sample result, given that H 0 were true (and given that the assumptions hold). A BF on the other hand, quantifies the probability of the data given H 1 relative to the probability of the data given H 0 (called BF 10 3 ).

There is ample evidence that researchers often find it difficult to interpret quantities such as p -values 5 , 6 , 7 . Although there has been growing awareness of the dangers of misinterpreting p -values, these dangers seem to remain prevalent. One of the key reasons for these misinterpretations is that these concepts are not simple or intuitive, and the correct interpretation of them would require more cognitive effort. Because of this high cognitive demand academics have been using shortcut interpretations, which are simply wrong 6 . An example of such a misinterpretation is that the p -value would represent the probability of the null hypothesis being true 6 . Research is typically conducted in order to reduce uncertainty around the existence of an effect in the population of interest. To do this, we use measures such as p -values and Bayes factors as a tool. Hence, it might be interesting (especially given the mistakes that are made by researchers when interpreting quantities such as p -values) to study how these measures affect people’s beliefs regarding the existence of an effect in the population of interest, so one can study how outcomes like p -values and Bayes factors translate to subjective beliefs about the existence of an effect in practice.

One of the first studies that focused on how researchers interpret statistical quantities was conducted by Rosenthal and Gaito 8 , in which they specifically studied how researchers interpret p -values of varying magnitude. Nineteen researchers and graduate students at their psychology faculty were requested to indicate their degree of belief or confidence in 14 p -values, varying from 0.001 to 0.90, on a 6-point scale ranging from “5 extreme confidence or belief” to “0 complete absence of confidence or belief” 8 , pp. 33–34 . These individuals were shown p -values for sample sizes of 10 and 100. The authors wanted to measure the degree of belief or confidence in research findings as a function of associated p -values, but stated as such it is not really clear what is meant here. We assume that the authors actually wanted to assess degree of belief or confidence in the existence of an effect, given the p -value. Their findings suggested that subjects’ degree of belief or confidence appeared to be a decreasing exponential function of the p- value. Additionally, for any p -value, self-rated confidence was greater for the larger sample size (i.e., n  = 100). Furthermore, the authors argued in favor of the existence of a cliff effect around p  = 0.05, which refers to an abrupt drop in the degree of belief or confidence in a p -value just beyond the 0.05 level 8 , 9 . This finding has been confirmed in several subsequent studies 10 , 11 , 12 . The studies described so far have been focusing on the average, and have not taken individual differences into account.

The cliff effect suggests p -values invite dichotomous thinking, which according to some authors seems to be a common type of reasoning when interpreting p -values in the context of Null Hypothesis Significance Testing (NHST 13 ). The outcome of the significance test seems to be usually interpreted dichotomously such as suggested by studies focusing on the cliff effect 8 , 9 , 10 , 11 , 12 , 13 , where one makes a binary choice between rejecting or not rejecting a null hypothesis 14 . This practice has taken some academics away from the main task of finding out the size of the effect of interest and the level of precision with which it has been measured 5 . However, Poitevineau and Lecoutre 15 argued that the cliff effect around p  = 0.05 is probably overstated. According to them, previous studies paid insufficient attention to individual differences. To demonstrate this, they explored the individual data and found qualitative heterogeneity in the respondents’ answers. The authors identified three categories of functions based on 12 p -values: (1) a decreasing exponential curve, (2) a decreasing linear curve, and (3) an all-or-none curve representing a very high degree of confidence when p  ≤ 0.05 and quasi-zero confidence otherwise. Out of 18 participants, they found that the responses of 10 participants followed a decreasing exponential curve, 4 participants followed a decreasing linear curve, and 4 participants followed an all-or-none curve. The authors concluded that the cliff effect may be an artifact of averaging, resulting from the fact that a few participants have an all-or-none interpretation of statistical significance 15 .

Although NHST has been used frequently, it has been argued that it should be replaced by effect sizes, confidence intervals (CIs), and meta-analyses. Doing so may allegedly invite a shift from dichotomous thinking to estimation and meta-analytic thinking 14 . Lai et al. 13 studied whether using CIs rather than p -values would reduce the cliff effect, and thereby dichotomous thinking. Similar to the classification by Poitevineau and Lecoutre 15 , the responses were divided into three classes: decreasing exponential, decreasing linear, or all-or-none. In addition, Lai et al. 13 found patterns in the responses of some of the participants that corresponded with what they called a “moderate cliff model”, which refers to using statistical significance as both a decision-making criterion and a measure of evidence 13 .

In contrast to Poitevineau and Lecoutre 15 , Lai et al. 13 concluded that the cliff effect is probably not just a byproduct resulting from the all-or-none class, because the cliff models were accountable for around 21% of the responses in NHST interpretation and for around 33% of the responses in CI interpretation. Furthermore, a notable finding was that the cliff effect prevalence in CI interpretations was more than 50% higher than that of NHST 13 . Something similar was found in a study by Hoekstra, Johnson, and Kiers 16 . They also predicted that the cliff effect would be stronger for results presented in the NHST format compared to the CI format, and like Lai et al. 13 , they actually found more evidence of a cliff effect in the CI format compared to the NHST format 16 .

The studies discussed so far seem to provide evidence for the existence of a cliff effect around p  = 0.05. Table 1 shows an overview of evidence related to the cliff effect. Interestingly, in a recent study, Helske et al. 17 examined how various visualizations can aim in reducing the cliff effect when interpreting inferential statistics among researchers. They found that compared to textual representation of the CI with p -values and classic CI visualization, including more complex visual information to classic CI representation seemed to decrease the cliff effect (i.e., dichotomous interpretations 17 ).

Although Bayesian methods have become more popular within different scientific fields 18 , 19 , we know of no studies that have examined whether self-reported degree of belief of the existence of an effect when interpreting BFs by researchers results in a similar cliff effect to those obtained for p -values and CIs. Another matter that seems to be conspicuously absent in previous examinations of the cliff effect is a comparison between the presentation methods that are used to investigate the cliff effect. In some cliff effect studies the p -values were presented to the participants on separate pages 15 and in other cliff effect studies the p -values were presented on the same page 13 . It is possible that the cliff effect manifests itself in (some) researchers without explicit awareness. It is possible that for those researchers presenting p -values/Bayes factors in isolation would lead to a cliff effect, whereas presenting all p -values/Bayes factors at once would lead to a cognitive override. Perhaps when participants see their cliff effect, they might think that they should not think dichotomously, and might change their results to be more in line with how they believe they should think, thereby removing their cliff effect. To our knowledge, no direct comparison of p -values/Bayes factors in isolation and all p -values/Bayes factors at once has yet been conducted. Therefore, to see whether the method matters, both types of presentation modes will be included in the present study.

All of this gives rise to the following three research questions: (1) What is the relation between obtained statistical evidence and the degree of belief or confidence that there is a positive effect in the population of interest across participants? (2) What is the difference in this relationship when the statistical evidence is quantified through p -values versus Bayes factors? (3) What is the difference in this relationship when the statistical evidence is presented in isolation versus all at once?

In the present study, we will investigate the relationship between method (i.e., p -values and Bayes factors) and the degree of belief or confidence that there is a positive effect in the population of interest, with special attention for the cliff effect. We choose this specific wording (“positive effect in the population of interest”) as we believe that this way of phrasing is more specific than those used in previous cliff effect studies. We will examine the relationship between different levels of strength of evidence using p -values or corresponding Bayes factors and measure participants' degree of belief or confidence in the following two scenarios: (1) the scenario in which values will be presented in isolation (such that the functional form of the relationship across values is implicit to the participant) and (2) the scenario in which all values will be presented simultaneously (such that the functional form of the relationship across values is explicit to the participant).

In what follows, we will first describe the set-up of the present study. In the results section, we will explore the relationship between obtained statistical evidence and the degree of belief or confidence, and in turn, we will compare this relationship for p -values to the corresponding relationship for BFs. All of this will be done in scenarios in which researchers are either made aware or not made aware of the functional form of the relationship. In the discussion, we will discuss implications for applied researchers using p -values and/or BFs in order to quantify statistical evidence.

Ethics information

Our study protocol has been approved by the ethics committee of the University of Groningen and our study complies with all relevant ethical regulations of the University of Groningen. Informed consent will be obtained from all participants. As an incentive for participating, we will raffle 10 Amazon vouchers with a worth of 25USD among participants that successfully completed our study.

Sampling plan

Our target population will consist of researchers in the social and behavioural sciences who are at least somewhat familiar with interpreting Bayes factors. We will obtain our prospective sample by collecting the e-mail addresses of (approximately) 2000 corresponding authors from 20 different journals in social and behavioural sciences with the highest impact factor. Specifically, we will collect the e-mail addresses of 100 researchers who published an article in the corresponding journal in 2021. We will start with the first issue and continue until we have 100 e-mail addresses per journal. We will contact the authors by e-mail. In the e-mail we will mention that we are looking for researchers who are familiar with interpreting Bayes factors. If they are familiar with interpreting Bayes factors, then we will ask them to participate in our study. If they are not familiar with interpreting Bayes factors, then we will ask them to ignore our e-mail.

If the currently unknown response rate is too low to answer our research questions, we will collect additional e-mail addresses of corresponding authors from articles published in 2022 in the same 20 journals. Based on a projected response rate of 10%, we expect a final completion rate of 200 participants. This should be enough to obtain a BF higher than 10 in favor of an effect if the proportions differ by 0.2 (see section “ Planned analyses ” for details).

Materials and procedure

The relationship between the different magnitudes of p -values/BFs and the degree of belief or confidence will be examined in a scenario in which values will be presented in isolation and in a scenario in which the values will be presented simultaneously. This all will result in four different conditions: (1) p -value questions in the isolation scenario (isolated p -value), (2) BF questions in the isolation scenario (isolated BF), (3) p -value questions in the simultaneous scenario (all at once p -value), and (4) BF questions in the simultaneous scenario (all at once BF). To reduce boredom, and to try to avoid making underlying goals of the study too apparent, each participant will receive randomly one out of four scenarios (i.e., all at once p -value, all at once BF, isolated p -value, or isolated BF), so the study has a between-person design.

The participants will receive an e-mail with an anonymous Qualtrics survey link. The first page of the survey will consist of the informed consent. We will ask all participants to indicate their level of familiarity with both Bayes factors and p -values on a 3-point scale with “completely unfamiliar/somewhat familiar/very familiar” and we will include everyone who is at least somewhat familiar on both. To have a better picture of our sample population, we will include the following demographic variables in the survey: gender, main continent, career stage, and broad research area. Then we will randomly assign respondents to one of four conditions (see below for a detailed description). After completing the content-part of the survey, all respondents will receive a question about providing their e-mail address if they are interested in (1) being included in the random draw of the Amazon vouchers; or (2) receiving information on our study outcomes.

In the isolated p -value condition, the following fabricated experimental scenario will be presented:

“Suppose you conduct an experiment comparing two independent groups, with n = 250 in each group. The null hypothesis states that the population means of the two groups do not differ. The alternative hypothesis states that the population mean in group 1 is larger than the population mean in group 2. Suppose a two-sample t test was conducted and a one-sided p value calculated.”

Then a set of possible findings of the fabricated experiment will be presented at different pages. We varied the strength of evidence for the existence of a positive effect with the following ten p -values in a random order: 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.065, 0.131, 0.267, and 0.543. A screenshot of a part of the isolated p -value questions is presented in S1 in the Supplementary Information.

In the all at once BF condition, a fabricated experimental scenario will be presented identical to that in the isolated p -value condition, except the last part is replaced by:

“Suppose a Bayesian two-sample t test was conducted and a one-sided Bayes factor (BF) calculated, with the alternative hypothesis in the numerator and the null hypothesis in the denominator, denoted BF 10 .”

A set of possible findings of the fabricated experiment will be presented at the same page. These findings vary in terms of the strength of evidence for the existence of a positive effect, quantified with the following ten BF 10 values in the following order: 22.650, 12.008, 6.410, 3.449, 1.873, 1.027, 0.569, 0.317, 0.175, and 0.091. These BF values correspond one-on-one to the p -values presented in the isolated p -value condition (the R code for the findings of the fabricated experiment can be found on https://osf.io/sq3fp ). A screenshot of a part of the all at once BF questions can be found in S2 in the Supplementary Information.

In both conditions, the respondents will be asked to rate their degree of belief or confidence that there is a positive effect in the population of interest based on these findings on a scale ranging from 0 (completely convinced that there is no effect), through 50 (somewhat convinced that there is a positive effect), to 100 (completely convinced that there is a positive effect).

The other two conditions (i.e., isolated BF condition and the all at once p -value condition) will be the same as the previously described conditions. The only difference between these two conditions and the previously described conditions is that in the isolated BF condition, the findings of the fabricated experiment for the BF questions will be presented at different pages in a random order, and in the all at once p -value condition, the findings for the p -value questions will be presented at the same page in a non-random order.

To keep things as simple as possible for the participants, all fictitious scenarios will include a two-sample t test with either a one-tailed p -value or a BF. The total sample size will be large ( n  = 250 in each group) in order to have sufficiently large power to detect even small effects.

Planned analyses

Poitevineau and Lecoutre 15 have suggested the following three models for the relationships between the different levels of statistical evidence and researchers’ subjective belief that a non-zero effect exists: all-or-none ( y  =  a for p  < 0.05, y  =  b for p  ≥ 0.05), linear ( y  =  a  +  bp ), and exponential ( y  = exp( a  +  bp )). In addition, Lai et al. 13 have suggested the moderate cliff model (a more gradual version of all-or-none), which they did not define more specifically. In the study by Lai et al. 13 (Fig.  4 ), the panel that represents the moderate cliff seems to be a combination of the exponential and the all-or-none function. In the present study, we will classify responses as moderate cliff if we observe a steep drop in the degree of belief or confidence around a certain p -value/BF, while for the remaining p -values/BFs the decline in confidence is more gradual. So, for example, a combination of the decreasing linear and the all-or-none function will also be classified as moderate cliff in the present study. Plots of the four models with examples of reasonable choices for the parameters are presented in Fig.  1 (the R code for Fig.  1 can be found on https://osf.io/j6d8c ).

figure 1

Plots are shown for fictitious outcomes for the four models (all-or-none, linear, exponential, and moderate cliff). The x-axis represents the different p -values. In the two BF conditions, the x-axis represents the different BF values. The y-axis represents the proportion of degree of belief or confidence that there is a positive effect in the population of interest. Note that these are prototype responses; different variations on these response patterns are possible.

We will manually classify data for each participant for each scenario as one of the relationship models. We will do so by blinding the coders as to the conditions associated with the data. Specifically, author JM will organize the data from each of the four conditions and remove the p -value or BF labels. Subsequently, authors DvR and RH will classify the data independently from one another. In order to improve objectivity regarding the classification, authors DvR and RH will classify the data according to specific instructions that are constructed before collecting the data (see Appendix 1 ). After coding, we will compute Cohen’s kappa for these data. For each set of scores per condition per subject for which there was no agreement on classification, authors DvR and RH will try to reach consensus in a discussion of no longer than 5 min. If after this discussion no agreement is reached, then author DF will classify these data. If author DF will choose the same class as either DvR or RH, then the data will be classified accordingly. However, if author DF will choose another class, then the data will be classified in a so-called rest category. This rest category will also include data that extremely deviate from the four relationship models, and we will assess these data by running exploratory analyses. Before classifying the real data, we will conduct a small pilot study in order to provide authors DvR and RH with the possibility to practice classifying the data. In the Qualtrics survey, the respondents cannot continue with the next question without answering the current question. However, it might be possible that some of the respondents quit filling out the survey. The responses of the participants who did not answer all questions will be removed from the dataset. This means that we will use complete case analysis in order to deal with missing data, because we do not expect to find specific patterns in the missing values.

Our approach to answer Research Question 1 (RQ1; “What is the relation between obtained statistical evidence and the degree of belief or confidence that there is a positive effect in the population of interest across participants?”) will be descriptive in nature. We will explore the results visually, by assessing the four models (i.e., all-or-none, linear, exponential, and moderate cliff) in each of the four conditions (i.e., isolated p -value, all at once p -value, isolated BF, and all at once BF), followed by zooming in on the classification ‘cliff effect’. This means that we will compare the frequency of the four classification models with one another within each of the four conditions.

In order to answer Research Question 2 (RQ2; “What is the difference in this relationship when the statistical evidence is quantified through p -values versus Bayes factors?”), we will first combine categories as follows: the p -value condition will encompass the data from both the isolated and the all at once p -value conditions, and the BF condition will encompass the data from both the isolated and the all at once BF conditions. Furthermore, the cliff condition will encompass the all-or-none and the moderate cliff models, and the non-cliff condition will encompass the linear and the exponential models. This classification ensures that we distinguish between curves that reflect a sudden change in the relationship between the level of statistical evidence and the degree of confidence that a positive effect exists in the population of interest, and those that represent a gradual relationship between the level of statistical evidence and the degree of confidence. We will then compare the proportions of cases with a cliff in the p -value conditions to those in the BF conditions, and we will add inferential information for this comparison by means of a Bayesian chi square test on the 2 × 2 table ( p -value/BF x cliff/non-cliff), as will be specified below.

Finally, in order to answer Research Question 3 (RQ3; “What is the difference in this relationship when the statistical evidence is presented in isolation versus all at once?”), we will first combine categories again, as follows: the isolation condition will encompass the data from both the isolated p -value and the isolated BF conditions, and the all at once condition will encompass the data from both the all at once p -value and the all at once BF conditions. The cliff/non-cliff distinction is made analogous to the one employed for RQ2. We will then compare the proportions of cases with a cliff in the isolated conditions to those in the all at once conditions, and we will add inferential information for this comparison by means of a Bayesian chi square test on the 2 × 2 table (all at once/isolated x cliff/non-cliff), as will be specified below.

For both chi square tests, the null hypothesis states that there is no difference in the proportion of cliff classifications between the two conditions, and the alternative hypothesis states that there is a difference in the proportion of cliff classifications between the two conditions. Under the null hypothesis, we specify a single beta(1,1) prior for the proportion of cliff classifications and under the alternative hypothesis we specify two independent beta(1,1) priors for the proportion of cliff classifications 20 , 21 . A beta(1,1) prior is a flat or uniform prior from 0 to 1. The Bayes factor that will result from both chi square tests gives the relative evidence for the alternative hypothesis over the null hypothesis (BF 10 ) provided by the data. Both tests will be carried out in RStudio 22 (the R code for calculating the Bayes factors can be found on https://osf.io/5xbzt ). Additionally, the posterior of the difference in proportions will be provided (the R code for the posterior of the difference in proportions can be found on https://osf.io/3zhju ).

If, after having computed results on the obtained sample, we observe that our BFs are not higher than 10 or smaller than 0.1, we will expand our sample in the way explained at the end of section “Sampling Plan”. To see whether this approach will likely lead to useful results, we have conducted a Bayesian power simulation study for the case of population proportions of 0.2 and 0.4 (e.g., 20% cliff effect in the p -value group, and 40% cliff effect in the BF group) in order to determine how large the Bayesian power would be for reaching the BF threshold for a sample size of n  = 200. Our results show that for values 0.2 and 0.4 in both populations respectively, our estimated sample size of 200 participants (a 10% response rate) would lead to reaching a BF threshold 96% of the time, suggesting very high power under this alternative hypothesis. We have also conducted a Bayesian power simulation study for the case of population proportions of 0.3 (i.e., 30% cliff effect in the p -value group, and 30% cliff effect in the BF group) in order to determine how long sampling takes for a zero effect. The results show that for values of 0.3 in both populations, our estimated sample size of 200 participants would lead to reaching a BF threshold 7% of the time. Under the more optimistic scenario of a 20% response rate, a sample size of 400 participants would lead to reaching a BF threshold 70% of the time (the R code for the power can be found on https://osf.io/vzdce ). It is well known that it is harder to find strong evidence for the absence of an effect than for the presence of an effect 23 . In light of this, we deem a 70% chance of reaching a BF threshold under the null hypothesis given a 20% response rate acceptable. If, after sampling the first 2000 participants and factoring in the response rate, we have not reached either BF threshold, we will continue sampling participants in increments of 200 (10 per journal) until we reach a BF threshold or until we have an effective sample size of 400, or until we reach a total of 4000 participants.

In sum, RQ1 is exploratory in nature, so we will descriptively explore the patterns in our data. For RQ2, we will determine what proportion of applied researchers make a binary distinction regarding the existence of a positive effect in the population of interest, and we will test whether this binary distinction is different when research results are expressed in the p -value versus the BF condition. Finally, for RQ3, we will determine whether this binary distinction is different in the isolated versus all at once condition (see Table 2 for a summary of the study design).

Sampling process

We deviated from our preregistered sampling plan in the following ways: we collected the e-mail address of all corresponding authors who published in the 20 journals in social and behavioural sciences in 2021 and 2022 at the same time . In total, we contacted 3152 academics, and 89 of them completed our survey (i.e., 2.8% of the contacted academics). We computed the BFs based on the responses of these 89 academics, and it turned out that the BF for RQ2 was equal to BF 10  = 16.13 and the BF for RQ3 was equal to BF 10  = 0.39, so the latter was neither higher than 10 nor smaller than 0.1.

In order to reach at least 4000 potential participants (see “ Planned analyses ” section), we decided to collect additional e-mail addresses of corresponding authors from articles published in 2019 and 2020 in the same 20 journals. In total, we thus reached another 2247 academics (total N = 5399), and 50 of them completed our survey (i.e., 2.2% of the contacted academics, effective N = 139).

In light of the large number of academics we had contacted at this point, we decided to do an ‘interim power analysis’ to calculate the upper and lower bounds of the BF for RQ3 to see if it made sense to continue collecting data up to N = 200. The already collected data of 21 cliffs out of 63 in the isolated conditions and 13 out of 65 in the all-at-once conditions yields a Bayes factor of 0.8 (see “ Results ” section below). We analytically verified that by increasing the number of participants to a total of 200, the strongest possible pro-null evidence we can get given the data we already had would be BF 10  = 0.14, or BF 01  = 6.99 (for 21 cliffs out of 100 in both conditions). In light of this, our judgment was that it was not the best use of human resources to continue collecting data, so we proceeded with a final sample of N = 139.

To summarize our sampling procedure, we contacted 5399 academics in total. Via Qualtrics, 220 participants responded. After removing the responses of the participants who did not complete the content part of our survey (i.e., the questions about the p -values or BFs), 181 cases remained. After removing the cases who were completely unfamiliar with p -values, 177 cases remained. After removing the cases who were completely unfamiliar with BFs, 139 cases remained. Note that there were also many people who responded via e-mail informing us that they were not familiar with interpreting BFs. Since the Qualtrics survey was anonymous, it was impossible for us to know the overlap between people who contacted us via e-mail and via Qualtrics that they were unfamiliar with interpreting BFs.

We contacted a total number of 5399 participants. The total number of participants who filled out the survey completely was N = 139, so 2.6% of the total sample (note that this is a result of both response rate and our requirement that researchers needed to self-report familiarity with interpreting BFs). Our entire Qualtrics survey can be found on https://osf.io/6gkcj . Five “difficult to classify” pilot plots were created such that authors RH and DvR could practice before classifying the real data. These plots can be found on https://osf.io/ndaw6/ (see folder “Pilot plots”). Authors RH and DvR had a qualitative discussion about these plots; however, no adjustments were made to the classification protocol. We manually classified data for each participant for each scenario as one of the relationship models (i.e., all-or-none, moderate cliff, linear, and exponential). Author JM organized the data from each of the four conditions and removed the p -value or BF labels. Authors RH and DvR classified the data according to the protocol provided in Appendix 1 , and the plot for each participant (including the condition each participant was in and the model in which each participant was classified) can be found in Appendix 2 . After coding, Cohen’s kappa was determined for these data, which was equal to κ = 0.47. Authors RH and DvR independently reached the same conclusion for 113 out of 139 data sets (i.e., 81.3%). For the remaining 26 data sets, RH and DvR were able to reach consensus within 5 min per data set, as laid out in the protocol. In Fig.  2 , plots are provided which include the prototype lines as well as the actual responses plotted along with them. This way, all responses can be seen at once along with how they match up with the prototype response for each category. To have a better picture of our sample population, we included the following demographic variables in the survey: gender, main continent, career stage, and broad research area. The results are presented in Table 3 . Based on these results it appeared that most of the respondents who filled out our survey were male (71.2%), living in Europe (51.1%), had a faculty position (94.1%), and were working in the field of psychology (56.1%). The total responses (i.e., including the responses of the respondents who quit filling out our survey) were very similar to the responses of the respondents who did complete our survey.

figure 2

Plots including the prototype lines and the actual responses.

To answer RQ1 (“What is the relation between obtained statistical evidence and the degree of belief or confidence that there is a positive effect in the population of interest across participants?”), we compared the frequency of the four classification models (i.e., all-or-none, moderate cliff, linear, and exponential) with one another within each of the four conditions (i.e., all at once and isolated p -values, and all at once and isolated BFs). The results are presented in Table 4 . In order to enhance the interpretability of the results in Table 4 , we have plotted them in Fig.  3 .

figure 3

Plotted frequency of classification models within each condition.

We observe that within the all at once p -value condition, the cliff models accounted for a proportion of (0 + 11)/33 = 0.33 of the responses. The non-cliff models accounted for a proportion of (1 + 21)/33 = 0.67 of the responses. Looking at the isolated p -value condition, we can see that the cliff models accounted for a proportion of (1 + 15)/35 = 0.46 of the responses. The non-cliff models accounted for a proportion of (0 + 19)/35 = 0.54 of the responses. In the all at once BF condition, we observe that the cliff models accounted for a proportion of (2 + 0)/32 = 0.06 of the responses. The non-cliff models accounted for a proportion of (0 + 30)/32 = 0.94 of the responses. Finally, we observe that within the isolated BF condition, the cliff models accounted for a proportion of (2 + 3)/28 = 0.18 of the responses. The non-cliff models accounted for a proportion of (0 + 23)/28 = 0.82 of the responses.

Thus, we observed a higher proportion of cliff models in p -value conditions than in BF conditions (27/68 = 0.40 vs 7/60 = 0.12), and we observed a higher proportion of cliff models in isolated conditions than in all-at-once conditions (21/63 = 0.33 vs 13/65 = 0.20). Next, we conducted statistical inference to dive deeper into these observations.

To answer RQ2 (“What is the difference in this relationship when the statistical evidence is quantified through p -values versus Bayes factors?”), we compared the sample proportions mentioned above (27/68 = 0.40 and 7/60 = 0.12, respectively, with a difference between these proportions equal to 0.40–0.12 = 0.28), and we tested whether the proportion of cliff classifications in the p -value conditions differed from that in the BF conditions in the population by means of a Bayesian chi square test. For the chi square test, the null hypothesis was that there is no difference in the proportion of cliff classifications between the two conditions, and the alternative hypothesis was that there is a difference in the proportion of cliff classifications between the two conditions.

The BF that resulted from the chi square test was equal to BF 10  = 140.01 and gives the relative evidence for the alternative hypothesis over the null hypothesis provided by the data. This means that the data are 140.01 times more likely under the alternative hypothesis than under the null hypothesis: we found strong support for the alternative hypothesis that there is a difference in the proportion of cliff classifications between the p -value and BF condition. Inspection of Table 4 or Fig.  3 shows that the proportion of cliff classifications is higher in the p -value conditions.

Additionally, the posterior distribution of the difference in proportions is provided in Fig.  4 , and the 95% credible interval was found to be [0.13, 0.41]. This means that there is a 95% probability that the population parameter for the difference of proportions of cliff classifications between p -value conditions and BF conditions lies within this interval, given the evidence provided by the observed data.

figure 4

The posterior density of difference of proportions of cliff models in p -value conditions versus BF conditions.

To answer RQ3 (“What is the difference in this relationship when the statistical evidence is presented in isolation versus all at once?”), we compared the sample proportions mentioned above (21/63 = 0.33 vs 13/65 = 0.20, respectively with a difference between these proportions equal to 0.33–0.20 = 0.13), and we tested whether the proportion of cliff classifications in the all or none conditions differed from that in the isolated conditions in the population by means of a Bayesian chi square test analogous to the test above.

The BF that resulted from the chi square test was equal to BF 10  = 0.81, and gives the relative evidence for the alternative hypothesis over the null hypothesis provided by the data. This means that the data are 0.81 times more likely under the alternative hypothesis than under the null hypothesis: evidence on whether there is a difference in the proportion of cliff classifications between the isolation and all at once conditions is ambiguous.

Additionally, the posterior distribution of the difference in proportions is provided in Fig.  5 . The 95% credible interval is [− 0.28, 0.02].

figure 5

The posterior density of difference of proportions of cliff models in all at once conditions versus isolated conditions.

There were 11 respondents who provided responses that extremely deviated from the four relationship models, so they were included in the rest category, and were left out of the analyses. Eight of these were in the isolated BF condition, one was in the isolated p -value condition, one was in the all at once BF condition, and one was in the all at once p -value condition. For five of these, their outcomes resulted in a roughly decreasing trend with significant large bumps. For four of these, there were one or more considerable increases in the plotted outcomes. For two of these, the line was flat. All these graphs are available in Appendix 2 .

In the present study, we explored the relationship between obtained statistical evidence and the degree of belief or confidence that there is a positive effect in the population of interest. We were in particular interested in the existence of a cliff effect. We compared this relationship for p -values to the relationship for corresponding degrees of evidence quantified through Bayes factors, and we examined whether this relationship was affected by two different modes of presentation. In the isolated presentation mode a possible clear functional form of the relationship across values was not visible to the participants, whereas in the all-at-once presentation mode, such a functional form could easily be seen by the participants.

The observed proportions of cliff models was substantially higher for the p -values than for the BFs, and the credible interval as well as the high BF test value indicate that a (substantial) difference will also hold more generally at the population level. Based on our literature review (summarized in Table 1 ), we did not know of studies that have compared the prevalence of cliff effect when interpreting p -values to that when interpreting BFs, so we think that this part is new in the literature. However, our findings are consistent with previous literature regarding the presence of a cliff effect when using p -values. Although we observed a higher proportion of cliff models for isolated presentations than for all-at-once presentation, we did not get a clear indication from the present results whether or not, at the population level, these proportion differences will also hold. We believe that this comparison between the presentation methods that have been used to investigate the cliff effect is also new. In previous research, the p -values were presented on separate pages in some studies 15 , while in other studies the p -values were presented on the same page 13 .

We deviated from our preregistered sampling plan by collecting the e-mail addresses of all corresponding authors who published in the 20 journals in social and behavioural sciences in 2021 and 2022 simultaneously, rather than sequentially. We do not believe that this approach created any bias in our study results. Furthermore, we decided that it would not make sense to collect additional data (after approaching 5399 academics who published in 2019, 2020, 2021, and 2022 in the 20 journals) in order to reach an effective sample size of 200. Based on our interim power analysis, the strongest possible pro-null evidence we could get if we continued collecting data up to an effective sample size of 200 given the data we already had would be BF 10  = 0.14 or BF 01  = 6.99. Therefore, we decided that it would be unethical to continue collecting additional data.

There were several limitations in this study. Firstly, the response rate was very low. This was probably the case because many academics who we contacted mentioned that they were not familiar with interpreting Bayes factors. It is important to note that our findings apply only to researchers who are at least somewhat familiar with interpreting Bayes factors, and our sample does probably not represent the average researcher in the social and behavioural sciences. Indeed, it is well possible that people who are less familiar with Bayes factors (and possibly with statistics in general) would give responses that were even stronger in line with cliff models, because we expect that researchers who exhibit a cliff effect will generally have less statistical expertise or understanding: there is nothing special about certain p -value or Bayes factor thresholds that merits a qualitative drop in the perceived strength of evidence. Furthermore, a salient finding was that the proportion of graduate students was very small. In our sample, the proportion of graduate students showing a cliff effect is 25% and the proportion of more senior researchers showing a cliff effect is 23%. Although we see no clear difference in our sample, we cannot rule out that our findings might be different if the proportion of graduate students in our sample would be higher.

There were several limitations related to the survey. Some of the participants mentioned via e-mail that in the scenarios insufficient information was provided. For example, we did not provide effect sizes and any information about the research topic. We had decided to leave out this information to make sure that the participants could only focus on the p -values and the Bayes factors. Furthermore, the questions in our survey referred to posterior probabilities. A respondent noted that without being able to evaluate the prior plausibility of the rival hypotheses, the questions were difficult to answer. Although this observation is correct, we do think that many respondents think they can do this nevertheless.

The respondents could indicate their degree of belief or confidence that there is a positive effect in the population of interest based on the fictitious findings on a scale ranging from 0 (completely convinced that there is no effect), through 50 (somewhat convinced that there is a positive effect), to 100 (completely convinced that there is a positive effect). A respondent mentioned that it might be unclear where the midpoint is between somewhat convinced that there is no effect and somewhat convinced that there is a positive effect, so biasing the scale towards yes response. Another respondent mentioned that there was no possibility to indicate no confidence in either the null or the alternative hypothesis. Although this is true, we do not think that many participants experienced this as problematic.

In our exploratory analyses we observed that eight out of eleven unclassifiable responses were in the isolated BF condition. In our survey, the all at once and isolated presentation conditions did not only differ in the way the pieces of statistical evidence were presented, but they also differed in the order. In all at once, the different pieces were presented in sequential order, while in the isolated condition, they were presented in a random order. Perhaps this might be an explanation for why the isolated BF condition contained most of the unclassifiable responses. Perhaps academics are more familiar with single p -values and can more easily place them along a line of “possible values” even if they are presented out of order.

This study indicates that a substantial proportion of researchers who are at least somewhat familiar with interpreting BFs experience a sharp drop in confidence when an effect exists around certain p -values and to a much smaller extent around certain Bayes factor values. But how do people act on these beliefs? In a recent study by Muradchanian et al. 24 , it was shown that editors, reviewers, and authors alike are much less likely to accept for publication, endorse, and submit papers with non-significant results than with significant results, suggesting these believes about the existence of an effect translate into considering certain findings more publication-worthy.

Allowing for these caveats, our findings showed that cliff models were more prevalent when interpreting p -values than when interpreting BFs, based on a sample of academics who were at least somewhat familiar with interpreting BFs. However, the high prevalence of the non-cliff models (i.e., linear and exponential) implied that p -values do not necessarily entail dichotomous thinking for everyone. Nevertheless, it is important to note that the cliff models were still accountable for 37.5% of responses in p -values, whereas in BFs, the cliff models were only accountable for 12.3% of the responses.

We note that dichotomous thinking has a place in interpreting scientific evidence, for instance in the context of decision criteria (if the evidence is more compelling than some a priori agreed level, then we bring this new medicine to the market), or in the context of sampling plans (we stop collecting data once the evidence or level of certainty hits some a priori agreed level). However, we claim that it is not rational for someone’s subjective belief that some effect is non-zero to make a big jump around for example a p -value of 0.05 or a BF of 10, but not at any other point along the range of potential values.

Based on our findings, one might think replacing p -values with BFs might be sufficient to overcome dichotomous thinking. We think that this is probably too simplistic. We believe that rejecting or not rejecting a null hypothesis is probably so deep-seated in the academic culture that dichotomous thinking might become more and more prevalent in the interpretation of BFs in time. In addition to using tools such as p -values or BFs, we agree with Lai et al. 13 that several ways to overcome dichotomous thinking in p -values, BFs, etc. are to focus on teaching (future) academics to formulate research questions requiring quantitative answers such as, for example, evaluating the extent to which therapy A is superior to therapy B rather than only evaluating that therapy A is superior to therapy B, and adopting effect size estimation in addition to statistical hypotheses in both thinking and communication.

In light of the results regarding dichotomous thinking among researchers, future research can focus on, for example, the development of comprehensive teaching methods aimed at cultivating the skills necessary for formulating research questions that require quantitative answers. Pedagogical methods and curricula can be investigated that encourage adopting effect size estimation in addition to statistical hypotheses in both thinking and communication.

Data availability

The raw data are available within the OSF repository: https://osf.io/ndaw6/ .

Code availability

For the generation of the p -values and BFs, the R file “2022-11-04 psbfs.R” can be used; for Fig.  1 , the R file “2021-06-03 ProtoCliffPlots.R” can be used; for the posterior for the difference between the two proportions in RQ2 and RQ3, the R file “2022-02-17 R script posterior for difference between two proportions.R” can be used; for the Bayesian power simulation, the R file “2022-11-04 Bayes Power Sim Cliff.R” can be used; for calculating the Bayes factors in RQ2 and RQ3 the R file “2022-10-21 BFs RQ2 and RQ3.R” can be used; for the calculation of Cohen’s kappa, the R file “2023-07-23 Cohens kappa.R” can be used; for data preparation, the R file “2023-07-23 data preparation.R” can be used; for Fig.  2 , the R file “2024-03-11 data preparation including Fig.  2 .R” can be used; for the interim power analysis, the R file “2024-03-16 Interim power analysis.R” can be used; for Fig.  3 , the R file “2024-03-16 Plot for Table 4 R” can be used. The R codes were written in R version 2022.2.0.443, and are uploaded as part of the supplementary material. These R codes are made available within the OSF repository: https://osf.io/ndaw6/ .

Lakens, D. Why p-Values Should be Interpreted as p-Values and Not as Measures of Evidence [Blog Post] . http://daniellakens.blogspot.com/2021/11/why-p-values-should-be-interpreted-as-p.html . Accessed 20 Nov 2021.

Jeffreys, H. Theory of Probability (Clarendon Press, 1939).

Google Scholar  

van Ravenzwaaij, D. & Etz, A. Simulation studies as a tool to understand Bayes factors. Adv. Methods Pract. Psychol. Sci. 4 , 1–20. https://doi.org/10.1177/2515245920972624 (2021).

Article   Google Scholar  

Wetzels, R. et al. Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspect. Psychol. Sci. 6 , 291–298. https://doi.org/10.1177/1745691611406923 (2011).

Article   PubMed   Google Scholar  

Dhaliwal, S. & Campbell, M. J. Misinterpreting p -values in research. Austral. Med. J. 1 , 1–2. https://doi.org/10.4066/AMJ.2009.191 (2010).

Greenland, S. et al. Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur. J. Epidemiol. 31 , 337–350. https://doi.org/10.1007/s10654-016-0149-3 (2016).

Article   PubMed   PubMed Central   Google Scholar  

Wasserstein, R. L. & Lazar, N. A. The ASA statement on p -values: context, process, and purpose. Am. Stat. 70 , 129–133. https://doi.org/10.1080/00031305.2016.1154108 (2016).

Article   MathSciNet   Google Scholar  

Rosenthal, R. & Gaito, J. The interpretation of levels of significance by psychological researchers. J. Psychol. Interdiscipl. Appl. 55 , 33–38. https://doi.org/10.1080/00223980.1963.9916596 (1963).

Rosenthal, R. & Gaito, J. Further evidence for the cliff effect in interpretation of levels of significance. Psychol. Rep. 15 , 570. https://doi.org/10.2466/pr0.1964.15.2.570 (1964).

Beauchamp, K. L. & May, R. B. Replication report: Interpretation of levels of significance by psychological researchers. Psychol. Rep. 14 , 272. https://doi.org/10.2466/pr0.1964.14.1.272 (1964).

Minturn, E. B., Lansky, L. M. & Dember, W. N. The Interpretation of Levels of Significance by Psychologists: A Replication and Extension. Quoted in Nelson, Rosenthal, & Rosnow, 1986. (1972).

Nelson, N., Rosenthal, R. & Rosnow, R. L. Interpretation of significance levels and effect sizes by psychological researchers. Am. Psychol. 41 , 1299–1301. https://doi.org/10.1037/0003-066X.41.11.1299 (1986).

Lai, J., Kalinowski, P., Fidler, F., & Cumming, G. Dichotomous thinking: A problem beyond NHST. in Data and Context in Statistics Education: Towards an Evidence Based Society , 1–4. http://icots.info/8/cd/pdfs/contributed/ICOTS8_C101_LAI.pdf (2010).

Cumming, G. Statistics education in the social and behavioural sciences: From dichotomous thinking to estimation thinking and meta-analytic thinking. in International Association of Statistical Education , 1–4 . https://www.stat.auckland.ac.nz/~iase/publications/icots8/ICOTS8_C111_CUMMING.pdf (2010).

Poitevineau, J. & Lecoutre, B. Interpretation of significance levels by psychological researchers: The .05 cliff effect may be overstated. Psychon. Bull. Rev. 8 , 847–850. https://doi.org/10.3758/BF03196227 (2001).

Article   CAS   PubMed   Google Scholar  

Hoekstra, R., Johnson, A. & Kiers, H. A. L. Confidence intervals make a difference: Effects of showing confidence intervals on inferential reasoning. Educ. Psychol. Meas. 72 , 1039–1052. https://doi.org/10.1177/0013164412450297 (2012).

Helske, J., Helske, S., Cooper, M., Ynnerman, A. & Besancon, L. Can visualization alleviate dichotomous thinking: Effects of visual representations on the cliff effect. IEEE Trans. Vis. Comput. Graph. 27 , 3379–3409. https://doi.org/10.1109/TVCG.2021.3073466 (2021).

van de Schoot, R., Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M. & Depaoli, S. A systematic review of Bayesian articles in psychology: The last 25 years. Psychol. Methods 22 , 217–239. https://doi.org/10.1037/met0000100 (2017).

Lartillot, N. & Philippe, H. Computing Bayes factors using thermodynamic integration. Syst. Biol. 55 , 195–207. https://doi.org/10.1080/10635150500433722 (2006).

Gunel, E. & Dickey, J. Bayes factors for independence in contingency tables. Biometrika 61 , 545–557. https://doi.org/10.2307/2334738 (1974).

Jamil, T. et al. Default, “Gunel and Dickey” Bayes factors for contingency tables. Behav. Res. Methods 49 , 638–652. https://doi.org/10.3758/s13428-016-0739-8 (2017).

RStudio Team. RStudio: Integrated Development Environment for R . RStudio, PBC. http://www.rstudio.com/ (2022).

van Ravenzwaaij, D. & Wagenmakers, E.-J. Advantages masquerading as “issues” in Bayesian hypothesis testing: A commentary on Tendeiro and Kiers (2019). Psychol. Methods 27 , 451–465. https://doi.org/10.1037/met0000415 (2022).

Muradchanian, J., Hoekstra, R., Kiers, H. & van Ravenzwaaij, D. The role of results in deciding to publish. MetaArXiv. https://doi.org/10.31222/osf.io/dgshk (2023).

Download references

Acknowledgements

We would like to thank Maximilian Linde for writing R code which we could use to collect the e-mail addresses of our potential participants. We would also like to thank Julia Bottesini and an anonymous reviewer for helping us improve the quality of our manuscript.

Author information

Authors and affiliations.

Behavioural and Social Sciences, University of Groningen, Groningen, The Netherlands

Jasmine Muradchanian, Rink Hoekstra, Henk Kiers & Don van Ravenzwaaij

Psychology, Rowan University, Glassboro, USA

Dustin Fife

You can also search for this author in PubMed   Google Scholar

Contributions

J.M., R.H., H.K., D.F., and D.v.R. meet the following authorship conditions: substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data; or the creation of new software used in the work; or have drafted the work or substantively revised it; and approved the submitted version (and any substantially modified version that involves the author's contribution to the study); and agreed both to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature. J.M. participated in data/statistical analysis, participated in the design of the study, drafted the manuscript and critically revised the manuscript; R.H. participated in data/statistical analysis, participated in the design of the study, and critically revised the manuscript; H.K. participated in the design of the study, and critically revised the manuscript; D.F. participated in the design of the study, and critically revised the manuscript; D.v.R. participated in data/statistical analysis, participated in the design of the study, and critically revised the manuscript.

Corresponding author

Correspondence to Jasmine Muradchanian .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., supplementary information 2., supplementary information 3., supplementary information 4., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Muradchanian, J., Hoekstra, R., Kiers, H. et al. Comparing researchers’ degree of dichotomous thinking using frequentist versus Bayesian null hypothesis testing. Sci Rep 14 , 12120 (2024). https://doi.org/10.1038/s41598-024-62043-w

Download citation

Received : 07 June 2022

Accepted : 09 May 2024

Published : 27 May 2024

DOI : https://doi.org/10.1038/s41598-024-62043-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

null hypothesis 95 confidence interval

Search for 95%

Search results, 5.2 - interval estimate of population mean.

… separately and uses the desired confidence level (usually 95% ) for every single interval. Bonferroni Method : … Suppose that the sample size is n = 25 and we want a 95% confidence interval for the population mean. Thus … = 25. The family-wide error = 5% for a family confidence = 95% . We are computing intervals for p = 5 means. The error …

9.5 - Step 2: Test for treatment by time interactions

9.5 - Step 2: Test for treatment by time interactions …

8.8 - Hypothesis Tests

… 2.77}\\[10pt] &= 4.114 \end{align} Simultaneous 95% Confidence Intervals are computed in the following … and Llanedyrn. Contrast 3 Simultaneous 95% Confidence Intervals for Contrast 3 are obtained … Here we have a \(t_{22,0.005} = 2.819\). The Bonferroni 95% Confidence Intervals are: Element …

7.1.10 - Confidence Intervals

… the adjusted individual confidence level for simultaneous 95% confidence with the Bonferroni method. Select … mean. Select 'OK' twice . The 95% Bonferroni intervals are displayed in the results area. …

7.1.4 - Example: Women’s Survey Data and Associated Confidence Intervals

… A one-at-a-time 95% confidence interval for calcium is given by the … Variable \(μ_{0}\) 95% Confidence Interval Calcium … error rate. Consequence : We are less than 95% confident that all of the intervals simultaneously …

4.7 - Example: Wechsler Adult Intelligence Scale

… 0.505, and 0.110. Now, let's consider the shape of the 95% prediction ellipse formed by the multivariate normal … degrees of freedom because we have four variables.  For a 95% prediction ellipse, the chi-square with four degrees of … 9.49. For looking at the first and longest axis of a 95% prediction ellipse, we substitute 26.245 for the largest …

7.2.4 - Bonferroni Corrected (1 - α) x 100% Confidence Intervals

… the adjusted individual confidence level for simultaneous 95% confidence with the Bonferroni method. Check the … intervals together may be interpreted with simultaneous 95% confidence. Analysis In summary, we have: Variable 95% Simultaneous Confidence Intervals (Bonferroni corrected) …

7.2.3 - Example: Swiss Banknotes

… the two separate data sets to one * and computes the 95% simultaneous confidence interval limits * from the … the table below: Variable 95% Confidence Interval Length …

6.3 - Testing for Partial Correlation

… &= 0.89098 \end{align} Step 2 : Compute the 95% confidence interval for \( \frac{1}{2}\log … Step 3 : Back-transform to obtain the 95% confidence interval for \(\rho_{12.34}\) : … Based on this result, we can conclude that we are 95% confident that the interval (0.4964, 0.8447) contains …

4.8 - Special Cases: p = 2

… The SAS program below can be used to plot the 95% confidence ellipse corresponding to any specified … file to your computer. options ls=78; title " 95% prediction ellipse"; data a; /*This data set …

  • Research article
  • Open access
  • Published: 20 May 2024

ABO and Rhesus blood groups and multiple health outcomes: an umbrella review of systematic reviews with meta-analyses of observational studies

  • Fang-Hua Liu 1 , 2   na1 ,
  • Jia-Kai Guo 1 , 3   na1 ,
  • Wei-Yi Xing 1 , 2   na1 ,
  • Xue-Li Bai 1 , 4   na1 ,
  • Yu-Jiao Chang 1 , 2   na1 ,
  • Zhao Lu 1 , 5 ,
  • Miao Yang 1 , 6 ,
  • Ying Yang 1 , 7 ,
  • Wen-Jing Li 1 , 8 ,
  • Xian-Xian Jia 1 , 6 ,
  • Tao Zhang 1 , 5 ,
  • Jing Yang 1 , 9 ,
  • Jun-Tong Chen 10 ,
  • Song Gao 4 ,
  • Lang Wu 11 ,
  • De-Yu Zhang 4 ,
  • Chuan Liu 4 ,
  • Ting-Ting Gong 4 &
  • Qi-Jun Wu   ORCID: orcid.org/0000-0001-9421-5114 1 , 2 , 4 , 12  

BMC Medicine volume  22 , Article number:  206 ( 2024 ) Cite this article

Metrics details

Numerous studies have been conducted to investigate the relationship between ABO and Rhesus (Rh) blood groups and various health outcomes. However, a comprehensive evaluation of the robustness of these associations is still lacking.

We searched PubMed, Web of Science, Embase, Scopus, Cochrane, and several regional databases from their inception until Feb 16, 2024, with the aim of identifying systematic reviews with meta-analyses of observational studies exploring associations between ABO and Rh blood groups and diverse health outcomes. For each association, we calculated the summary effect sizes, corresponding 95% confidence intervals, 95% prediction interval, heterogeneity, small-study effect, and evaluation of excess significance bias. The evidence was evaluated on a grading scale that ranged from convincing (Class I) to weak (Class IV). We assessed the certainty of evidence according to the Grading of Recommendations Assessment, Development, and Evaluation criteria (GRADE). We also evaluated the methodological quality of included studies using the A Measurement Tool to Assess Systematic Reviews (AMSTAR). AMSTAR contains 11 items, which were scored as high (8–11), moderate (4–7), and low (0–3) quality. We have gotten the registration for protocol on the PROSPERO database (CRD42023409547).

The current umbrella review included 51 systematic reviews with meta-analysis articles with 270 associations. We re-calculated each association and found only one convincing evidence (Class I) for an association between blood group B and type 2 diabetes mellitus risk compared with the non-B blood group. It had a summary odds ratio of 1.28 (95% confidence interval: 1.17, 1.40), was supported by 6870 cases with small heterogeneity ( I 2  = 13%) and 95% prediction intervals excluding the null value, and without hints of small-study effects ( P for Egger’s test > 0.10, but the largest study effect was not more conservative than the summary effect size) or excess of significance ( P  < 0.10, but the value of observed less than expected). And the article was demonstrated with high methodological quality using AMSTAR (score = 9). According to AMSTAR, 18, 32, and 11 studies were categorized as high, moderate, and low quality, respectively. Nine statistically significant associations reached moderate quality based on GRADE.

Conclusions

Our findings suggest a potential relationship between ABO and Rh blood groups and adverse health outcomes. Particularly the association between blood group B and type 2 diabetes mellitus risk.

Peer Review reports

Blood groups can be categorized based on different systems, such as the ABO blood group system, the Rhesus (Rh) blood group system, and the MN blood group system [ 1 ]. ABO blood group system is the most frequently applied [ 2 ]. Each of the two alleles possesses antigen A, B, or neither. These alleles come together to be a combination, determining an individual’s blood type phenotype, thus perform as the type of O, A, B, or AB. The Rh blood group system is more polymorphic than others among human blood groups, which is composed of numerous antigens and next to ABO. The ABO and Rh blood group system are extensively utilized in clinical practice, affecting host susceptibility [ 3 , 4 ].

The previous study suggested that blood groups are involved in disease mechanisms at the molecular level mediated either through the blood group antigens or by the blood group reactive antibodies [ 5 ]. In addition, J. Höglund et al. found 39 plasma proteins were associated with variation at the ABO locus. For example, proteins with functions related to tumorigenesis (CA9, Gal-9, and KLK6) and pro-inflammatory or anti-inflammatory functions (IFN-gamma-R1, IL-18BP, and MARCO) [ 6 ]. Generally, the overexpression of these proteins leads to an abnormal cell proliferation or cell growth. Thus, blood group may influence disease development through protein expression levels.

Numerous systematic reviews with meta-analyses have been published, which explored correlations between ABO and Rh blood groups with various health outcomes [ 7 , 8 , 9 ]. However, to date, the association between these blood groups and human health outcomes remains controversial [ 10 , 11 , 12 ]. Most of them have primarily concentrated on one single disease end-point, lacking a comprehensive evaluation of the aforementioned relationships. In addition, the strength and reliability of the evidence remains unclear. To overcome the inherent limitations of systematic reviews with meta-analyses and provide a comprehensive overview of the claimed associations of ABO and Rh blood groups with health outcomes, in the form of an umbrella review (UR), is necessary.

UR synthesizes evidence from various systematic reviews with meta-analyses on a subject, appraising the certainty, precision, and potential bias of the correlations, thus facilitating evidence grading based on well-defined criteria [ 13 ]. We set out to conduct an UR to comprehensively evaluate systematic reviews with meta-analyses of observational studies, which examined associations of ABO and Rh blood groups with a range of health outcomes. This endeavor was aimed at presenting an overview of the breadth and validity for aforementioned associations. We thus hoped to provide both clinicians and policy makers with robust data to identify high-risk groups and inform clinical practice and guidelines.

Protocol registration

We have gotten the registration for the protocol of this UR with the International Prospective Register of Systematic Reviews (PROSPERO; registration number CRD42023409547). The study followed the Preferred Reporting Items for Systematic Reviews and Meta-analyses reporting guideline [ 14 ] (Additional file 1 : Table S1) and the Meta-analysis of Observational Studies in Epidemiology reporting guideline [ 15 ] (Additional file 1 : Table S2).

Search strategy

We systematically searched PubMed, Web of Science, Embase, Scopus, Cochrane Library, and several regional databases (Latin American and Caribbean Health Sciences Literature, Western Pacific Region Index Medicus, Index Medicus for South-East Asia Region, Index Medicus for the Eastern Mediterranean Region, and African Index Medicus) on the date from inception until Feb 16, 2024, to identify systematic reviews with meta-analyses of observational studies evaluating associations between ABO as well as Rh blood groups and diverse health outcomes. We used the keywords (“ABO” OR “blood group” OR “blood type” OR “Rh”) AND (“meta-analysis” OR “systematic review” OR “systematic overview”) (Additional file 1 : Table S3) to search. Besides, the literature search was reviewed by hand-checking the reference lists of all systematic reviews with meta-analyses.

Eligibility criteria

Articles were selected based on the following PECOS (Population, Exposure, Comparison, Outcome, Study design) strategy:

Population: population with ABO or Rh blood groups;

Exposure: ABO (blood types A, B, O, and AB) and Rh (Rh positive [Rh +] and Rh negative [Rh −]) blood groups (any method used to assess blood type, including genetic tests and forward/reverse agglutination tests, was accepted);

Comparison: different blood groups;

Outcome: any health outcome (e.g., cancer, coronavirus disease 2019 [COVID-19], coronary artery disease, etc.). Ascertained health outcomes using self-report, observed (e.g., clinical diagnoses) or objective [e.g., biomarkers, certified mortality] criteria); and

Study design: systematic reviews with meta-analyses of observational studies (cohort, case–control, or cross-sectional studies).

The exclusion criteria were established as follows: (1) systematic reviews without quantitative analysis, (2) systematic reviews with meta-analyses without study-level data (e.g., effect sizes, 95% confidence intervals [CIs], the number of cases, and participants/control), (3) studies on genetic polymorphisms, animal studies, laboratory studies, conference abstracts and randomized controlled trials, or (4) systematic reviews with meta-analyses conducted in languages other than English.

Given the requirement for a minimum of three original studies to calculate 95% prediction intervals (PIs), we incorporated meta-analyses comprising at least 3 original studies [ 16 ]. Associations were considered to overlap if they assessed the same research topic and were examined in more than one systematic review with meta-analysis [ 17 ]. The inclusion of primary studies once or more may be led by incorporating results of reviews with overlapping associations, and biased findings and estimates could be caused by incorporating results as well [ 18 , 19 ]. Therefore, the systematic review with meta-analysis which contained the largest number of primary studies was picked up if two or more systematic reviews with meta-analyses overlapped, while the one with the largest sample size of participants if more than one systematic review with meta-analysis kept the same numbered primary studies.

To ascertain the eligible articles, four experienced investigators (Y-JC, J-KG, J-TC, and YY) matched in pairs and screened titles, abstracts, and full texts independently. We also checked the references of relevant studies to confirm any other eligible articles by hand. If there were any discrepancies, they would be made out by a third reviewer (Q-JW).

Data extraction

Ten trained investigators (Y-JC, J-KG, YY, X-XJ, W-JL, T-Z, YY, MY, ZL, and X-LB) were paired to extract data independently, discrepancies were settled by a third reviewer (Q-JW) when it was needed. From every meta-analysis we identified, it was abstracted of the contents on the name of the first author, journal, publication year, exposures of interest, outcomes of interest, comparison, meta-analysis metrics (RR [risk ratio], OR [odds ratio], or HR [hazard ratio]), and the number of studies considered. From the individual studies included in every meta-analysis, it was extracted of the name of the first author, publication year, epidemiological study design, number of cases and controls in the observational case–control studies or total population in the observational cohort studies, maximally adjusted risk estimates, and 95% CIs.

Data analysis

Estimation of summary effect —We utilized a random-effects model for each meta-analysis to do a calculation for the summary effect size and corresponding 95% CI [ 20 ].

Estimation of prediction interval —We got the 95% prediction intervals (PIs) for the summary random effect sizes, because it can explain heterogeneity between varied studies and the uncertainty for the effect, with an expectation in another study concerning on the same relationship [ 21 ].

Assessment of heterogeneity —We evaluated heterogeneity with the I 2 metric. And I 2 value exceeding 50% is judged large heterogeneity, and 75% is judged very large heterogeneity similarly [ 22 ]. We also produced τ 2 statistic to assess the heterogeneity.

Assessment of small study effects —Through Egger’s regression asymmetry test [ 23 ], we evaluated small-study effects (i.e., whether larger studies are more likely to give indirectly smaller estimates of effect size when compared with smaller ones) [ 24 ]. Reasons for distinctions between small and large studies such as publication and other reporting biases, genuine heterogeneity, chance, or other conditions are revealed through small study effects [ 24 ]. They were considered to exist when the largest study effect was more conservative than the summary effect size in the meta-analysis and it was found that P value < 0.10 in the regression asymmetry test.

Evaluation of excess significance —We assessed excess significance bias by analyzing whether the number of observed studies ( O ) with nominally statistically significant results (“positive” studies, P  < 0.05) was larger than the expected number of studies ( E ) with statistically significant results using the chi-square test [ 25 ]. The effect size of the largest study (that is, the smallest standard error) in a meta-analysis assessed the strength of the study which needed to use a noncentral t distribution [ 26 , 27 ]. The excess significance test was judged positive when it comes to both O  >  E and P  < 0.10 [ 22 ].

Strength of evidence —According to the established criteria applied in previously published URs [ 13 , 28 , 29 , 30 ] and based on our calculation, significant associations ( P  < 0.05) between ABO and Rh blood groups and health outcomes were divided into 4 levels of evidence strength (convincing [Class I], highly suggestive [Class II], suggestive [Class III], or weak [Class IV] evidence) to draw conclusions. This criterion was evaluated based on statistical significance, number of cases, heterogeneity, largest study, 95% PI, small-study effect, and excess significance bias. P value ≥ 0.05 demonstrated a statistically non-significant association (Additional file 1 : Table S4).

Certainty of the evidence —The credibility of the evidence was qualitatively assessed by two reviewers (W-YX and X-LB) using the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) method. As recommended by GRADE, the level of evidence was graded the high, moderate, low, and very low determined by risk of bias, inconsistency, indirectness, imprecision, and publication bias.

Sensitivity analyses —To verify the robustness of our findings, we conducted sensitivity analyses to assess the concordance of the summary associations, which were initially graded as convincing (Class I) or highly suggestive (Class II) evidence. The sensitivity analyses were realized by excluding small-sample studies (< 25th percentile) from meta-analyses with evidence of small-study effects and primary studies with low-quality evidence (Newcastle–Ottawa Scale < 6 [ 31 ], Agency for Healthcare Research and Quality < 8 [ 32 ], or effective public health practice project guideline rating moderate and low rather than strong quality [ 33 ]. Further sensitivity analysis was performed with the meta-analyses due to overlap in the main analysis. All statistical analyses were conducted in STATA version 17 and RStudio version 3.6.2.

Assessment of the methodological quality of meta-analyses

We used A Measurement Tool to Assess Systematic Reviews (AMSTAR) to evaluate the quality of systematic reviews and meta-analyses, which was considered as a valid and dependable measurement tool [ 34 ]. This instrument contains a total of 11 items. A “yes” scores one point, and the other answers score 0 points. The AMSTAR was graded as low (0–3 points), moderate (4–7 points), or high quality (8–11 points) [ 34 ]. Ten trained investigators (Y-CS, Z-PN, W-YX, YY, W-JL, ZL, JY, X-LB, MY, and J-NS) matched in pairs, and AMSTAR was used independently to assess the eligible systematic reviews with meta-analyses on methodological quality. Disagreements were made the final decision by the third author (Q-JW).

Literature identification and selection

We retrieved 6474 records from PubMed, Web of Science, Embase, Scopus, Cochrane Library, and several regional databases. According to the criterion, 159 full-text articles were retrieved and checked for inclusion after duplicate removal, title, and abstract screening. There were no additional eligible articles found by hand-checking the reference lists of all systematic reviews. Overall, 51 systematic reviews with meta-analyses corresponded to 270 unique associations were included [ 7 , 9 , 11 , 12 , 31 , 32 , 33 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 , 75 , 76 , 77 , 78 ] (Fig. 1 ). The two pairs of four investigators showed high consistency in terms of study selection, with kappa values of 0.893 and 0.926, respectively. The excluded articles and the reasons behind their removal are provided in Additional file 1 : Table S5. For meta-analyses excluded due to a lack of data relating to quantitative synthesis, we further summarized their findings in Additional file 1 : Table S6.

figure 1

PRISMA flow chart. Flow chart of included and excluded systematic reviews and meta-analyses

Characteristics of included meta-analyses

The 51 systematic reviews with meta-analyses corresponded to 270 unique associations: 105 on cancer outcomes (39%), 91 on infectious outcomes (34%), 25 on cardiovascular outcomes (9%), 22 on oral-related outcomes (8%), 12 on pregnancy-related outcomes (4%), 5 on metabolic disease (2%), and 10 on other outcomes (4%) (Fig. 2 ). The systematic reviews with meta-analyses included in this UR were published from 2007 until 2023. The number of studies per association ranged from 3 to 49. One hundred and ninety-five meta-analyses included ≥  1000 cases (Additional file 1 : Table S7).

figure 2

Map of 270 blood group related outcomes: percentage of outcomes per outcome category for all studies

Summary findings

Among 270 associations included in our UR. Eighty-nine associations (33%) presented a nominally statistically significant effect ( P  < 0.05). Of these, 41 (46%) were in conformity with the principle of statistical significance at P  < 10  −3 , and 24 (27%) reached P  < 10  −6 . When calculating the effect size of the largest data study of the associations, 61 (69%) of the 89 associations showed statistical significance. After estimating the 95% PI, 66 (74%) contained null values. Twenty-three (26%) and 19 (21%) associations had significant (50% <  I 2  ≤ 75%) and considerable ( I 2  > 75%) heterogeneity estimates, respectively. Twenty-one (24%) of the 89 associations presented evidence for small-study effects, and 29 (33%) associations presented evidence for excess significance bias.

Cancer outcomes

We summarized 105 associations between blood group and cancer outcomes. The magnitude of the observed summary random effects estimates ranged from 0.65 to 1.54 (Additional file 2 : Fig. S1). Thirty-one meta-analyses (30%) presented a nominally statistically significant effect ( P  < 0.05). Of these, 14 associations were graded as suggestive or above evidence (Fig. 3 , Additional file 1 : Tables S7–8).

figure 3

Forest plot showing studies investigating the association between blood group and health outcomes. CI, confidence interval; CagA, cytotoxin-associated gene A

Esophageal cancer

We found blood group B was associated with a higher risk of esophageal cancer (OR =  1.20; 95% CI: 1.10, 1.31), compared with blood group non-B. And the association was graded as suggestive evidence.

Gastric cancer

Blood group A was associated with a higher risk of gastric cancer, both compared with blood group non-A (OR =  1.11; 95% CI: 1.07, 1.15) and blood group O (OR =  1.19; 95% CI: 1.14, 1.24). However, blood group O was associated with a lower risk of gastric cancer (OR =  0.91; 95% CI: 0.89, 0.94), compared with blood group non-O. These above associations were graded as highly suggestive evidence.

Pancreatic cancer

Compared with blood group O, blood group A was associated with higher risk of pancreatic cancer (OR =  1.33; 95% CI: 1.27, 1.40), cytotoxin-associated gene A (CagA) endemic pancreatic cancer (OR =  1.46; 95% CI: 1.24, 1.50) and CagA-nonendemic pancreatic cancer (OR =  1.43; 95% CI: 1.24, 1.64); blood group B was associated with higher risk of pancreatic cancer (OR =  1.20; 95% CI: 1.10, 1.31) and CagA-nonendemic pancreatic cancer (OR =  1.42; 95% CI: 1.19, 1.69); blood group AB was associated with higher risk of CagA-nonendemic pancreatic cancer (OR =  1.54; 95% CI: 1.26, 1.88); Blood group non-O was also associated with higher risk of pancreatic cancer (OR =  1.31; 95% CI: 1.22, 1.42). Compared with blood group non-O, blood group O was associated with a higher risk of pancreatic cancer (OR =  1.32; 95% CI: 1.22, 1.42), CagA endemic pancreatic cancer (OR =  1.20; 95% CI: 1.11, 1.30), and CagA-nonendemic pancreatic cancer (OR =  1.42; 95% CI: 1.28, 1.59). According to the UR criteria, the associations between blood group A and pancreatic cancer, CagA endemic, and CagA-nonendemic pancreatic cancer risk, blood group O and pancreatic cancer and CagA-nonendemic pancreatic cancer, blood group non-O and pancreatic cancer were graded as highly suggestive evidence. The remaining associations were graded as suggestive evidence.

Infectious disease outcomes

Ninety-one associations between blood group and infectious disease outcomes were investigated. The magnitude of the observed summary random effects estimates ranged from 0.50 to 47.85 (Additional file 2 : Fig. S2). Overall, 27 (32%) of 85 associations reached a statistically significant value at P  < 0.05. Ten associations were supported by suggestive or above evidence (Fig. 3 , Additional file 1 : Tables S7–8).

Coronavirus disease 2019 (COVID-19)

We found blood group A (OR =  1.25; 95% CI: 1.14, 1.37) and blood group B (OR =  1.15; 95% CI: 1.07, 1.22) were associated with an increased risk of COVID-19 infection, compared with blood group O. But blood group O was associated with a decreased risk of COVID-19 infection (OR =  0.88; 95% CI: 0.82, 0.94), compared with blood group O. The association between blood group O and COVID-19 infection was supported by highly suggestive evidence. The other two associations were supported by suggestive evidence.

Human immunodeficiency virus (HIV)

Four blood groups of ABO blood group system were associated with an increased risk of HIV infection (RR =  24.25; 95% CI: 21.60, 27.23; blood group A versus blood group non-A, RR =  21.29; 95% CI: 18.62, 24.36; blood group B versus blood group non-B, RR =  5.44; 95% CI: 4.10, 7.22; blood group AB versus blood group non-AB, and RR =  47.85; 95% CI: 44.01, 52.03; blood group O versus blood group non-O). And these four associations were supported by highly suggestive evidence.

P. falciparum

Blood group A (OR =  1.68; 95% CI:1.32, 2.14), blood group B (OR =  1.97; 95% CI:1.49, 2.59), and blood group non-O (OR =  1.86; 95% CI:1.49, 2.33) were associated with an increased risk of P. falciparum infection, compared with blood group O. All of these associations were supported by suggestive evidence.

Cardiovascular and cerebrovascular outcomes

Twenty-five associations between blood group and cardiovascular and cerebrovascular outcomes were summarized. The magnitude of the observed summary random effects estimates ranged from 0.58 to 2.55 (Additional file 2 : Fig. S3). Of which, 21 (84%) associations gave a show on statistically significant effect nominally ( P  < 0.05), and 7 associations reached suggestive or above evidence (Fig. 3 , Additional file 1 : Tables S7–8).

Myocardial infarction (MI)

Blood group A (OR =  1.29; 95% CI: 1.16, 1.45) and blood group non-O (OR =  1.25; 95% CI:1.14, 1.37) had an increased risk of MI compared with blood group O. Another association showed blood group O had an increased risk of MI (OR =  1.28; 95% CI:1.17, 1.40) compared with blood group non-O. All three associations reached suggestive evidence.

Peripheral vascular disease (PVD)

Compared with blood group O, blood group A (OR =  1.44; 95% CI: 1.19, 1.74) and blood group non-O (OR =  1.45; 95% CI:1.35, 1.56) had an increased risk of PVD. The associations between blood group A and blood group non-O and PVD risk were reached suggestive and highly suggestive evidence, respectively.

Venous thromboembolism (VTE)

Blood group A (OR =  1.63; 95% CI: 1.40, 1.89) and blood group non-O (OR =  2.10; 95% CI:1.83, 2.40) had an increased risk of VTE compared with blood group O. And the two associations reached highly suggestive evidence.

Oral-related outcome

Twenty-two associations between blood group and oral-related outcomes were summarized. The magnitude of the observed summary random effects estimates ranged from 0.70 to 1.36 (Additional file 2 : Fig. S4). Only one association gave a show on statistically significant effect nominally ( P  < 0.05). No association reached suggestive or above evidence (Additional file 1 : Tables S7–8).

Pregnancy-related outcomes

We summarized twelve associations between blood group and pregnancy-related outcomes. The summary random effects estimate magnitude ranged from 0.90 to 1.49 (Additional file 2 : Fig. S5). Only two associations were statistically significant at P  < 0.05. And no association reached suggestive or above evidence (Additional file 1 : Tables S7–8).

Metabolic disease outcomes

We summarized 5 associations between blood group and metabolic disease outcomes. The magnitude of the observed summary random effects estimates ranged from 0.91 to 1.28 (Additional file 2 : Fig. S6). Only 2 (40%) of 5 associations were nominally statistically significant at a P  < 0.05 level. One association was supported by suggestive or above evidence (Fig. 3 , Additional file 1 : Tables S7–8).

Type 2 diabetes mellitus incidence (T2DM)

Blood group B, compared with blood group non-B, had a greater risk of T2DM (OR =  1.28; 95% CI: 1.17, 1.40), and the association was supported by convincing evidence.

Other outcomes

Ten associations between blood group and other outcomes (such as bleeding complication, decreased ovarian reserve) were summarized. The summary random effects estimate magnitude ranged from 0.84 to 1.33 (Additional file 2 : Fig. S7). Only one association was statistically significant at P  < 0.05. And which was supported by highly suggestive evidence (Fig. 3 , Additional file 1 : Tables S7–8).

Bleeding complication

Blood group O was associated with a higher risk of bleeding complication (OR =  1.33; 95% CI: 1.25, 1.42), compared with blood group non-O, which was supported by highly suggestive evidence.

In summary, we found an association between blood group B and an increased risk of T2DM incidence (OR =  1.28; 95% CI: 1.17, 1.40) was rated as convincing evidence when it was taken as comparison for blood group non-B, by owing over 1000 cases, random P value < 10 –6 , not large heterogeneity ( I 2  < 50%), 95% PI excluding the null value, no hints for small-study effects and excess significance bias (Fig. 3 ). Eighteen associations were rated as highly suggestive evidence, they reached a statistically significant value at P  < 10  −6 , had more 1000 cases, and the P value of the largest study was less than 0.05, such as comparison with blood group O, both blood group A (OR =  1.63; 95% CI: 1.40, 1.89) and non-O blood group (OR =  2.10; 95% CI: 1.83, 2.40) increased the risk of VTE incidence. In addition, we found 14 associations were rated as suggestive evidence. Fifty-six associations were rated as weak evidence and the remaining 181 associations were not significant (Additional file 1 : Tables S7–8).

Methodological quality of the meta-analyses

With the measurement tool AMSTAR, 18 (35%) articles were categorized as high quality. Of the 51 articles, 32 (63%) articles and only 1 (2%) article were categorized as moderate and low quality, respectively (Additional file 1 : Table S9).

Certainty of the evidence

Based on the GRADE approach, no health outcomes reached high credibility criteria. Nine of 89 health outcomes met the moderate certainty criteria. Thirty-three and 47 of 89 health outcomes met the low and very low certainty criteria, respectively (Additional file 1 : Table S10).

Sensitivity analyses

Findings from sensitivity analyses are reported in Additional file 1 : Tables S11–13. Removal of small-sized studies from the meta-analyses with evidence of small-study effects, these evidence ratings were not modified. When excluding low-quality studies, the associations between blood group A and gastric cancer, pancreatic cancer, and VTE and blood group O and pancreatic cancer retained their highly suggestive evidence ratings. When we focused on the associations excluded due to overlap, twelve associations were downgraded because of random P value.

Main findings

This is the first UR to provide a comprehensive overview of the observational data assessing associations between the ABO and Rh blood groups and multiple health outcomes. And we found 89 statistically significant associations. Convincing (Class I) evidence was only presented for the association between blood group B and T2DM risk. Highly suggestive (Class II) evidence was presented for 18 associations, such as HIV and VTE.

Comparison with previous studies

The positive association between blood group B and the risk of T2DM detected in this UR was supported by a prospective cohort study. This study included 82,104 women and followed for 18 years in France, throughout which 3553 women had a validated diagnosis of T2DM. After adjustment for potential confounders, blood group B increased the risk of T2DM compared with blood group O (HR =  1.21; 95% CI: 1.07, 1.36) [ 79 ]. A comparative cross-sectional study, including 326 participants (163 T2DM patients and 163 age and sex-matched healthy individuals), confirmed the harmful association of blood group B with T2DM risks (OR =  1.96; 95% CI: 1.05, 3.65), compared with the non-B blood group [ 80 ]. A meta-analysis revealed blood group B was significantly associated with an increased risk of T2DM (RR =  1.05; 95% CI: 0.93, 1.18), compared with the non-B blood group [ 81 ]. Nevertheless, caution is warranted in interpreting the observed association between blood group B and T2DM risk. Despite our result being consistent with findings from a prospective cohort study conducted in France, it is important to note that they exclusively included women. Subgroup analysis stratified by gender is needed in the future. In addition, the above studies have different control groups, sample sizes, and study designs. Further well-designed, large-scale prospective studies are needed to clarify the association between blood group B and T2DM.

The association between blood groups and HIV infection wase debated. In our UR, we found all ABO blood group was associated with an increased risk of HIV infection, and all of them were supported by highly suggestive evidence. A cross-sectional study conducted in Nairobi, Kenya among 280 female sex workers showed blood group A (OR =  1.56; 95% CI: 1.06, 2.28) was associated with HIV infection, compared with blood group O. However, blood group B (OR =  1.63; 95% CI: 0.94, 2.80) and blood group AB (OR =  1.50; 95% CI: 0.57, 3.93) were not associated with HIV infection [ 82 ]. A previous cross-sectional study conducted in South Africa investigated the associations between ABO blood groups and HIV infection among blood donors. The results suggested that the ABO blood group was not related to HIV infection. However, the point estimate for OR assesses blood group AB and HIV infection is 1.03 [ 83 ]. Cross-sectional studies cannot be used to infer causality and potential biases should be considered in the observational studies. Further well-designed longitudinal studies and controlling for different sources of bias are warranted to assess causality.

The harmful association between blood group A and non-O blood group and VTE incidence observed in our UR was supported by previous studies. For example, a previous study that included 7830 patients found blood group A was associated with VTE incidence (OR =  2.16; 95% CI: 1.10, 4.24) [ 84 ] in comparison with the blood group.

Biological plausibility

Multifactorial mechanisms might explain the increased risk of T2DM associated with blood groups. The previous study showed ABO blood group is in association with the level of plasma soluble intercellular adhesion molecule-1 and tumor necrosis factor receptor-2 [ 85 ]. And the above markers are identified to contribute a higher risk of T2DM. Moreover, a study suggested that the ABO blood group, being a gene-determined host factor, modulated the composition of the intestinal microbiota [ 86 ], which played an important role in influencing metabolism including glucose metabolism, energy balance, and low-grade inflammation [ 87 ].

For potential mechanisms between blood group and HIV infection, some studies indicated that expression of glycosyltransferase could be induced due to HIV and further synthetization of antigens of blood type on lymphocyte surfaces [ 88 , 89 ]. Therefore, apart from releasing new virion particles from lymphocytes, HIV could also integrate antigens of the blood group into its envelope surface [ 89 ]. The presence of these antigens sensitizes the virus against neutralizing antibodies and complements specific blood groups, potentially influencing the virus’s transmission between individuals and different blood groups [ 88 ].

It has not been thoroughly clear of the exact mechanism revealing the ABO blood group and VTE. The most likely hypothesis is that ABO plays a role in dominating the glycosylation degree of von Willebrand factor via modifying GT expressions [ 90 ]. von Willebrand factor multimeric composition is regulated in plasma by ADAMTS13. Proteolysis is enhanced by von Willebrand factor deglycosylation by ADAMTS13 [ 91 ]. In addition, individuals with blood group A1 and blood group B are at the level of 20% higher circulating Willebrand factor on average, which factor VIII levels than for O or A2 [ 92 , 93 ], high plasma levels of Willebrand factor, and factor VIII having association with increased VTE risk [ 94 , 95 , 96 , 97 ].

Strengths and limitations

To our knowledge, this is the first UR that systematically and comprehensively appraises the hierarchy of evidence relating blood groups to various health outcomes. Beyond summarizing the findings for a series of health end-point, we further an inquiry into bias and heterogeneity in the observational blood group literature. Compared with an individual systematic review or meta-analysis. This UR helped to summarize the complicated and vast amounts of research by comparing and contrasting the results of individual reviews, which provided an efficient overview of the findings for a particular problem [ 98 ]. Moreover, we adhered to a systematic methodology involving a search strategy in electronic databases and study selection and extraction conducted by two separate researchers. We also used standard approaches to evaluate the methodological quality and epidemiological evidence strength of the included studies.

UR provides top-tier evidence and important insights, but several limitations should be considered. First, some systematic reviews and meta-analyses did not acquire the level of evidence because they did not provide the number of cases. Second, we used I 2 (an estimate of the proportion of variance reflecting true differences in effect size) and τ 2 (an estimate of true variation in the summary estimate) to evaluate statistical heterogeneity. According to UR criteria, I 2  < 50% was applied as one of the criteria for convincing evidence in our UR, assigning the best evidence grade to robust associations. Several systematic reviews with meta-analyses examined the clinical and methodological heterogeneity by performing subgroup analyses stratified by these characteristics. Of note, we also extracted this information and analyzed it in the present UR. For example, within the subgroup comprising pancreatic cancer patients classified as either CagA-nonendemic or CagA endemic and COVID-19 patients with hospitalization, we found the results from subgroup analyses were consistent with the main findings. Future studies should better explore clinical and methodological heterogeneity to verify the association between blood groups and various health outcomes. Third, for the same health outcome (e.g., COVID-19 infection), the comparison group is different (e.g., A vs B, A vs AB, A vs O, and A vs non-A). Therefore, the findings between the blood group and health outcome in our study should be interpreted with caution. Fourth, we identified studies from published systematic reviews with meta-analyses, which may have omitted some individual studies for not in the systematic reviews with meta-analyses above. However, the systematic reviews with meta-analyses included in the current study were those of included the largest number of primary studies, which was unlikely to affect our results. Fifth, the reliability of the UR relies directly on the incorporated systematic reviews with meta-analyses. However, some included systematic reviews with meta-analyses existed risk of bias, which might decrease the robustness of statistical analyses. The study did not adjust for confounding factors that could have mediated associations between blood group and outcomes, because adjustment for potential confounders was unavailable in published systematic reviews with meta-analyses. Sixth, as this UR only included observational data, limitations common to this approach may influence the results of this review, such as information bias and residual confounding. There was a limited number of systematic reviews with meta-analyses that exclusively included prospective study designs, where information bias was reduced. However, case–control and cross-sectional study designs were more common than prospective study designs and were associated with a higher potential for information bias and reverse causation.

This comprehensive UR will help investigators to judge the relative priority of health outcomes related to the ABO blood group and RH blood group for future research and clinical management of the disease. In summary, compared with the non-B blood group, we found the association between blood group B and increased risk of T2DM incidence (OR =  1.28; 95% CI: 1.17, 1.40) was supported by convincing evidence. We also found 18 associations, such as blood group A and the risk of VTE incidence (OR =  1.63; 95% CI: 1.40, 1.89) and non-O blood group and the risk of VTE incidence (OR =  2.10; 95% CI: 1.83, 2.40), were supported by highly suggestive evidence. To enhance the quality of evidence regarding these associations and be able to give strong recommendations, future studies should consider several aspects. For example, set the same control group to increase the comparability of results, use standard definition of exposure or outcome to reduce clinical heterogeneity, and match the characteristics between cases and controls to reduce the impact of potential confounding. In addition, future studies understanding mechanisms between blood groups and various health outcomes are needed.

Availability of data and materials

Not applicable.

Abbreviations

A Measurement Tool to Assess Systematic Reviews

Confidence intervals

Human immunodeficiency virus infection

Hazard ratio

Prediction intervals

Peripheral vascular disease

Type 2 diabetes mellitus

  • Umbrella review

Venous thromboembolism

Daniels G. Human blood group systems. In: Practical transfusion medicine. 2017. p. 20–8.

Chapter   Google Scholar  

Landsteiner K, Wiener AS. An agglutinable factor in human blood recognized by immune sera for rhesus blood. Proc Soc Exp Biol Med. 1940;43(1):223.

Article   CAS   Google Scholar  

Cooling L. Blood groups in infection and host susceptibility. Clin Microbiol Rev. 2015;28(3):801–70.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Avent ND, Reid ME. The Rh blood group system: a review. Blood. 2000;95(2):375–87.

Article   CAS   PubMed   Google Scholar  

Bruun-Rasmussen P, Hanefeld Dziegiel M, Banasik K, Johansson PI, Brunak S. Associations of ABO and Rhesus D blood groups with phenome-wide disease incidence: a 41-year retrospective cohort study of 482,914 patients. Elife. 2023;12:e83116.

Hoglund J, Karlsson T, Johansson T, Ek WE, Johansson A. Characterization of the human ABO genotypes and their association to common inflammatory and cardiovascular diseases in the UK Biobank. Am J Hematol. 2021;96(11):1350–62.

Clark P, Wu O. ABO(H) blood groups and pre-eclampsia. A systematic review and meta-analysis. Thromb Haemost. 2008;100(3):469–74.

CAS   PubMed   Google Scholar  

Iodice S, Maisonneuve P, Botteri E, Sandri MT, Lowenfels AB. ABO blood group and cancer. Eur J Cancer. 2010;46(18):3345–50.

Getawa S, Bayleyegn B, Aynalem M, Worku YB, Adane T. Relationships of ABO and Rhesus blood groups with type 2 diabetes mellitus: a systematic review and meta-analysis. J Int Med Res. 2022;50(10):3000605221129547.

Bawazir WM. Systematic review and meta-analysis of the susceptibility of ABO blood groups to venous thromboembolism in individuals with Factor V Leiden. Diagnostics (Basel). 2022;12(8):1936.

Dentali F, Sironi AP, Ageno W, Turato S, Bonfanti C, Frattini F, et al. Non-O blood type is the commonest genetic risk factor for VTE: results from a meta-analysis of the literature. Semin Thromb Hemost. 2012;38(5):535–48.

Article   PubMed   Google Scholar  

Zhang Q, Peng H, Hu L, Ren R, Peng X, Song J. Association between ABO blood group and venous thromboembolism risk in patients with peripherally inserted central catheters: a meta-analysis and systematic review. Front Oncol. 2022;12:906427.

Article   PubMed   PubMed Central   Google Scholar  

Ioannidis JP. Integration of evidence from multiple meta-analyses: a primer on umbrella reviews, treatment networks and multiple treatments meta-analyses. CMAJ. 2009;181(8):488–93.

Moher D, Liberati A, Tetzlaff J, Altman DG, Group P. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097.

Stroup DF, Berlin JA, Morton SC, Olkin I, Williamson GD, Rennie D, et al. Meta-analysis of observational studies in epidemiology: a proposal for reporting. Meta-analysis Of Observational Studies in Epidemiology (MOOSE) group. JAMA. 2000;283(15):2008–12.

Bellou V, Belbasis L, Tzoulaki I, Evangelou E, Ioannidis JP. Environmental risk factors and Parkinson’s disease: an umbrella review of meta-analyses. Parkinsonism Relat Disord. 2016;23:1–9.

Pieper D, Antoine SL, Mathes T, Neugebauer EA, Eikermann M. Systematic review finds overlapping reviews were not mentioned in every other overview. J Clin Epidemiol. 2014;67(4):368–75.

Senn SJ. Overstating the evidence: double counting in meta-analysis and related problems. BMC Med Res Methodol. 2009;9:10.

Smith V, Devane D, Begley CM, Clarke M. Methodology in conducting a systematic review of systematic reviews of healthcare interventions. BMC Med Res Methodol. 2011;11(1):15.

Deeks JJ, Higgins JP, Altman DG. Chapter 10: Analysing data and undertaking meta-analyses. In: Cochrane handbook for systematic reviews of interventions. 2nd ed. 2019.

Google Scholar  

Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects meta-analyses. BMJ. 2011;342:d549.

Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–60.

Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629–34.

Sterne JA, Sutton AJ, Ioannidis JP, Terrin N, Jones DR, Lau J, et al. Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials. BMJ. 2011;343:d4002.

Ioannidis JP, Trikalinos TA. An exploratory test for an excess of significant findings. Clin Trials. 2007;4(3):245–53.

Hayhoe B, Kim D, Aylin PP, Majeed FA, Cowie MR, Bottle A. Adherence to guidelines in management of symptoms suggestive of heart failure in primary care. Heart (British Cardiac Society). 2019;105(9):678–85.

PubMed   Google Scholar  

Lubin JH, Gail MH. On power and sample size for studying features of the relative odds of disease. Am J Epidemiol. 1990;131(3):552–66.

Qin X, Chen J, Jia G, Yang Z. Dietary factors and pancreatic cancer risk: an umbrella review of meta-analyses of prospective observational studies. Adv Nutr. 2023;14(3):451–64.

Liu D, Meng X, Tian Q, Cao W, Fan X, Wu L, et al. Vitamin D and multiple health outcomes: an umbrella review of observational studies, randomized controlled trials, and Mendelian randomization studies. Adv Nutr. 2022;13(4):1044–62.

Aromataris E, Fernandez R, Godfrey CM, Holly C, Khalil H, Tungpunkom P. Summarizing systematic reviews: methodological development, conduct and reporting of an umbrella review approach. Int J Evid Based Healthc. 2015;13(3):132–40.

Cui H, Qu Y, Zhang L, Zhang W, Yan P, Yang C, et al. Epidemiological and genetic evidence for the relationship between ABO blood group and human cancer. Int J Cancer. 2023;153(2):320–30.

Noori M, Shokri P, Nejadghaderi SA, Golmohammadi S, Carson-Chahhoud K, Bragazzi NL, et al. ABO blood groups and risk of human immunodeficiency virus infection: a systematic review and meta-analysis. Rev Med Virol. 2022;32(3):e2298.

Degarege A, Gebrezgi MT, Beck-Sague CM, Wahlgren M, de Mattos LC, Madhivanan P. Effect of ABO blood group on asymptomatic, uncomplicated and placental Plasmodium falciparum infection: systematic review and meta-analysis. BMC Infect Dis. 2019;19(1):86.

Shea BJ, Hamel C, Wells GA, Bouter LM, Kristjansson E, Grimshaw J, et al. AMSTAR is a reliable and valid measurement tool to assess the methodological quality of systematic reviews. J Clin Epidemiol. 2009;62(10):1013–20.

Singh A, Purohit BM. ABO blood groups and its association with oral cancer, oral potentially malignant disorders and oral submucous fibrosis- a systematic review and meta-analysis. Asian Pac J Cancer Prev. 2021;22(6):1703–12.

Degarege A, Gebrezgi MT, Ibanez G, Wahlgren M, Madhivanan P. Effect of the ABO blood group on susceptibility to severe malaria: a systematic review and meta-analysis. Blood Rev. 2019;33:53–62.

Panda AK, Panda SK, Sahu AN, Tripathy R, Ravindran B, Das BK. Association of ABO blood group with severe falciparum malaria in adults: case control study and meta-analysis. Malar J. 2011;10:309.

Adegnika AA, Luty AJ, Grobusch MP, Ramharter M, Yazdanbakhsh M, Kremsner PG, et al. ABO blood group and the risk of placental malaria in sub-Saharan Africa. Malar J. 2011;10:101.

Bhattacharjee S, Banerjee M, Pal R. ABO blood groups and severe outcomes in COVID-19: a meta-analysis. Postgrad Med J. 2022;98(e2):e136–7.

Wu BB, Gu DZ, Yu JN, Yang J, Shen WQ. Association between ABO blood groups and COVID-19 infection, severity and demise: a systematic review and meta-analysis. Infect Genet Evol. 2020;84:104485.

Lubkin DT, Van Gent JM, Cotton BA, Brill JB. Mortality and outcomes by blood group in trauma patients: a systematic review and meta-analysis. Vox Sang. 2023;118(6):421–9.

Banchelli F, Negro P, Guido M, D’Amico R, Fittipaldo VA, Grima P, et al. The role of ABO blood type in patients with SARS-CoV-2 infection: a systematic review. J Clin Med. 2022;11(11):3029.

Dentali F, Sironi AP, Ageno W, Bonfanti C, Crestani S, Frattini F, et al. Relationship between ABO blood group and hemorrhage: a systematic literature review and meta-analysis. Semin Thromb Hemost. 2013;39(1):72–82.

Dentali F, Sironi AP, Ageno W, Crestani S, Franchini M. ABO blood group and vascular disease: an update. Semin Thromb Hemost. 2014;40(1):49–59.

Liu F, Li C, Zhu J, Ren L, Qi X. ABO blood type and risk of hepatocellular carcinoma: a meta-analysis. Expert Rev Gastroenterol Hepatol. 2018;12(9):927–33.

Urabe F, Kimura S, Iwatani K, Yasue K, Koike Y, Tashiro K, et al. The impact of ABO blood type on developing venous thromboembolism in cancer patients: systematic review and meta-analysis. J Clin Med. 2021;10(16):3692.

Balaouras G, Eusebi P, Kostoulas P. Systematic review and meta-analysis of the effect of ABO blood group on the risk of SARS-CoV-2 infection. PLoS ONE. 2022;17(7):e0271451.

Yang H, Tan Z, Zhang Y, Sun J, Huang P. ABO blood classification and the risk of lung cancer: a meta-analysis and trial sequential analysis. Oncol Lett. 2022;24(4):340.

Risch HA, Lu L, Wang J, Zhang W, Ni Q, Gao YT, et al. ABO blood group and risk of pancreatic cancer: a study in Shanghai and meta-analysis. Am J Epidemiol. 2012;177(12):1326–37.

Article   Google Scholar  

Risch HA, Lu L, Wang J, Zhang W, Ni Q, Gao YT, et al. ABO blood group and risk of pancreatic cancer: a study in Shanghai and meta-analysis. Am J Epidemiol. 2013;177(12):1326–37.

Takagi H, Umemoto T, All-Literature Investigation of Cardiovascular Evidence G. Meta-analysis of non-O blood group as an independent risk factor for coronary artery disease. Am J Cardiol. 2015;116(5):699–704.

Cheung JLS, Cheung VLS, Athale U. Impact of ABO Blood Group on the Development of Venous Thromboembolism in Children With Cancer: A Systematic Review and Meta-Analysis. J Pediatr Hematol Oncol. 2021;43(6):216–23.

Deng J, Jia M, Cheng X, Yan Z, Fan D, Tian X. ABO blood group and ovarian reserve: a meta-analysis and systematic review. Oncotarget. 2017;8(15):25628–36.

Zhao J, Yao Z, Hao J, Xu B, Wang Y, Li Y. Association of ABO blood groups with ovarian reserve, and outcomes after assisted reproductive technology: systematic review and meta-analyses. Reprod Biol Endocrinol. 2021;19(1):20.

Ai L, Li J, Wang W, Li Y. ABO blood group and risk of malaria during pregnancy: a systematic review and meta-analysis. Epidemiol Infect. 2022;150:e25.

Al-Askar M. Is there an association between periodontal diseases and ABO blood group? Systematic review and meta-analysis. Quintessence Int. 2022;53(5):404–12.

Bahardoust M, Barahman G, Baghaei A, Ghadimi P, Asadi Shahir MH, NajafiKandovan M, et al. The association between ABO blood group and the risk of colorectal cancer: a systematic literature review and meta-analysis. Asian Pac J Cancer Prev. 2023;24(8):2555–63.

Loscertales MP, Owens S, O’Donnell J, Bunn J, Bosch-Capblanch X, Brabin BJ. ABO blood group phenotypes and Plasmodium falciparum malaria: unlocking a pivotal mechanism. Adv Parasitol. 2007;65:1–50.

Gutierrez-Valencia M, Leache L, Librero J, Jerico C, Enguita German M, Garcia-Erce JA. ABO blood group and risk of COVID-19 infection and complications: a systematic review and meta-analysis. Transfusion. 2022;62(2):493–505.

Franchini M, Cruciani M, Mengoli C, Marano G, Candura F, Lopez N, et al. ABO blood group and COVID-19: an updated systematic literature review and meta-analysis. Blood Transfus. 2021;19(4):317–26.

PubMed   PubMed Central   Google Scholar  

He M, Wolpin B, Rexrode K, Manson JE, Rimm E, Hu FB, et al. ABO blood group and risk of coronary heart disease in two prospective cohort studies. Arterioscler Thromb Vasc Biol. 2012;32(9):2314–20.

Liu N, Zhang T, Ma L, Zhang H, Wang H, Wei W, et al. The impact of ABO blood group on COVID-19 infection risk and mortality: a systematic review and meta-analysis. Blood Rev. 2021;48:100785.

Razzaghi N, Seraj H, Heydari K, Azadeh H, Salehi A, Behnamfar M, et al. ABO blood groups associations with ovarian cancer: a systematic review and meta-analysis. Ind J Gynecol Oncol. 2020;18(4). https://doi.org/10.1007/s40944-020-00463-y .

Wu O, Bayoumi N, Vickers MA, Clark P. ABO(H) blood groups and vascular disease: a systematic review and meta-analysis. J Thromb Haemost. 2008;6(1):62–9.

Ding P. ABO blood groups and prognosis of gastric cancer: a meta-analysis. 2019.

Tiongco RE, Paragas NA, Dominguez MJ, Lasta SL, Pandac JK, Pineda-Cortel MR. ABO blood group antigens may be associated with increased susceptibility to schistosomiasis: a systematic review and meta-analysis. J Helminthol. 2018;94:e21.

Jing SW, Xu Q, Zhang XY, Jing ZH, Zhao ZJ, Zhang RH, et al. Are people with blood group O more susceptible to nasopharyngeal carcinoma and have worse survival rates? A systematic review and meta-analysis. Front Oncol. 2021;11:698113.

Miao SY, Zhou W, Chen L, Wang S, Liu XA. Influence of ABO blood group and Rhesus factor on breast cancer risk: a meta-analysis of 9665 breast cancer patients and 244,768 controls. Asia Pac J Clin Oncol. 2014;10(2):101–8.

Itenov TS, Sessler DI, Khanna AK, Ostrowski SR, Johansson PI, Erikstrup C, et al. ABO blood types and sepsis mortality. Ann Intensive Care. 2021;11(1):61.

Li T, Wang Y, Wu L, Ling Z, Li C, Long W, et al. The association between ABO blood group and preeclampsia: a systematic review and meta-analysis. Front Cardiovasc Med. 2021;8:665069.

Nayeri T, Moosazadeh M, Dalimi Asl A, Ghaffarifar F, Sarvi S, Daryani A. Toxoplasma gondii infection and ABO blood groups: a systematic review and meta-analysis. Trans R Soc Trop Med Hyg. 2024;118(4):234–46.

Wang W, Liu L, Wang Z, Lu X, Wei M, Lin T, et al. ABO blood group and esophageal carcinoma risk: from a case-control study in Chinese population to meta-analysis. Cancer Causes Control. 2014;25(10):1369–77.

Jing W, Zhao S, Liu J, Liu M. ABO blood groups and hepatitis B virus infection: a systematic review and meta-analysis. BMJ Open. 2020;10(1):e034114.

Rattanapan Y, Duangchan T, Wangdi K, Mahittikorn A, Kotepui M. Association between Rhesus blood groups and malaria infection: a systematic review and meta-analysis. Trop Med Infect Dis. 2023;8(4):190.

Liao Y, Xue L, Gao J, Wu A, Kou X. ABO blood group-associated susceptibility to norovirus infection: a systematic review and meta-analysis. Infect Genet Evol. 2020;81:104245.

Chakrani Z, Robinson K, Taye B. Association between ABO blood groups and helicobacter pylori infection: a meta-analysis. Sci Rep. 2018;8(1):17604.

Wang Z, Liu L, Ji J, Zhang J, Yan M, Zhang J, et al. ABO blood group system and gastric cancer: a case-control study and meta-analysis. Int J Mol Sci. 2012;13(10):13308–21.

Chen Z, Yang SH, Xu H, Li JJ. ABO blood group system and the coronary artery disease: an updated systematic review and meta-analysis. Sci Rep. 2016;6:23250.

Fagherazzi G, Gusto G, Clavel-Chapelon F, Balkau B, Bonnet F. ABO and Rhesus blood groups and risk of type 2 diabetes: evidence from the large E3N cohort study. Diabetologia. 2015;58(3):519–22.

Walle M, Tesfaye A, Getu F. The association of ABO and Rhesus blood groups with the occurrence of type 2 diabetes mellitus: a comparative cross-sectional study. Medicine (Baltimore). 2023;102(35):e34803.

Cano EA, Esguerra MA, Batausa AM, Baluyut JR, Cadiz R, Docto HF, et al. Association between ABO blood groups and type 2 diabetes mellitus: a meta-analysis. Curr Diabetes Rev. 2023;19(6):e270422204139.

Chanzu NM, Mwanda W, Oyugi J, Anzala O. Mucosal blood group antigen expression profiles and HIV infections: a study among female sex workers in Kenya. PLoS ONE. 2015;10(7):e0133049.

Jacobs G, Van den Berg K, Vermeulen M, Swanevelder R, Custer B, Murphy EL. Association of ABO and RhD blood groups with the risk of HIV infection. PLoS ONE. 2023;18(4):e0284975.

Englisch C, Moik F, Nopp S, Raderer M, Pabinger I, Ay C. ABO blood group type and risk of venous thromboembolism in patients with cancer. Blood Adv. 2022;6(24):6274–81.

Barbalic M, Dupuis J, Dehghan A, Bis JC, Hoogeveen RC, Schnabel RB, et al. Large-scale genomic studies reveal central role of ABO in sP-selectin and sICAM-1 levels. Hum Mol Genet. 2010;19(9):1863–72.

Makivuokko H, Lahtinen SJ, Wacklin P, Tuovinen E, Tenkanen H, Nikkila J, et al. Association between the ABO blood group and the human intestinal microbiota composition. BMC Microbiol. 2012;12:94.

Cani PD, Osto M, Geurts L, Everard A. Involvement of gut microbiota in the development of low-grade inflammation and type 2 diabetes associated with obesity. Gut Microbes. 2012;3(4):279–88.

Arendrup M, Hansen JE, Clausen H, Nielsen C, Mathiesen LR, Nielsen JO. Antibody to histo-blood group A antigen neutralizes HIV produced by lymphocytes from blood group A donors but not from blood group B or O donors. AIDS. 1991;5(4):441–4.

Neil SJ, McKnight A, Gustafsson K, Weiss RA. HIV-1 incorporates ABO histo-blood group antigens that sensitize virions to complement-mediated inactivation. Blood. 2005;105(12):4693–9.

Jenkins PV, O’Donnell JS. ABO blood group determines plasma von Willebrand factor levels: a biologic function after all? Transfusion. 2006;46(10):1836–44.

Schulz KF, Altman DG, Moher D, Group C. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMC Med. 2010;8:18.

Morange PE, Tregouet DA, Frere C, Saut N, Pellegrina L, Alessi MC, et al. Biological and genetic factors influencing plasma factor VIII levels in a healthy family population: results from the Stanislas cohort. Br J Haematol. 2005;128(1):91–9.

O’Donnell J, Laffan MA. The relationship between ABO histo-blood group, factor VIII and von Willebrand factor. Transfus Med. 2001;11(4):343–51.

Koster T, Blann AD, Briet E, Vandenbroucke JP, Rosendaal FR. Role of clotting factor VIII in effect of von Willebrand factor on occurrence of deep-vein thrombosis. Lancet. 1995;345(8943):152–5.

Kraaijenhagen RA, in’t Anker PS, Koopman MM, Reitsma PH, Prins MH, van den Ende A, et al. High plasma concentration of factor VIIIc is a major risk factor for venous thromboembolism. Thromb Haemost. 2000;83(1):5–9.

Rietveld IM, Lijfering WM, le Cessie S, Bos MHA, Rosendaal FR, Reitsma PH, et al. High levels of coagulation factors and venous thrombosis risk: strongest association for factor VIII and von Willebrand factor. J Thromb Haemost. 2019;17(1):99–109.

Souto JC, Almasy L, Muniz-Diaz E, Soria JM, Borrell M, Bayen L, et al. Functional effects of the ABO locus polymorphism on plasma levels of von Willebrand factor, factor VIII, and activated partial thromboplastin time. Arterioscler Thromb Vasc Biol. 2000;20(8):2024–8.

Papatheodorou S. Umbrella reviews: what they are and why we need them. Eur J Epidemiol. 2019;34(6):543–6.

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2022YFC2704205 to Wu QJ), the Natural Science Foundation of China (No. 82073647 and No. 82373674 to Wu QJ and No.82103914 to Gong TT), Outstanding Scientific Fund of Shengjing Hospital (Wu QJ), and 345 Talent Project of Shengjing Hospital of China Medical University (Gong TT).

Author information

Fang-Hua Liu, Jia-Kai Guo, Wei-Yi Xing, Xue-Li Bai, and Yu-Jiao Chang contributed equally to this work.

Authors and Affiliations

Department of Clinical Epidemiology, Shengjing Hospital of China Medical University, Shenyang, China

Fang-Hua Liu, Jia-Kai Guo, Wei-Yi Xing, Xue-Li Bai, Yu-Jiao Chang, Zhao Lu, Miao Yang, Ying Yang, Wen-Jing Li, Xian-Xian Jia, Tao Zhang, Jing Yang & Qi-Jun Wu

Clinical Research Center, Shengjing Hospital of China Medical University, Shenyang, China

Fang-Hua Liu, Wei-Yi Xing, Yu-Jiao Chang & Qi-Jun Wu

Hospital Management Office, Shengjing Hospital of China Medical University, Shenyang, China

Jia-Kai Guo

Department of Obstetrics and Gynecology, The Fourth Affiliated Hospital of China Medical University, Shenyang, China

Xue-Li Bai, Song Gao, De-Yu Zhang, Chuan Liu, Ting-Ting Gong & Qi-Jun Wu

Department of Radiology, Shengjing Hospital of China Medical University, Shenyang, China

Zhao Lu & Tao Zhang

Department of Pediatrics, Shengjing Hospital of China Medical University, Shenyang, China

Miao Yang & Xian-Xian Jia

Department of Hematology, Shengjing Hospital of China Medical University, Shenyang, China

Department of Otolaryngology Head and Neck Surgery, Shengjing Hospital of China Medical University, Shenyang, China

Wen-Jing Li

Department of Endocrinology, Shengjing Hospital of China Medical University, Shenyang, China

School of Medicine, Zhejiang University, Hangzhou, China

Jun-Tong Chen

Cancer Epidemiology Division, Population Sciences in the Pacific Program, University of Hawaii Cancer Center, University of Hawaii at Manoa, Honolulu, HI, USA

NHC Key Laboratory of Advanced Reproductive Medicine and Fertility (China Medical University), National Health Commission, Shenyang, China

You can also search for this author in PubMed   Google Scholar

Contributions

F-HL, J-KG, T-TG, and Q-JW contributed to the study design. J-KG, ZL, MY, X-LB, YY, W-JL, X-XJ, and TZ collection of data. F-HL and W-YX analysis of data. F-HL, J-KG, Y-JC, YJ, J-TC, SG, LW, D-YZ, CL, T-TG, and Q-JW wrote the first draft of the manuscript and edited the manuscript. All authors read and approved the final manuscript. F-HL, J-KG, W-YX, X-LB, and Y-JC contributed equally to this work.

Corresponding authors

Correspondence to De-Yu Zhang , Chuan Liu , Ting-Ting Gong or Qi-Jun Wu .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

12916_2024_3423_moesm1_esm.xlsx.

Additional file 1: Tables S1-13. Table S1-PRISMA checklist of items to include when reporting a systematic review or meta-analysis; Table S2-MOOSE checklist for meta-analyses of observational studies; Table S3-Search strategy; Table S4-Criteria for categorizing the credibility of evidence in the umbrella review; Table S5-The list of the excluded records during the process of full-text review; Table S6-The summary results of meta-analyses excluded due to lack of data for quantitative synthesis; Table S7-Description of 270 associations investigating the associations between ABO and Rhesus blood groups and multiple health outcomes; Table S8-Strength assessment of evidence from 270 associations examining associations between ABO and Rhesus blood groups and multiple health outcomes; Table S9-Methodological quality assessment of the included articles with AMSTAR; Table S10-The results of GRADE assessment of the evidence certainty on the associations between ABO and Rhesus blood groups and multiple health outcomes; Table S11-Sensitivity analysis results of omission of small-sized studies (< 25th percentile) from those meta-analyses with evidence of small-study effects; Table S12-Sensitivity analysis results of omission of primary studies with low-quality evidence; Table S13-Sensitivity analysis results of excluded meta-analyses due to overlap.

12916_2024_3423_MOESM2_ESM.docx

Additional file 2: Fig. S1-7. Fig. S1-Summary effects sizes with inverse of the variance of association between blood group and cancer outcomes; Fig. S2-Summary effects sizes with inverse of the variance of association between blood group and infectious disease outcomes; Fig. S3-Summary effects sizes with inverse of the variance of association between blood group and cardiovascular and cerebrovascular outcomes; Fig. S4-Summary effects sizes with inverse of the variance of association between blood group and oral related outcomes; Fig. S5-Summary effects sizes with inverse of the variance of association between blood group and pregnancy related outcomes; Fig. S6-Summary effects sizes with inverse of the variance of association between blood group and metabolic disease outcomes; Fig. S7-Summary effects sizes with inverse of the variance of association between blood group and other outcomes.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Liu, FH., Guo, JK., Xing, WY. et al. ABO and Rhesus blood groups and multiple health outcomes: an umbrella review of systematic reviews with meta-analyses of observational studies. BMC Med 22 , 206 (2024). https://doi.org/10.1186/s12916-024-03423-x

Download citation

Received : 29 November 2023

Accepted : 09 May 2024

Published : 20 May 2024

DOI : https://doi.org/10.1186/s12916-024-03423-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • ABO blood group
  • Meta-analysis
  • Observational study
  • Rhesus blood group

BMC Medicine

ISSN: 1741-7015

null hypothesis 95 confidence interval

  • Open access
  • Published: 26 May 2024

Intrapartum exposure to synthetic oxytocin, maternal BMI, and neurodevelopmental outcomes in children within the ECHO consortium

  • Lisa Kurth 1 ,
  • T. Michael O’Shea 2 ,
  • Irina Burd 3 ,
  • Anne L. Dunlop 4 ,
  • Lisa Croen 5 ,
  • Greta Wilkening 6 ,
  • Ting-ju Hsu 7 ,
  • Stephan Ehrhardt 7 ,
  • Arvind Palanisamy 8 ,
  • Monica McGrath 7 ,
  • Marie L. Churchill 7 ,
  • Daniel Weinberger 9 , 10 ,
  • Marco Grados 11 , 12 &
  • Dana Dabelea 13  

Journal of Neurodevelopmental Disorders volume  16 , Article number:  26 ( 2024 ) Cite this article

Metrics details

Synthetic oxytocin (sOT) is frequently administered during parturition. Studies have raised concerns that fetal exposure to sOT may be associated with altered brain development and risk of neurodevelopmental disorders. In a large and diverse sample of children with data about intrapartum sOT exposure and subsequent diagnoses of two prevalent neurodevelopmental disorders, i.e., attention deficit hyperactivity disorder (ADHD) and autism spectrum disorder (ASD), we tested the following hypotheses: (1) Intrapartum sOT exposure is associated with increased odds of child ADHD or ASD; (2) associations differ across sex; (3) associations between intrapartum sOT exposure and ADHD or ASD are accentuated in offspring of mothers with pre-pregnancy obesity.

The study sample comprised 12,503 participants from 44 cohort sites included in the Environmental Influences on Child Health Outcomes (ECHO) consortium. Mixed-effects logistic regression analyses were used to estimate the association between intrapartum sOT exposure and offspring ADHD or ASD (in separate models). Maternal obesity (pre-pregnancy BMI ≥ 30 kg/m 2 ) and child sex were evaluated for effect modification.

Intrapartum sOT exposure was present in 48% of participants. sOT exposure was not associated with increased odds of ASD (adjusted odds ratio [aOR] 0.86; 95% confidence interval [CI], 0.71–1.03) or ADHD (aOR 0.89; 95% CI, 0.76–1.04). Associations did not differ by child sex. Among mothers with pre-pregnancy obesity, sOT exposure was associated with lower odds of offspring ADHD (aOR 0.72; 95% CI, 0.55–0.96). No association was found among mothers without obesity (aOR 0.97; 95% CI, 0.80–1.18).

Conclusions

In a large, diverse sample, we found no evidence of an association between intrapartum exposure to sOT and odds of ADHD or ASD in either male or female offspring. Contrary to our hypothesis, among mothers with pre-pregnancy obesity, sOT exposure was associated with lower odds of child ADHD diagnosis.

For over 50 years, synthetic oxytocin (sOT), an exogenous neuropeptide and uterine stimulant (trade names Pitocin® and Syntocinon®), typically administered to the pregnant individual by intravenous infusion, has been increasingly used as a first line approach to induce and/or augment labor by stimulating uterine contractions [ 1 , 2 , 3 , 4 , 5 , 6 ]. Administration of sOT as a single agent for labor induction and/or augmentation assists in the expulsion of the fetus in the setting of childbirth complications [ 7 ] and may minimize risk of instrumental deliveries [ 8 ]. However, despite the increasing frequency with which sOT is administered to pregnant women [ 9 , 10 , 11 ], only a few large studies have characterized the relationship of intrapartum sOT and child neurodevelopmental outcome. One of the largest studies ( n  = 1.5 million), based on a national cohort of Scandinavian children, found an approximately 20% increased risk of attention deficit hyperactivity disorder (ADHD) and autism spectrum disorder (ASD) associated with sOT exposure. However, authors were reassured regarding clinical use of sOT as confounder adjustment attenuated this association [ 12 ].

Child neurodevelopmental outcomes following intrapartum sOT exposure have not been studied in large samples of children born in the United States (US) [ 13 , 14 ], where obstetric medical practices may differ from those of other countries [ 15 ]. Among existing studies, some report associations between sOT exposure and ADHD and/or ASD [ 13 , 14 , 16 , 17 , 18 , 19 ], some report mixed results [ 20 , 21 , 22 , 23 , 24 , 25 ], and some report no associations [ 12 , 26 , 27 , 28 , 29 ]. Preclinical models provide evidence of potential neuroprotective effects of endogenous oxytocin; however, if pulsatile uterine contractions are excessively prolonged by treatment with exogenous sOT, uteroplacental perfusion can be reduced to an extent sufficient to alter brain development [ 30 ]. Thus, a greater understanding is needed regarding the relationship of fetal intrapartum exposure to sOT and the risk(s) of child neurodevelopmental outcomes.

ADHD and ASD are among the most prevalent neurodevelopmental disorders with poorly understood etiology. ADHD, a disorder characterized by symptoms of inattention, distractibility, impulsivity, hyperactivity and behavioral dysregulation [ 31 ], affects almost 10% of US children [ 32 , 33 ]. ASD, characterized by deficits in social interaction and social communications with restricted or repetitive patterns of behavior and interests [ 34 ], affects 1 in 36 [ 35 ] eight-year-old US children [ 36 ]. ADHD and ASD demonstrate high diagnostic comorbidity [ 37 ], and represent the two most prevalent developmental disabilities among children aged 3 to 17 years in the US and other high-income countries [ 38 , 39 ]. In addition, the unique constellation of behavioral characteristics typified by children diagnosed with ADHD and/or ASD have long posed significant burdens within the familial and educational settings [ 40 , 41 , 42 , 43 ]. Importantly, the steadily rising prevalence of both ADHD and ASD impel an urgent need to identify modifiable risk factors [ 44 , 45 , 46 , 47 , 48 ]. The poorly understood etiology, comorbidity, and prevalence of ADHD and ASD prompted our examination of the association between intrapartum sOT exposure and these specific neurodevelopmental conditions.

Because females and males differ with respect to neurodevelopmental vulnerability [ 17 , 49 ] and males experience increased risk of both ADHD and ASD [ 50 ], we evaluated sex differences in the associations between sOT and neurodevelopmental outcomes. In addition, because mothers with obesity exhibit poor uterine contractility as compared to non-obese mothers, and therefore often require sOT induction to facilitate labor (50–53), we evaluated maternal pre-pregnancy obesity (e.g. BMI) as a potential effect measure modifier [ 51 ]. Here we tested three hypotheses: (1) Intrapartum exposure to sOT is associated with increased odds of child ADHD or ASD; (2) associations differ across sex; (3) associations between intrapartum sOT exposure and ADHD or ASD would be accentuated in offspring of mothers with pre-pregnancy obesity.

Data source

We used data from a large consortium, the Environmental influences on Child Health Outcomes (ECHO) program, to evaluate the association between intrapartum sOT and offspring ADHD and ASD. The ECHO program is a consortium of longitudinal cohort studies established by the National Institutes of Health (NIH) to examine the impacts of various exposures – chemical, biological, physical, and social – in relation to child health and development [ 52 ]. Specifically, ECHO research focuses on childbirth and perinatal outcomes, respiratory illness, obesity, neurodevelopment, and overall wellness, relying on a protocol of harmonized derived variables among cohort sites [ 53 , 54 , 55 ]. The study protocol was approved by the cohort-specific and/or the single ECHO Institutional Review Boards. Written informed consent was obtained for ECHO Cohort Data Collection Protocol participation and for participation in specific cohorts.

The study population included 12,503 biological mother/child pairs enrolled in 44 ECHO cohorts. The 44 cohorts included two ASD-enriched studies, six cohorts enrolling children from neonatal intensive care units (NICU), and thirty-six general population cohorts (See Additional File 1 Table S1 and Table S2 ). ASD-enriched studies included children originally enrolled as part of a case-control study of ASD, developmental delays, and typical development as well as a cohort enrolling younger siblings of children with ASD. NICU cohorts enrolled directly from NICUs. General population cohorts consisted of pregnancy and early-childhood studies evaluating other child health outcomes, including birth outcomes, growth and development, asthma, and overall wellbeing. Inclusion criteria for the study were (1) singleton births; (2) data available on child ADHD and ASD diagnoses, and (3) data on maternal administration of sOT during labor or delivery. For families with more than one child enrolled in the ECHO cohort, one sibling was randomly selected to be included in this study. We restricted inclusion to those cohorts with available data on at least 20 mother/child dyads. The decision-logic for inclusion and exclusion of cohorts and participants is displayed in Additional File 1 Fig. S1 . We identified 1073 ADHD cases and 851 ASD cases in our study population.

Synthetic oxytocin administration

Synthetic oxytocin use during childbirth (yes vs. no) was ascertained from either medical record abstraction or self-report by the mother. Regarding forms of terminology used to search the ECHO platform to identify relevant data included for harmonization of extant and new data (related to intrapartum sOT use), the following terms were included: sOT, Oxytocin, Pitocin, Syntocinon, uterotonic, uterine stimulant, stimulation, induction, induce, augmentation, augment. Terminology on the ECHO forms were oxytocin and Pitocin. Use of sOT for each mother-child pair was ascertained based on a prioritization of available information for use in the following order: (1) documentation of sOT administration during labor and delivery in maternal medical records, (2) documentation of labor induction or augmentation in maternal medical records, (3) documentation of labor induction or augmentation in childbirth medical records, and (4) maternal self-report of having been administered sOT.

ADHD and ASD

We defined ADHD and ASD based on caregiver report of physician-diagnosed disorders. Caregivers were asked whether a doctor or other health care provider had ever informed them that their child has or had Attention Deficit Disorder (ADD) or Attention Deficit /Hyperactivity Disorder (ADHD) for an ADHD diagnosis, and/or ASD Spectrum Disorder (ASD), Asperger’s Disorder or Pervasive Developmental Disorder (PDD) for an ASD diagnosis. In some cohorts, ASD diagnosis was obtained by utilizing several clinical sources, including established gold-standard diagnostic instruments, such as the Autism Diagnostic Observation Schedule [ 56 ] or a diagnosis extracted from medical records.

Self-reported maternal races were defined as American Indian/Alaskan Native, Asian, Black, Native Hawaiian or Pacific Islander, White, Other Race, and Multiple Races. Mother’s highest education was categorized as high school degree or equivalent or less; some college with no degree; and bachelor’s degree and above. Child characteristics include caregiver-reported child race, childbirth year (< 2005; 2006–2010; 2011–2015; 2016–2022), and child sex assigned at birth (male or female).

Maternal age at the time of delivery was determined from demographic questionnaires and maternal medical records. Preterm birth (yes/no), defined as birth prior to 37 weeks gestation, was based on available reports for gestational age.

Gestational age at birth in completed weeks was obtained through abstraction of maternal or child medical records or through parent-report. For medical record abstraction, an accepted hierarchy [ 57 , 58 ] was employed to ascertain the most accurate measure for estimating the due date: dating based on embryo placement following in vitro fertilization or dating based on artificial insemination, obstetrical estimate from first trimester ultrasound; obstetrical estimated from ultrasound taken in the second trimester with fetal biparietal diameter dating within 2 weeks of sure last menstrual period (LMP); ultrasound taken in the second trimester with unsure or no LMP date; report from obstetrical medical record reporting “consensus” estimated date of delivery with no ultrasound documented during first and second trimester; obstetrical estimate from LMP only; neonatal estimate of gestational age at birth obtained from child medical records; estimated from cohort research encounter; reported by mother; and estimated on cohort-provided estimated date of delivery without further description.

Large for gestational age (LGA), defined as child birthweight-for-gestational age and sex > 90th percentile (percentiles derived from the International Fetal and Newborn Growth Consortium for the 21st Century [INTERGROWTH-21]) [ 59 ] was calculated. Pre-pregnancy obesity was defined as a body mass index (BMI)  ≥  30 kg/m 2 according to accepted definitions [ 35 ]. Pre-pregnancy BMI was obtained using measured or self-reported height and weight between 12 months prior to conception through the first trimester. Gestational diabetes mellitus (GDM) was defined as new-onset diabetes during pregnancy based on self-report or as indicated in maternal medical records.

Statistical analysis

We compared the distribution of demographic characteristics and medical conditions between women who received sOT during labor and delivery and those who did not using Pearson chi-square tests. Using mixed-effects logistic models (“glmer” function from the “lme4” R package), we calculated unadjusted and covariate-adjusted odds ratios (aORs) and corresponding 95% confidence intervals (CI) to estimate associations between sOT use during childbirth and risk of ADHD or ASD in the offspring. Models were fitted with maximum likelihood estimators. Wald 95% CIs were constructed, and P -values were derived from the Wald z-test. In multivariable analyses, we adjusted for child race, ethnicity, sex, child’s birth year, gestational age and LGA status at birth, maternal age at delivery, and highest maternal education level. Maternal obesity prior to pregnancy and GDM were added to the adjusted model as covariates independently and in tandem. Models were fitted with random effects for individual cohorts to account for clustering within cohort. Based on a priori hypotheses that there would be variation by child sex and maternal pre-pregnancy obesity, fully adjusted models for both ADHD and ASD were stratified to examine for differences by strata. We evaluated effect modification by sex and by maternal pre-pregnancy obesity using product terms, sOT x sex, and sOT x maternal pre-pregnancy obesity. For all analyses, the criterion for statistical significance was P  < 0.05, without adjustment for multiple comparisons.

Imputation was performed for missing data using multiple imputation by chained equations from the “mice” R package [ 60 ]. The results were pooled after 25 imputations with a maximum of 10 iterations. The imputation models included our variables of interest with cohort type (general population, NICU, or ASD-enriched) and individual cohort membership as classification variables. Regression estimates from the imputed datasets were pooled together using Rubin’s rule.

In a set of sensitivity analyses, we explored potential cohort effects by assessing whether observed associations between the sOT use and odds of ADHD or ASD differed after removing individual cohorts and/or cohort types based on specific enrollment criteria (e.g. ASD-enriched, NICU, and general population cohorts). All analyses were performed using the R statistical software package, version 4.1.0 (R Foundation for Statistical Computing, Vienna, Austria).

Associations between participant characteristics and sOT exposure

Forty-eight percent of study participants were exposed to sOT. Table  1 shows socio-demographic characteristics of the sample by sOT exposure status. Maternal age at delivery and child sex assigned at birth were similar in sOT exposed mothers compared with those not exposed. Mean child age at diagnosis for ADHD was 7.10 in the sOT exposed group vs. 6.81 in the non-exposed group. Mean child age at diagnosis for ASD was 3.0 in the sOT exposed group, vs. 3.86 in the non-exposed group. Children exposed to sOT were more likely to be Hispanic (24.5% vs. 20.5%), and less likely to be White (56.7% vs. 60.9%) and born preterm (9.1% vs. 20.2%). Exposed mothers were more likely to have pre-pregnancy obesity (28.8% vs. 26.7%) and GDM (9.0% vs. 7.2%) compared with those not exposed.

Associations between sOT exposure and attention deficit hyperactivity disorder

As shown in Table  2 , the adjusted association between sOT exposure and ADHD was not significant in the pooled sample (aOR 0.89; 95% CI, 0.76, 1.04). In analysis stratified by child sex, the odds ratios were not statistically significant in either male (aOR 0.89; 95% CI, 0.73, 1.07) or female offspring (aOR 0.91; 95% CI, 0.69, 1.19) ( P  = 0.83).

Associations between sOT exposure and autism spectrum disorder

The unadjusted and adjusted ORs of associations between sOT exposure during labor and delivery and ASD diagnosis are shown in Table  3 . After adjusting for confounders, the aOR was 0.86 (95% CI, 0.71, 1.03) for the associations between ASD diagnosis and sOT exposure. Odds ratios were similar in male (aOR 0.81; 95% CI, 0.65, 1.01) and female offspring (aOR 0.97; 95% CI, 0.68, 1.39) ( P  = 0.42).

Effect modification by maternal obesity status

Participant clusters grouped by maternal pre-pregnancy obesity status are shown in Table  4 . In analyses adjusted for potential confounders, the interaction between sOT and maternal pre-pregnancy obesity was statistically significant for ADHD ( P  = 0.03) but was not statistically significant for ASD ( P  = 0.37). Forest plots depicting analysis of the association of sOT and ADHD, stratified by maternal obesity status, are presented in Fig.  1 . Among mothers who were obese prior to pregnancy, sOT was associated with lower odds of ADHD (aOR 0.72 95% CI, 0.55, 0.96); this association was not found among children of mothers who were not obese before pregnancy (aOR 0.97; 95% CI, 0.80, 1.18).

figure 1

Analysis of the association of sOT and ADHD, stratified by maternal pre-pregnancy obesity. Adjusted associations between sOT exposure and attention deficit hyperactivity disorder (ADHD) stratified by obesity before pregnancy. Adjusted for maternal age at delivery, highest maternal education level, child race, ethnicity, and sex, gestational age and large for gestational age at birth, child birth year, and gestational diabetes mellitus; ASD, autism spectrum disorder; CI, confidence interval; NICU, neonatal intensive care units; OR, odds ratio; sOT, synthetic Oxytocin. ASD-enriched cohorts: n  = 828. NICU cohorts: n  = 878. Other cohorts: n  = 10,797

Overall, we did not observe significant heterogeneity in cohort-specific and cohort type-specific effect estimates for the associations between intrapartum sOT exposure and child ADHD and ASD. There was no meaningful change in effect estimates after removing each cohort and after restricting to each cohort type (NICU, ASD-enriched, general population) (Fig.  1 and Additional File 1 Figs. S2 - S4 ).

In a multi-site, diverse cohort, in which 48% of mothers were administered sOT during childbirth, we found no evidence of an association between intrapartum exposure to sOT and odds of ADHD or ASD in either male or female offspring. Contrary to our hypothesis, among mothers with pre-pregnancy obesity, sOT was associated with lower odds of child ADHD diagnosis.

Our finding that intrapartum sOT exposure was not associated with adverse neurodevelopmental outcomes in the offspring is consistent with findings from several prior studies [ 12 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 ]. In contrast to some of these prior studies and current results, preclinical studies suggest that sOT exposure might disrupt fetal neurodevelopment [ 61 , 62 ] via cellular mechanisms such as epigenetic triggering [ 2 , 63 , 64 , 65 ], oxytocin receptor alterations [ 6 ], DNA damage and cellular death [ 66 , 67 ], complex signaling pathways [ 19 ], and transgenerational hormonal imprinting [ 68 , 69 ]. Biologically plausible mechanisms that could link fetal exposure to intrapartum sOT with ADHD or ASD include excessive uterine contractility leading to decreased uteroplacental perfusion and fetal hypoxemia [ 18 , 70 , 71 , 72 , 73 , 74 , 75 , 76 ], and especially at high cumulative doses [ 17 ] and transplacental transfer of sOT [ 77 , 78 ] resulting in sOT-induced oxytocinergic signaling in the developing brain, the importance of which is suggested by the role that oxytocinergic signaling plays in the development of social behaviors that are characteristically impaired in ASD [ 79 ]. Exogenous sOT differs from the human endogenous oxytocin hormone [ 6 , 80 ], and rodents exposed to sOT demonstrate altered behavioral presentations consistent with psychiatric phenotypes [ 81 ], pervasive developmental conditions [ 69 ], and enduring male specific neuroendocrine impairments, including dysfunctional cortical connectivity [ 71 ].

To our knowledge, the interaction of maternal obesity and intrapartum sOT exposure in relation to offspring neurodevelopmental outcomes has not previously been investigated. Recent reports suggest maternal weight gain and pre-pregnancy BMI may contribute to child ASD outcomes [ 82 , 83 ]. Maternal obesity can lead to poor uterine contractility [ 84 , 85 ], and thus impede the progression of labor and increase the likelihood of sOT exposure and exposure to higher cumulative doses of sOT [ 86 , 87 , 88 , 89 ]. Given these reports, we explored a potential joint effect between sOT exposure and maternal pre-pregnancy BMI on offspring neurodevelopmental outcomes in our study. Our finding that sOT was associated with lower odds of ADHD among offspring of mothers with pre-pregnancy obesity might be explained, at least in part, by confounding by indication, whereby mothers with obesity, and diminished uterine contractility, were more likely to be delivered promptly by C-section after an initial, possibly non-productive induction using sOT, thereby mitigating fetal exposure to the intense stress of labor that is typically involved during sOT exposure [ 90 , 91 ]. This may also explain our observed trend of more frequent sOT childbirth intervention among mothers with pre-pregnancy obesity.

It also is plausible that in obese mothers, sOT augmentation and/or induction of labor may reduce the risk of a prolonged second stage of labor and potentially mitigate the impact of stress to the vulnerable fetal brain. Additionally, it seems possible that this exposure could mechanistically mimic the neuroprotective effect of endogenous oxytocin, as has been reported in preclinical models [ 92 , 93 ].

Although our study’s findings did not confirm an association between intrapartum exposure to sOT and subsequent onset of child ADHD or ASD, the well documented routinization of sOT utilization during childbirth leaves us curious about the potential influence of this exposure on child neurodevelopmental outcomes. Synthetic oxytocin is in widespread use in the United States and globally [ 4 , 6 ]. Labor induction and augmentation with sOT is one of the most prevalent clinical interventions in modern obstetric practice [ 86 , 94 ]. In specific circumstances in which spontaneous labor has not begun, e.g., as pregnancies at term gestations with vertex, non-anomalous, singleton fetuses, induction of labor with sOT as compared to expectant management provides significant maternal (reduced maternal mortality, lower Cesarean delivery rate) and neonatal (reduced rate of neonatal death and meconium aspiration syndrome) benefits compared to expectant management [ 95 , 96 , 97 ]. Among pharmacologic agents used for labor induction and augmentation, sOT is by far the most frequently used. Furthermore, maternal obesity, and GDM are associated with higher doses of sOT during childbirth intervention [ 98 ].

For labor induction and/or augmentation, and for the management of the third stage of labor, US professional associations and the WHO recommend sOT as the uterotonic agent of choice [ 99 , 100 , 101 ]. This medical agent is administered intravenously, via infusion pump to provide a precise infusion rate which is adjusted based on the uterine activity (frequency and strength of contraction), fetal heart rate, and progress of labor [ 102 ]. In patients who achieve a desirable labor pattern and progress, there is no consensus about whether the sOT dose should be discontinued or continued, and consequently, sOT dosage tends to vary across birthing facilities [ 102 ]. Based on medical indication and local practices, initial sOT dosage varies from 0.5 to 6 milliunits/minute and the maximum dose varies between 16 and 64 milliunits/minute. Per this protocol, sOT is administered continuously until which point uterine contractions are deemed inefficient to reliably expel the fetus, and labor is declared a “failure to progress,” warranting a Cesarean Sects. [ 62 , 103 ].

Strengths and limitations

A chief limitation of our study was our lack of information on indications for childbirth intervention with sOT (specifically, the clinical indication for labor induction or augmentation), length of labor, mode of delivery (e.g. vaginal or C-section), and sOT dosage administered to laboring mothers during offspring delivery. We defined sOT exposure as a binary category, so we were unable to assess a potential dose-response association, or threshold effects. Findings from a study by Soltys et al. (17) are consistent with the concept that the strength and direction of the relationship of sOT and ASD varies across a range of sOT doses; specifically, low dose/short duration sOT exposure was associated with a statistically non-significant decrease in the odds of ASD, moderate dose/duration was associated with a non-significant increase in odds of ASD, and high dose/long duration exposure was associated with an increase in odds of ASD among male offspring. Our use of binary exposure limited the opportunity to assess such dose-dependent associations, leaving us questioning a potential dose-response influence on our results.

Given the limitations of the current study, and the fact that the main non-null finding was unexpected, replication of our analyses in other cohorts with clinical data related to indication for and dosage of intrapartum sOT is needed before drawing conclusions about associations between intrapartum sOT exposure and neurodevelopmental outcomes in the offspring.

Another potential limitation of our study is that child diagnoses of ADHD or ASD were based on parent report of physician diagnosis, rather than a rigorous assessment by clinicians with expertise in diagnosing these specific neurodevelopmental conditions, which could have led to misclassification regarding our outcomes.

Despite these limitations, our study had some notable strengths including a large, diverse, multi-site study cohort, which allowed us to derive precise estimates of associations, adjust for confounders, and explore effect measure modification by maternal pre-pregnancy obesity. Secondly, this was the first known endeavor which assessed the interaction between intrapartum sOT exposure and maternal BMI on child neurodevelopmental outcomes.

In a sample from the ECHO cohort, we found no evidence of an association between intrapartum sOT exposure and ADHD and ASD in the offspring. Instead, we observed an unexpected association between intrapartum sOT exposure and decreased odds of child ADHD among women with pre-pregnancy obesity. We observed use of intrapartum sOT in nearly half our sample, and more frequently among mothers with pre-pregnancy obesity. The unknown complexities, and under-investigated mechanisms and pathways of intrapartum sOT as weighed against the sensitivity of the still developing fetal brain provides a robust opportunity for future exploration regarding this early exposure.

Data availability

Select de-identified data from the ECHO Program are available through NICHD’s Data and Specimen Hub (DASH). Information on study data not available on DASH, such as some Indigenous datasets, can be found on the ECHO study DASH webpage.

Abbreviations

Attention deficit hyperactivity disorder

Adjusted odds ratio

Autism spectrum disorder

Body mass index

Confidence Interval

Environmental influences on Child Health Outcomes

Gestational diabetes mellitus

International Fetal and Newborn Growth Consortium for the 21st Century

International Classification of Disease

Large for gestational age

Neonatal intensive care units

  • Synthetic oxytocin

Perry RL, Satin AJ, Barth WH, Valtier S, Cody JT, Hankins GD. The pharmacokinetics of oxytocin as they apply to labor induction. Am J Obstet Gynecol. 1996;174(5):1590–93.

Article   CAS   PubMed   Google Scholar  

Kenkel W, Perkeybile A-M, Yee J, Pournajafi-Nazarloo H, Lillard T, Ferguson E, et al. Behavioral and epigenetic consequences of oxytocin treatment at birth. Sci Adv. 2019;5(5):eaav2244.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Laughon SK, Branch DW, Beaver J, Zhang J. Changes in labor patterns over 50 years. Am J Obstet Gynecol. 2012;206(5):e4191–9.

Article   Google Scholar  

Talati C, Carvalho JCA, Luca A, Balki M. The effect of intermittent oxytocin pretreatment on oxytocin-induced contractility of human myometrium in vitro. Anesth Analg. 2019;128(4):671–78.

Oscarsson ME, Amer-Wåhlin I, Rydhstroem H, Källén K. Outcome in obstetric care related to oxytocin use. A population-based study. Acta Obstet Gynecol Scand. 2006;85(9):1094–8.

Article   PubMed   Google Scholar  

Carter CS, Kenkel WM, MacLean EL, Wilson SR, Perkeybile AM, Yee JR, et al. Is oxytocin nature’s medicine? Pharmacol Rev. 2020;72(4):829–61.

Article   PubMed   PubMed Central   Google Scholar  

Clapp MA, James KE, Bates SV, Kaimal AJ. Patient and hospital factors associated with unexpected newborn complications among term neonates in US hospitals. JAMA Netw Open. 2020;3(2):e1919498.

Hinshaw K, Simpson S, Cummings S, Hildreth A, Thornton J. A randomised controlled trial of early versus delayed oxytocin augmentation to treat primary dysfunctional labour in nulliparous women. Brit J Obstet Gynecol. 2008;115(10):1289–95. discussion 95 – 6.

Article   CAS   Google Scholar  

Mirzabagi E, Deepak NN, Koski A, Tripathi V. Uterotonic use during childbirth in uttar pradesh: accounts from community members and health providers. Midwifery. 2013;29(8):902–10.

Zhang J, Laughon SK, Branch DW. Oxytocin regimen for labor augmentation, labor progression, and perinatal outcomes. Obstet Gynecol. 2012;119(2):381–82.

Harris JC, Carter CS. Therapeutic interventions with oxytocin: current status and concerns. J Am Acad Child Adolesc Psychiatry. 2013;52(10):998–1000.

Stokholm L, Juhl M, Talge NM, Gissler M, Obel C, Strandberg-Larsen K. Obstetric oxytocin exposure and ADHD and ASD among Danish and Finnish children. Int J Epidemiol. 2021;50(2):446–56.

Kurth L, Davalos D. Prenatal exposure to synthetic oxytocin: risk to neurodevelopment? J Prenat Perinat Psychol Health. 2012;27(1):3.

Google Scholar  

Kurth L, Haussmann R. Perinatal pitocin as an early ADHD biomarker: neurodevelopmental risk? J Atten Disord. 2011;15(5):423–31.

Tikkanen R, Gunja M, Fitzgerald M, Zephyrin L. Maternal mortality and maternity care in the United States compated to 10 other developed countries. Commonw Fund; 2020:1–17.

Smallwood M, Sareen A, Baker E, Hannusch R, Kwessi E, Williams T. Increased risk of autism development in children whose mothers experienced birth complications or received labor and delivery drugs. ASN Neuro. 2016;8(4):1759091416659742.

Soltys SM, Scherbel JR, Kurian JR, Diebold T, Wilson T, Hedden L, et al. An association of intrapartum synthetic oxytocin dosing and the odds of developing autism. Autism. 2020;24(6):1400–10.

García-Alcón A, González-Peñas J, Weckx E, Penzol MJ, Gurriarán X, Costas J, et al. Oxytocin exposure in labor and its relationship with cognitive impairment and the genetic architecture of autism. J Autism Dev Disord. 2022;53(1):66–79.

Torres G, Mourad M, Leheste JR. Perspectives of pitocin administration on behavioral outcomes in the pediatric population: recent insights and future implications. Heliyon. 2020;6(5):e04047.

Lønfeldt NN, Verhulst FC, Strandberg-Larsen K, Plessen KJ, Lebowitz ER. Assessing risk of neurodevelopmental disorders after birth with oxytocin: a systematic review and meta-analysis. Psychol Med. 2019;49(6):881–90.

Hertz-Picciotto I, Schmidt RJ, Krakowiak P. Understanding environmental contributions to autism: causal concepts and the state of science. Autism Res. 2018;11(4):554–86.

Monks DT, Palanisamy A, Oxytocin. At birth and beyond. A systematic review of the long-term effects of peripartum oxytocin. Anaesthesia. 2021;76(11):1526–37.

Saade GR, Sibai BM, Silver R. Induction or augmentation of labor and autism. JAMA Pediatr. 2014;168(2):190–1.

Wang C, Geng H, Liu W, Zhang G. Prenatal, perinatal, and postnatal factors associated with autism: a meta-analysis. Med (Baltim). 2017;96(18):e6696.

Gardener H, Spiegelman D, Buka SL. Perinatal and neonatal risk factors for autism: a comprehensive meta-analysis. Pediatrics. 2011;128(2):344–55.

Guastella AJ, Cooper MN, White CR, White MK, Pennell CE, Whitehouse AJ. Does perinatal exposure to exogenous oxytocin influence child behavioural problems and autistic-like behaviours to 20 years of age? J Child Psychol Psychiatry. 2018;59(12):1323–32.

Henriksen L, Wu CS, Secher NJ, Obel C, Juhl M. Medical augmentation of labor and the risk of ADHD in offspring: a population-based study. Pediatrics. 2015;135(3):e672–7.

Oberg AS, D’Onofrio BM, Rickert ME, Hernandez-Diaz S, Ecker JL, Almqvist C, et al. Association of labor induction with offspring risk of autism spectrum disorders. JAMA Pediatr. 2016;170(9):e160965–65.

Wiggs KK, Rickert ME, Hernandez-Diaz S, Bateman BT, Almqvist C, Larsson H, et al. A family-based study of the association between labor induction and offspring attention-deficit hyperactivity disorder and low academic achievement. Behav Genet. 2017;47(4):383–93.

Miranda A, Sousa N. Maternal hormonal milieu influence on fetal brain development. Brain Behav. 2018;8(2):e00920.

Singh A, Yeh CJ, Verma N, Das AK. Overview of attention deficit hyperactivity disorder in young children. Health Psychol Res. 2015;3(2):2115.

Xu G, Strathearn L, Liu B, Yang B, Bao W. Twenty-year trends in diagnosed attention-deficit/hyperactivity disorder among US children and adolescents, 1997–2016. JAMA Netw Open. 2018;1(4):e181471.

Xu G, Strathearn L, Liu B, O’Brien M, Kopelman TG, Zhu J, et al. Prevalence and treatment patterns of autism spectrum disorder in the United States, 2016. JAMA Pediatr. 2019;173(2):153–59.

Crespi BJ. Autism as a disorder of high intelligence. Front Neurosci. 2016;10:300.

Flegal KM, Kit BK, Orpana H, Graubard BI. Association of all-cause mortality with overweight and obesity using standard body mass index categories: a systematic review and meta-analysis. JAMA. 2013;309(1):71–82.

Maenner MJ, Shaw KA, Bakian AV, Bilder DA, Durkin MS, Esler A, et al. Prevalence and characteristics of autism spectrum disorder among children aged 8 years - autism and developmental disabilities monitoring network, 11 sites, United States, 2018. Morbidity Mortal Wkly Rep Surveillance Summaries (Washington D C : 2002). 2021;70(11):1–16.

Hours C, Recasens C, Baleyte JM. ASD and ADHD comorbidity: what are we talking about? Front Psychiatry. 2022;13:837424.

Dalsgaard S, Thorsteinsson E, Trabjerg BB, Schullehner J, Plana-Ripoll O, Brikell I, et al. Incidence rates and cumulative incidences of the full spectrum of diagnosed mental disorders in childhood and adolescence. JAMA Psychiatry. 2020;77(2):155–64.

Zablotsky B, Black LI, Maenner MJ, Schieve LA, Danielson ML, Bitsko RH et al. Prevalence and trends of developmental disabilities among children in the United States: 2009–2017. Pediatrics. 2019;144(4).

Flenik TMN, Bara TS, Cordeiro ML. Family functioning and emotional aspects of children with autism spectrum disorder in southern Brazil. J Autism Dev Disord. 2022;53(6):2306–13.

Picardi A, Gigantesco A, Tarolla E, Stoppioni V, Cerbo R, Cremonte M, et al. Parental burden and its correlates in families of children with autism spectrum disorder: a multicentre study with two comparison groups. Clin Pract Epidemiol Ment Health. 2018;14:143–76.

Rosello B, Berenguer C, Baixauli I, Colomer C, Miranda A. ADHD symptoms and learning behaviors in children with ASD without intellectual disability. A mediation analysis of executive functions. PLoS ONE. 2018;13(11):e0207286.

Usami M. Functional consequences of attention-deficit hyperactivity disorder on children and their families. Psychiatry Clin Neurosci. 2016;70(8):303–17.

Leitner Y. The co-occurrence of autism and attention deficit hyperactivity disorder in children–what do we know? Front. Hum Neurosci. 2014;8:268.

Goldstein RF, Abell SK, Ranasinha S, Misso M, Boyle JA, Black MH, et al. Association of gestational weight gain with maternal and infant outcomes: a systematic review and meta-analysis. JAMA. 2017;317(21):2207–25.

Rao PA, Landa RJ. Association between severity of behavioral phenotype and comorbid attention deficit hyperactivity disorder symptoms in children with autism spectrum disorders. Autism. 2014;18(3):272–80.

Akinbami LJ, Liu X, Pastor PN, Reuben CA. Attention deficit hyperactivity disorder among children aged 5–17 years in the United States, 1998–2009. NCHS Data Brief. 2011(70):1–8.

Baio J, Wiggins L, Christensen DL, Maenner MJ, Daniels J, Warren Z, et al. Prevalence of autism spectrum disorder among children aged 8 years - autism and developmental disabilities monitoring network, 11 sites, United States, 2014. MMWR Surveill Summ. 2018;67(6):1–23.

Weisman O, Agerbo E, Carter CS, Harris JC, Uldbjerg N, Henriksen TB, et al. Oxytocin-augmented labor and risk for autism in males. Behav Brain Res. 2015;284:207–12.

Wang S, Wang B, Drury V, Drake S, Sun N, Alkhairo H, et al. Rare x-linked variants carry predominantly male risk in autism, tourette syndrome, and ADHD. Nat Commun. 2023;14(1):8077.

Ellis JA, Brown CM, Barger B, Carlson NS. Influence of maternal obesity on labor induction: a systematic review and meta-analysis. J Midwifery Womens Health. 2019;64(1):55–67.

Jacobson LP, Lau B, Catellier D, Parker CB. An environmental influences on Child Health outcomes viewpoint of data analysis centers for collaborative study designs. Curr Opin Pediatr. 2018;30(2):269–75.

Hertz-Picciotto I, Korrick SA, Ladd-Acosta C, Karagas MR, Lyall K, Schmidt RJ, et al. Maternal tobacco smoking and offspring autism spectrum disorder or traits in ECHO cohorts. Autism Res. 2022;15(3):551–69.

Schantz SL, Eskenazi B, Buckley JP, Braun JM, Sprowles JN, Bennett DH, et al. A framework for assessing the impact of chemical exposures on neurodevelopment in ECHO: opportunities and challenges. Environ Res. 2020;188:109709.

Volk HE, Perera F, Braun JM, Kingsley SL, Gray K, Buckley J, et al. Prenatal air pollution exposure and neurodevelopment: a review and blueprint for a harmonized approach within ECHO. Environ Res. 2021;196:110320.

Lord C, Risi S, Lambrecht L, Cook EH Jr., Leventhal BL, DiLavore PC, et al. The autism diagnostic observation schedule-generic: a standard measure of social and communication deficits associated with the spectrum of autism. J Autism Dev Disord. 2000;30(3):205–23.

National Center for Health Statistics, US Department of Health and Human Services, Centers for Disease Control and Prevention. Updated 2003. Accessed at: https://www.cdc.gov/nchs/data/dvs/GuidetoCompleteFacilityWks.pdf . Accessed on 2024 January 1.

Committee opinion 700. Methods for estimating the due date. Obstet Gynecol. 2017;129(5):e150–54.

Dighe MK, Frederick IO, Andersen HF, Gravett MG, Abbott SE, Carter AA, et al. Implementation of the intergrowth-21st project in the United States. BJOG. 2013;120(Suppl 2):123–8.

van Buuren S, Groothuis-Oudshoorn K, Mice. Multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):1–67.

Uvnas-Moberg K, Ekstrom-Bergstrom A, Berg M, Buckley S, Pajalic Z, Hadjigeorgiou E, et al. Maternal plasma levels of oxytocin during physiological childbirth - a systematic review with implications for uterine contractions and central actions of oxytocin. BMC Pregnancy Childbirth. 2019;19(1):285.

Daly D, Minnie KCS, Blignaut A, Blix E, Vika Nilsen AB, Dencker A, et al. How much synthetic oxytocin is infused during labour? A review and analysis of regimens used in 12 countries. PLoS ONE. 2020;15(7):e0227941.

Nigg JT. Toward an emerging paradigm for understanding attention-deficit/hyperactivity disorder and other neurodevelopmental, mental, and behavioral disorders: environmental risks and epigenetic associations. JAMA Pediatr. 2018;172(7):619–21.

Andari E, Nishitani S, Kaundinya G, Caceres GA, Morrier MJ, Ousley O, et al. Epigenetic modification of the oxytocin receptor gene: implications for autism symptom severity and brain functional connectivity. Neuropsychopharmacology. 2020;45(7):1150–58.

Perera F, Herbstman J. Prenatal environmental exposures, epigenetics, and disease. Reprod Toxicol. 2011;31(3):363–73.

Leffa DD, Daumann F, Damiani AP, Afonso AC, Santos MA, Pedro TH, et al. DNA damage after chronic oxytocin administration in rats: a safety yellow light? Metab. Brain Dis. 2017;32(1):51–5.

Hirayama T, Hiraoka Y, Kitamura E, Miyazaki S, Horie K, Fukuda T, et al. Oxytocin induced labor causes region and sex-specific transient oligodendrocyte cell death in neonatal mouse brain. J Obstet Gynaecol Res. 2020;46(1):66–78.

Csaba G. Transgenerational effects of perinatal hormonal imprinting. Transgenerational epigenetics: Elsevier; 2014. pp. 255–67.

Hashemi F, Tekes K, Laufer R, Szegi P, Tóthfalusi L, Csaba G. Effect of a single neonatal oxytocin treatment (hormonal imprinting) on the biogenic amine level of the adult rat brain: could oxytocin-induced labor cause pervasive developmental diseases? Reprod. Sci. 2013;20(10):1255–63.

CAS   Google Scholar  

Sato M, Noguchi J, Mashima M, Tanaka H, Hata T. 3d power doppler ultrasound assessment of placental perfusion during uterine contraction in labor. Placenta. 2016;45:32–6.

Palanisamy A, Giri T, Jiang J, Bice A, Quirk JD, Conyers SB et al. In utero exposure to transient ischemia-hypoxemia promotes long-term neurodevelopmental abnormalities in male rat offspring. JCI Insight. 2020;5(10).

Palanisamy A, Lopez J, Frolova A, Macones G, Cahill AG. Association between uterine tachysystole during the last hour of labor and cord blood lactate in parturients at term gestation. Am J Perinatol. 2019;36(11):1171–78.

Crane JM, Young DC, Butt KD, Bennett KA, Hutchens D. Excessive uterine activity accompanying induced labor. Obstet Gynecol. 2001;97(6):926–31.

CAS   PubMed   Google Scholar  

Heuser CC, Knight S, Esplin MS, Eller AG, Holmgren CM, Manuck TA, et al. Tachysystole in term labor: incidence, risk factors, outcomes, and effect on fetal heart tracings. Am J Obstet Gynecol. 2013;209(1):e321–6.

Kunz MK, Loftus RJ, Nichols AA. Incidence of uterine tachysystole in women induced with oxytocin. J Obstet Gynecol Neonatal Nurs. 2013;42(1):12–8.

Walter MH, Abele H, Plappert CF. The role of oxytocin and the effect of stress during childbirth: neurobiological basics and implications for mother and child. Front Endocrinol (Lausanne). 2021;12:742236.

Malek A, Blann E, Mattison DR. Human placental transport of oxytocin. J Matern Fetal Med. 1996;5(5):245–55.

Nathan NO, Hedegaard M, Karlsson G, Knudsen LE, Mathiesen L. Intrapartum transfer of oxytocin across the human placenta: an ex vivo perfusion experiment. Placenta. 2021;112:105–10.

Froemke RC, Young LJ. Oxytocin, neural plasticity, and social behavior. Annu Rev Neurosci. 2021;44:359–81.

Bell AF, Erickson EN, Carter CS. Beyond labor: the role of natural and synthetic oxytocin in the transition to motherhood. J Midwifery Womens Health. 2014;59(1):35–42. quiz 108.

Palanisamy A, Kannappan R, Xu Z, Martino A, Friese MB, Boyd JD, et al. Oxytocin alters cell fate selection of rat neural progenitor cells in vitro. PLoS ONE. 2018;13(1):e0191160.

Matias SL, Pearl M, Lyall K, Croen LA, Kral TVE, Fallin D, et al. Maternal prepregnancy weight and gestational weight gain in association with autism and developmental disorders in offspring. Obes (Silver Spring). 2021;29(9):1554–64.

Windham GC, Anderson M, Lyall K, Daniels JL, Kral TVE, Croen LA, et al. Maternal pre-pregnancy body mass index and gestational weight gain in relation to autism spectrum disorder and other developmental disorders in offspring. Autism Res. 2019;12(2):316–27.

Maeder AB, Vonderheid SC, Park CG, Bell AF, McFarlin BL, Vincent C, et al. Titration of intravenous oxytocin infusion for postdates induction of labor across body mass index groups. J Obstetric Gynecologic Neonatal Nurs. 2017;46(4):494–507.

Zhang J, Bricker L, Wray S, Quenby S. Poor uterine contractility in obese women. BJOG. 2007;114(3):343–48.

Kernberg A, Caughey AB. Augmentation of labor: a review of oxytocin augmentation and active management of labor. Obstet Gynecol Clin North Am. 2017;44(4):593–600.

Carlson NS, Corwin EJ, Lowe NK. Oxytocin augmentation in spontaneously laboring, nulliparous women: multilevel assessment of maternal BMI and oxytocin dose. Biol Res Nurs. 2017;19(4):382–92.

Lassiter JR, Holliday N, Lewis DF, Mulekar M, Abshire J, Brocato B. Induction of labor with an unfavorable cervix: how does BMI affect success? J Maternal-Fetal Neonatal Med. 2016;29(18):3000–02.

Mackeen AD, Durie D, Lin M, Huls C, Packard R, Sciscione A. Effect of obesity on labor inductions with foley plus oxytocin versus oxytocin alone [37m]. Obstet Gynecol. 2017;129(5):S142.

Alan S, Akca E, Senoglu A, Gozuyesil E, Surucu SG. The use of oxytocin by healthcare professionals during labor. Yonago Acta Med. 2020;63(3):214–22.

Litorp H, Sunny AK, Kc A. Augmentation of labor with oxytocin and its association with delivery outcomes: a large-scale cohort study in 12 public hospitals in Nepal. Acta Obstet Gynecol Scand. 2021;100(4):684–93.

Leuner B, Caponiti JM, Gould E. Oxytocin stimulates adult neurogenesis even under conditions of stress and elevated glucocorticoids. Hippocampus. 2012;22(4):861–8.

Panaitescu AM, Isac S, Pavel B, Ilie AS, Ceanga M, Totan A, et al. Oxytocin reduces seizure burden and hippocampal injury in a rat model of perinatal asphyxia. Acta Endocrinol (Buchar). 2018;14(3):315–19.

Zhang J, Branch DW, Ramirez MM, Laughon SK, Reddy U, Hoffman M, et al. Oxytocin regimen for labor augmentation, labor progression, and perinatal outcomes. Obstet Gynecol. 2011;118(2 Pt 1):249–56.

Darney BG, Snowden JM, Cheng YW, Jacob L, Nicholson JM, Kaimal A, et al. Elective induction of labor at term compared with expectant management: maternal and neonatal outcomes. Obstet Gynecol. 2013;122(4):761–9.

Keulen JK, Bruinsma A, Kortekaas JC, van Dillen J, Bossuyt PM, Oudijk MA, et al. Induction of labour at 41 weeks versus expectant management until 42 weeks (index): Multicentre, randomised non-inferiority trial. BMJ. 2019;364:l344.

Knight HE, Cromwell DA, Gurol-Urganci I, Harron K, van der Meulen JH, Smith GCS. Perinatal mortality associated with induction of labour versus expectant management in nulliparous women aged 35 years or over: an English national cohort study. PLoS Med. 2017;14(11):e1002425.

Reinl EL, Goodwin ZA, Raghuraman N, Lee GY, Jo EY, Gezahegn BM, et al. Novel oxytocin receptor variants in laboring women requiring high doses of oxytocin. Am J Obstet Gynecol. 2017;217(2):214. e1-14 e8.

Article   PubMed Central   Google Scholar  

Practice bulletin no. 183: Postpartum hemorrhage. Obstet Gynecol. 2017;130(4):e168–86.

September ACNMU. 2017. Accessed at: http://www.midwife.org/acnm/files/ACNMLibraryData/UPLOADFILENAME/000000000310/AMTSL-PS-FINAL-10-10-17.pdf . Accessed on 2024 January 1.

World Health Organization. WHO recommendations for the prevention and treatment of postpartum haemorrhage. Geneva; 2012.

Jiang D, Yang Y, Zhang X, Nie X. Continued versus discontinued oxytocin after the active phase of labor: an updated systematic review and meta-analysis. PLoS ONE. 2022;17(5):e0267461.

American College of Obstetricians Gynecologists, Society for Maternal-Fetal Medicine, Caughey AB, Cahill AG, Guise JM, Rouse DJ. Safe prevention of the primary cesarean delivery. Am J Obstet Gynecol. 2014;210(3):179–93.

Download references

Acknowledgements

We wish to posthumously thank our colleague Li-Ching Lee, PhD whose input and insightful contributions to this endeavor were invaluable. We also thank C. Sue Carter, Ph.D. for her contributions.

The authors want to thank our ECHO colleagues; the medical, nursing, and program staff; and the children and families participating in the ECHO cohorts. We also acknowledge the contribution of the following ECHO program collaborators: ECHO Components—Coordinating Center: Duke Clinical Research Institute, Durham, North Carolina: Smith PB, Newby KL; Data Analysis Center: Johns Hopkins University Bloomberg School of Public Health, Baltimore, Maryland: Jacobson LP; Research Triangle Institute, Durham, North Carolina: Parker CB; Research Triangle Institute, Durham, North Carolina: Catellier DJ; Person-Reported Outcomes Core: Northwestern University, Evanston, Illinois: Gershon R, Cella D; ECHO Awardees and Cohorts— Albert Einstein College of Medicine, Bronx, New York: Aschner J; Cincinnati Children’s Hospital Medical Center, Cincinnati, OH: Merhar S; Children’s Hospital and Clinic Minneapolis, MN: Lampland A; Icahn School of Medicine at Mount Sinai, New York, NY: Teitelbaum S; Cohen Children’s Medical Center, Northwell Health, New Hyde Park, NY: Stroustrup A; University of Buffalo, Jacobson School of Medicine and Biomedical Sciences, Buffalo, NY: Reynolds A; University of Florida, College of Medicine, Jacksonville, FL: Hudak M; University of Rochester Medical Center, Rochester, NY: Pryhuber G; Vanderbilt Children’s Hospital, Nashville, TN: Moore P; Wake Forest University School of Medicine, Winston Salem, NC: Washburn L; Massachusetts General Hospital, Boston, MA: Camargo C; Boston Children’s Hospital, Boston, MA: Mansbach J; Children’s Hospital of Philadelphia, Philadelphia, PA: Spergel J; Norton Children’s Hospital, Louisville, KY: Stevenson M; Phoenix Children’s Hospital, Phoenix AZ: Bauer C; Memorial Hospital of Rhode Island, Providence RI: Deoni S; Avera Health Rapid City, Rapid City, SD: Elliott A; Kaiser Permanente Northern California Division of Research, Oakland, CA: Ferrara A; University of Wisconsin, Madison WI: Gern J; Marshfield Clinic Research Institute, Marshfield, WI: Seroogy C: Bendixsen C; University of California Davis Mind Institute, Sacramento, CA: Hertz-Picciotto I, Restrepo B; University of Pittsburgh, Pittsburgh, PA: Hipwell A; Geisel School of Medicine at Dartmouth, Lebanon, NH: Karagas M; University of Washington, Department of Environmental and Occupational Health Sciences, Seattle, WA: Karr C; University of Tennessee Health Science Center, Memphis, TN: Mason A; Seattle Children’s Research Institute, Seattle, WA: Sathyanarayana S; Women & Infants Hospital of Rhode Island, Providence RI, Lester B; Children’s Mercy, Kansas City, MO: Carter B; Emory University, Atlanta, GA: Marsit C; Helen DeVos Children’s Hospital, Grand Rapids, MI: Pastyrnak S; Kapiolani Medical Center for Women and Children, Providence, RI: Neal C; Los Angeles Biomedical Research Institute at Harbour-UCLA Medical Center, Los Angeles CA: Smith L; Wake Forest University School of Medicine, Winston Salem, NC: Helderman J; Prevention Science Institute, University of Oregon, Eugene, OR: Leve L; George Washington University, Washington, DC: Ganiban J; Pennsylvania State University, University Park, PA; Neiderhiser J; Brigham and Women’s Hospital, Boston, MA: Weiss S; Boston University Medical Center, Boston, MA: O’Connor G; Kaiser Permanente, Southern California, San Diego, CA: Zeiger R; Washington University of St. Louis, St Louis, MO: Bacharier L; Pennsylvania State University, University Park, PA: Lyall K; Johns Hopkins Bloomberg School of Public Health Kennedy Krieger Institute, Baltimore, MD: Landa R; University of California, UC Davis Medical Center Mind Institute, Sacramento, CA: Ozonoff, S; University of Rochester Medical Center Rochester, NY: O’Connor T; University of Pittsburgh Medical Center, Magee Women’s Hospital, Pittsburgh, PA: Simhan H; Baystate Children’s Hospital, Springfield, MA : Vaidya R; Beaumont Health Medical Center, Royal Oak, MI: Obeid R; Boston Children’s Hospital, Boston, MA: Rollins C; East Carolina University Brody School of Medicine, Greenville, NC: Bear K; Michigan State University College of Human Medicine, East Lansing, MI: Lenski, M; University of Chicago, Chicago IL: Msall M; University of Massachusetts Medical School, Worcester, MA: Frazier J; Wake Forest Baptist Health (Atrium Health), Winston Salem, NC: Washburn, L; Yale School of Medicine, New Haven, CT: Montgomery A; Michigan State University, East Lansing, MI: Kerver J; Henry Ford Health System, Detroit, MI: Barone, C; Michigan Department of Health and Human Services, Lansing, MI: McKane, P; Michigan State University, East Lansing, MI: Paneth N; University of Michigan, Ann Arbor, MI: Elliott, M; Columbia University Medical Center, New York, NY: Herbstman J; University of Illinois, Beckman Institute, Urbana, IL: Schantz S; University of California, San Francisco:, San Francisco, CA: Woodruff T; University of Utah, Salt Lake City, UT: Stanford J; Icahn School of Medicine at Mount Sinai, New York, NY: Wright R; George Mason University, Fairfax, VA: Huddleston K; New York University School of Medicine, Karr C; Trasande L; University of California, San Francisco, San Francisco CA: Bush N; University of Minnesota, Minneapolis, MN: Nguyen R; University of Rochester Medical Center: Rochester, NY: Barrett E; Emory University, Atlanta, GA: Carlson-Smith, N.

Research reported in this publication was supported by the Environmental influences on Child Health Outcomes (ECHO) program, Office of the Director, National Institutes of Health, under Award Numbers U2COD023375 (Coordinating Center), U24OD023382 (Data Analysis Center, Jacobson), U24OD023319 (PRO Core), UH3OD023248 (Dabelea), UH3OD023318 (Dunlop), UH3OD023348 (O’Shea), The following grants supported colleagues who contributed to this research but are not authors. These contributors are listed in the acknowledgments. UH3OD023320, UH3OD023253, UH3OD023313, UH3OD023279, UH3OD023289, UH3OD023282, UH3OD023365, UH3OD023244, UH3OD023275, UH3OD023271, UH3OD023347, UH3OD023389, UH3OD023268, UH3OD023342, UH3OD023349, UH3OD023285, UH3OD023290, UH3OD023272, UH3OD023249, UH3OD023337, UH3OD023305. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and affiliations.

Department of Pediatrics, Developmental Section, University of Colorado School of Medicine, 13123 E. 16th Ave. B065, Aurora, CO, 80045, USA

Department of Pediatrics, University of North Carolina School of Medicine, Chapel Hill, NC, USA

T. Michael O’Shea

Departments of Obstetrics, Gynecology and Reproductive Sciences, University of Maryland School of Medicine, Baltimore, MD, USA

Department of Gynecology and Obstetrics, Emory University School of Medicine, Atlanta, GA, USA

Anne L. Dunlop

Kaiser Permanente Division of Research, Northern California, Oakland, CA, USA

Department of Pediatrics, University of Colorado School of Medicine, Aurora, CO, USA

Greta Wilkening

Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA

Ting-ju Hsu, Stephan Ehrhardt, Monica McGrath & Marie L. Churchill

Department of Anesthesiology, Washington University School of Medicine in St. Louis, St. Louis, MO, USA

Arvind Palanisamy

Departments of Psychiatry, Neurology, Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA

Daniel Weinberger

The Lieber institute for Brain Development, Baltimore, MD, USA

Departments of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, USA

Marco Grados

Kennedy Krieger Institute, Baltimore, MD, USA

Lifecourse Epidemiology of Adiposity and Diabetes (LEAD) Center, University of Colorado Anschutz Medical Campus, Aurora, CO, USA

Dana Dabelea

You can also search for this author in PubMed   Google Scholar

Contributions

LK conceptualized and designed the study, drafted original manuscript. TMO critically reviewed and revised original manuscript draft. IB critically reviewed and revised manuscript. ALD critically reviewed and revised manuscript. LC critically reviewed and revised manuscript. GW critically reviewed and revised manuscript. TH and MLC collected and analyzed data, reviewed and revised the article critically for important intellectual content. SE developed study design, coordinated data analysis, and reviewed and revised the article critically for important intellectual content. AP critically reviewed and revised manuscript. MM supervised data collection and data analysis process and reviewed and revised the article critically for important intellectual content. DW critically reviewed and revised manuscript. MG critically reviewed and revised manuscript. DD critically reviewed and revised original manuscript draft. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lisa Kurth .

Ethics declarations

Ethics approval and consent to participate.

Properly constituted Institutional Review Boards – either the ECHO single IRB or the ECHO cohort’s local IRB – are accountable for compliance with regulatory requirements for the ECHO-wide Cohort Data Collection Protocol at participating cohort sites. Governing IRBs review ECHO protocols and all informed consent/assent forms, HIPAA authorization forms, recruitment materials, and other relevant information prior to the initiation of any ECHO-wide Cohort Data Collection Protocol-related procedures or activities. ECHO Cohort Investigators (or their designated study personnel) obtain written informed consent or parent’s / guardian’s permission along with child assent as appropriate, for ECHO-wide Cohort Data Collection Protocol participation and for participation in their specific cohorts. The work of the ECHO Data Analysis Center is approved through the Johns Hopkins Bloomberg School of Public Health Institutional Review Board.

Consent for publication

Not applicable.

Competing interests

The authors have no conflicts of interest relevant to this article to disclose.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Kurth, L., O’Shea, T.M., Burd, I. et al. Intrapartum exposure to synthetic oxytocin, maternal BMI, and neurodevelopmental outcomes in children within the ECHO consortium. J Neurodevelop Disord 16 , 26 (2024). https://doi.org/10.1186/s11689-024-09540-1

Download citation

Received : 07 February 2023

Accepted : 27 April 2024

Published : 26 May 2024

DOI : https://doi.org/10.1186/s11689-024-09540-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Neurodevelopment

Journal of Neurodevelopmental Disorders

ISSN: 1866-1955

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

null hypothesis 95 confidence interval

IMAGES

  1. How do you find the 95% confidence interval?

    null hypothesis 95 confidence interval

  2. Understanding Hypothesis Tests: Confidence Intervals and Confidence Levels

    null hypothesis 95 confidence interval

  3. Hypothesis Testing

    null hypothesis 95 confidence interval

  4. The Relationship Between Confidence Intervals and Hypothesis Tests

    null hypothesis 95 confidence interval

  5. Confidence Interval Excel Graph : Short IT recipes: Excel: Confidence

    null hypothesis 95 confidence interval

  6. Intervalles de confiance expliqués : Exemples, formules et

    null hypothesis 95 confidence interval

VIDEO

  1. How to Calculate Confidence Interval Limit (confidence interval)(95%)(99%)(acceptance region)

  2. Higher Level Leaving Cert: Hypothesis testing for a population proportion

  3. How to Calculate 95% Confidence Interval

  4. Introduction to the variance ratio F test

  5. 95% Confidence interval

  6. Learn Regression in 10 minutes in STATA

COMMENTS

  1. 6.6

    In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always ...

  2. Hypothesis Testing and Confidence Intervals

    The relationship between the confidence level and the significance level for a hypothesis test is as follows: Confidence level = 1 - Significance level (alpha) For example, if your significance level is 0.05, the equivalent confidence level is 95%. Both of the following conditions represent statistically significant results: The P-value in a ...

  3. Understanding Confidence Intervals

    To calculate the 95% confidence interval, we can simply plug the values into the formula. For the USA: So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98. For GB: So for the GB, the lower and upper bounds of the 95% confidence interval are 33.04 and 36.96.

  4. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. ... Freire APCF, Elkins MR, Ramos EMC, Moseley AM. Use of 95% confidence intervals in the reporting of between-group differences in randomized controlled trials: analysis of a representative sample of 200 physical therapy trials ...

  5. Statistical tests, P values, confidence intervals, and power: a guide

    Keywords: Confidence intervals, Hypothesis testing, Null testing, P value, Power, Significance tests, Statistical testing. ... If one 95 % confidence interval includes the null value and another excludes that value, the interval excluding the null is the more precise one. No!

  6. 11.8: Significance Testing and Confidence Intervals

    If the \(95\%\) confidence interval contains zero (more precisely, the parameter value specified in the null hypothesis), then the effect will not be significant at the \(0.05\) level. Looking at non-significant effects in terms of confidence intervals makes clear why the null hypothesis should not be accepted when it is not rejected: Every ...

  7. Understanding Hypothesis Tests: Confidence Intervals and ...

    You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree. The confidence level is equivalent to 1 - the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

  8. The Relationship Between Hypothesis Testing and Confidence Intervals

    An example of a typical hypothesis test (two-tailed) where "p" is some parameter. First, we state our two kinds of hypothesis:. Null hypothesis (H0): The "status quo" or "known/accepted fact".States that there is no statistical significance between two variables and is usually what we are looking to disprove.

  9. Confidence Intervals: Interpreting, Finding & Formulas

    A confidence interval (CI) is a range of values that is likely to contain the value of an unknown population parameter. These intervals represent a plausible domain for the parameter given the characteristics of your sample data. Confidence intervals are derived from sample statistics and are calculated using a specified confidence level.

  10. The Ultimate Guide to Hypothesis Testing and Confidence Intervals in

    We can also estimation a 95% confidence interval for the population mean where this sample is drawn from. Hypothesis Testing. Here are the steps for conducting hypothesis testing: ... is less than 5%. Thus we can reject the null hypothesis at the 95% significance level. critical value and significant level for a two-tailed test. Note that at ...

  11. Hypothesis Testing and Confidence Intervals

    A confidence interval can be defined as the range of parameters at which the true parameter can be found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter values where the null hypothesis cannot be rejected when using a 5% test size.

  12. 2.9: Confidence intervals and bootstrapping

    For the entire n = 1, 636 data set for these two groups, the results are obtained using the following code. The estimated difference in the means is -3 cm ( commute minus casual ). The t -based 95% confidence interval is from -5.89 to -0.11. The bootstrap 95% confidence interval is from -5.816 to -0.076.

  13. Confidence Intervals in Statistics: Examples & Interpretation

    What is the 99% confidence interval for the population's height? In a sample of 40 light bulbs, the mean lifetime is 5000 hours and the standard deviation is 400 hours. Compute a 90% confidence interval for the average lifetime of the bulbs. Answers: For a 95% confidence interval and a sample size > 30, we typically use a z-score of 1.96.

  14. Confidence Intervals and p-Values

    An easy way to remember the relationship between a 95% confidence interval and a p-value of 0.05 is to think of the confidence interval as arms that "embrace" values that are consistent with the data. If the null value is "embraced", then it is certainly not rejected, i.e. the p-value must be greater than 0.05 (not statistically significant) if ...

  15. Using a confidence interval to decide whether to reject the null hypothesis

    Remember that the decision to reject the null hypothesis (H 0) or fail to reject it can be based on the p-value and your chosen significance level (also called α). If the p-value is less than or equal to α, you reject H 0; if it is greater than α, you fail to reject H 0. Your decision can also be based on the confidence interval (or bound ...

  16. Rejecting the Null Hypothesis Using Confidence Intervals

    On the other hand, if H 1 ≠ 0.61, then since 0.61 is not in the confidence interval we can reject the null hypothesis at ɑ = 0.05. Note that the confidence level of 95% and the level of significance at ɑ = 0.05 = 5% are complements, which is the "H o is True" column in the above table.

  17. 7.2.2.1. Confidence interval approach

    The 95 % confidence interval includes the null hypothesis if, and only if, it would be accepted at the 5 % level. This interval includes the null hypothesis of 50 counts so we cannot reject the hypothesis that the process mean for particle counts is 50. The confidence interval includes all null hypothesis values for the population mean that ...

  18. Null-Hypothesis Testing with Confidence Intervals

    A more common way to express this finding is to state that the confidence interval does not include the largest value of the null-hypothesis, which is zero. However, this leads to the impression that we tested the nil-hypothesis, and rejected it. But that is not true. We also rejected all the values less than 0.

  19. Understanding and interpreting confidence and credible intervals around

    Interpretation of the frequentist 95% confidence interval: we can be 95% confident that the true (unknown) estimate would lie within the lower and upper limits of the interval, based on hypothesized repeats of the experiment. ... On the other hand, one accepts the null hypothesis when a p-value is equal to or greater than 0.05, which means that ...

  20. What Is a Confidence Interval and How Do You Calculate It?

    Confidence Interval: A confidence interval measures the probability that a population parameter will fall between two set values. The confidence interval can take any number of probabilities, with ...

  21. 7.2: Confidence interval and hypothesis tests for the slope and

    The 95% confidence interval for the slope is -0.186 to 0.0155. For a 1% increase in body fat %, we are 95% confident that the change in the true mean Hematocrit is between -0.186 and 0.0155% of blood. This suggests that we would find little evidence against the null hypothesis of no linear relationship because this CI contains 0.

  22. 5.4 A test for differences of sample means: 95% Confidence Intervals

    Fig. 11 Application of the 95% confidence interval test for the blood pressure drug trial with larger sample size (old drug at top, new drug at bottom) and null hypothesis true. Fig. 11 shows what could happen in the second scenario where the null hypothesis was true (the new drug had no different effect than the old drug).

  23. Comparing researchers' degree of dichotomous thinking using ...

    Having two contrasting hypotheses (i.e., a null hypothesis, H 0, and an alternative hypothesis, H 1), a p-value is the probability of getting a result as extreme or more extreme than the actual ...

  24. Search

    0.505, and 0.110. Now, let's consider the shape of the 95% prediction ellipse formed by the multivariate normal … degrees of freedom because we have four variables. For a 95% prediction ellipse, the chi-square with four degrees of … 9.49. For looking at the first and longest axis of a 95% prediction ellipse, we substitute 26.245 for the largest …

  25. Compare Two Samples with Confidence Intervals

    Beyond visual assessment, you can use hypothesis testing to statistically determine if there is a difference between the two samples. You would set up a null hypothesis stating that there is no ...

  26. Simulation-Based Power Analyses for the Smallest Effect Size of

    This hypothesis is then typically analyzed using the traditional NHST approach to examine whether the observed effect is statistically significantly different from 0 (p < .05; 95% confidence interval [CI] does not include 0). 1 This approach has several issues. First, researchers might indeed find a statistically significant effect; however, it ...

  27. PDF Model Specification and Confidence Intervals for Voice Communication

    •Testing the null hypothesis that the data come ... Clearance Step Distribution Fit Method Expected Value 95% C.I. Issue Lognormal (3.04, 0.10) Naïve 20.9 (20.5, 21.3) ... Model Specification and Confidence Intervals for Voice Communication Sara R. Wilson1, Robert D. Leonard2, David J. Edwards2, Kurt A. Swieringa1 and Jennifer L. Murdoch1 ...

  28. ABO and Rhesus blood groups and multiple health outcomes: an umbrella

    It had a summary odds ratio of 1.28 (95% confidence interval: 1.17, 1.40), was supported by 6870 cases with small heterogeneity (I2 = 13%) and 95% prediction intervals excluding the null value, and without hints of small-study effects (P for Egger's test > 0.10, but the largest study effect was not more conservative than the summary effect ...

  29. Intrapartum exposure to synthetic oxytocin, maternal BMI, and

    Contrary to our hypothesis, among mothers with pre-pregnancy obesity, sOT exposure was associated with lower odds of child ADHD diagnosis. ... 95% confidence interval [CI], 0.71-1.03) or ADHD (aOR 0.89; 95% CI, 0.76-1.04). Associations did not differ by child sex. Among mothers with pre-pregnancy obesity, sOT exposure was associated with ...

  30. PDF TITLE Prediction of 30-Day Mortality for ICU Patients with Sepsis-3

    AUC of 0.983, with a 95% confidence interval [0.980-0.990]. Moreover, it exhibited a commendable accuracy of 0.966 and an F1-score of 0.910. Notably, LightGBM showcased a ... The null hypothesis posits that the AUC scores with and without HOSP_LOS are not statistically different, while the alternative hypothesis