• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Null Hypothesis: Definition, Rejecting & Examples

By Jim Frost 6 Comments

What is a Null Hypothesis?

The null hypothesis in statistics states that there is no difference between groups or no relationship between variables. It is one of two mutually exclusive hypotheses about a population in a hypothesis test.

Photograph of Rodin's statue, The Thinker who is pondering the null hypothesis.

  • Null Hypothesis H 0 : No effect exists in the population.
  • Alternative Hypothesis H A : The effect exists in the population.

In every study or experiment, researchers assess an effect or relationship. This effect can be the effectiveness of a new drug, building material, or other intervention that has benefits. There is a benefit or connection that the researchers hope to identify. Unfortunately, no effect may exist. In statistics, we call this lack of an effect the null hypothesis. Researchers assume that this notion of no effect is correct until they have enough evidence to suggest otherwise, similar to how a trial presumes innocence.

In this context, the analysts don’t necessarily believe the null hypothesis is correct. In fact, they typically want to reject it because that leads to more exciting finds about an effect or relationship. The new vaccine works!

You can think of it as the default theory that requires sufficiently strong evidence to reject. Like a prosecutor, researchers must collect sufficient evidence to overturn the presumption of no effect. Investigators must work hard to set up a study and a data collection system to obtain evidence that can reject the null hypothesis.

Related post : What is an Effect in Statistics?

Null Hypothesis Examples

Null hypotheses start as research questions that the investigator rephrases as a statement indicating there is no effect or relationship.

Does the vaccine prevent infections? The vaccine does not affect the infection rate.
Does the new additive increase product strength? The additive does not affect mean product strength.
Does the exercise intervention increase bone mineral density? The intervention does not affect bone mineral density.
As screen time increases, does test performance decrease? There is no relationship between screen time and test performance.

After reading these examples, you might think they’re a bit boring and pointless. However, the key is to remember that the null hypothesis defines the condition that the researchers need to discredit before suggesting an effect exists.

Let’s see how you reject the null hypothesis and get to those more exciting findings!

When to Reject the Null Hypothesis

So, you want to reject the null hypothesis, but how and when can you do that? To start, you’ll need to perform a statistical test on your data. The following is an overview of performing a study that uses a hypothesis test.

The first step is to devise a research question and the appropriate null hypothesis. After that, the investigators need to formulate an experimental design and data collection procedures that will allow them to gather data that can answer the research question. Then they collect the data. For more information about designing a scientific study that uses statistics, read my post 5 Steps for Conducting Studies with Statistics .

After data collection is complete, statistics and hypothesis testing enter the picture. Hypothesis testing takes your sample data and evaluates how consistent they are with the null hypothesis. The p-value is a crucial part of the statistical results because it quantifies how strongly the sample data contradict the null hypothesis.

When the sample data provide sufficient evidence, you can reject the null hypothesis. In a hypothesis test, this process involves comparing the p-value to your significance level .

Rejecting the Null Hypothesis

Reject the null hypothesis when the p-value is less than or equal to your significance level. Your sample data favor the alternative hypothesis, which suggests that the effect exists in the population. For a mnemonic device, remember—when the p-value is low, the null must go!

When you can reject the null hypothesis, your results are statistically significant. Learn more about Statistical Significance: Definition & Meaning .

Failing to Reject the Null Hypothesis

Conversely, when the p-value is greater than your significance level, you fail to reject the null hypothesis. The sample data provides insufficient data to conclude that the effect exists in the population. When the p-value is high, the null must fly!

Note that failing to reject the null is not the same as proving it. For more information about the difference, read my post about Failing to Reject the Null .

That’s a very general look at the process. But I hope you can see how the path to more exciting findings depends on being able to rule out the less exciting null hypothesis that states there’s nothing to see here!

Let’s move on to learning how to write the null hypothesis for different types of effects, relationships, and tests.

Related posts : How Hypothesis Tests Work and Interpreting P-values

How to Write a Null Hypothesis

The null hypothesis varies by the type of statistic and hypothesis test. Remember that inferential statistics use samples to draw conclusions about populations. Consequently, when you write a null hypothesis, it must make a claim about the relevant population parameter . Further, that claim usually indicates that the effect does not exist in the population. Below are typical examples of writing a null hypothesis for various parameters and hypothesis tests.

Related posts : Descriptive vs. Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics

Group Means

T-tests and ANOVA assess the differences between group means. For these tests, the null hypothesis states that there is no difference between group means in the population. In other words, the experimental conditions that define the groups do not affect the mean outcome. Mu (µ) is the population parameter for the mean, and you’ll need to include it in the statement for this type of study.

For example, an experiment compares the mean bone density changes for a new osteoporosis medication. The control group does not receive the medicine, while the treatment group does. The null states that the mean bone density changes for the control and treatment groups are equal.

  • Null Hypothesis H 0 : Group means are equal in the population: µ 1 = µ 2 , or µ 1 – µ 2 = 0
  • Alternative Hypothesis H A : Group means are not equal in the population: µ 1 ≠ µ 2 , or µ 1 – µ 2 ≠ 0.

Group Proportions

Proportions tests assess the differences between group proportions. For these tests, the null hypothesis states that there is no difference between group proportions. Again, the experimental conditions did not affect the proportion of events in the groups. P is the population proportion parameter that you’ll need to include.

For example, a vaccine experiment compares the infection rate in the treatment group to the control group. The treatment group receives the vaccine, while the control group does not. The null states that the infection rates for the control and treatment groups are equal.

  • Null Hypothesis H 0 : Group proportions are equal in the population: p 1 = p 2 .
  • Alternative Hypothesis H A : Group proportions are not equal in the population: p 1 ≠ p 2 .

Correlation and Regression Coefficients

Some studies assess the relationship between two continuous variables rather than differences between groups.

In these studies, analysts often use either correlation or regression analysis . For these tests, the null states that there is no relationship between the variables. Specifically, it says that the correlation or regression coefficient is zero. As one variable increases, there is no tendency for the other variable to increase or decrease. Rho (ρ) is the population correlation parameter and beta (β) is the regression coefficient parameter.

For example, a study assesses the relationship between screen time and test performance. The null states that there is no correlation between this pair of variables. As screen time increases, test performance does not tend to increase or decrease.

  • Null Hypothesis H 0 : The correlation in the population is zero: ρ = 0.
  • Alternative Hypothesis H A : The correlation in the population is not zero: ρ ≠ 0.

For all these cases, the analysts define the hypotheses before the study. After collecting the data, they perform a hypothesis test to determine whether they can reject the null hypothesis.

The preceding examples are all for two-tailed hypothesis tests. To learn about one-tailed tests and how to write a null hypothesis for them, read my post One-Tailed vs. Two-Tailed Tests .

Related post : Understanding Correlation

Neyman, J; Pearson, E. S. (January 1, 1933).  On the Problem of the most Efficient Tests of Statistical Hypotheses .  Philosophical Transactions of the Royal Society A .  231  (694–706): 289–337.

Share this:

null hypothesis meaning in medical terms

Reader Interactions

' src=

January 11, 2024 at 2:57 pm

Thanks for the reply.

January 10, 2024 at 1:23 pm

Hi Jim, In your comment you state that equivalence test null and alternate hypotheses are reversed. For hypothesis tests of data fits to a probability distribution, the null hypothesis is that the probability distribution fits the data. Is this correct?

' src=

January 10, 2024 at 2:15 pm

Those two separate things, equivalence testing and normality tests. But, yes, you’re correct for both.

Hypotheses are switched for equivalence testing. You need to “work” (i.e., collect a large sample of good quality data) to be able to reject the null that the groups are different to be able to conclude they’re the same.

With typical hypothesis tests, if you have low quality data and a low sample size, you’ll fail to reject the null that they’re the same, concluding they’re equivalent. But that’s more a statement about the low quality and small sample size than anything to do with the groups being equal.

So, equivalence testing make you work to obtain a finding that the groups are the same (at least within some amount you define as a trivial difference).

For normality testing, and other distribution tests, the null states that the data follow the distribution (normal or whatever). If you reject the null, you have sufficient evidence to conclude that your sample data don’t follow the probability distribution. That’s a rare case where you hope to fail to reject the null. And it suffers from the problem I describe above where you might fail to reject the null simply because you have a small sample size. In that case, you’d conclude the data follow the probability distribution but it’s more that you don’t have enough data for the test to register the deviation. In this scenario, if you had a larger sample size, you’d reject the null and conclude it doesn’t follow that distribution.

I don’t know of any equivalence testing type approach for distribution fit tests where you’d need to work to show the data follow a distribution, although I haven’t looked for one either!

' src=

February 20, 2022 at 9:26 pm

Is a null hypothesis regularly (always) stated in the negative? “there is no” or “does not”

February 23, 2022 at 9:21 pm

Typically, the null hypothesis includes an equal sign. The null hypothesis states that the population parameter equals a particular value. That value is usually one that represents no effect. In the case of a one-sided hypothesis test, the null still contains an equal sign but it’s “greater than or equal to” or “less than or equal to.” If you wanted to translate the null hypothesis from its native mathematical expression, you could use the expression “there is no effect.” But the mathematical form more specifically states what it’s testing.

It’s the alternative hypothesis that typically contains does not equal.

There are some exceptions. For example, in an equivalence test where the researchers want to show that two things are equal, the null hypothesis states that they’re not equal.

In short, the null hypothesis states the condition that the researchers hope to reject. They need to work hard to set up an experiment and data collection that’ll gather enough evidence to be able to reject the null condition.

' src=

February 15, 2022 at 9:32 am

Dear sir I always read your notes on Research methods.. Kindly tell is there any available Book on all these..wonderfull Urgent

Comments and Questions Cancel reply

  • Course Home
  • Correlates, Conditions, Care, and Costs
  • Knowledge Check
  • Dependent and Independent Variables
  • Correlation
  • Age-Adjustment
  • Distribution
  • Standard Deviation
  • Significance Level
  • Confidence Intervals
  • Incorporation into Health Subjects
  • Medical Records
  • Claims Data
  • Vital Records
  • Surveillance
  • Grey Literature
  • Peer-Reviewed Literature
  • National Center for Health Statistics (NCHS)
  • World Health Organization (WHO)
  • Agency for Healthcare Research and Quality (AHRQ)
  • Centers for Disease Control and Prevention (CDC)
  • Robert Wood Johnson Foundation: County Health Rankings & Roadmaps
  • Centers for Medicare and Medicaid Services (CMS)
  • Kaiser Family Foundation (KFF)
  • United States Census Bureau
  • HealthData.gov
  • Dartmouth Atlas of Health Care (DAHC)
  • Academic Journal Databases
  • Search Engines

How to Navigate This Course

There are a variety of ways you can navigate this course. You can:

  • Click the Prev and Next buttons at the bottom of each page to move through the material.
  • Use the main navigation with dropdown subsections featured on all pages.
  • Use a combination of the above methods to explore the course contents.

2. Common Terms and Equations

In statistical analysis, two hypotheses are used. The null hypothesis , or H 0 , states that there is no statistical significance between two variables. The null is often the commonly accepted position and is what scientists seek to not support through the study. The alternative hypothesis , or H a , states that there is a statistical significance between two variables and is what scientists are seeking to support through experimentation.

For example, if someone wants to see how they score on a math test relative to their class average, they can write hypotheses comparing the student’s score, to the class average score (µ). Let’s say for this example, the student’s score on a math exam was 75. The null (H 0 ) and alternative (H a ) hypotheses could be written as:

  • H 0 : µ = 75
  • H 0 : µ = µ 0
  • H a : µ ≠ 75
  • H a : µ ≠ µ 0

In the null hypothesis, there is no difference between the observed mean (µ) and the claimed value (75). However, in the alternative hypothesis, class mean is significantly different (either less than or greater than 75) from the student’s score (75). Statistical tests will be used to support to either support or reject the null hypothesis. When the null hypothesis is supported by the test, then the test indicates that there is not a statistically significant difference between the class mean score and the student’s mean score. If the null hypothesis is rejected, then the alternative hypothesis is supported, which leads to the conclusion that the student’s score is statistically significant difference from the class mean score.

What is The Null Hypothesis & When Do You Reject The Null Hypothesis

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A null hypothesis is a statistical concept suggesting no significant difference or relationship between measured variables. It’s the default assumption unless empirical evidence proves otherwise.

The null hypothesis states no relationship exists between the two variables being studied (i.e., one variable does not affect the other).

The null hypothesis is the statement that a researcher or an investigator wants to disprove.

Testing the null hypothesis can tell you whether your results are due to the effects of manipulating ​ the dependent variable or due to random chance. 

How to Write a Null Hypothesis

Null hypotheses (H0) start as research questions that the investigator rephrases as statements indicating no effect or relationship between the independent and dependent variables.

It is a default position that your research aims to challenge or confirm.

For example, if studying the impact of exercise on weight loss, your null hypothesis might be:

There is no significant difference in weight loss between individuals who exercise daily and those who do not.

Examples of Null Hypotheses

Research QuestionNull Hypothesis
Do teenagers use cell phones more than adults?Teenagers and adults use cell phones the same amount.
Do tomato plants exhibit a higher rate of growth when planted in compost rather than in soil?Tomato plants show no difference in growth rates when planted in compost rather than soil.
Does daily meditation decrease the incidence of depression?Daily meditation does not decrease the incidence of depression.
Does daily exercise increase test performance?There is no relationship between daily exercise time and test performance.
Does the new vaccine prevent infections?The vaccine does not affect the infection rate.
Does flossing your teeth affect the number of cavities?Flossing your teeth has no effect on the number of cavities.

When Do We Reject The Null Hypothesis? 

We reject the null hypothesis when the data provide strong enough evidence to conclude that it is likely incorrect. This often occurs when the p-value (probability of observing the data given the null hypothesis is true) is below a predetermined significance level.

If the collected data does not meet the expectation of the null hypothesis, a researcher can conclude that the data lacks sufficient evidence to back up the null hypothesis, and thus the null hypothesis is rejected. 

Rejecting the null hypothesis means that a relationship does exist between a set of variables and the effect is statistically significant ( p > 0.05).

If the data collected from the random sample is not statistically significance , then the null hypothesis will be accepted, and the researchers can conclude that there is no relationship between the variables. 

You need to perform a statistical test on your data in order to evaluate how consistent it is with the null hypothesis. A p-value is one statistical measurement used to validate a hypothesis against observed data.

Calculating the p-value is a critical part of null-hypothesis significance testing because it quantifies how strongly the sample data contradicts the null hypothesis.

The level of statistical significance is often expressed as a  p  -value between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Usually, a researcher uses a confidence level of 95% or 99% (p-value of 0.05 or 0.01) as general guidelines to decide if you should reject or keep the null.

When your p-value is less than or equal to your significance level, you reject the null hypothesis.

In other words, smaller p-values are taken as stronger evidence against the null hypothesis. Conversely, when the p-value is greater than your significance level, you fail to reject the null hypothesis.

In this case, the sample data provides insufficient data to conclude that the effect exists in the population.

Because you can never know with complete certainty whether there is an effect in the population, your inferences about a population will sometimes be incorrect.

When you incorrectly reject the null hypothesis, it’s called a type I error. When you incorrectly fail to reject it, it’s called a type II error.

Why Do We Never Accept The Null Hypothesis?

The reason we do not say “accept the null” is because we are always assuming the null hypothesis is true and then conducting a study to see if there is evidence against it. And, even if we don’t find evidence against it, a null hypothesis is not accepted.

A lack of evidence only means that you haven’t proven that something exists. It does not prove that something doesn’t exist. 

It is risky to conclude that the null hypothesis is true merely because we did not find evidence to reject it. It is always possible that researchers elsewhere have disproved the null hypothesis, so we cannot accept it as true, but instead, we state that we failed to reject the null. 

One can either reject the null hypothesis, or fail to reject it, but can never accept it.

Why Do We Use The Null Hypothesis?

We can never prove with 100% certainty that a hypothesis is true; We can only collect evidence that supports a theory. However, testing a hypothesis can set the stage for rejecting or accepting this hypothesis within a certain confidence level.

The null hypothesis is useful because it can tell us whether the results of our study are due to random chance or the manipulation of a variable (with a certain level of confidence).

A null hypothesis is rejected if the measured data is significantly unlikely to have occurred and a null hypothesis is accepted if the observed outcome is consistent with the position held by the null hypothesis.

Rejecting the null hypothesis sets the stage for further experimentation to see if a relationship between two variables exists. 

Hypothesis testing is a critical part of the scientific method as it helps decide whether the results of a research study support a particular theory about a given population. Hypothesis testing is a systematic way of backing up researchers’ predictions with statistical analysis.

It helps provide sufficient statistical evidence that either favors or rejects a certain hypothesis about the population parameter. 

Purpose of a Null Hypothesis 

  • The primary purpose of the null hypothesis is to disprove an assumption. 
  • Whether rejected or accepted, the null hypothesis can help further progress a theory in many scientific cases.
  • A null hypothesis can be used to ascertain how consistent the outcomes of multiple studies are.

Do you always need both a Null Hypothesis and an Alternative Hypothesis?

The null (H0) and alternative (Ha or H1) hypotheses are two competing claims that describe the effect of the independent variable on the dependent variable. They are mutually exclusive, which means that only one of the two hypotheses can be true. 

While the null hypothesis states that there is no effect in the population, an alternative hypothesis states that there is statistical significance between two variables. 

The goal of hypothesis testing is to make inferences about a population based on a sample. In order to undertake hypothesis testing, you must express your research hypothesis as a null and alternative hypothesis. Both hypotheses are required to cover every possible outcome of the study. 

What is the difference between a null hypothesis and an alternative hypothesis?

The alternative hypothesis is the complement to the null hypothesis. The null hypothesis states that there is no effect or no relationship between variables, while the alternative hypothesis claims that there is an effect or relationship in the population.

It is the claim that you expect or hope will be true. The null hypothesis and the alternative hypothesis are always mutually exclusive, meaning that only one can be true at a time.

What are some problems with the null hypothesis?

One major problem with the null hypothesis is that researchers typically will assume that accepting the null is a failure of the experiment. However, accepting or rejecting any hypothesis is a positive result. Even if the null is not refuted, the researchers will still learn something new.

Why can a null hypothesis not be accepted?

We can either reject or fail to reject a null hypothesis, but never accept it. If your test fails to detect an effect, this is not proof that the effect doesn’t exist. It just means that your sample did not have enough evidence to conclude that it exists.

We can’t accept a null hypothesis because a lack of evidence does not prove something that does not exist. Instead, we fail to reject it.

Failing to reject the null indicates that the sample did not provide sufficient enough evidence to conclude that an effect exists.

If the p-value is greater than the significance level, then you fail to reject the null hypothesis.

Is a null hypothesis directional or non-directional?

A hypothesis test can either contain an alternative directional hypothesis or a non-directional alternative hypothesis. A directional hypothesis is one that contains the less than (“<“) or greater than (“>”) sign.

A nondirectional hypothesis contains the not equal sign (“≠”).  However, a null hypothesis is neither directional nor non-directional.

A null hypothesis is a prediction that there will be no change, relationship, or difference between two variables.

The directional hypothesis or nondirectional hypothesis would then be considered alternative hypotheses to the null hypothesis.

Gill, J. (1999). The insignificance of null hypothesis significance testing.  Political research quarterly ,  52 (3), 647-674.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method.  American Psychologist ,  56 (1), 16.

Masson, M. E. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing.  Behavior research methods ,  43 , 679-690.

Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy.  Psychological methods ,  5 (2), 241.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test.  Psychological bulletin ,  57 (5), 416.

Print Friendly, PDF & Email

9.1 Null and Alternative Hypotheses

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 , the — null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

H a —, the alternative hypothesis: a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are reject H 0 if the sample information favors the alternative hypothesis or do not reject H 0 or decline to reject H 0 if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

equal (=) not equal (≠) greater than (>) less than (<)
greater than or equal to (≥) less than (<)
less than or equal to (≤) more than (>)

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example 9.1

H 0 : No more than 30 percent of the registered voters in Santa Clara County voted in the primary election. p ≤ 30 H a : More than 30 percent of the registered voters in Santa Clara County voted in the primary election. p > 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25 percent. State the null and alternative hypotheses.

Example 9.2

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are the following: H 0 : μ = 2.0 H a : μ ≠ 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 66
  • H a : μ __ 66

Example 9.3

We want to test if college students take fewer than five years to graduate from college, on the average. The null and alternative hypotheses are the following: H 0 : μ ≥ 5 H a : μ < 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 45
  • H a : μ __ 45

Example 9.4

An article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third of the students pass. The same article stated that 6.6 percent of U.S. students take advanced placement exams and 4.4 percent pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6 percent. State the null and alternative hypotheses. H 0 : p ≤ 0.066 H a : p > 0.066

On a state driver’s test, about 40 percent pass the test on the first try. We want to test if more than 40 percent pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : p __ 0.40
  • H a : p __ 0.40

Collaborative Exercise

Bring to class a newspaper, some news magazines, and some internet articles. In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute Texas Education Agency (TEA). The original material is available at: https://www.texasgateway.org/book/tea-statistics . Changes were made to the original material, including updates to art, structure, and other content updates.

Access for free at https://openstax.org/books/statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Statistics
  • Publication date: Mar 27, 2020
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/statistics/pages/9-1-null-and-alternative-hypotheses

© Apr 16, 2024 Texas Education Agency (TEA). The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Null hypothesis

Null hypothesis n., plural: null hypotheses [nʌl haɪˈpɒθɪsɪs] Definition: a hypothesis that is valid or presumed true until invalidated by a statistical test

Table of Contents

Null Hypothesis Definition

Null hypothesis is defined as “the commonly accepted fact (such as the sky is blue) and researcher aim to reject or nullify this fact”.

More formally, we can define a null hypothesis as “a statistical theory suggesting that no statistical relationship exists between given observed variables” .

In biology , the null hypothesis is used to nullify or reject a common belief. The researcher carries out the research which is aimed at rejecting the commonly accepted belief.

What Is a Null Hypothesis?

A hypothesis is defined as a theory or an assumption that is based on inadequate evidence. It needs and requires more experiments and testing for confirmation. There are two possibilities that by doing more experiments and testing, a hypothesis can be false or true. It means it can either prove wrong or true (Blackwelder, 1982).

For example, Susie assumes that mineral water helps in the better growth and nourishment of plants over distilled water. To prove this hypothesis, she performs this experiment for almost a month. She watered some plants with mineral water and some with distilled water.

In a hypothesis when there are no statistically significant relationships among the two variables, the hypothesis is said to be a null hypothesis. The investigator is trying to disprove such a hypothesis. In the above example of plants, the null hypothesis is:

There are no statistical relationships among the forms of water that are given to plants for growth and nourishment.

Usually, an investigator tries to prove the null hypothesis wrong and tries to explain a relation and association between the two variables.

An opposite and reverse of the null hypothesis are known as the alternate hypothesis . In the example of plants the alternate hypothesis is:

There are statistical relationships among the forms of water that are given to plants for growth and nourishment.

The example below shows the difference between null vs alternative hypotheses:

Alternate Hypothesis: The world is round Null Hypothesis: The world is not round.

Copernicus and many other scientists try to prove the null hypothesis wrong and false. By their experiments and testing, they make people believe that alternate hypotheses are correct and true. If they do not prove the null hypothesis experimentally wrong then people will not believe them and never consider the alternative hypothesis true and correct.

The alternative and null hypothesis for Susie’s assumption is:

  • Null Hypothesis: If one plant is watered with distilled water and the other with mineral water, then there is no difference in the growth and nourishment of these two plants.
  • Alternative Hypothesis:  If one plant is watered with distilled water and the other with mineral water, then the plant with mineral water shows better growth and nourishment.

The null hypothesis suggests that there is no significant or statistical relationship. The relation can either be in a single set of variables or among two sets of variables.

Most people consider the null hypothesis true and correct. Scientists work and perform different experiments and do a variety of research so that they can prove the null hypothesis wrong or nullify it. For this purpose, they design an alternate hypothesis that they think is correct or true. The null hypothesis symbol is H 0 (it is read as H null or H zero ).

Why is it named the “Null”?

The name null is given to this hypothesis to clarify and explain that the scientists are working to prove it false i.e. to nullify the hypothesis. Sometimes it confuses the readers; they might misunderstand it and think that statement has nothing. It is blank but, actually, it is not. It is more appropriate and suitable to call it a nullifiable hypothesis instead of the null hypothesis.

Why do we need to assess it? Why not just verify an alternate one?

In science, the scientific method is used. It involves a series of different steps. Scientists perform these steps so that a hypothesis can be proved false or true. Scientists do this to confirm that there will be any limitation or inadequacy in the new hypothesis. Experiments are done by considering both alternative and null hypotheses, which makes the research safe. It gives a negative as well as a bad impact on research if a null hypothesis is not included or a part of the study. It seems like you are not taking your research seriously and not concerned about it and just want to impose your results as correct and true if the null hypothesis is not a part of the study.

Development of the Null

In statistics, firstly it is necessary to design alternate and null hypotheses from the given problem. Splitting the problem into small steps makes the pathway towards the solution easier and less challenging. how to write a null hypothesis?

Writing a null hypothesis consists of two steps:

  • Firstly, initiate by asking a question.
  • Secondly, restate the question in such a way that it seems there are no relationships among the variables.

In other words, assume in such a way that the treatment does not have any effect.

QuestionsNull Hypothesis
Are adults doing better at mathematics than teenagers?Mathematical ability does not depend on age.
Does the risk of a heart attack reduce by daily intake of aspirin?A heart attack is not affected by the daily dose of aspirin.
Are teenagers using cell phones to access the internet more than elders?Age does not affect the usage of cell phones for internet access.
Are cats concerned about their food color?Cats do not prefer food based on color.
Does pain relieve by chewing willow bark?Pain is not relieved by chewing willow bark.

The usual recovery duration after knee surgery is considered almost 8 weeks.

A researcher thinks that the recovery period may get elongated if patients go to a physiotherapist for rehabilitation twice per week, instead of thrice per week, i.e. recovery duration reduces if the patient goes three times for rehabilitation instead of two times.

Step 1: Look for the problem in the hypothesis. The hypothesis either be a word or can be a statement. In the above example the hypothesis is:

“The expected recovery period in knee rehabilitation is more than 8 weeks”

Step 2: Make a mathematical statement from the hypothesis. Averages can also be represented as μ, thus the null hypothesis formula will be.

In the above equation, the hypothesis is equivalent to H1, the average is denoted by μ and > that the average is greater than eight.

Step 3: Explain what will come up if the hypothesis does not come right i.e., the rehabilitation period may not proceed more than 08 weeks.

There are two options: either the recovery will be less than or equal to 8 weeks.

H 0 : μ ≤ 8

In the above equation, the null hypothesis is equivalent to H 0 , the average is denoted by μ and ≤ represents that the average is less than or equal to eight.

What will happen if the scientist does not have any knowledge about the outcome?

Problem: An investigator investigates the post-operative impact and influence of radical exercise on patients who have operative procedures of the knee. The chances are either the exercise will improve the recovery or will make it worse. The usual time for recovery is 8 weeks.

Step 1: Make a null hypothesis i.e. the exercise does not show any effect and the recovery time remains almost 8 weeks.

H 0 : μ = 8

In the above equation, the null hypothesis is equivalent to H 0 , the average is denoted by μ, and the equal sign (=) shows that the average is equal to eight.

Step 2: Make the alternate hypothesis which is the reverse of the null hypothesis. Particularly what will happen if treatment (exercise) makes an impact?

In the above equation, the alternate hypothesis is equivalent to H1, the average is denoted by μ and not equal sign (≠) represents that the average is not equal to eight.

Significance Tests

To get a reasonable and probable clarification of statistics (data), a significance test is performed. The null hypothesis does not have data. It is a piece of information or statement which contains numerical figures about the population. The data can be in different forms like in means or proportions. It can either be the difference of proportions and means or any odd ratio.

The following table will explain the symbols:

P-value
Probability of success
Size of sample
Null Hypothesis
Alternate Hypothesis

P-value is the chief statistical final result of the significance test of the null hypothesis.

  • P-value = Pr(data or data more extreme | H 0 true)
  • | = “given”
  • Pr = probability
  • H 0 = the null hypothesis

The first stage of Null Hypothesis Significance Testing (NHST) is to form an alternate and null hypothesis. By this, the research question can be briefly explained.

Null Hypothesis = no effect of treatment, no difference, no association Alternative Hypothesis = effective treatment, difference, association

When to reject the null hypothesis?

Researchers will reject the null hypothesis if it is proven wrong after experimentation. Researchers accept null hypothesis to be true and correct until it is proven wrong or false. On the other hand, the researchers try to strengthen the alternate hypothesis. The binomial test is performed on a sample and after that, a series of tests were performed (Frick, 1995).

Step 1: Evaluate and read the research question carefully and consciously and make a null hypothesis. Verify the sample that supports the binomial proportion. If there is no difference then find out the value of the binomial parameter.

Show the null hypothesis as:

H 0 :p= the value of p if H 0 is true

To find out how much it varies from the proposed data and the value of the null hypothesis, calculate the sample proportion.

Step 2: In test statistics, find the binomial test that comes under the null hypothesis. The test must be based on precise and thorough probabilities. Also make a list of pmf that apply, when the null hypothesis proves true and correct.

When H 0 is true, X~b(n, p)

N = size of the sample

P = assume value if H 0 proves true.

Step 3: Find out the value of P. P-value is the probability of data that is under observation.

Rise or increase in the P value = Pr(X ≥ x)

X = observed number of successes

P value = Pr(X ≤ x).

Step 4: Demonstrate the findings or outcomes in a descriptive detailed way.

  • Sample proportion
  • The direction of difference (either increases or decreases)

Perceived Problems With the Null Hypothesis

Variable or model selection and less information in some cases are the chief important issues that affect the testing of the null hypothesis. Statistical tests of the null hypothesis are reasonably not strong. There is randomization about significance. (Gill, 1999) The main issue with the testing of the null hypothesis is that they all are wrong or false on a ground basis.

There is another problem with the a-level . This is an ignored but also a well-known problem. The value of a-level is without a theoretical basis and thus there is randomization in conventional values, most commonly 0.q, 0.5, or 0.01. If a fixed value of a is used, it will result in the formation of two categories (significant and non-significant) The issue of a randomized rejection or non-rejection is also present when there is a practical matter which is the strong point of the evidence related to a scientific matter.

The P-value has the foremost importance in the testing of null hypothesis but as an inferential tool and for interpretation, it has a problem. The P-value is the probability of getting a test statistic at least as extreme as the observed one.

The main point about the definition is: Observed results are not based on a-value

Moreover, the evidence against the null hypothesis was overstated due to unobserved results. A-value has importance more than just being a statement. It is a precise statement about the evidence from the observed results or data. Similarly, researchers found that P-values are objectionable. They do not prefer null hypotheses in testing. It is also clear that the P-value is strictly dependent on the null hypothesis. It is computer-based statistics. In some precise experiments, the null hypothesis statistics and actual sampling distribution are closely related but this does not become possible in observational studies.

Some researchers pointed out that the P-value is depending on the sample size. If the true and exact difference is small, a null hypothesis even of a large sample may get rejected. This shows the difference between biological importance and statistical significance. (Killeen, 2005)

Another issue is the fix a-level, i.e., 0.1. On the basis, if a-level a null hypothesis of a large sample may get accepted or rejected. If the size of simple is infinity and the null hypothesis is proved true there are still chances of Type I error. That is the reason this approach or method is not considered consistent and reliable. There is also another problem that the exact information about the precision and size of the estimated effect cannot be known. The only solution is to state the size of the effect and its precision.

Null Hypothesis Examples

Here are some examples:

Example 1: Hypotheses with One Sample of One Categorical Variable

Among all the population of humans, almost 10% of people prefer to do their task with their left hand i.e. left-handed. Let suppose, a researcher in the Penn States says that the population of students at the College of Arts and Architecture is mostly left-handed as compared to the general population of humans in general public society. In this case, there is only a sample and there is a comparison among the known population values to the population proportion of sample value.

  • Research Question: Do artists more expected to be left-handed as compared to the common population persons in society?
  • Response Variable: Sorting the student into two categories. One category has left-handed persons and the other category have right-handed persons.
  • Form Null Hypothesis: Arts and Architecture college students are no more predicted to be lefty as compared to the common population persons in society (Lefty students of Arts and Architecture college population is 10% or p= 0.10)

Example 2: Hypotheses with One Sample of One Measurement Variable

A generic brand of antihistamine Diphenhydramine making medicine in the form of a capsule, having a 50mg dose. The maker of the medicines is concerned that the machine has come out of calibration and is not making more capsules with the suitable and appropriate dose.

  • Research Question: Does the statistical data recommended about the mean and average dosage of the population differ from 50mg?
  • Response Variable: Chemical assay used to find the appropriate dosage of the active ingredient.
  • Null Hypothesis: Usually, the 50mg dosage of capsules of this trade name (population average and means dosage =50 mg).

Example 3: Hypotheses with Two Samples of One Categorical Variable

Several people choose vegetarian meals on a daily basis. Typically, the researcher thought that females like vegetarian meals more than males.

  • Research Question: Does the data recommend that females (women) prefer vegetarian meals more than males (men) regularly?
  • Response Variable: Cataloguing the persons into vegetarian and non-vegetarian categories. Grouping Variable: Gender
  • Null Hypothesis: Gender is not linked to those who like vegetarian meals. (Population percent of women who eat vegetarian meals regularly = population percent of men who eat vegetarian meals regularly or p women = p men).

Example 4: Hypotheses with Two Samples of One Measurement Variable

Nowadays obesity and being overweight is one of the major and dangerous health issues. Research is performed to confirm that a low carbohydrates diet leads to faster weight loss than a low-fat diet.

  • Research Question: Does the given data recommend that usually, a low-carbohydrate diet helps in losing weight faster as compared to a low-fat diet?
  • Response Variable: Weight loss (pounds)
  • Explanatory Variable: Form of diet either low carbohydrate or low fat
  • Null Hypothesis: There is no significant difference when comparing the mean loss of weight of people using a low carbohydrate diet to people using a diet having low fat. (population means loss of weight on a low carbohydrate diet = population means loss of weight on a diet containing low fat).

Example 5: Hypotheses about the relationship between Two Categorical Variables

A case-control study was performed. The study contains nonsmokers, stroke patients, and controls. The subjects are of the same occupation and age and the question was asked if someone at their home or close surrounding smokes?

  • Research Question: Did second-hand smoke enhance the chances of stroke?
  • Variables: There are 02 diverse categories of variables. (Controls and stroke patients) (whether the smoker lives in the same house). The chances of having a stroke will be increased if a person is living with a smoker.
  • Null Hypothesis: There is no significant relationship between a passive smoker and stroke or brain attack. (odds ratio between stroke and the passive smoker is equal to 1).

Example 6: Hypotheses about the relationship between Two Measurement Variables

A financial expert observes that there is somehow a positive and effective relationship between the variation in stock rate price and the quantity of stock bought by non-management employees

  • Response variable- Regular alteration in price
  • Explanatory Variable- Stock bought by non-management employees
  • Null Hypothesis: The association and relationship between the regular stock price alteration ($) and the daily stock-buying by non-management employees ($) = 0.

Example 7: Hypotheses about comparing the relationship between Two Measurement Variables in Two Samples

  • Research Question: Is the relation between the bill paid in a restaurant and the tip given to the waiter, is linear? Is this relation different for dining and family restaurants?
  • Explanatory Variable- total bill amount
  • Response Variable- the amount of tip
  • Null Hypothesis: The relationship and association between the total bill quantity at a family or dining restaurant and the tip, is the same.

Try to answer the quiz below to check what you have learned so far about the null hypothesis.

Choose the best answer. 

Send Your Results (Optional)

  • Blackwelder, W. C. (1982). “Proving the null hypothesis” in clinical trials. Controlled Clinical Trials , 3(4), 345–353.
  • Frick, R. W. (1995). Accepting the null hypothesis. Memory & Cognition, 23(1), 132–138.
  • Gill, J. (1999). The insignificance of null hypothesis significance testing. Political Research Quarterly , 52(3), 647–674.
  • Killeen, P. R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16(5), 345–353.

©BiologyOnline.com. Content provided and moderated by Biology Online Editors.

Last updated on June 16th, 2022

You will also like...

Selective breeding.

Gregor Mendel's studies into Monohybrid and Dihybrid crossing and Charles Darwin's study of evolution and natural select..

Neural Control Mechanisms

Neurons generate electric signals that they pass along to the other neurons or target tissues. In this tutorial, you wil..

This tutorial presents Gregor Mendel's law of dominance. Learn more about this form of inheritance and how it can be pre..

Sleep and Dreams – Neurology

While learning and intelligence are associated with the functions of a conscious mind, sleep and dreams are activities o..

Passive and Active Types of Immunity

Lymphocytes are a type of white blood cell capable of producing a specific immune response to unique antigens. In thi..

Inheritance and Probability

Gregor Mendel, an Austrian monk, is most famous in this field for his study of the phenotype of pea plants, including ..

Related Articles...

No related articles found

Null Hypothesis Examples

ThoughtCo / Hilary Allison

  • Scientific Method
  • Chemical Laws
  • Periodic Table
  • Projects & Experiments
  • Biochemistry
  • Physical Chemistry
  • Medical Chemistry
  • Chemistry In Everyday Life
  • Famous Chemists
  • Activities for Kids
  • Abbreviations & Acronyms
  • Weather & Climate
  • Ph.D., Biomedical Sciences, University of Tennessee at Knoxville
  • B.A., Physics and Mathematics, Hastings College

In statistical analysis, the null hypothesis assumes there is no meaningful relationship between two variables. Testing the null hypothesis can tell you whether your results are due to the effect of manipulating ​a dependent variable or due to chance. It's often used in conjunction with an alternative hypothesis, which assumes there is, in fact, a relationship between two variables.

The null hypothesis is among the easiest hypothesis to test using statistical analysis, making it perhaps the most valuable hypothesis for the scientific method. By evaluating a null hypothesis in addition to another hypothesis, researchers can support their conclusions with a higher level of confidence. Below are examples of how you might formulate a null hypothesis to fit certain questions.

What Is the Null Hypothesis?

The null hypothesis states there is no relationship between the measured phenomenon (the dependent variable ) and the independent variable , which is the variable an experimenter typically controls or changes. You do not​ need to believe that the null hypothesis is true to test it. On the contrary, you will likely suspect there is a relationship between a set of variables. One way to prove that this is the case is to reject the null hypothesis. Rejecting a hypothesis does not mean an experiment was "bad" or that it didn't produce results. In fact, it is often one of the first steps toward further inquiry.

To distinguish it from other hypotheses , the null hypothesis is written as ​ H 0  (which is read as “H-nought,” "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the null hypothesis is not true, perhaps because the experimenter did not account for a critical factor or because of chance. This is one reason why it's important to repeat experiments.

Examples of the Null Hypothesis

To write a null hypothesis, first start by asking a question. Rephrase that question in a form that assumes no relationship between the variables. In other words, assume a treatment has no effect. Write your hypothesis in a way that reflects this.

Are teens better at math than adults? Age has no effect on mathematical ability.
Does taking aspirin every day reduce the chance of having a heart attack? Taking aspirin daily does not affect heart attack risk.
Do teens use cell phones to access the internet more than adults? Age has no effect on how cell phones are used for internet access.
Do cats care about the color of their food? Cats express no food preference based on color.
Does chewing willow bark relieve pain? There is no difference in pain relief after chewing willow bark versus taking a placebo.

Other Types of Hypotheses

In addition to the null hypothesis, the alternative hypothesis is also a staple in traditional significance tests . It's essentially the opposite of the null hypothesis because it assumes the claim in question is true. For the first item in the table above, for example, an alternative hypothesis might be "Age does have an effect on mathematical ability."

Key Takeaways

  • In hypothesis testing, the null hypothesis assumes no relationship between two variables, providing a baseline for statistical analysis.
  • Rejecting the null hypothesis suggests there is evidence of a relationship between variables.
  • By formulating a null hypothesis, researchers can systematically test assumptions and draw more reliable conclusions from their experiments.
  • What Are Examples of a Hypothesis?
  • Random Error vs. Systematic Error
  • Six Steps of the Scientific Method
  • What Is a Hypothesis? (Science)
  • Scientific Method Flow Chart
  • What Are the Elements of a Good Hypothesis?
  • Scientific Method Vocabulary Terms
  • Understanding Simple vs Controlled Experiments
  • The Role of a Controlled Variable in an Experiment
  • What Is an Experimental Constant?
  • What Is a Testable Hypothesis?
  • Scientific Hypothesis Examples
  • What Is the Difference Between a Control Variable and Control Group?
  • DRY MIX Experiment Variables Acronym
  • What Is a Controlled Experiment?
  • Scientific Variable

Module 9: Hypothesis Testing With One Sample

Null and alternative hypotheses, learning outcomes.

  • Describe hypothesis testing in general and in practice

The actual test begins by considering two  hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 : The null hypothesis: It is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.

H a : The alternative hypothesis : It is a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make adecision. There are two options for a  decision . They are “reject H 0 ” if the sample information favors the alternative hypothesis or “do not reject H 0 ” or “decline to reject H 0 ” if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in  H 0 and H a :

equal (=) not equal (≠)
greater than (>) less than (<)
greater than or equal to (≥) less than (<)
less than or equal to (≤) more than (>)

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

H 0 : No more than 30% of the registered voters in Santa Clara County voted in the primary election. p ≤ 30

H a : More than 30% of the registered voters in Santa Clara County voted in the primary election. p > 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%. State the null and alternative hypotheses.

H 0 : The drug reduces cholesterol by 25%. p = 0.25

H a : The drug does not reduce cholesterol by 25%. p ≠ 0.25

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are:

H 0 : μ = 2.0

H a : μ ≠ 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses. H 0 : μ __ 66 H a : μ __ 66

  • H 0 : μ = 66
  • H a : μ ≠ 66

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

H 0 : μ ≥ 5

H a : μ < 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses. H 0 : μ __ 45 H a : μ __ 45

  • H 0 : μ ≥ 45
  • H a : μ < 45

In an issue of U.S. News and World Report , an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.

H 0 : p ≤ 0.066

H a : p > 0.066

On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses. H 0 : p __ 0.40 H a : p __ 0.40

  • H 0 : p = 0.40
  • H a : p > 0.40

Concept Review

In a  hypothesis test , sample data is evaluated in order to arrive at a decision about some type of claim. If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we: Evaluate the null hypothesis , typically denoted with H 0 . The null is not rejected unless the hypothesis test shows otherwise. The null statement must always contain some form of equality (=, ≤ or ≥) Always write the alternative hypothesis , typically denoted with H a or H 1 , using less than, greater than, or not equals symbols, i.e., (≠, >, or <). If we reject the null hypothesis, then we can assume there is enough evidence to support the alternative hypothesis. Never state that a claim is proven true or false. Keep in mind the underlying fact that hypothesis testing is based on probability laws; therefore, we can talk only in terms of non-absolute certainties.

Formula Review

H 0 and H a are contradictory.

  • OpenStax, Statistics, Null and Alternative Hypotheses. Provided by : OpenStax. Located at : http://cnx.org/contents/[email protected]:58/Introductory_Statistics . License : CC BY: Attribution
  • Introductory Statistics . Authored by : Barbara Illowski, Susan Dean. Provided by : Open Stax. Located at : http://cnx.org/contents/[email protected] . License : CC BY: Attribution . License Terms : Download for free at http://cnx.org/contents/[email protected]
  • Simple hypothesis testing | Probability and Statistics | Khan Academy. Authored by : Khan Academy. Located at : https://youtu.be/5D1gV37bKXY . License : All Rights Reserved . License Terms : Standard YouTube License
  • More from M-W
  • To save this word, you'll need to log in. Log In

null hypothesis

Definition of null hypothesis

Examples of null hypothesis in a sentence.

These examples are programmatically compiled from various online sources to illustrate current usage of the word 'null hypothesis.' Any opinions expressed in the examples do not represent those of Merriam-Webster or its editors. Send us feedback about these examples.

Word History

1935, in the meaning defined above

Dictionary Entries Near null hypothesis

Nullarbor Plain

Cite this Entry

“Null hypothesis.” Merriam-Webster.com Dictionary , Merriam-Webster, https://www.merriam-webster.com/dictionary/null%20hypothesis. Accessed 23 Aug. 2024.

More from Merriam-Webster on null hypothesis

Britannica.com: Encyclopedia article about null hypothesis

Subscribe to America's largest dictionary and get thousands more definitions and advanced search—ad free!

Play Quordle: Guess all four words in a limited number of tries.  Each of your guesses must be a real 5-letter word.

Can you solve 4 words at once?

Word of the day.

See Definitions and Examples »

Get Word of the Day daily email!

Popular in Grammar & Usage

Plural and possessive names: a guide, 31 useful rhetorical devices, more commonly misspelled words, absent letters that are heard anyway, how to use accents and diacritical marks, popular in wordplay, 8 words for lesser-known musical instruments, it's a scorcher words for the summer heat, 7 shakespearean insults to make life more interesting, 10 words from taylor swift songs (merriam's version), 9 superb owl words, games & quizzes.

Play Blossom: Solve today's spelling word game by finding as many words as you can using just 7 letters. Longer words score more points.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMC Med Res Methodol

Logo of bmcmrm

The null hypothesis significance test in health sciences research (1995-2006): statistical analysis and interpretation

Luis carlos silva-ayçaguer.

1 Centro Nacional de Investigación de Ciencias Médicas, La Habana, Cuba

Patricio Suárez-Gil

2 Unidad de Investigación. Hospital de Cabueñes, Servicio de Salud del Principado de Asturias (SESPA), Gijón, Spain

Ana Fernández-Somoano

3 CIBER Epidemiología y Salud Pública (CIBERESP), Spain and Departamento de Medicina, Unidad de Epidemiología Molecular del Instituto Universitario de Oncología, Universidad de Oviedo, Spain

The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested supplementary use of confidence intervals (CI). Our objective was to evaluate the extent and quality in the use of NHST and CI, both in English and Spanish language biomedical publications between 1995 and 2006, taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on the accuracy of the interpretation of statistical significance and the validity of conclusions.

Original articles published in three English and three Spanish biomedical journals in three fields (General Medicine, Clinical Specialties and Epidemiology - Public Health) were considered for this study. Papers published in 1995-1996, 2000-2001, and 2005-2006 were selected through a systematic sampling method. After excluding the purely descriptive and theoretical articles, analytic studies were evaluated for their use of NHST with P-values and/or CI for interpretation of statistical "significance" and "relevance" in study conclusions.

Among 1,043 original papers, 874 were selected for detailed review. The exclusive use of P-values was less frequent in English language publications as well as in Public Health journals; overall such use decreased from 41% in 1995-1996 to 21% in 2005-2006. While the use of CI increased over time, the "significance fallacy" (to equate statistical and substantive significance) appeared very often, mainly in journals devoted to clinical specialties (81%). In papers originally written in English and Spanish, 15% and 10%, respectively, mentioned statistical significance in their conclusions.

Conclusions

Overall, results of our review show some improvements in statistical management of statistical results, but further efforts by scholars and journal editors are clearly required to move the communication toward ICMJE advices, especially in the clinical setting, which seems to be imperative among publications in Spanish.

The null hypothesis statistical testing (NHST) has been the most widely used statistical approach in health research over the past 80 years. Its origins dates back to 1279 [ 1 ] although it was in the second decade of the twentieth century when the statistician Ronald Fisher formally introduced the concept of "null hypothesis" H 0 - which, generally speaking, establishes that certain parameters do not differ from each other. He was the inventor of the "P-value" through which it could be assessed [ 2 ]. Fisher's P-value is defined as a conditional probability calculated using the results of a study. Specifically, the P-value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The Fisherian significance testing theory considered the p-value as an index to measure the strength of evidence against the null hypothesis in a single experiment. The father of NHST never endorsed, however, the inflexible application of the ultimately subjective threshold levels almost universally adopted later on (although the introduction of the 0.05 has his paternity also).

A few years later, Jerzy Neyman and Egon Pearson considered the Fisherian approach inefficient, and in 1928 they published an article [ 3 ] that would provide the theoretical basis of what they called hypothesis statistical testing . The Neyman-Pearson approach is based on the notion that one out of two choices has to be taken: accept the null hypothesis taking the information as a reference based on the information provided, or reject it in favor of an alternative one. Thus, one can incur one of two types of errors: a Type I error, if the null hypothesis is rejected when it is actually true, and a Type II error, if the null hypothesis is accepted when it is actually false. They established a rule to optimize the decision process, using the p-value introduced by Fisher, by setting the maximum frequency of errors that would be admissible.

The null hypothesis statistical testing, as applied today, is a hybrid coming from the amalgamation of the two methods [ 4 ]. As a matter of fact, some 15 years later, both procedures were combined to give rise to the nowadays widespread use of an inferential tool that would satisfy none of the statisticians involved in the original controversy. The present method essentially goes as follows: given a null hypothesis, an estimate of the parameter (or parameters) is obtained and used to create statistics whose distribution, under H 0 , is known. With these data the P-value is computed. Finally, the null hypothesis is rejected when the obtained P-value is smaller than a certain comparative threshold (usually 0.05) and it is not rejected if P is larger than the threshold.

The first reservations about the validity of the method began to appear around 1940, when some statisticians censured the logical roots and practical convenience of Fisher's P-value [ 5 ]. Significance tests and P-values have repeatedly drawn the attention and criticism of many authors over the past 70 years, who have kept questioning its epistemological legitimacy as well as its practical value. What remains in spite of these criticisms is the lasting legacy of researchers' unwillingness to eradicate or reform these methods.

Although there are very comprehensive works on the topic [ 6 ], we list below some of the criticisms most universally accepted by specialists.

• The P-values are used as a tool to make decisions in favor of or against a hypothesis. What really may be relevant, however, is to get an effect size estimate (often the difference between two values) rather than rendering dichotomous true/false verdicts [ 7 - 11 ].

• The P-value is a conditional probability of the data, provided that some assumptions are met, but what really interests the investigator is the inverse probability: what degree of validity can be attributed to each of several competing hypotheses, once that certain data have been observed [ 12 ].

• The two elements that affect the results, namely the sample size and the magnitude of the effect, are inextricably linked in the value of p and we can always get a lower P-value by increasing the sample size. Thus, the conclusions depend on a factor completely unrelated to the reality studied (i.e. the available resources, which in turn determine the sample size) [ 13 , 14 ].

• Those who defend the NHST often assert the objective nature of that test, but the process is actually far from being so. NHST does not ensure objectivity. This is reflected in the fact that we generally operate with thresholds that are ultimately no more than conventions, such as 0.01 or 0.05. What is more, for many years their use has unequivocally demonstrated the inherent subjectivity that goes with the concept of P, regardless of how it will be used later [ 15 - 17 ].

• In practice, the NHST is limited to a binary response sorting hypotheses into "true" and "false" or declaring "rejection" or "no rejection", without demanding a reasonable interpretation of the results, as has been noted time and again for decades. This binary orthodoxy validates categorical thinking, which results in a very simplistic view of scientific activity that induces researchers not to test theories about the magnitude of effect sizes [ 18 - 20 ].

Despite the weakness and shortcomings of the NHST, they are frequently taught as if they were the key inferential statistical method or the most appropriate, or even the sole unquestioned one. The statistical textbooks, with only some exceptions, do not even mention the NHST controversy. Instead, the myth is spread that NHST is the "natural" final action of scientific inference and the only procedure for testing hypotheses. However, relevant specialists and important regulators of the scientific world advocate avoiding them.

Taking especially into account that NHST does not offer the most important information (i.e. the magnitude of an effect of interest, and the precision of the estimate of the magnitude of that effect), many experts recommend the reporting of point estimates of effect sizes with confidence intervals as the appropriate representation of the inherent uncertainty linked to empirical studies [ 21 - 25 ]. Since 1988, the International Committee of Medical Journal Editors (ICMJE, known as the Vancouver Group ) incorporates the following recommendation to authors of manuscripts submitted to medical journals: "When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as P-values, which fail to convey important information about effect size" [ 26 ].

As will be shown, the use of confidence intervals (CI), occasionally accompanied by P-values, is recommended as a more appropriate method for reporting results. Some authors have noted several shortcomings of CI long ago [ 27 ]. In spite of the fact that calculating CI could be complicated indeed, and that their interpretation is far from simple [ 28 , 29 ], authors are urged to use them because they provide much more information than the NHST and do not merit most of its criticisms of NHST [ 30 ]. While some have proposed different options (for instance, likelihood-based information theoretic methods [ 31 ], and the Bayesian inferential paradigm [ 32 ]), confidence interval estimation of effect sizes is clearly the most widespread alternative approach.

Although twenty years have passed since the ICMJE began to disseminate such recommendations, systematically ignored by the vast majority of textbooks and hardly incorporated in medical publications [ 33 ], it is interesting to examine the extent to which the NHST is used in articles published in medical journals during recent years, in order to identify what is still lacking in the process of eradicating the widespread ceremonial use that is made of statistics in health research [ 34 ]. Furthermore, it is enlightening in this context to examine whether these patterns differ between English- and Spanish-speaking worlds and, if so, to see if the changes in paradigms are occurring more slowly in Spanish-language publications. In such a case we would offer various suggestions.

In addition to assessing the adherence to the above cited statistical recommendation proposed by ICMJE relative to the use of P-values, we consider it of particular interest to estimate the extent to which the significance fallacy is present, an inertial deficiency that consists of attributing -- explicitly or not -- qualitative importance or practical relevance to the found differences simply because statistical significance was obtained.

Many authors produce misleading statements such as "a significant effect was (or was not) found" when it should be said that "a statistically significant difference was (or was not) found". A detrimental consequence of this equivalence is that some authors believe that finding out whether there is "statistical significance" or not is the aim, so that this term is then mentioned in the conclusions [ 35 ]. This means virtually nothing, except that it indicates that the author is letting a computer do the thinking. Since the real research questions are never statistical ones, the answers cannot be statistical either. Accordingly, the conversion of the dichotomous outcome produced by a NHST into a conclusion is another manifestation of the mentioned fallacy.

The general objective of the present study is to evaluate the extent and quality of use of NHST and CI, both in English- and in Spanish-language biomedical publications, between 1995 and 2006 taking into account the International Committee of Medical Journal Editors recommendations, with particular focus on accuracy regarding interpretation of statistical significance and the validity of conclusions.

We reviewed the original articles from six journals, three in English and three in Spanish, over three disjoint periods sufficiently separated from each other (1995-1996, 2000-2001, 2005-2006) as to properly describe the evolution in prevalence of the target features along the selected periods.

The selection of journals was intended to get representation for each of the following three thematic areas: clinical specialties ( Obstetrics & Gynecology and Revista Española de Cardiología) ; Public Health and Epidemiology ( International Journal of Epidemiology and Atención Primaria) and the area of general and internal medicine ( British Medical Journal and Medicina Clínica ). Five of the selected journals formally endorsed ICMJE guidelines; the remaining one ( Revista Española de Cardiología ) suggests observing ICMJE demands in relation with specific issues. We attempted to capture journal diversity in the sample by selecting general and specialty journals with different degrees of influence, resulting from their impact factors in 2007, which oscillated between 1.337 (MC) and 9.723 (BMJ). No special reasons guided us to choose these specific journals, but we opted for journals with rather large paid circulations. For instance, the Spanish Cardiology Journal is the one with the largest impact factor among the fourteen Spanish Journals devoted to clinical specialties that have impact factor and Obstetrics & Gynecology has an outstanding impact factor among the huge number of journals available for selection.

It was decided to take around 60 papers for each biennium and journal, which means a total of around 1,000 papers. As recently suggested [ 36 , 37 ], this number was not established using a conventional method, but by means of a purposive and pragmatic approach in choosing the maximum sample size that was feasible.

Systematic sampling in phases [ 38 ] was used in applying a sampling fraction equal to 60/N, where N is the number of articles, in each of the 18 subgroups defined by crossing the six journals and the three time periods. Table ​ Table1 1 lists the population size and the sample size for each subgroup. While the sample within each subgroup was selected with equal probability, estimates based on other subsets of articles (defined across time periods, areas, or languages) are based on samples with various selection probabilities. Proper weights were used to take into account the stratified nature of the sampling in these cases.

Sizes of the populations (and the samples) for selected journals and periods.

ClinicalGeneral MedicinePublic Health and Epidemiology
1995-1996623 (62)125 (60)346 (62)238 (61)315 (60)169 (60)1816 (365)
2000-2001600 (60)146 (60)519 (62)196 (61)286 (60)145 (61)1892 (364)
2005-2006537 (59)144 (59)474 (62)158 (62)212 (61)167 (60)1692 (363)
Total1760 (181)415 (179)1339 (186)592 (184)813 (181)481 (181)5400 (1092)

G&O: Obstetrics & Gynecology; REC: Revista Española de Cardiología; BMJ: British Medical Journal; MC: Medicina Clínica; IJE: International Journal of Epidemiology; AP: Atención Primaria .

Forty-nine of the 1,092 selected papers were eliminated because, although the section of the article in which they were assigned could suggest they were originals, detailed scrutiny revealed that in some cases they were not. The sample, therefore, consisted of 1,043 papers. Each of them was classified into one of three categories: (1) purely descriptive papers, those designed to review or characterize the state of affairs as it exists at present, (2) analytical papers, or (3) articles that address theoretical, methodological or conceptual issues. An article was regarded as analytical if it seeks to explain the reasons behind a particular occurrence by discovering causal relationships or, even if self-classified as descriptive, it was carried out to assess cause-effect associations among variables. We classify as theoretical or methodological those articles that do not handle empirical data as such, and focus instead on proposing or assessing research methods. We identified 169 papers as purely descriptive or theoretical, which were therefore excluded from the sample. Figure ​ Figure1 1 presents a flow chart showing the process for determining eligibility for inclusion in the sample.

An external file that holds a picture, illustration, etc.
Object name is 1471-2288-10-44-1.jpg

Flow chart of the selection process for eligible papers .

To estimate the adherence to ICMJE recommendations, we considered whether the papers used P-values, confidence intervals, and both simultaneously. By "the use of P-values" we mean that the article contains at least one P-value, explicitly mentioned in the text or at the bottom of a table, or that it reports that an effect was considered as statistically significant . It was deemed that an article uses CI if it explicitly contained at least one confidence interval, but not when it only provides information that could allow its computation (usually by presenting both the estimate and the standard error). Probability intervals provided in Bayesian analysis were classified as confidence intervals (although conceptually they are not the same) since what is really of interest here is whether or not the authors quantify the findings and present them with appropriate indicators of the margin of error or uncertainty.

In addition we determined whether the "Results" section of each article attributed the status of "significant" to an effect on the sole basis of the outcome of a NHST (i.e., without clarifying that it is strictly statistical significance). Similarly, we examined whether the term "significant" (applied to a test) was mistakenly used as synonymous with substantive , relevant or important . The use of the term "significant effect" when it is only appropriate as a reference to a "statistically significant difference," can be considered a direct expression of the significance fallacy [ 39 ] and, as such, constitutes one way to detect the problem in a specific paper.

We also assessed whether the "Conclusions," which sometimes appear as a separate section in the paper or otherwise in the last paragraphs of the "Discussion" section mentioned statistical significance and, if so, whether any of such mentions were no more than an allusion to results.

To perform these analyses we considered both the abstract and the body of the article. To assess the handling of the significance issue, however, only the body of the manuscript was taken into account.

The information was collected by four trained observers. Every paper was assigned to two reviewers. Disagreements were discussed and, if no agreement was reached, a third reviewer was consulted to break the tie and so moderate the effect of subjectivity in the assessment.

In order to assess the reliability of the criteria used for the evaluation of articles and to effect a convergence of criteria among the reviewers, a pilot study of 20 papers from each of three journals ( Clinical Medicine , Primary Care , and International Journal of Epidemiology) was performed. The results of this pilot study were satisfactory. Our results are reported using percentages together with their corresponding confidence intervals. For sampling errors estimations, used to obtain confidence intervals, we weighted the data using the inverse of the probability of selection of each paper, and we took into account the complex nature of the sample design. These analyses were carried out with EPIDAT [ 40 ], a specialized computer program that is readily available.

A total of 1,043 articles were reviewed, of which 874 (84%) were found to be analytic, while the remainders were purely descriptive or of a theoretical and methodological nature. Five of them did not employ either P-values or CI. Consequently, the analysis was made using the remaining 869 articles.

Use of NHST and confidence intervals

The percentage of articles that use only P-values, without even mentioning confidence intervals, to report their results has declined steadily throughout the period analyzed (Table ​ (Table2). 2 ). The percentage decreased from approximately 41% in 1995-1996 to 21% in 2005-2006. However, it does not differ notably among journals of different languages, as shown by the estimates and confidence intervals of the respective percentages. Concerning thematic areas, it is highly surprising that most of the clinical articles ignore the recommendations of ICMJE, while for general and internal medicine papers such a problem is only present in one in five papers, and in the area of Public Health and Epidemiology it occurs only in one out of six. The use of CI alone (without P-values) has increased slightly across the studied periods (from 9% to 13%), but it is five times more prevalent in Public Health and Epidemiology journals than in Clinical ones, where it reached a scanty 3%.

Prevalence of NHST and CI across periods, languages and research areas.

Total of papersP-values and no CICI and P-valuesCI and no P-values
n% (95%CI)n% (95%CI)n% (95%CI)
Period1995-199628511941 (35 to 47)13849 (43 to 55)2810 (6 to13)
2000-200127810138 (31 to 44)15051 (44 to 58)2711 (6 to 15)
2005-20063066521 (16 to 26)19865 (59 to 71)4314 (9 to 17)
LanguageSpanish39615639 (34 to 43)21154 (49 to 59)297 (5 to 10)
English47312932 (28 to 36)27555 (51 to 60)6912 (10 to 15)
AreaClinical30016652 (45 to 58)12545 (39 to 51)93 (1 to 6)
General Medicine2786922 (17 to 27)17061 (55 to 67)3917 (12 to 22)
Public Health and Epidemiology2915018 (13 to 23)19165 (59 to 71)5017 (13 to 22)

CI: Confidence Interval

Ambivalent handling of the significance

While the percentage of articles referring implicitly or explicitly to significance in an ambiguous or incorrect way - that is, incurring the significance fallacy -- seems to decline steadily, the prevalence of this problem exceeds 69%, even in the most recent period. This percentage was almost the same for articles written in Spanish and in English, but it was notably higher in the Clinical journals (81%) compared to the other journals, where the problem occurs in approximately 7 out of 10 papers (Table ​ (Table3). 3 ). The kappa coefficient for measuring agreement between observers concerning the presence of the "significance fallacy" was 0.78 (CI95%: 0.62 to 0.93), which is considered acceptable in the scale of Landis and Koch [ 41 ].

Frequency of occurrence of the significance fallacy across periods, languages and research areas.

CriteriaCategoriesNumber of papers
examined
Frequency of occurrence of the
significance fallacy
%
(95%CI)
Period1995-199628522480 (75 to 85)
2000-200127821078 (72 to 83)
2005-200630621670 (64 to 75)
LanguageSpanish39629573 (69 to 78)
English47335576 (73 to 80)
AreaClinical30024881(76 to 86)
General Medicine27820072 (66 to 77)
Public
Health and Epidemiology
29120271 (66 to 76)

Reference to numerical results or statistical significance in Conclusions

The percentage of papers mentioning a numerical finding as a conclusion is similar in the three periods analyzed (Table ​ (Table4). 4 ). Concerning languages, this percentage is nearly twice as large for Spanish journals as for those published in English (approximately 21% versus 12%). And, again, the highest percentage (16%) corresponded to clinical journals.

Frequency of use of numerical results in conclusions across periods, languages and research areas.

CriteriaCategoriesNumber of papers
examined
Frequency of use of numerical results
in conclusions
%
(95%CI)
Period1995-19962854415 (10 to 19)
2000-20012784815 (10 to 20)
2005-20063064512,1 (8 to 16)
LanguageSpanish3968521 (17 to 25)
English4735212 (9 to 15)
AreaClinical3005816 (12 to 21)
General Medicine2783913 (9 to 17)
Public Health and Epidemiology2914012 (8 to 15)

A similar pattern is observed, although with less pronounced differences, in references to the outcome of the NHST (significant or not) in the conclusions (Table ​ (Table5). 5 ). The percentage of articles that introduce the term in the "Conclusions" does not appreciably differ between articles written in Spanish and in English. Again, the area where this insufficiency is more often present (more than 15% of articles) is the Clinical area.

Frequency of presence of the term Significance (or statistical significance) in conclusions across periods, languages and research areas.

CriteriaCategoriesNumber of papers
examined
Frequency of presence of significance
in conclusions
%
(95%CI)
Period1995-19962853514 (9 to 19)
2000-20012783212 (8 to 16)
2005-20063064114 (9 to 19)
LanguageSpanish3963910 (7 to 13)
English4736915 (11 to 18)
AreaClinical3004416 (11 to 20)
General Medicine2783011 (7 to 15)
Public Health and Epidemiology2913412 (8 to 16)

There are some previous studies addressing the degree to which researchers have moved beyond the ritualistic use of NHST to assess their hypotheses. This has been examined for areas such as biology [ 42 ], organizational research [ 43 ], or psychology [ 44 - 47 ]. However, to our knowledge, no recent research has explored the pattern of use P-values and CI in medical literature and, in any case, no efforts have been made to study this problem in a way that takes into account different languages and specialties.

At first glance it is puzzling that, after decades of questioning and technical warnings, and after twenty years since the inception of ICMJE recommendation to avoid NHST, they continue being applied ritualistically and mindlessly as the dominant doctrine. Not long ago, when researchers did not observe statistically significant effects, they were unlikely to write them up and to report "negative" findings, since they knew there was a high probability that the paper would be rejected. This has changed a bit: editors are more prone to judge all findings as potentially eloquent. This is probably the frequent denunciations of the tendency for those papers presenting a significant positive result to receive more favorable publication decisions than equally well-conducted ones that report a negative or null result, the so-called publication bias [ 48 - 50 ]. This new openness is consistent with the fact that if the substantive question addressed is really relevant, the answer (whether positive or negative) will also be relevant.

Consequently, even though it was not an aim of our study, we found many examples in which statistical significance was not obtained. However, many of those negative results were reported with a comment of this type: " The results did not show a significant difference between groups; however, with a larger sample size, this difference would have probably proved to be significant ". The problem with this statement is that it is true; more specifically, it will always be true and it is, therefore, sterile. It is not fortuitous that one never encounters the opposite, and equally tautological, statement: " A significant difference between groups has been detected; however, perhaps with a smaller sample size, this difference would have proved to be not significant" . Such a double standard is itself an unequivocal sign of the ritual application of NHST.

Although the declining rates of NHST usage show that, gradually, ICMJE and similar recommendations are having a positive impact, most of the articles in the clinical setting still considered NHST as the final arbiter of the research process. Moreover, it appears that the improvement in the situation is mostly formal, and the percentage of articles that fall into the significance fallacy is huge.

The contradiction between what has been conceptually recommended and the common practice is sensibly less acute in the area of Epidemiology and Public Health, but the same pattern was evident everywhere in the mechanical way of applying significance tests. Nevertheless, the clinical journals remain the most unmoved by the recommendations.

The ICMJE recommendations are not cosmetic statements but substantial ones, and the vigorous exhortations made by outstanding authorities [ 51 ] are not mere intellectual exercises due to ingenious and inopportune methodologists, but rather they are very serious epistemological warnings.

In some cases, the role of CI is not as clearly suitable (e.g. when estimating multiple regression coefficients or because effect sizes are not available for some research designs [ 43 , 52 ]), but when it comes to estimating, for example, an odds ratio or a rates difference, the advantage of using CI instead of P values is very clear, since in such cases it is obvious that the goal is to assess what has been called the "effect size."

The inherent resistance to change old paradigms and practices that have been entrenched for decades is always high. Old habits die hard. The estimates and trends outlined are entirely consistent with Alvan Feinstein's warning 25 years ago: "Because the history of medical research also shows a long tradition of maintaining loyalty to established doctrines long after the doctrines had been discredited, or shown to be valueless, we cannot expect a sudden change in this medical policy merely because it has been denounced by leading connoisseurs of statistics [ 53 ]".

It is possible, however, that the nature of the problem has an external explanation: it is likely that some editors prefer to "avoid troubles" with the authors and vice versa, thus resorting to the most conventional procedures. Many junior researchers believe that it is wise to avoid long back-and-forth discussions with reviewers and editors. In general, researchers who want to appear in print and survive in a publish-or-perish environment are motivated by force, fear, and expedience in their use of NHST [ 54 ]. Furthermore, it is relatively natural that simple researchers use NHST when they take into account that some theoretical objectors have used this statistical analysis in empirical studies, published after the appearance of their own critiques [ 55 ].

For example, Journal of the American Medical Association published a bibliometric study [ 56 ] discussing the impact of statisticians' co-authorship of medical papers on publication decisions by two major high-impact journals: British Medical Journal and Annals of Internal Medicine . The data analysis is characterized by methodological orthodoxy. The authors just use chi-square tests without any reference to CI, although the NHST had been repeatedly criticized over the years by two of the authors:

Douglas Altman, an early promoter of confidence intervals as an alternative [ 57 ], and Steve Goodman, a critic of NHST from a Bayesian perspective [ 58 ]. Individual authors, however, cannot be blamed for broader institutional problems and systemic forces opposed to change.

The present effort is certainly partial in at least two ways: it is limited to only six specific journals and to three biennia. It would be therefore highly desirable to improve it by studying the problem in a more detailed way (especially by reviewing more journals with different profiles), and continuing the review of prevailing patterns and trends.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LCSA designed the study, wrote the paper and supervised the whole process; PSG coordinated the data extraction and carried out statistical analysis, as well as participated in the editing process; AFS extracted the data and participated in the first stage of statistical analysis; all authors contributed to and revised the final manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here:

http://www.biomedcentral.com/1471-2288/10/44/prepub

Acknowledgements

The authors would like to thank Tania Iglesias-Cabo and Vanesa Alvarez-González for their help with the collection of empirical data and their participation in an earlier version of the paper. The manuscript has benefited greatly from thoughtful, constructive feedback by Carlos Campillo-Artero, Tom Piazza and Ann Séror.

  • Curran-Everett D. Explorations in statistics: hypothesis tests and P values. Adv Physiol Educ. 2009; 33 :81–86. doi: 10.1152/advan.90218.2008. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd; 1925. [ Google Scholar ]
  • Neyman J, Pearson E. On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika. 1928; 20 :175–240. [ Google Scholar ]
  • Silva LC. Los laberintos de la investigación biomédica. En defensa de la racionalidad para la ciencia del siglo XXI. Madrid: Díaz de Santos; 2009. [ Google Scholar ]
  • Berkson J. Test of significance considered as evidence. J Am Stat Assoc. 1942; 37 :325–335. doi: 10.2307/2279000. [ CrossRef ] [ Google Scholar ]
  • Nickerson RS. Null hypothesis significance testing: A review of an old and continuing controversy. Psychol Methods. 2000; 5 :241–301. doi: 10.1037/1082-989X.5.2.241. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rozeboom WW. The fallacy of the null hypothesissignificance test. Psychol Bull. 1960; 57 :418–428. doi: 10.1037/h0042040. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Callahan JL, Reio TG. Making subjective judgments in quantitative studies: The importance of using effect sizes and confidenceintervals. HRD Quarterly. 2006; 17 :159–173. [ Google Scholar ]
  • Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev. 2007; 82 :591–605. doi: 10.1111/j.1469-185X.2007.00027.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Breaugh JA. Effect size estimation: factors to consider and mistakes to avoid. J Manage. 2003; 29 :79–97. doi: 10.1177/014920630302900106. [ CrossRef ] [ Google Scholar ]
  • Thompson B. What future quantitative social science research could look like: confidence intervals for effect sizes. Educ Res. 2002; 31 :25–32. [ Google Scholar ]
  • Matthews RA. Significance levels for the assessment of anomalous phenomena. Journal of Scientific Exploration. 1999; 13 :1–7. [ Google Scholar ]
  • Savage IR. Nonparametric statistics. J Am Stat Assoc. 1957; 52 :332–333. [ Google Scholar ]
  • Silva LC, Benavides A, Almenara J. El péndulo bayesiano: Crónica de una polémica estadística. Llull. 2002; 25 :109–128. [ PubMed ] [ Google Scholar ]
  • Goodman SN, Royall R. Evidence and scientific research. Am J Public Health. 1988; 78 :1568–1574. doi: 10.2105/AJPH.78.12.1568. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Berger JO, Berry DA. Statistical analysis and the illusion of objectivity. Am Sci. 1988; 76 :159–165. [ Google Scholar ]
  • Hurlbert SH, Lombardi CM. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Ann Zool Fenn. 2009; 46 :311–349. [ Google Scholar ]
  • Fidler F, Thomason N, Cumming G, Finch S, Leeman J. Editors can lead researchers to confidence intervals but they can't make them think: Statistical reform lessons from Medicine. Psychol Sci. 2004; 15 :119–126. doi: 10.1111/j.0963-7214.2004.01502008.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Balluerka N, Vergara AI, Arnau J. Calculating the main alternatives to null-hypothesis-significance testing in between-subject experimental designs. Psicothema. 2009; 21 :141–151. [ PubMed ] [ Google Scholar ]
  • Cumming G, Fidler F. Confidence intervals: Better answers to better questions. J Psychol. 2009; 217 :15–26. [ Google Scholar ]
  • Jones LV, Tukey JW. A sensible formulation of the significance test. Psychol Methods. 2000; 5 :411–414. doi: 10.1037/1082-989X.5.4.411. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dixon P. The p-value fallacy and how to avoid it. Can J Exp Psychol. 2003; 57 :189–202. [ PubMed ] [ Google Scholar ]
  • Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev Camb Philos Soc. 2007; 82 :591–605. doi: 10.1111/j.1469-185X.2007.00027.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Brandstaetter E. Confidence intervals as an alternative to significance testing. MPR-Online. 2001; 4 :33–46. [ Google Scholar ]
  • Masson ME, Loftus GR. Using confidence intervals for graphically based data interpretation. Can J Exp Psychol. 2003; 57 :203–220. [ PubMed ] [ Google Scholar ]
  • International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals. http://www.icmje.org Update October 2008. Accessed July 11, 2009. [ PubMed ]
  • Feinstein AR. P-Values and Confidence Intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998; 51 :355–360. doi: 10.1016/S0895-4356(97)00295-3. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Haller H, Kraus S. Misinterpretations of significance: A problem students share with their teachers? MRP-Online. 2002; 7 :1–20. [ Google Scholar ]
  • Gigerenzer G, Krauss S, Vitouch O. In: The Handbook of Methodology for the Social Sciences. Kaplan D, editor. Chapter 21. Thousand Oaks, CA: Sage Publications; 2004. The null ritual: What you always wanted to know about significance testing but were afraid to ask; pp. 391–408. [ Google Scholar ]
  • Curran-Everett D, Taylor S, Kafadar K. Fundamental concepts in statistics: elucidation and illustration. J Appl Physiol. 1998; 85 :775–786. [ PubMed ] [ Google Scholar ]
  • Royall RM. Statistical evidence: a likelihood paradigm. Boca Raton: Chapman & Hall/CRC; 1997. [ Google Scholar ]
  • Goodman SN. Of P values and Bayes: A modest proposal. Epidemiology. 2001; 12 :295–297. doi: 10.1097/00001648-200105000-00006. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sarria M, Silva LC. Tests of statistical significance in three biomedical journals: a critical review. Rev Panam Salud Publica. 2004; 15 :300–306. [ PubMed ] [ Google Scholar ]
  • Silva LC. Una ceremonia estadística para identificar factores de riesgo. Salud Colectiva. 2005; 1 :322–329. [ Google Scholar ]
  • Goodman SN. Toward Evidence-Based Medical Statistics 1: The p Value Fallacy. Ann Intern Med. 1999; 130 :995–1004. [ PubMed ] [ Google Scholar ]
  • Schulz KF, Grimes DA. Sample size calculations in randomised clinical trials: mandatory and mystical. Lancet. 2005; 365 :1348–1353. doi: 10.1016/S0140-6736(05)61034-3. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bacchetti P. Current sample size conventions: Flaws, harms, and alternatives. BMC Med. 2010; 8 :17. doi: 10.1186/1741-7015-8-17. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Silva LC. Diseño razonado de muestras para la investigación sanitaria. Madrid: Díaz de Santos; 2000. [ Google Scholar ]
  • Barnett ML, Mathisen A. Tyranny of the p-value: The conflict between statistical significance and common sense. J Dent Res. 1997; 76 :534–536. doi: 10.1177/00220345970760010201. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Santiago MI, Hervada X, Naveira G, Silva LC, Fariñas H, Vázquez E, Bacallao J, Mújica OJ. [The Epidat program: uses and perspectives] [letter] Pan Am J Public Health. 2010; 27 :80–82. Spanish. [ PubMed ] [ Google Scholar ]
  • Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33 :159–74. doi: 10.2307/2529310. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fidler F, Burgman MA, Cumming G, Buttrose R, Thomason N. Impact of criticism of null-hypothesis significance testing on statistical reporting practices in conservation biology. Conserv Biol. 2005; 20 :1539–1544. doi: 10.1111/j.1523-1739.2006.00525.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kline RB. Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association; 2004. [ Google Scholar ]
  • Curran-Everett D, Benos DJ. Guidelines for reporting statistics in journals published by the American Physiological Society: the sequel. Adv Physiol Educ. 2007; 31 :295–298. doi: 10.1152/advan.00022.2007. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hubbard R, Parsa AR, Luthy MR. The spread of statistical significance testing: The case of the Journal of Applied Psychology. Theor Psychol. 1997; 7 :545–554. doi: 10.1177/0959354397074006. [ CrossRef ] [ Google Scholar ]
  • Vacha-Haase T, Nilsson JE, Reetz DR, Lance TS, Thompson B. Reporting practices and APA editorial policies regarding statistical significance and effect size. Theor Psychol. 2000; 10 :413–425. doi: 10.1177/0959354300103006. [ CrossRef ] [ Google Scholar ]
  • Krueger J. Null hypothesis significance testing: On the survival of a flawed method. Am Psychol. 2001; 56 :16–26. doi: 10.1037/0003-066X.56.1.16. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rising K, Bacchetti P, Bero L. Reporting Bias in Drug Trials Submitted to the Food and Drug Administration: Review of Publication and Presentation. PLoS Med. 2008; 5 :e217. doi: 10.1371/journal.pmed.0050217. doi:10.1371/journal.pmed.0050217. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sridharan L, Greenland L. Editorial policies and publication bias the importance of negative studies. Arch Intern Med. 2009; 169 :1022–1023. doi: 10.1001/archinternmed.2009.100. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Falagas ME, Alexiou VG. The top-ten in journal impact factor manipulation. Arch Immunol Ther Exp (Warsz) 2008; 56 :223–226. doi: 10.1007/s00005-008-0024-5. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rothman K. Writing for Epidemiology. Epidemiology. 1998; 9 :98–104. doi: 10.1097/00001648-199805000-00019. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fidler F. The fifth edition of the APA publication manual: Why its statistics recommendations are so controversial. Educ Psychol Meas. 2002; 62 :749–770. doi: 10.1177/001316402236876. [ CrossRef ] [ Google Scholar ]
  • Feinstein AR. Clinical epidemiology: The architecture of clinical research. Philadelphia: W.B. Saunders Company; 1985. [ Google Scholar ]
  • Orlitzky M. Institutionalized dualism: statistical significance testing as myth and ceremony. http://ssrn.com/abstract=1415926 Accessed Feb 8, 2010.
  • Greenwald AG, González R, Harris RJ, Guthrie D. Effect sizes and p-value. What should be reported and what should be replicated? Psychophysiology. 1996; 33 :175–183. doi: 10.1111/j.1469-8986.1996.tb02121.x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Altman DG, Goodman SN, Schroter S. How statistical expertise is used in medical research. J Am Med Assoc. 2002; 287 :2817–2820. doi: 10.1001/jama.287.21.2817. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gardner MJ, Altman DJ. Statistics with confidence. Confidence intervals and statistical guidelines. London: BMJ; 1992. [ Google Scholar ]
  • Goodman SN. P Values, Hypothesis Tests and Likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993; 137 :485–496. [ PubMed ] [ Google Scholar ]

Have a thesis expert improve your writing

Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.

  • Knowledge Base
  • Null and Alternative Hypotheses | Definitions & Examples

Null and Alternative Hypotheses | Definitions & Examples

Published on 5 October 2022 by Shaun Turney . Revised on 6 December 2022.

The null and alternative hypotheses are two competing claims that researchers weigh evidence for and against using a statistical test :

  • Null hypothesis (H 0 ): There’s no effect in the population .
  • Alternative hypothesis (H A ): There’s an effect in the population.

The effect is usually the effect of the independent variable on the dependent variable .

Table of contents

Answering your research question with hypotheses, what is a null hypothesis, what is an alternative hypothesis, differences between null and alternative hypotheses, how to write null and alternative hypotheses, frequently asked questions about null and alternative hypotheses.

The null and alternative hypotheses offer competing answers to your research question . When the research question asks “Does the independent variable affect the dependent variable?”, the null hypothesis (H 0 ) answers “No, there’s no effect in the population.” On the other hand, the alternative hypothesis (H A ) answers “Yes, there is an effect in the population.”

The null and alternative are always claims about the population. That’s because the goal of hypothesis testing is to make inferences about a population based on a sample . Often, we infer whether there’s an effect in the population by looking at differences between groups or relationships between variables in the sample.

You can use a statistical test to decide whether the evidence favors the null or alternative hypothesis. Each type of statistical test comes with a specific way of phrasing the null and alternative hypothesis. However, the hypotheses can also be phrased in a general way that applies to any test.

The null hypothesis is the claim that there’s no effect in the population.

If the sample provides enough evidence against the claim that there’s no effect in the population ( p ≤ α), then we can reject the null hypothesis . Otherwise, we fail to reject the null hypothesis.

Although “fail to reject” may sound awkward, it’s the only wording that statisticians accept. Be careful not to say you “prove” or “accept” the null hypothesis.

Null hypotheses often include phrases such as “no effect”, “no difference”, or “no relationship”. When written in mathematical terms, they always include an equality (usually =, but sometimes ≥ or ≤).

Examples of null hypotheses

The table below gives examples of research questions and null hypotheses. There’s always more than one way to answer a research question, but these null hypotheses can help you get started.

( )
Does tooth flossing affect the number of cavities? Tooth flossing has on the number of cavities. test:

The mean number of cavities per person does not differ between the flossing group (µ ) and the non-flossing group (µ ) in the population; µ = µ .

Does the amount of text highlighted in the textbook affect exam scores? The amount of text highlighted in the textbook has on exam scores. :

There is no relationship between the amount of text highlighted and exam scores in the population; β = 0.

Does daily meditation decrease the incidence of depression? Daily meditation the incidence of depression.* test:

The proportion of people with depression in the daily-meditation group ( ) is greater than or equal to the no-meditation group ( ) in the population; ≥ .

*Note that some researchers prefer to always write the null hypothesis in terms of “no effect” and “=”. It would be fine to say that daily meditation has no effect on the incidence of depression and p 1 = p 2 .

The alternative hypothesis (H A ) is the other answer to your research question . It claims that there’s an effect in the population.

Often, your alternative hypothesis is the same as your research hypothesis. In other words, it’s the claim that you expect or hope will be true.

The alternative hypothesis is the complement to the null hypothesis. Null and alternative hypotheses are exhaustive, meaning that together they cover every possible outcome. They are also mutually exclusive, meaning that only one can be true at a time.

Alternative hypotheses often include phrases such as “an effect”, “a difference”, or “a relationship”. When alternative hypotheses are written in mathematical terms, they always include an inequality (usually ≠, but sometimes > or <). As with null hypotheses, there are many acceptable ways to phrase an alternative hypothesis.

Examples of alternative hypotheses

The table below gives examples of research questions and alternative hypotheses to help you get started with formulating your own.

Does tooth flossing affect the number of cavities? Tooth flossing has an on the number of cavities. test:

The mean number of cavities per person differs between the flossing group (µ ) and the non-flossing group (µ ) in the population; µ ≠ µ .

Does the amount of text highlighted in a textbook affect exam scores? The amount of text highlighted in the textbook has an on exam scores. :

There is a relationship between the amount of text highlighted and exam scores in the population; β ≠ 0.

Does daily meditation decrease the incidence of depression? Daily meditation the incidence of depression. test:

The proportion of people with depression in the daily-meditation group ( ) is less than the no-meditation group ( ) in the population; < .

Null and alternative hypotheses are similar in some ways:

  • They’re both answers to the research question
  • They both make claims about the population
  • They’re both evaluated by statistical tests.

However, there are important differences between the two types of hypotheses, summarized in the following table.

A claim that there is in the population. A claim that there is in the population.

Equality symbol (=, ≥, or ≤) Inequality symbol (≠, <, or >)
Rejected Supported
Failed to reject Not supported

To help you write your hypotheses, you can use the template sentences below. If you know which statistical test you’re going to use, you can use the test-specific template sentences. Otherwise, you can use the general template sentences.

The only thing you need to know to use these general template sentences are your dependent and independent variables. To write your research question, null hypothesis, and alternative hypothesis, fill in the following sentences with your variables:

Does independent variable affect dependent variable ?

  • Null hypothesis (H 0 ): Independent variable does not affect dependent variable .
  • Alternative hypothesis (H A ): Independent variable affects dependent variable .

Test-specific

Once you know the statistical test you’ll be using, you can write your hypotheses in a more precise and mathematical way specific to the test you chose. The table below provides template sentences for common statistical tests.

( )
test 

with two groups

The mean dependent variable does not differ between group 1 (µ ) and group 2 (µ ) in the population; µ = µ . The mean dependent variable differs between group 1 (µ ) and group 2 (µ ) in the population; µ ≠ µ .
with three groups The mean dependent variable does not differ between group 1 (µ ), group 2 (µ ), and group 3 (µ ) in the population; µ = µ = µ . The mean dependent variable of group 1 (µ ), group 2 (µ ), and group 3 (µ ) are not all equal in the population.
There is no correlation between independent variable and dependent variable in the population; ρ = 0. There is a correlation between independent variable and dependent variable in the population; ρ ≠ 0.
There is no relationship between independent variable and dependent variable in the population; β = 0. There is a relationship between independent variable and dependent variable in the population; β ≠ 0.
Two-proportions test The dependent variable expressed as a proportion does not differ between group 1 ( ) and group 2 ( ) in the population; = . The dependent variable expressed as a proportion differs between group 1 ( ) and group 2 ( ) in the population; ≠ .

Note: The template sentences above assume that you’re performing one-tailed tests . One-tailed tests are appropriate for most studies.

The null hypothesis is often abbreviated as H 0 . When the null hypothesis is written using mathematical symbols, it always includes an equality symbol (usually =, but sometimes ≥ or ≤).

The alternative hypothesis is often abbreviated as H a or H 1 . When the alternative hypothesis is written using mathematical symbols, it always includes an inequality symbol (usually ≠, but sometimes < or >).

A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (‘ x affects y because …’).

A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses. In a well-designed study , the statistical hypotheses correspond logically to the research hypothesis.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Turney, S. (2022, December 06). Null and Alternative Hypotheses | Definitions & Examples. Scribbr. Retrieved 21 August 2024, from https://www.scribbr.co.uk/stats/null-and-alternative-hypothesis/

Is this article helpful?

Shaun Turney

Shaun Turney

Other students also liked, levels of measurement: nominal, ordinal, interval, ratio, the standard normal distribution | calculator, examples & uses, types of variables in research | definitions & examples.

  • Health Tech
  • Health Insurance
  • Medical Devices
  • Gene Therapy
  • Neuroscience
  • H5N1 Bird Flu
  • Health Disparities
  • Infectious Disease
  • Mental Health
  • Cardiovascular Disease
  • Chronic Disease
  • Alzheimer's
  • Coercive Care
  • The Obesity Revolution
  • The War on Recovery
  • Adam Feuerstein
  • Matthew Herper
  • Jennifer Adaeze Okwerekwu
  • Ed Silverman
  • CRISPR Tracker
  • Breakthrough Device Tracker
  • Generative AI Tracker
  • Obesity Drug Tracker
  • 2024 STAT Summit
  • All Summits
  • STATUS List
  • STAT Madness
  • STAT Brand Studio

Don't miss out

Subscribe to STAT+ today, for the best life sciences journalism in the industry

‘Null’ research findings aren’t empty of meaning. Let’s publish them

By Anupam B. Jena Nov. 10, 2017

null findings

E very medical researcher dreams of doing studies or conducting clinical trials that generate results so compelling they change how diseases are treated or health policy is written. In reality, we are lucky if the results are even a little bit positive, and often end up with “null” results, meaning that the effect of a policy, drug, or clinical intervention that we tested is no different than that of some alternative.

“Null” comes from the null hypothesis, the bedrock of the scientific method. Say I want to test whether the switch to daylight saving time affects the outcomes of surgery because surgeons may be slightly more fatigued in the days following the transition due to lost sleep. I set up a null hypothesis — surgery-related deaths are no different in the days immediately before the switch to daylight saving time compared to the days immediately after it — and then try to nullify, or disprove, it to show that there was indeed a difference. (Read on to see the answer, though you can probably guess from the headline what it is.) Disproving the null hypothesis is standard operating procedure in science.

advertisement

Null results are exceedingly common. Yet they aren’t nearly as likely to get published as “positive” results, even though they should be. In an analysis of nearly 30,000 presentations made at scientific conferences, fewer than half were ultimately published in peer-reviewed journals, and negative or null results were far less likely to be published than positive results. Clinical trials with positive findings are published more often and sooner than negative or null trials.

That’s a shame, because publishing null results is an important endeavor. Some null results represent potentially important discoveries, such as finding that paying hospitals for performance based on the quality of their outcomes has no effect on actually improving quality. The majority of research questions, though, don’t fall into this category. Leaving null results unpublished can also result in other researchers conducting the same study, wasting time and resources.

Related: Keep negativity out of politics. We need more of it in journals

Some unpublished null findings are on important topics, like whether public reporting of physician’s outcomes leads physicians to “game the system” and alter the care that they provide patients. Others come from explorations of quirkier topics.

Here are a few of each from my own unpublished research.

Daughters and life expectancy. Daughters are more likely than sons to provide care to their ailing parents. Does that mean being blessed with a daughter translates into greater life expectancy? Using data from the U.S. Health and Retirement Study , I compared mortality rates among adults with one daughter versus those with one son. There was no difference. Ditto for families with two daughters versus two sons.

Daylight saving time and surgical mortality. The switch to daylight saving time in the spring has been linked to increased driving accidents immediately after the transition, attributed to fatigue from the hour of lost sleep. I investigated whether this time switch affects the care provided by surgeons by studying operative mortality in the days after the transition. U.S. health insurance claims data from 2002 to 2012 showed no increase in operation-related deaths in the days after the transition to daylight saving time compared to the days just before it.

Tubal ligations and son preference. A preference for sons has been documented in developing countries such as China and India as well as in the United States . When I was a medical student rotating in obstetrics, I heard a patient ask her obstetrician, “Please tie my tubes,” because she had finally had a son. Years later, I investigated whether that observation could be systematically true using health insurance claims data from the U.S. Among women who had recently given birth, there was no difference in later tubal ligation rates between those giving birth to sons versus daughters.

Gaming the reporting of heart surgery deaths. One strategy for improving the delivery of health care is public reporting of doctors’ outcomes. Some evidence suggests that doctors may game the system by choosing healthier patients who are less likely to experience poor outcomes. One important metric is 30-day mortality after coronary artery bypass graft surgery or placement of an artery-opening stent. I wanted to know if heart surgeons were trying to avoid bad scores on 30-day mortality by ordering intensive interventions to keep patients who had experienced one or more complications from the procedure alive beyond the 30-day mark to avoid being dinged in the publicly reported statistics. I hypothesized that in states with public reporting, such as New York, deaths would be higher on post-procedure days 31 to 35 than on days 25 to 29 if doctors chose to keep patients alive by extreme measures. The data didn’t back that up — there was no evidence that cardiac surgeons or cardiologists attempt to game public reporting in this way.

Sign up for the First Opinion newsletter

A weekly digest of our opinion column, with insight from industry experts.

Halloween and hospitalization for high blood sugar. Children consume massive amounts of candy on and immediately after Halloween. Does this onslaught of candy consumption increase the number of episodes of seriously high blood sugar among children with type 1 or type 2 diabetes? I looked at emergency department use and hospitalization for hyperglycemia (high blood sugar) among children between the ages of 5 and 18 years in the eight weeks before Halloween versus the eight weeks after, using as a control group adults aged 35 and older to account for any seasonal trends in hospitalizations. There was no increase in emergency visits for hyperglycemia or hospitalizations for it among either adults or children in the days following Halloween.

The 2008 stock market crash and surgeons’ quality of care. During a three-week period in 2008, the Dow Jones Industrial Average fell 3,000 points, or nearly 25 percent of the Dow’s value. The sharp, massive decline in wealth for many Americans, particularly those with enough money to be heavily invested in stocks, had the potential to create immediate and significant stress. Was this acute, financial stress large enough to throw surgeons off their game? Using U.S. health insurance claims data for 2007 and 2008 that included patient deaths, I analyzed whether weekly 30-day postoperative mortality rates rose in the month following the crash, using 2007 as a control for seasonal trends. There were nearly identical 30-day mortality rates by week in both 2007 and 2008, suggesting that the stock market crash, while stressful, did not distract surgeons from their work.

The bottom line

Not reporting null research findings likely reflects competing priorities of scientific journals and researchers. With limited resources and space, journals prefer to publish positive findings and select only the most important null findings. Many researchers aren’t keen to publish null findings because the effort required to do so may not ultimately be rewarded by acceptance of the research into a scientific journal.

There are a few opportunities for researchers to publish null findings. For example, the Journal of Articles in Support of the Null Hypothesis has been publishing twice a year since 2002, and the Public Library of Science occasionally publishes negative and null results in its Missing Pieces collection. Perhaps a newly announced prize for publishing negative scientific results will spur researchers to pay more attention to this kind of work. The 10,000 Euro prize , initially aimed at neuroscience, is being sponsored by the European College of Neuropsychopharmacology’s Preclinical Data Forum .

For many researchers, though, the effort required to publish articles in these forums may not be worth the lift, particularly since the amount of effort required to write up a positive study is the same as for a null study.

The scientific community could benefit from more reporting of null findings, even if the reports were briefer and had less detail than would be needed for peer review. I’m not sure how we could accomplish that, but would welcome any ideas.

Reporting null findings

Anupam B. Jena, MD, is an economist, physician, and associate professor of health care policy and medicine at Harvard Medical School. He has received consulting fees from Pfizer, Hill Rom Services, Bristol Myers Squibb, Novartis Pharmaceuticals, Vertex Pharmaceuticals, and Precision Health Economics, a company providing consulting services to the life sciences industry.

About the Author Reprints

Anupam b. jena.

STAT encourages you to share your voice. We welcome your commentary, criticism, and expertise on our subscriber-only platform, STAT+ Connect

To submit a correction request, please visit our Contact Us page .

null hypothesis meaning in medical terms

Recommended

null hypothesis meaning in medical terms

Recommended Stories

null hypothesis meaning in medical terms

Why you may not need a checkup every year

null hypothesis meaning in medical terms

A harm-reduction approach to eating out

null hypothesis meaning in medical terms

STAT Plus: AI drug firm Recursion seeks to move from survival to industry domination

null hypothesis meaning in medical terms

STAT Plus: The business of CRISPR’d medicine: Not great

null hypothesis meaning in medical terms

STAT Plus: Rick Doblin, ‘unleashed,’ blasts FDA over Lykos drug rejection and turns to global push for MDMA therapy

null hypothesis meaning in medical terms

13.1 Understanding Null Hypothesis Testing

Learning objectives.

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

  The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables in a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called  parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 adults with clinical depression and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for adults with clinical depression).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of adults with clinical depression, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing  is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the  null hypothesis  (often symbolized  H 0  and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the  alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favor of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the  p value . A low  p  value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p  value that is not low means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the  p  value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called  α (alpha)  and is almost always set to .05. If there is a 5% chance or less of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be  statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

image

“Null Hypothesis” retrieved from http://imgs.xkcd.com/comics/null_hypothesis.png (CC-BY-NC 2.5)

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Sample Size Weak Medium Strong
Small (  = 20) No No  = Maybe

 = Yes

Medium (  = 50) No Yes Yes
Large (  = 100)  = Yes

 = No

Yes Yes
Extra large (  = 500) Yes Yes Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

image

“Conditional Risk” retrieved from http://imgs.xkcd.com/comics/conditional_risk.png (CC-BY-NC 2.5)

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favor of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the  p  value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
  • The correlation between two variables is  r  = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD  = 5) and the mean score for men is 24 ( SD  = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of  r  = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Creative Commons License

Share This Book

  • Increase Font Size
  • TheFreeDictionary
  • Word / Article
  • Starts with
  • Free toolbar & extensions
  • Word of the Day
  • Free content
  • null hypothesis

null hy·poth·e·sis

Null hypothesis (nh).

  • activity analysis
  • alpha error
  • alternative hypothesis
  • analysis of variance
  • background level
  • biological value
  • bivariate analysis
  • blood gas analysis
  • chi squared test
  • chromosome analysis
  • concept analysis
  • content analysis
  • data analysis
  • ego analysis
  • error of the first kind
  • Nuel, Jean P.
  • Nuffield Health
  • Nugent criteria
  • Nuhn, Anton
  • null allele
  • null mutation
  • null-, nulli-
  • null-cell adenoma
  • Null-hypothesis
  • nulligravida
  • nulliparity
  • nulliparous
  • numb chin syndrome
  • number crunching
  • number needed to harm
  • number needed to treat
  • number of excitations
  • Numbers Diet
  • Numbness and Tingling
  • numeric aperture
  • numeric pain scale
  • numeric taxonomy
  • numerical phenetics
  • numerical taxonomy
  • numerical work weight
  • numerophobia
  • Null string
  • Null Subject Parameter
  • Null surface
  • Null terminated
  • Null terminating character
  • Null terminator
  • Null vector
  • Null vector (Minkowski space)
  • Null vector (vector space)
  • Null vectors
  • null-balance recorder
  • Null-character
  • null-current circuit
  • null-current measurement
  • Null-modem cable
  • Null-On-Jam
  • Null-printer cable
  • Null-Space-Based Behavioral
  • Null-Terminated Byte String
  • Null-Terminated Character Array
  • null-terminated multibyte string
  • Nulla impossibilia aut inhonesta sunt praesumenda
  • Nulla nulla
  • nulla pœna sine lege
  • Nulla Per Os
  • Nulla poena sine lege
  • Nulla-nulla
  • nulla-nullas
  • Nullabor Plain
  • Nullabor Plains
  • Facebook Share

Type 1 and Type 2 Errors Explained - Differences and Examples

Understanding type 1 and type 2 errors is essential. Knowing what and how to manage them can help improve your testing and minimize future mistakes.

Types of errors in statistics

Probability in error types, type 1 error examples, type 2 error examples, how to manage and minimize type 1 and 2 errors, using amplitude to reduce errors.

In product and web testing, we generally categorize statistical errors into two main types—type 1 and type 2 errors. These are closely related to the ideas of hypothesis testing and significance levels.

Researchers often develop a null (H0) and an alternate hypothesis (H1) when conducting experiments or analyzing data . The null hypothesis usually represents the status quo or the baseline assumption, while the alternative hypothesis represents the claim or effect being investigated.

The goal is to determine whether the observed data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis.

With this in mind, let’s explore each type and the main differences between type 1 errors vs type 2 errors.

Type 1 Error

A type 1 error occurs when you reject the null hypothesis when it is actually true. In other words, you conclude there is a notable effect or difference when there isn’t one—such as a problem or bug that doesn’t exist.

This error is also known as a “false positive” because you’re falsely detecting something insignificant. Say your testing flags an issue with a feature that’s working correctly—this is a type 1 error.

The problem has not resulted from a bug in your code or product but has come about purely by chance or through unrelated factors. This doesn’t mean your testing was completely incorrect, but there isn’t enough weighting to confidently say the flag is genuine and significant enough to make changes.

Type 1 errors can lead to unnecessary reworks, wasted resources, and delays in your development cycle. You might alter something or add new features that don’t benefit the application.

Type 2 Error

A type 2 error, or “false negative,” happens when you fail to reject the null hypothesis when the alternative hypothesis is actually true. In this case, you’re failing to detect an effect or difference (like a problem or bug) that does exist.

It’s called a “false negative,” as you’re falsely concluding there’s no effect when there is one. For example, if your test suite gives the green light to a broken feature or one not functioning as intended, it’s a type 2 error.

Type 2 errors don’t mean you fully accept the null hypothesis—the testing only indicates whether to reject it. In fact, your testing might not have enough statistical power to detect an effect.

A type 2 error can result in you launching faulty products or features. This can massively harm your user experience and damage your brand’s reputation, ultimately impacting sales and revenue.

Understanding and managing type 1 and type 2 errors means understanding some math, specifically probability and statistics.

Let’s unpack the probabilities associated with each type of error and how they relate to statistical significance and power.

Type 1 Error Probability

The probability of getting a type 1 error is represented by alpha (α).

In testing, researchers typically set a desired significance level (α) to control the risk of type 1 errors. This is the statistical probability of getting those results ( p value). You get the p value by doing a t-test, comparing the means of two groups.

Common significance levels (α) are 0.05 (5%) or 0.01 (1%)—this means there’s a 5% or 1% chance of incorrectly rejecting the null hypothesis when it’s true.

If the p value is lower than α, it suggests your results are unlikely to have occurred by chance alone. Therefore, you can reject the null hypothesis and conclude that the alternative hypothesis is supported by your data.

However, the results are not statistically significant if the p value is higher than α. As they could have occurred by chance, you fail to reject the null hypothesis, and there isn’t enough evidence to support the alternative hypothesis.

You can set a lower significance level to reduce the probability of a type 1 error. For example, reducing the level from 0.05 to 0.01 effectively means you’re willing to accept a 1% chance of a type 1 error instead of 5%.

Type 2 Error Probability

The probability of having a type 2 error is denoted by beta (β). It’s inversely related to the statistical power of the test—this is the extent to which a test can correctly detect a real effect when there is one.

Statistical power is calculated as 1 - β. For example, if your risk of committing a type 2 error is 20%, your power level is 80% (1.0 - 0.02 = 0.8). A higher power indicates a lower probability of a type 2 error, meaning you’re less likely to have a false negative. Levels of 80% or more are generally considered acceptable.

Several factors can influence statistical power, including the sample size, effect size, and the chosen significance level. Increasing the sample size and significance level increases the test's power, indirectly reducing the probability of a type 2 error.

Balancing Type 1 and Type 2 Errors

There’s often a trade-off between type 1 and type 2 errors. For instance, lowering the significance level (a) reduces the probability of a type 1 error but increases the likelihood of a Type 2 error (and vice versa).

Researchers and product teams must carefully consider the relative consequences of each type of error in their specific context.

Take medical testing—a type 1 error (false positive) in this field might lead to unnecessary treatment, while a type 2 error (false negative) could result in a missed diagnosis.

It all depends on your product and context. If the cost of a false positive is high, you might want to set a lower significance level (to lower the probability of type 1 error). However, if the impact of missing a genuine issue is more severe (type 2 error), you might choose a higher level to increase the statistical power of your tests.

Knowing the probabilities associated with type 1 and type 2 errors helps teams make better decisions about their testing processes, balance each type's risks, and ensure their products meet proper quality standards.

To help you better understand type 1 errors or false positives in product software and web testing, here are some examples.

In each case, the Type 1 error could lead to unnecessary actions or investigations based on inaccurate or false positive results despite the absence of an actual issue or effect.

Mistaken A/B test result

Your team runs an A/B test to see if a new feature improves user engagement metrics, such as time spent on the platform or click-through rates.

The results show a statistically significant difference between the control and experiment groups, leading you to conclude the new feature is successful and should be rolled out to all users.

However, after further investigation and analysis, you realize the observed difference was not due to the feature itself but an unrelated factor, such as a marketing campaign or a seasonal trend.

You committed a Type 1 error by incorrectly rejecting the null hypothesis (no difference between the groups) when the new feature had no real effect.

Usability testing false positive

Imagine you’re testing that same new feature for usability. Your testing finds that people are struggling to use it—your team puts this down to a design flaw and decides to redesign the element.

However, after getting the same results, you realize that the users’ difficulty using the feature isn’t due to its design but rather their unfamiliarity with it.

After more exposure, they’re able to navigate the feature more easily. Your misattribution led to unnecessary design efforts and a prolonged launch.

This is a classic example of a Type 1 error, where the usability test incorrectly rejected the null hypothesis (the feature is usable).

Inaccurate performance issue detection

Your team uses performance testing to spot your app’s bottlenecks, slowdowns, or other performance issues.

A routine test reports a performance issue with a specific component, such as slow response times or high resource utilization. You allocate resources and efforts to investigate and confront the problem.

However, after in-depth profiling, load testing, and analysis, you find the issue was a false positive, and the component is working normally.

This is another example of a Type 1 error: testing incorrectly flagged a non-existent performance problem, leading to pointless troubleshooting efforts and potential resource waste.

In these examples, the type 2 error resulted in missed opportunities for improvement, the sending out of faulty products or features, or the failure to tackle existing issues or problems.

Missed bug detection

Your team has implemented a new feature in your web application, and you have designed test cases to catch each bug.

However, one of the tests fails to detect a critical bug, leading to the release of a faulty feature with unexpected behavior and functionality issues.

This is a type 2 error—your testing failed to reject the null hypothesis (no bug) when the alternative (bug present) was true.

Overlooked performance issues

Your product relies on a third-party API for data retrieval, and you regularly conduct performance testing to ensure optimal response times.

However, during a particular testing cycle, your team didn’t identify a significant slowdown in the API response times. This results in performance issues and a poor user experience for your customers, with slow page loads or delayed data updates.

As your performance testing failed to spot an existing performance problem, this is a type 2 error.

Undetected security vulnerability

Your security team carries out frequent penetration testing, code reviews, and security audits to highlight potential vulnerabilities in your web application.

However, a critical cross-site scripting (XSS) vulnerability goes undetected, enabling malicious actors to inject client-side scripts and potentially gain access to sensitive data or perform unauthorized actions. This puts your users’ data and security at risk.

It’s also another type 2 error, as your testing didn’t reject the null hypothesis (no vulnerability) when the alternative hypothesis (vulnerability present) was true.

Although it’s impossible to eliminate type 1 and type 2 errors, there are several strategies your product teams can apply to manage and minimize their risks.

Implementing these can improve the accuracy and reliability of your testing process, ultimately leading to you delivering better products and user experiences.

Adjust significance levels

We’ve already discussed adjusting significance levels—this is one of the most straightforward strategies.

Suppose the consequences of getting a false positive (type 1 error) are more severe. In that case, you may wish to set a lower significance level to reduce the probability of rejecting a true null hypothesis.

On the other hand, if overlooking an actual effect (type 2 error) is more costly, you can increase the significance level to improve the statistical power of your tests.

Increase sample size

Increasing the sample size of your tests can help minimize the probability of both type 1 and type 2 errors.

A larger sample size gives you more statistical power, making it easier to spot genuine effects and reducing the likelihood of false positives or negatives.

Implement more thorough testing methodologies

Adopting more thorough and accurate testing methods, such as comprehensive test case design, code coverage analysis, and exploratory testing, can help minimize the risk of missed issues or bugs (type 2 errors).

Regularly reviewing and updating your testing suite to meet changing product requirements can also make it more effective.

Use multiple testing techniques

Combining different testing techniques, including unit, integration, performance, and usability tests, can give you a more complete view of your product’s quality. This reduces the chances of overlooking important issues, which could later affect your bottom line.

Continuously monitor and feedback

Continuous monitoring and feedback loops enable you to identify and deal with any issues missed during the initial testing phases.

This might include monitoring your production systems, gathering user feedback, and conducting post-release testing.

Conduct root cause analysis

When errors are flagged, you must do a root cause analysis to find the underlying reasons for this false positive or negative.

This can help you refine your testing process, improve test case design, and prevent similar errors from occurring in the future.

Foster a culture of quality

Promoting a culture of quality within your organization can help ensure that everyone is invested in minimizing errors and delivering high-quality products.

To achieve this, ask your company to offer more training, encourage collaboration, and foster an environment where team members feel empowered to raise concerns or suggest improvements.

Encountering type 1 and type 2 errors can be disheartening for product teams. Here’s where Ampltide Experiment can help.

The A/B testing platform features help compensate for and correct the presence of type 1 and type 2 errors. By managing and minimizing their risk, you’re able to run more confident product experiments and tests.

Some of Amplitude’s main experimental features include its:

  • Sample size calculator : This helps you determine the minimum sample size needed to detect significant effects.
  • Experiment duration estimator : The platform’s estimator gives you an idea of how long your experiment needs to run to reach statistical significance.
  • Bonferroni correction application : Amplitude uses the Bonferroni correction to adjust the finance level when testing multiple hypotheses.
  • Minimum sample size threshold : The platform sets a minimum threshold that experiments must meet before declaring significance.

Use Amplitude to help you design more robust testing, ensure sufficient statistical power, control for multiple tests, and oversee your results. Get increased confidence in your experiment results and make more informed decisions about product changes and feature releases.

Ready to place more trust in your product testing? Sign up for Amplitude now .

How Amplitude Helped Give UX Designers a Voice at IBM Cloud for VMware Products

Client-side vs. server-side testing: full guide, 5 trends shaping the future of a/b testing and experimentation, improved security, simplicity, and customer satisfaction with smart data analytics, what is data cleaning step-by-step guide, guide to behavioral analytics, website funnel analysis: amplitude vs. google analytics, amplitude experiment: a/b testing and feature flagging powered by customer behavior.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 23 August 2024

Inferring histology-associated gene expression gradients in spatial transcriptomic studies

  • Jan Kueckelhaus 1 , 2   na1 ,
  • Simon Frerich   ORCID: orcid.org/0000-0002-8275-6113 3 , 4   na1 ,
  • Jasim Kada-Benotmane   ORCID: orcid.org/0009-0005-2100-913X 1 , 5 ,
  • Christina Koupourtidou   ORCID: orcid.org/0000-0002-8352-1498 6 , 7 ,
  • Jovica Ninkovic 6 , 7 ,
  • Martin Dichgans 3 , 7 , 8 ,
  • Juergen Beck   ORCID: orcid.org/0000-0002-7687-6098 5 ,
  • Oliver Schnell 2 &
  • Dieter Henrik Heiland   ORCID: orcid.org/0000-0002-9258-3033 1 , 2 , 9 , 10 , 11  

Nature Communications volume  15 , Article number:  7280 ( 2024 ) Cite this article

Metrics details

  • Computational models
  • Statistical methods

Spatially resolved transcriptomics has revolutionized RNA studies by aligning RNA abundance with tissue structure, enabling direct comparisons between histology and gene expression. Traditional approaches to identifying signature genes often involve preliminary data grouping, which can overlook subtle expression patterns in complex tissues. We present Spatial Gradient Screening, an algorithm which facilitates the supervised detection of histology-associated gene expression patterns without prior data grouping. Utilizing spatial transcriptomic data along with single-cell deconvolution from injured mouse cortex, and TCR-seq data from brain tumors, we compare our methodology to standard differential gene expression analysis. Our findings illustrate both the advantages and limitations of cluster-free detection of gene expression, offering more profound insights into the spatial architecture of transcriptomes. The algorithm is embedded in SPATA2, an open-source framework written in R, which provides a comprehensive set of tools for investigating gene expression within tissue.

Introduction

In recent years, significant advancements have been made in the field of spatial biology, providing essential tools for profiling gene, protein, and metabolic expression in biological tissues 1 . These developments have been crucial in various research domains, such as developmental biology 2 , neuroscience 3 , and cancer microenvironment 4 studies. The discoveries emerging from these studies have greatly enhanced our understanding of spatial organization in different tissues. While healthy tissue typically exhibits a highly ordered structure, diseases can disrupt this order, leading to a complex range of dynamic alterations. The human neocortex, for instance, is generally understood through a well-established model of six cortical layers. This organized structure contrasts sharply with the chaotic and heterogeneous architecture of malignant CNS tumors, a phenomenon encapsulated in the concept of intertumoral heterogeneity 5 , 6 . The heterogeneous complexity of pathologies in general, benign and malignant alike, poses significant challenges in medical care, given that effective treatments rely on recurring biological patterns or functions that can be targeted. In this context, ensuing efforts of the past decades have resulted in the identification of key histomorphological niches, that have become crucial factors in contemporary diagnostics and research. In glioblastoma, for instance, necrosis and the border between tumor and healthy tissue are notable examples. Recent advances in spatial transcriptomics have also revealed recurrent patterns of gene expression reflecting responses to inflammatory or metabolic stimuli and different stages of development 4 . The recurring nature of these spatial niches, whether of histomorphological or molecular nature, highlights their significance in understanding these medical conditions. However, to fully comprehend their roles and dynamics within the microenvironment, sophisticated analysis tools for supervised screening approaches are essential. Conventional approaches, such as clustering followed by differential expression analysis (DEA), encounter substantial limitations when applied to spatial multi-omic studies. The binary nature of clustering and its imposition of artificial boundaries can obscure nuanced expression patterns and fail to capture critical features within intricate tissues. Furthermore, the outcomes are reliant on the selected number of clusters, which is influenced by sample characteristics and algorithmic parameters. This reliance presents challenges in data interpretation, particularly given the continuous nature of gene expression in spatial samples. Consequently, clustering with DEA is suboptimal for addressing questions concerning spatial gene expression patterns, especially in complex and disordered tissues like malignancies. To overcome the limitations of DEA, unbiased computational methods like SpatialDE 7 and SPARKX 8 have been developed. While they are effective in identifying genes based on spatial variability, these algorithms primarily offer a holistic view and do not allow to incorporate additional information important to the sample and the specific query. Consequently, they may identify genes with statistically significant spatial expression patterns that are, however, not related to specific areas of the tissue architecture the researcher wants to focus on, Supplementary Fig.  13a–h . In certain scenarios, a more refined approach is necessary, one that can provide insights specifically tailored to the specificities of the tissue sample and the research questions at hand.

In this work we present a flexible, supervised screening approach attuned to detecting spatial subtleties. Furthermore, we aim to capture spatial expression dynamics through gradients rather than group-based log-fold changes, recognizing the inherent continuous nature of expression data in a spatial context 8 . Our efforts have led to the development of two methods falling under the umbrella term spatial gradient screening (SGS). These methods empower users to define spatial locations of interest and use them as reference points while screening for genes and other continuous features with relevant biological meaning. We demonstrate that this dual focus on location-specific screening and spatial gradients seamlessly complements and extends established gene identification approaches in spatial biology, catering to both exploratory and hypothesis-driven inquiries.

DEA is unreliable in predicting gene expression with spatial dependencies

To demonstrate the challenges inherent in analyzing gene expression in relation to defined tissue architecture, we analyzed a glioblastoma Visium dataset with three histologically distinct regions: a tumor core, a transition area, and the infiltrative cortex area, Fig.  1a–c . We hypothesized that genes exhibiting a gradual change of gene expression from the core to the infiltrative regions of glioblastoma are inadequately represented by traditional differential gene expression analysis. We compared classical differential gene expression analysis with our Spatial Trajectory Screening to characterize the histological regions based on their gradient and group-based gene expression profiles. To incorporate the histological classification into our data, we utilized SPATA2’s interactive annotation tool and labeled each barcoded spot depending on the histological region it was located in. This resulted in the division of spots into three groups: (1) Tumor, (2) Transition, and (3) Infiltrated Cortex, as depicted in Fig.  1a–c and Supplementary Fig.  1a, b . To validate our histological classification, we inferred copy number alterations (CNA) using SPATA2’s implementation of the infercnv R package. This showed an inferred gain of chromosome 7 and loss of chromosome 10 in the malignant region on the left, displayed in Fig.  1f, g , consistent with previous studies 5 . Additionally, the integration of histologically defined areas enabled us to quantify CNA across histological groups. We observed the highest abundance of Chr 7 and 10 alterations in the tumor area, partial alterations in the transition zone, and almost no alterations in the infiltrated cortex (ANOVA, Chr 7 p  < 2.2 × 10 −22 , Chr 10 p  < 2.2 × 10 −22 ), Fig.  1h, i , confirming our histological annotation. To further investigate the role of the transition zone we aimed to determine significantly upregulated genes of the annotated regions. The copy number aberration (CNA) profile suggests that the transitional region exhibits a significantly reduced tumor proportion, thereby representing the boundary zone towards the normal cortex. A border-like function of this area was also supported by spatial clustering results using the BayesSpace algorithm 9 which incorporates transcriptional and spatial distance, suggesting two spatially segregated clusters that largely overlapped with our manually annotated transition area, Supplementary Fig.  1c, d . To identify uniquely expressed genes, we performed differential expression analysis (DEA) based on our manual histological annotation. This resulted in the identification of 3489 significant differentially expressed genes (DEGs) using the default thresholds (avg_log2FC > 0.25; p_adj < 0.05). We found 1698 DEGs in the tumor area, 533 DEGs in the transition group, and 1258 DEGs in the infiltrated cortex group. DEA results further supported our categorization of the left area as a tumor region, evident from the presence of marker genes such as EGFR, and the right area as cortex, characterized by neural marker genes like SNAP25. The transition area exhibited elevated expression levels of genes associated with glial cells, such as MBP and MOG. Intriguingly, this region shared 78 differentially expressed genes (DEGs) with the tumor area and 21 genes with the infiltrated cortex, as highlighted in the volcano plot presented in Fig.  1k . Notably, there were no shared DEGs between the tumor area and the infiltrated cortex. Attributing DEA the capacity to predict gene expression in space, we hypothesized that after removing shared DEGs, the remaining ones should be exclusively expressed within the boundaries of the histological area occupied by the group they were assigned to as marker genes. We refer to DEGs that were not shared between two groups as area-specific DEGs. The top 18 of those in terms of significance (lowest adjusted p -values) are displayed in Fig.  1e .

figure 1

a – c SPATA’s manual annotation tool is employed to delineate borders, facilitating the grouping of spots based on histology. d Gene expression-driven UMAP projection of spots. e A dot plot showcases the 18 most statistically significant unique marker genes, ranked by their average log2-fold change, in accordance with histological areas. f and g Surface plots are color-coded to highlight inferred copy number alterations that are characteristic of glioblastoma. h and i Statistical analyses examine copy number alterations across histological areas (two-sided test, no adjustments for multiple comparisons, Tumor ( n  = 1307), Transition ( n  = 478), Infiltrated Cortex ( n  = 1428). The minima represent the smallest and the maxima represent the largest value within 1.5 times the interquartile range (IQR) below or above the first or third quartile (Q1, Q3), respectively. The median is shown as a line inside the box. The box bounds are the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). The whiskers extend from Q1 and Q3 to the minima and maxima. j A heatmap provides a comprehensive view of alterations across all chromosomes in relation to histological areas, corroborating the tumor area’s classification with multiple alterations, the nearly unaltered cortex infiltrated by the tumor, and the transition zone exhibiting intermediate levels of alterations. k A volcano plot from the DEA analysis across histological areas highlights marker genes for the Tumor and Transition areas, characterized by an adjusted p -value of 0 (infinite −log10).

To benchmark whether the remaining area-specific marker genes featured a corresponding spatial expression, we deployed a supervised spatial trajectory to predefine the axis along which gene expression changes are to be examined, Fig.  2a . We used this spatial trajectory to infer the gene expression of the top 18 group specific DEGs gradient along it. For a detailed description of the process involved in obtaining an expression gradient using a spatial trajectory please refer to the method section as well as to Supplementary Figs.  7e–i and 8b. While the expression of some genes along the trajectory clearly reflected the corresponding region, as exemplified in Fig.  2b, c , we found that DEA results were unreliable in determining the precise spatial extent of gene expression, even after filtering for area-specific marker genes, showcased in Fig.  2e–h . Inferring the expression gradient showed that the expression of even the most significant unique DEGs from the tumor and the transition group did not always decline abruptly when crossing the boundary. Unique marker genes from the transition zone featured gradual decreasing patterns transgressing both borders to the cortex- and the tumorous area alike, e.g. genes EEF1A1 and MBP. Furthermore, unique marker genes from the tumorous area featured rather a gradual transgression through the borders to the transition area, while marker genes from the cortex-like area declined rather abruptly right before the passage of the transition zone. Giving DEA the benefit of the doubt, we hypothesized that our potentially biased manual annotation could have affected the results, hindering DEA from reliably identifying genes with gene expression confined to the borders of the marked areas. To this end, we employed DEA based on the grouping suggested by BayesSpace clustering, Supplementary Fig.  2a . Example genes whose gradients do correspond to the area of their clusters are displayed in Supplementary Fig.  2b, c . Still, even among the most significant cluster-specific marker genes (Supplementary Fig.  1e, f ), multiple genes did not feature gene expression patterns confined to the area covered by their corresponding cluster, Supplementary Fig.  2d–h . Our spatial trajectory methodology underscores the difficulty in establishing precise boundaries when analyzing spatial transcriptomic data, emphasizing the importance of acknowledging the gradual changes of gene expression in tissue. Furthermore, it reveals notable shortcomings of differential expression analysis in examining spatial expression patterns, thereby highlighting the necessity for analysis approaches that are independent of predefined groups.

figure 2

a A surface plot provides a visual representation of the trajectory’s course (indicated by an arrow), complemented by a barplot illustrating the proportion of histological groups along this trajectory. b and c Surface plots offer insights into the gene expression patterns of genes predominantly localized within their respective assigned areas. Error bands of line plots indicate the confidence interval (level: 0.95). d – g Heatmaps present the expression profiles of the top 18 differentially expressed genes (DEGs) per histological group (as referenced in Fig.  1e ). Notably, while these genes have been identified as unique DEGs, those from the tumor and transition regions do not consistently exhibit confined expression patterns to their designated areas, in contrast to DEGs from the (infiltrated) cortex. e – h Line plots and surface plots further elucidate the expression patterns of selected genes, demonstrating how they may traverse area boundaries in various ways. Error bands of line plots indicate the confidence interval (level: 0.95).

Exploring the local environment of histological microstructures independently of grouping

We showed that capturing gene expression in the form of gradients can be valuable for gene pattern identification related to specific spatial structures, but the applicability of linear trajectories as a spatial reference is limited to rectangular areas. While they suit a-to-b architectural scenarios like the border between the tumor and healthy tissue, they do not capture tissue patterns related to spatial niches of a circular nature. To exemplify this limitation, we employed a Visium dataset of two mouse brain sections with stab wounds representing traumatic brain injury 10 . Our objective was to incorporate the location and spatial dimensions of these wounds into our spatial gradient screening process. To achieve this, we utilized SPATA2’s interactive spatial annotation tools, which facilitate the manual definition of spatial reference areas by directly interacting with the image, Supplementary Fig.  6a . Please refer to the method section for an elaboration on the differences between manual annotation of data points, as conducted for the three histological groups in Fig.  1a , and spatial annotations. Figure  3a illustrates the resulting stab wound outlines. Given the initial processing, which unveiled a noticeable cluster around the stab wounds perturbing the otherwise healthy CNS architecture of the mouse cortex, Fig.  3b , we assumed that these injuries likely exerted a substantial impact on their immediate surroundings. To explore gradual changes of gene expression, likely inflicted by the stabwounds on their surroundings, we conducted spatial gradient screening, inferring gene expression gradients as a function of distance to each stab wound, Fig.  3c, d , and screened for multiple patterns, Supplementary Fig.  14d . This approach allowed us to identify genes with descending expression patterns, reflecting their fine-grained biological association with the injury. Figure  3c–f visualizes inferred gradients of genes up to a distance of 1.5 mm from the injury border. Notably, several genes associated with immune activity (C1qa, C1qb, B2m), immune cell types (Cd68, Aif1), migration (Vim, Tyrobp), proliferation (Cd64, Ccnd1), and wound healing (S100a16, Lamp2) exhibited increased expression closer to the injury, Fig.  3g–j and Supplementary Fig.  3b–e , that faded with increasing distance. Subsequent screening within a shorter distance of 0.75 mm reaffirmed these findings and revealed additional genes that suggest increased immune activity, migration, and proliferation near the injury zone (Jun, Manf, Camk1, Ier5l) displayed in Supplementary Fig.  3f, g . It is worth noting that the genes depicted in Supplementary Fig.  3f, g were not identified as marker genes through DEA we conducted on the clustering (Supplementary Fig.  3a ). This observation underscores the capability of spatial gradient screening to detect even the most subtle and intricate expression patterns. Lastly, the estimation of cell density, obtained through single-cell deconvolution via Tangram 11 , highlighted increased microglia and macrophage density near the injury site decreasing with increasing distance, as suggested by the genes identified by spatial annotation screening, Fig.  3k–m . In summary, our study demonstrates that spatial annotation screening can adapt to complex spatial reference features and infer gene expression gradients related to them. Our findings with spatial annotation screening align with existing knowledge of the central nervous system’s response to injury, highlighting its capabilities to identify spatial gene expression patterns related to spatial areas and paving the way for new insights.

figure 3

a An H&E image provides an overview of the analyzed sample, with two enlarged windows highlighting the stab wounds inflicted on the brain ( n  = 1 mouse with bilateral injury for Visium, n  = 3 mice for scRNA-seq experiments). b Surface plots reveal clustering results obtained through the Scanpy pipeline, emphasizing significant barcode spot clusters within the injury area. c and d Surface plots and gradient ridge plots illustrate the gene expression patterns of two marker genes for the injury area, Hmox1 and Lcn2. e and f Ridgeplots visualize the expression gradients of genes sharing similar patterns with either of the two example genes, as identified through spatial annotation screening. g – j Surface plots showcase the expression profiles of co-expressed genes identified via spatial annotation screening. k UMAP of the scRNA-seq dataset, from the same mouse model as the Visium sample. l and m Visualization of Tangram results, featuring 2D Density plots that highlight an elevated density of monocytes, macrophages, and microglia in the vicinity of the injury zone, as compared to a control area.

Identification of confounders of gene expression in spatial transcriptomic studies

Glioblastoma presents a unique challenge in spatial transcriptomics due to its heterogeneous and chaotic nature, making it difficult for clustering algorithms to discern optimal cluster numbers and identify marker genes and spatial niches. This complexity sharply contrasts with the well-structured architecture of its healthy counterpart, the human neocortex, which demonstrated clearly defined spatially segregated layers that can be reliably derived from clustering, as illustrated in Supplementary Figs.  4 , 5 , 14a–c . One of our examined samples, UKF313T, exemplified this challenge. Initially, we observed a prominent central necrotic region, manually outlined as the “necrotic center” using SPATA2’s image annotation tool. Clustering the sample with BayesSpace, Supplementary Fig.  4d , revealed the absence of a clear elbow point in the clustering assessment, Supplementary Fig.  4b , highlighting the inherent challenges in defining clusters and borders in such samples. Annotating the central necrotic area alongside the clustering results, we noted that clusters proximal to the necrotic outline either coincided spatially with necrosis (B1) or organized themselves in a circular fashion around the necrotic center (B2, B3), akin to the injury cluster observed around the stab wound in Fig.  3b . However, when we computed top marker genes of each BayesSpace cluster in space, Supplementary Fig.  4d , marker gene expression did not align with the boundaries of their initially assigned clusters, emphasizing the unsuitability of this sample for traditional cluster analysis. This chaotic gene expression pattern was further evidenced by many marker genes sharing nearly identical average log2 fold changes across multiple clusters, Supplementary Fig.  4c . Assuming a descending relationship between marker genes of clusters B2 and B3 and the necrotic area, similar to the relationship between the stab wound and its environment, we computed the gene expression of marker genes relative to the distance from the necrotic center. The inferred gradients of CD44, NDRG1, THBS2, and IGFBP3 (predicted by DEA to be strongly expressed in the circular area represented by cluster B4) led to the assumption that a spatial dependency between the expression of these genes and the presence of necrosis exists. We also observed an inverse relationship in genes from clusters distant to necrosis, suggesting their repulsion by the presence of necrosis. This observation was supported by the decline of their expression at ~2–3 mm, Supplementary Fig.  4d , coinciding with the influence zone of another necrotic area at the bottom right of the sample, which was initially not accounted for in our single necrotic center annotation. To accurately account for necrotic areas in this sample, we introduced two additional annotations, referred to as “necrotic edge I” and “necrotic edge II”, Fig.  4a, b . Given the insights acquired from the gradients displayed in Supplementary Fig.  4d and the significant role necrosis plays as a histomorphological correlate of malignancies, we hypothesized that necrosis significantly confounded gene expression in the sample. Subsequently, we conducted comprehensive spatial gradient screening, considering the spatial extent of all three necrotic annotations, Fig.  4b . We employed SPARKX to preselect genes predicted to exhibit spatial significance, resulting in a set of genes ( n  = 11,478, adj. p -value < 0.05) that we subsequently subjected to the screening. We focused on subsets of descending and ascending models, Supplementary Fig.  14d , to specifically identify genes associated with necrosis and those repelled by it. Prominent examples of necrosis-associated genes included those involved in hypoxia response (VEGFA), glycolytic metabolism (SLC2A1), and cell cycle arrest (TMEM158), illustrated in Fig.  4c, d . Conversely, genes repelled by necrosis were linked to oxygen-dependent metabolism (NDUFS) and TCR receptor signaling (CD74, HLA-DRB1), displayed in Fig.  4e, f . In summary, our analysis involving spatial gradient screening using spatial annotations illuminated intricate spatial patterns within glioblastoma, shedding light on the association and repulsion of specific genes in response to necrosis. In particular, it shed light on a potential spatial dependency between necrosis, hypoxia and TCR immune response, which we aimed to investigate closely in the following section.

figure 4

a Presents the H&E image of the glioblastoma sample UKF313T, emphasizing key areas within the sample that are detailed in the following figure ( b ). Additionally, a surface plot visualizes the count distribution, supporting our necrosis (dead tissue) annotation. b Provides a visualization of distance values from the necrotic areas, with lines illustrating the assumed gradient direction and orientation based on proximity to necrotic regions. c – e Showcase representative genes exhibiting a pattern reminiscent of association with necrosis, displaying decreasing expression levels with increasing distance from necrotic regions. f – h Feature representative genes showing a pattern resembling the recovery of expression levels with increasing distance from necrotic areas.

Elucidating the spatial dynamics of T cell abundance and hypoxic metabolism in glioblastoma

Building upon prior findings that T cell receptor (TCR) signaling diminishes in the immediate vicinity of necrotic regions but steadily intensifies within one to two millimeters from these necrotic areas in glioblastomas, we hypothesized that the hypoxic metabolic environment found surrounding necrotic regions significantly impacts the immune landscape within these tumors. We examined the distribution of immune cells from hypoxic to non-hypoxic areas by combining spatial annotations from six glioblastomas with identified hypoxic niches. Cell abundance was determined using the cell2location algorithm, and CytoSpace was employed to enhance resolution from spot to single-cell level. Further, we integrated spatially resolved T cell sequencing 12 (SPTCR-seq) to support our hypothesis and understand the distribution of clonal and non-clonal T cells dependent on the presence of hypoxia, Fig.  5a . We began by characterizing hypoxia niches in six glioblastoma samples based on gene expression, Fig.  5b, c and Supplementary Fig.  6c . Horizontal integration of each samples results were facilitated by SPATA2s incorporation of SI units. We then assessed the distribution of cell types from hypoxic to non-hypoxic areas, aggregating data by average cell abundance. Our analysis showed that gene expression related to hypoxia normalized beyond ~1000 µm from hypoxic regions (elbow at 954.3 µm), Fig.  5d . By focusing on lymphoid cells, we found that T cells peak in abundance around 1000 µm from hypoxic cores and are present up to 1500 µm (spanning from 500 to 2000µm). Beyond 2000 µm, T cell presence diminishes, Fig.  5e . Similarly, bone-derived myeloid cells, abundant in hypoxic areas, decrease in number beyond 2000 µm, mirroring the T cell distribution, Fig.  5f . This immune response pattern relative to distance from hypoxia was also evident in the varying abundance of mesenchymal-like (higher towards hypoxia) and NPC-like (high towards infiltration regions) malignant cell populations Figures  5f, g . Integration of T cell receptor sequencing (SPTCR-seq) data confirmed the estimated T cell abundance from cell type deconvolution, Fig.  5h, i . We then focused on the distribution of exhausted and cytotoxic CD8 T cells based on their proximity to hypoxic areas. By analyzing gene expression markers for cytotoxic (e.g., GZMA, GZMB) and exhausted (e.g., PDCD1, LAG3) CD8 T cells, we found a higher concentration of exhausted CD8 T cells near hypoxic regions, whereas cytotoxic gene expression was more prevalent in T cells further from these areas, Fig.  5j . In summary, we demonstrated the utility of annotation screening in elucidating the immune response architecture in relation to hypoxic metabolism in glioblastoma. Our findings suggest a robust spatial dependency between hypoxia and T-cell abundance shared across six glioblastoma samples contributing to the quest of understanding the intricate biological architecture of glioblastoma.

figure 5

a Overview of the workflow, encompassing spatial transcriptomic (ST) data, single-cell deconvolution, and spatial T-cell receptor sequencing (SPTCR-seq), showcasing the horizontal integration of six glioblastoma samples featuring prominent hypoxic spatial niches. b A representative Visium glioblastoma sample (UKF260T) with multiple hypoxic areas, annotated using SPATA2’s automatic annotation tool. Lines indicate the screening direction guided by the hypoxic gradient. c Presentation of single-cell deconvolution results for sample UKF260T, highlighting proximity to annotated hypoxic regions. d The inferred gradient of hypoxic gene signatures merged from data obtained from six samples. Error bands of line plots indicate the confidence interval (level: 0.95) e and f Evaluation of T-cell and anti-inflammatory bone-derived macrophage abundance as a function of distance from the hypoxic areas. Error bands of line plots indicate the confidence interval (level: 0.95). f Abundance of cell types described by Neftel et al. (2019), revealing distinct abundance patterns. g Visualization of MES-like and NPC-like cell abundance in sample UKF269T. h An illustration of a glioblastoma example integrated into the comprehensive screening, featuring a solitary hypoxic area and a separate area of T-cell abundance approximately 1 mm distant from the hypoxic region. i Gradient representation of T-cell abundance as a function of distance from hypoxia. j Gradient plot depicting various T-cell subtypes as a function of distance from hypoxia, revealing an inverse correlation between T-cell cytotoxicity and distance to hypoxia, peaking at ~1 mm.

Statistical challenges and benchmarking

Our endeavors to identify gene expression patterns related to spatial reference areas or trajectories introduce a statistical challenge: accurately identifying patterns amid the noise and variability inherent to biological data. To tackle this challenge, we employ LOESS smoothing to model gene expression data along the distance, Supplementary Fig.  8a, b , and quantify the degree of randomness in the emerging patterns using the total variation, Supplementary Fig.  8c . We validated our approach through extensive simulations wherein controlled noise levels were introduced into predefined spatial patterns. This allowed us to evaluate the efficacy of our method in distinguishing between these patterns amidst various noise types and noise intensities, Supplementary Figs.  9 – 11 . In addition, we compared our method to two existing approaches designed for scRNA-seq pseudotime trajectories, namely tradeSeq 13 and PseudotimeDE 14 , 15 . SPATA2’s SAS and the said algorithms share a common goal of detecting differential expression patterns along a one-dimensional axis. In our simulated scenario, SAS exhibited a higher correlation with ground truth noise, achieving a mean R ² of 0.75 across different pre-defined patterns, in contrast to the R ² of 0.59 for tradeSeq’s Wald Statistic and an R ² of 0.64 for PseudotimeDE’s test statistic, Fig.  6a–c . Moreover, our benchmark illustrates that SPATA2’s SAS can rapidly screen 10,000 genes within a few minutes on a personal laptop (32GB RAM, 10 cores), Supplementary Fig.  14e . This is at least an order of magnitude faster than tradeSeq and PseudotimeDE model fitting alone (run on 256GB RAM, 48 cores; see the “Methods” section). This emphasizes the substantial computational efficiency and applicability of our method for large-scale datasets.

figure 6

a – c Scatterplots display the relationship between different test statistics used by different algorithms to determine the noise ratio of simulations. Colors indicate the type of simulated pattern hidden by noise (also see Supplementary Figs.  9 b, 10b ).

Lastly, while the spatial reference features—spatial annotations or spatial trajectories—can be placed in an automated manner, Supplementary Fig.  6b, c , we explored the resilience of our method against human error in case of manual placement. We evaluated the implications for spatial annotation screening (SAS) and spatial trajectory screening (STS) test performance after systematically modifying the positions or orientations of selected spatial annotations and trajectories in our dataset, Supplementary Figs.  6d–g and 12 . Our simulations indicate that both SAS and STS can be susceptible to type II errors with increasing deviation from the original annotation or trajectory placement. However, this susceptibility remained well-controlled within the expected bounds of human error, and notably, type I error rates stayed below 5% across all deviations, ensuring that random patterns are rarely incorrectly identified as non-random, even with large deviations in annotation. In conclusion, our thorough simulations and benchmarking highlight the robustness, adaptability, and computational efficiency of our approach for identifying spatial gene expression patterns along a one-dimensional gradient. This reaffirms our method as an advantageous application for hypothesis-driven spatial biology research.

Despite the transformative potential of spatial transcriptomics, the field has encountered limitations with existing analytical methods, which often do not fully capture the continuous and intricate patterns of gene expression across diverse tissues. To address these challenges, we introduce spatial gradient screening (SGS), an algorithm designed to capture gene expression patterns along a spatial continuum. It pursues the hypothesis that specific genes—or other numeric features for that matter—display non-random expression patterns in relation to spatial reference features. When screening for spatially variable genes, we utilize these reference features to incorporate the integration of potential biological forces, such as the direction of tumorous infiltration using spatial trajectories (spatial trajectory screening, STS), Fig.  2a , or the presence of stab wounds and necrotic areas using spatial annotations (spatial annotation screening, SAS) (Figs.  3a, b and 4a, b ). STS captures changes along a-to-b trajectories, while SAS focuses on radial changes from core to periphery. Together they facilitate a comprehensive framework for both exploratory and hypothesis-driven analyses. A particularly notable finding from our work is the identification of a spatial interplay between necrotic and hypoxic zones and T-cell distribution within glioblastomas, hinting at a potentially stratified immune landscape within these tumors.

Thorough benchmarking of the evaluation metrics showed that spatial gradient screening reliably differentiates between existing patterns and randomness. However, despite their utility, both spatial trajectory screening and spatial annotation screening have limitations. The flexibility they offer necessitates a close examination of histological architecture and other potential confounding factors. Interactive placement using SPATA2’s tools can introduce human error, impacting the course of a spatial trajectory and the outline of a spatial annotation and, thus the inferred expression gradients. We conducted a thorough investigation into the sensitivity of spatial gradient screening to such variations and our findings demonstrate that both spatial gradient screening approaches yield robust results against an expected degree of human-induced variation. Still, for precise outcomes, careful placement of spatial reference features is essential. The inclusion of confounding elements or the omission of crucial ones can significantly affect results, potentially leading to the misinterpretation of gene expression.

Our findings highlight a critical limitation of traditional DEA in capturing genes that exhibit gradual expression shifts in response to microstructural changes or metabolic gradients within complex tissues. It is important to note that, while we have identified these limitations, our intention is not to diminish the significant contributions of these traditional analysis methods. Both clustering-based and manual annotation-based approaches demonstrated their value in numerous spatial transcriptomics studies and will continue to do so. However, for a nuanced analysis of spatial expression patterns and gradients, it is evident that integrating supplementary tools is essential to achieve comprehensive interpretation. Spatial gradient screening represents our contribution to this evolving field, augmenting classical methodologies with enhanced capabilities to interpret complex spatial gene expression data. Although here we focused on brain-derived disease models, spatial gradient screening is a helpful tool in a broad variety of tissues in health and disease conditions in which spatial dependencies and heterogeneity are important players. To streamline the integration of spatial gradient screening into downstream analysis, we have incorporated it into SPATA2, an R-based framework that provides a user-friendly implementation of well-established analytical techniques. This integration encompasses the identification of spatially variable genes, facilitated by SPARKX, interactive manual annotation of barcode spots, and a suite of algorithms for tasks such as inferring copy number alterations, clustering, conducting differential expression analysis, and performing gene set enrichment analysis.

Furthermore, with regard to spatial analysis, SPATA2 fully incorporates the utilization of distance measures in SI units. This feature greatly enhances usability, allowing for the seamless integration of multiple samples. Additionally, it ensures that results remain scalable to different image resolutions, offering an intuitive framework for analysis, interpretation, and visualization. Moreover, while SPATA2’s default pre-defined models, as employed in this study to guide further interpretation of identified genes, cover a substantial portion of biologically relevant patterns, Supplementary Fig.  14d , specific research scenarios may necessitate tailored models. SPATA2 offers additional functions to expand the range of models available for screening. Lastly, it’s worth noting that while SPATA2 was initially developed with the 10X Visium platform in mind, it has been extended to support virtually all spatial platforms, regardless of the modality and observational unit (e.g., barcoded spots, single cells, beads). SPATA2 offers various functions to ensure compatibility with platforms established in recent years, such as Seurat 16 , 17 , Giotto 18 , Scanpy 19 , and Squidpy 20 . To assist users in adopting SPATA2, we offer user-friendly tutorials on our website. All in all, we believe that SPATA2 and the spatial gradient screening approach will be a valuable tool in the analysis of an exciting and rapidly developing field in spatial biology, that is spatial transcriptomics.

Ethical statement

The study design, data evaluation, and imaging procedures were given clearance by the ethics committee at the University of Freiburg, as delineated in protocols 100020/09 and 472/15_160880. All methodologies were executed in compliance with the guidelines approved by the committee. Informed consent, in written form, was received from all participating subjects. The Department of Neurosurgery of the Medical Center at the University of Freiburg, Germany, was responsible for securing preoperative informed consent from all patients participating in the study. Mice were housed and handled under the German and European guidelines for the use of animals for research purposes. Experiments that included mice were approved by the institutional animal care committee and the government of Upper Bavaria (ROB-55.2-2532.Vet_02-20-158).

The R-package SPATA2

SPATA2 is an R package that offers an object-oriented programming framework centered around an S4 object, named spata2. This object serves as a container for raw expression data, processed data, and the results of various downstream analyses, such as inferring copy number alterations, clustering using multiple algorithms, differential expression analysis (DEA), gene set enrichment analysis (GSEA), and identification of spatially variable genes and histological microstructures in image analysis. The package’s name is an acronym derived from Spatial Transcriptomic Analysis, highlighting its focus on spatial transcriptomics, specifically the 10X Visium platform. However, the standard analysis pipelines for clustering, DEA, GSEA, etc., can be applied to any type of expression data, including single-cell sequencing. Note that extensive tutorials can be found at our SPATA2 website: https://themilolab.github.io/SPATA2/ .

Architecture of the S4 SPATA2 object

At the core of the SPATA2 package is the S4 SPATA2 object, which serves as a container for both data and analysis progress. the package to be compatible with a wide range of spatial biology platforms provided that the data structure adheres to the following criteria: First, the numeric variables under analysis should correspond to molecule counts, such as RNA read counts, metabolite counts, or protein counts. Second, a clearly defined observational unit must exist to which the numeric variables can be mapped. For example, the Visium platform’s observational unit consists of barcoded spots, while the SlideSeq platform utilizes barcoded beads as its observational unit. In the case of the Xenium- or the MERFISH platform, the observational unit is the individual cell. Given the different observational units possible we refer to them with the umbrella term data points throughout this manuscript. Third, the observations must be equipped with x - and y -coordinates for analysis in two-dimensional space and should be equally distributed over the analyzed tissue. We provide helping functions to initiate analysis via SPATA2 for the standardized data output of the following platforms:

MERFISH: initiateSpataObjectMERFISH()

SlideSeq: initiateSpataObjectSlideSeq()

Visium: initiateSpataObjectVisium()

Xenium: initiateSpataObjectXenium()

Lastly, the flexible function initiateSpataObject() , allows the user to initiate analysis from the output data of any other platform, provided that it adheres to the aforementioned requirements. Information around the platform used is stored inside the created SPATA2 object and might decide on which features of the package can be used. (E.g. BayesSpace clustering is only compatible with the ST or Visium technique.) The Spatial Gradient Screening algorithms are compatible with *all* platforms.

Naming convention and families of functions

Most functions of the SPATA2 package start with a verb which indicates the family to which the function belongs. The most important families are: add*(): Add additional content ranging from grouping variables from external clustering algorithms (e.g. addFeatures() ) or manually set up trajectories (e.g. addSpatialTrajectory() ). create*(): Add additional content by creating, which either implies the necessity for interaction or additional computation is done b. (e.g. createSpatialSegmentation() to create a grouping variable based on manually encircling regions and labeling the barcode spots that fall into the circle, createImageAnnotations() to annotate the regions and microstructures on the histological image.) get*(): Extract results in the form of data.frames, lists, or vectors. (e.g. getCoordsDf() , getDeaResultsDf() , getSpatAnnBorderDf() ) ggpLayer*(): Create additional layers of miscellaneous aspects that can be added to corresponding plots of the ggplot2 framework via the + operator. (e.g. ggpLayerSpatAnnOutline() to add the border of previously annotated microstructures and regions to a surface plot, ggpLayerScaleBar() to add a scale bar for visual indication of the physical distance in SI units to a surface plot.) plot*(): Plots results. Usually by outputting objects of class gg from the ggplot2 package. (e.g. plotSurface() to visualize barcode spots that fall on the underlying tissue, usually colored by a variable.) run*(): Runs algorithms that are implemented from external packages (e.g. runDEA() to run differential expression analysis using Seurat::FindAllMarkers() , runCNV() to run the pipeline of the infercnv-package). The results are stored inside the SPATA2 object and can be extracted with corresponding get*() functions. set*(): Set miscellaneous content. Recommended if programming with SPATA2 to prevent bugs in case of changing architecture of the S4 object.

Implemented packages and algorithms

SPATA2 implements a variety of external algorithms and presents them in user-friendly wrapper functions that. The architecture of the SPATA2 object allows to conveniently store the results in the SPATA2 object to extract them via the corresponding get*() functions (e.g. getDeaResultsDf() ) and to plot them via corresponding plot*() functions (e.g. plotDeaVolcano() ).

Dimensionality reduction

Dimensional reduction is implemented in two ways. If a SPATA2 object is created using the Seurat pipeline, the embedding of the dimensional reduction (PCA, TSNE, and UMAP) is inherited from the Seurat object. If not, the dimensional reduction can be conducted from within SPATA2 using runPCA() which implements irlba::prcomp_irlba(), runTSNE() which implements tsne:: tsne() and runUMAP() which implements umap::umap() . The embedding of each dimensional reduction currently stored in the SPATA2 object can be extracted via getPcaDf(), getTsneDf() or getUmapDf() and plotted via plotPCA(), plotTSNE() and plotUMAP() .

Spatial clustering

BayesSpace clustering is implemented as a wrapper of the pipeline suggested by the R-package BayesSpace. The corresponding function in SPATA2 is called runBayesSpaceClustering() . This function is a wrapper around all functions needed to obtain cluster results based on the BayesSpace algorithm, including BayesSpace::readVisium() or alternatively asSingleCellExperiment() , BayesSpace::spatialPreprocess() , BayesSpace::qTune() , BayesSpace::spatialCluster() . The resulting grouping variable is stored together with all other variables that do not refer to the expression of single genes in the feature data.frame of the SPATA2 object. The resulting grouping can be obtained via getFeatureDf() and can be used for downstream analysis by referring to the name of the grouping variable (chosen by the user while calling runBayesSpaceClustering() ) in the recurring arguments across and grouping_variable as in 'my_spata_obj <- runDEA(object = my_spata_obj, across = 'bspace_7')' .

Differential expression analysis (DEA)

Differential expression analysis (DEA) is implemented via the function runDEA() relies on Seurat::FindAllMarkers() . A temporary Seurat object is created using the counts matrix of the SPATA2 object. The Seurat object is processed according to the specifications of the user. Then, the grouping variable based on which the testing is supposed to be conducted is transferred to the meta.data of the Seurat object and to the slots @active.idents. Then the function Seurat::FindAllMarkers() is called, which outputs a data.frame, that is stored in the SPATA2 object. The SPATA2 object contains a slot @dea where DEA results are stored according to the grouping variable they base on as well as the method with which the testing is run (defaults to the default of Seurat which is Wilcoxon Sum Rank testing). Using arguments across to specify the grouping variable and methode_de to specify the method, results can be extracted via, e.g. getDeaDf() or getDeaGenes() or they can be plotted via. e.g. plotDeaVolcano() , plotDeaHeatmap() or plotDeaDotplot() .

Gene set enrichment analysis (GSEA)

Gene signature enrichment analysis (GSEA) is implemented using the hypeR package 21 which conducts GSEA using hypergeometric testing. GSEA is conducted using the stored DEA results. Therefore, runDEA() with a specific combination of grouping variables and DEA method has to be called beforehand. GSEA results are stored in slot @dea next to the corresponding dea results. Results can be extracted via getGseaDf() or can be plotted via plotGseaDotPlot() . Extensive tutorials about GSEA in SPATA2 can be found here: https://themilolab.github.io/SPATA2 Gene sets can be used for gene set enrichment analysis and visualization and are stored in an extra R-object, a data.frame called gsdf. The SPATA2 object carries it in slot @used_genesets. The gene set collection can be expanded by user-defined gene signatures using the function addGeneSets() . The gene set data.frame contains a collection of more than eleven thousand gene sets downloaded from https://www.gsea-msigdb.org/gsea/index.jsp . This includes gene ontology genesets for biological processes (prefixed with BP.GO), cellular components (prefixed with CC.GO), molecular functions (prefixed with MF.GO), as well as hallmark gene sets (prefixed with HM), biocarta gene sets (prefixed with BC) and reactome gene sets (prefixed with RCTM).

Inferring copy number alterations (CNA)

Inferring copy number alterations (CNA) is implemented as a wrapper of the pipeline suggested by the R-package infercnv. The corresponding function in SPATA2 is called runCNV() , which is a wrapper around all functions needed to infer copy number variations. This includes: infercnv::CreateInfercnvObject() , infercnv::require_above_min_mean_expr_cutoff() , infercnv::require_above_min_cells_ref() , infercnv::normalize_counts_by_seq_depth() , infercnv::anscombe_transform() , infercnv::log2xplus1() , infercnv::apply_max_threshold_bounds() , infercnv::smooth_by_chromosome() , infercnv::center_cell_expr_across_chromosome() , infercnv::subtract_ref_expr_from_obs() , infercnv::invert_log2() , infercnv::clear_noise_via_ref_mean_sd() , infercnv::remove_outliers_norm() , infercnv::define_signif_tumor_subclusters() , and infercnv::plot_cnv(). The output of infercnv::plot_cnv() is stored under a specified directory from which the results are read back into the R session and stored together with metadata regarding the analysis in slot @cnv of the SPATA2 object. Results can be extracted via getCnvResults() and visualized via plotCnvLineplot() or plotCnvHeatmap() as well as with common plotting functions like, e.g. plotSurface(…, color_by = 'Chr7') or plotBoxplot(…, variables = 'Chr7', across = 'histology') .

Spatially variable genes

Identification of spatially variable genes using SPARKX is implemented as a wrapper around the function SPARK:: sparkx() from the SPARK package. The corresponding function in SPATA2 is called runSPARKX() . The count matrix as extracted via getCountMtr() is provided as input for argument count_in. The coordinates, as extracted via getCoordsMtr() is provided as input for argument locus_in. The results are stored in slot @spatial of the SPATA2 object which is a list with a slot named $sparkx. The original output can be obtained via getSparkxResults() . A shortcut to obtain genes with significant spatial variability is offered by the function getSparkxGenes() which defaults to a p -value threshold of <0.05.

Manual annotation of data points (spatial segmentation)

Manual annotation of data points is done with createSpatialSegmentation() . In the interface the function provides the user can interactively annotate data points on the histology image based on their spatial location. While drawing on the image the cursor’s position is captured every five milliseconds and creates a detailed polygon of the annotated region. Each data point within the polygon is then labeled according to the user’s designation and stored in a grouping variable in the SPATA2 object’s feature data.frame. This allows for statistical testing and data subsetting based on the annotated regions. Results can be obtained via getFeatureDf() and can be used for downstream analysis by referring to the name of the created grouping variable (chosen by the user within the interface of createSpatialSegmentation() ) in the recurring arguments across and grouping_variable as in 'my_spata_obj <- runDEA(object = my_spata_obj, across = 'histology')' . We use the term spatial segmentation to contrast the labeling of barcoded spots from our spatial annotation approach outlined below.

Distance and area measurements in SPATA2

Distance handling and transformation in SPATA2 is based on a package-built unit system that encompasses both SI units (nanometer (nm), micrometer (um), millimeter (mm), centimeter (cm)), and pixels (px). Every histology image processed by the method is accompanied by a set fiducial frame, with the example of the 10X Visium platform having a frame of 8 mm × 8 mm. However, the width and height of the loaded image in R may vary based on the resolution. This inbuilt system relies on the barcode-spots having the same distance of 100 μm to each other. Using that as the ground truth, the spatial coordinates of the spots provided in pixel units for alignment with the image section can be converted into SI units, or alternatively, the parameter adjustments that relate to distance measures can be converted to pixel units behind the scenes of each SPATA2 function. To simplify the transformation of pixel-based distances to SI units, SPATA2 offers several transformation functions. These functions allow to provide distance measures in SI units for precise adjustments of spatial parameters in algorithms such as Spatial Trajectory Screening and Image Annotation Screening as, for example, the distance and binwidth parameters. Additionally, the extracted image sections can be adjusted using these parameters. This concept is applied seamlessly to measuring areas in the case of image annotations, too.

Spatial annotations

In spatial experiments, spatial annotations are pivotal for marking regions of interest. Unlike manual annotation of data points (e.g. barcoded spots, beats, or cells) through spatial segmentation, where each annotation is translated into a single label per data point, spatial annotations are distinct. Each annotation consists of at least one detailed polygon, with vertices outlining the area of interest, thereby defining its spatial boundaries. Additional polygons might be needed to outline holes within the annotation (Fig.  5a, b , vivid area). Each annotation is uniquely identified by an ID and can be enriched with tags and metadata. This information, encompassing spatial position, extent, tags, and metadata, is encapsulated in an S4 object of class SpatialAnnotation . We categorize spatial annotations into three types, each characterized by different methods of generating the outlining polygon: image annotations, numeric annotations, and group annotations. Supplementary Fig.  6a–c showcases each concept. Image Annotations focus on histomorphological features discerned through visual inspection of histological tissue. Users can manually outline structures of interest using an interactive interface, adding tags and labels as needed. This concept is depicted in Supplementary Fig.  6a and is facilitated by the SPATA2:: createImageAnnotations() function in SPATA2. Group Annotations are generated automatically. Data points are grouped based on criteria like prior clustering. Then DBSCAN is applied to identify and remove spatial outliers that could disproportionately distort the outline. Lastly, the concaveman algorithm is employed to outline the filtered spots. This method is illustrated in Supplementary Fig.  6b and is implemented via the SPATA2::createGroupAnnotations() function in SPATA2. Numeric Annotations are automatically created by binning data points according to expression values, either through k -means clustering or a manually set threshold. DBSCAN is then applied to identify and remove spatial outliers that could disproportionately distort the outline. The concaveman algorithm subsequently outlines the remaining spots, as shown in Supplementary Fig.  6c . This process is executed through the SPATA2::createNumericAnnotations() function in SPATA2. Beyond their use in the spatial annotation screening algorithm, the spatial properties of these annotations, such as area, center, centroid, and outline, can be leveraged for further computational analysis and the development of customized analytical approaches. The user is free to add individually obtained spatial annotations using the function SPATA2::addSpatialAnnotation() which takes a data.frame of x - and y -coordinates that determine the vertices of the polygon outlining the area of interest.

Spatial trajectories

Spatial Trajectories abstract a linear direction along with expression gradients inferred. They can be interactively created using the function SPATA2::createSpatialTrajectories() , which allows the user to interact with the spatial sample by determining the start and end point of the trajectory via double clicks. Alternatively, the trajectory’s start and end points can be programmatically determined using SPATA2::addSpatialTrajectory() . The results are stored in specific S4 objects that can be enriched with metadata. They also carry the ID with which each trajectory is identified.

Simulation of expression patterns related to spatial annotations and trajectories

In our study, we simulated expression patterns related to spatial annotations and trajectories. This process was integral to validating our Spatial Gradient Screening methodology and to benchmark different evaluation metrics in their capability to quantify randomness as well as to benchmark their sensitivity to human deviations and variations caused by human bias. Furthermore, it is repeated with every call to the spatial gradient screening if argument estimate_r2 is set to TRUE to estimate the reliability of the results. The simulation is structured into three steps. Supplementary Figs.  9 , 10 visualize the steps for both, spatial annotations and spatial trajectories, respectively. The first step involves the computation of distance values for each data point (the same way as conducted during inferring an expression gradient). Following the distance computation, the data points are categorized into bins based on these distance values (Supplementary Figs.  9a , 10a ). The bins are then systematically ordered in ascending fashion according to the mean distance values of each bin and provided with an index. In the second step, expression values are assigned to each member of a distance bin. These values are extracted from a numeric vector, the length of which corresponds to the number of bins. The arrangement of these values, when plotted against their indices in the vector, represented the specific pattern intended for simulation, as displayed in Supplementary Figs.  9c, d , 10c, d . Initially, this approach resulted in identical expression values for all members within a distance bin. Note that for visualization purposes a higher binwidth was chosen for Supplementary Figs.  9b , 10b . The simulation process we conducted for the actual benchmarking utilized a binwidth equal to the center-to-center distance of the visium spots, which is 100 μm. Results of this are displayed in Supplementary Figs.   9 d–f, 10d–f . The third and final step introduces noise into the simulated expression data in a controlled and systematic manner. This is achieved by creating a numeric vector from a randomized uniform distribution, equal in length to the number of data points. The range of values in this vector is aligned with the range of expression values assigned during the second step. The integration of noise with the pattern-like expression from the second step was executed in four distinct manners, each representing a different type of noise. Results of this integration are displayed in Supplementary Figs.  9d–f , 10d–f . These types included equally distributed (ED, d ), where each data point’s simulated and noisy expression values were scaled based on the noise ratio and then combined; equally punctuated (EP, e ), where a percentage of randomly selected spots received random expression values; focally punctuated (FP, f ), which differed from EP in that the spots receiving random values were not chosen randomly but were instead centered around initially selected data points, creating spatial niches of randomness; and Combined which amalgamated all three previous noise types). To ensure a comprehensive estimation, the simulations used for the benchmarking of our evaluation metrics spanned every possible combination of six pattern variations and four noise types. These were conducted at incremental noise levels, ranging from 0% to 100%, with a step size of 2%, resulting in a total of 61,200 simulations. Each simulation was uniquely named following a specific syntax: SE. <pattern>.<noise type>.<noise percentage>.<iteration>. This naming convention facilitated detailed tracking and analysis of each simulated iteration. We placed particular emphasis on Equally Distributed and Combined noise types, as they closely resemble patterns likely to be encountered in real-world data. While the Equally Punctuated and Focally Punctuated noise patterns are not commonly encountered in real-life scenarios, their inclusion was crucial for a thorough evaluation of our algorithm.

Inference of gene expression gradients and screening

SPATA2 introduces two group-independent algorithms that allow the user to identify and visualize genes whose expression stands in meaningful relation to regions or microstructures identified by image analysis. Gene expression gradients can be inferred along spatial trajectories with Spatial Trajectory Screening or in spatial relation to image annotations Image Annotation Screening. First, this section explains how expression gradients are inferred along spatial trajectories. Second, it explains how expression gradients are inferred as a function of distance to image annotations. Third, it explains how the screening for specific gradients is conducted by fitting inferred expression gradients to predefined models and how their fit is evaluated.

Inferring an expression gradient

By inferring an expression gradient, we mean capturing how the expression levels of a particular gene change in relation to a spatial feature, such as along a trajectory or depending on the distance to a spatial annotation’s outline. This encompasses three substeps which are displayed in Supplementary Fig.  8a, b when using spatial annotations (a) and spatial trajectories (b). (Please refer to Supplementary Fig.  7 for a visual glossary of the terms used throughout the method section with regards to spatial annotations, trajectories, and the screening in general.). First, we calculate the distance of each data point to the relevant spatial feature. In the case of spatial trajectories all data points within the trajectory frame are projected onto the trajectory’s T that connects the origin of trajectory P to the barcode spot of interest. Then, C is projected onto T such that projection P corresponds to:

The magnitude of the vector P corresponds to the projection length (PL) and the projection length in turn corresponds to the distance along the trajectory:

In the case of spatial annotations, each data point is projected to its closest vertex of the polygon, forming the outline of the spatial annotation and the distance is computed 22 . After obtaining the distance values, the gene’s expression levels of each data point are related to the corresponding distance.

Then, locally weighted scatterplot smoothing (LOWESS or LOESS) is used to fit a curve that approximates the changes in gene expression along this distance. The α parameter for this loess fit, which determines the degree of smoothing, is standardized, and calculated as follows:

Resolution defaults to the average minimal center-to-center distance (CCD) of the data points. In the case of regularly fixed data points, as with Visium’s barcoded spots, the value is given (100 μm). For irregularly scattered data points, as is the case for single cell-based platforms, this value is computed.

Distance indicates the total distance covered in the screening.

CF is the correction factor computed as the proportion of data points that exist from the total data points required to call the data set complete (see Supplementary Fig.   7c, d ).

The resolution should not exceed the CCD. Generally speaking, the higher the resolution, the more reliable the results, however, the resolution can not be increased infinitely since this will run into errors with the LOESS fitting. Generally speaking we found that a resolution between the CCD and half of the CCD provides good results. The reasoning behind using a correction factor is as follows: Due to platform limitations, tissue morphology, and data quality, the necessary data for a complete screening is often only partially available. For instance, Supplementary Fig.  7a–d illustrates a screening setup that references the largest of the three necrotic areas and includes the environment up to a distance of 3 mm. Supplementary Fig.  7d demonstrates how both the tissue’s edge and the capture area of the Visium platform can impose constraints on data completeness. Consequently, the dataset represents only a subset of the data points needed to fully address the hypothesis. In this example, only about 42% of the necessary data points are available to consider the dataset complete in the context of our hypothesis. (If two or more annotations are used in the screening as displayed in Fig.  5b , the correction factor is computed according to the data requirements of all annotations.) A similar problem arises with spatial trajectories and the width of the screening area. If the alpha parameter of the LOESS fit only depends on the distance screened and the platform resolution, this can run into errors if the number of data points is too small due to incompleteness. Dividing the alpha parameter by the proportion of available data points ensures results in a higher alpha proportional to the incompleteness of the data set. If all required points are available, the proportion is 1 and the alpha parameter stays as is.

In summary, as the resolution of the platform increases, or the distance screened increases, the fit allows more details. The more incomplete the data set, the less details it allows. The more details it allows the better the screening can differentiate between random and non-random gradients as indicated by the R2 between total variation and noise percentage that is estimated beforehand to every screening set up if the argument estimate_R2 is set to TRUE .

With inferred gradient we refer to a numeric vector of expression estimates capturing the resulting pattern of the fitted curve. To obtain it, we utilize the loess model for gene expression estimation along the distance via stats::predict() . The number and position of the expression estimates are computed by averaging the distance values of all data points within predefined distance bins, using a binwidth that corresponds to the spatial screening resolution. E.g. if the distance screened is 3 mm and the resolution is 0.1 mm, 30 expression estimates form the gradient approximately starting at a distance of 0.05 mm and ending at 2.95 mm. Finally, the vector of expression estimates is standardized to a defined range (we use 0–1, equivalent to ‘low’ to ‘high’). When plotted against their corresponding distance values, the expression estimates form a polygonal curve representing the pattern of the inferred gradient.

Identification of non-random gradients

The second step in spatial gradient screening is focused on the identification of genes whose expression gradients exhibit a pattern that is unlikely due to randomness. Depending on the algorithm used, spatial annotation screening (SAS) or spatial trajectory screening (STS), for every inferred gradient we posit the null hypothesis:

Null Hypothesis (H0): The expression pattern of the tested gene does not show spatial significance in relation to specific spatial references, such as delineated areas or spatial trajectories, and is attributable to random chance rather than being influenced by proximity to defined areas within the tissue.

Correspondingly, for every inferred gradient we formulate the alternative hypothesis:

Alternative Hypothesis (H1): The expression pattern of the tested gene exhibits spatial significance when analyzed in relation to specific spatial references, such as delineated areas or spatial trajectories. It forms a recognizable pattern that statistically distinguishes it from genes whose expression is not influenced by proximity to these areas, suggesting a biologically meaningful connection between the gene and the reference feature.

Representative examples of either hypothesis as well as for either screening approach, are displayed in Supplementary Fig.  8a, b . To be able to adopt either of these hypotheses, we posit that, if the inferred gradient in question stems from a gene whose inferred expression gradient is dependent on the spatial trajectory or the spatial annotation, it should not merely consist of randomly scattered expression values. Instead, it should display a discernible degree of gradual expression change forming a recognizable pattern. Thus, the smoother the inferred gradient, the less random it is, and vice versa. Assuming this relationship, we quantify the degree of randomness of a gradient using its total variation (TV). This metric is calculated by considering the absolute differences between adjacent expression estimates, as displayed in Supplementary Fig.  8c . The total variation is calculated according to the formula:

TV is the total variation.

n is the number of expression estimates in the gradient.

yi represents the gene expression value at expression estimate i .

| y i+1 − y i | calculates the absolute difference between the gene expression values of adjacent expression estimates.

To assess the effectiveness of the total variation in capturing the percentage of noise or randomness introduced into a gradient, we leveraged the simulated dataset discussed above, where a certain degree of noise was introduced per simulation, represented by randomly generated expression values. Each resulting simulation was a combination of pattern-specific expression (Supplementary Figs.  9 c, 10c ) and randomly generated expression, based on different noise types (Supplementary Figs.  9 d–f, 10d–f ). We examined the relationship between the total variation and the degree of randomness across different underlying patterns and various types of noise. Our findings consistently demonstrated a strong linear relationship between the total variation and noise ratio across all simulation modalities using a resolution of 100 μm. The high C values of 0.77–0.92 and 0.78–0.91 in the noise types equally distributed and combined (which we consider the most realistic manifestation of noise in real-life data) indicate that this metric effectively quantifies the degree of randomness.

Using estimate_r2 = TRUE , the spatial gradient screening algorithm conducts the simulation with the sample and the setup chosen by the user which estimates the reliability of the evaluation metrics in terms of their capability to account for randomness and noise. We recommend doing this, since we noted that the resulting R 2 varies depending on the distance screened, the resolution as well as the number of data points available. Generally speaking, we found that R 2 increased with increasing resolution. However, the resolution can not be increased infinitely since the data points available become too sparse for the LOESS to fit a curve to the expression vs. distance plot. We recommend choosing a resolution equal to or lower than the center-to-center distance of the data set, which can be obtained via SPATA2::getCCD() .

Calculation of p -values in spatial gradient screening

In our study, we used total variation (TV) as a crucial metric to determine the randomness in an inferred gradient. We define the p -value for the hypotheses presented in the previous section as the likelihood of observing a TV as low as or lower than that of gradients observed under random conditions. Thus, to calculate a p -value for a given inferred gradient, we first simulate random expression gradients by assigning random expression values to all data points within the screened area and continue to infer the resulting expression gradient as described above. This simulation is repeated 10,000 times to build a robust distribution of TV scores under random conditions. To further ensure a robust distribution of TV scores that represent the distribution of total variation under randomness potential outliers are removed by calculating the interquartile range (IQR) and identifying TV scores that lie significantly outside the IQR above the third or below the first quartile. Using the resulting distribution of randomly generated total variation values, we calculate the p -value according to the following formula:

p -value is the p -value.

rTV i represents the total variation for the i th simulated gradient under complete randomness.

oTV is the total variation of the observed inferred gradient from the gene of interest.

n is the number of random simulations (with a default of 10,000), subtracted by the number of outlier TV removed by IQR.

I (·) is the indicator function that equals 1 if the condition inside the parentheses is true and 0 otherwise.

Supplementary Fig.  11c, f provides a comprehensive visualization of the relationship between noise levels and p -values across all simulation modalities. In the case of equally distributed noise, we observed a consistent and stable relationship between p -values and noise ratios regardless of the underlying pattern, highlighting that the mentioned differences in TV baseline across simulated patterns did not have an effect. The resulting corrected p -values are adjusted according to the Benjamini–Hochberg approach and returned in a separate column called fdr (false discovery rate). We recommend using the adjusted p -values with a threshold of lower than 0.05.

Identification of biologically relevant pattern

The final step of our methodology is centered on identifying non-random gradients that are indicative of biologically relevant patterns. In this process, each gradient inferred from the data is systematically compared against a series of predefined models. These models are numeric vectors that match the length and range of the inferred gradients and are designed to represent simplified versions of biologically relevant dynamics. Expression patterns can be complex and vary depending on the specific research questions. To address this variability, our approach not only supports the integration of user-defined models but also includes a basic set of models, as shown in Supplementary Fig.  14d . These standard models are developed to simplify complex gene expression patterns into three primary types, providing a practical and comprehensive framework for analyzing diverse biological data. This strategy ensures that our methodology is both versatile and grounded, capable of accommodating different research requirements while offering a solid base for the interpretation of gene expression patterns in relation to spatial features. The three patterns we provide standardized models for are:

Association pattern: Higher gene expression near the annotation, decreasing with distance, indicative of an association. This is exemplified by the hypoxic gene signatures near the necrotic area (descending models, Fig. 4d ).

Recovery pattern: Lower expression near the annotation, increasing with distance, suggesting recovery. An example is the increase in oxygen-based metabolism away from necrotic areas (ascending models, Fig. 4g ).

Layered pattern: Transient increase in expression at a certain distance, forming a layer-like organization (peaking models, Fig. 2b ).

Predefined models provide precision in addressing specific research questions, offering an intuitive framework for embedding findings in biological contexts (Supplementary Fig.  8d ). The goodness of the fit between each gradient ( G ) and model ( M ) can be evaluated using two metrics: mean absolute error (MAE) and root mean squared error (RMSE). MAE evaluates average absolute deviation, robust to outliers and offering straightforward interpretation, and is computed according to the formula:

RMSE emphasizes larger errors and is sensitive to outliers. It is computed according to the formula:

Identification of zero-inflated variables

Given the limitations of the spatial gradient screening algorithm in handling variables with a high proportion of zero values, we advise pre-emptively removing such variables. To facilitate this, we have incorporated an option within the algorithm, which can be activated by setting the parameter rm_zero_infl to TRUE. When this option is enabled, each variable considered for screening undergoes an outlier detection process. This process involves calculating the Interquartile Range (IQR) and excluding spots that fall significantly outside 1.5 times the IQR, either above the third quartile or below the first quartile. Should the process result in the retention of only zero values, indicating that all non-zero spots are outliers, the variable is deemed zero-inflated and subsequently removed from consideration. This approach helps mitigate the algorithm’s sensitivity to zero-inflated variables, ensuring more robust and reliable screening outcomes.

Sensitivity to human bias in spatial gradient screening

Both spatial reference features, spatial trajectories, and spatial annotations can be generated either computationally or manually through user interaction. While the interactive creation of both methods enables direct tissue interaction, it introduces the potential for human error, leading to variations in outlining spatial areas in the case of image annotations or drawing spatial trajectories. To assess susceptibility to human bias in both spatial annotation and spatial trajectory screening, we conducted a comprehensive investigation. In both approaches, we identified potential sources of human error and simulated deviations from originally created annotations or trajectories with increasing degrees of variation. Subsequently, we generated a ground truth dataset of expression variables using our simulated expression dataset, created using either the original trajectory or the original spatial annotation. The positive (non-random) ground truth consisted of a subset of 2400 simulated expression variables with an adjusted p -value (FDR) of 0. Conversely, the negative (random) ground truth was defined as a subset of expression variables with a noise percentage of 100%. Supplementary Figs.  6 g, 12a presents the resulting distribution of total variation values for each population. These simulations were developed using the original spatial features. We then conducted spatial annotation and spatial trajectory screening with annotations and trajectories that deviated from the original ones in various ways. Subsequently, we compared the genes identified as random and non-random in these runs with the original population and quantified the ratio of false positives and false negatives. False positives were defined as randomly simulated expression variables incorrectly identified as non-random due to the introduced deviations in spatial features. Conversely, false negatives were defined as non-randomly simulated expression variables falsely identified as random due to the introduced deviations in spatial features.

Sensitivity to human error using spatial annotations

To introduce increasing variation into spatial annotations, we systematically added noise to the spatial annotation outlines and progressively rotated them by increasing degrees. The degree of introduced variation was quantified by measuring the deviation from the original outline and assessing their overlap. Supplementary Fig.  6d displays representative examples along with their quantified degree of deviation from the original outline, which is displayed in black, and their respective IDs. The IDs are used to map their screening results in Supplementary Figs.  6e, f . Notably, the percentage of false positives remains consistently low, staying close to 0%. However, the percentage of non-random expression variables missed due to the introduced variation increases linearly with the degree of deviation, with some outliers. Observing outliers along this linear increase suggests that not only the degree of deviation but also the nature and shape of the overlap contribute to decreased test performance measures. Most importantly, however, the increase in false negatives only became apparent when the outline deviation exceeded 20%, highlighting the robustness of spatial annotation screening via image annotations against human-introduced bias. Supplementary Fig.  6e illustrates a representative example with a 13% deviation from the original outline.

Sensitivity to variations using spatial trajectories

The creation of spatial trajectories can introduce variations in terms of their start and endpoint, as well as variations in the angle at which they deviate from the original or optimal placement. To simulate variations in the start and endpoint, we generated trajectories and randomly displaced the start and endpoint of each trajectory randomly. Afterward, we measured the introduced deviation by calculating the resulting length of the new trajectory and subtracting it from the length of the original trajectory. Representative examples are presented in Supplementary Fig.  12b . Supplementary Fig.  12c displays the screening results compared to the introduced deviation in length, demonstrating that the percentage of false positives remains consistently low, approaching 0%, regardless of the introduced variation. It also indicates that variations in the start and endpoint do lead to non-random expression variables being missed (false negatives), with the proportion increasing linearly. However, the increase in false negatives only becomes noticeable when the deviation exceeds 0.75 mm (the original trajectory is approximately 5.75 mm in length), surpassing the 5% threshold. This suggests that there is a sufficient margin for variations in the start and endpoint. To simulate variations in degree, we generated deviating trajectories by displacing the endpoint along a vector perpendicular to the course of the original trajectory, resulting in trajectories with deviations of up to 25°. Supplementary Fig.  6d provides representative examples of these deviations. The test performance, influenced by the integrated deviations is displayed in Supplementary Fig.  6e . This figure demonstrates that the number of false positives remains unaffected, too, by variations introduced in this manner. Additionally, it reveals that the number of false negatives remains consistently low, only beginning to increase when the deviation degree reaches 15°. Based on the examples shown in Supplementary Fig.  12d , we conclude that the range of realistic variations introduced by human error does not exceed this threshold and spatial trajectory screening is robust with regards to this. Finally, we assessed the impact of the screening area size using spatial trajectories, defined as a rectangle formed by multiplying the trajectory’s length with a parameter referred to as trajectory width. By default, the trajectory width equals the trajectory length, resulting in squares encompassing as many data points as possible. In cases where specific areas within this square might confound the screening process, one can either remove data points from these areas individually or reduce the width (Supplementary Fig.  7e–i ). Our investigation revealed that reducing the width had no discernible effect on the number of false positives, which consistently remained low. Furthermore, the impact on the number of false negatives was negligible (Supplementary Fig.  12f, g ). In conclusion, we found that spatial trajectory screening remains robust against variations introduced by human error.

Comparison with test statistics derived from scRNA-seq pseudotime methods

Given that pseudotime-dependent gene expression identification methods and spatial gradient screening both aim to detect non-random patterns along a one-dimensional axis (distance or pseudotime), we conducted a comparative analysis of our total variation (TV) test statistic, as illustrated in Supplementary Fig.  8c , against well-known metrics from pseudotime-centric algorithms—specifically, the waldStatistic from tradeSeq and the log-likelihood from PseudotimeDE. To evaluate the correlation between each method’s test statistic and the degree of noise obscuring the ground truth pattern in our simulated expression dataset, we selected a subset of 10,000 simulated genes from a ‘combined’ noise type. We analyzed their gradient in relation to the distance from a ‘necrotic_center’ spatial annotation, considering barcode spots as cells and utilizing distances up to 3 mm—the scenario for which the dataset was simulated.

For this analysis, distances to the annotation, retrieved via SPATA2::getCoordsDfSA(), substituted the pseudotime variable. Spots beyond the scrutinized region, the environment, were excluded. Following guidelines from the tradeSeq ( https://statomics.github.io/tradeSeq/articles/tradeSeq.html ) and PseudotimeDE ( https://htmlpreview.github.io/?https://rpubs.com/dongyuansong/842884 ) tutorials, we fitted a generalized additive model (GAM) for each gene. For tradeSeq, we chose nknots=7 for tradeSeq::fitGAM() based on the Akaike Information Criterion (AIC) visual inspection. Due to computational limitations, a random sample of 1000 genes was analyzed for PseudotimeDE, using n  = 100 subsamples for PseudotimeDE::runPseudotimeDE() as analyzing 10,000 genes would surpass 24 h on a system with 256GB RAM and 48 cores. We then calculated the linear correlation between the introduced noise in these simulated genes and the derived metrics from both tradeSeq (waldStat) and PseudotimeDE (test.statistics), alongside our total variation measure from SPATA2::spatialAnnotationScreening() (column tot_var). This correlation was assessed using the square of the base R cor() function, as depicted in Fig.  6b, c , to gauge the linear relationship between the noise level and the test statistics.

Benchmarking computational efficiency

To measure runtime, we used bench::mark() on a subset of sample #UKF275_T_P, annotated for its hypoxic core, Supplementary Fig.  6c . Subsets included all combinations of randomly selected spots (35, 350, 3500) and genes (10, 100, 1000, 10,000), generated using the base R sample function. Each subset underwent 15 iterations. Benchmarks were run on a MacBook Pro with 32 GB RAM and 10 cores. Results are displayed in Fig.  6d .

Cross-platform compatibility

Converting functions are provided by SPATA2 that seamlessly convert S4 objects of class spata2 to S4 objects from platforms such as Giotto or Seurat and vice versa as well as to AnnData-format for compatibility with platforms relying on Python. These functions come with the prefix as*(). E.g. asGiotto() , asSeurat() , asSingleCellExperiment(), asAnnData() .

Data acquisition and processing of glioblastoma samples

The raw data for both samples, #UKF269 and #UKF313, were obtained from the online database of our 10XVisium platform using the SPATAData R package. SPATAData provides an interface with the SPATAData::launchSpataData() function that allows access to all 10XVisium samples used in previous publications by the Microenvironment and Immunology Research Group Freiburg. Currently, this collection comprises 32 samples, including 25 malignancies of the central nervous system (CNS), such as #UKF269T and #UKF313T (for detailed information on how these datasets were generated, see Ravi et al. 4 ). The raw data sets were downloaded using the SPATAData::downloadRawData() function. Data processing, clustering and DEA have been conducted the same way for both samples, as described below. From the downloaded data sets we used the initiateSpataObject_10X() function to create both SPATA2 objects. For data normalization, we used the pipeline with Seurat::SCTransform() using the default parameter setup of the function as illustrated in the tutorials on how to use Seurat::SCTransform() here: https://satijalab.org/seurat/articles/sctransform_vignette.html . As described previously, the count matrix, as well as the scaled matrix and dimensional reductions, are inherited by the respective SPATA2 object as outputted by initiateSpataObject_10X() which is a direct implementation of the default pipeline suggested by Stuart and Butler et al. 2019.

Data acquisition and processing of mouse brain sample #MCI_LMU

Mouse experiments.

Operations were performed on 9–11 weeks old C57Bl6/J male mice, housed and handled under the German and European guidelines for the use of animals for research purposes. Experiments were approved by the institutional animal care committee and the government of Upper Bavaria (ROB-55.2-2532.Vet_02-20-158). The anesthetized animals received bilateral stab wound lesions in the cerebral cortex by inserting a thin knife into the cortical parenchyma using the following coordinates from Bregma: RC: −1.2; ML: 1–1.2 and from Dura: DV: −0.6 mm. To produce stab lesions, the knife was moved over 1 mm back and forth along the anteroposterior axis from −1.2 to −2.2 mm. Animals were euthanized 3 days post-injury (dpi) by cervical dislocation 10 .

Visium experiments

A mouse brain was embedded and snap-frozen in an isopentane and liquid nitrogen bath as recommended by 10x Genomics (Protocol: CG000240). During cryosectioning (Thermo Scientific CryoStar NX50), the brain was resected to generate a smaller sample, and two 10 μm-thick coronal sections of the dorsal brain area were collected in one capture area. The tissue was stained using H&E staining and imaged with the Carl Zeiss Axio Imager.M2m Microscope using ×10 objective (Protocol: CG0001600). The sequencing library was prepared with the Visium Spatial Gene Expression Reagent Kit (CG000239) using 18 min permeabilization time. An Illumina, a paired-end flow cell, was used for sequencing on a HiSeq1500 following manufacturer protocol, to a sequencing depth of 75,398 mean reads per spot. Sequencing was performed in the Laboratory for Functional Genome Analysis of the LMU in Munich.

scRNA-seq experiments

The lesioned grey matter of the somatosensory cortex from three C57BL/6J mice at 3dpi was isolated using a biopsy punch ( ∅ 0.25 cm) and the cortical cells were dissociated using the Papain Dissociation System (Worthington, # LK003153) followed by the Dead Cell Removal kit (Miltenyi Biotec # 130-090-101), according to manufacturer’s instructions. Incubation with dissociating enzyme was performed for 60 min. Single-cell suspensions were resuspended in 1xPBS with 0.04% BSA and processed using the Single-Cell 3’ Reagent Kit v2 from 10x Genomics according to the manufacturer's instructions. In brief, this included the generation of single-cell gel beads in emulsion (GEMs), post-GEM-RT cleanup, cDNA amplification, and library construction. Illumina sequencing libraries were sequenced on a HiSeq 4000 following manufacturer protocol, to a mean depth of 30,000 reads per cell. Sequencing was performed in the genome analysis center of the Helmholtz Center Munich.

scRNA-seq data analysis

Read processing was performed using 10X Genomics Cell Ranger (v3.0.2). After barcode assignment and UMI quantification, reads were aligned to the mouse reference genome mm10 (GENCODE vM23/Ensembl 98; 2020A from 10xGenomics). Further processing was performed using Scanpy 1 (v1.9.1). Cells were excluded if they had ≤300 or ≥6000 unique genes, or ≥20% mitochondrial gene counts. The count matrix was normalized ( sc.pp.normalize_total ) and log( x  + 1)-transformed ( sc.pp.log1p ), before proceeding with dimensionality reduction and clustering ( sc.tl.pca , sc.pp.neighbors with n_pcs=20 , sc.tl.umap , sc.tl.leiden with resolution=0.6 ). Cell types were manually annotated using known marker genes (‘ECs’: [‘Cldn5’, ‘Pecam1’], ‘Mural cells’: [‘Vtn’, ‘Pdgfrb’, ‘Acta2’, ‘Myocd’], ‘Fibroblasts’: [‘Dcn’, ‘Col6a1’, ‘Col3a1’], ‘Oligodendrocytes’: [‘Mbp’, ‘Enpp2’], ‘OPCs’: [‘Cspg4’, ‘Pdgfra’], ‘Neurons’: [‘Rbfox3’, ‘Tubb3’], ‘Astrocytes’: [‘Aqp4’, ‘Aldoc’], ‘Microglia’: [‘Aif1’, ‘Tmem119’], ‘Monocytes/Macrophages’: [‘Cd14’, ‘Itgb2’, ‘Cd86’, ‘Adgre1’], ‘B cells’: [‘Cd19’], ‘T/NK cells’: [‘Cd3e’, ‘Il2rb’, ‘Lat’], ‘Neutrophils’: [‘S100a9’]).

Visium data analysis

Read processing was performed using 10x Genomics Space Ranger (v1.2.2). After barcode assignment and UMI quantification, reads were aligned to the mouse reference genome mm10 (GENCODE vM23/Ensembl 98; 2020A from 10xGenomics). Scanpy (v1.9.1) was used for further processing of the Visium dataset. Barcode spots with <400 counts were excluded ( sc.pp.filter_cells ). The count matrix was normalized ( sc.pp.normalize_total ) and log( x  + 1)-transformed (sc.pp.log1p), before proceeding with dimensionality reduction and clustering of barcode spots ( sc.tl.pca with n_comps = 40, sc.pp.neighbors, sc.tl.leiden ). Clusters were annotated based on histology and known locations of the injury sites. The full-sized Space Ranger input image (7671×7671 px) was used to segment nuclei using Squidpy and Cellpose 23 via sq.im.segment with method = cellpose_he and flow_threshold = 0.8 ; as suggested in https://squidpy.readthedocs.io/en/stable/external_tutorials/tutorial_cellpose_segmentation.html ). Next, Tangram was used to integrate scRNA-seq and Visium datasets by providing a cell type probability score per barcode spot, based on spatial correlation of genes shared by the datasets 2 . This probability score was used to deconvolve the Visium dataset by assigning each segmented nuclei a most likely cell type (using tg.pp_adatas with 1238 overlapping training genes from the top 125 marker genes of each single-cell cluster, tg.map_cells_to_space, tg.project_cell_annotations, tg.create_segment_cell_df, tg.count_cell_annotations, tg.deconvolve_cell_annotations ; total assigned nuclei 4,356; as suggested in https://squidpy.readthedocs.io/en/stable/external_tutorials/tutorial_tangram.html ). Processed h5ad files were imported to SPATA2 by first loading them to R via anndata::read_h5ad() and converting them using asSPATA2() .

Initiation and processing of the SPATA2 objects

Data was read into R using SPATA2::initiateSpataObjectVisium() from the SpaceRange outs folder. The count matrix was processed using the R function Seurat::NormalizeFeatures() . For further analysis, the normalized matrix of slot @layers from the corresponding Seurat object was used. The tissue outline (tissue edge) was identified using SPATA’s inbuilt image processing pipeline SPATA2::identidySpatialOutliers() . Spatial outlier spots were identified and removed using SPATA2::identifySpatialOutliers() and SPATA2::removeSpatialOutliers() . Afterward, we clustered the barcoded spots using the BayesSpace algorithm as implemented in SPATA2::runBayesSpaceClustering() (number of clusters ranging from n  = 3 to n  = 15). Next, we identified spatially variable genes using the SPARKX implementation SPATA2::runSparkx() .

Downstream analysis of sample #UKF269

First, we inferred copy number alterations using SPATA2 implementation of the infercnv R-package via SPATA2::runCNV() . Second, for sample #UKF269 we created a grouping variable based on the histological architecture of the sample using the function SPATA2::createSpatialSegmentation() called histology. The results can be obtained via the R command spatial_segmentations$T269 . Differential expression analysis based on the grouping variable histology was conducted using the default of SPATA2::runDEA() which in turn calls the default of Seurat::FindAllMarkers() . The default parameters were used. DEA was conducted based on the grouping of the Bayes space clustering and the histological segmentation using the default of runDEA() . Identification of spatially variable genes was conducted using SPATA2::runSparkx() . The spatial trajectory was named horizontal_mid and added via SPATA2::addSpatialTrajectory() . Start point was set to the min of all x -coordinates and the mean of all y -coordinates. End point was set to the max of all x -coordinates and the mean of all y -coordinates. Spatial trajectory screening was conducted with the function SPATA2::spatialTrajectoryScreening() . The parameter variables were set to the vector of genes that were identified as spatially variable by SPARKX as obtained by getSparkxGenes(…, threshold_pval = 0.05) . The output of STS is an S4 object of class SpatialTrajectoryScreening containing the results in slot @results . This is a data.frame in which each row corresponds to ta gene-model fit as indicated by the columns variables (theoretically, all numeric variables can be included in the screening process) and models.

Downstream analysis of sample #MCI_LMU

Differential gene expression analysis was conducted using the function runDEA() based on the clustering suggested by the Scanpy pipeline (see above). The function SPATA2::createImageAnnotations() was used to manually draw image annotations guided by prior information on injury location and histology. The two image annotations were labelled inj1 (upper section) and inj2 (lower section). Spatial annotation screening was conducted two times. The first run (Main Fig.  3 ) included parameter adjustments distance = "1.5 mm" and resolution = "50um" . Models corresponding to the gene expression gradient of Hmox1 and Lcn2 used for the screening were obtained via SPATA2::getSasDf() with equal parameters to the screening, converted to a list via base::as.list() and added with the argument add_models . The second run (Supplementary Fig.  3 ) included parameter adjustments distance = "0.75 mm" and resolution "50 um" . In both cases, as well as for all visualizations, parameter ids were set to c("inj1", "inj2") including both injury annotations.

Downstream analysis of sample #UKF313

After visual inspection of the histology image and identification of the necrotic area as well as the pseudopallisades we created a spatial annotation that captured the spatial extent of the central necrotic area using the function SPATA2::createImageAnnotations() and named it necrotic_center (Supplementary Fig.  7a ). The vivid part within the annotated area of necrotic_center was also interactively annotated and labeled vivid. Additionally, the spatial annotations necrotic_edge and necrotic_edge2 were also created within the interactive interface of SPATA2::createImageAnnotations() . The spatial annotation necrotic_center was equipped with the spatial annotation vivid as a hole using SPATA2::addInnerHoles() and the resulting spatial annotation was called necrotic_area. The spatial annotation screening, as visualized in Fig.  5b , was conducted using the function SPATA2::spatialAnnotationScreening() with parameter ids set to c("necrotic_area", "necrotic_edge", "necrotic_edge2" ), distance set to “dte”. Genes screened were subsetted according to the list of spatially variable genes as provided by SPATA2::getSparkxGenes(…, threshold_pval = 0.05) . The output of the image annotation screening algorithm is an S4 object of the class SpatialAnnotationScreening . Resulting model fits were filtered for genes with an adjusted p -value (FDR) of <0.05 and with an RMSE evaluation of <0.25 for either of the descending models (Fig.  5c ) or the ascending models (Fig.  5d ). The remaining genes were grouped by the model class (descending or ascending) and supplied to hypeR::hypeR() . Gene sets were provided by SPATA2::getGeneSetList . The results were filtered for gene sets in either group with an FDR <0.05. Gene sets used for Fig.  5e, h were picked as examples for either group. Their original name, as listed in the data.frames of SPATA2 are HM_HYPOXIA and RCTM_CELLULAR_HEXOSE_TRANSPORT (e) as well as HM_OXIDATIVE_PHOSPHORYLATION and RCTM_TCR_SIGNALING (h).

Cell2Location

We have integrated the cell2location model into our study, using it to bridge the Visium spatial transcriptomics data with the GBMap single-cell dataset of glioblastoma. The single-cell dataset was downsampled to 100,000 cells to accommodate computational demands. Signature estimation from the single-cell dataset was conducted via the cell2location Negative Binomial regression model, producing the inf_aver_sc.csv file, which served as the foundation for the spatial deconvolution process. Shared genes between the signature genes and the spatial dataset were identified, leading to the initiation of the cell2location model. The model was trained adhering to recommended hyperparameters and utilizing early stopping criteria based on ELBO loss. Post-training, the posterior distribution of cell abundance was quantified and extracted for subsequent analytical pursuits. The expected expression for each cell type was computed, and cell-specific expressions were documented.

For the decomposition of cell types, we utilized the GBMap atlas, which encompasses a dataset of over one million cells. We have developed a pipeline for single-cell deconvolution employing CytoSpace in conjunction with SPATA objects, details of which are accessible at our dedicated GitHub repository (githun.com/heilandd). The R script named “CytoSpace_from_SPATA.R” provides a detailed workflow for preparing files compatible with the CytoSpace suite, supplemented by a bash script to facilitate the batch processing of SPATA2 objects. The CytoSpace analysis itself runs within a bash environment, and upon its completion, a script is made available for importing the results back into the SPATA2 framework using the CytoSpace2SPATA function.

We utilized the spatial T cell receptor sequencing samples published in Benotmane et al. 2023. Data were downloaded at GEO accession code GSE238071 and processed by the analysis script https://github.com/heilandd/SPTCR_seq_code .

Horizontal integration of spatial annotation screening

Horizontal Integration of the Spatial Annotation Screening was performed with the output of the SPATA2::getSasDf() function which provides inferred expression estimates at distance intervals as explained in section Inferring an expression gradient. The distance parameter was set to 3 mm and the resolution parameter was set to 100 μm. The data.frames of all six samples containing their respective expression estimates of all variables displayed in Fig.  5 were merged using base::rbind() .

Statistics and reproducibility

Statistical analysis was conducted with R (version 4.1.2). No statistical method was used to predetermine the sample size. No data was excluded from the analysis. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. SPATA2 is a package that undergoes continuous improvements and adjustments. A stable version that can be used to reproduce the analysis and figures of this manuscript using the source data file can be installed via ‘devtools::install_github(repo = “kueckelj/SPATA2”, ref = “0e3eb85”)‘. The latest version of the package can be installed via ‘devtools::install_github(repo = “theMILOlab/SPATA2”)‘. The following packages are required as dependencies for the SPATA2 version with which this study has been conducted: BiocGenerics >= v0.40.0; DT >= v0.23; DelayedArray >= v0.20.0; DelayedMatrixStats >= v1.16.0; EBImage >= v4.36.0; FNN >= v1.1.3.2; Matrix.utils >= v0.9.8; S4Vectors >= v0.32.4; Seurat >= v5.0.2; SingleCellExperiment >= v1.16.0; SummarizedExperiment >= v1.24.0; aplot >= v0.1.6; batchelor >= v1.10.0; broom >= v0.8.0; colorspace >= v2.1-0; concaveman >= v1.1.0; confuns >= v1.0.3; dbscan >= v1.1-10; dplyr >= v1.1.2; ggalt >= v0.4.0; ggforce >= v0.3.3; ggplot2 >= v3.4.3; ggridges >= v0.5.3; ggsci >= v2.9; glue >= v1.7.0; grid >= v4.1.2; keys >= v0.1.1; limma >= v3.50.3; lubridate >= v1.8.0; magick >= v2.7.3; magrittr >= v2.0.3; paletteer >= v1.4.0; pheatmap >= v1.0.12; pracma >= v2.3.8; progress >= v1.2.2; psych >= v2.2.5; purrr >= v1.0.1; readr >= v2.1.2; reticulate >= v1.34.0; rlang >= v1.1.1; scattermore >= v1.2; shiny >= v1.7.1; shinyWidgets >= v0.7.0; shinybusy >= v0.3.1; shinydashboard >= v0.7.2; shinyhelper >= v0.3.2; sp >= v1.5-0; stringi >= v1.7.6; stringr >= v1.5.0; tibble >= v3.2.1; tidyr >= v1.2.0; tidytext >= v0.3.3; umap >= v0.2.8.0; units >= v0.8-0; viridis >= v0.6.2.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw stRNA-seq data from human glioblastoma used in this study have been deposited at https://datadryad.org/stash/dataset/doi:10.5061/dryad.h70rxwdmj . The raw scRNA-seq and stRNA-seq data from the injured mouse brain is deposited at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE226211 . Source data and code to reproduce the panels presented in all main and Supplementary Figs. are available in the source data file. Cluster results, interactively created segmentations as well as spatial annotations are additionally available as lists in the SPATA2 package. The data can be obtained using the commands SPATA2::clustering , SPATA2::spatial_segmentations , SPATA2::spatial_trajectories and SPATA2::spatial_annotations . Furthermore, processed SPATA2 objects used in this study can be downloaded using SPATA2::downloadFromPublication() .  Source data are provided with this paper.

Code availability

The SPATA2 package is available https://github.com/theMILOlab/SPATA2 . SPATA2 version 3.0.0 which contains the features presented in this manuscript will be made available within two weeks from publication of this manuscript. Further information and requests for resources, raw data and reagents should be directed and will be fulfilled by the contact: D. H. Heiland, [email protected].

Moses, L. & Pachter, L. Museum of spatial transcriptomics. Nat. Methods 19 , 534–546 (2022).

Article   CAS   PubMed   Google Scholar  

Asp, M. et al. A spatiotemporal organ-wide gene expression and cell atlas of the developing human heart. Cell 179 , 1647–1660.e19 (2019).

Maynard, K. R. et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat. Neurosci. 24 , 425–436 (2021).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ravi, V. M. et al. Spatially resolved multi-omics deciphers bidirectional tumor–host interdependence in glioblastoma. Cancer Cell 40 , 639–655.e13 (2022).

Neftel, C. et al. An integrative model of cellular states, plasticity, and genetics for glioblastoma. Cell 178 , 835–849.e21 (2019).

Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344 , 1396–1401 (2014).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15 , 343–346 (2018).

Zhu, J., Sun, S. & Zhou, X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 22 , (2021). https://doi.org/10.1186/s13059-021-02404-0

Zhao, E. et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat. Biotechnol. 39 , 1375–1384 (2021).

Koupourtidou, C. et al. Shared inflammatory glial cell signature after stab wound injury, revealed by spatial, temporal, and cell-type-specific profiling of the murine cerebral cortex. Nat. Commun. 15 , 2866 (2024).

Biancalani, T. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat. Methods 18 , 1352–1362 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Benotmane, J. K. et al. High-sensitive spatially resolved T cell receptor sequencing with SPTCR-seq. Nat. Commun. 14 , 7432 (2023).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Van den Berge, K. et al. Trajectory-based differential expression analysis for single-cell sequencing data. Nat. Commun. 11 , 1201 (2020).

Song, D. & Li, J. J. PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p -values from single-cell RNA sequencing data. Genome Biol. 22 , 124 (2021).

Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184 , 3573–3587.e29 (2021).

Bergenstråhle, J., Larsson, L. & Lundeberg, J. Seamless integration of image and molecular analysis for spatial transcriptomics workflows. BMC Genom. 21 . https://doi.org/10.1186/s12864-020-06832-3 (2020).

Dries, R. et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol. 22 , (2021) https://doi.org/10.1186/s13059-021-02286-2

Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19 , 15 (2018).

Palla, G. et al. Squidpy: a scalable framework for spatial omics analysis. Nat. Methods 19 , 171–178 (2022).

Hildebrandt, F. et al. Spatial Transcriptomics to definetranscriptional patterns of zonation and structural components in the mouse liver. Nat. Commun. 12 (2021).

Stringer, C., Wang, T., Michaelos, M. & Pachitariu, M. Cellpose: a generalist algorithm for cellular segmentation. Nat. Methods 18 , 100–106 (2021).

Federico, A. & Monti, S. HypeR: an R package for geneset enrichment workflows. Bioinformatics 36 , 1307–1308 (2020).

Download references

Acknowledgements

This project was funded by the German Cancer Consortium (DKTK), Else Kröner-Fresenius Foundation (DHH). The work is part of the MEPHISTO project (DHH), funded by BMBF (iGerman Ministry of Education and Research) (project number: 031L0260B). Funding was received from the German Research Foundation (DFG) as part of the Munich Cluster for Systems Neurology (EXC 2145 SyNergy – ID 390857198, to MD), the CRC 1123 (B3; to MD), DI 722/16-1 (ID: 428668490/40535880, to MD), DI 722/13-1, and DI 722/21-1 (to MD); a grant from the Leducq Foundation (to MD); the European Union’s Horizon Europe (European Innovation Council) programme under grant agreement No 101115381 (to MD); ERA-NET Neuron (MatriSVDs, to MD), and the Vascular Dementia Research Foundation. SF was supported by the Joachim Herz Foundation.

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Jan Kueckelhaus, Simon Frerich.

Authors and Affiliations

Microenvironment and Immunology Research Laboratory, Medical Center, Faculty of Medicine, Freiburg University, Freiburg, Germany

Jan Kueckelhaus, Jasim Kada-Benotmane & Dieter Henrik Heiland

Department of Neurosurgery, Medical Center, Faculty of Medicine, Erlangen University, Erlangen, Germany

Jan Kueckelhaus, Oliver Schnell & Dieter Henrik Heiland

Institute for Stroke and Dementia Research (ISD), University Hospital, LMU Munich, Munich, Germany

Simon Frerich & Martin Dichgans

Graduate School of Systemic Neurosciences, LMU Munich, Munich, Germany

Simon Frerich

Department of Neurosurgery, Medical Center, Faculty of Medicine, Freiburg University, Freiburg, Germany

Jasim Kada-Benotmane & Juergen Beck

Department of Cell Biology and Anatomy, Biomedical Center (BMC), LMU Munich, Munich, Germany

Christina Koupourtidou & Jovica Ninkovic

Munich Cluster for Systems Neurology (SyNergy), Munich, Germany

Christina Koupourtidou, Jovica Ninkovic & Martin Dichgans

German Center for Neurodegenerative Diseases (DZNE), Munich, Germany

Martin Dichgans

Comprehensive Cancer Center Freiburg (CCCF), Medical Center, University of Freiburg, Freiburg, Germany

Dieter Henrik Heiland

German Cancer Consortium (DKTK) partner site Freiburg, Freiburg, Germany

Department of Neurological Surgery, Lou and Jean Malnati Brain Tumor Institute, Robert H. Lurie Comprehensive Cancer Center, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA

You can also search for this author in PubMed   Google Scholar

Contributions

The study was designed and coordinated by D.H.H. SPATA2 was developed by J.K., D.H.H. SPATA2 is maintained by J.K. and S.F. Spatial gradient screening was conceived by J.K., D.H.H., S.F. and J.K.B. The statistical and informatic implementation was conducted by J.K. Simulations and benchmarking were conducted by J.K. and S.F. Mouse experiments and generation of mouse cortex Visium and scRNA-seq data were carried out by C.K. and J.N. Visium samples were selected and processed by S.F. Analysis of glioblastoma and human neocortex data sets was conducted by J.K., S.F., and D.H.H. Analysis of mouse cortex data sets was conducted by J.K. and S.F. Main figures were created by J.K. and D.H.H. Supplementary Figs. were created by J.K. and S.F. Main part of the manuscript was written by J.K., D.H.H., and S.F. Method part of the manuscript was written by J.K. The manuscript was edited by O.S., J.B., M.D., and J.K.B.

Corresponding authors

Correspondence to Jan Kueckelhaus or Dieter Henrik Heiland .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Wei Liu, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, reporting summary, peer review file, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Kueckelhaus, J., Frerich, S., Kada-Benotmane, J. et al. Inferring histology-associated gene expression gradients in spatial transcriptomic studies. Nat Commun 15 , 7280 (2024). https://doi.org/10.1038/s41467-024-50904-x

Download citation

Received : 29 May 2023

Accepted : 24 July 2024

Published : 23 August 2024

DOI : https://doi.org/10.1038/s41467-024-50904-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

null hypothesis meaning in medical terms

IMAGES

  1. Formulating hypothesis in nursing research

    null hypothesis meaning in medical terms

  2. Null hypothesis

    null hypothesis meaning in medical terms

  3. Null Hypothesis

    null hypothesis meaning in medical terms

  4. 15 Null Hypothesis Examples (2024)

    null hypothesis meaning in medical terms

  5. Boston University Neuroscience Phd Program

    null hypothesis meaning in medical terms

  6. PPT

    null hypothesis meaning in medical terms

COMMENTS

  1. Null hypothesis

    The null hypothesis and the alternative hypothesis are types of conjectures used in statistical tests to make statistical inferences, which are formal methods of reaching conclusions and separating scientific claims from statistical noise.. The statement being tested in a test of statistical significance is called the null hypothesis. The test of significance is designed to assess the strength ...

  2. An Introduction to Statistics: Understanding Hypothesis Testing and

    HYPOTHESIS TESTING. A clinical trial begins with an assumption or belief, and then proceeds to either prove or disprove this assumption. In statistical terms, this belief or assumption is known as a hypothesis. Counterintuitively, what the researcher believes in (or is trying to prove) is called the "alternate" hypothesis, and the opposite ...

  3. Statistical Significance

    The hypothesis to be disproven is the null hypothesis and typically the inverse statement of the hypothesis. Thus, the null hypothesis for our researcher would be, "Taking the new medication will not lower systolic blood pressure by at least 10 mm Hg compared to not taking the new medication." The researcher now has the null hypothesis for the ...

  4. Null Hypothesis: Definition, Rejecting & Examples

    The null hypothesis in statistics states that there is no difference between groups or no relationship between variables. It is one of two mutually exclusive hypotheses about a population in a hypothesis test. When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant.

  5. Finding and Using Health Statistics

    H0: µ = 75. H0: µ = µ0. Ha: There will be a statistically significant difference between the student's score and the class average score on the math exam. Ha: µ ≠ 75. Ha: µ ≠ µ0. In the null hypothesis, there is no difference between the observed mean (µ) and the claimed value (75). However, in the alternative hypothesis, class ...

  6. Null Hypothesis Definition and Examples

    Null Hypothesis Examples. "Hyperactivity is unrelated to eating sugar " is an example of a null hypothesis. If the hypothesis is tested and found to be false, using statistics, then a connection between hyperactivity and sugar ingestion may be indicated. A significance test is the most common statistical test used to establish confidence in a ...

  7. Null hypothesis significance testing: a short tutorial

    Abstract: "null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely". No, NHST is the method to test the hypothesis of no effect. I agree - yet people use it to investigate (not test) if an effect is likely.

  8. What Is The Null Hypothesis & When To Reject It

    A null hypothesis is a statistical concept suggesting no significant difference or relationship between measured variables. It's the default assumption unless empirical evidence proves otherwise. The null hypothesis states no relationship exists between the two variables being studied (i.e., one variable does not affect the other).

  9. 9.1 Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0, the —null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

  10. Null hypothesis

    Biology definition: A null hypothesis is an assumption or proposition where an observed difference between two samples of a statistical population is purely accidental and not due to systematic causes. It is the hypothesis to be investigated through statistical hypothesis testing so that when refuted indicates that the alternative hypothesis is true. . Thus, a null hypothesis is a hypothesis ...

  11. Examples of null and alternative hypotheses

    It is the opposite of your research hypothesis. The alternative hypothesis--that is, the research hypothesis--is the idea, phenomenon, observation that you want to prove. If you suspect that girls take longer to get ready for school than boys, then: Alternative: girls time > boys time. Null: girls time <= boys time.

  12. 9.1: Null and Alternative Hypotheses

    Review. In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of claim.If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we: Evaluate the null hypothesis, typically denoted with \(H_{0}\).The null is not rejected unless the hypothesis test shows otherwise.

  13. How to Formulate a Null Hypothesis (With Examples)

    To distinguish it from other hypotheses, the null hypothesis is written as H 0 (which is read as "H-nought," "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the ...

  14. How to Write a Null Hypothesis (5 Examples)

    Example 1: Weight of Turtles. A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles. Here is how to write the null and alternative hypotheses for this scenario: H0: μ = 300 (the true mean weight is equal to ...

  15. Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0: The null hypothesis: It is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.

  16. Null hypothesis Definition & Meaning

    The meaning of NULL HYPOTHESIS is a statistical hypothesis to be tested and accepted or rejected in favor of an alternative; specifically : the hypothesis that an observed difference (as between the means of two samples) is due to chance alone and not due to a systematic cause.

  17. The null hypothesis significance test in health sciences research (1995

    The null hypothesis significance test (NHST) is the most frequently used statistical method, although its inferential validity has been widely criticized since its introduction. In 1988, the International Committee of Medical Journal Editors (ICMJE) warned against sole reliance on NHST to substantiate study conclusions and suggested ...

  18. Null hypothesis

    null hypothesis (NH) a statement that a certain relationship exists, which can be tested with a statistical SIGNIFICANCE test. A typical null hypothesis is the statement that the deviation between observed and expected results is due to chance alone. In biology, a probability of greater than 5% that the NH is true ( P 5%) is considered acceptable.

  19. Null and Alternative Hypotheses

    The null and alternative hypotheses are two competing claims that researchers weigh evidence for and against using a statistical test: Null hypothesis (H0): There's no effect in the population. Alternative hypothesis (HA): There's an effect in the population. The effect is usually the effect of the independent variable on the dependent ...

  20. 'Null' research findings aren't empty of meaning. Let's publish them

    Some null results represent potentially important discoveries, such as finding that paying hospitals for performance based on the quality of their outcomes has no effect on actually improving ...

  21. 13.1 Understanding Null Hypothesis Testing

    A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p value that is not low means that ...

  22. Null Hypothesis

    The null hypothesis is defined as any observable differences in treatments or variables is likely due to chance. In other words, the null hypothesis states that there is no significant difference ...

  23. Null-hypothesis

    null hypothesis: [ hi-poth´ĕ-sis ] a supposition that appears to explain a group of phenomena and is advanced as a bases for further investigation. alternative hypothesis the hypothesis that is formulated as an opposite to the null hypothesis in a statistical test. complex hypothesis a prediction of the relationship between two or more ...

  24. Type 1 and Type 2 Errors Explained

    The null hypothesis usually represents the status quo or the baseline assumption, while the alternative hypothesis represents the claim or effect being investigated. The goal is to determine whether the observed data provides enough evidence to reject the null hypothesis in favor of the alternative hypothesis.

  25. Inferring histology-associated gene expression gradients in ...

    Null Hypothesis (H0): The expression pattern of the tested gene does not show spatial significance in relation to specific spatial references, such as delineated areas or spatial trajectories, and ...