Testing Equality of Two Percentages

When the measurement of each subject can be represented by 0 or 1 ( e.g. , subject's condition improves or not, subject buys something or not, subject clicks a link or not, subject passes an exam or not), deciding whether the treatment has an effect is essentially testing the null hypothesis that two percentages are equal—which is the problem this chapter addresses.

Different ways of drawing samples lead to different tests. In one sampling design (the randomization model ), the entire collection of subjects is allocated randomly between treatment and control, which makes the samples dependent . Conditioning on the total number of ones in the treatment and control groups leads to Fisher's exact test , which is based on the hypergeometric distribution of the number of ones in the treatment group if the null hypothesis is true. When the sample sizes are large, calculating the rejection region for Fisher's Exact Test is cumbersome, but the normal approximation to the hypergeometric distribution gives an approximate test —a test whose significance level is approximately what it claims to be.

In a second sampling design (the population model ), the two samples are independent random samples with replacement from two populations; conditioning on the total number of ones in the two samples again leads to Fisher's exact test, which can be approximated as before.

There is another approximate approach to testing the null hypothesis in the population model: If the sample sizes are large (but the samples are drawn with replacement or are small compared to the two population sizes), the normal approximation to the distribution of the difference between the two sample percentages tends to be accurate. If the null hypothesis is true, the expected value of the difference between the sample percentages is zero, and the SE of the difference in sample percentages can be estimated by pooling the two samples. That allows one to transform the difference of sample percentages approximately into standard units , and to base an hypothesis test on the normal approximation to the probability distribution of the approximately standardized difference. Surprisingly, the resulting approximate test is essentially the normal approximation to Fisher's exact test, even though the assumptions of the two tests are different.

Fisher's Exact Test for an Effect—Dependent Samples

Suppose we own a start-up company that offers e-tailers a service for targeting their Web advertising. Consumers register with our service by filling out a form indicating their likes and dislikes, gender, age, etc. We store cookies on each consumer�s computer to keep track of who he is. When a consumer with one of our cookies visits the Web site of any of our clients, we use the consumer's likes and dislikes to select (from a collection of the client's ads) the advertisement we think he is most likely to respond to. This is called targeted advertising . The targeting service is free to consumers; we charge the e-tailers. We can raise venture capital if we can show that targeting makes e-tailers' advertisements more effective.

We offer our service free to a large e-tailer. The e-tailer has a collection of advertisements that it usually uses in rotation: Each time a consumer arrives at the site, the server selects the next ad in the sequence to show to the consumer; the cycle starts over when all the ads have been shown.

To test whether targeting works, we implement a randomized, controlled , blind experiment by installing our software on the e-tailer's server to work as follows: Each time a consumer arrives at the site, with probability 50% the server shows the consumer the ad our targeting software selects, and with probability 50% the server shows the consumer the next ad in the rotation—the control ad. The decision of whether to show a consumer the targeted ad or the control ad is independent from consumer to consumer. For each consumer, the software records which strategy was used (target or rotation), and whether the consumer buys anything. The consumers who were shown the targeted ad comprise the treatment group; the other consumers comprise the control group. If a consumer visits the site more than once during the trial period, we ignore all of that consumer's visits but the first. Each subject (consumer) is assigned at random either to treatment or to control, and no subject knows which group he is in, so this is a controlled, randomized, blind experiment. There is no subjective element to determining whether a subject purchased something, so the lack of a double blind does not introduce bias.

Suppose that N consumers visit the site during the trial, that n t of them are assigned to the treatment group, that n c of them are assigned to the control group, and that G of the consumers buy something. (The mnemonic is that c stands for control, t for treatment, and G for the number of good customers—customers who buy something.) We want to know whether the targeting affects whether subjects buy anything. Only some of the consumers see the targeted ad, and only some see the control ad, so answering this question involves hypothetical counterfactual situations—what would have happened had the all the consumers been shown a targeted ad, and what would have happened had all the consumers been shown a control ad. We treat the N consumers as a fixed group, without regard for how they were drawn from the more general population of people who shop online. Any conclusions we draw about the consumers who visited the site might not hold for the general population: We should be wary of extrapolating the results to consumers who were not in the sample unless we know that the randomized group is itself a random sample from the larger population. This set-up, in which the N subjects are a fixed group and the only random element is in allocating some of the subjects to the treatment group and the rest to the control group, is called the randomization model . Later in this chapter we consider a population model , in which the treatment group and the control group are random samples from a much larger population. In the population model, the null hypothesis will be slightly different, but we shall be able to extrapolate the results from the samples to the populations from which they were drawn, because they were drawn at random.

We can think of the experiment in the following way: The i th consumer has a ticket with two numbers on it: The first number, c i , is 1 if the consumer would have bought something if shown the control ad, and 0 if not. The second number, t i , is 1 if the consumer would have bought something if shown the targeted ad, and 0 if not. There are N tickets in all. Under the null hypothesis that targeting has no effect, t i =c i for each i=1, 2,  … , N . That is, each consumer either will buy or will not buy, regardless of the ad he is shown: Whether he will buy is determined before he is assigned to treatment or control.

For the i th consumer, we observe either c i or t i , but not both. The percentage of consumers who would have purchased something if every consumer had been shown the control ads is

p c = ( c 1 + c 2 + … + c N )/N.

The percentage of consumers who would have bought something if all had been shown the targeted ads is

p t = ( t 1 + t 2 + … + t N )/N.

μ = p t − p c

be the difference between the percentage of consumers who would have bought had all been shown the targeted ad, and the percentage of consumers who would have bought had all been shown the control ad. Under the null hypothesis that targeting does not make a difference, t i = c i for all i=1, 2, … , N . Thus if the null hypothesis is true, μ = 0 , but the hypothesis that μ =  0 is weaker than the null hypothesis: If μ ≠ 0 , the null hypothesis is false, but the null hypothesis can be false and yet still μ = 0 . (That occurs if the number of consumers who would have bought something if all had been shown the targeted ads is equal to the number of consumers who would have bought something if all had been shown the control ads, but the purchases were made by a different subset of consumers.) The alternative hypothesis, that targeting helps, is that μ > 0 . We would like to test the null hypothesis at significance level 5%.

Let X t be the number of sales to consumers in the treatment group, the sum of the observed values of t i . If the null hypothesis is true, the same G consumers would have bought whether they were assigned to treatment or to control, and the number of the consumers in the treatment group who bought something is the number of those G in a simple random sample of size n t from the population of N consumers. Thus, for any fixed values of N , G , and n t , X t has an hypergeometric distribution with parameters N , G , and n t .

We cannot calculate the threshold value x 0 until we know N , n t , and G . Once we observe them, we can find the smallest value x 0 so that the probability that X t is larger than x 0 if the null hypothesis is true is at most 5%, the chosen significance level. Our rule for testing the null hypothesis then is to reject the null hypothesis if X t > x 0 , and not to reject the null hypothesis otherwise. This is called Fisher's exact test for the equality of two percentages (against a one-sided alternative ). The test is called exact because its probability of a Type I error can be computed exactly.

The Normal Approximation to Fisher's Exact Test

If N is large and n t is neither close to zero nor close to N , computing the hypergeometric probabilities will be difficult, but the normal approximation to the probability distribution of X t should be accurate provided G is neither too close to zero nor too close to n t . To calculate the normal approximation, we need to convert X t to standard units, which requires that we know the expected value and SE of X t . The expected value of X t is

E(X t ) = n t ×G/N,

and the SE of X t is

SE(X t ) = f ×n t ½ ×SD,

where f is the finite population correction

f = (N − n t ) ½ /(N−1) ½ ,

and SD is the standard deviation of a list of N values of which G equal 1 and (N−G) equal 0:

SD = ( G/N × (1 − G/N) ) ½ .

In standard units, X t is

Z = (X t − E(X t ))/SE(X t ) = (X t − n t ×G/N )/(f×n t ½ ×SD).

x 0 = E(X t ) + 1.645×SE(X t ) = n t ×G/N + 1.645×f×n t ½ ×SD

= n t ×G/N + 1.645×f×n t ½ × ( G/N × (1 − G/N) ) ½

in the original units. Thus if we reject the null hypothesis when

Z>1.645,

or equivalently when

X t > n t ×G/N + 1.645×f×n t ½ ×( G/N × (1 − G/N) ) ½ ,

Because the normal curve is symmetric about zero, z 100%−α = −z α . Note that the area under the normal curve between z α and z 100%−α is 100% − 2×α . Combining these two results shows that the area under the normal curve over the interval

[−z 100%−α/2 , z 100%−α/2 ]

is 100% − α , and thus the area under the normal curve outside the interval (the area under the normal curve over the complement of the interval) is α . This complement also can be written

all values z such that |z| > z 100%−α/2 .

With this notation for quantiles of the normal curve, it is easier to write down the rejection region of the normal approximation to Fisher's exact test for a general significance level: The significance level of the rule {Reject the null hypothesis if Z>z 100%−α } is approximately α .

The following exercise checks your ability to use the normal approximation to Fisher's exact test. The exercise is dynamic: The data will tend to change when you reload the page, so you can practice as much as you wish.

Testing the Equality of Two Percentages Using Independent Samples

In the experiment to test the effectiveness of targeted advertising using the randomization model described previously in this chapter, the samples from the populations of control and treatment values are dependent: Individual i has two numbers, c i and t i , and if we observe c i we cannot observe t i , and vice versa. If individual i is in the treatment group, he or she is not in the control group, and vice versa. Under the null hypothesis, the purchasers would have bought whether they were assigned to treatment or to control, and the non-purchasers would not have bought whether they were assigned to treatment or control, so the total number of purchasers does not depend on which consumers were assigned to treatment. That constancy led to an hypergeometric distribution for the number of purchasers in the treatment group under the null hypothesis. In this section, we see that Fisher's exact test allows us to test a slightly weaker null hypothesis when the data are two independent random samples with replacement from separate populations, a control group and a treatment group. This is the population model for comparing two percentages.

The weaker hypothesis is that the population percentage of the treatment group is equal to the population percentage of the control group. We also develop an approximate test for the equality of two percentages based on the sample percentages of independent random samples with replacement from two populations. The approximate test is essentially equivalent to the normal approximation to Fisher's exact test when the sample sizes are large.

Fisher's Exact Test Using Independent Samples

Suppose there are two populations of tickets labeled 0 and 1, a control group and a treatment group, with corresponding population percentages p c and p t . We want to test the null hypothesis that

p c = p t ;

i.e. , that

μ = p t − p c = 0.

We draw a random sample of size n c with replacement from the control group, and compute the sample sum X c . X c has a binomial distribution with parameters n c and p c . We draw another random sample of size n t with replacement from the treatment group, and compute the sample sum X t . X t has a binomial distribution with parameters n t and p t . We draw the two random samples independently of each other, so X c and X t are independent random variables. This scenario could correspond to an observational study, to a non-randomized experiment, or to a randomized experiment, depending upon how individuals came to be in the treatment group and the control group. The randomness in the problem at this point is in drawing the samples from the control group and the treatment group, not in assigning subjects to treatment or to control—that assignment occurred before we arrived on the scene. In this population model, we might be able to conclude from the data that the population percentages differ for the treatment and control groups (that p t ≠p c ), but even then we should not conclude that treatment has an effect unless the assignment of subjects to treatment and control was randomized. Otherwise, any real difference between p t and p c could be the result of confounding, rather than the result of the treatment.

In contrast, in the randomization model described earlier in the chapter, we might be able to conclude that the treatment has an effect for the N subjects in the randomization, but even then we should not extrapolate from those N subjects to conclude that treatment has an effect in the larger population from which they the subjects were drawn, because we did not know how they were drawn.

G C x × N−G C n t −x

result in x ones and n t − x zeros among the n t values drawn from the treatment group, so the chance that X t =x is

G C x × N−G C n t −x / G C n t :

given the value of G , X t has an hypergeometric distribution with parameters N , G , and n t if the null hypothesis is true. Thus, under the null hypothesis that p c =p t , given the total number G of ones in the sample, the test statistic X t has the same distribution for this sampling design that the test statistic X t did for a population of N subjects assigned randomly to treatment or control. Therefore, the same testing procedure, Fisher's exact test, can be used to test the null hypothesis that p c =p t using independent random samples from two populations. In the previous section, the null hypothesis and the sampling design were different: Each subject i had two values, t i and c i ; the null hypothesis was that

t i = c i for all i=1, 2, … ,N;

and each of the N subjects was assigned at random either to treatment or to control.

As noted previously, it is hard to perform the calculations needed to find the rejection region for this test when N is large; the normal approximation to Fisher's exact test described in the previous section is a computationally tractable way to construct the rejection region. The approximation is accurate under the same assumptions.

The Z Test for the Equality of Two Percentages using Independent Samples

In this section, we develop another approximate test of the null hypothesis that p t =p c in the population model; it turns out that this test is essentially the same as the normal approximation to Fisher's exact test, although it is motivated quite differently.

Let φ c be the sample percentage of the random sample from the control group, and let φ t be the sample percentage of the random sample from the treatment group. Suppose that the two sample sizes n c and n t are large (say, over 100 each). Then the normal approximations to the two sample percentages should be accurate (provided neither p c nor p t is too close to 0 or to 1). The expected value of the sample percentage of a random sample with replacement is the population percentage, so the expected value of φ c is p c , and the expected value of φ t is p t . The SE of φ c is

SE(φ c ) = ( p c ×(1−p c ) ) ½ /n c ½ ,

and the SE of φ t is

SE(φ t ) = ( p t ×(1−p t ) ) ½ /n t ½ .

Consider the difference of the two sample percentages

φ t−c = φ t − φ c .

The difference φ t−c is a random variable. The expected value of φ t−c is

μ = p t − p c .

Because the samples from the treatment and control groups are independent of each other, φ t and φ c are independent, so the SE of φ is

SE(φ t−c ) = ( SE 2 (φ t ) + SE 2 (φ c ) ) ½ .

If the null hypothesis is true, the two population percentages are equal— p t =p c =p —and the two samples are like one larger sample from a single 0-1 box with a percentage p of tickets labeled "1." Let us call that box the null box . If the null hypothesis is true, the expected value of φ t−c , E(φ t−c ) , is zero, and

SE(φ t−c ) = (p×(1−p)/n t + p×(1−p)/n c ) ½

= ( 1/n t + 1/n c ) ½ × (p×(1−p)) ½

= ( N/(n t ×n c ) ) ½ × (p×(1−p)) ½ .

The first factor depends only on the sample sizes n t and n c , which we know. The second factor is the SD of the labels on the tickets in the null box. That factor depends only on p , the percentage of tickets labeled "1" in the null box. We do not know p , so we do not know the SD of the null box. However, we can use the bootstrap estimate of the SD of the null box because the sample size is large: let φ be the pooled sample percentage

φ = (total number of "1"s in both samples)/(total sample size)

= (n c ×φ c + n t ×φ t )/N.

The pooled bootstrap estimate of the SD of the null box is the estimate we get by pretending that the percentage of ones in the null box is equal to the percentage of ones in the pooled sample:

s * = (pooled bootstrap estimate of SD of the null box) = ( φ×(1−φ) ) ½ .

If the sample sizes are large and the null hypothesis is true, this will tend to be close to the true SD of the null box, and

SE*(φ t−c ) = ( N/(n t ×n c ) ) ½ ×s *

will tend to be quite close to SE( φ t−c ). The normal approximation to the probability distribution of φ t−c tells us that the chance that φ t−c is in a given range is approximately equal to the area under the normal curve for the same range, converted to standard units. Under the null hypothesis, the expected value of φ t−c is zero, and SE(φ t−c ) is approximately SE*(φ t−c ) , so

Z = φ t−c /SE*(φ t−c )

is approximately φ t−c in standard units: The chance that Z is in the range of values [a, b] is approximately the area under the normal curve between a and b .

Under the alternative hypothesis that p t >p c , Z will tend to be larger than it would under the null hypothesis. We can test the null hypothesis that p t =p c against the one-sided alternative hypothesis that p t >p c using

as the test statistic. To test at approximate significance level α , reject the null hypothesis if Z > z 1−α .

This is called the (one-sided) z test for equality of two percentages using independent samples . The random variable Z is called the Z -statistic, and the observed value of Z is called the z -score. To test the null hypothesis against the other one-sided alternative hypothesis that p t <p c at approximate significance level α , reject the null hypothesis if Z<z α . To test the null hypothesis against the a two-sided alternative hypothesis that p t ≠p c at approximate significance level α , reject when |Z| > z 1−a/2 .

This test is based on transforming the difference of sample percentages (the test statistic) approximately to standard units, under the assumption that the null hypothesis is true. Because the null hypothesis specifies that the two population percentages are equal, the expected value of the difference between the sample percentages is zero—the expected values of both sample percentages is p . However, the null hypothesis does not specify the value of p , and the SE of the difference of sample percentages depends on p , so we cannot calculate the SE of the test statistic under the null hypotheses—we have to estimate SE(φ t − c ) from the data. When the combined sample size N=n t +n c is sufficiently large, the pooled bootstrap estimate of the SD of the null box is likely to be quite accurate, and the estimated SE is likely to be very closet to the true SE. When the individual sample sizes are large, the probability histogram of the difference of sample percentages can be approximated well by the normal curve.

The following exercise checks your ability to calculate the z test for equality of two percentages from independent samples.

The Normal Approximation to Fisher's Exact Test and the z Test for Equality of Two Percentages

We derived the z test for equality of two percentages using the assumption that the two samples are independent and that their sizes are fixed in advance. We derived Fisher's exact test by conditioning on the total number of tickets in the sample labeled "1," and found that the test could be used in two quite different situations: to test the hypothesis that treatment has no effect when a fixed collection of individuals are randomized into treatment and control groups, so the treatment and control samples are dependent; and to test the hypothesis that two population percentages are equal from independent samples from the two populations.

Somewhat surprisingly, the normal approximation to Fisher's exact test is essentially the z test when the sample sizes are all large. (The difference is just the −1 in the denominator of the finite population correction, which is negligible if the samples are large.) That is, the z score in the normal approximation to Fisher's exact test is almost exactly equal to the z score in the z -test for equality of two percentages using independent samples: The two tests reject for essentially the same observed data values.

The following example illustrates the approximate equivalence between the z test and the normal approximation to Fisher's exact test. The example is dynamic: The data will tend to change when you reload the page, to provide more examples of the computations involved.

It is rather surprising that tests derived under different assumptions behave so similarly. Generally, when the assumptions of a test are violated, the nominal significance level will be incorrect and the test should not be used. This is a rare exception.

Suppose two variables, C and T , are defined for a group of N individuals: c i is the value of C for the i th individual, and t i is the value of T for the i th individual, i=1, 2, …, N . Suppose each c i and each t i can equal either 0 or 1, so that

p c =(c 1 + c 2 + … + c N )/N

is the population percentage of the values of C , and

p t =(t 1 + t 2 + … + t N )/N

is the population percentage of the values of T . A simple random sample of size n t will be taken from the population. The values of t i are observed for the units in the sample; for the N−n t units not in the sample, the values of c i are observed instead. This is the randomization model for evaluating whether a treatment has an effect in an experiment in which a fixed set of N units are assigned at random either to treatment or to control. The response of individual i is t i if he is treated and c i if not. At issue is whether the treatment has an effect. The null hypothesis is that treatment does not matter at all: c i =t i , for every individual i . Let G be the sum of all the observations, the observed values of c i plus the observed values of t i . Let X t be the sum of the observed values of t i .

If the null hypothesis is true, the n t observed values of t i are like a random sample from a 0-1 box of N tickets of which G are labeled 1. Thus X t has an hypergeometric distribution with parameters N , G , and n t . Fisher�s exact test uses X t as the test statistic, and this hypergeometric distribution to select the rejection region. If the alternative hypothesis is that p t > p c , then if the alternative hypothesis is true X t would tend to be larger than it would be if the null hypothesis is true, so the hypothesis test should be of the form {Reject if X t >x 0 } , with x 0 chosen so that the test has the desired significance level. If the sample sizes are large, it can be difficult to calculate the rejection region for Fisher's exact test; then the normal approximation to the hypergeometric distribution can be used to construct a test with approximately the correct significance level. In the normal approximation to Fisher's exact test, the rejection region for approximate significance level a uses the threshold for rejection

x 0 =n t ×G/N + z 1 − α ×f×n t ½ ×(G/N ×(1 − G/N)) ½ ,

where f is the finite population correction (N−n t ) ½ /(N−1) ½ and z 1−α is the 1 − α quantile of the normal curve. The α quantile of the normal curve, z α , is the number for which the area under the normal curve from minus infinity to z α equals α . For example, z 0.05 =−1.645 , and z 0.95 =1.645 .

A Z -statistic is a test statistic whose probability histogram can be approximated well by a normal curve if the null hypothesis is true. The observed value of a Z -statistic is called the z -score. In Fisher's exact test,

Z = (X t −n t × G/N)/(f×n t ½ ×(G/N ×(1−G/N)) ½ )

is a Z statistic.

Suppose one wants to test the null hypothesis that two population percentages are equal, p t =p c , on the basis of independent random samples with replacement from the two populations. This is the population model for comparing two population percentages. Let n t denote the size of the random sample from the first population; let n c be the size of the sample from the second population; and let N=n t +n c be the total sample size. Let X t denote the sample sum of the first sample; let X c denote the sample sum of the second sample; and let

denote the sum of the two samples. Conditional on the value of G , the probability distribution of X t is hypergeometric with parameters N , G , and n t , so Fisher's exact test can be used to test the null hypothesis. There is a different approximate approach based on the normal approximation to the probability distribution of the sample percentages: Let φ t denote the sample percentage of the sample from the first population; let φ c denote the sample percentage of the sample from the second population; and let φ denote the overall sample percentage of the two samples pooled together,

φ=(total number of "1"s in the two samples)/(total sample size) = G/N.

Then, if the null hypothesis is true,

E(φ t −φ c )=0.

If in addition n t and n c are large, SE(φ t −φ c ) is approximately

s * ×(1/n t + 1/n c ) ½ ,

s * =(φ×(1−φ)) ½

is the pooled bootstrap estimate of the SD of the null box. Under the null hypothesis, for large sample sizes n t and n c , the probability histogram of

Z = (φ t −φ c )/(s * × (1/n t + 1/n c ) ½ )

can be approximated accurately by the normal curve, so Z is a Z -statistic. To test the null hypothesis against the one-sided alternative that p t <p c at approximate significance level α , use a one-sided test that rejects the null hypothesis when Z<z α . To test the null hypothesis against the one-sided alternative that p t >p c at approximate significance level α , use a one-sided test that rejects the null hypothesis when Z>z 1−α . To test the null hypothesis against the two-sided alternative that p t ≠p c at approximate significance level α , use a two-sided test that rejects the null hypothesis when |Z|≥z 1−α/2 . The Z test for the equality of two percentages is essentially equivalent to the normal approximation to Fisher's exact test when the sample sizes are all large, even though the assumptions of the tests differ.

  • alternative hypothesis
  • binomial distribution
  • bootstrap estimate
  • control group
  • expected value
  • finite population correction
  • Fisher�s exact test
  • hypergeometric distribution
  • hypothesis testing
  • independent
  • independent random sample
  • normal approximation
  • normal curve
  • null hypothesis
  • pooled bootstrap estimate of the SD
  • population model
  • population percentage
  • probability
  • probability distribution
  • probability histogram
  • quantile of the normal curve
  • random sample
  • random variable
  • randomization model
  • rejection region
  • sample percentage
  • sample size
  • significance level
  • simple random sample
  • standard deviation
  • standard error
  • standard unit
  • test statistic
  • treatment group
  • Z statistic

Two‐sample testing for random graphs

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, k-sample tests for equality of variances of random fuzzy sets.

The problem of testing equality of variances often arises when distributions of random variables are compared or linear models between them are considered. The usual tests for variances given normality of the underlying populations are highly non-robust ...

Toughness of the corona of two graphs

The toughness of a non-complete graph G=(V, E) is defined as τ(G)=min{|S|/ω(G-S)}, where the minimum is taken over all cutsets S of vertices of G and ω(G-S) denotes the number of components of the resultant graph G-S by deletion of S. The corona of two ...

On 4-γt -critical graphs of diameter two

A vertex subset S of graph G is a total dominating set of G if every vertex of G is adjacent to a vertex in S. For a graph G with no isolated vertex, the total domination number of G, denoted by @c"t(G), is the minimum cardinality of a total dominating ...

Information

Published in.

John Wiley & Sons, Inc.

United States

Publication History

Author tags.

  • adjacency spectral embedding
  • hypothesis testing
  • latent position random graphs
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

A new maximum mean discrepancy based two-sample test for equal distributions in separable metric spaces

  • Original Paper
  • Published: 25 August 2024
  • Volume 34 , article number  172 , ( 2024 )

Cite this article

hypothesis tests two sample

  • Bu Zhou 1 ,
  • Zhi Peng Ong 2 &
  • Jin-Ting Zhang 2  

20 Accesses

Explore all metrics

This paper presents a novel two-sample test for equal distributions in separable metric spaces, utilizing the maximum mean discrepancy (MMD). The test statistic is derived from the decomposition of the total variation of data in the reproducing kernel Hilbert space, and can be regarded as a V-statistic-based estimator of the squared MMD. The paper establishes the asymptotic null and alternative distributions of the test statistic. To approximate the null distribution accurately, a three-cumulant matched chi-squared approximation method is employed. The parameters for this approximation are consistently estimated from the data. Additionally, the paper introduces a new data-adaptive method based on the median absolute deviation to select the kernel width of the Gaussian kernel, and a new permutation test combining two different Gaussian kernel width selection methods, which improve the adaptability of the test to different data sets. Fast implementation of the test using matrix calculation is discussed. Extensive simulation studies and three real data examples are presented to demonstrate the good performance of the proposed test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

hypothesis tests two sample

Measuring and testing homogeneity of distributions by characteristic distance

hypothesis tests two sample

On some graph-based two-sample tests for high dimension, low sample size data

hypothesis tests two sample

A k -Sample Test for Functional Data Based on Generalized Maximum Mean Discrepancy

Explore related subjects.

  • Artificial Intelligence

Data Availability

Data is provided within the supplementary files.

Anderson, N.H., Hall, P., Titterington, D.M.: Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. J. Multivar. Anal. 50 (1), 41–54 (1994)

Article   MathSciNet   Google Scholar  

Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.-P., Schölkopf, B., Smola, A.: Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22 (14), 49–57 (2006)

Article   Google Scholar  

Biswas, M., Mukhopadhyay, M., Ghosh, A.K.: A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 101 (4), 913–926 (2014)

Einsporn, R.L., Habtzghi, D.: Combining paired and two-sample data using a permutation test. J. Data Sci. 11 , 767–779 (2013)

Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel approach to comparing distributions. In: Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), pp. 1637–1641 (2007a)

Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems 15, pp. 513– 520. MIT Press (2007b)

Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13 , 723–773 (2012)

MathSciNet   Google Scholar  

Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.K.: A fast, consistent kernel two-sample test. In: Advances in Neural Information Processing Systems 22, pp. 673– 681. Curran Associates, Inc. ( 2009)

Gao, H., Shao, X.: Two sample testing in high dimension via maximum mean discrepancy. J. Mach. Learn. Res. 24 , 1–33 (2023)

Golub, T.R., Slonim, D.K., Tamayo, P., Gaasenbeek, C.H.M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (5439), 531–537 (1999)

Hall, P., Keilegom, I.V.: Two-sample tests in functional data analysis starting from discrete data. Stat. Sinica 17 , 1511–1531 (2007)

Jiang, Q., Hus̆ková, M., Meintanis, S.G., Zhu, L.: Asymptotics, finite-sample comparisons and applications for two-sample tests with functional data. J. Multivar. Anal. 170 , 202–220 (2019)

Li, J.: Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 105 (3), 529–546 (2018)

Pomann, G.-M., Staicu, A.-M., Ghosh, S.: A two-sample distribution-free test for functional data with application to a diffusion tensor imaging study of multiple sclerosis. J. R. Stat. Soc.: Ser. C (Appl. Stat.) 65 (3), 395–414 (2016)

Ramsay, J.O., Silverman, B.W.: Applied Functional Data Analysis: Methods and Case Studies. Springer, Berlin (2002)

Book   Google Scholar  

Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York (1980)

Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.R.G.: Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12 , 2389–2410 (2011)

Smola, A., Gretton, A., Song, L., Schölkopf, B.: A Hilbert space embedding for distributions. In: Proceedings of the International Conference on Algorithmic Learning Theory, vol. 4754, pp. 13–31 (2007)

Shinmura, S.: High-Dimensional Microarray Data Analysis. Springer, Berlin (2019)

Székely, G.J., Rizzo, M.L.: Testing for equal distributions in high dimension. InterStat (2004)

Tuddenham, R.D., Snyder, M.M.: Physical growth of California boys and girls from birth to eighteen years. Univ. Calif. Publ. Child Dev. 1 , 183–364 (1954)

Google Scholar  

Wynne, G., Duncan, A.B.: A kernel two-sample test for functional data. J. Mach. Learn. Res. 23 , 1–51 (2022)

Wei, S., Lee, C., Wichers, L., Marron, J.S.: Direction–projection–permutation for high-dimensional hypothesis tests. J. Comput. Gr. Stat. 25 (2), 549–569 (2016)

Zhou, B., Guo, J.: A note on the unbiased estimator of \(\Sigma ^2\) . Stat. Probab. Lett. 129 , 141–146 (2017)

Zhang, J.-T., Guo, J., Zhou, B.: Testing equality of several distributions in separable metric spaces: a maximum mean discrepancy based approach. J. Econom. 239 , 105286 (2024)

Zhang, J.-T., Guo, J., Zhou, B., Cheng, M.-Y.: A simple two-sample test in high dimensions based on \(L^2\) -norm. J. Am. Stat. Assoc. 115 (530), 1011–1027 (2020)

Zhang, J.-T.: Approximate and asymptotic distributions of chi-squared-type mixtures with applications. J. Am. Stat. Assoc. 100 (469), 273–285 (2005)

Zhang, J.-T., Smaga, Ł: Two-sample test for equal distributions in separable metric space: new maximum mean discrepancy based approaches. Electron. J. Stat. 16 , 4090–4132 (2022)

Download references

Acknowledgements

Zhou was supported by the Zhejiang Provincial Natural Science Foundation of China (Grant No. LY21A010007), the National Natural Science Foundation of China (Grants Nos. U23A2064 & 11901520), the Fundamental Research Funds for the Provincial Universities of Zhejiang (No. XRK22007), and Zhejiang Gongshang University “Digital+” Disciplinary Construction Management Project (No. SZJ2022A001). Zhang was financially supported by the National University of Singapore Academic Research Grants (22-5699-A0001 and 23-1046-A0001).

Author information

Authors and affiliations.

School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, 310018, China

Department of Statistics and Data Science, National University of Singapore, Singapore, 117546, Singapore

Zhi Peng Ong & Jin-Ting Zhang

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: J.-T. Zhang; methodology: J.-T. Zhang, B. Zhou; formal analysis and investigation: B. Zhou, Z. P. Ong; writing—original draft preparation: J.-T. Zhang, B. Zhou, Z. P. Ong; writing—review and editing: J.-T. Zhang, B. Zhou.

Corresponding author

Correspondence to Bu Zhou .

Ethics declarations

Conflict of interest.

The authors have no conflict of interest to declare that are relevant to the content of this article.

Supplementary information

R code for the real data examples is available in the online supplementary materials.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Technical Proofs

For \(\alpha ,\beta =1,2\) , we have

Proof of Lemma 8

Notice that we have the following useful properties: when \(\varvec{y}'=\varvec{y}\) , we have

where \(\varvec{z}\) and \(\varvec{z}'\) are independent copies of \(\varvec{y}\) , and when \(\varvec{y}\) and \(\varvec{y}'\) are independent, we have

By ( A2 ), ( A3 ), and ( 20 ), we have

\(\square \)

Proof of Theorem 1

Under Condition C1, we have \(P_{1}=P_{2}=P\) . Let \(\varvec{y},\varvec{y}'{\mathop {\sim }\limits ^{\text{ i.i.d. }}}P\) . Under Condition C3, we have the Mercer’s expansion ( 21 ). By ( A3 ) and ( 22 ), we have

This, together with ( 22 ), implies that

Set \(z_{r,\alpha i}=\psi _{r}(\varvec{y}_{\alpha i}),\ i=1,\dots ,n_{\alpha };\alpha =1,2,\) . Under Condition C1, we have \(\varvec{y}_{\alpha i}{\mathop {\sim }\limits ^{\text{ i.i.d. }}}P\) . It follows that for a fixed \(r=1,2,\dots \) , \(z_{r,\alpha i},\ i=1,\dots ,n_{\alpha };\alpha =1,2\) are i.i.d. with mean 0 and variance 1. For different r , \(z_{r,\alpha i},\ i=1,\dots ,n_{\alpha };\alpha =1,2\) are uncorrelated. Then by ( 20 ) and ( 21 ), we have

where \({\bar{z}}_{r,\alpha }=n_{\alpha }^{-1}\sum _{i=1}^{n_{\alpha }}z_{r,\alpha i},\ \alpha =1,2;r=1,2,\dots \) . By ( 19 ), we have

where \(A_{n,r}=w_{n,r}^{2}\) with \(w_{n,r}=\sqrt{\frac{n_{1}n_{2}}{n}}({\bar{z}}_{r,1}-{\bar{z}}_{r,2}),\ r=1,2,\dots \) which are uncorrelated.

Let \(\varphi _{x}(t)={\text {E}}(e^{itx})\) denote the characteristic function of a random variable x . Set \(\tilde{T}_{n}^{(q)}=\sum _{r=1}^{q}\lambda _{r}A_{n,r}\) . Notice that by Serfling ( 1980 , p. 197), for any two random variables X and Y with finite second moments, we have \(|{\text {E}}(e^{itX}-e^{itY})|\le {\text {E}}|e^{itX}-e^{itY}|={\text {E}}|e^{it(X-Y)}-1|<{\text {E}}|t(X-Y)|<|t|[{\text {E}}(X-Y)^2]^{1/2}\) where we use the well-known facts that \(|e^{iZ}-1|\le |Z|\) and \({\text {E}}^2(|Z|)\le {\text {E}}(Z^2)\) for any random variable Z with finite second moment. It follows that we have \(|\varphi _{\tilde{T}_{n}}(t)-\varphi _{\tilde{T}_{n}^{(q)}}(t)|\le |t|\big [{\text {E}}(\tilde{T}_{n}-\tilde{T}_{n}^{(q)})^{2}\big ]^{1/2}\) . For any given \(r=1,2,\dots \) , by the central limit theorem, under Conditions C2 and C3, as \(n\rightarrow \infty \) , we have \(\sqrt{n_{\alpha }}{\bar{z}}_{r,\alpha }{\mathop {\longrightarrow }\limits ^\mathcal{L}}\mathcal N (0,1),\ \alpha =1,2,\) and \({\bar{z}}_{r,1}\) and \({\bar{z}}_{r,2}\) are independent. It follows that under Conditions C2 and C3, as \(n\rightarrow \infty \) , \(w_{n,r}{\mathop {\longrightarrow }\limits ^\mathcal{L}}w_{r}\sim \mathcal N (0,1)\) and hence

It follows that as \(n\rightarrow \infty \) , we have \({\text {Var}}(A_{n,r})=2+o(1)\) and \({\text {E}}(A_{n,r})=1+o(1)\) . In addition, for any non-negative random variable sequence \(X_1,X_2,\cdots \) with finite second moments, by Cauchy–Schwarz inequality, we have \({\text {Cov}}(X_{i},X_{j})\le \sqrt{{\text {Var}}(X_{i})}\sqrt{{\text {Var}}(X_{j})}\) . It follows that \( {\text {Var}}\left( \sum _{i=1}^{\infty } X_i\right) =\sum _{i=1}^{\infty }\sum _{j=1}^{\infty } {\text {Cov}}(X_i,X_j)\le \sum _{i=1}^{\infty }\sum _{j=1}^{\infty }\sqrt{{\text {Var}}(X_{i})}\sqrt{{\text {Var}}(X_{j})}=\left[ \sum _{i=1}^{\infty } \sqrt{{\text {Var}}(X_i)}\right] ^2. \) By using this, as \(n\rightarrow \infty \) , we have

It follows that

Let t be fixed. Under Condition C3 and \({\text {E}}(\tilde{T})=\sum _{r=1}^{\infty }\lambda _{r}<\infty \) , as \(q\rightarrow \infty \) , we have \(\sum _{r=q+1}^{\infty }\lambda _{r}\longrightarrow 0\) . Thus, for any given \(\epsilon >0\) , there exist \(N_{1}\) and \(Q_{1}\) , depending on | t | and \(\epsilon \) , such that as \(n>N_{1}\) and \(q>Q_{1}\) , we have

For any fixed \(q>Q_{1}\) , by ( A5 ), as \(n\rightarrow \infty \) , we have \(\tilde{T}_{n}^{(q)}{\mathop {\longrightarrow }\limits ^\mathcal{L}}\tilde{T}^{(q)}{\mathop {=}\limits ^{d}}\sum _{r=1}^{q}\lambda _{r}A_{r},\;A_{r}{\mathop {\sim }\limits ^{\text{ i.i.d. }}}\chi _{1}^{2}\) . Thus, there exists \(N_{2}\) , depending on q and \(\epsilon \) such that as \(n>N_{2}\) , we have

Recall that \(\tilde{T}=\sum _{r=1}^{\infty }\lambda _{r}A_{r},\;A_{r}{\mathop {\sim }\limits ^{\text{ i.i.d. }}}\chi _{1}^{2}\) . Along the same lines as those for proving ( A7 ), we can show that there exists \(Q_{2}\) , depending on | t | and \(\epsilon \) , such that as \(q>Q_{2}\) , we have

It follows from ( A7 )–( A9 ) that for any \(n\ge \max (N_{1},N_{2})\) and \(q\ge \max (Q_{1},Q_{2})\) , we have

The convergence in distribution of \(\tilde{T}_{n}\) to \(\tilde{T}\) follows as we can let \(\epsilon \rightarrow 0\) . \(\square \)

Proof of Theorem 2

Under Condition C1, by Lemma  8 , we have

In addition, under Conditions C1, C2, and C4 (since under Condition C4, we have Condition C3 holds automatically), by Theorem  1 , as \(n\rightarrow \infty \) , we have \(\tilde{T}_{n}{\mathop {\longrightarrow }\limits ^\mathcal{L}}\tilde{T}\) where \(\tilde{T}=\sum _{r=1}^{\infty }\lambda _{r}A_{r},\;A_{r}{\mathop {\sim }\limits ^{\text{ i.i.d. }}}\chi _{1}^{2}\) as given in Theorem  1 . Note \(\mathcal K _{3}(\tilde{T})=\sum _{r=1}^{\infty }\lambda _{r}^{3}\) and by ( 21 ) and some simple algebra, we have

where \(\varvec{y},\varvec{y}',\varvec{y}''{\mathop {\sim }\limits ^{\text{ i.i.d. }}}P\) . So for the last claim of the theorem, we only need to prove that under Conditions C1, C2, and C4, we have

If we can show under Conditions C1, C2, and C4, we have \(\sup _{n}{\text {E}}(|\tilde{T}_{n}|^{4})<\infty \) , then by the uniform integrability, \({\text {E}}(\tilde{T}_{n}^{r})\longrightarrow {\text {E}}(\tilde{T}^{r})\) , \(1\le r\le 3\) , so it follows that ( A10 ) is true. To this end, for \(\alpha =1,2\) , define

so \(\tilde{V}_{\alpha \alpha }=\tilde{V}_{\alpha \alpha (1)}+\tilde{V}_{\alpha \alpha (2)}\) . To show \(\sup _{n}{\text {E}}(|\tilde{T}_{n}|^{4})<\infty \) , by ( 13 ), we only need to show \(\frac{n_{1}^{4}n_{2}^{4}}{n^{4}}{\text {E}}[|\tilde{V}_{\alpha \alpha (1)}|^{4}]\) , \(\frac{n_{1}^{4}n_{2}^{4}}{n^{4}}{\text {E}}[|\tilde{V}_{\alpha \alpha (2)}|^{4}\) ], and \(\frac{n_{1}^{4}n_{2}^{4}}{n^{4}}{\text {E}}[|\tilde{V}_{12}|^{4}]\) are all bounded. Under Condition C4 and by ( 16 ), we have \(|\tilde{K}(\varvec{y},\varvec{y}')|\le \tilde{B}_{K}\) , where \(\tilde{B}_{K}=4B_{K}\) . Then

so \(\frac{n_{1}^{4}n_{2}^{4}}{n^{4}}{\text {E}}[|\tilde{V}_{\alpha \alpha (1)}|^{4}]\le \tilde{B}^{4}\) . Furthermore, it can be shown that \(\frac{n_{1}^{4}n_{2}^{4}}{n^{4}}{\text {E}}[|\tilde{V}_{\alpha \alpha (2)}|^{4}]\le c_{1}\tilde{B}^{4}\) and \(\frac{n_{1}^{4}n_{2}^{4}}{n^{4}}{\text {E}}[|\tilde{V}_{12}|^{4}]\le c_{2}\tilde{B}^{4}\) for some constants \(c_{1}\) and \(c_{2}\) . The theorem is then proved. \(\square \)

Proof of Theorem 3

We only need to show that under Conditions C1, C2, and C4, \(\hat{M}_{i}{\mathop {\longrightarrow }\limits ^{P}}M_{i}\) , \(i=1,2,3,4\) . Firstly, under Conditions C1, C2, and C4, by Theorem 4.1 of Zhang and Smaga ( 2022 ), we have

uniformly for all \(\varvec{y}_{i}\) . So we have \(\hat{M}_{1}=\tilde{M}_{1}+O_{p}(n^{-1/2})\) and \(\hat{M}_{4}=\tilde{M}_{4}+O_{p}(n^{-1/2})\) , where

Note \(|K(\varvec{y},\varvec{y}')|\le B_{K}\) , by the law of large numbers, we have \(\tilde{M}_{1}{\mathop {\longrightarrow }\limits ^{P}}M_{1}\) and \(\tilde{M}_{4}{\mathop {\longrightarrow }\limits ^{P}}M_{4}\) . Therefore, \(\hat{M}_{1}{\mathop {\longrightarrow }\limits ^{P}}M_{1}\) and \(\hat{M}_{4}{\mathop {\longrightarrow }\limits ^{P}}M_{4}\) . Under Conditions C1, C2, and C4, the consistency of \(\hat{M}_{2}\) and \(\hat{M}_{3}\) can be proved similarly. \(\square \)

Proof of Theorem 4

First of all, under Condition C4, we have \(|\tilde{K}(\varvec{y},\varvec{y}')|\le 4B_{K}\) for all \(\varvec{y},\varvec{y}'\in \mathcal Y \) . Then by Lemma  8 , for \(\alpha \ne \beta ,\ \alpha ,\beta =1,2,\) we have

By the Cauchy–Schwarz inequality, we have

Then under Conditions C2 and C4, as \(n\rightarrow \infty \) , we have

where \(\tau \) is defined in Condition C2. It follows that as \(n\rightarrow \infty \) , we have \({\text {E}}(\tilde{T}_{n})/\sqrt{{\text {Var}}(S_{n})}\longrightarrow 0\) and \({\text {Var}}(\tilde{T}_{n})/{\text {Var}}(S_{n})\longrightarrow 0\) . Thus, we have \(\tilde{T}_{n}/\sqrt{{\text {Var}}(S_{n})}{\mathop {\longrightarrow }\limits ^{P}}0\) and hence (a) is proved. To show (b), notice that by the central limit theorem, as \(n_{\alpha }\rightarrow \infty \) , we have

Since \(\bar{\varvec{x}}_{\alpha },\ \alpha =1,2\) are independent and \(S_{n}=\frac{n_{1}n_{2}}{n^{3/2-\Delta }}(u_{1}/\sqrt{n_{1}}-u_{2}/\sqrt{n_{2}})\) , we have \(S_{n}/\sqrt{{\text {Var}}(S_{n})}{\mathop {\longrightarrow }\limits ^\mathcal{L}}\mathcal N (0,1).\) The theorem is then proved. \(\square \)

Proof of Theorem 5

Under Conditions C2 and C4, by Theorem  4 , we have \(|{\text {E}}(\tilde{T}_{n})|\le 4B_{K}\) , \({\text {Var}}(\tilde{T}_{n})\le 32B_{K}^{2}\) , \(\tilde{T}_{n}/\sqrt{{\text {Var}}(S_{n})}{\mathop {\longrightarrow }\limits ^{P}}0\) and \(S_{n}/\sqrt{{\text{ Var }}(S_{n})}{\mathop {\longrightarrow }\limits ^\mathcal {L}}\mathcal N (0,1)\) . It follows from ( 34 ) that

This means (a) is valid. To prove (b), let \(\hat{C}_{\alpha ^{*}}\) denote the estimated upper \(100\alpha ^{*}\) percentile of \(\tilde{T}_{n}\) for the given significance level \(\alpha ^{*}\) . Then as \(n\rightarrow \infty \) , we have

The theorem is proved. \(\square \)

Proof of Proposition 6

Denote \(\hat{\lambda }_{\text {MED},g}\) and \(\hat{\lambda }_{\text {MAD},g}\) as the kernel width selected by the same method as \(\hat{\lambda }_{\text {MED}}\) and \(\hat{\lambda }_{\text {MAD}}\) , respectively, but based on the transformed observations \(g(\varvec{y}_{1}),\dots ,g(\varvec{y}_{n})\) . And let \(d_{g}(\varvec{y}_{i},\varvec{y}_{j})=d(g(\varvec{y}_{i}),g(\varvec{y}_{j}))\) , \(i,j=1,\dots ,n\) . Then from \(d_{g}(\varvec{y}_{i},\varvec{y}_{j})=c\cdot d(\varvec{y}_{i},\varvec{y}_{j})\) , we have

so \(d_{g}^{2}(\varvec{y}_{i},\varvec{y}_{j})/\hat{\lambda }_{\text {MED},g}^{2}=d^{2}(\varvec{y}_{i},\varvec{y}_{j})/\hat{\lambda }_{\text {MED}}^{2}\) , and \(K(g(\varvec{y}_{i}),g(\varvec{y}_{j}))=K(\varvec{y}_{i},\varvec{y}_{j})\) , \(i,j=1,\dots ,n\) .

where \(\tilde{d_{g}^{2}}={\text {median}}_{i\ne j}[d_{g}^{2}(\varvec{y}_{i},\varvec{y}_{j})]=c^{2}\cdot \tilde{d^{2}}\) , so \(d_{g}^{2}(\varvec{y}_{i},\varvec{y}_{j})/\hat{\lambda }_{\text {MAD},g}^{2}=d^{2}(\varvec{y}_{i},\varvec{y}_{j})/\hat{\lambda }_{\text {MAD}}^{2}\) . \(\square \)

Proof of Corollary 7

For \(\hat{\lambda }_{\text {MED}}\) and \(\hat{\lambda }_{\text {MAD}}\) , the transformation \(g(\varvec{y}_{i})=s\cdot U\varvec{y}_{i}+\varvec{c}\) satisfies

where we have used the fact that the unitary operator U is linear and preserves the inner product. So the result is true according to Proposition 6 .

For \(\hat{\lambda }_{\text {SD}}\) , by its translation-invariant and scale-equivariant property, and also note it depends on the observations \(\varvec{y}_{1},\dots ,\varvec{y}_{n}\) only through their inner products (see Eq. (23) and (24) of Zhang et al. 2023 ), we can similarly get the desired result. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Zhou, B., Ong, Z.P. & Zhang, JT. A new maximum mean discrepancy based two-sample test for equal distributions in separable metric spaces. Stat Comput 34 , 172 (2024). https://doi.org/10.1007/s11222-024-10483-9

Download citation

Received : 08 March 2024

Accepted : 08 August 2024

Published : 25 August 2024

DOI : https://doi.org/10.1007/s11222-024-10483-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Characteristic reproducing kernel
  • Reproducing kernel Hilbert space
  • \(\chi ^{2}\) -Approximation
  • Two-sample test
  • High-dimensional data
  • Functional data
  • Find a journal
  • Publish with us
  • Track your research

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Introduction to Statistics and Data Science

Chapter 15 hypothesis testing: two sample tests, 15.1 two sample t test.

We can also use the t test command to conduct a hypothesis test on data where we have samples from two populations. To introduce this lets consider an example from sports analytics. In particular, let us consider the NBA draft and the value of a lottery pick in the draft. Teams which do make the playoffs are entered into a lottery to determine the order of the top picks in the draft for the following year. These top 14 picks are called lottery picks.

Using historical data we might want to investigate the value of a lottery pick against those players who were selected outside the lottery.

We can now make a boxplot comparing the career scoring averages of the lottery picks between these two pick levels.

hypothesis tests two sample

From this boxplot we notice that the lottery picks tend to have a higher point per game (PPG) average. However, we certainly see many exceptions to this rule. We can also compute the averages of the PTS column for these two groups:

Lottery.Pick ppg NumberPlayers
Lottery 11.236927 371
Not Lottery 7.107924 366

This table once again demonstrates that the lottery picks tend to average more points. However, we might like to test this trend to see if have sufficient evidence to conclude this trend is real (this could also just be a function of sampling error).

15.1.1 Regression analysis

Our first technique for looking for a difference between our two categories is linear regression with a categorical explanatory variable. We fit a regression model of the form: \[PTS=\beta \delta_{\text{ not lottery}}+\alpha\] Where \(\delta_{\text{ not lottery}}\) is equal to one if the draft pick fell outside the lottery and zero otherwise.

To see if this relationship is real we can form a confidence interval for the coefficients.

From this we can see that Lottery picks to tend to average more point per game over their careers. The magnitude of this effect is somewhere between 3.5 and 4.7 points more for lottery picks.

15.1.2 Two Sample t test approach

For this we can use the two-sample t-test to compare the means of these two distinct populations.

Here the alternative hypothesis is that the lottery players score more points \[H_A: \mu_L > \mu_{NL}\] thus the null hypothesis is \[H_0: \mu_L \leq \mu_{NL}.\] We can now perform the test in R using the same t.test command as before.

Notice that I used the magic tilde ~ to split the PTS column into the lottery/non-lottery pick subdivisions. I could also do this manually and get the same answer:

The very small p-value here indicates that the population mean of the lottery picks is truly greater than the population mean of the non-lottery picks.

The 95% confidence interval also tells us that this difference is rather large (at least 3.85 points).

Conditions for using a two-sample t test:

These are roughly the same as the conditions for using a one sample t test, although we now need to assume that BOTH samples satisfy the conditions.

Must be looking for a difference in the population means (averages)

30 or greater samples in both groups (CLT)

  • If you have less than 30 in one sample, you can use the t test must you must then assume that the population is roughly mound shaped.

At this point you would probably like to know why we would ever want to do a two sample t test instead of a linear regression?

My answer is that a two sample t test is more robust against a difference in variance between the two groups. Recall, that one of the assumptions of simple linear regression is that the variance of the residuals does not depend on the explanatory variable(s). By default R does a type of t test which does not assume equal variance between the two groups. This is the one advantage of using the t test command.

15.1.2.1 Paired t test

Lets say we are trying to estimate the effect of a new training regiment on the 40 yard dash times for soccer players. Before implementing the training regime we measure the 40 yard dash times of the 30 players. First lets read this data set into R.

First, we can compare the mean times before and after the training:

Also we could make a side by side boxplot for the soccer players times before and after the training

hypothesis tests two sample

We could do a simple t test to examine whether mean of the players times after the training regime is implemented decrease (on average). Here we have the alternative hypothesis that \(H_a: \mu_b-\mu_a>0\) and thus the null hypothesis that \(H_0: \mu_b-\mu_a \leq 0\) . Using the two sample t test format in R we have:

Here we cannot reject the null hypothesis that the training had no effect on the players sprinting performance. However, we haven’t used all of the information available to us in this scenario. The t test we have just run doesn’t know that we recorded the before and after for the same players more than once. As far as R knows the before and after times could be entirely different players as if we are comparing the results between one team which received the training and one who didn’t. Therefore, R has to be pretty conservative in its predictions. The differences between the two groups could be due to many reasons other than the training regime implemented. Maybe the second set of players just started off being a little bit faster, etc.

The data we collected is actually more powerful because we know the performance of the same players before and after the test. This greatly reduces the number of variables which need to be accounted for in our statistical test. Luckily, we can easily let R know that our data points are paired .

Setting the paired keyword to true lets R know that the two columns should be paired together during the test. We can see that running the a paired t test gives us a much smaller p value. Moreover, we can now safely conclude that the new training regiment is effective in at least modestly reducing the 40 yard dash times of the soccer players.

This is our first example of the huge subject of experimental design which is the study of methods which can be used to create data sets which have more power to distinguish differences between groups. Where possible it is better to collect data for the same subjects under two conditions as this will allow for more powerful statistical analysis of the data (i.e a paired t test instead of a normal t test).

Whenever the assumptions are met for a paired t test, you will be expected to perform a paired t test in this class.

15.2 Two Sample Proportion Tests

We can also use statistical hypothesis testing to compare the proportion between two samples. For example, we might conduct a survey of 100 smokers and 50 non-smokers to see whether they buy organic foods. If we find that 30/100 smokers buy organic and only 11/50 non-smokers buy organic then can we conclude that more smokers buy organic foods that smokers? \(H_a: p_s > p_n\) and \(H_0: p_s \leq p_n\) .

In this case we don’t have sufficient evidence to conclude that a larger fraction of smokers buy organic foods. It is common when analyzing survey data to want to compare proportions between populations.

The key assumptions when performing a two-sample proportion test are that we have at least 5 successes and 5 failures in BOTH samples.

15.3 Extra Example: Birth Weights and Smoking

For this example we are going to use a data from a study on the risk factors associated with giving birth to a low-weight baby (sometimes defined as less than 2,500 grams). This data set is another one which is build into R . To load this data for analysis type:

You can view all a description of the data by typing ?birthwt once it is loaded. To begin we could look at the raw birth weight of mothers who were smokers versus non-smokers. We can do some EDA on this data using a boxplot:

hypothesis tests two sample

From the boxplot we can see that the median birth weight of babies whose mothers smoked was smaller. We can test the data for a difference in the means using a t.test command.

Notice we can use the ~ shorthand to split the data into those two groups faster than filtering. Here we get a small p value meaning we have sufficient evidence to reject the null hypothesis that the mean weight of babies of women who smoked is greater than or equal to those of non-smokers.

Within this data set we also have a column low which classifies whether the babies birth weight is considered low using the medical criterion (birth weight less than 2,500 grams):

We can see that smoking gives a higher fraction of low-weight births. However, this could just be due to sampling error so let’s run a proportion test to find out.

Once again we find we have sufficient evidence to reject the null hypothesis that smoking does not increase the risk of a low birth weight.

15.4 Homework

15.4.1 concept questions.

  • What the assumptions behind using a two sample proportion test? Hint these will be the same as forming a confidence interval for for the fraction of a population, with two samples where this assumption needs to hold.
  • What assumptions are required for a two sample t test with small \(N\leq 30\) sample sizes?
  • A paired t test may be used for any two sample experiment (True/False)
  • The power of any statistical test will increase with increasing sample sizes. (True/False)
  • Where possible it is better to collect data on the same individuals when trying to distinguish a difference in the average response to a condition (True/False)
  • The paired t test is a more powerful statistical test than a normal t test (True/ False)

15.4.2 Practice Problems

For each of the scenarios below form the null and alternative hypothesis.

  • We have conducted an educational study on two classrooms of 30 students using two different teaching methods. The first method had 50% of students pass a standardized test, and the classroom using the second teaching method had 60% of the students pass.
  • A basketball coach is extremely superstitious and believes that when he wears his lucky tie the team has a greater chance of winning the game. He comes to you because he is looking to design an experiment to test this belief. If the team has 40 games in the upcoming season, design an experiment and the (null and alt) hypothesis to test the coaches claims.

For the below question work out the number of errors in the data set.

  • Before the Olympics all athletes are required to submit a urine sample to be tested for banned substances. This is done by estimating the concentration of certain compounds in the urine and is prone to some degree of laboratory error. In addition, the concentration of these compounds are known to vary with the individual (genetic, diet, etc). To weigh the evidence present in a drug test the laboratory conducts a statistical test. To ensure they don’t falsely convict athletes of doping they use a significance level of \(\alpha=0.01\) . If they test 3000 athletes, all of whom are clean about how many will be falsely accused of doping? Explain the issue with this procedure.

15.4.3 Advanced Problems

Load the drug_use data set from the fivethirtyeight package. Run a hypothesis test to determine if a larger proportion of 22-23 year olds are using marijuana then 24-25 year olds. Interpret your results statistically and practically.

Import the data set Cavaliers_Home_Away_2016 . Form a hypothesis on whether being home or away for the game had an effect on the proportion of games won by the Cavaliers during the 2016-2017 season, test this hypothesis using a hypothesis test.

Load the data set animal_sleep and compare the average total sleep time (sleep_total column) between carnivores and herbivores (using the vore column) to divide the between the two categories. To begin make a boxplot to compare the total sleep time between these two categories. Do we have sufficient evidence to conclude the average total sleep time differs between these groups?

Load the HR_Employee_Attrition data set. We wish to investigate whether the daily rate (pay) has anything to do with whether a employee has quit (the attrition column is “Yes”). To begin make a boxplot of the DailyRate column split into these Attrition categories. Use the boxplot to help form the null hypothesis for your test and decide on an alternative hypothesis. Conduct a statistical hypothesis test to determine if we have sufficient evidence to conclude that those employees who quit tended to be paid less. Report and interpret the p value for your test.

Load the BirdCaptureData data set. Perform a hypothesis test to determine if the proportion of orange-crowned warblers (SpeciesCode==OCWA) caught at the station is truly less than the proportion of Yellow Warblers (SpeciesCode==YWAR). Report your p value and interpret the results statistically and practically.

(All of Statistics Problem) In 1861, 10 essays appeared in the New Orleans Daily Crescent. They were signed “Quintus Curtius Snodgrass” and one hypothesis is that these essays were written by Mark Twain. One way to look for similarity between writing styles is to compare the proportion of three letter words found in two works. For 8 Mark Twain essays we have:

From 10 Snodgrass essays we have that:

  • Perform a two sample t test to examine these two data sets for a difference in the mean values. Report your p value and a 95% confidence interval for the results.
  • What are the issues with using a t-test on this data?

Consider the analysis of the kidiq data set again.

  • Run a regression with kid_score as the response and mom_hs as the explanatory variable and look at the summary() of your results. Notice the p-value which is reported in the last line of the summary. This “F-test” is a hypothesis test with the null hypothesis that the explanatory variable tells us nothing about the value of the response variable.
  • Perform a t test for the a difference in means in the kid_score values based on the mom_hs column. What is your conclusion?
  • Repeat the t test again using the command:

dataanalysisclassroom

dataanalysisclassroom

making data analysis easy

Lesson 98 – The Two-Sample Hypothesis Tests using the Bootstrap

Two-sample hypothesis tests – part vii.

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

These days, a peek out of the window is greeted by chilling rain or warm snow. On days when it is not raining or snowing, there is biting cold. So we gaze at our bicycles, waiting for that pleasant month of April when we can joyfully bike — to work, or for pleasure.

Speaking of bikes, since I have nothing much to do today except watch the snow, I decided to explore some data from our favorite “ Open Data for All New Yorkers ” page.

Interestingly, I found data on the bicycle counts for East River Bridges . New York City DOT keeps track of the daily total of bike counts on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.

hypothesis tests two sample

I could find the data for April to October during 2016 and 2017. Here is how the data for April 2017 looks.

hypothesis tests two sample

Being a frequent biker on the Manhattan Bridge, my curiosity got kindled. I wanted to verify how different the total bike counts on the Manhattan Bridge are from the Williamsburg Bridge.

At the same time, I also wanted to share the benefits of the bootstrap method for two-sample hypothesis tests.

To keep it simple and easy for you to follow the bootstrap method’s logical development, I will test how different the total bike counts data on Manhattan Bridge are from that of the Williamsburg Bridge during all the non-holiday weekdays with no precipitation.

Here is the data of the total bike counts on Manhattan Bridge during all the non-holiday weekdays with no precipitation in April of 2017 — essentially, the data from the yellow-highlighted rows in the table for Manhattan Bridge.

5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

And the data of the total bike counts on Williamsburg Bridge during all the non-holiday weekdays with no precipitation in April of 2017.

5711, 6881, 8079, 6775, 5877, 7341, 6026, 7196

Their distributions look like this.

hypothesis tests two sample

I want answers to the following questions.

\bar{x}_{M}=\bar{x}_{W}?

What do we know so far?  

We know how to test the difference in means using the t-Test under the proposition that the population variances are equal ( Lesson 94 ) or using Welch’s t-Test when we cannot assume equality of population variances ( Lesson 95 ). We also know how to do this using Wilcoxon’s Rank-sum Test that uses the ranking method to approximate the significance of the differences in means ( Lesson 96 ).

We know how to test the equality of variances using F-distribution ( Lesson 97 ).

We know how to test the difference in proportions using either Fisher’s Exact test ( Lesson 92 ) or using the normal distribution as the null distribution under the large-sample approximation ( Lesson 93 ).

In all these tests, we made critical assumptions on the limiting distributions of the test-statistics.
  • What is the limiting distribution of the test-statistic that computes the difference in medians?
  • What is the limiting distribution of the test-statistic that compares interquartile ranges of two populations?
  • What if we do not want to make any assumptions on data distributions or the limiting forms of the test-statistics?

Enter the Bootstrap

I would urge you to go back to Lesson 79 to get a quick refresher on the bootstrap, and Lesson 90 to recollect how we used it for the one-sample hypothesis tests.

\hat{f}

Take the data for Manhattan Bridge. 5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these eight data points is 1/8, we can randomly draw eight numbers from these eight values —  with replacement .

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774), some may appear more than one time, and some may not show up at all in a random sample.

Here is one such bootstrap replicate. 6359, 6359, 6359, 6052, 6774, 6359, 5276, 6359

The value 6359 appeared five times. Some values like 7247, 5054, 6691, and 5311 did not appear at all in this replicate.

Here is another replicate. 6359, 5276, 5276, 5276, 7247, 5311, 6052, 5311

Such bootstrap replicates are representations of the empirical distribution , i.e., the proportion of times each value in the data sample occurs. We can generate all the information contained in the true distribution by creating , the empirical distribution.

Using the Bootstrap for Two-Sample Hypothesis Tests

Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis. There is no need to make any assumptions on the distributional nature of the data or the limiting distribution for the test-statistic . As long as we can compute a test-statistic from the bootstrap sample, we can test the hypothesis on any statistic — mean, median, variance, interquartile range, proportion, etc.

\theta_{x}

The null hypothesis is that there is no difference between the statistic of X or Y .

The alternate hypothesis is

\theta_{x}>\theta_{y}

For example, one bootstrap replicate for X (Manhattan Bridge) and Y (Williamsburg Bridge) may look like this:

\bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}

Another bootstrap replicate for X and Y may look like this:

S_{i} \in (0,1)

The proportion of times in a set of N bootstrap-replicated statistics is the p-value.

p-value=\frac{1}{N}\sum_{i=1}^{i=N}S_{i}

Manhattan Bridge vs. Williamsburg Bridge

H_{0}: P(\bar{x}_{M}>\bar{x}_{W}) = 0.5

Let’s take a two-sided alternate hypothesis.

\frac{\bar{x}_{M}}{\bar{x}_{W}}

Can we reject the null hypothesis if we select a 5% rate of error?

H_{0}: P(\tilde{x}_{M}>\tilde{x}_{W}) = 0.5

Can you see the bootstrap concept’s flexibility and how widely we can apply it for hypothesis testing? Just remember that the underlying assumption is that the data are independent. 

To summarize,

P(\theta_{x}>\theta_{y})=0.5

After seven lessons, we are now equipped with all the theory of the two-sample hypothesis tests. It is time to put them to practice. Dust off your programming machines and get set.

If you find this useful, please like, share and subscribe. You can also follow me on Twitter  @realDevineni  for updates on new lessons.

error

Enjoy this blog? Please spread the word :)

Facebook

Teach yourself statistics

Hypothesis Test: Difference Between Means

This lesson explains how to conduct a hypothesis test for the difference between two means. The test procedure, called the two-sample t-test , is appropriate when the following conditions are met:

  • The sampling method for each sample is simple random sampling .
  • The samples are independent .
  • Each population is at least 20 times larger than its respective sample .
  • The population distribution is normal.
  • The population data are symmetric , unimodal , without outliers , and the sample size is 15 or less.
  • The population data are slightly skewed , unimodal, without outliers, and the sample size is 16 to 40.
  • The sample size is greater than 40, without outliers.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses

Every hypothesis test requires the analyst to state a null hypothesis and an alternative hypothesis . The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false; and vice versa.

The table below shows three sets of null and alternative hypotheses. Each makes a statement about the difference d between the mean of one population μ 1 and the mean of another population μ 2 . (In the table, the symbol ≠ means " not equal to ".)

Set Null hypothesis Alternative hypothesis Number of tails
1 μ - μ = d μ - μ ≠ d 2
2 μ - μ d μ - μ < d 1
3 μ - μ d μ - μ > d 1

The first set of hypotheses (Set 1) is an example of a two-tailed test , since an extreme value on either side of the sampling distribution would cause a researcher to reject the null hypothesis. The other two sets of hypotheses (Sets 2 and 3) are one-tailed tests , since an extreme value on only one side of the sampling distribution would cause a researcher to reject the null hypothesis.

When the null hypothesis states that there is no difference between the two population means (i.e., d = 0), the null and alternative hypothesis are often stated in the following form.

H o : μ 1 = μ 2

H a : μ 1 ≠ μ 2

Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or reject the null hypothesis. It should specify the following elements.

  • Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
  • Test method. Use the two-sample t-test to determine whether the difference between means found in the sample is significantly different from the hypothesized difference between means.

Analyze Sample Data

Using sample data, find the standard error, degrees of freedom, test statistic, and the P-value associated with the test statistic.

SE = sqrt[ (s 1 2 /n 1 ) + (s 2 2 /n 2 ) ]

DF = (s 1 2 /n 1 + s 2 2 /n 2 ) 2 / { [ (s 1 2 / n 1 ) 2 / (n 1 - 1) ] + [ (s 2 2 / n 2 ) 2 / (n 2 - 1) ] }

t = [ ( x 1 - x 2 ) - d ] / SE

  • P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a t statistic, use the t Distribution Calculator to assess the probability associated with the t statistic, having the degrees of freedom computed above. (See sample problems at the end of this lesson for examples of how this is done.)

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding

In this section, two sample problems illustrate how to conduct a hypothesis test of a difference between mean scores. The first problem involves a two-tailed test; the second problem, a one-tailed test.

Problem 1: Two-Tailed Test

Within a school district, students were randomly assigned to one of two Math teachers - Mrs. Smith and Mrs. Jones. After the assignment, Mrs. Smith had 30 students, and Mrs. Jones had 25 students.

At the end of the year, each class took the same standardized test. Mrs. Smith's students had an average test score of 78, with a standard deviation of 10; and Mrs. Jones' students had an average test score of 85, with a standard deviation of 15.

Test the hypothesis that Mrs. Smith and Mrs. Jones are equally effective teachers. Use a 0.10 level of significance. (Assume that student performance is approximately normal.)

Solution: The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:

State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.

Null hypothesis: μ 1 - μ 2 = 0

Alternative hypothesis: μ 1 - μ 2 ≠ 0

  • Formulate an analysis plan . For this analysis, the significance level is 0.10. Using sample data, we will conduct a two-sample t-test of the null hypothesis.

SE = sqrt[(s 1 2 /n 1 ) + (s 2 2 /n 2 )]

SE = sqrt[(10 2 /30) + (15 2 /25] = sqrt(3.33 + 9)

SE = sqrt(12.33) = 3.51

DF = (10 2 /30 + 15 2 /25) 2 / { [ (10 2 / 30) 2 / (29) ] + [ (15 2 / 25) 2 / (24) ] }

DF = (3.33 + 9) 2 / { [ (3.33) 2 / (29) ] + [ (9) 2 / (24) ] } = 152.03 / (0.382 + 3.375) = 152.03/3.757 = 40.47

t = [ ( x 1 - x 2 ) - d ] / SE = [ (78 - 85) - 0 ] / 3.51 = -7/3.51 = -1.99

where s 1 is the standard deviation of sample 1, s 2 is the standard deviation of sample 2, n 1 is the size of sample 1, n 2 is the size of sample 2, x 1 is the mean of sample 1, x 2 is the mean of sample 2, d is the hypothesized difference between the population means, and SE is the standard error.

Since we have a two-tailed test , the P-value is the probability that a t statistic having 40 degrees of freedom is more extreme than -1.99; that is, less than -1.99 or greater than 1.99.

We use the t Distribution Calculator to find P(t < -1.99) is about 0.027.

  • If you enter 1.99 as the sample mean in the t Distribution Calculator, you will find the that the P(t ≤ 1.99) is about 0.973. Therefore, P(t > 1.99) is 1 minus 0.973 or 0.027. Thus, the P-value = 0.027 + 0.027 = 0.054.
  • Interpret results . Since the P-value (0.054) is less than the significance level (0.10), we cannot accept the null hypothesis.

Note: If you use this approach on an exam, you may also want to mention why this approach is appropriate. Specifically, the approach is appropriate because the sampling method was simple random sampling, the samples were independent, the sample size was much smaller than the population size, and the samples were drawn from a normal population.

Problem 2: One-Tailed Test

The Acme Company has developed a new battery. The engineer in charge claims that the new battery will operate continuously for at least 7 minutes longer than the old battery.

To test the claim, the company selects a simple random sample of 100 new batteries and 100 old batteries. The old batteries run continuously for 190 minutes with a standard deviation of 20 minutes; the new batteries, 200 minutes with a standard deviation of 40 minutes.

Test the engineer's claim that the new batteries run at least 7 minutes longer than the old. Use a 0.05 level of significance. (Assume that there are no outliers in either sample.)

Null hypothesis: μ 1 - μ 2 <= 7

Alternative hypothesis: μ 1 - μ 2 > 7

where μ 1 is battery life for the new battery, and μ 2 is battery life for the old battery.

  • Formulate an analysis plan . For this analysis, the significance level is 0.05. Using sample data, we will conduct a two-sample t-test of the null hypothesis.

SE = sqrt[(40 2 /100) + (20 2 /100]

SE = sqrt(16 + 4) = 4.472

DF = (40 2 /100 + 20 2 /100) 2 / { [ (40 2 / 100) 2 / (99) ] + [ (20 2 / 100) 2 / (99) ] }

DF = (20) 2 / { [ (16) 2 / (99) ] + [ (2) 2 / (99) ] } = 400 / (2.586 + 0.162) = 145.56

t = [ ( x 1 - x 2 ) - d ] / SE = [(200 - 190) - 7] / 4.472 = 3/4.472 = 0.67

where s 1 is the standard deviation of sample 1, s 2 is the standard deviation of sample 2, n 1 is the size of sample 1, n 2 is the size of sample 2, x 1 is the mean of sample 1, x 2 is the mean of sample 2, d is the hypothesized difference between population means, and SE is the standard error.

Here is the logic of the analysis: Given the alternative hypothesis (μ 1 - μ 2 > 7), we want to know whether the observed difference in sample means is big enough (i.e., sufficiently greater than 7) to cause us to reject the null hypothesis.

Interpret results . Suppose we replicated this study many times with different samples. If the true difference in population means were actually 7, we would expect the observed difference in sample means to be 10 or less in 75% of our samples. And we would expect to find an observed difference to be more than 10 in 25% of our samples Therefore, the P-value in this analysis is 0.25.

Pardon Our Interruption

As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:

  • You've disabled JavaScript in your web browser.
  • You're a power user moving through this website with super-human speed.
  • You've disabled cookies in your web browser.
  • A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .

To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

5.5 - hypothesis testing for two-sample proportions.

We are now going to develop the hypothesis test for the difference of two proportions for independent samples. The hypothesis test follows the same steps as one group.

These notes are going to go into a little bit of math and formulas to help demonstrate the logic behind hypothesis testing for two groups. If this starts to get a little confusion, just skim over it for a general understanding! Remember we can rely on the software to do the calculations for us, but it is good to have a basic understanding of the logic!

We will use the sampling distribution of \(\hat{p}_1-\hat{p}_2\) as we did for the confidence interval.

For a test for two proportions, we are interested in the difference between two groups. If the difference is zero, then they are not different (i.e., they are equal). Therefore, the null hypothesis will always be:

\(H_0\colon p_1-p_2=0\)

Another way to look at it is \(H_0\colon p_1=p_2\). This is worth stopping to think about. Remember, in hypothesis testing, we assume the null hypothesis is true. In this case, it means that \(p_1\) and \(p_2\) are equal. Under this assumption, then \(\hat{p}_1\) and \(\hat{p}_2\) are both estimating the same proportion. Think of this proportion as \(p^*\).

Therefore, the sampling distribution of both proportions, \(\hat{p}_1\) and \(\hat{p}_2\), will, under certain conditions, be approximately normal centered around \(p^*\), with standard error \(\sqrt{\dfrac{p^*(1-p^*)}{n_i}}\), for \(i=1, 2\).

We take this into account by finding an estimate for this \(p^*\) using the two-sample proportions. We can calculate an estimate of \(p^*\) using the following formula:

\(\hat{p}^*=\dfrac{x_1+x_2}{n_1+n_2}\)

This value is the total number in the desired categories \((x_1+x_2)\) from both samples over the total number of sampling units in the combined sample \((n_1+n_2)\).

Putting everything together, if we assume \(p_1=p_2\), then the sampling distribution of \(\hat{p}_1-\hat{p}_2\) will be approximately normal with mean 0 and standard error of \(\sqrt{p^*(1-p^*)\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\), under certain conditions.

\(z^*=\dfrac{(\hat{p}_1-\hat{p}_2)-0}{\sqrt{\hat{p}^*(1-\hat{p}^*)\left(\dfrac{1}{n_1}+\dfrac{1}{n_2}\right)}}\)

...will follow a standard normal distribution.

Finally, we can develop our hypothesis test for \(p_1-p_2\).

Hypothesis Testing for Two-Sample Proportions

Conditions :

\(n_1\hat{p}_1\), \(n_1(1-\hat{p}_1)\), \(n_2\hat{p}_2\), and \(n_2(1-\hat{p}_2)\) are all greater than five

Test Statistic:

\(z^*=\dfrac{\hat{p}_1-\hat{p}_2-0}{\sqrt{\hat{p}^*(1-\hat{p}^*)\left(\dfrac{1}{n_1}+\dfrac{1}{n_2}\right)}}\)

...where \(\hat{p}^*=\dfrac{x_1+x_2}{n_1+n_2}\).

The critical values, p-values, and decisions will all follow the same steps as those from a hypothesis test for a one-sample proportion.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved August 26, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

hypothesis tests two sample

Snapsolve any problem by taking a picture. Try it in the Numerade app?

Module 10: Hypothesis Testing With Two Samples

Putting it together: hypothesis testing with two samples, let’s summarize.

  • The steps for performing a hypothesis test for two population means with unknown standard deviation is generally the same as the steps for conducting a hypothesis test for one population mean with unknown standard deviation, using a t -distribution.
  • Because the population standard deviations are not known, the sample standard deviations are used for calculations.
  • When the sum of the sample sizes is more than 30, a normal distribution can be used to approximate the student’s  t -distribution.
  • The difference of two proportions is approximately normal if there are at least five successes and five failures in each sample.
  • When conducting a hypothesis test for a difference of two proportions, the random samples must be independent and the population must be at least ten times the sample size.
  • When calculating the standard error for the difference in sample proportions, the pooled proportion must be used.
  • When two measurements (samples) are drawn from the same pair of individuals or objects, the differences from the sample are used to conduct the hypothesis test.
  • The distribution that is used to conduct the hypothesis test on the differences is a t -distribution.
  • Provided by : Lumen Learning. License : CC BY: Attribution
  • Introductory Statistics. Authored by : Barbara Illowsky, Susan Dean. Provided by : OpenStax. Located at : https://openstax.org/books/introductory-statistics/pages/1-introduction . License : CC BY: Attribution . License Terms : Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction

Footer Logo Lumen Candela

Privacy Policy

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Hypothesis Testing: Uses, Steps & Example

By Jim Frost 4 Comments

What is Hypothesis Testing?

Hypothesis testing in statistics uses sample data to infer the properties of a whole population . These tests determine whether a random sample provides sufficient evidence to conclude an effect or relationship exists in the population. Researchers use them to help separate genuine population-level effects from false effects that random chance can create in samples. These methods are also known as significance testing.

Data analysts at work.

For example, researchers are testing a new medication to see if it lowers blood pressure. They compare a group taking the drug to a control group taking a placebo. If their hypothesis test results are statistically significant, the medication’s effect of lowering blood pressure likely exists in the broader population, not just the sample studied.

Using Hypothesis Tests

A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement the sample data best supports. These two statements are called the null hypothesis and the alternative hypothesis . The following are typical examples:

  • Null Hypothesis : The effect does not exist in the population.
  • Alternative Hypothesis : The effect does exist in the population.

Hypothesis testing accounts for the inherent uncertainty of using a sample to draw conclusions about a population, which reduces the chances of false discoveries. These procedures determine whether the sample data are sufficiently inconsistent with the null hypothesis that you can reject it. If you can reject the null, your data favor the alternative statement that an effect exists in the population.

Statistical significance in hypothesis testing indicates that an effect you see in sample data also likely exists in the population after accounting for random sampling error , variability, and sample size. Your results are statistically significant when the p-value is less than your significance level or, equivalently, when your confidence interval excludes the null hypothesis value.

Conversely, non-significant results indicate that despite an apparent sample effect, you can’t be sure it exists in the population. It could be chance variation in the sample and not a genuine effect.

Learn more about Failing to Reject the Null .

5 Steps of Significance Testing

Hypothesis testing involves five key steps, each critical to validating a research hypothesis using statistical methods:

  • Formulate the Hypotheses : Write your research hypotheses as a null hypothesis (H 0 ) and an alternative hypothesis (H A ).
  • Data Collection : Gather data specifically aimed at testing the hypothesis.
  • Conduct A Test : Use a suitable statistical test to analyze your data.
  • Make a Decision : Based on the statistical test results, decide whether to reject the null hypothesis or fail to reject it.
  • Report the Results : Summarize and present the outcomes in your report’s results and discussion sections.

While the specifics of these steps can vary depending on the research context and the data type, the fundamental process of hypothesis testing remains consistent across different studies.

Let’s work through these steps in an example!

Hypothesis Testing Example

Researchers want to determine if a new educational program improves student performance on standardized tests. They randomly assign 30 students to a control group , which follows the standard curriculum, and another 30 students to a treatment group, which participates in the new educational program. After a semester, they compare the test scores of both groups.

Download the CSV data file to perform the hypothesis testing yourself: Hypothesis_Testing .

The researchers write their hypotheses. These statements apply to the population, so they use the mu (μ) symbol for the population mean parameter .

  • Null Hypothesis (H 0 ) : The population means of the test scores for the two groups are equal (μ 1 = μ 2 ).
  • Alternative Hypothesis (H A ) : The population means of the test scores for the two groups are unequal (μ 1 ≠ μ 2 ).

Choosing the correct hypothesis test depends on attributes such as data type and number of groups. Because they’re using continuous data and comparing two means, the researchers use a 2-sample t-test .

Here are the results.

Hypothesis testing results for the example.

The treatment group’s mean is 58.70, compared to the control group’s mean of 48.12. The mean difference is 10.67 points. Use the test’s p-value and significance level to determine whether this difference is likely a product of random fluctuation in the sample or a genuine population effect.

Because the p-value (0.000) is less than the standard significance level of 0.05, the results are statistically significant, and we can reject the null hypothesis. The sample data provides sufficient evidence to conclude that the new program’s effect exists in the population.

Limitations

Hypothesis testing improves your effectiveness in making data-driven decisions. However, it is not 100% accurate because random samples occasionally produce fluky results. Hypothesis tests have two types of errors, both relating to drawing incorrect conclusions.

  • Type I error: The test rejects a true null hypothesis—a false positive.
  • Type II error: The test fails to reject a false null hypothesis—a false negative.

Learn more about Type I and Type II Errors .

Our exploration of hypothesis testing using a practical example of an educational program reveals its powerful ability to guide decisions based on statistical evidence. Whether you’re a student, researcher, or professional, understanding and applying these procedures can open new doors to discovering insights and making informed decisions. Let this tool empower your analytical endeavors as you navigate through the vast seas of data.

Learn more about the Hypothesis Tests for Various Data Types .

Share this:

hypothesis tests two sample

Reader Interactions

' src=

June 10, 2024 at 10:51 am

Thank you, Jim, for another helpful article; timely too since I have started reading your new book on hypothesis testing and, now that we are at the end of the school year, my district is asking me to perform a number of evaluations on instructional programs. This is where my question/concern comes in. You mention that hypothesis testing is all about testing samples. However, I use all the students in my district when I make these comparisons. Since I am using the entire “population” in my evaluations (I don’t select a sample of third grade students, for example, but I use all 700 third graders), am I somehow misusing the tests? Or can I rest assured that my district’s student population is only a sample of the universal population of students?

' src=

June 10, 2024 at 1:50 pm

I hope you are finding the book helpful!

Yes, the purpose of hypothesis testing is to infer the properties of a population while accounting for random sampling error.

In your case, it comes down to how you want to use the results. Who do you want the results to apply to?

If you’re summarizing the sample, looking for trends and patterns, or evaluating those students and don’t plan to apply those results to other students, you don’t need hypothesis testing because there is no sampling error. They are the population and you can just use descriptive statistics. In this case, you’d only need to focus on the practical significance of the effect sizes.

On the other hand, if you want to apply the results from this group to other students, you’ll need hypothesis testing. However, there is the complicating issue of what population your sample of students represent. I’m sure your district has its own unique characteristics, demographics, etc. Your district’s students probably don’t adequately represent a universal population. At the very least, you’d need to recognize any special attributes of your district and how they could bias the results when trying to apply them outside the district. Or they might apply to similar districts in your region.

However, I’d imagine your 3rd graders probably adequately represent future classes of 3rd graders in your district. You need to be alert to changing demographics. At least in the short run I’d imagine they’d be representative of future classes.

Think about how these results will be used. Do they just apply to the students you measured? Then you don’t need hypothesis tests. However, if the results are being used to infer things about other students outside of the sample, you’ll need hypothesis testing along with considering how well your students represent the other students and how they differ.

I hope that helps!

June 10, 2024 at 3:21 pm

Thank you so much, Jim, for the suggestions in terms of what I need to think about and consider! You are always so clear in your explanations!!!!

June 10, 2024 at 3:22 pm

You’re very welcome! Best of luck with your evaluations!

Comments and Questions Cancel reply

JMP | Statistical Discovery.™ From SAS.

Statistics Knowledge Portal

A free online introduction to statistics

The Two-Sample t -Test

What is the two-sample t -test.

The two-sample t -test (also known as the independent samples t -test) is a method used to test whether the unknown population means of two groups are equal or not.

Is this the same as an A/B test?

Yes, a two-sample t -test is used to analyze the results from A/B tests.

When can I use the test?

You can use the test when your data values are independent, are randomly sampled from two normal populations and the two independent groups have equal variances.

What if I have more than two groups?

Use a multiple comparison method. Analysis of variance (ANOVA) is one such method. Other multiple comparison methods include the Tukey-Kramer test of all pairwise differences, analysis of means (ANOM) to compare group means to the overall mean or Dunnett’s test to compare each group mean to a control mean.

What if the variances for my two groups are not equal?

You can still use the two-sample t- test. You use a different estimate of the standard deviation. 

What if my data isn’t nearly normally distributed?

If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.

See how to perform a two-sample t -test using statistical software

  • Download JMP to follow along using the sample data included with the software.
  • To see more JMP tutorials, visit the JMP Learning Library .

Using the two-sample t -test

The sections below discuss what is needed to perform the test, checking our data, how to perform the test and statistical details.

What do we need?

For the two-sample t -test, we need two variables. One variable defines the two groups. The second variable is the measurement of interest.

We also have an idea, or hypothesis, that the means of the underlying populations for the two groups are different. Here are a couple of examples:

  • We have students who speak English as their first language and students who do not. All students take a reading test. Our two groups are the native English speakers and the non-native speakers. Our measurements are the test scores. Our idea is that the mean test scores for the underlying populations of native and non-native English speakers are not the same. We want to know if the mean score for the population of native English speakers is different from the people who learned English as a second language.
  • We measure the grams of protein in two different brands of energy bars. Our two groups are the two brands. Our measurement is the grams of protein for each energy bar. Our idea is that the mean grams of protein for the underlying populations for the two brands may be different. We want to know if we have evidence that the mean grams of protein for the two brands of energy bars is different or not.

Two-sample t -test assumptions

To conduct a valid test:

  • Data values must be independent. Measurements for one observation do not affect measurements for any other observation.
  • Data in each group must be obtained via a random sample from the population.
  • Data in each group are normally distributed .
  • Data values are continuous.
  • The variances for the two independent groups are equal.

For very small groups of data, it can be hard to test these requirements. Below, we'll discuss how to check the requirements using software and what to do when a requirement isn’t met.

Two-sample t -test example

One way to measure a person’s fitness is to measure their body fat percentage. Average body fat percentages vary by age, but according to some guidelines, the normal range for men is 15-20% body fat, and the normal range for women is 20-25% body fat.

Our sample data is from a group of men and women who did workouts at a gym three times a week for a year. Then, their trainer measured the body fat. The table below shows the data.

Table 1: Body fat percentage data grouped by gender

GroupBody Fat Percentages

Men

13.36.020.08.014.0
19.018.025.016.024.0
15.01.015.0  

Women

22.016.021.721.030.0
26.012.023.228.023.0

You can clearly see some overlap in the body fat measurements for the men and women in our sample, but also some differences. Just by looking at the data, it's hard to draw any solid conclusions about whether the underlying populations of men and women at the gym have the same mean body fat. That is the value of statistical tests – they provide a common, statistically valid way to make decisions, so that everyone makes the same decision on the same set of data values.

Checking the data

Let’s start by answering: Is the two-sample t -test an appropriate method to evaluate the difference in body fat between men and women?

  • The data values are independent. The body fat for any one person does not depend on the body fat for another person.
  • We assume the people measured represent a simple random sample from the population of members of the gym.
  • We assume the data are normally distributed, and we can check this assumption.
  • The data values are body fat measurements. The measurements are continuous.
  • We assume the variances for men and women are equal, and we can check this assumption.

Before jumping into analysis, we should always take a quick look at the data. The figure below shows histograms and summary statistics for the men and women.

Histogram and summary statistics for the body fat data

The two histograms are on the same scale. From a quick look, we can see that there are no very unusual points, or outliers . The data look roughly bell-shaped, so our initial idea of a normal distribution seems reasonable.

Examining the summary statistics, we see that the standard deviations are similar. This supports the idea of equal variances. We can also check this using a test for variances.

Based on these observations, the two-sample t -test appears to be an appropriate method to test for a difference in means.

How to perform the two-sample t -test

For each group, we need the average, standard deviation and sample size. These are shown in the table below.

Table 2: Average, standard deviation and sample size statistics grouped by gender

Women1022.295.32
Men1314.956.84

Without doing any testing, we can see that the averages for men and women in our samples are not the same. But how different are they? Are the averages “close enough” for us to conclude that mean body fat is the same for the larger population of men and women at the gym? Or are the averages too different for us to make this conclusion?

We'll further explain the principles underlying the two sample t -test in the statistical details section below, but let's first proceed through the steps from beginning to end. We start by calculating our test statistic. This calculation begins with finding the difference between the two averages:

$ 22.29 - 14.95 = 7.34 $

This difference in our samples estimates the difference between the population means for the two groups.

Next, we calculate the pooled standard deviation. This builds a combined estimate of the overall standard deviation. The estimate adjusts for different group sizes. First, we calculate the pooled variance:

$ s_p^2 = \frac{((n_1 - 1)s_1^2) + ((n_2 - 1)s_2^2)} {n_1 + n_2 - 2} $

$ s_p^2 = \frac{((10 - 1)5.32^2) + ((13 - 1)6.84^2)}{(10 + 13 - 2)} $

$ = \frac{(9\times28.30) + (12\times46.82)}{21} $

$ = \frac{(254.7 + 561.85)}{21} $

$ =\frac{816.55}{21} = 38.88 $

Next, we take the square root of the pooled variance to get the pooled standard deviation. This is:

$ \sqrt{38.88} = 6.24 $

We now have all the pieces for our test statistic. We have the difference of the averages, the pooled standard deviation and the sample sizes.  We calculate our test statistic as follows:

$ t = \frac{\text{difference of group averages}}{\text{standard error of difference}} = \frac{7.34}{(6.24\times \sqrt{(1/10 + 1/13)})} = \frac{7.34}{2.62} = 2.80 $

To evaluate the difference between the means in order to make a decision about our gym programs, we compare the test statistic to a theoretical value from the t- distribution. This activity involves four steps:

  • We decide on the risk we are willing to take for declaring a significant difference. For the body fat data, we decide that we are willing to take a 5% risk of saying that the unknown population means for men and women are not equal when they really are. In statistics-speak, the significance level, denoted by α, is set to 0.05. It is a good practice to make this decision before collecting the data and before calculating test statistics.
  • We calculate a test statistic. Our test statistic is 2.80.
  • We find the theoretical value from the t- distribution based on our null hypothesis which states that the means for men and women are equal. Most statistics books have look-up tables for the t- distribution. You can also find tables online. The most likely situation is that you will use software and will not use printed tables. To find this value, we need the significance level (α = 0.05) and the degrees of freedom . The degrees of freedom ( df ) are based on the sample sizes of the two groups. For the body fat data, this is: $ df = n_1 + n_2 - 2 = 10 + 13 - 2 = 21 $ The t value with α = 0.05 and 21 degrees of freedom is 2.080.
  • We compare the value of our statistic (2.80) to the t value. Since 2.80 > 2.080, we reject the null hypothesis that the mean body fat for men and women are equal, and conclude that we have evidence body fat in the population is different between men and women.

Statistical details

Let’s look at the body fat data and the two-sample t -test using statistical terms.

Our null hypothesis is that the underlying population means are the same. The null hypothesis is written as:

$ H_o:  \mathrm{\mu_1} =\mathrm{\mu_2} $

The alternative hypothesis is that the means are not equal. This is written as:

$ H_o:  \mathrm{\mu_1} \neq \mathrm{\mu_2} $

We calculate the average for each group, and then calculate the difference between the two averages. This is written as:

$\overline{x_1} -  \overline{x_2} $

We calculate the pooled standard deviation. This assumes that the underlying population variances are equal. The pooled variance formula is written as:

The formula shows the sample size for the first group as n 1 and the second group as n 2 . The standard deviations for the two groups are s 1 and s 2 . This estimate allows the two groups to have different numbers of observations. The pooled standard deviation is the square root of the variance and is written as s p .

What if your sample sizes for the two groups are the same? In this situation, the pooled estimate of variance is simply the average of the variances for the two groups:

$ s_p^2 = \frac{(s_1^2 + s_2^2)}{2} $

The test statistic is calculated as:

$ t = \frac{(\overline{x_1} -\overline{x_2})}{s_p\sqrt{1/n_1 + 1/n_2}} $

The numerator of the test statistic is the difference between the two group averages. It estimates the difference between the two unknown population means. The denominator is an estimate of the standard error of the difference between the two unknown population means. 

Technical Detail: For a single mean, the standard error is $ s/\sqrt{n} $  . The formula above extends this idea to two groups that use a pooled estimate for s (standard deviation), and that can have different group sizes.

We then compare the test statistic to a t value with our chosen alpha value and the degrees of freedom for our data. Using the body fat data as an example, we set α = 0.05. The degrees of freedom ( df ) are based on the group sizes and are calculated as:

$ df = n_1 + n_2 - 2 = 10 + 13 - 2 = 21 $

The formula shows the sample size for the first group as n 1 and the second group as n 2 .  Statisticians write the t value with α = 0.05 and 21 degrees of freedom as:

$ t_{0.05,21} $

The t value with α = 0.05 and 21 degrees of freedom is 2.080. There are two possible results from our comparison:

  • The test statistic is lower than the t value. You fail to reject the hypothesis of equal means. You conclude that the data support the assumption that the men and women have the same average body fat.
  • The test statistic is higher than the t value. You reject the hypothesis of equal means. You do not conclude that men and women have the same average body fat.

t -Test with unequal variances

When the variances for the two groups are not equal, we cannot use the pooled estimate of standard deviation. Instead, we take the standard error for each group separately. The test statistic is:

$ t = \frac{ (\overline{x_1} -  \overline{x_2})}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} $

The numerator of the test statistic is the same. It is the difference between the averages of the two groups. The denominator is an estimate of the overall standard error of the difference between means. It is based on the separate standard error for each group.

The degrees of freedom calculation for the t value is more complex with unequal variances than equal variances and is usually left up to statistical software packages. The key point to remember is that if you cannot use the pooled estimate of standard deviation, then you cannot use the simple formula for the degrees of freedom.

Testing for normality

The normality assumption is more important   when the two groups have small sample sizes than for larger sample sizes.

Normal distributions are symmetric, which means they are “even” on both sides of the center. Normal distributions do not have extreme values, or outliers. You can check these two features of a normal distribution with graphs. Earlier, we decided that the body fat data was “close enough” to normal to go ahead with the assumption of normality. The figure below shows a normal quantile plot for men and women, and supports our decision.

 Normal quantile plot of the body fat measurements for men and women

You can also perform a formal test for normality using software. The figure above shows results of testing for normality with JMP software. We test each group separately. Both the test for men and the test for women show that we cannot reject the hypothesis of a normal distribution. We can go ahead with the assumption that the body fat data for men and for women are normally distributed.

Testing for unequal variances

Testing for unequal variances is complex. We won’t show the calculations in detail, but will show the results from JMP software. The figure below shows results of a test for unequal variances for the body fat data.

Test for unequal variances for the body fat data

Without diving into details of the different types of tests for unequal variances, we will use the F test. Before testing, we decide to accept a 10% risk of concluding the variances are equal when they are not. This means we have set α = 0.10.

Like most statistical software, JMP shows the p -value for a test. This is the likelihood of finding a more extreme value for the test statistic than the one observed. It’s difficult to calculate by hand. For the figure above, with the F test statistic of 1.654, the p- value is 0.4561. This is larger than our α value: 0.4561 > 0.10. We fail to reject the hypothesis of equal variances. In practical terms, we can go ahead with the two-sample t -test with the assumption of equal variances for the two groups.

Understanding p-values

Using a visual, you can check to see if your test statistic is a more extreme value in the distribution. The figure below shows a t- distribution with 21 degrees of freedom.

t-distribution with 21 degrees of freedom and α = .05

Since our test is two-sided and we have set α = .05, the figure shows that the value of 2.080 “cuts off” 2.5% of the data in each of the two tails. Only 5% of the data overall is further out in the tails than 2.080. Because our test statistic of 2.80 is beyond the cut-off point, we reject the null hypothesis of equal means.

Putting it all together with software

The figure below shows results for the two-sample t -test for the body fat data from JMP software.

Results for the two-sample t-test from JMP software

The results for the two-sample t -test that assumes equal variances are the same as our calculations earlier. The test statistic is 2.79996. The software shows results for a two-sided test and for one-sided tests. The two-sided test is what we want (Prob > |t|). Our null hypothesis is that the mean body fat for men and women is equal. Our alternative hypothesis is that the mean body fat is not equal. The one-sided tests are for one-sided alternative hypotheses – for example, for a null hypothesis that mean body fat for men is less than that for women.

We can reject the hypothesis of equal mean body fat for the two groups and conclude that we have evidence body fat differs in the population between men and women. The software shows a p -value of 0.0107. We decided on a 5% risk of concluding the mean body fat for men and women are different, when they are not. It is important to make this decision before doing the statistical test.

The figure also shows the results for the t- test that does not assume equal variances. This test does not use the pooled estimate of the standard deviation. As was mentioned above, this test also has a complex formula for degrees of freedom. You can see that the degrees of freedom are 20.9888. The software shows a p- value of 0.0086. Again, with our decision of a 5% risk, we can reject the null hypothesis of equal mean body fat for men and women.

Other topics

If you have more than two independent groups, you cannot use the two-sample t- test. You should use a multiple comparison   method. ANOVA, or analysis of variance, is one such method. Other multiple comparison methods include the Tukey-Kramer test of all pairwise differences, analysis of means (ANOM) to compare group means to the overall mean or Dunnett’s test to compare each group mean to a control mean.

What if my data are not from normal distributions?

If your sample size is very small, it might be hard to test for normality. In this situation, you might need to use your understanding of the measurements. For example, for the body fat data, the trainer knows that the underlying distribution of body fat is normally distributed. Even for a very small sample, the trainer would likely go ahead with the t -test and assume normality.

What if you know the underlying measurements are not normally distributed? Or what if your sample size is large and the test for normality is rejected? In this situation, you can use nonparametric analyses. These types of analyses do not depend on an assumption that the data values are from a specific distribution. For the two-sample t ­-test, the Wilcoxon rank sum test is a nonparametric test that could be used.

COMMENTS

  1. Testing Equality of Two Percentages

    Testing Equality of Two Percentages introduced a conceptual framework for statistical hypothesis testing. presented important statistical considerations for determining whether a treatment has an effect. Treatment is meant loosely—it could be a drug, an advertising campaign, a car wax, a test preparation course, a fertilizer, etc.The best way to determine whether a treatment has an effect is ...

  2. Two‐sample testing for random graphs

    The employment of two‐sample hypothesis testing in examining random graphs has been a prevalent approach in diverse fields such as social sciences, neuroscience, and genetics. We advance a spectral‐based two‐sample hypothesis testing methodology to test the latent position random graphs. We propose two distinct asymptotic normal ...

  3. Optimal Network Pairwise Comparison

    We are interested in the problem of two-sample network hypothesis testing: given two networks with the same set of nodes, we wish to test whether the underlying Bernoulli probability matrices of the two networks are the same or not. We propose Interlacing Balance Measure (IBM) as a new two-sample testing approach.

  4. A new maximum mean discrepancy based two-sample test for equal

    This paper presents a novel two-sample test for equal distributions in separable metric spaces, utilizing the maximum mean discrepancy (MMD). The test statistic is derived from the decomposition of the total variation of data in the reproducing kernel Hilbert space, and can be regarded as a V-statistic-based estimator of the squared MMD. The paper establishes the asymptotic null and ...

  5. A general framework for planning the number of items/subjects for

    This framework aims to determine the optimal configuration of measurements and subjects for Cronbach's alpha by integrating hypothesis testing and confidence intervals. We have developed two R Shiny apps capable of considering up to nine probabilities, which encompass width, validity, and/or rejection events.

  6. The Optimal Finite-Sample Error Probability in Asymmetric Binary

    Sharp, nonasymptotic bounds are derived for the best achievable error probability in binary hypothesis testing between two probability distributions with indepe

  7. Undergraduate Admissions Test UK (UAT UK) certification testing with

    Before sitting a UAT-UK test, you will need to complete a two-step registration: First-time test takers must create a ... For ease of use, the ESAT sample tests are split into their separate parts so you can easily access the subjects that you intend to take in the live test. On the day of the test, your chosen subjects will be combined as a ...

  8. West Nile detected in Washoe County mosquito tests, health agency says

    The second zip code was 89436 in the Spanish Springs area. NNPH said its common for mosquitos to have West Nile, and the agency regularly sends samples to the Nevada State Laboratory for surveillance.

  9. Two-sample hypothesis testing

    In statistical hypothesis testing, a two-sample test is a test performed on the data of two random samples, each independently obtained from a different given population.The purpose of the test is to determine whether the difference between these two populations is statistically significant.. There are a large number of statistical tests that can be used in a two-sample test.

  10. 10: Hypothesis Testing with Two Samples

    10.5: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  11. Two Sample t-test: Definition, Formula, and Example

    A two sample t-test is used to determine whether or not two population means are equal. ... 0.05, and 0.01) then you can reject the null hypothesis. Two Sample t-test: Assumptions. For the results of a two sample t-test to be valid, the following assumptions should be met:

  12. Chapter 15 Hypothesis Testing: Two Sample Tests

    15.1.2 Two Sample t test approach. For this we can use the two-sample t-test to compare the means of these two distinct populations. Here the alternative hypothesis is that the lottery players score more points H A: μL > μN L H A: μ L > μ N L thus the null hypothesis is H 0: μL ≤ μN L. H 0: μ L ≤ μ N L. We can now perform the test ...

  13. 10: Hypothesis Testing with Two Samples

    10.4: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  14. How t-Tests Work: 1-sample, 2-sample, and Paired t-Tests

    For a 2-sample t-test, the signal, or effect, is the difference between the two sample means. This calculation is straightforward. If the first sample mean is 20 and the second mean is 15, the effect is 5. Typically, the null hypothesis states that there is no difference between the two samples.

  15. The Two-Sample Hypothesis Tests using the Bootstrap

    Using the Bootstrap for Two-Sample Hypothesis Tests. Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis.

  16. Hypothesis Test: Difference in Means

    The first step is to state the null hypothesis and an alternative hypothesis. Null hypothesis: μ 1 - μ 2 = 0. Alternative hypothesis: μ 1 - μ 2 ≠ 0. Note that these hypotheses constitute a two-tailed test. The null hypothesis will be rejected if the difference between sample means is too big or if it is too small.

  17. Hypothesis Testing: Two Samples

    The Population Mean: This image shows a series of histograms for a large number of sample means taken from a population.Recall that as more sample means are taken, the closer the mean of these means will be to the population mean. In this section, we explore hypothesis testing of two independent population means (and proportions) and also tests for paired samples of population means.

  18. 10.29: Hypothesis Test for a Difference in Two Population Means (1 of 2)

    Step 1: Determine the hypotheses. The hypotheses for a difference in two population means are similar to those for a difference in two population proportions. The null hypothesis, H 0, is again a statement of "no effect" or "no difference.". H 0: μ 1 - μ 2 = 0, which is the same as H 0: μ 1 = μ 2. The alternative hypothesis, H a ...

  19. PDF Two Samples Hypothesis Testing

    Statisticians refer to this case (equal n in the two samples) as a paired samples hypothesis test. The procedure is very similar to the single-sample hypothesis tests we have already discussed, except that we replace variable x by the difference between the two variables, δ = x − x . B A.

  20. 8: Hypothesis Testing with Two Samples

    8.5: Matched or Paired Samples. When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  21. 5.5

    5.5 - Hypothesis Testing for Two-Sample Proportions. We are now going to develop the hypothesis test for the difference of two proportions for independent samples. The hypothesis test follows the same steps as one group. These notes are going to go into a little bit of math and formulas to help demonstrate the logic behind hypothesis testing ...

  22. 10.E: Hypothesis Testing with Two Samples (Exercises)

    Use the following information to answer the next 15 exercises: Indicate if the hypothesis test is for. independent group means, population standard deviations, and/or variances known. independent group means, population standard deviations, and/or variances unknown. matched or paired samples. single mean.

  23. Hypothesis Testing

    Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test. Step 4: Decide whether to reject or fail to reject your null hypothesis. Step 5: Present your findings. Other interesting articles. Frequently asked questions about hypothesis testing.

  24. PDF Chapter 10 One- and Two-sample Tests of Hypotheses

    CHAPTER 10ONE- AND TWO-SAMPLE TESTS OF HYPOTHESESCon dence intervals represent the rst of t. o kinds of inference that we study in this course. Hy-pothesis testing, or test of significance is the s. cond common type of formal statistical inference. . It has a di erent goal than con dence intervals.The big picture is that the test of hypothesis ...

  25. 9.1: Two Sample Mean T-Test for Dependent Groups

    The t-test for dependent samples is a statistical test for comparing the means from two dependent populations (or the difference between the means from two populations). The t-test is used when the differences are normally distributed. The samples also must be dependent. The formula for the t-test statistic is: t = D¯−μD (SD n√).

  26. Hypothesis Testing with Two Samples: A Comprehensive Guide

    Hypothesis testing with two samples is a statistical method used to determine whether two groups of data are significantly different from each other. This type of testing involves comparing the means of two separate samples in order to determine if there is a significant difference between them. The process involves creating a null hypothesis ...

  27. Hypothesis Testing

    This statistics video explains how to perform hypothesis testing with two sample means using the t-test with the student's t-distribution and the z-test with...

  28. Putting It Together: Hypothesis Testing with Two Samples

    The difference of two proportions is approximately normal if there are at least five successes and five failures in each sample. When conducting a hypothesis test for a difference of two proportions, the random samples must be independent and the population must be at least ten times the sample size.

  29. Hypothesis Testing: Uses, Steps & Example

    Alternative Hypothesis (H A): The population means of the test scores for the two groups are unequal (μ 1 ≠ μ 2). Choosing the correct hypothesis test depends on attributes such as data type and number of groups. Because they're using continuous data and comparing two means, the researchers use a 2-sample t-test.

  30. Two-Sample t-Test

    The two-sample t-test (also known as the independent samples t-test) ... We can reject the hypothesis of equal mean body fat for the two groups and conclude that we have evidence body fat differs in the population between men and women. The software shows a p-value of 0.0107. We decided on a 5% risk of concluding the mean body fat for men and ...