dataanalysisclassroom

dataanalysisclassroom

making data analysis easy

Lesson 98 – The Two-Sample Hypothesis Tests using the Bootstrap

Two-sample hypothesis tests – part vii.

H_{0}: P(\theta_{x}>\theta_{y}) = 0.5

These days, a peek out of the window is greeted by chilling rain or warm snow. On days when it is not raining or snowing, there is biting cold. So we gaze at our bicycles, waiting for that pleasant month of April when we can joyfully bike — to work, or for pleasure.

Speaking of bikes, since I have nothing much to do today except watch the snow, I decided to explore some data from our favorite “ Open Data for All New Yorkers ” page.

Interestingly, I found data on the bicycle counts for East River Bridges . New York City DOT keeps track of the daily total of bike counts on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.

hypothesis testing with bootstrap

I could find the data for April to October during 2016 and 2017. Here is how the data for April 2017 looks.

hypothesis testing with bootstrap

Being a frequent biker on the Manhattan Bridge, my curiosity got kindled. I wanted to verify how different the total bike counts on the Manhattan Bridge are from the Williamsburg Bridge.

At the same time, I also wanted to share the benefits of the bootstrap method for two-sample hypothesis tests.

To keep it simple and easy for you to follow the bootstrap method’s logical development, I will test how different the total bike counts data on Manhattan Bridge are from that of the Williamsburg Bridge during all the non-holiday weekdays with no precipitation.

Here is the data of the total bike counts on Manhattan Bridge during all the non-holiday weekdays with no precipitation in April of 2017 — essentially, the data from the yellow-highlighted rows in the table for Manhattan Bridge.

5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

And the data of the total bike counts on Williamsburg Bridge during all the non-holiday weekdays with no precipitation in April of 2017.

5711, 6881, 8079, 6775, 5877, 7341, 6026, 7196

Their distributions look like this.

hypothesis testing with bootstrap

I want answers to the following questions.

\bar{x}_{M}=\bar{x}_{W}?

What do we know so far?  

We know how to test the difference in means using the t-Test under the proposition that the population variances are equal ( Lesson 94 ) or using Welch’s t-Test when we cannot assume equality of population variances ( Lesson 95 ). We also know how to do this using Wilcoxon’s Rank-sum Test that uses the ranking method to approximate the significance of the differences in means ( Lesson 96 ).

We know how to test the equality of variances using F-distribution ( Lesson 97 ).

We know how to test the difference in proportions using either Fisher’s Exact test ( Lesson 92 ) or using the normal distribution as the null distribution under the large-sample approximation ( Lesson 93 ).

In all these tests, we made critical assumptions on the limiting distributions of the test-statistics.
  • What is the limiting distribution of the test-statistic that computes the difference in medians?
  • What is the limiting distribution of the test-statistic that compares interquartile ranges of two populations?
  • What if we do not want to make any assumptions on data distributions or the limiting forms of the test-statistics?

Enter the Bootstrap

I would urge you to go back to Lesson 79 to get a quick refresher on the bootstrap, and Lesson 90 to recollect how we used it for the one-sample hypothesis tests.

\hat{f}

Take the data for Manhattan Bridge. 5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774

Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these eight data points is 1/8, we can randomly draw eight numbers from these eight values —  with replacement .

Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774), some may appear more than one time, and some may not show up at all in a random sample.

Here is one such bootstrap replicate. 6359, 6359, 6359, 6052, 6774, 6359, 5276, 6359

The value 6359 appeared five times. Some values like 7247, 5054, 6691, and 5311 did not appear at all in this replicate.

Here is another replicate. 6359, 5276, 5276, 5276, 7247, 5311, 6052, 5311

Such bootstrap replicates are representations of the empirical distribution , i.e., the proportion of times each value in the data sample occurs. We can generate all the information contained in the true distribution by creating , the empirical distribution.

Using the Bootstrap for Two-Sample Hypothesis Tests

Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis. There is no need to make any assumptions on the distributional nature of the data or the limiting distribution for the test-statistic . As long as we can compute a test-statistic from the bootstrap sample, we can test the hypothesis on any statistic — mean, median, variance, interquartile range, proportion, etc.

\theta_{x}

The null hypothesis is that there is no difference between the statistic of X or Y .

The alternate hypothesis is

\theta_{x}>\theta_{y}

For example, one bootstrap replicate for X (Manhattan Bridge) and Y (Williamsburg Bridge) may look like this:

\bar{x}^{X}_{boot}<\bar{x}^{Y}_{boot}

Another bootstrap replicate for X and Y may look like this:

S_{i} \in (0,1)

The proportion of times in a set of N bootstrap-replicated statistics is the p-value.

p-value=\frac{1}{N}\sum_{i=1}^{i=N}S_{i}

Manhattan Bridge vs. Williamsburg Bridge

H_{0}: P(\bar{x}_{M}>\bar{x}_{W}) = 0.5

Let’s take a two-sided alternate hypothesis.

\frac{\bar{x}_{M}}{\bar{x}_{W}}

Can we reject the null hypothesis if we select a 5% rate of error?

H_{0}: P(\tilde{x}_{M}>\tilde{x}_{W}) = 0.5

Can you see the bootstrap concept’s flexibility and how widely we can apply it for hypothesis testing? Just remember that the underlying assumption is that the data are independent. 

To summarize,

P(\theta_{x}>\theta_{y})=0.5

After seven lessons, we are now equipped with all the theory of the two-sample hypothesis tests. It is time to put them to practice. Dust off your programming machines and get set.

If you find this useful, please like, share and subscribe. You can also follow me on Twitter  @realDevineni  for updates on new lessons.

error

Enjoy this blog? Please spread the word :)

Facebook

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Beyond normality: the bootstrap method for hypothesis testing.

Posted on August 17, 2019 by R on Alejandro Morales' Blog in R bloggers | 0 Comments

[social4i size="small" align="align-left"] --> [This article was first published on R on Alejandro Morales' Blog , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tl;dr: Parametric bootstrap methods can be used to test hypothesis and calculate p values while assuming any particular population distribution we may want. Non-parametric bootstrapping methods can be used to test hypotheses and calculate p values without having to assume any particular population as long as the sample can be assumed to be representative of the population and one can transform the data adequately to take into account the null hypothesis. The p values from bootstrap methods may differ from those from classical methods, especially when the assumptions of the classical methods do not hold. The different methods of calculation can push a p value beyond the 0.05 threshold which means that statements of statistical significance are sensitive to all the assumptions used in the test.

Introduction

In this article I show how to use parametric and non-parametric bootstrapping to test null hypotheses, with special emphasis on situations when the assumption of normality may not hold. To make it more relevant, I will use real data (from my own research) illustrate the application of these methods. If you get lost somewhere in this article, you may want to take a look at my previous post , where I introduced the basic concepts behind hypothesis testing and sampling distributions. As in the previous post, the analysis will be done in R, so before we get into the details, it is important to properly setup our R session:

The data I will use consists of measurements of individual plant biomass (i.e. the weight of a plant after we have remove all the water) exposed to a control treatment (C), drought (D), high temperature (HT) and high temperature and drought (HTD). First, let’s take a look at the data:

To leave a comment for the author, please follow the link and comment on their blog: R on Alejandro Morales' Blog . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2022 | MH Corporate basic by MH Themes

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Stat 3701 Lecture Notes: Bootstrap

Charles j. geyer, november 23, 2022.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License ( http://creativecommons.org/licenses/by-sa/4.0/ ).

The version of R used to make this document is 4.2.1.

The version of the rmarkdown package used to make this document is 2.17.

The version of the knitr package used to make this document is 1.40.

The version of the bootstrap package used to make this document is 2019.6.

3 Relevant and Irrelevant Simulation

3.1 irrelevant.

Most statisticians think a statistics paper isn’t really a statistics paper or a statistics talk isn’t really a statistics talk if it doesn’t have simulations demonstrating that the methods proposed work great (at least in some toy problems).

IMHO, this is nonsense. Simulations of the kind most statisticians do prove nothing. The toy problems used are often very special and do not stress the methods at all. In fact, they may be (consciously or unconsciously) chosen to make the methods look good.

In scientific experiments, we know how to use randomization, blinding, and other techniques to avoid biasing the results. Analogous things are never AFAIK done with simulations.

When all of the toy problems simulated are very different from the statistical model you intend to use for your data, what could the simulation study possibly tell you that is relevant? Nothing.

Hence, for short, your humble author calls all of these millions of simulation studies statisticians have done irrelevant simulation .

3.2 Relevant

But there is a well-known methodology of relevant simulation , except that it isn’t called that. It is called the bootstrap .

It idea is, for each statistical model and each data set to which it is applied, one should do a simulation study of this model on data of this form.

But there is a problem: the fundamental problem of statistics, that \(\hat{\theta}\) is not \(\theta\) . To be truly relevant we should simulate from the true unknown distribution of the data, but we don’t know what that is. (If we did, we wouldn’t need statistics.)

So as a second best choice we have to simulate from our best estimate of the true unknown distribution, the one corresponding to the parameter value \(\hat{\theta}\) if that is the best estimator we know.

But we know that is the Wrong Thing . So we have to be sophisticated about this. We have to arrange what we do with our simulations to come as close to the Right Thing as possible.

And bootstrap theory and methods are extraordinarily sophisticated with many different methods of coming very close to the Right Thing .

4 R Packages and Textbooks

There are two well known R packages concerned with the bootstrap. They go with two well known textbooks.

R package boot is an R recommended package that is installed by default in every installation of R. As the package description says, it goes with the textbook Davison and Hinkley (1997) .

The CRAN package bootstrap goes with, as its package description says, the textbook Efron and Tibshirani (1993) .

The package description also says that “new projects should preferentially use the recommended package ‘ boot ’”. But I do not agree. The package maintainer is neither of Efron or Tibshirani, and I do not think they would agree. Whatever the politics of the R core team that make the boot package “recommended”, they have nothing to do with the quality of the package or with the quality of the textbook they go with. If you like Efron and Tibshirani (1993), you should be using the R package bootstrap that goes with it.

These authors range from moderately famous (for a statistician) to very, very famous (for a statistician). Efron is the inventor of the term bootstrap in its statistical meaning.

5 The Bootstrap Analogy

5.1 the name of the game.

The term “bootstrap” recalls the English idiom “pull oneself up by one’s bootstraps” .

The literal meaning of “bootstrap” in non-technical language is leather loops at the top of boots used to pull them on. So the literal meaning of “pull oneself up by one’s bootstraps” is to reach down, grab your shoes, and lift yourself off the ground — a physical impossibility. But, idiomatically, it doesn’t mean do the physically impossible; it means something like “succeed by one’s own efforts”, especially when this is difficult.

The technical meaning in statistics plays off this idiom. It means to get a good approximation to the sampling distribution of an estimator without using any theory. (At least not using any theory in the computation. A great deal of very technical theory may be used in justifying the bootstrap in certain situations.)

5.2 Introduction

The discussion in this section (all of Section 5) is stolen from Efron and Tibshirani (1993, Figure 8.1 and the surrounding text).

To understand the bootstrap you have to understand a simple analogy. Otherwise it is quite mysterious. I recall being mystified about it when I was a graduate student. I hope the students I teach are much less mystified because of this analogy. This appears to the untutored to be impossible or magical. But it isn’t really. It is sound statistical methodology.

5.3 The Nonparametric Bootstrap

The nonparametric bootstrap (or, to be more precise, Efron’s original nonparametric bootstrap, because others have been proposed in the literature, although no other is widely used AFAIK) is based on a nonparametric estimate of the true unknown distribution of the data.

This nonparametric estimate is just the sample itself, thought of as a finite population to sample from. Let \(P\) denote the true unknown probability distribution that we assume the data are an IID sample from, and let \(\widehat{P}_n\) denote probability model that samples IID from the original sample thought of as a finite population to sample from.

As we said above, this is the Wrong Thing with a capital W and a capital T. The sample is not the population. But it will be close for large sample sizes. Thus all justification for the nonparametric bootstrap is asymptotic. It only works for large sample sizes. We emphasize this because many naive users have picked up the opposite impression somewhere. The notion that the bootstrap (any kind of bootstrap) is an exact statistical method seems to be floating around in the memeosphere and impossible to stamp out.

The bootstrap makes an analogy between the real world and a mythical bootstrap world.

The explanation.

In the real world we have the true unknown distribution of the data \(P\) . In the bootstrap world we have the “true” pretend unknown distribution of the data \(\widehat{P}_n\) . Actually the distribution \(\widehat{P}_n\) is known, and that’s a good thing, because it allows us to simulate data from it. But we pretend it is unknown when we are reasoning in the bootstrap world. It is the analog in the bootstrap world of the true unknown distribution \(P\) in the real world.

In the real world we have the true unknown parameter \(\theta\) . It is the aspect of \(P\) that we want to estimate. In the bootstrap world we have the “true” pretend unknown parameter \(\hat{\theta}_n\) . Actually the parameter \(\hat{\theta}_n\) is known, and that’s a good thing, because it allows to see how close estimators come to it. But we pretend it is unknown when we are reasoning in the bootstrap world. It is the analog in the bootstrap world of the true unknown parameter \(\theta\) in the real world.

\(\hat{\theta}_n\) is the same function of \(\widehat{P}_n\) as \(\theta\) is of \(P\) .

If \(\theta\) is the population mean, then \(\hat{\theta}_n\) is the sample mean.

If \(\theta\) is the population median, then \(\hat{\theta}_n\) is the sample median.

and so forth.

In the real world we have data \(X_1\) , \(\ldots,\) \(X_n\) that are assumed IID from \(P\) , whatever it is. In the bootstrap world we simulate data \(X_1^*\) , \(\ldots,\) \(X_n^*\) that are IID from \(\widehat{P}_n\) .

The way we simulate IID \(\widehat{P}_n\) is to take samples from the original data considered as a finite population to sample. These are samples with replacement because that is what IID requires.

Sometimes the nonparametric bootstrap is called “resampling” because it samples from the sample, called resampling for short. But this terminology misdirects the naive. What is important is that we have the correct analogy on the “data” line of the table.

We have some estimator of \(\theta\) , which must be a statistic , that is some function of the data that does not depend on the unknown parameter. In order to have the correct analogy in the bootstrap world, our estimate there must be the same function of the bootstrap data .

Many procedures require some estimate of standard error of \(\hat{\theta}_n\) . Call that \(\hat{s}_n\) . It too must be a statistic , that is some function of the data that does not depend on the unknown parameter. In order to have the correct analogy in the bootstrap world, our estimate there must be the same function of the bootstrap data .

Many procedures use so-called pivotal quantities , either exact or approximate.

An exact pivotal quantity is a function of the data and the parameter of interest whose distribution does not depend on any parameters . The prototypical example is the \(t\) statistic \[ \frac{\overline{X}_n - \mu}{s_n / \sqrt{n}} \] which has, when the data are assumed to be exactly normal, an exact \(t\) distribution on \(n - 1\) degrees of freedom (which does not depend on the unknown parameters \(\mu\) and \(\sigma\) of the distribution of the data). Note that the pivotal quantity is a function of \(\mu\) but the sampling distribution of the pivotal quantity does not depend on \(\mu\) or \(\sigma\) : the \(t\) distribution with \(n - 1\) degrees of freedom does not does not have any unknown parameters.

An asymptotic pivotal quantity is a function of the data and the parameter of interest whose asymptotic distribution does not depend on any parameters . The prototypical example is the \(z\) statistic \[ \frac{\overline{X}_n - \mu}{s_n / \sqrt{n}} \] (actually the same function of data and parameters as the \(t\) statistic discussed above), which has, when the data are assumed to have any distribution with finite variance, an asymptotic standard normal distribution (which does not depend on the unknown the distribution of the data). Note that the pivotal quantity is a function of \(\mu\) but the sampling distribution of the pivotal quantity does not depend on the unknown distribution of the data : the standard normal distribution does not does not have any unknown parameters.

An approximate pivotal quantity is a function of the data and the parameter of interest whose sampling distribution does not depend on the unknown distribution of the data, at least not very much. Often such quantities are made by standardizing in a manner similar to those discussed above: by standardization. Any time we have some purported standard errors of estimators, we can use them to make approximate pivotal quantities. \[ \frac{\hat{\theta}_n - \theta}{\hat{s}_n} \] as in the bottom left cell of the table above.

The importance of pivotal quantities in (frequentist) statistics cannot be overemphasized. They are what allow valid exact or approximate inference. When we invert the pivotal quantity to make confidence intervals, for example, \[ \hat{\theta}_n \pm 1.96 \cdot \hat{s}_n \] this is (exactly or approximately) valid because the sampling distribution does not depend on the true unknown distribution of the data, at least not much . If it did depend strongly on the true distribution of the data, then our coverage could be way off, because our estimated sampling distribution of the pivotal quantity might be far from its correct sampling distribution.

As we shall see, even when we have no \(\hat{s}_n\) available, the bootstrap can find one for us.

5.3.1 Cautions

5.3.1.1 use the correct analogies.

In the bottom right cell of the table above there is a strong tendency for naive users to replace \(\hat{\theta}_n\) with \(\theta\) . But this is clearly incorrect. What plays the role of true unknown parameter value in the bootstrap world is \(\hat{\theta}_n\) not \(\theta\) .

5.3.1.2 Hypothesis Tests are Problematic

Any hypothesis test calculates critical values or \(P\) -values using the distribution under the null hypothesis . But the bootstrap does not sample that unless the null hypothesis happens to be correct. Usually, we want to reject the null hypothesis, meaning we hope it is not correct . And in any case, we would not be doing a hypothesis test unless we did not know whether the null hypothesis is correct.

Thus the obvious naive way to calculate a bootstrap \(P\) -value, which has been re-invented time and time again by naive users, is completely bogus. It says, if \(w(X_1, \ldots, X_n)\) is the test statistic of the test, then the naive bootstrap \(P\) -value is the fraction of simulations of bootstrap data in which \(w(X_1^*, \ldots, X_n^*) \ge w(X_1, \ldots, X_n)\) . This test typically has no power. It rejects at level \(\alpha\) with probability \(\alpha\) no matter how far the true unknown distribution of the data is from the null hypothesis. This is because the bootstrap samples (approximately, for large \(n\) ) from the true unknown distribution, not from the null hypothesis.

Of course, there are non-bogus ways of doing bootstrap tests, but one has to be a bit less naive. For example, any valid bootstrap confidence interval also gives a valid bootstrap test. The test rejects \(H_0 : \theta = \theta_0\) (two-tailed) at level \(\alpha\) if and only if a valid confidence interval with coverage probability \(1 - \alpha\) does not cover \(\theta_0\) .

We won’t say any more about bootstrap hypothesis tests. The textbooks cited above each have a chapter on the subject.

5.3.1.3 Regression is Problematic

If we consider our data to be IID pairs \((X_i, Y_i)\) , then the naive bootstrap procedure is to resample pairs \((X_i^*, Y_i^*)\) where each \((X_i^*, Y_i^*) = (X_j, Y_j)\) for some \(j\) . But this mimics the joint distribution of \(X\) and \(Y\) and regression is about the conditional distribution of \(Y\) given \(X\) . So again the naive bootstrap samples the wrong distribution.

A solution to this problem is to resample residuals rather than data. Suppose we are assuming a parametric model for the regression function but are being nonparametric about the error distribution, as in Section 3.4.1 of the course notes about models, Part I . Just for concreteness, assume the regression function is simple \(\alpha + \beta x\) . Then the relation between the bootstrap world and the real world changes as follows.

The table is not quite as neat as before because there is no good way to say that \(\hat{\alpha}_n\) and \(\hat{\beta}_n\) are the same function of the regression data, thought of as a finite population to sample, as \(\alpha\) and \(\beta\) are of the population, and similarly that \(\alpha^*_n\) and \(\beta^*_n\) are the same function of the bootstrap data as \(\hat{\alpha}_n\) and \(\hat{\beta}_n\) are of the original data.

The textbooks cited above each have a chapter on this subject.

Bootstrapping residuals is usually not fully nonparametric because the estimate of the residuals depends on some parametric part of the model (either the mean function is parametric or the error distribution, or both).

5.4 The Parametric Bootstrap

The parametric bootstrap was also invented by Efron.

Now we have a parametric model. Let \(P_\theta\) denote the true unknown probability distribution that we assume the data are an IID sample from,

We won’t be so detailed in our explanation as above. The main point is that everything is the same except as with the nonparametric bootstrap except that we are using parametric estimates of distributions rather than nonparametric.

The same caution about being careful about the analogy applies as with the nonparametric bootstrap. But the other cautions do not apply. Neither hypothesis tests nor regression are problematic with the parametric bootstrap. One simply samples from the correct parametric distribution. For hypothesis tests, one estimates the parameters under the null hypothesis and then simulates that distribution. For regression, one estimates the parameters and then simulates new response data from the estimated conditional distribution of the response given the predictors.

6.1 Nonparametric Bootstrap

We will use the following highly skewed data.

Suppose we wish to estimate the population mean using the sample mean as its estimator. We have the asymptotically valid confidence interval \[ \bar{x}_n \pm \text{critical value} \cdot \frac{s_n}{\sqrt{n}} \] where \(s_n\) is the sample standard deviation. We also have the rule of thumb widely promulgated by intro statistics books that this interval is valid when \(n \ge 30\) . That is, according to intro statistics books, \(30 = \infty\) . These data show how dumb that rule of thumb is.

6.1.2 Bootstrap

So let us bootstrap these data. There is an R function boot in the R recommended package of the same name that does bootstrap samples, but we find it so complicated as to be not worth using. We will just use a loop.

As the histogram shows, the sampling distribution of our estimator is also skewed (the vertical line shows \(\hat{\mu}_n\) ).

We want to use the method of pivotal quantities here using the sample standard deviation as the standardizer.

We can see that the distribution of z.star which is supposed to be standard normal (it would be standard normal when \(n = \infty\) ) is actually for these data far from standard normal.

6.1.3 Bootstrap Confidence Interval

But since we have the bootstrap estimate of the actual sampling distribution we can use that to determine critical values.

I chose nboot to be 999 (a round number minus one) in order for the following trick to work. Observe that \(n\) values divide the number line into \(n + 1\) parts. It can be shown by statistical theory that each part has the same sampling distribution of when stated in terms of fraction of the population distribution covered. Thus sound estimates of the quantiles of the distribution are z[k] estimates the \(k / (n + 1)\) quantile. So we want to arrange the bootstrap sample size so that (nboot + 1) * alpha is an integer, where alpha is the probability for the critical value we want.

The last command (the result of which we don’t bother to save) shows that we are doing (arguably) the right thing. And we don’t have to decide among the 9 different “types” of quantile estimator that the R function quantile offers. The recipe used here is unarguably correct so long as (nboot + 1) * alpha is an integer.

Note that our critical values are very different from

which asymptotic (large sample) theory would have us use.

Our confidence interval is now \[ c_1 < \frac{\bar{x}_n - \mu}{s_n / \sqrt{n}} < c_2 \] where \(c_1\) and \(c_2\) are the critical values. We “solve” these inequalities for \(\mu\) as follows. \[\begin{gather*} c_1 \cdot \frac{s_n}{\sqrt{n}} < \bar{x}_n - \mu < c_2 \cdot \frac{s_n}{\sqrt{n}} \\ c_1 \cdot \frac{s_n}{\sqrt{n}} - \bar{x}_n < - \mu < c_2 \cdot \frac{s_n}{\sqrt{n}} - \bar{x}_n \\ \bar{x}_n - c_2 \cdot \frac{s_n}{\sqrt{n}} < \mu < \bar{x}_n - c_1 \cdot \frac{s_n}{\sqrt{n}} \end{gather*}\] (in going from the second line to the third, multiplying an inequality through by \(- 1\) reverses the inequality).

Now we use the last line of the nonparametric bootstrap analogy table. We suppose that the critical values are the same for both distributions on the bottom line (in the real world and in the bootstrap world).

Thus the bootstrap 95% confidence interval is

which is very different from

6.1.4 Using boott

There is an R function boott in the CRAN package bootstrap that does this whole calculation for us.

where the weird signature of the sdfun

is required by boott as help(boott) explains. Even though we have no use for the arguments nbootsd and theta , we have to have them in the function arguments list because the function is going to be passed them by boott whether we need them or not.

And what if you cannot think up a useful standardizing function? Then boott can find one for you using the bootstrap to the standard deviation of the sampling distribution of the estimator. So there is another bootstrap inside the main bootstrap. We call this a double bootstrap.

Pretty cool.

6.1.5 Bootstrapping the Bootstrap

So how much better is the bootstrap confidence interval than the asymptotic confidence interval? We should do a simulation study to find out. But we don’t have any idea what the population distribution is, and anyway, as argued in Section 3 above , simulations are irrelevant unless they are instances of the bootstrap. So we should check using the bootstrap. In order to not have our code too messy, we will use boott .

This is the first thing that actually takes more than a second of computing time, but it is still not very long.

The bootstrap is apparently quite a bit better, but we can’t really say that until we look at MCSE. For this kind of problem where we are looking at a dichotomous result (hit or miss), we know from intro stats how to calculate standard errors. This is the same as the problem of estimating a population proportion. The standard error is \[ \sqrt{ \frac{\hat{p}_n (1 - \hat{p}_n) }{n} } \]

We say “bootstrap estimates” and “bootstrap standard errors” here rather than “Monte Carlo estimates” and “Monte Carlo standard errors” or “simulation estimates” and “simulation standard errors” because, of course, we are not doing the Right Thing (with a capital R and a capital T) which is simulating from the true unknown population distribution because, of course, we don’t know what that is.

6.1.6 A Plethora of Bootstrap Confidence Intervals

The recipe for bootstrap confidence intervals illustrated here is a good one but far from the only good one. There are, in fact, a plethora of bootstrap confidence intervals covered in the textbooks cited above and even more in statistical journals.

Some of these are covered in the course materials for your humble author’s version of STAT 5601 . So a lot more could be said about the bootstrap. But we won’t.

6.1.7 The Moral of the Story

The bootstrap can do even better than theory. Theory needs \(n\) to be large enough for theory to work. The bootstrap needs \(n\) to be large enough for the bootstrap to work. The \(n\) for the latter can be smaller than the \(n\) for the former.

This is well understood theoretically. Good bootstrap confidence intervals like the so-called bootstrap \(t\) intervals illustrated above, have the property called higher-order accuracy or second-order correctness . Asymptotic theory says that the coverage error of the asymptotic interval will be of order \(n^{- 1 / 2}\) . Like everything else in asymptotics it too obeys the square root law. The actual coverage probability of the interval will differ from the nominal coverage probability by an error term that has approximate size \(c / \sqrt{n}\) for some constant \(c\) (which we usually do not know, as it depends on the true unknown population distribution). For a second-order correct bootstrap interval the error will have approximate size \(c / n\) for some different (and unknown) constant \(c\) . The point is that \(1 / n\) is a lot smaller than \(1 / \sqrt{n}\) .

We expect second-order correct bootstrap intervals to do better than asymptotics.

And we don’t need to do any theory ourselves! The computer does it for us!

6.2 Parametric Bootstrap

We are going to use the same data to illustrate the parametric bootstrap.

But we now need a parametric model for these data. It was simulated from a gamma distribution, so we will use that.

6.2.1 The Gamma Distribution

The gamma distribution is a continuous distribution of a strictly positive random variable having PDF \[ f_{\alpha, \lambda}(x) = \frac{1}{\beta^\alpha \Gamma(\alpha)} x^{\alpha - 1} e^{- x / \beta}, \qquad 0 < x < \infty, \] where \(\alpha\) and \(\beta\) are unknown parameters that are strictly positive.

It has a lot of appearances in theoretical statistics. The chi-square distribution is a special case. So is the exponential distribution, which is a model for failure times of random thingummies that do not get worse as they age. Also random variables having the \(F\) distribution can be written as a function of independent gamma random variables. In Bayesian statistics, it is the conjugate prior for several well-known families of distributions. But here we are just using it as a statistical model for data.

The function \(\Gamma\) is called the gamma function . It gives the probability distribution its name. If you haven’t heard of it, don’t worry about it. Just think of \(\Gamma(\alpha)\) as a term that has to be what it is to make the PDF integrate to one.

The parameter \(\alpha\) is called the shape parameter because different \(\alpha\) correspond to distributions of different shape. In fact, radically different.

For \(\alpha < 1\) the PDF goes to infinity as \(x \to 0\) .

For \(\alpha > 1\) the PDF goes to zero as \(x \to 0\) .

For \(\alpha = 1\) the PDF goes to \(\lambda\) as \(x \to 0\) .

The parameter \(\beta\) is called the scale parameter because it is one. If \(X\) has the gamma distribution with shape parameter \(\alpha\) and scale parameter one, then \(\beta X\) has the gamma distribution with shape parameter \(\alpha\) and scale parameter \(\beta\) . So changing \(\beta\) does not change the shape of the distribution. We could use the same plot for all \(\beta\) if we don’t put numbers on the axes.

The mean and variance are \[\begin{align*} E(X) & = \alpha \beta \\ \mathop{\rm var}(X) & = \alpha \beta^2 \end{align*}\]

6.2.2 Method of Moments Estimators

Solving the last two equations for the parameters gives \[\begin{align*} \alpha & = \frac{E(X)^2}{\mathop{\rm var}(X)} \\ \beta & = \frac{\mathop{\rm var}(X)}{E(X)} \end{align*}\]

This suggests the corresponding sample quantities as reasonable parameter estimates.

These are called method of moments estimators because expectations of polynomial functions of data are called moments (mean and variance are special cases).

6.2.3 Maximum Likelihood

Since we got warnings, redo.

We commented out the checks that the parameter values are strictly positive because optim often goes outside the parameter space but then gets back on track, as it does here. We seem to have gotten the correct answer despite the warnings.

So now we are ready to bootstrap. Let us suppose we want a confidence interval for \(\alpha\) .

6.2.4 Bootstrap

The parametric bootstrap simulates from the MLE distribution.

We now follow the same “bootstrap \(t\) ” idea with the parametric bootstrap that we did for the nonparametric.

This tells the asymptotics is working pretty well at \(n = 30\) . Perhaps the bootstrap is unnecessary. (But we didn’t know that without using the bootstrap to show it.)

Not a lot of difference in the critical values from the standard normal ones.

Since we forgot about the Hessian when estimating the parameters for the real data, we have to get it now.

6.2.5 The Moral of the Story

The moral of the story here is different from the nonparametric story above. Here we didn’t need the bootstrap, and the confidence interval it made wasn’t any better than the interval derived from the usual asymptotics of maximum likelihood.

But we didn’t know that would happen until we did it. If anyone ever asks you “How do you know the sample size is large enough to use asymptotic theory?”, this is the answer.

If the asymptotics agrees with the bootstrap, then both are correct. If the asymptotics does not agree with the bootstrap, use the bootstrap.

Lesson 11: Introduction to Nonparametric Tests and Bootstrap

What are nonparametric methods.

Nonparametric methods require very few assumptions about the underlying distribution and can be used when the underlying distribution is unspecified.

In the next section, we will focus on inference for one parameter. There are many methods we will not cover for one sample and also many methods for more than one parameter. We present the Sign Test in some detail because it uses many of the concepts we learned in the course. We leave out the details of the other tests.

  • Determine when to use nonparametric methods.
  • Explain how to conduct the Sign test.
  • Generate a bootstrap sample.
  • Find a confidence interval for any statistic from the bootstrap sample.

11.1 - Inference for the Population Median

Introduction.

So far, the methods we learned were for the population mean. The mean is a good measure of center when the data is bell-shaped, but it is sensitive to outliers and extreme values. When the data is skewed, however, a better measure of center would be the median. The median, you may recall, is a resistant measure. We present an example below that demonstrates why we might consider an alternative method than the one presented so far. In other words, we may want to consider a test for the median and not the mean.

Example 11.1: Tax Forms

The Internal Revenue Service (IRS) claims that it should typically take about 160 minutes to fill out a 1040 tax form. A researcher believes that the IRS's claim is not correct and that it generally takes people longer to complete the form. He recorded the time (in minutes) it took 30 individuals to complete the form. Download the data set: [ irs.txt ]

How would we approach this using previous methods? We would set the hypotheses as:

\(H_0\colon \mu=160\)

\( H_a\colon \mu>160\)

If we run the analysis in Minitab, we get the following output:

Descriptive Statistics

Ό: mean of time

Alternative hypothesis

H 1 : Ό > 160

The output here gives the \(t\) statistic (1.7001), the degrees of freedom (29) and the p-value (0.04991). In this case, the p-value is less than our significance level, \(\alpha=0.05\). Therefore we reject the null hypothesis and conclude that it takes on average longer than 160 to complete the 1040 form.

We assumed time to fill out the form was Normally distributed (or at least symmetric) BUT time is not Normally distributed (symmetric). It is generally skewed with a long right tail. Let’s take a look at the data. Below is the histogram of the data.

Histogram showing the time in minutes of completing the IRS tax forms. The highest bar has a frequency of 10 at 100 minutes.

As you can see from the histogram, the shape of the data does not support the assumption that time is Normally distributed. There is a clear right tail here. In a skewed distribution, the population median, typically denoted as \(\eta\), is a better typical value than the population mean, \(\mu\).

11.1.1 - The Sign Test

Suppose we are interested in testing the population median. The hypotheses are similar to those we have seen before but use the median, \(\eta\), instead of the mean.

If the hypotheses are based on the median, they would look like the following:

Alternative

For the IRS example, the null and alternative are:

\(H_0\colon \eta=160\) vs \(H_a\colon \eta >160\)

Consider the test statistic, \(S^+\), where

\(S^+=\text{the number of observations greater than 160}\)

Under the null hypothesis, \(S^+\), should be about 50% of the observations. Therefore, \(S^+\) should have a binomial distribution with parameters \(n\) and \(p=0.5\). Let’s review and verify that it is a Binomial random variable.

  • The number of trials, \(n\), is fixed and known . Here the number of trials equals the number of observations. Therefore, in this case, \(n\) is fixed and known.
  • The outcomes of each trial can be categorized as either a "success" or a "failure", with the probability of success being \(p\). Observations can either be above the median (a "success") or below the median (a "failure") with the probability of being above the median being \(p=Âœ\).
  • The probability of "success" remains constant from trial to trial. The probability of being above the median remains the same for each observation.
  • The trials are independent. Each of the observations is independent of the next, so we are okay here.

Now, back to our problem. To make a conclusion, we need to find the p-value. It is the probability of seeing what we see or something more extreme given the null hypothesis is true.

In the IRS example, let’s find \(S^+\), or in other words, let's find the number of observations that fall above 160. We find \(S^+=15\).

Using the Binomial distribution function, we can find the p-value as \(P(S^+\ge 15)\):

\begin{align} P(X\ge15)&=\sum_{i=15}^{30} {30\choose i}(0.5)^{30-i}(0.5)^i\\&=\sum_{i=15}^{30} {30\choose i} (0.5)^{30}\\&\approx 0.5722322 \end{align}

If we assume the significance level is 5%, then the p-value\(>0.05\). We would fail to reject the null hypothesis and conclude that there is no evidence in the data to suggest that the median is above 160 minutes.

This test is called the Sign Test and \(S^+\) is called the sign statistic . The Sign Test is also known as the Binomial Test.

Let's recap what we found. The research question was to see if it took longer than 160 minutes to complete the 1040 form. The measurement was the time in minutes to complete the form. Here is a summary:

Here, we have two opposite conclusions from each of the tests. Given the shape of the data, which do you think is the valid conclusion?

Minitab ®

Minitab sign test.

We can use Minitab to conduct the Sign test.

  • Click Stat > Nonparametrics > 1-Sample sign
  • Enter your 'variable', 'significance level', and adjust for the alternative.

Example 11-2: Tax Forms (Sign Test)

Conduct the test for the median time for filling out the tax forms using the Sign Test in Minitab. Download the dataset: [ irs.txt ]

Conducting the test in Minitab yields the following output.

1-Sample Sign Test

\(\eta\): median of time

95% Lower bound for \(\eta\)

H 1 : \(\eta\) > 0

As you can see in the Minitab output, you can also find a confidence interval for the population median based on the sign statistic. As you can imagine, finding the confidence interval by hand is a bit tricky. The interpretation of the confidence interval for the median has the same template interpretation as the confidence interval for the population mean.

We present the details of the Sign Test because it can be found based on the material we covered so far in the course. For the next section, we present another test and how to do it in Minitab but leave out the details.

11.1.2 - One-Sample Wilcoxon

In this section, we briefly present the one-sample Wilcoxon test. This test was developed by Frank Wilcoxon in 1945. It is considered one of the first “nonparametric” tests developed.

The hypotheses are the similar to the ones presented previously for the Sign Test:

\(H_0\colon \eta=\eta_0\)

\(H_a\colon \eta>\eta_0\)

\(H_a\colon \eta<\eta_0\)

\(H_a\colon \eta\ne\eta_0\)

The Wilcoxon test needs additional assumptions, however. They are:

  • The random variable of interest is continuous
  • The probability distribution of the population is symmetric.

If we compare the assumptions of the Wilcoxon test to the Sign Test, the Wilcoxon test requires the distribution to be symmetric. For example, we should not be making conclusions for the IRS data using the Wilcoxon test because the data is right-skewed.

The test statistic is typically denoted as \(W\). We will not go into details on how this statistic is found as it involves ranks.

One-Sample Wilcoxon Test

Minitab will conduct the one-sample Wilcoxon test.

  • Choose Stat  > Nonparametrics > 1-sample Wilcoxon
  • Enter the 'variable', the 'hypothesized value', and the correct 'alternative'.
  • Choose OK .

Example 11-3: Checkout Time (Wilcoxon Test)

Fresh N Friendly food store advertises that their checkout waiting times is four minutes or less. An angry customer wants to dispute this claim. He takes a random sample of shoppers at the peak time and records their checkout times. Can he dispute their claim at significance level 10%?

Checkout times:

3.8, 5.3, 3.5, 4.5, 7.2, 5.1

Use Minitab to conduct the 1-sample Wilcoxon Test. Compare the conclusion to the one found using the one-sample t-test. Lesson 6b.4 More Examples

Wilcoxon Signed Rank Test: time

H 1 : \(\eta\) > 4

N for Wilcoxon

The p-value for this test is 0.086. The p-value is less than our significance level and therefore we reject the null hypothesis. There is enough evidence in the data to suggest the population median time is greater than 4.

If we assume the data are normal and perform a test for the mean, the p-value was 0.0798.

At the 10% level, the data suggest that both the mean and the median are greater than 4.

11.1.3 - Other Nonparametric Tests

So far we discussed nonparametric tests for only one parameter. There are many tests for two parameters and for more than two parameters. There are also tests like Fisher’s Exact that will test for the association between two categorical variables.

In the table below, we give some examples of nonparametric tests. If you are interested, explore these tests on your own.

11.2 - Introduction to Bootstrapping

In this section, we will start by reviewing the concept of sampling distributions. Recall, we can find the sampling distribution of any summary statistic. Then, the method of bootstrapping samples to find the approximate sampling distribution of a statistic is introduced.

Review of Sampling Distributions

Before looking at the bootstrapping method, we will need to recall the idea of sampling distributions. More specifically, let's look at the sampling distribution of the sample mean, \(\bar{x}\).

Suppose we are interested in estimating the population mean, \(\mu\). To do this, we find a random sample of size \(n\) and calculate the sample mean, \(\bar{x}\). But how do we know how good of an estimate \(\bar{x}\) is? To answer this question, we need to find the standard deviation of the estimate.

Recall that \(\bar{x}\) is calculated from a random sample and is, therefore, a random variable. Let's call the sample mean from above \(\bar{x}_1\). Now suppose we gather another random sample of size \(n\) and calculate \(\bar{x}\) from that sample and denote it \(\bar{x}_2\). Take another sample, and so on and so on. With many of these samples, we can construct a histogram of the sample means.

With theory and the central limit theorem, we have the following summary:

If the sample satisfied at least one of the following:

  • The distribution of the random variable, \(X\), is Normal
  • The sample size is large; rule of thumb is \(n>30\)

...then the sampling distribution of \(\bar{X}\) is approximately Normal with

  • Mean: \(\mu\)
  • Standard deviation: \(\frac{\sigma}{\sqrt{n}}\)
  • Standard error: \(\frac{s}{\sqrt{n}}\)

Using the above, we can construct confidence intervals, and hypothesis test for the population mean, \(\mu\).

What happens when we do not know the underlying distribution and cannot resample from the distribution? How could we estimate certain sample statistics? This is what we try to answer in the next section.

11.2.1 - Bootstrapping Methods

Point estimates are helpful to estimate unknown parameters but in order to make inference about an unknown parameter, we need interval estimates. Confidence intervals are based on information from the sampling distribution, including the standard error.

What if the underlying distribution is unknown? What if we are interested in a population parameter that is not the mean, such as the median? How can we construct a confidence interval for the population median?

If we have sample data, then we can use bootstrapping methods to construct a bootstrap sampling distribution to construct a confidence interval.

Bootstrapping is a topic that has been studied extensively for many different population parameters and many different situations. There are parametric bootstrap, nonparametric bootstraps, weighted bootstraps, etc. We merely introduce the very basics of the bootstrap method. To introduce all of the topics would be an entire class in itself.

Let’s show how to create a bootstrap sample for the median. Let the sample median be denoted as \(M\).

  • Replace the population with the sample
  • Sample with replacement \(B\) times. \(B\) should be large, say 1000.
  • Compute sample medians each time, \(M_i\)
  • Obtain the approximate distribution of the sample median.

If we have the approximate distribution, we can find an estimate of the standard error of the sample median by finding the standard deviation of \(M_1,...,M_B\).

Sampling with replacement is important. If we did not sample with replacement, we would always get the same sample median as the observed value. The sample we get from sampling from the data with replacement is called the bootstrap sample .

Once we find the bootstrap sample, we can create a confidence interval. For a 90% confidence interval, for example, we would find the 5th percentile and the 95th percentile of the bootstrap sample.

You can create a bootstrap sample to find the approximate sampling distribution of any statistic, not just the median. The steps would be the same except you would calculate the appropriate statistic instead of the median.

Video: Bootstrapping

  Sampling R Code from the Bootstrapping Video

11.3 - Summary

In this Lesson, we introduced the very basic idea behind nonparametric methods. We use nonparametric methods when the assumptions fail for the tests we've learned so far. We also introduced the bootstrap method and how to create a bootstrap sample.

Smoothed Bootstrap Methods for Hypothesis Testing

  • Original Article
  • Open access
  • Published: 04 March 2024
  • Volume 18 , article number  16 , ( 2024 )

Cite this article

You have full access to this open access article

  • Asamh S. M. Al Luhayb 1 ,
  • Tahani Coolen-Maturi   ORCID: orcid.org/0000-0002-0229-2671 2 &
  • Frank P. A. Coolen 2  

262 Accesses

Explore all metrics

This paper demonstrates the application of smoothed bootstrap methods and Efron’s methods for hypothesis testing on real-valued data, right-censored data and bivariate data. The tests include quartile hypothesis tests, two sample medians and Pearson and Kendall correlation tests. Simulation studies indicate that the smoothed bootstrap methods outperform Efron’s methods in most scenarios, particularly for small datasets. The smoothed bootstrap methods provide smaller discrepancies between the actual and nominal error rates, which makes them more reliable for testing hypotheses.

Similar content being viewed by others

hypothesis testing with bootstrap

Resampling-Free Bootstrap Inference for Quantiles

hypothesis testing with bootstrap

Bootstrap for inference after model selection and model averaging for likelihood models

Andrea C. Garcia-Angulo & Gerda Claeskens

Multiplier bootstrap methods for conditional distributions

Félix Camirand Lemyre & Jean-François Quessy

Avoid common mistakes on your manuscript.

1 Introduction

The bootstrap method, as introduced by Efron [ 13 ], is a nonparametric statistical method proposed to specify the variability of sample estimates. The method has been widely used in the literature for a variety of statistical problems [ 17 ] as it is easy to apply and overall provides good results. When the distribution is unknown, the bootstrap method could be of great practical use [ 10 ].

For univariate real-valued data, Efron [ 13 ] introduced the bootstrap method, which is used in many real-world applications; see Efron and Tibshirani [ 17 ], Davison and Hinkley [ 10 ] and Berrar [ 5 ] for more details. For an original data set of size n , bootstrap samples of size n are created by random sampling with replacement and then computing the function of interest based on each bootstrap sample. The empirical distribution of the results can be used as a proxy for the distribution of the function of interest. In the case of finite support, Banks [ 4 ] presented a smoothed bootstrap method by linear interpolation between consecutive observations. Banks’ bootstrap method starts with ordering the n observations of the original sample, where it is assumed that there are no ties, and taking the \(n+1\) intervals of the partition of the support created by the n ordered observations. Each interval is assigned probability \(\frac{1}{n+1}\) . To generate one Banks’ bootstrap sample, n intervals are resampled, and then one observation is drawn uniformly from each chosen interval. With Banks’ bootstrap method, it is allowed to sample from the whole support, and ties occur with probability 0 in the bootstrap samples. This is contrary to Efron’s method, where the process is restricted to resampling from the original data set [ 13 ]. In the case of underlying distributions with infinite support, Coolen and BinHimd [ 8 ] generalised Banks’ bootstrap method by assuming distribution tail(s) for the first and last interval.

Efron [ 14 ] presented the bootstrap method for right-censored data, which is widely used in survival analysis; see Efron and Tibshirani [ 4 , 16 ]. This bootstrap version is very similar to the method presented for univariate real-valued data, where multiple bootstrap samples of size n are created by resampling from the original sample, and the function of interest is computed based on each bootstrap sample. The empirical distribution of those resulting values can be used as a good proxy for the distribution of the function of interest. Al Luhayb et al. [ 2 ] generalized Banks’ bootstrap method based on the right-censoring \(A_{(n)}\) assumption [ 9 ]. The generalised bootstrap method produced better results; see Al Luhayb [ 1 ] and Al Luhayb et al. [ 2 ] for more details.

Efron and Tibshirani [ 16 ] introduced the bootstrap method for bivariate data, where again, multiple bootstrap samples are generated by resampling from the original data set, and the function of interest is computed based on each bootstrap sample. The empirical distribution of the resulting values can be a good proxy for the distribution of the function of interest. However, Efron’s bootstrap method often produces poor results when working with small data sets. To address this issue, Al Luhayb et al. [ 3 ] proposed three new smoothed bootstrap methods. These methods rely on applying Nonparametric Predictive Inference on the marginals and modelling the dependence using parametric and nonparametric copulas. The new bootstrap methods have been shown to produce more accurate results. For further details, we refer the reader to Al Luhayb [ 1 ] and Al Luhayb et al. [ 3 ].

Classical statistical methods are widely used for testing statistical hypotheses, although their underlying assumptions are not always met, especially with complex data sets. To avoid these issues, Efron’s bootstrap method has been used to test statistical hypotheses [ 16 , 23 , 24 ], which is easy to implement, and it provides good approximation results. However, it may not be suitable for small data sets and may include ties in the bootstrap samples. To overcome these limitations, various smoothed bootstrap methods have been proposed by Banks [ 4 ], Al Luhayb et al. [ 2 ] and Al Luhayb et al. [ 3 ] for real-valued data, right-censored data, and bivariate data, respectively. This paper investigates the use of these bootstrap methods for hypothesis testing and compares their results with those of Efron’s methods.

This paper is organised as follows: Sect.  2 provides an overview of several bootstrap methods for real-valued univariate data, right-censored univariate data, and real-valued bivariate data. To illustrate their application, an example with data from the literature is presented in Sect.  3 using Efron’s and Banks’ bootstrap methods for hypothesis testing. Section  4 compares the smoothed bootstrap methods and Efron’s bootstrap methods through simulations in various hypothesis tests, such as quartile hypothesis tests, two-sample medians, Pearson and Kendall correlation tests. Firstly, the smoothed bootstrap methods and Efron’s bootstrap methods for real-valued univariate data and right-censored univariate data are used to compute the Type I error rates for quartile tests. Secondly, the achieved significance level is used to compute the Type I error rate for two-sample median tests. Lastly, for real-valued bivariate data, the smoothed bootstrap methods and Efron’s bootstrap method are compared in computing the Type I error rates for Pearson and Kendall correlation tests. The final section provides some concluding remarks.

2 Bootstrap Methods for Different Data Types

When it comes to real-world applications, using traditional statistical methods can be challenging due to the mathematical assumptions involved. However, the use of bootstrap methods can provide a computer-based way of conducting statistical inference that doesn’t require complex formulas. This paper demonstrates the use of different bootstrap methods for hypothesis testing. This section will provide an overview of multiple bootstrap methods that can be applied to real-valued data, right-censored data, and bivariate data.

2.1 Bootstrap Methods for Real-Valued Univariate Data

In this section, we will discuss two bootstrap methods for data that include only real-valued observations, namely Efron’s bootstrap method and Banks’ bootstrap method [ 4 , 13 ]. These methods are used to measure the variability of sample estimates for a given function of interest \(\theta (F)\) , where F is a continuous distribution defined on the interval [ a ,  b ]. Suppose we have n independent and identically distributed random quantities \(X_{1}, X_{2}, \ldots , X_{n}\) from the distribution F and the corresponding observations are \(x_{1}, x_{2}, \ldots , x_{n}\) .

Efron’s bootstrap method [ 13 ] is a nonparametric method proposed to measure the variability of sample estimates. It uses the empirical distribution function of the original sample, where each observation has the same probability of being selected. To create B resamples of size n , we randomly select observations with replacement from the original sample. We then calculate the function of interest \(\hat{\theta }\) for each bootstrap sample to obtain \(\hat{\theta }_{1}, \hat{\theta }_{2}, \ldots , \hat{\theta }_{B}\) . The empirical distribution of these results approximates the sampling distribution of \(\theta (F)\) . Efron’s bootstrap method is commonly used for hypothesis testing and has been shown to provide reliable results [ 17 ].

Banks’ bootstrap method [ 4 ] is a smoothed bootstrap method for real-valued univariate data. The original data points are ordered as \(x_{(1)}, x_{(2)}, \ldots , x_{(n)}\) , and the sample space [ a ,  b ] is divided into \(n+1\) intervals by the observations, where the end points \(x_{(0)}\) and \(x_{(n+1)}\) are equal to a and b , respectively. Each interval \((x_{(i)}, x_{(i+1)})\) for \(i= 0, 1, 2, \ldots , n\) is assigned a probability of \(\frac{1}{n+1}\) . To create a bootstrap sample, we randomly select n intervals with replacement, and then sample one observation uniformly from each selected interval. Based on the bootstrap sample, we calculate the function of interest and repeat this process B times to obtain \(\hat{\theta }_{1}, \hat{\theta }_{2}, \ldots , \hat{\theta }_{B}\) . The empirical distribution of these values approximates the sampling distribution of \(\theta (F)\) . Banks’ bootstrap method is used for hypothesis testing in this paper and will be compared to Efron’s bootstrap method in Sect.  4 .

2.2 Bootstrap Methods for Right-Censored Univariate Data

This section presents Efron’s bootstrap method [ 14 ] and the smoothed bootstrap method for right-censored data [ 1 , 2 ]. Let \(T_{1},T_{2},\ldots ,T_{n}\) be independent and identically distributed event random variables from a distribution F supported on \(\mathbb {R}^{+}\) and let \(C_{1},C_{2},\ldots ,C_{n}\) be independent and identically distributed right-censored random variables from a distribution G supported on \(\mathbb {R}^{+}\) . Furthermore, let \((X_{1}, D_{1}), (X_{2}, D_{2}), \ldots ,\) \( (X_{n}, D_{n})\) be the right-censored random variables, where each pair can be derived by

where \(i= 1, 2, \ldots , n\) . Let \((x_{1},d_{1}),(x_{2},d_{2}),\ldots ,(x_{n},d_{n})\) be the observations of the corresponding random quantities \((X_{1},D_{1}),(X_{2},D_{2}),\ldots ,(X_{n},D_{n})\) and \(\theta (F)\) is the function of interest, where this function can be estimated by \(\theta (\hat{F})\) .

Efron [ 14 ] proposed a nonparametric bootstrap method for data with right-censored observations. This method is similar to the one he proposed for real-valued data. In this method, the empirical distribution function of the original sample is used, so that each observation has an equal probability of \(\frac{1}{n}\) , regardless of whether it is an event or a censored observation. To apply this method, B bootstrap samples of size n are generated by randomly selecting observations from the original dataset with replacement. The function of interest is then calculated based on each bootstrap sample. This process results in values \(\hat{\theta }_{1}, \hat{\theta }_{2}, \ldots , \hat{\theta }_{B}\) , where the empirical distribution of these values can be a good estimate for the sampling distribution of \(\theta (F)\) . This bootstrap method is useful for testing the equality of average lifetimes over two populations [ 25 ], and it has been shown to provide good results in multiple statistical inferences, see Efron [ 15 ], Efron and Tibshirani [ 16 , 17 ] for more details.

Another method for right-censored data is the smoothed bootstrap method, introduced by Al Luhayb [ 1 ] and Al Luhayb et al. [ 2 ]. This method generalises Banks’ bootstrap method for right-censored data, and is based on the generalisation of the A \(_{(n)}\) assumption for data that contains right-censored observations, proposed by Coolen and Yan [ 9 ]. To implement this method, the data support is divided into \(n+1\) intervals by the original data, and the right-censored A \(_{(n)}\) assumption is used to assign specific probabilities to these intervals. For each bootstrap sample, n intervals are resampled with the assignment probabilities, and one observation is sampled from each interval. Performing these steps B times creates B bootstrap samples. Then, the function of interest is computed for each bootstrap sample, resulting in the values \(\hat{\theta }_{1}, \hat{\theta }_{2}, \ldots , \hat{\theta }_{B}\) . The empirical distribution of these values is used to estimate the sampling distribution of \(\theta (F)\) . In this paper, we use the smoothed bootstrap method for hypothesis testing and compare its performance to Efron’s bootstrap method, with the comparison results presented in Sect.  4 .

2.3 Bootstrap Methods for Bivariate Data

In this section, we will discuss Efron’s bootstrap method [ 16 ] and three smoothed bootstrap methods for bivariate data [ 1 , 3 ]. Let \((X_i, Y_i) \in \mathbb {R}^{2}\) , for \(i= 1, 2, \ldots , n\) denote independent and identically distributed random variables with a distribution of H . The observations corresponding to \((X_i, Y_i)\) are \((x_i, y_i)\) . We are interested in \(\theta {(H)}\) , which is estimated by \(\theta {(\hat{H})}\) . To implement the bootstrap, Efron and Tibshirani [ 16 ] used the empirical distribution. The bootstrap method involves creating multiple bootstrap samples, say B , of size n by resampling with equal probability from the observed data. Based on each bootstrap sample, the function of interest is calculated, resulting in B values. The empirical distribution of these B values is used as a proxy for the distribution of the function of interest. This is the same approach as for univariate data. Several references use this bootstrap method for hypothesis testing. For further details, see e.g. Dolker et al. [ 11 ], MacKinnon [ 19 ] and Hesterberg [ 18 ].

In their recent work, Al Luhayb [ 1 ] and Al Luhayb et al. [ 3 ] proposed three different smoothed bootstrap methods for estimating the distribution of a function of interest. The first smoothed bootstrap method, referred to by SBSP, is based on the semi-parametric predictive method, which is proposed by Muhammad [ 20 ]. The second smoothed bootstrap method, referred to by SBNP, is based on the nonparametric predictive method introduced by Muhammad et al. [ 21 ]. These two methods divide the sample space into \((n+1)^2\) squares (or blocks hereafter), each assigned with a certain probability. The third method, referred to by SEB, is based on uniform kernels, where each data point is surrounded by a block of size \(b_X \times b_Y\) , and the observation is located at the centre of its corresponding block, with \(b_X\) and \(b_Y\) being the chosen bandwidths for the kernel. To create a bootstrap sample, n blocks are resampled with the assignment probabilities, and one observation is sampled from each chosen block. This process is repeated multiple times, typically \(B=1000\) times, and based on each bootstrap sample, the function of interest is calculated. This results in B values, and the empirical distribution of these values is used to estimate the distribution of the function of interest.

In this section, we will explore an example using data from the literature on the maximum flow rates over a 100 year period at gauging stations on rivers in North Carolina [ 6 ]. The data is presented in Table  1 , and it shows the maximum flow rates in gallons per second. Our goal is to investigate whether the median of the data is equal to 5400 gallons per second using a 90% confidence interval, using Efron’s bootstrap method and Banks’ bootstrap method.

To conduct the test, we first generate 1000 bootstrap data sets from the original data using each of the two bootstrap methods, resulting in 1000 bootstrap samples for each method. Then, we calculate the median for each bootstrap sample, and from the resulting values, we can define the 90% bootstrap confidence interval for the median by taking the 50th and 950th ordered values.

If the value 5400 is included in the confidence interval, we fail to reject the null hypothesis. Otherwise, we reject the null hypothesis. Table  2 presents the 90% confidence intervals for the median based on both Efron’s and Banks’ bootstrap methods. As the value 5400 falls within both confidence intervals, therefore we fail to reject the null hypothesis.

4 Comparison of the Bootstrap Methods

Hypothesis tests based on the bootstrap method are a type of computer-based statistical technique. Thanks to recent advancements in computational power, these tests have become practical for real-world applications. The basic idea behind the bootstrap method is simple to understand and doesn’t rely on complex mathematical assumptions. In this section, we will conduct various tests for different types of data using the bootstrap method explained in Sect.  2 .

4.1 Hypothesis Tests for Quartiles

In this section, we calculate the Type I error rates of quartile hypothesis tests based on bootstrap methods presented in Sect.  2.2 . These methods are used when the data contains right-censored observations. To determine how well the bootstrap methods perform, we simulate datasets that include right-censored observations from two different scenarios. For the first scenario, we use the Beta distribution with parameters \(\alpha =1.2\) and \(\beta =3.2\) , where \(\alpha \) and \(\beta \) are the shape parameters, and the Uniform distribution with parameters \(a=0\) and \(b=1.82\) for event time observations and right-censored observations, respectively. The second scenario is defined as \(T\sim \text {Log-Normal}(\mu =0,\sigma =1)\) and \(C\sim \text {Weibull}(\alpha =3, \beta =3.7)\) , where \(\alpha \) is the shape parameter and \(\beta \) is the scale parameter (see Appendix). In both scenarios, the censoring proportion p in the generated datasets is \(15\%\) , and this is determined by setting the two parameters of the uniform distribution. For more information on how to fix the censoring proportion, we refer the reader to Wan [ 26 ] and Al Luhayb [ 1 ].

To compare Efron’s bootstrap method with the smoothed bootstrap method, we generate \(N=1000\) datasets from each scenario. For each dataset, we apply each method \(B=1000\) times, resulting in 1000 bootstrap samples based on each method. We then compute the quartile of interest at each bootstrap sample and use the resulting values to define the \(100(1-2\alpha )\%\) bootstrap confidence interval for the quartile. We count one if the value of the quartile specified in the null hypothesis is not included in the confidence interval; otherwise, we count zero. We repeat this procedure for all \(N=1000\) generated datasets, then count the number of times the null hypothesis was rejected over the 1000 trials. This ratio will be the Type I error rate of the quartile’s hypothesis test with significance level \(2\alpha \) .

It’s important to note that Efron’s bootstrap samples often include some censored observations, so we use the Kaplan–Meier (KM) estimator to find their corresponding quartiles. Suppose we are interested in the median; we should find a time t such that \(\hat{S}(t)=0.50\) in each bootstrap sample. Unfortunately, in some samples, we cannot find that time t because there is no time such that \(\hat{S}^{-1}(0.50)=t\) . In this case, we have considered three options or solutions. The first option is to neglect all not applicable medians, so the \(100(1-2\alpha )\%\) bootstrap confidence interval for the median is based on fewer than 1000 bootstrap samples. This option is referred to as E \(_{(1)}\) . The second option is to assume the median to be the maximum event time of that bootstrap sample. This is Efron’s suggestion, which is used for each bootstrap sample whose median is not found by the KM estimator [ 12 ]. This option is referred to as E \(_{(2)}\) . Finally, we fit an Exponential distribution to the interval with a rate parameter of \(\hat{\lambda }^{*}=-\ln (\hat{S}(t_{max}))/t_{max}\) , where \(t_{max}\) is the maximum event time of the bootstrap sample and \(\hat{S}(.)\) is the KM estimator. This allows us to find the corresponding median, \(X_{med}\) , with \(X_{med}=-\ln (0.50)/\hat{\lambda }^{*}\) . This suggestion is presented in Brown et al. [ 7 ], and we refer to it as E \(_{(3)}\) . In the last two cases, we can ensure that the confidence interval is based on 1000 bootstrap samples’ medians.

In the tables, the NA represents the number of Efron’s bootstrap samples where quartiles cannot be found, while ABS represents the number of cases where a bootstrap sample containing only right-censored observations is replaced by another sample that includes at least one event time. These two numbers are out of 1,000,000.

We consider three different strategies for the smoothed bootstrap method when sampling observations from the \(n+1\) intervals partitioning the sample space. The first strategy is to sample uniformly from all intervals, denoted by SB. The second strategy is to assume an exponential tail for each interval and sample from the tails to create the bootstrap samples, denoted by SB \(_{\text {exp}}\) . The third strategy is to sample uniformly from all intervals except the last intervals, for which we sample from the exponential tails. We refer to this strategy as SB \(_{\text {Lexp}}\) . By investigating how the sampling strategies affect the results, we can gain insight into the impact of different sampling methods on the smoothed bootstrap method.

Tables  3 and 4 show the results of the Type I error rates for the quartiles’ hypothesis tests with significance levels 0.10 and 0.05 for simulated data sets in the first scenario. When the sample size is 10, the smoothed bootstrap with its three assumptions, SB, SB \(_{\text {exp}}\) and SB \(_{\text {Lexp}}\) , provides lower discrepancies between actual and nominal error rates for all quartiles’ tests compared to Efron’s bootstrap with its three assumptions, E \(_{(1)}\) , E \(_{(2)}\) and E \(_{(3)}\) . The superiority of the smoothed bootstrap methods is due not only to the event observations obtained for the smoothed bootstrap samples, but also to the fact that the KM estimator used in Efron’s bootstrap samples is often not able to find the quartiles, particularly the second and third ones. In 1,000,000 bootstrap samples, we cannot find the first, second and third quartiles in 228, 3736 and 32,821 bootstrap samples, respectively. As the sample size increases to 50, 100 and 500, both methods provide good results, but Efron’s method is better, and the number of NA and ABS decreases toward zero. These decreases lead to equal results when E \(_{(1)}\) , E \(_{(2)}\) and E \(_{(3)}\) are used. Also, at these large sample sizes, SB, SB \(_{\text {exp}}\) and SB \(_{\text {Lexp}}\) provide approximately equal outcomes.

In the second scenario, we should note that the data space is \((0,\infty )\) , which is different from the first scenario where the support is (0, 1), so the last intervals for the smoothed method are not bounded. In this case, we can only use smoothed bootstrap assumptions SB \(_{\text {exp}}\) and SB \(_{\text {Lexp}}\) , not SB. Tables  5 and 6 present the results of Type I error rates for the quartiles’ hypothesis tests with significance levels of 0.10 and 0.05, respectively. The SB \(_{\text {exp}}\) and SB \(_{\text {Lexp}}\) methods again outperform Efron’s method in defining the Type I error rates when the sample size is small. As the sample size gets large, both methods perform well, as observed in Tables  3 and 4 .

In a special case where data includes only failures, with no censored observations, we will use Banks’ bootstrap method and Efron’s bootstrap method, which are presented in Sect.  2.1 , to compute the Type I error rates for the quartiles’ hypothesis tests. In the simulations, we use \(\text {Beta}(\alpha =1.2,\beta =3.2)\) to create data sets and repeat the same comparison procedure as in the previous simulations. Tables  7 and 8 present the Type I error rates for the quartiles’ hypothesis tests based on Banks’ and Efron’s methods with significance levels of 0.10 and 0.05, respectively. Banks’ bootstrap method performs better, particularly when \(n=10\) and \(2\alpha =0.05\) . As the sample size gets large, both methods perform well.

4.2 The Two-Sample Problem

When conducting a hypothesis test \(H_{0}: \theta _{1}=\theta _{2}\) against \(H_1: \theta _1 \ne \theta _2\) , where \(\theta _{1}\) and \(\theta _{2}\) represent the function of interest in the first and second populations respectively, the achieved significance level ( ASL ) is used to draw a conclusion. ASL is defined as the probability of observing at least the same value as \(\hat{\theta }=\hat{\theta }_{1}-\hat{\theta }_{2}\) , when the null hypothesis is true,

The smaller the value of ASL , the stronger the evidence against \(H_{0}\) . The value \(\hat{\theta }\) is fixed at its observed value, and the quantity \(\hat{\theta }^{*}\) has the null hypothesis distribution, which is the distribution of \(\hat{\theta }\) if \(H_{0}\) is true [ 17 ].

Efron and Tibshirani [ 17 ] used the achieved significance level to test whether the two populations have equal mean or not. Suppose we have two samples \({\textbf {z}}=\{z_{1},z_{2},\ldots ,z_{n}\}\) and \({\textbf {y}}=\{y_{1},y_{2},\ldots ,y_{m}\}\) from possibly different probability distributions, and we wish to test the null hypothesis \(H_{0}: \theta _{1}=\theta _{2}\) . Efron’s bootstrap method is used to approximate the ASL value, then \(H_{0}\) is rejected when \(\widehat{ASL}<2\alpha \) . The algorithm to test the null hypothesis based on the bootstrap methods is as follows

Combine \({\textbf {z}}\) and \({\textbf {y}}\) samples together, so we get a sample \({\textbf {x}}\) of size \(n+m\) . Thus, \({\textbf {x}}=\{z_{1},z_{2},\ldots ,z_{n},y_{1},y_{2},\ldots ,y_{m}\}\)

Draw B bootstrap samples of size \(n+m\) with replacement from \({\textbf {x}}\) , and call the first n observations \({\textbf {z}}^{*b}\) and the remaining m observations \({\textbf {y}}^{*b}\) for \(b=1,2,\ldots ,B\) .

For each bootstrap sample, we compute the means of \({\textbf {z}}^{*b}\) and \({\textbf {y}}^{*b}\) , then find \(A^{*b}=\overline{{\textbf {z}}}^{*b}-\overline{{\textbf {y}}}^{*b}\) , \(b=1,2,\ldots ,B\) .

The achieved significance level ASL can be approximated by

where \(A_{obs}=\overline{{\textbf {z}}}-\overline{{\textbf {y}}}\) , and \(\overline{{\textbf {z}}}\) and \(\overline{{\textbf {y}}}\) are the sample means of the two original samples.

We will employ the proposed strategy in this section to examine whether the two samples have the same median ( \(Q_{2}^{1}=Q_{2}^{2}\) ) or not. To conduct these tests, we will use the bootstrap methods presented in Sect.  2.2 and make comparisons through simulations. Specifically, we will calculate the Type I error rate for the following hypothesis test:

In order to compare different bootstrap methods through simulation, we first generate two datasets of size n using the second scenario proposed in Sect.  4.1 . We compute the medians of these datasets, \(\hat{Q}_{2}^{1}\) and \(\hat{Q}_{2}^{2}\) , and calculate \(A_{obs}=\hat{Q}_{2}^{1}-\hat{Q}_{2}^{2}\) . Next, we combine the two datasets so that they form a new dataset of size 2 n . Then, for each bootstrap method, we draw 1000 samples of size 2 n , and call the first n observations \({\textbf {z}}^{*b}\) and the remaining n observations \({\textbf {y}}^{*b}\) for \(b=1,2,\ldots ,B\) . We compute \(A^{*b}=\hat{Q}_{2}({\textbf {z}}^{*b})-\hat{Q}_{2}({\textbf {y}}^{*b})\) for each bootstrap sample, resulting in 1000 \(A^{*}\) values. Finally, we calculate the ASL value and reject \(H_{0}\) if \(\widehat{ASL}<2\alpha \) . We repeat this process \(B=1000\) times and count the number of times we reject the null hypothesis. We take the ratio of rejected hypotheses out of 1000 trials and consider the method with the ratio closest to \(2\alpha \) as the best method. The final results of the simulations are presented in Tables  9 and 10 for two different significance levels.

As the sample space of the underlying distribution is \([0,\infty )\) , we only consider SB \(_{\text {exp}}\) and SB \(_{\text {Lexp}}\) for the smoothed bootstrap method. For Efron’s method, we consider E \(_{(2)}\) and E \(_{(3)}\) as they are guaranteed to find the median of each set in each bootstrap sample. Tables  9 and 10 present the Type I error rates of the hypothesis test in Equation ( 5 ) with significance levels of 0.10 and 0.05, respectively. The SB \(_{\text {exp}}\) and SB \(_{\text {Lexp}}\) methods generally provide lower actual Type I error rates compared to E \(_{(2)}\) and E \(_{(3)}\) at different sample sizes. However, E \(_{(2)}\) and E \(_{(3)}\) provide smaller discrepancies between the actual and nominal Type I error levels, especially when the sample size is small. When \(n=500\) , all methods provide almost identical results.

In previous simulations, we created both samples in each run from a single scenario, but now we want to create samples from two different scenarios. In each run, the first sample is created from \(T\sim \text {Log-Normal}(\mu =0,\sigma =1)\) and \(C\sim \text {Weibull}(\alpha =3,\beta =3.7)\) , while the second sample is created from \(T\sim \text {Weibull}(\alpha =1,\beta =1.443)\) and \(C\sim \text {Exponential}(\lambda =0.12)\) , where \(p=0.15\) in both scenarios (see Appendix). We aim to investigate how the bootstrap methods perform when the two samples have different distributions but the same median (which is equal to 1). Tables  11 and 12 show the Type I error rates with significance levels of 0.10 and 0.05, respectively. All methods perform well at different sample sizes, and the results are close to the nominal size \(2\alpha \) , particularly when the sample size is large.

4.3 Pearson Correlation Test

In Sect.  2.3 , we present smoothed bootstrap methods and compare them to Efron’s method. We compute the Type I error rate to determine the superiority of each method, where a method is considered superior if its corresponding Type I error rate is closer to the significance level of \(2\alpha \) . In this section, we simulate data sets from two different distributions to compare the methods. For the first scenario, we generate data sets from Gumbel copula, where the marginals X and Y both follow the standard uniform distribution. The second scenario is Clayton copula where X follows the normal distribution with mean 1 and standard deviation 1, and Y follows the normal distribution with mean 5 and standard deviation 3. For both scenarios, we consider three dependence levels of \(\rho \) and three sample sizes with two significance levels. We also include the dependence parameters of copulas and their concordance measure Kendall’s \(\tau \) . The cumulative distribution functions of Gumbel copula and Clayton copula are, respectively, given by [ 22 ]

where all marginals are uniformly distributed on [0,1].

To compute the Type I error rate for the null hypothesis of \(\rho =\rho ^{\star }\) based on a bootstrap method, we create \(N=1000\) data sets with sample size n and dependence level \(\rho =\rho ^{\star }\) from one of the scenarios presented above. For each generated data set, we apply each bootstrap method \(B=1000\) times and compute the Pearson correlation of each bootstrap sample. We order the 1000 Pearson correlation bootstrapped values from lowest to highest and obtain the \(100(1-2\alpha )\%\) bootstrap confidence interval. If the null hypothesis value is not included in the confidence interval, we reject \(H_{0}\) and count 1; otherwise, we do not reject \(H_{0}\) and count 0. The number of times that the null hypothesis was rejected over the 1000 trials will be the Type I error rate.

Table  13 presents the Type I error rates based on the bootstrap methods, where the significance level is 0.10. For a small sample size of \(n=10\) , the SBSP and SBNP methods provide error rates closer to the nominal rate of 0.10 compared to Efron’s and the smoothed Efron’s methods. However, the SBNP method is the best when \(\rho = 0.4\) and 0.8. When n increases to 50 and 100, all methods decrease the discrepancies between the actual and nominal error rates, but the SBNP method is the superior one in most cases.

With a significance level of 0.05, the actual Type I error rates based on the bootstrap methods are listed in Table  14 . The SBSP and SBNP methods again provide lower discrepancies between the nominal and actual Type I error rates compared to Efron’s and the smoothed Efron’s methods, especially when \(n=10\) . When the sample size increases to 50 and 100, all methods perform better, but the SBNP method is the best one in most settings.

In the second scenario, we simulate \(N=1000\) data sets with dependence level \(\rho =\rho ^{\star }\) , and we compute Type I error rates using the bootstrap methods as shown in Tables 15 and 16 . For \(n=10\) , the SBSP method provides the closest results to the nominal error rates at most levels of \(\rho \) . As n increases to 50 and 100, its performance worsens for \(H_{0}: \rho =0.8\) because the underlying distribution is not symmetric. At these large sample sizes, the SBNP, Efron and SEB methods perform better than the SBSP method, particularly the SBNP method. The SBNP method provides the lowest discrepancies between the nominal and actual error rates in most cases, in both significance levels of 0.10 and 0.05; however, when \(n=10\) and \(\rho =0\) , 0.4, the SBNP method provides very small error rates.

4.4 Kendall Correlation Test

In Sect.  4.3 , we computed the Type I error rate for the Pearson correlation test using different sample sizes and dependence levels. In this section, we aim to repeat the same comparisons, but this time, we will use the Kendall correlation test instead. We will use the same scenarios, generating datasets with \(n=10, 50\) and 100, and dependence levels of \(\tau =0, 0.4\) and 0.8, with significance levels of 0.10 and 0.05.

To generate data sets and apply the bootstrap methods, we will use the Gumbel copula, where both marginals follow Uniform(0,1). From Tables 17 and 18 , we can see that the SBSP method performs well when \(\tau =0\) across all different sample sizes. However, it performs poorly as the sample size increases for \(\tau =0.4\) and 0.8. This is in contrast to the results based on SBNP, Efron’s, and smoothed Efron’s methods. These methods provide lower error rates than the nominal levels when the sample size is small at all different dependence levels. As n increases to 50 and 100, the error rates become closer to the nominal level \(2\alpha \) .

Tables 19 and 20 present the Type I error rates for the Kendal correlation test at different dependence levels with significance levels of 0.10 and 0.05, respectively. When \(\tau =0\) and \(n=10\) , the error rate based on the SBNP method is significantly lower than the nominal level \(2\alpha \) , while the results of other methods are close to the nominal levels. As the sample size increases to 50 and 100, all methods provide good results. If there is a strong relation between the variables, it is recommended to use either Efron’s bootstrap method or the SEB method. These methods are both able to produce good results because they have much less effect than the SBSP and SBNP methods on the observation’s rank, which is the basis for computing the Kendall correlation.

5 Concluding Remarks

In this paper, we explored how the proposed smoothed bootstrap methods can be used to compute Type I error rates for different hypothesis tests and compare their results to Efron’s bootstrap methods through simulations. The smoothed bootstrap methods are applied to real-valued data, right-censored data and bivariate data. For real-valued data and right-censored data, we test the null hypothesis that quartiles are equal to those of the underlying distributions. We also test whether two sample medians are equal, regardless of whether the two samples are from the same underlying distribution or not. For bivariate data, we compute the Type I error rates for Pearson and Kendall correlation tests.

We found that the smoothed bootstrap methods perform better when the sample size is small for real-valued and right-censored data, providing lower discrepancies between actual and nominal error rates. As the sample size gets larger, all bootstrap methods provide good results, but Efron’s methods mostly perform better for the third quartile. For the two-sample median test, we use the achieved significance level to test whether the two samples have equal medians or not. All bootstrap methods performed well, and the Type I error rates are close to the nominal levels.

For the Pearson correlation test, the SBSP and SBNP methods lead to lower discrepancies between actual and nominal Type I error rates compared to Efron’s and smoothed Efron’s methods when the sample size is small. For large sample sizes, all methods provide good results. However, the SBNP method performs better in most dependence levels. In situations where the data distribution is asymmetric, the SBSP method does not perform well, particularly when \(\tau \) is not close to zero, which results from the Normal copula assumption.

For the Kendall correlation test, it is recommended to use either Efron’s bootstrap method or the SEB method, particularly when the underlying distribution is asymmetric and has a strong Kendall correlation. Their influences on the observations rank are much less than those of the SBSP and SBNP methods. When \(\tau =0\) and the sample size is small, all bootstrap methods perform well, and as n gets large, their performances improve and the Type I error rates become closer to the nominal level \(2\alpha \) .

In conclusion, we used the bootstrap methods for real-valued data, right-censored data and bivariate data to compute Type I error rates for different hypothesis tests. Future research could focus on applying these bootstrap methods to compute power or Type II error rates for some hypothesis tests.

Al Luhayb ASM (2021) Smoothed bootstrap methods for right-censored data and bivariate data. PhD thesis, Durham University. http://etheses.dur.ac.uk/14096

Al Luhayb ASM, Coolen FPA, Coolen-Maturi T (2023) Smoothed bootstrap for right-censored data. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2023.2171708

Article   Google Scholar  

Al Luhayb ASM, Coolen-Maturi T, Coolen FPA (2023) Smoothed bootstrap methods for bivariate data. J Stat Theory Pract 17(3):1–37. https://doi.org/10.1007/s42519-023-00334-7

Article   MathSciNet   Google Scholar  

Banks DL (1988) Histospline smoothing the Bayesian bootstrap. Biometrika 75:673–684

Berrar D (2019) Introduction to the non-parametric bootstrap. In: Encyclopedia of bioinformatics and computational biology. Academic Press, Oxford, pp 766–773

Boos DD (2003) Introduction to the bootstrap world. Stat Sci 18(2):168–174

Brown BW, Hollander M, Korwar RM (1974) Nonparametric tests of independence for censored data with applications to heart transplant studies. In: Proschan F, Serfling RJ (eds) Reliability and biometry. SIAM, Philadelphia, pp 327–354

Google Scholar  

Coolen FPA, BinHimd S (2020) Nonparametric predictive inference bootstrap with application to reproducibility of the two-sample Kolmogorov–Smirnov test. J Stat Theory Pract 14:1–13

MathSciNet   Google Scholar  

Coolen FPA, Yan KJ (2004) Nonparametric predictive inference with right-censored data. J Stat Plan Inference 126:25–54

Davison AC, Hinkley DV (1997) Bootstrap methods and their application. Cambridge University Press, Cambridge

Book   Google Scholar  

Dolker M, Halperin S, Divgi DR (1982) Problems with bootstrapping Pearson correlations in very small bivariate samples. Psychometrika 47(4):529–530

Efron B (1967) The two-sample problem with censored data. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 4. University of California Press, Berkeley, pp 831–853

Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26

Efron B (1981) Censored data and the bootstrap. J Am Stat Assoc 76:312–319

Efron B (1982) The jackknife, the bootstrap, and other resampling plans, vol 38. Society for Industrial and Applied Mathematics, Philadelphia

Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1:54–77

Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall, London

Hesterberg T (2011) Bootstrap. Wiley Interdiscip Rev Comput Stat 3(6):497–526

MacKinnon JG (2009) Bootstrap hypothesis testing. In: Handbook of computational econometrics, pp 183–213

Muhammad N (2016) Predictive inference with copulas for bivariate data. PhD thesis, Durham University, UK

Muhammad N, Coolen FPA, Coolen-Maturi T (2016) Predictive inference for bivariate data with nonparametric copula. Am Inst Phys AIP Conf Proc 1750(1):0600041–0600048. https://doi.org/10.1063/1.4954609

Muhammad N, Coolen-Maturi T, Coolen FPA (2018) Nonparametric predictive inference with parametric copulas for combining bivariate diagnostic tests. Stat Optim Inf Comput 6(3):398–408

Rasmussen JL (1987) Estimating correlation coefficients: bootstrap and parametric approaches. Psychol Bull 101(1):136–139

Strube MJ (1988) Bootstrap type I error rates for the correlation coefficient: an examination of alternate procedures. Psychol Bull 104(2):290–292

Vaman H, Tattar P (2022) Survival analysis. Chemical Rubber Company Press, Boca Raton

Wan F (2017) Simulating survival data with predefined censoring rates for proportional hazards models. Stat Med 36:721–880

Download references

Acknowledgements

Asamh Al Luhayb was a PhD student at Durham University, supported by a scholarship from the Deanship of Scientific Research at Qassim University. During his studies, he worked under the supervision of Prof. Frank Coolen and Dr. Tahani Coolen-Maturi.

Author information

Authors and affiliations.

Department of Mathematics, College of Science, Qassim University, P.O. Box 6644, Buraydah, 51452, Saudi Arabia

Asamh S. M. Al Luhayb

Department of Mathematical Sciences, Durham University, Durham, DH1 3LE, UK

Tahani Coolen-Maturi & Frank P. A. Coolen

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Tahani Coolen-Maturi .

Ethics declarations

Conflicts of interest.

The author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The probability density functions for the distributions used in each scenario to generate right-censored data.

Scenario 1:

Beta distribution for event times:

\(f(t)=\frac{t^{\alpha -1} (1-t)^{\beta -1}}{\beta (\alpha ,\beta )}; \ t\in [0,1]\) where \(\alpha =1.2\) and \(\beta =3.2\) .

Uniform distribution for censored times:

\(g(c)=\frac{1}{b-a}; \ c\in [a,b]\) where \(a=0\) and \(b=1.82\) .

Scenario 2 :

Log-Normal distribution for event times:

\(f(t)=\dfrac{1}{t\sqrt{2\pi }} \exp (-\frac{(\ln (t))^2}{2}); \ t\in (0,\infty )\) .

Weibull distribution for censored times:

\(g(c)=\frac{\alpha }{\beta } (\frac{c}{\beta })^{\alpha -1} \exp (-(\frac{c}{\beta })^{\alpha }); \ c\in [0,\infty )\) where \(\alpha =3\) and \(\beta =3.7\) .

Scenario 3:

Weibull distribution for event times:

\(f(t)=\frac{\alpha }{\beta } (\frac{t}{\beta })^{\alpha -1} \exp (-(\frac{t}{\beta })^{\alpha }); \ t\in [0,\infty )\) where \(\alpha =1\) and \(\beta =1.443\) .

Exponential distribution for censored times:

\(g(c)=\lambda \exp (-\lambda c); \ c\in [0,\infty )\) where \(\lambda =0.12\) .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Al Luhayb, A.S.M., Coolen-Maturi, T. & Coolen, F.P.A. Smoothed Bootstrap Methods for Hypothesis Testing. J Stat Theory Pract 18 , 16 (2024). https://doi.org/10.1007/s42519-024-00370-x

Download citation

Accepted : 07 February 2024

Published : 04 March 2024

DOI : https://doi.org/10.1007/s42519-024-00370-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Achieved significance level
  • Banks’ bootstrap
  • Bootstrap confidence interval
  • Efron’s bootstrap
  • Smoothed bootstrap
  • Find a journal
  • Publish with us
  • Track your research

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Bootstrap hypothesis testing Python Package. Bootstrapping is a simple method to compute statistics over your custom metrics, using only one run of the method for each sample in your evaluation set. It has the advantage of being very versatile, and can be used with any metric really.

fpgdubost/bstrap

Folders and files, repository files navigation, bstrap: a python package for confidence intervals and hypothesis testing using bootstrapping..

You are an amazing machine learning researcher .

You invented a new super cool method .

You are not sure that it is significantly better than your baseline.

You don't have 3000 GPUs to rerun your experiment and check it out.

Then, what you want to do is bootstrap your results !

The bstrap package allows you to compare two methods and claim that one is better than the other.

Installation

That's all you need, really.

Maybe tough, you can still read the instructions and check out the examples to make sure you get it right...

Bootstrapping is a simple method to compute statistics over your custom metrics, using only one run of the method for each sample in your evaluation set. It has the advantage of being very versatile, and can be used with any metric really.

  • Bootstrapping for computation of confidence interval
  • Bootstrapping for hypothesis testing (claim that one method is better than another for a given metric)
  • Supports metrics that can be computed sample-wise and metrics that cannot.

Keep in mind: non-overlapping confidence intervals means that there is a significant statistical difference. Overlapping confidence intervals does not mean that there is no significant statistical difference. To verify this further, you will need to compute the bootstrap hypothesis testing and check the p-value.

Instructions

You will need to implement your metric and provide the data sample-wise as a single Pandas dataframe for each method. That's about it. Your metric is more complex than simply averaging results for each sample? For example, you cannot compute sample-wise, maybe like AUC or mAP? Then just give your predictions and ground truths sample-wise, which also works with Boostrap.

To use this code, you need to:

  • Implement your own metric: should take the one pandas dataframe of data as input and return a scalar value.
  • Load your data.
  • Reformat data to a single pandas dataframe per method with standardized column names, and one sample per row.
  • Check that your estimates (confidence interval and p-value) are stable over several runs of the bootstrapping method. If the estimates are not stable, increase nbr_runs

You can find example dataframes under src/bstrap/example_dataframes.

Example 1: Mean metric

Example 2: f1 score, example 3: auc, example 4: multiclass: mean average precision (map).

Reference: Efron, B. and Tibshirani, R.J., 1994. An introduction to the bootstrap. CRC press.

  • Python 100.0%

Hypothesis testing and bootstrapping

Permutation tests, empirical distribution functions, chi-squared tests for categorical data, nonparametric bootstrapping of regression standard errors, using the boot package in r, parametric bootstrapping of regression standard errors, kolmogorov-smirnov: bootstrapped p-values.

Beyond normality: the bootstrap method for hypothesis testing

tl;dr: Parametric bootstrap methods can be used to test hypothesis and calculate p values while assuming any particular population distribution we may want. Non-parametric bootstrapping methods can be used to test hypotheses and calculate p values without having to assume any particular population as long as the sample can be assumed to be representative of the population and one can transform the data adequately to take into account the null hypothesis. The p values from bootstrap methods may differ from those from classical methods, especially when the assumptions of the classical methods do not hold. The different methods of calculation can push a p value beyond the 0.05 threshold which means that statements of statistical significance are sensitive to all the assumptions used in the test.

Introduction

In this article I show how to use parametric and non-parametric bootstrapping to test null hypotheses, with special emphasis on situations when the assumption of normality may not hold. To make it more relevant, I will use real data (from my own research) illustrate the application of these methods. If you get lost somewhere in this article, you may want to take a look at my previous post , where I introduced the basic concepts behind hypothesis testing and sampling distributions. As in the previous post, the analysis will be done in R, so before we get into the details, it is important to properly setup our R session:

The data I will use consists of measurements of individual plant biomass (i.e. the weight of a plant after we have remove all the water) exposed to a control treatment (C), drought (D), high temperature (HT) and high temperature and drought (HTD). First, let’s take a look at the data:

I want to answer the classic question of whether the treatments are decreasing the individual plant biomass. There are different ways of addressing this question but, for the purpose of this article, I will focus on hypothesis testing as this is the most common approach in plant research. Remember that in hypothesis testing the skeptical position (i.e. no effect) is taken as the null hypothesis and the goal is to calculate the probability that, if the null hypothesis is true, we will observe the data collected in the experiment.

The main issue with these data is that they do not appear to be samples from Normal distributions. To be more formal, we can use a normality test to check this claim, such as the Shapiro-Wilk test:

We can see that, at least for the control and high temperature treatment, it is not very likely to get these kind of data if the variation across replicates was normally distributed (i.e. the assumption of normality does not hold). This actually makes sense as these data (i) cannot be negative (while the Normal distribution allows for negative values) and (ii) the measurements were performed on young plants in the exponential phase of growth, which would tend to produce Log-Normal distributions. A Log-Normal distribution is the distribution of a random variable which logarithm is normally distributed (in other words, a Log-Normal is the result of applying an exponential function to a Normal distribution). Indeed, if I log-transform the data and run the tests again, the p-values are much higher:

which means that the log-transformed data is compatible with the assumption of normality. At this point the usual path is to either work with the orignal scale and address the question using t test (hoping that it is robust enough) or to log-transform the data and apply the t test on the log scale. As will be shown below, whther we test on the original scale or the log scale can lead to differences in the p values.

In this article I will show alternative methods to test differences in means. These methods are known as bootstrapping methods and they come in many flavours. Here I will focus on the parametric bootstrap and non-parametric bootstrap (when people just bootstrap, without adjective, it generally means the non-parametric version). In the final section, the result from these methods will be compared to the t applied to the original and log scale to illustrate the impact that these choices can have on the results.

Parametric bootstrap

The idea of the parametric bootstrap was discussed in the previous post . The basic idea is that one needs to (i) assume a particular distribution to describe the population from which the data could have been sampled, (ii) estimate the values of the parameters of the distribution using the observed data and null hypothesis and (iii) generate the sampling distribution of any statistic through Monte Carlo simulation.

As discussed above, a reasonable distribution for the individual plant biomass would be a Log-Normal distribution. If a random variable \(x\) follows a Log-Normal distribution, then \(\text{ln}(x)\) follows a Normal distribution. Therefore, the easiest way to implement this distribution (and therefore how it is generally done) is to take the function for the Normal distribution and substitute \(x\) for \(\text{ln}(x)\) . The problem is that the resulting function is parameterized in terms of mean and standard deviation of \(\text{ln}(x)\) , but we are interested in the mean of \(x\) . To express the Log-Normal in terms of the \(x\) we need the following reparameterization:

\[ \begin{align} \mu_x &= \text{exp}\left(\mu_{lx} + \sigma_{lx}^2/2 \right) \\ \sigma_x^2 &= \left[\text{exp}(\sigma_{lx}^2) - 1 \right]\text{exp}(2\mu_{lx} + \sigma_{lx}^2) \end{align} \]

where the subscript “ \(_{x}\) ” and “ \(_{lx}\) ” refer to the original scale and the log scale of \(x\) , respectively. As discussed in the previous post , different methods may be used to estimate the parameters of a distribution. In this case, the estimators for \(\mu_x\) and \(\sigma_x\) can be derived from the maximum likelihood estimators of \(\mu_{lx}\) and \(\sigma_{lx}\) by just applying the reparameterization, that is:

\[ \begin{align} \hat{\mu_x} &= \text{exp}\left(\hat{\mu_{lx}} + \hat{\sigma_{lx}}^2/2 \right) \\ \hat{\sigma_x}^2 &= \left[\text{exp}(\hat{\sigma_{lx}}^2) - 1 \right]\text{exp}(2\hat{\mu_{lx}} + \hat{\sigma_{lx}}^2) \end{align} \]

\[ \begin{align} \hat{\mu_{lx}} &= \frac{\sum_k \text{ln} x_k}{n}\\ \hat{\sigma_{lx}} &= \frac{\left( \text{ln} x_k - \hat{\mu}\right)^2}{n} \end{align} \]

The R implementation of these formulae are as follows:

Note that this code calculates the maximum likelihood estimators for the mean and standard deviation of the Log-Normal distribution. If we had use another distribution we would have to use different formulae. You can find these formulae online (Wikipedia often has good information on many distributions) or you can use an optmization approach for more complex cases (but that is for another post).

The null hypothesis specifies that the mean of the populations from which the data were sampled should be equal. Since we are comparing each treatment to the control, it makes sense to calculate this mean from the control data (an alternative approach is to use the mean of each pair of treatments being compared). Hence, the estimates of the parameters for the population associated to the control are:

The other populations will have the mean set to mu_x_hat_C , but the standard deviation is calculated from the samples:

We now construct the population models using the distr6 package:

By sampling from each population we can compute the sampling distribution of the difference of sample means (the future_map_dbl is equivalent to a for loop in parallel):

Which are then compared to the differences observed in the experiment:

And the p values are calculated as the fraction of casers where the different is more extreme:

Non-parametric bootstrap

In non-parametric boostrap methods one tries to approximate the sampling distribution directly from the data, without making any explicit assumption about the distribution of the population from which the data could have been sampled. The procedure is to create alternative samples with the same size as the one observed by randomly sampling from the current sample with replacement (i.e. you allow every value to be sample more than once). This is not a capricious choice: if you do not replace the values that are being sampled you will just end up with the original sample in a different order!

There is an important implicit assumption in bootstrapping: that the sample is representative of the population in terms of its statistical properties (e.g. mean, variance, kurtosis, etc). This means that bootstrapping is more reliable the larger the sample size, as statistical properties will tend to stabilize as sample size increases.

Bootstrapping is most often used to calculated confidence intervals of parameters (e.g. coefficients of a regression model), but they can also be used for hypothesis testing provided that we do not introduce in the calculations any more assumptions than the null hypothesis and the assumptions mentioned above. Since the null hypothesis specifies that the data are being sampled from populations with the same mean, we need to transform the data such that the bootstrap method can produce sampling distributions in agreement with this hypothesis. One way to achieve this is by substracting from each sample its own mean and adding the pooled mean. That is:

Notice that there are now three corrected means for the control as we are doing three tests. The function sample gives a sample with replacement of a given size. In order to compute the three sampling distributions we need to apply bootstrapping six times (one for every corrected sample):

And the p values are calculated as in the parametric case but using the new samples from the non-parametric bootstrap:

Comparison with classical methods

The p values for the one sided Welch two sample t test applied to original and log transformed scales can be calculated with the t.test function:

Finally, the p values for the different methods and treatments are:

As can be seen above, the t test is actually quite robust to deviations from normality and the results in the original scale are quite clsoe to the results from both bootstrapping methods. This is less the case when applied to log tranformed data, where p values were generally larger, even though we did the transformation to “help” the t test.

Regarding the two bootstrapping methods, the parametric one is usually more flexible in calculating p values as it is always possible to implement the null hypothesis in terms of parameters of the population distribution whereas in the non-parametric case this is more difficult. Also, the non-parametric approach generally requires more data than the parametric approach to produce reliable estimates. Of course, the price we pay is that the parametric method makes more assumptions about the data. If the assumptions are not reasonable, results can be less reliable than for the non-parametric case.

Some remarks about p < 0.05

Notice than in the high temperature + drought treatment (HTD), the choice of test and whether to transform the data or not resulted in p values on different sides of the magical threshold of 0.05. This is one of the many reasons why it is not a great idea to dichotomize your thinking according to whether the p value happens to fall below a threshold or not. The problem is that small technical details (that are just assumptions, so they cannot be proven to be right or wrong and are often a matter of personal preference) will push the p value across the threshold. Therefore, if you base your scientific conclusion about an effect on whether p < 0.05 or not, your scientific reasoning is as robust or objective as you may think. Also, p hacking your way into significance becomes very tempting, specially when getting significant results has an impact on your career. So, perhaps, you should stop interpreting your results in terms of whether p < 0.05 or not and also stop using the term statistical significance . That is actually the opinion of the American Statistical Association (see statements from 2016 and 2019 ).

Then, what should we use as alternative? It is not at all clear but most suggestions point at the need to have a more quantitative and continuous view of statistical inference. One option is to supplement the reported p values with estimates of the effect of the treatment (i.e. by how much was biomass decreased by each treatment) and, very importantly, the uncertainty in these estimates. As with p values, confidence intervals can also be calculated using classical mathematical formulae, parametric and non-parametric bootstrapping methods. I will show how to do this in a future post, stay tuned!

Alejandro Morales' Blog

Alejandro Morales' Blog

Docent computational crop ecophysiology.

  • Estimating variance: should I use n or n - 1? The answer is not what you think
  • Maximum likelihood estimation from scratch
  • Where do p-values come from? Fundamental concepts and simulation approach

Introduction to Computational Finance and Financial Econometrics with R

9.8 using the bootstrap for hypothesis testing.

To be completed

  • duality between HT and CIs can be exploited by the bootstrap. Use bootstrap to compute CI and see if hypothesized values lies in the interval. Illustrate with testing equality of 2 Sharpe ratios

IMAGES

  1. Bootstrap Hypothesis Testing in Statistics with Example |Statistics Tutorial #35 |MarinStatsLectures

    hypothesis testing with bootstrap

  2. Bootstrap hypothesis testing p-value confusion

    hypothesis testing with bootstrap

  3. Bootstrap Hypothesis Testing in R with Example

    hypothesis testing with bootstrap

  4. Permutation vs. bootstrap test of hypothesis

    hypothesis testing with bootstrap

  5. Bootstrap Hypothesis Testing

    hypothesis testing with bootstrap

  6. hypothesis testing

    hypothesis testing with bootstrap

VIDEO

  1. Computational Statistics

  2. A Gentle Introduction to The Bootstrap Method

  3. When Intrusive Thoughts Take Over VR Physics Testing (Bootstrap Island)

  4. Sample size selection, bootstrap method, and hypothesis testing introduction

  5. Bootstrap distribution, estimation. standard error and bias || In Bengali

  6. Hypothesis Testing Made Easy: These are the Steps

COMMENTS

  1. Introduction to Bootstrapping in Statistics with an Example

    Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics.Bootstrap methods are alternative approaches to traditional hypothesis testing and are notable for being easier to understand and ...

  2. The Two-Sample Hypothesis Tests using the Bootstrap

    Using the Bootstrap for Two-Sample Hypothesis Tests. Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis.

  3. PDF Hypothesis Testing with the Bootstrap

    A bootstrap hypothesis test starts with a test statistic - P( ) (not necessary an estimate of a parameter). We seek an achieved significance level 𝑆𝐿=đ‘ƒđ‘đ» 0 P ∗ ≄ P( ) Where the random variable ∗ has a distribution specified by the null hypothesis 0 - denote as 0. Bootstrap hypothesis testing uses a "plug-in" style to ...

  4. Bootstrapping (statistics)

    This represents an empirical bootstrap distribution of sample mean. From this empirical distribution, one can derive a bootstrap confidence interval for the purpose of hypothesis testing. Regression. In regression problems, case resampling refers to the simple scheme of resampling individual cases - often rows of a data set. For regression ...

  5. PDF Bootstrap Hypothesis Test

    Bootstrap Hypothesis Test In 1882 Simon Newcomb performed an experiment to measure the speed of light. The numbers below represent the measured time it took for light to travel from Fort Myer on the west bank of the Potomac River to a fixed mirror at the foot of the Washington monument 3721 meters away.

  6. 11.2.1

    Bootstrapping is a resampling procedure that uses data from one sample to generate a sampling distribution by repeatedly taking random samples from the known sample, with replacement. Let's show how to create a bootstrap sample for the median. Let the sample median be denoted as M. Steps to create a bootstrap sample: Replace the population ...

  7. Beyond normality: the bootstrap method for hypothesis testing

    tl;dr: Parametric bootstrap methods can be used to test hypothesis and calculate p values while assuming any particular population distribution we may want. Non-parametric bootstrapping methods can be used to test hypotheses and calculate p values without having to assume any particular population as long as the sample can be assumed to be representative of the population and one can transform ...

  8. PDF STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

    Hypothesis Testing The terminology we use when conducting a hypothesis test is that we either: I Have enough evidence (based on our statistic, T(X 1;:::;X n)) to reject the null hypothesis H 0 in favor of the alternative hypothesis H 1, or I We do not have enough evidence to reject the null hypothesis H 0, and so our data is consistent with the ...

  9. Stat 3701 Lecture Notes: Bootstrap

    5.3.1.2 Hypothesis Tests are Problematic. Any hypothesis test calculates critical values or \(P\)-values using the distribution under the null hypothesis. But the bootstrap does not sample that unless the null hypothesis happens to be correct. Usually, we want to reject the null hypothesis, meaning we hope it is not correct. And in any case, we ...

  10. Lesson 11: Introduction to Nonparametric Tests and Bootstrap

    The p-value for this test is 0.086. The p-value is less than our significance level and therefore we reject the null hypothesis. There is enough evidence in the data to suggest the population median time is greater than 4. If we assume the data are normal and perform a test for the mean, the p-value was 0.0798.

  11. Bootstrap Resampling for Hypothesis Tests: A Modern Approach to

    The bootstrap method for hypothesis testing involves several key steps. Initially, the researcher formulates the null (0 H 0 ) and alternative (1 H 1 ) hypotheses, focusing on a parameter of ...

  12. Bootstrapping for Inferential Statistics

    Data Scientist's Toolkit — bootstrapping, sampling, confidence intervals, hypothesis testing. Bootstrap is a powerful, computer-based method for statistical inference without relying on too many assumption. It's just magical to form a sampling distribution just from only one sample data. No formula needed for my statistical inference.

  13. PDF Bootstrap Hypothesis Testing

    bootstrap samples B must be such that α(B +1) is an integer, where α is the level of the test. If a bootstrap test satisïŹes these two conditions, then it is exact. This sort of test, which was originally proposed in Dwass (1957), is generally called a Monte Carlo test. For an introduction to Monte Carlo testing, see Dufour and Khalaf (2001).

  14. Smoothed Bootstrap Methods for Hypothesis Testing

    This paper demonstrates the application of smoothed bootstrap methods and Efron's methods for hypothesis testing on real-valued data, right-censored data and bivariate data. The tests include quartile hypothesis tests, two sample medians and Pearson and Kendall correlation tests. Simulation studies indicate that the smoothed bootstrap methods outperform Efron's methods in most scenarios ...

  15. Bootstrapping and PLS-SEM: A step-by-step guide to get more out of your

    A direct effect is the hypothesized relationship between two constructs, which most probably reflects the most common situation for hypothesis testing in a PLS-SEM application. The bootstrap output contains J estimates for each of these direct effects on which the procedures outlined in Table 2 can be readily applied. Although research ...

  16. Bootstrap Hypothesis Testing

    Bootstrap and Monte Carlo tests. Finite-sample properties of bootstrap tests. Double bootstrap and fast double bootstrap tests. Bootstrap data generating processes. Multiple test statistics. Finite-sample properties of bootstrap supF tests. Conclusion. Acknowledgments. References

  17. GitHub

    Bootstrap hypothesis testing Python Package. Bootstrapping is a simple method to compute statistics over your custom metrics, using only one run of the method for each sample in your evaluation set. It has the advantage of being very versatile, and can be used with any metric really. - fpgdubost/bstrap

  18. Hypothesis testing and bootstrapping in R

    Hypothesis testing and bootstrapping. This tutorial demonstrates some of the many statistical tests that R can perform. It is impossible to give an exhaustive list of such testing functionality, but we hope not only to provide several examples but also to elucidate some of the logic of statistical hypothesis tests with these examples.

  19. Beyond normality: the bootstrap method for hypothesis testing

    tl;dr: Parametric bootstrap methods can be used to test hypothesis and calculate p values while assuming any particular population distribution we may want. Non-parametric bootstrapping methods can be used to test hypotheses and calculate p values without having to assume any particular population as long as the sample can be assumed to be representative of the population and one can transform ...

  20. Bootstrap Hypothesis Testing in Statistics with Example ...

    Bootstrap Hypothesis Testing in Statistics with Example: How to test a hypothesis using a bootstrapping approach in Statistics? đŸ‘‰đŸŒRelated Video: Hypothes...

  21. 2.9: Confidence intervals and bootstrapping

    To create a 95% bootstrap confidence interval for the difference in the true mean distances ( ÎŒcommute −Όcasual ÎŒ commute − ÎŒ casual ), select the middle 95% of results from the bootstrap distribution. Specifically, find the 2.5 th percentile and the 97.5 th percentile (values that put 2.5 and 97.5% of the results to the left) in the ...

  22. 9.8 Using the Bootstrap for Hypothesis Testing

    9.8 Using the Bootstrap for Hypothesis Testing. To be completed. duality between HT and CIs can be exploited by the bootstrap. Use bootstrap to compute CI and see if hypothesized values lies in the interval. Illustrate with testing equality of 2 Sharpe ratios

  23. The Simulation of Bootstrapping for Confidence Interval and Hypothesis

    Bootstrap Hypothesis Testing. Ontario (US). Queen's Economics Department Working Paper, Queen's University, Department of Economics. Pennsylvania State University. 2018. Bootstrapping.