]  ]  ]

 

In hypothesis testing a decision between two alternatives, one of which is called the null hypothesis and the other the alternative hypothesis, must be made. As an example, suppose you are asked to decide whether a coin is fair or biased in favor of heads. In this situation the statement that the coin is fair is the null hypothesis while the statement that the coin is biased in favor of heads is the alternative hypothesis. To make the decision an experiment is performed. For example, the experiment might consist of tossing the coin 10 times, and on the basis of the 10 coin outcomes, you would make a decision either to accept the null hypothesis or reject the null hypothesis (and therefore accept the alternative hypothesis). So, in hypothesis testing acceptance or rejection of the null hypothesis can be based on a decision rule. As an example of a decision rule, you might decide to reject the null hypothesis and accept the alternative hypothesis if 8 or more heads occur in 10 tosses of the coin.

The process of testing hypotheses can be compared to court trials. A person comes into court charged with a crime. A jury must decide whether the person is innocent (null hypothesis) or guilty (alternative hypothesis). Even though the person is charged with the crime, at the beginning of the trial (and until the jury declares otherwise) the accused is assumed to be innocent. Only if overwhelming evidence of the person's guilt can be shown is the jury expected to declare the person guilty--otherwise the person is considered innocent.

In the jury trial there are two types of errors: (1) the person is innocent but the jury finds the person guilty, and (2) the person is guilty but the jury declares the person to be innocent. In our system of justice, the first error is considered more serious than the second error.  These two errors along with the correct decisions are shown in the next table where the jury decision is shown in bold on the left margin and the true state of affairs is shown in bold along the top margin of the table.

Truth is Person Innocent Truth is Person Guilty
Jury Decides Person Innocent

Correct Decision

Type II Error

Jury Decides Person Guilty

Type I Error

Correct Decision

  In Fact H0 is True In Fact H0 is False
Test Decides H0 True

Correct Decision

Type II Error

Test Decides H0 False

Type I Error

Correct Decision

Assumptions

Decreasing the probability of a type ii error (beta) without increasing the probability of a type i error (alpha).

The previous example shows that decreasing the probability of a Type I error leads to an increase in the probability of a Type II error, and vice versa.  How probability of a Type I error be held at some (preferably small level) while decreasing the probability of a Type II error?  The next series of graphs show that this can be done by using a larger n, that is by increasing the number of coin tosses.  An increase in n can be viewed as increasing the sample size for the experiment.  In the middle graph of the series of five graphs shown above, the probability of a Type I error, alpha, is approximately 0.05.  Suppose the coin was tossed 30 times instead of 10 times.  With 30 tosses you would want the critical value to be some number greater than 15.  Suppose that 20 is used as the critical value, that is, if 20 or more heads occur in the 30 tosses you would reject the null hypothesis that the coin is fair and accept the alternative hypothesis that the coin is biased in favor of heads (in this situation, we are looking at the alternative that the probability of a head is p=0.7).  The next graph displays the results with the probability distribution of the number of heads under the assumption that the null hypothesis is true shown in red , and the probability distribution of the number of heads under the assumption that the null hypothesis is false (and the probability of a head is 0.7) is displayed in blue .

The P-Value Approach to Hypothesis Testing

One and two tail tests, specific hypothesis tests, summary of the p-value method.

  • Determine the null and alternative hypotheses
  • Determine the test statistic
  • Take a random sample of size n and compute the value of the test statistic
  • Determine the probability of observed value or something more extreme than the observed value of the test statistic (more extreme is based on the null and alternative hypotheses).  This is the p-value.
  • Reject the null hypothesis if the p-value is 'small.'  (Where a significance level is give for the test, 'small' is usually meant to be any p-value less than or equal to the significance level)  

For a population mean with known population standard deviation

  • Assumptions: (1) Sample is random (2) If the sample is small (n<30), the population is normal or close to normal.

For a population mean with unknown population standard deviation

  • Assumptions: (1) Sample is random (2) If the sample is small (n<30), the population is normal.

For a population proportion

  • Assumptions: (1) Sample is random (2) Sample is large (n is 30 or more) (3) x is the number of sample elements that have the characteristic

For a population variance

  • Assumptions: (1) Sample is random (2) Population is normal

Data Science from Scratch (ch7) - Hypothesis and Inference

Connecting probability and statistics to hypothesis testing and inference

Table of contents

  • Central Limit Theorem
  • Hypothesis Testing
  • Confidence Intervals
  • Connecting dots with Python

This is a continuation of my progress through Data Science from Scratch by Joel Grus. We’ll use a classic coin-flipping example in this post because it is simple to illustrate with both concept and code . The goal of this post is to connect the dots between several concepts including the Central Limit Theorem, Hypothesis Testing, p-Values and confidence intervals, using python to build our intuition.

Central_Limit_Theorem

Terms like “null” and “alternative” hypothesis are used quite frequently, so let’s set some context. The “null” is the default position. The “alternative”, alt for short, is something we’re comparing to the default (null).

The classic coin-flipping exercise is to test the fairness off a coin. If a coin is fair, it’ll land on heads 50% of the time (and tails 50% of the time). Let’s translate into hypothesis testing language:

Null Hypothesis : Probability of landing on Heads = 0.5.

Alt Hypothesis : Probability of landing on Heads != 0.5.

Each coin flip is a Bernoulli trial , which is an experiment with two outcomes - outcome 1, “success”, (probability p ) and outcome 0, “fail” (probability p - 1 ). The reason it’s a Bernoulli trial is because there are only two outcome with a coin flip (heads or tails). Read more about Bernoulli here .

Here’s the code for a single Bernoulli Trial:

When you sum the independent Bernoulli trials , you get a Binomial(n,p) random variable, a variable whose possible values have a probability distribution. The central limit theorem says as n or the number of independent Bernoulli trials get large, the Binomial distribution approaches a normal distribution.

Here’s the code for when you sum all the Bernoulli Trials to get a Binomial random variable:

Note : A single ‘success’ in a Bernoulli trial is ‘x’. Summing up all those x’s into X, is a Binomial random variable. Success doesn’t imply desirability, nor does “failure” imply undesirability. They’re just terms to count the cases we’re looking for (i.e., number of heads in multiple coin flips to assess a coin’s fairness).

Given that our null is (p = 0.5) and alt is (p != 0.5), we can run some independent bernoulli trials, then sum them up to get a binomial random variable.

independent_coin_flips

Each bernoulli_trial is an experiment with either 0 or 1 as outcomes. The binomial function sums up n bernoulli(0.5) trails. We ran both twice and got different results. Each bernoulli experiment can be a success(1) or faill(0); summing up into a binomial random variable means we’re taking the probability p(0.5) that a coin flips head and we ran the experiment 1,000 times to get a random binomial variable.

The first 1,000 flips we got 510. The second 1,000 flips we got 495. We can repeat this process many times to get a distribution . We can plot this distribution to reinforce our understanding. To this we’ll use binomial_histogram function. This function picks points from a Binomial(n,p) random variable and plots their histogram.

This plot is then rendered:

binomial_coin_fairness

What we did was sum up independent bernoulli_trial (s) of 1,000 coin flips, where the probability of head is p = 0.5, to create a binomial random variable. We then repeated this a large number of times (N = 10,000), then plotted a histogram of the distribution of all binomial random variables. And because we did it so many times, it approximates the standard normal distribution (smooth bell shape curve).

Just to demonstrate how this works, we can generate several binomial random variables:

several_binomial

If we do this 10,000 times, we’ll generate the above histogram. You’ll notice that because we are testing whether the coin is fair, the probability of heads (success) should be at 0.5 and, from 1,000 coin flips, the mean ( mu ) should be a 500.

We have another function that can help us calculate normal_approximation_to_binomial :

When calling the function with our parameters, we get a mean mu of 500 (from 1,000 coin flips) and a standard deviation sigma of 15.8114. Which means that 68% of the time, the binomial random variable will be 500 +/- 15.8114 and 95% of the time it’ll be 500 +/- 31.6228 (see 68-95-99.7 rule )

Hypothesis_Testing

Now that we have seen the results of our “coin fairness” experiment plotted on a binomial distribution (approximately normal), we will be, for the purpose of testing our hypothesis, be interested in the probability of its realized value (binomial random variable) lies within or outside a particular interval .

This means we’ll be interested in questions like:

  • What’s the probability that the binomial(n,p) is below a threshold?
  • Above a threshold?
  • Between an interval?
  • Outside an interval?

First, the normal_cdf (normal cummulative distribution function), which we learned in a previous post , is the probability of a variable being below a certain threshold.

Here, the probability of X (success or heads for a ‘fair coin’) is at 0.5 ( mu = 500, sigma = 15.8113), and we want to find the probability that X falls below 490, which comes out to roughly 26%

On the other hand, the normal_probability_above , probability that X falls above 490 would be 1 - 0.2635 = 0.7365 or roughly 74%.

To make sense of this we need to recall the binomal distribution, that approximates the normal distribution, but we’ll draw a vertical line at 490.

binomial_vline

We’re asking, given the binomal distribution with mu 500 and sigma at 15.8113, what is the probability that a binomal random variable falls below the threshold (left of the line); the answer is approximately 26% and correspondingly falling above the threshold (right of the line), is approximately 74%.

Between interval

We may also wonder what the probability of a binomial random variable falling between 490 and 520 :

binomial_2_vline

Here is the function to calculate this probability and it comes out to approximately 63%. note : Bear in mind the full area under the curve is 1.0 or 100%.

Finally, the area outside of the interval should be 1 - 0.6335 = 0.3665:

In addition to the above, we may also be interested in finding (symmetric) intervals around the mean that account for a certain level of likelihood , for example, 60% probability centered around the mean.

For this operation we would use the inverse_normal_cdf :

First we’d have to find the cutoffs where the upper and lower tails each contain 20% of the probability. We calculate normal_upper_bound and normal_lower_bound and use those to calculate the normal_two_sided_bounds .

So if we wanted to know what the cutoff points were for a 60% probability around the mean and standard deviation ( mu = 500, sigma = 15.8113), it would be between 486.69 and 513.31 .

Said differently, this means roughly 60% of the time, we can expect the binomial random variable to fall between 486 and 513.

Significance and Power

Now that we have a handle on the binomial normal distribution, thresholds (left and right of the mean), and cut-off points, we want to make a decision about significance . Probably the most important part of statistical significance is that it is a decision to be made, not a standard that is externally set.

Significance is a decision about how willing we are to make a type 1 error (false positive), which we explored in a previous post . The convention is to set it to a 5% or 1% willingness to make a type 1 error. Suppose we say 5%.

We would say that out of 1,000 coin flips, 95% of the time, we’d get between 469 and 531 heads on a “fair coin” and 5% of the time, outside of this 469-531 range.

If we recall our hypotheses:

Null Hypothesis : Probability of landing on Heads = 0.5 (fair coin)

Alt Hypothesis : Probability of landing on Heads != 0.5 (biased coin)

Each binomial distribution (test) that consist of 1,000 bernoulli trials, each test where the number of heads falls outside the range of 469-531, we’ll reject the null that the coin is fair. And we’ll be wrong (false positive), 5% of the time. It’s a false positive when we incorrectly reject the null hypothesis, when it’s actually true.

We also want to avoid making a type-2 error (false negative), where we fail to reject the null hypothesis, when it’s actually false.

Note : Its important to keep in mind that terms like significance and power are used to describe tests , in our case, the test of whether a coin is fair or not. Each test is the sum of 1,000 independent bernoulli trials.

For a “test” that has a 95% significance, we’ll assume that out of a 1,000 coin flips, it’ll land on heads between 469-531 times and we’ll determine the coin is fair. For the 5% of the time it lands outside of this range, we’ll determine the coin to be “unfair”, but we’ll be wrong because it actually is fair.

To calculate the power of the test, we’ll take the assumed mu and sigma with a 95% bounds (based on the assumption that the probability of the coin landing on heads is 0.5 or 50% - a fair coin). We’ll determine the lower and upper bounds:

And if the coin was actually biased , we should reject the null, but we fail to. Let’s suppose the actual probability that the coin lands on heads is 55% ( biased towards head):

Using the same range 469 - 531, where the coin is assumed ‘fair’ with mu at 500 and sigma at 15.8113:

95sig_binomial

If the coin, in fact, had a bias towards head (p = 0.55), the distribution would shift right, but if our 95% significance test remains the same, we get:

type2_error

The probability of making a type-2 error is 11.345%. This is the probability that we’re see that the coin’s distribution is within the previous interval 469-531, thinking we should accept the null hypothesis (that the coin is fair), but in actuality, failing to see that the distribution has shifted to the coin having a bias towards heads.

The other way to arrive at this is to find the probability, under the new mu and sigma (new distribution), that X (number of successes) will fall below 531.

So the probability of making a type-2 error or the probability that the new distribution falls below 531 is approximately 11.3%.

The power to detect a type-2 error is 1.00 minus the probability of a type-2 error (1 - 0.113 = 0.887), or 88.7%.

Finally, we may be interested in increasing power to detect a type-2 error. Instead of using a normal_two_sided_bounds function to find the cut-off points (i.e., 469 and 531), we could use a one-sided test that rejects the null hypothesis (‘fair coin’) when X (number of heads on a coin-flip) is much larger than 500.

Here’s the code, using normal_upper_bound :

This means shifting the upper bounds from 531 to 526, providing more probability in the upper tail. This means the probability of a type-2 error goes down from 11.3 to 6.3.

increase_power

And the new (stronger) power to detect type-2 error is 1.0 - 0.064 = 0.936 or 93.6% (up from 88.7% above).

p-Values represent another way of deciding whether to accept or reject the Null Hypothesis. Instead of choosing bounds, thresholds or cut-off points, we could compute the probability, assuming the Null Hypothesis is true, that we would see a value as extreme as the one we just observed.

Here is the code:

If we wanted to compute, assuming we have a “fair coin” ( mu = 500, sigma = 15.8113), what is the probability of seeing a value like 530? ( note : We use 529.5 instead of 530 below due to continuity correction )

Answer: approximately 6.2%

The p-value, 6.2% is higher than our (hypothetical) 5% significance, so we don’t reject the null. On the other hand, if X was slightly more extreme, 532, the probability of seeing that value would be approximately 4.3%, which is less than 5% significance, so we would reject the null.

For one-sided tests, we would use the normal_probability_above and normal_probability_below functions created above:

Under the two_sided_p_values test, the extreme value of 529.5 had a probability of 6.2% of showing up, but not low enough to reject the null hypothesis.

However, with a one-sided test, upper_p_value for the same threshold is now 3.1% and we would reject the null hypothesis.

Confidence_Intervals

A third approach to deciding whether to accept or reject the null is to use confidence intervals. We’ll use the 530 as we did in the p-Values example.

The confidence interval for a coin flipping heads 530 (out 1,000) times is (0.4991, 0.5609). Since this interval contains the p = 0.5 (probability of heads 50% of the time, assuming a fair coin), we do not reject the null.

If the extreme value were more extreme at 540, we would arrive at a different conclusion:

Here we would be 95% confident that the mean of this distribution is contained between 0.5091 and 0.5709 and this does not contain 0.500 (albiet by a slim margin), so we reject the null hypothesis that this is a fair coin.

note : Confidence intervals are about the interval not probability p. We interpret the confidence interval as, if you were to repeat the experiment many times, 95% of the time, the “true” parameter, in our example p = 0.5, would lie within the observed confidence interval.

Connecting_Dots

We used several python functions to build intuition around statistical hypothesis testing. To higlight this “from scratch” aspect of the book here is a diagram tying together the various python function used in this post:

connecting_dots

This post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus .

book_disclaimer

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter .

Paul Apivat

Paul Apivat

Cryptodata analyst ⛓️.

My interests include data science, machine learning and Python programming.

  • Statistics & Probability in Code
  • Data Science from Scratch (ch6) - Probability
  • How Positive are Your Facebook Posts?
  • Gradient Descent -- Data Science from Scratch (ch8)
  • Data Science from Scratch (ch5) - Statistics

Hypothesis Testing: The Basics

Say I hand you a coin. How would you tell if it's fair? If you flipped it 100 times and it came up heads 51 times, what would you say? What if it came up heads 5 times, instead?

In the first case you'd be inclined to say the coin was fair and in the second case you'd be inclined to say it was biased towards tails. How certain are you? Or, even more specifically, how likely is it actually that the coin is fair in each case?

Hypothesis Testing

Questions like the ones above fall into a domain called hypothesis testing . Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment.

In the coin example the "experiment" was flipping the coin 100 times. There are two questions you can ask. One, assuming the coin was fair, how likely is it that you'd observe the results we did? Two, what is the likelihood that the coin is fair given the results you observed?

Of course, an experiment can be much more complex than coin flipping. Any situation where you're taking a random sample of a population and measuring something about it is an experiment, and for our purposes this includes A/B testing .

Let's focus on the coin flip example understand the basics.

The Null Hypothesis

The most common type of hypothesis testing involves a null hypothesis . The null hypothesis, denoted H 0 , is a statement about the world which can plausibly account for the data you observe. Don't read anything into the fact that it's called the "null" hypothesis — it's just the hypothesis we're trying to test.

For example, "the coin is fair" is an example of a null hypothesis, as is "the coin is biased." The important part is that the null hypothesis be able to be expressed in simple, mathematical terms. We'll see how to express these statements mathematically in just a bit.

The main goal of hypothesis testing is to tell us whether we have enough evidence to reject the null hypothesis. In our case we want to know whether the coin is biased or not, so our null hypothesis should be "the coin is fair." If we get enough evidence that contradicts this hypothesis, say, by flipping it 100 times and having it come up heads only once, then we can safely reject it.

All of this is perfectly quantifiable, of course. What constitutes "enough" and "safely" are all a matter of statistics.

The Statistics, Intuitively

So, we have a coin. Our null hypothesis is that this coin is fair. We flip it 100 times and it comes up heads 51 times. Do we know whether the coin is biased or not?

Our gut might say the coin is fair, or at least probably fair, but we can't say for sure. The expected number of heads is 50 and 51 is quite close. But what if we flipped the coin 100,000 times and it came up heads 51,000 times? We see 51% heads both times, but in the second instance the coin is more likely to be biased.

Lack of evidence to the contrary is not evidence that the null hypothesis is true. Rather, it means that we don't have sufficient evidence to conclude that the null hypothesis is false. The coin might actually have a 51% bias towards heads, after all.

If instead we saw 1 head for 100 flips that would be another story. Intuitively we know that the chance of seeing this if the null hypothesis were true is so small that we would be comfortable rejecting the null hypothesis and declaring the coin to (probably) be biased.

Let's quantify our intuition.

The Coin Flip

Formally the flip of a coin can be represented by a Bernoulli trial. A Bernoulli trial is a random variable X such that Pr\left(X = 1\right) = 1 - Pr\left(X = 0\right) = 1 - q = p

That is, X takes on the value 1 (representing heads) with probability p , and 0 (representing tails) with probability 1 - p Of course, 1 can represent either heads or tails so long as you're consistent and 0 represents the opposite outcome .

Now, let's say we have 100 coin flips. Let X i represent the i th coin flip. Then the random variable Y = \sum_{i=1}^{100} X_i represents the run of 100 coin flips.

The Statistics, Mathematically

Say you have a set of observations O and a null hypothesis H 0 . In the above coin example we were trying to calculate P\left(O \mid H_0\right) i.e., the probability that we observed what we did given the null hypothesis. If that probability is sufficiently small we're confident concluding the null hypothesis is false But remember, if that probability is not sufficiently small, that doesn't mean the null hypothesis is true!

We can use whatever level of confidence we want before rejecting the null hypothesis, but most people choose 90%, 95%, or 99%. For example if we choose a 95% confidence level we reject the null hypothesis if P\left(O \mid H_0\right) \le 1 - 0.95 = 0.05

The Central Limit Theorem is the main piece of math here. Briefly, the Central Limit Theorem says that the sum of any number of re-averaged identically distributed random variables approximates a normal distribution.

Remember our random variables from before? If we let p = \frac{Y}{N} then p is the proportion of heads in our sample of 100 coin flips. In our case, it is equal to 0.51, or 51%.

But by the central limit theorem we also know that p approximates a normal distribution. This means we can estimate the standard deviation of p as \sigma = \sqrt{\frac{p(1-p)}{N}}

Wrapping It Up

Our null hypothesis is that the coin is fair. Mathematically we're saying H_0 : p_0 = 0.50

Here's the normal curve:

A 95% level of confidence means we reject the null hypothesis if p falls outside 95% of the area of the normal curve. Looking at that chart we see that this corresponds to approximately 1.98 standard deviations.

The so-called "z-score" tells us how many standard deviations away from the mean our sample is, and it's calculated as z = \frac{p-0.50}{\sqrt{\frac{0.50(1-0.50)}{N}}}

The numerator is "p - 0.50" because our null hypothesis is that p = 0.50. This measures how far the sample mean, p, diverges from the expect mean of a fair coin, 0.50.

Let's say we flipped three coins 100 times each and got the following data.

Data for 100 Flips of a Coin
Coin Flips Pct. Heads Z-score
Coin 1 100 51% 0.20
Coin 2 100 60% 2.04
Coin 3 100 75% 5.77

Using a 95% confidence level we'd conclude that Coin 2 and Coin 3 are biased using the techniques we've developed so far. Coin 2 is 2.04 standard deviations from the mean and Coin 3 is 5.77 standard deviations.

When your test statistic meets the 95% confidence threshold we call it statistically significant .

This means there's only a 5% chance of observing what you did assuming the null hypothesis was true. Phrased another way, there's only a 5% chance that your observation is due to random variation.

Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment. You start by forming a null hypothesis, e.g., "this coin is fair," and then calculate the likelihood that your observations are due to pure chance rather than a real difference in the population.

The confidence interval is the level at which you reject the null hypothesis. If there is a 95% chance that there's a real difference in your observations, given the null hypothesis, then you are confident in rejecting it. This also means there is a 5% chance you're wrong and the difference is due to random fluctuations.

The null hypothesis can be any mathematical statement and the test you use depends on both the underlying data and your null hypothesis. In our coin flipping example the underlying data approximated a normal distribution and we wanted to test whether the observed proportion of heads was different enough to be significant. In this case we were measuring the sample mean .

We can measure anything, though: the sample variance, correlation, etc. Different tests needs to be used to determine whether these are statistically significant, as we'll see in coming articles.

What's Next?

Now that we understand the innards of hypothesis testing we can apply our knowledge to A/B tests to determine whether new features actually effect user behavior. Until then!

hypothesis test coin is fair

Member-only story

Hypothesis Testing Explained as Simply as Possible

One of the most important concepts for data scientists.

Terence Shin, MSc, MBA

Terence Shin, MSc, MBA

Towards Data Science

Table of Content

Introduction.

  • Terminology
  • Reject or Do not Reject?
  • What is the point of Significance Testing?
  • Steps for Hypothesis Testing

If you’ve heard of the terms null hypothesis , p-value, and alpha but don’t really know what they mean or how they’re related then you’ve come to the right place! And if you’ve never heard of these terms, I urge you to read through this article as this is an essential topic to understand.

I’ll start with a simple example:

Imagine that you and your friend play a game. If a coin lands on heads, you win $5 and if it lands on tails he wins $5.

Let’s say the first two coin tosses landed on tails, meaning your friend won $10. Should you be worried that he’s using a rigged coin? Well, the probability of the coin landing on tails two times in a row is 25% (see above) which is not unlikely.

Terence Shin, MSc, MBA

Written by Terence Shin, MSc, MBA

Data Scientist @ Cash App | Health fanatic | Educator | https://www.linkedin.com/in/terenceshin/

Text to speech

Hypothesis Testing

1   hypothesis testing.

H 0 : We're just as likely to get heads as tails when we flip the coin.
H a : We're more likely to see either heads or tails when we flip the coin.

hypothesis test coin is fair

  • Describe a null hypothesis and an alternative hypothesis.
  • Specify a significance or alpha level for the hypothesis test. This is the percent of the time that you're willing to be wrong when you reject the null hypothesis.
  • Formulate some assumptions about the distribution of the statistic that's involved in the hypothesis test. In this example we made the assumption that a fair coin follows a binomial distribution with p=0.50.
  • Using the assumptions you made and the alpha level you decided on, construct the rejection region for the test, that is, the values of the statistic for which you'll be willing to reject the null hypothesis. In this example, the rejection region is broken up into two sections: less than 40 heads and more than 60 heads.

2   Determining Power

  • Follow the general hypothesis testing procedure outlined above.
  • Repeatedly generate simulated data which corresponds to an alternative hypothesis that you're interested in. For each set of data that you simulate (sometimes refered to as a trial), calculate the statistic, and decide whether to reject the null hypothesis or not.
  • When you're done with all the trials, count how often you (correctly) rejected the null hypothesis, and divide by the total number of trials to obtain the power of the experiment for that specific alternative distribution.

3   Probability Distributions

  • To get an overall idea of how values are distributed, we can examine (usually graphically), the probability density function for a distribution. For example, the familiar bell curve is the density function for a normal distribution. In R, density functions begin with the letter "d".
  • To find out what value corresponds to a particular probability (like 0.025 and 0.975 in the previous example), we can call the quantile function for the distribution. These functions accept an argument between 0 and 1 (i.e. a probability) and return the value for that distribution where the probability that an observation would be smaller than that value is equal to the probability passed to the function. When you perform a statistical test, choose a significance level, and look up the statistic's value in a table, you're actually using the quantile function. Quantile functions in R begin with the letter "q".
  • If you've got a value for an observation that comes from a particular distribution, and you want to know the probability that observations from the distribution would be less than that value, you can pass the value to a probability function (like pbinom ) and it will return the probability. In R, these functions begin with the letter "p".
  • When you want to generate random numbers from a distribution, you can call the appropriate random number generator. Random number generators in R begin with the letter "r".
Beta Binomial
Cauchy Chi-squared
Exponential Exponential
F Gamma
Geometric Hypergeometric
Lognormal Logistic
Negative Binomial Normal
Poisson Signed Rank
Student's t Uniform
Weibull Wilcoxon's Rank Sum

4   A Note about Random Numbers in R

  • Number Theory
  • Data Structures
  • Cornerstones

Hypothesis Testing Basics & One Sample Tests for Proportions

Introduction to hypothesis testing.

Hypothesis testing is a decision-making process by which we analyze a sample in an attempt to distinguish between results that can easily occur and results that are unlikely.

One begins with a claim or statement -- the reason for the study. For example, the claim might be "This coin in my pocket is fair."

Then we design a study to test the claim. In the case of the coin, we might decide to flip the coin 100 times.

Consider what could happen as a result of flipping that coin 100 times:

Suppose we saw that 99 out of 100 times, the flip resulted in "heads". Upon seeing this, no one in their right mind would still believe that the coin was fair. That notion would be completely rejected. A fair coin should come up "heads" roughly 50% of the time. The probability that a fair coin would come up "heads" 99 times out of 100 is so ridiculously small, that for all practical purposes we should never see it happen. The fact that we saw it happen constitutes significant statistical "evidence" that an assumption the coin is fair is very, very wrong.

If on the other hand, if one saw 54 out of 100 flips result in "heads", then -- while this doesn't exactly match our expectation that a fair coin should come up heads 50 out of 100 times -- it is not that far off the mark. It may be that we have a fair coin and the amount we are off is just due to the random nature of coin flips. It may also be that our coin is only slightly unfair -- perhaps coming up heads only 55% of the time. We simply don't know. We have no evidence either way. There is no reason for a person who previously believed the coin was fair, to change their mind. There is no significant statistical "evidence" that the coin is not fair.

These two circumstances capture the essence of all hypothesis testing...

We hold some belief of something before we start our experiment or study (e.g., the coin is fair). This belief might be based on our experience or history. It might be the more conservative thing to believe, given two possibilities. It might be categorized as a "no-change-has-happened" belief. Whatever it is -- if we see a highly unusual difference between what is expected under the assumption of that belief and what actually happens as a result of our sampling or experimentation, we consequently reject that belief. Seeing a more common outcome under the assumption of that belief, however, does not result in any rejection of that belief.

Attaching some statistical verbiage to these ideas, the "belief" described in the previous paragraph is called the null hypothesis , $H_0$. The alternative hypothesis , $H_1$, is what one will be forced to conclude is more likely the case after a rejection of the null hypothesis.

These hypotheses are typically stated in terms of values of population parameters, with the null hypothesis stating that the parameter in question "equals" some specific value, while the alternative hypothesis says this parameter is instead either not equal to, greater than, or less than that same specific value, depending on the context.

Importantly, the "claim" (the reason for the study) might sometimes be the null hypothesis, while other times it might be the alternative hypothesis. A common error among students learning statistics for the first time is to assume the claim is always just one of these.

Hypothesis Testing Using $p$-values

Returning to the example concerned with deciding whether a coin is fair or not based on flipping it 100 times, and assuming $p$ is the true proportion of heads that should be seen upon flipping the coin in our pocket, we first write the null and alternative hypotheses for our coin tossing experiment in the following way: $$H_0 : p = 0.50; \quad H_1 : p \neq 0.50$$

Recall that under the assumption of the null hypothesis, and as long as $np \ge 5$ and $nq \ge 5$, sample proportions $\widehat{p}$ should "pile up" in an approximately normal distribution with $$\mu = p = 0.5 \quad \textrm{ and } \quad \sigma = \sqrt{\frac{pq}{n}} = \sqrt{\frac{(0.5)(0.5)}{100}} = 0.05$$

Suppose as a result of our flipping the coin 100 times, we flipped heads $63$ times.

Then, the $z$-score for the corresponding sample proportion $\widehat{p} = 63/100 = 0.63$ is $$z_{0.63} = \frac{x - \mu}{\sigma} = \frac{0.63 - 0.5}{0.05} = 2.6$$ As a matter of verbiage -- for a hypothesis test involving a single proportion, the $z$-score associated with the sample proportion under consideration is called the test statistic . More generally, a test statistic indicates where on the related distribution the sample statistic falls.

Now we confront the question "Is what we saw unlikely enough that we should reject the null hypothesis? That is to say, does this particular observed $\widehat{p}$ happen so rarely when $p = 0.5$ that seeing it happen provides significant evidence that the $p \neq 0.5$?"

Towards this end, we consider the probability of seeing our test statistic, $z_{0.63} = 2.6$ -- or something even more extreme in the sense of the alternative hypothesis, in this standard normal distribution.

These "even more extreme" values (shaded red in the below diagram) certainly include those $z$ scores farther in the tail on the right (i.e., $z > 2.6$), but they also include those $z$-scores at a similar distance from the mean in the left tail of the distribution (i.e., $z < -2.6$). This is due to the fact that our alternative hypothesis simply says $p \neq 0.5$ -- it does not specify that $p$ is higher or lower than $0.5$. Had the alternative hypothesis been different, we might have limited ourselves to those $z$-scores in only one tail.

hypothesis test coin is fair

We can easily find $P_{\textrm{std norm}}(z 2.6)$ (i.e., the area shaded red above) with either a standard normal table, calculator, or a statistical programming environment like R.

As it turns out, $$P_{\textrm{std norm}}(z 2.6) \doteq 0.00932$$

Thus the probability, under the assumption that $p = 0.5$, that a sample will produce a $\widehat{p}$ as rare (or rarer) than what we saw in our one sample is only $0.00932$.

The probability just found is known as the p-value for the hypothesis test. More generally, the p-value is the probability of the observed outcome or an outcome at least that unusual (in the sense of the alternative hypothesis), under the assumption that the null hypothesis is true.

In this way, the p-value quantifies just how unusual what we saw actually was.

The question remains, however -- was the p-value we found small enough that we should conclude $p \neq 0.5$. (i.e., thus rejecting the null hypothesis)?

Understand that as long as the p-value is not zero, there is some possibility that $p = 0.50$ is actually true, and what we saw was just due to random chance. However, we want to make a decision -- one way or the other -- as to whether we believe this is a fair coin or not. We need to establish a cut-off probability, so that if the p-value is less than this cut-off, we consider the observed outcome unusual enough that it constitutes significant evidence that the null hypothesis should be rejected. This cut-off probability is called the significance level , and is denoted by $\alpha$.

As a standard, when a significance level is not specified at the outset, one typically uses $\alpha = 0.05$. Under this standard, observing something that happens less than 5% of the time is considered unusual enough to reject the null hypothesis.

Certainly, in this case we have $\textrm{p-value } \doteq 0.00932 < 0.05 = \alpha$ and the null hypothesis should be rejected. We indeed have statistically significant evidence that $p \neq 0.50$ and that the coin is consequently not a fair coin.

Hypothesis Testing Using Critical Values

Continuing with our the previous example, note that if all we care about is making a decision as to whether or not we believe the coin is fair, then we don't actually need the exact value of the p-value -- we just need to know if it is less than the significance level, $\alpha$.

With this in mind, recall that we know in a standard normal distribution, by the Empirical Rule, roughly 95% of the distribution falls between $z = -2$ and $z = 2$. A slightly better approximation puts this 95% in the region where $-1.96 \lt z \lt 1.96$. That leaves 5% of the distribution in the tails, where $z 1.96$ (i.e., the area outside the blue dashed lines below).

hypothesis test coin is fair

It should be patently obvious that if the test statistic (i.e., $z_{0.63} = 2.6$) falls in this region, the p-value (which agrees with the area of the red region above) must be smaller than 5%. Thus, we can immediately reject the null hypothesis, knowing $p \lt 0.05 = \alpha$.

The region of the distribution where it is unlikely for the test statistic to fall if the null hypothesis is indeed true (here, outside of $z = \pm1.96$) is called the rejection region , and the boundaries of the rejection region are known as critical values .

This (more traditional) way of performing a hypothesis test is certainly simpler to perform, especially in the absence of a calculator or software that can help with the calculation of the $p$-value. Although the lack of knowledge of the $p$-value makes comparing how significant results are relative to each other a bit more difficult.

Hypothesis Testing Using a Confidence Interval

There is one more way to perform a hypothesis test. It is only appropriate for two-tailed tests, but is simple to perform:

Simply find a confidence interval with confidence level of $(1-\alpha)$ where $\alpha$ is the significance level of the hypothesis test. Then, reject $H_0$ if the hypothesized population parameter (e.g., a proportion or mean) is not in the confidence interval.

This method of hypothesis testing can sometimes (very rarely) result in a different conclusion than the other two methods, as the confidence interval is built from an approximation of the standard deviation using the sample statistic ($\widehat{p}$, for example) as opposed to using the hypothesized population parameter (e.g. the proportion $p$ in the examples discussed here) to calculate the standard deviation.

Making an Inference

Regardless of whether you use $p$-values, critical values, or confidence intervals to conclude whether or not the null hypothesis should be rejected -- once this is done, it is time to make an inference -- that is to say, it is time to communicate what your conclusion says about the original claim.

When forming an inference, one should try to phrase it in a manner easily digested by someone that doesn't know a lot about statistics. In particular, one should not use words like "null hypothesis", "p-value", "significance level", etc.

If the claim is the null hypothesis, you can start your inference with " There is enough evidence to reject the claim that... " if you rejected the null hypothesis, and with " There is not enough evidence to reject the claim that... " if you failed to reject the null hypothesis.

If the claim is the alternative hypothesis, you can instead start your inference with " There is enough evidence to support the claim that... " if you rejected the null hypothesis, and with " There is not enough evidence to support the claim that... " if you failed to reject the null hypothesis.

Alternatively, you can use phrases like " significantly different ", " significantly higher ", or " significantly lower " in the statement of your inference.

Very importantly -- one should never use the word "prove" in an inference. No matter the result of the hypothesis test, there is always a possibility of error. For example, at a significance level of $0.05$, one should absolutely expect that one in twenty experiments/observations will produce a p-value less than that $0.05$, which could then be erroneously deemed "statistically significant evidence" -- a point humorously driven home by this xkcd.com comic strip .

Indeed, there are two types of errors we could make when conducting a hypothesis test. We could -- as just described -- mistakenly reject a true null hypothesis (known as a Type I error ), or we could also fail to reject a false hypothesis (known as a Type II error ). Sadly, these firmly entrenched names for the types of errors one might commit are not very descriptive and consequently can easily be confused with each another. As a useful way to keep them straight, one might remember Aesop's fable of The Boy Who Cried Wolf" . In part I of this story, the villagers commit a Type I error by reacting to the presence of a wolf when there is none. In part II of this story, the villagers commit a Type II error by failing to react when there actually is a wolf.

The probability of a Type I error is, of course, the significance level for the test, $\alpha$. The probability of a Type II error is denoted by $\beta$, and is impossible to calculate without actually knowing the actual value of the population parameter in question.

As one last bit of verbiage, the probability of rejecting a false null hypothesis is consequently $1 - \beta$, and is known as the power of the test . We can increase the power of a test by either increasing the sample size $n$, or the significance level, $\alpha$.

hypothesis test coin is fair

Is my coin fair ?

Notes on statistical hypothesis testing..

Paros Kwan

Introduction

For centuries the coin is used as an unbiased way to choose between two situation. But how do you know if the coin is really fair ? May be the head side is slightly heavier and therefore the tail side show up more? Who knows ? Turns out, To tell if a coin is fair, it takes a little bit of statistic to answer the question.

Probability

R andom variable.

In the simplest terms, a random variable has set of possible outcomes with corresponding probability distribution. So if we flip a single coin , the possible outcome for the random variable X would be :

X = { Head, Tail }

and a probability is associated with each possible outcome, for a fair coin,

  • P( X = Head ) = 0.5
  • P( X = Tail ) = 0.5

What if it is a unfair coin ? the probability would look like :

  • P( X = Head ) = r
  • P( X = Tail ) = 1 - r = s for 0≦ r ≦1

In general, a single random experiment with exactly 2 outcomes, such as flipping a coin, answering a yes-no question, is called a Bernoulli trial. Repeated Bernoulli trial is called a Bernoulli process, and the probability distribution for the possible outcome is called a Bernoulli distribution.

Central Limit Theorem

When independent random variables are added, the normalised sum tends toward a normal distribution even if the original variables themselves are not normally distributed.A normal distribution is a well studied distribution parameterised by its mean 𝜇 and SD 𝝈 ,

In our case , flipping the same coin repeatedly for a thousand times, and add up the result (take Head as 1 and Tail as 0), it is assumed that each of the trial is identical and independent, as it is the same coin, and the result of each flip won’t affect the other flip , the resultant random variable Y :

Y = X₁ + X₂ + X₃ + … +X₁₀₀₀

To be precise, a Bernoulli distribution is formed by a series of Bernoulli trial, with total number of trial n is large enough, the distribution can be approximated by a normal distribution with :

  • 𝝈² = np (1- p )

Statistically hypothesis testing

To testify a hypothesis in statistics, the first thing is to come up with a null hypothesis , which is generally believe to be true, for example :

  • a woman with big belly is pregnant
  • the pill is effective to cure cancer etc

and its corresponding alternatives hypothesis is the exact opposite :

  • women with big belly is not pregnant
  • the pill is not effective to cure cancer

After the experiment is conducted and data is collected, if the result is sufficiently inconsistent from the null hypothesis ,the null hypothesis is rejected and the alternative is proven.

In principle, null and alternative hypothesis is interchangeable , it is just a generate practice in science to conduct experiment to try to disprove null hypothesis .

Significance and and p-value

Assuming the null hypothesis to be true, if the experimental result lies within the most extreme 5% of the distribution, in statistic, we would reject the null hypothesis and the alternative is in favour ,as it is very unlikely that we can just do a random experiment, and hit such an extreme result. The threshold 5%, is called the significant level ( α ), and it is a generate practices to set it to 5% or 1%. The probability( p ) of an experimental outcome, given that null hypothesis hold, is called the p-value, if p < α , we would reject null hypothesis.

Putting it all together

so for our experiment, our null hypothesis is that the coin is fair, that is

  • P( HEAD ) is 0.5

And the alternative is that:

  • P( HEAD ) is not 0.5

Assuming the null hypothesis hold, if the flip the coins for 1000 times, the probability of the the total number of head would be a normal distribution with 𝜇 = 500 and 𝝈 = 15.8

with 95% margin lies at 469 and 531. The most extreme 5% of possible outcome is the number of head less then 469 or larger then 531. If you flip a coins for 1000 times and the result lies between 469 and 531, the null hypothesis is not rejected and the coin is fair. otherwise, the null hypothesis is rejected and the alternative hypothesis is favour.

Paros Kwan

Written by Paros Kwan

CUHK Physics graduate. Programming and STEM.

Text to speech

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Coin Flips and Hypothesis Tests

Here's a problem I thought of that I don't know how to approach:

You have a fair coin that you keep on flipping. After every flip, you perform a hypothesis test based on all coin flips thus far, with significance level $\alpha$, where your null hypothesis is that the coin is fair and your alternative hypothesis is that the coin is not fair. In terms of $\alpha$, what is the expected number of flips before the first time that you reject the null hypothesis?

Edit based on comment below: For what values of $\alpha$ is the answer to the question above finite? For those values for which it is infinite, what is the probability that the null hypothesis will ever be rejected, in terms of $\alpha$?

Edit 2: My post was edited to say "You believe that you have a fair coin." The coin is in fact fair, and you know that. You do the hypothesis tests anyway. Otherwise the problem is unapproachable because you don't know the probability that any particular toss will come up a certain way.

  • probability
  • expectation
  • hypothesis-testing

Jeffrey L.'s user avatar

  • 1 $\begingroup$ If there is a positive probability that you never reject the null hypothesis, then the expectation is infinite. [I don't know whether or not there is such a positive probability.] $\endgroup$ –  paw88789 Commented May 9, 2015 at 12:41

EDIT: This answer was unclear for OP at first, so I tried to make it clearer through a new approach. Apparently it arose another legitimate doubt, so I tried now to put both answers together and clarify them even more. (Still I might be wrong, but I'll try to express myself better)

What you look for, is the expected number of tosses before we do a Type I error (rejecting $H_0$ when it was true). The probability of that is precisely $\alpha$ (that's another way to define it).

So $P(Type\ I\ error)=\alpha$.

Let $X_n$ be the event of rejecting $n^{th}$ test.

Now, $E[X_1]=\alpha$ stands for the expected number of games (a game is starting to test in the way we do a new coin) where $H_0$ was rejected on the first throw. $E[X_1+X_2]=E[X_1]+E[X_2]$ is the expected number of games where $H_0$ is rejected either on the first or the second throw. Note that with most $\alpha$ this will be lower than $1$, so the expectation for a single game is not to reject $H_0$ yet.

When do we expect to have rejected $H_0$? Precisely when the number of expected games in which we reject $H_0$ is $1$. Therefore, we look for $n$ such as $$ E[X_1+X_2+...+X_n]=1\\ E[X_1+X_2+...+X_n]=E[nX_1]=nE[X_1]=n\alpha=1\\ n=\frac{1}{\alpha} $$

The other answer goes like this: Let the variable $T$ count the number of tests before rejecting one. We look for $E[T]$.

Also, using previous notation, $P(X_n)=\alpha(1-\alpha)^{n-1}$ (I'm aware this implies independence between the events $X_n$ and $X_{n-1}$ but since I'm looking for the expected value, for the linearity of the Expected Value , it shouldn't be a problem, though I'm aware I'm not being polite with notation).

$$E[T] = \sum_{n=1}^{\infty}nP(X_n) = \sum_{n=1}^{\infty}n\alpha(1-\alpha)^{n-1}= \alpha\sum_{n=1}^{\infty}n(1-\alpha)^{n-1}=\\ \alpha\sum_{n=0}^{\infty}(n+1)(1-\alpha)^{n}= \alpha(\sum_{n=0}^{\infty}n(1-\alpha)^{n}+\sum_{n=0}^{\infty}(1-\alpha)^{n}) = \alpha(\frac{1-\alpha}{\alpha^2}+\frac{1}{\alpha}) \\ E[T]=\frac{1}{\alpha} $$

Masclins's user avatar

  • $\begingroup$ Can you explain why we are looking for $n$ such that $E[X_1 + X_2 + \dots + X_n] = 1$? $\endgroup$ –  Eric Neyman Commented May 23, 2015 at 18:28
  • $\begingroup$ I look for when the expected number of rejected $H_0$ is 1. $\endgroup$ –  Masclins Commented May 23, 2015 at 18:32
  • $\begingroup$ Why is this the same as the expected number of flips before the null hypothesis is rejected for the first time? $\endgroup$ –  Eric Neyman Commented May 24, 2015 at 20:28
  • $\begingroup$ It's what you asked. The expected number of flips until a hypothesis test rejects $H_0$, when doing one after each flip. I don't get your doubt, please be more specific about what you don't understand of the reasoning. $\endgroup$ –  Masclins Commented May 24, 2015 at 20:36
  • $\begingroup$ You showed that $\frac{1}{\alpha}$ is the value of $n$ such that the expected number of rejections for the first $n$ tests is equal to $1$. I am asking for the expected number of tests before the first rejection. I may be missing something, but I don't see why those two should be the same. $\endgroup$ –  Eric Neyman Commented May 25, 2015 at 21:13

You must log in to answer this question.

Not the answer you're looking for browse other questions tagged probability statistics expectation hypothesis-testing ..

  • Featured on Meta
  • Announcing a change to the data-dump process
  • Bringing clarity to status tag usage on meta sites
  • 2024 Election Results: Congratulations to our new moderator!

Hot Network Questions

  • Does it make sense for the governments of my world to genetically engineer soldiers?
  • Why is GParted distributed as an ISO image? Is it to accommodate Linux needs as well as Windows needs?
  • How is an inverting opamp adder circuit able to regulate its feedback?
  • How to prevent my frozen dessert from going solid?
  • Would it be Balanced to Give Everyone Warlock Slots for Casting Racial Spells?
  • In 1982 Admiral Grace Hopper said "I still haven't found out why helicopter rotors go the way they do". If she were here today, how might one answer?
  • Find the radius of a circle given 2 of its coordinates and their angles.
  • Is consciousness a prerequisite for knowledge?
  • How many ways can you make change?
  • Smooth curve ellipse in geometry nodes with low resolution
  • Conservation of the determinant of density matrix
  • How can coordinates be meaningless in General Relativity?
  • Do eternal ordinances such as the festival of unleavened bread pose a biblical contradiction?
  • How to translate the German word "Mitmenschlich(keit)"
  • Can an APK be installed from a URI via `adb`?
  • How do Trinitarian Christians defend the unfalsifiability of the Trinity?
  • Did Gandalf know he was a Maia?
  • Driveway electric run using existing service poles
  • What did Wittgenstein mean by ”contradiction is the outer limit of propositions”?
  • When to use negative binomial and Poisson regression
  • What rules of legal ethics apply to information a lawyer learns during a consultation?
  • Is the Oath Formula "By the Life of Pharaoh" Attested Anywhere outside of Biblical Literature?
  • Decomposable maps of half-smash products
  • Is it possible to travel to USA with legal cannabis?

hypothesis test coin is fair

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Testing if a coin is fair using Bayesian statistics

Suppose we have a coin and want to decide whether it's fair. We assume that the a priori probability of a coin being fair is 1/2. However, we can't yet calculate the probability of the outcome of a coin flip, because to do so would require data we do not have. How would we estimate the probability that this coin is fair?

Idea: We can conduct an experiment where in we flip the coin 1000 times. We count the number of successes (either heads or tails), and calculate the probability that this number of successes occurs in a fair coin*. We'll call this Pfair.

We then estimate the probability of this number of successes occurring in this coin* by assuming that the probability of a single success = the number of successes / 1000. We'll call the resulting binomial probability Pcoin. Then, we can estimate that P(success) = Pfair * P(fair coin) + Pcoin * P(unfair coin) = (Pfair + Pcoin) / 2.

*= Alternatively, we can calculate the probabilities that the number of successes is either less than or greater than the expected/actual number of successes.

Edit: What I'm after is more of a model of the likelihood that a given model is true. To do this, I would need the a priori probability of observing the data. Suppose I don't know that the probability that I pick up a random coin and get heads is 1/2. How can I apply Bayes' rule?

moonman239's user avatar

  • $\begingroup$ Whether or not you have data, you have beliefs about coin fairness (e.g., the coin is fair unless data suggest otherwise, or the coin is unfair, giving heads probability P , or there is a distribution of possible degrees of fairness representing how uncertain you are about the (un)fairness of the coin, etc.). $\endgroup$ –  Alexis Commented Aug 26, 2018 at 17:03

3 Answers 3

Bayesian hypothesis testing is usually done by formulating a model that decomposes the prior into the null and alternative cases, which leads to a particular form for Bayes factor . For an arbitrary prior probability for the null hypothesis we can update to find the posterior probability of the null hypothesis using a simple equation involving Bayes factor. This leads to a simple graph of the prior-to-posterior probability mapping, which is a bit like an AUC curve. Here is a general model form for your problem, plus the more specific uniform-prior model.

General Bayesian model: Suppose we observe coin tosses yielding the indicator variables $x_1, x_2, ..., x_n \sim \text{IID Bern}(\theta)$ where one indicates heads and zero indicates tails. We observe $s = \sum_{i=1}^n x_i$ heads in $n$ coin tosses, and we want to use this data to test the hypotheses:

$$H_0: \theta = \tfrac{1}{2} \quad \quad \quad H_A: \theta \neq \tfrac{1}{2}.$$

Without any loss of generality, let $\delta \equiv \mathbb{P}(H_0)$ be the prior probability of the null hypothesis and let $\pi_A(\theta) = p(\theta|H_A)$ be the conditional prior for $\theta$ under the alternative hypothesis. For this arbitrary prior we can express Bayes factor as:

$$\begin{equation} \begin{aligned} BF(n,s) \equiv \frac{p(\mathbf{x}|H_A) }{p(\mathbf{x}|H_0)} &= \frac{\int_0^1 \theta^s (1-\theta)^{n-s} \pi_A(\theta) d\theta}{(\tfrac{1}{2})^n} \\[6pt] &= 2^n \int_0^1 \theta^s (1-\theta)^{n-s} \pi_A(\theta) d\theta. \\[6pt] \end{aligned} \end{equation}$$

We then have posterior:

$$\begin{equation} \begin{aligned} \mathbb{P}(H_0|\mathbf{x}) = \mathbb{P}(H_0|s) &= \frac{p(\mathbf{x}|H_0) \mathbb{P}(H_0)}{p(\mathbf{x}|H_0) \mathbb{P}(H_0) + p(\mathbf{x}|H_A) \mathbb{P}(H_A)} \\[6pt] &= \frac{\mathbb{P}(H_0)}{\mathbb{P}(H_0) + BF(n,s) \mathbb{P}(H_A)} \\[6pt] &= \frac{\delta}{\delta + (1-\delta) BF(n,s)}. \\[6pt] \end{aligned} \end{equation}$$

Use of a particular conditional prior $\pi_A$ leads to different forms for the Bayes factor,

Testing with uniform prior under alternative: Suppose we take $\pi_A(\theta) = \mathbb{I}(\theta \neq 1/2)$ so that this conditional prior is uniform over the allowable parameter values. (We could take $\pi_A(\theta) = 1$ since changing the density at a single point has no effect; thus, we can be a bit fast-and-loose with the support.) Under this model the Bayes factor is:

$$BF(n,s) = 2^n \int_0^1 \theta^s (1-\theta)^{n-s} d\theta = 2^n \cdot \frac{\Gamma(s+1) \Gamma(n-s+1)}{\Gamma(n+2)}.$$

We can implement this model in R using the code below. In this code we give a function for the Bayes factor, and we generate the prior-posterior plot for some example data.

enter image description here

  • $\begingroup$ It’s worth mentioning, in view of Lindley’s paradox, that your inferences here are going to depend a lot on your prior. The uniform prior makes it really hard to reject when the true probability is close to, but not quite, 1/2, which is probably the case with an actual coin. $\endgroup$ –  guy Commented Aug 26, 2018 at 20:14
  • 1 $\begingroup$ True, but it is inherently difficult to distinguish parameter values that are close together, and this naturally requires a lot of data. So I don't really see that as a drawback of the method; it is just a natural aspect of statistics. $\endgroup$ –  Ben Commented Aug 27, 2018 at 0:37
  • $\begingroup$ It's hard to distinguish points that are close together, sure. But Bayesian inference under a uniform prior diverges drastically from Frequentist inference in this particular case. I can decrease the Bayes factor in favor of the alternative by a factor of 5 just by consider a Uniform(.4,.6) prior under the alternative rather than a Uniform(0,1); this applies in settings where the Uniform(.4,.6) prior and Uniform(0,1) prior result in essentially the same inference about $\pi$ when you don't include a point mass at $1/2$. $\endgroup$ –  guy Commented Aug 27, 2018 at 2:11
  • $\begingroup$ Let me see if I can dumb down your answer a bit: We basically comsider not just the probability of experiencing a result for the null hypothesis, we alsp consider the probability of experiencing the result we got given a whole range of heads probabilities from 0 to 1 $\endgroup$ –  moonman239 Commented Sep 15, 2018 at 2:59
  • $\begingroup$ @moonman: Yes, that is the essence of Bayesian analysis --- we have an unknown probability of heads, represented by a parameter $\theta$, and we give this a prior distribution and then determine the posterior from the data. In hypothesis testing this generally entails giving a distribution under specific values, not just the general alternative hypothesis. $\endgroup$ –  Ben Commented Sep 15, 2018 at 3:39

The common approach for dealing with this kind of problem (in particular, where you have assumed a priori a fair coin) is to use an appropriate concentration inequality.

For this specific case, you would want to use the Hoeffding bound, and apply it to Bernoulli random variables. Wikipedia has an entry just on that. To summarize, basically, one can calculate a bound on the probability of what you observed.

For example, what is the probability, given our assumption of a fair coin, that we will have seen at least (or at most) $X$ heads over 1000 coin flips? They give a nice result there where they show that (if we normalize our results by the number of flips, $n$), that the probability of the sum / $n$ being off by more than $\sqrt\frac{\ln n}{n}$ of the correct probability is at most $\frac{2}{n^2}$. Of course, you can put in different values, trading off the accuracy against the confidence level.

For completeness, the normalized version of the formula would be $P\left[\left|\frac{H(n)}{n} - p\right| > \epsilon\right] < 2e^{-2\epsilon^2n}$, where $H(n)$ is the sum of the Bernoulli trials, and $p$ is the actual probability for $1$.

Alexis's user avatar

  • 1 $\begingroup$ OP is specifically asking for a Bayesian solution. This answer is based entirely on frequentist considerations. $\endgroup$ –  guy Commented Aug 26, 2018 at 16:33
  • $\begingroup$ In addition to @guy 's comment, even if you go the frequentist route, why not just use the Binomial distribution? $\endgroup$ –  jbowman Commented Aug 26, 2018 at 16:49

To be fair, there isn't a good method of testing if a coin is "fair" in Bayesian methodology. You should either use Fisher's method of maximum likelihood or Pearson and Neyman's Frequentist method, depending on your goal. The method cited by MotiN is a rigorous version of ROPE, or region of practical equivalence. This creates a region around $\pi=\frac{1}{2}$ where no one could distinguish between $H_0:\pi=\frac{1}{2}$ and $H_A:\pi\ne\frac{1}{2}$.

This is not the same as testing to determine if $\pi=\frac{1}{2}$. It is used for that purpose in Bayesian methods, but Bayesian methods are very ill-suited for testing sharp null hypotheses. The issue comes from the fact that the number of points in the interval [0,1] is uncountable, while $\pi=\frac{1}{2}$ is a countable point. As such, the null hypothesis has zero measure and so has zero probability. $\Pr\left(\pi=\frac{1}{2}\right)=0$.

Likewise, you could use Bayes factors against any single point to get a relative probability, but not a point versus a region, which is what is required here.

If you really need to know if a coin is fair, which you cannot really know, then you should use one of the two classical methods. Use Fisher's method of maximum likelihood if you want the p-value to provide information regarding the weight of the evidence against, or use the Frequentist method if you want to know how you should behave. If the null is not rejected, you behave as if it is true. If the null is rejected, then you behave as if it is false.

There is no good Bayesian answer, but if you must use a Bayesian method, then ROPE is as close as you can get.

Dave Harris's user avatar

  • 4 $\begingroup$ I don't agree with this answer. Since when can Bayesians not build Bayes factors for testing composite alternatives? Even though [0,1] is uncountable, there is nothing stopping you from putting a point mass at 1/2; this part of your answer makes the presumption that we are limited to continuous priors. $\endgroup$ –  guy Commented Aug 26, 2018 at 16:36
  • $\begingroup$ @guy are you presuming then that the prior has holes in it? What would be your prior? The Lesbegue measure of $[0,.5)\cup(.5,1]=1$. The Lesbegue measure of $\pi=\frac{1}{2}$ is the compliment, zero. The Bayes factor is zero. $\endgroup$ –  Dave Harris Commented Aug 26, 2018 at 17:14
  • $\begingroup$ My prior comment should have read @guy are you presuming then that the prior has holes in it? What would be your prior? The Lesbegue measure of $\Pr(\pi)$ over $[0,.5)\cup(.5,1]=1$. The Lesbegue measure of $\pi=\frac{1}{2}$ is the compliment, zero. The Bayes factor is zero. $\endgroup$ –  Dave Harris Commented Aug 26, 2018 at 17:22
  • 2 $\begingroup$ You are making a massive assumption in stating that the prior on $\pi$ must have a density with respect to Lebesgue measure. Instead, we can have a density with respect to the sum of Lebesgue measure and the Dirac mass at 1/2, which is the completely standard thing to do. $\endgroup$ –  guy Commented Aug 26, 2018 at 19:57

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged bayesian or ask your own question .

  • Featured on Meta
  • Announcing a change to the data-dump process
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • Making sense of どうかこうか凌いで来たように客が来る
  • How to reconcile the effect of my time magic spell with my timeline
  • Are there probabilistic facts of the matter about the universe?
  • Could there be a runaway thermonuclear fusion in ocean of heavy water?
  • Expensive constructors. Should they exist? Should they be replaced?
  • When to use negative binomial and Poisson regression
  • Does it make sense for the governments of my world to genetically engineer soldiers?
  • When a star becomes a black hole do the neutrons that are squeezed together release binding energy and if so does this energy escape from the hole?
  • How do Trinitarian Christians defend the unfalsifiability of the Trinity?
  • Find and delete files from unix directory of multiple patterns
  • A story where SETI finds a signal but it's just a boring philosophical treatise
  • Best way to explain the thinking steps from x² = 9 to x=±3
  • Reheating beans makes them mushy and unpeeled
  • Velocity dispersion of stars in galaxies
  • Can taut membranes and strings that are clamped at both ends propagate non-standing waves?
  • What makes the spring equinox happen on october instead of fall on the north hemisphere?
  • In what instances are 3-D charts appropriate?
  • Why does my shifter say the wrong gear?
  • Not a cross, not a word (number crossword)
  • Replicating Econometrics Research
  • Why is the wiper fluid hose on the Mk7 Golf covered in cloth tape?
  • Self-descriptive
  • Driveway electric run using existing service poles
  • How to reproduce this equation itemization?

hypothesis test coin is fair

IMAGES

  1. SOLVED: To test hypothesis that a coin is fair; the following decision

    hypothesis test coin is fair

  2. PPT

    hypothesis test coin is fair

  3. Is This Coin Fair? Hypothesis Testing Questions and Answers in Data Science Interviews

    hypothesis test coin is fair

  4. Solved To test the hypothesis that a coin is fair, the

    hypothesis test coin is fair

  5. Solved To test the hypothesis that a coin is fair, the

    hypothesis test coin is fair

  6. PPT

    hypothesis test coin is fair

VIDEO

  1. What is the F-test in Hypothesis Testing

  2. Hypothesis Testing Using TI 84

  3. Tossing an Unfair Coin

  4. L42 Hypothesis testing of a coin toss

  5. Probability of biased versus unbiased coin🙄🙄🙄???By Noor Alam

  6. Coin Toss Study: 350,757 Flips To Test The Myth of 50/50 Odds

COMMENTS

  1. hypothesis testing

    $\begingroup$ While I spend a lot of text explaining why a hypothesis test won't answer the question that was asked and indeed why no amount of data can be used to demonstrate that the coin is actually fair, I've added a section at the end explaining several things wrong in your attempted hypothesis test. (If you actually said those things in the interview, I expect you didn't get the job ...

  2. hypothesis testing

    binom.test(x=3, n=10, p=.5) Exact binomial test data: 3 and 10 number of successes = 3, number of trials = 10, p-value = 0.3438 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.06673951 0.65245285 sample estimates: probability of success 0.3

  3. PDF Hypothesis Testing: Coin example, with background explanation

    We will carry out a hypothesis test to test whether a coin is fair. The null and alternative hypotheses are as follows: H0: p = 0.50 H1: p != 0.50 where p = probability of heads on a single coin toss After carrying out an experiment, a coin is flipped 100 times, and lands heads up 62 times. Is there evidence the coin is biased?

  4. hypothesis testing

    The probability that in tossing a fair coin the number of heads differs from $450$ by $40$ or more (in either direction) is, by symmetry, $$2\sum_{k=490}^{900} \binom{900}{k}\left(\frac{1}{2}\right)^{900}.$$ This is not practical to compute by hand, but Wolfram Alpha gives an answer of roughly $0.008419$.

  5. Checking whether a coin is fair

    In statistics, the question of checking whether a coin is fair is one whose importance lies, firstly, in providing a simple problem on which to illustrate basic ideas of statistical inference and, secondly, in providing a simple problem that can be used to compare various competing methods of statistical inference, including decision theory.The practical problem of checking whether a coin is ...

  6. 9. Hypothesis Testing

    On the other hand if you were testing H0: coin is fair (p=0.5) against the alternative hypothesis Ha: coin is not fair (p not equal to 0.5), you would reject the null hypothesis in favor of the alternative hypothesis if the number of heads was some number much less than 5 or some number much greater than 5. For example, you might decide to ...

  7. Data Science from Scratch (ch7)

    The classic coin-flipping exercise is to test the fairness off a coin. If a coin is fair, it'll land on heads 50% of the time (and tails 50% of the time). Let's translate into hypothesis testing language: Null Hypothesis: Probability of landing on Heads = 0.5. Alt Hypothesis: Probability of landing on Heads != 0.5.

  8. Hypothesis Testing: (Almost) Everything you need to know

    Our hypothesis of a fair coin is that p = 1/2. Under that hypothesis, the probability of observing the above dataset is 1/32 ~ .031. This seems really small… Intuitively, if the coin was fair ...

  9. Is This Coin Fair? Hypothesis Testing Questions and Answers ...

    Let's solve a commonly asked problem in data science interviews together: "How would we design an experiment to test if a coin is a fair coin or not?" I will...

  10. 09-2 Hypothesis Testing Example_Coin Fairness

    A thorough example of statistical hypothesis test regarding testing the fairness of a coin. It provides details about p-value, Type I and Type II errors and ...

  11. Hypothesis Testing: The Basics

    Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment. You start by forming a null hypothesis, e.g., "this coin is fair," and then calculate the likelihood that your observations are due to pure chance rather than a real difference in the population.

  12. Hypothesis Testing Explained as Simply as Possible

    The null hypothesis in our example is that the coin is a fair coin and that the observations are purely from chance. The alternative hypothesis would then be that the coin is not fair, ... So now that you understand the use of hypothesis testing through the coin toss example, know the relevant terminology, and know the main rule to determine ...

  13. hypothesis testing

    5. 1.If I toss a coin for 10 times and 9 times are tails. Ask if the coin is fair. My method is that the null hypothesis is that the coin is fair (p0 = p1 = 1 2 p 0 = p 1 = 1 2), alternative hypothesis is it's not fair. The test statistic is the number of tails. And p-value is calculated by C910(1 2)9(1 2) +C1010(1 2)10 C 10 9 (1 2) 9 (1 2) + C ...

  14. Hypothesis Testing

    For tossing a fair coin (which is what the null hypothesis states), most statisticians agree that the number of heads (or tails) that we would expect follows what is called a binomial distribution. This distribution takes two parameters: the theoretical probability of the event in question (let's say the event of getting a head when we toss the ...

  15. Hypothesis Testing Basics & One Sample Tests for Proportions

    Hypothesis testing is a decision-making process by which we analyze a sample in an attempt to distinguish between results that can easily occur and results that are unlikely. One begins with a claim or statement -- the reason for the study. For example, the claim might be "This coin in my pocket is fair." Then we design a study to test the ...

  16. Is my coin fair ?. Notes on statistical hypothesis…

    If you flip a coins for 1000 times and the result lies between 469 and 531, the null hypothesis is not rejected and the coin is fair. otherwise, the null hypothesis is rejected and the alternative ...

  17. probability

    But in this case, the statistical test is set up to handle both country A and B. Regardless of the characteristic of the coin population, the interpretation is the same: the p-value is the probability of obtaining the experimental data given that the null hypothesis (the coin is a 1:1 fair coin) is true. $\endgroup$ -

  18. hypothesis testing

    In the case where you use the estimated binomial parameter in the variance (resulting in a Wald statistic if I remember correctly, though I might have permuted those) you would still use a z test because the variance is determined by the estimate of the binomial parameter. binom.test H0: p = 1/2 H 0: p = 1 / 2 Ha: p ≠ 1/2 H a: p ≠ 1 / 2 H0 ...

  19. hypothesis testing

    If you observe X = 490 X = 490 Heads, then the P-value of the test is. If you are doing a two-sided test H0: p = 1/2 H 0: p = 1 / 2 vs. Ha: p ≠ 1/2, H a: p ≠ 1 / 2, then you have to include the probability of being just as far below 450 as 490 is above 450. Thus the two-sided P-value is 0.00842.

  20. Coin Flips and Hypothesis Tests

    5. Here's a problem I thought of that I don't know how to approach: You have a fair coin that you keep on flipping. After every flip, you perform a hypothesis test based on all coin flips thus far, with significance level α α, where your null hypothesis is that the coin is fair and your alternative hypothesis is that the coin is not fair.

  21. Testing if a coin is fair using Bayesian statistics

    Bayesian hypothesis testing is usually done by formulating a model that decomposes the prior into the null and alternative cases, which leads to a particular form for Bayes factor.For an arbitrary prior probability for the null hypothesis we can update to find the posterior probability of the null hypothesis using a simple equation involving Bayes factor.