How to Compare Two or More Distributions

hypothesis testing cumulative distribution function

A complete guide to comparing distributions, from visualization to statistical tests

Comparing the empirical distribution of a variable across different groups is a common problem in data science. In particular, in causal inference the problem often arises when we have to assess the quality of randomization .

When we want to assess the causal effect of a policy (or UX feature, ad campaign, drug, …), the golden standard in causal inference are randomized control trials , also known as A/B tests . In practice, we select a sample for the study and we randomly split it into a control and a treatment group, and we compare the outcomes between the two groups. Randomization ensures that only difference between the two groups is the treatment, on average, so that we can attribute outcome differences to the treatment effect.

The problem is that, despite randomization, the two groups are never identical. However, sometimes, they are not even “similar”. For example, we might have more males in one group, or older people, etc.. (we usually call these characteristics, covariates or control variables ). When it happens, we cannot be certain anymore that the difference in the outcome is only due to the treatment and cannot be attributed to the inbalanced covariates instead. Therefore, it is always important, after randomization, to check whether all observed variables are balanced across groups and whether there are no systematic differences. Another option, to be certain ex-ante that certain covariates are balanced, is stratified sampling .

In this blog post, we are going to see different ways to compare two (or more) distributions and assess the magnitude and significance of their difference. We are going to consider two different approaches, visual and statistical . The two approaches generally trade-off intuition with rigor : from plots we can quickly assess and explore differences, but it’s hard to tell whether these differences are systematic or due to noise.

Let’s assume we need to perform an experiment on a group of individuals and we have randomized them into a treatment and control group. We would like them to be as comparable as possible , in order to attribute any difference between the two groups to the treatment effect alone. We also have divided the treatment group in different arms for testing different treatments (e.g. slight variations of the same drug).

For this example, I have simulated a dataset of 1000 individuals, for whom we observe a set of characteristics. I import the data generating process dgp_rnd_assignment() from src.dgp and some plotting functions and libraries from src.utils .

We have information on $1000$ individuals, for which we observe gender , age and weekly income . Each individual is assigned either to the treatment or control group and treated individuals are distributed across four treatment arms .

Two Groups - Plots

Let’s start with the simplest setting: we want to compare the distribution of income across the treatment and control group. We first explore visual approaches and the statistical approaches. The advantage of the first is intuition while the advantage of the second is rigor .

For most visualizations I am going to use Python’s seaborn library.

A first visual approach is the boxplot . The boxplot is a good trade-off between summary statistics and data visualization. The center of the box represents the median while the borders represent the first (Q1) and third quartile (Q3), respectively. The whiskers instead, extend to the first data points that are more than 1.5 times the interquartile range (Q3 - Q1) outside the box. The points that fall outside of the whiskers are plotted individually and are usually considered outliers .

Therefore, the boxplot provides both summary statistics (the box and the whiskers) and direct data visualization (the outliers).

png

It seems that the income distribution in the treatment group is slightly more dispersed: the orange box is larger and its whiskers cover a wider range. However, the issue with the boxplot is that it hides the shape of the data, telling us some summary statistics but not showing us the actual data distribution.

The most intuitive way to plot a distribution is the histogram . The histogram groups the data into equally wide bins and plots the number of observations within each bin.

png

There are multiple issues with this plot:

  • Since the two groups have a different number of observations, the two histograms are not comparable
  • The number of bins is arbitrary

We can solve the first issue using the stat option to plot the density instead of the count and setting the common_norm option to False to use the same normalization.

png

Now the two histograms are comparable!

However, an important issue remains: the size of the bins is arbitrary. In the extreme, if we bunch the data less, we end up with bins with at most one observation, if we bunch the data more, we end up with a single bin. In both cases, if we exaggerate, the plot loses informativeness. This is a classical bias-variance trade-off .

Kernel Density

One possible solution is to use a kernel density function that tries to approximate the histogram with a continuous function, using kernel density estimation (KDE) .

png

From the plot, it seems that the estimated kernel density of income has “fatter tails” (i.e. higher variance) in the treatment group, while the average seems similar across groups.

The issue with kernel density estimation is that it is a bit of a black-box and might mask relevant features of the data.

Cumulative Distribution

A more transparent representation of the two distribution is their cumulative distribution function . At each point of the x axis ( income ) we plot the percentage of data points that have an equal or lower value. The main advantages of the cumulative distribution function are that

  • we do not need to make any arbitrary choice (e.g. number of bins)
  • we do not need to perform any approximation (e.g. with KDE), but we represent all data points

png

How should we interpret the graph?

Since the two lines cross more or less at 0.5 (y axis), it means that their median is similar

Since the orange line is above the blue line on the left and below the blue line on the left, it means that the distribution of the treatment group as fatter tails

A related method is the qq-plot , where q stands for quantile. The qq-plot plots the quantiles of the two distributions against each other. If the distributions are the same, we should get a 45 degree line.

There is no native qq-plot function in Python and, while the statsmodels package provides a qqplot function , it is quite cumbersome. Therefore, we will do it by hand.

First, we need to compute the quartiles of the two groups, using the percentile function.

Now we can plot the two quantile distributions against each other, plus the 45-degree line, representing the benchmark perfect fit.

png

The qq-plot delivers a very similar insight with respect to the cumulative distribution plot: income in the treatment group has the same median (lines cross in the center) but wider tails (dots are below the line on the left end and above on the right end).

Two Groups - Tests

So far, we have seen different ways to visualize differences between distributions. The main advantage of visualization is intuition : we can eyeball the differences and intuitively assess them.

However, we might want to be more rigorous and try to assess the statistical significance of the difference between the distributions, i.e. answer the question “ is the observed difference systematic or due to sampling noise? ”.

We are now going to analyze different tests to discern two distributions from each other.

The first and most common test is the student t-test . T-tests are generally used to compare means . In this case, we want to test whether the means of the income distribution is the same across the two groups. The test statistic for the two-means comparison test is given by:

$$ stat = \frac{|\bar x_1 - \bar x_2|}{\sqrt{s^2 / n }} $$

Where $\bar x$ is the sample mean and $s$ is the sample standard deviation. Under mild conditions, the test statistic is asymptotically distributed as a student t distribution.

We use the ttest_ind function from scipy to perform the t-test. The function returns both the test statistic and the implied p-value .

The p-value of the test is $0.12$, therefore we do not reject the null hypothesis of no difference in means across treatment and control groups.

Note : the t-test assumes that the variance in the two samples is the same so that its estimate is computed on the joint sample. Welch’s t-test allows for unequal variances in the two samples.

Standardized Mean Difference (SMD)

In general, it is good practice to always perform a test for difference in means on all variables across the treatment and control group, when we are running a randomized control trial or A/B test.

However, since the denominator of the t-test statistic depends on the sample size, the t-test has been criticized for making p-values hard to compare across studies. In fact, we may obtain a significant result in an experiment with very small magnitude of difference but large sample size while we may obtain a non-significant result in an experiment with large magnitude of difference but small sample size.

One solution that has been proposed is the standardized mean difference (SMD) . As the name suggests, this is not a proper test statistic, but just a standardized difference, which can be computed as:

$$ SMD = \frac{|\bar x_1 - \bar x_2|}{\sqrt{(s^2_1 + s^2_2) / 2}} $$

Usually a value below $0.1$ is considered a “small” difference.

It is good practice to collect average values of all variables across treatment and control group and a measure of distance between the two — either the t-test or the SMD — into a table that is called balance table . We can use the create_table_one function from the causalml library to generate it. As the name of the function suggests, the balance table should always be the first table you present when performing an A/B test.

In the first two columns, we can see the average of the different variables across the treatment and control groups, with standard errors in parenthesis. In the last column , the values of the SMD indicate a standardized difference of more than $0.1$ for all variables, suggesting that the two groups are probably different.

Mann–Whitney U Test

An alternative test is the Mann–Whitney U test . The null hypothesis for this test is that the two groups have the same distribution, while the alternative hypothesis is that one group has larger (or smaller) values than the other.

Differently from the other tests we have seen so far, the Mann–Whitney U test is agnostic to outliers and concentrates on the center of the distribution.

The test procedure is the following.

Combine all data points and rank them (in increasing or decreasing order)

Compute $U_1 = R_1 - n_1(n_1 + 1)/2$, where $R_1$ is the sum of the ranks for data points in the first group and $n_1$ is the number of points in the first group.

Compute $U_2$ similarly for the second group.

The test statistic is given by $stat = min(U_1, U_2)$.

Under the null hypothesis of no systematic rank differences between the two distributions (i.e. same median), the test statistic is asymptotically normally distributed with known mean and variance.

The intuition behind the computation of $R$ and $U$ is the following: if the values in the first sample were all bigger than the values in the second sample, then $R_1 = n_1(n_1 + 1)/2$ and, as a consequence, $U_1$ would then be zero (minimum attainable value). Otherwise, if the two samples were similar, $U_1$ and $U_2$ would be very close to $n_1 n_2 / 2$ (maximum attainable value).

We perform the test using the mannwhitneyu function from scipy .

We get a p-value of 0.6 which implies that we do not reject the null hypothesis of no difference between the two distributions.

Note : as for the t-test, there exists a version of the Mann–Whitney U test for unequal variances in the two samples, the Brunner-Munzel test .

Permutation Tests

A non-parametric alternative is permutation testing. The idea is that, under the null hypothesis, the two distributions should be the same, therefore shuffling the group labels should not significantly alter any statistic.

We can chose any statistic and check how its value in the original sample compares with its distribution across group label permutations. For example, let’s use as a test statistic the difference of sample means between the treatment and control group.

The permutation test gives us a p-value of $0.056$, implying a weak non-rejection of the null hypothesis at the 5% level.

How do we interpret the p-value? It means that the difference in means in the data is larger than $1 - 0.0560 = 94.4%$ of the differences in means across the permuted samples.

We can visualize the test, by plotting the distribution of the test statistic across permutations against its sample value.

png

As we can see, the sample statistic is quite extreme with respect to the values in the permuted samples, but not excessively.

Chi-Squared Test

The chi-squared test is a very powerful test that is mostly used to test differences in frequencies.

One of the least known applications of the chi-squared test, is testing the similarity between two distributions. The idea is to bin the observations of the two groups. If the two distributions were the same, we would expect the same frequency of observations in each bin. Importantly, we need enough observations in each bin, in order for the test to be valid.

I generate bins corresponding to deciles of the distribution of income in the control group and then I compute the expected number of observations in each bin in the treatment group, if the two distributions were the same.

We can now perform the test by comparing the expected (E) and observed (O) number of observations in the treatment group, across bins. The test statistic is given by

$$ stat = \sum _{i=1}^{n} \frac{(O_i - E_i)^{2}}{E_i} $$

where the bins are indexed by $i$ and $O$ is the observed number of data points in bin $i$ and $E$ is the expected number of data points in bin $i$. Since we generated the bins using deciles of the distribution of income in the control group, we expect the number of observations per bin in the treatment group to be the same across bins. The test statistic is asymptocally distributed as a chi-squared distribution.

To compute the test statistic and the p-value of the test, we use the chisquare function from scipy .

Differently from all other tests so far, the chi-squared test strongly rejects the null hypothesis that the two distributions are the same. Why?

The reason lies in the fact that the two distributions have a similar center but different tails and the chi-squared test tests the similarity along the whole distribution and not only in the center, as we were doing with the previous tests.

This result tells a cautionary tale : it is very important to understand what you are actually testing before drawing blind conclusions from a p-value!

Kolmogorov-Smirnov Test

The idea of the Kolmogorov-Smirnov test , is to compare the cumulative distributions of the two groups. In particular, the Kolmogorov-Smirnov test statistic is the maximum absolute difference between the two cumulative distributions.

$$ stat = \sup _{x} \ \Big| \ F_1(x) - F_2(x) \ \Big| $$

Where $F_1$ and $F_2$ are the two cumulative distribution functions and $x$ are the values of the underlying variable. The asymptotic distribution of the Kolmogorov-Smirnov test statistic is Kolmogorov distributed .

To better understand the test, let’s plot the cumulative distribution functions and the test statistic. First, we compute the cumulative distribution functions.

We now need to find the point where the absolute distance between the cumulative distribution functions is largest.

We can visualize the value of the test statistic, by plotting the two cumulative distribution functions and the value of the test statistic.

png

From the plot, we can see that the value of the test statistic corresponds to the distance between the two cumulative distributions at income ~650. For that value of income , we have the largest imbalance between the two groups.

We can now perform the actual test using the kstest function from scipy .

The p-value is below 5%: we reject the null hypothesis that the two distributions are the same, with 95% confidence.

Note 1 : The KS test is too conservative and rejects the null hypothesis too rarely. Lilliefors test corrects this bias using a different distribution for the test statistic, the Lilliefors distribution.
Note 2 : the KS test uses very little information since it only compares the two cumulative distributions at one point: the one of maximum distance. The Anderson-Darling test and the Cramér-von Mises test instead compare the two distributions along the whole domain, by integration (the difference between the two lies in the weighting of the squared distances).

Multiple Groups - Plots

So far we have only considered the case of two groups: treatment and control. But that if we had multiple groups ? Some of the methods we have seen above scale well, while others don’t.

As a working example, we are now going to check whether the distribution of income is the same across treatment arms .

The boxplot scales very well, when we have a number of groups in the single-digits, since we can put the different boxes side-by-side.

png

From the plot, it looks like the distribution of income is different across treatment arms, with higher numbered arms having a higher average income.

Violin Plot

A very nice extension of the boxplot that combines summary statistics and kernel density estimation is the violinplot . The violinplot plots separate densities along the y axis so that they don’t overlap. By default, it also adds a miniature boxplot inside.

png

As for the boxplot, the violin plot suggests that income is different across treatment arms.

Ridgeline Plot

Lastly, the ridgeline plot plots multiple kernel density distributions along the x-axis, making them more intuitive than the violin plot but partially overlapping them. Unfortunately, there is no default ridgeline plot neither in matplotlib nor in seaborn . We need to import it from joypy .

png

Again, the ridgeline plot suggests that higher numbered treatment arms have higher income. From this plot it is also easier to appreciate the different shapes of the distributions.

Multiple Groups - Tests

Lastly, let’s consider hypothesis tests to compare multiple groups. For simplicity, we will concentrate on the most popular one: the F-test.

With multiple groups, the most popular test is the F-test . The F-test compares the variance of a variable across different groups. This analysis is also called analysis of variance, or ANOVA .

In practice, the F-test statistic is

$$ \text{stat} = \frac{\text{between-group variance}}{\text{within-group variance}} = \frac{\sum_{g} \big( \bar x_g - \bar x \big) / (G-1)}{\sum_{g} \sum_{i \in g} \big( \bar x_i - \bar x_g \big) / (N-G)} $$

Where $G$ is the number of groups, $N$ is the number of observations, $\bar x$ is the overall mean and $\bar x_g$ is the mean within group $g$. Under the null hypothesis of group independence, the f-statistic is F-distributed .

The test p-value is basically zero, implying a strong rejection of the null hypothesis of no differences in the income distribution across treatment arms.

In this post we have see a ton of different ways to compare two or more distributions , both visually and statistically. This is a primary concern in many applications, but especially in causal inference where we use randomization to make treatment and control group as comparable as possible.

We have also seen how different methods might be better suited for different situations . Visual methods are great to build intuition, but statistical methods are essential for decision-making, since we need to be able to assess the magnitude and statistical significance of the differences.

[1] Student, The Probable Error of a Mean (1908), Biometrika .

[2] F. Wilcoxon, Individual Comparisons by Ranking Methods (1945), Biometrics Bulletin .

[3] B. L. Welch, The generalization of “Student’s” problem when several different population variances are involved (1947), Biometrika .

[4] H. B. Mann, D. R. Whitney, On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other (1947), The Annals of Mathematical Statistics .

[5] E. Brunner, U. Munzen, The Nonparametric Behrens-Fisher Problem: Asymptotic Theory and a Small-Sample Approximation (2000), Biometrical Journal .

[6] A. N. Kolmogorov, Sulla determinazione empirica di una legge di distribuzione (1933), Giorn. Ist. Ital. Attuar. .

[7] H. Cramér, On the composition of elementary errors (1928), Scandinavian Actuarial Journal .

[8] R. von Mises, Wahrscheinlichkeit statistik und wahrheit (1936), Bulletin of the American Mathematical Society .

[9] T. W. Anderson, D. A. Darling, Asymptotic Theory of Certain “Goodness of Fit” Criteria Based on Stochastic Processes (1953), The Annals of Mathematical Statistics .

Related Articles

  • Goodbye Scatterplot, Welcome Binned Scatterplot

You can find the original Jupyter Notebook here:

https://github.com/matteocourthoud/Blog-Posts/blob/main/notebooks/distr.ipynb

hypothesis testing cumulative distribution function

I hold a PhD in economics from the University of Zurich. Now I work at the intersection of economics, data science and statistics. I regularly write about causal inference on Medium .

Statistical functions ( scipy.stats ) #

This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more.

Statistics is a very large area, and there are topics that are out of scope for SciPy and are covered by other packages. Some of the most important ones are:

statsmodels : regression, linear models, time series analysis, extensions to topics also covered by scipy.stats .

Pandas : tabular data, time series functionality, interfaces to other statistical languages.

PyMC : Bayesian statistical modeling, probabilistic machine learning.

scikit-learn : classification, regression, model selection.

Seaborn : statistical data visualization.

rpy2 : Python to R bridge.

Probability distributions #

Each univariate distribution is an instance of a subclass of rv_continuous ( rv_discrete for discrete distributions):

Continuous distributions #

The fit method of the univariate continuous distributions uses maximum likelihood estimation to fit the distribution to a data set. The fit method can accept regular data or censored data . Censored data is represented with instances of the CensoredData class.

Multivariate distributions #

scipy.stats.multivariate_normal methods accept instances of the following class to represent the covariance.

Discrete distributions #

An overview of statistical functions is given below. Many of these functions have a similar version in scipy.stats.mstats which work for masked arrays.

Summary statistics #

Frequency statistics #, hypothesis tests and related functions #.

SciPy has many functions for performing hypothesis tests that return a test statistic and a p-value, and several of them return confidence intervals and/or other related information.

The headings below are based on common uses of the functions within, but due to the wide variety of statistical procedures, any attempt at coarse-grained categorization will be imperfect. Also, note that tests within the same heading are not interchangeable in general (e.g. many have different distributional assumptions).

One Sample Tests / Paired Sample Tests #

One sample tests are typically used to assess whether a single sample was drawn from a specified distribution or a distribution with specified properties (e.g. zero mean).

Paired sample tests are often used to assess whether two samples were drawn from the same distribution; they differ from the independent sample tests below in that each observation in one sample is treated as paired with a closely-related observation in the other sample (e.g. when environmental factors are controlled between observations within a pair but not among pairs). They can also be interpreted or used as one-sample tests (e.g. tests on the mean or median of differences between paired observations).

Association/Correlation Tests #

These tests are often used to assess whether there is a relationship (e.g. linear) between paired observations in multiple samples or among the coordinates of multivariate observations.

These association tests and are to work with samples in the form of contingency tables. Supporting functions are available in scipy.stats.contingency .

Independent Sample Tests #

Independent sample tests are typically used to assess whether multiple samples were independently drawn from the same distribution or different distributions with a shared property (e.g. equal means).

Some tests are specifically for comparing two samples.

Others are generalized to multiple samples.

Resampling and Monte Carlo Methods #

The following functions can reproduce the p-value and confidence interval results of most of the functions above, and often produce accurate results in a wider variety of conditions. They can also be used to perform hypothesis tests and generate confidence intervals for custom statistics. This flexibility comes at the cost of greater computational requirements and stochastic results.

Instances of the following object can be passed into some hypothesis test functions to perform a resampling or Monte Carlo version of the hypothesis test.

Multiple Hypothesis Testing and Meta-Analysis #

These functions are for assessing the results of individual tests as a whole. Functions for performing specific multiple hypothesis tests (e.g. post hoc tests) are listed above.

The following functions are related to the tests above but do not belong in the above categories.

Quasi-Monte Carlo #

  • scipy.stats.qmc.QMCEngine
  • scipy.stats.qmc.Sobol
  • scipy.stats.qmc.Halton
  • scipy.stats.qmc.LatinHypercube
  • scipy.stats.qmc.PoissonDisk
  • scipy.stats.qmc.MultinomialQMC
  • scipy.stats.qmc.MultivariateNormalQMC
  • scipy.stats.qmc.discrepancy
  • scipy.stats.qmc.geometric_discrepancy
  • scipy.stats.qmc.update_discrepancy
  • scipy.stats.qmc.scale

Contingency Tables #

  • chi2_contingency
  • relative_risk
  • association
  • expected_freq

Masked statistics functions #

  • hdquantiles
  • hdquantiles_sd
  • idealfourths
  • plotting_positions
  • find_repeats
  • trimmed_mean
  • trimmed_mean_ci
  • trimmed_std
  • trimmed_var
  • scoreatpercentile
  • pointbiserialr
  • kendalltau_seasonal
  • siegelslopes
  • theilslopes
  • sen_seasonal_slopes
  • ttest_1samp
  • ttest_onesamp
  • mannwhitneyu
  • kruskalwallis
  • friedmanchisquare
  • brunnermunzel
  • kurtosistest
  • obrientransform
  • trimmed_stde
  • argstoarray
  • count_tied_groups
  • compare_medians_ms
  • median_cihs
  • mquantiles_cimj

Other statistical functionality #

Transformations #, statistical distances #.

  • scipy.stats.sampling.NumericalInverseHermite
  • scipy.stats.sampling.NumericalInversePolynomial
  • scipy.stats.sampling.TransformedDensityRejection
  • scipy.stats.sampling.SimpleRatioUniforms
  • scipy.stats.sampling.RatioUniforms
  • scipy.stats.sampling.DiscreteAliasUrn
  • scipy.stats.sampling.DiscreteGuideTable
  • scipy.stats.sampling.UNURANError
  • FastGeneratorInversion
  • scipy.stats.sampling.FastGeneratorInversion.evaluate_error
  • scipy.stats.sampling.FastGeneratorInversion.ppf
  • scipy.stats.sampling.FastGeneratorInversion.qrvs
  • scipy.stats.sampling.FastGeneratorInversion.rvs
  • scipy.stats.sampling.FastGeneratorInversion.support

Random variate generation / CDF Inversion #

Fitting / survival analysis #, directional statistical functions #, sensitivity analysis #, plot-tests #, univariate and multivariate kernel density estimation #, warnings / errors used in scipy.stats #, result classes used in scipy.stats #.

These classes are private, but they are included here because instances of them are returned by other statistical functions. User import and instantiation is not supported.

  • scipy.stats._result_classes.RelativeRiskResult
  • scipy.stats._result_classes.BinomTestResult
  • scipy.stats._result_classes.TukeyHSDResult
  • scipy.stats._result_classes.DunnettResult
  • scipy.stats._result_classes.PearsonRResult
  • scipy.stats._result_classes.FitResult
  • scipy.stats._result_classes.OddsRatioResult
  • scipy.stats._result_classes.TtestResult
  • scipy.stats._result_classes.ECDFResult
  • scipy.stats._result_classes.EmpiricalDistributionFunction

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

22.1 - the test.

Before we can work on developing a hypothesis test for testing whether an empirical distribution function \(F_n (x)\) fits a hypothesized distribution function \(F (x)\) we better have a good idea of just what is an empirical distribution function \(F_n (x)\). Therefore, let's start with formally defining it.

Given an observed random sample \(X_1 , X_2 , \dots , X_n\), an empirical distribution function \(F_n (x)\) is the fraction of sample observations less than or equal to the value x . More specifically, if \(y_1 < y_2 < \dots < y_n\) are the order statistics of the observed random sample, with no two observations being equal, then the empirical distribution function is defined as:

Such a formal definition is all well and good, but it would probably make even more sense if we took at a look at a simple example.

Example 22-1 Section  

woman swimming in a pool

A random sample of n = 8 people yields the following (ordered) counts of the number of times they swam in the past month:

0 1 2 2 4 6 6 7

Calculate the empirical distribution function \(F_n (x)\).

As reported, the data are ordered, therefore the order statistics are \(y_1 = 0, y_2 = 1, y_3 = 2, y_4 = 2, y_5 = 4, y_6 = 6, y_7 = 6\) and \(y_8 = 7\). Therefore, using the definition of the empirical distribution function, we have:

\(F_n(x)=0 \text{ for } x < 0\)

\(F_n(x)=\frac{1}{8} \text{ for } 0 \le x < 1\) and \(F_n(x)=\frac{2}{8} \text{ for } 1 \le x < 2\)

Now, noting that there are two 2s, we need to jump 2/8 at x = 2:

\(F_n(x)=\frac{2}{8}+\frac{2}{8}=\frac{4}{8} \text{ for } 2 \le x < 4\)

\(F_n(x)=\frac{5}{8} \text{ for } 4 \le x < 6\)

Again, noting that there are two 6s, we need to jump 2/8 at x = 6:

\(F_n(x)=\frac{5}{8}+\frac{2}{8}=\frac{7}{8} \text{ for } 6 \le x < 7\)

And, finally:

\(F_n(x)=\frac{7}{8}+\frac{1}{8}=\frac{8}{8}=1 \text{ for } x \ge 7\)

Plotting the function, it should look something like this then:

Now, with that behind us, let's jump right in and state and justify (not prove!) the Kolmogorov-Smirnov statistic for testing whether an empirical distribution fits a hypothesized distribution well.

\[D_n=sup_x\left[ |F_n(x)-F_0(x)| \right]\]

is used for testing the null hypothesis that the cumulative distribution function \(F (x)\) equals some hypothesized distribution function \(F_0 (x)\), that is, \(H_0 : F(x)=F_0(x)\), against all of the possible alternative hypotheses \(H_A : F(x) \ne F_0(x)\). That is, \(D_n\) is the least upper bound of all pointwise differences \(|F_n(x)-F_0(x)|\).

The bottom line is that the Kolmogorov-Smirnov statistic makes sense, because as the sample size n approaches infinity, the empirical distribution function \(F_n (x)\) converges, with probability 1 and uniformly in x , to the theoretical distribution function \(F (x)\). Therefore, if there is, at any point x , a large difference between the empirical distribution \(F_n (x)\) and the hypothesized distribution \(F_0 (x)\), it would suggest that the empirical distribution \(F_n (x)\) does not equal the hypothesized distribution \(F_0 (x)\). Therefore, we reject the null hypothesis:

\[H_0 : F(x)=F_0(x)\]

if \(D_n\) is too large.

Now, how do we know that \(F_n (x)\) converges, with probability 1 and uniformly in x , to the theoretical distribution function \(F (x)\)? Well, unfortunately, we don't have the tools in this course to officially prove it, but we can at least do a bit of a hand-waving argument.

Let \(X_1 , X_2 , \dots , X_n\) be a random sample of size n from a continuous distribution \(F (x)\). Then, if we consider a fixed x , then \(W= F_n (x)\) can be thought of as a random variable that takes on possible values \(0, 1/n , 2/n , \dots , 1\). Now:

  • nW = 1, if and only if exactly 1 observation is less than or equal to x, and n −1 observations are greater than x
  • nW = 2, if and only if exactly 2 observations are less than or equal to x, and n −2 observations are greater than x
  • and in general...
  • nW = k , if and only if exactly k observations are less than or equal to x, and n − k observations are greater than x

If we treat a success as an observation being less than or equal to x , then the probability of success is:

\(P(X_i ≤ x) = F(x)\)

Do you see where this is going? Well, because \(X_1 , X_2 , \dots , X_n\) are independent random variables, the random variable nW is a binomial random variable with n trials and probability of success p = F (x). Therefore:

\[ P\left(W = \frac{k}{n}\right) = P(nW=k) = \binom{n}{k}[F(x)]^k[1-F(x)]^{n-k}\]

And, the expected value and variance of nW are:

\(E(nW)=np=nF(x)\) and \(Var(nW)=np(1-p)=n[F(x)][1-F(x)]\)

respectively. Therefore, the expected value and variance of W are:

\(E(W)=n\frac{F(x)}{n}=F(x)\) and \(\displaystyle Var(W) =\frac{n[(F(x)][1-F(x)]}{n^2}=\frac{[(F(x)][1-F(x)]}{n}\)

We're very close now. We just now need to recognize that as n approaches infinity, the variance of W , that is, the variance of \(F_n (x)\) approaches 0. That means that as n approaches infinity the empirical distribution \(F_n (x)\) approaches its mean \(F (x)\). And, that's why the argument for rejecting the null hypothesis if there is, at any point x , a large difference between the empirical distribution \(F_n (x)\) and the hypothesized distribution \(F_0 (x)\). Not a mathematically rigorous argument, but an argument nonetheless!

Notice that the Kolmogorov-Smirnov (KS) test statistic is the supremum over all real \(x\)---a very large set of numbers! How then can we possibly hope to compute it? Well, fortunately, we don't have to check it at every real number but only at the sample values, since they are the only points at which the supremum can occur. Here's why:

First the easy case. If \(x\ge y_n\), then \(F_n(x)=1\), and the largest difference between \(F_n(x)\) and \(F_0(x)\) occurs at \(y_n\). Why? Because \(F_0(x)\) can never exceed 1 and will only get closer for larger \(x\) by the monotonicity of distribution functions. So, we can record the value \(F_n(y_n)-F_0(y_n)=1-F_0(y_n)\) and safely know that no other value \(x\ge y_n\) needs to be checked.

The case where \(x<y_1\) is a little trickier. Here, \(F_n(x)=0\), and the largest difference between \(F_n(x)\) and \(F_0(x)\) would occur at the largest possible \(x\) in this range for a reason similar to that above: \(F_0(x)\) can never be negative and only gets farther from 0 for larger \(x\). The trick is that there is no largest \(x\) in this range (since \(x\) is strictly less than \(y_1\) ), and we instead have to consider lefthand limits. Since \(F_0(x)\) is continuous, its limit at \(y_1\) is simply \(F_0(y_1)\). However, the lefthand limit of \(F_n(y_1)\) is 0. So, the value we record is \(F_0(y_1)-0=F_0(y_1)\), and ignore checking any other value \(x<y_1\).

Finally, the general case \(y_{k-1}\le x <y_{k}\) is a combination of the two above. If \(F_0(x)<F_n(x)\), then \(F_0(y_{k-1})\le F_0(x)<F_n(x)=F_n(y_{k-1})\), so that \(F_n(y_{k-1})-F_0(y_{k-1})\) is at least as large as \(F_n(x)-F_0(x)\) (so we don't even have to check those \(x\) values). If, however, \(F_0(x)>F_n(x)\), then the largest difference will occur at the lefthand limits at \(y_{k}\). Again, the continuity of \(F_0\) allows us to use \(F_0(y_{k})\) here, while the lefthand limit of \(F_n(y_{k})\) is actually \(F_n(y_{k-1})\). So, the value to record is \(F_0(y_{k})-F_n(y_{k-1})\), and we may disregard the other \(x\) values.

Whew! That covers all real \(x\) values and leaves us a much smaller set of values to actually check. In fact, if we introduce a value \(y_0\) such that \(F_n(y_0)=0\), then we can summarize all this exposition with the following rule:

For each ordered observation \(y_k\) compute the differences

\(|F_n(y_k)-F_0(y_k)|\) and \(|F_n(y_{k-1})-F_0(y_k)|\).

The largest of these is the KS test statistic.

The easiest way to manage these calculations is with a table, which we now demonstrate with two examples.

Quickonomics

Cumulative Distribution Function

Definition of cumulative distribution function.

The Cumulative Distribution Function (CDF) of a random variable is a function that gives the probability that the variable takes a value less than or equal to a certain value. It is a fundamental concept in probability theory and statistics, providing a complete description of the distribution of a random variable. The CDF is non-decreasing and right-continuous, with limits at minus infinity converging to 0 and at plus infinity converging to 1.

Consider the random variable X that represents the result of rolling a fair six-sided die. The outcomes can be any integer from 1 to 6, each with an equal probability of 1/6. The CDF of X, denoted as F(x), can be described for each value of x as follows:

Why the Cumulative Distribution Function Matters

The cumulative distribution function is crucial in statistics and probability theory for several reasons:

– It provides a concise way to summarize and visualize the entire distribution of a random variable. – It is useful for calculating probabilities of intervals. For example, the probability that a variable falls within a range can be found by taking the difference between the CDF at the upper and lower bounds of the range. – The CDF is fundamental in defining other important statistical measures, such as quantiles and the median. – It serves as a starting point for statistical hypothesis testing and confidence interval estimation, which are central to inferential statistics.

Moreover, the CDF is essential in fields such as economics, engineering, and science, where understanding the distribution of various phenomena allows for better decision-making and predictions.

Frequently Asked Questions (FAQ)

How does the cumulative distribution function differ from the probability density function (pdf).

The Probability Density Function (PDF) of a continuous random variable is a function that represents the density of the variable at a certain point, indicating how likely the variable is to take a value near that point. The CDF, on the other hand, gives the cumulative probability up to a point. The PDF and the CDF are related; the CDF is the integral of the PDF for continuous variables and the sum for discrete variables.

Can the CDF be used for both discrete and continuous random variables?

Yes, the CDF is defined for both discrete and continuous random variables. For discrete variables, the CDF jumps at each value where there is a positive probability. For continuous variables, the CDF is typically a smooth function.

How can the CDF be used to find median and other quantiles?

The median of a distribution is a point where the CDF equals 0.5, meaning there is a 50% chance that the variable takes a value less than or equal to the median. Similarly, other quantiles can be found by looking for values that satisfy the appropriate probabilities. For instance, the first quartile is found where the CDF equals 0.25, indicating that 25% of the distribution lies below this point.

Understanding the cumulative distribution function and its applications is fundamental to mastering concepts in probability and statistics, which provide the groundwork for numerous real-world analyses and decision-making processes.

To provide the best experiences, we and our partners use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us and our partners to process personal data such as browsing behavior or unique IDs on this site and show (non-) personalized ads. Not consenting or withdrawing consent, may adversely affect certain features and functions.

Click below to consent to the above or make granular choices. Your choices will be applied to this site only. You can change your settings at any time, including withdrawing your consent, by using the toggles on the Cookie Policy, or by clicking on the manage consent button at the bottom of the screen.

Statistics 5601 (Geyer, Fall 2013) Kolmogorov-Smirnov and Lilliefors Tests

General instructions, (cumulative) distribution functions, empirical (cumulative) distribution functions, distribution function examples, the uniform law of large numbers (glivenko-cantelli theorem, the asymptotic distribution (brownian bridge), suprema over the brownian bridge, one-sample tests, the corresponding confidence interval, example 11.6 in hollander and wolfe., the corresponding point estimate, two-sample tests, example 5.4 in hollander and wolfe..

The possible values "two.sided" , "less" and "greater" of alternative specify the null hypothesis that the true distribution function of x is equal to, not less than or not greater than the hypothesized distribution function (one-sample case) or the distribution function of y (two-sample case), respectively.

The Lilliefors Test

Valid HTML 4.01 Strict

1014SCG Statistics - Lecture Notes

Chapter 3 week 3/4 - probability distributions and the test of proportion.

Outline: Revision and Basics for Statistical Inference Review – Revision and things for you to look up Types of Inference Notation Probability Distribution Functions Areas Under a Probability Distribution Curve - Probabilities Tchebychef’s Rule – applies for any shaped distribution Probability of a Range of the Variable Cumulative Probability Central Limit Theorem Inference for Counts and Proportions – test of a proportion An Example One- and Two-Tailed Hypotheses The p-value of a Test Statistic Statistical Distributions The Binomial Distribution – for counts and proportions Examples using the Binomial Distribution The Normal Distribution The Normal Approximation to the Binomial Using R Calculating Binomial and Normal probabilities using R. Sorting data sets and running functions separately for different categories Workshop for Week 3 Based on week 1 and week 2 lecture notes. R: Entering data, summarising data, rep(), factor() using R for Goodness of Fit and test of independence. Workshop for Week 4 Based on week 3 lecture notes. Project Requirements for Week 3 Ensure you have generated your randomly selected individual project data; Explore your data using summaries, tables, graphs etc. Project Requirements for Week 4 Project 1 guidelines will be available in week 4. Make sure you read it carefully ! Things YOU must do in Week 3 Revise and summarise the lecture notes for week 2; Read your week 3 & 4 lecture notes before the lecture; Read the workshop on learning@griffith before your workshop; Revise and practice the \(\chi^2\) tests - You have a quiz on them in week 4

3.1 Revision and Basics for Statistical Inference

3.1.1 revision and things for you to look up.

Sample versus population & statistic versus parameter – see diagram in week 1 notes. Sampling variability

3.1.2 Types of Inference

2 basic branches of statistical inference: estimation and hypothesis testing.

eg1: Groundwater monitoring:

  • what is the level of sodium (na) in the groundwater downstream of the landfill (gdf)?
  • is the level of na in the gdf above the set standard for drinking water?
  • what level of na in the gdf can be expected over the next 12 months?

eg2: A new treatment is proposed for protecting pine trees against a disease – does it work? How effective is it? Does it give more than 20% protection?

3.1.3 Notation

population parameter: Greek Letter

sample statistic: Name - Upper case; observed value - lower case

sample statistic: is an ESTIMATOR of population parameter - use of ‘hat’ over the Greek symbols: \(\hat{\theta}\) , \(\hat{\sigma}\) , \(\hat{\phi}\) .

Some estimators as used so often they get a special symbol. E.g.: Sample mean, \(\overline{X} = \hat{\mu}\) , the estimate of the population mean \(\mu\) .

Sometimes use letters eg SE for standard error – the standard deviation of a sample statistic

3.1.4 Probability Distribution Functions: \(f(x)\)

Statistical probability models:

Can be expressed in graphical form – distribution curve

  • possible values of X along x-axis
  • relative frequencies (or probabilities) for each possible value along y-axis
  • total area under curve is 1; representing total of probabilities for all possible values/outcomes.

Shape can also be described by appropriate mathematical formula and/or expressed as a table of possible values and associated probabilities.

If the allowable values of X are discrete: Probability Mass Function (PMF), \(f(x) = Pr(X = x)\) .

If the allowable values of X are continuous: Probability Density Function (PDF) NB \(f(x) \neq Pr(X = x)\) for continuous r.v.s.

3.1.5 Areas Under a Probability Distribution Curve - Probabilities

For continuous variables, the total area under the probability curve will be 1, as this is the totality of the possible values X can take. Similarly, the sum over all allowable values of a discrete random variable will be 1.

3.1.6 Tchebychef’s Rule – applies for any shaped distribution

For any integer \(k > 1\) , at least \(100(1 − \frac{1}{k^2})\%\) of the measurements in a population will be within a distance of \(k\) standard deviations from the population mean:

\[ Pr(\mu - k \sigma \leq X \leq \mu + k \sigma) \geq 1 - \frac{1}{k^2} \] \[\begin{align*} \text{If } k = 2, & \text{ } Pr(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \geq 1 - \frac{1}{2^2} = 0.75 & \text{ } \text{75% within 2 sd of population mean} \\ \text{If } k = 3, & \text{ } Pr(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \geq 1 - \frac{1}{3^2} = 0.89 & \text{ } \text{89% within 3 sd of population mean} \\ \text{If } k = 4, & \text{ } Pr(\mu - 4\sigma \leq X \leq \mu + 4\sigma) \geq 1 - \frac{1}{4^2} = 0.94 & \text{ } \text{94% within 4 sd of population mean} \end{align*}\]

3.1.7 Probability of a Range of the Variable

For continuous random variables, the area under the curve representing f \((x)\) between two points is the probability that \(X\) lies between those two points:

\[Pr(a < X < b) = \int_b^a f(x) dx.\]

Note that \(P(X = a) = 0\) for all \(a \in X\) when \(X\) is continuous. (Why??)

For discrete random variables, \(P(X = a) = f(a)\) , and to find the probability that \(X\) lies in some range of values (e.g.  \(P(a \leq X < b)\) , \(P(X > c)\) , \(P(X < d)\) etc.), we simply sum the probabilities associated with the values specified in the range:

\[ P(X \in A) = \Sigma_{x \in A} f(x). \]

Example: Throwing a Fair Dice

The probability that any of one of the six sides of a fair dice lands uppermost when thrown is 1/6. This can be represented be represented mathematically as:

\[ P(X = x) = \frac{1}{6}, \hspace{0.3 cm} x = 1, 2, \ldots, 6. \]

where \(X\) represents the random variable describing the side that lands uppermost, and \(x\) represents the possible values \(X\) can take. This kind of distribution is known as a Uniform distribution (why?).

How would we represent this probability mass function (pmf) graphically?

3.1.8 Cumulative Probability (CDF): \(F(x) = Pr(X \leq x)\)

The cumulative density function (CDF) of a random variable is the probability that the random variable takes a value less than or equal to a specified value, x:

\[ F(x) = Pr(X \leq x) \]

For the dice example, the cumulative probability distribution can be calculated as follows:

We can also express the CDF for this example mathematically (note that this is not always possible for all random variables, but it is generally possible to create a table as above):

\[ F(x) = Pr(X \leq x) = \frac{x}{6}, \hspace{0.3 cm} x = 1, 2, \ldots, 6. \]

(Check for yourself that the values you get from this formula match those in the table.)

The cumulative probability is found by summing the relevant probabilities, starting from the left hand (smallest) values of the variable and stopping at the specified value; this gives the cumulative probability up to the stopping point and represents the probability that the variable is less than or equal to the specified value.

CDF Dice

What is \(Pr(X < 4)\) ?

3.1.9 Central Limit Theorem

In probability theory, the central limit theorem (CLT) states conditions under which the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed. If we were to take lots of samples from our population of interest, the means of these samples would be normally distributed.

The central limit theorem also requires the random variables to be identically distributed, unless certain conditions are met. The CLT also justifies the approximation of large-sample statistics to the normal distribution in controlled experiments.

3.2 Inference for Counts and Proportions – Test of a Proportion

3.2.1 an example.

Suppose we are concerned that the coin used to decide who will bat first in a cricket match is not unbiased. Note that this scenario is analogous to situations which arise in all sorts of research situations. For example, consider the following claims: the sex ratio for some animal species is 50:50; half of the Australian population have access to the internet at home; fifty percent of Australian children now go on to tertiary studies; there is a 50% chance that in the next six months there will be a better than average rainfall in Queensland; half of the eucalyptus species in Northern New South Wales are suffering from the disease die back.

Research Question:

Is the coin unbiased? That is, if it is tossed, is it just as likely to come down with a head showing as with a tail? Is the probability of seeing a head when the coin is tossed equal to ½?

What sort of experiment:

A single toss will not tell us much – how many tosses will we carry out? Resources are limited so we decide to use only six.

What sort of data:

Success or failure – assume a head is a success. Binary Data – each toss is a Bernoulli trial (an experiment with only two possible outcomes).

What feature from the experiment will have meaning for the question:

The number of heads seen in the sample. If the coin is unbiased we would expect to see three heads and three tails in 6 tosses. number of heads is a Binomial Variable – the sum of a series of independent Bernoulli trials.

Hypotheses:

We want to test the current belief that the probability of seeing a head is 0.5. The null hypothesis always reflects the status quo and assumes the current belief to be true. The alternative hypothesis is the opposite of the null and reflects the reason why the research was conducted.
Null Hypothesis: \(H_0\) : within the population of interest, the probability that a head will be seen is \(\frac{1}{2}\) . \(H_0: Pr(\text{head}) = 0.5\) . Alternative Hypothesis: \(H_1\) : the distribution within the population is not as specified; the probability that a head will be seen is not one half. \(H_1: Pr(\text{head}) \neq 0.5\) .
Selected at random from the population of interest – six random throws.

Test Statistic:

Seems sensible to look at the number of heads in the sample of 6 tosses as the test statistic - how likely are we to get the number of heads seen?

Null Distribution:

The distribution of the random variable, number of heads in a sample of six, IF the null hypothesis is true – that is, if \(Pr(\text{head}) = 0.5\) . See below for derivation and final distribution.
The following table shows the probability distribution function for the number of heads in 6 tosses of unbiased coin (a binomial variable with \(n=6\) and \(p=0.5\) , as derived in the box above). Note that R can be used to get these values.
NOTE THAT THIS TABLE RELIES ON THE FACT THAT EACH OF THE 64 POSSIBLE OUTCOMES IS EQUALLY LIKELY – WHAT HAPPENS IF THE PROBABILITY OF A HEAD IS TWICE THE PROBABILITY OF A TAIL?? What is the probability of getting 5 or 6 heads? What is the probability of getting at least 4 heads? What is the probability of getting no more than 3 heads?

Significance Level, \(\alpha\) :

Traditionally assume 0.05 (5%).

Critical Value, A:

AND NOW A PROBLEM ARISES!!!!!!!

3.2.2 One- and Two-Tailed Hypotheses

We need a value (or values) of the test statistic that will ‘cut off’ a portion of the null distribution representing 95% of the entire area.

Firstly we need to decide where the 5% to be omitted is to be. Will it be in the upper tail as it was for the chi-squared situation? Or, will it be in the lower tail ? Or, will it need to be apportioned across both tails ?

The answer will depend on the alternative hypothesis . Consider the following three possibilities:

the researcher’s belief is such that the test statistic will be larger than that expected if the null hypothesis is true;

the researcher’s belief is such that the test statistic will be smaller than that expected if the null hypothesis is true;

the researcher’s belief is such that the it is not clear whether the test statistic will be larger or smaller, it will just be different from that expected if the null hypothesis is true.

Two-Tailed Hypothesis

In the example, the question simply raises the issue that the coin may not be unbiased. There is no indication as to whether the possible bias will make a head more likely or less likely. The results could be too few heads or too many heads. This is a case 3 situation and is an example of a two-tailed hypothesis .

The critical value can be at either end of the distribution and the value of the stipulated significance, 0.05, must be split between the two ends, 0.025 (or as close as we can get it) going to each tail.

One-Tailed Hypothesis

Suppose instead that the researcher suspects the coin is biased in such a way as to give more heads and this is what is to be assessed (tested). The alternative hypothesis would be that the probability of a head is greater than \(\frac{1}{2}\) : \(H_1: p > 0.5\) – a case 1 situation.

Clearly the opposite situation could also occur if the researcher expected bias towards tails leading to an alternative: \(H_1: p < 0.5\) . This is a case 2 situation.

In both of these cases, the researcher clearly expects that if the null hypothesis is not true it will be false in a specific way . These are examples of a one-tailed hypothesis .

The critical value occurs entirely in the tail containing the extremes anticipated by the researcher. Thus for case 1 the critical value will cut off an upper tail . For case 2 the critical value must cut off a lower tail .

Back to the Example

The example as given is a two-tailed situation thus two critical values are needed, one to cut off the upper 0.025 portion of the null distribution, and the other to cut off the lower 0.025 portion.

To find the actual critical values we look at the distribution as we did for chi-squared.

AND NOW ANOTHER PROBLEM ARISES!!!!!!!

For chi-squared we had a continuous curve and the possible values could be anything, enabling us to find a specific value for any significance level nominated. Here we have discrete data (a count) with only the integer values from zero to six and their associated probabilities. Working with 5% we want the values that will cut off a lower and an upper probability of 0.025 each.

From the table we see:

  • probability of being less than 0 = 0
  • probability of being less than 1 = 0.01562
  • probability of being less than 2 = 0.10938

The closest we can get to 0.025 in the lower tail is 0.01562 for a number of heads of less than 1 (i.e. 0). Similar reasoning gives an upper critical value of greater than 5 (i.e. 6) with a probability of 0.01562.

We cannot find critical values for an exact significance level of 0.05 in this case. The best we can do is to use a significance level of \(0.01562 + 0.01562 = 0.03124\) and the critical values of 1 and 5 – approximately 97% of the values lie between 1 and 5, inclusive.

?? What significance level would you be using if you selected the critical values of (less than) 2 and (greater than) 4 ??

Critical Region:

The part of the distribution more extreme than the critical values, A. The critical region for a significance level of 0.03124 will be any value less than 1 or any value greater than 5: \(T < 1\) or \(T > 5\) . Thus, if the sample test statistic (number of heads) is either zero or six, then it lies in the critical region (reject the null hypothesis). Any other value is said to lie in the acceptance region (cannot reject the null hypothesis).
Calculated using the sample data – number of heads.

We now need to carry out the experiment

Collect the data:

Complete six independent tosses of the coin. The experimental results are: H T H H H H

Calculate the test statistic:

Test statistic (number of heads) = 5

Compare the test statistic with the null distribution:

Where in the distribution does the value of 5 lie? Two possible outcomes: T lies in the critical region - conclusion: reject \(H_0\) in favour of the alternative hypothesis. T does not lie in critical region – conclusion: do not reject \(H_0\) (there is insufficient evidence to reject \(H_0\) ). Here we have a critical region defined as values of \(T < 1\) and values of \(T > 5\) . The test statistic of 5 does NOT lie in the critical region so the null hypothesis \(H_0\) is not rejected.

Interpretation – one of two possibilities

Rejecting \(H_0\) in favour of \(H_1\) – within the possible error defined by the significance level, we believe the alternative hypothesis to be true and the null hypothesis has been falsified. Failing to reject \(H_0\) – there is no evidence to reject the null hypothesis. Note: this does not prove the null hypothesis is true! It may simply mean that the data are inadequate – e.g. the sample size may be too small (Mythbusters effect…). For the example, the null has not been rejected and we could give the conclusion as: We cannot reject the null hypothesis. We conclude that, based on this data, there is insufficient evidence to suggest the coin is biased. Note that this does not PROVE that the coin is unbiased, it simply says that given the available data there is no reason to believe that it is biased. NOTE: Intuitively, getting 5 out of 6 is not particularly informative – a sample of 6 is very small. If the equivalent figure of 15 out of 18 were obtained what would the decision be?

3.2.3 The \(p\) - value of a test statistic.

An alternative to working with a specific significance level is to use what is known as the \(p\) - value of the test statistic. This is the probability of getting a value as or more extreme than the sample test statistic , assuming the null hypothesis is true. Instead of giving the conclusion conditioned on the possible error as determined by the significance level, the conclusion is given together with the actual \(p\) -value. There are various ways of expressing this and some possible wordings are given with each example below.

In the example above, we had a sample test statistic of 5 heads. What is the probability of observing a value this extreme, or more extreme, if the probability of a head is \(0.5\) (i.e. if the null hypothesis is true)? From the table we want the probability of 5 or 6 heads.

If the coin is unbiased, the probability of getting 5 heads in 6 independent random tosses is \(0.09375\) . The probability of getting 6 heads is \(0.01562\) . Therefore, the probability of selecting a number as or more extreme than 5 is: \(0.09375 + 0.01562 = 0.10937\) . Note that this can be read directly from the cumulative column of the table by realising that:

\[ Pr(X \geq 5) = 1 – Pr(X < 5) = 1 – Pr(X \leq 4) = 1 - 0.89062 = 0.10937\]

There is approximately an 11% chance of seeing something as or more extreme than the observed sample data just by pure chance, assuming the probability of a head is 0.5 – this seems like reasonable odds.

3.3 THEORETICAL STATISTICAL DISTRIBUTIONS

All forms of statistical inference draw on the concept of some form of theoretical or empirical statistical model that describes the values that are being measured. This model is used to describe the distribution of the chosen test statistic under the null hypothesis (that is, if the null is true).

3.3.1 The Binomial Distribution – Discrete Variable

The data and test statistic used for the coin example were a specific case of the binomial probability distribution.

binary variable - variable can have only two possible values: present-absent, yes-no, success- failure, etc.

Bernoulli trial - process of deciding whether or not each individual has the property of interest; is a success or a failure.

The sum of \(n\) independent Bernoulli trials results in a variable with a binomial distribution (a binomial random variable).

A Binomial random variable measures the number of successes, number of presences, number of yes’s etc. out of the total number of (independent) trials ( \(n\) ) conducted.

The Binomial Setting There are a fixed number of observations (trials), \(n\) . The \(n\) observations (trials) are all independent. Each observation falls into one of just two categories, which for convenience we call “success” and “failure” (but could be any dichotomy). The probability of a “success” is called \(p\) and it is the same for each observation. (Note that this implies the probability of a “failure” is \((1-p)\) , since there are only two (mutually exclusive and exhaustive) categories for each trial outcome.)

Mathematical Jargon: \(X \sim \text{Bin}(n, p)\)

\(X\) is distributed as a Binomial random variable with number of trials \(n\) and probability of “success” \(p\) .

The mathematical model for the probability mass function of a binomial random variable is given by:

\[ Pr(X = x; n, p) = {n\choose x} p^x (1-p)^{(n-x)}, \text{ } x = 0, 1, 2, \ldots, n, \text{ } 0 \leq p \leq 1, \]

  • \(X\) is the name of the binomial variable – the number of successes
  • \(n\) is the sample size – the number of identical, independent observations;
  • \(x\) is the number of successes in the \(n\) observations;
  • \(p\) is the probability of a success.

The mathematical model in words is:

the probability of observing \(x\) successes of the variable, \(X\) , in a sample of \(n\) independent trials, if the probability of a success for any single trial is the same and equal to \(p\) .

(Compare this formula to that discussed in the coin toss example box.)

Binomial Tables: Binomial probabilities for various (but very limited) values of \(n\) and \(p\) can be found in table form. See the Tables folder on the L@G site. Also note the folder contains a binomial table generator written in java script that will show you probabilities for user-selected \(n\) and \(p\) . R will also calculate Binomial probabilities (see R section in these notes).

3.3.1.1 Examples Using the Binomial Distribution

3.3.1.1.1 example 1:.

Each child born to a particular set of parents has probability \(0.25\) of having blood type O. If these parents have 5 children, what is the probability that exactly 2 of them have type O blood?

Let \(X\) denote the number of children with blood type O. Then \(X \sim \text{Bin}(5, 0.25)\) . We want to find \(P(X = 2)\) . Using the Binomial pmf above:

\[\begin{align*} P(X = 2) &= {5 \choose 2} \times 0.25^2 \times (1-0.25)^{(5 -2)} \\ &= 10 \times 0.25^2 \times 0.75^3 \\ &= 0.2637. \end{align*}\]

The probability that this couple will have 2 children out 5 with blood type O is 0.2637. Can you find this probability in the Binomial tables?

3.3.1.1.2 Example 2:

A couple have 5 children, and two of them have blood type O. Using this data, test the hypothesis that the probability of the couple having a child with type O blood is 0.25.

We are testing the following hypotheses:

\[\begin{align*} H_0: &\text{the probability of the couple having a child with type O blood is 0.25} \\ H_0: &P(\text{Blood Type O}) = 0.25 \end{align*}\]

\[\begin{align*} H_1: &\text{the probability of the couple having a child with type O blood is not 0.25} \\ H_1: &P(\text{Blood Type O}) \neq 0.25 \end{align*}\]

This is very similar to the coin tossing example above. Our test statistic will be the number of children with type O blood, which we are told in the question is T = 2.

The null distribution is the distribution of the test statistic assuming the null hypothesis is true. \(T\) is binomial with \(n=5\) and \(p=0.25\) . This distribution is shown here graphically:

Null Distribution

Significance level \(\alpha = 0.05\) (two-tailed = 0.025 either end). However, because this is a discrete distribution we may not be able to get exactly 0.05.

In the right hand tail using 3 as the critical value gives \(0.0146 + 0.001 = 0.0153\) (using 2 as the critical value would make the overall \(\alpha\) too large). In the left hand tail the first category is very large but it’s the best we can do, so \(\leq 0\) gives 0.2373.

This means that the overall significance level for this test is \(\alpha = 0.0153 + 0.2373 = 0.2526\) (the small sample size, \(n = 5\) , is a big problem in this case).

Our test statistic \(T = 2\) . This does not lie in the critical region.

Inference: There is insufficient evidence to reject the null hypothesis that the probability of the couple having a child with type O blood is 0.25, at the \(\alpha = 0.2526\) level of significance.

3.3.1.1.3 Example 3.

You have to answer 20 True/False questions. You have done some study so your answers will not be total guesses (i.e. the chance of getting any one question correct will not be 50/50). If the probability of getting a question correct is 0.60, what is the probability that you get exactly 16 of the 20 questions correct?

Let \(X\) denote the number of questions answsered correctly. Then \(X \sim \text{Bin}(20, 0.6)\) . We want to find \(P(X = 16)\) . Using the Binomial pmf:

\[\begin{align*} P(X = 2) &= {20 \choose 16} \times 0.6^{16} \times (1-0.6)^{(20 - 16)} \\ &= 4845 \times 0.6^{16} \times 0.4^4 \\ &= 0.0349. \end{align*}\]

The probability of getting 16/20 True/False questions correct (if the probability of a correct answer is 0.60, and assuming your answers to each question are independent) is 0.0349. You need to study harder!

This is the probability of getting exactly 16 correct. What is the probability of getting 16 or less correct? We could sum the probabilities for getting 0, 1, 2, 3….16 correct (tedious!!) or we could note that \(P(X \leq 16) = 1 − P(X > 16)\) :

\[\begin{align*} P(X \leq 16) &= 1 - \sum_{x = 17}^{20} P(X = x) \\ &= 1 - \left( P(X = 17) + P(X = 18) + P(X = 19) + P(X = 20) \right) \\ &= 1 - (0.012 + 0.003 + 0.0005 + 0.00004) \\ &= 0.984 \end{align*}\]

(Make sure you know where these numbers come from).

There is a 98.4% chance that you will answer between 0 and 16 questions correctly. More to the point, there is only a less than 2% chance of answering 17 or more questions correctly with the level of study undertaken (that is, a level of study that leads to a 60% chance of answering any one question correctly).

3.3.1.1.4 Example 4.

You decide to do an experiment to test whether studying for 5 hours increases your chance of getting a correct answer in a T/F exam compared to simply guessing (no study). You study for 5 hours, answer 20 T/F questions, and discover you answered 16 correctly. Test your hypothesis using this sample data.

Note: this is a one-tailed (upper) hypothesis test, since your research question asks whether 5 hours of study will increase the chance of successfully answering a question, \(p\) , over and above guessing (ie will \(p\) be greater than 0.5?).

\[\begin{align*} H_0: & p \leq 0.5 \\ H_1: & p > 0.5 \end{align*}\]

With our sample of 20 questions and 16 successes, do we hvae enough evidence to reject \(H_0\) ?

Null Distribution: the distribution of the test statistic assuming the null hypothesis is true. This is a \(\text{Bin}(20, 0.5)\) and is shown graphically below.

Null Distribution

Significance level \(\alpha = 0.05\) (one tailed, all in the upper tail). Test statistic \(T = 16\) .

To obtain the critical value we need to find the value in the upper tail that has 0.05 of values (or as close as we can get to it) above it. We can sum the probabilities backwards from 20 until we reach approximately 0.05:

\[0.000001 + 0.00001 + 0.0002 + 0.0011 + 0.0046 + 0.0148 + 0.037 = 0.0577\]

So our critical value is 13, and the critical region is any value greater than 13.

The test statistic \(T = 16 > 13\) . Therefore we reject \(H_0\) and conclude that the probability of getting a T/F question correct is significantly greater than \(0.50\) if we study for 5 hours, at the \(\alpha = 0.05722\) level of significance.

If we took the critical value to be 14, our significance level would be 0.02072 and our test statistic would still be significant (i.e. we would still reject the null hypothesis). New conclusion: The test statistic of T = \(16 > 14\) . Therefore we reject \(H_0\) and conclude that the probability of getting a T/F question correct is significantly greater than 0.50 if we study for 5 hours, at the \(\alpha = 0.02072\) level of significance. What is the main effect of reducing the significance level? We have reduced the chance of a Type I error. Make sure you can explain this. Can the significance level be reduced further?

In example 3 we found that the probability of getting 16/20 True/False questions correct (if the probability of a correct answer is 0.60) is 0.0349. You should perhaps study more.

If the probability of getting a correct answer really is 0.6, how many of the 20 questions would you expect to answer correctly?

\[0.6 \times 20 = 12\] .

What should the probability of correctly answering a question be to make 16 correct answers out of 20 the expected outcome?

\[\begin{align*} p \times 20 &= 16\\ p &= \frac{16}{20} \\ &= 0.8 \end{align*}\]

Back To The Theoretical Binomial Distribution:

The expected (mean) value of the random variable \(X \sim \text{Bin}(n, p)\) is \(np\) . Note that this value does not always work out to be a whole number. You should think of the expected value (or mean) as the long run, average value of the random variable under repeated sampling.

The variance of a \(\text{Bin}(n, p)\) random variable is \(np(1-p)\) . Note that the variance depends on the mean, \(np\) .

The mode of a random variable is the most frequent, or probable, value. For the binomial distribution the mode is either equal to the mean or very close to it (the mode will be a whole number, whereas the mean will not necessarily be so). The mode of any particular \(\text{Bin}(n, p)\) distribution can be found by perusing the binomial tables for that distribution and finding the value with the largest probability, although this becomes prohibitive as \(n\) gets large. (There is a formula to calculate the mode for the binomial distribution; however this goes beyond the scope of this course.)

Binomial Distributions

The probability of success, \(p\) , influences the shape of the binomial distribution. If \(p=0.5\) , the distribution is symmetric around its mode (second figure). If \(p > 0.5\) , the distribution has a left skew (not shown). If \(p < 0.5\) , the distribution has a right skew (first figure). The closer \(p\) gets to either 0 or 1, the more skewed (right or left, respectively) the distribution becomes.

The number of trials, \(n\) , mediates the effects of \(p\) to a certain degree in the sense that the larger \(n\) is, the less skewed the distribution becomes for values of \(p \neq 0.5\) .

3.3.2 The Normal Distribution – A Continuous Variable

“Everybody believes in the Normal Distribution (Normal Approximation), the experimenters because they think it is a Mathematical theorem, the mathematicians because they think it is an experimental fact.”

(G. Lippman, A Nobel Prize winner in 1908, who specialised in Physics and Astronomy and was responsible for making improvements to the seismograph.)

Original mathematical derivation - Abraham De Moivre in 1773.

In the 1880’s, the mathematician and physicist, Gauss “rediscovered” it - errors in physical measurements. Often called the Gaussian distribution for this reason. (History is written by the “winners”!)

The normal distribution arises in many situations where the random variable is continuous and can be thought of as an agglomeration of a number of components .

  • A physical feature determined by genetic effects and environmental influences, height, air temperature, yield from a crop, soil permeability.
  • The final grade on an exam, where a number of questions each receive some of the marks.
  • The day to day movements of a stock market index.

The Mathematical Equation of the Normal Distribution: \[ f_X(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2}, \text{ } x \in \Re, \mu \in \Re, \sigma > 0, \]

  • \(x\) is a particular value of the random variable, \(X\) , and \(f(x)\) is the associated probability;
  • \(\sigma^2\) is the population variance of the random variable, \(X\) ;
  • \(\mu\) is the population mean of the random variable \(X\) .

We write: \(X \sim N(\mu, \sigma^2)\) : “X is normally distributed with mean \(\mu\) and variance \(\sigma^2\) .”

Properties of the Normal probability distribution function

  • the shape of the curve is determined by the values of the parameters \(\mu\) and \(\sigma^2\) ;
  • the location of the peak (mode) is determined by the value of \(\mu\) ;
  • the spread, or dispersion, of the curve is determined by the value of \(\sigma^2\) ;
  • it is symmetric about \(\mu\) - thus the mean, median and mode are all equal to \(\mu\) ;
  • the total area under the curve is one – as for all probability distribution functions.

The Standard Normal Distribution

The shape of the normal probability distribution function depends on the population parameters. Separate curves are needed to describe each population. This is a problem because it means we need statistical tables of probabilities for each possible combination of \(\mu\) and \(\sigma\) (and there are infinitely many such combinations)!!

Happily, we can convert any normal distribution into the standard normal distribution via what is known as the Z-transformation formula. This means we only need probability tables for the standard normal distribution: we can work out probabilities for any other normal distribution from this.

We denote a random variable with the standard normal distribution as Z. The standard normal distribution is a normal distribution with mean = 0 and variance (and hence standard deviation) = 1. That is, \(Z \sim N(0,1)\) .

If \(X \sim N(\mu, \sigma^2)\) , we can convert it to a standard normal distribution via the Z-transformation:

\[ Z = \frac{X - \mu}{\sigma} \]

Probability of a Range of the Variable – the continuous case

The area under the graphical model between 2 points, is the probability that the variable lies between those 2 points. Tables exist that tabulate some of these probabilities. See the tables folder on the L@G site. Also see the class examples below.

Cumulative Probability as an Area – the continuous case

The area under a graphical model starting from the left hand (smallest) values of the variable and stopping at a specified value is the cumulative probability up to the stopping point; It represents the probability that the variable is less than or equal to the specified value.

Class Examples

BE GUIDED BY THE DIAGRAM AND THE SYMMETRY OF THE CURVE

  • If \(Z \sim N(0, 1)\) find \(Pr(Z > 1.52)\)
  • If \(Z \sim N(0, 1)\) find the probability that \(Z\) lies between 0 and 1.52.
  • Find \(Pr(-1.52 < Z < 1.52)\) where \(Z \sim N(0, 1)\) .
  • Find \(Pr(Z < -1.96)\)
  • Find the value \(Z_i\) for which \(Pr(0 < Z < Z_i) = 0.45\) .

Class Example of Application of the Normal Distribution

Many university students do some part-time work to supplement their allowances. In a study on students’ wages earned from part-time work, it was found that their hourly wages are normally distributed with mean, \(\mu = \$ 6.20\) and standard deviation \(\sigma = \$0.60\) . Find the proportion of students who do part-time work and earn more than $7.00 per hour.

If there are 450 students in a particular Faculty who do part-time work, how many of them would you expect to earn more than $7.00 per hour?

Normal Quantile Plots (normal probability plots)

You do not need to do these by hand. The R functions qqnorm() (and qqline() to add a reference line) do these for you. See the example R code in the week 1 lecture notes folder for an example of how to do these graphs. (Boxplots can be used to show similar things.)

QQplot

Some Normal Notes

  • Not all Bell-Shaped Curves are normal.
  • It is a model that has been shown to approximate other models for a large number of cases.
  • It is by far the most commonly used probability distribution function in (traditional) statistical inference.

3.3.3 Normal Approximation to the Binomial

There is a limit to creating binomial tables, especially when the number of trials, \(n\) , becomes large (say 30 or greater). Fortunately, as \(n\) becomes large we can approximate the binomial distribution with a normal distribution as follows:

\[ \text{If } X \sim \text{Bin}(n, p) \text{ then } X \dot\sim N(np, np(1 - p)). \]

This approximation is reasonably good provided: \(np \geq 5\) and \(np(1 - p) \geq 5\) . Note that \((1 - p)\) is sometimes referred to as \(q\) , with adjustments made to the above formulae accordingly.

3.3.3.1 Binomial Test of Proportion for Large Samples

We saw earlier in the binomial examples how to test hypotheses about proportions when the number of trials is small (<20, say). When the number of trials is large, we can use the normal approximation to the binomial to test hypotheses about proportions.

\(\text{If } X \sim \text{Bin}(n, p) \text{ then } X \dot\sim N(np, np(1 - p))\) . Using the \(Z\) - transform,

\[ T = \frac{X - np}{\sqrt{np(1-p)}} \dot\sim N(0, 1) \]

The following example will illustrate how to use this formula to test hypotheses about proportions when the sample size (number of trials) is large.

Forestry Example:

A forester wants to know if more than 40% of the eucalyptus trees in a particular state forest are host to a specific epiphyte. She takes a random sample of 150 trees and finds that 65 do support the specified epiphyte.

Research Question: What sort of experiment? What sort of data? What feature of the data is of interest? Null Hypothesis Alternative Hypothesis One-tailed or two-tailed test? Sample size? Null Distribution? Test Statistic? Significance Level? Critical Value? Compare test statistic to critical value Conclusion

Although 65/150 = 0.43 is greater than 0.4, this on its own is not enough to say that the true population proportion of epiphyte hosts is greater than 0.4. Remember, we are using this sample to infer things about the wider population of host trees. Of course, in this sample the proportion of hosts is greater than 0.4, but this is only one sample of 150 trees. What if we took another sample of 150 trees from the forest and found that the sample proportion was 0.38? Would we then conclude that the true population proportion was in fact less than 0.4? Whenever we sample we introduce uncertainty. It is this uncertainty we are trying to take into account when we do hypothesis testing.

How many host trees would we need to have found in our sample to feel confident that the actual population proportion is > 40%? That is, how many host trees would we need to have found in our 150 tree sample in order to reject \(H_0\) ?

3.4 Using R Week 3/4

3.4.1 calculating binomial and normal probabilities in r, 3.4.1.1 binomial probabilities:.

R has several functions available for calculating binomial probabilities. The two most useful are dbinom(x, size = n, p) and pbinom(x, size = n, p) .

When \(X \sim \text{Bin}(n, p)\) :

  • dbinom(x, n, p) calculates \(P(X = x)\) (ie the density function); and
  • pbinom(x, n, p) calculates \(P(X \leq x)\) (ie the cumulative density function).

Use the example of two heads out of six tosses of an unbiased coin. We want the probability of getting two or fewer heads: \(Pr(X \leq 2)\) , where \(X \sim \text{Bin}(n=6, p=0.5)\) .

What if we want the probability of finding exactly 2 heads: \(Pr(X = 2)\) ?

If we want the probability of seeing an upper extreme set, for example seeing three or more heads, we can use the subtraction from unity approach as indicated in the examples above:

\[Pr(X \geq 3) = 1 - Pr(X \leq 2)\]

Or, we can do each probability in the set individually and add them up (note, this is only really a good option if you don’t have a large number of trials, or if there are not a lot of probabilities to add up):

\[ Pr(X \geq 3) = Pr(X = 3) + Pr(X = 4) + Pr(X = 5) + Pr(X = 6)\]

3.4.1.2 Normal Probabilities:

There are similar functions for calculating Normal probabilities. However, note that \(Pr(X = x)\) is a meaningless quantity when a distribution is not discrete (the Binomial is discrete, the Normal is continuous). Therefore the only useful function to us for normal probability calculations is pnorm(x, mean, sd) .

When \(X \sim N(\mu, \sigma^2)\) :

  • pnorm(x, mean = \(\mu\) , sd = \(\sigma\) ) calculates \(Pr(X \leq x)\) .

For example, suppose \(X \sim N(0, 1)\) . Find \(Pr(X \leq 1.96)\) .

Find \(Pr(0.5 \leq X \leq 1.96)\)

Find \(Pr(X > 1.96)\)

  • \(Pr(-1.96 \leq X \leq 1.96)\) ; and
  • \(Pr(|X| > 1.96)\) .

3.4.2 Sorting Data Sets and running Functions Separately for Different Categories:

When the data contain variables that are categorical, sometimes we would like to sort the data based on those categories. We can do this in R using the order() function. See the accompanying R file in the week 3/4 lecture notes folder – examples will be shown and discussed in lectures.

We might also sometimes want to run separate analyses/summaries for our datasets based on the categories of the factor (categorical) variables. For example, suppose we wanted to know the mean rainfall of each district from the rainfall data in week 1 lectures:

There are always several ways to to do the same thing in R. Another way we could find the mean for each district is to use the tapply function: tapply(rain, district, mean) .

Which you use can often just boil down to a personal preference (eg you might prefer the output from using by over the output from tapply ). As an exercise, try adding the tapply version to the end of the code box above and see which you prefer.

Now that we are finished with the rainfall.dat data frame we should detach it:

More examples will be shown in lectures – please see the accompanying R file in the weeks 3/4 lecture notes folder.

Help Center Help Center

  • Help Center
  • Trial Software
  • Product Updates
  • Documentation

One-sample Kolmogorov-Smirnov test

Description

h = kstest( x ) returns a test decision for the null hypothesis that the data in vector x comes from a standard normal distribution, against the alternative that it does not come from such a distribution, using the one-sample Kolmogorov-Smirnov test . The result h is 1 if the test rejects the null hypothesis at the 5% significance level, or 0 otherwise.

h = kstest( x , Name,Value ) returns a test decision for the one-sample Kolmogorov-Smirnov test with additional options specified by one or more name-value pair arguments. For example, you can test for a distribution other than standard normal, change the significance level, or conduct a one-sided test.

[ h , p ] = kstest( ___ ) also returns the p -value p of the hypothesis test, using any of the input arguments from the previous syntaxes.

[ h , p , ksstat , cv ] = kstest( ___ ) also returns the value of the test statistic ksstat and the approximate critical value cv of the test.

collapse all

Test for Standard Normal Distribution

Perform the one-sample Kolmogorov-Smirnov test by using kstest . Confirm the test decision by visually comparing the empirical cumulative distribution function (cdf) to the standard normal cdf.

Load the examgrades data set. Create a vector containing the first column of the exam grade data.

Test the null hypothesis that the data comes from a normal distribution with a mean of 75 and a standard deviation of 10. Use these parameters to center and scale each element of the data vector, because kstest tests for a standard normal distribution by default.

The returned value of h = 0 indicates that kstest fails to reject the null hypothesis at the default 5% significance level.

Plot the empirical cdf and the standard normal cdf for a visual comparison.

hypothesis testing cumulative distribution function

The figure shows the similarity between the empirical cdf of the centered and scaled data vector and the cdf of the standard normal distribution.

Specify the Hypothesized Distribution Using a Two-Column Matrix

Load the sample data. Create a vector containing the first column of the students’ exam grades data.

Specify the hypothesized distribution as a two-column matrix. Column 1 contains the data vector x . Column 2 contains cdf values evaluated at each value in x for a hypothesized Student’s t distribution with a location parameter of 75, a scale parameter of 10, and one degree of freedom.

Test if the data are from the hypothesized distribution.

The returned value of h = 1 indicates that kstest rejects the null hypothesis at the default 5% significance level.

Specify the Hypothesized Distribution Using a Probability Distribution Object

Create a probability distribution object to test if the data comes from a Student’s t distribution with a location parameter of 75, a scale parameter of 10, and one degree of freedom.

Test the null hypothesis that the data comes from the hypothesized distribution.

Test the Hypothesis at Different Significance Levels

Load the sample data. Create a vector containing the first column of the students’ exam grades.

Test the null hypothesis that data comes from the hypothesized distribution at the 1% significance level.

The returned value of h = 1 indicates that kstest rejects the null hypothesis at the 1% significance level.

Conduct a One-Sided Hypothesis Test

Load the sample data. Create a vector containing the third column of the stock return data matrix.

Test the null hypothesis that the data comes from a standard normal distribution, against the alternative hypothesis that the population cdf of the data is larger than the standard normal cdf.

The returned value of h = 1 indicates that kstest rejects the null hypothesis in favor of the alternative hypothesis at the default 5% significance level.

hypothesis testing cumulative distribution function

The plot shows the difference between the empirical cdf of the data vector x and the cdf of the standard normal distribution.

Input Arguments

X — sample data vector.

Sample data, specified as a vector.

Data Types: single | double

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN , where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Tail','larger','Alpha',0.01 specifies a test using the alternative hypothesis that the cdf of the population from which the sample data is drawn is greater than the cdf of the hypothesized distribution, conducted at the 1% significance level.

Alpha — Significance level 0.05 (default) | scalar value in the range (0,1)

Significance level of the hypothesis test, specified as the comma-separated pair consisting of 'Alpha' and a scalar value in the range (0,1).

Example: 'Alpha',0.01

CDF — cdf of hypothesized continuous distribution matrix | probability distribution object

cdf of hypothesized continuous distribution, specified the comma-separated pair consisting of 'CDF' and either a two-column matrix or a continuous probability distribution object. When CDF is a matrix, column 1 contains a set of possible x values, and column 2 contains the corresponding hypothesized cumulative distribution function values G ( x ). The calculation is most efficient if CDF is specified such that column 1 contains the values in the data vector x . If there are values in x not found in column 1 of CDF , kstest approximates G ( x ) by interpolation. All values in x must lie in the interval between the smallest and largest values in the first column of CDF . By default, kstest tests for a standard normal distribution.

The one-sample Kolmogorov-Smirnov test is only valid for continuous cumulative distribution functions, and requires CDF to be predetermined. The result is not accurate if CDF is estimated from the data. To test x against the normal, lognormal, extreme value, Weibull, or exponential distribution without specifying distribution parameters, use lillietest instead.

Tail — Type of alternative hypothesis 'unequal' (default) | 'larger' | 'smaller'

Type of alternative hypothesis to evaluate, specified as the comma-separated pair consisting of 'Tail' and one of the following.

If the values in the data vector x tend to be larger than expected from the hypothesized distribution, the empirical distribution function of x tends to be smaller, and vice versa.

Example: 'Tail','larger'

Output Arguments

H — hypothesis test result 1 | 0.

Hypothesis test result, returned as a logical value.

If h = 1 , this indicates the rejection of the null hypothesis at the Alpha significance level.

If h = 0 , this indicates a failure to reject the null hypothesis at the Alpha significance level.

p — p -value scalar value in the range [0,1]

p -value of the test, returned as a scalar value in the range [0,1]. p is the probability of observing a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis. Small values of p cast doubt on the validity of the null hypothesis.

ksstat — Test statistic nonnegative scalar value

Test statistic of the hypothesis test, returned as a nonnegative scalar value.

cv — Critical value nonnegative scalar value

Critical value, returned as a nonnegative scalar value.

One-Sample Kolmogorov-Smirnov Test

The one-sample Kolmogorov-Smirnov test is a nonparametric test of the null hypothesis that the population cdf of the data is equal to the hypothesized cdf.

The two-sided test for “unequal” cdf functions tests the null hypothesis against the alternative that the population cdf of the data is not equal to the hypothesized cdf. The test statistic is the maximum absolute difference between the empirical cdf calculated from x and the hypothesized cdf:

D * = max x ( | F ^ ( x ) − G ( x ) | ) ,

where F ^ ( x ) is the empirical cdf and G ( x ) is the cdf of the hypothesized distribution.

The one-sided test for a “larger” cdf function tests the null hypothesis against the alternative that the population cdf of the data is greater than the hypothesized cdf. The test statistic is the maximum amount by which the empirical cdf calculated from x exceeds the hypothesized cdf:

D * = max x ( F ^ ( x ) − G ( x ) ) .

The one-sided test for a “smaller” cdf function tests the null hypothesis against the alternative that the population cdf of the data is less than the hypothesized cdf. The test statistic is the maximum amount by which the hypothesized cdf exceeds the empirical cdf calculated from x :

D * = max x ( G ( x ) − F ^ ( x ) ) .

kstest computes the critical value cv using an approximate formula or by interpolation in a table. The formula and table cover the range 0.01 ≤ alpha ≤ 0.2 for two-sided tests and 0.005 ≤ alpha ≤ 0.1 for one-sided tests. cv is returned as NaN if alpha is outside this range.

kstest decides to reject the null hypothesis by comparing the p -value p with the significance level Alpha , not by comparing the test statistic ksstat with the critical value cv . Since cv is approximate, comparing ksstat with cv occasionally leads to a different conclusion than comparing p with Alpha .

[1] Massey, F. J. “The Kolmogorov-Smirnov Test for Goodness of Fit.” Journal of the American Statistical Association . Vol. 46, No. 253, 1951, pp. 68–78.

[2] Miller, L. H. “Table of Percentage Points of Kolmogorov Statistics.” Journal of the American Statistical Association . Vol. 51, No. 273, 1956, pp. 111–121.

[3] Marsaglia, G., W. Tsang, and J. Wang. “Evaluating Kolmogorov’s Distribution.” Journal of Statistical Software . Vol. 8, Issue 18, 2003.

Version History

Introduced before R2006a

kstest2 | lillietest | adtest

MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

  • Switzerland (English)
  • Switzerland (Deutsch)
  • Switzerland (Français)
  • 中国 (English)

You can also select a web site from the following list:

How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

  • América Latina (Español)
  • Canada (English)
  • United States (English)
  • Belgium (English)
  • Denmark (English)
  • Deutschland (Deutsch)
  • España (Español)
  • Finland (English)
  • France (Français)
  • Ireland (English)
  • Italia (Italiano)
  • Luxembourg (English)
  • Netherlands (English)
  • Norway (English)
  • Österreich (Deutsch)
  • Portugal (English)
  • Sweden (English)
  • United Kingdom (English)

Asia Pacific

  • Australia (English)
  • India (English)
  • New Zealand (English)

Contact your local office

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

3.2: Probability Mass Functions (PMFs) and Cumulative Distribution Functions (CDFs) for Discrete Random Variables

  • Last updated
  • Save as PDF
  • Page ID 3259

  • Kristin Kuter
  • Saint Mary's College

Since random variables simply assign values to outcomes in a sample space and we have defined probability measures on sample spaces, we can also talk about probabilities for random variables. Specifically, we can compute the probability that a discrete random variable equals a specific value ( probability mass function ) and the probability that a random variable is less than or equal to a specific value ( cumulative distribution function ).

Probability Mass Functions (PMFs)

In the following example, we compute the probability that a discrete random variable equals a specific value.

Example \(\PageIndex{1}\)

Continuing in the context of Example 3.1.1 , we compute the probability that the random variable \(X\) equals \(1\). There are two outcomes that lead to \(X\) taking the value 1, namely \(ht\) and \(th\). So, the probability that \(X=1\) is given by the probability of the event \({ht, th}\), which is \(0.5\):

$$P(X=1) = P(\{ht, th\}) = \frac{\text{# outcomes in}\ \{ht, th\}}{\text{# outcomes in}\ S} = \frac{2}{4} = 0.5\notag$$

In Example 3.2.1, the probability that the random variable \(X\) equals 1, \(P(X=1)\), is referred to as the  probability mass function  of \(X\) evaluated at 1. In other words, the specific value 1 of the random variable \(X\) is associated with the probability that \(X\) equals that value, which we found to be 0.5. The process of assigning probabilities to specific values of a discrete random variable is what the probability mass function is and the following definition formalizes this.

Definition \(\PageIndex{1}\)

The probability mass function ( pmf )  (or frequency function ) of a discrete random variable \(X\) assigns probabilities to the possible values of the random variable. More specifically, if \(x_1, x_2, \ldots\) denote the possible values of a random variable \(X\), then the probability mass function is denoted as \(p\) and we write $$p(x_i) = P(X=x_i) = P(\underbrace{\{s\in S\ |\ X(s) = x_i\}}_{\text{set of outcomes resulting in}\ X=x_i}).\label{pmf}$$

Note that, in Equation \ref{pmf}, \(p(x_i)\) is  shorthand  for \(P(X = x_i)\), which represents the probability of the event that the random variable \(X\) equals \(x_i\).

As we can see in Definition 3.2.1, the probability mass function of a random variable \(X\) depends on the probability measure of the underlying sample space \(S\). Thus, pmf's inherit some properties from the axioms of probability ( Definition 1.2.1 ). In fact, in order for a function to be a valid pmf it must satisfy the following properties.

Properties of Probability Mass Functions

Let \(X\) be a discrete random variable with possible values denoted \(x_1, x_2, \ldots, x_i, \ldots\). The probability mass function of \(X\), denoted \(p\), must satisfy the following:

  • \(\displaystyle{\sum_{x_i} p(x_i)} = p(x_1) + p(x_2) + \cdots = 1\)
  • \(p(x_i) \geq 0\), for all \(x_i\)

Furthermore, if \(A\) is a subset of the possible values of \(X\), then the probability that \(X\) takes a value in \(A\) is given by

$$P(X\in A) = \sum_{x_i\in A} p(x_i).\label{3rdprop}$$

Note that the first property of pmf's stated above follows from the first axiom of probability, namely that the probability of the sample space equals \(1\): \(P(S) = 1\). The second property of pmf's follows from the second axiom of probability, which states that all probabilities are non-negative.

We now apply the formal definition of a pmf and verify the properties in a specific context.

Example \(\PageIndex{2}\)

Returning to Example 3.2.1 , now using the notation of Definition 3.2.1 , we found that the pmf for \(X\) at \(1\) is given by $$p(1) = P(X=1) = P(\{ht, th\}) = 0.5.\notag$$ Similarly, we find the pmf for \(X\) at the other possible values of the random variable: \begin{align*} p(0) &= P(X=0) = P(\{tt\}) = 0.25 \\ p(2) &= P(X=2) = P(\{hh\}) = 0.25 \end{align*} Note that all the values of \(p\) are positive (second property of pmf's) and \(p(0) + p(1) + p(2) = 1\) (first property of pmf's). Also, we can demonstrate the third property of pmf's (Equation \ref{3rdprop}) by computing the probability that there is at least one heads, i.e., \(X\geq 1\), which we could represent by setting \(A = \{1,2\}\) so that we want the probability that \(X\) takes a value in \(A\):

$$P(X\geq1) = P(X\in A) = \sum_{x_i\in A}p(x_i) = p(1) + p(2) = 0.5 + 0.25 = 0.75\notag$$

We can represent probability mass functions numerically with a table, graphically with a histogram , or analytically with a formula. The following example demonstrates the numerical and graphical representations. In the next three sections, we will see examples of pmf's defined analytically with a formula.

Example \(\PageIndex{3}\)

We represent the pmf we found in Example 3.2.2 in two ways below, numerically with a table on the left and graphically with a histogram on the right.

hist.PNG

In the histogram in Figure 1, note that we represent probabilities as areas of  rectangles . More specifically, each rectangle in the histogram has width \(1\) and height equal to the probability of the value of the random variable \(X\) that the rectangle is centered over. For example, the leftmost rectangle in the histogram is centered at \(0\) and has height equal to \(p(0) = 0.25\), which is also the area of the rectangle since the width is equal to \(1\). In this way, histograms provides a visualization of the  distribution  of the probabilities assigned to the possible values of the random variable \(X\). This helps to explain where the common terminology of "probability distribution" comes from when talking about random variables.

Cumulative Distribution Functions (CDFs)

There is one more important function related to random variables that we define next. This function is again related to the probabilities of the random variable equalling specific values. It provides a shortcut for calculating many probabilities at once.

Definition \(\PageIndex{2}\)

The cumulative distribution function ( cdf ) of a random variable \(X\) is a function on the real numbers that is denoted as \(F\) and is given by $$F(x) = P(X\leq x),\quad \text{for any}\ x\in\mathbb{R}. \label{cdf}$$

Before looking at an example of a cdf, we note a few things about the definition.

First of all, note that we did not specify the random variable \(X\) to be discrete. CDFs are also defined for continuous random variables (see Chapter 4 ) in exactly the same way.

Second, the cdf of a random variable is defined for all real numbers, unlike the pmf of a discrete random variable, which we only define for the possible values of the random variable. Implicit in the definition of a pmf is the assumption that it equals 0 for all real numbers that are not possible values of the discrete random variable, which should make sense since the random variable will never equal that value. However, cdf's, for both discrete and continuous random variables, are defined for all real numbers. In looking more closely at Equation \ref{cdf}, we see that a cdf \(F\) considers an upper bound, \(x\in\mathbb{R}\), on the random variable \(X\), and assigns that value \(x\) to the probability that the random variable \(X\) is less than or equal to that upper bound \(x\). This type of probability is referred to as a  cumulative probability , since it could be thought of as the probability accumulated by the random variable up to the specified upper bound. With this interpretation, we can represent Equation \ref{cdf} as follows:

$$F: \underbrace{\mathbb{R}}_{\text{upper bounds on RV}\ X} \longrightarrow \underbrace{\mathbb{R}}_{\text{cumulative probabilities}}\label{function}$$

In the case that \(X\) is a discrete random variable, with possible values denoted \(x_1, x_2, \ldots, x_i, \ldots\), the cdf of \(X\) can be calculated using the third property of pmf's (Equation \ref{3rdprop}), since, for a fixed \(x\in\mathbb{R}\), if we let the set \(A\) contain the possible values of \(X\) that are less than or equal to \(x\), i.e., \(A = \{x_i\ |\ x_i\leq x\}\), then the cdf of \(X\) evaluated at \(x\) is given by

$$F(x) = P(X\leq x) = P(X\in A) = \sum_{x_i\leq x} p(x_i).\notag$$

Example \(\PageIndex{4}\)

Continuing with Examples 3.2.2 and 3.2.3 , we find the cdf for \(X\). First, we find \(F(x)\) for the possible values of the random variable, \(x=0,1,2\): \begin{align*} F(0) &= P(X\leq0) = P(X=0) = 0.25 \\ F(1) &= P(X\leq1) = P(X=0\ \text{or}\ 1) = p(0) + p(1) = 0.75 \\ F(2) &= P(X\leq2) = P(X=0\ \text{or}\ 1\ \text{or}\ 2) = p(0) + p(1) + p(2) = 1 \end{align*} Now, if \(x<0\), then the cdf \(F(x) = 0\), since the random variable \(X\) will never be negative.

If \(0<x<1\), then the cdf \(F(x) = 0.25\), since the only value of the random variable \(X\) that is less than or equal to such a value \(x\) is \(0\). For example, consider \(x=0.5\). The probability that \(X\) is less than or equal to \(0.5\) is the same as the probability that \(X=0\), since \(0\) is the only possible value of \(X\) less than \(0.5\):

$$F(0.5) = P(X\leq0.5) = P(X=0) = 0.25.\notag$$

Similarly, we have the following: \begin{align*} F(x) &= F(1) = 0.75,\quad\text{for}\ 1<x<2 \\ F(x) &= F(2) = 1,\quad\text{for}\ x>2 \end{align*}

Exercise \(\PageIndex{1}\)

For this random variable \(X\), compute the following values of the cdf:

  • \(F(-3) = P(X\leq -3) = 0\)
  • \(F(0.1) = P(X\leq 0.1) = P(X=0) = 0.25\)
  • \(F(0.9)= P(X\leq 0.9) = P(X=0) = 0.25\)
  • \(F(1.4) = P(X\leq 1.4) = \displaystyle{\sum_{x_i\leq1.4}}p(x_i) = p(0) + p(1) = 0.25 + 0.5 = 0.75\)
  • \(F(2.3) = P(X\leq 2.3) = \displaystyle{\sum_{x_i\leq2.3}}p(x_i) = p(0) + p(1) + p(2) = 0.25 + 0.5 + 0.25 = 1\)
  • \(F(18) = P(X\leq18) = P(X\leq 2) = 1\)

To summarize Example 3.2.4, we write the cdf \(F\) as a piecewise function and Figure 2 gives its graph: $$F(x) = \left\{\begin{array}{l l} 0, & \text{for}\ x<0 \\ 0.25 & \text{for}\ 0\leq x <1 \\ 0.75 & \text{for}\ 1\leq x <2 \\ 1 & \text{for}\ x\geq 2. \end{array}\right.\notag$$

cdf1.png

Figure 2: Graph of cdf in Example 3.2.4

Note that the cdf we found in Example 3.2.4 is a "step function", since its graph resembles a series of steps. This is the case for all discrete random variables. Additionally, the value of the cdf for a discrete random variable will always "jump" at the possible values of the random variable, and the size of the "jump" is given by the value of the pmf at that possible value of the random variable. For example, the graph in Figure 2 "jumps" from \(0.25\) to \(0.75\) at \(x=1\), so the size of the "jump" is \(0.75-0.25= 0.5\) and note that \(p(1) = P(X=1) = 0.5\). The pmf for any discrete random variable can be obtained from the cdf in this manner.

We end this section with a statement of the properties of cdf's.  The reader is encouraged to verify these properties hold for the cdf derived in Example 3.2.4 and to provide an intuitive explanation (or formal explanation using the axioms of probability and the properties of pmf's) for why these properties hold for cdf's in general.

Properties of Cumulative Distribution Functions

Let \(X\) be a random variable with cdf \(F\). Then \(F\) satisfies the following:

  • \(F\) is non-decreasing, i.e., \(F\) may be constant, but otherwise it is increasing.
  • \(\displaystyle{\lim_{x\to-\infty} F(x) = 0}\) and \(\displaystyle{\lim_{x\to\infty} F(x) = 1}\)

Tutorial Playlist

Statistics tutorial, everything you need to know about the probability density function in statistics, the best guide to understand central limit theorem, an in-depth guide to measures of central tendency : mean, median and mode, the ultimate guide to understand conditional probability, a comprehensive look at percentile in statistics, the best guide to understand bayes theorem, everything you need to know about the normal distribution, an in-depth explanation of cumulative distribution function, a complete guide to chi-square test, a complete guide on hypothesis testing in statistics, understanding the fundamentals of arithmetic and geometric progression, the definitive guide to understand spearman’s rank correlation, a comprehensive guide to understand mean squared error, all you need to know about the empirical rule in statistics, the complete guide to skewness and kurtosis, a holistic look at bernoulli distribution, all you need to know about bias in statistics, a complete guide to get a grasp of time series analysis, the key differences between z-test vs. t-test, the complete guide to understand pearson's correlation, a complete guide on the types of statistical studies, everything you need to know about poisson distribution, your best guide to understand correlation vs. regression, the most comprehensive guide for beginners on what is correlation, an in-depth explanation of cumulative distribution function.

Lesson 8 of 24 By Avijeet Biswal

An In-Depth Explanation of Cumulative Distribution Function

Table of Contents

An essential part of statistics is the cumulative distribution function which helps you find the probability for a random variable in a specific range. This tutorial will teach you the basics of the cumulative distribution function and how to implement it in Python .

What Is the Cumulative Distribution Function?

The cumulative distribution function is used to describe the probability distribution of random variables. It can be used to describe the probability for a discrete, continuous or mixed variable. It is obtained by summing up the probability density function and getting the cumulative probability for a random variable.

The Probability Density Function is a function that gives us the probability distribution of a random variable for any value of it. To get the probability distribution at a point, you only have to solve the probability density function for that point. 

The cumulative distribution function of a random variable to be calculated at a point x is represented as Fx(X). It is the probability that the random variable X will take a value less than or equal to x.

Your Data Analytics Career is Around The Corner!

Your Data Analytics Career is Around The Corner!

Consider the diagram shown below. The diagram shows the probability density function f(x), which gives us a rectangle between the points (a, b) when plotted. f(x) has a value of 1/(b-a).                                    

Cumulative_Distribution_Function_1

 Figure 1: Probability Density Function

Now consider a point c on the x-axis. This is the point you need to find the cumulative distribution function at. According to the definition, you need to find the total probability density function up to point c. This means that you have to find the area of the rectangle between points a and c.                       

Cumulative_Distribution_Function_2

Figure 2: Calculating the CDF

You can do this by multiplying the length and breadth of the rectangle. The breadth is the distance between a and c obtained by subtracting them, and the length is the probability density function. In the end, you get the CDF as:

Cumulative_Distribution_Function_3

Figure 3: CDF

Since the cumulative distribution function is the total probability density function up to a certain point x, it can be represented as the probability that the random variable X is less than or equal to x.

Cumulative_Distribution_Function_4.

Figure 4: CDF representation

As you need to get the total PDF sum between two points, you can also represent the CDF as the integration of PDF between the points it has been calculated at. The formula depicted below shows the cumulative distribution function calculated between points (a, b) for the PDF Fx(x).

Cumulative_Distribution_Function_5

      Figure 5: CDF as the integration of PDF

Understanding the Cumulative Distribution Function With the IRIS Dataset

In this case study, you will be looking at the Iris dataset, which contains information on the sepal length, sepal width, petal length, and petal width of three different species of Iris:

  • Iris Setosa
  • Iris Versicolor 
  • Iris Virginica

Cumulative_Distribution_Function_6

Figure 6: Iris Dataset

All the values are in centimeters. The dataset contains 50 data points on each of the different species. You need to find a reliable measure using which we can differentiate the different species from each other.

Now, plot each feature of our dataset using a bar graph. You plot the features with different colors for each flower to see how they overlap with each other. This is a way of finding the PDF of the data.

Cumulative_Distribution_Function_7

Figure 7: Iris Dataset PDF

The above graphs are as follows from top left to bottom right:

  • Sepal_length: In this graph, we can see that the sepal lengths of all three species have considerable overlap. Hence it becomes tough to set parameters or ranges which you can use to differentiate our flowers.
  • Sepal_width: This graph has even more overlap. It is also not a feature that you can use to differentiate between our flowers correctly.
  • Petal_length: This graph has way less overlap than the other two. The boundaries for Setosa have no overlap with any other species, and Versicolor and Virginica have a slight overlap. You can easily find the different ranges that most of the petal lengths fall into for different species.
  • Petal_width: This graph has significantly less overlap than the sepal parameters, but you can see a little bit of overlap between Setosa, Versicolor and Virginica.

Hence, you can conclude that the petal length is the best parameter for differentiating between iris species.

Next, add the PDF and plot the resultant graph to see the CDF for our iris data.

Cumulative_Distribution_Function_8

 Figure 8: Iris Dataset PDF and CDF

From the above graph, you can notice three things:

  • Petal Length < 1.9 is most definitely ‘Setosa’. You can say this as the petal length for setosa falls in this range and does not coincide with the petal length of Versicolor or Virginica.
  • 3.2 < Petal length < 5 has a 95% chance of being Versicolor. Versicolor and Virginia have a slight overlap between 4.7 and 5. Hence, any machine learning model may mistake the two, but the chances of that happening are very low.
  • Petal length > 5 has a 90% chance of being ‘Virginia’. Again, there is a slight overlap between Versicolor and virginica in the 5-5.2 cm region, which is the cause of the slight error.

Implementing Cumulative Distribution Function With Python

Now, see how you can implement the cumulative distribution function in Python . Let’s start by importing the necessary libraries.

Cumulative_Distribution_Function_9.

 Figure 9: Importing necessary modules

Next, read in our iris dataset.

Cumulative_Distribution_Function_10

      Figure 10: Importing Iris dataset

You can find the mean and median of the data and see how they differ according to species.

Cumulative_Distribution_Function_11

 Figure 11: Finding mean and median

As you can see, the mean and median do not differ by much. This means there is not much difference in the average sepal length and sepal width and petal length and petal width for our different species. The same can be said for the median, and the medians are comparable for different species.

Now, find the standard deviation for the data.

Cumulative_Distribution_Function_12

Figure 12: Finding standard deviation

The data has a minimal value of standard deviation for each feature across the different species. This means that it does not spread out our data, and there is not much variation. Also, outliers are far and few.

Now, plot a violin plot to see how the different features compare to each other.

Cumulative_Distribution_Function_13

Figure 13: Plotting violin plots

Cumulative_Distribution_Function_14

Figure 14: Violin plots

Violin plots plot the range of values in our dataset on the x-axis and show how to spread out the data is with their width. The above graphs show that the petal length has the most narrow violins and hence the least outliers. Their range of values also has the least intersection, as can be seen by comparing the heights of the violins.

If you were to choose two features to compare the flowers on, which ones would they be? This can be found by plotting pair plots of your data. Pair plots plot each feature against the others in the form of a scatterplot to see which pair performs the best.

Cumulative_Distribution_Function_15

Figure 15: Plotting pairplots

You get the following plots on running the above code:

Cumulative_Distribution_Function_16_1

Figure 16: Iris Pair plots

The above plots show the x-axis value on the leftmost side and the y-axis value on the bottom. When you plot two of the same features against one another, you get the PDF of that feature. From the above graphs, you can say that petal length and petal width would be the best features to use together, as the scatter plot of these two features has the least overlap, both along the x-axis and the y-axis. Using these, you can distinctly identify the unique ranges for different flowers.

Now, plot the PDF along with the histogram for the iris data.                     

Cumulative_Distribution_Function_17

Figure 17: plotting PDF

Cumulative_Distribution_Function_18.

Figure 18: PDF plots of Iris dataset

The above plots show that the petal_length feature has the least overlap between data of different iris species. The bell-shaped curve you see in all of our features is a probability distribution called the normal distribution. The bell curve for petal_leanght is also smoother than the bell curve for petal_width. All in all, petal length is the best feature for classifying our data.

Now, using the above data, plot the cumulative distribution function. You first split the data into three sets, depending on the flower species.                   

Cumulative_Distribution_Function_19

Figure 19: Separating our datasets

You then plot tje PDF and CDF together on the same graph. To get the pdf, you count the number of data points in each histogram bin and divide the count by the number of data points in that bin. The CDF is nothing but the cumulative sum or total sum of PDF up to a certain point. To get the cumulative distribution function for every bin, you add the PDF of all the previous bins.        

Cumulative_Distribution_Function_20.

Figure 20: Plotting PDF and CDF of Iris Dataset

On running the above code, you get our graph as shown below. You can see that the setosa species has its unique range of values of petal_length that is just below 2cm. For Virginica and Versicolor, there is a bit of an overlap. But most of the flowers for these two species fall in their unique range. Hence, the accuracy of prediction is not compromised much.             

Cumulative_Distribution_Function_21.

 Figure 21: PDF and CDF of Iris Dataset

From the above diagram, you can say that different petal_length ranges can be determined for different species of Irises. The ranges are:

  • Iris Setosa has petal lengths < 2cm.
  • A flower has a 95% chance of being Iris Versicolor if (3.2 < petal_length < 5).
  • A flower has a 90% chance of being Iris Virginica if its petal length is more than 5.
Looking forward to a career in Data Analytics? Check out the  Data Analytics Training  and get certified today.

Conclusion 

In this tutorial on cumulative distribution function, you first understood the concept of CDF and how to calculate it using PDF. You then did a case study on the iris dataset and found ranges to differentiate between different iris species based on their petal lengths. Finally, you used Python to implement the case study and derive its results. 

We hope this article helped you understand how to find the cumulative distribution function of a random variable. To learn more about python and statistics and how statistics can be used in data analytics , check out Simplilearn’s Data Analytics Certification Program . On the other hand, if you need any clarifications or have any doubts, share them with us by commenting them down below in this page's comment section, and we'll have our experts answer them immediately!

Happy learning!

About the Author

Avijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

Recommended Programs

Data Analyst

Data Scientist

*Lifetime access to high-quality, self-paced e-learning content.

Find Data Analyst Master's Program in these cities

Recommended resources.

How to Become a Data Analyst: A Step-by-Step Guide

How to Become a Data Analyst: A Step-by-Step Guide

Data Analyst Resume Guide

Data Analyst Resume Guide

Data Analyst vs. Business Analyst: A Comprehensive Exploration

Data Analyst vs. Business Analyst: A Comprehensive Exploration

How to Use ChatGPT & Excel For Data Analytics in 2024

How to Use ChatGPT & Excel For Data Analytics in 2024

Data Analyst vs. Data Scientist: The Ultimate Comparison

Data Analyst vs. Data Scientist: The Ultimate Comparison

Business Intelligence Career Guide: Your Complete Guide to Becoming a Business Analyst

Business Intelligence Career Guide: Your Complete Guide to Becoming a Business Analyst

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Continuous Probability Distributions, Confidence Intervals, and Hypothesis Testing

  • First Online: 12 October 2022

Cite this chapter

hypothesis testing cumulative distribution function

  • Edward B. Magrab 2  

589 Accesses

In this chapter, we introduce continuous probability density functions: normal, lognormal, chi square, student t , f distribution, and Weibull. These probability density functions are then used to obtain the confidence intervals at a specified confidence level for the mean, differences in means, variance, ratio of variances, and difference in means for paired samples. These results are then extended to hypothesis testing where the p -value is introduced and the type I and type II errors are defined. The use of operating characteristic (OC) curves to determine the magnitude of these errors is illustrated. Also introduced is a procedure to obtain probability plots for the normal distribution as a visual means to confirm the normality assumption for data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Sometimes these values are displayed with p i on the x- axis and \( {\tilde{x}}_i \) on the y -axis.

For these data, there is very little difference visually when comparing the results using Eq. ( 2.41 ) to those using Eqs. ( 2.42 ) to ( 2.44 ).

Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2:110–114

Article   Google Scholar  

Download references

Author information

Authors and affiliations.

University of Maryland, College Park, MD, USA

Edward B. Magrab

You can also search for this author in PubMed   Google Scholar

2.1 Electronic Supplementary Material

(ZIP 57 kb)

(PDF 15 kb)

(PDF 977 kb)

(PDF 2389 kb)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Magrab, E.B. (2022). Continuous Probability Distributions, Confidence Intervals, and Hypothesis Testing. In: Engineering Statistics. Springer, Cham. https://doi.org/10.1007/978-3-031-05010-7_2

Download citation

DOI : https://doi.org/10.1007/978-3-031-05010-7_2

Published : 12 October 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-05009-1

Online ISBN : 978-3-031-05010-7

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. Cumulative Distribution Functions

    hypothesis testing cumulative distribution function

  2. Cumulative Distribution Functions and Probability Density Functions

    hypothesis testing cumulative distribution function

  3. Cumulative Distribution Function Definition

    hypothesis testing cumulative distribution function

  4. The Cumulative Distribution Function in Normally Distributed Data

    hypothesis testing cumulative distribution function

  5. hypothesis test formula statistics

    hypothesis testing cumulative distribution function

  6. L08.7 Cumulative Distribution Functions

    hypothesis testing cumulative distribution function

VIDEO

  1. Cumulative distribution function

  2. Bivariate Distributions الاحتمالات المحاضرة 9

  3. Probability and Statistics Primers

  4. ERTH413/613: Hypothesis Testing 1: Null hypothesis and one sample test of mean

  5. Hypothesis Testing: claims about the mean, example 3

  6. Definition of Random Variable, Cumulative Distribution Function (Continued 2)

COMMENTS

  1. Cumulative Distribution Function (CDF): Uses, Graphs & vs PDF

    It is a cumulative function because it sums the total likelihood up to that point. Its output always ranges between 0 and 1. CDFs have the following definition: CDF (x) = P (X ≤ x) Where X is the random variable, and x is a specific value. The CDF gives us the probability that the random variable X is less than or equal to x.

  2. An Interactive Guide to Hypothesis Testing in Python

    For any x covered in the range of the distribution, pdf(x) is the probability density function of x — which can be represented as the orange line below, and cdf(x) is the cumulative density function of x — which can be seen as the cumulative area. In this example, we are testing the alternative hypothesis that — Recency of positive ...

  3. 9.4: Distribution Needed for Hypothesis Testing

    If you are testing a single population mean, the distribution for the test is for means: X¯ − N(μx, σx n−−√) (9.4.1) (9.4.1) X ¯ − N ( μ x, σ x n) or. tdf (9.4.2) (9.4.2) t d f. The population parameter is μ μ. The estimated value (point estimate) for μ μ is x¯ x ¯, the sample mean. If you are testing a single population ...

  4. Cumulative distribution function

    The empirical distribution function is a formal direct estimate of the cumulative distribution function for which simple statistical properties can be derived and which can form the basis of various statistical hypothesis tests. Such tests can assess whether there is evidence against a sample of data having arisen from a given distribution, or ...

  5. Probability Distribution

    A cumulative distribution function is another type of function that describes a continuous probability distribution. ... Null distributions are an important tool in hypothesis testing. A null distribution is the probability distribution of a test statistic when the null hypothesis of the test is true.

  6. How to Compare Two or More Distributions

    The null hypothesis for this test is that the two groups have the same distribution, while the alternative hypothesis is that one group has larger (or smaller) values than the other. ... To better understand the test, let's plot the cumulative distribution functions and the test statistic. First, we compute the cumulative distribution functions.

  7. Statistical functions (scipy.stats)

    Statistical functions (. scipy.stats. ) #. This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more. Statistics is a very large area, and there are topics that are out of ...

  8. 22.1

    22.1 - The Test. Before we can work on developing a hypothesis test for testing whether an empirical distribution function F n ( x) fits a hypothesized distribution function F ( x) we better have a good idea of just what is an empirical distribution function F n ( x). Therefore, let's start with formally defining it.

  9. Probability distributions, hypothesis testing, and analysis

    Hypothesis testing: Uses a prespecified significance level, ... The cumulative distribution function for discrete variables gives the probability that the value of interest is less than or equal to a given value. Apart from the normal distribution, other commonly encountered probability distributions in forensic anthropology are the binomial ...

  10. PDF Connor Dowd arXiv:2007.01360v1 [stat.ME] 2 Jul 2020

    July 6, 2020. Abstract. Empirical cumulative distribution functions (ECDFs) have been used to test the hypothesis. that two samples come from the same distribution since the seminal contribution by Kol-. mogorov and Smirnov. This paper describes a statistic which is usable under the same condi-.

  11. Cumulative Distribution Function Definition & Examples

    The Cumulative Distribution Function (CDF) of a random variable is a function that gives the probability that the variable takes a value less than or equal to a certain value. ... - It serves as a starting point for statistical hypothesis testing and confidence interval estimation, which are central to inferential statistics. Moreover, the ...

  12. Statistical Distributions & Hypothesis Testing for Beginners

    A probability mass function evaluated at a value corresponds to the probability that a random variable takes that value.A valid pmf must satisfy two conditions: 1)P>=0. 2) ∑ p =1. Example ...

  13. Statistics 5601 (Geyer, Fall 2013) Kolmogorov-Smirnov and Lilliefors Tests

    If the support of X is not the whole real line, then all of the increase of F takes place on the support, that is, if a ≤ X ≤ b with probability one, then F(a) = 0 and F(b) = 1.; Other properties of the distribution function depend on whether X is discrete or continuous.; If X is a continuous random variable, then . F is a continuous function and is strictly increasing on the support of X.

  14. Probability Distribution: Definition & Calculations

    Hypothesis tests use this type of information to determine whether the results are statistically significant. Related posts: Sampling Distributions and How Hypothesis Tests Work. ... Cumulative distribution functions (CDFs) show the same type of information but in a different way. Instead of displaying probabilities for x-values, they display ...

  15. Chapter 3 Week 3/4

    The cumulative density function (CDF) of a random variable is the probability that the random variable takes a value less than or equal to a specified value, x: ... A couple have 5 children, and two of them have blood type O. Using this data, test the hypothesis that the probability of the couple having a child with type O blood is 0.25. We are ...

  16. One-sample Kolmogorov-Smirnov test

    Copy Command. Perform the one-sample Kolmogorov-Smirnov test by using kstest. Confirm the test decision by visually comparing the empirical cumulative distribution function (cdf) to the standard normal cdf. Load the examgrades data set. Create a vector containing the first column of the exam grade data.

  17. 3.2: Probability Mass Functions (PMFs) and Cumulative Distribution

    Definition \(\PageIndex{1}\) The probability mass function (pmf) (or frequency function) of a discrete random variable \(X\) assigns probabilities to the possible values of the random variable.More specifically, if \(x_1, x_2, \ldots\) denote the possible values of a random variable \(X\), then the probability mass function is denoted as \(p\) and we write

  18. hypothesis testing

    The performance of function point counters is described by the cumulative distribution function of the number of function points measured by that counter. Fi(x) gives the probability that the number of function points counted by the counter i is less than or equal to x. ... and you want to test the null hypothesis that the 17 underlying ...

  19. Confidence distributions and hypothesis testing

    The traditional frequentist approach to hypothesis testing has recently come under extensive debate, raising several critical concerns. Additionally, practical applications often blend the decision-theoretical framework pioneered by Neyman and Pearson with the inductive inferential process relied on the p-value, as advocated by Fisher. The combination of the two methods has led to interpreting ...

  20. Kolmogorov-Smirnov test

    In statistics, the Kolmogorov-Smirnov test ( K-S test or KS test) is a nonparametric test of the equality of continuous (or discontinuous, see Section 2.2 ), one-dimensional probability distributions that can be used to test whether a sample came from a given reference probability distribution (one-sample K-S test), or to test whether two ...

  21. What Is Cumulative Distribution Function & Density Function

    The cumulative distribution function is used to describe the probability distribution of random variables. It can be used to describe the probability for a discrete, continuous or mixed variable. It is obtained by summing up the probability density function and getting the cumulative probability for a random variable.

  22. Continuous Probability Distributions, Confidence Intervals, and

    The probability density functions and the cumulative density functions for the first six cases in Table 2.1 can be displayed with interactive graphic IG2-1. ... Hypothesis testing assesses the plausibility of the hypothesis by using randomly selected data from these populations. The correctness of the hypothesis is never known with certainty ...

  23. PDF Distribution-free complex hypothesis testing for single-cell RNA-seq

    Key words: single-cell, conditional cumulative distribution function, conditional independence test, dif-ferential expression analysis 1. Introduction Single-cell RNA-Sequencing (scRNA-seq) makes it possible to simultaneously measure gene ex-pression levels at the resolution of single cells, allowing a refined definition of cell types and states

  24. Brain Sciences

    The plug-in estimator for this function is identified with the test statistic of the log-likelihood ratios. Since under the null hypothesis, this estimator follows an asymptotic chi-squared distribution, it facilitates the calculation of p-values when applied to empirical data.