• Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • ROC Curve & AUC Explained with Python Examples - August 28, 2024
  • Accuracy, Precision, Recall & F1-Score – Python Examples - August 28, 2024
  • Logistic Regression in Machine Learning: Python Example - August 26, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • ROC Curve & AUC Explained with Python Examples
  • Accuracy, Precision, Recall & F1-Score – Python Examples
  • Logistic Regression in Machine Learning: Python Example
  • Reducing Overfitting vs Models Complexity: Machine Learning
  • Model Parallelism vs Data Parallelism: Examples

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Simple Linear Regression | An Easy Introduction & Examples

Simple Linear Regression | An Easy Introduction & Examples

Published on February 19, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Simple linear regression is used to estimate the relationship between two quantitative variables . You can use simple linear regression when you want to know:

  • How strong the relationship is between two variables (e.g., the relationship between rainfall and soil erosion).
  • The value of the dependent variable at a certain value of the independent variable (e.g., the amount of soil erosion at a certain level of rainfall).

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

If you have more than one independent variable, use multiple linear regression instead.

Table of contents

Assumptions of simple linear regression, how to perform a simple linear regression, interpreting the results, presenting the results, can you predict values outside the range of your data, other interesting articles, frequently asked questions about simple linear regression.

Simple linear regression is a parametric test , meaning that it makes certain assumptions about the data. These assumptions are:

  • Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
  • Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among observations.
  • Normality : The data follows a normal distribution .

Linear regression makes one additional assumption:

  • The relationship between the independent and dependent variable is linear : the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor).

If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.

If your data violate the assumption of independence of observations (e.g., if observations are repeated over time), you may be able to perform a linear mixed-effects model that accounts for the additional structure in the data.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

hypothesis test in simple linear regression

Simple linear regression formula

The formula for a simple linear regression is:

y = {\beta_0} + {\beta_1{X}} + {\epsilon}

  • y is the predicted value of the dependent variable ( y ) for any given value of the independent variable ( x ).
  • B 0 is the intercept , the predicted value of y when the x is 0.
  • B 1 is the regression coefficient – how much we expect y to change as x increases.
  • x is the independent variable ( the variable we expect is influencing y ).
  • e is the error of the estimate, or how much variation there is in our estimate of the regression coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B 1 ) that minimizes the total error (e) of the model.

While you can perform a linear regression by hand , this is a tedious process, so most people use statistical programs to help them quickly analyze the data.

Simple linear regression in R

R is a free, powerful, and widely-used statistical program. Download the dataset to try it yourself using our income and happiness example.

Dataset for simple linear regression (.csv)

Load the income.data dataset into your R environment, and then run the following command to generate a linear model describing the relationship between income and happiness:

This code takes the data you have collected data = income.data and calculates the effect that the independent variable income has on the dependent variable happiness using the equation for the linear model: lm() .

To learn more, follow our full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function in R:

This function takes the most important parameters from the linear model and puts them into a table, which looks like this:

Simple linear regression summary output in R

This output table first repeats the formula that was used to generate the results (‘Call’), then summarizes the model residuals (‘Residuals’), which give an idea of how well the model fits the real data.

Next is the ‘Coefficients’ table. The first row gives the estimates of the y-intercept, and the second row gives the regression coefficient of the model.

Row 1 of the table is labeled (Intercept) . This is the y-intercept of the regression equation, with a value of 0.20. You can plug this into your regression equation if you want to predict happiness values across the range of income that you have observed:

The next row in the ‘Coefficients’ table is income. This is the row that describes the estimated effect of income on reported happiness:

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The number in the table (0.713) tells us that for every one unit increase in income (where one unit of income = 10,000) there is a corresponding 0.71-unit increase in reported happiness (where happiness is a scale of 1 to 10).

The Std. Error column displays the standard error of the estimate. This number shows how much variation there is in our estimate of the relationship between income and happiness.

The t value  column displays the test statistic . Unless you specify otherwise, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that our results occurred by chance.

The Pr(>| t |)  column shows the p value . This number tells us how likely we are to see the estimated effect of income on happiness if the null hypothesis of no effect were true.

Because the p value is so low ( p < 0.001),  we can reject the null hypothesis and conclude that income has a statistically significant effect on happiness.

The last three lines of the model summary are statistics about the model as a whole. The most important thing to notice here is the p value of the model. Here it is significant ( p < 0.001), which means that this model is a good fit for the observed data.

When reporting your results, include the estimated effect (i.e. the regression coefficient), standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what your regression coefficient means:

It can also be helpful to include a graph with your results. For a simple linear regression, you can simply plot the observations on the x and y axis and then include the regression line and regression function:

Simple linear regression graph

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

No! We often say that regression models can be used to predict the value of the dependent variable at certain values of the independent variable. However, this is only true for the range of values where we have actually measured the response.

We can use our income and happiness regression analysis as an example. Between 15,000 and 75,000, we found an r 2 of 0.73 ± 0.0193. But what if we did a second survey of people making between 75,000 and 150,000?

Extrapolating data in R

The r 2 for the relationship between income and happiness is now 0.21, or a 0.21-unit increase in reported happiness for every 10,000 increase in income. While the relationship is still statistically significant (p<0.001), the slope is much smaller than before.

Extrapolating data in R graph

What if we hadn’t measured this group, and instead extrapolated the line from the 15–75k incomes to the 70–150k incomes?

You can see that if we simply extrapolated from the 15–75k income data, we would overestimate the happiness of people in the 75–150k income range.

Curved data line

If we instead fit a curve to the data, it seems to fit the actual pattern much better.

It looks as though happiness actually levels off at higher incomes, so we can’t use the same regression line we calculated from our lower-income data to predict happiness at higher levels of income.

Even when you see a strong pattern in your data, you can’t know for certain whether that pattern continues beyond the range of values you have actually measured. Therefore, it’s important to avoid extrapolating beyond what the data actually tell you.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.

For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Simple Linear Regression | An Easy Introduction & Examples. Scribbr. Retrieved September 3, 2024, from https://www.scribbr.com/statistics/simple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, an introduction to t tests | definitions, formula and examples, multiple linear regression | a quick guide (examples), linear regression in r | a step-by-step guide & examples, what is your plagiarism score.

Linear regression - Hypothesis testing

by Marco Taboga , PhD

This lecture discusses how to perform tests of hypotheses about the coefficients of a linear regression model estimated by ordinary least squares (OLS).

Table of contents

Normal vs non-normal model

The linear regression model, matrix notation, tests of hypothesis in the normal linear regression model, test of a restriction on a single coefficient (t test), test of a set of linear restrictions (f test), tests based on maximum likelihood procedures (wald, lagrange multiplier, likelihood ratio), tests of hypothesis when the ols estimator is asymptotically normal, test of a restriction on a single coefficient (z test), test of a set of linear restrictions (chi-square test), learn more about regression analysis.

The lecture is divided in two parts:

in the first part, we discuss hypothesis testing in the normal linear regression model , in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors;

in the second part, we show how to carry out hypothesis tests in linear regression analyses where the hypothesis of normality holds only in large samples (i.e., the OLS estimator can be proved to be asymptotically normal).

How to choose which test to carry out after estimating a linear regression model.

We also denote:

We now explain how to derive tests about the coefficients of the normal linear regression model.

It can be proved (see the lecture about the normal linear regression model ) that the assumption of conditional normality implies that:

How the acceptance region is determined depends not only on the desired size of the test , but also on whether the test is:

one-tailed (only one of the two things, i.e., either smaller or larger, is possible).

For more details on how to determine the acceptance region, see the glossary entry on critical values .

[eq28]

The F test is one-tailed .

A critical value in the right tail of the F distribution is chosen so as to achieve the desired size of the test.

Then, the null hypothesis is rejected if the F statistics is larger than the critical value.

In this section we explain how to perform hypothesis tests about the coefficients of a linear regression model when the OLS estimator is asymptotically normal.

As we have shown in the lecture on the properties of the OLS estimator , in several cases (i.e., under different sets of assumptions) it can be proved that:

These two properties are used to derive the asymptotic distribution of the test statistics used in hypothesis testing.

The test can be either one-tailed or two-tailed . The same comments made for the t-test apply here.

[eq50]

Like the F test, also the Chi-square test is usually one-tailed .

The desired size of the test is achieved by appropriately choosing a critical value in the right tail of the Chi-square distribution.

The null is rejected if the Chi-square statistics is larger than the critical value.

Want to learn more about regression analysis? Here are some suggestions:

R squared of a linear regression ;

Gauss-Markov theorem ;

Generalized Least Squares ;

Multicollinearity ;

Dummy variables ;

Selection of linear regression models

Partitioned regression ;

Ridge regression .

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression - Hypothesis testing", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-hypothesis-testing.

Most of the learning materials found on this website are now available in a traditional textbook format.

  • F distribution
  • Beta distribution
  • Conditional probability
  • Central Limit Theorem
  • Binomial distribution
  • Mean square convergence
  • Delta method
  • Almost sure convergence
  • Mathematical tools
  • Fundamentals of probability
  • Probability distributions
  • Asymptotic theory
  • Fundamentals of statistics
  • About Statlect
  • Cookies, privacy and terms of use
  • Loss function
  • Almost sure
  • Type I error
  • Precision matrix
  • Integrable variable
  • To enhance your privacy,
  • we removed the social buttons,
  • but don't forget to share .

logo

Simple linear regression

Simple linear regression #.

Fig. 9 Simple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the (training) residual sum of squares (RSS):

Sample code: advertising data #

Estimates \(\hat\beta_0\) and \(\hat\beta_1\) #.

A little calculus shows that the minimizers of the RSS are:

Assessing the accuracy of \(\hat \beta_0\) and \(\hat\beta_1\) #

Fig. 10 How variable is the regression line? #

Based on our model #

The Standard Errors for the parameters are:

95% confidence intervals:

Hypothesis test #

Null hypothesis \(H_0\) : There is no relationship between \(X\) and \(Y\) .

Alternative hypothesis \(H_a\) : There is some relationship between \(X\) and \(Y\) .

Based on our model: this translates to

\(H_0\) : \(\beta_1=0\) .

\(H_a\) : \(\beta_1\neq 0\) .

Test statistic:

Under the null hypothesis, this has a \(t\) -distribution with \(n-2\) degrees of freedom.

Sample output: advertising data #

Interpreting the hypothesis test #.

If we reject the null hypothesis, can we assume there is an exact linear relationship?

No. A quadratic relationship may be a better fit, for example. This test assumes the simple linear regression model is correct which precludes a quadratic relationship.

If we don’t reject the null hypothesis, can we assume there is no relationship between \(X\) and \(Y\) ?

No. This test is based on the model we posited above and is only powerful against certain monotone alternatives. There could be more complex non-linear relationships.

Teach yourself statistics

Hypothesis Test for Regression Slope

This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y .

The test focuses on the slope of the regression line

Y = Β 0 + Β 1 X

where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of the independent variable, and Y is the value of the dependent variable.

If we find that the slope of the regression line is significantly different from zero, we will conclude that there is a significant relationship between the independent and dependent variables.

Test Requirements

The approach described in this lesson is valid whenever the standard requirements for simple linear regression are met.

  • The dependent variable Y has a linear relationship to the independent variable X .
  • For each value of X, the probability distribution of Y has the same standard deviation σ.
  • The Y values are independent.
  • The Y values are roughly normally distributed (i.e., symmetric and unimodal ). A little skewness is ok if the sample size is large.

The test procedure consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses

If there is a significant linear relationship between the independent variable X and the dependent variable Y , the slope will not equal zero.

H o : Β 1 = 0

H a : Β 1 ≠ 0

The null hypothesis states that the slope is equal to zero, and the alternative hypothesis states that the slope is not equal to zero.

Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify the following elements.

  • Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
  • Test method. Use a linear regression t-test (described in the next section) to determine whether the slope of the regression line differs significantly from zero.

Analyze Sample Data

Using sample data, find the standard error of the slope, the slope of the regression line, the degrees of freedom, the test statistic, and the P-value associated with the test statistic. The approach described in this section is illustrated in the sample problem at the end of this lesson.

Predictor Coef SE Coef T P
Constant 76 30 2.53 0.01
X 35 20 1.75 0.04

SE = s b 1 = sqrt [ Σ(y i - ŷ i ) 2 / (n - 2) ] / sqrt [ Σ(x i - x ) 2 ]

  • Slope. Like the standard error, the slope of the regression line will be provided by most statistics software packages. In the hypothetical output above, the slope is equal to 35.

t = b 1 / SE

  • P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a t statistic, use the t Distribution Calculator to assess the probability associated with the test statistic. Use the degrees of freedom computed above.

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding

The local utility company surveys 101 randomly selected customers. For each survey participant, the company collects the following: annual electric bill (in dollars) and home size (in square feet). Output from a regression analysis appears below.

Annual bill = 0.55 * Home size + 15

Predictor Coef SE Coef T P
Constant 15 3 5.0 0.00
Home size 0.55 0.24 2.29 0.01

Is there a significant linear relationship between annual bill and home size? Use a 0.05 level of significance.

The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:

H o : The slope of the regression line is equal to zero.

H a : The slope of the regression line is not equal to zero.

  • Formulate an analysis plan . For this analysis, the significance level is 0.05. Using sample data, we will conduct a linear regression t-test to determine whether the slope of the regression line differs significantly from zero.

We get the slope (b 1 ) and the standard error (SE) from the regression output.

b 1 = 0.55       SE = 0.24

We compute the degrees of freedom and the t statistic, using the following equations.

DF = n - 2 = 101 - 2 = 99

t = b 1 /SE = 0.55/0.24 = 2.29

where DF is the degrees of freedom, n is the number of observations in the sample, b 1 is the slope of the regression line, and SE is the standard error of the slope.

  • Interpret results . Since the P-value (0.0242) is less than the significance level (0.05), we cannot accept the null hypothesis.

Welcome to our newly formatted notes. Update you bookmarks accordingly.

4   Linear Regression

A quick review of regression, expectation, variance, and parameter estimation.

Input vector: \(X = (X_1, X_2, ... , X_p)\) .

Output Y is real-valued.

Predict Y from X by f ( X ) so that the expected loss function \(E(L(Y, f(X)))\) is minimized.

Review: Expectation

Intuitively, the expectation of a random variable is its “average” value under its distribution.

Formally, the expectation of a random variable X, denoted E[X], is its Lebesgue integral with respect to its distribution.

If X takes values in some countable numeric set \(\chi\) , then

\[E(X) =\sum_{x \in \chi}xP(X=x)\]

If \(X \in \mathbb{r eval=FALSE}^m\) has a density p , then

\[E(X) =\int_{\mathbb{r eval=FALSE}^m}xp(x)dx\]

Expectation is linear: \(E(aX +b)=aE(X) + b\)

Also, \(E(X+Y) = E(X) +E(Y)\)

The expectation is monotone: if X ≥ Y , then E ( X ) ≥ E ( Y )

Review: Variance

The variance of a random variable X is defined as:

\(Var(X) = E[(X-E[X])^2]=E[X^2]-(E[X])^2\)

and the variance obeys the following \(a, b \in \mathbb{r eval=FALSE}\) :

\[Var(aX + b) =a^2Var(X)\]

Review: Frequentist Basics

The data x 1 , … , x n is generally assumed to be independent and identically distributed (i.i.d.).

We would like to estimate some unknown value θ associated with the distribution from which the data was generated.

In general, our estimate will be a function of the data (i.e., a statistic) \[\hat{\theta} =f(x_1, x_2, ... , x_n)\]

Example: Given the results of n independent flips of a coin, determine the probability p with which it lands on heads.

Review: Parameter Estimation

In practice, we often seek to select a distribution (model) corresponding to our data.

If our model is parameterized by some set of values, then this problem is that of parameter estimation.

How can we obtain estimates in general? One Answer: Maximize the likelihood and the estimate is called the maximum likelihood estimate, MLE.

\[ \begin {align} \hat{\theta} & = argmax_{\theta} \prod_{i=1}^{n}p_{\theta}(x_i) \\ & =argmax_{\theta} \sum_{i=1}^{n}log (p_{\theta}(x_i)) \\ \end {align} \]

Let’s look at the setup for linear regression. We have an input vector: \(X = \left( X _ { 1 } , X _ { 2 } , \dots , X _ { p }\right)\) . This vector is p dimensional.

The output Y is a real value and is ordered.

We want to predict Y from X .

Before we actually do the prediction we have to train the function f ( X ). By the end of the training, I would have a function f ( X ) to map every X into an estimated Y . Then, we need some way to measure how good this predictor function is. This is measured by the expectation of a loss.

Why do we have a loss in the estimation?

Y is actually a random variable given X . For instance, consider predicting someone’s weight based on the person’s height. People can have different weights given the same height. If you think of the weight as Y and the height as X , Y is random given X . We, therefore, cannot have a perfect prediction for every subject because f ( X ) is a fixed function, impossible to be correct all the time. The loss measures how different the true Y is from your prediction.

Why do we have the overall loss expressed as an expectation?

The loss may be different for different subjects. In statistics, a common thing to do is to average the losses over the entire population.

Squared loss:

\[L ( Y , f ( X ) ) = ( Y - f ( X ) ) ^ { 2 }\]

We simply measure the difference between the two variables and square them so that we can handle negative and positive difference symmetrically.

Suppose the distribution of Y given X is known , the optimal predictor is:

\[\begin{array} { l } { f ^ {*} ( X ) = \operatorname { argmin } _ { f ( x ) } E ( Y - f ( x ) ) ^ { 2 } } \\ { = E ( Y | X ) } \end{array}\]

This is the conditional expectation of Y given X . The function E ( Y | X ) is called the regression function .

Example 3-1

We want to predict the number of physicians in a metropolitan area.

Problem : The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y , is expected to be related to total population ( X 1 , measured in thousands), land area ( X 2 , measured in square miles), and total personal income ( X 3 , measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table.

9387 7031 7017 233 232 231
1348 4069 3719 1011 813 654
72100 52737 54542 1337 1589 1148
25627 15389 13326 264 371 140

Our Goal: To predict Y from \(X _ { 1 } , X _ { 2 } , \text { and } X _ { 3 }\) .

This is a typical regression problem.

Upon successful completion of this lesson, you should be able to:

  • Review of linear regression model focusing on prediction.
  • Use least square estimation for linear regression.
  • Apply model developed in training data to an independent test data.
  • Context setting for more complex supervised prediction methods.

4.1 Linear Methods

The linear regression model:

\[ f(X)=\beta_{0} + \sum_{j=1}^{p}X_{j}\beta_{j}\]

This is just a linear combination of the measurements that are used to make predictions, plus a constant, (the intercept term). This is a simple approach. However, It might be the case that the regression function might be pretty close to a linear function, and hence the model is a good approximation.

What if the model is not true?

  • It still might be a good approximation - the best we can do.
  • Sometimes because of the lack of training data or smarter algorithms, this is the most we can estimate robustly from the data.

Comments on \(X_j\) :

  • We assume that these are quantitative inputs [or dummy indicator variables representing levels of a qualitative input]
  • We can also perform transformations of the quantitative inputs, e.g., log(•), √(•). In this case, this linear regression model is still a linear function in terms of the coefficients to be estimated. However, instead of using the original \(X_{j}\) , we have replaced them or augmented them with the transformed values. Regardless of the transformations performed on \(X _ { j } f ( x )\) is still a linear function of the unknown parameters.
  • Some basic expansions: \(X _ { 2 } = X _ { 1 } ^ { 2 } , X _ { 3 } = X _ { 1 } ^ { 3 } , X _ { 4 } = X _ { 1 } \cdot X _ { 2 }\) .

Below is a geometric interpretation of a linear regression.

For instance, if we have two variables, \(X_{1}\) and \(X_{2}\) , and we predict Y by a linear combination of \(X_{1}\) and \(X_{2}\) , the predictor function corresponds to a plane (hyperplane) in the three-dimensional space of \(X_{1}\) , \(X_{2}\) , Y . Given a pair of \(X_{1}\) and \(X_{2}\) we could find the corresponding point on the plane to decide Y by drawing a perpendicular line to the hyperplane, starting from the point in the plane spanned by the two predictor variables.

For accurate prediction, hopefully, the data will lie close to this hyperplane, but they won’t lie exactly in the hyperplane (unless perfect prediction is achieved). In the plot above, the red points are the actual data points. They do not lie on the plane but are close to it.

How should we choose this hyperplane?

We choose a plane such that the total squared distance from the red points (real data points) to the corresponding predicted points in the plane is minimized. Graphically, if we add up the squares of the lengths of the line segments drawn from the red points to the hyperplane, the optimal hyperplane should yield the minimum sum of squared lengths.

The issue of finding the regression function \(E ( Y | X )\) is converted to estimating \(\beta _ { j } , j = 0,1 , \dots , p\) .

Remember in earlier discussions we talked about the trade-off between model complexity and accurate prediction on training data. In this case, we start with a linear model, which is relatively simple. The model complexity issue is taken care of by using a simple linear function. In basic linear regression, there is no explicit action taken to restrict model complexity. Although variable selection, which we cover in Lesson 5: Variable Selection, can be considered a way to control model complexity.

With the model complexity under check, the next thing we want to do is to have a predictor that fits the training data well.

Let the training data be:

\[\left\{ \left( x _ { 1 } , y _ { 1 } \right) , \left( x _ { 2 } , y _ { 2 } \right) , \dots , \left( x _ { N } , y _ { N } \right) \right\} , \text { where } x _ { i } = \left( x _ { i 1 } , x _ { i 2 } , \ldots , x _ { i p } \right)\]

Denote \(\beta = \left( \beta _ { 0 } , \beta _ { 1 } , \ldots , \beta _ { p } \right) ^ { T }\) .

Without knowing the true distribution for X and Y , we cannot directly minimize the expected loss.

Instead, the expected loss \(E ( Y - f ( X ) ) ^ { 2 }\) is approximated by the empirical loss \(R S S ( \beta ) / N\) :

\[ \begin {align}RSS(\beta)&=\sum_{i=1}^{N}\left(y_i - f(x_i)\right)^2 &=\sum_{i=1}^{N}\left(y_i - \beta_0 -\sum_{j=1}^{p}x_{ij}\beta_{j}\right)^2 \\ \end {align} \]

This empirical loss is basically the accuracy you computed based on the training data. This is called the residual sum of squares, RSS .

The x ’s are known numbers from the training data.

Here is the input matrix X of dimension N × ( p +1):

\[\begin{pmatrix} 1 & x_{1,1} &x_{1,2} & ... &x_{1,p} \\ 1 & x_{2,1} & x_{2,2} & ... &x_{2,p} \\ ... & ... & ... & ... & ... \\ 1 & x_{N,1} &x_{N,2} &... & x_{N,p} \end{pmatrix}\]

Earlier we mentioned that our training data had N number of points. So, in the example where we were predicting the number of doctors, there were 101 metropolitan areas that were investigated. Therefore, N =101. Dimension p = 3 in this example. The input matrix is augmented with a column of 1’s (for the intercept term). So, above you see the first column contains all 1’s. Then if you look at every row, every row corresponds to one sample point and the dimensions go from one to p . Hence, the input matrix X is of dimension N × ( p +1).

Output vector y :

\[ y= \begin{pmatrix} y_{1}\\ y_{2}\\ ...\\ y_{N} \end{pmatrix} \]

Again, this is taken from the training data set.

The estimated \(\beta\) is \(\hat{\beta}\) and this is also put in a column vector, \(\left( \beta _ { 0 } , \beta _ { 1 } , \dots , \beta _ { p } \right)\) .

The fitted values (not the same as the true values) at the training inputs are

\[\hat{y}_{i}=\hat{\beta}_{0}+\sum_{j=1}^{p}x_{ij}\hat{\beta}_{j}\]

\[ \hat{y}= \begin{pmatrix} \hat{y}_{1}\\ \hat{y}_{2}\\ ...\\ \hat{y}_{N} \end{pmatrix} \]

For instance, if you are talking about sample i , the fitted value for sample i would be to take all the values of the x ’s for sample i , (denoted by \(x_{ij}\) ) and do a linear summation for all of these \(x_{ij}\) ’s with weights \(\hat{\beta}_{j}\) and the intercept term \(\hat{\beta}_{0}\) .

4.2 Point Estimate

  • The least square estimation of \(\hat{\beta}\) is:

\[\hat{\beta} =(X^{T}X)^{-1}X^{T}y \]

  • The fitted value vector is:

\[\hat{y} =X\hat{\beta}=X(X^{T}X)^{-1}X^{T}y \]

  • Hat matrix:

\[H=X(X^{T}X)^{-1}X^{T} \]

Geometric Interpretation

Each column of X is a vector in an N -dimensional space (not the \(p + 1\) * dimensional feature vector space). Here, we take out columns in matrix X , and this is why they live in N*-dimensional space. Values for the same variable across all of the samples are put in a vector. I represent this input matrix as the matrix formed by the column vectors:

\[X = \left( X _ { 0 } , x _ { 1 } , \ldots , x _ { p } \right)\]

Here \(x_0\) is the column of 1’s for the intercept term. It turns out that the fitted output vector \(\hat{y}\) is a linear combination of the column vectors \(x _ { j } , j = 0,1 , \dots , p\) . Go back and look at the matrix and you will see this.

This means that \(\hat{y}\) lies in the subspace spanned by \(x _ { j } , j = 0,1 , \dots , p\) .

The dimension of the column vectors is N , the number of samples. Usually, the number of samples is much bigger than the dimension p . The true y can be any point in this N -dimensional space. What we want to find is an approximation constraint in the \(p+1\) dimensional space such that the distance between the true y and the approximation is minimized. It turns out that the residual sum of squares is equal to the square of the Euclidean distance between y and \(\hat{y}\) .

\[RSS(\hat{\beta})=\parallel y - \hat{y}\parallel^2 \]

For the optimal solution, \(y-\hat{y}\) has to be perpendicular to the subspace, i.e., \(\hat{y}\) is the projection of y on the subspace spanned by \(x _ { j } , j = 0,1 , \dots , p\) .

Geometrically speaking let’s look at a really simple example. Take a look at the diagram below. What we want to find is a \(\hat{y}\) that lies in the hyperplane defined or spanned by \(x _ {1}\) and \(x _ {2}\) . You would draw a perpendicular line from y to the plane to find \(\hat{y}\) . This comes from a basic geometric fact. In general, if you want to find some point in a subspace to represent some point in a higher dimensional space, the best you can do is to project that point to your subspace.

The difference between your approximation and the true vector has to be perpendicular to the subspace.

The geometric interpretation is very helpful for understanding coefficient shrinkage and subset selection (covered in Lesson 5 and 6).

4.3 Example Results

Let’s take a look at some results for our earlier example about the number of active physicians in a Standard Metropolitan Statistical Area (SMSA - data). If I do the optimization using the equations, I obtain these values below:

\[\hat{Y}_{i}= –143.89+0.341X_{i1}–0.019X_{i2}+0.254X_{i3} \]

\[RSS(\hat{\beta})=52,942,438 \]

Let’s take a look at some scatter plots. We plot one variable versus another. For instance, in the upper left-hand plot, we plot the pairs of \(x_{1}\) and y . These are two-dimensional plots, each variable plotted individually against any other variable.

STAT 501 on Linear Regression goes deeper into which scatter plots are more helpful than others. These can be indicative of potential problems that exist in your data. For instance, in the plots above you can see that \(x_{3}\) is almost a perfectly linear function of \(x_{1}\) . This might indicate that there might be some problems when you do the optimization. What happens is that if \(x_{3}\) is a perfectly linear function of \(x_{1}\) , then when you solve the linear equation to determine the \(β\) ’s, there is no unique solution. The scatter plots help to discover such potential problems.

In practice, because there is always measurement error, you rarely get a perfect linear relationship. However, you might get something very close. In this case, the matrix, \(X ^ { T } X\) , will be close to singular, causing large numerical errors in computation. Therefore, we would like to have predictor variables that are not so strongly correlated.

4.4 Theoretical Justification

If the linear model is true.

Here is some theoretical justification for why we do parameter estimation using least squares.

If the linear model is true, i.e., if the conditional expectation of Y given X indeed is a linear function of the X j ’s, and Y is the sum of that linear function and an independent Gaussian noise, we have the following properties for least squares estimation.

\[ E(Y|X)=\beta_0+\sum_{j=1}^{p}X_{j}\beta{j} \]

The least squares estimation of \(\beta\) is unbiased,

\[E(\hat{\beta}_{j}) =\beta_j, j=0,1, ... , p \]

To draw inferences about \(\beta\) , further assume: \(Y = E(Y | X) + \epsilon\) where \(\epsilon \sim N(0,\sigma^2)\) and is independent of X .

\(X_{ij}\) are regarded as fixed, \(Y_i\) are random due to \(\epsilon\) .

The estimation accuracy of \(\hat{\beta}\) , the variance of \(\hat{\beta}\) is given here:

\[Var(\hat{\beta})=(X^{T}X)^{-1}\sigma^2\]

You should see that the higher \(\sigma^2\) is, the variance of \(\hat{\beta}\) will be higher. This is very natural. Basically, if the noise level is high, you’re bound to have a large variance in your estimation. But then, of course, it also depends on \(X^T X\) . This is why in experimental design, methods are developed to choose X so that the variance tends to be small.

Note that \(\hat{\beta}\) is a vector and hence its variance is a covariance matrix of size ( p + 1) × ( p + 1). The covariance matrix not only tells the variance for every individual \(\beta_j\) , but also the covariance for any pair of \(\beta_j\) and \(\beta_k\) , \(j \ne k\) .

Gauss-Markov Theorem

This theorem says that the least squares estimator is the best linear unbiased estimator.

Assume that the linear model is true. For any linear combination of the parameters \(\beta_0 , \cdots ,beta_p\) you get a new parameter denoted by \(\theta = a^{T}\beta\) . Then \(a^{T}\hat{\beta}\) is just a weighted sum of \(\hat{\beta}_0, ..., \hat{\beta}_p\) and is an unbiased estimator since \(\hat{\beta}\) is unbiased.

We want to estimate \(θ\) and the least squares estimate of \(θ\) is:

\[ \begin {align} \hat{\theta} & = a^T\hat{\beta}\\ & = a^T(X^{T}X)^{-1}Xy \\ & \doteq \tilde{a}^{T}y, \\ \end{align} \]

which is linear in y . The Gauss-Markov theorem states that for any other linear unbiased estimator, \(c^Ty\) , the linear estimator obtained from the least squares estimation on \(\theta\) is guaranteed to have a smaller variance than \(c^Ty\) :

\[Var(\tilde{a}^{T}y) \le Var(c^{T}y).\]

Keep in mind that you’re only comparing with linear unbiased estimators. If the estimator is not linear, or is not unbiased, then it is possible to do better in terms of squared loss.

\(\beta_j\) , j = 0, 1, …, p are special cases of \(a^T\beta\) , where \(a^T\) only has one non-zero element that equals 1.

4.5 R Scripts

1. acquire data.

Diabetes data

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Save the data into your working directory for this course as “diabetes.data.” Then load data into R as follows:

In RawData , the response variable is its last column; and the remaining columns are the predictor variables.

2. Fitting a Linear Model

In order to fit linear regression models in R, lm can be used for linear models, which are specified symbolically. A typical model takes the form of response~predictors where response is the (numeric) response vector and predictors is a series of predictor variables.

Take the full model and the base model (no predictors used) as examples:

For the full model, coefficients shows the least square estimation for \(\hat{\beta}\) and fitted.values are the fitted values for the response variable.

The results for the coefficients should be as follows:

The fitted values should start with 0.6517572852.

Source Code

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 04 September 2024

CDK5–cyclin B1 regulates mitotic fidelity

  • Xiao-Feng Zheng   ORCID: orcid.org/0000-0001-8769-4604 1   na1 ,
  • Aniruddha Sarkar   ORCID: orcid.org/0000-0002-9393-1335 1   na1 ,
  • Humphrey Lotana 2 ,
  • Aleem Syed   ORCID: orcid.org/0000-0001-7942-3900 1 ,
  • Huy Nguyen   ORCID: orcid.org/0000-0002-4424-1047 1 ,
  • Richard G. Ivey 3 ,
  • Jacob J. Kennedy 3 ,
  • Jeffrey R. Whiteaker 3 ,
  • Bartłomiej Tomasik   ORCID: orcid.org/0000-0001-5648-345X 1 , 4   nAff7 ,
  • Kaimeng Huang   ORCID: orcid.org/0000-0002-0552-209X 1 , 5 ,
  • Feng Li 1 ,
  • Alan D. D’Andrea   ORCID: orcid.org/0000-0001-6168-6294 1 , 5 ,
  • Amanda G. Paulovich   ORCID: orcid.org/0000-0001-6532-6499 3 ,
  • Kavita Shah 2 ,
  • Alexander Spektor   ORCID: orcid.org/0000-0002-1085-3205 1 , 5 &
  • Dipanjan Chowdhury   ORCID: orcid.org/0000-0001-5645-3752 1 , 5 , 6  

Nature ( 2024 ) Cite this article

40 Altmetric

Metrics details

CDK1 has been known to be the sole cyclin-dependent kinase (CDK) partner of cyclin B1 to drive mitotic progression 1 . Here we demonstrate that CDK5 is active during mitosis and is necessary for maintaining mitotic fidelity. CDK5 is an atypical CDK owing to its high expression in post-mitotic neurons and activation by non-cyclin proteins p35 and p39 2 . Here, using independent chemical genetic approaches, we specifically abrogated CDK5 activity during mitosis, and observed mitotic defects, nuclear atypia and substantial alterations in the mitotic phosphoproteome. Notably, cyclin B1 is a mitotic co-factor of CDK5. Computational modelling, comparison with experimentally derived structures of CDK–cyclin complexes and validation with mutational analysis indicate that CDK5–cyclin B1 can form a functional complex. Disruption of the CDK5–cyclin B1 complex phenocopies CDK5 abrogation in mitosis. Together, our results demonstrate that cyclin B1 partners with both CDK5 and CDK1, and CDK5–cyclin B1 functions as a canonical CDK–cyclin complex to ensure mitotic fidelity.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

hypothesis test in simple linear regression

Similar content being viewed by others

hypothesis test in simple linear regression

Core control principles of the eukaryotic cell cycle

hypothesis test in simple linear regression

CDC7-independent G1/S transition revealed by targeted protein degradation

hypothesis test in simple linear regression

Evolution of opposing regulatory interactions underlies the emergence of eukaryotic cell cycle checkpoints

Data availability.

All data supporting the findings of this study are available in the Article and its Supplementary Information . The LC–MS/MS proteomics data have been deposited to the ProteomeXchange Consortium 60 via the PRIDE 61 partner repository under dataset identifier PXD038386 . Correspondence regarding experiments and requests for materials should be addressed to the corresponding authors.

Wieser, S. & Pines, J. The biochemistry of mitosis. Cold Spring Harb. Perspect. Biol. 7 , a015776 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Dhavan, R. & Tsai, L. H. A decade of CDK5. Nat. Rev. Mol. Cell Biol. 2 , 749–759 (2001).

Article   CAS   PubMed   Google Scholar  

Malumbres, M. Cyclin-dependent kinases. Genome Biol. 15 , 122 (2014).

Coverley, D., Laman, H. & Laskey, R. A. Distinct roles for cyclins E and A during DNA replication complex assembly and activation. Nat. Cell Biol. 4 , 523–528 (2002).

Desai, D., Wessling, H. C., Fisher, R. P. & Morgan, D. O. Effects of phosphorylation by CAK on cyclin binding by CDC2 and CDK2. Mol. Cell. Biol. 15 , 345–350 (1995).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Brown, N. R. et al. CDK1 structures reveal conserved and unique features of the essential cell cycle CDK. Nat. Commun. 6 , 6769 (2015).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Strauss, B. et al. Cyclin B1 is essential for mitosis in mouse embryos, and its nuclear export sets the time for mitosis. J. Cell Biol. 217 , 179–193 (2018).

Gavet, O. & Pines, J. Activation of cyclin B1-Cdk1 synchronizes events in the nucleus and the cytoplasm at mitosis. J. Cell Biol. 189 , 247–259 (2010).

Barbiero, M. et al. Cell cycle-dependent binding between cyclin B1 and Cdk1 revealed by time-resolved fluorescence correlation spectroscopy. Open Biol. 12 , 220057 (2022).

Pines, J. & Hunter, T. Isolation of a human cyclin cDNA: evidence for cyclin mRNA and protein regulation in the cell cycle and for interaction with p34cdc2. Cell 58 , 833–846 (1989).

Clute, P. & Pines, J. Temporal and spatial control of cyclin B1 destruction in metaphase. Nat. Cell Biol. 1 , 82–87 (1999).

Potapova, T. A. et al. The reversibility of mitotic exit in vertebrate cells. Nature 440 , 954–958 (2006).

Basu, S., Greenwood, J., Jones, A. W. & Nurse, P. Core control principles of the eukaryotic cell cycle. Nature 607 , 381–386 (2022).

Santamaria, D. et al. Cdk1 is sufficient to drive the mammalian cell cycle. Nature 448 , 811–815 (2007).

Article   ADS   CAS   PubMed   Google Scholar  

Zheng, X. F. et al. A mitotic CDK5-PP4 phospho-signaling cascade primes 53BP1 for DNA repair in G1. Nat. Commun. 10 , 4252 (2019).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Fagerberg, L. et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteom. 13 , 397–406 (2014).

Article   CAS   Google Scholar  

Pozo, K. & Bibb, J. A. The emerging role of Cdk5 in cancer. Trends Cancer 2 , 606–618 (2016).

Sharma, S. & Sicinski, P. A kinase of many talents: non-neuronal functions of CDK5 in development and disease. Open Biol. 10 , 190287 (2020).

Sun, K. H. et al. Novel genetic tools reveal Cdk5’s major role in Golgi fragmentation in Alzheimer’s disease. Mol. Biol. Cell 19 , 3052–3069 (2008).

Sharma, S. et al. Targeting the cyclin-dependent kinase 5 in metastatic melanoma. Proc. Natl Acad. Sci. USA 117 , 8001–8012 (2020).

Nabet, B. et al. The dTAG system for immediate and target-specific protein degradation. Nat. Chem. Biol. 14 , 431–441 (2018).

Simpson, L. M. et al. Target protein localization and its impact on PROTAC-mediated degradation. Cell Chem. Biol. 29 , 1482–1504 e1487 (2022).

Vassilev, L. T. et al. Selective small-molecule inhibitor reveals critical mitotic functions of human CDK1. Proc. Natl Acad. Sci. USA 103 , 10660–10665 (2006).

Janssen, A. F. J., Breusegem, S. Y. & Larrieu, D. Current methods and pipelines for image-based quantitation of nuclear shape and nuclear envelope abnormalities. Cells 11 , 347 (2022).

Thompson, S. L. & Compton, D. A. Chromosome missegregation in human cells arises through specific types of kinetochore-microtubule attachment errors. Proc. Natl Acad. Sci. USA 108 , 17974–17978 (2011).

Kline-Smith, S. L. & Walczak, C. E. Mitotic spindle assembly and chromosome segregation: refocusing on microtubule dynamics. Mol. Cell 15 , 317–327 (2004).

Prosser, S. L. & Pelletier, L. Mitotic spindle assembly in animal cells: a fine balancing act. Nat. Rev. Mol. Cell Biol. 18 , 187–201 (2017).

Zeng, X. et al. Pharmacologic inhibition of the anaphase-promoting complex induces a spindle checkpoint-dependent mitotic arrest in the absence of spindle damage. Cancer Cell 18 , 382–395 (2010).

Warren, J. D., Orr, B. & Compton, D. A. A comparative analysis of methods to measure kinetochore-microtubule attachment stability. Methods Cell. Biol. 158 , 91–116 (2020).

Gregan, J., Polakova, S., Zhang, L., Tolic-Norrelykke, I. M. & Cimini, D. Merotelic kinetochore attachment: causes and effects. Trends Cell Biol 21 , 374–381 (2011).

Etemad, B., Kuijt, T. E. & Kops, G. J. Kinetochore-microtubule attachment is sufficient to satisfy the human spindle assembly checkpoint. Nat. Commun. 6 , 8987 (2015).

Tauchman, E. C., Boehm, F. J. & DeLuca, J. G. Stable kinetochore-microtubule attachment is sufficient to silence the spindle assembly checkpoint in human cells. Nat. Commun. 6 , 10036 (2015).

Mitchison, T. & Kirschner, M. Microtubule assembly nucleated by isolated centrosomes. Nature 312 , 232–237 (1984).

Fourest-Lieuvin, A. et al. Microtubule regulation in mitosis: tubulin phosphorylation by the cyclin-dependent kinase Cdk1. Mol. Biol. Cell 17 , 1041–1050 (2006).

Ubersax, J. A. et al. Targets of the cyclin-dependent kinase Cdk1. Nature 425 , 859–864 (2003).

Yang, C. H., Lambie, E. J. & Snyder, M. NuMA: an unusually long coiled-coil related protein in the mammalian nucleus. J. Cell Biol. 116 , 1303–1317 (1992).

Yang, C. H. & Snyder, M. The nuclear-mitotic apparatus protein is important in the establishment and maintenance of the bipolar mitotic spindle apparatus. Mol. Biol. Cell 3 , 1259–1267 (1992).

Kotak, S., Busso, C. & Gonczy, P. NuMA phosphorylation by CDK1 couples mitotic progression with cortical dynein function. EMBO J. 32 , 2517–2529 (2013).

Kitagawa, M. et al. Cdk1 coordinates timely activation of MKlp2 kinesin with relocation of the chromosome passenger complex for cytokinesis. Cell Rep. 7 , 166–179 (2014).

Schrock, M. S. et al. MKLP2 functions in early mitosis to ensure proper chromosome congression. J. Cell Sci. 135 , jcs259560 (2022).

Sun, M. et al. NuMA regulates mitotic spindle assembly, structural dynamics and function via phase separation. Nat. Commun. 12 , 7157 (2021).

Chen, Q., Zhang, X., Jiang, Q., Clarke, P. R. & Zhang, C. Cyclin B1 is localized to unattached kinetochores and contributes to efficient microtubule attachment and proper chromosome alignment during mitosis. Cell Res. 18 , 268–280 (2008).

Kabeche, L. & Compton, D. A. Cyclin A regulates kinetochore microtubules to promote faithful chromosome segregation. Nature 502 , 110–113 (2013).

Hegarat, N. et al. Cyclin A triggers mitosis either via the Greatwall kinase pathway or cyclin B. EMBO J. 39 , e104419 (2020).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Wood, D. J. & Endicott, J. A. Structural insights into the functional diversity of the CDK-cyclin family. Open Biol. 8 , 180112 (2018).

Brown, N. R., Noble, M. E., Endicott, J. A. & Johnson, L. N. The structural basis for specificity of substrate and recruitment peptides for cyclin-dependent kinases. Nat. Cell Biol. 1 , 438–443 (1999).

Tarricone, C. et al. Structure and regulation of the CDK5-p25 nck5a complex. Mol. Cell 8 , 657–669 (2001).

Poon, R. Y., Lew, J. & Hunter, T. Identification of functional domains in the neuronal Cdk5 activator protein. J. Biol. Chem. 272 , 5703–5708 (1997).

Oppermann, F. S. et al. Large-scale proteomics analysis of the human kinome. Mol. Cell. Proteom. 8 , 1751–1764 (2009).

van den Heuvel, S. & Harlow, E. Distinct roles for cyclin-dependent kinases in cell cycle control. Science 262 , 2050–2054 (1993).

Article   ADS   PubMed   Google Scholar  

Nakatani, Y. & Ogryzko, V. Immunoaffinity purification of mammalian protein complexes. Methods Enzymol. 370 , 430–444 (2003).

Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc. 11 , 2301–2319 (2016).

Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13 , 731–740 (2016).

Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43 , e47 (2015).

R Core Team. R: a language and environment for statistical computing (2021).

Wickham, H. ggplot2: elegant graphics for data analysis (2016).

Slowikowski, K. ggrepel: automatically position non-overlapping text labels with “ggplot2” (2018).

Wu, T. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2 , 100141 (2021).

CAS   PubMed   PubMed Central   Google Scholar  

Deutsch, E. W. et al. The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics. Nucleic Acids Res. 48 , D1145–D1152 (2020).

CAS   PubMed   Google Scholar  

Perez-Riverol, Y. et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 47 , D442–D450 (2019).

Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 , 139–140 (2010).

Nagahara, H. et al. Transduction of full-length TAT fusion proteins into mammalian cells: TAT-p27Kip1 induces cell migration. Nat. Med. 4 , 1449–1452 (1998).

Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19 , 679–682 (2022).

Lu, C. et al. OPLS4: improving force field accuracy on challenging regimes of chemical space. J. Chem. Theory Comput. 17 , 4291–4300 (2021).

Obenauer, J. C., Cantley, L. C. & Yaffe, M. B. Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 31 , 3635–3641 (2003).

Download references

Acknowledgements

We thank D. Pellman for comments on the manuscript; W. Michowski, S. Sharma, P. Sicinski, B. Nabet and N. Gray for the reagents; J. A. Tainer for providing access to software used for structural analysis; and S. Gerber for sharing unpublished results. D.C. is supported by grants R01 CA208244 and R01 CA264900, DOD Ovarian Cancer Award W81XWH-15-0564/OC140632, Tina’s Wish Foundation, Detect Me If You Can, a V Foundation Award, a Gray Foundation grant and the Claudia Adams Barr Program in Innovative Basic Cancer Research. A. Spektor would like to acknowledge support from K08 CA208008, the Burroughs Wellcome Fund Career Award for Medical Scientists, Saverin Breast Cancer Research Fund and the Claudia Adams Barr Program in Innovative Basic Cancer Research. X.-F.Z. was an American Cancer Society Fellow and is supported by the Breast and Gynecologic Cancer Innovation Award from Susan F. Smith Center for Women’s Cancers at Dana-Farber Cancer Institute. A. Syed is supported by the Claudia Adams Barr Program in Innovative Basic Cancer Research. B.T. was supported by the Polish National Agency for Academic Exchange (grant PPN/WAL/2019/1/00018) and by the Foundation for Polish Science (START Program). A.D.D is supported by NIH grant R01 HL52725. A.G.P. by National Cancer Institute grants U01CA214114 and U01CA271407, as well as a donation from the Aven Foundation; J.R.W. by National Cancer Institute grant R50CA211499; and K.S. by NIH awards 1R01-CA237660 and 1RF1NS124779.

Author information

Bartłomiej Tomasik

Present address: Department of Oncology and Radiotherapy, Medical University of Gdańsk, Faculty of Medicine, Gdańsk, Poland

These authors contributed equally: Xiao-Feng Zheng, Aniruddha Sarkar

Authors and Affiliations

Division of Radiation and Genome Stability, Department of Radiation Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA

Xiao-Feng Zheng, Aniruddha Sarkar, Aleem Syed, Huy Nguyen, Bartłomiej Tomasik, Kaimeng Huang, Feng Li, Alan D. D’Andrea, Alexander Spektor & Dipanjan Chowdhury

Department of Chemistry and Purdue University Center for Cancer Research, Purdue University, West Lafayette, IN, USA

Humphrey Lotana & Kavita Shah

Translational Science and Therapeutics Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA

Richard G. Ivey, Jacob J. Kennedy, Jeffrey R. Whiteaker & Amanda G. Paulovich

Department of Biostatistics and Translational Medicine, Medical University of Łódź, Łódź, Poland

Broad Institute of Harvard and MIT, Cambridge, MA, USA

Kaimeng Huang, Alan D. D’Andrea, Alexander Spektor & Dipanjan Chowdhury

Department of Biological Chemistry & Molecular Pharmacology, Harvard Medical School, Boston, MA, USA

Dipanjan Chowdhury

You can also search for this author in PubMed   Google Scholar

Contributions

X.-F.Z., A. Sarkar., A. Spektor. and D.C. conceived the project and designed the experiments. X.-F.Z. and A. Sarkar performed the majority of experiments and associated analyses except as listed below. H.L. expressed relevant proteins and conducted the kinase activity assays for CDK5–cyclin B1, CDK5–p35 and CDK5(S46) variant complexes under the guidance of K.S.; A. Syed performed structural modelling and analysis. R.G.I., J.J.K. and J.R.W. performed MS and analysis. B.T. and H.N. performed MS data analyses. K.H. provided guidance to screen CDK5(as) knocked-in clones and performed sequence analysis to confirm CDK5(as) knock-in. F.L. and A.D.D. provided reagents and discussion on CDK5 substrates analyses. X.-F.Z., A. Sarkar, A. Spektor and D.C. wrote the manuscript with inputs and edits from all of the other authors.

Corresponding authors

Correspondence to Alexander Spektor or Dipanjan Chowdhury .

Ethics declarations

Competing interests.

A.D.D. reports consulting for AstraZeneca, Bayer AG, Blacksmith/Lightstone Ventures, Bristol Myers Squibb, Cyteir Therapeutics, EMD Serono, Impact Therapeutics, PrimeFour Therapeutics, Pfizer, Tango Therapeutics and Zentalis Pharmaceuticals/Zeno Management; is an advisory board member for Cyteir and Impact Therapeutics; a stockholder in Cedilla Therapeutics, Cyteir, Impact Therapeutics and PrimeFour Therapeutics; and reports receiving commercial research grants from Bristol Myers Squibb, EMD Serono, Moderna and Tango Therapeutics. The other authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Yibing Shan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 inhibition of cdk5 in analogue-sensitive (cdk5- as ) system..

a , Schematics depicting specific inhibition of the CDK5 analogue-sensitive ( as ) variant. Canonical ATP-analogue inhibitor (In, yellow) targets endogenous CDK5 (dark green) at its ATP-binding catalytic site nonspecifically since multiple kinases share structurally similar catalytic sites (left panel). The analogue-sensitive ( as , light green) phenylalanine-to-glycine (F80G) mutation confers a structural change adjacent to the catalytic site of CDK5 that does not impact its catalysis but accommodates the specific binding of a non-hydrolysable bulky orthogonal inhibitor 1NM-PP1(In*, orange). Introduction of 1NM-PP1 thus selectively inhibits CDK5- as variant (right panel). b , Immunoblots showing two clones (Cl 23 and Cl 50) of RPE-1 cells expressing FLAG-HA-CDK5- as in place of endogenous CDK5. Representative results are shown from three independent repeats. c , Proliferation curve of parental RPE-1 and RPE-1 CDK5- as cells. Data represent mean ± s.d. from three independent repeats. p -value was determined by Mann Whitney U test. d , Immunoblots showing immunoprecipitated CDK1-cyclin B1 complex or CDK5- as -cyclin B1 complex by the indicated antibody-coupled agarose, from nocodazole arrested RPE-1 CDK5- as cells with treated with or without 1NM-PP1 for inhibition of CDK5- as , from three independent replicate experiments. e , In-vitro kinase activity quantification of immunoprecipitated complex shown in d . Data represent mean ± s.d. from three independent experiments. p -values were determined by unpaired, two-tailed student’s t-test. f , Immunoblots of RPE-1 CDK5- as cells treated with either DMSO or 1NM-PP1 for 2 h prior to and upon release from RO-3306 and collected at 60 min following release. Cells were lysed and blotted with anti-bodies against indicated proteins (upper panel). Quantification of the relative intensity of PP4R3β phosphorylation at S840 in 1NM-PP1-treated CDK5- as cells compared to DMSO-treatment (lower panel). g , Experimental scheme for specific and temporal abrogation of CDK5 in RPE-1 CDK5- as cells. Data represent mean ± S.D from quadruplicate repeats. p -value was determined by one sample t and Wilcoxon test. h , Hoechst staining showing primary nuclei and micronuclei of RPE-1 CDK5- as with indicated treatment; scale bar is as indicated (left panel). Right, quantification of the percentage of cells with micronuclei after treatment. Data represent mean ± s.d. of three independent experiments from n = 2174 DMSO, n = 1788 1NM-PP1 where n is the number of cells. p- values were determined by unpaired, two-tailed student’s t-test. Scale bar is as indicated. Uncropped gel images are provided in Supplementary Fig. 1 .

Extended Data Fig. 2 Degradation of CDK5 in degradation tag (CDK5- dTAG ) system.

a , Schematic depicting the dTAG-13-inducible protein degradation system. Compound dTAG-13 links protein fused with FKBP12 F36V domain (dTAG) to CRBN-DDB1-CUL4A E3 ligase complex, leading to CRBN-mediated degradation. b , Immunoblots showing two clones of RPE-1 cells that express dTAG -HA-CDK5 in place of endogenous CDK5 (Cl N1 and Cl N4). Representative results are shown from three independent repeats. c , Proliferation curve of parental RPE-1 and RPE-1 CDK5-dTAG. Data represent mean ± s.d. of three independent repeats. p -value was determined by Mann Whitney U test. d and e , Representative images of RPE-1 CDK5- dTAG clone 1 (N1) ( d ) and RPE-1 CDK5- dTAG clone 4 (N4) ( e ) treated with DMSO or dTAG-13 for 2 h prior to and upon release from G2/M arrest and fixed at 120 min after release (top panel); quantification of CDK5 total intensity per cell (lower panels). Data represent mean ± s.d. of at least two independent experiments from n = 100 cells each condition. p- values were determined by unpaired, two-tailed student’s t-test. f , Immunoblots showing level of indicated proteins in RPE-1 CDK5- dTAG cells. Cells were treated with either DMSO or dTAG-13 for 2 h prior to and upon release from RO-3306 and lysed at 60 min following release (upper panel). Quantification of the relative intensity of PP4R3β phosphorylation at S840 in dTAG13-treated CDK5- dTAG cells compared to DMSO-treatment (lower panel). Data represent mean ± s.d. of four independent experiments. p -value was determined by one sample t and Wilcoxon test. g , Experimental scheme for specific and temporal abrogation of CDK5 in RPE-1 CDK5- dTAG cells. h , Hoechst staining showing primary nuclei and micronuclei of RPE-1 CDK5- dTAG with indicated treatment; scale bar is as indicated (left panel). Right, quantification of the percentage of cells with micronuclei after treatment. Data represent mean ± s.d. of three independent experiments from n = 2094 DMSO and n = 2095 dTAG-13, where n is the number of cells. p- values were determined by unpaired, two-tailed student’s t-test. Scale bar is as indicated. Uncropped gel images are provided in Supplementary Fig. 1 .

Extended Data Fig. 3 CDK5 abrogation render chromosome alignment and segregation defect despite intact spindle assembly checkpoint and timely mitotic duration.

a and b , Live-cell imaging snapshots of RPE-1 CDK5- as cells ( a ) and RPE-1 CDK5- dTAG cells ( b ) expressing mCherry-H2B and GFP-α-tubulin, abrogated of CDK5 by treatment with 1NM-PP1 or dTAG-13, respectively. Imaging commenced in prophase following release from RO-3306 into fresh media containing indicated chemicals (left); quantification of the percentage of cells with abnormal nuclear morphology (right). c and d , Representative snapshots of the final frame prior to metaphase-to-anaphase transition from a live-cell imaging experiment detailing chromosome alignment at the metaphase plate of RPE- CDK5- as (c) and RPE-1 CDK5- dTAG ( d ) expressing mCherry-H2B, and GFP-α-tubulin (left); quantification of the percentage of cells displaying abnormal chromosome alignment following indicated treatments (top right). e , Representative images showing the range of depolymerization outcomes (low polymers, high polymers and spindle-like) in DMSO- and 1NM-PP1-treated cells, as shown in Fig. 2e , from n = 50 for each condition, where n is number of metaphase cells . f , Quantifications of mitotic duration from nuclear envelope breakdown (NEBD) to anaphase onset of RPE-1 CDK5- as (left ) and RPE-1 CDK5- dTAG (right) cells, following the indicated treatments. Live-cell imaging of RPE-1 CDK5- as and RPE-1 CDK5- dTAG cells expressing mCherry-H2B and GFP-BAF commenced following release from RO-3306 arrest into fresh media containing DMSO or 1NM-PP1 or dTAG-13. g , Quantifications of the percentage of RPE-1 CDK5- as (left) and RPE-1 CDK5- dTAG (right) cells that were arrested in mitosis following the indicated treatments. Imaging commenced in prophase cells as described in a , following release from RO-3306 into fresh media in the presence or absence nocodazole as indicated. The data in a, c , and g represent mean ± s.d. of at least two independent experiments from n = 85 DMSO and n = 78 1NM-PP1 in a and c ; from n = 40 cells for each treatment condition in g . The data in b , d , and f represent mean ± s.d. of three independent experiments from n = 57 DMSO and n = 64 dTAG-13 in b and d ; from n = 78 DMSO and n = 64 1NM-PP1; n = 59 DMSO and n = 60 dTAG-13, in f , where n is the number of cells. p- values were determined by unpaired, two-tailed student’s t-test. Scale bar is as indicated.

Extended Data Fig. 4 CDK5 and CDK1 regulate tubulin dynamics.

a, b , Immunostaining of RPE-1 cells with antibodies against CDK1 and α-tubulin ( a ); and CDK5 and α-tubulin ( b ) at indicated stages of mitosis. c, d , Manders’ overlap coefficient M1 (CDK1 versus CDK5 on α-tubulin) ( c ); and M2 (α-tubulin on CDK1 versus CDK5) ( d ) at indicated phases of mitosis in cells shown in a and b . The data represent mean ± s.d. of at least two independent experiments from n = 25 cells in each mitotic stage. p- values were determined by unpaired, two-tailed student’s t-test.

Extended Data Fig. 5 Phosphoprotoemics analysis to identify mitotic CDK5 substrates.

a , Scheme of cell synchronization for phosphoproteomics: RPE-1 CDK5- as cells were arrested at G2/M by treatment with RO-3306 for 16 h. The cells were treated with 1NM-PP1 to initiate CDK5 inhibition. 2 h post-treatment, cells were released from G2/M arrest into fresh media with or without 1NM-PP1 to proceed through mitosis with or without continuing inhibition of CDK5. Cells were collected at 60 min post-release from RO-3306 for lysis. b , Schematic for phosphoproteomics-based identification of putative CDK5 substrates. c , Gene ontology analysis of proteins harbouring CDK5 inhibition-induced up-regulated phosphosites. d , Table indicating phospho-site of proteins that are down-regulated as result of CDK5 inhibition. e , Table indicating the likely kinases to phosphorylate the indicated phosphosites of the protein, as predicted by Scansite 4 66 . Divergent score denotes the extent by which phosphosite diverge from known kinase substrate recognition motif, hence higher divergent score indicating the corresponding kinase is less likely the kinase to phosphorylate the phosphosite.

Extended Data Fig. 6 Cyclin B1 is a mitotic co-factor of CDK5 and of CDK1.

a , Endogenous CDK5 was immunoprecipitated from RPE-1 cells collected at time points corresponding to the indicated cell cycle stage. Cell lysate input and elution of immunoprecipitation were immunoblotted by antibodies against the indicated proteins. RPE-1 cells were synchronized to G2 by RO-3306 treatment for 16 h and to prometaphase (M) by nocodazole treatment for 6 h. Asynch: Asynchronous. Uncropped gel images are provided in Supplementary Fig. 1 . b , Immunostaining of RPE-1 cells with antibodies against the indicated proteins at indicated mitotic stages (upper panels). Manders’ overlap coefficient M1 (Cyclin B1 on CDK1) and M2 (CDK1 on Cyclin B1) at indicated mitotic stages for in cells shown in b (lower panels). The data represent mean ± s.d. of at least two independent experiments from n = 25 mitotic cells in each mitotic stage. p- values were determined by unpaired, two-tailed student’s t-test. c , Table listing common proteins as putative targets of CDK5, uncovered from the phosphoproteomics anlaysis of down-regulated phosphoproteins upon CDK5 inhibition (Fig. 3 and Supplementary Table 1 ), and those of cyclin B1, uncovered from phosphoproteomics analysis of down-regulated phospho-proteins upon cyclin B1 degradation (Fig. 6 and Table EV2 in Hegarat et al. EMBO J. 2020). Proteins relevant to mitotic functions are highlighted in red.

Extended Data Fig. 7 Structural prediction and analyses of the CDK5-cyclin B1 complex.

a , Predicted alignment error (PAE) plots of the top five AlphaFold2 (AF2)-predicted models of CDK5-cyclin B1 (top row) and CDK1-cyclin B1 (bottom row) complexes, ranked by interface-predicted template (iPTM) scores. b , AlphaFold2-Multimer-predicted structure of the CDK5-cyclin B1 complex. c , Structural comparison of CDK-cyclin complexes. Left most panel: Structural-overlay of AF2 model of CDK5-cyclin B1 and crystal structure of phospho-CDK2-cyclin A3-substrate complex (PDB ID: 1QMZ ). The zoomed-in view of the activation loops of CDK5 and CDK2 is shown in the inset. V163 (in CDK5), V164 (in CDK2) and Proline at +1 position in the substrates are indicated with arrows. Middle panel: Structural-overlay of AF2 model of CDK5-cyclin B1 and crystal structure of CDK1-cyclin B1-Cks2 complex (PDB ID: 4YC3 ). The zoomed-in view of the activation loops of CDK5 and CDK1 is shown in the inset. Cks2 has been removed from the structure for clarity. Right most panel: structural-overlay of AF2 models of CDK5-cyclin B1 and CDK1-cyclin B1 complex. The zoomed view of the activation loops of CDK5 and CDK1 is shown in the inset. d , Secondary structure elements of CDK5, cyclin B1 and p25. The protein sequences, labelled based on the structural models, are generated by PSPript for CDK5 (AF2 model) ( i ), cyclin B1 (AF2 model) ( ii ) and p25 (PDB ID: 3O0G ) ( iii ). Structural elements ( α , β , η ) are defined by default settings in the program. Key loops highlighted in Fig. 4d are mapped onto the corresponding sequence.

Extended Data Fig. 8 Phosphorylation of CDK5 S159 is required for kinase activity and mitotic fidelity.

a , Structure of the CDK5-p25 complex (PDB ID: 1h41 ). CDK5 (blue) interacts with p25 (yellow). Serine 159 (S159, magenta) is in the T-loop. b , Sequence alignment of CDK5 and CDK1 shows that S159 in CDK5 is the analogous phosphosite as that of T161 in CDK1 for T-loop activation. Sequence alignment was performed by CLC Sequence Viewer ( https://www.qiagenbioinformatics.com/products/clc-sequence-viewer/ ). c , Immunoblots of indicated proteins in nocodazole-arrested mitotic (M) and asynchronous (Asy) HeLa cell lysate. d , Myc-His-tagged CDK5 S159 variants expressed in RPE-1 CDK5- as cells were immunoprecipitated from nocodazole-arrested mitotic lysate by Myc-agarose. Input from cell lysate and elution from immunoprecipitation were immunoblotted with antibodies against indicated protein. EV= empty vector. In vitro kinase activity assay of the indicated immunoprecipitated complex shown on the right panel. Data represent mean ± s.d. of four independent experiments. p -values were determined by unpaired two-tailed student’s t-test. e , Immunoblots showing RPE-1 FLAG-CDK5- as cells stably expressing Myc-His-tagged CDK5 WT and S159A, which were used in live-cell imaging and immunofluorescence experiments to characterize chromosome alignment and spindle architecture during mitosis, following inhibition of CDK5- as by 1NM-PP1, such that only the Myc-His-tagged CDK5 WT and S159A are not inhibited. Representative results are shown from three independent repeats. f , Hoechst staining showing nuclear morphology of RPE-1 CDK5- as cells expressing indicated CDK5 S159 variants following treatment with either DMSO or 1NMP-PP1 and fixation at 120 min post-release from RO-3306-induced arrest (upper panel); quantification of nuclear circularity and solidity (lower panels) g , Snapshots of live-cell imaging RPE-1 CDK5- as cells expressing indicated CDK5 S159 variant, mCherry-H2B, and GFP-α-tubulin, after release from RO-3306-induced arrest at G2/M, treated with 1NM-PP1 2 h prior to and upon after release from G2/M arrest (upper panel); quantification of cells displaying abnormal chromosome alignment in (lower panel). Representative images are shown from two independent experiments, n = 30 cells each cell line. h , Representative images of RPE-1 CDK5- as cells expressing indicated CDK5 S159 variants in metaphase, treated with DMSO or 1NM-PP1 for 2 h prior to and upon release from RO-3306-induced arrest, and then released into media containing 20 µM proTAME for 2 h, fixed and stained with tubulin and DAPI (upper panel); metaphase plate width and spindle length measurements for these representative cells were shown in the table on right; quantification of metaphase plate width and spindle length following the indicated treatments (lower panel). Data in f and h represent mean ± s.d. of at least two independent experiments from n = 486 WT, n = 561 S159A, and n = 401 EV, where n is the number of cells in f ; from n = 65 WT, n = 64 S159A, and n = 67 EV, where n is the number of cells in h . Scale bar is as indicated. Uncropped gel images are provided in Supplementary Fig. 1 .

Extended Data Fig. 9 The CDK5 co-factor-binding helix regulates CDK5 kinase activity.

a , Structure of the CDK5-p25 complex (PDB ID: 1h41 ). CDK5 (blue) interacts with p25 (yellow) at the PSSALRE helix (green). Serine 46 (S46, red) is in the PSSALRE helix. Serine 159 (S159, magenta) is in the T-loop. b , Sequence alignment of CDK5 and CDK1 shows that S46 is conserved in CDK1 and CDK5. Sequence alignment was performed by CLC Sequence Viewer ( https://www.qiagenbioinformatics.com/products/clc-sequence-viewer/ ). c , Immunoblots of CDK5 immunoprecipitation from lysate of E. coli BL21 (DE3) expressing His-tagged human CDK5 WT or CDK5 S46D, mixed with lysate of E. coli BL21 (DE3) expressing His-tagged human cyclin B1. Immunoprecipitated CDK5 alone or in the indicated complex were used in kinase activity assay, shown in Fig. 5b . Representative results are shown from three independent repeats. d , Immunoblots showing RPE-1 FLAG-CDK5- as cells stably expressing Myc-His-tagged CDK5 S46 phospho-variants, which were used in live-cell imaging and immunofluorescence experiments to characterize chromosome alignment and spindle architecture during mitosis, following inhibition of CDK5- as by 1NM-PP1, such that only the Myc-His-tagged CDK5 S46 phospho-variants are not inhibited. Representative results are shown from three independent repeats. e , Immunostaining of RPE-1 CDK5- as cells expressing Myc-His-tagged CDK5 WT or S46D with anti-PP4R3β S840 (pS840) antibody following indicated treatment (DMSO vs 1NM-PP1). Scale bar is as indicated (left). Normalized intensity level of PP4R3β S840 phosphorylation (right). Data represent mean ± s.d. of at least two independent experiments from n = 40 WT and n = 55 S46D, where n is the number of metaphase cells. p- values were determined by unpaired two-tailed student’s t-test. f , Immunoblots showing level of indicated proteins in RPE-1 CDK5- as cells expressing Myc-His-tagged CDK5 WT or S46D. Cells were treated with either DMSO or 1NM-PP1 for 2 h prior to and upon release from RO-3306 and collected and lysed at 60 min following release (left). Quantification of the intensity of PP4R3β phosphorylation at S840 (right). Data represent mean ± s.d. of four independent experiments. p -values were determined by two-tailed one sample t and Wilcoxon test. g , Representative snapshots of live-cell imaging of RPE-1 CDK5- as cells harbouring indicated CDK5 S46 variants expressing mCherry-H2B and GFP-α-tubulin, treated with 1NM-PP1, as shown in Fig. 5d , from n = 35 cells. Imaging commenced in prophase following release from RO-3306 into fresh media containing indicated chemicals. Uncropped gel images are provided in Supplementary Fig. 1 .

Extended Data Fig. 10 Localization of CDK5 S46 phospho-variants.

Immunostaining of RPE-1 CDK5- as cells stably expressing Myc-His CDK5-WT ( a ), S46A ( b ), and S46D ( c ) with antibodies against indicated protein in prophase, prometaphase, and metaphase. Data represent at least two independent experiments from n = 25 cells of each condition in each mitotic stage.

Extended Data Fig. 11 RPE-1 harbouring CDK5- as introduced by CRISPR-mediated knock-in recapitulates chromosome mis-segregation defects observed in RPE-1 overexpressing CDK5- as upon inhibition of CDK5- as by 1NM-PP1 treatment.

a , Chromatogram showing RPE-1 that harbours the homozygous CDK5- as mutation F80G introduced by CRISPR-mediated knock-in (lower panel), replacing endogenous WT CDK5 (upper panel). b , Immunoblots showing level of CDK5 expressed in parental RPE-1 and RPE-1 that harbours CDK5- as F80G mutation in place of endogenous CDK5. c , Representative images of CDK5- as knocked-in RPE-1 cells exhibiting lagging chromosomes following indicated treatments. d , Quantification of percentage of cells exhibiting lagging chromosomes following indicated treatments shown in (c). Data represent mean ± s.d. of three independent experiments from n = 252 DMSO, n = 220 1NM-PP1, where n is the number of cells. p -value was determined by two-tailed Mann Whitney U test.

Extended Data Fig. 12 CDK5 is highly expressed in post-mitotic neurons and overexpressed in cancers.

a , CDK5 RNAseq expression in tumours (left) with matched normal tissues (right). The data are analysed using 22 TCGA projects. Note that CDK5 expression is higher in many cancers compared to the matched normal tissues. BLCA, urothelial bladder carcinoma; BRCA, breast invasive carcinoma; CESC cervical squamous cell carcinoma and endocervical adenocarcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; ESCA, esophageal carcinoma; HNSC, head and neck squamous cell carcinoma; KICH, kidney chromophobe; KIRC, kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; PAAD, pancreatic adenocarcinoma; PCPG, pheochromocytoma and paraganglioma; PRAD, prostate adenocarcinoma; READ, rectum adenocarcinoma; SARC, sarcoma; STAD, stomach adenocarcinoma; THCA, thyroid carcinoma; THYM, thymoma; and UCEC, uterine corpus endometrial carcinoma. p -value was determined by two-sided Student’s t-test. ****: p <= 0.0001; ***: p <= 0.001; **: p <= 0.01; *: p <= 0.05; ns: not significant, p  > 0.05. b , Scatter plots showing cells of indicated cancer types that are more dependent on CDK5 and less dependent on CDK1. Each dot represents a cancer cell line. The RNAi dependency data (in DEMETER2) for CDK5 and CDK1 were obtained from the Dependency Map ( depmap.org ). The slope line represents a simple linear regression analysis for the indicated cancer type. The four indicated cancer types (Head/Neck, Ovary, CNS/Brain, and Bowel) showed a trend of more negative CDK5 RNAi effect scores (indicative of more dependency) with increasing CDK1 RNAi effect scores (indicative of less dependency). The p value represents the significance of the correlation computed from a simple linear regression analysis of the data. Red circle highlights the subset of the cells that are relatively less dependent on CDK1 but more dependent on CDK5. c , Scatter plots showing bowel cancer cells that expresses CDK5 while being less dependent on CDK1. Each dot represents a cancer cell line. The data on gene effect of CDK1 CRISPR and CDK5 mRNA level were obtained from the Dependency Map ( depmap.org ). The slope line represents a simple linear regression analysis. Red circle highlights the subset of cells that are relatively less dependent on CDK1 but expresses higher level of CDK5. For b and c , solid line represents the best-fit line from simple linear regression using GraphPad Prism. Dashed lines represent 95% confidence bands of the best-fit line. p -value is determined by the F test testing the null hypothesis that the slope is zero. d , Scatter plots showing rapidly dividing cells of indicated cancer types that are more dependent on CDK5 and less dependent on CDK1. Each dots represents a cancer cell line. The doubling time data on the x-axis were obtained from the Cell Model Passports ( cellmodelpassports.sanger.ac.uk ). The RNAi dependency data (in DEMETER2) for CDK5, or CDK1, on the y-axis were obtained from the Dependency Map ( depmap.org ). Only cell lines with doubling time of less than 72 h are displayed and included for analysis. Each slope line represents a simple linear regression analysis for each cancer type. The indicated three cancer types were analysed and displayed because they showed a trend of faster proliferation rate (lower doubling time) with more negative CDK5 RNAi effect (more dependency) but increasing CDK1 RNAi effect (less dependency) scores. The p value represents the significance of the association of the three cancer types combined, computed from a multiple linear regression analysis of the combined data, using cancer type as a covariate. Red circle depicts subset of fast dividing cells that are relatively more dependent on CDK5 (left) and less dependent on CDK1 (right). Solid lines represent the best-fit lines from individual simple linear regressions using GraphPad Prism. p -value is for the test with the null hypothesis that the effect of the doubling time is zero from the multiple linear regression RNAi ~ Intercept + Doubling Time (hours) + Lineage.

Supplementary information

Supplementary figure 1.

Full scanned images of all western blots.

Reporting Summary

Peer review file, supplementary table 1.

Phosphosite changes in 1NM-PP1-treated cells versus DMSO-treated controls as measured by LC–MS/MS.

Supplementary Table 2

Global protein changes in 1NM-PP1-treated cells versus DMSO-treated controls as measured by LC–MS/MS.

Supplementary Video 1

RPE-1 CDK5(as) cell after DMSO treatment, ×100 imaging.

Supplementary Video 2

RPE-1 CDK5(as) cell after 1NM-PP1 treatment (example 1), ×100 imaging.

Supplementary Video 3

RPE-1 CDK5(as) cell after 1NM-PP1 treatment (example 2), ×100 imaging.

Supplementary Video 4

RPE-1 CDK5(dTAG) cell after DMSO treatment, ×100 imaging.

Supplementary Video 5

RPE-1 CDK5(dTAG) cell after dTAG-13 treatment (example 1), ×100 imaging.

Supplementary Video 6

RPE-1 CDK5(dTAG) cell after dTAG-13 treatment (example 2) ×100 imaging.

Supplementary Video 7

RPE-1 CDK5(as) cells expressing MYC-CDK5(WT) after 1NM-PP1 treatment, ×20 imaging.

Supplementary Video 8

RPE-1 CDK5(as) cells expressing MYC-EV after 1NM-PP1 treatment, ×20 imaging.

Supplementary Video 9

RPE-1 CDK5(as) cells expressing MYC-CDK5(S159A) after 1NM-PP1 treatment (example 1), ×20 imaging.

Supplementary Video 10

RPE-1 CDK5(as) cells expressing MYC-CDK5(S159A) after 1NM-PP1 treatment (example 2), ×20 imaging.

Supplementary Video 11

RPE-1 CDK5(as) cells expressing MYC-CDK5(WT) after 1NM-PP1 treatment, ×100 imaging.

Supplementary Video 12

RPE-1 CDK5(as) cells expressing MYC-CDK5(S46A) after 1NM-PP1 treatment (example 1), ×100 imaging.

Supplementary Video 13

RPE-1 CDK5(as) cells expressing MYC-CDK5(S46A) after 1NM-PP1 treatment (example 2), ×100 imaging.

Supplementary Video 14

RPE-1 CDK5(as) cells expressing MYC-CDK5(S46D) after 1NM-PP1 treatment (example 1), ×100 imaging.

Supplementary Video 15

RPE-1 CDK5(as) cells expressing MYC-CDK5(S46D) after 1NM-PP1 treatment (example 2), ×100 imaging.

Supplementary Video 16

RPE-1 CDK5(as) cells expressing MYC-EV after 1NM-PP1 treatment,×100 imaging.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Zheng, XF., Sarkar, A., Lotana, H. et al. CDK5–cyclin B1 regulates mitotic fidelity. Nature (2024). https://doi.org/10.1038/s41586-024-07888-x

Download citation

Received : 24 March 2023

Accepted : 30 July 2024

Published : 04 September 2024

DOI : https://doi.org/10.1038/s41586-024-07888-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

hypothesis test in simple linear regression

A comprehensive comparison of goodness-of-fit tests for logistic regression models

  • Original Paper
  • Published: 30 August 2024
  • Volume 34 , article number  175 , ( 2024 )

Cite this article

hypothesis test in simple linear regression

  • Huiling Liu 1 ,
  • Xinmin Li 2 ,
  • Feifei Chen 3 ,
  • Wolfgang Härdle 4 , 5 , 6 &
  • Hua Liang 7  

We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small ( \(n=50\) ), moderate ( \(n=100\) ), and large ( \(n=500\) ) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

hypothesis test in simple linear regression

Similar content being viewed by others

hypothesis test in simple linear regression

A generalized Hosmer–Lemeshow goodness-of-fit test for a family of generalized linear models

hypothesis test in simple linear regression

Fifty Years with the Cox Proportional Hazards Regression Model

hypothesis test in simple linear regression

CPMCGLM: an R package for p -value adjustment when looking for an optimal transformation of a single explanatory variable in generalized linear models

Explore related subjects.

  • Artificial Intelligence

Data availibility

No datasets were generated or analysed during the current study.

Chen, K., Hu, I., Ying, Z.: Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. Ann. Stat. 27 (4), 1155–1163 (1999)

Article   MathSciNet   Google Scholar  

Dardis, C.: LogisticDx: diagnostic tests and plots for logistic regression models. R package version 0.3 (2022)

Dikta, G., Kvesic, M., Schmidt, C.: Bootstrap approximations in model checks for binary data. J. Am. Stat. Assoc. 101 , 521–530 (2006)

Ekanem, I.A., Parkin, D.M.: Five year cancer incidence in Calabar, Nigeria (2009–2013). Cancer Epidemiol. 42 , 167–172 (2016)

Article   Google Scholar  

Escanciano, J.C.: A consistent diagnostic test for regression models using projections. Economet. Theor. 22 , 1030–1051 (2006)

Härdle, W., Mammen, E., Müller, M.: Testing parametric versus semiparametric modeling in generalized linear models. J. Am. Stat. Assoc. 93 , 1461–1474 (1998)

MathSciNet   Google Scholar  

Harrell, F.E.: rms: Regression modeling strategies. R package version 6.3-0 (2022)

Hosmer, D.W., Hjort, N.L.: Goodness-of-fit processes for logistic regression: simulation results. Stat. Med. 21 (18), 2723–2738 (2002)

Hosmer, D.W., Lemesbow, S.: Goodness of fit tests for the multiple logistic regression model. Commun Stat Theory Methods 9 , 1043–1069 (1980)

Hosmer, D.W., Hosmer, T., Le Cessie, S., Lemeshow, S.: A comparison of goodness-of-fit tests for the logistic regression model. Stat. Med. 16 (9), 965–980 (1997)

Hosmer, D., Lemeshow, S., Sturdivant, R.: Applied Logistic Regression. Wiley Series in Probability and Statistics, Wiley, New York (2013)

Book   Google Scholar  

Jones, L.K.: On a conjecture of Huber concerning the convergence of projection pursuit regression. Ann. Stat. 15 , 880–882 (1987)

Kohl, M.: MKmisc: miscellaneous functions from M. Kohl. R package version, vol. 1, p. 8 (2021)

Kosorok, M.R.: Introduction to Empirical Processes and Semiparametric Inference, vol. 61. Springer, New York (2008)

Lee, S.-M., Tran, P.-L., Li, C.-S.: Goodness-of-fit tests for a logistic regression model with missing covariates. Stat. Methods Med. Res. 31 , 1031–1050 (2022)

Lindsey, J.K.: Applying Generalized Linear Models. Springer, Berlin (2000)

McCullagh, P., Nelder, J.A.: Generalized Linear Models, vol. 37. Chapman and Hall (1989)

Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc. Ser. A 135 , 370–384 (1972)

Oguntunde, P.E., Adejumo, A.O., Okagbue, H.I.: Breast cancer patients in Nigeria: data exploration approach. Data Brief 15 , 47 (2017)

Osius, G., Rojek, D.: Normal goodness-of-fit tests for multinomial models with large degrees of freedom. J. Am. Stat. Assoc. 87 (420), 1145–1152 (1992)

Rady, E.-H.A., Abonazel, M.R., Metawe’e, M.H.: A comparison study of goodness of fit tests of logistic regression in R: simulation and application to breast cancer data. Appl. Math. Sci. 7 , 50–59 (2021)

Google Scholar  

Stukel, T.A.: Generalized logistic models. J. Am. Stat. Assoc. 83 (402), 426–431 (1988)

Stute, W., Zhu, L.-X.: Model checks for generalized linear models. Scand. J. Stat. Theory Appl. 29 , 535–545 (2002)

van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer (1996)

van Heel, M., Dikta, G., Braekers, R.: Bootstrap based goodness-of-fit tests for binary multivariate regression models. J. Korean Stat. Soc. 51 (1), 308–335 (2022)

Yin, C., Zhao, L., Wei, C.: Asymptotic normality and strong consistency of maximum quasi-likelihood estimates in generalized linear models. Sci. China Ser. A Math. 49 , 145–157 (2006)

Download references

Acknowledgements

Li’s research was partially supported by NNSFC grant 11871294. Härdle gratefully acknowledges support through the European Cooperation in Science & Technology COST Action grant CA19130 - Fintech and Artificial Intelligence in Finance - Towards a transparent financial industry; the project “IDA Institute of Digital Assets”, CF166/15.11.2022, contract number CN760046/ 23.05.2024 financed under the Romanias National Recovery and Resilience Plan, Apel nr. PNRR-III-C9-2022-I8; and the Marie Skłodowska-Curie Actions under the European Union’s Horizon Europe research and innovation program for the Industrial Doctoral Network on Digital Finance, acronym DIGITAL, Project No. 101119635

Author information

Authors and affiliations.

Department of Statistics, South China University of Technology, Guangzhou, China

Huiling Liu

School of Mathematics and Statistics, Qingdao University, Shandong, 266071, China

Center for Statistics and Data Science, Beijing Normal University, Zhuhai, 519087, China

Feifei Chen

BRC Blockchain Research Center, Humboldt-Universität zu Berlin, 10178, Berlin, Germany

Wolfgang Härdle

Dept Information Management and Finance, National Yang Ming Chiao Tung U, Hsinchu, Taiwan

IDA Institute Digital Assets, Bucharest University of Economic Studies, Bucharest, Romania

Department of Statistics, George Washington University, Washington, DC, 20052, USA

You can also search for this author in PubMed   Google Scholar

Contributions

LHL, LXM and LH wrote the main manuscript text, LHL and CFF program, HW commented on the methodological section. All authors reviewed the manuscript.

Corresponding author

Correspondence to Hua Liang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Liu, H., Li, X., Chen, F. et al. A comprehensive comparison of goodness-of-fit tests for logistic regression models. Stat Comput 34 , 175 (2024). https://doi.org/10.1007/s11222-024-10487-5

Download citation

Received : 02 December 2023

Accepted : 19 August 2024

Published : 30 August 2024

DOI : https://doi.org/10.1007/s11222-024-10487-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Consistent test
  • Model based bootstrap (MBB)
  • Residual marked empirical process (RMEP)
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. 12. Simple Linear Regression Analysis

    hypothesis test in simple linear regression

  2. Hypothesis Test for Simple Linear Regession

    hypothesis test in simple linear regression

  3. Hypothesis Testing On Linear Regression

    hypothesis test in simple linear regression

  4. PPT

    hypothesis test in simple linear regression

  5. 11.6. Simple Linear Regression: Hypothesis Testing

    hypothesis test in simple linear regression

  6. Simple Linier Regression

    hypothesis test in simple linear regression

VIDEO

  1. Lecture 5. Hypothesis Testing In Simple Linear Regression Model

  2. Simple linear regression hypothesis testing

  3. Hypothesis Testing in Simple Linear Regression

  4. Introduction to Hypothesis Testing Part 2

  5. Hypothesis testing: Linear & Multiple Regression in R

  6. Hypothesis Test for Linear Regression

COMMENTS

  1. 12.2.1: Hypothesis Test for Linear Regression

    The two test statistic formulas are algebraically equal; however, the formulas are different and we use a different parameter in the hypotheses. The formula for the t-test statistic is t = b1 (MSE SSxx)√ t = b 1 ( M S E S S x x) Use the t-distribution with degrees of freedom equal to n − p − 1 n − p − 1.

  2. 3.3.4: Hypothesis Test for Simple Linear Regression

    3.3: Correlation and Linear Regression 3.3.4: Hypothesis Test for Simple Linear Regression Expand/collapse global location

  3. PDF Chapter 9 Simple Linear Regression

    c plot.9.2 Statistical hypothesesFor simple linear regression, the chief null hypothesis is H0 : β1 = 0, and the corresponding alter. ative hypothesis is H1 : β1 6= 0. If this null hypothesis is true, then, from E(Y ) = β0 + β1x we can see that the population mean of Y is β0 for every x value, which t.

  4. 6.4

    For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0.

  5. Linear regression hypothesis testing: Concepts, Examples

    This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0. Determine the test statistics: The next step is to determine the test statistics and calculate the value.

  6. Simple Linear Regression

    Regression allows you to estimate how a dependent variable changes as the independent variable (s) change. Simple linear regression example. You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to ...

  7. 17.1: Simple linear regression

    Regression and correlation both test linear hypotheses: we state that the relationship between two variables is linear (the alternate hypothesis) ... change the object name yourself. The example shown in Figure \(\PageIndex{1}\) is a simple linear regression, with Body.Mass as the Y variable and Matings the X variable. No other information need ...

  8. Linear regression

    The lecture is divided in two parts: in the first part, we discuss hypothesis testing in the normal linear regression model, in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors; in the second part, we show how to carry out hypothesis tests in linear regression analyses where the ...

  9. Lesson 1: Simple Linear Regression

    Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. ... Let's perform the hypothesis test on the husband's age and wife's age data in which the sample correlation based on n = 170 couples is r = 0.939. To test \(H_{0} \colon \rho = 0\) against the ...

  10. Lesson 1: Simple Linear Regression

    1.9 - Hypothesis Test for the Population Correlation Coefficient; 1.10 - Further Examples; Software Help 1. Minitab Help 1: Simple Linear Regression; R Help 1: Simple Linear Regression; Lesson 2: SLR Model Evaluation. 2.1 - Inference for the Population Intercept and Slope; 2.2 - Another Example of Slope Inference; 2.3 - Sums of Squares; 2.4 ...

  11. Simple linear regression

    Interpreting the hypothesis test# If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. This test assumes the simple linear regression model is correct which precludes a quadratic relationship.

  12. Hypothesis Test for Regression Slope

    Hypothesis Test for Regression Slope. This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y.. The test focuses on the slope of the regression line Y = Β 0 + Β 1 X. where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of ...

  13. PDF Lecture 15. Hypothesis testing in the linear model

    15. Hypothesis testing in the linear model 15.7. Simple linear regression Simple linear regression We assume that Y i = a0+ b(x i x) + "i; i = 1;:::;n; where x = P x i=n, and "i;i = 1;:::;n are iid N(0;˙2). Suppose we want to test the hypothesis H 0: b = 0, i.e. no linear relationship. From Lecture 14 we have seen how to construct a con dence ...

  14. PDF Chapter 6 Hypothesis Testing

    Simple Linear Regression Independent variable (x) ) The output of a regression is a function that predicts the dependent variable based upon values of the independent variables. ... • For hypothesis testing, research questions are statements: • This is the null hypothesis ...

  15. 17.5: Testing regression coefficients

    Testing whether a linear regression coefficient is statistically significant, for one or two slopes. ... then this is a Simple Linear Regression; if more than one predictor is entered, then this is a Multiple Linear Regression. ... note that the regression test in the ANOVA is the same (has the same probability vs. the null hypothesis) as the ...

  16. Understanding the Null Hypothesis for Linear Regression

    x: The value of the predictor variable. Simple linear regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

  17. Simple Linear Regression, hypothesis tests

    A 12 minute video introducing the default hypothesis tests of the intercept and slope in simple linear regression.

  18. Understanding the t-Test in Linear Regression

    Whenever we perform linear regression, we want to know if there is a statistically significant relationship between the predictor variable and the response variable. We test for significance by performing a t-test for the regression slope. We use the following null and alternative hypothesis for this t-test: H 0: β 1 = 0 (the slope is equal to ...

  19. 4 Linear Regression

    Fitting a Linear Model {.unnumbered .unlisted} In order to fit linear regression models in R, lm can be used for linear models, which are specified symbolically. A typical model takes the form of `response~predictors` where `response` is the (numeric) response vector and `predictors` is a series of predictor variables.

  20. How to Simplify Hypothesis Testing for Linear Regression in Python

    A Quick Reminder Regarding Linear Regression. Before I share the 4 assumptions that should be met in order to run a linear regression hypothesis test, there is one important point to keep in mind regarding linear regression. Linear regression can be thought of as a dual purpose tool: To predict future values for the y variable

  21. Hypothesis Testing On Linear Regression

    We will use Hypothesis Testing on β₁ for the same. Steps to Perform Hypothesis testing: Step 1: We start by saying that β₁ is not significant, i.e., there is no relationship between x and y ...

  22. 15.5: Hypothesis Tests for Regression Models

    Formally, our "null model" corresponds to the fairly trivial "regression" model in which we include 0 predictors, and only include the intercept term b 0. H 0:Y i =b 0 +ϵ i. If our regression model has K predictors, the "alternative model" is described using the usual formula for a multiple regression model: H1: Yi = (∑K k=1 ...

  23. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if |t 0| > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.

  24. CDK5-cyclin B1 regulates mitotic fidelity

    The slope line represents a simple linear regression analysis for the indicated cancer type. ... p-value is for the test with the null hypothesis that the effect of the doubling time is zero from ...

  25. 7.2: Confidence interval and hypothesis tests for the slope and

    In Chapter 6, the relationship between Hematocrit and body fat % for females appeared to be a weak negative linear association. The 95% confidence interval for the slope is -0.186 to 0.0155. For a 1% increase in body fat %, we are 95% confident that the change in the true mean Hematocrit is between -0.186 and 0.0155% of blood.

  26. A comprehensive comparison of goodness-of-fit tests for logistic

    We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu's test with several commonly used goodness-of-fit (GoF) tests: the Hosmer-Lemeshow test, modified Hosmer-Lemeshow test, Osius ...

  27. Testing the assumptions underlying multiple regression.

    The F test was used to test the hypothesis of linearity of regression. The hypothesis was rejected at the 5% level for only one out of 24 correlation tables used. The chi square test for goodness of fit employed to test the normality of the marginal distributions rejected the null hypothesis in 25 of the 65 tests at the 5% level.

  28. 17: Linear Regression

    Introduction. Regression is a toolkit for developing models of cause and effect between one ratio scale data type dependent response variables, and one (simple linear regression) or more or more (multiple linear regression) ratio scale data type independent predictor variables. By convention the dependent variable(s) is denoted by \(Y\), the independent variable(s) represented by \(X_{1}, X_{2 ...

  29. 14.4: Hypothesis Test for Simple Linear Regression

    This page titled 14.4: Hypothesis Test for Simple Linear Regression is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Maurice A. Geraghty via source content that was edited to the style and standards of the LibreTexts platform.

  30. 17.8: Assumptions and model diagnostics for simple linear regression

    Consider a simple linear regression first. If \(H_{O}: b = 0\) is not rejected, then the slope of the regression equation is taken to not differ from zero. We would conclude that if repeated samples were drawn from the population, on average, the regression equation would not fit the data well (lots of scatter) and it would not yield useful ...