• Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • ROC Curve & AUC Explained with Python Examples - September 8, 2024
  • Accuracy, Precision, Recall & F1-Score – Python Examples - August 28, 2024
  • Logistic Regression in Machine Learning: Python Example - August 26, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • ROC Curve & AUC Explained with Python Examples
  • Accuracy, Precision, Recall & F1-Score – Python Examples
  • Logistic Regression in Machine Learning: Python Example
  • Reducing Overfitting vs Models Complexity: Machine Learning
  • Model Parallelism vs Data Parallelism: Examples

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Linear regression - Hypothesis testing

by Marco Taboga , PhD

This lecture discusses how to perform tests of hypotheses about the coefficients of a linear regression model estimated by ordinary least squares (OLS).

Table of contents

Normal vs non-normal model

The linear regression model, matrix notation, tests of hypothesis in the normal linear regression model, test of a restriction on a single coefficient (t test), test of a set of linear restrictions (f test), tests based on maximum likelihood procedures (wald, lagrange multiplier, likelihood ratio), tests of hypothesis when the ols estimator is asymptotically normal, test of a restriction on a single coefficient (z test), test of a set of linear restrictions (chi-square test), learn more about regression analysis.

The lecture is divided in two parts:

in the first part, we discuss hypothesis testing in the normal linear regression model , in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors;

in the second part, we show how to carry out hypothesis tests in linear regression analyses where the hypothesis of normality holds only in large samples (i.e., the OLS estimator can be proved to be asymptotically normal).

How to choose which test to carry out after estimating a linear regression model.

We also denote:

We now explain how to derive tests about the coefficients of the normal linear regression model.

It can be proved (see the lecture about the normal linear regression model ) that the assumption of conditional normality implies that:

How the acceptance region is determined depends not only on the desired size of the test , but also on whether the test is:

one-tailed (only one of the two things, i.e., either smaller or larger, is possible).

For more details on how to determine the acceptance region, see the glossary entry on critical values .

[eq28]

The F test is one-tailed .

A critical value in the right tail of the F distribution is chosen so as to achieve the desired size of the test.

Then, the null hypothesis is rejected if the F statistics is larger than the critical value.

In this section we explain how to perform hypothesis tests about the coefficients of a linear regression model when the OLS estimator is asymptotically normal.

As we have shown in the lecture on the properties of the OLS estimator , in several cases (i.e., under different sets of assumptions) it can be proved that:

These two properties are used to derive the asymptotic distribution of the test statistics used in hypothesis testing.

The test can be either one-tailed or two-tailed . The same comments made for the t-test apply here.

[eq50]

Like the F test, also the Chi-square test is usually one-tailed .

The desired size of the test is achieved by appropriately choosing a critical value in the right tail of the Chi-square distribution.

The null is rejected if the Chi-square statistics is larger than the critical value.

Want to learn more about regression analysis? Here are some suggestions:

R squared of a linear regression ;

Gauss-Markov theorem ;

Generalized Least Squares ;

Multicollinearity ;

Dummy variables ;

Selection of linear regression models

Partitioned regression ;

Ridge regression .

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression - Hypothesis testing", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-hypothesis-testing.

Most of the learning materials found on this website are now available in a traditional textbook format.

  • F distribution
  • Beta distribution
  • Conditional probability
  • Central Limit Theorem
  • Binomial distribution
  • Mean square convergence
  • Delta method
  • Almost sure convergence
  • Mathematical tools
  • Fundamentals of probability
  • Probability distributions
  • Asymptotic theory
  • Fundamentals of statistics
  • About Statlect
  • Cookies, privacy and terms of use
  • Loss function
  • Almost sure
  • Type I error
  • Precision matrix
  • Integrable variable
  • To enhance your privacy,
  • we removed the social buttons,
  • but don't forget to share .

Statology

Understanding the Null Hypothesis for Linear Regression

Linear regression is a technique we can use to understand the relationship between one or more predictor variables and a response variable .

If we only have one predictor variable and one response variable, we can use simple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x

  • ŷ: The estimated response value.
  • β 0 : The average value of y when x is zero.
  • β 1 : The average change in y associated with a one unit increase in x.
  • x: The value of the predictor variable.

Simple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = 0
  • H A : β 1 ≠ 0

The null hypothesis states that the coefficient β 1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

The alternative hypothesis states that β 1 is not equal to zero. In other words, there is a statistically significant relationship between x and y.

If we have multiple predictor variables and one response variable, we can use multiple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k

  • β 0 : The average value of y when all predictor variables are equal to zero.
  • β i : The average change in y associated with a one unit increase in x i .
  • x i : The value of the predictor variable x i .

Multiple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = β 2 = … = β k = 0
  • H A : β 1 = β 2 = … = β k ≠ 0

The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically significant relationship with the response variable, y.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.

The following examples show how to decide to reject or fail to reject the null hypothesis in both simple linear regression and multiple linear regression models.

Example 1: Simple Linear Regression

Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects data for 20 students and fits a simple linear regression model.

The following screenshot shows the output of the regression model:

Output of simple linear regression in Excel

The fitted simple linear regression model is:

Exam Score = 67.1617 + 5.2503*(hours studied)

To determine if there is a statistically significant relationship between hours studied and exam score, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  47.9952
  • P-value:  0.000

Since this p-value is less than .05, we can reject the null hypothesis. In other words, there is a statistically significant relationship between hours studied and exam score received.

Example 2: Multiple Linear Regression

Suppose a professor would like to use the number of hours studied and the number of prep exams taken to predict the exam score that students will receive in his class. He collects data for 20 students and fits a multiple linear regression model.

Multiple linear regression output in Excel

The fitted multiple linear regression model is:

Exam Score = 67.67 + 5.56*(hours studied) – 0.60*(prep exams taken)

To determine if there is a jointly statistically significant relationship between the two predictor variables and the response variable, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  23.46
  • P-value:  0.00

Since this p-value is less than .05, we can reject the null hypothesis. In other words, hours studied and prep exams taken have a jointly statistically significant relationship with exam score.

Note: Although the p-value for prep exams taken (p = 0.52) is not significant, prep exams combined with hours studied has a significant relationship with exam score.

Additional Resources

Understanding the F-Test of Overall Significance in Regression How to Read and Interpret a Regression Table How to Report Regression Results How to Perform Simple Linear Regression in Excel How to Perform Multiple Linear Regression in Excel

Featured Posts

linear regression and hypothesis testing

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

2 Replies to “Understanding the Null Hypothesis for Linear Regression”

Thank you Zach, this helped me on homework!

Great articles, Zach.

I would like to cite your work in a research paper.

Could you provide me with your last name and initials.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.4 - the hypothesis tests for the slopes.

At the beginning of this lesson, we translated three different research questions pertaining to heart attacks in rabbits ( Cool Hearts dataset ) into three sets of hypotheses we can test using the general linear F -statistic. The research questions and their corresponding hypotheses are:

Hypotheses 1

Is the regression model containing at least one predictor useful in predicting the size of the infarct?

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon\) At least one \(\beta_{j} ≠ 0\) (for j = 1, 2, 3)

Hypotheses 2

Is the size of the infarct significantly (linearly) related to the area of the region at risk?

  • \(H_{0} \colon \beta_{1} = 0 \)
  • \(H_{A} \colon \beta_{1} \ne 0 \)

Hypotheses 3

(Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon \) At least one \(\beta_{j} ≠ 0\) (for j = 2, 3)

Let's test each of the hypotheses now using the general linear F -statistic:

\(F^*=\left(\dfrac{SSE(R)-SSE(F)}{df_R-df_F}\right) \div \left(\dfrac{SSE(F)}{df_F}\right)\)

To calculate the F -statistic for each test, we first determine the error sum of squares for the reduced and full models — SSE ( R ) and SSE ( F ), respectively. The number of error degrees of freedom associated with the reduced and full models — \(df_{R}\) and \(df_{F}\), respectively — is the number of observations, n , minus the number of parameters, p , in the model. That is, in general, the number of error degrees of freedom is n - p . We use statistical software, such as Minitab's F -distribution probability calculator, to determine the P -value for each test.

Testing all slope parameters equal 0 Section  

To answer the research question: "Is the regression model containing at least one predictor useful in predicting the size of the infarct?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 1, 2, 3)

The full model

The full model is the largest possible model — that is, the model containing all of the possible predictors. In this case, the full model is:

\(y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE , that appears in the analysis of variance table. Because there are 4 parameters in the full model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4\).

The reduced model

The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is:

\(y_i=\beta_0+\epsilon_i\)

The reduced model suggests that none of the variations in the response y is explained by any of the predictors. Therefore, the error sum of squares for the reduced model, SSE ( R ), is just the total sum of squares, SSTO , that appears in the analysis of variance table. Because there is only one parameter in the reduced model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 1 \).

Upon plugging in the above quantities, the general linear F -statistic:

\(F^*=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div \dfrac{SSE(F)}{df_F}\)

becomes the usual " overall F -test ":

\(F^*=\dfrac{SSR}{3} \div \dfrac{SSE}{n-4}=\dfrac{MSR}{MSE}\)

That is, to test \(H_{0}\) : \(\beta_{1} = \beta_{2} = \beta_{3} = 0 \), we just use the overall F -test and P -value reported in the analysis of variance table:

Analysis of Variance

Source DF Adj SS Adj MS F- Value P-Value
Regression 3 0.95927 0.31976 16.43 0.000
Area 1 0.63742 0.63742 32.75 0.000
X2 1 0.29733 0.29733 15.28 0.001
X3 1 0.01981 0.01981 1.02 0.322
Error 28 0.54491 0.01946    
31 1.50418      

Regression Equation

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

There is sufficient evidence ( F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.

In general, to test that all of the slope parameters in a multiple linear regression model are 0, we use the overall F -test reported in the analysis of variance table.

Testing one slope parameter is 0 Section  

Now let's answer the second research question: "Is the size of the infarct significantly (linearly) related to the area of the region at risk?" To do so, we test the hypotheses:

Again, the full model is the model containing all of the possible predictors:

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE . Alternatively, because the three predictors in the model are \(x_{1}\), \(x_{2}\), and \(x_{3}\), we can denote the error sum of squares as SSE (\(x_{1}\), \(x_{2}\), \(x_{3}\)). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4 \).

Because the null hypothesis sets the first slope parameter, \(\beta_{1}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

Because the two predictors in the model are \(x_{2}\) and \(x_{3}\), we denote the error sum of squares as SSE (\(x_{2}\), \(x_{3}\)). Because there are 3 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 3\).

The general linear statistic:

simplifies to:

\(F^*=\dfrac{SSR(x_1|x_2, x_3)}{1}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}=\dfrac{MSR(x_1|x_2, x_3)}{MSE(x_1,x_2, x_3)}\)

Getting the numbers from the Minitab output:

we determine that the value of the F -statistic is:

\(F^* = \dfrac{SSR(x_1 \vert x_2, x_3)}{1} \div \dfrac{SSE(x_1, x_2, x_3)}{28} = \dfrac{0.63742}{0.01946}=32.7554\)

The P -value is the probability — if the null hypothesis were true — that we would get an F -statistic larger than 32.7554. Comparing our F -statistic to an F -distribution with 1 numerator degree of freedom and 28 denominator degrees of freedom, Minitab tells us that the probability is close to 1 that we would observe an F -statistic smaller than 32.7554:

F distribution with 1 DF in Numerator and 28 DF in denominator

x P ( X ≤x )
32.7554 1.00000

Therefore, the probability that we would get an F -statistic larger than 32.7554 is close to 0. That is, the P -value is < 0.001. There is sufficient evidence ( F = 32.8, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk after the other predictors x2 and x3 have been taken into account.

But wait a second! Have you been wondering why we couldn't just use the slope's t -statistic to test that the slope parameter, \(\beta_{1}\), is 0? We can! Notice that the P -value ( P < 0.001) for the t -test ( t * = 5.72):

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant -0.135 0.104 -1.29 0.206  
Area 0.613 0.107 5.72 0.000 1.14
X2 -0.2435 0.0623 -3.91 0.001 1.44
X3 -0.0657 0.0651 -1.01 0.322 1.57

is the same as the P -value we obtained for the F -test. This will always be the case when we test that only one slope parameter is 0. That's because of the well-known relationship between a t -statistic and an F -statistic that has one numerator degree of freedom:

\(t_{(n-p)}^{2}=F_{(1, n-p)}\)

For our example, the square of the t -statistic, 5.72, equals our F -statistic (within rounding error). That is:

\(t^{*2}=5.72^2=32.72=F^*\)

So what have we learned in all of this discussion about the equivalence of the F -test and the t -test? In short:

Compare the output obtained when \(x_{1}\) = Area is entered into the model last :

Term Coef SE Coef T-Value P-Value VIF
Constant -0.135 0.104 -1.29 0.206  
X2 -0.2435 0.0623 -3.91 0.001 1.44
X3 -0.0657 0.0651 -1.01 0.322 1.57
Area 0.613 0.107 5.72 0.000 1.14

Inf = - 0.135 - 0.2435 X2 - 0.0657 X3 + 0.613 Area

to the output obtained when \(x_{1}\) = Area is entered into the model first :

The t -statistic and P -value are the same regardless of the order in which \(x_{1}\) = Area is entered into the model. That's because — by its equivalence to the F -test — the t -test for one slope parameter adjusts for all of the other predictors included in the model.

  • We can use either the F -test or the t -test to test that only one slope parameter is 0. Because the t -test results can be read right off of the Minitab output, it makes sense that it would be the test that we'll use most often.
  • But, we have to be careful with our interpretations! The equivalence of the t -test to the F -test has taught us something new about the t -test. The t -test is a test for the marginal significance of the \(x_{1}\) predictor after the other predictors \(x_{2}\) and \(x_{3}\) have been taken into account. It does not test for the significance of the relationship between the response y and the predictor \(x_{1}\) alone.

Testing a subset of slope parameters is 0 Section  

Finally, let's answer the third — and primary — research question: "Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 2, 3)

Because the null hypothesis sets the second and third slope parameters, \(\beta_{2}\) and \(\beta_{3}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\)

The ANOVA table for the reduced model is:

Source DF Adj SS Adj MS F- Value P-Value
Regression 1 0.6249 0.62492 21.32 0.000
Area 1 0.6249 0.62492 21.32 0.000
Error 30 0.8793 0.02931    
31 1.5042      

Because the only predictor in the model is \(x_{1}\), we denote the error sum of squares as SSE (\(x_{1}\)) = 0.8793. Because there are 2 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 2 = 32 – 2 = 30\).

\begin{align} F^*&=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div\dfrac{SSE(F)}{df_F}\\&=\dfrac{0.8793-0.54491}{30-28} \div\dfrac{0.54491}{28}\\&= \dfrac{0.33439}{2} \div 0.01946\\&=8.59.\end{align}

Alternatively, we can calculate the F-statistic using a partial F-test :

\begin{align}F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{MSR(x_2, x_3|x_1)}{MSE(x_1,x_2, x_3)}.\end{align}

To conduct the test, we regress y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3 }\)— in order (and with "Sequential sums of squares" selected under "Options"):

Source DF Seq SS Seq MS F- Value P-Value
Regression 3 0.95927 0.31976 16.43 0.000
Area 1 0.62492 0.63492 32.11 0.000
X2 1 0.3143 0.31453 16.16 0.001
X3 1 0.01981 0.01981 1.02 0.322
Error 28 0.54491 0.01946    
31 1.50418      

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

yielding SSR (\(x_{2}\) | \(x_{1}\)) = 0.31453, SSR (\(x_{3}\) | \(x_{1}\), \(x_{2}\)) = 0.01981, and MSE = 0.54491/28 = 0.01946. Therefore, the value of the partial F -statistic is:

\begin{align} F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{0.31453+0.01981}{2}\div\dfrac{0.54491}{28}\\&= \dfrac{0.33434}{2} \div 0.01946\\&=8.59,\end{align}

which is identical (within round-off error) to the general F-statistic above. The P -value is the probability — if the null hypothesis were true — that we would observe a partial F -statistic more extreme than 8.59. The following Minitab output:

F distribution with 2 DF in Numerator and 28 DF in denominator

x P ( X ≤ x )
8.59 0.998767

tells us that the probability of observing such an F -statistic that is smaller than 8.59 is 0.9988. Therefore, the probability of observing such an F -statistic that is larger than 8.59 is 1 - 0.9988 = 0.0012. The P -value is very small. There is sufficient evidence ( F = 8.59, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs — after taking into account the size of the region at risk.

Summary of MLR Testing Section  

For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:

  • Hypothesis test for testing that all of the slope parameters are 0.
  • Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.
  • Hypothesis test for testing that one slope parameter is 0.

We have learned how to perform each of the above three hypothesis tests. Along the way, we also took two detours — one to learn about the " general linear F-test " and one to learn about " sequential sums of squares. " As you now know, knowledge about both is necessary for performing the three hypothesis tests.

The F -statistic and associated p -value in the ANOVA table is used for testing whether all of the slope parameters are 0. In most applications, this p -value will be small enough to reject the null hypothesis and conclude that at least one predictor is useful in the model. For example, for the rabbit heart attacks study, the F -statistic is (0.95927/(4–1)) / (0.54491/(32–4)) = 16.43 with p -value 0.000.

To test whether a subset — more than one, but not all — of the slope parameters are 0, there are two equivalent ways to calculate the F-statistic:

  • Use the general linear F-test formula by fitting the full model to find SSE(F) and fitting the reduced model to find SSE(R) . Then the numerator of the F-statistic is (SSE(R) – SSE(F)) / ( \(df_{R}\) – \(df_{F}\)) .
  • Alternatively, use the partial F-test formula by fitting only the full model but making sure the relevant predictors are fitted last and "sequential sums of squares" have been selected. Then the numerator of the F-statistic is the sum of the relevant sequential sums of squares divided by the sum of the degrees of freedom for these sequential sums of squares. The denominator of the F -statistic is the mean squared error in the ANOVA table.

For example, for the rabbit heart attacks study, the general linear F-statistic is ((0.8793 – 0.54491) / (30 – 28)) / (0.54491 / 28) = 8.59 with p -value 0.0012. Alternatively, the partial F -statistic for testing the slope parameters for predictors \(x_{2}\) and \(x_{3}\) using sequential sums of squares is ((0.31453 + 0.01981) / 2) / (0.54491 / 28) = 8.59.

To test whether one slope parameter is 0, we can use an F -test as just described. Alternatively, we can use a t -test, which will have an identical p -value since in this case, the square of the t -statistic is equal to the F -statistic. For example, for the rabbit heart attacks study, the F -statistic for testing the slope parameter for the Area predictor is (0.63742/1) / (0.54491/(32–4)) = 32.75 with p -value 0.000. Alternatively, the t -statistic for testing the slope parameter for the Area predictor is 0.613 / 0.107 = 5.72 with p -value 0.000, and \(5.72^{2} = 32.72\).

Incidentally, you may be wondering why we can't just do a series of individual t-tests to test whether a subset of the slope parameters is 0. For example, for the rabbit heart attacks study, we could have done the following:

  • Fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3}\) and use an individual t-test for \(x_{3}\).
  • If the test results indicate that we can drop \(x_{3}\) then fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and use an individual t-test for \(x_{2}\).

The problem with this approach is we're using two individual t-tests instead of one F-test, which means our chance of drawing an incorrect conclusion in our testing procedure is higher. Every time we do a hypothesis test, we can draw an incorrect conclusion by:

  • rejecting a true null hypothesis, i.e., make a type I error by concluding the tested predictor(s) should be retained in the model when in truth it/they should be dropped; or
  • failing to reject a false null hypothesis, i.e., make a type II error by concluding the tested predictor(s) should be dropped from the model when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.

Hypothesis tests for the slope parameters Section  

The problems in this section are designed to review the hypothesis tests for the slope parameters, as well as to give you some practice on models with a three-group qualitative variable (which we'll cover in more detail in Lesson 8). We consider tests for:

  • whether one slope parameter is 0 (for example, \(H_{0} \colon \beta_{1} = 0 \))
  • whether a subset (more than one but less than all) of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \) against the alternative \(H_{A} \colon \beta_{2} \ne 0 \) or \(\beta_{3} \ne 0 \) or both ≠ 0)
  • whether all of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3}\) = 0 against the alternative \(H_{A} \colon \) at least one of the \(\beta_{i}\) is not 0)

(Note the correct specification of the alternative hypotheses for the last two situations.)

Sugar beets study

A group of researchers was interested in studying the effects of three different growth regulators ( treat , denoted 1, 2, and 3) on the yield of sugar beets (y = yield , in pounds). They planned to plant the beets in 30 different plots and then randomly treat 10 plots with the first growth regulator, 10 plots with the second growth regulator, and 10 plots with the third growth regulator. One problem, though, is that the amount of available nitrogen in the 30 different plots varies naturally, thereby giving a potentially unfair advantage to plots with higher levels of available nitrogen. Therefore, the researchers also measured and recorded the available nitrogen (\(x_{1}\) = nit , in pounds/acre) in each plot. They are interested in comparing the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen. The Sugar Beets dataset contains the data from the researcher's experiment.

Preliminary Work

The plot shows a similar positive linear trend within each treatment category, which suggests that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data.

Because the qualitative variable treat distinguishes between the three treatment groups (1, 2, and 3), we need to create two indicator variables, \(x_{2}\) and \(x_{3}\), say, to fit a linear regression model to these data. The new indicator variables should be defined as follows:

treat \(x_2\) \(x_3\)
1 1 0
2 0 1
3 0 0

Use Minitab's Calc >> Make Indicator Variables command to create the new indicator variables in your worksheet

Minitab creates an indicator variable for each treatment group but we can only use two, for treatment groups 1 and 2 in this case (treatment group 3 is the reference level in this case).

Then, if we assume the trend in the data can be summarized by this regression model:

\(y_{i} = \beta_{0}\) + \(\beta_{1}\)\(x_{1}\) + \(\beta_{2}\)\(x_{2}\) + \(\beta_{3}\)\(x_{3}\) + \(\epsilon_{i}\)

where \(x_{1}\) = nit and \(x_{2}\) and \(x_{3}\) are defined as above, what is the mean response function for plots receiving treatment 3? for plots receiving treatment 1? for plots receiving treatment 2? Are the three regression lines that arise from our formulated model parallel? What does the parameter \(\beta_{2}\) quantify? And, what does the parameter \(\beta_{3}\) quantify?

The fitted equation from Minitab is Yield = 84.99 + 1.3088 Nit - 2.43 \(x_{2}\) - 2.35 \(x_{3}\), which means that the equations for each treatment group are:

  • Group 1: Yield = 84.99 + 1.3088 Nit - 2.43(1) = 82.56 + 1.3088 Nit
  • Group 2: Yield = 84.99 + 1.3088 Nit - 2.35(1) = 82.64 + 1.3088 Nit
  • Group 3: Yield = 84.99 + 1.3088 Nit

The three estimated regression lines are parallel since they have the same slope, 1.3088.

The regression parameter for \(x_{2}\) represents the difference between the estimated intercept for treatment 1 and the estimated intercept for reference treatment 3.

The regression parameter for \(x_{3}\) represents the difference between the estimated intercept for treatment 2 and the estimated intercept for reference treatment 3.

Testing whether all of the slope parameters are 0

\(H_0 \colon \beta_1 = \beta_2 = \beta_3 = 0\) against the alternative \(H_A \colon \) at least one of the \(\beta_i\) is not 0.

\(F=\dfrac{SSR(X_1,X_2,X_3)\div3}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1,X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{16039.5}{3}}{\frac{1078.0}{30-4}} = \dfrac{5346.5}{41.46} = 128.95\)

Since the p -value for this F -statistic is reported as 0.000, we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that at least one of the slope parameters is not zero, i.e., the regression model containing at least one predictor is useful in predicting the size of sugar beet yield.

Tests for whether one slope parameter is 0

\(H_0 \colon \beta_1= 0\) against the alternative \(H_A \colon \beta_1 \ne 0\)

t -statistic = 19.60, p -value = 0.000, so we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that the slope parameter for \(x_{1}\) = nit is not zero, i.e., sugar beet yield is significantly linearly related to the available nitrogen (controlling for treatment).

\(F=\dfrac{SSR(X_1|X_2,X_3)\div1}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1|X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

Use the Minitab output to calculate the value of this F statistic. Does the value you obtain equal \(t^{2}\), the square of the t -statistic as we might expect?

\(F-statistic= \dfrac{\frac{15934.5}{1}}{\frac{1078.0}{30-4}} = \dfrac{15934.5}{41.46} = 384.32\), which is the same as \(19.60^{2}\).

Because \(t^{2}\) will equal the partial F -statistic whenever you test for whether one slope parameter is 0, it makes sense to just use the t -statistic and P -value that Minitab displays as a default. But, note that we've just learned something new about the meaning of the t -test in the multiple regression setting. It tests for the ("marginal") significance of the \(x_{1}\) predictor after \(x_{2}\) and \(x_{3}\) have already been taken into account.

Tests for whether a subset of the slope parameters is 0

\(H_0 \colon \beta_2=\beta_3= 0\) against the alternative \(H_A \colon \beta_2 \ne 0\) or \(\beta_3 \ne 0\) or both \(\ne 0\).

\(F=\dfrac{SSR(X_2,X_3|X_1)\div2}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_2,X_3|X_1)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{10.4+27.5}{2}}{\frac{1078.0}{30-4}} = \dfrac{18.95}{41.46} = 0.46\).

F distribution with 2 DF in Numerator and 26 DF in denominator

x P ( X ≤ x )
0.46 0.363677

p-value \(= 1-0.363677 = 0.636\), so we fail to reject \(H_{0}\) in favor of \(H_{A}\) and conclude that we cannot rule out \(\beta_2 = \beta_3 = 0\), i.e., there is no significant difference in the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen.

Note that the sequential mean square due to regression, MSR(\(X_{2}\),\(X_{3}\)|\(X_{1}\)), is obtained by dividing the sequential sum of square by its degrees of freedom (2, in this case, since two additional predictors \(X_{2}\) and \(X_{3}\) are considered). Use the Minitab output to calculate the value of this F statistic, and use Minitab to get the associated P -value. Answer the researcher's question at the \(\alpha= 0.05\) level.

Browse Course Material

Course info, instructors.

  • Prof. Cynthia Rudin
  • Allison Chang
  • Dimitrios Bisias

Departments

  • Sloan School of Management
  • Institute for Data, Systems, and Society

As Taught In

  • Data Mining
  • Probability and Statistics

Learning Resource Types

Statistical thinking and data analysis, course description.

Normal distribution curve.

You are leaving MIT OpenCourseWare

Teach yourself statistics

Hypothesis Test for Regression Slope

This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y .

The test focuses on the slope of the regression line

Y = Β 0 + Β 1 X

where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of the independent variable, and Y is the value of the dependent variable.

If we find that the slope of the regression line is significantly different from zero, we will conclude that there is a significant relationship between the independent and dependent variables.

Test Requirements

The approach described in this lesson is valid whenever the standard requirements for simple linear regression are met.

  • The dependent variable Y has a linear relationship to the independent variable X .
  • For each value of X, the probability distribution of Y has the same standard deviation σ.
  • The Y values are independent.
  • The Y values are roughly normally distributed (i.e., symmetric and unimodal ). A little skewness is ok if the sample size is large.

The test procedure consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses

If there is a significant linear relationship between the independent variable X and the dependent variable Y , the slope will not equal zero.

H o : Β 1 = 0

H a : Β 1 ≠ 0

The null hypothesis states that the slope is equal to zero, and the alternative hypothesis states that the slope is not equal to zero.

Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify the following elements.

  • Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
  • Test method. Use a linear regression t-test (described in the next section) to determine whether the slope of the regression line differs significantly from zero.

Analyze Sample Data

Using sample data, find the standard error of the slope, the slope of the regression line, the degrees of freedom, the test statistic, and the P-value associated with the test statistic. The approach described in this section is illustrated in the sample problem at the end of this lesson.

Predictor Coef SE Coef T P
Constant 76 30 2.53 0.01
X 35 20 1.75 0.04

SE = s b 1 = sqrt [ Σ(y i - ŷ i ) 2 / (n - 2) ] / sqrt [ Σ(x i - x ) 2 ]

  • Slope. Like the standard error, the slope of the regression line will be provided by most statistics software packages. In the hypothetical output above, the slope is equal to 35.

t = b 1 / SE

  • P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a t statistic, use the t Distribution Calculator to assess the probability associated with the test statistic. Use the degrees of freedom computed above.

Interpret Results

If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding

The local utility company surveys 101 randomly selected customers. For each survey participant, the company collects the following: annual electric bill (in dollars) and home size (in square feet). Output from a regression analysis appears below.

Annual bill = 0.55 * Home size + 15

Predictor Coef SE Coef T P
Constant 15 3 5.0 0.00
Home size 0.55 0.24 2.29 0.01

Is there a significant linear relationship between annual bill and home size? Use a 0.05 level of significance.

The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:

H o : The slope of the regression line is equal to zero.

H a : The slope of the regression line is not equal to zero.

  • Formulate an analysis plan . For this analysis, the significance level is 0.05. Using sample data, we will conduct a linear regression t-test to determine whether the slope of the regression line differs significantly from zero.

We get the slope (b 1 ) and the standard error (SE) from the regression output.

b 1 = 0.55       SE = 0.24

We compute the degrees of freedom and the t statistic, using the following equations.

DF = n - 2 = 101 - 2 = 99

t = b 1 /SE = 0.55/0.24 = 2.29

where DF is the degrees of freedom, n is the number of observations in the sample, b 1 is the slope of the regression line, and SE is the standard error of the slope.

  • Interpret results . Since the P-value (0.0242) is less than the significance level (0.05), we cannot accept the null hypothesis.

linear regression and hypothesis testing

Member-only story

How to Simplify Hypothesis Testing for Linear Regression in Python

What is homoscedasticity again.

Andreas Martinson

Andreas Martinson

Towards Data Science

I find myself coming back to the basics to refresh my statistical knowledge over and over again. Most people’s first introduction to statistics begins by learning hypothesis testing, which is followed soon after by t-tests and linear regression. This article is a refresher of how to use linear regression for hypothesis testing along with the assumptions that have to be satisfied in order to trust the results of your linear regression statistical test. I also want to share a Python function I made to quickly check the 5 statistical assumptions that need to be satisfied for hypothesis testing using linear regression.

A Quick Reminder Regarding Linear Regression

Before I share the 4 assumptions that should be met in order to run a linear regression hypothesis test, there is one important point to keep in mind regarding linear regression. Linear regression can be thought of as a dual purpose tool:

  • To predict future values for the y variable
  • To infer if the trend is statistically significant

This is important to remember because it means that your data does not have to meet the…

Andreas Martinson

Written by Andreas Martinson

Co-founder & CEO @ Basejump AI 👉 LinkedIn: https://www.linkedin.com/in/andreasmartinson/

Text to speech

linear regression and hypothesis testing

Hypothesis Testing On Linear Regression

Ankita Banerji

Ankita Banerji

Nerd For Tech

W hen we build a multiple linear regression model, we may have a few potential predictor/independent variables. Therefore, it is extremely important to select the variables which are really significant and influence the experiment strongly. To get the optimal model, we can try all the possible combinations of independent variables and see which model fits best. But this method is time-consuming and infeasible. Hence, we need another method to get a decent model. We can do the same either by manual feature elimination or by using any automated approach (RFE, Regularization, etc.).

In manual feature elimination, we can:

  • Build a model with all the features,
  • Drop the features that are least helpful in prediction (high p-value),
  • Drop the features that are redundant (using correlations and VIF),
  • Rebuild the model and repeat.

It is generally recommended that we follow a balanced approach, i.e., use a combination of automated (coarse tuning) + manual (fine tuning) selection in order to get an optimal mode. In this blog we will discuss the second step of manual feature elimination i.e., Drop the features that are least helpful in prediction (insignificant features).

First question that arises is: ‘ What do we mean by significant variable? ’. Let us understand it in Simple Linear Regression first.

When we fit a straight line through the data, we get two parameters i.e., the intercept (β₀) and the slope (β₁).

Now, β₀ is not of much importance right now, but there are a few aspects around β₁ which needs to be checked and verified. Suppose we have a dataset for which the scatter plot looks like the following:

When we run a linear regression on this dataset in Python, Python will fit a line on the data which looks like the following:

We can clearly see that the data in randomly scattered and doesn’t seem to follow linear trend. Python will anyway fit a line through the data using the least squared method. We can see that the fitted line is of no use in this case. Hence, every time we perform linear regression, we need to test whether the fitted line is a significant one or not (in other terms, test whether β₁ is significant or not). We will use Hypothesis Testing on β₁ for the same.

Steps to Perform Hypothesis testing:

  • Set the Hypothesis
  • Set the Significance Level, Criteria for a decision
  • Compute the test statistics
  • Make a decision

Step 1: We start by saying that β₁ is not significant, i.e., there is no relationship between x and y, therefore slope β₁ = 0.

Step 2: Typically, we set the Significance level at 10%, 5%, or 1%.

Step 3: After formulating the null and alternate hypotheses, next step to follow in order to make a decision using the p-value method are as follows:

  • Calculate the value of t-score for the mean on the distribution.

Where, μ is the population mean and s is the sample standard deviation which when divided by √n is also known as standard error.

2. Calculate the p-value from the cumulative probability for the given t-score using the t-table

3. Make the decision on the basis of the p-value with respect to the given value of significance level.

Step 4: Making Decision

p-value < 0.05 , we can reject the null hypothesis.

p-value> 0.05 , we fail to reject the null hypothesis.

If we fail to reject the null hypothesis that would mean β₁ is zero (in other words β₁ is insignificant) and of no use in the model. Similarly, if we reject the null hypothesis, it would mean that β₁ is not zero and the line fitted is a significant one.

NOTE: The above steps are performed by Python automatically.

Similarly in multiple linear regression, we will perform the same steps as in linear regression except the null and alternate hypothesis will be different. For the multiple regression model :

Example in Python

Let us take housing datase t which contains the prices of properties in the Delhi region. We wish to use this data to optimise the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Top five rows of dataset look something like this:

After preparing, cleaning and analysing the data we will build a linear regression model by using all the variables (Fit a regression line through the data using statsmodels )

We get the following output:

Looking at the p-values (P>|t|), some of the variables like bedrooms, semi-furnished aren’t really significant (p>0.05). We could simply drop the variable with the highest, non-significant p value.

Conclusion: Generally we use two main parameters to judge the insignificant variables, the p-values and the VIFs (variance inflation factor).

Ankita Banerji

Written by Ankita Banerji

Aspiring Data scientist

Text to speech

Reading list

Basics of machine learning, machine learning lifecycle, importance of stats and eda, understanding data, probability, exploring continuous variable, exploring categorical variables, missing values and outliers, central limit theorem, bivariate analysis introduction, continuous - continuous variables, continuous categorical, categorical categorical, multivariate analysis, different tasks in machine learning, build your first predictive model, evaluation metrics, preprocessing data, linear models, selecting the right model, feature selection techniques, decision tree, feature engineering, naã¯ve bayes, multiclass and multilabel, basics of ensemble techniques, advance ensemble techniques, hyperparameter tuning, support vector machine, advance dimensionality reduction, unsupervised machine learning methods, recommendation engines, improving ml models, working with large datasets, interpretability of machine learning models, automated machine learning, model deployment, deploying ml models, embedded devices, everything you need to know about hypothesis testing in machine learning.

This article was published as a part of the  Data Science Blogathon

What is Hypothesis Testing?

Hypothesis Testing

Hypothesis testing is done to confirm our observation about the population using sample data, within the desired error level. Through hypothesis testing, we can determine whether we have enough statistical evidence to conclude if the hypothesis about the population is true or not.

How to perform hypothesis testing in machine learning?

To trust your model and make predictions, we utilize hypothesis testing. When we will use sample data to train our model, we make assumptions about our population. By performing hypothesis testing, we validate these assumptions for a desired significance level.

how to perform Hypothesis Testing

Let’s take the case of regression models: When we fit a straight line through a linear regression model, we get the slope and intercept for the line. Hypothesis testing is used to confirm if our beta coefficients are significant in a linear regression model. Every time we run the linear regression model, we test if the line is significant or not by checking if the coefficient is significant. I have shared details on how you can check these values in python, towards the end of this blog.

Key steps to perform hypothesis test are as follows:

  • Formulate a Hypothesis
  • Determine the significance level
  • Determine the type of test
  • Calculate the Test Statistic values and the p values
  • Make Decision

Now let’s look into the steps in detail:

Formulating the hypothesis

One of the key steps to do this is to formulate the below two hypotheses:

The null hypothesis represented as H₀ is the initial claim that is based on the prevailing belief about the population. The alternate hypothesis represented as H₁ is the challenge to the null hypothesis. It is the claim which we would like to prove as True

One of the main points which we should consider while formulating the null and alternative hypothesis is that the null hypothesis always looks at confirming the existing notion. Hence, it has sign >= or , < and ≠

Determine the significance level also known as alpha or α for Hypothesis Testing

The significance level is the proportion of the sample mean lying in critical regions. It is usually set as 5% or 0.05 which means that there is a 5% chance that we would accept the alternate hypothesis even when our null hypothesis is true

Based on the criticality of the requirement, we can choose a lower significance level of 1% as well.

Determine the Test Statistic and calculate its value for Hypothesis Testing

 Hypothesis testing uses Test Statistic which is a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test.

Select the type of Hypothesis test

We choose the type of test statistic based on the predictor variable – quantitative or categorical. Below are a few of the commonly used test statistics for quantitative data

Quantitative Normal Distribution Z – Test
Quantitative T Distribution T-Test
Quantitative Positively skewed distribution F – Test
Quantitative Negatively skewed distribution NA
Categorical NA Chi-Square test

Z-statistic – Z Test

Z-statistic is used when the sample follows a normal distribution. It is calculated based on the population parameters like mean and standard deviation. One sample Z test is used when we want to compare a sample mean with a population mean Two sample Z test is used when we want to compare the mean of two samples

T-statistic – T-Test

T-statistic is used when the sample follows a T distribution and population parameters are unknown. T distribution is similar to a normal distribution, it is shorter than normal distribution and has a flatter tail.

F-statistic – F test

For samples involving three or more groups, we prefer the F Test. Performing T-test on multiple groups increases the chances of Type-1 error. ANOVA is used in such cases.

Analysis of variance (ANOVA) can determine whether the means of three or more groups are different. ANOVA uses F-tests to statistically test the equality of means.

F-statistic is used when the data is positively skewed and follows an F distribution. F distributions are always positive and skewed right.

F = Variation between the sample means/variation within the samples

For negatively skewed data we would need to perform feature transformation

Chi-Square Test 

For categorical variables, we would be performing a chi-Square test.

Following are the two types of chi-squared tests:

  • Chi-squared test of independence – We use the Chi-Square test to determine whether or not there is a significant relationship between two categorical variables.
  • Chi-squared Goodness of fit helps us determine if the sample data correctly represents the population.

The decision about your model

Test Statistic is then used to calculate P-Value. A P-value measures the strength of evidence in support of a null hypothesis. If the P-value is less than the significance level, we reject the null hypothesis.

if the p-value < α , then we have statistically significant evidence against the null hypothesis, so we reject the null hypothesis and accept the alternate hypothesis

if the p-value > α then we do not have statistically significant evidence against the null hypothesis, so we fail to reject the null hypothesis.

As we make decisions, it is important to understand the errors that can happen while testing.

Errors while making decisions

There are two possible types of error we could commit while performing hypothesis testing.

errors

1) Type1 Error – This occurs when the null hypothesis is true but we reject it.The probability of type I error is denoted by alpha (α). Type 1 error is also known as the level of significance of the hypothesis test

2) Type 2 Error – This occurs when the null hypothesis is false but we fail to reject it. The probability of type II error is denoted by beta (β)

Hypothesis testing in python

The stats model library has the unique ability to perform and summarize the outcomes of hypothesis tests on your model. Based on your feature variables, you can determine which test value is relevant for your model and make decisions accordingly.

To create a fitted model, I have used Ordinary least squares

Once we have trained the model, we can see the summary of the tests using the command

The model summary will look something like below.

summary

From a hypothesis testing standpoint, you need to pay attention to the following values decide if you need to refine your model

  • Prob (F-statistic) –  F-statistic tells us the goodness of fit of regression. You want the probability of F-statistic to be as low as possible to reject the null hypothesis.
  • P-value is given in the column P>|t| – As mentioned above, for a good model, we want this value to be less than the significance level.

This is all about hypothesis testing in this article.

Image source: All images in this blog have been created by the author

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Free Courses

image.name

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

image.name

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

image.name

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

image.name

Building Your first RAG System using LlamaIndex

Build your first RAG model with LlamaIndex in this free course. Dive into Retrieval-Augmented Generation now!

image.name

Building Production Ready RAG systems using LlamaIndex

Learn Retrieval-Augmented Generation (RAG): learn how it works, the RAG framework, and use LlamaIndex for advanced systems.

Recommended Articles

Hypothesis Testing Made Easy for Data Science B...

A Simple Guide to Hypothesis Testing for Dummies!

Hypothesis Testing in Inferential Statistics

Hypothesis Testing for Data Science and Analytics

An Introduction to Hypothesis Testing

Hypothesis Testing: A Way to Accept or Reject Y...

Difference Between Z-Test and T-Test

Quick Guide To Perform Hypothesis Testing

Learn all About Hypothesis Testing!

Your Guide to Master Hypothesis Testing in Stat...

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Write for us

Write, captivate, and earn accolades and rewards for your work

  • Reach a Global Audience
  • Get Expert Feedback
  • Build Your Brand & Audience
  • Cash In on Your Knowledge
  • Join a Thriving Community
  • Level Up Your Data Science Game

imag

Sion Chakrabarti

CHIRAG GOYAL

CHIRAG GOYAL

Barney Darlington

Barney Darlington

Suvojit Hore

Suvojit Hore

Arnab Mondal

Arnab Mondal

Prateek Majumder

Prateek Majumder

Flagship Courses

Popular categories, generative ai tools and techniques, popular genai models, data science tools and techniques, genai pinnacle program, revolutionizing ai learning & development.

  • 1:1 Mentorship with Generative AI experts
  • Advanced Curriculum with 200+ Hours of Learning
  • Master 26+ GenAI Tools and Libraries

Enroll with us today!

Continue your learning for free, enter email address to continue, enter otp sent to.

Resend OTP in 45s

Privacy Overview

logo

Simple linear regression

Simple linear regression #.

Fig. 9 Simple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the (training) residual sum of squares (RSS):

Sample code: advertising data #

Estimates \(\hat\beta_0\) and \(\hat\beta_1\) #.

A little calculus shows that the minimizers of the RSS are:

Assessing the accuracy of \(\hat \beta_0\) and \(\hat\beta_1\) #

Fig. 10 How variable is the regression line? #

Based on our model #

The Standard Errors for the parameters are:

95% confidence intervals:

Hypothesis test #

Null hypothesis \(H_0\) : There is no relationship between \(X\) and \(Y\) .

Alternative hypothesis \(H_a\) : There is some relationship between \(X\) and \(Y\) .

Based on our model: this translates to

\(H_0\) : \(\beta_1=0\) .

\(H_a\) : \(\beta_1\neq 0\) .

Test statistic:

Under the null hypothesis, this has a \(t\) -distribution with \(n-2\) degrees of freedom.

Sample output: advertising data #

Interpreting the hypothesis test #.

If we reject the null hypothesis, can we assume there is an exact linear relationship?

No. A quadratic relationship may be a better fit, for example. This test assumes the simple linear regression model is correct which precludes a quadratic relationship.

If we don’t reject the null hypothesis, can we assume there is no relationship between \(X\) and \(Y\) ?

No. This test is based on the model we posited above and is only powerful against certain monotone alternatives. There could be more complex non-linear relationships.

  • For Individuals
  • For Businesses
  • For Universities
  • For Governments
  • Online Degrees
  • Find your New Career
  • Join for Free

SAS

Introduction to Statistical Analysis: Hypothesis Testing

This course is part of SAS Statistical Business Analyst Professional Certificate

Financial aid available

18,211 already enrolled

(115 reviews)

Skills you'll gain

  • Multivariate Time Series Analysis
  • Multivariate Analysis
  • Multivariate Statistics
  • Predictive Modelling

Details to know

linear regression and hypothesis testing

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Placeholder

Build your Data Analysis expertise

  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate from SAS

Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

There are 4 modules in this course

This introductory course is for SAS software users who perform statistical analyses using SAS/STAT software. The focus is on t tests, ANOVA, and linear regression, and includes a brief introduction to logistic regression.

Course Overview and Data Setup

In this module you learn about the course and the data you analyze in this course. Then you set up the data you need to do the practices in the course.

What's included

2 videos 5 readings

2 videos • Total 12 minutes

  • Welcome and Meet the Instructor • 1 minute • Preview module
  • Demo: Exploring Ames Housing Data • 10 minutes

5 readings • Total 52 minutes

  • Learner Prerequisites • 1 minute
  • Access SAS Software for this Course • 10 minutes
  • Follow These Instructions to Set Up Data for This Course (REQUIRED) • 30 minutes
  • Completing Demos and Practices • 10 minutes
  • Using Forums and Getting Help • 1 minute

Introduction and Review of Concepts

In this module you learn about the models required to analyze different types of data and the difference between explanatory vs predictive modeling. Then you review fundamental statistical concepts, such as the sampling distribution of a mean, hypothesis testing, p-values, and confidence intervals. After reviewing these concepts, you apply one-sample and two-sample t tests to data to confirm or reject preconceived hypotheses.

17 videos 2 readings 9 quizzes

17 videos • Total 41 minutes

  • Overview • 1 minute • Preview module
  • Statistical Modeling: Types of Variables • 1 minute
  • Overview of Models • 3 minutes
  • Explanatory versus Predictive Modeling • 1 minute
  • Population Parameters and Sample Statistics • 1 minute
  • Normal (Gaussian) Distribution • 2 minutes
  • Standard Error of the Mean • 0 minutes
  • Confidence Intervals • 2 minutes
  • Statistical Hypothesis Test • 4 minutes
  • p-Value: Effect Size and Sample Size Influence • 3 minutes
  • Scenario • 0 minutes
  • Performing a t Test • 4 minutes
  • Demo: Performing a One-Sample t Test Using PROC TTEST • 3 minutes
  • Scenario • 1 minute
  • Assumptions for the Two-Sample t Test • 2 minutes
  • Testing for Equal and Unequal Variances • 2 minutes
  • Demo: Performing a Two-Sample t Test Using PROC TTEST • 4 minutes

2 readings • Total 20 minutes

  • Parameters and Statistics • 10 minutes
  • Normal Distribution • 10 minutes

9 quizzes • Total 100 minutes

  • Introduction and Review of Concepts • 30 minutes
  • Question 1.01 • 5 minutes
  • Question 1.02 • 5 minutes
  • Question 1.03 • 5 minutes
  • Question 1.04 • 5 minutes
  • Question 1.05 • 5 minutes
  • Practice - Using PROC TTEST to Perform a One-Sample t Test • 20 minutes
  • Question 1.06 • 5 minutes
  • Practice - Using PROC TTEST to Compare Groups • 20 minutes

ANOVA and Regression

In this module you learn to use graphical tools that can help determine which predictors are likely or unlikely to be useful. Then you learn to augment these graphical explorations with correlation analyses that describe linear relationships between potential predictors and our response variable. After you determine potential predictors, tools like ANOVA and regression help you assess the quality of the relationship between the response and predictors.

29 videos 2 readings 14 quizzes

29 videos • Total 69 minutes

  • Identifying Associations in ANOVA with Box Plots • 1 minute
  • Demo: Exploring Associations Using PROC SGPLOT • 1 minute
  • Identifying Associations in Linear Regression with Scatter Plots • 1 minute
  • Demo: Exploring Associations Using PROC SGSCATTER • 2 minutes
  • The ANOVA Hypothesis • 1 minute
  • Partitioning Variability in ANOVA • 2 minutes
  • Coefficient of Determination • 1 minute
  • F Statistic and Critical Values • 1 minute
  • The ANOVA Model • 2 minutes
  • Demo: Performing a One-Way ANOVA Using PROC GLM • 6 minutes
  • Multiple Comparison Methods • 2 minutes
  • Tukey's and Dunnett's Multiple Comparison Methods • 1 minute
  • Diffograms and Control Plots • 1 minute
  • Demo: Performing a Post Hoc Pairwise Comparison Using PROC GLM • 6 minutes
  • Using Correlation to Measure Relationships between Continuous Variables • 1 minute
  • Hypothesis Testing for a Correlation • 1 minute
  • Avoiding Common Errors When Interpreting Correlations • 5 minutes
  • Demo: Producing Correlation Statistics and Scatter Plots Using PROC CORR • 6 minutes
  • The Simple Linear Regression Model • 1 minute
  • How SAS Performs Simple Linear Regression • 1 minute
  • Comparing the Regression Model to a Baseline Model • 2 minutes
  • Hypothesis Testing and Assumptions for Linear Regression • 1 minute
  • Demo: Performing Simple Linear Regression Using PROC REG • 7 minutes
  • What Does a CLASS Statement Do? • 10 minutes
  • Correlation Analysis and Model Building • 10 minutes

14 quizzes • Total 155 minutes

  • ANOVA and Regression • 30 minutes
  • Question 2.01 • 5 minutes
  • Question 2.02 • 5 minutes
  • Question 2.03 • 5 minutes
  • Question 2.04 • 5 minutes
  • Practice - Performing a One-Way ANOVA • 20 minutes
  • Question 2.05 • 5 minutes
  • Question 2.06 • 5 minutes
  • Practice - Using PROC GLM to Perform Post Hoc Parwise Comparisons • 20 minutes
  • Question 2.07 • 5 minutes
  • Question 2.08 • 5 minutes
  • Practice - Describing the Relationship between Continuous Variables • 20 minutes
  • Question 2.09 • 5 minutes
  • Practice - Using PROC REG to Fit a Simple Linear Regression Model • 20 minutes

More Complex Linear Models

In this module you expand the one-way ANOVA model to a two-factor analysis of variance and then extend simple linear regression to multiple regression with two predictors. After you understand the concepts of two-way ANOVA and multiple linear regression with two predictors, you'll have the skills to fit and interpret models with many variables.

13 videos 1 reading 5 quizzes

13 videos • Total 43 minutes

  • Applying the Two-Way ANOVA Model • 3 minutes
  • Demo: Performing a Two-Way ANOVA Using PROC GLM • 7 minutes
  • Interactions • 3 minutes
  • Demo: Performing a Two-Way ANOVA With an Interaction Using PROC GLM • 5 minutes
  • Demo: Performing Post-Processing Analysis Using PROC PLM • 4 minutes
  • The Multiple Linear Regression Model • 2 minutes
  • Hypothesis Testing for Multiple Regression • 1 minute
  • Multiple Linear Regression versus Simple Linear Regression • 2 minutes
  • Adjusted R-Square • 1 minute
  • Demo: Fitting a Multiple Linear Regression Model Using PROC REG • 7 minutes

1 reading • Total 10 minutes

  • The STORE Statement • 10 minutes

5 quizzes • Total 80 minutes

  • More Complex Linear Models • 30 minutes
  • Question 3.01 • 5 minutes
  • Practice - Performing a Two-Way ANOVA Using PROC GLM • 20 minutes
  • Question 3.02 • 5 minutes
  • Practice - Performing Multiple Regression Using PROC REG • 20 minutes

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

Jordan Bakerman

Through innovative software and services, SAS empowers and inspires customers around the world to transform data into intelligence. SAS is a trusted analytics powerhouse for organizations seeking immediate value from their data. A deep bench of analytics solutions and broad industry knowledge keep our customers coming back and feeling confident. With SAS®, you can discover insights from your data and make sense of it all. Identify what’s working and fix what isn’t. Make more intelligent decisions. And drive relevant change.

Recommended if you're interested in Data Analysis

linear regression and hypothesis testing

University of London

Probability and Statistics: To p or not to p?

linear regression and hypothesis testing

SAS Statistical Business Analyst

Professional Certificate

linear regression and hypothesis testing

Regression Modeling Fundamentals

linear regression and hypothesis testing

University of Colorado Boulder

Algebra and Differential Calculus for Data Science

Why people choose coursera for their career.

linear regression and hypothesis testing

Learner reviews

Showing 3 of 115

115 reviews

Reviewed on Jan 23, 2021

1) SAS programming basics 2) Not uninterest in Statistics ... then only.

Reviewed on Apr 28, 2022

Well Done Everything. Practices need to be clarified more because somebody know Statistics with less SAS and reversely some learners know SAS but less Statistics.

Reviewed on Jun 28, 2021

Thoroughly enjoyed this course. In depth explanation of hypothesis testing, ANOVA and Regression, explained very clearly using SAS.

New to Data Analysis? Start here.

Placeholder

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

When will i have access to the lectures and assignments.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.

The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

What will I get if I subscribe to this Certificate?

When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

What is the refund policy?

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy Opens in a new tab .

More questions

Help Center Help Center

  • Help Center
  • Trial Software
  • Product Updates
  • Documentation

Linear hypothesis test on linear regression model coefficients

Description

p = coefTest( mdl ) computes the p -value for an F -test that all coefficient estimates in mdl , except for the intercept term, are zero.

p = coefTest( mdl , H ) performs an F -test that H × B = 0 , where B represents the coefficient vector. Use H to specify the coefficients to include in the F -test.

p = coefTest( mdl , H , C ) performs an F -test that H × B = C .

[ p , F ] = coefTest( ___ ) also returns the F -test statistic F using any of the input argument combinations in previous syntaxes.

[ p , F , r ] = coefTest( ___ ) also returns the numerator degrees of freedom r for the test.

collapse all

Test Significance of Linear Regression Model

Fit a linear regression model and test the coefficients of the fitted model to see if they are zero.

Load the carsmall data set and create a table in which the Model_Year predictor is categorical.

Fit a linear regression model of mileage as a function of the weight, weight squared, and model year.

The last line of the model display shows the F -statistic value of the regression model and the corresponding p -value. The small p -value indicates that the model fits significantly better than a degenerate model consisting of only an intercept term. You can return these two values by using coefTest .

Test Significance of Linear Model Coefficient

Fit a linear regression model and test the significance of a specified coefficient in the fitted model by using coefTest . You can also use anova to test the significance of each predictor in the model.

The model display includes the p -value for the t -statistic for each coefficient to test the null hypothesis that the corresponding coefficient is zero.

You can examine the significance of the coefficient using coefTest . For example, test the significance of the Acceleration coefficient. According to the model display, Acceleration is the second predictor. Specify the coefficient by using a numeric index vector.

p_Acceleration is the p -value corresponding to the F -statistic value F_Acceleration , and r_Acceleration is the numerator degrees of freedom for the F -test. The returned p -value indicates that Acceleration is not statistically significant in the fitted model. Note that p_Acceleration is equal to the p -value of t -statistic ( tStat ) in the model display, and F_Acceleration is the square of tStat .

Test the significance of the categorical predictor Model_Year . Instead of testing Model_Year_76 and Model_Year_82 separately, you can perform a single test for the categorical predictor Model_Year . Specify Model_Year_76 and Model_Year_82 by using a numeric index matrix.

The returned p -value indicates that Model_Year is statistically significant in the fitted model.

You can also return these values by using anova .

Input Arguments

Mdl — linear regression model object linearmodel object | compactlinearmodel object.

Linear regression model object, specified as a LinearModel object created by using fitlm or stepwiselm , or a CompactLinearModel object created by using compact .

H — Hypothesis matrix numeric index matrix

Hypothesis matrix, specified as a full-rank numeric index matrix of size r -by- s , where r is the number of linear combinations of coefficients being tested, and s is the total number of coefficients.

If you specify H , then the output p is the p -value for an F -test that H × B = 0 , where B represents the coefficient vector.

If you specify H and C , then the output p is the p -value for an F -test that H × B = C .

Example: [1 0 0 0 0] tests the first coefficient among five coefficients.

Data Types: single | double

C — Hypothesized value numeric vector

Hypothesized value for testing the null hypothesis, specified as a numeric vector with the same number of rows as H .

If you specify H and C , then the output p is the p -value for an F -test that H × B = C , where B represents the coefficient vector.

Output Arguments

P — p -value for f -test numeric value in the range [0,1].

p -value for the F -test, returned as a numeric value in the range [0,1].

F — Value of test statistic for F -test numeric value

Value of the test statistic for the F -test, returned as a numeric value.

r — Numerator degrees of freedom for F -test positive integer

Numerator degrees of freedom for the F -test, returned as a positive integer. The F -statistic has r degrees of freedom in the numerator and mdl.DFE degrees of freedom in the denominator.

The p -value, F -statistic, and numerator degrees of freedom are valid under these assumptions:

The data comes from a model represented by the formula in the Formula property of the fitted model.

The observations are independent, conditional on the predictor values.

Under these assumptions, let β represent the (unknown) coefficient vector of the linear regression. Suppose H is a full-rank numeric index matrix of size r -by- s , where r is the number of linear combinations of coefficients being tested, and s is the total number of coefficients. Let c be a column vector with r rows. The following is a test statistic for the hypothesis that Hβ  =  c :

F = ( H β ^ − c ) ′ ( H V H ′ ) − 1 ( H β ^ − c ) / r .

Here β ^ is the estimate of the coefficient vector β , stored in the Coefficients property, and V is the estimated covariance of the coefficient estimates, stored in the CoefficientCovariance property. When the hypothesis is true, the test statistic F has an F Distribution with r and u degrees of freedom, where u is the degrees of freedom for error, stored in the DFE property.

Alternative Functionality

The values of commonly used test statistics are available in the Coefficients property of a fitted model.

anova provides tests for each model predictor and groups of predictors.

Extended Capabilities

Gpu arrays accelerate code by running on a graphics processing unit (gpu) using parallel computing toolbox™..

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox) .

Version History

Introduced in R2012a

anova | CompactLinearModel | LinearModel | linhyptest | coefCI | dwtest

  • F-statistic and t-statistic
  • Interpret Linear Regression Results
  • Linear Regression Workflow
  • Linear Regression

MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

  • Switzerland (English)
  • Switzerland (Deutsch)
  • Switzerland (Français)
  • 中国 (English)

You can also select a web site from the following list:

How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

  • América Latina (Español)
  • Canada (English)
  • United States (English)
  • Belgium (English)
  • Denmark (English)
  • Deutschland (Deutsch)
  • España (Español)
  • Finland (English)
  • France (Français)
  • Ireland (English)
  • Italia (Italiano)
  • Luxembourg (English)
  • Netherlands (English)
  • Norway (English)
  • Österreich (Deutsch)
  • Portugal (English)
  • Sweden (English)
  • United Kingdom (English)

Asia Pacific

  • Australia (English)
  • India (English)
  • New Zealand (English)

Contact your local office

linear regression and hypothesis testing

How to Check Linear Regression Assumptions (and What to Do If They Fail)

Linear regression is a powerful statistical tool used to model the relationship between a dependent variable and one or more independent variables. However, the validity and reliability of linear regression analysis hinge on several key assumptions. If these assumptions are violated, the results of the analysis can be misleading or even invalid. In this comprehensive guide, we will delve into the essential assumptions of linear regression, explore how to check them, and provide practical solutions for addressing potential violations.  

Introduction

Linear regression is a cornerstone of statistical modeling, widely employed in various fields, from economics and finance to social sciences and engineering. Its simplicity and interpretability make it a popular choice for understanding the relationships between variables. However, like any statistical method, linear regression relies on a set of assumptions to ensure the accuracy and meaningfulness of its results.

When these assumptions are met, linear regression provides unbiased and efficient estimates of the model parameters. However, when these assumptions are violated, the results can be biased, inefficient, or even completely invalid. Therefore, it’s crucial to understand these assumptions, assess whether they hold in your data, and take appropriate corrective measures if they don’t.

The Key Assumptions of Linear Regression

Before we dive into the specifics of checking and addressing assumption violations, let’s first outline the key assumptions underlying linear regression:

  • Linearity: The relationship between the independent variables and the dependent variable is linear.
  • Independence: The errors (residuals) are independent of each other.
  • Homoscedasticity: The variance of the errors is constant across all levels of the independent variables.
  • Normality: The errors are normally distributed.

Checking the Assumptions

Now that we understand the key assumptions, let’s explore how to check whether they hold in your data.

1. Checking for Linearity

The linearity assumption states that the relationship between the independent variables and the dependent variable is linear. This means that a straight line can adequately represent the relationship.

How to check:

  • Scatterplots: The most straightforward way to check for linearity is to create scatterplots of the dependent variable against each independent variable. If the relationship appears to be linear, the points should roughly form a straight line.
  • Residual plots: Another useful tool is a residual plot, which plots the residuals (the differences between the actual and predicted values) against the fitted values (the predicted values from the regression model). If the linearity assumption holds, the residuals should be randomly scattered around zero, with no discernible pattern.

<response-element_nghost-ng-c2478939204=”” ng-version=”0.0.0-PLACEHOLDER”>

<divrole=”presentation” data-mprt=”5″ style=”position: absolute; overflow: hidden; left: 0px; width: 5px; height: 5px;”><divstyle=”position: absolute; overflow: hidden; width: 1e+06px; height: 1e+06px; transform: translate3d(0px, 0px, 0px); contain: strict; top: 0px; left: 0px;”><divrole=”presentation” aria-hidden=”true” style=”position: absolute; font-family: "Google Sans Mono", Consolas, "Courier New", monospace; font-weight: normal; font-size: 14px; font-feature-settings: "liga" 0, "calt" 0; font-variation-settings: normal; line-height: 18px; letter-spacing: 0px; height: 0px; width: 45px;”>

<divrole=”presentation” aria-hidden=”true” style=”position: absolute;”>

<divrole=”presentation” aria-hidden=”true” data-mprt=”7″ style=”position: absolute; font-family: "Google Sans Mono", Consolas, "Courier New", monospace; font-weight: normal; font-size: 14px; font-feature-settings: "liga" 0, "calt" 0; font-variation-settings: normal; line-height: 18px; letter-spacing: 0px; width: 45px; height: 18px;”>

<canvasaria-hidden=”true” width=”14″ height=”5″ style=”position: absolute; transform: translate3d(0px, 0px, 0px); contain: strict; top: 0px; right: 0px; width: 14px; height: 5px; display: block;”>

What to do if the linearity assumption fails:

  • Transformations: If the relationship appears to be non-linear, you can try transforming the independent or dependent variables. Common transformations include log transformations, square root transformations, and polynomial transformations.
  • Non-linear regression: If transformations don’t work, you might need to consider using a non-linear regression model.

2. Checking for Independence

The independence assumption states that the errors (residuals) are independent of each other. This means that the error in one observation should not be related to the error in another observation.

  • Durbin-Watson test: The Durbin-Watson test is a statistical test that checks for autocorrelation (correlation between errors at different time points) in the residuals. The test statistic ranges from 0 to 4, with a value of 2 indicating no autocorrelation. Values significantly less than 2 suggest positive autocorrelation, while values significantly greater than 2 suggest negative autocorrelation.

What to do if the independence assumption fails:

  • Time series models: If autocorrelation is present (especially in time series data), you might need to consider using time series models that explicitly account for the dependence between observations.
  • Generalized least squares: In some cases, you can use generalized least squares (GLS) regression, which allows for correlated errors.

3. Checking for Homoscedasticity

The homoscedasticity assumption states that the variance of the errors is constant across all levels of the independent variables. This means that the spread of the residuals should be roughly the same across the range of fitted values.  

  • Residual plots: The residual plot mentioned earlier can also be used to check for homoscedasticity. If the homoscedasticity assumption holds, the residuals should be evenly scattered around zero, with no fanning out or funneling in pattern.
  • Breusch-Pagan test: The Breusch-Pagan test is a statistical test that checks for heteroscedasticity (non-constant variance of errors). The null hypothesis is homoscedasticity. If the p-value is significant (typically less than 0.05), it suggests heteroscedasticity.

What to do if the homoscedasticity assumption fails:

  • Transformations: Transforming the dependent variable can sometimes help stabilize the variance.
  • Weighted least squares: In weighted least squares regression, you assign weights to the observations based on their estimated variance. This gives more weight to observations with lower variance and less weight to observations with higher variance.

4. Checking for Normality

The normality assumption states that the errors are normally distributed. This means that if you were to plot a histogram of the residuals, it should roughly resemble a bell-shaped curve.

  • Histogram of residuals: A histogram of the residuals can provide a visual check for normality.
  • Normal probability plot (Q-Q plot): A Q-Q plot compares the quantiles of the residuals to the quantiles of a normal distribution. If the normality assumption holds, the points should roughly fall along a straight line.
  • Shapiro-Wilk test: The Shapiro-Wilk test is a statistical test that checks for normality. The null hypothesis is normality. If the p-value is significant (typically less than 0.05), it suggests non-normality.

What to do if the normality assumption fails:

  • Transformations: Transforming the dependent variable can sometimes help normalize the residuals.
  • Robust regression: Robust regression methods are less sensitive to outliers and deviations from normality.

5. Checking for Multicollinearity (in Multiple Linear Regression)

The no multicollinearity assumption states that the independent variables are not highly correlated with each other. Multicollinearity can make it difficult to interpret the individual effects of the independent variables and can lead to unstable estimates.  

  • Correlation matrix: Calculate the correlation matrix between the independent variables. High correlations (typically above 0.7 or 0.8) can indicate multicollinearity.

Share this:

Hypothesis Testing for Differentially Private Linear Regression

Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track

Daniel Alabi, Salil Vadhan

In this work, we design differentially private hypothesis tests for the following problems in the general linear model: testing a linear relationship and testing for the presence of mixtures. The majority of our hypothesis tests are based on differentially private versions of the $F$-statistic for the general linear model framework, which are uniformly most powerful unbiased in the non-private setting. We also present another test for testing mixtures, based on the differentially private nonparametric tests of Couch, Kazan, Shi, Bray, and Groce (CCS 2019), which is especially suited for the small dataset regime. We show that the differentially private $F$-statistic converges to the asymptotic distribution of its non-private counterpart. As a corollary, the statistical power of the differentially private $F$-statistic converges to the statistical power of the non-private $F$-statistic. Through a suite of Monte Carlo based experiments, we show that our tests achieve desired \textit{significance levels} and have a high \textit{power} that approaches the power of the non-private tests as we increase sample sizes or the privacy-loss parameter. We also show when our tests outperform existing methods in the literature.

Name Change Policy

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings.

Use the "Report an Issue" link to request a name change.

World Bank Blogs Logo

Regression-based joint orthogonality tests of balance can over-reject: so what should you do?

David mckenzie.

Development Impact logo

One of the shortest posts I wrote for the blog was on a joint test of orthogonality when testing for balance between treatment and control groups. Given a set of k covariates X1, X2, X3, …., Xk, this involves running the regression:

Treatment = a + b1X1+b2X2+b3X3+…+bkXk + u

And then testing the joint hypothesis b1=b2=b3=…=bk=0. This could be done by running the equation as a linear regression and using an F-test, or running it as a probit and using a chi-squared test. If the experiment is stratified, you might want to do this conditioning on randomization strata , especially if the probability of assignment to treatment varies across strata , and if the experiment is clustered, then the standard errors should be clustered. There are questions about whether it is desirable at all to do such tests when you know for sure the experiment was correctly randomized, but let’s assume you want to do such a test, perhaps to show the sample is still balanced after attrition, or that a randomization done in the field was done correctly.

One of the folk wisdoms is that researchers sometimes are surprised to find this test rejecting the null hypothesis of joint orthogonality, especially when they have a lot of variables in their balance table, or when they have multiple treatments and estimate a multinomial logit. A new paper by Jason Kerwin, Nada Rostom and Olivier Sterck shows this via simulations, and offers a solution.

Joint orthogonality tests based on standard robust standard errors over-reject the null, especially when k is large relative to n

Kerwin et al. look at both joint orthogonality tests, as well as the practice of doing pairwise t-tests (or group F-tests with multiple treatments) and doing some sort of “vote counting” where e.g. researchers look to see whether more than 10 percent of the tests reject the null at the 10% level. They run simulations for two data generating processes they specify (one using individual level randomization, and one clustered), and with data from two published experiments (one with k=33 and n=698 and individual level randomization, and one with k=10 and clustered randomization with 1016 units in 148 clusters).

They find that standard joint orthogonality tests with “robust” standard errors (HC1, HC2, or HC3) over-reject the null in their simulations:

·       When n=500 and k=50, in one data generating process the test rejects the null at the 10% level approximately 50% of the time! That is, in half the cases researchers would conclude that a truly randomized experiment resulted in imbalance between treatment and control.

·       Things look a lot better if n is large relative to k. With n=5000, size is around the correct 10% even for k=50 or 60; when k=10, size looks pretty good for n=500 or more.

·       The issue is not surprisingly worse in clustered experiments, where the effective degrees of freedom are lower.

What is the problem?

The problem is that standard Eicker-White robust standard error asymptotics do not hold when the number of covariates are large relative to the sample size . Cattaneo et al. (2018) provide discussion and proofs, and suggest that the HC3 estimator can be conservative and used for inference – although Kerwin et al. still find overrejection using HC3 in their simulations. In addition to the number of covariates, leverage matters a lot – and having a lot of covariates and small sample can increase leverage.

So what are the solutions?

The solution Kerwin et al. propose is to use omnibus tests with randomization inference instead of regression standard errors. They show this gives the correct size in their simulations, works with clustering, and also works with multiple treatments. They show this makes a difference in practice to the published papers they relook at: in one, the F-test p-value from HC1 clustered standard errors is p=0.088, whereas it would be 0.278 using RI standard errors; and similarly a regression clustered standard error p-value of 0.068 becomes 0.186 using RI standard errors – so using randomization inference makes the published papers claim of balanced randomization more credible (for once a methods paper that strengthens existing results!).

My other suggestion is for researchers to also think carefully about how many variables they are putting in their balance tables in the first place. We are most concerned about imbalances in variables that will be highly correlated with outcomes of interest – but also often like to use this balance table/Table 1 to provide some summary statistics that help provide context and details of the sample. The latter is a reason for more controls, but keeping to 10-20 controls rather than 30-50 seems plenty to me in most cases – and also will help with journals having restrictions on how many rows your tables can have. Pre-registering which variables will go into this test then helps guard against selective reporting. There are also some parallels to the use of methods such as pdslasso to choose controls – I have a new working paper coming out soon on using this method with field experiments, and one of the lessons there is putting in too many variables can result in a higher chance of not selecting the ones that matter.

Another practical note

Another practical note with these tests is that it can be common to have a few missing values for some baseline covariates – e.g. age might be missing for 3 cases, gender for one, education for a few others, etc. This does not present such a problem for pairwise t-tests (where you are then testing treatment and control are balanced for the subsample that have data on a particular variable). But for a joint orthogonality F-test, the regression would only then be estimated for the subsample with no missing data, which could be a lot lower than n. Researchers then need to think about dummying out the missing values before running this test – but then this can result in a whole lot more (often highly correlated) covariates in the form of dummy variables for these missing values. Another reason to be judicious on which variables go into the omnibus test and focusing on a subset of variables without many missing values.

Get updates from Development Impact

Thank you for choosing to be part of the Development Impact community!

Your subscription is now active. The latest blog posts and blog-related announcements will be delivered directly to your email inbox. You may unsubscribe at any time.

David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

  • Share on mail
  • comments added

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

11.1: Testing the Hypothesis that β = 0

  • Last updated
  • Save as PDF
  • Page ID 26113

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

The correlation coefficient, \(r\), tells us about the strength and direction of the linear relationship between \(x\) and \(y\). However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient \(r\) and the sample size \(n\), together. We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute \(r\), the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, \(r\), is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is \(\rho\), the Greek letter "rho."
  • \(\rho =\) population correlation coefficient (unknown)
  • \(r =\) sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient \(\rho\) is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient \(r\) and the sample size \(n\).

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."

  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between \(x\) and \(y\). We can use the regression line to model the linear relationship between \(x\) and \(y\) in the population.

If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".

  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is not significantly different from zero."
  • What the conclusion means: There is not a significant linear relationship between \(x\) and \(y\). Therefore, we CANNOT use the regression line to model a linear relationship between \(x\) and \(y\) in the population.
  • If \(r\) is significant and the scatter plot shows a linear trend, the line can be used to predict the value of \(y\) for values of \(x\) that are within the domain of observed \(x\) values.
  • If \(r\) is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If \(r\) is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed \(x\) values in the data.

PERFORMING THE HYPOTHESIS TEST

  • Null Hypothesis: \(H_{0}: \rho = 0\)
  • Alternate Hypothesis: \(H_{a}: \rho \neq 0\)

WHAT THE HYPOTHESES MEAN IN WORDS:

  • Null Hypothesis \(H_{0}\) : The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between \(x\) and \(y\) in the population.
  • Alternate Hypothesis \(H_{a}\) : The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between \(x\) and \(y\) in the population.

DRAWING A CONCLUSION:There are two methods of making the decision. The two methods are equivalent and give the same result.

  • Method 1: Using the \(p\text{-value}\)
  • Method 2: Using a table of critical values

In this chapter of this textbook, we will always use a significance level of 5%, \(\alpha = 0.05\)

Using the \(p\text{-value}\) method, you could choose any appropriate significance level you want; you are not limited to using \(\alpha = 0.05\). But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, \(\alpha = 0.05\). (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

METHOD 1: Using a \(p\text{-value}\) to make a decision

To calculate the \(p\text{-value}\) using LinRegTTEST:

On the LinRegTTEST input screen, on the line prompt for \(\beta\) or \(\rho\), highlight "\(\neq 0\)"

The output screen shows the \(p\text{-value}\) on the line that reads "\(p =\)".

(Most computer statistical software can calculate the \(p\text{-value}\).)

If the \(p\text{-value}\) is less than the significance level ( \(\alpha = 0.05\) ):

  • Decision: Reject the null hypothesis.
  • Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is significantly different from zero."

If the \(p\text{-value}\) is NOT less than the significance level ( \(\alpha = 0.05\) )

  • Decision: DO NOT REJECT the null hypothesis.
  • Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between \(x\) and \(y\) because the correlation coefficient is NOT significantly different from zero."

Calculation Notes:

  • You will use technology to calculate the \(p\text{-value}\). The following describes the calculations to compute the test statistics and the \(p\text{-value}\):
  • The \(p\text{-value}\) is calculated using a \(t\)-distribution with \(n - 2\) degrees of freedom.
  • The formula for the test statistic is \(t = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}\). The value of the test statistic, \(t\), is shown in the computer or calculator output along with the \(p\text{-value}\). The test statistic \(t\) has the same sign as the correlation coefficient \(r\).
  • The \(p\text{-value}\) is the combined area in both tails.

An alternative way to calculate the \(p\text{-value}\) ( \(p\) ) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: \(p\text{-value}\) method

  • Consider the third exam/final exam example.
  • The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points.
  • Can the regression line be used for prediction? Given a third exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?
  • \(H_{0}: \rho = 0\)
  • \(H_{a}: \rho \neq 0\)
  • \(\alpha = 0.05\)
  • The \(p\text{-value}\) is 0.026 (from LinRegTTest on your calculator or from computer software).
  • The \(p\text{-value}\), 0.026, is less than the significance level of \(\alpha = 0.05\).
  • Decision: Reject the Null Hypothesis \(H_{0}\)
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Because \(r\) is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a good idea of whether the computed value of \(r\) is significant or not . Compare \(r\) to the appropriate critical value in the table. If \(r\) is not between the positive and negative critical values, then the correlation coefficient is significant. If \(r\) is significant, then you may want to use the line for prediction.

Example \(\PageIndex{1}\)

Suppose you computed \(r = 0.801\) using \(n = 10\) data points. \(df = n - 2 = 10 - 2 = 8\). The critical values associated with \(df = 8\) are \(-0.632\) and \(+0.632\). If \(r <\) negative critical value or \(r >\) positive critical value, then \(r\) is significant. Since \(r = 0.801\) and \(0.801 > 0.632\), \(r\) is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Exercise \(\PageIndex{1}\)

For a given line of best fit, you computed that \(r = 0.6501\) using \(n = 12\) data points and the critical value is 0.576. Can the line be used for prediction? Why or why not?

If the scatter plot looks linear then, yes, the line can be used for prediction, because \(r >\) the positive critical value.

Example \(\PageIndex{2}\)

Suppose you computed \(r = –0.624\) with 14 data points. \(df = 14 – 2 = 12\). The critical values are \(-0.532\) and \(0.532\). Since \(-0.624 < -0.532\), \(r\) is significant and the line can be used for prediction

Exercise \(\PageIndex{2}\)

For a given line of best fit, you compute that \(r = 0.5204\) using \(n = 9\) data points, and the critical value is \(0.666\). Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction, because \(r <\) the positive critical value.

Example \(\PageIndex{3}\)

Suppose you computed \(r = 0.776\) and \(n = 6\). \(df = 6 - 2 = 4\). The critical values are \(-0.811\) and \(0.811\). Since \(-0.811 < 0.776 < 0.811\), \(r\) is not significant, and the line should not be used for prediction.

Exercise \(\PageIndex{3}\)

For a given line of best fit, you compute that \(r = -0.7204\) using \(n = 8\) data points, and the critical value is \(= 0.707\). Can the line be used for prediction? Why or why not?

Yes, the line can be used for prediction, because \(r <\) the negative critical value.

THIRD-EXAM vs FINAL-EXAM EXAMPLE: critical value method

Consider the third exam/final exam example. The line of best fit is: \(\hat{y} = -173.51 + 4.83x\) with \(r = 0.6631\) and there are \(n = 11\) data points. Can the regression line be used for prediction? Given a third-exam score ( \(x\) value), can we use the line to predict the final exam score (predicted \(y\) value)?

  • Use the "95% Critical Value" table for \(r\) with \(df = n - 2 = 11 - 2 = 9\).
  • The critical values are \(-0.602\) and \(+0.602\)
  • Since \(0.6631 > 0.602\), \(r\) is significant.
  • Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (\(x\)) and the final exam score (\(y\)) because the correlation coefficient is significantly different from zero.

Example \(\PageIndex{4}\)

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if \(r\) is significant and the line of best fit associated with each r can be used to predict a \(y\) value. If it helps, draw a number line.

  • \(r = –0.567\) and the sample size, \(n\), is \(19\). The \(df = n - 2 = 17\). The critical value is \(-0.456\). \(-0.567 < -0.456\) so \(r\) is significant.
  • \(r = 0.708\) and the sample size, \(n\), is \(9\). The \(df = n - 2 = 7\). The critical value is \(0.666\). \(0.708 > 0.666\) so \(r\) is significant.
  • \(r = 0.134\) and the sample size, \(n\), is \(14\). The \(df = 14 - 2 = 12\). The critical value is \(0.532\). \(0.134\) is between \(-0.532\) and \(0.532\) so \(r\) is not significant.
  • \(r = 0\) and the sample size, \(n\), is five. No matter what the \(dfs\) are, \(r = 0\) is between the two critical values so \(r\) is not significant.

Exercise \(\PageIndex{4}\)

For a given line of best fit, you compute that \(r = 0\) using \(n = 100\) data points. Can the line be used for prediction? Why or why not?

No, the line cannot be used for prediction no matter what the sample size is.

Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between \(x\) and \(y\) in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between \(x\) and \(y\) in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatter plot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of \(y\) for varying values of \(x\). In other words, the expected value of \(y\) for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
  • The \(y\) values for any particular \(x\) value are normally distributed about the line. This implies that there are more \(y\) values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of \(y\) values lie on the line.
  • The standard deviations of the population \(y\) values about the line are equal for each value of \(x\). In other words, each of these normal distributions of \(y\) values has the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

Linear regression is a procedure for fitting a straight line of the form \(\hat{y} = a + bx\) to data. The conditions for regression are:

  • Linear In the population, there is a linear relationship that models the average value of \(y\) for different values of \(x\).
  • Independent The residuals are assumed to be independent.
  • Normal The \(y\) values are distributed normally for any value of \(x\).
  • Equal variance The standard deviation of the \(y\) values is equal for each \(x\) value.
  • Random The data are produced from a well-designed random sample or randomized experiment.

The slope \(b\) and intercept \(a\) of the least-squares line estimate the slope \(\beta\) and intercept \(\alpha\) of the population (true) regression line. To estimate the population standard deviation of \(y\), \(\sigma\), use the standard deviation of the residuals, \(s\). \(s = \sqrt{\frac{SEE}{n-2}}\). The variable \(\rho\) (rho) is the population correlation coefficient. To test the null hypothesis \(H_{0}: \rho =\) hypothesized value , use a linear regression t-test. The most common null hypothesis is \(H_{0}: \rho = 0\) which indicates there is no linear relationship between \(x\) and \(y\) in the population. The TI-83, 83+, 84, 84+ calculator function LinRegTTest can perform this test (STATS TESTS LinRegTTest).

Formula Review

Least Squares Line or Line of Best Fit:

\[\hat{y} = a + bx\]

\[a = y\text{-intercept}\]

\[b = \text{slope}\]

Standard deviation of the residuals:

\[s = \sqrt{\frac{SEE}{n-2}}\]

\[SSE = \text{sum of squared errors}\]

\[n = \text{the number of data points}\]

IMAGES

  1. Multiple Linear Regression

    linear regression and hypothesis testing

  2. Linear regression

    linear regression and hypothesis testing

  3. Simple Linear Regression Model "3" & Hypothesis Testing "3"

    linear regression and hypothesis testing

  4. Linear regression hypothesis testing: Concepts, Examples

    linear regression and hypothesis testing

  5. Hypothesis testing in linear regression part 2

    linear regression and hypothesis testing

  6. Hypothesis Testing On Linear Regression

    linear regression and hypothesis testing

VIDEO

  1. Lecture 5. Hypothesis Testing In Simple Linear Regression Model

  2. Linear Regression Part 2

  3. Hypothesis Testing in Simple Linear Regression

  4. Simple linear regression hypothesis testing

  5. Multiple regression, hypothesis testing, model deployment

  6. 11.6. Simple Linear Regression: Hypothesis Testing

COMMENTS

  1. 12.2.1: Hypothesis Test for Linear Regression

    The formula for the t-test statistic is t = b1 √(MSE SSxx) Use the t-distribution with degrees of freedom equal to n − p − 1. The t-test for slope has the same hypotheses as the F-test: Use a t-test to see if there is a significant relationship between hours studied and grade on the exam, use α = 0.05.

  2. Linear regression hypothesis testing: Concepts, Examples

    F-statistics for testing hypothesis for linear regression model: F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0.

  3. Linear regression

    The lecture is divided in two parts: in the first part, we discuss hypothesis testing in the normal linear regression model, in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors; in the second part, we show how to carry out hypothesis tests in linear regression analyses where the ...

  4. 15.5: Hypothesis Tests for Regression Models

    Testing the model as a whole. Okay, suppose you've estimated your regression model. The first hypothesis test you might want to try is one in which the null hypothesis that there is no relationship between the predictors and the outcome, and the alternative hypothesis is that the data are distributed in exactly the way that the regression model predicts.

  5. Understanding the Null Hypothesis for Linear Regression

    x: The value of the predictor variable. Simple linear regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

  6. 6.4

    For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0. Hypothesis test for testing ...

  7. 3.3.4: Hypothesis Test for Simple Linear Regression

    Simple Linear Regression ANOVA Hypothesis Test Example: Rainfall and sales of sunglasses We will now describe a hypothesis test to determine if the regression model is meaningful; in other words, does the value of \(X\) in any way help predict the expected value of \(Y\)?

  8. The Complete Guide to Linear Regression Analysis

    In the case of simple linear regression we performed the hypothesis testing by using the t statistics to see is there any relationship between the TV advertisement and sales. In the same manner, for multiple linear regression, we can perform the F test to test the hypothesis as, H0: β1 = β2 = · · · = βp = 0. Ha: At least one βj is non-zero.

  9. PDF Chapter 9 Simple Linear Regression

    c plot.9.2 Statistical hypothesesFor simple linear regression, the chief null hypothesis is H0 : β1 = 0, and the corresponding alter. ative hypothesis is H1 : β1 6= 0. If this null hypothesis is true, then, from E(Y ) = β0 + β1x we can see that the population mean of Y is β0 for every x value, which t.

  10. Statistical Thinking and Data Analysis

    In this course, you will learn about several types of sampling distributions, including the normal distribution shown here. (Courtesy of Mwtoews on Wikipedia.) This course is an introduction to statistical data analysis. Topics are chosen from applied probability, sampling, estimation, hypothesis testing, linear regression, analysis of variance ...

  11. Hypothesis Test for Regression Slope

    Hypothesis Test for Regression Slope. This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y.. The test focuses on the slope of the regression line Y = Β 0 + Β 1 X. where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of ...

  12. PDF Lecture 9: Linear Regression

    Regression. Technique used for the modeling and analysis of numerical data. Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other. Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships.

  13. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    Consider the regression model with p predictors y = Xβ + . We would like to determine if some subset of r < p predictors contributes significantly to the regression model. 16. Partition the vector of regression coefficients as β = β1. β2. where β1is (p+1−r)×1 and β2is r ×1. We want to test the hypothesis H. 0: β2= 0 H.

  14. How to Simplify Hypothesis Testing for Linear Regression in Python

    A Quick Reminder Regarding Linear Regression. Before I share the 4 assumptions that should be met in order to run a linear regression hypothesis test, there is one important point to keep in mind regarding linear regression. Linear regression can be thought of as a dual purpose tool: To predict future values for the y variable

  15. Hypothesis testing in linear regression part 1

    This video explains how hypothesis testing works in practice, using a particular example. Check out https://ben-lambert.com/econometrics-course-problem-sets-...

  16. Hypothesis Testing On Linear Regression

    Steps to Perform Hypothesis testing: Step 1: We start by saying that β₁ is not significant, i.e., there is no relationship between x and y, therefore slope β₁ = 0. Step 2: Typically, we set ...

  17. Everything you need to know about Hypothesis Testing in Machine Learning

    Hypothesis testing is used to confirm if our beta coefficients are significant in a linear regression model. Every time we run the linear regression model, we test if the line is significant or not by checking if the coefficient is significant. I have shared details on how you can check these values in python, towards the end of this blog.

  18. Simple linear regression

    Interpreting the hypothesis test# If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. This test assumes the simple linear regression model is correct which precludes a quadratic relationship.

  19. Linear regression

    See all my videos at https://www.tilestats.com/In this video, we will see how we can use hypothesis testing in linear regression to, for example, test if the...

  20. Introduction to Statistical Analysis: Hypothesis Testing

    Introduction and Review of Concepts. In this module you learn about the models required to analyze different types of data and the difference between explanatory vs predictive modeling. Then you review fundamental statistical concepts, such as the sampling distribution of a mean, hypothesis testing, p-values, and confidence intervals.

  21. Linear hypothesis test on linear regression model coefficients

    Fit a linear regression model and test the coefficients of the fitted model to see if they are zero. ... The model display includes the p-value for the t-statistic for each coefficient to test the null hypothesis that the corresponding coefficient is zero. You can examine the significance of the coefficient using coefTest.

  22. How to Check Linear Regression Assumptions (and What to Do If They Fail)

    Linear regression is a cornerstone of statistical modeling, widely employed in various fields, from economics and finance to social sciences and engineering. ... Breusch-Pagan test: The Breusch-Pagan test is a statistical test that checks for heteroscedasticity (non-constant variance of errors). The null hypothesis is homoscedasticity. If the p ...

  23. 14.4: Hypothesis Test for Simple Linear Regression

    This page titled 14.4: Hypothesis Test for Simple Linear Regression is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Maurice A. Geraghty via source content that was edited to the style and standards of the LibreTexts platform.

  24. Hypothesis Testing for Differentially Private Linear Regression

    Hypothesis Testing for Differentially Private Linear Regression. Part of Advances in Neural Information Processing Systems 35 (NeurIPS ... In this work, we design differentially private hypothesis tests for the following problems in the general linear model: testing a linear relationship and testing for the presence of mixtures. The majority of ...

  25. Regression-based joint orthogonality tests of balance can over-reject

    A standard way for testing for balance between treatment and control groups is to regress a treatment indicator on a set of covariates, and then use an F-test to test the null hypothesis of joint orthogonality. However, a new paper shows that this test can over-reject the null substantially when sample sizes are small or the number of covariates large. Randomization inference approaches can be ...

  26. 11: Linear Regression and Hypothesis Testing

    However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, and perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is ...

  27. 11.1: Testing the Hypothesis that β = 0

    11.1: Testing the Hypothesis that β = 0. The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample.