Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Prevent plagiarism. Run a free check.

Multiple linear regression formula.

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

hypothesis multiple regression example

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved August 8, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Writing hypothesis for linear multiple regression models

I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models.

For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon).

Say all your friends think you should buy a 6 cylinder car, but before you make up your mind you want to know how 6 cylinder cars perform miles-per-gallon-wise compared to 4 cylinder cars because you think there might be a difference.

Would this be a fair null hypothesis (since 4 cylinder cars is the reference group)?: There is no difference between 6 cylinder car miles-per-gallon performance and 4 cylinder car miles-per-gallon performance.

Would this be a fair model interpretation ?: 6 cylinder vehicles travel fewer miles per gallon (p=0.010, β -4.00, CI -6.95 - -1.04) as compared to 4 cylinder vehicles when adjusting for all other predictors, thus rejecting the null hypothesis.

Sorry for troubling, and thanks in advance for any feedback!

enter image description here

  • multiple-regression
  • linear-model
  • interpretation

LuizZ's user avatar

Yes, you already got the right answer to both of your questions.

  • Your null hypothesis in completely fair. You did it the right way. When you have a factor variable as predictor, you omit one of the levels as a reference category (the default is usually the first one, but you also can change that). Then all your other levels’ coefficients are tested for a significant difference compared to the omitted category. Just like you did.

If you would like to compare 6-cylinder cars with 8-cylinder car, then you would have to change the reference category. In your hypothesis you just could had added at the end (or as a footnote): "when adjusting for weight and gear", but it is fine the way you did it.

  • Your model interpretation is correct : It is perfect the way you did it. You could even had said: "the best estimate is that 6 cylinder vehicles travel 4 miles per gallon less than 4 cylinder vehicles (p-value: 0.010; CI: -6.95, -1.04), when adjusting for weight and gear, thus rejecting the null hypothesis".

Let's assume that your hypothesis was related to gears, and you were comparing 4-gear vehicles with 3-gear vehicles. Then your result would be β: 0.65; p-value: 0.67; CI: -2.5, 3.8. You would say that: "There is no statistically significant difference between three and four gear cars in fuel consumption, when adjusting for weight and engine power, thus failing to reject the null hypothesis".

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged r regression multiple-regression linear-model interpretation or ask your own question .

  • Featured on Meta
  • Introducing an accessibility dashboard and some upcoming changes to display...
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Announcing a change to the data-dump process

Hot Network Questions

  • Too many SMB SAMR requests to Primary Domain Controller from Remote Site
  • Best (safest) order of travel for Russia and the USA (short research trip)
  • English equivalent to the famous Hindi proverb "the marriage sweetmeat: those who eat it regret, and those who don't eat it also regret"?
  • Is there a Morse function that does not arise from a minimal-dimensional height function?
  • Connect electric cable with 4 wires to 3-prong 30 Amp 125-Volt/250-Volt outlet?
  • Abrupt increase of evaluation time of MatrixExp
  • Reference for the proof that Möbius transformations extend to isometries of hyperbolic 3-space
  • Reportedly there are Marders in Russia's Kursk region. Has this provoked any backlash in Germany?
  • Backfill civicrm_mailing unknown error when upgrading to 5.76.0
  • Why is “water takes the steepest path downhill” a common approximation?
  • What's "unregulated baggage"?
  • Is there a pre-defined compiler macro for legacy Microsoft C 5.10 to get the compiler's name and version number?
  • English equilvant to this hindi proverb "A washerman's dog belongs neither at home nor at the riverbank."?
  • If KCI allows MitM attacks, how are ephemeral exchanges an effective mitigation?
  • In Europe, are you allowed to enter an intersection on red light in order to allow emergency vehicles to pass?
  • What are those small notes below other ones?
  • Tips/strategies to managing my debt
  • A finance broker made me the primary instead of a co-signer
  • What happens if your child sells your car?
  • A spaceship travelling at speed of light
  • What does "No camping 10-21" mean?
  • What would "doctor shoes" have looked like in "Man on the Moon"?
  • Question about Three mobile operator cellular network coverage in Oban UK
  • Solar System Replacement?

hypothesis multiple regression example

logo

Multiple linear regression

Multiple linear regression #.

Fig. 11 Multiple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):

Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write

Multiple linear regression answers several questions #

Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\) ?

Which subset of the predictors is most important?

How good is a linear model for these data?

Given a set of predictor values, what is a likely value for \(Y\) , and how accurate is this prediction?

The estimates \(\hat\beta\) #

Our goal again is to minimize the RSS: $ \( \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \) $

One can show that this is minimized by the vector \(\hat\beta\) : $ \(\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\) $

We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.

Which variables are important? #

Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\) .

Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)

Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.

The \(F\) -statistic is defined by: $ \(F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\) $

Under the null hypothesis (of our model), this has an \(F\) -distribution.

Example: If \(q=p\) , we test whether any of the variables is important. $ \(\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \) $

A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
49411336.29NA NA NA NA
49211078.78 2257.50765.7178530.003509036

The \(t\) -statistic associated to the \(i\) th predictor is the square root of the \(F\) -statistic for the null hypothesis which sets only \(\beta_i=0\) .

A low \(p\) -value indicates that the predictor is important.

Warning: If there are many predictors, even under the null hypothesis, some of the \(t\) -tests will have low p-values even when the model has no explanatory power.

How many variables are important? #

When we select a subset of the predictors, we have \(2^p\) choices.

A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.

Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.

Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.

Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.

Choosing one model in the range produced is a form of tuning . This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…

How good are the predictions? #

The function predict in R outputs predictions and confidence intervals from a linear model:

A matrix: 3 × 3 of type dbl
fitlwrupr
9.409426 8.72269610.09616
14.16309013.70842314.61776
18.91675418.20618919.62732

Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.

A matrix: 3 × 3 of type dbl
fitlwrupr
9.409426 2.94670915.87214
14.163090 7.72089820.60528
18.91675412.45146125.38205

These functions rely on our linear regression model $ \( Y = X\beta + \epsilon. \) $

Dealing with categorical or qualitative predictors #

For each qualitative predictor, e.g. Region :

Choose a baseline category, e.g. East

For every other category, define a new predictor:

\(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise

\(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.

The model will be: $ \(Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\) $

The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\) ) for being from the South compared to the baseline category (East).

The model fit and predictions are independent of the choice of the baseline category.

However, hypothesis tests derived from these variables are affected by the choice.

Solution: To check whether region is important, use an \(F\) -test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.

Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\) , but the coefficients have different interpretations.

So far, we have:

Defined Multiple Linear Regression

Discussed how to test the importance of variables.

Described one approach to choose a subset of variables.

Explained how to code qualitative variables.

Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit? #

To assess the fit, we focus on the residuals $ \( e = Y - \hat{Y} \) $

The RSS always decreases as we add more variables.

The residual standard error (RSE) corrects this: $ \(\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\) $

Fig. 12 Residuals #

Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:

Potential issues in linear regression #

Interactions between predictors

Non-linear relationships

Correlation of error terms

Non-constant variance of error (heteroskedasticity)

High leverage points

Collinearity

Interactions between predictors #

Linear regression has an additive assumption: $ \(\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\) $

i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.

We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.

One way to deal with this is to include multiplicative variables in the model:

The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.

R makes it easy to include interaction variables in the model:

Non-linearities #

Fig. 13 A nonlinear fit might be better here. #

Example: Auto dataset.

A scatterplot between a predictor and the response may reveal a non-linear relationship.

Solution: include polynomial terms in the model.

Could use other functions besides polynomials…

Fig. 14 Residuals for Auto data #

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms #

We assumed that the errors for each sample are independent:

What if this breaks down?

The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…

Example : Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\) .

When could this happen in real life:

Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.

Spatial data: Each sample corresponds to a different location in space.

Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.

Correlated errors #

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity) #

The variance of the error depends on some characteristics of the input features.

To diagnose this, we can plot residuals vs. fitted values:

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers from a model are points with very high errors.

While they may not affect the fit, they might affect our assessment of model quality.

Possible solutions: #

If we believe an outlier is due to an error in data collection, we can remove it.

An outlier might be evidence of a missing predictor, or the need to specify a more complex model.

High leverage points #

Some samples with extreme inputs have an outsized effect on \(\hat \beta\) .

This can be measured with the leverage statistic or self influence :

Studentized residuals #

The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\) .

The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\) .

A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\) )

When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity #

Two predictors are collinear if one explains the other well:

Problem: The coefficients become unidentifiable .

Consider the extreme case of using two identical predictors limit : $ \( \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2-100)\times\mathtt{limit} + \epsilon \end{aligned} \) $

For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\) .

If 2 variables are collinear, we can easily diagnose this using their correlation.

A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.

Pairwise correlations may not reveal multilinear variables.

The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:

Above, \(R^2_{X_j|X_{-j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.

Home

Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does. It regresses each dependent variable separately on the predictors. However, because we have multiple responses, we have to modify our hypothesis tests for regression parameters and our confidence intervals for predictions.

To get started, let's read in some data from the book Applied Multivariate Statistical Analysis (6th ed.) by Richard Johnson and Dean Wichern. This data come from exercise 7.25 and involve 17 overdoses of the drug amitriptyline (Rudorfer, 1982). There are two responses we want to model: TOT and AMI. TOT is total TCAD plasma level and AMI is the amount of amitriptyline present in the TCAD plasma level. The predictors are as follows:

  • GEN, gender (male = 0, female = 1)
  • AMT, amount of drug taken at time of overdose
  • PR, PR wave measurement
  • DIAP, diastolic blood pressure
  • QRS, QRS wave measurement

We'll use the R statistical computing environment to demonstrate multivariate multiple regression. The following code reads the data into R and names the columns.

Before going further you may wish to explore the data using the summary() and pairs() functions.

Performing multivariate multiple regression in R requires wrapping the multiple responses in the cbind() function. cbind() takes two vectors, or columns, and "binds" them together into two columns of data. We insert that on the left side of the formula operator: ~. On the other side we add our predictors. The + signs do not mean addition but rather inclusion. Taken together the formula cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS translates to "model TOT and AMI as a function of GEN, AMT, PR, DIAP and QRS." To fit this model we use the workhorse lm() function and save it to an object we name "mlm1". Finally we view the results with summary() .

Notice the summary shows the results of two regressions: one for TOT and one for AMI. These are exactly the same results we would get if we modeled each separately. You can verify this for yourself by running the following code and comparing the summaries to what we got above. They're identical.

The same diagnostics we check for models with one predictor should be checked for these as well. For a review of some basic but essential diagnostics see our post Understanding Diagnostic Plots for Linear Regression Analysis .

We can use R's extractor functions with our mlm1 object, except we'll get double the output. For example, instead of one set of residuals, we get two:

Instead of one set of fitted values, we get two:

Instead of one set of coefficients, we get two:

Instead of one residual standard error, we get two:

Again these are all identical to what we get by running separate models for each response. The similarity ends, however, with the variance-covariance matrix of the model coefficients. We don't reproduce the output here because of the size, but we encourage you to view it for yourself:

The main takeaway is that the coefficients from both models covary . That covariance needs to be taken into account when determining if a predictor is jointly contributing to both models. For example, the effects of PR and DIAP seem borderline. They appear significant for TOT but less so for AMI. But it's not enough to eyeball the results from the two separate regressions. We should formally test for their inclusion. And that test involves the covariances between the coefficients in both models.

Determining whether or not to include predictors in a multivariate multiple regression requires the use of multivariate test statistics. These are often taught in the context of MANOVA, or multivariate analysis of variance. Again the term "multivariate" here refers to multiple responses or dependent variables. This means we use modified hypothesis tests to determine whether a predictor contributes to a model.

The easiest way to do this is to use the Anova() or Manova() functions in the car package (Fox and Weisberg, 2011), like so:

The results are titled "Type II MANOVA Tests". The Anova() function automatically detects that mlm1 is a multivariate multiple regression object. "Type II" refers to the type of sum-of-squares. This basically says that predictors are tested assuming all other predictors are already in the model. This is usually what we want. Notice that PR and DIAP appear to be jointly insignificant for the two models despite what we were led to believe by examining each model separately.

Based on these results we may want to see if a model with just GEN and AMT fits as well as a model with all five predictors. One way we can do this is to fit a smaller model and then compare the smaller model to the larger model using the anova() function, (notice the little "a"; this is different from the Anova() function in the car package). For example, below we create a new model using the update() function that only includes GEN and AMT. The expression . ~ . - PR - DIAP - QRS says "keep the same responses and predictors except PR, DIAP and QRS."

The large p-value provides good evidence that the model with two predictors fits as well as the model with five predictors. Notice the test statistic is "Pillai", which is one of the four common multivariate test statistics.

The car package provides another way to conduct the same test using the linearHypothesis() function. The beauty of this function is that it allows us to run the test without fitting a separate model. It also returns all four multivariate test statistics. The first argument to the function is our model. The second argument is our null hypothesis. The linearHypothesis() function conveniently allows us to enter this hypothesis as character phrases. The null entered below is that the coefficients for PR, DIAP and QRS are all 0.

The Pillai result is the same as we got using the anova() function above. The Wilks, Hotelling-Lawley, and Roy results are different versions of the same test. The consensus is that the coefficients for PR, DIAP and QRS do not seem to be statistically different from 0. There is some discrepancy in the test results. The Roy test in particular is significant, but this is likely due to the small sample size (n = 17).

Also included in the output are two sum of squares and products matrices, one for the hypothesis and the other for the error. These matrices are used to calculate the four test statistics. These matrices are stored in the lh.out object as SSPH (hypothesis) and SSPE (error). We can use these to manually calculate the test statistics. For example, let SSPH = H and SSPE = E. The formula for the Wilks test statistic is $$ \frac{\begin{vmatrix}\bf{E}\end{vmatrix}}{\begin{vmatrix}\bf{E} + \bf{H}\end{vmatrix}} $$

In R we can calculate that as follows:

Likewise the formula for Pillai is $$ tr[\bf{H}(\bf{H} + \bf{E})^{-1}] $$ tr means trace. That's the sum of the diagonal elements of a matrix. In R we can calculate as follows:

The formula for Hotelling-Lawley is $$ tr[\bf{H}\bf{E}^{-1}] $$ In R:

And finally the Roy statistics is the largest eigenvalue of \(\bf{H}\bf{E}^{-1}\). In R code:

Given these test results, we may decide to drop PR, DIAP and QRS from our model. In fact this is model mlm2 that we fit above. Here is the summary:

Now let's say we wanted to use this model to estimate mean TOT and AMI values for GEN = 1 (female) and AMT = 1200. We can use the predict() function for this. First we need put our new data into a data frame with column names that match our original data.

This predicts two values, one for each response. Now this is just a prediction and has uncertainty. We usually quantify uncertainty with confidence intervals to give us some idea of a lower and upper bound on our estimate. But in this case we have two predictions from a multivariate model with two sets of coefficients that covary! This means calculating a confidence interval is more difficult. In fact we don't calculate an interval but rather an ellipse to capture the uncertainty in two dimensions.

Unfortunately at the time of this writing there doesn't appear to be a function in R for creating uncertainty ellipses for multivariate multiple regression models with two responses. However, we have written one below you can use called confidenceEllipse() . The details of the function go beyond a "getting started" blog post but it should be easy enough to use. Simply submit the code in the console to create the function. Then use the function with any multivariate multiple regression model object that has two responses. The newdata argument works the same as the newdata argument for predict. Use the level argument to specify a confidence level between 0 and 1. The default is 0.95. Set ggplot to FALSE to create the plot using base R graphics.

Here's a demonstration of the function.

Plot of predicted value for TOT and AMI for model mlm2 with a 95% confidence ellipse.

The dot in the center is our predicted values for TOT and AMI. The ellipse represents the uncertainty in this prediction. We're 95% confident the true mean values of TOT and AMI when GEN = 1 and AMT = 1200 are within the area of the ellipse. Notice also that TOT and AMI seem to be positively correlated. Predicting higher values of TOT means predicting higher values of AMI, and vice versa.

  • Fox, J and Weisberg, S (2011). An {R} Companion to Applied Regression, Second Edition . Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
  • Johnson, R and Wichern, D (2007). Applied Multivariate Statistical Analysis, Sixth Edition . Prentice-Hall.
  • Rudorfer, MV "Cardiovascular Changes and Plasma Drug Levels after Amitriptyline Overdose." Journal of Toxicology-Clinical Toxicology , 19 (1982), 67-71.

Clay Ford Statistical Research Consultant University of Virginia Library October 27, 2017 Updated May 26, 2023 Update February 20, 2024 (changed function name)

For questions or clarifications regarding this article, contact  [email protected] .

View the entire collection  of UVA Library StatLab articles, or learn how to cite .

Research Data Services

Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!

Related categories:

  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • Completion Model vs Chat Model: Python Examples - June 30, 2024
  • LLM Hosting Strategy, Options & Cost: Examples - June 30, 2024
  • Application Architecture for LLM Applications: Examples - June 25, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Completion Model vs Chat Model: Python Examples
  • LLM Hosting Strategy, Options & Cost: Examples
  • Application Architecture for LLM Applications: Examples
  • Python Pickle Security Issues / Risk
  • Pricing Analytics in Banking: Strategies, Examples

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Null and Alternative hypothesis for multiple linear regression

I have 1 dependent variable and 3 independent variables.

I run multiple regression, and find that the p value for one of the independent variables is higher than 0.05 (95% is my confidence level).

I take that variable out and run it again. Both remaining independent variables have $p$-value less than 0.05 so I conclude I have my model.

Am I correct in thinking that initially, my null hypothesis is

$$H_0= β_1=β_2 = \dots =β_{k-1} = 0$$

and that the alternative hypothesis is

$$H_1=\textrm{At least one } β \neq 0 \textrm{ whilst } p<0.05$$

And that after the first regression, I do not reject, as one variable does not meet my confidence level needs...

So I run it again, and then reject the null as all $p$-values are significant?

Is what I have written accurate?

Edit: Thanks to Bob Jansen for improving this aesthetics of this post.

Harry's user avatar

2 Answers 2

The hypothesis $H_0: β_1=β_2=\dots =β_{k−1}=0$ is normally tested by the $F$-test for the regression.

You are carrying out 3 independent tests of your coefficients (Do you also have a constant in the regression or is the constant one of your three variables?) If you do three independent tests at a 5% level you have a probability of over 14% of finding one of the coefficients significant at the 5% level even if all coefficients are truly zero (the null hypothesis). This is often ignored but be careful. Even so, If the coefficient is close to significant I would think about the underlying theory before coming to a decision.

If you add dummies you will have a beta for each dummy

user1483's user avatar

  • $\begingroup$ Thanks for your response. I don't have a constant, all of my p-values are very significant (the least is a dummy variable at 0.039). What would my null hypothesis be? My knowledge is that I'm seeking p-values because that'd give me my model. I don't understand the technicalities of it and want to learn it :) $\endgroup$ –  Harry Commented Jan 7, 2015 at 22:36
  • $\begingroup$ I think you meant to say 14% of committing a type one error (probability of 0.14 of finding at least one of the coefficient significant when there true value is actually the null hypothesis value) $\endgroup$ –  Kamster Commented Jan 8, 2015 at 0:36
  • $\begingroup$ @Kamster Thanks. You are correct and I have amended my answer. $\endgroup$ –  user1483 Commented Jan 21, 2015 at 21:26

These are independent variables so the hypothesis applies to each parameter independently.

Andrew's user avatar

  • $\begingroup$ +1: Yes, you are right - but the rest of it should be fine $\endgroup$ –  vonjd Commented Jan 2, 2015 at 21:18
  • $\begingroup$ sorry, could you clarify? How do I change the equation so it applies to each parameter independently? And also, what is the effect of adding 3 dummy variables. Is it simply 2 more betas? Or do they require their own symbol $\endgroup$ –  Harry Commented Jan 4, 2015 at 0:32
  • $\begingroup$ It just means that you have an H_0 and an H_1 for every parameter. $\endgroup$ –  vonjd Commented Jan 4, 2015 at 11:33
  • $\begingroup$ Ok I see. Do you know the procedure for dummy variables? Are they just additional beta? Or is it more accurate to refer to them as delta? $\endgroup$ –  Harry Commented Jan 4, 2015 at 11:43
  • $\begingroup$ Maybe I have this wrong but isn't it true if you remain your individual significance levels at 0.05 that the probability of type one error (ie the probability that reject null hypothesis when it is actually true; significance level) will be greater than or equal 0.14 $\endgroup$ –  Kamster Commented Jan 8, 2015 at 0:43

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged modelling or ask your own question .

  • Featured on Meta
  • Introducing an accessibility dashboard and some upcoming changes to display...
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Announcing a change to the data-dump process

Hot Network Questions

  • Closed form of an integral using Mathematica or otherwise
  • English equilvant to this hindi proverb "A washerman's dog belongs neither at home nor at the riverbank."?
  • How can I append comma except lastline?
  • Adjust circle radius by spline parameter
  • Italian meta-narrative film, 70s? Shot fired in cinema, from the screen, killing audience member
  • Adding another forest or tree to the existing one
  • What would "doctor shoes" have looked like in "Man on the Moon"?
  • Why, fundamentally, does adding sin graphs together always produce another sin graph?
  • In Europe, are you allowed to enter an intersection on red light in order to allow emergency vehicles to pass?
  • Is there a pre-defined compiler macro for legacy Microsoft C 5.10 to get the compiler's name and version number?
  • Best (safest) order of travel for Russia and the USA (short research trip)
  • Low current continuity checker
  • Tic-Tac-Toe Console Game in Java
  • Significance of negative work done
  • Summing three binoharmonic series
  • How do the Fremen harvest spice?
  • Why do commercial airliners go around on hard touchdown?
  • What's "unregulated baggage"?
  • What is "were't"?
  • What are those small notes below other ones?
  • "Seagulls are gulling away."
  • How can I run a machine language program off the disk drive on an Apple II?
  • When/why did software only engines overtake custom hardware?
  • Who‘s to say that beliefs held because of rational reasons are indeed more justified than beliefs held because of emotional ones

hypothesis multiple regression example

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.4 - the hypothesis tests for the slopes.

At the beginning of this lesson, we translated three different research questions pertaining to heart attacks in rabbits ( Cool Hearts dataset ) into three sets of hypotheses we can test using the general linear F -statistic. The research questions and their corresponding hypotheses are:

Hypotheses 1

Is the regression model containing at least one predictor useful in predicting the size of the infarct?

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon\) At least one \(\beta_{j} ≠ 0\) (for j = 1, 2, 3)

Hypotheses 2

Is the size of the infarct significantly (linearly) related to the area of the region at risk?

  • \(H_{0} \colon \beta_{1} = 0 \)
  • \(H_{A} \colon \beta_{1} \ne 0 \)

Hypotheses 3

(Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon \) At least one \(\beta_{j} ≠ 0\) (for j = 2, 3)

Let's test each of the hypotheses now using the general linear F -statistic:

\(F^*=\left(\dfrac{SSE(R)-SSE(F)}{df_R-df_F}\right) \div \left(\dfrac{SSE(F)}{df_F}\right)\)

To calculate the F -statistic for each test, we first determine the error sum of squares for the reduced and full models — SSE ( R ) and SSE ( F ), respectively. The number of error degrees of freedom associated with the reduced and full models — \(df_{R}\) and \(df_{F}\), respectively — is the number of observations, n , minus the number of parameters, p , in the model. That is, in general, the number of error degrees of freedom is n - p . We use statistical software, such as Minitab's F -distribution probability calculator, to determine the P -value for each test.

Testing all slope parameters equal 0 Section  

To answer the research question: "Is the regression model containing at least one predictor useful in predicting the size of the infarct?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 1, 2, 3)

The full model

The full model is the largest possible model — that is, the model containing all of the possible predictors. In this case, the full model is:

\(y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE , that appears in the analysis of variance table. Because there are 4 parameters in the full model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4\).

The reduced model

The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is:

\(y_i=\beta_0+\epsilon_i\)

The reduced model suggests that none of the variations in the response y is explained by any of the predictors. Therefore, the error sum of squares for the reduced model, SSE ( R ), is just the total sum of squares, SSTO , that appears in the analysis of variance table. Because there is only one parameter in the reduced model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 1 \).

Upon plugging in the above quantities, the general linear F -statistic:

\(F^*=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div \dfrac{SSE(F)}{df_F}\)

becomes the usual " overall F -test ":

\(F^*=\dfrac{SSR}{3} \div \dfrac{SSE}{n-4}=\dfrac{MSR}{MSE}\)

That is, to test \(H_{0}\) : \(\beta_{1} = \beta_{2} = \beta_{3} = 0 \), we just use the overall F -test and P -value reported in the analysis of variance table:

Analysis of Variance

Source DF Adj SS Adj MS F- Value P-Value
Regression 3 0.95927 0.31976 16.43 0.000
Area 1 0.63742 0.63742 32.75 0.000
X2 1 0.29733 0.29733 15.28 0.001
X3 1 0.01981 0.01981 1.02 0.322
Error 28 0.54491 0.01946    
31 1.50418      

Regression Equation

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

There is sufficient evidence ( F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.

In general, to test that all of the slope parameters in a multiple linear regression model are 0, we use the overall F -test reported in the analysis of variance table.

Testing one slope parameter is 0 Section  

Now let's answer the second research question: "Is the size of the infarct significantly (linearly) related to the area of the region at risk?" To do so, we test the hypotheses:

Again, the full model is the model containing all of the possible predictors:

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE . Alternatively, because the three predictors in the model are \(x_{1}\), \(x_{2}\), and \(x_{3}\), we can denote the error sum of squares as SSE (\(x_{1}\), \(x_{2}\), \(x_{3}\)). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4 \).

Because the null hypothesis sets the first slope parameter, \(\beta_{1}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

Because the two predictors in the model are \(x_{2}\) and \(x_{3}\), we denote the error sum of squares as SSE (\(x_{2}\), \(x_{3}\)). Because there are 3 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 3\).

The general linear statistic:

simplifies to:

\(F^*=\dfrac{SSR(x_1|x_2, x_3)}{1}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}=\dfrac{MSR(x_1|x_2, x_3)}{MSE(x_1,x_2, x_3)}\)

Getting the numbers from the Minitab output:

we determine that the value of the F -statistic is:

\(F^* = \dfrac{SSR(x_1 \vert x_2, x_3)}{1} \div \dfrac{SSE(x_1, x_2, x_3)}{28} = \dfrac{0.63742}{0.01946}=32.7554\)

The P -value is the probability — if the null hypothesis were true — that we would get an F -statistic larger than 32.7554. Comparing our F -statistic to an F -distribution with 1 numerator degree of freedom and 28 denominator degrees of freedom, Minitab tells us that the probability is close to 1 that we would observe an F -statistic smaller than 32.7554:

F distribution with 1 DF in Numerator and 28 DF in denominator

x P ( X ≤x )
32.7554 1.00000

Therefore, the probability that we would get an F -statistic larger than 32.7554 is close to 0. That is, the P -value is < 0.001. There is sufficient evidence ( F = 32.8, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk after the other predictors x2 and x3 have been taken into account.

But wait a second! Have you been wondering why we couldn't just use the slope's t -statistic to test that the slope parameter, \(\beta_{1}\), is 0? We can! Notice that the P -value ( P < 0.001) for the t -test ( t * = 5.72):

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant -0.135 0.104 -1.29 0.206  
Area 0.613 0.107 5.72 0.000 1.14
X2 -0.2435 0.0623 -3.91 0.001 1.44
X3 -0.0657 0.0651 -1.01 0.322 1.57

is the same as the P -value we obtained for the F -test. This will always be the case when we test that only one slope parameter is 0. That's because of the well-known relationship between a t -statistic and an F -statistic that has one numerator degree of freedom:

\(t_{(n-p)}^{2}=F_{(1, n-p)}\)

For our example, the square of the t -statistic, 5.72, equals our F -statistic (within rounding error). That is:

\(t^{*2}=5.72^2=32.72=F^*\)

So what have we learned in all of this discussion about the equivalence of the F -test and the t -test? In short:

Compare the output obtained when \(x_{1}\) = Area is entered into the model last :

Term Coef SE Coef T-Value P-Value VIF
Constant -0.135 0.104 -1.29 0.206  
X2 -0.2435 0.0623 -3.91 0.001 1.44
X3 -0.0657 0.0651 -1.01 0.322 1.57
Area 0.613 0.107 5.72 0.000 1.14

Inf = - 0.135 - 0.2435 X2 - 0.0657 X3 + 0.613 Area

to the output obtained when \(x_{1}\) = Area is entered into the model first :

The t -statistic and P -value are the same regardless of the order in which \(x_{1}\) = Area is entered into the model. That's because — by its equivalence to the F -test — the t -test for one slope parameter adjusts for all of the other predictors included in the model.

  • We can use either the F -test or the t -test to test that only one slope parameter is 0. Because the t -test results can be read right off of the Minitab output, it makes sense that it would be the test that we'll use most often.
  • But, we have to be careful with our interpretations! The equivalence of the t -test to the F -test has taught us something new about the t -test. The t -test is a test for the marginal significance of the \(x_{1}\) predictor after the other predictors \(x_{2}\) and \(x_{3}\) have been taken into account. It does not test for the significance of the relationship between the response y and the predictor \(x_{1}\) alone.

Testing a subset of slope parameters is 0 Section  

Finally, let's answer the third — and primary — research question: "Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 2, 3)

Because the null hypothesis sets the second and third slope parameters, \(\beta_{2}\) and \(\beta_{3}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\)

The ANOVA table for the reduced model is:

Source DF Adj SS Adj MS F- Value P-Value
Regression 1 0.6249 0.62492 21.32 0.000
Area 1 0.6249 0.62492 21.32 0.000
Error 30 0.8793 0.02931    
31 1.5042      

Because the only predictor in the model is \(x_{1}\), we denote the error sum of squares as SSE (\(x_{1}\)) = 0.8793. Because there are 2 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 2 = 32 – 2 = 30\).

\begin{align} F^*&=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div\dfrac{SSE(F)}{df_F}\\&=\dfrac{0.8793-0.54491}{30-28} \div\dfrac{0.54491}{28}\\&= \dfrac{0.33439}{2} \div 0.01946\\&=8.59.\end{align}

Alternatively, we can calculate the F-statistic using a partial F-test :

\begin{align}F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{MSR(x_2, x_3|x_1)}{MSE(x_1,x_2, x_3)}.\end{align}

To conduct the test, we regress y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3 }\)— in order (and with "Sequential sums of squares" selected under "Options"):

Source DF Seq SS Seq MS F- Value P-Value
Regression 3 0.95927 0.31976 16.43 0.000
Area 1 0.62492 0.63492 32.11 0.000
X2 1 0.3143 0.31453 16.16 0.001
X3 1 0.01981 0.01981 1.02 0.322
Error 28 0.54491 0.01946    
31 1.50418      

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

yielding SSR (\(x_{2}\) | \(x_{1}\)) = 0.31453, SSR (\(x_{3}\) | \(x_{1}\), \(x_{2}\)) = 0.01981, and MSE = 0.54491/28 = 0.01946. Therefore, the value of the partial F -statistic is:

\begin{align} F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{0.31453+0.01981}{2}\div\dfrac{0.54491}{28}\\&= \dfrac{0.33434}{2} \div 0.01946\\&=8.59,\end{align}

which is identical (within round-off error) to the general F-statistic above. The P -value is the probability — if the null hypothesis were true — that we would observe a partial F -statistic more extreme than 8.59. The following Minitab output:

F distribution with 2 DF in Numerator and 28 DF in denominator

x P ( X ≤ x )
8.59 0.998767

tells us that the probability of observing such an F -statistic that is smaller than 8.59 is 0.9988. Therefore, the probability of observing such an F -statistic that is larger than 8.59 is 1 - 0.9988 = 0.0012. The P -value is very small. There is sufficient evidence ( F = 8.59, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs — after taking into account the size of the region at risk.

Summary of MLR Testing Section  

For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:

  • Hypothesis test for testing that all of the slope parameters are 0.
  • Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.
  • Hypothesis test for testing that one slope parameter is 0.

We have learned how to perform each of the above three hypothesis tests. Along the way, we also took two detours — one to learn about the " general linear F-test " and one to learn about " sequential sums of squares. " As you now know, knowledge about both is necessary for performing the three hypothesis tests.

The F -statistic and associated p -value in the ANOVA table is used for testing whether all of the slope parameters are 0. In most applications, this p -value will be small enough to reject the null hypothesis and conclude that at least one predictor is useful in the model. For example, for the rabbit heart attacks study, the F -statistic is (0.95927/(4–1)) / (0.54491/(32–4)) = 16.43 with p -value 0.000.

To test whether a subset — more than one, but not all — of the slope parameters are 0, there are two equivalent ways to calculate the F-statistic:

  • Use the general linear F-test formula by fitting the full model to find SSE(F) and fitting the reduced model to find SSE(R) . Then the numerator of the F-statistic is (SSE(R) – SSE(F)) / ( \(df_{R}\) – \(df_{F}\)) .
  • Alternatively, use the partial F-test formula by fitting only the full model but making sure the relevant predictors are fitted last and "sequential sums of squares" have been selected. Then the numerator of the F-statistic is the sum of the relevant sequential sums of squares divided by the sum of the degrees of freedom for these sequential sums of squares. The denominator of the F -statistic is the mean squared error in the ANOVA table.

For example, for the rabbit heart attacks study, the general linear F-statistic is ((0.8793 – 0.54491) / (30 – 28)) / (0.54491 / 28) = 8.59 with p -value 0.0012. Alternatively, the partial F -statistic for testing the slope parameters for predictors \(x_{2}\) and \(x_{3}\) using sequential sums of squares is ((0.31453 + 0.01981) / 2) / (0.54491 / 28) = 8.59.

To test whether one slope parameter is 0, we can use an F -test as just described. Alternatively, we can use a t -test, which will have an identical p -value since in this case, the square of the t -statistic is equal to the F -statistic. For example, for the rabbit heart attacks study, the F -statistic for testing the slope parameter for the Area predictor is (0.63742/1) / (0.54491/(32–4)) = 32.75 with p -value 0.000. Alternatively, the t -statistic for testing the slope parameter for the Area predictor is 0.613 / 0.107 = 5.72 with p -value 0.000, and \(5.72^{2} = 32.72\).

Incidentally, you may be wondering why we can't just do a series of individual t-tests to test whether a subset of the slope parameters is 0. For example, for the rabbit heart attacks study, we could have done the following:

  • Fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3}\) and use an individual t-test for \(x_{3}\).
  • If the test results indicate that we can drop \(x_{3}\) then fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and use an individual t-test for \(x_{2}\).

The problem with this approach is we're using two individual t-tests instead of one F-test, which means our chance of drawing an incorrect conclusion in our testing procedure is higher. Every time we do a hypothesis test, we can draw an incorrect conclusion by:

  • rejecting a true null hypothesis, i.e., make a type I error by concluding the tested predictor(s) should be retained in the model when in truth it/they should be dropped; or
  • failing to reject a false null hypothesis, i.e., make a type II error by concluding the tested predictor(s) should be dropped from the model when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.

Hypothesis tests for the slope parameters Section  

The problems in this section are designed to review the hypothesis tests for the slope parameters, as well as to give you some practice on models with a three-group qualitative variable (which we'll cover in more detail in Lesson 8). We consider tests for:

  • whether one slope parameter is 0 (for example, \(H_{0} \colon \beta_{1} = 0 \))
  • whether a subset (more than one but less than all) of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \) against the alternative \(H_{A} \colon \beta_{2} \ne 0 \) or \(\beta_{3} \ne 0 \) or both ≠ 0)
  • whether all of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3}\) = 0 against the alternative \(H_{A} \colon \) at least one of the \(\beta_{i}\) is not 0)

(Note the correct specification of the alternative hypotheses for the last two situations.)

Sugar beets study

A group of researchers was interested in studying the effects of three different growth regulators ( treat , denoted 1, 2, and 3) on the yield of sugar beets (y = yield , in pounds). They planned to plant the beets in 30 different plots and then randomly treat 10 plots with the first growth regulator, 10 plots with the second growth regulator, and 10 plots with the third growth regulator. One problem, though, is that the amount of available nitrogen in the 30 different plots varies naturally, thereby giving a potentially unfair advantage to plots with higher levels of available nitrogen. Therefore, the researchers also measured and recorded the available nitrogen (\(x_{1}\) = nit , in pounds/acre) in each plot. They are interested in comparing the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen. The Sugar Beets dataset contains the data from the researcher's experiment.

Preliminary Work

The plot shows a similar positive linear trend within each treatment category, which suggests that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data.

Because the qualitative variable treat distinguishes between the three treatment groups (1, 2, and 3), we need to create two indicator variables, \(x_{2}\) and \(x_{3}\), say, to fit a linear regression model to these data. The new indicator variables should be defined as follows:

treat \(x_2\) \(x_3\)
1 1 0
2 0 1
3 0 0

Use Minitab's Calc >> Make Indicator Variables command to create the new indicator variables in your worksheet

Minitab creates an indicator variable for each treatment group but we can only use two, for treatment groups 1 and 2 in this case (treatment group 3 is the reference level in this case).

Then, if we assume the trend in the data can be summarized by this regression model:

\(y_{i} = \beta_{0}\) + \(\beta_{1}\)\(x_{1}\) + \(\beta_{2}\)\(x_{2}\) + \(\beta_{3}\)\(x_{3}\) + \(\epsilon_{i}\)

where \(x_{1}\) = nit and \(x_{2}\) and \(x_{3}\) are defined as above, what is the mean response function for plots receiving treatment 3? for plots receiving treatment 1? for plots receiving treatment 2? Are the three regression lines that arise from our formulated model parallel? What does the parameter \(\beta_{2}\) quantify? And, what does the parameter \(\beta_{3}\) quantify?

The fitted equation from Minitab is Yield = 84.99 + 1.3088 Nit - 2.43 \(x_{2}\) - 2.35 \(x_{3}\), which means that the equations for each treatment group are:

  • Group 1: Yield = 84.99 + 1.3088 Nit - 2.43(1) = 82.56 + 1.3088 Nit
  • Group 2: Yield = 84.99 + 1.3088 Nit - 2.35(1) = 82.64 + 1.3088 Nit
  • Group 3: Yield = 84.99 + 1.3088 Nit

The three estimated regression lines are parallel since they have the same slope, 1.3088.

The regression parameter for \(x_{2}\) represents the difference between the estimated intercept for treatment 1 and the estimated intercept for reference treatment 3.

The regression parameter for \(x_{3}\) represents the difference between the estimated intercept for treatment 2 and the estimated intercept for reference treatment 3.

Testing whether all of the slope parameters are 0

\(H_0 \colon \beta_1 = \beta_2 = \beta_3 = 0\) against the alternative \(H_A \colon \) at least one of the \(\beta_i\) is not 0.

\(F=\dfrac{SSR(X_1,X_2,X_3)\div3}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1,X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{16039.5}{3}}{\frac{1078.0}{30-4}} = \dfrac{5346.5}{41.46} = 128.95\)

Since the p -value for this F -statistic is reported as 0.000, we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that at least one of the slope parameters is not zero, i.e., the regression model containing at least one predictor is useful in predicting the size of sugar beet yield.

Tests for whether one slope parameter is 0

\(H_0 \colon \beta_1= 0\) against the alternative \(H_A \colon \beta_1 \ne 0\)

t -statistic = 19.60, p -value = 0.000, so we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that the slope parameter for \(x_{1}\) = nit is not zero, i.e., sugar beet yield is significantly linearly related to the available nitrogen (controlling for treatment).

\(F=\dfrac{SSR(X_1|X_2,X_3)\div1}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1|X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

Use the Minitab output to calculate the value of this F statistic. Does the value you obtain equal \(t^{2}\), the square of the t -statistic as we might expect?

\(F-statistic= \dfrac{\frac{15934.5}{1}}{\frac{1078.0}{30-4}} = \dfrac{15934.5}{41.46} = 384.32\), which is the same as \(19.60^{2}\).

Because \(t^{2}\) will equal the partial F -statistic whenever you test for whether one slope parameter is 0, it makes sense to just use the t -statistic and P -value that Minitab displays as a default. But, note that we've just learned something new about the meaning of the t -test in the multiple regression setting. It tests for the ("marginal") significance of the \(x_{1}\) predictor after \(x_{2}\) and \(x_{3}\) have already been taken into account.

Tests for whether a subset of the slope parameters is 0

\(H_0 \colon \beta_2=\beta_3= 0\) against the alternative \(H_A \colon \beta_2 \ne 0\) or \(\beta_3 \ne 0\) or both \(\ne 0\).

\(F=\dfrac{SSR(X_2,X_3|X_1)\div2}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_2,X_3|X_1)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{10.4+27.5}{2}}{\frac{1078.0}{30-4}} = \dfrac{18.95}{41.46} = 0.46\).

F distribution with 2 DF in Numerator and 26 DF in denominator

x P ( X ≤ x )
0.46 0.363677

p-value \(= 1-0.363677 = 0.636\), so we fail to reject \(H_{0}\) in favor of \(H_{A}\) and conclude that we cannot rule out \(\beta_2 = \beta_3 = 0\), i.e., there is no significant difference in the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen.

Note that the sequential mean square due to regression, MSR(\(X_{2}\),\(X_{3}\)|\(X_{1}\)), is obtained by dividing the sequential sum of square by its degrees of freedom (2, in this case, since two additional predictors \(X_{2}\) and \(X_{3}\) are considered). Use the Minitab output to calculate the value of this F statistic, and use Minitab to get the associated P -value. Answer the researcher's question at the \(\alpha= 0.05\) level.

  • Search Search Please fill out this field.
  • An Overview

Linear Regression

Multiple regression, the bottom line.

  • Corporate Finance
  • Financial Analysis

Linear vs. Multiple Regression: What's the Difference?

Thomas J Catalano is a CFP and Registered Investment Adviser with the state of South Carolina, where he launched his own financial advisory firm in 2018. Thomas' experience gives him expertise in a variety of areas including investments, retirement, insurance, and financial planning.

hypothesis multiple regression example

Getty Images, Cultura RM Exclusive/yellowdog

Linear Regression vs. Multiple Regression: An Overview

Linear regression (also called simple regression) is one of the most common techniques of regression analysis. Multiple regression is a broader class of regression analysis, which encompasses both linear and nonlinear regressions with multiple explanatory variables.

Regression analysis is a statistical method used in finance and investing . Regression analysis pools data together to help people and companies make informed decisions. There are different variables at play in this type of statistical analysis, including a dependent variable—the main variable that you're trying to understand—and an independent variable(s)—factors that may have an impact on the dependent variable.

There are several main reasons people use regression analysis:

  • To predict future economic conditions, trends, or values.
  • To determine the relationship between two or more variables.
  • To understand how one variable changes when another changes.

While there are many different kinds of regression analysis, this article will examine two different types: linear regression and multiple regression.

Key Takeaways

  • Regression analysis is a common statistical method used in finance and investing.
  • Linear regression (also called simple regression) is one of the most common techniques of regression analysis; in linear regression, there are only two variables: the independent variable and the dependent variable.
  • Whereas linear regression only has one independent variable, multiple regression encompasses both linear and nonlinear regressions and incorporates multiple independent variables.
  • Each independent variable in multiple regression has its own coefficient to ensure each variable is weighted appropriately.

Also called simple regression, linear regression establishes the relationship between two variables. Linear regression is graphically depicted using a straight line; the slope defines how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of one variable, when the value of the other is 0.

In linear regression, every dependent value has a single corresponding independent variable that drives its value. For example, in the linear regression formula of y = 3x + 7, there is only one possible outcome of "y" if "x" is defined as 2.

If the relationship between two variables does not follow a straight line, nonlinear regression may be used instead. Linear and nonlinear regression are similar in that both track a particular response from a set of variables. As the relationship between the variables becomes more complex, nonlinear models have greater flexibility and capability of depicting the non-constant slope.

For complex connections between data, the relationship might be explained by more than one variable. In this case, an analyst uses multiple regression; multiple regression attempts to explain a dependent variable using more than one independent variable.

There are two main uses for multiple regression analysis. The first is to determine the dependent variable based on multiple independent variables. For example, you may be interested in determining what a crop yield will be based on temperature, rainfall, and other independent variables. The second is to determine how strong the relationship is between each variable. For example, you may be interested in knowing how a crop yield will change if rainfall increases—or the temperature decreases.

Multiple regression assumes there is not a strong relationship between each independent variable. It also assumes there is a correlation between each independent variable and the single dependent variable. Each of these relationships is weighted to ensure more impactful independent variables drive the dependent value by adding a unique regression coefficient to each independent variable.

A company can not only use regression analysis to understand certain situations, like why customer service calls are dropping, but also to make forward-looking predictions, like sales figures in the future.

Linear Regression vs. Multiple Regression Example

Consider an analyst who wishes to establish a relationship between the daily change in a company's stock prices and the daily change in trading volume . Using linear regression, the analyst can attempt to determine the relationship between the two variables:

Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (y-intercept)

If the stock price increases $0.10 before any trades occur and increases $0.01 for every share sold, the linear regression outcome is:

Daily Change in Stock Price = ($0.01)(Daily Change in Trading Volume) + $0.10

However, the analyst realizes there are several other factors to consider including the company's P/E ratio, dividends, and prevailing inflation rate. The analyst can perform multiple regression to determine which—and how strongly—each of these variables impacts the stock price:

Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (Coefficient)(Company's P/E Ratio) + (Coefficient)(Dividend) + (Coefficient)(Inflation Rate)

Is Multiple Linear Regression Better Than Simple Linear Regression?

Multiple linear regression is a more specific calculation than simple linear regression. For straight-forward relationships, simple linear regression may easily capture the relationship between the two variables. For more complex relationships requiring more consideration, multiple linear regression is often better.

When Should You Use Multiple Linear Regression?

Multiple linear regression should be used when multiple independent variables determine the outcome of a single dependent variable. This is often the case when forecasting more complex relationships.

How Do You Interpret Multiple Regression?

A multiple regression formula has multiple slopes (one for each variable) and one y-intercept. It is interpreted the same as a simple linear regression formula—except there are multiple variables that all impact the slope of the relationship.

Regression analysis is a statistical method. There are many different types of regression analysis, including linear regression and multiple regression (among others). Linear regression captures the relationship between two variables—for example, the relationship between the daily change in a company's stock prices and the daily change in trading volume. Multiple linear regression is a more specific (and complex) calculation than simple linear regression. It incorporates multiple independent variables. For example, multiple regression could capture how the daily change in a company's stock price is impacted by the company's P/E ratio, dividends, the prevailing inflation rate, and the daily change in trading volume.

hypothesis multiple regression example

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

IMAGES

  1. Multiple Linear Regression and Correlation Analysis Chapter 14

    hypothesis multiple regression example

  2. multiple regression

    hypothesis multiple regression example

  3. Multiple Regression Analysis Interpretation : SPSS Multiple Regression

    hypothesis multiple regression example

  4. What Is And How To Use A Multiple Regression Equation Model Example

    hypothesis multiple regression example

  5. Hypothesis Tests in Multiple Linear Regression, Part 2

    hypothesis multiple regression example

  6. Multiple Regression Assumptions & Diagnostics

    hypothesis multiple regression example

VIDEO

  1. Multiple Regression

  2. Multiple Regression

  3. Hypothesis Tests in Multiple Linear Regression, Part 2

  4. Multiple Regression, Clearly Explained!!!

  5. Hypothesis Tests in Multiple Linear Regression, Part 1

  6. How to perform Multiple Regression?/Interpretation and Hypothesis test using SPSS. Part 1

COMMENTS

  1. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  2. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    Hypothesis Testing in Multiple Linear Regression BIOST 515 January 20, 2004. 1 Types of tests • Overall test • Test for addition of a single variable ... Consider the regression model with p predictors y = Xβ + . We would like to determine if some subset of r < p predictors

  3. PDF Hypothesis Testing in the Multiple regression model

    Hypothesis Testing in the Multiple regression model • Testing that individual coefficients take a specific value such as zero or some other value is done in exactly the same way as with the simple two variable regression model. • Now suppose we wish to test that a number of coefficients or combinations of coefficients take some particular ...

  4. Understanding the Null Hypothesis for Linear Regression

    xi: The value of the predictor variable xi. Multiple linear regression uses the following null and alternative hypotheses: H0: β1 = β2 = … = βk = 0. HA: β1 = β2 = … = βk ≠ 0. The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically ...

  5. Hypothesis Tests and Confidence Intervals in Multiple Regression

    Confidence Intervals for a Single Coefficient. The confidence interval for a regression coefficient in multiple regression is calculated and interpreted the same way as it is in simple linear regression. The t-statistic has n - k - 1 degrees of freedom where k = number of independents. Supposing that an interval contains the true value of ...

  6. Writing hypothesis for linear multiple regression models

    2. I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models. For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon). Say all your friends think you should ...

  7. 5.3

    A population model for a multiple linear regression model that relates a y -variable to p -1 x -variables is written as. y i = β 0 + β 1 x i, 1 + β 2 x i, 2 + … + β p − 1 x i, p − 1 + ϵ i. We assume that the ϵ i have a normal distribution with mean 0 and constant variance σ 2. These are the same assumptions that we used in simple ...

  8. Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation

    a hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0. In this lesson, we also learn how to perform each of the above three hypothesis tests. Key Learning Goals for this Lesson: Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is ...

  9. Multiple linear regression

    The model fit and predictions are independent of the choice of the baseline category. However, hypothesis tests derived from these variables are affected by the choice. Solution: To check whether region is important, use an \(F\)-test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not ...

  10. Multiple Regression

    This video provides a clear explanation of the nature of hypothesis testing around a multiple regression model. It builds on the example introduced in Multip...

  11. 12.2.1: Hypothesis Test for Linear Regression

    The hypotheses are: Find the critical value using dfE = n − p − 1 = 13 for a two-tailed test α = 0.05 inverse t-distribution to get the critical values ± 2.160. Draw the sampling distribution and label the critical values, as shown in Figure 12-14. Figure 12-14: Graph of t-distribution with labeled critical values.

  12. Introduction to Multiple Linear Regression

    Assumptions of Multiple Linear Regression. There are four key assumptions that multiple linear regression makes about the data: 1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y. 2. Independence: The residuals are independent.

  13. PDF Multiple Regression

    Note on TerminologyWhen we have two or more predictors and fit a linear model by least squares, we are formally said to fit a least squares linear m. ltiple re-gression. Most folks just call it "multiple regression."You may also see the abbreviation OLS used with thi. kind of analy-sis. It stands for "Ordina.

  14. Multiple Linear Regression by Hand (Step-by-Step)

    This tutorial explains how to perform multiple linear regression by hand. Example: Multiple Linear Regression by Hand. Suppose we have the following dataset with one response variable y and two predictor variables X 1 and X 2: Use the following steps to fit a multiple linear regression model to this dataset. Step 1: Calculate X 1 2, X 2 2, X 1 ...

  15. Lesson 5: Multiple Linear Regression

    The only real difference is that whereas in simple linear regression we think of the distribution of errors at a fixed value of the single predictor, with multiple linear regression we have to think of the distribution of errors at a fixed set of values for all the predictors. All of the model-checking procedures we learned earlier are useful ...

  16. Multiple regression

    Use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent ( Y) variable. The rest of the variables are the independent ( X) variables; you think they may have an effect on the dependent variable. The purpose of a multiple regression is to find an equation that best predicts the ...

  17. PDF 13 Multiple Linear( Regression(

    In contrast, the simple regression slope is called the marginal (or unadjusted) coefficient. The multiple regression model can be written in matrix form. To estimate the parameters b 0, b 1,..., b p using the principle of least squares, form the sum of squared deviations of the observed yj's from the regression line:

  18. Getting started with Multivariate Multiple Regression

    Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score.

  19. Linear regression hypothesis testing: Concepts, Examples

    The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

  20. Null and Alternative hypothesis for multiple linear regression

    I have 1 dependent variable and 3 independent variables. I run multiple regression, and find that the p value for one of the independent variables is higher than 0.05 (95% is my confidence level).

  21. 6.4

    For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0. Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.

  22. Linear vs. Multiple Regression: What's the Difference?

    Linear Regression vs. Multiple Regression Example . ... P-value is the level of marginal significance within a statistical hypothesis test, representing the probability of the occurrence of a ...

  23. Probabilistic classification of gene-by-treatment ...

    Despite its effectiveness in identifying a large number of associations, the response QTL mapping relies on hypothesis testing, which does not provide classification of different types of GxT interactions. Herein, we propose a use of Bayesian model selection to classify the GxT types of response QTLs and provide a software package for this method.

  24. NEUROSCIENCE copyright © 2024 the Dendritic spine head diameter

    was used to select the penalty factor used for coefficient shrinkage during lASSO regression. nested model leave- one-out cross- validation was then performed using the validation set. Multiple linear models containing different combinations of variables (age, pathology scores, sex, and spine density, head diameter, length, and volume)