Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Prevent plagiarism. Run a free check.

Multiple linear regression formula.

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved September 9, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

logo

Multiple linear regression

Multiple linear regression #.

Fig. 11 Multiple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):

Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write

Multiple linear regression answers several questions #

Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\) ?

Which subset of the predictors is most important?

How good is a linear model for these data?

Given a set of predictor values, what is a likely value for \(Y\) , and how accurate is this prediction?

The estimates \(\hat\beta\) #

Our goal again is to minimize the RSS: $ \( \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \) $

One can show that this is minimized by the vector \(\hat\beta\) : $ \(\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\) $

We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.

Which variables are important? #

Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\) .

Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)

Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.

The \(F\) -statistic is defined by: $ \(F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\) $

Under the null hypothesis (of our model), this has an \(F\) -distribution.

Example: If \(q=p\) , we test whether any of the variables is important. $ \(\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \) $

A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
49411336.29NA NA NA NA
49211078.78 2257.50765.7178530.003509036

The \(t\) -statistic associated to the \(i\) th predictor is the square root of the \(F\) -statistic for the null hypothesis which sets only \(\beta_i=0\) .

A low \(p\) -value indicates that the predictor is important.

Warning: If there are many predictors, even under the null hypothesis, some of the \(t\) -tests will have low p-values even when the model has no explanatory power.

How many variables are important? #

When we select a subset of the predictors, we have \(2^p\) choices.

A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.

Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.

Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.

Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.

Choosing one model in the range produced is a form of tuning . This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…

How good are the predictions? #

The function predict in R outputs predictions and confidence intervals from a linear model:

A matrix: 3 × 3 of type dbl
fitlwrupr
9.409426 8.72269610.09616
14.16309013.70842314.61776
18.91675418.20618919.62732

Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.

A matrix: 3 × 3 of type dbl
fitlwrupr
9.409426 2.94670915.87214
14.163090 7.72089820.60528
18.91675412.45146125.38205

These functions rely on our linear regression model $ \( Y = X\beta + \epsilon. \) $

Dealing with categorical or qualitative predictors #

For each qualitative predictor, e.g. Region :

Choose a baseline category, e.g. East

For every other category, define a new predictor:

\(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise

\(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.

The model will be: $ \(Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\) $

The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\) ) for being from the South compared to the baseline category (East).

The model fit and predictions are independent of the choice of the baseline category.

However, hypothesis tests derived from these variables are affected by the choice.

Solution: To check whether region is important, use an \(F\) -test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.

Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\) , but the coefficients have different interpretations.

So far, we have:

Defined Multiple Linear Regression

Discussed how to test the importance of variables.

Described one approach to choose a subset of variables.

Explained how to code qualitative variables.

Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit? #

To assess the fit, we focus on the residuals $ \( e = Y - \hat{Y} \) $

The RSS always decreases as we add more variables.

The residual standard error (RSE) corrects this: $ \(\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\) $

Fig. 12 Residuals #

Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:

Potential issues in linear regression #

Interactions between predictors

Non-linear relationships

Correlation of error terms

Non-constant variance of error (heteroskedasticity)

High leverage points

Collinearity

Interactions between predictors #

Linear regression has an additive assumption: $ \(\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\) $

i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.

We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.

One way to deal with this is to include multiplicative variables in the model:

The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.

R makes it easy to include interaction variables in the model:

Non-linearities #

Fig. 13 A nonlinear fit might be better here. #

Example: Auto dataset.

A scatterplot between a predictor and the response may reveal a non-linear relationship.

Solution: include polynomial terms in the model.

Could use other functions besides polynomials…

Fig. 14 Residuals for Auto data #

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms #

We assumed that the errors for each sample are independent:

What if this breaks down?

The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…

Example : Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\) .

When could this happen in real life:

Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.

Spatial data: Each sample corresponds to a different location in space.

Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.

Correlated errors #

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity) #

The variance of the error depends on some characteristics of the input features.

To diagnose this, we can plot residuals vs. fitted values:

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers from a model are points with very high errors.

While they may not affect the fit, they might affect our assessment of model quality.

Possible solutions: #

If we believe an outlier is due to an error in data collection, we can remove it.

An outlier might be evidence of a missing predictor, or the need to specify a more complex model.

High leverage points #

Some samples with extreme inputs have an outsized effect on \(\hat \beta\) .

This can be measured with the leverage statistic or self influence :

Studentized residuals #

The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\) .

The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\) .

A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\) )

When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity #

Two predictors are collinear if one explains the other well:

Problem: The coefficients become unidentifiable .

Consider the extreme case of using two identical predictors limit : $ \( \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2-100)\times\mathtt{limit} + \epsilon \end{aligned} \) $

For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\) .

If 2 variables are collinear, we can easily diagnose this using their correlation.

A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.

Pairwise correlations may not reveal multilinear variables.

The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:

Above, \(R^2_{X_j|X_{-j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.

  • How It Works

Multiple Linear Regression in SPSS

Discover Multiple Linear Regression in SPSS ! Learn how to perform, understand SPSS output , and report results in APA style. Check out this simple, easy-to-follow guide below for a quick read!

Struggling with the Multiple Regression in SPSS ? We’re here to help . We offer comprehensive assistance to students , covering assignments , dissertations , research, and more. Request Quote Now !

hypothesis for multiple linear regression

Introduction

Welcome to our comprehensive guide on Multiple Linear Regression in SPSS . In the dynamic world of statistics, understanding the nuances of Multiple Linear Regression is key for researchers and analysts seeking a deeper understanding of relationships within their data. This blog post is your roadmap to mastering Multiple Linear Regression using the Statistical Package for the Social Sciences (SPSS).

From unraveling the fundamentals to providing practical insights through examples, this guide aims to demystify the complexities, making Multiple Linear Regression accessible to both beginners and seasoned data enthusiasts.

Definition: Multiple Linear Regression

Multiple Linear Regression expands upon the principles of Simple Linear Regression by accommodating multiple independent variables. In essence, it assesses the linear relationship between the dependent variable and two or more predictors. The model’s flexibility allows for a more realistic representation of real-world scenarios where outcomes are influenced by multiple factors. By incorporating multiple predictors, this technique offers a nuanced understanding of how each variable contributes to the variation in the dependent variable. This section serves as a gateway to the intricacies of Multiple Linear Regression, setting the stage for a detailed exploration of its components and applications in subsequent sections.

Linear Regression Methods

Multiple Linear Regression encompasses various methods for building and refining models to predict a dependent variable based on multiple independent variables. These methods help researchers and analysts tailor regression models to the specific characteristics of their data and research questions. Here are some key methods in Multiple Linear Regression:

Ordinary Least Squares (OLS)

OLS is the most common method used in Multiple Linear Regression . It minimizes the sum of squared differences between observed and predicted values, aiming to find the coefficients that best fit the data. OLS provides unbiased estimates if the assumptions of the regression model are met.

Stepwise Regression

In stepwise regression , the model-building process involves adding or removing predictor variables at each step based on statistical criteria. The algorithm evaluates variables and decides whether to include or exclude them in a stepwise manner. It can be forward (adding variables) or backward (removing variables) stepwise regression.

Backward Regression

Backward regression begins with a model that includes all predictor variables and then systematically removes the least significant variables based on statistical tests. This process continues until the model only contains statistically significant predictors. It’s a simplification approach aimed at retaining only the most influential variables.

Forward Regression

Forward regression starts with an empty model and incrementally adds the most significant predictor variables based on statistical tests. This iterative process continues until the addition of more variables does not significantly improve the model. Forward regression helps identify the most relevant predictors contributing to the model’s explanatory power.

Hierarchical Regression

In hierarchical regression, predictor variables are entered into the model in a pre-defined sequence or hierarchy. This method allows researchers to examine the impact of different sets of variables on the dependent variable, taking into account their hierarchical or logical order. The most common approach involves entering blocks of variables at different steps, and assessing how each set contributes to the overall predictive power of the model.

Understanding these multiple linear regression types is crucial for selecting the most appropriate model-building strategy based on the specific goals of your analysis and the characteristics of your dataset. Each approach has its advantages and considerations, influencing the interpretability and complexity of the final regression model.

Regression Equation

The Multiple Regression Equation in Multiple Linear Regression takes the form of

Y = b0 + b1X1 + b2X2 + … + bnXn , where

  • Y is the predicted value of the dependent variable,
  • b0 is the intercept,
  • b1, b2, …, bn are the regression coefficients for each independent variable (X1, X2, …, Xn).

The regression coefficients represent the change in the dependent variable for a one-unit change in the corresponding independent variable, while the intercept is the predicted value when all independent variables are zero. Understanding the interplay between these components is essential for deciphering the impact of each predictor on the overall model. In the upcoming sections, we’ll delve deeper into specific aspects of Multiple Linear Regression, such as the role of dummy variables and the critical assumptions that underpin this statistical method.

What are Dummy Variables?

In the realm of Multiple Linear Regression , dummy variables are pivotal when dealing with categorical predictors. These variables allow us to include categorical data, like gender or region, in our regression model. Consider a binary categorical variable, such as gender (Male/Female). We represent this in our equation using a dummy variable, where one category is assigned 0 and the other 1. For instance, if Male is our reference category, the dummy variable would be 1 for Female and 0 for Male. This inclusion of categorical information enhances the model’s flexibility, capturing the nuanced impact of different categories on the dependent variable. As we explore Multiple Linear Regression further, understanding the role of dummy variables becomes paramount for robust and accurate analyses.

Assumption of Multiple Linear Regression

Before diving into Multiple Linear Regression analysis, it’s crucial to be aware of the underlying assumptions that bolster the reliability of the results.

  • Linearity : Assumes a linear relationship between the dependent variable and all independent variables. The model assumes that changes in the dependent variable are proportional to changes in the independent variables.
  • Independence of Residuals : Assumes that the residuals (the differences between observed and predicted values) are independent of each other. The independence assumption is crucial to avoid issues of autocorrelation and ensure the reliability of the model.
  • Homoscedasticity : Assumes that the variability of the residuals remains constant across all levels of the independent variables. Homoscedasticity ensures that the spread of residuals is consistent, indicating that the model’s predictions are equally accurate across the range of predictor values.
  • Normality of Residuals : Assumes that the residuals follow a normal distribution. Normality is essential for making valid statistical inferences and hypothesis testing. Deviations from normality may impact the accuracy of confidence intervals and p-values.
  • No Perfect Multicollinearity : Assumes that there is no perfect linear relationship among the independent variables. Perfect multicollinearity can lead to unstable estimates of regression coefficients, making it challenging to discern the individual impact of each predictor.

These assumptions collectively form the foundation of Multiple Linear Regression analysis . Ensuring that these conditions are met enhances the validity and reliability of the statistical inferences drawn from the model. In the subsequent sections, we will delve into hypothesis testing in Multiple Linear Regression, provide practical examples, and guide you through the step-by-step process of performing and interpreting Multiple Linear Regression analyses using SPSS.

Hypothesis of Multiple Linear Regression

The hypothesis in Multiple Linear Regression revolves around the significance of the regression coefficients. Each coefficient corresponds to a specific predictor variable, and the hypothesis tests whether each predictor has a significant impact on the dependent variable.

  • Null Hypothesis (H0): The regression coefficients for all independent variables are simultaneously equal to zero.
  • Alternative Hypothesis (H1): At least one regression coefficient for an independent variable is not equal to zero.

The hypothesis testing in Multiple Linear Regression revolves around assessing whether the collective set of independent variables has a statistically significant impact on the dependent variable. The null hypothesis suggests no overall effect, while the alternative hypothesis asserts the presence of at least one significant relationship. This testing framework guides the evaluation of the model’s overall significance, providing valuable insights into the joint contribution of the predictor variables.

Example of Simple Multiple Regression

To illustrate the concepts of Multiple Linear Regression, let’s consider an example. Imagine you are studying the factors influencing house prices, with predictors such as square footage, number of bedrooms, and distance to the city centre. By applying Multiple Linear Regression, you can model how these factors collectively influence house prices.

The regression equation would look like:

Price = b0 + b1(square footage) + b2(number of bedrooms) + b3(distance to city centre).

Through this example, you’ll gain practical insights into how Multiple Linear Regression can untangle complex relationships and offer a comprehensive understanding of the factors affecting the dependent variable.

How to Perform Multiple Linear Regression using SPSS Statistics

hypothesis for multiple linear regression

Step by Step: Running Regression Analysis in SPSS Statistics

Now, let’s delve into the step-by-step process of conducting the Multiple Linear Regression using SPSS Statistics .  Here’s a step-by-step guide on how to perform a Multiple Linear Regression in SPSS :

  • STEP: Load Data into SPSS

Commence by launching SPSS and loading your dataset, which should encompass the variables of interest – a categorical independent variable. If your data is not already in SPSS format, you can import it by navigating to File > Open > Data and selecting your data file.

  • STEP: Access the Analyze Menu

In the top menu, locate and click on “ Analyze .” Within the “Analyze” menu, navigate to “ Regression ” and choose ” Linear ” Analyze > Regression> Linear

  • STEP: Choose Variables

A dialogue box will appear. Move the dependent variable (the one you want to predict) to the “ Dependen t” box and the independent variables to the “ Independent ” box.

  • STEP: Generate SPSS Output

Once you have specified your variables and chosen options, click the “ OK ” button to perform the analysis. SPSS will generate a comprehensive output, including the requested frequency table and chart for your dataset.

Executing these steps initiates the Multiple Linear Regression in SPSS, allowing researchers to assess the impact of the teaching method on students’ test scores while considering the repeated measures. In the next section, we will delve into the interpretation of SPSS output for Multiple Linear Regression .

Conducting a Multiple Linear Regression in SPSS provides a robust foundation for understanding the key features of your data. Always ensure that you consult the documentation corresponding to your SPSS version, as steps might slightly differ based on the software version in use. This guide is tailored for SPSS version 25 , and for any variations, it’s recommended to refer to the software’s documentation for accurate and updated instructions.

SPSS Output for Multiple Regression Analysis

hypothesis for multiple linear regression

How to Interpret SPSS Output of Multiple Regression

Deciphering the SPSS output of Multiple Linear Regression is a crucial skill for extracting meaningful insights. Let’s focus on three tables in SPSS output;

Model Summary Table

  • R (Correlation Coefficient): This value ranges from -1 to 1 and indicates the strength and direction of the linear relationship. A positive value signifies a positive correlation, while a negative value indicates a negative correlation.
  • R-Square (Coefficient of Determination) : Represents the proportion of variance in the dependent variable explained by the independent variable. Higher values indicate a better fit of the model.
  • Adjusted R Square : Adjusts the R-squared value for the number of predictors in the model, providing a more accurate measure of goodness of fit.

ANOVA Table

  • F (ANOVA Statistic): Indicates whether the overall regression model is statistically significant. A significant F-value suggests that the model is better than a model with no predictors.
  • df (Degrees of Freedom): Represents the degrees of freedom associated with the F-test.
  • P values : The probability of obtaining the observed F-statistic by random chance. A low p-value (typically < 0.05) indicates the model’s significance.

Coefficient Table

  • Unstandardized Coefficients (B): Provides the individual regression coefficients for each predictor variable.
  • Standardized Coefficients (Beta): Standardizes the coefficients, allowing for a comparison of the relative importance of each predictor.
  • t-values : Indicate how many standard errors the coefficients are from zero. Higher absolute t-values suggest greater significance.
  • P values : Test the null hypothesis that the corresponding coefficient is equal to zero. A low p-value suggests that the predictors are significantly related to the dependent variable.

Understanding these tables in the SPSS output is crucial for drawing meaningful conclusions about the strength, significance, and direction of the relationship between variables in a Simple Linear Regression analysis.

  How to Report Results of Multiple Linear Regression in APA

Effectively communicating the results of Multiple Linear Regression in compliance with the American Psychological Association (APA) guidelines is crucial for scholarly and professional writing

  • Introduction : Begin the report with a concise introduction summarizing the purpose of the analysis and the relationship being investigated between the variables.
  • Assumption Checks: If relevant, briefly mention the checks for assumptions such as linearity, independence, homoscedasticity, and normality of residuals to ensure the robustness of the analysis.
  • Significance of the Model : Comment on the overall significance of the model based on the ANOVA table. For example, “The overall regression model was statistically significant (F = [value], p = [value]), suggesting that the predictors collectively contributed to the prediction of the dependent variable.”
  • Regression Equation : Present the Multiple Regression equation, highlighting the intercept and regression coefficients for each predictor variable.
  • Interpretation of Coefficients : Interpret the coefficients, focusing on the slope (b1..bn) to explain the strength and direction of the relationship. Discuss how a one-unit change in the independent variable corresponds to a change in the dependent variable.
  • R-squared Value: Include the R-squared value to highlight the proportion of variance in the dependent variable explained by the independent variables. For instance, “The R-squared value of [value] indicates that [percentage]% of the variability in [dependent variable] can be explained by the linear relationship with [independent variables].”
  • Conclusion : Conclude the report by summarizing the key findings and their implications. Discuss any practical significance of the results in the context of your study.

hypothesis for multiple linear regression

Get Help For Your SPSS Analysis

Embark on a seamless research journey with SPSSAnalysis.com , where our dedicated team provides expert data analysis assistance for students, academicians, and individuals. We ensure your research is elevated with precision. Explore our pages;

  • SPSS Data Analysis Help – SPSS Helper ,
  • Quantitative Analysis Help ,
  • Qualitative Analysis Help ,
  • SPSS Dissertation Analysis Help ,
  • Dissertation Statistics Help ,
  • Statistical Analysis Help ,
  • Medical Data Analysis Help .

Connect with us at SPSSAnalysis.com to empower your research endeavors and achieve impactful results. Get a Free Quote Today !

Expert SPSS data analysis assistance available.

Struggling with Statistical Analysis in SPSS? - Hire a SPSS Helper Now!

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Linear Regression Explained with Examples

By Jim Frost 16 Comments

What is Linear Regression?

Linear regression models the relationships between at least one explanatory variable and an outcome variable. This flexible analysis allows you to separate the effects of complicated research questions, allowing you to isolate each variable’s role. Additionally, linear models can fit curvature and interaction effects.

Statisticians refer to the explanatory variables in linear regression as independent variables (IV) and the outcome as dependent variables (DV). When a linear model has one IV, the procedure is known as simple linear regression. When there are more than one IV, statisticians refer to it as multiple regression. These models assume that the average value of the dependent variable depends on a linear function of the independent variables.

Linear regression has two primary purposes—understanding the relationships between variables and prediction.

  • The coefficients represent the estimated magnitude and direction (positive/negative) of the relationship between each independent variable and the dependent variable.
  • The equation allows you to predict the mean value of the dependent variable given the values of the independent variables that you specify.

Linear regression finds the constant and coefficient values for the IVs for a line that best fit your sample data. The graph below shows the best linear fit for the height and weight data points, revealing the mathematical relationship between them. Additionally, you can use the line’s equation to predict future values of the weight given a person’s height.

Linear regression was one of the earliest types of regression analysis to be rigorously studied and widely applied in real-world scenarios. This popularity stems from the relative ease of fitting linear models to data and the straightforward nature of analyzing the statistical properties of these models. Unlike more complex models that relate to their parameters in a non-linear way, linear models simplify both the estimation and the interpretation of data.

In this post, you’ll learn how to interprete linear regression with an example, about the linear formula, how it finds the coefficient estimates , and its assumptions .

Learn more about when you should use regression analysis  and independent and dependent variables .

Linear Regression Example

Suppose we use linear regression to model how the outside temperature in Celsius and Insulation thickness in centimeters, our two independent variables, relate to air conditioning costs in dollars (dependent variable).

Let’s interpret the results for the following multiple linear regression equation:

Air Conditioning Costs$ = 2 * Temperature C – 1.5 * Insulation CM

The coefficient sign for Temperature is positive (+2), which indicates a positive relationship between Temperature and Costs. As the temperature increases, so does air condition costs. More specifically, the coefficient value of 2 indicates that for every 1 C increase, the average air conditioning cost increases by two dollars.

On the other hand, the negative coefficient for insulation (–1.5) represents a negative relationship between insulation and air conditioning costs. As insulation thickness increases, air conditioning costs decrease. For every 1 CM increase, the average air conditioning cost drops by $1.50.

We can also enter values for temperature and insulation into this linear regression equation to predict the mean air conditioning cost.

Learn more about interpreting regression coefficients and using regression to make predictions .

Linear Regression Formula

Linear regression refers to the form of the regression equations these models use. These models follow a particular formula arrangement that requires all terms to be one of the following:

  • The constant
  • A parameter multiplied by an independent variable (IV)

Then, you build the linear regression formula by adding the terms together. These rules limit the form to just one type:

Dependent variable = constant + parameter * IV + … + parameter * IV

Linear model equation.

This formula is linear in the parameters. However, despite the name linear regression, it can model curvature. While the formula must be linear in the parameters, you can raise an independent variable by an exponent to model curvature . For example, if you square an independent variable, linear regression can fit a U-shaped curve.

Specifying the correct linear model requires balancing subject-area knowledge, statistical results, and satisfying the assumptions.

Learn more about the difference between linear and nonlinear models and specifying the correct regression model .

How to Find the Linear Regression Line

Linear regression can use various estimation methods to find the best-fitting line. However, analysts use the least squares most frequently because it is the most precise prediction method that doesn’t systematically overestimate or underestimate the correct values when you can satisfy all its assumptions.

The beauty of the least squares method is its simplicity and efficiency. The calculations required to find the best-fitting line are straightforward, making it accessible even for beginners and widely used in various statistical applications. Here’s how it works:

  • Objective : Minimize the differences between the observed and the linear regression model’s predicted values . These differences are known as “ residuals ” and represent the errors in the model values.
  • Minimizing Errors : This method focuses on making the sum of these squared differences as small as possible.
  • Best-Fitting Line : By finding the values of the model parameters that achieve this minimum sum, the least squares method effectively determines the best-fitting line through the data points. 

By employing the least squares method in linear regression and checking the assumptions in the next section, you can ensure that your model is as precise and unbiased as possible. This method’s ability to minimize errors and find the best-fitting line is a valuable asset in statistical analysis.

Assumptions

Linear regression using the least squares method has the following assumptions:

  • A linear model satisfactorily fits the relationship.
  • The residuals follow a normal distribution.
  • The residuals have a constant scatter.
  • Independent observations.
  • The IVs are not perfectly correlated.

Residuals are the difference between the observed value and the mean value that the model predicts for that observation. If you fail to satisfy the assumptions, the results might not be valid.

Learn more about the assumptions for ordinary least squares and How to Assess Residual Plots .

Yan, Xin (2009),  Linear Regression Analysis: Theory and Computing

Share this:

hypothesis for multiple linear regression

Reader Interactions

' src=

September 10, 2024 at 5:16 am

I managed to figure it out myself! Apparently the difference comes from using type 1 anova instead of type 2 (in R default anova function is type 1 anova, whereas the function in python is type 2).

As I understand it, in type 1 it’s done sequentially and the order of the variables in the model changes the results in this case, whereas in type 2 anova it is done marginally.

' src=

September 11, 2024 at 2:04 pm

That’s great! I assume you’re referring to the sum of squares, which would actually be Adjusted Type 3 (the default in statistics) and Sequential Type I.

As you say, Type 1 depends on the order that the variables are entered into a model. That’s not usually used because a truly unimportant variable can look more important simply by being added to the model first.

Type 3 gives the results for each variable when all the other variables are already in the model. That puts them all on an even playing field and gives you the results for the unique variance each variables explains that the other variables do not.

You say Type 2. There is a type 2 sum of squares but it’s much less common that Type 1 and 3 though. Type 2 consideres each main effect as being added after all the other main effects but before the interaction terms.

I don’t know why Python uses Type 2. Generally speaking, you should use Type 3 unless you have very strong theoretical/subject-area knowledge indicating that a different type is better. But that’s very rare. Almost always use Type 3.

September 7, 2024 at 7:42 am

Thank you for your blog, it saved me so many hair being pulled in frustration! 😀

I am reading your book on Regression (big thumbs up, recommend to everyone!) and trying to recreate results on Income ~ Major+Experience example on p 68.

I tried it in python can got same results as yours, then tried R (all seems the same, one factor, one numerical independent variable), and get different results:

Analysis of Variance Table

Response: Income Df Sum Sq Mean Sq F value Pr(>F) Major 2 2.5165e+09 1258246677 2.7701 0.08117 Experience 1 2.2523e+09 2252342774 4.9587 0.03483 Residuals 26 1.1810e+10 454222144

Meaning that the Major isn’t significant!

Coefficient table is exaclty the same as yours.

Can you please help me to understand what is going on?

I am going mad trying to solve it.

Thanks, Alex

' src=

May 9, 2024 at 9:10 am

Why not perform centering or standardization with all linear regression to arrive at a better estimate of the y-intercept?

May 9, 2024 at 4:48 pm

I talk about centering elsewhere. This article just covers the basics of what linear regression does.

A little statistical niggle on centering creating a “better estimate” of the y-intercept. In statistics, there’s a specific meaning to “better estimate,” relating to precision and a lack of bias. Centering (or standardizing) doesn’t create a better estimate in that sense. It can create a more interpretable value in some situations, which is better in common usage.

' src=

August 16, 2023 at 5:10 pm

Hi Jim, I’m trying to understand why the Beta and significance changes in a linear regression, when I add another independent variable to the model. I am currently working on a mediation analysis, and as you know the linear regression is part of that. A simple linear regression between the IV (X) and the DV (Y) returns a statistically significant result. But when I add another IV (M), X becomes insignificant. Can you explain this? Seeking some clarity, Peta.

August 16, 2023 at 11:12 pm

This is a common occurrence in linear regression and is crucial for mediation analysis.

By adding M (mediator), it might be capturing some of the variance that was initially attributed to X. If M is a mediator, it means the effect of X on Y is being channeled through M. So when M is included in the model, it’s possible that the direct effect of X on Y becomes weaker or even insignificant, while the indirect effect (through M) becomes significant.

If X and M share variance in predicting Y, when both are in the model, they might “compete” for explaining the variance in Y. This can lead to a situation where the significance of X drops when M is added.

I hope that helps!

' src=

July 31, 2022 at 7:56 pm

July 30, 2022 at 2:49 pm

Jim, Hi! I am working on an interpretation of multiple linear regression. I am having a bit of trouble getting help. is there a way to post the table so that I may initiate a coherent discussion on my interpretation?

' src=

April 28, 2022 at 3:24 pm

Is it possible that we get significant correlations but no significant prediction in a multiple regression analysis? I am seeing that with my data and I am so confused. Could mediation be a factor (i.e IVs are not predicting the outcome variables because the relationship is made possible through mediators)?

April 29, 2022 at 4:37 pm

I’m not sure what you mean by “significant prediction.” Typically, the predictions you obtain from regression analysis will be a fitted value (the prediction) and a prediction interval that indicates the precision of the prediction (how close is it likely to be to the correct value). We don’t usually refer to “significance” when talking about predictions. Can you explain what you mean? Thanks!

' src=

March 25, 2022 at 7:19 am

I want to do a multiple regression analysis is SPSS (creating a predictive model), where IQ is my dependent variable and my independent variables contains of different cognitive domains. The IQ scores are already scaled for age. How can I controlling my independent variables for age, whitout doing it again for the IQ scores? I can’t add age as an independent variable in the model.

I hope that you can give me some advise, thank you so much!

March 28, 2022 at 9:27 pm

If you include age as an independent variable, the model controls for it while calculating the effects of the other IVs. And don’t worry, including age as an IV won’t double count it for IQ because that is your DV.

' src=

March 2, 2022 at 8:23 am

Hi Jim, Is there a reason you would want your covariates to be associated with your independent variable before including them in the model? So in deciding which covariates to include in the model, it was specified that covariates associated with both the dependent variable and independent variable at p<0.10 will be included in the model.

My question is why would you want the covariates to be associated with the independent variable?

March 2, 2022 at 4:38 pm

In some cases, it’s absolutely crucial to include covariates that correlate with other independent variables, although it’s not a sufficient reason by itself. When you have a potential independent variable that correlates with other IVs and it also correlates with the dependent variable, it becomes a confounding variable and omitting it from the model can cause a bias in the variables that you do include. In this scenario, the degree of bias depends on the strengths of the correlations involved. Observational studies are particularly susceptible to this type of omitted variable bias. However, when you’re performing a true, randomized experiment, this type of bias becomes a non-issue.

I’ve never heard of a formalized rule such as the one that you mention. Personally, I wouldn’t use p-values to make this determination. You can have low p-values for weak correlation in some cases. Instead, I’d look at the strength of the correlations between IVs. However, it’s not a simple as a single criterial like that. The strength of the correlation between the potential IV and the DV also plays a role.

I’ve written an article about that discusses these issues in more detail, read Confounding Variables Can Bias Your Results .

' src=

February 28, 2022 at 8:19 am

Jim, as if by serendipity: having been on your mailing list for years, I looked up your information on multiple regression this weekend for a grad school advanced statistics case study. I’m a fan of your admirable gift to make complicated topics approachable and digestible. Specifically, I was looking for information on how pronounced the triangular/funnel shape must be–and in what directions it may point–to suggest heteroscedasticity in a regression scatterplot of standardized residuals vs standardized predicted values. It seemed to me that my resulting plot of a 5 predictor variable regression model featured an obtuse triangular left point that violated homoscedasticity; my professors disagreed, stating the triangular “funnel” aspect would be more prominent and overt. Thus, should you be looking for a new future discussion point, my query to you then might be some pearls on the nature of a qualifying heteroscedastic funnel shape: How severe must it be? Is there a quantifiable magnitude to said severity, and if so, how would one quantify this and/or what numeric outputs in common statistical software would best support or deny a suspicion based on graphical interpretation? What directions can the funnel point; are only some directions suggestive, whereby others are not? Thanks for entertaining my comment, and, as always, thanks for doing what you do.

Comments and Questions Cancel reply

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Writing hypothesis for linear multiple regression models

I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models.

For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon).

Say all your friends think you should buy a 6 cylinder car, but before you make up your mind you want to know how 6 cylinder cars perform miles-per-gallon-wise compared to 4 cylinder cars because you think there might be a difference.

Would this be a fair null hypothesis (since 4 cylinder cars is the reference group)?: There is no difference between 6 cylinder car miles-per-gallon performance and 4 cylinder car miles-per-gallon performance.

Would this be a fair model interpretation ?: 6 cylinder vehicles travel fewer miles per gallon (p=0.010, β -4.00, CI -6.95 - -1.04) as compared to 4 cylinder vehicles when adjusting for all other predictors, thus rejecting the null hypothesis.

Sorry for troubling, and thanks in advance for any feedback!

enter image description here

  • multiple-regression
  • linear-model
  • interpretation

LuizZ's user avatar

Yes, you already got the right answer to both of your questions.

  • Your null hypothesis in completely fair. You did it the right way. When you have a factor variable as predictor, you omit one of the levels as a reference category (the default is usually the first one, but you also can change that). Then all your other levels’ coefficients are tested for a significant difference compared to the omitted category. Just like you did.

If you would like to compare 6-cylinder cars with 8-cylinder car, then you would have to change the reference category. In your hypothesis you just could had added at the end (or as a footnote): "when adjusting for weight and gear", but it is fine the way you did it.

  • Your model interpretation is correct : It is perfect the way you did it. You could even had said: "the best estimate is that 6 cylinder vehicles travel 4 miles per gallon less than 4 cylinder vehicles (p-value: 0.010; CI: -6.95, -1.04), when adjusting for weight and gear, thus rejecting the null hypothesis".

Let's assume that your hypothesis was related to gears, and you were comparing 4-gear vehicles with 3-gear vehicles. Then your result would be β: 0.65; p-value: 0.67; CI: -2.5, 3.8. You would say that: "There is no statistically significant difference between three and four gear cars in fuel consumption, when adjusting for weight and engine power, thus failing to reject the null hypothesis".

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged r regression multiple-regression linear-model interpretation or ask your own question .

  • Featured on Meta
  • Site maintenance - Mon, Sept 16 2024, 21:00 UTC to Tue, Sept 17 2024, 2:00...
  • User activation: Learnings and opportunities
  • Join Stack Overflow’s CEO and me for the first Stack IRL Community Event in...

Hot Network Questions

  • Why would the GPL be viral, while EUPL isn't, according to the EUPL authors?
  • The meaning of an implication in an existential quantifier
  • How is switching of measurement ranges in instruments, like oscilloscopes, realized nowadays?
  • Help updating 34 year old document to run with modern LaTeX
  • Is it safe to use the dnd 3.5 skill system in pathfinder 1e?
  • How can a D-lock be bent inward by a thief?
  • Parity of the wave function
  • To whom or what did Jesus address Ephphatha?
  • Is it a correct rendering of Acts 1,24 when the New World Translation puts in „Jehovah“ instead of Lord?
  • grouping for stdout
  • Why is resonance such a widespread phenomenon?
  • What is the rationale behind 32333 "Technic Pin Connector Block 1 x 5 x 3"?
  • How do elected politicians get away with not giving straight answers?
  • Is this a misstatement of Euclid in Halmos' Naive Set Theory book?
  • How to prove that the Greek cross tiles the plane?
  • Can Cantrip Connection be used with innate cantrips?
  • NSolve uses all CPU resources
  • How can I verify integrity of the document types?
  • Are data in the real world "sampled" in the statistical sense?
  • Should I change advisors because mine doesn't object to publishing at MDPI?
  • Why did early ASCII have ← and ↑ but not ↓ or →?
  • Should I write an email to a Latino teacher working in the US in English or Spanish?
  • What came of the Trump campaign's complaint to the FEC that Harris 'stole' (or at least illegally received) Biden's funding?
  • Why are some Cloudflare challenges CPU intensive?

hypothesis for multiple linear regression

The Multiple Regression Analysis

  • First Online: 04 September 2024

Cite this chapter

hypothesis for multiple linear regression

  • Franz Kronthaler 2  

12 Accesses

In the previous chapter, we saw that usually one independent variable is not sufficient to adequately describe the dependent variable. Normally, several factors have an influence on the dependent variable. Hence, typically we need multiple regression analysis, also known as multivariate regression analysis, to describe an issue.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and affiliations.

Hochschule Graubünden FHGR, Chur, Switzerland

Franz Kronthaler

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Franz Kronthaler .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer-Verlag GmbH, DE, part of Springer Nature

About this chapter

Kronthaler, F. (2024). The Multiple Regression Analysis. In: Statistics Applied with the R Commander. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-69107-6_20

Download citation

DOI : https://doi.org/10.1007/978-3-662-69107-6_20

Published : 04 September 2024

Publisher Name : Springer, Berlin, Heidelberg

Print ISBN : 978-3-662-69106-9

Online ISBN : 978-3-662-69107-6

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Search Search Please fill out this field.
  • What Is Multiple Linear Regression?
  • Formula and Calculation
  • What It Can Tell You
  • Linear vs. Multiple Regression
  • Multiple Regression FAQs

The Bottom Line

  • Corporate Finance
  • Financial Analysis

Multiple Linear Regression (MLR) Definition, Formula, and Example

Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.

hypothesis for multiple linear regression

Investopedia / Nez Riaz

What Is Multiple Linear Regression (MLR)?

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of MLR is to model the linear relationship between the explanatory (independent) variables and response (dependent) variables. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable.

Key Takeaways

  • Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.
  • It is also known as multiple regression,
  • Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
  • MLR is used extensively in econometrics and financial inference.
  • Multiple regressions are used to make forecasts, explain relationships between financial variables, and test existing theories.

Formula and Calculation of Multiple Linear Regression (MLR)

y i = β 0 + β 1 x i 1 + β 2 x i 2 + . . . + β p x i p + ϵ where,   for   i = n   observations: y i = dependent variable x i = explanatory variables β 0 = y-intercept (constant term) β p = slope coefficients for each explanatory variable ϵ = the model’s error term (also known as the residuals) \begin{aligned}&y_i = \beta_0 + \beta _1 x_{i1} + \beta _2 x_{i2} + ... + \beta _p x_{ip} + \epsilon\\&\textbf{where, for } i = n \textbf{ observations:}\\&y_i=\text{dependent variable}\\&x_i=\text{explanatory variables}\\&\beta_0=\text{y-intercept (constant term)}\\&\beta_p=\text{slope coefficients for each explanatory variable}\\&\epsilon=\text{the model's error term (also known as the residuals)}\end{aligned} ​ y i ​ = β 0 ​ + β 1 ​ x i 1 ​ + β 2 ​ x i 2 ​ + ... + β p ​ x i p ​ + ϵ where, for  i = n  observations: y i ​ = dependent variable x i ​ = explanatory variables β 0 ​ = y-intercept (constant term) β p ​ = slope coefficients for each explanatory variable ϵ = the model’s error term (also known as the residuals) ​

What Multiple Linear Regression (MLR) Can Tell You

Simple linear regression is a function that allows an analyst or statistician to make predictions about one variable based on the information that is known about another variable. Linear regression can only be used when one has two continuous variables—an independent variable and a dependent variable. The independent variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression model extends to several explanatory variables.

The MLR model is based on the following assumptions:

  • There is a linear relationship between the dependent variables and the independent variables
  • The independent variables are not too highly correlated with each other
  • y i observations are selected independently and randomly from the population
  • Residuals should be normally distributed with a mean of 0 and variance σ

MLR assumes there is a linear relationship between the dependent and independent variables, that the independent variables are not highly correlated, and that the variance of the residuals is constant.

The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the variation in outcome can be explained by the variation in the independent variables. R 2 always increases as more predictors are added to the MLR model, even though the predictors may not be related to the outcome variable.

R 2 by itself can't thus be used to identify which predictors should be included in a model and which should be excluded. R 2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of the independent variables and 1 indicates that the outcome can be predicted without error from the independent variables.

When interpreting the results of multiple regression, beta coefficients are valid while holding all other variables constant ("all else equal"). The output from a multiple regression can be displayed horizontally as an equation, or vertically in table form.

Example of How to Use Multiple Linear Regression (MLR)

As an example, an analyst may want to know how the movement of the market affects the price of ExxonMobil (XOM). In this case, the linear equation will have the value of the S&P 500 index as the independent variable, or predictor, and the price of XOM as the dependent variable.

In reality, multiple factors predict the outcome of an event. The price movement of ExxonMobil, for example, depends on more than just the performance of the overall market. Other predictors such as the price of oil, interest rates, and the price movement of oil futures can affect the price of Exon Mobil ( XOM ) and the stock prices of other oil companies. To understand a relationship in which more than two variables are present, MLR is used.

MLR is used to determine a mathematical relationship among several random variables. In other terms, MLR examines how multiple independent variables are related to one dependent variable. Once each of the independent factors has been determined to predict the dependent variable, the information on the multiple variables can be used to create an accurate prediction on the level of effect they have on the outcome variable. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points.

Referring to the MLR equation above, in our example:

  • y i = dependent variable—the price of XOM
  • x i1 = interest rates
  • x i2 = oil price
  • x i3 = value of S&P 500 index
  • x i4 = price of oil futures
  • B 0 = y-intercept at time zero
  • B 1 = regression coefficient that measures a unit change in the dependent variable when x i1 changes—the change in XOM price when interest rates change
  • B 2 = coefficient value that measures a unit change in the dependent variable when x i2 changes—the change in XOM price when oil prices change

The least-squares estimates—B 0 , B 1 , B 2 …B p —are usually computed by statistical software. As many variables can be included in the regression model in which each independent variable is differentiated with a number—1,2, 3, 4...p.

Multiple regression can also be non-linear, in which case the dependent and independent variables would not follow a straight line.

The multiple regression model allows an analyst to predict an outcome based on information provided on multiple explanatory variables. Still, the model is not always perfectly accurate as each data point can differ slightly from the outcome predicted by the model. The residual value, E, which is the difference between the actual outcome and the predicted outcome, is included in the model to account for such slight variations.

We ran our XOM price regression model through a statistics computation software. It returned this output:

Image by Sabrina Jiang © Investopedia 

An analyst would interpret this output to mean if other variables are held constant, the price of XOM will increase by 7.8% if the price of oil in the markets increases by 1%. The model also shows that the price of XOM will decrease by 1.5% following a 1% rise in interest rates. R 2 indicates that 86.5% of the variations in the stock price of Exxon Mobil can be explained by changes in the interest rate, oil price, oil futures, and S&P 500 index.

The Difference Between Linear and Multiple Regression

Ordinary linear squares (OLS) regression compares the response of a dependent variable given a change in some explanatory variables. However, a dependent variable is rarely explained by only one variable. In this case, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one independent variable.

Multiple regressions can be linear and nonlinear. MLRs are based on the assumption that there is a linear relationship between both the dependent and independent variables. It also assumes no major correlation between the independent variables.

What Makes a Multiple Regression Multiple?

A multiple regression considers the effect of more than one explanatory variable on some outcome of interest. It evaluates the relative effect of these explanatory, or independent, variables on the dependent variable when holding all the other variables in the model constant.

Why Would One Use a Multiple Regression Over a Simple OLS Regression?

A dependent variable is rarely explained by only one variable. In such cases, an analyst uses multiple regression, which attempts to explain a dependent variable using more than one independent variable. The model, however, assumes that there are no major correlations between the independent variables.

Can I Do a Multiple Regression by Hand?

It's unlikely as multiple regression models are complex and become even more so when there are more variables included in the model or when the amount of data to analyze grows. To run a multiple regression you will likely need to use specialized statistical software or functions within programs like Excel.

What Does It Mean for a Multiple Regression to Be Linear?

In multiple linear regression, the model calculates the line of best fit that minimizes the variances of each of the variables included as it relates to the dependent variable. Because it fits a line, it is a linear model. There are also non-linear regression models involving multiple variables, such as logistic regression, quadratic regression, and probit models.

How Are Multiple Regression Models Used in Finance?

Any econometric model that looks at more than one variable may be a multiple. Factor models compare two or more factors to analyze relationships between variables and the resulting performance. The Fama and French Three-Factor Mod is such a model that expands on the capital asset pricing model (CAPM) by adding size risk and value risk factors to the market risk factor in CAPM (which is itself a regression model). By including these two additional factors, the model adjusts for this outperforming tendency, which is thought to make it a better tool for evaluating manager performance.

MLR is a statistical tool used to predict the outcome of a variable based on two or more explanatory variables. If just one variable affects the dependent variable, a simple linear regression model is sufficient. If, on the other hand, more than one thing affects that variable, MLR is needed.

A classic example would be the drivers of a company’s valuation on the stock market. Usually, a company’s share price is influenced by a variety of factors. In this case, the dependent variable would be the share price, which is the thing we are trying to predict, while the independent, explanatory variables would be the factors that affect it.

Yale University. " Multiple Linear Regression ."

CFA Institute. " Basics of Multiple Regression and Underlying Assumptions ."

Boston University Medical Campus-School of Public Health. " Multiple Linear Regression ."

hypothesis for multiple linear regression

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Price prediction of polyester yarn based on multiple linear regression model

Roles Conceptualization, Project administration, Writing – original draft, Writing – review & editing

Affiliation School of Global Education & Development, University of Chinese Academy of Social Sciences-University of Stirling, Beijing, China

Roles Data curation, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Mechanical Engineering, Tsinghua University, Beijing, China

ORCID logo

Roles Investigation, Validation, Visualization, Writing – review & editing

Affiliation Industrial Development Center, Zhejiang Materials Industry Yuantong Automobile Group Co., Ltd., Hangzhou, China

  • Wenyi Qiu, 
  • Qingjun Mao, 

PLOS

  • Published: September 12, 2024
  • https://doi.org/10.1371/journal.pone.0310355
  • Reader Comments

Fig 1

China’s polyester textile industry is one of the notable contributors to national economy. This paper takes polyester yarn, core raw material in polyester textile industry chain, as research object, and deeply explores its price indicators and risk hedging mechanisms through multiple linear regression models and Holt-Winters approaches. It is worth mentioning that with continuous development of digital technology, digital transformation of production lines and warehouses has become an important development feature in various industries. This study also actively complies with this trend, and innovatively incorporates the upstream and downstream production line start-up rates into price prediction model. Through this initiative, we can more comprehensively consider the impact of supply and demand changes on price of polyester yarn, thus making prediction results more closely reflect the actual market situation. This quantitative analysis method undoubtedly provides new ideas for enterprises to better grasp market dynamics in digital era.

Citation: Qiu W, Mao Q, Liu C (2024) Price prediction of polyester yarn based on multiple linear regression model. PLoS ONE 19(9): e0310355. https://doi.org/10.1371/journal.pone.0310355

Editor: Pawel Klosowski, Gdańsk University of Technology: Politechnika Gdanska, POLAND

Received: January 23, 2024; Accepted: August 30, 2024; Published: September 12, 2024

Copyright: © 2024 Qiu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Warp knitting is an important weaving process, which refers to the knitting of warp wale into fabrics. Its upstream industry is polyester chemical fiber, and its downstream industries include clothing, home textiles, etc. Before the 1970s, the warp knitting industry was mainly located in Europe and the United States. Chen pointed out in his research that the great development of China’s warp knitting industry began in the 1970s, benefiting from the development of China’s chemical fiber industry [ 1 ]. Presently, China is the largest base for warp knitting industry in the world, and the market share is still increasing. At the same time, the regional integration features are obvious. Ge et al. summarized that more than 90% of the enterprises in the Zhejiang Haining Warp Knitting Industrial Park were engaged in warp knitting industry, with output value accounting for more than 90% of the total output value of the whole district [ 2 ].

With the increasing number of market participants, the role of market mechanisms has become increasingly prominent. We have observed an increasing sensitivity among participants in the industry chain towards price movements. Bruce et al. pointed out that the supply chain in the textiles industry was complex. The supply chain is relatively long, with a number of parties involved. Consequently, careful management of the supply chain is required in order to reduce lead times and achieve quick response [ 3 ]. Changes in supply and demand will directly or indirectly affect price trends, resulting in complicating price changes. Dai analyzed the main factors affecting the operational performance of the polyester industry chain from the perspectives of value chain, supply chain, enterprise partnership, spatial agglomeration mode, and proposed that risk management is an important tool for enterprise operation [ 4 ].

For example, from June 10, 2022 to July 15, 2022, the price of PTA, the main raw material of polyester yarn, fell from 7,562 yuan/ton to 5,280 yuan/ton in less than a month. Prices have fallen by nearly 30%. Correspondingly, the price of mainstream specification polyester yarn has fallen from 6,015 yuan/ton to 5,080 yuan/ton. The price of a single ton has fallen by nearly 1,000 yuan. The plummeting prices of PTA and polyester yarn have had a huge impact on the stability of the supply chain.

Facing the violent fluctuation of raw material prices, the traditional manufacturing industry lacks sufficient risk management capabilities. Fischl et al. mentioned that risks related to the purchase prices of industrial consumption factors (raw materials, semi-finished/finished goods, auxiliary materials, and operating materials) exerted an increasing influence on manufacturing companies’ business continuity and economic sustainability [ 5 ]. During the period of sharp price declines, when companies purchased raw materials, the price of polyester filament was at a high point. But the price dropped when they sold their products, and the profit of the company was compressed. Some companies even experienced the inversion of the sales price and the cost price.

The price of polyester yarn is affected by the macroeconomic environment and its supply and demand. Chen et al. pointed out that the operation demand of the textile industry supply chain came from various information supports. The quality of information such as the market demand and price prediction of final products, the yield and price prediction of raw materials affects the effective operation of the supply chain [ 6 ]. Das and Chakrabarti proposed a Multilayer Perceptron (MLP) approach, developed efficient forecasting models using it for the Wholesale Price Index (WPI) of all the twenty-five individual items of the manufacture of the textiles group of India [ 7 ]. Lorente-Leyva et al. focused on the demand forecasting for textile products by comparing a set of classic methods such as ARIMA, STL Decomposition, Holt-Winters and machine learning, Artificial Neural Networks, Bayesian Networks, Random Forest, Support Vector Machine [ 8 ].

However, most price forecasting studies only consider the historical prices of related products. Due to the lack of data sources for key data such as industry start-up rates, it is difficult to quantitatively incorporate changes in supply and demand into analytical models. Yıldız and Møller stated that the complexity of manufacturing systems, on-going production and existing constraints on the shop floor remained among the main challenges for the analysis, design and development of the models in product, process and factory domains [ 9 ]. With the development of the industry, more and more companies are beginning to carry out digital construction to support complex manufacturing systems and continuous production. We have observed that in the industrial clusters, some leading companies with years of in-depth understanding and knowledge of the industry have begun to actively explore and innovate, with a particular focus on digitalization and the construction of virtual factories. According to Li’s research, the implementation of enterprise digitalization and the construction of industrial internet platforms can achieve rapid interaction of industrial data, promoting the integrated development of industry chains, value chains, innovation chains, and capital chains [ 10 ]. Up to now, a large amount of production and operation data from the warp knitting textile industry chain has been connected to the cloud, providing support for studying price influencing factors.

Therefore, this paper innovatively considers the capacity utilization rates of upstream and downstream industries in the price forecasting model, quantitatively incorporating changes in supply and demand into the analytical model. Leveraging the data accumulated through industrial digitization and integrating it with the public data from China’s Commodity Exchanges, it has established a solid foundation for studying the price transmission mechanism of polyester yarn and identifying its key price indicators. This holds significant importance for comprehensively grasping market price fluctuations and stabilizing the supply chain.

2. Literature review

Recent literature provides various perspectives on dynamic analysis of commodity price distribution and its correlated factors. Zhang et al. utilized bibliometrics to trace the development of research on commodity prices, and conducted statistical and co-citation analyses. It was found that the research hotspots in this field are concentrated on four aspects: factors influencing commodity prices, the impact of price fluctuations on the macroeconomy, forecasts of commodity prices, and the financialization of commodities [ 11 ]. Li and Chavas investigated the role of futures markets and their dynamic effects on the stability of commodity prices based on a quantile vector autoregression (QVAR) model of the marginal distributions of futures and spot prices, and a copula of their joint distribution. The paper finds evidence of nonlinear price dynamics that depend on the maturity of the futures contract and documents how marginal price distributions and associated moments evolve over time [ 12 ]. Le et al. examined the dynamic effect of oil prices on other energy prices based on asymmetric cointegration and dynamic multipliers in a nonlinear ARDL framework. The paper identifies positive relationships between oil price and the prices of other energy commodities [ 13 ]. Landajo and Presno addressed the problem of testing for persistence in the effects of the shocks affecting the prices of renewable commodities based on stationarity testing conditional on the number of changes detected and the detection of change points, and finds non-linear features that often coincide with well-known political and economic episodes [ 14 ].

Pani et al. examined the price discovery function of the bullion, metal, and energy commodity futures and spot prices through the Granger causality and Johansen–Juselius cointegration tests. The findings of the study suggest the market participants for implementing hedging and arbitrage strategies [ 15 ]. Ubilava conducted a comparison of multistep commodity price forecasts using direct and iterated smooth transition autoregressive methods (STAR), and finds that the STAR models are in most instances inferior to the basic autoregressive framework for multistep commodity price forecasting [ 16 ]. Chatnani analyzed the long hedge strategy using the Multi Commodity Exchange (MCX) of India listed lead contracts to identify the advantages and disadvantages of hedging with futures contracts, and examine how hedging replaces price risk with basis risk [ 17 ]. Koziol and Treuter analyzed the impact of speculative trading in agricultural commodity markets on major economic quantities. It identifies crucial variables determining whether speculative trading is beneficial or dangerous, including the correlation between the speculators’ portfolio and the commodity prices, the risk premium of the forward, and the producer’s gains [ 18 ].

The abovementioned literature review provides pivotal information on the methodology of commodity price forecast and impact of related hedging and speculation activities. The polyester textile industry chain is very long, so there are many factors affecting its price. For instance, macro factors such as world macroeconomic changes, exchange rate changes, and unexpected political events, as well as government macro-control, industrial policy, tariff adjustment, chemical fiber industry cycle, business operating costs, crude oil price fluctuation, market demand, trade disputes and other micro factors. Therefore, it is precisely because of a great number of influencing factors and huge price volatility that a lot of financial institutions participate in the trading of PTA and MEG futures contracts and conduct speculative operations.

Thus, it is of great significance to find the factors of significant correlation and identify the price transmission mechanism. In this way, it is achievable to grasp the market price trend and guide the entity enterprises to effectively hedge the risk of price fluctuations.

Multiple linear regression model has significant statistical significance, and is widely used in management disciplines and economics. Multiple regression analysis refers to the use of regression equations to quantitatively explain the linear dependence between dependent variables and two or more independent variables. It is used to find the mathematical expression that best represents the relationship between independent variables and dependent variables [ 19 – 21 ]. The analysis process of multiple regression analysis generally includes correlation analysis, significance analysis, regression detection, etc.

hypothesis for multiple linear regression

β 1 is regression constant, β 1 , β 2 , …, β k are regression coefficients, and ε is random error term.

The purpose of this paper is to investigate the key factors affecting the price trend of polyester yarn, and to build a multiple linear regression model to predict the future price trend.

Thus, this paper selected the daily average price of one mainstream specification of polyester yarn, 50D/24F FDY (Fully Drawn Yarn), as the dependent variable. The data is generated from data services purchased from www.ccf.com.cn from January 29, 2018 to March 4, 2022.

The factors affecting the price of polyester yarn are complex. In order to reduce the prediction bias that may be caused by omission of independent variables, combined with the existing research literature, this paper collects industry data from multiple sources as the independent variables of the prediction model.

The data on daily main contract settlement price of PTA is drawn from Zhengzhou Commodity Exchange. Considering that MEG futures was not listed by Dalian Commodity Exchange before December 10, 2018, the data on daily main contract settlement price of MEG is from two sources, including Dalian Commodity Exchange and Huaxicun Commodity Contracts Exchange. Data on monthly average production load of polyester factory and weekly average operating rate of looms in Jiangsu and Zhejiang provinces are from data services purchased from www.ccf.com.cn . Daily settlement price of Brent crude oil is generated from Sina. The dataset used for the analysis is presented in Table Raw Data in S1 File .

As the direct raw materials for producing polyester yarn, the prices of PTA and MEG reflect the cost of producing polyester yarn. Monthly average production load of polyester factory represents the production capacity of polyester yarn. Weekly average operating rate of looms in Jiangsu and Zhejiang provinces represents the demand market of the downstream industry. Meanwhile, since polyester yarn is a petroleum product, the fluctuation of Brent crude oil price is transmitted through the polyester textile industry chain. It affects the price trend of polyester yarn from multiple dimensions such as raw material cost and market sentiment.

hypothesis for multiple linear regression

Y represents the daily average price of 50D/24F FDY. X 1 is daily main contract settlement price of PTA. X 2 means daily main contract settlement price of MEG. X 3 represents monthly average production load of polyester factory. X 4 is weekly average operating rate of looms in Jiangsu and Zhejiang provinces. X 5 stands for daily settlement price of Brent crude oil.

4. Analysis

Fig 1 describes the fluctuation of each dependent variable and independent variable for January 29, 2018 to March 4, 2022. FDY in the figures represents the daily average price of 50D/24F FDY. TA means daily main contract settlement price of PTA. EG stands for daily main contract settlement price of MEG. PLOAD represents monthly average production load of polyester factory. RATIO is weekly average operating rate of looms in Jiangsu and Zhejiang provinces. BRENT means daily settlement price of Brent crude oil.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0310355.g001

Since polyester textile production in China is mainly concentrated in Jiangsu and Zhejiang provinces, RATIO selects the loom operating rates in these two provinces. Additionally, as most polyester textile enterprises suspend operations during the Chinese New Year holiday, some of the time-point values in the RATIO data are close to zero.

4.1 Intuitive analysis

Fig 2 demonstrates that the price of polyester yarn has a relatively significant correlation with the prices of PTA, MEG and Brent crude oil. The production load of polyester factory and the operating rate of looms, which represent the upstream and downstream supply and demand, have some degree of impact on the fluctuation of polyester yarn prices.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g002

hypothesis for multiple linear regression

Fig 3 demonstrates that the standardized data has same trend with initial data. Thus, it is reliable to use standardized data in the prediction model.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g003

4.2 Linear relationship test

Fig 4 shows that there are significant linear relationships among polyester yarn price and PTA price, MEG price, crude oil price. There are some degree of linear relationship between polyester yarn price and the production load of polyester factory, or the operating rate of looms. This requires further testing.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g004

4.3 Stationary test

This paper uses Phillips-Perron Unit Root Test to test whether dependent variable and independent variables are stationary.

Null Hypothesis: The time series data has a unit root and is non-stationary.

Alternative Hypothesis: The time series data is stationary and does not have a unit root.

Table 1 indicates that most dependent variable and independent variables are non- stationary. Thus, it is necessary to have variables cointegrated and residual stationary.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.t001

4.4 Cointegration test

This paper uses Johansen-Procedure Test to test whether dependent variable and independent variables are cointegrated.

Null Hypothesis: There is 0 cointegrated vector.

Alternative Hypothesis: There exists at least one cointegration relationship in the system.

Table 2 shows that it is valid to reject null hypothesis at 1% significant level since 120.34 is greater than 104.20. Thus, dependent variable and independent variables are cointegrated.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.t002

4.5 Regression analysis

The empirical model is based on previous analysis. Let sFDY represents standardized daily average price of 50D/24F FDY. sTA represents standardized daily main contract settlement price of PTA. sEG represents standardized daily main contract settlement price of MEG. sPLOAD represents standardized monthly average production load of polyester factory. sRATIO represents standardized weekly average operating rate of looms in Jiangsu and Zhejiang provinces. sBRENT represents standardized daily settlement price of Brent crude oil.

hypothesis for multiple linear regression

Since the linear relationship between polyester yarn price and the production load of polyester factory, or the operating rate of looms, is not very significant, this paper sets up another model leaving out these two independent variables and compares results from these two models.

hypothesis for multiple linear regression

Since p-value is less than 0.01, Table 3 indicates that, in addition to the price of PTA, MEG and Brent crude oil, the production load of polyester factory and the operating rate of looms also have significant impact on the price of polyester yarn at the 1% significance level.

Null Hypothesis: The regression coefficient is equal to zero and is not statistically significant.

Alternative Hypothesis: The regression coefficient is not equal to zero and is statistically significant.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.t003

In addition, AIC (Akaike Information Criterion) Test result ( Table 4 ) also shows it is necessary to consider these two independent variables into model. Thus, this model uses Model (4) as regression function.

Null Hypothesis: All candidate models possess equal explanatory power and predictive performance.

Alternative Hypothesis: Among the models being compared, at least one model outperforms the others in terms of explaining the data or predicting future observations.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.t004

4.6 Stationary residual test

This paper uses Phillips-Perron Unit Root Test to test the stationarity of residual. The Phillips-Perron Unit Root Test result is:

Dickey-Fuller = -4.9743, Truncation lag parameter = 7, p-value = 0.01.

Since p-value is less than 0.05, so, it is reliable to reject the null hypothesis at 95% confidence interval. Thus, residual is stationary.

Because dependent variable and independent variables are cointegrated and the residual is stationary, the result from regression model (4) is reliable.

4.7 Multicollinearity test

This paper uses VIF Test to test whether there is multicollinearity in the regression model.

Table 5 proves that there is no multicollinearity in the regression model since all test results are less than 10.

Null Hypothesis: There is no multicollinearity among the independent variables.

Alternative Hypothesis: There is multicollinearity among the independent variables.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.t005

4.8 Model fitness test

In Fig 5 , the red line represents the actual historical values, while the blue line represents the fitted values obtained using the regression model in this study. The figure visually demonstrates that the overall trend of the blue fitted values is consistent with the red actual values, with similar time points for both upward and downward movements, and a relatively small numerical difference. Therefore, through the fitting test of historical actual values, it can be concluded that the regression model used in this study fits well.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g005

4.9 Forecast

This paper uses the Holt-Winters model to predict the value of each independent variable in the next 30 days.

Fig 6 indicates that the fit of the Cumulative Triple Exponential Smoothing with Additive Model (as shown in Fig 6B, 6D, 6F, 6H, and 6J ) is better than that of the Cumulative Triple Exponential Smoothing with Multiplicative Model (as shown in Fig 6A, 6C, 6E, 6G, and 6I ). Therefore, the Cumulative Triple Exponential Smoothing with Additive Model is selected.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g006

Fig 7 displays the prediction results of the values of independent variables for the next 30 days using the Holt-Winters model, specifically the Cumulative Triple Exponential Smoothing with Additive Model. Fig 7(A)–7(E) respectively represent the predicted values of standardized daily main contract settlement price of PTA, standardized daily main contract settlement price of MEG, standardized monthly average production load of polyester factory, standardized weekly average operating rate of looms in Jiangsu and Zhejiang provinces, and standardized daily settlement price of Brent crude oil for the next 30 days.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g007

In this paper, model (4) is used to predict the standardized daily average price of 50D/24F FDY in the next 30 days, with the predicted values of the independent variables for the future 30 days set as prediction results obtained from the Holt-Winters model as shown in Fig 7 .

Fig 8 and Table 6 describe the prediction results.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g008

thumbnail

https://doi.org/10.1371/journal.pone.0310355.t006

Convert the standardized value into the absolute value of daily average price of 50D/24F FDY. Fig 9 shows the forecast results of polyester yarn prices in the next 30 days.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g009

Since the model is used to predict price fluctuations over a period of time after a certain date, unexpected events during that period can easily lead to consistent errors in absolute values, while the impact on the trend is minor. Therefore, the focus of the model is on capturing the general direction of price movements rather than the precise numerical values. Table 7 presents the predicted and actual values after standardization, while Fig 10 compares the fluctuation trends of the predicted and actual values. As shown in Fig 10 , the predicted price shows a trend of first rising, then stabilizing for about three working days, and facing a decline afterwards. After that, an upward trend is expected. It is evident that the overall trend of price fluctuations is consistent between actual and predicted value.

thumbnail

https://doi.org/10.1371/journal.pone.0310355.g010

thumbnail

https://doi.org/10.1371/journal.pone.0310355.t007

Therefore, textile enterprises can view the short-term rise in raw material prices more rationally, wait for prices to fall, and optimize the timing of raw material procurement. For traders holding polyester yarn inventory, the price rising period might be a good opportunity to sell. It is advisable for traders to consider appropriate promotions to reduce inventory, and then restock when prices fall.

5. Conclusion

In conclusion, the price of polyester yarn is significantly related to PTA price, MEG price, production load of polyester factory, operating rate of looms, and Brent crude oil price.

This conclusion is basically consistent with the theoretical analysis results. As the raw materials of polyester yarn, the increase of PTA price and MEG price will push up the price of polyester yarn. Production load of polyester factory represents the production capacity of polyester yarn. Under the condition that demand remains unchanged, higher production capacity will lead to a decrease in the price of polyester yarn. Operating rate of looms represents the demand market. Under the condition of constant supply, higher demand will lead to an increase in the price of polyester yarn.

Mastering this model is helpful for relevant enterprises to avoid price risk and reduce production costs. However, in the midst of market volatility, quantitative model analysis may intensify panic, which can easily trigger speculation.

In addition, when employing quantitative models, special emphasis should be placed on data ethics principles. The rights of data producers regarding the storage, deletion, use, and dissemination of data should be fully respected. In this paper, manufacturing enterprises, as producers of data, are the primary community that the model should serve.

Supporting information

https://doi.org/10.1371/journal.pone.0310355.s001

  • View Article
  • Google Scholar
  • 4. Dai H. X. Research on the Performance Evaluation System of China’s Polyester Industry Chain Operation. China, Harbin: Harbin University of Science and Technology, Master Thesis, 2021, 67p.
  • 8. Lorente-Leyva L.L., Alemany M.M., Peluffo-Ordóñez D.H., Herrera-Granda I.D. A Comparison of Machine Learning and Classical Demand Forecasting Methods: A Case Study of Ecuadorian Textile Industry. International Conference on Machine Learning, Optimization, and Data Science, 19–23 July 2020, Siena, Italy. pp. 131–142.
  • 5.7 - MLR Parameter Tests

Earlier in this lesson, we translated three different research questions pertaining to the heart attacks in rabbits study ( coolhearts.txt ) into three sets of hypotheses we can test using the general linear F -statistic. The research questions and their corresponding hypotheses are:

1. Is the regression model containing at least one predictor useful in predicting the size of the infarct?

  • H 0 : β 1 = β 2 = β 3 = 0
  • H A : At least one β j ≠ 0 (for j = 1, 2, 3)

2. Is the size of the infarct significantly (linearly) related to the area of the region at risk?

  • H 0 : β 1 = 0
  • H A : β 1 ≠ 0

3. (Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?

  • H 0 : β 2 = β 3 = 0
  • H A : At least one β j ≠ 0 (for j = 2, 3)

Let's test each of the hypotheses now using the general linear F -statistic:

\[F^*=\left(\frac{SSE(R)-SSE(F)}{df_R-df_F}\right) \div \left(\frac{SSE(F)}{df_F}\right)\]

To calculate the F -statistic for each test, we first determine the error sum of squares for the reduced and full models — SSE ( R ) and SSE ( F ), respectively. The number of error degrees of freedom associated with the reduced and full models — df R and df F , respectively — is the number of observations, n , minus the number of parameters, k +1 , in the model. That is, in general, the number of error degrees of freedom is n – ( k +1) . We use statistical software to determine the P -value for each test.

Testing all slope parameters equal 0

To answer the research question: "Is the regression model containing at least one predictor useful in predicting the size of the infarct?," we test the hypotheses:

The full model. The full model is the largest possible model — that is, the model containing all of the possible predictors. In this case, the full model is:

\[y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\]

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE , that appears in the analysis of variance table. Because there are k +1 = 3+1 = 4 parameters in the full model, the number of error degrees of freedom associated with the full model is df F = n – 4.

The reduced model. The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is:

\[y_i=\beta_0+\epsilon_i\]

The reduced model basically suggests that none of the variation in the response y is explained by any of the predictors. Therefore, the error sum of squares for the reduced model, SSE ( R ), is just the total sum of squares, SSTO , that appears in the analysis of variance table. Because there is only one parameter in the reduced model, the number of error degrees of freedom associated with the reduced model is df R = n – 1.

The test. Upon plugging in the above quantities, the general linear F -statistic:

\[F^*=\frac{SSE(R)-SSE(F)}{df_R-df_F} \div \frac{SSE(F)}{df_F}\]

becomes the usual " overall F -test ":

\[F^*=\frac{SSR}{3} \div \frac{SSE}{n-4}=\frac{MSR}{MSE}.\]

That is, to test H 0 : β 1 = β 2 = β 3 = 0, we just use the overall F -test and P -value reported in the analysis of variance table:

minitab output

\[F^*=\frac{0.95927}{3} \div \frac{0.54491}{28}=\frac{0.31976}{0.01946}=16.43.\]

There is sufficient evidence ( F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.

In general, to test that all of the slope parameters in a multiple linear regression model are 0, we use the overall F -test reported in the analysis of variance table.

Testing one slope parameter is 0

Now let's answer the second research question: "Is the size of the infarct significantly (linearly) related to the area of the region at risk?" To do so, we test the hypotheses:

The full model. Again, the full model is the model containing all of the possible predictors:

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE . Alternatively, because the three predictors in the model are x 1 , x 2 , and x 3 , we can denote the error sum of squares as SSE ( x 1 , x 2 , x 3 ). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is df F = n – 4.

The reduced model. Because the null hypothesis sets the first slope parameter, β 1 , equal to 0, the reduced model is:

\[y_i=(\beta_0+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\]

Because the two predictors in the model are x 2 and x 3 , we denote the error sum of squares as SSE ( x 2 , x 3 ). Because there are 3 parameters in the model, the number of error degrees of freedom associated with the reduced model is df R = n – 3.

The test. The general linear statistic:

simplifies to:

\[F^*=\frac{SSR(x_1|x_2, x_3)}{1}\div \frac{SSE(x_1,x_2, x_3)}{n-4}=\frac{MSR(x_1|x_2, x_3)}{MSE(x_1,x_2, x_3)}\]

Getting the numbers from the following output:

minitab output

we determine that value of the F -statistic is:

\[F^*=\frac{SSR(x_1|x_2, x_3)}{1}\div MSE=\frac{0.63742}{0.01946}=32.7554.\]

The P -value is the probability — if the null hypothesis were true — that we would get an F -statistic larger than 32.7554. Comparing our F -statistic to an F -distribution with 1 numerator degree of freedom and 28 denominator degrees of freedom, the probability is close to 1 that we would observe an F -statistic smaller than 32.7554:

minitab output

Therefore, the probability that we would get an F -statistic larger than 32.7554 is close to 0. That is, the P -value is < 0.001. There is sufficient evidence ( F = 32.8, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk.

But wait a second! Have you been wondering why we couldn't just use the slope's t -statistic to test that the slope parameter, β 1 , is 0? We can! Notice that the P -value ( P < 0.001) for the t -test ( t * = 5.72):

minitab output

is the same as the P -value we obtained for the F -test. This will be always be the case when we test that only one slope parameter is 0. That's because of the well-known relationship between a t -statistic and an F -statistic that has one numerator degree of freedom:

\[t_{(n-(k+1))}^{2}=F_{(1, n-(k+1))}\]

For our example, the square of the t -statistic, 5.72, equals our F -statistic (within rounding error). That is:

\[t^{*2}=5.72^2=32.72=F^*\]

So what have we learned in all of this discussion about the equivalence of the F -test when testing only one slope parameter and the t -test? In short:

  • We can use either the F -test or the t -test to test that only one slope parameter is 0. Because the t -test results can be read directly from the software output, it makes sense that it would be the test that we'll use most often.
  • But, we have to be careful with our interpretations! The equivalence of the t -test to the F -test when testing only one slope parameter has taught us something new about the t -test. The t -test is a test for the marginal significance of the x 1 predictor after the other predictors x 2 and x 3 have been taken into account. It does not test for the significance of the relationship between the response y and the predictor x 1 alone.

Testing a subset of slope parameters is 0

Finally, let's answer the third — and primary — research question: "Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?" To do so, we test the hypotheses:

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE = 0.54491 from the output above. Alternatively, because the three predictors in the model are x 1 , x 2 , and x 3 , we can denote the error sum of squares as SSE ( x 1 , x 2 , x 3 ). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is df F = n – 4 = 32 – 4 = 28.

The reduced model. Because the null hypothesis sets the second and third slope parameters, β 2 and β 3 , equal to 0, the reduced model is:

\[y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\]

The ANOVA table for the reduced model is:

reduced model

Because the only predictor in the model is x 1 , we denote the error sum of squares as SSE ( x 1 ) = 0.8793. Because there are 2 parameters in the model, the number of error degrees of freedom associated with the reduced model is df R = n – 2 = 32 – 2 = 30.

The test. The general linear statistic is:

\[F^*=\frac{SSE(R)-SSE(F)}{df_R-df_F} \div\frac{SSE(F)}{df_F}=\frac{0.8793-0.54491}{30-28} \div\frac{0.54491}{28}= \frac{0.33439}{2} \div 0.01946=8.59.\]

The P -value is the probability — if the null hypothesis were true — that we would observe an F -statistic more extreme than 8.59. The following output:

minitab output

tells us that the probability of observing such an F -statistic that is smaller than 8.59 is 0.9988. Therefore, the probability of observing such an F -statistic that is larger than 8.59 is 1 – 0.9988 = 0.0012. The P -value is very small. There is sufficient evidence ( F = 8.59, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs — after taking into account the size of the region at risk.

Summary of MLR Testing

For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:

  • Hypothesis test for testing that all of the slope parameters are 0.
  • Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.
  • Hypothesis test for testing that one slope parameter is 0.

We have learned how to perform each of the above three hypothesis tests.

The F -statistic and associated p -value in the ANOVA table are used for testing whether all of the slope parameters are 0. In most applications this p -value will be small enough to reject the null hypothesis and conclude that at least one predictor is useful in the model. For example, for the rabbit heart attacks study, the F -statistic is (0.95927/3) / (0.54491/(32–4)) = 16.43 with p -value 0.000.

To test whether a subset — more than one, but not all — of the slope parameters are 0, u se the general linear F-test formula by fitting the full model to find SSE(F) and fitting the reduced model to find SSE(R) . Then the numerator of the F-statistic is (SSE(R) – SSE(F)) / (df R – df F ) .  The denominator of the F -statistic is the mean squared error in the ANOVA table.  For example, for the rabbit heart attacks study, the general linear F-statistic is [(0.8793 – 0.54491) / (30 – 28)] / (0.54491 / 28) = 8.59 with p -value 0.0012.

To test whether one slope parameter is 0, we can use an F -test as just described. Alternatively, we can use a t -test, which will have an identical p -value since in this case the square of the t -statistic is equal to the F -statistic. For example, for the rabbit heart attacks study, the F -statistic for testing the slope parameter for the Area predictor is (0.63742/1) / (0.54491/(32–4)) = 32.75 with p -value 0.000. Alternatively, the t -statistic for testing the slope parameter for the Area predictor is 0.613 / 0.107 = 5.72 with p -value 0.000, and 5.72 2 = 32.72.

Incidentally, you may be wondering why we can't just do a series of individual t-tests to test whether a subset of the slope parameters are 0. For example, for the rabbit heart attacks study, we could have done the following:

  • Fit the model of y = InfSize on x 1 = Area and x 2 and x 3 and use an individual t-test for x 3 .
  • If the test results indicate that we can drop x 3 then fit the model of y = InfSize on x 1 = Area and x 2 and use an individual t-test for x 2 .

The problem with this approach is we're using two individual t-tests instead of one F-test, which means our chance of drawing an incorrect conclusion in our testing procedure is higher. Every time we do a hypothesis test, we can draw an incorrect conclucion by:

  • rejecting a true null hypothesis, i.e., make a type 1 error by concluding the tested predictor(s) should be retained in the model, when in truth it/they should be dropped; or
  • failing to reject a false null hypothesis, i.e., make a type 2 error by concluding the tested predictor(s) should be dropped from the model, when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable. 

The problems in this section are designed to review the hypothesis tests for the slope parameters, as well as to give you some practice on models with a three-group qualitative variable (which we'll cover in more detail in Lesson 8). We consider tests for:

: = 0) : = = 0 against the alternative : ≠ 0 or ≠ 0 or both ≠ 0) : = = = 0 against the alternative : at least one of the is not 0)

(Note the correct specification of the alternative hypotheses for the last two situations.)

A group of researchers were interested in studying the effects of three different growth regulators ( , denoted 1, 2, and 3) on the yield of sugar beets (y = , in pounds). They planned to plant the beets in 30 different plots and then to randomly treat 10 plots with the first growth regulator, 10 plots with the second growth regulator, and 10 plots with the third growth regulator. One problem, though, is that the amount of available nitrogen in the 30 different plots varies naturally, thereby giving a potentially unfair advantage to plots with higher levels of available nitrogen. Therefore, the researchers also measured and recorded the available nitrogen ( = , in pounds/acre) in each plot. They are interested in comparing the mean yields of sugar beets subjected to the different growth regulators the available nitrogen. The data set contains the data from the researcher's experiment.

= on the y-axis and = on the x-axis — in doing so, use the qualitative ("grouping") variable to denote whether each plot received the first, second or third growth regulator. Does the plot suggest that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data?   ( ) distinguishes between the three treatment groups (1, 2, and 3), we need to create two indicator variables, and , say, in order to fit a linear regression model to these data. The new indicator variables should be defined as follows:

Use Minitab's Calc >> Make Indicator Variables command to create the new indicator variables in your worksheet.  ( )

= + + + +

where = and and are defined as above, what is the mean response function for plots receiving treatment 3? for plots receiving treatment 1? for plots receiving treatment 2? Are the three regression lines that arise from our formulated model parallel? What does the parameter quantify? And, what does the parameter quantify?  ( )

) = and = and and as predictors. To test : = = = 0, we can use the " " test, constructed as:

\[F=\frac{SSR(X_1,X_2,X_3)\div3}{SSE(X_1,X_2,X_3)\div(n-4)}=\frac{MSR(X_1,X_2,X_3)}{MSE(X_1,X_2,X_3)}\]

( )

-test and associated -value reported in the analysis of variance table. Make a decision for the researchers at the α = 0.05 level.  ( ) ) = and (in order) and and = as predictors. (In Minitab click "Model" and use the arrows to re-order the "Terms in the model." Also click "Options" and select "Sequential (Type I)" for "Sum of squares for tests.") To test : = 0, we know that we can use the -test that Minitab displays as a default. What is the value of the -statistic and its associated -value? What does this -value tell the scientists about nitrogen?  ( ) " test, constructed as:

\[F=\frac{SSR(X_1|X_2,X_3)\div1}{SSE(X_1,X_2,X_3)\div(n-4)}=\frac{MSR(X_1|X_2,X_3)}{MSE(X_1,X_2,X_3)}\]

Use the Minitab output to calculate the value of this statistic. Does the value you obtain equal , the square of the -statistic as we might expect?  ( )

Because will equal the partial -statistic whenever you test for whether one slope parameter is 0, it makes sense to just use the -statistic and -value that Minitab displays as a default. But, note that we've just learned something new about the meaning of the -test in the multiple regression setting. It tests for the ("marginal") significance of the predictor after and have already been taken into account.

) = and (in order) = and and as predictors. To test : = = 0, we can use a " , constructed as:

\[F=\frac{SSR(X_2,X_3|X_1)\div2}{SSE(X_1,X_2,X_3)\div(n-4)}=\frac{MSR(X_2,X_3|X_1)}{MSE(X_1,X_2,X_3)}\]

Note that the sequential mean square due to regression, MSR( , | ), is obtained by dividing the sequential sum of square by its degrees of freedom (2, in this case, since two additional predictors and are considered). Use the Minitab output to calculate the value of this statistic, and use Minitab to get the associated -value. Answer the researcher's question at the α = 0.05 level.   ( )

The plot shows a similar positive linear trend within each treatment category, which suggests that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data.

Minitab creates an indicator variable for each treatment group but we can only use two, for treatment groups 1 and 2 in this case (treatment group 3 is the reference level in this case).

The fitted equation from Minitab is Yield = 84.99 + 1.3088 Nit - 2.43 x 2 - 2.35 x 3 , which means that the equations for each treatment group are:

Group 1: Yield = 84.99 + 1.3088 Nit - 2.43(1) = 82.56 + 1.3088 Nit Group 2: Yield = 84.99 + 1.3088 Nit - 2.35(1) = 82.64 + 1.3088 Nit Group 3: Yield = 84.99 + 1.3088 Nit

The three estimated regression lines are parallel since they have the same slope, 1.3088.

The regression parameter for x 2 represents the difference between the estimated intercept for treatment 1 and the estimated intercept for the reference treatment 3.

The regression parameter for x 3 represents the difference between the estimated intercept for treatment 2 and the estimated intercept for the reference treatment 3.

\(H_0 : \beta_1=\beta_2=\beta_3 = 0\) against the alternative \(H_A\) : at least one of the \(\beta_i\) is not 0.

\(F = (16039.5/3) / (1078.0/(30-4)) = 5346.5 / 41.46 = 128.95\).

Since the p -value for this F -statistic is reported as 0.000, we reject H 0 in favor of H A and conclude that at least one of the slope parameters is not zero, i.e., the regression model containing at least one predictor is useful in predicting the size of sugar beet yield.

\(H_0:\beta_1= 0\) against the alternative \(H_A:\beta_1 \ne 0\)

t -statistic = 19.60, p -value = 0.000, so we reject H 0 in favor of H A and conclude that the slope parameter for x 1 = nit is not zero, i.e., sugar beet yield is significantly linearly related to the available nitrogen (controlling for treatment).

F -statistic \(= (15934.5/1) / (1078.0/(30-4)) = 15934.5 / 41.46 = 384.32\), which is the same as 19.60 2 .

\(H_0:\beta_2=\beta_3= 0\) against the alternative \(H_A:\beta_2 \ne 0\) or \(\beta_3 \ne 0\) or both \(\ne 0\).

\(F = ((10.4+27.5)/2) / (1078.0/(30-4)) = 18.95 / 41.46 = 0.46\).

F distribution with 2 DF in numerator and 26 DF in denominator    x  P( X ≤ x ) 0.46    0.363677

p -value \(= 1-0.363677 = 0.636\), so we fail to reject H 0 in favor of H A and conclude that we cannot rule out \(\beta_2 = \beta_3 = 0\), i.e., there is no significant difference in the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen.

Start Here!

  • Welcome to STAT 462!
  • Search Course Materials
  • Lesson 1: Statistical Inference Foundations
  • Lesson 2: Simple Linear Regression (SLR) Model
  • Lesson 3: SLR Evaluation
  • Lesson 4: SLR Assumptions, Estimation & Prediction
  • 5.1 - Example on IQ and Physical Characteristics
  • 5.2 - Example on Underground Air Quality
  • 5.3 - The Multiple Linear Regression Model
  • 5.4 - A Matrix Formulation of the Multiple Regression Model
  • 5.5 - Three Types of MLR Parameter Tests
  • 5.6 - The General Linear F-Test
  • 5.8 - Partial R-squared
  • 5.9- Further MLR Examples
  • Lesson 6: MLR Assumptions, Estimation & Prediction
  • Lesson 7: Transformations & Interactions
  • Lesson 8: Categorical Predictors
  • Lesson 9: Influential Points
  • Lesson 10: Regression Pitfalls
  • Lesson 11: Model Building
  • Lesson 12: Logistic, Poisson & Nonlinear Regression
  • Website for Applied Regression Modeling, 2nd edition
  • Notation Used in this Course
  • R Software Help
  • Minitab Software Help

Penn State Science

Copyright © 2018 The Pennsylvania State University Privacy and Legal Statements Contact the Department of Statistics Online Programs

Price prediction of polyester yarn based on multiple linear regression model

PLOS

  • This person is not on ResearchGate, or hasn't claimed this research yet.

Qingjun Mao

Abstract and Figures

Historical price fluctuations

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Zhang Qi

  • Jianbin Jiao

Shouyang Wang

  • DISCRETE DYN NAT SOC
  • Upananda Pani

Ştefan Cristian Gherghina

  • Mário Nuno Mata

Pedro Neves Mata

  • COMPUT IND ENG
  • Wenzhao Han
  • Felix T.S. Chan
  • N Am J Econ Finance

David Ubilava

  • Manuel Landajo

María José Presno

  • Satyajit Chakrabarti

Emre Yildiz

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

IMAGES

  1. PPT

    hypothesis for multiple linear regression

  2. PPT

    hypothesis for multiple linear regression

  3. Multiple Linear Regression Hypothesis Testing in Matrix Form

    hypothesis for multiple linear regression

  4. PPT

    hypothesis for multiple linear regression

  5. PPT

    hypothesis for multiple linear regression

  6. PPT

    hypothesis for multiple linear regression

VIDEO

  1. Hypothesis Tests in Multiple Linear Regression, Part 1

  2. Hypothesis Testing in a Multiple Linear Regression Model: Part1

  3. Multiple linear regression

  4. Multiple Regression, Clearly Explained!!!

  5. Multiple Linear Regression in SPSS with Assumption Testing

  6. Multiple Regression

COMMENTS

  1. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    Consider the regression model with p predictors y = Xβ + . We would like to determine if some subset of r < p predictors contributes significantly to the regression model. 16. Partition the vector of regression coefficients as β = β1. β2. where β1is (p+1−r)×1 and β2is r ×1. We want to test the hypothesis H. 0: β2= 0 H.

  2. Multiple Linear Regression

    Multiple linear regression formula. The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable ...

  3. Understanding the Null Hypothesis for Linear Regression

    xi: The value of the predictor variable xi. Multiple linear regression uses the following null and alternative hypotheses: H0: β1 = β2 = … = βk = 0. HA: β1 = β2 = … = βk ≠ 0. The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically ...

  4. Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation

    a hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0. In this lesson, we also learn how to perform each of the above three hypothesis tests. Key Learning Goals for this Lesson: Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is ...

  5. 5.3

    A population model for a multiple linear regression model that relates a y -variable to p -1 x -variables is written as. y i = β 0 + β 1 x i, 1 + β 2 x i, 2 + … + β p − 1 x i, p − 1 + ϵ i. We assume that the ϵ i have a normal distribution with mean 0 and constant variance σ 2. These are the same assumptions that we used in simple ...

  6. Multiple linear regression

    However, hypothesis tests derived from these variables are affected by the choice. Solution: To check whether region is important, use an \ ... Defined Multiple Linear Regression. Discussed how to test the importance of variables. Described one approach to choose a subset of variables.

  7. Multiple Linear Regression. A complete study

    Multiple Linear Regression: It's a form of linear regression that is used when there are two or more predictors. We will see how multiple input variables together influence the output variable, while also learning how the calculations differ from that of Simple LR model. ... We start by forming a Null Hypothesis and a corresponding ...

  8. Lesson 5: Multiple Linear Regression

    Minitab Help 5: Multiple Linear Regression; R Help 5: Multiple Linear Regression; Lesson 6: MLR Model Evaluation. 6.1 - Three Types of Hypotheses; 6.2 - The General Linear F-Test; 6.3 - Sequential (or Extra) Sums of Squares; 6.4 - The Hypothesis Tests for the Slopes; 6.5 - Partial R-squared; 6.6 - Lack of Fit Testing in the Multiple Regression ...

  9. PDF 13 Multiple Linear( Regression(

    In contrast, the simple regression slope is called the marginal (or unadjusted) coefficient. The multiple regression model can be written in matrix form. To estimate the parameters b 0, b 1,..., b p using the principle of least squares, form the sum of squared deviations of the observed yj's from the regression line:

  10. Multiple linear regression: Theory and applications

    Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio, or sparse data (Hastie et al., 2009).

  11. Multiple Linear Regression

    A regression analysis is used for one (or more) of three purposes: modeling the relationship between x and y; prediction of the target variable (forecasting); and testing of hypotheses. The chapter introduces the basic multiple linear regression model, and discusses how this model can be used for these three purposes. Specifically, it discusses ...

  12. Introduction to Multiple Linear Regression

    Assumptions of Multiple Linear Regression. There are four key assumptions that multiple linear regression makes about the data: 1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y. 2. Independence: The residuals are independent.

  13. PDF 12-1 Multiple Linear Regression Models

    12-2 Hypothesis Tests in Multiple Linear Regression R 2 and Adjusted R The coefficient of multiple determination • For the wire bond pull strength data, we find that R2 = SS R /SS T = 5990.7712/6105.9447 = 0.9811. • Thus, the model accounts for about 98% of the variability in the pull strength response.

  14. 12.2.1: Hypothesis Test for Linear Regression

    The hypotheses are: Find the critical value using dfE = n − p − 1 = 13 for a two-tailed test α = 0.05 inverse t-distribution to get the critical values ± 2.160. Draw the sampling distribution and label the critical values, as shown in Figure 12-14. Figure 12-14: Graph of t-distribution with labeled critical values.

  15. Multiple Linear Regression in SPSS

    The hypothesis in Multiple Linear Regression revolves around the significance of the regression coefficients. Each coefficient corresponds to a specific predictor variable, and the hypothesis tests whether each predictor has a significant impact on the dependent variable.

  16. Linear Regression Explained with Examples

    Let's interpret the results for the following multiple linear regression equation: Air Conditioning Costs$ = 2 * Temperature C - 1.5 * Insulation CM. The coefficient sign for Temperature is positive (+2), which indicates a positive relationship between Temperature and Costs.

  17. 5.3

    If the null hypothesis above were the case, ... Multiple linear regression, in contrast to simple linear regression, involves multiple predictors and so testing each variable can quickly become complicated. For example, suppose we apply two separate tests for two predictors, say \(x_1\) and \(x_2\), and both tests have high p-values. ...

  18. 8.7: Overall F-test in multiple linear regression

    This test is called the overall F-test in MLR and is very similar to the F F -test in a reference-coded One-Way ANOVA model. It tests the null hypothesis that involves setting every coefficient except the y y -intercept to 0 (so all the slope coefficients equal 0). We saw this reduced model in the One-Way material when we considered setting all ...

  19. Writing hypothesis for linear multiple regression models

    2. I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models. For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon). Say all your friends think you should ...

  20. The Multiple Regression Analysis

    The regression model does not explain the dependent variable. H A:: The regression model explains the dependent variable. The significance level is usually \(\alpha = 1\,\%\) or \(\alpha = 0.01\). If we reject the null hypothesis, we assume that the regression function explains the dependent variable in the population.

  21. Null and Alternative hypothesis for multiple linear regression

    Null and Alternative hypothesis for multiple linear regression. Ask Question Asked 9 years, 8 months ago. Modified 7 years, 2 months ago. Viewed 12k times ... I have 1 dependent variable and 3 independent variables. I run multiple regression, and find that the p value for one of the independent variables is higher than 0.05 (95% is my ...

  22. What Is Multiple Linear Regression (MLR)?

    Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. It is also known as multiple regression, Multiple ...

  23. Lesson 5: Multiple Linear Regression

    Multiple linear regression, in contrast to simple linear regression, involves multiple predictors and so testing each variable can quickly become complicated. For example, suppose we apply two separate tests for two predictors, say \ (x_1\) and \ (x_2\), and both tests have high p-values. One test suggests \ (x_1\) is not needed in a model with ...

  24. Price prediction of polyester yarn based on multiple linear regression

    China's polyester textile industry is one of the notable contributors to national economy. This paper takes polyester yarn, core raw material in polyester textile industry chain, as research object, and deeply explores its price indicators and risk hedging mechanisms through multiple linear regression models and Holt-Winters approaches. It is worth mentioning that with continuous development ...

  25. 5.7

    For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0.

  26. Price prediction of polyester yarn based on multiple linear regression

    Multiple linear regression model has significant statistical significance, and is widely used in management disciplines and economics. Multiple regression analysis refers to the use of