• Privacy Policy

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

About the author.

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

MANOVA

MANOVA (Multivariate Analysis of Variance) –...

Documentary Analysis

Documentary Analysis – Methods, Applications and...

ANOVA

ANOVA (Analysis of variance) – Formulas, Types...

Graphical Methods

Graphical Methods – Types, Examples and Guide

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

A Refresher on Regression Analysis

regression analysis in research example

Understanding one of the most important types of data analysis.

You probably know by now that whenever possible you should be making data-driven decisions at work . But do you know how to parse through all the data available to you? The good news is that you probably don’t need to do the number crunching yourself (hallelujah!) but you do need to correctly understand and interpret the analysis created by your colleagues. One of the most important types of data analysis is called regression analysis.

  • Amy Gallo is a contributing editor at Harvard Business Review, cohost of the Women at Work podcast , and the author of two books: Getting Along: How to Work with Anyone (Even Difficult People) and the HBR Guide to Dealing with Conflict . She writes and speaks about workplace dynamics. Watch her TEDx talk on conflict and follow her on LinkedIn . amyegallo

regression analysis in research example

Partner Center

If you could change one thing about college, what would it be?

Graduate faster

Better quality online classes

Flexible schedule

Access to top-rated instructors

Mountain side representing simple regression analysis

The Complete Guide To Simple Regression Analysis

08.08.2023 • 8 min read

Sarah Thomas

Subject Matter Expert

Learn what simple regression analysis means and why it’s useful for analyzing data, and how to interpret the results.

In This Article

What Is Simple Linear Regression Analysis?

Linear regression equation, how to perform linear regression, linear regression assumptions, how do you find the regression line, how to interpret the results of simple regression.

What is the relationship between parental income and educational attainment or hours spent on social media and anxiety levels? Regression is a versatile statistical tool that can help you answer these types of questions. It’s a tool that lets you model the relationship between two or more variables .

The applications of regression are endless. You can use it as a machine learning algorithm to make predictions. You can use it to establish correlations, and in some cases, you can use it to uncover causal links in your data.

In this article, we’ll tell you everything you need to know about the most basic form of regression analysis: the simple linear regression model.

Simple linear regression is a statistical tool you can use to evaluate correlations between a single independent variable (X) and a single dependent variable (Y). The model fits a straight line to data collected for each variable, and using this line, you can estimate the correlation between X and Y and predict values of Y using values of X.

As a quick example, imagine you want to explore the relationship between weight (X) and height (Y). You collect data from ten randomly selected individuals, and you plot your data on a scatterplot like the one below.

scatterplot

In the scatterplot, each point represents data collected for one of the individuals in your sample. The blue line is your regression line. It models the relationship between weight and height using observed data. Not surprisingly, we see ‌the regression line is upward-sloping, indicating a positive correlation between weight and height. Taller people tend to be heavier than shorter people.

Once you have this line, you can measure how strong the correlation is between height and weight. You can estimate the height of somebody ‌not in your sample by plugging their weight into the regression equation.

The equation for a simple linear regression is:

X is your independent variable

Y is an estimate of your dependent variable

β 0 \beta_0 β 0 ​ is the constant or intercept of the regression line, which is the value of Y when X is equal to zero

β 1 \beta_1 β 1 ​ is the regression coefficient, which is the slope of the regression line and your estimate for the change in Y given a 1-unit change in X

ε \varepsilon ε is the error term of the regression

You may notic‌e the formula for a regression looks very similar to the equation of a line (y=mX+b). That’s because linear regression is a line! It’s a line fitted to data that you can use to estimate the values of one variable using the value of a correlated variable.

You can build a simple linear regression model in 5 steps.

1. Collect data

Collect data for two variables (X and Y). Y is your dependent variable, which is the variable you want to estimate using the regression. X is your independent variable—the variable you use as an input in your regression.

2. Plot the data on a scatter plot

Plot the values of X and Y on a scatter plot with values of X plotted along the horizontal x-axis and values of Y plotted on the vertical y-axis.

3. Calculate a correlation coefficient

Calculate a correlation coefficient to determine the strength of the linear relationship between your two variables.

4. Fit a regression to the data

Find the regression line using the ordinary least-squares method. (You can do this by hand; but it’s much easier to use statistical software like Desmos, Excel, R, or Stata.)

5. Assess the regression line

Once you have the regression line, assess how well your model performs by checking to see how well the model predicts values of Y.

The key assumptions we make when using a simple linear regression model are:

The relationship between X and Y (if it exists) is linear.

Independence

The residuals of your model are independent.

Homoscedasticity

The variance of the residual is constant across values of the independent variable.

The residuals are normally distributed .

You should not use a simple linear regression unless it’s reasonable to make these assumptions.

Simple linear regression involves fitting a straight line to your dataset. We call this line the line of best fit or the regression line. The most common method for finding this line is OLS (or the Ordinary Least Squares Method).

In OLS, we find the regression line by minimizing the sum of squared residuals —also called squared errors. Anytime you draw a straight line through your data, there will be a vertical distance between each ‌point on your scatter plot and the regression line. These vertical distances are called residuals (or errors).

They represent the difference between the actual values of your dependent variable Y i Y_i Y i ​ , and the predicted value of that variable, Y ^ i \widehat{Y}_i Y i ​ . The regression you find with OLS is the line that minimizes the sum of squared residuals.

Graph showing calculating of regression line

You can calculate the OLS regression line by hand, but it’s much easier to do so using statistical software like Excel, Desmos, R, or Stata. In this video, Professor AnnMaria De Mars explains how to find the OLS regression equation using Desmos.

Depending on the software you use, the results of your regression analysis may look ‌different. In general, however, your software will display output tables summarizing the main characteristics of your regression.

The values you should be looking for in these output tables fall under three categories:

Coefficients

Regression statistics

This is the β 0 \beta_0 β 0 ​ value in your regression equation. It is the y-intercept of your regression line, and it is the estimate of Y when X is equal to zero.

Next to your intercept, you’ll see columns in the table showing additional information about the intercept. These include a standard error, p-value, T-stat, and confidence interval. You can use these values to test whether the estimate of your intercept is statistically significant .

Regression coefficient

This is the β 1 \beta_1 β 1 ​ of your regression equation. It’s the slope of the regression line, and it tells you how much Y should change in response to a 1-unit change in X.

Similar to the intercept, the regression coefficient will have columns to the right of it. They'll show a standard error, p-value , T-stat, and confidence interval. Use these values to test whether your parameter estimate of β 1 \beta_1 β 1 ​ is statistically significant.

Regression Statistics

Correlation coefficient (or multiple r).

This is the Pearson Correlation coefficient. It measures the strength of the correlation between X and Y.

R-squared (or the coefficient of determination)

We calculate this value by squaring the correlation coefficient. The independent variable can explain how much of the variance in your dependent variable. You can convert R 2 R^2 R 2 into a percentage by multiplying it by 100.

Standard error of the residuals

The standard error of the residuals is the average value of the errors in your model. It is the average vertical distance between each point on your scatter plot and the regression line. We measure this value in the same units as your dependent variable.

Degrees of freedom

In simple linear regression, the degrees of freedom equal the number of data points you used minus the two estimated parameters. The parameters are the intercept and regression coefficient.

Some software will also output a 5-number summary of your residuals. It'll show the minimum, first quartile , median , third quartile, and maximum values of your residuals.

P-value (or Significance F) - This is the p-value of your regression model.

It returns a hypothesis test's results where the null hypothesis is that no relationship exists between X and Y. The alternative hypothesis is that a linear relationship exists between X and Y.

If you are using a significance level (or alpha level) of 0.05, you would reject the null hypothesis if the p-value is less than or equal to 0.05. You would fail to reject the null hypothesis if your p-value is greater than 0.05.

What are correlations?

A correlation is a measure of the relationship between two variables.

Positive Correlations - If two variables, X and Y, have a positive linear correlation, Y tends to increase as X increases, and Y tends to decrease as X decreases. In other words, the two variables tend to move together in the same direction.

Negative Correlations - Two variables, X and Y, have a negative correlation if Y tends to increase as X decreases and Y tends to decrease as X increases. (i.e., The values of the two variables tend to move in opposite directions).

What’s the difference between the dependent and independent variables in a regression?

A simple linear regression involves two variables: X, the input or independent variable, and Y, the output or dependent variable. The independent variable is the variable you want to estimate using the regression. Its estimated value “depends” on the parameters and other variables of the model.

The independent variable—also called the predictor variable—is an input in the model. Its value does not depend on the other elements of the model.

Is the correlation coefficient the same as the regression coefficient?

The correlation coefficient and the regression coefficient will both have the same sign (positive or negative), but they are not the same. The only case where these two values will be equal is when the values of X and Y have been standardized to the same scale.

What is a correlation coefficient?

A correlation coefficient—or Pearson’s correlation coefficient —measures the strength of the linear relationship between X and Y. It’s a number ranging between -1 and 1. The closer a coefficient correlation is to 0, the weaker the correlation is between X and Y.

The closer the correlation coefficient is to 1 or -1, the stronger the correlation. Points on a scatter plot will be more dispersed around the regression line when the correlation between X and Y is weak, and the points will be more tightly clustered around the regression line when the correlation is strong.

What is the regression coefficient?

The regression coefficient, β 1 \beta_1 β 1 ​ , is the slope of the regression line. It provides you with an estimate of how much the dependent variable, Y, will change in response to a 1-unit increase in the dependent variable, X.

The regression coefficient can be any number from − ∞ -\infty − ∞ to ∞ \infty ∞ . A positive regression coefficient implies a positive correlation between X and Y, and a negative regression coefficient implies a negative correlation.

Can I use linear regression in Excel?

Yes. The easiest way to add a simple linear regression line in Excel is to install and use Excel’s “Analysis Toolpak” add-in. To do this, go to Tools > Excel Add-ins and select the “Analysis Toolpak.”

Next, follow these steps.

In your spreadsheet, enter your data for X and Y in two columns

Navigate to the “Data” tab and click on the “Data Analysis” icon

From the list of analysis tools, select “Regression” and click “OK”

Select the data for Y and X respectively where it says “Input Y Range” and “Input X Range”

If you’ve labeled your columns with the names of your X and Y variables, click on the “Labels” checkbox.

You can further customize where you want your regression in your workbook and what additional information you would like Excel to display.

Once you’ve finished customizing, click “OK”

Your regression results will display next to your data or in a new sheet.

Is linear regression used to establish causal relationships?

Correlations are not equivalent to causation. If two variables are correlated, you cannot immediately conclude ‌one causes the other to change. A linear regression will immediately indicate whether two variables correlate. But you’ll need to include more variables in your model and use regression with causal theories to draw conclusions about causal relationships.

What are some other types of regression analysis?

Simple linear regression is the most basic form of regression analysis. It involves ‌one independent variable and one dependent variable. Once you get a handle on this model, you can move on to more sophisticated forms of regression analysis. These include multiple linear regression and nonlinear regression.

Multiple linear regression is a model that estimates the linear relationship between variables using one dependent variable and multiple predictor variables. Nonlinear regression is a method used to estimate nonlinear relationships between variables.

Explore Outlier's Award-Winning For-Credit Courses

Outlier (from the co-founder of MasterClass) has brought together some of the world's best instructors, game designers, and filmmakers to create the future of online college.

Check out these related courses:

Intro to Statistics

Intro to Statistics

How data describes our world.

Intro to Microeconomics

Intro to Microeconomics

Why small choices have big impact.

Intro to Macroeconomics

Intro to Macroeconomics

How money moves our world.

Intro to Psychology

Intro to Psychology

The science of the mind.

Related Articles

Mountains during sunset representing logarithmic regression

Calculating Logarithmic Regression Step-By-Step

Learn about logarithmic regression and the steps to calculate it. We’ll also break down what a logarithmic function is, why it’s useful, and a few examples.

Overhead view of rows of small potted plants. This visual helps represent the interquartile range

What Is the Interquartile Range (IQR)?

Learn what the interquartile range is, why it’s used in Statistics and how to calculate it. Also read about how it can be helpful for finding outliers.

Outlier Blog Calculate Outlier Formula HighRes

Calculate Outlier Formula: A Step-By-Step Guide

This article is an overview of the outlier formula and how to calculate it step by step. It’s also packed with examples and FAQs to help you understand it.

Further Reading

What is statistical significance & why learn it, mean absolute deviation (mad) - meaning & formula, discrete & continuous variables with examples, population vs. sample: the big difference, why is statistics important, how to make a box plot.

Research-Methodology

Regression Analysis

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

                                    Y  ≈  f (X, β)   

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

Regression analysis

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity . There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.

4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Regression analysis

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 31 January 2022

The clinician’s guide to interpreting a regression analysis

  • Sofia Bzovsky 1 ,
  • Mark R. Phillips   ORCID: orcid.org/0000-0003-0923-261X 2 ,
  • Robyn H. Guymer   ORCID: orcid.org/0000-0002-9441-4356 3 , 4 ,
  • Charles C. Wykoff 5 , 6 ,
  • Lehana Thabane   ORCID: orcid.org/0000-0003-0355-9734 2 , 7 ,
  • Mohit Bhandari   ORCID: orcid.org/0000-0001-9608-4808 1 , 2 &
  • Varun Chaudhary   ORCID: orcid.org/0000-0002-9988-4146 1 , 2

on behalf of the R.E.T.I.N.A. study group

Eye volume  36 ,  pages 1715–1717 ( 2022 ) Cite this article

20k Accesses

9 Citations

1 Altmetric

Metrics details

  • Outcomes research

Introduction

When researchers are conducting clinical studies to investigate factors associated with, or treatments for disease and conditions to improve patient care and clinical practice, statistical evaluation of the data is often necessary. Regression analysis is an important statistical method that is commonly used to determine the relationship between several factors and disease outcomes or to identify relevant prognostic factors for diseases [ 1 ].

This editorial will acquaint readers with the basic principles of and an approach to interpreting results from two types of regression analyses widely used in ophthalmology: linear, and logistic regression.

Linear regression analysis

Linear regression is used to quantify a linear relationship or association between a continuous response/outcome variable or dependent variable with at least one independent or explanatory variable by fitting a linear equation to observed data [ 1 ]. The variable that the equation solves for, which is the outcome or response of interest, is called the dependent variable [ 1 ]. The variable that is used to explain the value of the dependent variable is called the predictor, explanatory, or independent variable [ 1 ].

In a linear regression model, the dependent variable must be continuous (e.g. intraocular pressure or visual acuity), whereas, the independent variable may be either continuous (e.g. age), binary (e.g. sex), categorical (e.g. age-related macular degeneration stage or diabetic retinopathy severity scale score), or a combination of these [ 1 ].

When investigating the effect or association of a single independent variable on a continuous dependent variable, this type of analysis is called a simple linear regression [ 2 ]. In many circumstances though, a single independent variable may not be enough to adequately explain the dependent variable. Often it is necessary to control for confounders and in these situations, one can perform a multivariable linear regression to study the effect or association with multiple independent variables on the dependent variable [ 1 , 2 ]. When incorporating numerous independent variables, the regression model estimates the effect or contribution of each independent variable while holding the values of all other independent variables constant [ 3 ].

When interpreting the results of a linear regression, there are a few key outputs for each independent variable included in the model:

Estimated regression coefficient—The estimated regression coefficient indicates the direction and strength of the relationship or association between the independent and dependent variables [ 4 ]. Specifically, the regression coefficient describes the change in the dependent variable for each one-unit change in the independent variable, if continuous [ 4 ]. For instance, if examining the relationship between a continuous predictor variable and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that for every one-unit increase in the predictor, there is a two-unit increase in intra-ocular pressure. If the independent variable is binary or categorical, then the one-unit change represents switching from one category to the reference category [ 4 ]. For instance, if examining the relationship between a binary predictor variable, such as sex, where ‘female’ is set as the reference category, and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that, on average, males have an intra-ocular pressure that is 2 mm Hg higher than females.

Confidence Interval (CI)—The CI, typically set at 95%, is a measure of the precision of the coefficient estimate of the independent variable [ 4 ]. A large CI indicates a low level of precision, whereas a small CI indicates a higher precision [ 5 ].

P value—The p value for the regression coefficient indicates whether the relationship between the independent and dependent variables is statistically significant [ 6 ].

Logistic regression analysis

As with linear regression, logistic regression is used to estimate the association between one or more independent variables with a dependent variable [ 7 ]. However, the distinguishing feature in logistic regression is that the dependent variable (outcome) must be binary (or dichotomous), meaning that the variable can only take two different values or levels, such as ‘1 versus 0’ or ‘yes versus no’ [ 2 , 7 ]. The effect size of predictor variables on the dependent variable is best explained using an odds ratio (OR) [ 2 ]. ORs are used to compare the relative odds of the occurrence of the outcome of interest, given exposure to the variable of interest [ 5 ]. An OR equal to 1 means that the odds of the event in one group are the same as the odds of the event in another group; there is no difference [ 8 ]. An OR > 1 implies that one group has a higher odds of having the event compared with the reference group, whereas an OR < 1 means that one group has a lower odds of having an event compared with the reference group [ 8 ]. When interpreting the results of a logistic regression, the key outputs include the OR, CI, and p-value for each independent variable included in the model.

Clinical example

Sen et al. investigated the association between several factors (independent variables) and visual acuity outcomes (dependent variable) in patients receiving anti-vascular endothelial growth factor therapy for macular oedema (DMO) by means of both linear and logistic regression [ 9 ]. Multivariable linear regression demonstrated that age (Estimate −0.33, 95% CI − 0.48 to −0.19, p  < 0.001) was significantly associated with best-corrected visual acuity (BCVA) at 100 weeks at alpha = 0.05 significance level [ 9 ]. The regression coefficient of −0.33 means that the BCVA at 100 weeks decreases by 0.33 with each additional year of older age.

Multivariable logistic regression also demonstrated that age and ellipsoid zone status were statistically significant associated with achieving a BCVA letter score >70 letters at 100 weeks at the alpha = 0.05 significance level. Patients ≥75 years of age were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.96, 95% CI 0.94 to 0.98, p  = 0.001) [ 9 ]. Similarly, patients between the ages of 50–74 years were also at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.15, 95% CI 0.04 to 0.48, p  = 0.001) [ 9 ]. As well, those with a not intact ellipsoid zone were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone (OR 0.20, 95% CI 0.07 to 0.56; p  = 0.002). On the other hand, patients with an ungradable/questionable ellipsoid zone were at an increased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone, since the OR is greater than 1 (OR 2.26, 95% CI 1.14 to 4.48; p  = 0.02) [ 9 ].

The narrower the CI, the more precise the estimate is; and the smaller the p value (relative to alpha = 0.05), the greater the evidence against the null hypothesis of no effect or association.

Simply put, linear and logistic regression are useful tools for appreciating the relationship between predictor/explanatory and outcome variables for continuous and dichotomous outcomes, respectively, that can be applied in clinical practice, such as to gain an understanding of risk factors associated with a disease of interest.

Schneider A, Hommel G, Blettner M. Linear Regression. Anal Dtsch Ärztebl Int. 2010;107:776–82.

Google Scholar  

Bender R. Introduction to the use of regression models in epidemiology. In: Verma M, editor. Cancer epidemiology. Methods in molecular biology. Humana Press; 2009:179–95.

Schober P, Vetter TR. Confounding in observational research. Anesth Analg. 2020;130:635.

Article   Google Scholar  

Schober P, Vetter TR. Linear regression in medical research. Anesth Analg. 2021;132:108–9.

Szumilas M. Explaining odds ratios. J Can Acad Child Adolesc Psychiatry. 2010;19:227–9.

Thiese MS, Ronna B, Ott U. P value interpretations and considerations. J Thorac Dis. 2016;8:E928–31.

Schober P, Vetter TR. Logistic regression in medical research. Anesth Analg. 2021;132:365–6.

Zabor EC, Reddy CA, Tendulkar RD, Patil S. Logistic regression in clinical studies. Int J Radiat Oncol Biol Phys. 2022;112:271–7.

Sen P, Gurudas S, Ramu J, Patrao N, Chandra S, Rasheed R, et al. Predictors of visual acuity outcomes after anti-vascular endothelial growth factor treatment for macular edema secondary to central retinal vein occlusion. Ophthalmol Retin. 2021;5:1115–24.

Download references

R.E.T.I.N.A. study group

Varun Chaudhary 1,2 , Mohit Bhandari 1,2 , Charles C. Wykoff 5,6 , Sobha Sivaprasad 8 , Lehana Thabane 2,7 , Peter Kaiser 9 , David Sarraf 10 , Sophie J. Bakri 11 , Sunir J. Garg 12 , Rishi P. Singh 13,14 , Frank G. Holz 15 , Tien Y. Wong 16,17 , and Robyn H. Guymer 3,4

Author information

Authors and affiliations.

Department of Surgery, McMaster University, Hamilton, ON, Canada

Sofia Bzovsky, Mohit Bhandari & Varun Chaudhary

Department of Health Research Methods, Evidence & Impact, McMaster University, Hamilton, ON, Canada

Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary

Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia

Robyn H. Guymer

Department of Surgery, (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia

Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA

Charles C. Wykoff

Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA

Biostatistics Unit, St. Joseph’s Healthcare Hamilton, Hamilton, ON, Canada

Lehana Thabane

NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK

Sobha Sivaprasad

Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Peter Kaiser

Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA

David Sarraf

Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA

Sophie J. Bakri

The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA

Sunir J. Garg

Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Rishi P. Singh

Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA

Department of Ophthalmology, University of Bonn, Bonn, Germany

Frank G. Holz

Singapore Eye Research Institute, Singapore, Singapore

Tien Y. Wong

Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore

You can also search for this author in PubMed   Google Scholar

  • Varun Chaudhary
  • , Mohit Bhandari
  • , Charles C. Wykoff
  • , Sobha Sivaprasad
  • , Lehana Thabane
  • , Peter Kaiser
  • , David Sarraf
  • , Sophie J. Bakri
  • , Sunir J. Garg
  • , Rishi P. Singh
  • , Frank G. Holz
  • , Tien Y. Wong
  •  & Robyn H. Guymer

Contributions

SB was responsible for writing, critical review and feedback on manuscript. MRP was responsible for conception of idea, critical review and feedback on manuscript. RHG was responsible for critical review and feedback on manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript. MB was responsible for conception of idea, critical review and feedback on manuscript. VC was responsible for conception of idea, critical review and feedback on manuscript.

Corresponding author

Correspondence to Varun Chaudhary .

Ethics declarations

Competing interests.

SB: Nothing to disclose. MRP: Nothing to disclose. RHG: Advisory boards: Bayer, Novartis, Apellis, Roche, Genentech Inc.—unrelated to this study. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed—unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis—unrelated to this study.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Bzovsky, S., Phillips, M.R., Guymer, R.H. et al. The clinician’s guide to interpreting a regression analysis. Eye 36 , 1715–1717 (2022). https://doi.org/10.1038/s41433-022-01949-z

Download citation

Received : 08 January 2022

Revised : 17 January 2022

Accepted : 18 January 2022

Published : 31 January 2022

Issue Date : September 2022

DOI : https://doi.org/10.1038/s41433-022-01949-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Factors affecting patient satisfaction at a plastic surgery outpatient department at a tertiary centre in south africa.

  • Chrysis Sofianos

BMC Health Services Research (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

regression analysis in research example

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Artificial Intelligence
  • Market Research
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Survey Data Analysis & Reporting
  • Regression Analysis

Try Qualtrics for free

The complete guide to regression analysis.

19 min read What is regression analysis and why is it useful? While most of us have heard the term, understanding regression analysis in detail may be something you need to brush up on. Here’s what you need to know about this popular method of analysis.

When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analyzing what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between independent and dependent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Free eBook: 2024 global market research trends report

What is regression analysis?

Regression analysis is a statistical method. It’s used for analyzing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analyzing the effects of these variables on your outcome,  you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with variables that are categorized into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analyzing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyze and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

  • Is the variable measured as an outcome of the study?
  • Does the variable depend on another in the study?
  • Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

  • Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
  • Does this variable come before the other variable in time?
  • Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

Regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

Regression analysis - step by step

This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Regression analysis - step by step

Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

  • the data was collected using a statistically valid sample collection method that is representative of the target population
  • The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
  • the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

  • there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
  • the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organization wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualize those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyze the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyze it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

  • Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

  • Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

  • Choosing the wrong variables to analyze

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

  • Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

  • our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
  • the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
  • the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

IQ stats in action

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualizations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

Regression analysis tools

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

Related resources

Analysis & Reporting

Data Analysis 31 min read

Social media analytics 13 min read, kano analysis 21 min read, margin of error 11 min read, data saturation in qualitative research 8 min read, thematic analysis 11 min read, behavioral analytics 12 min read, request demo.

Ready to learn more about Qualtrics?

A behind-the-scenes blog about research methods at Pew Research Center. For our latest findings, visit pewresearch.org .

  • Survey Methods

A short intro to linear regression analysis using survey data

regression analysis in research example

Many of Pew Research Center’s survey analyses show relationships between two variables. For example, our reports may explore how attitudes about one thing — such as views of the economy — are associated with attitudes about another thing — such as views of the president’s job performance. Or they might look at how different demographic groups respond to the same survey question.

But analysts are sometimes interested in understanding how  multiple  factors might contribute simultaneously to the same outcome. One useful tool to help us make sense of these kinds of problems is regression. Regression is a statistical method that allows us to look at the relationship between two variables, while holding other factors equal.

This post will show how to estimate and interpret linear regression models with survey data using R. We’ll use data taken from a Pew Research Center 2016 post-election survey, and you can  download the dataset for your own use here . We’ll discuss both bivariate regression, which has one outcome variable and one explanatory variable, and multiple regression, which has one outcome variable and multiple explanatory variables.

This post is meant as a brief introduction to how to estimate a regression model in R. It also offers a brief explanation of some of the aspects that need to be accounted for in the process.

Bivariate regression models with survey data

In the Center’s 2016 post-election survey, respondents were asked to rate then President-elect Donald Trump on a 0–100 “feeling thermometer.” Respondents were told, “a rating of zero degrees means you feel as cold and negative as possible. A rating of 100 degrees means you feel as warm and positive as possible. You would rate the person at 50 degrees if you don’t feel particularly positive or negative toward the person.”

We can use R’s plot function to take a look at the answers people gave. The plot below shows the distribution of the ratings of Trump. Round numbers and increments of 5 typically received more responses than other numbers. For example, 50 had a larger number of responses than 49.

In most survey research we also want to represent a population (in this case, the adult population in the U.S.), which requires weighting the data to known national statistics. Weights are used to correct for under- and overrepresentation among different demographic groups in our sample (like age, gender, region, education, race). When working with weighted survey data, we need to account for these weights correctly. Otherwise, population estimates, standard errors and significance tests will be incorrect.

regression analysis in research example

One option for working with survey data in R is to use the “survey” package. For an introduction on working with survey data in R, see  our earlier blog post .

The first step involves creating a survey design object with our weights variable. Below, we define the “d_design” object with the corresponding weight from the WEIGHT_W23 variable. We can use this survey object to perform a wide variety of analyses included in the `survey` package. In this case, we’ll use it to calculate averages and run a regression.

The `svymean()` function lets us calculate Trump’s average thermometer rating and its standard error. Overall, the average rating of Trump among those who gave him a rating in this data is 43, but we know from existing research that  public views of Trump differ substantially by race , among other things. We can see this by tabulating the average Trump thermometer score by the race/ethnicity variable in the dataset (“F_RACETHN_RECRUITMENT”). The `svyby()` function lets us do that separately for each race category:

We can see that there is a large difference between whites, blacks and Hispanics, with whites rating Trump at least 23 points higher than the other racial/ethnic groups do. (The “other” and “don’t know/refused” categories account for about 7% of the public.) However, since we know that  there are large racial and ethnic differences in party identification , it may be that the racial divide in Trump ratings is a function of partisanship. This is where regression comes in.

By using the regression function `svyglm()` in R, we can conduct a regression analysis that includes party differences in the  same  model as race. Using `svyglm()` from the survey package (rather than `lm()` or `glm()`) is important because it accounts for the survey weights while estimating the model. The output from our `svyglm()` function will allow us to see whether a racial gap persists even after accounting for differences in partisanship between racial groups.

First, we can look at the results when we only include race in the regression:

When interpreting regression output, we want to examine the coefficients of the independent variables. These are given by the values in the “Estimate” column.

Notice that the estimate and standard error for the “(Intercept)” are identical to the values we calculated earlier for white non-Hispanics. By default, R treats the first category in an independent variable as the reference category. The coefficients for the other racial groups show how each group differs from whites in terms of the Trump thermometer score. Notice that the coefficients for blacks, Hispanics and those who identify with other racial groups are all negative. This means that, on average, the ratings of Trump are lower across each of these groups compared to whites. For example, the coefficient for blacks is -23.7. This can be interpreted as meaning that, on average, Trump’s thermometer rating is 23.7 points lower for blacks than for whites. If we think back to the overall averages, this makes sense because all the nonwhite racial/ethnic groups rated Trump lower than whites did. And, in fact, if you combine the intercept estimate with the estimate for non-Hispanic blacks, you get 49.3–23.7 = 25.6, exactly what we saw in the simple tabulation above.

Multiple regression models with survey data

Regression becomes a more useful tool when researchers want to look at multiple factors simultaneously. If we want to know whether the racial divide persists even after accounting for differences in party identification, we can enter partisanship into the regression equation. Note that the only difference here is one added explanatory variable (F_PARTYSUM_FINAL) which contains responses to questions about which political party the respondents identify with or lean toward. Since we have two independent variables now, the reference categories are now the group of people who are in the first level for the F_RACETHN_RECRUITMENT and F_PARTYSUM_FINAL variables. In this case, that means that the intercept is the expected average thermometer score among non-Hispanic whites who also identify as or lean Republican.

After including a new variable for partisanship, the racial and ethnic differences almost entirely disappear. The coefficients are quite small (none exceed 5) and are not statistically significant at p < 0.05. For blacks, we can interpret the coefficient of -2.1 as meaning that if we hold party constant, race does not explain differences in Trump’s rating. We would expect both black and white Republicans to give similar ratings of Trump. Likewise, we would expect only small differences between white and black Democrats. In contrast, party matters a lot: Democrats rate Trump about 51 points lower than Republicans on average. Those who don’t lean toward either party rate Trump about 39 points lower than Republicans.

Further analysis could be conducted to explore how other factors might account for variance in Trump thermometer ratings. Perhaps there are significant interactions that we haven’t accounted for (e.g., it might be the case that there is some kind of interaction between race and partisanship that isn’t accounted for in the simple additive model that we looked at above), and it is always important to remember that standard regression analysis of the kind presented in this post is not sufficient to show causal relationships. Regression allows us to sort out the relationships between many variables simultaneously, but we can’t say that just because a significant relationship was found between two variables, one  caused the other. Regression is a useful tool for summarizing descriptive relationships, but it is not a silver bullet (see  this post  for more on where regression can go wrong).

Categories:

More from Decoded

How public polling has changed in the 21st century.

A new study found that 61% of national pollsters used different methods in 2022 than in 2016. And last year, 17% of pollsters used multiple methods to sample or interview people – up from 2% in 2016.

What 2020’s Election Poll Errors Tell Us About the Accuracy of Issue Polling

Given the errors in 2016 and 2020 election polling, how much should we trust polls that attempt to measure opinions on issues?

A Field Guide to Polling: Election 2020 Edition

While survey research in the United States is a year-round undertaking, the public’s focus on polling is never more intense than during the run-up to a presidential election.

Methods 101: How is polling done around the world?

Polling in different parts of the world can be very challenging, because what works in one country may not work in a different country.

Methods 101: Mode effects

How does the way a poll is conducted influence the answers people give?

MORE FROM DECODED

To browse all of Pew Research Center findings and data by topic, visit  pewresearch.org

About Decoded

This is a blog about research methods and behind-the-scenes technical matters at Pew Research Center. To get our latest findings, visit pewresearch.org .

Copyright 2024 Pew Research Center

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Dtsch Arztebl Int
  • v.107(44); 2010 Nov

Linear Regression Analysis

Astrid schneider.

1 Departrment of Medical Biometrics, Epidemiology, and Computer Sciences, Johannes Gutenberg University, Mainz, Germany

Gerhard Hommel

Maria blettner.

Regression analysis is an important statistical method for the analysis of medical data. It enables the identification and characterization of relationships among multiple factors. It also enables the identification of prognostically relevant risk factors and the calculation of risk scores for individual prognostication.

This article is based on selected textbooks of statistics, a selective review of the literature, and our own experience.

After a brief introduction of the uni- and multivariable regression models, illustrative examples are given to explain what the important considerations are before a regression analysis is performed, and how the results should be interpreted. The reader should then be able to judge whether the method has been used correctly and interpret the results appropriately.

The performance and interpretation of linear regression analysis are subject to a variety of pitfalls, which are discussed here in detail. The reader is made aware of common errors of interpretation through practical examples. Both the opportunities for applying linear regression analysis and its limitations are presented.

The purpose of statistical evaluation of medical data is often to describe relationships between two variables or among several variables. For example, one would like to know not just whether patients have high blood pressure, but also whether the likelihood of having high blood pressure is influenced by factors such as age and weight. The variable to be explained (blood pressure) is called the dependent variable, or, alternatively, the response variable; the variables that explain it (age, weight) are called independent variables or predictor variables. Measures of association provide an initial impression of the extent of statistical dependence between variables. If the dependent and independent variables are continuous, as is the case for blood pressure and weight, then a correlation coefficient can be calculated as a measure of the strength of the relationship between them ( box 1 ).

Interpretation of the correlation coefficient (r)

Spearman’s coefficient:

Describes a monotone relationship

A monotone relationship is one in which the dependent variable either rises or sinks continuously as the independent variable rises.

Pearson’s correlation coefficient:

Describes a linear relationship

Interpretation/meaning:

Correlation coefficients provide information about the strength and direction of a relationship between two continuous variables. No distinction between the explaining variable and the variable to be explained is necessary:

  • r = ± 1: perfect linear and monotone relationship. The closer r is to 1 or –1, the stronger the relationship.
  • r = 0: no linear or monotone relationship
  • r < 0: negative, inverse relationship (high values of one variable tend to occur together with low values of the other variable)
  • r > 0: positive relationship (high values of one variable tend to occur together with high values of the other variable)

Graphical representation of a linear relationship:

Scatter plot with regression line

A negative relationship is represented by a falling regression line (regression coefficient b < 0), a positive one by a rising regression line (b > 0).

Regression analysis is a type of statistical evaluation that enables three things:

  • Description: Relationships among the dependent variables and the independent variables can be statistically described by means of regression analysis.
  • Estimation: The values of the dependent variables can be estimated from the observed values of the independent variables.
  • Prognostication: Risk factors that influence the outcome can be identified, and individual prognoses can be determined.

Regression analysis employs a model that describes the relationships between the dependent variables and the independent variables in a simplified mathematical form. There may be biological reasons to expect a priori that a certain type of mathematical function will best describe such a relationship, or simple assumptions have to be made that this is the case (e.g., that blood pressure rises linearly with age). The best-known types of regression analysis are the following ( table 1 ):

  • Linear regression,
  • Logistic regression, and
  • Cox regression.

The goal of this article is to introduce the reader to linear regression. The theory is briefly explained, and the interpretation of statistical parameters is illustrated with examples. The methods of regression analysis are comprehensively discussed in many standard textbooks ( 1 – 3 ).

Cox regression will be discussed in a later article in this journal.

Linear regression is used to study the linear relationship between a dependent variable Y (blood pressure) and one or more independent variables X (age, weight, sex).

The dependent variable Y must be continuous, while the independent variables may be either continuous (age), binary (sex), or categorical (social status). The initial judgment of a possible relationship between two continuous variables should always be made on the basis of a scatter plot (scatter graph). This type of plot will show whether the relationship is linear ( figure 1 ) or nonlinear ( figure 2 ).

An external file that holds a picture, illustration, etc.
Object name is Dtsch_Arztebl_Int-107-0776_001.jpg

A scatter plot showing a linear relationship

An external file that holds a picture, illustration, etc.
Object name is Dtsch_Arztebl_Int-107-0776_002.jpg

A scatter plot showing an exponential relationship. In this case, it would not be appropriate to compute a coefficient of determination or a regression line

Performing a linear regression makes sense only if the relationship is linear. Other methods must be used to study nonlinear relationships. The variable transformations and other, more complex techniques that can be used for this purpose will not be discussed in this article.

Univariable linear regression

Univariable linear regression studies the linear relationship between the dependent variable Y and a single independent variable X. The linear regression model describes the dependent variable with a straight line that is defined by the equation Y = a + b × X, where a is the y-intersect of the line, and b is its slope. First, the parameters a and b of the regression line are estimated from the values of the dependent variable Y and the independent variable X with the aid of statistical methods. The regression line enables one to predict the value of the dependent variable Y from that of the independent variable X. Thus, for example, after a linear regression has been performed, one would be able to estimate a person’s weight (dependent variable) from his or her height (independent variable) ( figure 3 ).

An external file that holds a picture, illustration, etc.
Object name is Dtsch_Arztebl_Int-107-0776_003.jpg

A scatter plot and the corresponding regression line and regression equation for the relationship between the dependent variable body weight (kg) and the independent variable height (m).

r = Pearsons’s correlation coefficient

R-squared linear = coefficient of determination

The slope b of the regression line is called the regression coefficient. It provides a measure of the contribution of the independent variable X toward explaining the dependent variable Y. If the independent variable is continuous (e.g., body height in centimeters), then the regression coefficient represents the change in the dependent variable (body weight in kilograms) per unit of change in the independent variable (body height in centimeters). The proper interpretation of the regression coefficient thus requires attention to the units of measurement. The following example should make this relationship clear:

In a fictitious study, data were obtained from 135 women and men aged 18 to 27. Their height ranged from 1.59 to 1.93 meters. The relationship between height and weight was studied: weight in kilograms was the dependent variable that was to be estimated from the independent variable, height in centimeters. On the basis of the data, the following regression line was determined: Y= –133.18 + 1.16 × X, where X is height in centimeters and Y is weight in kilograms. The y-intersect a = –133.18 is the value of the dependent variable when X = 0, but X cannot possibly take on the value 0 in this study (one obviously cannot expect a person of height 0 centimeters to weigh negative 133.18 kilograms). Therefore, interpretation of the constant is often not useful. In general, only values within the range of observations of the independent variables should be used in a linear regression model; prediction of the value of the dependent variable becomes increasingly inaccurate the further one goes outside this range.

The regression coefficient of 1.16 means that, in this model, a person’s weight increases by 1.16 kg with each additional centimeter of height. If height had been measured in meters, rather than in centimeters, the regression coefficient b would have been 115.91 instead. The constant a, in contrast, is independent of the unit chosen to express the independent variables. Proper interpretation thus requires that the regression coefficient should be considered together with the units of all of the involved variables. Special attention to this issue is needed when publications from different countries use different units to express the same variables (e.g., feet and inches vs. centimeters, or pounds vs. kilograms).

Figure 3 shows the regression line that represents the linear relationship between height and weight.

For a person whose height is 1.74 m, the predicted weight is 68.50 kg (y = –133.18 + 115.91 × 1.74 m). The data set contains 6 persons whose height is 1.74 m, and their weights vary from 63 to 75 kg.

Linear regression can be used to estimate the weight of any persons whose height lies within the observed range (1.59 m to 1.93 m). The data set need not include any person with this precise height. Mathematically it is possible to estimate the weight of a person whose height is outside the range of values observed in the study. However, such an extrapolation is generally not useful.

If the independent variables are categorical or binary, then the regression coefficient must be interpreted in reference to the numerical encoding of these variables. Binary variables should generally be encoded with two consecutive whole numbers (usually 0/1 or 1/2). In interpreting the regression coefficient, one should recall which category of the independent variable is represented by the higher number (e.g., 2, when the encoding is 1/2). The regression coefficient reflects the change in the dependent variable that corresponds to a change in the independent variable from 1 to 2.

For example, if one studies the relationship between sex and weight, one obtains the regression line Y = 47.64 + 14.93 × X, where X = sex (1 = female, 2 = male). The regression coefficient of 14.93 reflects the fact that men are an average of 14.93 kg heavier than women.

When categorical variables are used, the reference category should be defined first, and all other categories are to be considered in relation to this category.

The coefficient of determination, r 2 , is a measure of how well the regression model describes the observed data ( Box 2 ). In univariable regression analysis, r 2 is simply the square of Pearson’s correlation coefficient. In the particular fictitious case that is described above, the coefficient of determination for the relationship between height and weight is 0.785. This means that 78.5% of the variance in weight is due to height. The remaining 21.5% is due to individual variation and might be explained by other factors that were not taken into account in the analysis, such as eating habits, exercise, sex, or age.

Coefficient of determination (R-squared)

Definition:

  • n be the number of observations (e.g., subjects in the study)
  • ŷ i be the estimated value of the dependent variable for the i th observation, as computed with the regression equation
  • y i be the observed value of the dependent variable for the i th observation
  • y be the mean of all n observations of the dependent variable

The coefficient of determination is then defined

as follows:

In formal terms, the null hypothesis, which is the hypothesis that b = 0 (no relationship between variables, the regression coefficient is therefore 0), can be tested with a t-test. One can also compute the 95% confidence interval for the regression coefficient ( 4 ).

Multivariable linear regression

In many cases, the contribution of a single independent variable does not alone suffice to explain the dependent variable Y. If this is so, one can perform a multivariable linear regression to study the effect of multiple variables on the dependent variable.

In the multivariable regression model, the dependent variable is described as a linear function of the independent variables X i , as follows: Y = a + b1 × X1 + b2 × X 2 +…+ b n × X n . The model permits the computation of a regression coefficient b i for each independent variable X i ( box 3 ).

Regression line for a multivariable regression

Y= a + b 1 × X 1 + b 2 × X 2 + …+ b n × X n ,

Y = dependent variable

X i = independent variables

a = constant (y-intersect)

b i = regression coefficient of the variable X i

Example: regression line for a multivariable regression Y = –120.07 + 100.81 × X 1 + 0.38 × X 2 + 3.41 × X 3 ,

X 1 = height (meters)

X 2 = age (years)

X 3 = sex (1 = female, 2 = male)

Y = the weight to be estimated (kg)

Just as in univariable regression, the coefficient of determination describes the overall relationship between the independent variables X i (weight, age, body-mass index) and the dependent variable Y (blood pressure). It corresponds to the square of the multiple correlation coefficient, which is the correlation between Y and b 1 × X 1 + … + b n × X n .

It is better practice, however, to give the corrected coefficient of determination, as discussed in Box 2 . Each of the coefficients b i reflects the effect of the corresponding individual independent variable X i on Y, where the potential influences of the remaining independent variables on X i have been taken into account, i.e., eliminated by an additional computation. Thus, in a multiple regression analysis with age and sex as independent variables and weight as the dependent variable, the adjusted regression coefficient for sex represents the amount of variation in weight that is due to sex alone, after age has been taken into account. This is done by a computation that adjusts for age, so that the effect of sex is not confounded by a simultaneously operative age effect ( box 4 ).

Two important terms

  • Confounder (in non-randomized studies): an independent variable that is associated, not only with the dependent variable, but also with other independent variables. The presence of confounders can distort the effect of the other independent variables. Age and sex are frequent confounders.
  • Adjustment: a statistical technique to eliminate the influence of one or more confounders on the treatment effect. Example: Suppose that age is a confounding variable in a study of the effect of treatment on a certain dependent variable. Adjustment for age involves a computational procedure to mimic a situation in which the men and women in the data set were of the same age. This computation eliminates the influence of age on the treatment effect.

In this way, multivariable regression analysis permits the study of multiple independent variables at the same time, with adjustment of their regression coefficients for possible confounding effects between variables.

Multivariable analysis does more than describe a statistical relationship; it also permits individual prognostication and the evaluation of the state of health of a given patient. A linear regression model can be used, for instance, to determine the optimal values for respiratory function tests depending on a person’s age, body-mass index (BMI), and sex. Comparing a patient’s measured respiratory function with these computed optimal values yields a measure of his or her state of health.

Medical questions often involve the effect of a very large number of factors (independent variables). The goal of statistical analysis is to find out which of these factors truly have an effect on the dependent variable. The art of statistical evaluation lies in finding the variables that best explain the dependent variable.

One way to carry out a multivariable regression is to include all potentially relevant independent variables in the model (complete model). The problem with this method is that the number of observations that can practically be made is often less than the model requires. In general, the number of observations should be at least 20 times greater than the number of variables under study.

Moreover, if too many irrelevant variables are included in the model, overadjustment is likely to be the result: that is, some of the irrelevant independent variables will be found to have an apparent effect, purely by chance. The inclusion of irrelevant independent variables in the model will indeed allow a better fit with the data set under study, but, because of random effects, the findings will not generally be applicable outside of this data set ( 1 ). The inclusion of irrelevant independent variables also strongly distorts the determination coefficient, so that it no longer provides a useful index of the quality of fit between the model and the data ( Box 2 ).

In the following sections, we will discuss how these problems can be circumvented.

The selection of variables

For the regression model to be robust and to explain Y as well as possible, it should include only independent variables that explain a large portion of the variance in Y. Variable selection can be performed so that only such independent variables are included ( 1 ).

Variable selection should be carried out on the basis of medical expert knowledge and a good understanding of biometrics. This is optimally done as a collaborative effort of the physician-researcher and the statistician. There are various methods of selecting variables:

Forward selection

Forward selection is a stepwise procedure that includes variables in the model as long as they make an additional contribution toward explaining Y. This is done iteratively until there are no variables left that make any appreciable contribution to Y.

Backward selection

Backward selection, on the other hand, starts with a model that contains all potentially relevant independent variables. The variable whose removal worsens the prediction of the independent variable of the overall set of independent variables to the least extent is then removed from the model. This procedure is iterated until no dependent variables are left that can be removed without markedly worsening the prediction of the independent variable.

Stepwise selection

Stepwise selection combines certain aspects of forward and backward selection. Like forward selection, it begins with a null model, adds the single independent variable that makes the greatest contribution toward explaining the dependent variable, and then iterates the process. Additionally, a check is performed after each such step to see whether one of the variables has now become irrelevant because of its relationship to the other variables. If so, this variable is removed.

Block inclusion

There are often variables that should be included in the model in any case—for example, the effect of a certain form of treatment, or independent variables that have already been found to be relevant in prior studies. One way of taking such variables into account is their block inclusion into the model. In this way, one can combine the forced inclusion of some variables with the selective inclusion of further independent variables that turn out to be relevant to the explanation of variation in the dependent variable.

The evaluation of a regression model requires the performance of both forward and backward selection of variables. If these two procedures result in the selection of the same set of variables, then the model can be considered robust. If not, a statistician should be consulted for further advice.

The study of relationships between variables and the generation of risk scores are very important elements of medical research. The proper performance of regression analysis requires that a number of important factors should be considered and tested:

1. Causality

Before a regression analysis is performed, the causal relationships among the variables to be considered must be examined from the point of view of their content and/or temporal relationship. The fact that an independent variable turns out to be significant says nothing about causality. This is an especially relevant point with respect to observational studies ( 5 ).

2. Planning of sample size

The number of cases needed for a regression analysis depends on the number of independent variables and of their expected effects (strength of relationships). If the sample is too small, only very strong relationships will be demonstrable. The sample size can be planned in the light of the researchers’ expectations regarding the coefficient of determination (r 2 ) and the regression coefficient (b). Furthermore, at least 20 times as many observations should be made as there are independent variables to be studied; thus, if one wants to study 2 independent variables, one should make at least 40 observations.

3. Missing values

Missing values are a common problem in medical data. Whenever the value of either a dependent or an independent variable is missing, this particular observation has to be excluded from the regression analysis. If many values are missing from the dataset, the effective sample size will be appreciably diminished, and the sample may then turn out to be too small to yield significant findings, despite seemingly adequate advance planning. If this happens, real relationships can be overlooked, and the study findings may not be generally applicable. Moreover, selection effects can be expected in such cases. There are a number of ways to deal with the problem of missing values ( 6 ).

4. The data sample

A further important point to be considered is the composition of the study population. If there are subpopulations within it that behave differently with respect to the independent variables in question, then a real effect (or the lack of an effect) may be masked from the analysis and remain undetected. Suppose, for instance, that one wishes to study the effect of sex on weight, in a study population consisting half of children under age 8 and half of adults. Linear regression analysis over the entire population reveals an effect of sex on weight. If, however, a subgroup analysis is performed in which children and adults are considered separately, an effect of sex on weight is seen only in adults, and not in children. Subgroup analysis should only be performed if the subgroups have been predefined, and the questions already formulated, before the data analysis begins; furthermore, multiple testing should be taken into account ( 7 , 8 ).

5. The selection of variables

If multiple independent variables are considered in a multivariable regression, some of these may turn out to be interdependent. An independent variable that would be found to have a strong effect in a univariable regression model might not turn out to have any appreciable effect in a multivariable regression with variable selection. This will happen if this particular variable itself depends so strongly on the other independent variables that it makes no additional contribution toward explaining the dependent variable. For related reasons, when the independent variables are mutually dependent, different independent variables might end up being included in the model depending on the particular technique that is used for variable selection.

Linear regression is an important tool for statistical analysis. Its broad spectrum of uses includes relationship description, estimation, and prognostication. The technique has many applications, but it also has prerequisites and limitations that must always be considered in the interpretation of findings ( Box 5 ).

What special points require attention in the interpretation of a regression analysis?

  • How big is the study sample?
  • Is causality demonstrable or plausible, in view of the content or temporal relationship of the variables?
  • Has there been adjustment for potential confounding effects?
  • Is the inclusion of the independent variables that were used justified, in view of their content?
  • What is the corrected coefficient of determination (R-squared)?
  • Is the study sample homogeneous?
  • In what units were the potentially relevant independent variables reported?
  • Was a selection of the independent variables (potentially relevant independent variables) performed, and, if so, what kind of selection?
  • If a selection of variables was performed, was its result confirmed by a second selection of variables that was performed by a different procedure?
  • Are predictions of the dependent variable made on the basis of extrapolated data?

An external file that holds a picture, illustration, etc.
Object name is Dtsch_Arztebl_Int-107-0776_004.jpg

→ r 2 is the fraction of the overall variance that is explained. The closer the regression model’s estimated values ŷ i lie to the observed values y i , the nearer the coefficient of determination is to 1 and the more accurate the regression model is.

Meaning: In practice, the coefficient of determination is often taken as a measure of the validity of a regression model or a regression estimate. It reflects the fraction of variation in the Y-values that is explained by the regression line.

Problem: The coefficient of determination can easily be made artificially high by including a large number of independent variables in the model. The more independent variables one includes, the higher the coefficient of determination becomes. This, however, lowers the precision of the estimate (estimation of the regression coefficients b i ).

Solution: Instead of the raw (uncorrected) coefficient of determination, the corrected coefficient of determination should be given: the latter takes the number of explanatory variables in the model into account. Unlike the uncorrected coefficient of determination, the corrected one is high only if the independent variables have a sufficiently large effect.

Acknowledgments

Translated from the original German by Ethan Taub, MD

Conflict of interest statement

The authors declare that they have no conflict of interest as defined by the guidelines of the International Committee of Medical Journal Editors.

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

regression analysis in research example

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading Change and Organizational Renewal
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

What Is Regression Analysis in Business Analytics?

Business professional using calculator for regression analysis

  • 14 Dec 2021

Countless factors impact every facet of business. How can you consider those factors and know their true impact?

Imagine you seek to understand the factors that influence people’s decision to buy your company’s product. They range from customers’ physical locations to satisfaction levels among sales representatives to your competitors' Black Friday sales.

Understanding the relationships between each factor and product sales can enable you to pinpoint areas for improvement, helping you drive more sales.

To learn how each factor influences sales, you need to use a statistical analysis method called regression analysis .

If you aren’t a business or data analyst, you may not run regressions yourself, but knowing how analysis works can provide important insight into which factors impact product sales and, thus, which are worth improving.

Access your free e-book today.

Foundational Concepts for Regression Analysis

Before diving into regression analysis, you need to build foundational knowledge of statistical concepts and relationships.

Independent and Dependent Variables

Start with the basics. What relationship are you aiming to explore? Try formatting your answer like this: “I want to understand the impact of [the independent variable] on [the dependent variable].”

The independent variable is the factor that could impact the dependent variable . For example, “I want to understand the impact of employee satisfaction on product sales.”

In this case, employee satisfaction is the independent variable, and product sales is the dependent variable. Identifying the dependent and independent variables is the first step toward regression analysis.

Correlation vs. Causation

One of the cardinal rules of statistically exploring relationships is to never assume correlation implies causation. In other words, just because two variables move in the same direction doesn’t mean one caused the other to occur.

If two or more variables are correlated , their directional movements are related. If two variables are positively correlated , it means that as one goes up or down, so does the other. Alternatively, if two variables are negatively correlated , one goes up while the other goes down.

A correlation’s strength can be quantified by calculating the correlation coefficient , sometimes represented by r . The correlation coefficient falls between negative one and positive one.

r = -1 indicates a perfect negative correlation.

r = 1 indicates a perfect positive correlation.

r = 0 indicates no correlation.

Causation means that one variable caused the other to occur. Proving a causal relationship between variables requires a true experiment with a control group (which doesn’t receive the independent variable) and an experimental group (which receives the independent variable).

While regression analysis provides insights into relationships between variables, it doesn’t prove causation. It can be tempting to assume that one variable caused the other—especially if you want it to be true—which is why you need to keep this in mind any time you run regressions or analyze relationships between variables.

With the basics under your belt, here’s a deeper explanation of regression analysis so you can leverage it to drive strategic planning and decision-making.

Related: How to Learn Business Analytics without a Business Background

What Is Regression Analysis?

Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression).

According to the Harvard Business School Online course Business Analytics , regression is used for two primary purposes:

  • To study the magnitude and structure of the relationship between variables
  • To forecast a variable based on its relationship with another variable

Both of these insights can inform strategic business decisions.

“Regression allows us to gain insights into the structure of that relationship and provides measures of how well the data fit that relationship,” says HBS Professor Jan Hammond, who teaches Business Analytics, one of three courses that comprise the Credential of Readiness (CORe) program . “Such insights can prove extremely valuable for analyzing historical trends and developing forecasts.”

One way to think of regression is by visualizing a scatter plot of your data with the independent variable on the X-axis and the dependent variable on the Y-axis. The regression line is the line that best fits the scatter plot data. The regression equation represents the line’s slope and the relationship between the two variables, along with an estimation of error.

Physically creating this scatter plot can be a natural starting point for parsing out the relationships between variables.

Credential of Readiness | Master the fundamentals of business | Learn More

Types of Regression Analysis

There are two types of regression analysis: single variable linear regression and multiple regression.

Single variable linear regression is used to determine the relationship between two variables: the independent and dependent. The equation for a single variable linear regression looks like this:

Single Variable Linear Regression Formula

In the equation:

  • ŷ is the expected value of Y (the dependent variable) for a given value of X (the independent variable).
  • x is the independent variable.
  • α is the Y-intercept, the point at which the regression line intersects with the vertical axis.
  • β is the slope of the regression line, or the average change in the dependent variable as the independent variable increases by one.
  • ε is the error term, equal to Y – ŷ, or the difference between the actual value of the dependent variable and its expected value.

Multiple regression , on the other hand, is used to determine the relationship between three or more variables: the dependent variable and at least two independent variables. The multiple regression equation looks complex but is similar to the single variable linear regression equation:

Multiple Regression Formula

Each component of this equation represents the same thing as in the previous equation, with the addition of the subscript k, which is the total number of independent variables being examined. For each independent variable you include in the regression, multiply the slope of the regression line by the value of the independent variable, and add it to the rest of the equation.

How to Run Regressions

You can use a host of statistical programs—such as Microsoft Excel, SPSS, and STATA—to run both single variable linear and multiple regressions. If you’re interested in hands-on practice with this skill, Business Analytics teaches learners how to create scatter plots and run regressions in Microsoft Excel, as well as make sense of the output and use it to drive business decisions.

Calculating Confidence and Accounting for Error

It’s important to note: This overview of regression analysis is introductory and doesn’t delve into calculations of confidence level, significance, variance, and error. When working in a statistical program, these calculations may be provided or require that you implement a function. When conducting regression analysis, these metrics are important for gauging how significant your results are and how much importance to place on them.

Business Analytics | Become a data-driven leader | Learn More

Why Use Regression Analysis?

Once you’ve generated a regression equation for a set of variables, you effectively have a roadmap for the relationship between your independent and dependent variables. If you input a specific X value into the equation, you can see the expected Y value.

This can be critical for predicting the outcome of potential changes, allowing you to ask, “What would happen if this factor changed by a specific amount?”

Returning to the earlier example, running a regression analysis could allow you to find the equation representing the relationship between employee satisfaction and product sales. You could input a higher level of employee satisfaction and see how sales might change accordingly. This information could lead to improved working conditions for employees, backed by data that shows the tie between high employee satisfaction and sales.

Whether predicting future outcomes, determining areas for improvement, or identifying relationships between seemingly unconnected variables, understanding regression analysis can enable you to craft data-driven strategies and determine the best course of action with all factors in mind.

Do you want to become a data-driven professional? Explore our eight-week Business Analytics course and our three-course Credential of Readiness (CORe) program to deepen your analytical skills and apply them to real-world business problems.

regression analysis in research example

About the Author

When to Use Regression Analysis (With Examples)

Regression analysis can be used to:

  • estimate the effect of an exposure on a given outcome
  • predict an outcome using known factors
  • balance dissimilar groups
  • model and replace missing data
  • detect unusual records

In the text below, we will go through these points in greater detail and provide a real-world example of each.

1. Estimate the effect of an exposure on a given outcome

Regression can model linear and non-linear associations between an exposure (or treatment) and an outcome of interest. It can also simultaneously model the relationship between more than 1 exposure and an outcome, even when these exposures interact with each other.

Example: Exploring the relationship between Body Mass Index (BMI) and all-cause mortality

De Gonzales et al. used a Cox regression model to estimate the association between BMI and mortality among 1.46 million white adults.

As expected, they found that the risk of mortality increases with progressively higher than normal levels of BMI.

The takeaway message is that regression analysis enabled them to quantify that association while adjusting for smoking, alcohol consumption, physical activity, educational level and marital status — all potential confounders of the relationship between BMI and mortality.

2. Predict an outcome using known factors

A regression model can also be used to predict things like stock prices, weather conditions, the risk of getting a disease, mortality, etc. based on a set of known predictors (also called independent variables).

Example: Predicting malaria in South Africa using seasonal climate data

Kim et al. used Poisson regression to develop a malaria prediction model using climate data such as temperature and precipitation in South Africa.

The model performed best with short-term predictions.

Anyway, the important thing to notice here is the amount of complexities that a regression model can handle. For instance in this example, the model had to be flexible enough to account for non-linear and delayed associations between malaria transmission and climate factors.

This is a recurrent theme with predictive models: We start with a simple model, then we keep adding complexities until we get a satisfying result — this is why we call it model building .

3. Balance dissimilar groups

Proving that a relationship exists between some independent variable X and an outcome Y does not mean much if this result cannot be generalized beyond your sample.

In order for your results to generalize well, the sample you’re working with has to resemble the population from which it was drawn. If it doesn’t, you can use regression to balance some important characteristics in the sample to make it representative of the population of interest.

Another case where you would want to balance dissimilar groups is in a randomized controlled trial, where the objective is to compare the outcome between the group who received the intervention and another one that serves as control/reference. But in order for the comparison to make sense, the 2 groups must have similar characteristics.

Example: Evaluating how sleep quality is affected by sleep hygiene education and behavioral therapy

Nishinoue et al. conducted a randomized controlled trial to compare sleep quality between 2 groups of participants:

  • The treatment group: Participants received sleep hygiene education and behavioral therapy
  • The control group: Participants received sleep hygiene education only

A generalized linear model (a generalized form of linear regression) was used to:

  • Evaluate how sleep quality changed between groups
  • Adjust for age, gender, job title, smoking and drinking habits, body-mass index, and mental health to make the groups more comparable

4. Model and replace missing data

Modeling missing data is an important part of data analysis, especially in cases where you have high non-response rates (so a high number of missing values) like in telephone surveys.

Before jumping into imputing missing data, first you must determine:

  • How important the variables that have missing values are in your analysis
  • The percentage of missing values
  • If these values were missing at random or not

Based on this analysis, you can then choose to:

  • Delete observations with missing values
  • Replace missing data with the column’s mean or median
  • Use a a regression model to replace missing data

Example: Using multiple imputation to replace missing data in a medical study

Beynon et al. studied the prognostic role of alcohol and smoking at diagnosis of head and neck cancer.

But before they built their statistical model, they noticed that 11 variables (including smoking status and alcohol intake and other covariates) had missing values, so they used a technique called MICE (Multiple Imputation by Chained Equations) which runs regression models under the hood to replace missing values.

5. Detect unusual records

Regression models alongside other statistical techniques can be used to model how “normal data” should look like, the purpose being to detect values that deviate from this norm. These are referred to as “anomalies” or “outliers” in the data.

Most applications of anomaly detection is outside the healthcare domain. It is typically used for detection of financial frauds, atypical online behavior of website visitors, detection of anomalies in machine performance in a factory, etc.

Example: Detecting critical cases of patients undergoing heart surgery

Presbitero et al. used a time-varying autoregressive model (along with other statistical measures) to flag abnormal cases of patients undergoing heart surgery using data on their blood measurements.

Their goal is to ultimately prevent patient death by allowing early intervention to take place through the use of this early warning detection algorithm.

Further reading

  • Variables to Include in a Regression Model
  • Understand Linear Regression Assumptions
  • 7 Tricks to Get Statistically Significant p-Values
  • How to Handle Missing Data in Practice: Guide for Beginners
  • Search Search Please fill out this field.

What Is Regression?

Understanding regression, calculating regression, the bottom line.

  • Macroeconomics

Regression: Definition, Analysis, Calculation, and Example

regression analysis in research example

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

Also called simple regression or ordinary least squares (OLS), linear regression is the most common form of this technique. Linear regression establishes the linear relationship between two variables based on a line of best fit . Linear regression is thus graphically depicted using a straight line with the slope defining how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of one variable when the value of the other is zero. Nonlinear regression models also exist, but are far more complex.

Regression analysis is a powerful tool for uncovering the associations between variables observed in data, but cannot easily indicate causation. It is used in several contexts in business, finance, and economics. For instance, it is used to help investment managers value assets and understand the relationships between factors such as commodity prices and the stocks of businesses dealing in those commodities.

Regression as a statistical technique should not be confused with the concept of regression to the mean ( mean reversion ).

Key Takeaways

  • A regression is a statistical technique that relates a dependent variable to one or more independent (explanatory) variables.
  • A regression model is able to show whether changes observed in the dependent variable are associated with changes in one or more of the explanatory variables.
  • It does this by essentially fitting a best-fit line and seeing how the data is dispersed around this line.
  • Regression helps economists and financial analysts in things ranging from asset valuation to making predictions.
  • For regression results to be properly interpreted, several assumptions about the data and the model itself must hold.

Joules Garcia / Investopedia

Regression captures the correlation between variables observed in a data set and quantifies whether those correlations are statistically significant or not.

The two basic types of regression are simple linear regression and  multiple linear regression , although there are nonlinear regression methods for more complicated data and analysis. Simple linear regression uses one independent variable to explain or predict the outcome of the dependent variable Y, while multiple linear regression uses two or more independent variables to predict the outcome (while holding all others constant). Analysts can use stepwise regression to examine each independent variable contained in the linear regression model.

Regression can help finance and investment professionals as well as professionals in other businesses. Regression can also help predict sales for a company based on weather, previous sales, gross domestic product (GDP) growth, or other types of conditions. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing assets and discovering the costs of capital.

Regression and Econometrics

Econometrics is a set of statistical techniques used to analyze data in finance and economics. An example of the application of econometrics is to study the income effect using observable data. An economist may, for example, hypothesize that as a person increases their income , their spending will also increase.

If the data show that such an association is present, a regression analysis can then be conducted to understand the strength of the relationship between income and consumption and whether or not that relationship is statistically significant—that is, it appears to be unlikely that it is due to chance alone.

Note that you can have several explanatory variables in your analysis—for example, changes to GDP and inflation in addition to unemployment in explaining stock market prices. When more than one explanatory variable is used, it is referred to as  multiple linear regression . This is the most commonly used tool in econometrics.

Econometrics is sometimes criticized for relying too heavily on the interpretation of regression output without linking it to economic theory or looking for causal mechanisms. It is crucial that the findings revealed in the data are able to be adequately explained by a theory, even if that means developing your own theory of the underlying processes.

Linear regression models often use a least-squares approach to determine the line of best fit. The least-squares technique is determined by minimizing the sum of squares created by a mathematical function. A square is, in turn, determined by squaring the distance between a data point and the regression line or mean value of the data set.

Once this process has been completed (usually done today with software), a regression model is constructed. The general form of each type of regression model is:

Simple linear regression:

Y = a + b X + u \begin{aligned}&Y = a + bX + u \\\end{aligned} ​ Y = a + b X + u ​

Multiple linear regression:

Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 + . . . + b t X t + u where: Y = The dependent variable you are trying to predict or explain X = The explanatory (independent) variable(s) you are  using to predict or associate with Y a = The y-intercept b = (beta coefficient) is the slope of the explanatory variable(s) u = The regression residual or error term \begin{aligned}&Y = a + b_1X_1 + b_2X_2 + b_3X_3 + ... + b_tX_t + u \\&\textbf{where:} \\&Y = \text{The dependent variable you are trying to predict} \\&\text{or explain} \\&X = \text{The explanatory (independent) variable(s) you are } \\&\text{using to predict or associate with Y} \\&a = \text{The y-intercept} \\&b = \text{(beta coefficient) is the slope of the explanatory} \\&\text{variable(s)} \\&u = \text{The regression residual or error term} \\\end{aligned} ​ Y = a + b 1 ​ X 1 ​ + b 2 ​ X 2 ​ + b 3 ​ X 3 ​ + ... + b t ​ X t ​ + u where: Y = The dependent variable you are trying to predict or explain X = The explanatory (independent) variable(s) you are  using to predict or associate with Y a = The y-intercept b = (beta coefficient) is the slope of the explanatory variable(s) u = The regression residual or error term ​

Example of How Regression Analysis Is Used in Finance

Regression is often used to determine how many specific factors such as the price of a commodity, interest rates, particular industries, or sectors influence the price movement of an asset. The aforementioned CAPM is based on regression, and is utilized to project the expected returns for stocks and to generate costs of capital. A stock’s returns are regressed against the returns of a broader index, such as the S&P 500, to generate a beta for the particular stock.

Beta is the stock’s risk in relation to the market or index and is reflected as the slope in the CAPM. The return for the stock in question would be the dependent variable Y, while the independent variable X would be the market risk premium.

Additional variables such as the market capitalization of a stock, valuation ratios, and recent returns can be added to the CAPM to get better estimates for returns. These additional factors are known as the Fama-French factors, named after the professors who developed the multiple linear regression model to better explain asset returns.

Why Is It Called Regression?

Although there is some debate about the origins of the name, the statistical technique described above most likely was termed “regression” by Sir Francis Galton in the 19th century to describe the statistical feature of biological data (such as heights of people in a population) to regress to some mean level. In other words, while there are shorter and taller people, only outliers are very tall or short, and most people cluster somewhere around (or “regress” to) the average.

What Is the Purpose of Regression?

In statistical analysis, regression is used to identify the associations between variables occurring in some data. It can show the magnitude of such an association and determine its statistical significance (i.e., whether or not the association is likely due to chance). Regression is a powerful tool for statistical inference and has been used to try to predict future outcomes based on past observations.

How Do You Interpret a Regression Model?

A regression model output may be in the form of Y = 1.0 + (3.2) X 1 - 2.0( X 2 ) + 0.21.

Here we have a multiple linear regression that relates some variable Y with two explanatory variables X 1 and X 2 . We would interpret the model as the value of Y changes by 3.2× for every one-unit change in X 1 (if X 1 goes up by 2, Y goes up by 6.4, etc.) holding all else constant (all else equal). That means controlling for X 2 , X 1 has this observed relationship. Likewise, holding X1 constant, every one unit increase in X 2 is associated with a 2× decrease in Y. We can also note the y-intercept of 1.0, meaning that Y = 1 when X 1 and X 2 are both zero. The error term (residual) is 0.21.

What Are the Assumptions That Must Hold for Regression Models?

To properly interpret the output of a regression model, the following main assumptions about the underlying data process of what you are analyzing must hold:

  • The relationship between variables is linear.
  • There must be homoskedasticity , or the variance of the variables and error term must remain constant.
  • All explanatory variables are independent of one another.
  • All variables are normally distributed .

Regression is a statistical method that tries to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables). It is used in finance, investing, and other disciplines.

Regression analysis uncovers the associations between variables observed in data, but cannot easily indicate causation.

Margo Bergman. “ Quantitative Analysis for Business: 12. Simple Linear Regression and Correlation .” University of Washington Pressbooks, 2022.

Margo Bergman. “ Quantitative Analysis for Business: 13. Multiple Linear Regression .” University of Washington Pressbooks, 2022.

Eugene F. Fama and Kenneth R. French, via Wiley Online Library. “ The Cross-Section of Expected Stock Returns .” The Journal of Finance , Vol. 47, No. 2 (June 1992), Pages 427–465.

Jeffrey M. Stanton, via Taylor & Francis Online. “ Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors .” Journal of Statistics Education , vol. 9, no. 3, 2001, .

CFA Institute. “ Basics of Multiple Regression and Underlying Assumptions .”

regression analysis in research example

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

Multiple Regression Analysis Example with Conceptual Framework

Data analysis using multiple regression analysis is a fairly common tool used in statistics. Many graduate students find this too complicated to understand. However, this is not that difficult to do, especially with computers as everyday household items nowadays. You can now quickly analyze more than just two sets of variables in your research using multiple regression analysis. 

How is multiple regression analysis done? This article explains this handy statistical test when dealing with many variables, then provides an example of a research using multiple regression analysis to show how it works. It explains how research using multiple regression analysis is conducted.

Multiple regression is often confused with multivariate regression. Multivariate regression, while also using several variables, deals with more than one dependent variable . Karen Grace-Martin clearly explains the difference in her post on the difference between the Multiple Regression Model and Multivariate Regression Model .

Table of Contents

Statistical software applications used in computing multiple regression analysis.

Multiple regression analysis is a powerful statistical test used to find the relationship between a given dependent variable and a set of independent variables .

Using multiple regression analysis requires a dedicated statistical software like the popular  Statistical Package for the Social Sciences (SPSS) , Statistica, Microstat, and open-source statistical software applications like SOFA statistics and Jasp, among other sophisticated statistical packages.

Two decades ago, it will be near impossible to do the calculations using the obsolete simple calculator replaced by smartphones. 

However, a standard spreadsheet application like Microsoft Excel can help you compute and model the relationship between the dependent variable and a set of predictor or independent variables. But you cannot do this without activating first the setting of statistical tools that ship with MS Excel.

Activating MS Excel

To activate the add-in for multiple regression analysis in MS Excel, you may view the two-minute Youtube tutorial below. If you already have this installed on your computer, you may proceed to the next section.

Multiple Regression Analysis Example

I will illustrate the use of multiple regression analysis by citing the actual research activity that my graduate students undertook two years ago.

The study pertains to identifying the factors predicting a current problem among high school students, the long hours they spend online for a variety of reasons. The purpose is to address many parents’ concerns about their difficulty of weaning their children away from the lures of online gaming, social networking, and other engaging virtual activities.

Review of Literature on Internet Use and Its Effect on Children

Upon reviewing the literature, the graduate students discovered that very few studies were conducted on the subject. Studies on problems associated with internet use are still in its infancy as the Internet has just begun to influence everyone’s life.

Hence, with my guidance, the group of six graduate students comprising school administrators, heads of elementary and high schools, and faculty members proceeded with the study.

Given that there is a need to use a computer to analyze multiple variable data, a principal who is nearing retirement was “forced” to buy a laptop, as she had none. Anyhow, she is very much open-minded and performed the class activities that require data analysis with much enthusiasm.

The Research on High School Students’ Use of the Internet

The brief research using multiple regression analysis is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad because it only focuses on the total number of hours devoted by high school students to activities online.

They correlated the time high school students spent online with their profile. The students’ profile comprised more than two independent variables, hence the term “multiple.” The independent variables are age, gender, relationship with the mother, and relationship with the father.

The statement of the problem in this study is:

“Is there a significant relationship between the total number of hours spent online and the students’ age, gender, relationship with their mother, and relationship with their father?”

Their parents’ relationship was gauged using a scale of 1 to 10, 1 being a poor relationship, and 10 being the best experience with parents. The figure below shows the paradigm of the study.

multipleregression

Notice that in research using multiple regression studies such as this, there is only one dependent variable involved. That is the total number of hours spent by high school students online.

Although many studies have identified factors that influence the use of the internet, it is standard practice to include the respondents’ profile among the set of predictor or independent variables. Hence, the standard variables age and gender are included in the multiple regression analysis.

Also, among the set of variables that may influence internet use, only the relationship between children and their parents was tested. The intention of this research using multiple regression analysis is to determine if parents spend quality time establishing strong emotional bonds between them and their children.

exampleofmultipleregression

Findings of the Research Using Multiple Regression Analysis

What are the findings of this exploratory study? This quickly done example of a research using multiple regression analysis revealed an interesting finding.

The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated.

The relationship means that the greater the number of hours spent by the mother with her child to establish a closer emotional bond, the fewer hours spent by her child using the internet. The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.

The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.

While this example of a research using multiple regression analysis may be a significant finding, the mother-child bond accounts for only a small percentage of the variance in total hours spent by the child online. This observation means that other factors need to be addressed to resolve long waking hours and abandonment of serious study of lessons by children.

But establishing a close bond between mother and child is a good start. Undertaking more investigations along this research concern will help strengthen the findings of this study.

The above example of a research using multiple regression analysis shows that the statistical tool is useful in predicting dependent variables’ behavior. In the above case, this is the number of hours spent by students online.

The identification of significant predictors can help determine the correct intervention to resolve the problem. Using multiple regression approaches prevents unnecessary costs for remedies that do not address an issue or a question.

Thus, this example of a research using multiple regression analysis streamlines solutions and focuses on those influential factors that must be given attention.

Once you become an expert in using multiple regression in analyzing data, you can try your hands on multivariate regression where you will deal with more than one dependent variable.

©2012 November 11 Patrick Regoniel Updated: 14 November 2020

Related Posts

Research Topics on Education: Four Child-Centered Examples

Research Topics on Education: Four Child-Centered Examples

Four statistical scales of measurement, what is a good research problem, about the author, patrick regoniel.

Dr. Regoniel, a faculty member of the graduate school, served as consultant to various environmental research and development projects covering issues and concerns on climate change, coral reef resources and management, economic valuation of environmental and natural resources, mining, and waste management and pollution. He has extensive experience on applied statistics, systems modelling and analysis, an avid practitioner of LaTeX, and a multidisciplinary web developer. He leverages pioneering AI-powered content creation tools to produce unique and comprehensive articles in this website.

mostly in monasteries.

Manuscript is a collective name for texts

the example is good but lacks the table of regression results. With the tables, a student could learn more on how to interpret regression results

this is so enlightening,hope it reaches most of the parents…

nice; but it is not good enough for reference

This is an action research Daniel. And I have updated it here. It can set off other studies. And please take note that blogs nowadays are already recognized sources of information. Please read my post here on why this is so: https://simplyeducate.me/wordpress_Y//2019/09/26/using-blogs-in-education/

Was this study published? It may have important implications

Dear Gabe, this study was presented by one of my students in a conference. I am just unsure if she was able to publish it in a journal.

SimplyEducate.Me Privacy Policy

Stroke and frailty index: a two-sample Mendelian randomisation study

  • Original Article
  • Open access
  • Published: 22 May 2024
  • Volume 36 , article number  114 , ( 2024 )

Cite this article

You have full access to this open access article

regression analysis in research example

  • Jiangnan Wei 1 ,
  • Jiaxian Wang 1 ,
  • Jiayin Chen 1 ,
  • Kezhou Yang 1 &
  • Ning Liu 2  

364 Accesses

Explore all metrics

Introduction

Previous observational studies have found an increased risk of frailty in patients with stroke. However, evidence of a causal relationship between stroke and frailty is scarce. The aim of this study was to investigate the potential causal relationship between stroke and frailty index (FI).

Pooled data on stroke and debility were obtained from genome-wide association studies (GWAS).The MEGASTROKE Consortium provided data on stroke (N = 40,585), ischemic stroke (IS,N = 34,217), large-vessel atherosclerotic stroke (LAS,N = 4373), and cardioembolic stroke (CES,N = 7 193).Summary statistics for the FI were obtained from the most recent GWAS meta-analysis of UK BioBank participants and Swedish TwinGene participants of European ancestry (N = 175,226).Two-sample Mendelian randomization (MR) analyses were performed by inverse variance weighting (IVW), weighted median, MR-Egger regression, Simple mode, and Weighted mode, and heterogeneity and horizontal multiplicity of results were assessed using Cochran’s Q test and MR-Egger regression intercept term test.

The results of the current MR study showed a significant correlation between stroke gene prediction and FI (odds ratio 1.104, 95% confidence interval 1.064 − 1.144, P < 0.001). In terms of stroke subtypes, IS (odds ratio 1.081, 95% confidence interval 1.044 − 1.120, P < 0.001) and LAS (odds ratio 1.037, 95% confidence interval 1.012 − 1.062, P = 0.005). There was no causal relationship between gene-predicted CES and FI. Horizontal multidimensionality was not found in the intercept test for MR Egger regression (P > 0.05), nor in the heterogeneity test (P > 0.05).

Conclusions

This study provides evidence for a causal relationship between stroke and FI and offers new insights into the genetic study of FI.

Similar content being viewed by others

regression analysis in research example

Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke

regression analysis in research example

A Mendelian randomization analysis of the relationship between cardioembolic risk factors and ischemic stroke

regression analysis in research example

Exploring the Causality of Type 1 Diabetes and Stroke Risk: A Mendelian Randomization Study and Meta-analysis

Avoid common mistakes on your manuscript.

Stroke is a cerebrovascular lesion caused by sudden cerebrovascular injury with high morbidity, disability, mortality and recurrence rates [ 1 ]. Stroke is the second leading cause of death worldwide, with an annual death rate of about 5.5 million [ 2 ]. Studies have shown that the incidence of stroke increases dramatically with age, with about three-quarters of all strokes occurring in people over the age of 65 [ 3 ]. The United Nations Population Prospects predicts that the number of people aged 60 years and older will reach 2.1 billion globally by 2050 and 3.2 billion by 2100 [ 4 ]. Elderly stroke patients suffer from long-term sequelae such as incapacitation, emotional deficits, and cognitive disorders [ 5 ], and because of the long duration of illness, high medical costs, and poor adherence to treatment, elderly stroke patients impose a heavy burden on society, families, and patients [ 6 ].

Frailty, characterized by age-related multisystem dysfunction, is a major public health problem in older adults [ 7 , 8 ]. Studies show that frailty is common in strokes, with at least a quarter of stroke victims being physically frail [ 9 ]. Understanding the potential association between frailty and senescence-­related diseases and the underlying mechanisms may facilitate the individualized management and early interventions of frail patients.FI reflects the accumulation of physiological deficits in various systems of the body, and assesses frailty by calculating the number of deficits in the health variables, taking into account the effects of physical, psychosocial, and social factors on the human body [ 10 ]. Studies have shown that stroke frailty can be assessed by physicians or researchers in the early stages of stroke by means of a scale survey, and studies have confirmed the applicability of the FI in stroke patients [ 11 ].

Randomized controlled trials (RCTs) are the gold standard of clinical evidence and are widely used to infer causality, mainly by eliminating confounding bias through randomized grouping. However, randomized controlled trials require a great deal of time, money, and human resources; therefore, RCT studies are difficult to perform in many medical studies. Mendelian randomization (MR) is a genetic epidemiology study design method that allows for the exploration of causal relationships between exposures and outcomes through the use of genetic variation as an instrumental variable (IV) [ 12 ]. Due to the ability to overcome the effects of potential confounding and reverse causation, MR methods have been increasingly used in observational studies in recent years [ 13 ]. MR research has been facilitated by the discovery in biology of a large number of genetic variants that are strongly associated with specific traits, and by the public release of hundreds of thousands of pooled data on the association of exposures and diseases with genetic variants from many large-sample genome-wide association studies (GWAS),which have allowed researchers to estimate genetic associations in large-sample data.The method of MR, with its use of genetic variables, allows for the avoidance of reverse causality and minimizes the interference of environmental factors in a manner similar to that of RCTs [ 14 ].

In this study, we aimed to investigate the causal relationship between stroke and frailty index using MR methods.

Materials and methods

Mendelian randomization assumptions.

The MR method is an instrumental variables analysis that uses genetic variants as proxies for exposure. As in Fig. 1 , the MR analysis relies on 3 important assumptions: (1) instrumental variables are closely related to exposure factors; (2) instrumental variables are independent of confounding factors; and (3) instrumental variables affect outcome only through exposure and not through other means [ 15 ].

figure 1

Design and main assumptions of our Mendelian randomization study

Data sources and SNP selection for stroke

The exposure factor for this study was defined as stroke, and data were obtained from the MEGASTROKE consortium, including 446,696 individuals of European ancestry (406,111 non-stroke cases and 40,585 stroke cases); the total number of ischemic stroke cases was 34,217, including 4373 large-vessel atherosclerotic stroke (LAS) cases, 5386 small-vessel stroke cases, and 7193 cardiac stroke cases.

Data sources and SNP selection for frailty index

Frailty is commonly defined using the Frailty Index (FI) [ 16 ]. In this MR study, frailty was measured according to the FI, which was calculated based on the accumulation of 44–49 self-reported health deficits over the life course.Summary statistics for the FI were obtained from the most recent GWAS meta-analysis of UK BioBank participants and Swedish TwinGene participants of European ancestry (N = 175,226).

Genetic instrumental variable selection

To meet the above hypotheses, the specific instrumental variables screening criteria were: (1) the association of single nucleotide polymorphism(SNPs) with single amino acids was genome-wide significant (P < 5 × 10 −8 ); (2) the linkage disequilibrium (LD) between SNPs was calculated using the European population genome of 1000 individuals as the reference template, and SNPs with r 2 < 0.001 and physical genetic distance > 10,000 kb were screened. (3) remove SNPs with minor allele frequencies < 0.01;(4) exclude SNPs with F values < 10 to avoid weak instrumental bias,and the following formula was used to calculate the F statistic [ 17 ]: F statistic = R 2 (N–2)/(1–R 2 ). R 2 = 2 × EAF × (1–EAF) × β 2 .The F values were all > 10, which indicated that our IVs were not biased by weak instrument. (5) apply MR Steiger [ 18 ] to assess the causal direction of each SNP on exposure and outcome, and exclude SNPs with reverse causality.

Statistical analysis

Two-sample mendelian randomization analysis.

We performed two-sample MR analyses using the inverse-variance-weighted method as the primary approach.Four other MR methods based on different model assumptions were also used for the analysis: weighted median method [ 19 ], MR-Egger regression [ 20 ], Weighted mode and Simple mode. According to the MR-Egger regression method, intercepts different from the origin can be used to assess potential pleiotropic effects [ 21 ]. These various MR methods can test the stability and reliability of association under different assumptions.

Sensitivity and power analysis

In this study, MR-Egger intercept was used to detect horizontal multiplicity, and if the intercept term in MR-Egger intercept analysis was statistically significant compared to 0, it indicated that the study had horizontal multiplicity [ 22 ]; Cochran Q test was applied to determine the heterogeneity of SNPs [ 23 ], and if the Cochran Q statistic test is statistically significant and proves that the analysis results have significant heterogeneity, then focus on the results of the random effects IVW method; using the Leave-one-out sensitivity test for sensitivity analysis [ 24 ], each SNP is eliminated in turn, and the remaining The MR results are robust if the remaining SNPs are not significantly different from the total results.The above methods were implemented using the TwoSampleMR package in the R 4. 2. 3 software with a test level of α= 0. 05.

Instrumental variable

According to the screening criteria of the instrumental variables in this study, all strokes, ischemic strokes, atherosclerotic strokes of large arteries, and cerebral embolic strokes were finally screened for 17 SNP stroke , 18 SNP IS , 4 SNP LAS , and 4 SNP CES , respectively. Table 1 shows 18 SNPs significantly associated with ischemic stroke.The MR-Egger regression intercepts were b stroke = 0.0179 (P = 0.8888), b IS = 0.1234 (P = 0.2288), b LAS = − 0.0121 (P = 0.8912), and b CES = 0.0306 (P = 0.3773), respectively. That is, there was no genetic pleiotropy between the screened SNPs and the outcome frailty index, and thus the Mendelian randomization method was a valid method for causal inference in this study.

Mendelian randomization analysis

The results of the ivw method showed a significant relationship between stroke (OR = 1.104, 95% CI 1.064–1.144, p < 0.001), IS (OR = 1.081, 95% CI 1.044–1.120, p < 0.001), LAS (OR = 1.037, 95% CI 1.012–1.062, p = 0.005), and FI There was a causal relationship between. In Weighted median method analysis, stroke (OR = 1.085, 95% CI 1.037–1.135, p < 0.001), IS (OR = 1.081, 95% CI 1.034–1.129, p < 0.001), LAS (OR = 1.035, 95% CI 1.005–1.066, and p = 0.026) and FI were also causally related. However, there was no evidence to support a causal relationship between CES and FI. MR estimates and efficacy analyses for stroke and FI are shown in Table 2 , and scatter plots of MR analyses for the 5 methods are shown in Fig. 2 .

figure 2

Scatter plots of the results of 5 MR methods: A Stroke; B IS; C LAS; and D CES

In this study Cochran’s Q test was used to assess the heterogeneity of the results which was done by both IVW and MR-Egger’s analyses.The results of CochranQ test for IVW method showed Q stroke = 22.193, Q IS = 26.370, Q LAS = 1.813, Q CES = 1.208.MR-Egger’s CochranQ test showed Q stroke = 21.590, Q IS = 26.016, Q LAS = 1.421, Q CES = 1.084.Regarding the results of the horizontal multivariate test, it was assessed using MR Egger’s intercept term. The p-value of heterogeneity and horizontal polytropy was greater than 0.05, so there was no heterogeneity and horizontal polytropy in this study, so the results had a weak risk of bias and high reliability. In addition, we performed Leave-one-out analysis, and the results showed that after removing each SNP in turn, the b-values of the remaining SNPs were (0.089–0.105), (0.694–0.841), (0.012–0.150), and (0.012–0.027), respectively, with p < 0.01. The b-values were all > 0, and the directions were all the same, indicating that removing either SNP had little effect on the results, and a positive causal association between stroke and frailty index was still observed. Sensitivity analyses and forest plots of the association between gene-predicted stroke and FI are shown in Fig. 3 .

figure 3

Sensitivity analysis of the association between genetically predicted Stroke and FI. A “Leave-one-out” sensitivity analysis results; B Forest plot.IS,ischaemic stroke;LAS,large artery atherosclerosis stroke;CES, cardioembolism stroke

In this study, based on large-scale GWAS pooled data, a two-sample MR method was used to analyze the causal association between stroke, IS, LAS, CES and FI. The results showed that stroke, IS, LAS, and CES were risk factors for FI. Further sensitivity analyses showed the consistency and reliability of these results.

Our finding that stroke may cause frailty provides evidence for early observational studies.Previous studies have reported an association between stroke and frailty [ 25 ]. The research results show that 12.8% of ischemic stroke patients and 10.3% of hemorrhagic stroke patients are already in a weakened state before stroke, and the degree of weakness worsens after stroke [ 26 ]. Stroke may accelerate the occurrence and development of physical weakness. In a large-scale assessment of debility, Hanlon et al. found that debility was common among stroke survivors. Rodriguez et al. showed an overall prevalence of frailty of 15.2% in older adults and a positive correlation between frailty and stroke in a survey of eight urban and four rural areas in eight countries, including Cuba, the Dominican Republic, Puerto Rico, Venezuela, Peru, Mexico, China, and India.Palmer et al. [ 27 ] found that the incidence of frailty was twice as high in stroke patients as in those who did not have a stroke [ 28 ]. Rowan et al. found frailty in about a quarter of acute stroke patients through a cross-sectional survey [ 11 ]. Stroke increases the risk of debilitation, and the prevalence of debilitation varies widely by region between countries, while debilitation imposes a serious burden on stroke patients and reduces their quality of life.

Exploring the causal relationship between these two comorbidities may be difficult because they share the same risk factors, such as hyperglycemia and Dyslipidemia. In addition,Evans et al showed that neurological deficits after stroke may exacerbate the phenotypic features of frailty, that hemodynamic changes in central and peripheral vasculature occur with age, and that frailty is associated with impaired brain self-regulation. And that a history of previous stroke is an important factor in the transition from robust to debilitated patients and in the worsening of the debilitating trajectory [ 29 ]. Hanotier et al. pointed out that prolonged malnutrition can easily lead to electrolyte disorders in the elderly. Once a stroke occurs, due to insufficient nutrition intake, electrolyte disorders worsen, and body mass sharply decreases, ultimately leading to frailty in the elderly [ 30 ]. Stroke patients have varying degrees of neurological dysfunction, which can affect the number of skeletal muscles to a certain extent [ 31 ], leading to the occurrence of sarcopenia, thereby reducing limb muscle strength and grip strength, and increasing the risk of physical weakness.

Zhu et al. [ 8 ] conducted a two-way Mendelian randomisation study of FI and stroke and found that FI was significantly associated with stroke in both directions of the outcome, but stroke-related subgroup analyses were not performed, which somewhat supports our findings. Liu et al. [ 32 ] found an implied association between FI and any stroke, and FI was associated with a high risk of LAS, but there was no causal association between FI and IS and small-artery stroke. There was no causal relationship.However, our study found that stroke, IS, and LAS were all causally related to FI, which may be due to the fact that our study was conducted in a different direction than theirs.

The main strengths of this study include the use of MR for causal inference and analyzing a large sample group. MR methods can effectively avoid the drawbacks of uncertain residual confounding and reverse causality that exist in traditional observational study methods [ 21 ]. Data on stroke and frailty are derived from existing large GWAS, which allows for more precise assessment of effect sizes than individual-level data or results from studies with limited sample sizes. Inevitably, there are some limitations. First, the two-sample Mendelian randomization method assumes a correlation between the exposure factor and the outcome, and the MR method is not applicable if the relationship is nonlinear. Second, the results of the analysis in this study are based only on populations of European origin, so further research and validation are needed for generalization to other populations. Finally, database statistics are difficult to analyze stratified by sex or age, which may lead to biased findings.

In summary, we found a causal relationship between stroke and its subtypes and debility by two-sample MR analysis. Further studies are needed to elucidate the potential mechanisms underlying the various causal relationships between stroke subtypes and debility.

Data availability

All data are publicly available. Detailed information for these datasets is summarized in supplementary material.

Collaborators GBDS (2021) Global, regional, and national burden of stroke and its risk factors, 1990–2019: a systematic analysis for the global burden of disease study 2019. Lancet Neurol 20:795–820

Article   Google Scholar  

Paul S, Candelario-Jalil E (2021) Emerging neuroprotective strategies for the treatment of ischemic stroke: an overview of clinical and preclinical studies. Exp Neurol 335:113518

Article   CAS   PubMed   Google Scholar  

Myint PK, Sinha S, Luben RN et al (2008) Risk factors for first-ever stroke in the EPIC-Norfolk prospective population-based study. Eur J Cardiovasc Prev Rehabil 15:663–9

Article   PubMed   Google Scholar  

Sanuade OA, Dodoo FN, Koram K et al (2019) Prevalence and correlates of stroke among older adults in Ghana: evidence from the study on global AGEing and adult health (SAGE). PLoS ONE 14:e0212623

Article   CAS   PubMed   PubMed Central   Google Scholar  

Nguyen TV, Le D, Tran KD et al (2019) Frailty in older patients with acute coronary syndrome in Vietnam. Clin Interv Aging 14:2213–22

Article   PubMed   PubMed Central   Google Scholar  

Blanco S, Ferrieres J, Bongard V et al (2017) Prognosis impact of frailty assessed by the edmonton frail scale in the setting of acute coronary syndrome in the elderly. Can J Cardiol 33:933–9

Proietti M, Cesari M (2020) Frailty: what is it? Adv Exp Med Biol 1216:1–7

Zhu J, Zhou D, Wang J et al (2022) Frailty and cardiometabolic diseases: a bidirectional Mendelian randomisation study. Age Ageing 51:afac256

PubMed   Google Scholar  

Burton JK, Stewart J, Blair M et al (2022) Prevalence and implications of frailty in acute stroke: systematic review & meta-analysis. Age Ageing 51:afac064

Sonny A, Kurz A, Skolaris LA et al (2020) Deficit accumulation and phenotype assessments of frailty both poorly predict duration of hospitalization and serious complications after noncardiac surgery. Anesthesiology 132:82–94

Taylor-Rowan M, Cuthbertson G, Keir R et al (2019) The prevalence of frailty among acute stroke patients, and evaluation of method of assessment. Clin Rehabil 33:1688–96

Davies NM, Holmes MV, Davey Smith G (2018) Reading Mendelian randomisation studies: a guide, glossary, and checklist for clinicians. BMJ 362:k601

Burgess S, Timpson NJ, Ebrahim S et al (2015) Mendelian randomization: where are we now and where are we going? Int J Epidemiol 44:379–88

Larsson SC, Traylor M, Markus HS (2019) Homocysteine and small vessel stroke: a mendelian randomization analysis. Ann Neurol 85:495–501

Zheng J, Baird D, Borges MC et al (2017) Recent developments in Mendelian randomization studies. Curr Epidemiol Rep 4:330–45

Atkins JL, Jylhava J, Pedersen NL et al (2021) A genome-wide association study of the frailty index highlights brain pathways in ageing. Aging Cell 20:e13459

Gao L, Di X, Gao L et al (2023) The Frailty Index and colon cancer: a 2-sample Mendelian-randomization study. J Gastrointest Oncol 14:798–805

Hemani G, Tilling K, Davey Smith G (2017) Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet 13:e1007081

Bowden J, Davey Smith G, Haycock PC et al (2016) Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genet Epidemiol 40:304–14

Bowden J, Davey Smith G, Burgess S (2015) Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol 44:512–25

Zhao H, Zhu J, Ju L et al (2022) Osteoarthritis & stroke: a bidirectional mendelian randomization study. Osteoarthr Cartil 30:1390–7

Article   CAS   Google Scholar  

Martin S, Tyrrell J, Thomas EL et al (2022) Disease consequences of higher adiposity uncoupled from its adverse metabolic effects using Mendelian randomisation. Elife 11:e72452

Bowden J, Spiller W, Del Greco MF et al (2018) Improving the visualization, interpretation and analysis of two-sample summary data Mendelian randomization via the radial plot and radial regression. Int J Epidemiol 47:2100

Gronau QF, Wagenmakers EJ (2019) Limitations of bayesian leave-one-out cross-validation for model selection. Comput Brain Behav 2:1–11

Calado LB, Ferriolli E, Moriguti JC et al (2016) Frailty syndrome in an independent urban population in Brazil (FIBRA study): a cross-sectional populational study. Sao Paulo Med J 134:385–392

Article   PubMed Central   Google Scholar  

Kanai M, Noguchi M, Kubo H et al (2020) Pre-stroke frailty and stroke severity in elderly patients with acute stroke. J Stroke Cerebrovasc Dis 29:105346

Llibre Rodriguez JJ, Prina AM, Acosta D et al (2018) The prevalence and correlates of frailty in urban and rural populations in latin America, China, and India: a 10/66 population-based survey. J Am Med Dir Assoc 19:287–95.e4

Palmer K, Vetrano DL, Padua L et al (2019) Frailty syndromes in persons with cerebrovascular disease: a systematic review and meta-analysis. Front Neurol 10:1255

Evans NR, Todd OM, Minhas JS et al (2022) Frailty and cerebrovascular disease: concepts and clinical implications for stroke medicine. Int J Stroke 17:251–9

Hanotier P (2015) Hyponatremia in the elderly: its role in frailty. Rev Med Brux 36:475–84

CAS   PubMed   Google Scholar  

Jung H, Kim M, Lee Y et al (2020) Prevalence of physical frailty and its multidimensional risk factors in Korean community-dwelling older adults: findings from Korean frailty and aging cohort study. Int J Environ Res Public Health 17:7883

Liu W, Zhang L, Fang H et al (2022) Genetically predicted frailty index and risk of stroke and Alzheimer’s disease. Eur J Neurol 29:1913–21

Download references

This study was supported by the National Natural Science Foundation of China (Project No. 82260281) and the Science and Technology Fund Project of Guizhou Provincial Health Commission (gzwkj2023-588);Scientific study on the Construction of shared Medical Model and Optimization Mechanism for patients with Stroke Depression in low Resource area. Guizhou Science and Technology Plan Project, Guizhou Science and Technology Cooperation (Qiankehe) Foundation(NO:. ZK [2024] key Project 069).

Author information

Authors and affiliations.

Department of Nursing, Zhuhai Campus of Zunyi Medical University, No. 368 Jinwan Road, Zhuhai, Guangdong, China

Jiangnan Wei, Jiaxian Wang, Jiayin Chen & Kezhou Yang

Department of Fundamentals, Department of Basic Teaching and Research in General Medicine, Zunyi Medical University Zhuhai Campus, Zhuhai, Guangdong, China

You can also search for this author in PubMed   Google Scholar

Contributions

Jiangnan Wei wrote the main manuscript text, Jiaxian Wang drew the images and tables, Jiayin Chen performed the data collection, Kezhou Yang organized the information, and Ning Liu revised the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ning Liu .

Ethics declarations

Conflict of interest.

The authors declare no conflict of interest.

Ethical approval

For this study, publicly available summarized data from the GWAS database were used and the data were approved by the ethical review board.

Informed consent

Informed consent was obtained from all participants.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (DOCX 432 KB)

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Wei, J., Wang, J., Chen, J. et al. Stroke and frailty index: a two-sample Mendelian randomisation study. Aging Clin Exp Res 36 , 114 (2024). https://doi.org/10.1007/s40520-024-02777-9

Download citation

Received : 29 November 2023

Accepted : 15 May 2024

Published : 22 May 2024

DOI : https://doi.org/10.1007/s40520-024-02777-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Mendelian randomization
  • Genetic analyses
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Regression analysis: What it means and how to interpret the outcome

    regression analysis in research example

  2. What is regression analysis?

    regression analysis in research example

  3. Linear Regression model sample illustration

    regression analysis in research example

  4. What Is And How To Use A Multiple Regression Equation Model Example

    regression analysis in research example

  5. Building a Regression Model

    regression analysis in research example

  6. PPT

    regression analysis in research example

VIDEO

  1. Regression Analysis

  2. Regression Analysis: An introduction to Linear and Logistic Regression

  3. An Introduction to Linear Regression Analysis

  4. Video 1: Introduction to Simple Linear Regression

  5. Regression analysis

  6. Regression Analysis with Multiple Dependent Variables

COMMENTS

  1. Regression Analysis

    Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices. Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or ...

  2. Simple Linear Regression

    Regression allows you to estimate how a dependent variable changes as the independent variable (s) change. Simple linear regression example. You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to ...

  3. Regression Tutorial with Analysis Examples

    My tutorial helps you go through the regression content in a systematic and logical order. This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions.

  4. A Refresher on Regression Analysis

    A Refresher on Regression Analysis. Understanding one of the most important types of data analysis. by. Amy Gallo. November 04, 2015. uptonpark/iStock/Getty Images. You probably know by now that ...

  5. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  6. The Complete Guide To Simple Regression Analysis

    The easiest way to add a simple linear regression line in Excel is to install and use Excel's "Analysis Toolpak" add-in. To do this, go to Tools > Excel Add-ins and select the "Analysis Toolpak.". Next, follow these steps. In your spreadsheet, enter your data for X and Y in two columns. Navigate to the "Data" tab and click on the ...

  7. Regression Analysis

    Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

  8. Regression Analysis

    Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables. It can be utilized to assess the strength of the relationship between variables and for modeling the future relationship between them. Regression analysis includes several variations ...

  9. The clinician's guide to interpreting a regression analysis

    Regression analysis is an important statistical method that is commonly used to determine the relationship ... Clinical example. ... Schober P, Vetter TR. Logistic regression in medical research ...

  10. Regression Analysis: Step by Step Articles, Videos, Simple Definitions

    Step 1: Type your data into two columns in Minitab. Step 2: Click "Stat," then click "Regression" and then click "Fitted Line Plot.". Regression in Minitab selection. Step 3: Click a variable name for the dependent value in the left-hand window.

  11. The Complete Guide to Linear Regression Analysis

    With a simple calculation, we can find the value of β0 and β1 for minimum RSS value. With the stats model library in python, we can find out the coefficients, Table 1: Simple regression of sales on TV. Values for β0 and β1 are 7.03 and 0.047 respectively. Then the relation becomes, Sales = 7.03 + 0.047 * TV.

  12. Regression Analysis: The Complete Guide

    When running regression analysis, be it a simple linear or multiple regression, it's really important to check that the assumptions your chosen method requires have been met. If your data points don't conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data.

  13. Regression Analysis

    The aim of linear regression analysis is to estimate the coefficients of the regression equation b 0 and b k (k∈K) so that the sum of the squared residuals (i.e., the sum over all squared differences between the observed values of the i th observation of y i and the corresponding predicted values \( {\hat{y}}_i \)) is minimized.The lower part of Fig. 1 illustrates this approach, which is ...

  14. A short intro to linear regression analysis using survey data

    Bivariate regression models with survey data. In the Center's 2016 post-election survey, respondents were asked to rate then President-elect Donald Trump on a 0-100 "feeling thermometer.". Respondents were told, "a rating of zero degrees means you feel as cold and negative as possible. A rating of 100 degrees means you feel as warm ...

  15. Linear Regression in R

    Table of contents. Getting started in R. Step 1: Load the data into R. Step 2: Make sure your data meet the assumptions. Step 3: Perform the linear regression analysis. Step 4: Check for homoscedasticity. Step 5: Visualize the results with a graph. Step 6: Report your results. Other interesting articles.

  16. Linear Regression Analysis

    The theory is briefly explained, and the interpretation of statistical parameters is illustrated with examples. The methods of regression analysis are comprehensively discussed in many standard textbooks (1- 3). ... The study of relationships between variables and the generation of risk scores are very important elements of medical research ...

  17. Understanding and interpreting regression analysis

    Example. Building on her research interest mentioned in the beginning, let us consider a study by Ali and Naylor.4 They were interested in identifying the academic and non-academic factors which predict the academic success of nursing diploma students. This purpose is consistent with one of the above-mentioned purposes of regression analysis (ie, prediction).

  18. Regression analysis

    t. e. In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one or more independent variables (often called 'predictors', 'covariates', 'explanatory variables ...

  19. What Is Regression Analysis in Business Analytics?

    Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression). According to the Harvard Business School Online course Business Analytics, regression is used for two primary purposes: To study the magnitude and ...

  20. When to Use Regression Analysis (With Examples)

    The takeaway message is that regression analysis enabled them to quantify that association while adjusting for smoking, alcohol consumption, physical activity, educational level and marital status — all potential confounders of the relationship between BMI and mortality. 2. Predict an outcome using known factors.

  21. Regression: Definition, Analysis, Calculation, and Example

    Regression is a statistical measure used in finance, investing and other disciplines that attempts to determine the strength of the relationship between one dependent variable (usually denoted by ...

  22. (PDF) Regression Analysis

    7.1 Introduction. Regression analysis is one of the most fr equently used tools in market resear ch. In its. simplest form, regression analys is allows market researchers to analyze rela tionships ...

  23. Research Using Multiple Regression Analysis: 1 Example with Conceptual

    The brief research using multiple regression analysis is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad because it only focuses on the total number of hours devoted by high school students to ...

  24. Multiple Linear Regression

    Abstract. Multiple linear regression generalizes straight line regression to allow multiple explanatory (or predictor) variables, in this chapter under the normal errors assumption. The focus may ...

  25. What is a Case Study? Definition & Examples

    A case study is an in-depth investigation of a single person, group, event, or community. This research method involves intensively analyzing a subject to understand its complexity and context. The richness of a case study comes from its ability to capture detailed, qualitative data that can offer insights into a process or subject matter that ...

  26. Stroke and frailty index: a two-sample Mendelian ...

    Statistical analysis Two-sample Mendelian randomization analysis. We performed two-sample MR analyses using the inverse-variance-weighted method as the primary approach.Four other MR methods based on different model assumptions were also used for the analysis: weighted median method , MR-Egger regression , Weighted mode and Simple mode.

  27. JMSE

    Afterward, the regression analysis results are interpreted, considering both statistical and practical significance. The regression model coefficients are interpreted in the context of the research question, and any model assumptions are checked and justified. ... Thus, in a future research line, comparing the sample from this paper to another ...