Research-Methodology

Regression Analysis

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

                                    Y  ≈  f (X, β)   

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

Regression analysis

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity . There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.

4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Regression analysis

  • Privacy Policy

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

Advantages of Regression AnalysisDisadvantages of Regression Analysis
Provides a quantitative measure of the relationship between variablesAssumes a linear relationship between variables, which may not always hold true
Helps in predicting and forecasting outcomes based on historical dataRequires a large sample size to produce reliable results
Identifies and measures the significance of independent variables on the dependent variableAssumes no multicollinearity, meaning that independent variables should not be highly correlated with each other
Provides estimates of the coefficients that represent the strength and direction of the relationship between variablesAssumes the absence of outliers or influential data points
Allows for hypothesis testing to determine the statistical significance of the relationshipCan be sensitive to the inclusion or exclusion of certain variables, leading to different results
Can handle both continuous and categorical variablesAssumes the independence of observations, which may not hold true in some cases
Offers a visual representation of the relationship through the use of scatter plots and regression linesMay not capture complex non-linear relationships between variables without appropriate transformations
Provides insights into the marginal effects of independent variables on the dependent variableRequires the assumption of homoscedasticity, meaning that the variance of errors is constant across all levels of the independent variables

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Content Analysis

Content Analysis – Methods, Types and Examples

Descriptive Statistics

Descriptive Statistics – Types, Methods and...

Histogram

Histogram – Types, Examples and Making Guide

Symmetric Histogram

Symmetric Histogram – Examples and Making Guide

MANOVA

MANOVA (Multivariate Analysis of Variance) –...

Factor Analysis

Factor Analysis – Steps, Methods and Examples

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

A Refresher on Regression Analysis

regression analysis meaning in research

Understanding one of the most important types of data analysis.

You probably know by now that whenever possible you should be making data-driven decisions at work . But do you know how to parse through all the data available to you? The good news is that you probably don’t need to do the number crunching yourself (hallelujah!) but you do need to correctly understand and interpret the analysis created by your colleagues. One of the most important types of data analysis is called regression analysis.

  • Amy Gallo is a contributing editor at Harvard Business Review, cohost of the Women at Work podcast , and the author of two books: Getting Along: How to Work with Anyone (Even Difficult People) and the HBR Guide to Dealing with Conflict . She writes and speaks about workplace dynamics. Watch her TEDx talk on conflict and follow her on LinkedIn . amyegallo

regression analysis meaning in research

Partner Center

  • Data Center
  • Applications
  • Open Source

Logo

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

Regression analysis is a widely used set of statistical analysis methods for gauging the true impact of various factors on specific facets of a business. These methods help data analysts better understand relationships between variables, make predictions, and decipher intricate patterns within data. Regression analysis enables better predictions and more informed decision-making by tapping into historical data to forecast future outcomes. It informs the highest levels of strategic decision-making at the world’s leading enterprises, enabling them to achieve successful outcomes at scale in virtually all domains and industries. In this article, we delve into the essence of regression analysis, exploring its mechanics, applications, various types, and the benefits it brings to the table for enterprises that invest in it.

What is Regression Analysis?

Enterprises have long sought the proverbial “secret sauce” to increasing revenue. While a definitive formula for boosting sales has yet to be discovered, powerful advances in statistics and data science have made it easier to grasp relationships between potentially influential factors and reported sales results and earnings.

In the world of data analytics and statistical modeling, regression analysis stands out for its versatility and predictive power. At its core, it involves modeling the relationship between one or more independent variables and a dependent variable—in essence, asking how changes in one correspond to changes in the other.

How Does Regression Analysis Work?

Regression analysis works by constructing a mathematical model that represents the relationships among the variables in question. This model is expressed as an equation that captures the expected influence of each independent variable on the dependent variable.

End-to-end, the regression analysis process consists of data collection and preparation, model selection, parameter estimation, and model evaluation.

Step 1: Data Collection and Preparation

The first step in regression analysis involves gathering and preparing the data. As with any data analytics, data quality is imperative—in this context, preparation includes identifying all dependent and independent variables, cleaning the data, handling missing values, and transforming variables as needed.

Step 2: Model Selection

In this step, the appropriate regression model is selected based on the nature of the data and the research question. For example, a simple linear regression is suitable when exploring a single predictor, while multiple linear regression is better for use cases with multiple predictors. Polynomial regression, logistic regression, and other specialized forms can be employed for various other use cases.

Step 3: Parameter Estimation

The next step is to estimate the model parameters. For linear regression, this involves finding the coefficients (slopes and intercepts) that best fit the data. This is more often accomplished using techniques like the least squares method, which minimizes the sum of squared differences between observed and predicted values.

Step 4: Model Evaluation

Model evaluation is critical for determining the model’s goodness of fit and predictive accuracy. This process involves assessing such metrics as the coefficient of determination (R-squared), mean squared error (MSE), and others. Visualization tools—scatter plots and residual plots, for example—can aid in understanding how well the model captures the data’s patterns.

Interpreting the Results of Regression Analysis

In order to be actionable, data must be transformed into information. In a similar sense, once the regression analysis has yielded results, they must be interpreted. This includes interpreting coefficients and significance, determining goodness of fit, and performing residual analysis.

Interpreting Coefficients and Significance

Interpreting regression coefficients is crucial for understanding the relationships between variables. A positive coefficient suggests a positive relationship; a negative coefficient suggests a negative relationship.

The significance of coefficients is determined through hypothesis testing—a common statistical method to determine if sample data contains sufficient evidence to draw conclusions—and represented by the p-value. The smaller the p-value, the more significant the relationship.

Determining Goodness of Fit

The coefficient of determination—denoted as R-squared—indicates the proportion of the variance in the dependent variable explained by the independent variables. A higher R-squared value suggests a better fit, but correlation doesn’t necessarily equal causation (i.e., a high R-squared doesn’t imply causation).

Performing Residual Analysis

Analyzing residuals helps validate the assumptions of regression analysis. In a well-fitting model, residuals are randomly scattered around zero. Patterns in residuals could indicate violations of assumptions or omitted variables that should be included in the model.

Key Assumptions of Regression Analysis

For regression analysis to yield reliable and meaningful results, regression analysis relies on assumptions of linearity, independence, homoscedasticity, normality, and no multicollinearity in interpreting and validating models.

  • Linearity. The relationship between independent and dependent variables is assumed to be linear. This means that the change in the dependent variable is directly proportional to changes in the independent variable(s).
  • Independence. The residuals—differences between observed and predicted values—should be independent of each other. In other words, the value of the residual for one data point should not provide information about the residual for another data point.
  • Homoscedasticity. The variance of residuals should remain consistent across all levels of the independent variables. If the variance of residuals changes systematically, it indicates heteroscedasticity and an unreliable regression model.
  • Normality. Residuals should follow a normal distribution. While this assumption is more crucial for smaller sample sizes, violations can impact the reliability of statistical inference and hypothesis testing in many scenarios.
  • No multicollinearity. Multicollinearity—a statistical phenomenon where several independent variables in a model are correlated—makes interpreting individual variable contributions difficult and may result in unreliable coefficient estimates. In multiple linear regression, independent variables should not be highly correlated.

Types of Regression Analysis

There are many regression analysis techniques available for different use cases. Simple linear regression and logistic regression are well-suited for most scenarios, but the following are some of the other most commonly used approaches.

Studies relationship between two variables (predictor and outcome)
Captures impact of all variables
Finds and represents complex patterns and non-linear relationships
Estimates probability based on predictor variables
Used in cases with high correlation between variables; can also be used as a regularization method for accuracy
Used to minimize effect of correlated variables on predictions

Common types of regression analysis.

Simple Linear Regression

Useful for exploring the relationship between two continuous variables in straightforward cause-and-effect investigations, simple linear regression is the most basic form of regression analysis. It involves studying the relationship between two variables: an independent variable (the predictor) and a dependent variable (the outcome).

Source: https://upload.wikimedia.org/wikipedia/commons/b/b0/Linear_least_squares_example2.svg

Multiple Linear Regression (MLR)

MLR regression extends the concept of simple linear regression by capturing the combined impact of all factors, allowing for a more comprehensive analysis of how several factors collectively influence the outcome.

regression analysis meaning in research

Source: https://cdn.corporatefinanceinstitute.com/assets/multiple-linear-regression.png

Polynomial Regression

For non-linear relationships, polynomial regression accommodates curves and enables accurate representation of complex patterns. This method involves fitting a polynomial equation to the data, allowing for more flexible modeling of complex relationships. For example, a second order polynomial regression—also known as a quadratic regression—can be used to capture a U-shaped or inverted U-shaped pattern in the data.

regression analysis meaning in research

Source: https://en.wikipedia.org/wiki/Polynomial_regression#/media/File:Polyreg_scheffe.svg

Logistic Regression

Logistic regression estimates the probability of an event occurring based on one or more predictor variables. In contrast to linear regression, logistic regression is designed to predict categorical outcomes, which are typically binary in nature—for example, yes/no or 0/1.

Source: https://en.m.wikipedia.org/wiki/File:Exam_pass_logistic_curve.svg

Ridge Regression

Ridge regression is typically employed when there is a high correlation between the independent variables. This powerful regression method yields models that are less susceptible to overfitting, and can be used as regularization methods for reducing the impact of correlated variables on model accuracy.

regression analysis meaning in research

Source: https://www.statology.org/ridge-regression-in-r/

Lasso Regression

Like ridge regression, lasso regression—short for least absolute shrinkage and selection operator—works by minimizing the effect that correlated variables have on a model’s predictive capabilities.

regression analysis meaning in research

Source: https://www.statology.org/lasso-regression-in-r/

Regression Analysis Benefits and Use Cases

Because it taps into historical data to forecast future outcomes, regression analysis enables better predictions and more informed decision-making, giving it tremendous value for enterprises in all fields. It’s used at the highest levels of the world’s leading enterprises in fields from finance to marketing to help achieve successful outcomes at scale.

For example, regression analysis plays a crucial role in the optimization of transportation and logistics operations. By predicting demand patterns, it allows enterprises to adjust inventory levels and optimize their supply chain management efforts. It can also help optimize routes by identifying factors that influence travel times and delivery delays, ultimately leading to more accurate scheduling and resource allocation, and assists in fleet management by predicting maintenance needs.

Here are other examples of how other industries use regression analysis:

  • Economics and finance. Regression models help economists understand the interplay of variables such as interest rates, inflation, and consumer spending, guiding monetary strategy and policy decisions and economic forecasts.
  • Healthcare. Medical researchers employ regression analysis to determine how factors like age, lifestyle choices, genetics, and environmental factors contribute to health outcomes to aid in the design of personalized treatment plans and mechanisms for predicting disease risks.
  • Marketing and business. Enterprises use regression analysis to understand consumer behavior, optimize pricing strategies, and evaluate marketing campaign effectiveness.

Challenges and Limitations

Despite its power, regression analysis is not without challenges and limitations. For example, overfitting occurs when a model is too complex and fits the noise in the data, rather than the underlying patterns, or multicollinearity can lead to unstable coefficient estimates.

To deal with these issues, methods such as correlation analysis, variance inflation factor (VIF), and principal component analysis (PCA) can be used to identify and remove redundant variables. Regularization methods using additional regression techniques—ridge regression, lasso regression, and elastic net regression, for example—can help to reduce the impact of correlated variables on the model’s accuracy.

Inherently, regression analysis methods assume that relationships are constant across all levels of the independent variables. But this assumption might not hold true in all cases. For example, modeling the relationship between an app’s ease-of-use and subscription renewal rate may not be well-represented by a linear model, as subscription renewals may increase exponentially or logarithmically with the level of usability.

Bottom Line: Regression Analysis for Enterprise Use

Regression analysis is an indispensable tool in the arsenal of data analysts and researchers. It allows for the decoding of hidden relationships, more accurate outcome predictions, and revelations hidden inside intricate data dynamics that can aid in strategic decision-making.

While it has limitations, many of them can be minimized with the use of other analytical methods. With a solid understanding of its mechanisms, types, and applications, enterprises across nearly all domains can harness its potential to extract valuable information.

Doing so requires investment—not just in the right data analytics and visualization tools and expertise, but in a commitment to collect and prepare high quality data and train staff to incorporate it into decision-making processes. Regression analysis should be just one of the arrows in a business’s data analytics and data management quiver.

Read about the 6 Essential Techniques for Data Mining to learn more about how enterprise data feeds regression analysis to make predictions.

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

8 best data analytics tools: gain data-driven advantage in 2024, common data visualization examples: transform numbers into narratives, what is data management a guide to systems, processes, and tools, get the free newsletter.

Subscribe to Data Insider for top news, trends & analysis

Latest Articles

Exploring multi-tenant architecture: a..., 8 best data analytics..., common data visualization examples:..., what is data management....

Logo

In this post

Regression analysis basics

  • How does regression analysis work?
  • Types of regression analysis

When is regression analysis used?

Benefits of regression analysis, applications of regression analysis, top statistical analysis software.

Businesses collect data to make better decisions.

But when you count on data for building strategies, simplifying processes, and improving customer experience, more than collecting it, you need to understand and analyze it to be able to draw valuable insights. Analyzing data helps you study what’s already happened and predict what may happen in the future. 

Data analysis has many components, and while some can be easy to understand and perform, others are rather complex. The good news is that many statistical analysis software offer meaningful insights from data in a few steps.

You have to understand the fundamentals before using or relying on a statistical program to give accurate results because even though generating results is easy, interpreting them is another ballgame. 

While interpreting data, considering the factors that affect the data becomes essential. Regression analysis helps you do just that. With the assistance of this statistical analysis method , you can find the most important and least important factors in any data set and understand how they relate. 

This guide covers the fundamentals of regression analysis, its process, benefits, and applications.

What is regression analysis? 

Regression analysis is a statistical process that helps assess the relationships between a dependent variable and one or more independent variables.

The primary purpose of regression analysis is to describe the relationship between variables, but it can also be used to:

  • Estimate the value of one variable using the known values of other variables.
  • Predict results and shifts in a variable based on its relationship with other variables. 
  • Control the influence of variables while exploring the relationship between variables.  

To understand regression analysis comprehensively, you must build foundational knowledge of the statistical concepts.

Regression analysis helps identify the factors that impact data insights. You can use it to understand which factors play a role in creating an outcome and how significant they are. These factors are called variables.

You need to grasp two main types of variables.

  • The main factor you're focusing on is the dependent variable . This variable is often measured as an outcome of analyses and depends on one or more other variables.
  • The factors or variables that impact your dependent variable are called independent variables . Variables like these are often altered for analysis. They’re also called explanatory variables or predictor variables.

Correlation vs. causation 

Causation indicates that one variable is the result of the occurrence of the other variable. Correlation suggests a connection between variables. Correlation and causation can coexist, but correlation does not imply causation. 

Overfitting

Overfitting is a statistical modeling error that occurs when a function lines up with a limited set of data points and makes predictions based on those instead of exploring new data points. As a result, the model can only be used as a reference to its initial data set and not to any other data sets.

Want to learn more about Statistical Analysis Software? Explore Statistical Analysis products.

How does regression analysis work .

For a minute, let's imagine that you own an ice cream stand. In this case, we can consider “revenue” and “temperature” to be the two factors under analysis. The first step toward conducting a successful regression statistical analysis is gathering data on the variables. 

You collect all your monthly sales numbers for the past two years and any data on the independent variables or explanatory variables you’re analyzing. In this case, it’s the average monthly temperature for the past two years.

To begin to understand whether there’s a relationship between these two variables, you need to plot these data points on a graph that looks like the following theoretical example of a scatter plot:

scatter plot for regression analysis

The amount of sales is represented on the y-axis (vertical axis), and temperature is represented on the x-axis (horizontal axis). The dots represent one month's data – the average temperature and sales in that same month.

Observing this data shows that sales are higher on days when the temperature increases. But by how much? If the temperature goes higher, how much do you sell? And what if the temperature drops? 

Drawing a regression line roughly in the middle of all the data points helps you figure out how much you typically sell when it’s a specific temperature. Let’s use a theoretical scatter plot to depict a regression line: 

How regression analysis works

The regression line explains the relationship between the predicted values and dependent variables. It can be created using statistical analysis software or Microsoft Excel. 

Your regression analysis tool must also display a formula that defines the slope of the line. For example: 

y = 100 + 2x + error term

On observing the formula, you can conclude that when there is no x , y equals 100, which means that when the temperature is very low, you can make an average of 100 sales. Provided the other variables remain constant, you can use this to predict the future of sales. For every rise in the temperature, you make an average of two more sales.

A regression line always has an error term because an independent variable cannot be a perfect predictor of a dependent variable. Deciding whether this variable is worth your attention depends on the error term – the larger the error term, the less certain the regression line. 

Types of regression analysis 

Various types of regression analysis are at your disposal, but the five mentioned below are the most commonly used.

Linear regression

A linear regression model is defined as a straight line that attempts to predict the relationship between variables. It’s mainly classified into two types: simple and multiple linear regression. 

We’ll discuss those in a moment, but let’s first cover the five fundamental assumptions made in the linear regression model. 

  • The dependent and independent variables display a linear relationship.
  • The value of the residual is zero.
  • The value of the residual is constant and not correlated across all observations.
  • The residual is normally distributed.
  • Residual errors are homoscedastic – they have a constant variance.

Simple linear regression analysis 

Linear regression analysis helps predict a variable's value (dependent variable) based on the known value of one other variable (independent variable).

Linear regression fits a straight line, so a simple linear model attempts to define the relationship between two variables by estimating the coefficients of the linear equation.

Simple linear regression equation:

Y = a + bX + ϵ

Where, Y – Dependent variable (response variable) X – Independent variable (predictor variable) a – Intercept (y-intercept) b – Slope ϵ – Residual (error)

I n such a linear regression model, a response variable has a single corresponding predictor variable that impacts its value. For example, consider the linear regression formula:

  y = 5x + 4  

If the value of x is defined as 3, only one possible outcome of y is possible.

Multiple linear regression analysis

In most cases, simple linear regression analysis can't explain the connections between data. As the connection becomes more complex, the relationship between data is better explained using more than one variable. 

Multiple regression analysis describes a response variable using more than one predictor variable. It is used when a strong correlation between each independent variable has the ability to affect the dependent variable. 

Multiple linear regression equation: 

Y = a + bX1 + cX2 + dX3 + ϵ

Where, Y – Dependent variable X1, X2, X3 – Independent variables a – Intercept (y-intercept) b, c, d – Slopes ϵ – Residual (error)

Ordinary least squares

Ordinary Least Squares regression estimates the unknown parameters in a model. It estimates the coefficients of a linear regression equation by minimizing the sum of the squared errors between the actual and predicted values configured as a straight line.

Polynomial regression

A linear regression algorithm only works when the relationship between the data is linear. What if the data distribution was more complex, as shown in the figure below?  

Simple linear model

As seen above, the data is nonlinear. A linear model can't be used to fit nonlinear data because it can't sufficiently define the patterns in the data.

Polynomial regression is a type of multiple linear regression used when data points are present in a nonlinear manner. It can determine the curvilinear relationship between independent and dependent variables having a nonlinear relationship.

Polynomial model

Polynomial regression equation: 

y = b0+b1x1+ b2x1^2+ b2x1^3+...... bnx1^n

Logistic regression

Logistic regression models the probability of a dependent variable as a function of independent variables. The values of a dependent variable can take one of a limited set of binary values (0 and 1) since the outcome is a probability. 

Logistic regression is often used when binary data (yes or no; pass or fail) needs to be analyzed. In other words, using the logistic regression method to analyze your data is recommended if your dependent variable can have either one of two binary values.

Let’s say you need to determine whether an email is spam. We need to set up a threshold based on which the classification can be done. Using logistic regression here makes sense as the outcome is strictly bound to 0 (spam) or 1 (not spam) values.  

Bayesian linear regression

In other regression methods, the output is derived from one or more attributes. But what if those attributes are unavailable? 

The bayesian regression method is used when the dataset that needs to be analyzed has less or poorly distributed data because its output is derived from a probability distribution instead of point estimates. When data is absent, you can place a prior on the regression coefficients to substitute the data. As we add more data points, the accuracy of the regression model improves. 

Imagine a company launches a new product and wants to predict its sales. Due to the lack of available data, we can’t use a simple regression analysis model. But Bayesian regression analysis lets you set up a prior and calculate future projections.

Additionally, once new data from the new product sales come in, the prior is immediately updated. As a result, the forecast for the future is influenced by the latest and previous data. 

The Bayesian technique is mathematically robust. Because of this, it doesn’t require you to have any prior knowledge of the dataset during usage. However, its complexity means it takes time to draw inferences from the model, and using it doesn't make sense when you have too much data.

Quantile regression analysis

The linear regression method estimates a variable's mean based on the values of other predictor variables. But we don’t always need to calculate the conditional mean. In most situations, we only need the median, the 0.25 quantile, and so on. In cases like this, we can use quantile regression. 

Quantile regression defines the relationship between one or more predictor variables and specific percentiles or quantiles of a response variable. It resists the influence of outlying observations. No assumptions about the distribution of the dependent variable are made in quantile regression, so you can use it when linear regression doesn’t satisfy its assumptions. 

Let's consider two students who have taken an Olympiad exam open for all age groups. Student A scored 650, while student B scored 425. This data shows that student A has performed better than student B. 

But quantile regression helps remind us that since the exam was open for all age groups, we have to factor in r the student's age to determine the correct outcome in their individual conditional quantile spaces. 

We know the variable causing such a difference in the data distribution. As a result, the scores of the students are compared for the same age groups.

What is regularization? 

Regularization is a technique that prevents a regression model from overfitting by including extra information. It’s implemented by adding a penalty term to the data model. It allows you to keep the same number of features by reducing the magnitude of the variables. It reduces the magnitude of the coefficient of features toward zero.

The two types of regularization techniques are L1 and L2. A regression model using the L1 regularization technique is known as Lasso regression, and the one using the L2 regularization technique is called Ridge regression.

Ridge regression

Ridge regression is a regularization technique you would use to eliminate the correlations between independent variables (multicollinearity) or when the number of independent variables in a set exceeds the number of observations. 

Ridge regression performs L2 regularization. In such a regularization, the formula used to make predictions is the same for ordinary least squares, but a penalty is added to the square of the magnitude of regression coefficients. This is done so that each feature has as little effect on the outcome as possible. 

Lasso regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. 

Lasso regression is a regularized linear regression that uses an L1 penalty that pushes some regression coefficient values to become closer to zero. By setting features to zero, it automatically chooses the required feature and avoids overfitting.

So if the dataset has high correlation, high levels of multicollinearity, or when specific features such as variable selection or parameter elimination need to be automated, you can use lasso regression.

Now is the time to get SaaS-y news and entertainment with our 5-minute newsletter, G2 Tea , featuring inspiring leaders, hot takes, and bold predictions. Subscribe today!

g2 tea cta

Regression analysis is a powerful tool used to derive statistical inferences for the future using observations from the past . It identifies the connections between variables occurring in a dataset and determines the magnitude of these associations and their significance on outcomes.

Across industries, it’s a useful statistical analysis tool because it provides exceptional flexibility. So the next time someone at work proposes a plan that depends on multiple factors, perform a regression analysis to predict an accurate outcome. 

In the real world, various factors determine how a business grows. Often these factors are interrelated, and a change in one can positively or negatively affect the other. 

Using regression analysis to judge how changing variables will affect your business has two primary benefits.

  • Making data-driven decisions: Businesses use regression analysis when planning for the future because it helps determine which variables have the most significant impact on the outcome according to previous results. Companies can better focus on the right things when forecasting and making data-backed predictions.
  • Recognizing opportunities to improve: Since regression analysis shows the relations between two variables, businesses can use it to identify areas of improvement in terms of people, strategies, or tools by observing their interactions. For example, increasing the number of people on a project might positively impact revenue growth . 

Both small and large industries are loaded with an enormous amount of data. To make better decisions and eliminate guesswork, many are now adopting regression analysis because it offers a scientific approach to management.

Using regression analysis, professionals can observe and evaluate the relationship between various variables and subsequently predict this relationship's future characteristics. 

Companies can utilize regression analysis in numerous forms. Some of them:

  • Many finance professionals use regression analysis to forecast future opportunities and risks . The capital asset pricing model (CAPM) that decides the relationship between an asset's expected return and the associated market risk premium is an often-used regression model in finance for pricing assets and discovering capital costs. Regression analysis is also used to calculate beta (β), which is described as the volatility of returns while considering the overall market for a stock.
  • Insurance firms use regression analysis to forecast the creditworthiness of a policyholder . It can also help choose the number of claims that may be raised in a specific period.
  • Sales forecasting uses regression analysis to predict sales based on past performance. It can give you a sense of what has worked before, what kind of impact it has created, and what can improve to provide more accurate and beneficial future results. 
  • Another critical use of regression models is the optimization of business processes . Today, managers consider regression an indispensable tool for highlighting the areas that have the maximum impact on operational efficiency and revenues, deriving new insights, and correcting process errors. 

Businesses with a data-driven culture use regression analysis to draw actionable insights from large datasets. For many leading industries with extensive data catalogs, it proves to be a valuable asset. As the data size increases, further executives lean into regression analysis to make informed business decisions with statistical significance. 

While Microsoft Excel remains a popular tool for conducting fundamental regression data analysis, many more advanced statistical tools today drive more accurate and faster results. Check out the top statistical analysis software in 2023 here. 

To be included in this category, the regression analysis software product must be able to:

  • Execute a simple linear regression or a complex multiple regression analysis for various data sets.
  • Provide graphical tools to study model estimation, multicollinearity, model fits, line of best fit, and other aspects typical of the type of regression.
  • Possess a clean, intuitive, and user-friendly user interface (UI) design

*Below are the top 5 leading statistical analysis software solutions from G2’s Winter 2023 Grid® Report. Some reviews may be edited for clarity.

1. IBM SPSS statistics

IBM SPSS Statistics allows you to predict the outcomes and apply various nonlinear regression procedures that can be used for business and analysis projects where standard regression techniques are limiting or inappropriate. With IBM SPSS Statistics, you can specify multiple regression models in a single command to observe the correlation between independent and dependent variables and expand regression analysis capabilities on a dataset.

What users like best :

"I have used a couple of different statistical softwares. IBM SPSS is an amazing software, a one-stop shop for all statistics-related analysis. The graphical user interface is elegantly built for ease. I was quickly able to learn and use it"

- IBM SPSS Statistics Review , Haince Denis P.

What users dislike:

"Some of the interfaces could be more intuitive. Thankfully much information is available from various sources online to help the user learn how to set up tests."

- IBM SPSS Statistics Review , David I.

To make data science more intuitive and collaborative, Posit provides users across key industries with R and Python-based tools, enabling them to leverage powerful analytics and gather valuable insights.

What users like best:

"Straightforward syntax, excellent built-in functions, and powerful libraries for everything else. Building anything from simple mathematical functions to complicated machine learning models is a breeze."

- Posit Review , Brodie G.

"Its GUI could be more intuitive and user-friendly. One needs a lot of time to understand and implement it. Including a package manager would be a good idea, as it has become common in many modern IDEs. There must be an option to save console commands, which is currently unavailable."

- Posit Review , Tanishq G.

JMP is a data analysis software that helps make sense of your data using cutting-edge and modern statistical methods. Its products are intuitively interactive, visually compelling, and statistically profound. 

"The instructional videos on the website are great; I had no clue what I was doing before I watched them. The videos make the application very user-friendly."

- JMP Review , Ashanti B.

"Help function can be brief in terms of what the functionality entails, and that's disappointing because the way the software is set up to communicate data visually and intuitively suggests the presence of a logical and explainable scientific thought process, including an explanation of the "why.” The graph builder could also use more intuitive means to change layout features."

- JMP Review , Zeban K.

4. Minitab statistical software

Minitab Statistical Software is a data and statistical analysis tool used to help businesses understand their data and make better decisions. It allows companies to tap into the power of regression analysis by analyzing new and old data to discover trends, predict patterns, uncover hidden relationships between variables, and create stunning visualizations. 

"The greatest program for learning and analyzing as it allows you to improve the settings with incredibly accurate graphs and regression charts. This platform allows you to analyze the outcomes or data with their ideal values."

- Minitab Statistical Software Review , Pratibha M.

"The software price is steep, and licensing is troublesome. You are required to be online or connected to the company VPN for licensing, especially for corporate use. So without an internet connection, you cannot use it at all. Also, if you are in the middle of doing an analysis and happen to lose your internet connection, you will risk losing the project or the study you are working on."

- Minitab Statistical Software Review , Siew Kheong W.

EViews offers user-friendly tools to perform data modeling and forecasting. It operates with an innovative, easy-to-use object-oriented interface used by researchers, financial institutions, government agencies, and educators.

"As an economist, this software is handy as it assists me in conducting advanced research, analyzing data, and interpreting results for policy recommendations. I just cannot do without EViews. I like its recent updates that have also enhanced the UI."

- EViews Review , T homas M.

"In my experience, importing data from Excel is not easy using EViews compared to other statistical software. One needs to develop expertise while importing data into EViews from different formats. Moreover, the price of the software is very high."

 - EViews Review , Md. Zahid H.

Click to chat with G2s Monty-AI

Collecting data gathers no moss.

Data collection has become easy in the modern world, but more than just gathering is required. Businesses must know how to get the most value from this data. Analysis helps companies to understand the available information, derive actionable insights, and make informed decisions. Businesses should thoroughly know the data analysis process inside and out to refine operations, improve customer service, and track performance. 

Learn more ab out the various stages of data analysis and implement them to drive success. 

Devyani Mehta

Devyani Mehta is a content marketing specialist at G2. She has worked with several SaaS startups in India, which has helped her gain diverse industry experience. At G2, she shares her insights on complex cybersecurity concepts like web application firewalls, RASP, and SSPM. Outside work, she enjoys traveling, cafe hopping, and volunteering in the education sector. Connect with her on LinkedIn.

Explore More G2 Articles

Statistical analysis software

What is Regression Analysis?

  • Regression Analysis – Linear Model Assumptions
  • Regression Analysis – Simple Linear Regression
  • Regression Analysis – Multiple Linear Regression

Regression Analysis in Finance

Regression tools, additional resources, regression analysis.

The estimation of relationships between a dependent variable and one or more independent variables

Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables . It can be utilized to assess the strength of the relationship between variables and for modeling the future relationship between them.

Regression Analysis - Types of Regression Analysis

Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for more complicated data sets in which the dependent and independent variables show a nonlinear relationship.

Regression analysis offers numerous applications in various disciplines, including finance .

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

  • The dependent and independent variables show a linear relationship between the slope and the intercept.
  • The independent variable is not random.
  • The value of the residual (error) is zero.
  • The value of the residual (error) is constant across all observations.
  • The value of the residual (error) is not correlated across all observations.
  • The residual (error) values follow the normal distribution.

Regression Analysis – Simple Linear Regression

Simple linear regression is a model that assesses the relationship between a dependent variable and an independent variable. The simple linear model is expressed using the following equation:

Y = a + bX + ϵ

  • Y – Dependent variable
  • X – Independent (explanatory) variable
  • a – Intercept
  • b – Slope
  • ϵ – Residual (error)

Check out the following video to learn more about simple linear regression:

Regression Analysis – Multiple Linear Regression

Multiple linear regression analysis is essentially similar to the simple linear model, with the exception that multiple independent variables are used in the model. The mathematical representation of multiple linear regression is:

Y = a + b X 1  + c X 2  + d X 3 + ϵ

  • X 1 , X 2 , X 3  – Independent (explanatory) variables
  • b, c, d – Slopes

Multiple linear regression follows the same conditions as the simple linear model. However, since there are several independent variables in multiple linear analysis, there is another mandatory condition for the model:

  • Non-collinearity: Independent variables should show a minimum correlation with each other. If the independent variables are highly correlated with each other, it will be difficult to assess the true relationships between the dependent and independent variables.

Regression analysis comes with several applications in finance. For example, the statistical method is fundamental to the Capital Asset Pricing Model (CAPM) . Essentially, the CAPM equation is a model that determines the relationship between the expected return of an asset and the market risk premium.

The analysis is also used to forecast the returns of securities, based on different factors, or to forecast the performance of a business. Learn more forecasting methods in CFI’s Budgeting and Forecasting Course !

1. Beta and CAPM

In finance, regression analysis is used to calculate the Beta (volatility of returns relative to the overall market) for a stock. It can be done in Excel using the Slope function .

Screenshot of Beta Calculator Template in Excel

Download CFI’s free beta calculator !

2. Forecasting Revenues and Expenses

When forecasting financial statements for a company, it may be useful to do a multiple regression analysis to determine how changes in certain assumptions or drivers of the business will impact revenue or expenses in the future. For example, there may be a very high correlation between the number of salespeople employed by a company, the number of stores they operate, and the revenue the business generates.

Simple Linear Regression - Forecasting Revenues and Expenses

The above example shows how to use the Forecast function in Excel to calculate a company’s revenue, based on the number of ads it runs.

Learn more forecasting methods in CFI’s Budgeting and Forecasting Course !

Excel remains a popular tool to conduct basic regression analysis in finance, however, there are many more advanced statistical tools that can be used.

Python and R are both powerful coding languages that have become popular for all types of financial modeling, including regression. These techniques form a core part of data science and machine learning, where models are trained to detect these relationships in data.

Learn more about regression analysis, Python, and Machine Learning in CFI’s Business Intelligence & Data Analysis certification.

To learn more about related topics, check out the following free CFI resources:

  • Cost Behavior Analysis
  • Forecasting Methods
  • Joseph Effect
  • Variance Inflation Factor (VIF)
  • High Low Method vs. Regression Analysis
  • See all data science resources
  • Share this article

Excel Fundamentals - Formulas for Finance

Create a free account to unlock this Template

Access and download collection of free Templates to help power your productivity and performance.

Already have an account? Log in

Supercharge your skills with Premium Templates

Take your learning and productivity to the next level with our Premium Templates.

Upgrading to a paid membership gives you access to our extensive collection of plug-and-play Templates designed to power your performance—as well as CFI's full course catalog and accredited Certification Programs.

Already have a Self-Study or Full-Immersion membership? Log in

Access Exclusive Templates

Gain unlimited access to more than 250 productivity Templates, CFI's full course catalog and accredited Certification Programs, hundreds of resources, expert reviews and support, the chance to work with real-world finance and research tools, and more.

Already have a Full-Immersion membership? Log in

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

regression analysis meaning in research

Home Market Research

Regression Analysis: Definition, Types, Usage & Advantages

regression analysis meaning in research

Regression analysis is perhaps one of the most widely used statistical methods for investigating or estimating the relationship between a set of independent and dependent variables. In statistical analysis , distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities.

It is also used as a blanket term for various data analysis techniques utilized in a qualitative research method for modeling and analyzing numerous variables. In the regression method, the dependent variable is a predictor or an explanatory element, and the dependent variable is the outcome or a response to a specific query.

LEARN ABOUT:   Statistical Analysis Methods

Content Index

Definition of Regression Analysis

Types of regression analysis, regression analysis usage in market research, how regression analysis derives insights from surveys, advantages of using regression analysis in an online survey.

Regression analysis is often used to model or analyze data. Most survey analysts use it to understand the relationship between the variables, which can be further utilized to predict the precise outcome.

For Example – Suppose a soft drink company wants to expand its manufacturing unit to a newer location. Before moving forward, the company wants to analyze its revenue generation model and the various factors that might impact it. Hence, the company conducts an online survey with a specific questionnaire.

After using regression analysis, it becomes easier for the company to analyze the survey results and understand the relationship between different variables like electricity and revenue – here, revenue is the dependent variable.

LEARN ABOUT: Level of Analysis

In addition, understanding the relationship between different independent variables like pricing, number of workers, and logistics with the revenue helps the company estimate the impact of varied factors on sales and profits.

Survey researchers often use this technique to examine and find a correlation between different variables of interest. It provides an opportunity to gauge the influence of different independent variables on a dependent variable.

Overall, regression analysis saves the survey researchers’ additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.

Create a Free Account

Researchers usually start by learning linear and logistic regression first. Due to the widespread knowledge of these two methods and ease of application, many analysts think there are only two types of models. Each model has its own specialty and ability to perform if specific conditions are met.

This blog explains the commonly used seven types of multiple regression analysis methods that can be used to interpret the enumerated data in various formats.

01. Linear Regression Analysis

It is one of the most widely known modeling techniques, as it is amongst the first elite regression analysis methods picked up by people at the time of learning predictive modeling. Here, the dependent variable is continuous, and the independent variable is more often continuous or discreet with a linear regression line.

Please note that multiple linear regression has more than one independent variable than simple linear regression. Thus, linear regression is best to be used only when there is a linear relationship between the independent and a dependent variable.

A business can use linear regression to measure the effectiveness of the marketing campaigns, pricing, and promotions on sales of a product. Suppose a company selling sports equipment wants to understand if the funds they have invested in the marketing and branding of their products have given them substantial returns or not.

Linear regression is the best statistical method to interpret the results. The best thing about linear regression is it also helps in analyzing the obscure impact of each marketing and branding activity, yet controlling the constituent’s potential to regulate the sales.

If the company is running two or more advertising campaigns simultaneously, one on television and two on radio, then linear regression can easily analyze the independent and combined influence of running these advertisements together.

LEARN ABOUT: Data Analytics Projects

02. Logistic Regression Analysis

Logistic regression is commonly used to determine the probability of event success and event failure. Logistic regression is used whenever the dependent variable is binary, like 0/1, True/False, or Yes/No. Thus, it can be said that logistic regression is used to analyze either the close-ended questions in a survey or the questions demanding numeric responses in a survey.

Please note logistic regression does not need a linear relationship between a dependent and an independent variable, just like linear regression. Logistic regression applies a non-linear log transformation for predicting the odds ratio; therefore, it easily handles various types of relationships between a dependent and an independent variable.

Logistic regression is widely used to analyze categorical data, particularly for binary response data in business data modeling. More often, logistic regression is used when the dependent variable is categorical, like to predict whether the health claim made by a person is real(1) or fraudulent, to understand if the tumor is malignant(1) or not.

Businesses use logistic regression to predict whether the consumers in a particular demographic will purchase their product or will buy from the competitors based on age, income, gender, race, state of residence, previous purchase, etc.

03. Polynomial Regression Analysis

Polynomial regression is commonly used to analyze curvilinear data when an independent variable’s power is more than 1. In this regression analysis method, the best-fit line is never a ‘straight line’ but always a ‘curve line’ fitting into the data points.

Please note that polynomial regression is better to use when two or more variables have exponents and a few do not.

Additionally, it can model non-linearly separable data offering the liberty to choose the exact exponent for each variable, and that too with full control over the modeling features available.

When combined with response surface analysis, polynomial regression is considered one of the sophisticated statistical methods commonly used in multisource feedback research. Polynomial regression is used mostly in finance and insurance-related industries where the relationship between dependent and independent variables is curvilinear.

Suppose a person wants to budget expense planning by determining how long it would take to earn a definitive sum. Polynomial regression, by taking into account his/her income and predicting expenses, can easily determine the precise time he/she needs to work to earn that specific sum amount.

04. Stepwise Regression Analysis

This is a semi-automated process with which a statistical model is built either by adding or removing the dependent variable on the t-statistics of their estimated coefficients.

If used properly, the stepwise regression will provide you with more powerful data at your fingertips than any method. It works well when you are working with a large number of independent variables. It just fine-tunes the unit of analysis model by poking variables randomly.

Stepwise regression analysis is recommended to be used when there are multiple independent variables, wherein the selection of independent variables is done automatically without human intervention.

Please note, in stepwise regression modeling, the variable is added or subtracted from the set of explanatory variables. The set of added or removed variables is chosen depending on the test statistics of the estimated coefficient.

Suppose you have a set of independent variables like age, weight, body surface area, duration of hypertension, basal pulse, and stress index based on which you want to analyze its impact on the blood pressure.

In stepwise regression, the best subset of the independent variable is automatically chosen; it either starts by choosing no variable to proceed further (as it adds one variable at a time) or starts with all variables in the model and proceeds backward (removes one variable at a time).

Thus, using regression analysis, you can calculate the impact of each or a group of variables on blood pressure.

05. Ridge Regression Analysis

Ridge regression is based on an ordinary least square method which is used to analyze multicollinearity data (data where independent variables are highly correlated). Collinearity can be explained as a near-linear relationship between variables.

Whenever there is multicollinearity, the estimates of least squares will be unbiased, but if the difference between them is larger, then it may be far away from the true value. However, ridge regression eliminates the standard errors by appending some degree of bias to the regression estimates with a motive to provide more reliable estimates.

If you want, you can also learn about Selection Bias through our blog.

Please note, Assumptions derived through the ridge regression are similar to the least squared regression, the only difference being the normality. Although the value of the coefficient is constricted in the ridge regression, it never reaches zero suggesting the inability to select variables.

Suppose you are crazy about two guitarists performing live at an event near you, and you go to watch their performance with a motive to find out who is a better guitarist. But when the performance starts, you notice that both are playing black-and-blue notes at the same time.

Is it possible to find out the best guitarist having the biggest impact on sound among them when they are both playing loud and fast? As both of them are playing different notes, it is substantially difficult to differentiate them, making it the best case of multicollinearity, which tends to increase the standard errors of the coefficients.

Ridge regression addresses multicollinearity in cases like these and includes bias or a shrinkage estimation to derive results.

06. Lasso Regression Analysis

Lasso (Least Absolute Shrinkage and Selection Operator) is similar to ridge regression; however, it uses an absolute value bias instead of the square bias used in ridge regression.

It was developed way back in 1989 as an alternative to the traditional least-squares estimate with the intention to deduce the majority of problems related to overfitting when the data has a large number of independent variables.

Lasso has the capability to perform both – selecting variables and regularizing them along with a soft threshold. Applying lasso regression makes it easier to derive a subset of predictors from minimizing prediction errors while analyzing a quantitative response.

Please note that regression coefficients reaching zero value after shrinkage are excluded from the lasso model. On the contrary, regression coefficients having more value than zero are strongly associated with the response variables, wherein the explanatory variables can be either quantitative, categorical, or both.

Suppose an automobile company wants to perform a research analysis on average fuel consumption by cars in the US. For samples, they chose 32 models of car and 10 features of automobile design – Number of cylinders, Displacement, Gross horsepower, Rear axle ratio, Weight, ¼ mile time, v/s engine, transmission, number of gears, and number of carburetors.

As you can see a correlation between the response variable mpg (miles per gallon) is extremely correlated to some variables like weight, displacement, number of cylinders, and horsepower. The problem can be analyzed by using the glmnet package in R and lasso regression for feature selection.

07. Elastic Net Regression Analysis

It is a mixture of ridge and lasso regression models trained with L1 and L2 norms. The elastic net brings about a grouping effect wherein strongly correlated predictors tend to be in/out of the model together. Using the elastic net regression model is recommended when the number of predictors is far greater than the number of observations.

Please note that the elastic net regression model came into existence as an option to the lasso regression model as lasso’s variable section was too much dependent on data, making it unstable. By using elastic net regression, statisticians became capable of over-bridging the penalties of ridge and lasso regression only to get the best out of both models.

A clinical research team having access to a microarray data set on leukemia (LEU) was interested in constructing a diagnostic rule based on the expression level of presented gene samples for predicting the type of leukemia. The data set they had, consisted of a large number of genes and a few samples.

Apart from that, they were given a specific set of samples to be used as training samples, out of which some were infected with type 1 leukemia (acute lymphoblastic leukemia) and some with type 2 leukemia (acute myeloid leukemia).

Model fitting and tuning parameter selection by tenfold CV were carried out on the training data. Then they compared the performance of those methods by computing their prediction mean-squared error on the test data to get the necessary results.

A market research survey focuses on three major matrices; Customer Satisfaction , Customer Loyalty , and Customer Advocacy . Remember, although these matrices tell us about customer health and intentions, they fail to tell us ways of improving the position. Therefore, an in-depth survey questionnaire intended to ask consumers the reason behind their dissatisfaction is definitely a way to gain practical insights.

However, it has been found that people often struggle to put forth their motivation or demotivation or describe their satisfaction or dissatisfaction. In addition to that, people always give undue importance to some rational factors, such as price, packaging, etc. Overall, it acts as a predictive analytic and forecasting tool in market research.

When used as a forecasting tool, regression analysis can determine an organization’s sales figures by taking into account external market data. A multinational company conducts a market research survey to understand the impact of various factors such as GDP (Gross Domestic Product), CPI (Consumer Price Index), and other similar factors on its revenue generation model.

Obviously, regression analysis in consideration of forecasted marketing indicators was used to predict a tentative revenue that will be generated in future quarters and even in future years. However, the more forward you go in the future, the data will become more unreliable, leaving a wide margin of error .

Case study of using regression analysis

A water purifier company wanted to understand the factors leading to brand favorability. The survey was the best medium for reaching out to existing and prospective customers. A large-scale consumer survey was planned, and a discreet questionnaire was prepared using the best survey tool .

A number of questions related to the brand, favorability, satisfaction, and probable dissatisfaction were effectively asked in the survey. After getting optimum responses to the survey, regression analysis was used to narrow down the top ten factors responsible for driving brand favorability.

All the ten attributes derived (mentioned in the image below) in one or the other way highlighted their importance in impacting the favorability of that specific water purifier brand.

Regression Analysis in Market Research

It is easy to run a regression analysis using Excel or SPSS, but while doing so, the importance of four numbers in interpreting the data must be understood.

The first two numbers out of the four numbers directly relate to the regression model itself.

  • F-Value: It helps in measuring the statistical significance of the survey model. Remember, an F-Value significantly less than 0.05 is considered to be more meaningful. Less than 0.05 F-Value ensures survey analysis output is not by chance.
  • R-Squared: This is the value wherein the independent variables try to explain the amount of movement by dependent variables. Considering the R-Squared value is 0.7, a tested independent variable can explain 70% of the dependent variable’s movement. It means the survey analysis output we will be getting is highly predictive in nature and can be considered accurate.

The other two numbers relate to each of the independent variables while interpreting regression analysis.

  • P-Value: Like F-Value, even the P-Value is statistically significant. Moreover, here it indicates how relevant and statistically significant the independent variable’s effect is. Once again, we are looking for a value of less than 0.05.
  • Interpretation: The fourth number relates to the coefficient achieved after measuring the impact of variables. For instance, we test multiple independent variables to get a coefficient. It tells us, ‘by what value the dependent variable is expected to increase when independent variables (which we are considering) increase by one when all other independent variables are stagnant at the same value.

In a few cases, the simple coefficient is replaced by a standardized coefficient demonstrating the contribution from each independent variable to move or bring about a change in the dependent variable.

01. Get access to predictive analytics

Do you know utilizing regression analysis to understand the outcome of a business survey is like having the power to unveil future opportunities and risks?

For example, after seeing a particular television advertisement slot, we can predict the exact number of businesses using that data to estimate a maximum bid for that slot. The finance and insurance industry as a whole depends a lot on regression analysis of survey data to identify trends and opportunities for more accurate planning and decision-making.

02. Enhance operational efficiency

Do you know businesses use regression analysis to optimize their business processes?

For example, before launching a new product line, businesses conduct consumer surveys to better understand the impact of various factors on the product’s production, packaging, distribution, and consumption.

A data-driven foresight helps eliminate the guesswork, hypothesis, and internal politics from decision-making. A deeper understanding of the areas impacting operational efficiencies and revenues leads to better business optimization.

03. Quantitative support for decision-making

Business surveys today generate a lot of data related to finance, revenue, operation, purchases, etc., and business owners are heavily dependent on various data analysis models to make informed business decisions.

For example, regression analysis helps enterprises to make informed strategic workforce decisions. Conducting and interpreting the outcome of employee surveys like Employee Engagement Surveys, Employee Satisfaction Surveys, Employer Improvement Surveys, Employee Exit Surveys, etc., boosts the understanding of the relationship between employees and the enterprise.

It also helps get a fair idea of certain issues impacting the organization’s working culture, working environment, and productivity. Furthermore, intelligent business-oriented interpretations reduce the huge pile of raw data into actionable information to make a more informed decision.

04. Prevent mistakes from happening due to intuitions

By knowing how to use regression analysis for interpreting survey results, one can easily provide factual support to management for making informed decisions. ; but do you know that it also helps in keeping out faults in the judgment?

For example, a mall manager thinks if he extends the closing time of the mall, then it will result in more sales. Regression analysis contradicts the belief that predicting increased revenue due to increased sales won’t support the increased operating expenses arising from longer working hours.

Regression analysis is a useful statistical method for modeling and comprehending the relationships between variables. It provides numerous advantages to various data types and interactions. Researchers and analysts may gain useful insights into the factors influencing a dependent variable and use the results to make informed decisions. 

With QuestionPro Research, you can improve the efficiency and accuracy of regression analysis by streamlining the data gathering, analysis, and reporting processes. The platform’s user-friendly interface and wide range of features make it a valuable tool for researchers and analysts conducting regression analysis as part of their research projects.

Sign up for the free trial today and let your research dreams fly!

LEARN MORE         FREE TRIAL

MORE LIKE THIS

regression analysis meaning in research

Why You Should Attend XDAY 2024

Aug 30, 2024

Alchemer vs Qualtrics

Alchemer vs Qualtrics: Find out which one you should choose

target population

Target Population: What It Is + Strategies for Targeting

Aug 29, 2024

Microsoft Customer Voice vs QuestionPro

Microsoft Customer Voice vs QuestionPro: Choosing the Best

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence
  • Search Search Please fill out this field.

What Is Regression?

Understanding regression, calculating regression, the bottom line.

  • Macroeconomics

Regression: Definition, Analysis, Calculation, and Example

regression analysis meaning in research

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between a dependent variable and one or more independent variables.

Linear regression is the most common form of this technique. Also called simple regression or ordinary least squares (OLS), linear regression establishes the linear relationship between two variables.

Linear regression is graphically depicted using a straight line of best fit with the slope defining how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of the dependent variable when the value of the independent variable is zero. Nonlinear regression models also exist, but are far more complex.

Key Takeaways

  • Regression is a statistical technique that relates a dependent variable to one or more independent variables.
  • A regression model is able to show whether changes observed in the dependent variable are associated with changes in one or more of the independent variables.
  • It does this by essentially determining a best-fit line and seeing how the data is dispersed around this line.
  • Regression helps economists and financial analysts in things ranging from asset valuation to making predictions.
  • For regression results to be properly interpreted, several assumptions about the data and the model itself must hold.

In economics, regression is used to help investment managers value assets and understand the relationships between factors such as commodity prices and the stocks of businesses dealing in those commodities.

While a powerful tool for uncovering the associations between variables observed in data, it cannot easily indicate causation. Regression as a statistical technique should not be confused with the concept of regression to the mean, also known as mean reversion .

Joules Garcia / Investopedia

Regression captures the correlation between variables observed in a data set and quantifies whether those correlations are statistically significant or not.

The two basic types of regression are simple linear regression and  multiple linear regression , although there are nonlinear regression methods for more complicated data and analysis. Simple linear regression uses one independent variable to explain or predict the outcome of the dependent variable Y, while multiple linear regression uses two or more independent variables to predict the outcome. Analysts can use stepwise regression to examine each independent variable contained in the linear regression model.

Regression can help finance and investment professionals. For instance, a company might use it to predict sales based on weather, previous sales, gross domestic product (GDP) growth, or other types of conditions. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing assets and discovering the costs of capital.

Regression and Econometrics

Econometrics is a set of statistical techniques used to analyze data in finance and economics. An example of the application of econometrics is to study the income effect using observable data. An economist may, for example, hypothesize that as a person increases their income , their spending will also increase.

If the data show that such an association is present, a regression analysis can then be conducted to understand the strength of the relationship between income and consumption and whether or not that relationship is statistically significant.

Note that you can have several independent variables in an analysis—for example, changes to GDP and inflation in addition to unemployment in explaining stock market prices. When more than one independent variable is used, it is referred to as  multiple linear regression . This is the most commonly used tool in econometrics.

Econometrics is sometimes criticized for relying too heavily on the interpretation of regression output without linking it to economic theory or looking for causal mechanisms. It is crucial that the findings revealed in the data are able to be adequately explained by a theory.

Linear regression models often use a least-squares approach to determine the line of best fit. The least-squares technique is determined by minimizing the sum of squares created by a mathematical function. A square is, in turn, determined by squaring the distance between a data point and the regression line or mean value of the data set.

Once this process has been completed (usually done today with software), a regression model is constructed. The general form of each type of regression model is:

Simple linear regression:

Y = a + b X + u \begin{aligned}&Y = a + bX + u \\\end{aligned} ​ Y = a + b X + u ​

Multiple linear regression:

Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 + . . . + b t X t + u where: Y = The dependent variable you are trying to predict or explain X = The explanatory (independent) variable(s) you are  using to predict or associate with Y a = The y-intercept b = (beta coefficient) is the slope of the explanatory variable(s) u = The regression residual or error term \begin{aligned}&Y = a + b_1X_1 + b_2X_2 + b_3X_3 + ... + b_tX_t + u \\&\textbf{where:} \\&Y = \text{The dependent variable you are trying to predict} \\&\text{or explain} \\&X = \text{The explanatory (independent) variable(s) you are } \\&\text{using to predict or associate with Y} \\&a = \text{The y-intercept} \\&b = \text{(beta coefficient) is the slope of the explanatory} \\&\text{variable(s)} \\&u = \text{The regression residual or error term} \\\end{aligned} ​ Y = a + b 1 ​ X 1 ​ + b 2 ​ X 2 ​ + b 3 ​ X 3 ​ + ... + b t ​ X t ​ + u where: Y = The dependent variable you are trying to predict or explain X = The explanatory (independent) variable(s) you are  using to predict or associate with Y a = The y-intercept b = (beta coefficient) is the slope of the explanatory variable(s) u = The regression residual or error term ​

Example of How Regression Analysis Is Used in Finance

Regression is often used to determine how specific factors—such as the price of a commodity, interest rates, particular industries, or sectors—influence the price movement of an asset. The aforementioned CAPM is based on regression, and it's utilized to project the expected returns for stocks and to generate costs of capital. A stock’s returns are regressed against the returns of a broader index, such as the S&P 500, to generate a beta for the particular stock.

Beta is the stock’s risk in relation to the market or index and is reflected as the slope in the CAPM. The return for the stock in question would be the dependent variable Y, while the independent variable X would be the market risk premium.

Additional variables such as the market capitalization of a stock, valuation ratios, and recent returns can be added to the CAPM to get better estimates for returns. These additional factors are known as the Fama-French factors, named after the professors who developed the multiple linear regression model to better explain asset returns.

Why Is It Called Regression?

Although there is some debate about the origins of the name, the statistical technique described above most likely was termed “regression” by Sir Francis Galton in the 19th century to describe the statistical feature of biological data (such as heights of people in a population) to regress to some mean level. In other words, while there are shorter and taller people, only outliers are very tall or short, and most people cluster somewhere around (or “regress” to) the average.

What Is the Purpose of Regression?

In statistical analysis, regression is used to identify the associations between variables occurring in some data. It can show the magnitude of such an association and determine its statistical significance. Regression is a powerful tool for statistical inference and has been used to try to predict future outcomes based on past observations.

How Do You Interpret a Regression Model?

A regression model output may be in the form of Y = 1.0 + (3.2) X 1 - 2.0( X 2 ) + 0.21.

Here we have a multiple linear regression that relates some variable Y with two explanatory variables X 1 and X 2 . We would interpret the model as the value of Y changes by 3.2× for every one-unit change in X 1 (if X 1 goes up by 2, Y goes up by 6.4, etc.) holding all else constant. That means controlling for X 2 , X 1 has this observed relationship. Likewise, holding X1 constant, every one unit increase in X 2 is associated with a 2× decrease in Y. We can also note the y-intercept of 1.0, meaning that Y = 1 when X 1 and X 2 are both zero. The error term (residual) is 0.21.

What Are the Assumptions That Must Hold for Regression Models?

To properly interpret the output of a regression model, the following main assumptions about the underlying data process of what you are analyzing must hold:

  • The relationship between variables is linear;
  • There must be homoskedasticity , or the variance of the variables and error term must remain constant;
  • All explanatory variables are independent of one another;
  • All variables are normally distributed .

Regression is a statistical method that tries to determine the strength and character of the relationship between one dependent variable and a series of other variables. It is used in finance, investing, and other disciplines.

Regression analysis uncovers the associations between variables observed in data, but cannot easily indicate causation.

Margo Bergman. “ Quantitative Analysis for Business: 12. Simple Linear Regression and Correlation .” University of Washington Pressbooks, 2022.

Margo Bergman. “ Quantitative Analysis for Business: 13. Multiple Linear Regression .” University of Washington Pressbooks, 2022.

Fama, Eugene F., and Kenneth R. French, via Wiley Online Library. “ The Cross-Section of Expected Stock Returns .” The Journal of Finance , vol. 47, no. 2, June 1992, pp. 427–465.

Stanton, Jeffrey M., via Taylor & Francis Online. “ Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors .” Journal of Statistics Education , vol. 9, no. 3, 2001.

CFA Institute. “ Basics of Multiple Regression and Underlying Assumptions .”

regression analysis meaning in research

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

Institute of Data

  • New Zealand
  • United Kingdom

What is Regression Analysis in Data Science?

What is regression analysis in data science?

Stay Informed With Our Weekly Newsletter

Receive crucial updates on the ever-evolving landscape of technology and innovation.

By clicking 'Sign Up', I acknowledge that my information will be used in accordance with the Institute of Data's Privacy Policy .

Understanding regression analysis in data science

Regression analysis stands as a central statistical technique in data science .

Regression analysis delves into the intricate relationship between dependent and independent variables, facilitating predictions of future scenarios and discerning variable impacts.

Organisation using regression analysis in data science

Core concepts of regression analysis

At its essence, regression analysis entails the positioning of a line or curve within a set of data points.

This line signifies the relationship between variables, assuming a linear bond. The main constituents include:

  • Dependent variable : The prediction or explanation subject.
  • Independent variables : Variables that influence the dependent variable.
  • Coefficients : Representing the intercepts and slopes of the equation.
  • Error term : Known as the residual, it’s the gap between actual and predicted data.

This technique illuminates the interplay between variables, vital in sectors like economics where discerning these relationships guides decision-making.

However, the foundational assumption of linearity is crucial, and overlooking elements like outliers can skew outcomes.

Regression’s significance in data science

Data professional using regression analysis technique

Regression analysis is a powerful statistical technique that quantifies relationships between variables, identifies significant predictors, and bases predictions on discerned patterns.

By employing regression, data scientists can pinpoint determinants of specific outcomes, imperative in areas like marketing where capturing variables’ impact on consumer behavior is essential.

Additionally, the technique aids in quantifying uncertainty, helping distinguish genuine associations from random occurrences.

Types of regression analysis

  • Linear regression : This basic form hypothesizes a linear relationship between variables. It’s routinely employed across various sectors to gauge how independent variables influence the dependent counterpart.
  • Logistic regression : Suited for binary outcomes, logistic regression predicts probabilities based on independent variables. It’s paramount in areas where outcomes have two categories.
  • Polynomial regression : Venturing beyond linearity, polynomial regression embraces non-linear associations by integrating polynomial terms, offering more nuanced curve fits.

Undertaking Regression Analysis

  • Data collection and refinement : Collect relevant data and prepare it for analysis. This involves cleaning the data, handling missing values, and transforming variables if necessary.
  • Model choice and application : The appropriate regression model must be selected once the data is ready. This involves considering the type of relationship, the distribution of variables, and the assumptions of the chosen model. The model is then fitted to the data using statistical algorithms.
  • Decoding results : Understanding the results is pivotal post-application—from variable significance to evaluating the goodness of fit and practical implications.

Relevance in predictive modeling

Data scientists forecasting report using regression analysis

Regression analysis is the backbone of predictive modeling .

By pinpointing key influencing elements, it augments the precision and reliability of predictions.

Furthermore, regression assists in estimating prospective trends in forecasting, enabling informed decision-making.

Hurdles and restrictions

Though potent, regression analysis is not without its challenges:

  • Assumptions and pitfalls : Regression relies heavily on several assumptions. Any deviation can lead to skewed interpretations. Being vigilant about these assumptions is paramount.
  • Addressing hurdles : To navigate these challenges, various strategies are employed. Transforming variables or using sophisticated regression types can mitigate issues.

In data science, regression analysis is a powerful tool that paves the way for insightful predictions and informed decision-making.

For data scientists and researchers, understanding and correctly applying regression analysis remains indispensable in their analytical toolkit.

Considering a future in data science?

The Institute of Data offers a comprehensive curriculum designed to equip you with in-demand skills.

Ready to position yourself at the forefront of the rapidly evolving arena? Contact our local team for a free career consultation .

regression analysis meaning in research

Follow us on social media to stay up to date with the latest tech news

Stay connected with Institute of Data

Iterating Into Artificial Intelligence Sid’s Path from HR to Data Science & AI

Iterating Into Artificial Intelligence: Sid’s Path from HR to Data Science & AI

Maria's cybersecurity story

From Curiosity to Cybersecurity: Maria Kim’s Path to Protecting the Digital World

Discover Ruramai Mangachena's inspiring journey from law to cybersecurity

Mastering Cybersecurity: Ruramai’s Inspiring Journey from Law to Digital Defense

Neil Kripal from passion to pursuing a new career.

From Passion to Pursuing a New Career: Neil Kripal’s Driven Journey into Software Engineering

Iterating Into Artificial Intelligence Sid’s Path from HR to Data Science & AI

From Teaching to Data Science: Eamon’s Journey of Passion and Persistence

© Institute of Data. All rights reserved.

regression analysis meaning in research

Copy Link to Clipboard

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 31 January 2022

The clinician’s guide to interpreting a regression analysis

  • Sofia Bzovsky 1 ,
  • Mark R. Phillips   ORCID: orcid.org/0000-0003-0923-261X 2 ,
  • Robyn H. Guymer   ORCID: orcid.org/0000-0002-9441-4356 3 , 4 ,
  • Charles C. Wykoff 5 , 6 ,
  • Lehana Thabane   ORCID: orcid.org/0000-0003-0355-9734 2 , 7 ,
  • Mohit Bhandari   ORCID: orcid.org/0000-0001-9608-4808 1 , 2 &
  • Varun Chaudhary   ORCID: orcid.org/0000-0002-9988-4146 1 , 2

on behalf of the R.E.T.I.N.A. study group

Eye volume  36 ,  pages 1715–1717 ( 2022 ) Cite this article

22k Accesses

11 Citations

1 Altmetric

Metrics details

  • Outcomes research

Introduction

When researchers are conducting clinical studies to investigate factors associated with, or treatments for disease and conditions to improve patient care and clinical practice, statistical evaluation of the data is often necessary. Regression analysis is an important statistical method that is commonly used to determine the relationship between several factors and disease outcomes or to identify relevant prognostic factors for diseases [ 1 ].

This editorial will acquaint readers with the basic principles of and an approach to interpreting results from two types of regression analyses widely used in ophthalmology: linear, and logistic regression.

Linear regression analysis

Linear regression is used to quantify a linear relationship or association between a continuous response/outcome variable or dependent variable with at least one independent or explanatory variable by fitting a linear equation to observed data [ 1 ]. The variable that the equation solves for, which is the outcome or response of interest, is called the dependent variable [ 1 ]. The variable that is used to explain the value of the dependent variable is called the predictor, explanatory, or independent variable [ 1 ].

In a linear regression model, the dependent variable must be continuous (e.g. intraocular pressure or visual acuity), whereas, the independent variable may be either continuous (e.g. age), binary (e.g. sex), categorical (e.g. age-related macular degeneration stage or diabetic retinopathy severity scale score), or a combination of these [ 1 ].

When investigating the effect or association of a single independent variable on a continuous dependent variable, this type of analysis is called a simple linear regression [ 2 ]. In many circumstances though, a single independent variable may not be enough to adequately explain the dependent variable. Often it is necessary to control for confounders and in these situations, one can perform a multivariable linear regression to study the effect or association with multiple independent variables on the dependent variable [ 1 , 2 ]. When incorporating numerous independent variables, the regression model estimates the effect or contribution of each independent variable while holding the values of all other independent variables constant [ 3 ].

When interpreting the results of a linear regression, there are a few key outputs for each independent variable included in the model:

Estimated regression coefficient—The estimated regression coefficient indicates the direction and strength of the relationship or association between the independent and dependent variables [ 4 ]. Specifically, the regression coefficient describes the change in the dependent variable for each one-unit change in the independent variable, if continuous [ 4 ]. For instance, if examining the relationship between a continuous predictor variable and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that for every one-unit increase in the predictor, there is a two-unit increase in intra-ocular pressure. If the independent variable is binary or categorical, then the one-unit change represents switching from one category to the reference category [ 4 ]. For instance, if examining the relationship between a binary predictor variable, such as sex, where ‘female’ is set as the reference category, and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that, on average, males have an intra-ocular pressure that is 2 mm Hg higher than females.

Confidence Interval (CI)—The CI, typically set at 95%, is a measure of the precision of the coefficient estimate of the independent variable [ 4 ]. A large CI indicates a low level of precision, whereas a small CI indicates a higher precision [ 5 ].

P value—The p value for the regression coefficient indicates whether the relationship between the independent and dependent variables is statistically significant [ 6 ].

Logistic regression analysis

As with linear regression, logistic regression is used to estimate the association between one or more independent variables with a dependent variable [ 7 ]. However, the distinguishing feature in logistic regression is that the dependent variable (outcome) must be binary (or dichotomous), meaning that the variable can only take two different values or levels, such as ‘1 versus 0’ or ‘yes versus no’ [ 2 , 7 ]. The effect size of predictor variables on the dependent variable is best explained using an odds ratio (OR) [ 2 ]. ORs are used to compare the relative odds of the occurrence of the outcome of interest, given exposure to the variable of interest [ 5 ]. An OR equal to 1 means that the odds of the event in one group are the same as the odds of the event in another group; there is no difference [ 8 ]. An OR > 1 implies that one group has a higher odds of having the event compared with the reference group, whereas an OR < 1 means that one group has a lower odds of having an event compared with the reference group [ 8 ]. When interpreting the results of a logistic regression, the key outputs include the OR, CI, and p-value for each independent variable included in the model.

Clinical example

Sen et al. investigated the association between several factors (independent variables) and visual acuity outcomes (dependent variable) in patients receiving anti-vascular endothelial growth factor therapy for macular oedema (DMO) by means of both linear and logistic regression [ 9 ]. Multivariable linear regression demonstrated that age (Estimate −0.33, 95% CI − 0.48 to −0.19, p  < 0.001) was significantly associated with best-corrected visual acuity (BCVA) at 100 weeks at alpha = 0.05 significance level [ 9 ]. The regression coefficient of −0.33 means that the BCVA at 100 weeks decreases by 0.33 with each additional year of older age.

Multivariable logistic regression also demonstrated that age and ellipsoid zone status were statistically significant associated with achieving a BCVA letter score >70 letters at 100 weeks at the alpha = 0.05 significance level. Patients ≥75 years of age were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.96, 95% CI 0.94 to 0.98, p  = 0.001) [ 9 ]. Similarly, patients between the ages of 50–74 years were also at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.15, 95% CI 0.04 to 0.48, p  = 0.001) [ 9 ]. As well, those with a not intact ellipsoid zone were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone (OR 0.20, 95% CI 0.07 to 0.56; p  = 0.002). On the other hand, patients with an ungradable/questionable ellipsoid zone were at an increased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone, since the OR is greater than 1 (OR 2.26, 95% CI 1.14 to 4.48; p  = 0.02) [ 9 ].

The narrower the CI, the more precise the estimate is; and the smaller the p value (relative to alpha = 0.05), the greater the evidence against the null hypothesis of no effect or association.

Simply put, linear and logistic regression are useful tools for appreciating the relationship between predictor/explanatory and outcome variables for continuous and dichotomous outcomes, respectively, that can be applied in clinical practice, such as to gain an understanding of risk factors associated with a disease of interest.

Schneider A, Hommel G, Blettner M. Linear Regression. Anal Dtsch Ärztebl Int. 2010;107:776–82.

Google Scholar  

Bender R. Introduction to the use of regression models in epidemiology. In: Verma M, editor. Cancer epidemiology. Methods in molecular biology. Humana Press; 2009:179–95.

Schober P, Vetter TR. Confounding in observational research. Anesth Analg. 2020;130:635.

Article   Google Scholar  

Schober P, Vetter TR. Linear regression in medical research. Anesth Analg. 2021;132:108–9.

Szumilas M. Explaining odds ratios. J Can Acad Child Adolesc Psychiatry. 2010;19:227–9.

Thiese MS, Ronna B, Ott U. P value interpretations and considerations. J Thorac Dis. 2016;8:E928–31.

Schober P, Vetter TR. Logistic regression in medical research. Anesth Analg. 2021;132:365–6.

Zabor EC, Reddy CA, Tendulkar RD, Patil S. Logistic regression in clinical studies. Int J Radiat Oncol Biol Phys. 2022;112:271–7.

Sen P, Gurudas S, Ramu J, Patrao N, Chandra S, Rasheed R, et al. Predictors of visual acuity outcomes after anti-vascular endothelial growth factor treatment for macular edema secondary to central retinal vein occlusion. Ophthalmol Retin. 2021;5:1115–24.

Download references

R.E.T.I.N.A. study group

Varun Chaudhary 1,2 , Mohit Bhandari 1,2 , Charles C. Wykoff 5,6 , Sobha Sivaprasad 8 , Lehana Thabane 2,7 , Peter Kaiser 9 , David Sarraf 10 , Sophie J. Bakri 11 , Sunir J. Garg 12 , Rishi P. Singh 13,14 , Frank G. Holz 15 , Tien Y. Wong 16,17 , and Robyn H. Guymer 3,4

Author information

Authors and affiliations.

Department of Surgery, McMaster University, Hamilton, ON, Canada

Sofia Bzovsky, Mohit Bhandari & Varun Chaudhary

Department of Health Research Methods, Evidence & Impact, McMaster University, Hamilton, ON, Canada

Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary

Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia

Robyn H. Guymer

Department of Surgery, (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia

Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA

Charles C. Wykoff

Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA

Biostatistics Unit, St. Joseph’s Healthcare Hamilton, Hamilton, ON, Canada

Lehana Thabane

NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK

Sobha Sivaprasad

Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Peter Kaiser

Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA

David Sarraf

Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA

Sophie J. Bakri

The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA

Sunir J. Garg

Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Rishi P. Singh

Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA

Department of Ophthalmology, University of Bonn, Bonn, Germany

Frank G. Holz

Singapore Eye Research Institute, Singapore, Singapore

Tien Y. Wong

Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore

You can also search for this author in PubMed   Google Scholar

  • Varun Chaudhary
  • , Mohit Bhandari
  • , Charles C. Wykoff
  • , Sobha Sivaprasad
  • , Lehana Thabane
  • , Peter Kaiser
  • , David Sarraf
  • , Sophie J. Bakri
  • , Sunir J. Garg
  • , Rishi P. Singh
  • , Frank G. Holz
  • , Tien Y. Wong
  •  & Robyn H. Guymer

Contributions

SB was responsible for writing, critical review and feedback on manuscript. MRP was responsible for conception of idea, critical review and feedback on manuscript. RHG was responsible for critical review and feedback on manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript. MB was responsible for conception of idea, critical review and feedback on manuscript. VC was responsible for conception of idea, critical review and feedback on manuscript.

Corresponding author

Correspondence to Varun Chaudhary .

Ethics declarations

Competing interests.

SB: Nothing to disclose. MRP: Nothing to disclose. RHG: Advisory boards: Bayer, Novartis, Apellis, Roche, Genentech Inc.—unrelated to this study. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed—unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis—unrelated to this study.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Bzovsky, S., Phillips, M.R., Guymer, R.H. et al. The clinician’s guide to interpreting a regression analysis. Eye 36 , 1715–1717 (2022). https://doi.org/10.1038/s41433-022-01949-z

Download citation

Received : 08 January 2022

Revised : 17 January 2022

Accepted : 18 January 2022

Published : 31 January 2022

Issue Date : September 2022

DOI : https://doi.org/10.1038/s41433-022-01949-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Prevalence and associated factors of myopia among adolescents aged 12–15 in shandong province, china: a cross-sectional study.

  • Zhihao Huang
  • Dingding Song
  • Kunzong Tian

Scientific Reports (2024)

Factors affecting patient satisfaction at a plastic surgery outpatient department at a tertiary centre in South Africa

  • Chrysis Sofianos

BMC Health Services Research (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

regression analysis meaning in research

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • Product Demos
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Artificial Intelligence
  • Market Research
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Survey Data Analysis & Reporting
  • Regression Analysis

Try Qualtrics for free

The complete guide to regression analysis.

19 min read What is regression analysis and why is it useful? While most of us have heard the term, understanding regression analysis in detail may be something you need to brush up on. Here’s what you need to know about this popular method of analysis.

When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analyzing what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between independent and dependent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Free eBook: 2024 global market research trends report

What is regression analysis?

Regression analysis is a statistical method. It’s used for analyzing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analyzing the effects of these variables on your outcome,  you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with variables that are categorized into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analyzing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyze and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

  • Is the variable measured as an outcome of the study?
  • Does the variable depend on another in the study?
  • Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

  • Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
  • Does this variable come before the other variable in time?
  • Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

Regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

Regression analysis - step by step

This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Regression analysis - step by step

Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

  • the data was collected using a statistically valid sample collection method that is representative of the target population
  • The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
  • the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

  • there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
  • the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organization wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualize those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyze the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyze it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

  • Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

  • Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

  • Choosing the wrong variables to analyze

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

  • Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

  • our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
  • the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
  • the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

IQ stats in action

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualizations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

Regression analysis tools

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

Related resources

Analysis & Reporting

Data Analysis 31 min read

Social media analytics 13 min read, kano analysis 21 min read, margin of error 11 min read, data saturation in qualitative research 8 min read, thematic analysis 11 min read, behavioral analytics 12 min read, request demo.

Ready to learn more about Qualtrics?

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

When Should I Use Regression Analysis?

By Jim Frost 183 Comments

Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.

As a statistician , I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level !

In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis.

Related post : What are Independent and Dependent Variables?

Use Regression to Analyze a Wide Variety of Relationships

An example regression model to illustrate when to us regression.

  • Model multiple independent variables
  • Include continuous and categorical variables
  • Use polynomial terms to model curvature
  • Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable

These capabilities are all cool, but they don’t include an almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:

  • Do socio-economic status and race affect educational achievement?
  • Do education and IQ affect earnings?
  • Do exercise habits and diet effect weight?
  • Are drinking coffee and smoking cigarettes related to mortality risk?
  • Does a particular exercise intervention have an impact on bone density that is a distinct effect from other physical activities?

More on the last two examples later!

All these research questions have entwined independent variables that can influence the dependent variables. How do you untangle a web of related variables? Which variables are statistically significant and what role does each one play? Regression comes to the rescue because you can use it for all of these scenarios!

Use Regression Analysis to Control the Independent Variables

As I mentioned, regression analysis describes how the changes in each independent variable are related to changes in the dependent variable. Crucially, regression also statistically controls every variable in your model.

What does controlling for a variable mean?

When you perform regression analysis, you need to isolate the role of each variable. For example, I participated in an exercise intervention study where our goal was to determine whether the intervention increased the subjects’ bone mineral density. We needed to isolate the role of the exercise intervention from everything else that can impact bone mineral density, which ranges from diet to other physical activity.

To accomplish this goal, you must minimize the effect of confounding variables. Regression analysis does this by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Again, you want to isolate the effect of each variable.

Regression models help you prevent spurious correlations from confusing your results by controlling for confounders.

How do you control the other variables in regression?

A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! Let’s look at this in action with an example.

A recent study analyzed the effect of coffee consumption on mortality. The first results indicated that higher coffee intake is related to a higher risk of death. However, coffee drinkers frequently smoke, and the researchers did not include smoking in their initial model. After they included smoking in the model, the regression results indicated that coffee intake lowers the risk of mortality while smoking increases it. This model isolates the role of each variable while holding the other variable constant. You can assess the effect of coffee intake while controlling for smoking. Conveniently, you’re also controlling for coffee intake when looking at the effect of smoking.

Note that the study also illustrates how excluding a relevant variable can produce misleading results. Omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model. This warning is particularly applicable for observational studies where the effects of omitted variables might be unbalanced. On the other hand, the randomization process in a true experiment tends to distribute the effects of these variables equally, which lessens omitted variable bias.

Related post : Confounding Variables and Omitted Variable Bias

How to Interpret Regression Output

To answer questions using regression analysis, you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values. When you have a low p-value (typically < 0.05), the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.

For instance, if your dependent variable is income and your IVs include IQ and education (among other relevant variables), you might see output like this:

The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, an additional unit of education increases average earnings by $24.22 while holding the other variables constant.

Regression analysis is a form of inferential statistics . The p-values help determine whether the relationships that you observe in your sample also exist in the larger population . I’ve written an entire blog post about how to interpret regression coefficients and their p-values , which I highly recommend.

Obtaining Trustworthy Regression Results

With the vast power of using regression comes great responsibility. Sorry, but that’s the way it must be. To obtain regression results that you can trust, you need to do the following:

  • Specify the correct model . As we saw, if you fail to include all the important variables in your model, the results can be biased.
  • Check your residual plots . Be sure that your model fits the data adequately.
  • Correlation between the independent variables is called multicollinearity. As we saw, some multicollinearity is OK. However, excessive multicollinearity can be a problem .

Using regression analysis gives you the ability to separate the effects of complicated research questions. You can disentangle the spaghetti noodles by modeling and controlling all relevant variables, and then assess the role that each one plays.

There are many different regression analysis procedures. Read my post to determine which type of regression is correct for your data .

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Share this:

regression analysis meaning in research

Reader Interactions

' src=

July 12, 2023 at 1:42 pm

Jim, I am trying to predict a categorical variable (college major category, where there are 5 different categories).

I have 3 different continuous variables (CAREER INTERESTS, which has 6 different subscales), PERSONALITY (which is the Big Five) and MORAL PREFERENCES (which uses the MFQ30 questionnaire, that has 5 subscales).

I am confused about what type of regression (hierarchical, etc.) I could use in this study. What are your thoughts?

' src=

July 17, 2023 at 12:18 am

Because your dependent variable is a categorical variable consider using Nominal Logistic Regression, also known as Multinomial Logistic Regression or Polytomous Logistic Regression. These terms are used interchangeably to describe a statistical method used for predicting the outcome of a categorical dependent variable based on one or more predictor variables.

' src=

January 9, 2023 at 12:03 am

First of all, Many thanks for this fantastic website that makes statistics seem a little bit simpler and more clear. It’s a fantastic resource. I have dataset of an experiment. It have dependent variable Choice reaction time(CRT) and independent variable Visual task. (This visual task includes two types of task; cognitive involved questions and minimizes cognitive questions. These questions are of three types questions which include choices/options(2,4,8)/bits(1,2,3) only two options to choose one answer, 4options questions and 8 options in questions. First i used Linear regression to check the best fitting of model(Hicks law) in SPSS. But unfortunately the value of r-square was very very low. Now, my professor push me to make new model by using that dataset. Please suggest me some steps and hints so i will start working on it.

' src=

December 14, 2022 at 3:59 am

Following are my research objectives a. To identify youth’s competencies in entreprenuership in the area. b. To identify the factor of youth involvement in agricultural entreprenuership in the area.

I have used opinion based question designed on 5-point likert scale item except demographic question in the beginning of my survey. The questionnaire contain simp-le opinion based question there is no dependent and independent items in questionnaire. My question is which analysis is suitable for my research? Regresion analysis or descriptive analysis or both?

December 14, 2022 at 5:57 pm

The question of whether there is a dependent variable and one or more independent variables is separate from the question of whether you need to use inferential or descriptive statistics. And regression analysis can be either a descriptive or inferential procedure. Although, it is almost always an inferential procedure. Let’s go through these issues.

If you just want to describe a sample and you’re not generalizing from the sample to a population, you’re performing descriptive statistics. In this case, you don’t need to use hypothesis testing and confidence intervals.

However, if you have a representative sample and you want to infer the properties of an entire population, then you need to perform hypothesis testing and look at confidence intervals. Read my post about the Difference between Descriptive and Inferential Statistics for more information.

Regression analysis can apply to either of these cases. You perform the same analysis but if you’re only describing the sample, you can ignore the p-values and confidence intervals. Instead, you’ll focus on using the coefficients to describe the relationships between the variables within the sample. There’s less to worry about but you only know what is happening within that sample and can’t apply the results to a larger population. Conversely, if you do want to generalize to a population, then you must consider the p-values and confidence intervals and determine whether the coefficients are statistically significant. Most analysts performing regression analysis do want to generalize to the population, making it an inferential procedure.

However, regression analysis does specify independent and dependent variables. If you don’t need to specify those types of variables, then just use a correlation. Likert data is ordinal data. And for that data type, you need use Spearman’s correlation. And, like regression analysis, correlation can be either a descriptive or inferential procedure. You either pay attention to the p-values (inferential) or not (descriptive). In both cases, you are interested in the correlation coefficients. You’ll see the relationships between the variables without need to specify independent and dependent variables. You could calculate medians or modes for each item but not the mean because that’s not appropriate for ordinal data.

I hope that helps!

' src=

December 12, 2022 at 9:18 am

Hi Jim, Supposing I’m interested in establishing an explanatory relationship between two variables, profits and average age of employees using regression analysis and I have access to data from the entire population of interest e.g. all the 30 firms in a particular industry, do I still need to perform statistical inference? What would be the meaning of p-values , F tests etc, given that I am not intending to generalize the results for firms outside the industry? Do I still need to perform power analysis given that I have access to the entire population of 30 firms? Is the population of 30 firms too small for reliable statistical deductions? Thanks in advance Jim.

December 13, 2022 at 5:11 pm

Hi Patrick,

If you are truly interested in only those 30 companies and have access to data for all their employees, then you don’t need to perform inferential statistics. You’ve got the entire population. Hence, you know the population parameters. Hypothesis tests account for sampling error. But when you measure the entire population, there is zero sampling error and, hence, zero need to perform a hypothesis test.

However, if your average ages are based on only a sample of the employees in the 30 firms, then you’re still working with samples. To generalize from the sample to the population of all employees at the 30 firms, you’d need to use hypothesis testing in that case.

So, you just need to determine whether you really have access to the data for the entire population.

' src=

December 8, 2022 at 1:52 am

Hi, Following are my research objectives a. To investigate effectiveness of asynchronous and synchronous mode of online education. b. To identify challenges that both teachers and students encounter in synchronous and asynchronous mode of online education. I have used pearson correlation to find relationship of effectiveness of synchronous mode with asynchronous mode and challenges of online mode and vice versa. I have used opinion based question designed on 5-point likert scale item. The questionnaire contain simp-le opinion based question there is no dependent and independent items in questionnaire. My question is that correlation is sufficient or i have to run other test for proving my hypothesis.

December 10, 2022 at 8:28 pm

Because you have Likert scale data, you should use Pearson’s correlation because that is more appropriate for ordinal data.

Another possibility would be to use a nonparametric test and evaluate the median difference between the asynchronous and synchronous modes of education for each item.

' src=

November 21, 2022 at 3:45 am

A scientist determined the intensity of solar radiation and temperature of plantains every hour throughout the day. He used correlation to describe the association between the two variables. A friend said he would get more information using regression. What are your views?

November 22, 2022 at 4:15 pm

Yes, I’d agree the regression provides more information that correlation. But it’s also important to understand how correlation and regression presents effect sizes differently because in some cases you might want to use correlation even though it provides less information.

Correlation gives you a standardized effect size (i.e., the correlation coefficient). Standardized effect sizes don’t provide information using the natural units of the data. In other words, you can’t relate a correlation coefficient to what’s going on with the natural data units. However, it does allow you to compare correlations between dissimilar variables.

Conversely, regression gives you unstandardized effect sizes in the coefficients. They tell you exactly what’s going on between an independent variable and dependent variable using the DV’s natural data units. But it’s harder to compare results between regression models with dissimilar DV units. Although regression has its own standardize measure of the overall strength of the model in the R-squared–but not the individual variables. Additionally, in regression, you can standardize the regression coefficients, which facilitates comparisons within a regression model but not between them.

In some cases, while correlation gives you less information, you might want to use it to facilitate comparisons between studies.

Regression allows you to predict the mean outcome. It also gives you to the tools to understand the amount of error between the predicted and observed values. Additionally, you can model a variety of different types of relationships (curved and interactions) Correlation doesn’t provide those.

So, yes, in general, regression provides more information, but it also provides a different take on the nature of the relationships.

' src=

February 1, 2022 at 6:39 am

First, congrats and many thanks on this wonderful website, which makes statistics look a bit easier and understandable. Its a great resource, both for students and professionals. Thanks again.

A request for bit of help, if you’d be kind enough to comment. Doing some research on pharmaceutical industry, regulations and its effects. I am looking at a) probable effects (if any) of drug price increases on other consumption categories (like food and travel), and b) the effects of pricing regulations on drug shortages. In ‘a’, I’ve got inflation data and average consumption expense by quintiles. In ‘b’, I’ve got last 6 year data on drug shortages, mainly due to government administered pricing. However, I’d need to show statistical significance (additionally, if it could predict anything statistically significant about drug shortages in the future).

What kind of stat methodology would be appropriate in terms of ‘a’ and ‘b’? Would appreciate your help.

' src=

December 11, 2021 at 7:39 pm

Thank you so much Sir.

' src=

August 7, 2021 at 7:01 am

Hello Mr. Jim,

Thank you very much for your opinion. Much helpful.

I’ve another case with 2 DV and multiple IDV and the scope is to determine the validity of data. So for this case, can I run MANOVA as regression analysis and look for significant value and null hypothesis for validity test?

Hoping to hear from you soon.

Kind Regards, A.Kaur

August 6, 2021 at 12:17 pm

Thank you for your reply Mr. Jim. My goal is to predict which approach best predicts CRI measure.

CRI-I: Disaster Management Cycle (DMC) based approach (Variable: PP, RS, RC, MP-contain all indices according to its phases) CRI- II: Sustainability based approach (Physical, Economy, Social-contain all indices according to its phases) CRI-III: Overall indices of data (24 indices from all the listed variable)

I’ve chosen PP and MP as my DV, and RS and RC as my IDV, since my goal focus on DMC.

Hope I’m clear now. And hoping to hear from you soon Mr. Jim. Thank you.

August 7, 2021 at 12:04 am

One approach would be to fit a regression model for each approach and the DV. Then assess the goodness-of-fit measures. You’d be particularly interested in the standard error of the regression . This measure tells you how wrong the model is typically. You’d be looking for the model that produces the lowest value because it indicates it’s less wrong at predicting the outcome.

July 31, 2021 at 1:31 pm

Good day Mr. Jim,

I’ve decided to run regression analysis after correlation test. My research is about reliability and validity of dataset for 3 approaches of community resilience index(CRI) based DMC, sustainability and overall indices approach. So now, I’m literally confused on how can to interpret data with regression analysis? Can I used OLS and GLM to interpret data?

3 approaches: 1:PP,RS,RC,MP {DMC} 2: PY,EC,SC {Sustainability} 3: Overall indices {24 indices}

For your information all those approaches are proposed in 1 dataset that contains 24 indices. Add on, I’ve previously conducted Likert questionnaire(5 scale) to collect my data.

I hope my question is clear. Hoping to hear from you soon.

August 4, 2021 at 4:38 pm

I’m sorry but I don’t completely understand what your goal is for your analysis. Are you trying to determine which approach best predicts sustainability? What are your IVs and DV. It wasn’t totally clear from your description. Thanks!

' src=

July 16, 2021 at 1:56 am

Going through your blog gave me a good understand when to use regression analysis, honestly it’s an amazing blog

July 19, 2021 at 10:22 pm

Thanks so much, Robin!

' src=

May 18, 2021 at 7:02 pm

Hey Jim, thanks for all the information. I would like to ask: are there any limitations in the multiple regression method? Is there other method in mathematics that can be more accurate than a regression?

Sincerly, Mythili

May 20, 2021 at 1:46 am

Hi Mythili,

There are definitely limitations for regression! That’s a broad question that could be answered with a book. But, a good place to start is to consider the assumptions for least squares regression . Click the link to learn more. You can think of those as limitations because if you violate the assumptions, you can’t necessarily trust the results! In fact, when you violate an assumption, you might need to switch to a different analysis or perform it a different way.

Additionally, the Gauss-Markov theorem states that least squares regression is the most efficient regression, but only when you satisfy those assumptions!

' src=

May 15, 2021 at 4:19 pm

Hi Sir, In regression analysis specifically multiple linear regression, should all variables (dependent and independent variables) be normally distributed?

Thank you, Helena

May 15, 2021 at 11:08 pm

In least squares regression analysis, you don’t assess the normality of the variables. Instead, you assess the normality of the residuals. However, there is some correlation because if you have dependent variable that follows a very non-normal distribution, it can be harder to obtain normal residuals. But it’s really the residuals that you need to focus on. I discuss that in my article about the least squares (OLS) regression assumptions .

' src=

April 18, 2021 at 11:12 pm

Hi Sir, I’m currently a senior high school student and currently struggling on my quantitative research. As a statistician what would you recommend a statistical treatment to use in identifying an impact? To answer the question “What is the impact of the development of educational brochure in minimizing cyber bullying in terms of? 3.1 Mental health 3.2 Self-Esteem”.

Waiting for your reply, desperate for answers lol Jane

' src=

April 16, 2021 at 7:21 am

Hi Jim, thank you

So would you advise an ordinal regression or another? i have a survey identifying if they use the new social media- which will place them into 2 groups. Then compare the 2 groups (1- use the new social media, 2- don’t use it) with a control (FB use) to compare their happiness scores (obtained from a survey aswell- higher score=more happier). The conclusions i can draw- would it be causal? or more an indication that for example the new users have lower happiness.

-Also is there a graph that can be drawn after a regression?

On a side note- when would it be advisable to do correlations? for example have both groups complete happiness score and conduct correlations for this and a regression to control for covariates? or is this not statistically advisable

April 16, 2021 at 3:46 pm

I highly recommend you get my book about regression analysis because I think it would be really helpful with these nuts and bolts types of questions. You can find it in My Web Store .

As for the type of regression, as I mentioned, that depends largely on what you use for your dependent variable. If it’s a single Likert item, then you’d use ordinal logistic regression. If it’s the sum or average of multiple Likert items, you can often use the regular least squares regression. But, I don’t have a good handle on exactly how you’re defining your dependent variable.

There are graphs you can create afterwards to illustrate the results. I cover those in my book. I don’t have a good post to refer you to that shows them. Fitted line plots are good when you have simple regression (just one independent variable), but when you have more there are other types.

You can do correlations but be aware that they don’t control for other variables. If there are confounders, your correlations might exhibit omitted variable bias and differ from the relationships you’ll find in the regression model. Personally, I would just stick to the regression results because they control for confounders that you include in the model.

April 15, 2021 at 4:46 pm

hi Sorry- as you can tell im a little confused on what best to do. As is it advisable to do 2 groups- users of the new social media and non users of that new social media. Then do a T-test to compare their happiness scores. Then have participants answer facebook use questionnaire to control for this by conducting a hierarchical regression where i enter this in- to identify how much this variance is explained by Facebook use?

Many thanks

April 15, 2021 at 10:28 pm

Hi Sam, you wouldn’t be able to do all of that with t-tests. I think regression is a better bet. You can still include an indicator variable to identify the two groups you mention AND include the controlling variables in that model. That way you can determine whether the difference between those two groups is statistically significant while controlling for the other IVs. All in one regression model!

April 15, 2021 at 8:26 am

Hi I wanted to ask if regression is the best test for me- I am looking at happiness scores and time spent on a new social media site. As other social media sites have a relationship with happiness and that people don’t use one social media site- i was going to control for this ‘other social media’ use. My 1st group would be the new social media site and Facebook users and the 2nd group would be Facebook users. They would do a happiness questionnaire and questionnaire about their time/use. Any advice I really appreciate it

I have read around and found partial correlations- do you advice that? So instead participants would complete a questionnaire on their use on this new social media, then also do a questionnaire on their Facebook use and do a happiness questionnaire. I would do a partial correlation between the new social media app use and happiness score, while controlling for Facebook use.

April 15, 2021 at 10:22 pm

This case sounds like a good time to use regression analysis. The type of regression depends largely on the nature of the dependent variable. It’s for a survey. Perhaps it’s a Likert scale item? If it’s an item, that’s an ordinal scale and you’d need to use ordinal logistic regression. If you’re summing multiple items for the DV, you might be able to use regular linear regression. Ordinal independent variables are a bit problematic. You’d need to use them as either continuous or categorical variables. You’d include the questions about FB use to control for that.

' src=

April 13, 2021 at 5:10 am

Thank you very much for your answer,

I understand your point of view. However that data set consist of companies investing the largest sums to R&D and not companies with also the best results. Some of them even shows up with a loss of operating profit. Is that still a factor of biasing my results?

Have a nice day, Natasha

' src=

April 12, 2021 at 11:36 am

thank you it was very useful

April 12, 2021 at 11:24 am

I am working on my thesis which is about evaluating the motivation of firms to invest in R&D of new products. I am specifically interested in automotive sector. I have a data of R&D ranking of the world top 2500 companies (by industry) which consist of data about their R&D expenses, (also R&D one-year growth), net sales (also net sales one-year growth), R&D intensity, Capex, operational profit, (also one-year growth), profitability, employees (also one-year growth), market cap (also one-year growth).

My question is that which type of analysis would you recommend to fulfill the topic requirements?

April 13, 2021 at 12:29 am

Hi Natasha,

You could certainly use regression analysis to see which variables related to R&D spending.

However, be aware that by using that list of companies, you are potentially biasing your results. For one thing, it’s a list of top R&D companies and you’d certainly want more of a mix of companies across the full range of R&D. You can learn from those who weren’t so good at R&D too. Also, by using a list of the top R&D companies, you’ll introduce some survival bias into the results because these are companies that made it and made it big (presumably). Again, you’d want mix of companies that had varying degrees of success and even some failures! If you limit your data to top companies and particularly top companies in R&D, you’ll limit how much can learn. You might still be able to learn some, but just be aware that you’re potentially biasing your results.

' src=

April 8, 2021 at 8:05 pm

Hi Mr. Jim! Thank you so much for your response. Well appreciated!

April 8, 2021 at 11:07 pm

You’re very welcome, Violetta!

April 8, 2021 at 2:08 am

Hi! I’m currently doing my research paper, and i am confused whether i can use regression analysis since my title is “New Normal Workplace Setting towards Employee’s Engagement with their Workloads” as for the moment I have used correlational approach since it deals with the relationship of two variables. But still im confused on what would be the best in my research. Hope i can get a response soon. Thank you so much!

April 8, 2021 at 3:56 pm

Hi Violetta,

If you’re just working with just two variables, you have a choice. You can use either correlation or regression. You can even use both together! It depends on the goals of your research. Correlation coefficient are standardized measures of an effect size while regression coefficients are unstandardized effect sizes. I write about the difference between standardized and unstandardized effect sizes . Click the link to read about that. I discuss both correlation and coefficient in that context. It should help you decide what is best for your research goals.

' src=

March 3, 2021 at 9:34 am

Hi Jim, I am undertaking a Msc dissertation and would like to ask questions on analysis please. The research is health related and I am looking at determinants of outcome. I have 5 continuous data independent variables and I would like to know if they have an association with the outcome of a treatment. They involve age, temperature and blood test values. The dependent variable is binary that is the treatment was yes successful or not. I am looking to do a logistic regression analysis. Questions I have: 1. Do I first need to do tests to find out if there is statistical significance of each variable before I do the regression analysis or can I go straight in? 2. If so will I need to carry out tests to find out if I have skewed data in order to know whether I need to do parametric or non parametric tests? Thank you.

March 3, 2021 at 6:01 pm

You should go in with a bunch theory and background knowledge about the independent variables you should include. Look to other research studies for guides. When you have a set of IVs identified, it’s usually ok to include them all and see what’s significant. An important caveat is if you have a small number of observations you don’t want to overfit your model . However, statistical significance shouldn’t be your only guide for which variables to include and exclude.

To read learn more about model specification, ready my post about specifying your regression model . I write about it in the context of linear regression rather than binary logistic regression, but the ideas are the same.

In terms of the distribution of your data, typically, you assess the residuals rather than the data itself. Usually, you can assess the residual plots .

' src=

January 4, 2021 at 12:01 pm

Looks like treating both ordinal variables as continuous seems to solve my problems with non-mutually exclusive levels of the variables if I enter the variables as categorical. My main concern is to look at the variable as a whole not by its levels so it might be what I need; the measurement ranges were based on a an established rating system and does not have any weight for my analysis. Tho, I’ll have to looks more into it as well as the residual plot etc before deciding. Thank you for highlighting this option!

Is it correct if I assign the numerical value to the levels like this? 1 to 5, from lowest to highest.

Spacing 1: less than 60mm 2: 60-200mm 3: 200-600mm 4:0.6-2m 5: more than 2m

length 1: less than 1m 2: 1-3m 3: 3-10m 4: 10-20m 4: more than 20m

As for the data repetition, what I mean was say data for Site A is:

Set 1 (quantity: 25) SP3 PER5 Set 2 (quantity: 30) SP4 PER6 set 3 (quantity: 56) SP2 PER3

so in the data input I’d entered set 1 data 25 times, set 2 data 30 times and set 3 data 56 times. From what I have gathered from fellow student and my lecturer, it is correct but I’d like a confirmation from a statistician. Thanks again!

December 31, 2020 at 5:44 am

I’m sorry, again the levels disappeared. maybe bc I used (>) and (<) so it's messing up the coding of the comment.

spacing levels:

SP1: less than 60mm SP2: 60-200mm SP3: 200-600mm SP4:0.6-2m SP5: more than 2m

length level:

PER1: more than 20m PER2: 10-20m PER3: 3-10m PER4: 1-3m PER4: less than 1m

Spacing and Length were recoded as ranges since they were estimate and not measured individually as it'd take too much time to measure each one (1 set of cracks may have at least 10 cracks, some can reach 50 or more and the measurement are not exactly the same between cracks belonging to the same set).

I've input the dummy like in my previous reply when running the model, tho the resulting equation I've provided does not include the length. Can ordinal variable be converted/treated into continuous variables?

Also, since each set has their own quantities, so I repeated the data in the input according to their quantity. Is that the right way of doing it?

January 2, 2021 at 7:10 pm

Technically those are ordinal variables. I write about this in more detail in my book about regression analysis , but you can enter these variables as either continuous variables (if you assign a numeric value to the groups) or as categorical variables. If you go the categorical route, you’ll need to use the indicator variable scheme and leave out a reference level approach as we discussed. The approach you should use depends on a combination of your analysis goals, the nature of your data, and the ability to adequately fit the model (i.e., properties of the residual plots).

I don’t exactly know what you mean by “repeated the data in the input.” However, you have levels for each categorical variable. Let’s use the lowest level for each variable as the reference level. Here’s how you’d use indicator variables to include both categorical variables in your model (some statistical software will do that for you behind the scenes).

Spacing variable: Leave out SP1. It’s the reference. Include and indicator variable for: SP2 SP3 SP4 SP5

Length Variable: Leave PER5 out as reference. Include indicator variables for: PER1 PER2 PER3 PER4

And just code each indicator variable appropriately based on the presence or absences of the corresponding characteristic. All zeros in a set of indictor variables for a categorical variable represents the reference level for that categorical variable.

As you can see, you’ll need to include many indicator variables (8), which is a drawback of entering them as categorical variables. You can quickly get into overfitting your model .

December 30, 2020 at 12:32 am

I’m sorry I had just noticed that the levels are missing

December 28, 2020 at 11:48 am

For my case, I’m studying the cracks set on a rock face and I have two independent categorical variables (spacing and length) that have 5 levels of measurement ranges each. Dependant variable is the blasted rock size i.e I want to know how the spacing and length of the existing cracks on a rock face would effect the size of blasted rocks.

E.g: For Spacing: SP1 = 2m

I’ve coded the levels to run the regression model into:

SP1 SP2 SP3 SP4 SP1 1 0 0 0 SP2 0 1 0 0 SP3 0 0 1 0 SP4 0 0 0 1 SP5 0 0 0 0

From the coding (leaving SP5 out as the reference level) above, after running the model, I have obtained the equation:

Blasted rock size (mm) = 1849.146 + 332.224SP1 + 137.624SP2 – 115.268SP3 – 103.604SP4

1 rock slope could consist of 2 or more crack sets hence the situation where more than 1 levels of spacing and length can be observed. As an example, rock face A consist of 3 crack sets with set #1 having SP1, set #2 with SP3 and set #3 have SP4. To predict blasted rock size for rock face A using the equation, I’ll have to insert “1” for SP1, SP3 and SP4. Which is actually the wrong way of doing it since they are not mutually exclusive? Or can I calculate each crack set separately using the same equation then average the of blasted rock size for these 3 crack sets?

From the method in your explanation, does this mean that I’ll have to separate each level into 10 different variables and code them as 1=yes and 0=no? If so, for spacing, will the coding be

SP1 SP2 SP3 SP4 SP5 SP1 1 0 0 0 0 SP2 0 1 0 0 0 SP3 0 0 1 0 0 SP4 0 0 0 1 0 SP5 0 0 0 0 1

in the input table which would be similar to the initial one except with SP5 included? But if I were to include all levels when running the model, SPSS would automatically excluded 1 level since I ran several rock faces (belonging to a single location) in a model so all levels of spacing and length are present in the data set.

The other way that I can think of is to create interaction for all possible combinations and dummy code them but wouldn’t that end up with a super long equation?

I’m sorry for imposing like this but I couldn’t grasp this problem on my own. Your help is very much appreciated.

December 31, 2020 at 12:51 am

Ah, ok, it sounds like you have two separate categorical variables. In that case, for each observation, you can have one level for each variable. Additionally, for each categorical variable, you’ll leave out one level for its own reference level.

I do have a question. spacing and length sound like continuous measurements. Why are you including them as categorical variables? There might be a good reason why but it almost seems like you can include them as continuous predictors. Perhaps you don’t have the raw measurements but instead they’re in groups? In which case, they might actually be ordinal variables. You can include ordinal variables as categorical variables. But sometimes they’ll still work as continuous variables.

December 26, 2020 at 12:12 am

I see, sorry I couldn’t fully understand your previous reply before this, thanks for the clarification. However, I am dealing with a situation where 2 or more levels of a variable could be observed simultaneously, is it theoretically right to use dummy or is there other method around it?

December 27, 2020 at 2:30 am

That sounds like you’re dealing with more than one variable rather than one categorical variable. Within an individual categorical variable, the levels of the variable are mutually exclusive. In your case, you need to sort out which categorical variables you have and be sure that the levels are mutually exclusive. If you looking at the presence and absence of certain characteristics, you can use a series of indicator variables. If these characteristics are not the mutually exclusive levels of a single categorical variable, you don’t use the rule about leaving one out.

For example, in a medical setting, you might include characteristics of a patient using a series of indicator variables: gender (1 = female 0 = male), high blood pressure (1 = Yes, 0 = No), On medication, etc. These are separate characteristics (not part of one larger categorical variable) and you can just include an indicator variable to indicate the presence or absence of that characteristic.

Perhaps that it what you need? But be aware that what you describe with multiple levels possible does not work for a single categorical variable. But the method I describe might be what you need if you’re talking about separate characteristics.

' src=

December 24, 2020 at 2:03 am

Thank you , sir

December 18, 2020 at 12:54 am

Thanks for the answer Jim,

does that mean predicted value for when both L4 and L1 are observed and when only L1 is observed without L4 is the same? (Y = 133)

thanks again!

December 18, 2020 at 1:03 am

The groups must be mutually exclusive. Hence, an observation could not be in both L1 and L4.

December 16, 2020 at 4:58 am

I have a question regarding categorical variables dummy coding, I can’t seem to find any post about this topic. Hope you don’t mind me asking here.

I ran a regression model with categorical variable containing 4 level: using the 4th level as the reference group. Meaning in the equation there will only be level 1 to 3 since level 4 is the reference. Say, the equation is Y = 120 + 13L1 – 6L2 + 15L3, to predict the Y with L4 then I’ll have Y = 120, right?

My question is what if I want to predict Y when there is L1 but no L4? if I calculate Y = 120 + 13L that would mean I am including L4 in the equation, or am I wrong about this?

Thank you in advance.

December 17, 2020 at 11:28 pm

I cover how this works in my book about regression analysis . If you’re using regression for a project, you might consider it.

It sounds like you’re approach is correct. You always leave one level out for the reference group. And, yes, given your equation, the predicted value for level 4 is 120.

For observations where the subject/item belongs to group 1, your equation stays the same, but you enter a 1 for L1 and 0s for L2 and L3. Hence, the predicted value is 133. In other words, you don’t change the equation given the level, you change the X values in the equation. When an observation belongs to group 4, you’ll enter 0s for L1, l2, and L3, which is why the predicted Y is 120. For a given categorical variable, you’ll only enter a single 1 for observations that belong to a non-reference group, and all 0s for observations belonging to the reference group. But the equation stays the same in all cases. I hope that makes sense!

' src=

December 14, 2020 at 5:35 am

May I just ask if there is a difference between a true and simple linear regression model? I can only think that their difference is the presence of a random error. Thanks a lot!

December 14, 2020 at 8:48 pm

Hi Anthony,

I’ve never heard the dichotomy state as being true vs. simple linear regression. I take true models to refer to the model that is correctly specified for the population. A simple regression model is just one that has a single predictor whereas multiple regression has more than one predictor. The true model has as many terms as are required, which includes predictors and other terms that fit curvature and interaction as needed.

' src=

December 13, 2020 at 3:04 pm

Hi Jim, I find your explanation to questions very good and so important. Thanks for that. Please I need your help in my thesis work. My question is if for example I want to measure say level of resilience capacity in a company’s safety management system. What tool would you advise. Regression or which other one ? Thanks Kwame

December 14, 2020 at 9:01 pm

The type of analysis you use depends the data you collect as well as a variety of other factors. The answer is entirely specific to your research question, field of study, data, etc. After you make those determinations, you can begin to figure out which type of analysis to use. I recommend researching your study area to answer all of those questions, including which type of analysis to use. If you need help after you start developing the answers to the preliminary question, I’d be able to provide more input.

Also, I really recommend reading my post about designing a study that includes statistical analyses . That’ll help you understand what type of information you need to collect and questions you need to answer.

' src=

November 12, 2020 at 11:12 pm

Thank you so much for your answer, Jim!

November 12, 2020 at 11:53 am

hello Jim, I have a question. I have one independent variable, and two dependent variables, I will explain the case before asking you a question. So, I obtain the data for independent variable using a questionnaire, and one of my dependent variable is also using a questionnaire. But, another dependent variable, which is my second variable, the data is from official website which is secondary data, different from the another variables. Then, I have a question, Is it okay if I use regression analysis to analyze these three variables? Or I have to use another statistical analysis that suit the best to analyze these variables? Thanks in advance.

November 12, 2020 at 4:37 pm

Most forms of regression analysis allow you to use one dependent variable and multiple independent variables. Because you have two dependent variables, you’ll need to fit two regression models, one for each dependent variable.

In regression, you need to be able to tie together all corresponding values of an observation for the dependent variable and the independent variables. We’ll use an example with people. To fit a regression model, for each person, you’ll need to know their values for the dependent variable and all the independent variables in the model. In your case, it sounds like you’re mixing data from an official website and a survey. If those data sources contain the same people and you can link their values as describes, that can work. However, if those data sources have different people, or you can’t link their scores, you won’t be able to perform regression analysis.

' src=

November 6, 2020 at 9:55 am

Hi Jim, if you’ve got three predictors and one dependent variable, is it ever worth doing linear regression on each individual predictor beforehand or should you just dive into the multiple regression? Thanks a lot!

November 6, 2020 at 8:48 pm

Hi Kristian,

You should probably just dive right into multiple regression. There’s a risk of being misled by starting out with regressions with individual predictors. It’s possible that omitted variable bias can increase or decrease the observed effect. By leaving out the other predictors, the model can’t control for them, which can cause that bias.

However, that said, it’s often a good idea to graph the relationship between pairs of variables using scatterplots to get an idea of the nature of each relationship. That’s a great place to start. Those plots not only reveal the direction of the relationship but also whether you need to model curvature.

I’d start with graphs and then try modeling with all the variables. You can always remove insignificant variables.

' src=

October 2, 2020 at 1:00 pm

Hi Jim, do you think it is correct to estimate a regression model based on historical data as Y=aX+b and then use the model for the forecast as Y=aX? Would this be biased?

if the variables involved are growth rates, would it be preferable to directly estimate the model without the intercept?

Thank you in advance Stefania

October 4, 2020 at 12:56 am

Hi Stefania,

The answer to that question depends on a very close understanding of the subject area. However, there are very few cases where fitting a model without a constant is advisable. Bias would be very likely. Read my article about the y-intercept , where I discuss this issue specifically.

' src=

September 30, 2020 at 3:22 am

Nice article. Thank you for sharing.

' src=

August 19, 2020 at 12:13 pm

If your outcome variable is a pass or fail, then it is binomial logistic. My undergrad thesis was on this topic. May be I can offer some help as this topic is of interest to me. Azad ( [email protected] )

' src=

August 6, 2020 at 2:36 am

Sir , what is cox regression analysis ?

' src=

August 6, 2020 at 12:52 am

A friend recommended your help with a stats question for my dissertation. I am currently looking at data regarding pass rate and student characteristics. I have collected multiple data points. One example is student pass rate (pass or rate) and observation hours (continuous variable (0-1000). Would this be a binomial logistic regression? Can that be performed in Excel?

Additionally I am looking at pass rate in relation to faculty characteristics. Another example is pass rate (percentage of 100% maybe continuous data 0-100) and categorical data (Level of degree – bachelor, masters, doctorate)? Additionally, pass rate (percentage of 100) and ratio of faculty to student within classroom (continuous Data) which test would be appropriate for this type of data comparison? Linear regression?

Thanks for your guidance!

' src=

July 24, 2020 at 7:14 am

Hi Jim. Concepts were well explained. Thank you so much for making this content available.

I have the data of Mortgage loan customers who are currently in default. There are various parameters why default would have happened. But predominantly there are two factors where we would have gone wrong while sanctioning the loan one is underwriting the loan( Credit Risk) and/or Property Valuation (Technical Risk). I have data of sub parameters coming under credit and technical risk at the point of sanction.

Now I want to arrive at an output where predominantly where did I go wrong. Either Technical/Credit risk or both. Which model of regression analysis can help in solving this.

July 3, 2020 at 3:40 am

dear sir, i ‘m currently final year undergradute of Bsc.Radiography degree, so i choosed risk estimation of cardiovascular diseses using several risk factors from regression analysis as my undergraduate research. i want to predict a percentage value for my cardiovascular risk estimation as a dependent variable using regression analysis. how can i do that sir,i’m very pleased to have your answer sir ? Thank you very much.

July 3, 2020 at 3:41 pm

Hi, It sounds like you might need to use binary logistic regression. If your dependent variable indicates the presence or absence (i.e., binary outcome measure) of a cardiovascular condition, binary logistic regression will predict the probability of having that condition given the values of your dependent variables.

' src=

June 26, 2020 at 8:35 pm

Thank you for all the information on your page , I am currently beginning to get into statistics and wanted to ask your advice about something

I am an business analyst with MI skills building dashboard etc and using sales data and kpi s

I am wondering for regression would a good independent variable be the significance of a salespersons sales performance over the teams total sales performance or am I on the wrong track with that ?

' src=

June 11, 2020 at 2:18 pm

Dear Jim… I am a first year ‘MBA’ student having least exposure to the research kind of things. Please have patience and explain me whether I can use regression to determine the impact of a variable on a ‘construct’?

' src=

June 7, 2020 at 6:49 pm

which criteria does an independent variable need to meet in order to use it in a regression analysis? How do you deal with data that does not meet these requirements?

June 8, 2020 at 3:13 pm

I recommend you read my post about specifying the correct regression model . That deals directly with which variables to include in the model. If you have further questions on the specifics, please post them in the comments section there.

' src=

June 5, 2020 at 7:15 am

How should we interpret the factor A that becomes not significant when fitting with factor B in a model? Can I conclude that factor B incorporates factor A and just ignore the effect of factor A?

' src=

May 28, 2020 at 2:17 am

Hello Mr.Jim and friends,

I have one dependent variable Y and six independent variables X1….X6. I have to find the effect of of all independent variables on Y , Specifically X6. to check wither it is effective or not 1) Can I use OLS regression 2) which other test i need to do before or after regression analysis

May 29, 2020 at 4:16 pm

If your dependent variable is continuous, then OLS is a good place to start. You’ll need to check the OLS assumptions for your model.

' src=

April 29, 2020 at 8:06 am

good,very explicit processes.

' src=

April 10, 2020 at 4:53 pm

I hope this comment reaches you in good health as we are living in some pretty tough times right now. Also, thank you for building this website as it is an excellent resource for novice statisticians such as myself. My question has to do with the first paragraph of this post. In it you state,

“Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.”

Is it possible to use regression analysis to produce a regression equation when you have two independent variables and two dependent variables? Also, while I hopefully have you attention, would I need to do regression analysis twice(one for each dependent variable versus the independent variables)?

April 10, 2020 at 7:07 pm

Typically, you would separate regression models for each dependent variable. There are a few exception. For example, if you use multivariate ANOVA (MANOVA), you can include multiple dependent variables. If those DVs are correlated, using MANOVA provides some benefits. You can include covariates in the MANOVA model. For more informaton, read my post about MANOVA .

' src=

April 1, 2020 at 7:00 pm

n my study, I intervened with an instructional practice. My intervention has 4 independent variables (A, B, C, and D). In literature each subskill can be graded alone and we can get one whole score. In literature, the effect of the intervention is holistic (A, B, C, together predict the performance on D).

So, I conducted a multiple regression (enter method) before and after the intervention where individual scores of A, B, C were added as predictors on D.

I added Group (Experimental Vs Control ) to delete any difference at baseline between experimental and control. No significant effect was noticed except for individual score of A and C on D. Model had a weak fit.

However, after the intervention, I repeated the same regression. the group (experimental Vs Control) was the best predictor. No significant effect of A was noticed but significant effect of B and C was noticed — How do you think I can interpret the change in the significance value of A? It is relevant in literature but after the intervention it was not significant. Does the significance have to do with the increase of the significance of the Group?

' src=

January 26, 2020 at 2:51 pm

I’d like to ask a question that builds on your example of income regressed on IQ and education. In the dataset I am sure there would be a range of incomes. Let’s say you want to find ways to bring up the low income earners based on the data from this regression.

Can I use the coefficients from the regression to guide ideas on how to improve the lower income earners as an estimate of how much improvement would be expected? For example, if I take the lowest earner and find that he is also below average in IQ and education, could I suggest that he gets another degree and try to improve IQ test results to potentially gain $X (n*IQ + m*Edu) in income?

This example may not be strictly usable because I imagine there are many other factors for income. Assuming that we are confident that we’ve captured most of the variables that affect income, can the numbers be used in this way?

If this is not an appropriate application, how would one go about this? Thanks.

' src=

October 22, 2019 at 7:45 am

Hello I am completing a reflection paper for Math 221 I work in a call center can I use a regression analysis for this type of work?

' src=

October 20, 2019 at 4:48 am

I am a total novice when it comes to Statistics. My challenge is, I am working on the relationship between population growth of a town and class size of secondary schools in that same town (about 10 schools) over a period of years (2008-2018). Having gathered my data, I don’t know what to use in analyzing my data to show this relationship.

' src=

October 16, 2019 at 8:48 pm

Hi Jim! Im just a student whos trying to finish her science investigation 🙂 but i have a question. What is linear regression and how do we know if this method is appropriate for our data?

October 18, 2019 at 1:23 pm

Hi Marlene,

I think this blog post describes pretty well when to use regression analysis generally. Linear regression analysis is a specific form of regression. Linear refers to the form of the model–not whether it can fit curvature. I talk about this in my post about the differences between linear and nonlinear regression . I always suggest that you start with linear regression because it’s an easier to use analysis. However, sometimes linear regression can’t fit your data. It can fit curvature in your data but it can fit all types of curves. Nonlinear regression is more flexible in the types of curves.

As for determining whether linear regression is appropriate for your data, you need to see if it can provide an adequate fit to your data. To make that determination, please read my posts about residual plots because that’s how you can tell.

Best of luck with your research!! 🙂

' src=

August 27, 2019 at 4:50 pm

Hello Jim, thank you for this wonderful page. It has enlightened me when to use regression analysis. However, I am a complete beginner to using SPSS (and statistics at that) so I am hoping you can help me with my specific problem.

I intend to use a linear regression analysis. My dependent variable is continuous and I would think it’s ordinal (data was obtained through a 5-point Likert scale). I have two independent variables (also obtained through 5-point Likert scales). However, I also intend to use 7 control variables and this is where my problem lies. My control variables are all (I think) nominal (or is that called categorical in statistics?). They are as follows:

Age – 4 categories Gender – 2 categories Marital Status – 4 categories Education level – 11 categories Household income – 4 categories Nationality – 4 categories Country of origin – 9 categories

Do I input these control variables as it is? Or do I have to do something beforehand? I have heard about creating dummy variables. However, if I try creating dummy variables for each control variable, won’t I end up with many variables?

Please give me some advise regarding this. I am really stuck in this process for a while now. I look forward to hearing from you, thanks.

August 27, 2019 at 11:43 pm

There are several issues to address in your questions. I’ll provide some information. However, my regression ebook goes it into the details much further. So, I highly recommend you get that.

In terms of the dependent variable, the answer is clear. Likert scale data, if it’s the actual values of 1, 2, 3, 4, and 5, these are actually ordinal data and are not considered continuous. You’ll need to use ordinal logistic regression. If the DV is an average of multiple Likert score items for each individual, so an individual might have a 3.4, that is continuous data and you can try using linear least squares regression.

Categorical data and nominal data are the same. There are different naming conventions, but those synonyms.

For categorical data, it’s true that you need to recode them as indicator variables. However, most software should do that automatically behind the scenes. However, as you noticed, the recoding (even if your software does it for you) can involve creating many indicator variables (dummy variables), particularly when you have many categorical variables and/or many levels within a categorical variable. That can use up your degrees of freedom! My ebook covers this in more detail.

For Likert IV variables. Again, if it’s an average of multiple Likert items, you can probably include it as a continuous variable. However, if it’s the actual Likert values of 1, 2, 3, 4, and 5, then you’ll need to decide whether to include it as a continuous or categorical variable. There are pros and cons for both approaches. The best answer depends on both your data and your goals. My ebook describes this in more detail.

Yes, as a general rule, you want to include your control variables and IVs that you are specifically testing. Control variables are just more IVs, but they’re usually not your main focus of study. You include them so that you can account for them while testing your main variables of interest. Excluding relevant IVs that are significant can bias the estimates for the variables you’re interested in. However, if you include control variables and find they’re not significant, you can consider removing them from the model.

So, those are some pointers to start with!

' src=

June 22, 2019 at 1:02 am

Hi Jim and everyone! I’m starting some some statistical analysis and is been really useful. I have a question regarding variables and samples. I need to see if there is any relationship between days of the week and number of robberies. I already have the data but I wonder, if my variables (# of robberies in each day of the week (independent) and # of total roberies (dependent)) come from the same data sample, can it be a problem?

' src=

June 7, 2019 at 2:56 am

Thank you Jim this was really helpful

I have a question How do you interpret an independent variable lets say AGE with categories that are insignificant for example i run the regression analysis for the variable age with categories age as a whole was found to be significant but there appear insignificance within categories , it was as follows Age =0.002 <30 years =0.201 30-44 years=0.161 45+ ( ref cat)

I had another scenario occupation = 0.000 peasant farmers =0.061 petty businessmen=0.003 other occupation ( ref cat)

my research question was " what are effect of socio- demographic characteristics on men's attendance to education classes

I failed to interpret them , kindly help

June 7, 2019 at 10:07 am

For categorical variables, the linear regression procedure uses two tests of significance. It uses an F-test to determine the overall significance of the categorical variable across all its levels jointly. And, it uses separate t-tests to determine whether each individual level is different from the reference level. If you change the reference level, it can change the significance of t-tests because that changes the levels that the procedure directly compares. However, changing the reference level won’t change the F-test for the variable as a whole.

In your case, I’m guessing that the mean for <30 is on one side (high or low) compared to the reference category of 45+ while the mean of 30-44 is on the other side of 45+. These two categories are not far enough from 45+ to be significant. However, given the very low p-value for age, I'd guess that if you change the reference level from 45+ to one of the other two groups, you'll see significant p-values for at least one of the t-tests. The very low p-value for Age indicates that the means for the different levels are not all equal. However, given the reference level, you can't tell which means are different. Using a different reference level might provide more meaningful information.

For occupation, the low p-value for the F-test indicates that not all the means for the different types of occupations are equal. The t-test results indicate that the difference in means between petty businessmen and other (reference level) is statistically significant. The difference between peasant farmers and the reference category is not quite significant.

You don't include the coefficients, but those would indicate how those means differ.

Because you're using regression analysis, you should consider getting by regression ebook. I cover this topic, and others, in more detail in the book.

Best of luck with your analysis!

' src=

May 11, 2019 at 12:51 pm

Hi Jim, I have followed your discussion and I want to know if I can apply this analysis in case study

' src=

April 26, 2019 at 4:01 pm

Hi Jim really appreciate your excellency in regression analysis. please would help the steps to draw a single fitted line for several, say five IVs, against a sing DV with regard

April 26, 2019 at 4:18 pm

It sounds like you’re dealing with multiple regression because you have more than one IV. Each IV requires an axis (or dimension) on a graph. So, for a two-dimensional graph, you can use the X-axis (horizontal) for IV and the Y-axis for the DV. If you have two IVs, you could theoretically show them as hologram in three dimensions. Two dimensions for the IVs and one for the DV. However, when you get to three or more IVs, there’s just no way to graph them! You’d need four or more dimensions. So, what can you do?

You can view residual plots to see how the model with all 5 IVs fits the data. And, you can predict specific values by plugging numbers into the equation. But you can’t graph all 5 IVs against the DV at the same time.

You could graph them individually. Each IV by itself against the DV. However, that approach doesn’t control for the other variables in the model and can produce biased results.

The best thing you can do that shows the relationship between an individual IV and a DV while controlling for all the variables in a model is to use main effects plots and interaction plots. You can see interaction plots here . Unfortunately I don’t have a blog post about main effects plots, but I do write about them in my ebook, which I highly recommend you get to understand regression! Learn more about my ebook!

I hope this helps!

' src=

March 16, 2019 at 1:31 pm

Many thanks. I appreciate it.

March 15, 2019 at 10:47 am

I stumbled across your website in hopes of finding an answer to a couple of questions regarding the methodology of my political science paper. If you could help, I would be very grateful.

My research question is “Why do North-South regional trade agreements tend to generate economic convergence while South-South agreements sooner cause economic divergence?”. North = OECD developed countries and South = non-OECD developing countries.

This is my lineup of variables and hypotheses: DV: Economic convergence between country members in a regional trade agreement IV1: Complementarity (differentness) of relative factor abundance IV2: Market size of region IV3: Economic policy coordination (Harmonization of Foreign Direct Investment (FDI) policy)

H1: The higher the factor endowment difference between countries, the greater the convergence H2: The larger the market size, the greater the convergence H3: The greater the harmonization of FDI policies, the greater the convergence

I am not sure what the best methodological approach is. I will have to take North-South and South-South groups of countries and assign values for the groupings. I want to show the relationship between the IVs and DV, so I thought to use a regression. But there are at least two issues:

1. I feel the variables are not appropriate for a time series, which is usually used to show relationships. This is because e.g. the market size of a region will not be changing with time. Can I not do a time series and still have meaningful results?

2. The IVs are not completely independent of one another. How can I work with that?

Also, what kind of regression would be most appropriate in your view?

Many sincere thanks in advance. Irina

March 15, 2019 at 5:23 pm

I’m not an expert in that specific field, so I can’t give you concrete advice, but here are somethings to consider.

The question about whether you need to include time related information in the model depends on the nature of your data and whether you expect temporal effects to exist. If your data are essentially collected at the same time and refer to the same time period, you probably don’t need to account for time effects. If theory suggests that the outcome does not change over time, you probably don’t need to include variables for time effects.

However, if your data are collected at or otherwise describe different points in time, and you suspect that the relationships between the IVs and DV changes overtime, or there is an overall shift over time, yes, you’d need to account for the time effects in your model. In that case, failure to account for the effects of time can bias your other coefficients–basically there’s the potential for omitted variable bias .

I don’t know the subject area well enough to be able to answer those questions, but that’s what I’d think about.

You mention that the IVs are potentially correlated (multicollinearity). That might or might not be a problem. It depends on the degree of the correlation. Some correlation is OK and might not be a problem. I’d perform the analysis and check the VIFs, which measure multicollinearity. Read my post about multicollinearity , which discusses how to detect it, determine whether it’s a problem and some corrective measures.

I’d start with linear regression. Move away from that only if you have specific reason to do so.

' src=

March 10, 2019 at 3:59 am

I was wondering if you could help. I’m currently doing a lab report on Numerical cognition in Human and non human primates. Where we are looking at whether size , quantity and visibility of food effects choice. We have tested Humans so far and then are going to test chimps in the future. My Iv is Condition : visible and opague containers and my Dv is number of correct responses. So far I have compared the means of number of correct responses for both conditions using a one way repeated measures ANOVA but I don’t think this is correct. After having a look at your website, should I look to run a regression analysis instead ? Sorry for the confusion I’m really a rookie at this. Hope you can help !

March 11, 2019 at 11:26 am

Linear regression analysis and ANOVA are really the same type of analysis-linear models. They both use the same math “underneath the hood.” They each have their own historical traditions and terminology, but they’re really the same thing. In general, ANOVA tends to focus on categorical (nominal) independent variables while regression tends to focus on continuous IVs. However, you can add continuous variables into an ANOVA model and categorical variables into a regression model. If you fit the same model in ANOVA as regression, you’ll get the same results.

So, for your study, you can use either ANOVA or regression. However, because you have only one categorical IV, I’d normally suggest using one-way ANOVA. In fact, if you have only those two groups (visible vs opaque), you can use a 2-sample t-test.

Although, you mention repeated measures, you can use that if you in fact do have a pre-test and post-test conditions. You could even use a paired t-test if you have only the two groups and you have a pre- and post-tests.

There is one potential complication. You mention that the DV is a count of correct responses. Counts often do not follow the normal distribution but can follow other distributions such as the Poisson and Negative Binomial distributions. Although, counts can approximate the normal distribution when the mean is high enough (>~20). However, if you have two groups and each group has more than 15 observations, the analyses are robust to departures from the normal distribution.

I hope this helps! Best of luck with your analysis!

' src=

February 9, 2019 at 8:20 am

Thankyou so much for the reply . Appreciate it and I finally worked it out and got good mark on lab report, which was good :). Appreciate your time replying you explain things very clear so thankyou

January 17, 2019 at 9:49 am

Hi there. I am currently doing a lab report and have not done stats in years so hoping someone can help as due tommorow. When I do correlation bivariate test it shows the correlations not significant between a personaility trait and a particular cognitive task. Yet when I conduct a simple t test it shows a significant p value and gives the 95 % conf interval. If I was to compare that higher scores on one trait tends to mean higher scores on a particular cognitive task then should I be doing a regression then. We were told basic correlations so I did the bivariate option and just stated that the pearson’s r is not significant r=.. n= p =.84 for example. Yet if do a regression analysis for each it is significant. Why could this be?

January 18, 2019 at 9:45 am

There not quite enough details to know for sure what is happening–but here are some ideas.

Be aware that a series of pairwise correlations is not equivalent to performing regression analysis with multiple predictors. Suppose you have your outcome variable and two predictors (Y X1 X2). When you peform the pairwise correlations (X1 and Y, X2 and Y), each correlation does not account for the other X. However, when you include both X1 and X2 in a regression model, it estimates the relationship between each X and Y while accounting for the other X.

If the correlation and regression model results differ as you describe, you might well have a confounding variable, which biases your correlation results. I write about this in my post about omitted variable bias . You’d favor the regression results in this situation.

As for the difference between the 2-sample t-test and correlation, that’s not surprising because they are doing two entirely different things. The 2-sample t-test requires a continuous outcome variable and a categorical grouping variable and it tests the mean difference between the two groups. Correlations measure the linear association between two continuous variables. It’s not surprising the results can differ.

It sounds like you should probably use regression analysis and include your multiple continuous variables in the model along with your categorical grouping variables as independent variables to model your outcome variable.

' src=

January 17, 2019 at 12:39 am

This is Kathlene, and I am a Grade 12 student. I am currently doing my research. It’s a quantitative research. I am having a little trouble on how will i approach my statistical treatment. My research is entitled ” Emotional Quotient and Academic Performance Among Senior High School Students in Tarlac National High School: Basis to a Guidance Program. I was battling what to use to determine the relationship between the variables in my study. I’m thinking to use chi-square method but a friend said it would be more accurate to use the regression analysis method. Math is not really my field of study so i badly need your opinion regarding this.

I’m hoping you could lend me a helping hand.

January 17, 2019 at 9:27 am

Hi Kathlene,

It sounds like you’re in a great program! I wish more 12th grade students were conducting studies and analyzing their results! 🙂

To determine how to model the relationships between your variables, it depends on the type of variables you have. It sounds like your outcome variable is academic performance. If that’s a continuous variable, like GPA, then I’d agree with your friend that regression analysis would be a good place to start!

Chi-square assesses the relationship between categorical variables.

' src=

December 13, 2018 at 1:57 am

Hi Mr Jim, I am using orthogonal design having 7 factors with three levels. I have done regression analysis on Minitab software but i don’t know how to explain them or interpret them. I need your help in this regard.

December 13, 2018 at 9:13 am

I have a lot of content throughout my blog that will help you, including how to interpret the results. For a complete list for regression analysis, check out my regression tutorial .

Also, early next year I’ll be publishing a book about regression analysis as well that contains even more information.

If you have a more specific question after reading my other posts, you can ask them in the comments for the appropriate blog post.

Best of luck!

' src=

December 9, 2018 at 12:08 pm

By the way my gun laws vs VCR, is part of a regression model. Any help you can give, I’d greatly appreciate.

December 9, 2018 at 12:07 pm

Mr. Jim, I have a problem. I’m working on a research design on gun laws vs homicides with my dependent variable being violent crime rate. My sig is .308 The constant’s (VCR) standard error is 24.712 my n for violent crime rate is 430.44. I really need help ASAP. I don’t know how to interpret this well. Please help!!!

December 11, 2018 at 10:03 am

There’s not enough information for me to know how to interpret the results. How are you measuring gun laws? Also, VCR is your dependent variable, not the constant as you state. You don’t usually interpret the constant . All I can really say is that based on your p-value, it appears your independent variable is not statistically significant. You have insufficient evidence to conclude that there is a relationship between gun laws and homicides (or is it VCR?).

' src=

December 4, 2018 at 12:49 am

Your blog has been very useful. I have a query.. if I am conducting a multiple regression is it okay to have an outcome variable which is normally distributed ( i winsorized an outlier to achieve this) and have two other predictor variables which are not normally distributed? ( the normality tests scores were significant).

I have read in many places that you have to transform your data to achieve normality for the entire data set to conduct a multiple regression – but doing so has not helped me at all. Please advice.

December 4, 2018 at 10:42 am

I’m dubious about the Winsorizing process in general. Winsorizing reduces the effect of outliers. However, this process is fairly indiscriminate in terms of identifying outliers. It simply defines outliers as being more extreme than an upper and lower percentile and changes those extreme values to equal the specified percentiles. Identifying outliers should be a point by point investigation. Simply changing unusual values is not a good process. It might improve the fit of your data but it is an artificial improvement that overstates the true precision of the study area. If that point is truly an outlier, it might be better to remove it altogether, but make sure you a good explanation for why it’s an outlier.

For regression analysis, the distributions of your predictors and response don’t necessarily need to be normally distributed. However, it’s helpful, and generally sought, to have residuals that are normally distributed. So, check your residual plots! For more information, read my post about OLS assumptions so you know what you need to check!

If your residuals are nonnormally distributed, sometimes transforming the response can help. There are many transformations you can try. It’s a bit trial by error. I suggest you look into the Box-Cox and Johnson transformations. Both methods assess families of transformations and pick one that works bets for your data. However, it sounds like your outcome is already normally distributed so you might not need to do that.

Also, see what other researchers in your field have done with similar data. There’s little general advice I can offer other than to check the residuals and make sure they look good. If there are patterns in the residuals, make sure you’re fitting curvature that might be present. You can graph the various predictors by the residuals to find where the problem lies. You can also try transforming the variables as I describe earlier. While the variables don’t need to follow the normal distribution, if they’re very nonnormally distributed, it can cause problems in the residuals.

' src=

December 3, 2018 at 10:05 pm

Hi, I am confused about the assumption of independent observations in multiple linear regression. Here’s the case. I have heart rate data per five-minute for a day of 14 people. The dependent variable is the heart rate. During the day, the workers worked for 8 hours (8 am to 5 pm), so basically, I have 90 data points per worker for a day. So that makes it 1260 data points (90 times 14) to be included in the model. Is it valid to use multiple linear regression for this type of data?

December 4, 2018 at 10:47 am

It sounds like your model is more of a time series model. You can model those using regression analysis as well, but there are special concerns that you need to address. Your data are not independent. If someone has a height heart rate during one measurement, it’s very likely it’ll also be heighted 5 minutes later. The residuals are likely to be serially correlated, which violates one of the OLS assumptions .

You’ll likely need to include other variables in your model that capture this time dependent information, such as lagged variables. There are various considerations you’ll need to address that go beyond the scope of these comments. You’ll need to do some additional research into use regression analysis for time series data.

' src=

November 8, 2018 at 10:38 am

Ok.Thank you so much.

November 8, 2018 at 10:21 am

Thank you so much for your time! Actually i don’t have authentic data about property values (dependent variable) nor the concerning institutions have this data. Can i ask the property value directly to the property owner thorough walk interview?

November 8, 2018 at 10:31 am

You really need to have valid data. Using a self-reported valuation might be better than no data. However, be aware there might be differences between what the property owner says and the true market value. Your model would describe self-valuation rather than market valuation. Typically, I’ve seen studies like yours use actual sales prices.

November 8, 2018 at 12:20 am

Hello Sir! is it necessary fir dependent variable in multiple regression model to have values. i have number of independent variable( age of property, stories in building, location close to park)and single dependent variable (Property values). Some independent variable decrease the value of dependent variable, while some independent variables increase the value of the dependent variable? Can i put the value if my single dependent variable as ( a.<200000, b.<300000,c. d. 500000)?

November 8, 2018 at 9:39 am

Why would can’t you enter the actual property values? Ideally, that’s what you would do. If you are missing a value for a particular observation, you typically need to exclude the entire observation from the analysis. However, there are some ways to estimate missing values. For example, SPSS has advanced methods for imputing missing values. But, you should use those only to estimate a few missing values. Your plan should be to obtain the property values. If you can’t do that, it will be difficult to perform regression analysis.

There are some cases where you can’t record the exact values and it’s usually related to the observation time. This is known as censored data. A common example is in reliability analysis where you record failure times for a product. You run the experiment for a certain amount of time and you obtain some failures and know their failure times. However, some products don’t fail and you only know that their failure time is greater than the test time. There are censored regression models you can use in situations like that. However, I don’t think that applies to your subject-area, at least as far as I can tell.

' src=

November 5, 2018 at 5:52 pm

thank you so much Jim! this is really helpful 🙂

November 5, 2018 at 10:03 pm

You’re very welcome! Best of luck with your analysis!

November 5, 2018 at 5:16 pm

The variances (SD) for the 3 groups are 0.45, 0.7 and 1. Would you say that they vary by a lot? Another follow up question: does a narrower CI equals a better estimate?

November 5, 2018 at 5:26 pm

Yes, that’s definitely it!

I would suggest using Welch’s one-way ANOVA to analyze it and potentially use that analysis to calculate the CI. You’re essentially performing a one-way ANOVA. And, in ANOVA, there is the assumption of equal variances between groups, which your data do not satisfy. In regression, we’d refer to it as heteroscedasticity. In Welch’s ANOVA, you don’t need to satisfy that assumption. That makes it a simple solution for your case.

In terms of CIs, yes, narrower CIs indicate that the estimate is more precise than if you had a wider CI. Think of the CI as a margin of error around the estimate and it’s good to have a smaller margin of error. With a narrower CI, you can expect the actual mean to fall closer to the fitted value.

November 5, 2018 at 4:21 pm

Thank you so much for the quick response! I checked the residual plots, it gives me a pretty trend line at y=0, and my R square = 0.87. However the CI it gives me by using all 15 points (regression inference) is a little wider (2.012 – 3.655) than if I just use that 5 points(2.245 – 3.355). In this case, would you still prefer using all 15 points?

November 5, 2018 at 4:38 pm

That’s tricky. I hate to throw out data, but it does seem warranted. At least you have a good rationale for not using the data!

CIs of the mean for the a point at the end of a data range in a regression model do tend to be wider than in the middle of the range. Still, I’m not sure why it would be wider. Are the variances of the groups roughly equal? If not, that might well be the reason.

November 5, 2018 at 2:36 pm

suppose I have total of 15 data points at x=0, x=40, and x=80 (5 data points at each x value), now I can use regression to estimate y when x=60. But what if I want to estimate the average when x=0? Should I just use that 5 data points when x=0, or use the intercept from the regression line? Which is the best estimate for a 95% CI for the average y value when x=0?

Thank you 🙂

November 5, 2018 at 3:52 pm

Assuming that model provides a good fit to the data (check the residual plots), I’d use all the data to come up with the CI for the fitted value that corresponds to X = 0. That approach uses more data to calculate the estimate. Your CI might even be more precise (narrower) using all the data.

' src=

October 26, 2018 at 5:27 am

Hi, What make us use the linear regression instead of other types of regression. In other words, the motivation for selecting a linear model?

October 26, 2018 at 10:48 am

Typically, try linear regression first. If your data contain curvature, you might still be able to use linear regression. Linear regression is generally easier to use and includes some useful statistics that nonlinear regression can’t provide, such as p-values for the coefficients and R-squared.

However, if you can’t adequately fit the curvature in your data, it might be time to try nonlinear regression. While both types allow you fit curvature, nonlinear regression is more flexible because it allows your model to fit more types of curvature.

I’ve written a post about how to choose between linear and nonlinear regression that you should read. Within that post are various related links that talk about how to fit curves using both types of regression, along with additional information about both types.

' src=

October 26, 2018 at 1:17 am

Thank u so much for your reply. I am really gorgeous to know much more of this . I shall keep sending mails seeking your reply which i hope you will not mind

October 25, 2018 at 5:02 am

I have been unfortunate to get your reply to my comment on 18/09/2018

October 25, 2018 at 9:29 am

Sorry about the delay. As you can no doubt imagine, my schedule gets busy and things can fall through the cracks.

I replied under your original comment.

' src=

October 23, 2018 at 2:28 pm

Your blog has been really helpful! 🙂 I am currently completing my Masters Thesis and my primary outcome is to assess the relationship between Diabetes Distress and blood glucose control. I am a newbie to SPSS and I am at a loss as to how best to analyse my small (not normally distributed pre and post data transformation) data set.

I have been advised that regression analysis may be appropriate and better than correlations? However my data does not appear to be linear. My diabetes distress variables consist of a score of 1-6 based on a likert scale and also are categorical (low, moderate, high distress) and my blood glucose consists of continuous data and also a categorical variable of poorly controlled blood glucose and well controlled blood glucose.

At the moment I am struggling to complete this analysis. Any help would be greatly appreciated 🙂

' src=

October 21, 2018 at 5:06 pm

Dear Jim, thatk you very much for this post! Could you, please, explain the following.

You are writing: “you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values”

What if I have small r-squired, but the coefficiants are statistically significant with the small values?

' src=

October 15, 2018 at 5:37 am

Hi Jim Thanks for your enlightened explanations. However I want to engage you a bit. under how to interpret regression results, you indicated that a small p-Value indicates that the ” independent variable is statistically significant”. i tend not to agree> Note that since the null hypothesis is that the coefficient of the independent variable is equal to Zero, it’s rejection as evidenced by low p-Value should imply that it is the coefficient which is significantly different from zero and not the variable. almadi

October 15, 2018 at 9:56 am

Yes, you’re correct that the p-value tests whether the coefficient estimate is significantly different from zero. If it is, you can say that the coefficient is statistically significant. Alternatively, statisticians often say that the independent variable is statistically significant. In this context, these are two different ways of saying the same thing because the coefficient is a property of the variable itself.

September 18, 2018 at 5:44 am

As u must be well aware, govt releases price indices and these are broadly used to determine the effect of base prices during a given period of time.

Construction industry, normally uses these price indices running over a period of time to redetermine the prices based on the movement between the base date and current date, which is called as price adjustment

Govt after a few years of time releases a new series of price indices where we may not have the data of indices with old series which will necessitate us to use these new indices with a conversion factor to arrive at the equivalent value of the base price.

Where do you feel that Regression Analysis could be of help where we have to determine the current value of the base price using the new indices.

It is a bit amusing that someone was suggesting to me.

V.G.Subramanian

October 25, 2018 at 9:27 am

I agree that switching price indices can be a problem. If the indices overlap, you can perform regression analysis where the old index is the independent variable and the new index is the dependent variable. However, that is problematic if you don’t have both indices. If you had both indices, I suppose it wouldn’t be a problem to begin with!

Ideally, you’d understand the differences behind how the government calculates both indices, and you could use that to estimate the value of the other index.

I’m not particularly familiar with this practice, so I don’t have a whole lot of insight into it. I hope this helps somewhat!

' src=

September 2, 2018 at 8:15 am

Thank you for this, Jim. I’ve always felt a common sense explanation minus all the impressive math formulas is what is needed in statistics for data science. This is a big part of the basics I’ve been missing. I’m looking forward to your Logistic Regression Tutorial. How is that coming along for you?

September 2, 2018 at 2:59 pm

Hi Antonio,

Thanks so much for your kind words! They mean a lot to me! Yes, I totally agree, explanations should focus on being intuitive and helping people grasp the concepts.

I have written a post on binary logistic regression . Unfortunately, it’ll be awhile before I have a chance to write a more in-depth article–just too many subject to write about!

' src=

July 19, 2018 at 2:55 am

Dear sir, I have a few question about when to use ANOVA and when to use regression analysis. In my study i have conducted a experiment by considering temperature , pH, weight of a compound as a independent variables and extraction as a dependent variable ( i mentioned very generally but i have some specific independent and dependent variables along with these variables). I did statistical analysis by using one way ANOVA-Tukey’s test and i have used grouping method ( using alphabets a,b,c….) to show the significance based on the p value . My question is, for these type of data can i use regression analysis? and what is the main difference between Tukey’s test and regression analysis?

July 19, 2018 at 11:14 am

Both regression analysis and ANOVA are linear models. As linear models, both types of analyses have the same math “under the hood.” You can even use them interchangeably and get the same results. Traditionally, you use ANOVA when you have only, or mainly, categorical factors–although you can add in covariates (continuous variables). On the other hand, you tend to use regression when you have only, or mainly, continuous variables–although you can add in categorical variables.

Because ANOVA focuses on categorical factors and comparing multiple group means, statisticians have developed additional post hoc analyses to work with ANOVA, such as Tukey’s test. Typically, you’ll perform the ANOVA first and then the post hoc test. Suppose you perform a one-way ANOVA and obtain significant results. This significance tells you that not all of the group means are equal. However, it does not tell you which differences are statistically significant.

That point is where post hoc tests come in. These tests do two things. They’ll tell you which differences are statistically significant. They also control the family error rate for the group of comparisons. When you compare multiple differences like that, you increase the risk of a Type I error–which is when you say there is a difference but there really isn’t. When you compare multiple means, the Type I error rate will be higher than your significance level (alpha). These post hoc tests (other than Fishers) maintain the type I error rate so it continues to equal alpha, which is what you would expect.

So, use an ANOVA first. If you obtain significant results for a categorical factor, you can use post hoc tests like Tukey’s to explore the differences between the various factor levels.

I really need to write a blog post about this! I will soon!

In the meantime, I hope this helps!

' src=

May 28, 2018 at 5:28 am

Is it necessary to conduct correlation analysis before regression analysis?

May 30, 2018 at 11:02 am

Hi Kaushal,

No it’s not absolutely required. I actually prefer producing a series of scatterplots (or a matrix plot) so I can see the nature of the different relationships. That helps give me a better feel for the data along with the types of relationships. However, if you have a good theory and a solid background knowledge on which variables should be included in the model, you can go straight to modeling. I think it depends a lot on your existing level of knowledge.

That all said, I personally like knowing the correlation structure between all of the variables. It gives me a better feel for the data.

' src=

May 18, 2018 at 5:56 am

' src=

April 28, 2018 at 11:45 pm

Thank you Jim!

I really appreciate it!

April 28, 2018 at 7:38 am

Hi Jim, I hope you are having good time!

I would like to ask you a question, please!

I have 24 observations to perform a regression analysis (let’s say Zones), and I have many independent variables (IV). I would like to know what is the minimum number of observations I should have to perform a reasonable linear regression model. I would like to hear something from you about how to test many regression model with different IV, since I can not use many IV in a model where a have few observations (24).

Thank you in advance!

April 28, 2018 at 2:26 pm

Hi Patrik, great to hear from you again!

Those are great questions. For 24 observations, I’d say that you usually wouldn’t want more than 2 IVs. I write an entire post about how many variables you can include in a regression model . Including too many IVs (and other terms such as interactions and polynomials) is known as overfitting the model. Check that post out because it’ll provide guidance and show you the dangers of including too many.

There’s another issue a play too because you want to compare a number of different regression models to each other. If you compare many models, it’s a form of data mining. The risk here is that if you compare enough models, you will uncover chance correlations. These chance correlations look like the real thing but only appear in your sample and not the population. I’ve written a post about how using this type of data mining to choose a regression model causes problems . This concern is particularly problematic with a small sample size like yours. It can find “patterns” in randomly generated data.

So, there’s really two issues for you to watch out for–overfitting and chance correlations found through data mining!

Hope this helps!

April 5, 2018 at 5:51 am

Many Thanks Jim!!! You have no idea about how much you helped me.

Very well clarified!!!

God bless you always!!!

April 4, 2018 at 1:33 am

Hi Jim, I am everywhere in your post!

I am starting loving statistic, that’s why I am not quiet,

I have some questions for you:

To use OLS regression, one of the assumptions is that the dependent variable is normally distributed. To achieve this requirement what I should do with my data? Should I check the normality of my dependent variable, for example using Shapiro test (etc)? If I conclude that my dependent variable is not following the normal distribution I should start to see data transformation, right? Another way that I have used to see people analyzing the normality is by plotting the dependent variable with the independent variable and if the relationship doesn’t follow linear trend then they go to data transformation (which one you recommend me?) Should I perform the regression using my data (original) and then the residuals will show me non-normality if do exists?

When should I transform my independent variables, and what is the consequence of transforming them?

Sorry, I use to ask many questions in a single comment, but I think this is the way to understand the full picture of my doubt,

You are being so useful to me,

Thank you again!

April 4, 2018 at 11:11 am

Hi Patrik, I’m so happy to hear that you’re starting to love statistics! It’s a great field that is exciting. The thrill of discovery combined with getting the most value out of your data. I’m not sure if you’ve read my post about The Importance of Statistics , but if you haven’t, I recommend it. It explains why the field of statistics is more important than ever!

In OLS regression, the dependent variable does not have to be normally distributed. Instead, you need to assess the distribution of the residuals using residual plots . If your residuals are not normally distributed, there are a variety of possible reasons and different ways to resolve that issue. I always recommend that transforming your data is the last resort. For example, the residuals might be nonnormal because the model is specified incorrectly. Maybe there is curvature in the data that you aren’t modeling correctly? If so, transforming the data might mask the problem. You really want to specify the best possible model. However, if all else fails, you might need to transform the data. When you transform the data, you’ll need to back transform the results to make sense of the results because everything applies to the transformed data. Most statistical software should do this for you.

Be aware that you can’t trust R-squared and the standard error of the regression when you transform your dependent variable because they apply to the transformed data rather than the raw data (backtransformation won’t help there).

In terms of testing the normality of the residuals, I recommend using normal probability plots. You can usually tell at a glance whether they are normally distributed. If you need a test, I generally use the Anderson-Darling test–which you can see in action in my post about identifying the distribution of your data . By the way, as a case in point, the data in that post are not normal, but I use it as the dependent variable in OLS regression in this post about using regression to make predictions . The residuals are normally distributed even though the dependent variable is not.

' src=

March 29, 2018 at 2:27 am

In the coffee intake and smoking example, the first result showed that higher coffee intake leads to higher mortality, but after including smoking, coffee intake leads to lower or no mortality? Smoking was revealed to cause the mortality, but how did coffee intake now result in the opposite? Was a separate test taken for this result? Please let me know. S. CHATTERJEE

March 29, 2018 at 10:36 am

Hi, that’s a great question. It turns out that coffee and smoking are correlated. The negative effects of smoking on mortality are well documented. However, for some reason, the researchers did not originally include smoking in their model. Because drinking coffee and smoking are correlated, the variable for coffee consumption took on some of smoking’s effect on mortality.

Put another way, because smoking was not included in the model, it was not being controlled (held constant). So, as you increased coffee consumption, smoking also tended to increase because it is both positively correlated with coffee consumption and not in the model. Therefore, it appeared as though increased coffee consumption is correlated with higher mortality rates but only because smoking was not included in the model.

Presumably, the researchers had already collected data about smoking. So, all they had to do was include the smoking variable in their regression model. Voila, the model now controls for smoking and the new output displays the new estimate of the effect that coffee has on mortality.

This point illustrates a potential problem. If the researchers had not collected the smoking information, they would have really been stuck. Before conducting any study researchers need to do a lot of background research to be sure that they are collecting the correct data!

' src=

March 20, 2018 at 12:25 pm

Hi Jim Hope all thing is well,

I have faced problem with plotting, which is included the relationship between dependent variable (response) and the independent variables . when i do the main effect plots, i have the straight line increasing. y= x, this linear trending to change it i need to make y= square root for time

Im stuck with this thing i couldn’t find solution for it

' src=

March 5, 2018 at 9:37 am

I was wondering if you can help me? I am doing my dissertation and I have 1 within-subjects IV, and 3 between-subjects IVs.. most of my variables are categorical, but one is not categorical, it is a questionnaire which I am using to determine sleep quality, with both Likert scales and own answers to amount of sleep (hours), amount of times woke in the night etc. Can I use a regression when making use of both categorical data and other? I also have multiple DVs (angry/sad Likert ratings).. but I *could* combine those into one overall ’emotion’ DV. Any help would be much appreciated!

March 5, 2018 at 10:11 am

Hi Cara, because your DV use the Likert scale, you really should be using Ordinal Logistic Regression. This type of regression is designed for ordinal dependent variables like yours. As for the IVs, it can be tricky using ordinal variables. They’re not quite either continuous or categorical. My suggestion is to give them a try as continuous variable and check the residual plots to see how they look. If they look good, then it’s probably ok. However, if they don’t look good, you can try refitting the model using them as categorical variables and then rechecking the residual plots. If the residuals still don’t look good, you can then try using the chi-square test of independence for ordinal data.

As for combining the data, that would seem to be a subject-area specific decision, and I don’t know that area well enough to make an informed recommendation.

' src=

February 25, 2018 at 5:36 pm

Yes. But it may be that you miss my point. Because I argue that a proper and sound experiment will allow you to test for causality, regardless of if you deploy e.g. Pearsons r or regression. With no experimental design, neither Pearsons r nor a regression will test for an effect relationship between the variables. Randomisation makes a better case for controlling for variables that you are unaware of than picking a few, and then proclaim that your study found that x will cause an incrrase in y or that x has an effect on y. You may as well argue that you dont need to control for any variables and argue that any correlational study test for Effect relationships.

February 25, 2018 at 8:14 pm

Hi Martin, yes, that is exactly what I’m saying. Whether you can draw causal conclusion depends on whether you used a randomized experiment to collect your data. If it’s an observational study, you can’t assume it’s anything other than correlation. What you write in your comment agrees with what I’m saying.

The controlling for other variables that I mention in this post is a different matter. Yes, if you include a variable in a regression model, it is held constant while estimating the effects of the other variables. That doesn’t mean you can assume causality though.

February 25, 2018 at 5:04 pm

No statistical tool or method turns a survey or corrolation study into an experiment, i.e. regression does not test or imply cause effect relationship. A positive relationship between smoking and cancer in a regression analysis does not mean that smoking cause cancer. You have not controlled for what you are unaware of.

February 25, 2018 at 5:22 pm

Hi Martin, you are 100% correct about the fact that correlation doesn’t imply causation. This issue is one that I plan to cover in future posts.

There are two issues at play here. The type of study under which the data were collected and the statistical findings.

Being able to determine causation comes down to the difference between an observational study versus a randomized experiment. You actually use the same analyses to assess both types of designs. In an observational study, you can only establish correlation and not causality. However, in a randomized experiment, the same patterns and correlations in the data can suggest causality. So, regression analysis can help establish causality, but only when it’s performed on data that were collected through a randomized experiment.

' src=

February 6, 2018 at 7:11 am

Very nicely explanined. thank you

February 6, 2018 at 10:04 am

Thanks you, Hari!

' src=

December 1, 2017 at 2:40 am

Thanks for your reply and for the guidance.

I read your posts which are very helpful. After reading them, I concluded that only the independent variables which have a well-established association with the dependent variable should be included. Hence, in my case, variable Z should not be included given that the association of Z with dependent variable is not well-established.

Furthermore, suppose there is another variable (A) and literature suggests that it, in general, has an association with dependent variable. However, assume that A does not affect any independent variables so there is no omitted variable bias. In this case, if there is no data available for A (due to the study being conducted in different environment/context) then what statistical techniques can be deployed to address any problems caused due to the exclusion of A?

I look forward to your reply and I will be grateful for your reply.

Kind regards.

November 30, 2017 at 9:10 am

Thanks for the reply. I apologise if I am taking a considerable time out of your schedule.

Based on the literature, there isn’t any conclusive evidence that z is a determinant of y. So, that is why I intend to remove z. Some studies include it while some do not and some find significant association (between y and z) while some find the association insignificant. Hence, I think I can safely remove it.

Moreover, I will be grateful if you can answer another query. From an statistical viewpoint, is it fine if I use Generalized method of moments (GMM) for binary dependent variable?

November 30, 2017 at 2:24 pm

While I can’t offer you a concrete statement about whether you should include or exclude the variable (clearly there is disagreement in your own field), I do suggest that you read my article about specifying the correct regression model . I include a number of tips and considerations.

Unfortunately, I don’t know enough about GMM to make a recommendation. All of the examples I have seen personally are for continuous data, but I don’t know about binary data.

November 29, 2017 at 11:12 am

Thanks for your reply. I really appreciate it. Could you please also provide an answer to my query mentioned below for further clarification?

November 29, 2017 at 11:01 am

Further clarification on my above post. From internet I found that if a variable (z) that is related to y but unrelated to x then inclusion of z will reduce standard errors of x. So, if z is excluded, but f-stat and adjusted r-square are fine then does high standard errors create problems? I look forward to your reply.

November 29, 2017 at 11:50 am

Yes, what you read is correct. Typically, if Z is statistically significant, you should include it in your model. If you exclude it, the precision of your coefficient estimates will be lower (higher standard errors). You also risk a biased model because you are not including important information in the model–check the residual plots. The F-test of overall significance and adjusted R-squared depend on the other IVs in your model. If Z is by far the best variable, it’s possible that removing it will cause the F-test to not be significant and adjusted R-square might drop noticeably. Again, that depends on how the explanatory power of Z compares to the other IVs. Why do you want to remove a significant variable?

November 29, 2017 at 10:29 am

Thanks for the reply. Jim.

I am unable to understand “Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable”. Are you stating that other independent variables will be fine but r-square will become low? I will be grateful if you can explain this.

Kind regards

November 29, 2017 at 11:04 am

Hi, you indicated that the removed independent variable is related to the dependent variable, but it is not correlated with the other independent variables. Consequently, removing that independent variable should reduce R-squared. For one thing, that’s the typical result of removing variables, even when they’re not statistically significant. In this case, because it is not correlated to the other independent variables, you know that the removed variable is supplying unique information. Taking that variable out means that information is no longer included in the model. R-squared will definitely go down, possibly dramatically.

R-squared measures the strength of the relationship between the entire set of IVs and the DP. Read my post about R-squared for more information.

November 29, 2017 at 9:15 am

Hello, Jim.

What is the impact* on the independent variables in the model if I omit a variable that is a determinant of dependent variable but is not related to any of the independent variables?

*Here impact relates to the independent variables’ p-values and the coefficients.

November 29, 2017 at 10:13 am

If the independent variable is not correlated with the other independent variables, it’s likely that there would be a minimal effect on the other independent variables. Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable. You should also check the residual plots to be sure that by removing the variable you’re not introducing bias.

' src=

October 26, 2017 at 12:33 am

why do we use 5% level of significance usually for comparing instead of 1% or other

October 26, 2017 at 12:49 am

Hi, I actually write about this topic in a post about hypothesis testing . It’s basically a tradeoff between several different error rates–and a dash of tradition. Read that post and see if it answers your questions.

October 24, 2017 at 11:30 pm

Sir usually we take 5% level of significance for comparing why 0

October 24, 2017 at 11:35 pm

Hi Ghulam, yes, the significance level is usually 0.05. I’m not sure what you’re asking about in regards to zero? The p-values in the example output are all listed as 0.000, which is less than the significance level of 0.05, so they are statistically significant.

' src=

October 23, 2017 at 9:08 am

In my model, I use different independent variables. Now my question is before useing regression, do I need to check the distribution of data? if yes then please write the name tests. My title is Education and Productivity Nexus, : evidence from pharmaceutical sector in Bangladesh.

October 23, 2017 at 11:22 am

Hi Shamsun, typically you test the distribution of the residuals after you fit a model. I’ve written a blog post about checking your residual plots that should read.

I hope this helps! Jim

' src=

October 22, 2017 at 4:24 am

Thank you Mr. Jim

October 22, 2017 at 11:15 am

You’re very welcome!

' src=

October 22, 2017 at 2:31 am

In linear regression, can we use categorical variables as Independent variables? If yes, what should be the minimum or maximum categories in an Independent variable?

October 22, 2017 at 10:44 pm

Hi, yes you can use categorical variables as independent variables! The number of groups really depends on what makes sense for your study area. Of course, the minimum is two. There really is no maximum in theory. It depends on what makes sense for your study. However, in practice, having more groups requires a larger total sample size, which can become expensive. If you have 2-9 groups, you should have at least 15 in each group. For 10-12 groups, you should have 20. These numbers are based on simulation studies for ANOVA, but they also apply to categorical variables in regression. In a nutshell, figure out what makes sense for your study and then be sure to collect enough data!

I hope this help! Jim

Comments and Questions Cancel reply

regression analysis meaning in research

What is Regression Analysis and Why Should I Use It?

  • Survey Tips

Alchemer is an incredibly robust online survey software platform. It’s continually voted one of the best survey tools available on G2, FinancesOnline, and others. To make it even easier, we’ve created a series of blogs to help you better understand how to get the most from your Alchemer account.

Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. 

While there are many types of regression analysis, at their core they all examine the influence of one or more independent variables on a dependent variable. 

Regression analysis provides detailed insight that can be applied to further improve products and services.

Here at Alchemer, we offer hands-on application training events during which customers  learn how to become super users of our software. 

In order to understand the value being delivered at these training events, we distribute follow-up surveys to attendees with the goals of learning what they enjoyed, what they didn’t, and what we can improve on for future sessions. 

The data collected from these feedback surveys allows us to measure the levels of satisfaction that our attendees associate with our events, and what variables influence those levels of satisfaction. 

Could it be the topics covered in the individual sessions of the event? The length of the sessions? The food or catering services provided? The cost to attend? Any of these variables have the potential to impact an attendee’s level of satisfaction.

By performing a regression analysis on this survey data, we can determine whether or not these variables have impacted overall attendee satisfaction, and if so, to what extent. 

This information then informs us about which elements of the sessions are being well received, and where we need to focus attention so that attendees are more satisfied in the future.

What is regression analysis and what does it mean to perform a regression?

Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.

In order to understand regression analysis fully, it’s essential to comprehend the following terms:

  • Dependent Variable: This is the main factor that you’re trying to understand or predict. 
  • Independent Variables: These are the factors that you hypothesize have an impact on your dependent variable.

In our application training example above, attendees’ satisfaction with the event is our dependent variable. The topics covered, length of sessions, food provided, and the cost of a ticket are our independent variables.

How does regression analysis work?

In order to conduct a regression analysis, you’ll need to define a dependent variable that you hypothesize is being influenced by one or several independent variables.

You’ll then need to establish a comprehensive dataset to work with. Administering surveys to your audiences of interest is a terrific way to establish this dataset. Your survey should include questions addressing all of the independent variables that you are interested in.

Let’s continue using our application training example. In this case, we’d want to measure the historical levels of satisfaction with the events from the past three years or so (or however long you deem statistically significant), as well as any information possible in regards to the independent variables. 

Perhaps we’re particularly curious about how the price of a ticket to the event has impacted levels of satisfaction. 

To begin investigating whether or not there is a relationship between these two variables, we would begin by plotting these data points on a chart, which would look like the following theoretical example.

Regression Analysis: Plotting data is the first step in figuring out if there is a relationship between independent and dependent variables

(Plotting your data is the first step in figuring out if there is a relationship between your independent and dependent variables)

Our dependent variable (in this case, the level of event satisfaction) should be plotted on the y-axis, while our independent variable (the price of the event ticket) should be plotted on the x-axis.

Once your data is plotted, you may begin to see correlations. If the theoretical chart above did indeed represent the impact of ticket prices on event satisfaction, then we’d be able to confidently say that the higher the ticket price, the higher the levels of event satisfaction. 

But how can we tell the degree to which ticket price affects event satisfaction?

To begin answering this question, draw a line through the middle of all of the data points on the chart. This line is referred to as your regression line, and it can be precisely calculated using a standard statistics program like Excel.

We’ll use a theoretical chart once more to depict what a regression line should look like.

The regression line summarizes the relationship between X and Y.

The regression line represents the relationship between your independent variable and your dependent variable. 

Excel will even provide a formula for the slope of the line, which adds further context to the relationship between your independent and dependent variables. 

The formula for a regression line might look something like Y = 100 + 7X + error term .

This tells you that if there is no “X”, then Y = 100. If X is our increase in ticket price, this informs us that if there is no increase in ticket price, event satisfaction will still increase by 100 points. 

You’ll notice that the slope formula calculated by Excel includes an error term. Regression lines always consider an error term because in reality, independent variables are never precisely perfect predictors of dependent variables. This makes sense while looking at the impact of  ticket prices on event satisfaction — there are clearly other variables that are contributing to event satisfaction outside of price.

Your regression line is simply an estimate based on the data available to you. So, the larger your error term, the less definitively certain your regression line is.

Why should your organization use regression analysis?

Regression analysis is helpful statistical method that can be leveraged across an organization to determine the degree to which particular independent variables are influencing dependent variables. 

The possible scenarios for conducting regression analysis to yield valuable, actionable business insights are endless.

The next time someone in your business is proposing a hypothesis that states that one factor, whether you can control that factor or not, is impacting a portion of the business, suggest performing a regression analysis to determine just how confident you should be in that hypothesis! This will allow you to make more informed business decisions, allocate resources more efficiently, and ultimately boost your bottom line.

regression analysis meaning in research

See all blog posts >

regression analysis meaning in research

  • Customer Experience , Customer Feedback

regression analysis meaning in research

  • Alchemer Survey , Integrated Feedback

Photo of 2 product managers collaborating on customer feedback

  • Customer Feedback , Product Feedback , Product Management

See it in Action

regression analysis meaning in research

  • Privacy Overview
  • Strictly Necessary Cookies
  • 3rd Party Cookies

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

regression analysis meaning in research

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Creating Brand Value
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading Change and Organizational Renewal
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

What Is Regression Analysis in Business Analytics?

Business professional using calculator for regression analysis

  • 14 Dec 2021

Countless factors impact every facet of business. How can you consider those factors and know their true impact?

Imagine you seek to understand the factors that influence people’s decision to buy your company’s product. They range from customers’ physical locations to satisfaction levels among sales representatives to your competitors' Black Friday sales.

Understanding the relationships between each factor and product sales can enable you to pinpoint areas for improvement, helping you drive more sales.

To learn how each factor influences sales, you need to use a statistical analysis method called regression analysis .

If you aren’t a business or data analyst, you may not run regressions yourself, but knowing how analysis works can provide important insight into which factors impact product sales and, thus, which are worth improving.

Access your free e-book today.

Foundational Concepts for Regression Analysis

Before diving into regression analysis, you need to build foundational knowledge of statistical concepts and relationships.

Independent and Dependent Variables

Start with the basics. What relationship are you aiming to explore? Try formatting your answer like this: “I want to understand the impact of [the independent variable] on [the dependent variable].”

The independent variable is the factor that could impact the dependent variable . For example, “I want to understand the impact of employee satisfaction on product sales.”

In this case, employee satisfaction is the independent variable, and product sales is the dependent variable. Identifying the dependent and independent variables is the first step toward regression analysis.

Correlation vs. Causation

One of the cardinal rules of statistically exploring relationships is to never assume correlation implies causation. In other words, just because two variables move in the same direction doesn’t mean one caused the other to occur.

If two or more variables are correlated , their directional movements are related. If two variables are positively correlated , it means that as one goes up or down, so does the other. Alternatively, if two variables are negatively correlated , one goes up while the other goes down.

A correlation’s strength can be quantified by calculating the correlation coefficient , sometimes represented by r . The correlation coefficient falls between negative one and positive one.

r = -1 indicates a perfect negative correlation.

r = 1 indicates a perfect positive correlation.

r = 0 indicates no correlation.

Causation means that one variable caused the other to occur. Proving a causal relationship between variables requires a true experiment with a control group (which doesn’t receive the independent variable) and an experimental group (which receives the independent variable).

While regression analysis provides insights into relationships between variables, it doesn’t prove causation. It can be tempting to assume that one variable caused the other—especially if you want it to be true—which is why you need to keep this in mind any time you run regressions or analyze relationships between variables.

With the basics under your belt, here’s a deeper explanation of regression analysis so you can leverage it to drive strategic planning and decision-making.

Related: How to Learn Business Analytics without a Business Background

What Is Regression Analysis?

Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression).

According to the Harvard Business School Online course Business Analytics , regression is used for two primary purposes:

  • To study the magnitude and structure of the relationship between variables
  • To forecast a variable based on its relationship with another variable

Both of these insights can inform strategic business decisions.

“Regression allows us to gain insights into the structure of that relationship and provides measures of how well the data fit that relationship,” says HBS Professor Jan Hammond, who teaches Business Analytics, one of three courses that comprise the Credential of Readiness (CORe) program . “Such insights can prove extremely valuable for analyzing historical trends and developing forecasts.”

One way to think of regression is by visualizing a scatter plot of your data with the independent variable on the X-axis and the dependent variable on the Y-axis. The regression line is the line that best fits the scatter plot data. The regression equation represents the line’s slope and the relationship between the two variables, along with an estimation of error.

Physically creating this scatter plot can be a natural starting point for parsing out the relationships between variables.

Credential of Readiness | Master the fundamentals of business | Learn More

Types of Regression Analysis

There are two types of regression analysis: single variable linear regression and multiple regression.

Single variable linear regression is used to determine the relationship between two variables: the independent and dependent. The equation for a single variable linear regression looks like this:

Single Variable Linear Regression Formula

In the equation:

  • ŷ is the expected value of Y (the dependent variable) for a given value of X (the independent variable).
  • x is the independent variable.
  • α is the Y-intercept, the point at which the regression line intersects with the vertical axis.
  • β is the slope of the regression line, or the average change in the dependent variable as the independent variable increases by one.
  • ε is the error term, equal to Y – ŷ, or the difference between the actual value of the dependent variable and its expected value.

Multiple regression , on the other hand, is used to determine the relationship between three or more variables: the dependent variable and at least two independent variables. The multiple regression equation looks complex but is similar to the single variable linear regression equation:

Multiple Regression Formula

Each component of this equation represents the same thing as in the previous equation, with the addition of the subscript k, which is the total number of independent variables being examined. For each independent variable you include in the regression, multiply the slope of the regression line by the value of the independent variable, and add it to the rest of the equation.

How to Run Regressions

You can use a host of statistical programs—such as Microsoft Excel, SPSS, and STATA—to run both single variable linear and multiple regressions. If you’re interested in hands-on practice with this skill, Business Analytics teaches learners how to create scatter plots and run regressions in Microsoft Excel, as well as make sense of the output and use it to drive business decisions.

Calculating Confidence and Accounting for Error

It’s important to note: This overview of regression analysis is introductory and doesn’t delve into calculations of confidence level, significance, variance, and error. When working in a statistical program, these calculations may be provided or require that you implement a function. When conducting regression analysis, these metrics are important for gauging how significant your results are and how much importance to place on them.

Business Analytics | Become a data-driven leader | Learn More

Why Use Regression Analysis?

Once you’ve generated a regression equation for a set of variables, you effectively have a roadmap for the relationship between your independent and dependent variables. If you input a specific X value into the equation, you can see the expected Y value.

This can be critical for predicting the outcome of potential changes, allowing you to ask, “What would happen if this factor changed by a specific amount?”

Returning to the earlier example, running a regression analysis could allow you to find the equation representing the relationship between employee satisfaction and product sales. You could input a higher level of employee satisfaction and see how sales might change accordingly. This information could lead to improved working conditions for employees, backed by data that shows the tie between high employee satisfaction and sales.

Whether predicting future outcomes, determining areas for improvement, or identifying relationships between seemingly unconnected variables, understanding regression analysis can enable you to craft data-driven strategies and determine the best course of action with all factors in mind.

Do you want to become a data-driven professional? Explore our eight-week Business Analytics course and our three-course Credential of Readiness (CORe) program to deepen your analytical skills and apply them to real-world business problems.

regression analysis meaning in research

About the Author

Copyright © SurveySparrow Inc. 2024 Privacy Policy Terms of Service SurveySparrow Inc.

What is Regression Analysis? Definition, Types, and Examples

blog author

Kate Williams

Last Updated: 22 January 2024

14 min read

What is Regression Analysis? Definition, Types, and Examples

Table Of Contents

  • Regression Analysis Definition
  • Regression Analysis FAQs
  • Regression Analysis: Importance
  • Types of Regression Analysis
  • Uses By Businesses
  • Regression Analysis Use Cases

If you want to find data trends or predict sales based on certain variables, then regression analysis is the way to go.

In this article, we will learn about regression analysis, types of regression analysis, business applications, and its use cases. Feel free to jump to a section that’s relevant to you.

  • What is the definition of regression analysis?
  • Regression analysis: FAQs
  • Why is regression analysis important?
  • Types of regression analysis and when to use them
  • How is regression analysis used by businesses
  • Use cases of regression analysis

What is Regression Analysis?

Need a quick regression definition? In simple terms, regression analysis identifies the variables that have an impact on another variable .

The regression model is primarily used in finance, investing, and other areas to determine the strength and character of the relationship between one dependent variable and a series of other variables.

Regression Analysis: FAQs

Let us look at some of the most commonly asked questions about regression analysis before we head deep into understanding everything about the regression method.

1. What is multiple regression analysis meaning?

Multiple regression analysis is a statistical method that is used to predict the value of a dependent variable based on the values of two or more independent variables.

2. In regression analysis, what is the predictor variable called?

The predictor variable is the name given to an independent variable that we use in regression analysis.

The predictor variable provides information about an associated dependent variable regarding a certain outcome. At their core, predictor variables are those that are linked with particular outcomes.

3. What is a residual plot in a regression analysis?

A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis.

Moreover, the residual plot is a representation of how close each data point is (vertically) from the graph of the prediction equation of the regression model. If the data point is above or below the graph of the prediction equation of the model, then it is supposed to fit the data.

4. What is linear regression analysis?

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable that you want to predict is referred to as the dependent variable. The variable that you are using to predict the other value is called the independent variable.

Easily estimate and interpret linear regression models with survey data by SurveySparrow . Get a feel for our tool with a free account . Sign up below.

14-day free trial • Cancel Anytime • No Credit Card Required • No Strings Attached

Why is Regression Analysis Important?

There are many business applications of regression analysis.

  • For any machine learning problem which involves continuous numbers , regression analysis is essential. Some of those instances could be:
  • Testing automobiles
  • Weather analysis, and prediction
  • Sales and promotions forecasting
  • Financial forecasting
  • Time series forecasting
  • Regression analysis data also helps you understand whether the relationship between two different variables can give way to potential business opportunities .
  • For example, if you change one variable (say delivery speed), regression analysis will tell you the kind of effect that it has on other variables (such as customer satisfaction, small value orders, etc).
  • One of the best ways to solve regression issues in machine learning using a data model is through regression analysis. Plotting points on a chart, and running the best fit line , helps predict the possibility of errors.
  • The insights from these patterns help businesses to see the kind of difference that it makes to their bottom line .

5 Types of Regression Analysis and When to Use Them

1. linear regression analysis.

  • This type of regression analysis is one of the most basic types of regression and is used extensively in machine learning .
  • Linear regression has a predictor variable and a dependent variable which is related to each linearly.
  • Moreover, linear regression is used in cases where the relationship between the variables is related in a linear fashion.

Let’s say you are looking to measure the impact of email marketing on your sales. The linear analysis can be wrong as there will be aberrations. So, you should not use big data sets ( big data services ) for linear regression.

2. Logistic Regression Analysis

  • If your dependent variable has discrete values , that is, if they can have only one or two values, then logistic regression SPSS is the way to go.
  • The two values could be either 0 or 1, black or white, true or false, proceed or not proceed, and so on.
  • To show the relationship between the target and independent variables, logistic regression uses a sigmoid curve.

This type of regression is best used when there are large data sets that have a chance of equal occurrence of values in target variables. There should not be a huge correlation between the independent variables in the dataset.

3. Lasso Regression Analysis

  • Lasso regression is a regularization technique that reduces the model’s complexity.
  • How does it do that? By limiting the absolute size of the regression coefficient .
  • When doing so, the coefficient value becomes closer to zero. This does not happen with ridge regression.

Lass regression is advantageous as it uses feature selection – where it lets you select a set of features from the database to build your model. Since it uses only the required features, lasso regression manages to avoid overfitting.

4. Ridge Regression Analysis

  • If there is a high correlation between independent variables , ridge regression is the recommended tool.
  • It is also a regularization technique that reduces the complexity of the model .

Ridge regression manages to make the model less prone to overfitting by introducing a small amount of bias known as the ridge regression penalty, with the help of a bias matrix.

5. Polynomial Regression Analysis

  • Polynomial regression models a non-linear dataset with the help of a linear model .
  • Its working is similar to that of multiple linear regression. But it uses a non-linear curve and is mainly employed when data points are available in a non-linear fashion.
  • It transforms the data points into polynomial features of a given degree and manages to model them in the form of a linear model.

Polynomial regression involves fitting the data points using a polynomial line. Since this model is susceptible to overfitting, businesses are advised to analyze the curve during the end so that they get accurate results.

While there are many more regression analysis techniques, these are the most popular ones.

regression analysis meaning in research

How is regression analysis used by businesses?

Regression stats help businesses understand what their data points represent and how to use them with the help of business analytics techniques.

Using this regression model, you will understand how the typical value of the dependent variable changes based on how the other independent variables are held fixed.

Data professionals use this incredibly powerful statistical tool to remove unwanted variables and select the ones that are more important for the business.

Here are some uses of regression analysis:

1. Business Optimization

  • The whole objective of regression analysis is to make use of the collected data and turn it into actionable insights .
  • With the help of regression analysis, there won’t be any guesswork or hunches based on which decisions need to be made.
  • Data-driven decision-making improves the output that the organization provides.
  • Also, regression charts help organizations experiment with inputs that might not have been earlier thought of, but now that it is backed with data, the chances of success are also incredibly high.
  • When there is a lot of data available, the accuracy of the insights will also be high.

2. Predictive Analytics

  • For businesses that want to stay ahead of the competition, they need to be able to predict future trends. Organizations use regression analysis to understand what the future holds for them.
  • To forecast trends, the data analysts predict how the dependent variables change based on the specific values given to them.
  • You can use multivariate linear regression for tasks such as charting growth plans, forecasting sales volumes, predicting inventory required, and so on.
  • Find out more about the area so that you can gather data from different sources
  • Collect the data required for the relevant variables
  • Specify and measure your regression model
  • If you have a model which fits the data, then use it to come up with predictions

3. Decision-making

  • For businesses to run effectively, they need to make better decisions and be aware of how each of their decisions will affect them. If they do not understand the consequences of their decisions, it can be difficult for their smooth functioning.
  • Businesses need to collect information about each of their departments – sales, operations, marketing, finance, HR, expenditures, budgetary allocation, and so on. Using relevant parameters and analyzing them helps businesses improve their outcomes.
  • Regression analysis helps businesses understand their data and gain insights into their operations . Business analysts use regression analysis extensively to make strategic business decisions.

4. Understanding failures

  • One of the most important things that most businesses miss doing is not reflecting on their failures.
  • Without contemplating why they met with failure for a marketing campaign or why their churn rate increased in the last two years, they will never find ways to make it right.
  • Regression analysis provides quantitative support to enable this kind decision-making.

5. Predicting Success

  • You can use regression analysis to predict the probability of success of an organization in various aspects.
  • Additionally, regression in stats analyses the data point of various sales data, including current sales data, to understand and predict the success rate in the future.

6. Risk Analysis

  • When analyzing data, data analysts, sometimes, make the mistake of considering correlation and causation as the same. However, businesses should know that correlation is not causation.
  • Financial organizations use regression data to assess their risk and guide them to make sound business decisions.

7. Provides New Insights

  • Looking at a huge set of data will help you get new insights. But data, without analysis, is meaningless.
  • With the help of regression analysis, you can find the relationship between a variety of variables to uncover patterns.
  • For example, regression models might indicate that there are more returns from a particular seller. So the eCommerce company can get in touch with the seller to understand how they send their products.

Each of these issues has different solutions to them. Without regression analysis, it might have been difficult to understand exactly what was the issue in the first place.

8. Analyze marketing effectiveness

  • When the company wants to know if the funds they have invested in marketing campaigns for a particular brand will give them enough ROI, then regression analysis is the way to go.
  • It is possible to check the isolated impact of each of the campaigns by controlling the factors that will have an impact on the sales.
  • Businesses invest in a number of marketing channels – email marketing , paid ads, Instagram influencers, etc. Regression statistics is capable of capturing the isolated ROI as well as the combined ROI of each of these companies.

regression analysis meaning in research

7 Use Cases of Regression Analysis

1. credit card.

  • Credit card companies use regression analysis to understand various user factors such as the consumer’s future behavior, prediction of credit balance, risk of customer’s credit default, etc.
  • All of these data points help the company implement specific EMI options based on the results.
  • This will help credit card companies take note of the risky customers.
  • Simple linear regression (also called Ordinary Least Squares (OLS)) gives an overall rationale for the placing of the line of the best fit among the data points.
  • One of the most common applications using the statistical model is the Capital Asset Pricing Model (CAPM) which describes the relationship between the returns and risks of investing in a security.

3. Pharmaceuticals

  • Pharmaceutical companies use the process to analyze the quantitative stability data to estimate the shelf life of a product. This is because it finds the nature of the relationship between an attribute and time.
  • Medical researchers use regression analysis to understand if changes in drug dosage will have an impact on the blood pressure of patients. Pharma companies leveraging best engagement platforms of HCP to increase brand awareness in the virtual space.

For example, researchers will administer different dosages of a certain drug to patients and observe changes in their blood pressure. They will fit a simple regression model where they use dosage as the predictor variable and blood pressure as the response variable.

4. Text Editing

  • Logistic regression is a popular choice in a number of natural language processing (NLP) tasks s uch as text preprocessing.
  • After this, you can use logistic regression to make claims about the text fragment.
  • Email sorting, toxic speech detection, topic classification for questions, etc, are some of the areas where logistic regression shows great results.

5. Hospitality

  • You can use regression analysis to predict the intention of users and recognize them. For example, like where do the customers want to go? What they are planning to do?
  • It can even predict if the customer hasn’t typed anything in the search bar, based on how they started.
  • It is not possible to build such a huge and complex system from scratch. There are already several machine learning algorithms that have accumulated data and have simple models that make such predictions possible.

6. Professional sports

  • Data scientists working with professional sports teams use regression analysis to understand the effect that training regiments will have on the performance of players .
  • They will find out how different types of exercises, like weightlifting sessions or Zumba sessions, affect the number of points that player scores for their team (let’s say basketball).
  • Using Zumba and weightlifting as the predictor variables, and the total points scored as the response variable, they will fit the regression model.

Depending on the final values, the analysts will recommend that a player participates in more or less weightlifting or Zumba sessions to maximize their performance.

7. Agriculture

  • Agricultural scientists use regression analysis t o understand the effect of different fertilizers and how it affects the yield of the crops.
  • For example, the analysts might use different types of fertilizers and water on fields to understand if there is an impact on the crop’s yield.
  • Based on the final results, the agriculture analysts will change the number of fertilizers and water to maximize the crop output.

Wrapping Up

Using regression analysis helps you separate the effects that involve complicated research questions. It will allow you to make informed decisions, guide you with resource allocation, and increase your bottom line by a huge margin if you use the statistical method effectively.

If you are looking for an online survey tool to gather data for your regression analysis, SurveySparrow is one of the best choices. SurveySparrow has a host of features that lets you do as much as possible with a survey tool. Get on a call with us to understand how we can help you.

blog author image

Product Marketing Manager at SurveySparrow

Excels in empowering visionary companies through storytelling and strategic go-to-market planning. With extensive experience in product marketing and customer experience management, she is an accomplished author, podcast host, and mentor, sharing her expertise across diverse platforms and audiences.

You Might Also Like

45 Likert Scale Questions for 2025 [With Types & Examples]

45 Likert Scale Questions for 2025 [With Types & Examples]

Exploratory Research: Your Guide to Unraveling Insights

Exploratory Research: Your Guide to Unraveling Insights

How To Create An Effective Employee Engagement Action Plan

Employee Experience

How To Create An Effective Employee Engagement Action Plan

How To Create A Salesforce Chatbot To 10x Your Customer Service?

Uncategorized

How To Create A Salesforce Chatbot To 10x Your Customer Service?

Turn every feedback into a growth opportunity.

14-day free trial • Cancel Anytime • No Credit Card Required • Need a Demo?

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Cardiopulm Phys Ther J
  • v.20(3); 2009 Sep

Regression Analysis for Prediction: Understanding the Process

Phillip b palmer.

1 Hardin-Simmons University, Department of Physical Therapy, Abilene, TX

Dennis G O'Connell

2 Hardin-Simmons University, Department of Physical Therapy, Abilene, TX

Research related to cardiorespiratory fitness often uses regression analysis in order to predict cardiorespiratory status or future outcomes. Reading these studies can be tedious and difficult unless the reader has a thorough understanding of the processes used in the analysis. This feature seeks to “simplify” the process of regression analysis for prediction in order to help readers understand this type of study more easily. Examples of the use of this statistical technique are provided in order to facilitate better understanding.

INTRODUCTION

Graded, maximal exercise tests that directly measure maximum oxygen consumption (VO 2 max) are impractical in most physical therapy clinics because they require expensive equipment and personnel trained to administer the tests. Performing these tests in the clinic may also require medical supervision; as a result researchers have sought to develop exercise and non-exercise models that would allow clinicians to predict VO 2 max without having to perform direct measurement of oxygen uptake. In most cases, the investigators utilize regression analysis to develop their prediction models.

Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur, 15 regression analysis has 2 uses in scientific literature: prediction, including classification, and explanation. The following provides a brief review of the use of regression analysis for prediction. Specific emphasis is given to the selection of the predictor variables (assessing model efficiency and accuracy) and cross-validation (assessing model stability). The discussion is not intended to be exhaustive. For a more thorough explanation of regression analysis, the reader is encouraged to consult one of many books written about this statistical technique (eg, Fox; 5 Kleinbaum, Kupper, & Muller; 12 Pedhazur; 15 and Weisberg 16 ). Examples of the use of regression analysis for prediction are drawn from a study by Bradshaw et al. 3 In this study, the researchers' stated purpose was to develop an equation for prediction of cardiorespiratory fitness (CRF) based on non-exercise (N-EX) data.

SELECTING THE CRITERION (OUTCOME MEASURE)

The first step in regression analysis is to determine the criterion variable. Pedhazur 15 suggests that the criterion have acceptable measurement qualities (ie, reliability and validity). Bradshaw et al 3 used VO 2 max as the criterion of choice for their model and measured it using a maximum graded exercise test (GXT) developed by George. 6 George 6 indicated that his protocol for testing compared favorably with the Bruce protocol in terms of predictive ability and had good test-retest reliability ( ICC = .98 –.99). The American College of Sports Medicine indicates that measurement of VO 2 max is the “gold standard” for measuring cardiorespiratory fitness. 1 These facts support that the criterion selected by Bradshaw et al 3 was appropriate and meets the requirements for acceptable reliability and validity.

SELECTING THE PREDICTORS: MODEL EFFICIENCY

Once the criterion has been selected, predictor variables should be identified (model selection). The aim of model selection is to minimize the number of predictors which account for the maximum variance in the criterion. 15 In other words, the most efficient model maximizes the value of the coefficient of determination ( R 2 ). This coefficient estimates the amount of variance in the criterion score accounted for by a linear combination of the predictor variables. The higher the value is for R 2 , the less error or unexplained variance and, therefore, the better prediction. R 2 is dependent on the multiple correlation coefficient ( R ), which describes the relationship between the observed and predicted criterion scores. If there is no difference between the predicted and observed scores, R equals 1.00. This represents a perfect prediction with no error and no unexplained variance ( R 2 = 1.00). When R equals 0.00, there is no relationship between the predictor(s) and the criterion and no variance in scores has been explained ( R 2 = 0.00). The chosen variables cannot predict the criterion. The goal of model selection is, as stated previously, to develop a model that results in the highest estimated value for R 2 .

According to Pedhazur, 15 the value of R is often overestimated. The reasons for this are beyond the scope of this discussion; however, the degree of overestimation is affected by sample size. The larger the ratio is between the number of predictors and subjects, the larger the overestimation. To account for this, sample sizes should be large and there should be 15 to 30 subjects per predictor. 11 , 15 Of course, the most effective way to determine optimal sample size is through statistical power analysis. 11 , 15

Another method of determining the best model for prediction is to test the significance of adding one or more variables to the model using the partial F-test . This process, which is further discussed by Kleinbaum, Kupper, and Muller, 12 allows for exclusion of predictors that do not contribute significantly to the prediction, allowing determination of the most efficient model of prediction. In general, the partial F-test is similar to the F-test used in analysis of variance. It assesses the statistical significance of the difference between values for R 2 derived from 2 or more prediction models using a subset of the variables from the original equation. For example, Bradshaw et al 3 indicated that all variables contributed significantly to their prediction. Though the researchers do not detail the procedure used, it is highly likely that different models were tested, excluding one or more variables, and the resulting values for R 2 assessed for statistical difference.

Although the techniques discussed above are useful in determining the most efficient model for prediction, theory must be considered in choosing the appropriate variables. Previous research should be examined and predictors selected for which a relationship between the criterion and predictors has been established. 12 , 15

It is clear that Bradshaw et al 3 relied on theory and previous research to determine the variables to use in their prediction equation. The 5 variables they chose for inclusion–gender, age, body mass index (BMI), perceived functional ability (PFA), and physical activity rating (PAR)–had been shown in previous studies to contribute to the prediction of VO 2 max (eg, Heil et al; 8 George, Stone, & Burkett 7 ). These 5 predictors accounted for 87% ( R = .93, R 2 = .87 ) of the variance in the predicted values for VO 2 max. Based on a ratio of 1:20 (predictor:sample size), this estimate of R , and thus R 2 , is not likely to be overestimated. The researchers used changes in the value of R 2 to determine whether to include or exclude these or other variables. They reported that removal of perceived functional ability (PFA) as a variable resulted in a decrease in R from .93 to .89. Without this variable, the remaining 4 predictors would account for only 79% of the variance in VO 2 max. The investigators did note that each predictor variable contributed significantly ( p < .05 ) to the prediction of VO 2 max (see above discussion related to the partial F-test).

ASSESSING ACCURACY OF THE PREDICTION

Assessing accuracy of the model is best accomplished by analyzing the standard error of estimate ( SEE ) and the percentage that the SEE represents of the predicted mean ( SEE % ). The SEE represents the degree to which the predicted scores vary from the observed scores on the criterion measure, similar to the standard deviation used in other statistical procedures. According to Jackson, 10 lower values of the SEE indicate greater accuracy in prediction. Comparison of the SEE for different models using the same sample allows for determination of the most accurate model to use for prediction. SEE % is calculated by dividing the SEE by the mean of the criterion ( SEE /mean criterion) and can be used to compare different models derived from different samples.

Bradshaw et al 3 report a SEE of 3.44 mL·kg −1 ·min −1 (approximately 1 MET) using all 5 variables in the equation (gender, age, BMI, PFA, PA-R). When the PFA variable is removed from the model, leaving only 4 variables for the prediction (gender, age, BMI, PA-R), the SEE increases to 4.20 mL·kg −1 ·min −1 . The increase in the error term indicates that the model excluding PFA is less accurate in predicting VO 2 max. This is confirmed by the decrease in the value for R (see discussion above). The researchers compare their model of prediction with that of George, Stone, and Burkett, 7 indicating that their model is as accurate. It is not advisable to compare models based on the SEE if the data were collected from different samples as they were in these 2 studies. That type of comparison should be made using SEE %. Bradshaw and colleagues 3 report SEE % for their model (8.62%), but do not report values from other models in making comparisons.

Some advocate the use of statistics derived from the predicted residual sum of squares ( PRESS ) as a means of selecting predictors. 2 , 4 , 16 These statistics are used more often in cross-validation of models and will be discussed in greater detail later.

ASSESSING STABILITY OF THE MODEL FOR PREDICTION

Once the most efficient and accurate model for prediction has been determined, it is prudent that the model be assessed for stability. A model, or equation, is said to be “stable” if it can be applied to different samples from the same population without losing the accuracy of the prediction. This is accomplished through cross-validation of the model. Cross-validation determines how well the prediction model developed using one sample performs in another sample from the same population. Several methods can be employed for cross-validation, including the use of 2 independent samples, split samples, and PRESS -related statistics developed from the same sample.

Using 2 independent samples involves random selection of 2 groups from the same population. One group becomes the “training” or “exploratory” group used for establishing the model of prediction. 5 The second group, the “confirmatory” or “validatory” group is used to assess the model for stability. The researcher compares R 2 values from the 2 groups and assessment of “shrinkage,” the difference between the two values for R 2 , is used as an indicator of model stability. There is no rule of thumb for interpreting the differences, but Kleinbaum, Kupper, and Muller 12 suggest that “shrinkage” values of less than 0.10 indicate a stable model. While preferable, the use of independent samples is rarely used due to cost considerations.

A similar technique of cross-validation uses split samples. Once the sample has been selected from the population, it is randomly divided into 2 subgroups. One subgroup becomes the “exploratory” group and the other is used as the “validatory” group. Again, values for R 2 are compared and model stability is assessed by calculating “shrinkage.”

Holiday, Ballard, and McKeown 9 advocate the use of PRESS-related statistics for cross-validation of regression models as a means of dealing with the problems of data-splitting. The PRESS method is a jackknife analysis that is used to address the issue of estimate bias associated with the use of small sample sizes. 13 In general, a jackknife analysis calculates the desired test statistic multiple times with individual cases omitted from the calculations. In the case of the PRESS method, residuals, or the differences between the actual values of the criterion for each individual and the predicted value using the formula derived with the individual's data removed from the prediction, are calculated. The PRESS statistic is the sum of the squares of the residuals derived from these calculations and is similar to the sum of squares for the error (SS error ) used in analysis of variance (ANOVA). Myers 14 discusses the use of the PRESS statistic and describes in detail how it is calculated. The reader is referred to this text and the article by Holiday, Ballard, and McKeown 9 for additional information.

Once determined, the PRESS statistic can be used to calculate a modified form of R 2 and the SEE . R 2 PRESS is calculated using the following formula: R 2 PRESS = 1 – [ PRESS / SS total ], where SS total equals the sum of squares for the original regression equation. 14 Standard error of the estimate for PRESS ( SEE PRESS ) is calculated as follows: SEE PRESS =, where n equals the number of individual cases. 14 The smaller the difference between the 2 values for R 2 and SEE , the more stable the model for prediction. Bradshaw et al 3 used this technique in their investigation. They reported a value for R 2 PRESS of .83, a decrease of .04 from R 2 for their prediction model. Using the standard set by Kleinbaum, Kupper, and Muller, 12 the model developed by these researchers would appear to have stability, meaning it could be used for prediction in samples from the same population. This is further supported by the small difference between the SEE and the SEE PRESS , 3.44 and 3.63 mL·kg −1 ·min −1 , respectively.

COMPARING TWO DIFFERENT PREDICTION MODELS

A comparison of 2 different models for prediction may help to clarify the use of regression analysis in prediction. Table ​ Table1 1 presents data from 2 studies and will be used in the following discussion.

Comparison of Two Non-exercise Models for Predicting CRF

VariablesHeil et al = 374Bradshaw et al = 100
Intercept36.58048.073
Gender (male = 1, female = 0)3.7066.178
Age (years)0.558−0.246
Age −7.81 E-3
Percent body fat−0.541
Body mass index (kg-m )−0.619
Activity code (0-7)1.347
Physical activity rating (0–10)0.671
Perceived functional abilty0.712
)
.88 (.77).93 (.87)
4.90·mL–kg ·min 3.44 mL·kg min
12.7%8.6%

As noted above, the first step is to select an appropriate criterion, or outcome measure. Bradshaw et al 3 selected VO 2 max as their criterion for measuring cardiorespiratory fitness. Heil et al 8 used VO 2 peak. These 2 measures are often considered to be the same, however, VO 2 peak assumes that conditions for measuring maximum oxygen consumption were not met. 17 It would be optimal to compare models based on the same criterion, but that is not essential, especially since both criteria measure cardiorespiratory fitness in much the same way.

The second step involves selection of variables for prediction. As can be seen in Table ​ Table1, 1 , both groups of investigators selected 5 variables to use in their model. The 5 variables selected by Bradshaw et al 3 provide a better prediction based on the values for R 2 (.87 and .77), indicating that their model accounts for more variance (87% versus 77%) in the prediction than the model of Heil et al. 8 It should also be noted that the SEE calculated in the Bradshaw 3 model (3.44 mL·kg −1 ·min −1 ) is less than that reported by Heil et al 8 (4.90 mL·kg −1 ·min −1 ). Remember, however, that comparison of the SEE should only be made when both models are developed using samples from the same population. Comparing predictions developed from different populations can be accomplished using the SEE% . Review of values for the SEE% in Table ​ Table1 1 would seem to indicate that the model developed by Bradshaw et al 3 is more accurate because the percentage of the mean value for VO 2 max represented by error is less than that reported by Heil et al. 8 In summary, the Bradshaw 3 model would appear to be more efficient, accounting for more variance in the prediction using the same number of variables. It would also appear to be more accurate based on comparison of the SEE% .

The 2 models cannot be compared based on stability of the models. Each set of researchers used different methods for cross-validation. Both models, however, appear to be relatively stable based on the data presented. A clinician can assume that either model would perform fairly well when applied to samples from the same populations as those used by the investigators.

The purpose of this brief review has been to demystify regression analysis for prediction by explaining it in simple terms and to demonstrate its use. When reviewing research articles in which regression analysis has been used for prediction, physical therapists should ensure that the: (1) criterion chosen for the study is appropriate and meets the standards for reliability and validity, (2) processes used by the investigators to assess both model efficiency and accuracy are appropriate, 3) predictors selected for use in the model are reasonable based on theory or previous research, and 4) investigators assessed model stability through a process of cross-validation, providing the opportunity for others to utilize the prediction model in different samples drawn from the same population.

  • Open access
  • Published: 24 August 2024

Mixed effects models but not t-tests or linear regression detect progression of apathy in Parkinson’s disease over seven years in a cohort: a comparative analysis

  • Anne-Marie Hanff 1 , 2 , 3 , 4 ,
  • Rejko Krüger 1 , 2 , 5 ,
  • Christopher McCrum 4 ,
  • Christophe Ley 6 on behalf of

BMC Medical Research Methodology volume  24 , Article number:  183 ( 2024 ) Cite this article

284 Accesses

2 Altmetric

Metrics details

Introduction

While there is an interest in defining longitudinal change in people with chronic illness like Parkinson’s disease (PD), statistical analysis of longitudinal data is not straightforward for clinical researchers. Here, we aim to demonstrate how the choice of statistical method may influence research outcomes, (e.g., progression in apathy), specifically the size of longitudinal effect estimates, in a cohort.

In this retrospective longitudinal analysis of 802 people with typical Parkinson’s disease in the Luxembourg Parkinson's study, we compared the mean apathy scores at visit 1 and visit 8 by means of the paired two-sided t-test. Additionally, we analysed the relationship between the visit numbers and the apathy score using linear regression and longitudinal two-level mixed effects models.

Mixed effects models were the only method able to detect progression of apathy over time. While the effects estimated for the group comparison and the linear regression were smaller with high p -values (+ 1.016/ 7 years, p  = 0.107, -0.056/ 7 years, p  = 0.897, respectively), effect estimates for the mixed effects models were positive with a very small p -value, indicating a significant increase in apathy symptoms by + 2.345/ 7 years ( p  < 0.001).

The inappropriate use of paired t-tests and linear regression to analyse longitudinal data can lead to underpowered analyses and an underestimation of longitudinal change. While mixed effects models are not without limitations and need to be altered to model the time sequence between the exposure and the outcome, they are worth considering for longitudinal data analyses. In case this is not possible, limitations of the analytical approach need to be discussed and taken into account in the interpretation.

Peer Review reports

In longitudinal studies: “an outcome is repeatedly measured, i.e., the outcome variable is measured in the same subject on several occasions.” [ 1 ]. When assessing the same individuals over time, the different data points are likely to be more similar to each other than measurements taken from other individuals. Consequently, the application of special statistical techniques is required, which take into account the fact that the repeated observations of each subject are correlated [ 1 ]. Parkinson’s disease (PD) is a heterogeneous neurodegenerative disorder resulting in a wide variety of motor and non-motor symptoms including apathy, defined as a disorder of motivation, characterised by reduced goal-directed behaviour and cognitive activity and blunted affect [ 2 ]. Apathy increases over time in people with PD [ 3 ]. Specifically, apathy has been associated with the progressive denervation of ascending dopaminergic pathways in PD [ 4 , 5 ] leading to dysfunctions of circuits implicated in reward-related learning [ 5 ].

T-tests are often misused to analyse changes over time [ 6 ]. Consequently, we aim to demonstrate how the choice of statistical method may influence research outcomes, specifically the size and interpretation of longitudinal effect estimates in a cohort. Thus, the findings are intended for illustrative and educational purposes related to the statistical methodology. In a retrospective analysis of data from the Luxembourg Parkinson's study, a nation-wide, monocentric, observational, longitudinal-prospective dynamic cohort [ 7 , 8 ], we assess change in apathy using three different statistical approaches (paired t-test, linear regression, mixed effects model). We defined the following target estimand: In people diagnosed with PD, what is the change in the apathy score from visit 1 to visit 8? To estimate this change, we formulated the statistical hypothesis as follows:

While apathy was the dependent variable, we included the visit number as an independent variable (linear regression, mixed effects model) and as a grouping variable (paired t-test). The outcome apathy was measured by the discrete score from the Starkstein apathy scale (0 – 42, higher = worse) [ 9 ], a scale recommended by the Movement Disorders Society [ 10 ]. This data was obtained from the National Centre of Excellence in Research on Parkinson's disease (NCER-PD). The establishment of data collection standards, completion of the questionnaires at home at the participants’ convenience, mobile recruitment team for follow-up visits or standardized telephone questionnaire with a reduced assessment were part of the efforts in the primary study to address potential sources of bias [ 7 , 8 ]. Ethical approval was provided by the National Ethics Board (CNER Ref: 201,407/13). We used data from up to eight visits, which were performed annually between 2015 and 2023. Among the participants are people with typical PD and PD dementia (PDD), living mostly at home in Luxembourg and the Greater Region (geographically close areas of the surrounding countries Belgium, France, and Germany). People with atypical PD were excluded. The sample at the date of data export (2023.06.22) consisted of 802 individuals of which 269 (33.5%) were female. The average number of observations was 3.0. Fig. S1 reports the numbers of individuals at each visit while the characteristics of the participants are described in Table  1 .

As illustrated in the flow diagram (Fig.  1 ), the sample analysed from the paired t-test is highly selective: from the 802 participants at visit 1, the t-test only included 63 participants with data from visit 8. This arises from the fact that, first, we analyse the dataset from a dynamic cohort, i.e., the data at visit 1 were not collected at the same time point. Thus, 568 of the 802 participants joined the study less than eight years before, leading to only 234 participants eligible for the eighth yearly visit. Second, after excluding non-participants at visit 8 due to death ( n  = 41) and other reasons ( n  = 130), only 63 participants at visit 8 were left. To discuss the selective study population of a paired t-test, we compared the characteristics (age, education, age at diagnosis, apathy at visit 1) of the remaining 63 participants at visit 8 (included in the paired t-test) and the 127 non-participants at visit 8 (excluded from the paired t-test) [ 12 ].

figure 1

Flow diagram of patient recruitment

The paired two-sided t-test compared the mean apathy score at visit 1 with the mean apathy score at the visit 8. We attract the reader’s attention to the fact that this implies a rather small sample size as it includes only those people with data from the first and 8th visit. The linear regression analysed the relationship between the visit number and the apathy score (using the “stats” package [ 13 ]), while we performed longitudinal two-level mixed effects models analysis with a random intercept on subject level, a random slope for visit number and the visit number as fixed effect (using the “lmer”-function of the “lme4”-package [ 14 ]). The latter two approaches use all available data from all visits while the paired t-test does not. We illustrated the analyses in plots with the function “plot_model” of the R package sjPlot [ 15 ]. We conducted data analysis using R version 3.6.3 [ 13 ] and the R syntax for all analyses is provided on the OSF project page ( https://doi.org/ https://doi.org/10.17605/OSF.IO/NF4YB ).

Panel A in Fig.  2 illustrates the means and standard deviations of apathy for all participants at each visit, while the flow-chart (Fig. S1 ) illustrates the number of participants at each stage. On average, we see lower apathy scores at visit 8 compared to visit 1 (higher score = worse). By definition, the paired t-test analyses pairs, and in this case, only participants with complete apathy scores at visit 1 and visit 8 are included, reducing the total analysed sample to 63 pairs of observations. Consequently, the t-test compares mean apathy scores in a subgroup of participants with data at both visits leading to different observations from Panel A, as illustrated and described in Panel B: the apathy score has increased at visit 8, hence symptoms of apathy have worsened. The outcome of the t-test along with the code is given in Table  2 . Interestingly, the effect estimates for the increase in apathy were not statistically significant (+ 1.016 points, 95%CI: -0.225, 2.257, p  = 0.107). A possible reason for this non-significance is a loss of statistical power due to a small sample size included in the paired t-test. To visualise the loss of information between visit 1 and visit 8, we illustrated the complex individual trajectories of the participants in Fig.  3 . Moreover, as described in Table S1 in the supplement, the participants at visit 8 (63/190) analysed in the t-test were inherently significantly different compared to the non-participants at visit 8 (127/190): they were younger, had better education, and most importantly their apathy scores at visit 1 were lower. Consequently, those with the better overall situation kept coming back while this was not the case for those with a worse outcome at visit 1, which explains the observed (non-significant) increase. This may result in a biased estimation of change in apathy when analysed by the compared statistical methods.

figure 2

Bar charts illustrating apathy scores (means and standard deviations) per visit (Panel A: all participants, Panel B: subgroup analysed in the t-test). The red line indicates the mean apathy at visit 1

figure 3

Scatterplot illustrating the individual trajectories. The red line indicates the regression line

From the results in Table  2 , we see that the linear regression coefficient, representing change in apathy symptoms per year, is not significantly different from zero, indicating no change over time. One possible explanation is the violation of the assumption of independent observations for linear regressions. On the contrary, the effect estimates for the linear mixed effects models indicated a significant increase in apathy symptoms from visit 1 to visit 8 by + 2.680 points (95%CI: 1.880, 3.472, p  < 0.001). Consequently, mixed effects models were the only method able to detect an increase in apathy symptoms over time and choosing mixed effect models for the analysis of longitudinal data reduces the risk of false negative results. The differences in the effect sizes are also reflected in the regression lines in Panel A and B of Fig.  4 .

figure 4

Scatterplot illustrating the relationship between visit number and apathy. Apathy measured by a whole number interval scale, jitter applied on x- and y-axis to illustrate the data points (Panel A: Linear regression, Panel B: Linear mixed effects model). The red line indicates the regression line

The effect sizes differed depending on the choice of the statistical method. Thus, the paired t-test and the linear regression resulted in an output that would lead to different interpretations than the mixed effects models. More specifically, compared to the t-test and linear regression (which indicated non-significant changes in apathy of only + 1.016, -0.064 points from visit 1 to visit 8, respectively), the linear mixed effects models found an increase of + 2.680 points from visit 1 to visit 8 on the apathy scale. This increase is more than twice as high as indicated by the t-test and suggests linear mixed models is a more sensitive approach to detect meaningful changes perceived by people with PD over time.

Mixed effects models are a valuable tool in longitudinal data analysis as these models expand upon linear regression models by considering the correlation among repeated measurements within the same individuals through the estimation of a random intercept [ 1 , 16 , 17 ]. Specifically, to account for correlation between observations, linear mixed effects models use random effects to explicitly model the correlation structure, thus removing correlation from the error term. A random slope in addition to a random intercept allows both the rate of change and the mean value to vary by participant, capturing individual differences. This distinguishes them from group comparisons or standard linear regressions, in which such explicit modelling of correlation is not possible. Thus, the linear regression not considering correlation among the repeated observations leads to an underestimation of longitudinal change, explaining the smaller effect sizes and insignificant results of the regression. By including random effects, linear mixed effects models can better capture the variability within the data.

Another common challenge in longitudinal studies is missing data. Compared to the paired t-test and regression, the mixed effects models can also include participants with missing data at single visits and account for the individual trajectories of each participant as illustrated in Fig.  2 [ 18 ]. Although multiple imputation could increase the sample size, those results need to be interpreted with caution in case the data is not missing at random [ 18 , 19 ]. Note that we do not further elaborate here on this topic since this is a separate issue to statistical method comparison. Finally, assumptions of the different statistical methods need to be respected. The paired t-test assumes a normal distribution, homogeneity of variance and pairs of the same individuals in both groups [ 20 , 21 ]. While mixed effects models don’t rely on independent observations as it is the case for linear regression, all other assumptions for standard linear regression analysis (e.g., linearity, homoscedasticity, no multicollinearity) also hold for mixed effects model analyses. Thus, additional steps, e.g., check for linearity of the relationships or data transformations are required before the analysis of clinical research questions [ 17 ].

While mixed effects models are not without limitations and need to be altered to model the time sequence between the exposure and the outcome [ 1 ], they are worth considering for longitudinal data analyses. Thus, assuming an increase of apathy over time [ 3 ], mixed effects models were the only method able to detect statistically significant changes in the defined estimand, i.e., the change in apathy from visit 1 to visit 8. Possible reasons are a loss of statistical power due to a small sample size included in the paired t-test and the violence of the assumption of independent observations for linear regressions. Specifically, the effects estimated for the group comparison and the linear regression were smaller with high p -values, indicating a statistically insignificant change in apathy over time. The effect estimates for the mixed effects models were positive with a very small p -value, indicating a statistically significant increase in apathy symptoms from visit 1 to visit 8 in line with clinical expectations. Mixed effects models can be used to estimate different types of longitudinal effects while an inappropriate use of paired t-tests and linear regression to analyse longitudinal data can lead to underpowered analyses and an underestimation of longitudinal change and thus clinical significance. Therefore, researchers should more often consider mixed effects models for longitudinal analyses. In case this is not possible, limitations of the analytical approach need to be discussed and taken into account in the interpretation.

Availability of data and materials

The LUXPARK database used in this study was obtained from the National Centre of Excellence in Research on Parkinson’s disease (NCER-PD). NCER-PD database are not publicly available as they are linked to the Luxembourg Parkinson’s study and its internal regulations. The NCER-PD Consortium is willing to share its available data. Its access policy was devised based on the study ethics documents, including the informed consent form approved by the national ethics committee. Requests for access to datasets should be directed to the Data and Sample Access Committee by email at [email protected].

The code is available on OSF ( https://doi.org/10.17605/OSF.IO/NF4YB )

Abbreviations

Parkinson's disease

Null hypothesis

Alternative hypothesis

Parkinson's disease dementia

National Centre of Excellence in Research on Parkinson's disease

Open Science Framework

Confidence Interval

Twisk JWR. Applied Longitudinal Data Analysis for Epidemiology. A Practical Guide: Cambridge University Press; 2013.

Book   Google Scholar  

Levy R, Dubois B. Apathy and the functional anatomy of the prefrontal cortex-basal ganglia circuits. Cereb Cortex. 2006;16(7):916–28.

Article   PubMed   Google Scholar  

Poewe W, Seppi K, Tanner CM, Halliday GM, Brundin P, Volkmann J, et al. Parkinson disease. Nat Rev Dis Primers. 2017;3:17013.

Pagonabarraga J, Kulisevsky J, Strafella AP, Krack P. Apathy in Parkinson’s disease: clinical features, neural substrates, diagnosis, and treatment. Lancet Neurol. 2015;14(5):518–31.

Drui G, Carnicella S, Carcenac C, Favier M, Bertrand A, Boulet S, Savasta M. Loss of dopaminergic nigrostriatal neurons accounts for the motivational and affective deficits in Parkinson’s disease. Mol Psychiatry. 2014;19(3):358–67.

Article   CAS   PubMed   Google Scholar  

Liang G, Fu W, Wang K. Analysis of t-test misuses and SPSS operations in medical research papers. Burns Trauma. 2019;7:31.

Article   PubMed   PubMed Central   Google Scholar  

Hipp G, Vaillant M, Diederich NJ, Roomp K, Satagopam VP, Banda P, et al. The Luxembourg Parkinson’s Study: a comprehensive approach for stratification and early diagnosis. Front Aging Neurosci. 2018;10:326.

Pavelka L, Rawal R, Ghosh S, Pauly C, Pauly L, Hanff A-M, et al. Luxembourg Parkinson’s study -comprehensive baseline analysis of Parkinson’s disease and atypical parkinsonism. Front Neurol. 2023;14:1330321.

Starkstein SE, Mayberg HS, Preziosi TJ, Andrezejewski P, Leiguarda R, Robinson RG. Reliability, validity, and clinical correlates of apathy in Parkinson’s disease. J Neuropsychiatry Clin Neurosci. 1992;4(2):134–9.

Leentjens AF, Dujardin K, Marsh L, Martinez-Martin P, Richard IH, Starkstein SE, et al. Apathy and anhedonia rating scales in Parkinson’s disease: critique and recommendations. Mov Disord. 2008;23(14):2004–14.

Goetz CG, Tilley BC, Shaftman SR, Stebbins GT, Fahn S, Martinez-Martin P, et al. Movement Disorder Society-sponsored revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): scale presentation and clinimetric testing results. Mov Disord. 2008;23(15):2129–70.

Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–202.

Article   Google Scholar  

R Core Team. R: A language and environment for statistical computing Vienna: R Foundation for Statistical Computing; 2023. Available from: https://www.R-project.org/ .

Bates D, Maechler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. J Stat Softw. 2015;67:1–48.

Lüdecke D. sjPlot: Data Visualization for Statistics in Social Science. 2022 [R package version 2.8.11]. Available from: https://CRAN.R-project.org/package=sjPlot .

Twisk JWR. Applied Multilevel Analysis: A Practical Guide for Medical Researchers. Cambridge: Cambridge University Press; 2006.

Twisk JWR. Applied Mixed Model Analysis. New York: A Practical Guide; 2019.

Long DJ. Longitudinal data analysis for the behavioral sciences using R. United States of America: SAGE; 2012.

Google Scholar  

Twisk JWR, de Boer M, de Vente W, Heymans M. Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. J Clin Epidemiol. 2013;66(9):1022–8.

Student. The probable error of a mean. Biometrika. 1908;6(1):1–25.

Polit DF. Statistics and Data Analysis for Nursing Research. England: Pearson; 2014.

Download references

Acknowledgements

We would like to thank all participants of the Luxembourg Parkinson’s Study for their important support of our research. Furthermore, we acknowledge the joint effort of the National Centre of Excellence in Research on Parkinson’s Disease (NCER-PD) Consortium members from the partner institutions Luxembourg Centre for Systems Biomedicine, Luxembourg Institute of Health, Centre Hospitalier de Luxembourg, and Laboratoire National de Santé generally contributing to the Luxembourg Parkinson’s Study as listed below:

Geeta ACHARYA 2, Gloria AGUAYO 2, Myriam ALEXANDRE 2, Muhammad ALI 1, Wim AMMERLANN 2, Giuseppe ARENA 1, Michele BASSIS 1, Roxane BATUTU 3, Katy BEAUMONT 2, Sibylle BÉCHET 3, Guy BERCHEM 3, Alexandre BISDORFF 5, Ibrahim BOUSSAAD 1, David BOUVIER 4, Lorieza CASTILLO 2, Gessica CONTESOTTO 2, Nancy DE BREMAEKER 3, Brian DEWITT 2, Nico DIEDERICH 3, Rene DONDELINGER 5, Nancy E. RAMIA 1, Angelo Ferrari 2, Katrin FRAUENKNECHT 4, Joëlle FRITZ 2, Carlos GAMIO 2, Manon GANTENBEIN 2, Piotr GAWRON 1, Laura Georges 2, Soumyabrata GHOSH 1, Marijus GIRAITIS 2,3, Enrico GLAAB 1, Martine GOERGEN 3, Elisa GÓMEZ DE LOPE 1, Jérôme GRAAS 2, Mariella GRAZIANO 7, Valentin GROUES 1, Anne GRÜNEWALD 1, Gaël HAMMOT 2, Anne-Marie HANFF 2, 10, 11, Linda HANSEN 3, Michael HENEKA 1, Estelle HENRY 2, Margaux Henry 2, Sylvia HERBRINK 3, Sascha HERZINGER 1, Alexander HUNDT 2, Nadine JACOBY 8, Sonja JÓNSDÓTTIR 2,3, Jochen KLUCKEN 1,2,3, Olga KOFANOVA 2, Rejko KRÜGER 1,2,3, Pauline LAMBERT 2, Zied LANDOULSI 1, Roseline LENTZ 6, Laura LONGHINO 3, Ana Festas Lopes 2, Victoria LORENTZ 2, Tainá M. MARQUES 2, Guilherme MARQUES 2, Patricia MARTINS CONDE 1, Patrick MAY 1, Deborah MCINTYRE 2, Chouaib MEDIOUNI 2, Francoise MEISCH 1, Alexia MENDIBIDE 2, Myriam MENSTER 2, Maura MINELLI 2, Michel MITTELBRONN 1, 2, 4, 10, 12, 13, Saïda MTIMET 2, Maeva Munsch 2, Romain NATI 3, Ulf NEHRBASS 2, Sarah NICKELS 1, Beatrice NICOLAI 3, Jean-Paul NICOLAY 9, Fozia NOOR 2, Clarissa P. C. GOMES 1, Sinthuja PACHCHEK 1, Claire PAULY 2,3, Laure PAULY 2, 10, Lukas PAVELKA 2,3, Magali PERQUIN 2, Achilleas PEXARAS 2, Armin RAUSCHENBERGER 1, Rajesh RAWAL 1, Dheeraj REDDY BOBBILI 1, Lucie REMARK 2, Ilsé Richard 2, Olivia ROLAND 2, Kirsten ROOMP 1, Eduardo ROSALES 2, Stefano SAPIENZA 1, Venkata SATAGOPAM 1, Sabine SCHMITZ 1, Reinhard SCHNEIDER 1, Jens SCHWAMBORN 1, Raquel SEVERINO 2, Amir SHARIFY 2, Ruxandra SOARE 1, Ekaterina SOBOLEVA 1,3, Kate SOKOLOWSKA 2, Maud Theresine 2, Hermann THIEN 2, Elodie THIRY 3, Rebecca TING JIIN LOO 1, Johanna TROUET 2, Olena TSURKALENKO 2, Michel VAILLANT 2, Carlos VEGA 2, Liliana VILAS BOAS 3, Paul WILMES 1, Evi WOLLSCHEID-LENGELING 1, Gelani ZELIMKHANOV 2,3

1 Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg

2 Luxembourg Institute of Health, Strassen, Luxembourg

3 Centre Hospitalier de Luxembourg, Strassen, Luxembourg

4 Laboratoire National de Santé, Dudelange, Luxembourg

5 Centre Hospitalier Emile Mayrisch, Esch-sur-Alzette, Luxembourg

6 Parkinson Luxembourg Association, Leudelange, Luxembourg

7 Association of Physiotherapists in Parkinson's Disease Europe, Esch-sur-Alzette, Luxembourg

8 Private practice, Ettelbruck, Luxembourg

9 Private practice, Luxembourg, Luxembourg

10 Faculty of Science, Technology and Medicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg

11 Department of Epidemiology, CAPHRI School for Public Health and Primary Care, Maastricht University Medical Centre+, Maastricht, the Netherlands

12 Luxembourg Center of Neuropathology, Dudelange, Luxembourg

13 Department of Life Sciences and Medicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg

This work was supported by grants from the Luxembourg National Research Fund (FNR) within the National Centre of Excellence in Research on Parkinson's disease [NCERPD(FNR/NCER13/BM/11264123)]. The funding body played no role in the design of the study and collection, analysis, interpretation of data, and in writing the manuscript.

Author information

Authors and affiliations.

Transversal Translational Medicine, Luxembourg Institute of Health, Strassen, Luxembourg

Anne-Marie Hanff & Rejko Krüger

Translational Neurosciences, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-Sur-Alzette, Luxembourg

Department of Epidemiology, CAPHRI Care and Public Health Research Institute, Maastricht University Medical Centre+, Maastricht, The Netherlands

Anne-Marie Hanff

Department of Nutrition and Movement Sciences, NUTRIM School of Nutrition and Translational Research in Metabolism, Maastricht University Medical Centre+, Maastricht, The Netherlands

Anne-Marie Hanff & Christopher McCrum

Parkinson Research Clinic, Centre Hospitalier du Luxembourg, Luxembourg, Luxembourg

Rejko Krüger

Department of Mathematics, University of Luxembourg, Esch-Sur-Alzette, Luxembourg

Christophe Ley

You can also search for this author in PubMed   Google Scholar

  • Geeta Acharya
  • , Gloria Aguayo
  • , Myriam Alexandre
  • , Muhammad Ali
  • , Wim Ammerlann
  • , Giuseppe Arena
  • , Michele Bassis
  • , Roxane Batutu
  • , Katy Beaumont
  • , Sibylle Béchet
  • , Guy Berchem
  • , Alexandre Bisdorff
  • , Ibrahim Boussaad
  • , David Bouvier
  • , Lorieza Castillo
  • , Gessica Contesotto
  • , Nancy de Bremaeker
  • , Brian Dewitt
  • , Nico Diederich
  • , Rene Dondelinger
  • , Nancy E. Ramia
  • , Angelo Ferrari
  • , Katrin Frauenknecht
  • , Joëlle Fritz
  • , Carlos Gamio
  • , Manon Gantenbein
  • , Piotr Gawron
  • , Laura georges
  • , Soumyabrata Ghosh
  • , Marijus Giraitis
  • , Enrico Glaab
  • , Martine Goergen
  • , Elisa Gómez de Lope
  • , Jérôme Graas
  • , Mariella Graziano
  • , Valentin Groues
  • , Anne Grünewald
  • , Gaël Hammot
  • , Anne-Marie Hanff
  • , Linda Hansen
  • , Michael Heneka
  • , Estelle Henry
  • , Margaux Henry
  • , Sylvia Herbrink
  • , Sascha Herzinger
  • , Alexander Hundt
  • , Nadine Jacoby
  • , Sonja Jónsdóttir
  • , Jochen Klucken
  • , Olga Kofanova
  • , Rejko Krüger
  • , Pauline Lambert
  • , Zied Landoulsi
  • , Roseline Lentz
  • , Laura Longhino
  • , Ana Festas Lopes
  • , Victoria Lorentz
  • , Tainá M. Marques
  • , Guilherme Marques
  • , Patricia Martins Conde
  • , Patrick May
  • , Deborah Mcintyre
  • , Chouaib Mediouni
  • , Francoise Meisch
  • , Alexia Mendibide
  • , Myriam Menster
  • , Maura Minelli
  • , Michel Mittelbronn
  • , Saïda Mtimet
  • , Maeva Munsch
  • , Romain Nati
  • , Ulf Nehrbass
  • , Sarah Nickels
  • , Beatrice Nicolai
  • , Jean-Paul Nicolay
  • , Fozia Noor
  • , Clarissa P. C. Gomes
  • , Sinthuja Pachchek
  • , Claire Pauly
  • , Laure Pauly
  • , Lukas Pavelka
  • , Magali Perquin
  • , Achilleas Pexaras
  • , Armin Rauschenberger
  • , Rajesh Rawal
  • , Dheeraj Reddy Bobbili
  • , Lucie Remark
  • , Ilsé Richard
  • , Olivia Roland
  • , Kirsten Roomp
  • , Eduardo Rosales
  • , Stefano Sapienza
  • , Venkata Satagopam
  • , Sabine Schmitz
  • , Reinhard Schneider
  • , Jens Schwamborn
  • , Raquel Severino
  • , Amir Sharify
  • , Ruxandra Soare
  • , Ekaterina Soboleva
  • , Kate Sokolowska
  • , Maud Theresine
  • , Hermann Thien
  • , Elodie Thiry
  • , Rebecca Ting Jiin Loo
  • , Johanna Trouet
  • , Olena Tsurkalenko
  • , Michel Vaillant
  • , Carlos Vega
  • , Liliana Vilas Boas
  • , Paul Wilmes
  • , Evi Wollscheid-Lengeling
  •  & Gelani Zelimkhanov

Contributions

A-MH: Conceptualization, Methodology, Formal analysis, Investigation, Visualization, Project administration, Writing – original draft, Writing – review & editing. RK: Conceptualization, Methodology, Funding, Resources, Supervision, Project administration, Writing – review & editing. CMC: Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing. CL: Conceptualization, Methodology, Writing – original draft, Writing – review & editing.

Corresponding author

Correspondence to Anne-Marie Hanff .

Ethics declarations

Ethics approval and consent to participate.

The study involved human participants, was reviewed and obtained approval from the National Ethics Board Comité National d’Ethique de Recherche (CNER Ref: 201407/13). The study was performed in accordance with the Declaration of Helsinki and patients/participants provided their written informed consent to participate in this study. We confirm that we have read the Journal’s position on issues involved in ethical publication and affirm that this work is consistent with those guidelines.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Hanff, AM., Krüger, R., McCrum, C. et al. Mixed effects models but not t-tests or linear regression detect progression of apathy in Parkinson’s disease over seven years in a cohort: a comparative analysis. BMC Med Res Methodol 24 , 183 (2024). https://doi.org/10.1186/s12874-024-02301-7

Download citation

Received : 21 March 2024

Accepted : 01 August 2024

Published : 24 August 2024

DOI : https://doi.org/10.1186/s12874-024-02301-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cohort studies
  • Epidemiology
  • Disease progression
  • Lost to follow-up
  • Statistical model

BMC Medical Research Methodology

ISSN: 1471-2288

regression analysis meaning in research

Investigations on machine learning, deep learning, and longitudinal regression methods for global greenhouse gases predictions

  • Original Paper
  • Published: 30 August 2024

Cite this article

regression analysis meaning in research

  • S. D. Yazd 1 ,
  • N. Gharib 1 &
  • J. F. Derakhshandeh   ORCID: orcid.org/0000-0002-6812-9148 1  

Combating climate change is one of the key topics and concerns that our community is currently facing these days. Since a few decades ago, greenhouse gases emissions gradually started to increase. Thus, the researchers attempted to find a permanent solution for this challenge. In this paper, different methods of machine learning and deep learning models are applied to evaluate their effectiveness and accuracy in predicting greenhouse gases emissions. To increase the accuracy of the assessment, the data of 101 countries over a period of 31 years (1991–2021) from the official World Bank sources are considered. In this study, therefore, a range of matrices are analyzed including Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, p value, and correlation coefficient for each model. The results demonstrate that machine learning models typically overtake the deep learning models with the support vector regression polynomial model. Besides, the statistical findings of longitudinal regression analysis reveal that by increasing cereal yield, and permanent cropland areas the greenhouse gas emissions are significantly increase ( p value = 0.000) and ( p value = 0.06) respectively; however, increasing in renewable energy consumption and forest areas will lead to decreasing in greenhouse gas emissions ( p value = 0.000) and ( p value = 0.07) respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

regression analysis meaning in research

Explore related subjects

  • Artificial Intelligence
  • Environmental Chemistry

Ağbulut Ü (2022) Forecasting of transportation-related energy demand and CO 2 emissions in Turkey with different machine learning algorithms. Sustain Prod Consum 29:141–157

Article   Google Scholar  

Akhshik M, Bilton A, Tjong J, Singh CV, Faruk O, Sain M (2022) Prediction of greenhouse gas emissions reductions via machine learning algorithms: toward an artificial intelligence-based life cycle assessment for automotive light weighting. Sustain Mater Technol 31:e00370

CAS   Google Scholar  

Alfaseeh L, Tu R, Farooq B, Hatzopoulou M (2020) Greenhouse gas emission prediction on road network using deep sequence learning. Transp Res Part D Transp Environ Elsevier 88:102593

Bakay MS, Ağbulut Ü (2021) Electricity production based forecasting of greenhouse gas emissions in Turkey with deep learning, support vector machine and artificial neural network algorithms. J Clean Prod. https://doi.org/10.1016/j.jclepro.2020.125324

Bamisile O, Oluwasanmi A, Ejiyi C, Yimen N, Obiora S, Huang Q (2021) Comparison of machine learning and deep learning algorithms for hourly global/diffuse solar radiation predictions. Int J Energy Res. https://doi.org/10.1002/er.6529

Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324

Creutzig F, Ravindranath NH, Berndes G, Bolwig S, Bright R, Cherubini F, Chum H et al (2015) Bioenergy and climate change mitigation: an assessment. Gcb Bioenerg Wiley Online Librar 7(5):916–944

Article   CAS   Google Scholar  

Dritsaki C, Dritsaki M (2014) Causal relationship between energy consumption, economic growth and CO2 emissions: a dynamic panel data approach. Int J Energy Econ Policy 4(2):125–136

Google Scholar  

Dwivedi YK, Hughes L, Kar AK, Baabdullah AM, Grover P, Abbas R, Andreini D et al (2022) Climate change and COP26: are digital technologies and information management part of the problem or the solution? An editorial reflection and call to action. Int J Info Manage Elsevier 63:102456

Enke D, Thawornwong S (2005) The use of data mining and neural networks for forecasting stock market returns. Expert Sys Appl Elsevier 29(4):927–940

Géron A (2017) “Hands-on machine learning with scikit-learn and tensorflow: Concepts”, Tools, and Techniques to Build Intelligent Systems, O’Reilly Media.

Gheisari M, Wang G, Bhuiyan MZA (2017) “A survey on deep learning in big data”. In: Proceedings-2017 IEEE international conference on computational science and engineering and IEEE/IFIP international conference on embedded and ubiquitous computing , CSE and EUC 2017 , Vol 2 No pp. 173–180

Hamrani A, Akbarzadeh A, Madramootoo CA (2020) Machine learning for predicting greenhouse gas emissions from agricultural soils. Sci Total Environ Elsevier 741:140338

Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Sys Appl 13(4):18–28. https://doi.org/10.1109/5254.708428

Jenhani I, Amor NB, Elouedi Z (2008) Decision trees as possibilistic classifiers. Int J Approx Reasoning Elsevier 48(3):784–807

Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260. https://doi.org/10.1126/science.aaa8415

Kelleher JD (2019) Deep Learning. MIT press, Cambridge

Book   Google Scholar  

Khader MIR (2018) “Using panel data analysis to define most global warming characteristic”.

Kumari S, Singh SK (2022) Machine learning-based time series models for effective CO 2 emission prediction in India. Springer, Environmental Science and Pollution Research, pp 1–16

Liu Y, Gao C, Lu Y (2017) The impact of urbanization on GHG emissions in China: the role of population density. J Clean Prod Elsevier 157:299–309

Liu Y, Wang Y, Zhang J (2012) “New machine learning algorithm: Random forest”. In: Information computing and applications: third international conference, ICICA 2012, Chengde, China, September 14–16, Proceedings 3, Springer, pp. 246–252

Lu X, Ota K, Dong M, Yu C, Jin H (2017) Predicting transportation carbon emission with urban big data. IEEE Trans Sustain Comput IEEE 2(4):333–344

Maimon O, Rokach L (2005) Data mining and knowledge discovery handbook. Springer

MehmandoostKotlar A, Singh J, Kumar S (2022) Prediction of greenhouse gas emissions from agricultural fields with and without cover crops. Soil Sci Soc America J Wiley Online Librar 86(5):1227–1240

Mohammed S, Gill AR, Alsafadi K, Hijazi O, Yadav KK, Hasan MA, Khan AH et al (2021) An overview of greenhouse gases emissions in Hungary. J Clean Prod Elsevier 314:127865

O’Shea, K. and Nash, R. (2015), “An introduction to convolutional neural networks”, arXiv Preprint arXiv:1511.08458

Oreggioni GD, Monforti Ferraio F, Crippa M, et al. (2021) Climate change in a changing world: Socio-economic and technological transitions, regulatory frameworks and trends on global greenhouse gas emissions from EDGAR v.5.0. Global Environmental Change 70, Elsevier: 102350. https://doi.org/10.1016/j.gloenvcha.2021.102350

Raihan A, Muhtasim DA, Farhana S, et al. (2023) An econometric analysis of Greenhouse gas emissions from different agricultural factors in Bangladesh. In; Energy Nexus 9. Elsevier Ltd 100179. https://doi.org/10.1016/j.nexus.2023.100179

Rao ND, Poblete-Cazenave M, Bhalerao R, Davis KF, Parkinson S (2019) Spatial analysis of energy use and GHG emissions from cereal production in India. Sci Total Environ 654:841–849

Sarkodie SA, Strezov V (2019) Effect of foreign direct investments, economic development and energy consumption on greenhouse gas emissions in developing countries. Sci Total Environ Elsevier 646:862–871

Sharifzadeh M, Sikinioti-Lock A, Shah N (2019) Machine-learning methods for integrated renewable power generation: a comparative study of artificial neural networks, support vector regression, and Gaussian process regression. Renew Sustain Energy Rev Elsevier 108:513–538

Sharma N, Sharma R, Jindal N (2021) Machine learning and deep learning applications-A vision. Global Trans Proceed Elsevier B V 2(1):24–28. https://doi.org/10.1016/j.gltp.2021.01.004

Suthaharan S (2016) Support Vector Machine. In: Suthaharan S (ed) Machine learning models and algorithms for big data classification: thinking with examples for effective learning. Springer, Boston

Chapter   Google Scholar  

Sutskever I (2013) Training recurrent neural networks. University of Toronto Toronto, ON, Canada

Szetela, B., Majewska, A., Jamroz, P., Djalilov, B. and Salahodjaev, R. (2022), “Renewable energy and CO2 emissions in top natural resource rents depending countries: the role of governance”, Frontiers in Energy Research, Frontiers, p. 242.

Torralba A, Efros A.A. (2011), “Unbiased look at dataset bias”, CVPR 2011 , IEEE, pp. 1521–1528.

van Loon MP, Hijbeek R, ten Berge HFM et al (2019) Impacts of intensifying or expanding cereal cropping in sub-Saharan Africa on greenhouse gas emissions and food security. Glob Change Biol 25(11):3720–3730. https://doi.org/10.1111/gcb.14783

Wang F, Harindintwali JD, Wei K et al (2023) Climate change: Strategies for mitigation and adaptation. Innov Geosci 1(1):100015. https://doi.org/10.59717/j.xinn-geo.2023.100015

Warren DL, Seifert SN (2011) Ecological niche modeling in Maxent: the importance of model complexity and the performance of model selection criteria. Ecol Appl Wiley Online Librar 21(2):335–342

Witten IH, Frank E (2002) Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record ACM New York, NY, USA 31(1):76–77

Xu Y, Goodacre R (2018) On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. J Analysis Test Springer 2(3):249–262

Yu D (2010) “Exploring spatiotemporally varying regressed relationships: the geographically weighted panel regression analysis. Int Archiv Photogr Remote Sens Sp Info Sci 38:134–139

Download references

There is no funding source.

Author information

Authors and affiliations.

College of Engineering and Technology, American University of the Middle East, Egaila, 54200, Kuwait

S. D. Yazd, N. Gharib & J. F. Derakhshandeh

You can also search for this author in PubMed   Google Scholar

Contributions

Sahar Yazd: Methodology, Simulations and Data collection, Validation, Preparing figures, and Writing and Editing Original Manuscript. Nima Gharib: Methodology, Simulations and Data collection, Validation, Preparing figures and Writing and Editing Original Manuscript. Javad Farrokhi Derakhshandeh: Methodology, Validation, Resources, Writing and Editing Original Manuscript.

Corresponding author

Correspondence to J. F. Derakhshandeh .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Editorial responsibility: Samareh Mirkia.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Yazd, S.D., Gharib, N. & Derakhshandeh, J.F. Investigations on machine learning, deep learning, and longitudinal regression methods for global greenhouse gases predictions. Int. J. Environ. Sci. Technol. (2024). https://doi.org/10.1007/s13762-024-06014-8

Download citation

Received : 26 July 2023

Revised : 26 June 2024

Accepted : 19 August 2024

Published : 30 August 2024

DOI : https://doi.org/10.1007/s13762-024-06014-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial Intelligent
  • Deep learning
  • Greenhouse gases
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 31 August 2024

Incidence of post-extubation dysphagia among critical care patients undergoing orotracheal intubation: a systematic review and meta-analysis

  • Weixia Yu 1   na1 ,
  • Limi Dan 1   na1 ,
  • Jianzheng Cai 1 ,
  • Yuyu Wang 1 ,
  • Qingling Wang 1 ,
  • Yingying Zhang 1 &
  • Xin Wang 1  

European Journal of Medical Research volume  29 , Article number:  444 ( 2024 ) Cite this article

Metrics details

Post-extubation dysphagia (PED) emerges as a frequent complication following endotracheal intubation within the intensive care unit (ICU). PED has been strongly linked to adverse outcomes, including aspiration, pneumonia, malnutrition, heightened mortality rates, and prolonged hospitalization, resulting in escalated healthcare expenditures. Nevertheless, the reported incidence of PED varies substantially across the existing body of literature. Therefore, the principal objective of this review was to provide a comprehensive estimate of PED incidence in ICU patients undergoing orotracheal intubation.

We searched Embase, PubMed, Web of Science, Cochrane Library, China National Knowledge Infrastructure (CNKI), Wanfang Database, China Science, Technology Journal Database (VIP), and SinoMed databases from inception to August 2023. Two reviewers independently screened studies and extracted data. Subsequently, a random-effects model was employed for meta-statistical analysis utilizing the “meta prop” command within Stata SE version 15.0 to ascertain the incidence of PED. In addition, we performed subgroup analyses and meta-regression to elucidate potential sources of heterogeneity among the included studies.

Of 4144 studies, 30 studies were included in this review. The overall pooled incidence of PED was 36% (95% confidence interval [CI] 29–44%). Subgroup analyses unveiled that the pooled incidence of PED, stratified by assessment time (≤ 3 h, 4–6 h, ≤ 24 h, and ≤ 48 h), was as follows: 31.0% (95% CI 8.0–59.0%), 28% (95% CI 22.0–35.0%), 41% (95% CI 33.0–49.0%), and 49.0% (95% CI 34.0–63.0%), respectively. When sample size was 100 <  N  ≤ 300, the PED incidence was more close to the overall PED incidence. Meta-regression analysis highlighted that sample size, assessment time and mean intubation time constituted the source of heterogeneity among the included studies.

The incidence of PED was high among ICU patients who underwent orotracheal intubation. ICU professionals should raise awareness about PED. In the meantime, it is important to develop guidelines or consensus on the most appropriate PED assessment time and assessment tools to accurately assess the incidence of PED.

Graphical abstract

regression analysis meaning in research

Introduction

Mechanical ventilation is the most common technological support, being required by 20–40% of adult in ICU [ 1 ]. Orotracheal intubation is the primary way of mechanical ventilation in ICU, which can increase the risk of post-extubation dysphagia (PED) [ 2 , 3 ]. PED is any form of swallowing dysfunction that arises subsequent to extubation following endotracheal intubation, affecting the passage of food from the entrance to the stomach. The occurrence rate of PED within the ICU setting demonstrates considerable variation among different countries [ 4 ]. The incidence varied among countries, including 13.3–61.8% in the United States [ 5 , 6 ], 25.3–43.5% in France, and 23.2–56% in China [ 7 , 8 ], and the incidence ranging from 7 to 80% [ 9 , 10 ]. Significantly, PED standing out as a prominent complication encountered in this particular context. For instance, See et al. have elucidated that patients afflicted with PED face an 11-fold higher risk of aspiration compared to those without PED [ 11 ]. McIntyre et al. have underscored that patients afflicted with PED endure double the length of stay in the ICU and the overall hospitalization period when compared to patients without PED [ 10 ]. Furthermore, it is essential to note that PED emerged as an independent predictor of 28-day and 90-day mortality [ 12 ]. This high incidence of PED places an immense burden not only on patients but also on the broader healthcare system. Therefore, a systematic review and meta-analysis is necessary to explore the incidence of PED in ICU patients. A systematic review and meta-analysis conducted by McIntyre et al. reported that the incidence of PED was 41%, but the main outcomes of their partly included studies was aspiration [ 12 ]. Although aspiration and PED are closely related, not all aspiration is caused by dysphagia. The incidence of aspiration was 8.80%-88.00% in ICU [ 13 , 14 ], so the incidence of PED in that study may be overestimated. Moreover, there has been increasing literature on PED of ICU patients, and a new systematic review and meta-analysis is needed to obtain a more precise estimate of its incidence.

The incidence of PED may indeed vary depending on various covariates, including assessment time, mean intubation time, age and other relevant factors. First, there is no standard time for swallowing function assessment, which spans a range of intervals, including 3 h [ 6 , 9 , 12 ], 4–6 h [ 15 , 16 ], 24 h [ 17 , 18 , 19 ], 48 h [ 20 ], 7 days [ 21 ], and discharge [ 22 ], and the incidence of PED was 80% [ 9 ], 22.62% [ 15 ], 56.06% [ 18 ], and 35.91% [ 20 ], 22.06% [ 21 ], and 28.78% [ 22 ], respectively. Second, the PED is closely tied to the time of orotracheal intubation. Skoretz et al. have demonstrated that the overall incidence of PED in the ICU ranges from 3 to 4%. However, upon re-analysis of patients subjected to orotracheal intubation for more than 48 h, the PED incidence can surge as high as 51% [ 23 ]. Third, the choice of assessment tool to evaluate PED in ICU patients plays a pivotal role. These assessment tools may include Video-fluoroscopic Swallowing Study (VFSS), Fiberoptic Endoscopic Evaluation of Swallowing (FEES), Standardized Swallowing Assessment (SSA), Bedside Swallowing Evaluation (BSE), Gugging Swallowing Screen (GUSS), Post-Extubation Dysphagia Screening Tool (PEDS), Water Swallowing Test (WST) and other assessment tools. FEES and VFSS are considered the gold standards, with a detection rate of approximately 80% [ 9 ]. SSA and BSE exhibit detection rates of 22% and 62%, respectively [ 5 , 15 ]. Finally, age-related changes in laryngeal sensory and motor functions also influence PED risk [ 24 ]. Notably, there may not be a significant difference in the incidence of PED between elderly and young patients within the initial 48 h post-extubation. However, elderly patients exhibit a significantly slower rate of PED recovery compared to their younger counterparts over time (5.0 days vs 3.0 days; p  = 0.006) [ 5 ]. Therefore, it is necessary to explore the potential source of heterogeneity in the incidence of PED in ICU patients from such covariates.

The purpose of this study was to estimate the incidence of PED among ICU patients who underwent orotracheal intubation and investigate potential sources of heterogeneity through the application of subgroup analyses and meta-regression.

This systematic review and meta-analysis was conducted adhering to the guidelines outlined in the Joanna Briggs Institute (JBI) Reviewers’ Manual and followed the principles of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 statement (PRISMA 2020) [ 25 ] (see Additional file 1: Table S1). In addition, it was registered with PROSPERO under the registration number CRD42022373300.

Eligibility criteria

The study’s eligibility criteria were established in accordance with the PICOS principle. Inclusion criteria as follows: population (P): adult patients (≥ 18 years old) admitted to the ICU who underwent orotracheal intubation. Exposure (E): undergoing orotracheal intubation. Outcome (O): PED. Study design (S): observational study (cohort, case–control, cross-sectional study). In studies where multiple articles were derived from the same sample, only the article providing the most detailed data was included. Patients at high risk of dysphagia (such as those with head and neck cancer, who have undergone head and neck surgery, patients receiving palliative care, esophageal dysfunction, stroke, esophageal cancer and Parkinson’s disease) were excluded. Studies were excluded if they exhibited incomplete original data or data that could not be extracted. Studied were also excluded if their sample sizes fell below 30 participants or the full text was inaccessible.

Data sources and search strategy

Our comprehensive search multiple databases, including Embase, PubMed, Web of Science, Cochrane Library, China National Knowledge Infrastructure (CNKI), Wanfang, China Science and Technology Journal Database (VIP), and SinoMed, with the search period encompassing inception to August 18, 2023. Search language was Chinese and English. The limited number of studies retrieved initially, primarily attributed to the inclusion of the qualifier “ICU” in the initial search, prompted us to broaden the scope of our literature search. Consequently, we refined the search strategy by reducing the emphasis on “ICU” during the search process. After a series of preliminary searches, we finalized the search strategy, which combined subject headings and free-text terms while employing Boolean operators to enhance search precision. In addition, a manual hand-search of the reference lists of selected articles was carried out to identify any supplementary studies not originally identified through the electronic search. For a detailed presentation of our complete search strategies across all databases, please refer to Additional file 1: Table S2.

Quality evaluation

The evaluation of the risk of bias within the included studies was conducted by two trained investigators. Cross-sectional study was evaluated by the Agency for Healthcare Research and Quality (AHRQ) tool [ 26 ], which consisted of 11 items, resulting in a maximum score of 11. Scores falling within the ranges of 0–3, 4–7, and 8–11 corresponded to studies of poor, moderate, and high quality, respectively. Cohort study was evaluated by the Newcastle–Ottawa Scale (NOS) tool [ 27 ], which comprised three dimensions and eight items, allowing for a star rating ranging from 2 to 9 stars. In this case, 0–4, 5–6, and 7–9 stars were indicative of study of poor, moderate, and high quality, respectively. Any discrepancies or disagreements between the investigators were resolved through discussion, when necessary, consultation with a third expert specializing in evidence-based practice methodology.

Study selection and data extraction

Bibliographic records were systematically exported into the NoteExpress database to facilitate the screening process and the removal of duplicate citations. Initial screening, based on titles and abstracts, was conducted by two reviewers who possessed specialized training in evidence-based knowledge. To ascertain whether the studies satisfied the predefined inclusion and exclusion criteria, the full texts of potentially relevant articles were acquired. In the event of disagreements between the two reviewers, resolution was achieved through discussion or, when necessary, by enlisting the input of a third reviewer for arbitration.

After confirming the included studies, the two authors independently extracted data from the each paper, including the first author, year of publication, country, study design, ICU type, mean patient age, mean intubation time, assessment time, assessment tool, evaluator, sample size, and the PED event. Any disparities during the process of extracted data were addressed through thorough discussion and consensus-building among the reviewers.

The outcomes of this review were as follows: (1) incidence of PED in patients with orotracheal intubation in the ICU; (2) sources of heterogeneity of PED in patients with orotracheal intubation in ICU.

Statistical analyses

Meta-analysis was conducted using the ‘meta prop’ function from the meta package within STATA/SE (version 15.0, StataCorp, TX, USA). To approximate the normal distribution of the data, incidence estimates were transformed using the “Freeman-Tukey Double Arcsine Transformation”. Heterogeneity was assessed using the I 2 statistic, and pooled analyses of PED were executed employing a random-effects model in the presence of significant heterogeneity ( I 2  ≥ 50%), with fixed-effects models utilized when heterogeneity was non-significant. A significance level of P  < 0.05 was established for all analyses.

Subgroup analyses were undertaken to investigate the potential impact of various factors, including assessment tool (gold standard, SSA, GUSS, BSE, PEDS, WST, and other assessment tools), year of publication (2000–2010, 2011–2015, 2016–2020, 2021–2023), study design (cross-sectional study and cohort study), study quality (moderate quality and high quality), assessment time (≤ 3 h, 4–6 h, ≤ 24 h, ≤ 48 h, and after 48 h post-extubation), mean intubation time (≤ 24 h, 48 – 168 h, and > 168 h), mean patient age (≤ 44 years, 45–59 years, 60–74 years), evaluator (nurses, speech-language pathologist), ICU type (Trauma ICU, Cardiac surgery ICU, Mixed medical and surgical ICU), and sample size ( N  ≤ 100, 100 <  N  ≤ 200, 200 <  N  ≤ 300, N  > 300) on the pooled estimate. In instances where no source of heterogeneity was identified in the subgroup analyse, we conducted meta-regression to further pinpoint the origins of heterogeneity, focusing on assessment time, mean intubation time, mean age, assessment tool, sample size, evaluator, ICU type, study design, study quality and year of publication. Sensitivity analysis by the “leave-one-out method” was employed to evaluate the random-effects model’s stability of the pooled incidence of PED. Publication bias was assessed by funnel plot and “Trim and Full” method.

Certainty of the evidence

The level of evidence was assessed using the Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) [ 28 ]. This tool classifies the certainty of evidence into four levels: very low, low, moderate, and high. “High quality” suggests that the actual effect is approximate to the estimate of the effect. On the other hand, “Very low quality” suggests that there is very little confidence in the effect estimate and the reported estimate may be substantially different from what was measured. Two reviewers judged the following aspects: risk of bias, inconsistency, imprecision, indirect evidence, and publication bias. Disagreements were resolved by consensus with the third reviewer.

Study selection

Out of the 4144 studies initially identified, 1280 duplicate studies were removed, and an additional 2864 studies that were deemed irrelevant were excluded based on title and abstract screening. Subsequently, a thorough examination of the full text was conducted for the remaining 122 studies. A manual hand-search of the reference lists of selected articles was 5 studies. Finally, 30 studies were chosen as they met the predetermined inclusion criteria for this systematic review and meta-analysis. The study selection flowchart is shown in Fig.  1 .

figure 1

Flowchart of study selection

General characteristics of the included studies

The characteristics of the included studies are shown in Table  1 . The total sample size across these studies amounted to 6,228 participants. The earliest study in this review was conducted in 2003 [ 29 ], while the most recent study was conducted in 2023 [ 15 ], with 14 studies published after 2020. The study with the largest sample size was conducted by Schefold et al. [ 12 ], comprising 933 participants, while the study with the smallest sample size was carried out by Yılmaz et al. [ 19 ], including 40 participants. The methods employed to assess the incidence of PED exhibited variability among the studies. Specifically, one study employed VFSS [ 30 ], and four studies relied on FEES [ 9 , 29 , 31 , 32 ], and seven studies utilized SSA assessment tools [ 7 , 15 , 16 , 33 , 34 , 35 , 36 ]. Furthermore, six studies utilized BSE [ 5 , 10 , 17 , 37 , 38 , 39 ], two studies employed WST [ 12 , 40 ], two studies adopted PEDS [ 8 , 18 ], two studies utilized GUSS [ 19 , 41 ], and six studies employed other assessment tools [ 6 , 20 , 21 , 22 , 43 ,, 42 , 43 ] such as ASHA, FOIS, SSQ200, NPS-PED, MASA, and YSP.

Among all the studies, 23 studies recorded the assessment time for PED. Specifically, three studies assessed PED within ≤ 3 h post-extubation [ 6 , 9 , 12 ], four studies conducted assessments at 4–6 h post-extubation [ 15 , 16 , 33 , 36 ], nine studies assessed PED within ≤ 24 h post-extubation [ 7 , 8 , 17 , 18 , 19 , 31 , 34 , 40 , 41 ], three studies assessed PED within ≤ 48 h post-extubation [ 5 , 20 , 37 ], and four studies evaluated PED at > 24 h post-extubation [ 21 , 22 , 29 , 38 ]. In terms of study quality, eight of the included studies were categorized as high quality, while the remainder were deemed of moderate quality (see Additional 1: Tables S3, S4).

Meta-analysis results

Utilizing the random-effects model, the pooled incidence of PED was estimated to be 36% (95% CI 29.0%–44.0%, I 2  = 97.06%, p  < 0.001; Fig.  2 ), indicating a substantial degree of heterogeneity. Despite conducting additional subgroup analyses, the source of this high heterogeneity remained elusive. However, the results of the meta-regression analysis revealed that sample size ( p  < 0.001), assessment time ( p  = 0.027) and mean intubation time ( p  = 0.045) emerged as the significant factor contributing to the heterogeneity.

figure 2

Overall pooled incidence of PED in ICU

Subgroup analysis of incidence

The subgroup analyses yielded the following incidence rates of PED based on assessment time post-extubation: the incidence of PED within 3 h post-extubation was 31% (95% CI 8.0–59.0), 4–6 h was 28% (95% CI 22.0–35.0, I 2  = 78.56%, p  < 0.001), within 24 h was 41% (95% CI 33.0–49.0, I 2  = 88.99%, p  < 0.001), and within 48 h was 49%. In addition, the incidence of PED beyond 24 h post-extubation was 37% (95% CI 23.0–52.0, I 2  = 91.73%, p  < 0.001) (Additional file 1: Fig. S1). Furthermore, when analyzing studies based on sample size ( N ), the overall incidence of PED was found 51% (95% CI 39.0–63.0, I 2  = 87.11%, p  < 0.001) for studies with N  < 100 participants, 37% (95% CI 31.0–43.0, I 2  = 84.74%, p  < 0.001) for studies with 100 <  N  ≤ 200 participants, 32% (95% CI 20.0–46.0, I 2  = 97.16%, p  < 0.001) for studies with 200 <  N  ≤ 300 participants, and 16% (95% CI 8.0–26.0, I 2  = 97.07%, p  < 0.001) for studies with N  > 300 participants (see Additional file 1: Fig. S2). In addition, further analyses were conducted based on assessment tool, mean intubation time, mean age, ICU type, evaluator, publication year, study design and study quality (see Additional file 1: Figs. S3–S11).

Results of meta-regression analysis

In the meta-regression analysis, we examined PED assessment time, sample size, assessment tools, mean intubation time, mean age, ICU type, evaluator, publication year, study design and study quality as potential covariates to identify the source of heterogeneity (Table  2 ). The univariate meta-regression analysis revealed a statistically significant correlation between incidence and sample size, assessment time and mean intubation time. Bubble plots of meta-regression of covariates were shown in Additional (see Additional file 1: Figs. S12–S22).

Sensitivity analysis

Sensitivity analysis showed that the incidence of PED ranged from 29 to 44% (see Additional file 1: Fig. S23). The marginal variance between these results and the pooled incidence was minimal, suggesting that the result of the pooled incidence being stable and reliable.

Publication bias

In our study, publication bias was detected by the funnel plot (see Additional file 1: Fig. S24). We found that the adjusted effect size was similar to the original effect size ( p  < 0.01) (see Additional file 1: Fig. S25).

The certainty of evidence was very low for all comparisons performed according to the GRADE rating [ 28 ]. Thus, it can be considered that the certainty of the evidence regarding the incidence of PED in this review is very low (Table  3 ).

This systematic review and meta-analysis aimed to estimate the incidence of PED in ICU patients. The study revealed an overall incidence of PED in ICU patients who underwent orotracheal intubation to be 36.0%. This incidence rate was comparable to the incidence of dysphagia resulting from stroke (36.30%) [ 45 ] and aligned with the incidence of PED observed in ICU patients (36%) [ 46 ]. However, it was slightly lower than the 41% reported in the meta-analysis conducted by McIntyre et al. [ 4 ]. The incidence of PED among ICU patients who underwent orotracheal intubation was high, ICU medical professionals, especially nurses should raise awareness about PED. However, the included studies were characterized by diversity and heterogeneity in assessment time and assessment tools signaled the need for obtaining consensus on a range of issues, including assessment time and assessment tools appropriate for ICU.

Sample size

This review identified sample size as a significant source of heterogeneity ( p  < 0.001). Notably, the incidence of PED demonstrated a gradual decrease as the sample size of the studies increased. In larger scale studies, such as those conducted by McIntyre et al. and Schefold et al., simpler assessment tools are employed, allowing for quick completion [ 10 , 12 ]. However, the reliability and validity of some of these tools remain unverified. Conversely, certain studies are conducted by highly trained professionals using the gold standard for PED assessment [ 9 , 29 , 31 ], which, while more accurate, is also time-consuming and costly [ 47 ]. In addition, some ICU patients, due to their unstable conditions, are unable to complete the gold standard assessment, resulting in relatively smaller sample sizes for these studies.

In statistics, sample size is intricately linked to result stability, and the confidence intervals for subgroups with N  < 100 in this study exhibited a wider range, this might diminish the result precision and lead to larger deviations from the true value. However, as the sample size increased to 100 <  N  ≤ 300, the confidence intervals narrowed in comparison to other subgroups. Consequently, when sample size was 100 <  N  ≤ 300, the PED incidence rates were more close with the overall PED rate. According to the central limit theorem, if the sampling method remains consistent, results obtained from larger samples are more stable and closer to the true value [ 48 , 49 ]. It is worth noting that the confidence intervals for the subgroup with N  > 300 in this study were wider and demonstrated a larger divergence from the total PED incidence. Therefore, in future studies, careful consideration of the sample size, based on the detection rate of the assessment tool used, is advisable to ensure both the stability and reliability of the results.

Mean intubation time

This review identified mean intubation time as a significant source of heterogeneity ( p  = 0.045). Variances in mean intubation time among ICU patients undergoing orotracheal intubation can lead to differing degrees of mucosal damage in the oropharynx and larynx [ 2 , 50 ], thereby resulting in varying incidence rates of PED. For instance, Malandraki et al. have reported that prolonged intubation is associated with more than a 12-fold increased risk of moderate/severe dysphagia compared to shorter intubation durations, and this effect is particularly pronounced among elderly patients [ 51 ]. Moreover, studies have demonstrated that ICU patients with extended orotracheal intubation periods leading to PED also exhibit diminished tongue and lip strength, protracted oral food transportation, slower swallowing, and muscle weakness in swallowing-related muscles [ 24 , 46 ]. In view of these findings, ICU medical professionals should routinely evaluate the need for orotracheal intubation, strive to minimize the duration of mechanical ventilation.

PED assessment time

This review identified assessment time as a significant source of heterogeneity ( P  = 0.027). It is important to note that there are currently no established guidelines recommending the optimal timing for the initial assessment of PED in ICU patients who have undergone orotracheal intubation. Consequently, the assessment time varies widely across studies, resulting in PED incidence rates ranging from 28 to 49% among subgroups. Interestingly, the incidence of PED assessed within ≤ 3 h post-extubation appeared lower than that assessed within ≤ 24 h and ≤ 48 h post-extubation. This difference may be attributed to the study by Schefold et al., which featured a shorter intubation duration [ 12 ]. Therefore, the incidence of PED assessed within ≤ 3 h post-extubation in ICU patients with orotracheal intubation may be underestimated. Moreover, it is essential to highlight that some ICU patients, particularly those with severe illnesses and extended intubation time, may face challenges in complying with post-extubation instructions provided by healthcare personnel. Paradoxically, this group of patients is at a higher risk of developing PED, subsequently increasing their susceptibility to post-extubation pneumonia [ 11 ]. ICU professionals should evaluate swallowing function in patients post-extubation; early identification of patients at risk for PED to reduce complications. If PED is identified, nurses should follow-up assessments at multiple time to obtain a thorough comprehension of PED recovery trajectory among PED patients, which can serve as a foundation for determining the timing of clinical interventions accurately.

PED assessment tools

Despite the subgroup analyses and meta-regression results indicating that PED assessment tools did not contribute to the observed heterogeneity, it is important to acknowledge the wide array of assessment tools employed across the studies included in this review. The study’s findings revealed that the results of the GUSS and BSE assessments were most closely aligned with the gold standard screening results. In contrast, the PEDS assessment results tended to be higher than those derived from the gold standard assessment. Furthermore, the results of other assessment tools generally yielded lower incidence rates of PED, possibly attributable to variations in specificity or sensitivity. FEES and VFSS assessments are recognized for their meticulous scrutiny of patients’ swallowing processes, including the detection of food residue and aspiration, which may not be as comprehensively addressed by other assessment methods [ 51 ]. Assessment tools such as BSE, SSA, GUSS, WST, and other clinical methods do not provide direct visualization of the swallowing process. Instead, assessors rely on the observation of overt clinical symptoms during the patient’s initial food or water intake to judge the presence of PED. However, these methods may overlook occult aspiration in patients, potentially resulting in an underestimation of PED incidence. In contrast, PEDS, which primarily assesses patients based on their medical history and plumbing symptoms without screening for drinking or swallowing, may overestimate PED incidence. Considering the varying strengths and limitations of existing assessment tools, ICU professionals select appropriate PED assessment tool based on the characteristics of the critically ill patient. Early and rapid identification of PED, before the use of more complex and expensive assessment tools, minimizes the occurrence of complications in patients.

Strengths and weaknesses

In this study, we conducted a comprehensive analysis of the incidence of PED in ICU patients who underwent orotracheal intubation across various subgroups, revealing a notable degree of heterogeneity among the included studies. In our study, we have expanded the search as much as possible and included a total of 30 papers after screening, half of which were published after 2020. There are several limitations that should be considered when interpreting the results of this meta-analysis. First, there was varied heterogeneity between methodological of the study and estimates of prevalence that may question the appropriateness of calculating pooled prevalence estimates. However, in order to address this heterogeneity, we addressed the heterogeneity with applying a random-effect model and conducting subgroup analysis and meta-regression to explore three sources of heterogeneity. Second, the overall quality of evidence for the incidence of PED was rated as low according to GRADE. Higher quality original studies on the incidence of PED should be performed in the future. As a result, the findings should be interpreted with caution in such cases.

In conclusion, our systematic review and meta-analysis revealed a high incidence of PED among ICU patients who underwent orotracheal intubation. It is also worth noting that the incidence of PED in the ICU may be underestimated. It is expected to increase awareness about the issue of PED among ICU patients. It will be important to develop guidelines or consensus on the most appropriate PED assessment time and assessment tools to accurately assess the incidence of PED.

Relevance to clinical practice

Each year, a substantial number of critically ill patients, ranging from 13 to 20 million, necessitate endotracheal intubation to sustain their lives. Patients undergoing orotracheal intubation are at heightened risk of developing PED. PED has been linked to prolonged hospital and ICU length of stay, increased rates of pneumonia, and all-cause mortality. Early identification of high-risk patients by clinical nurses is critical for reduce patient burden and adverse outcomes.

Early and multiple times assessment: Future investigations should early assess PED in clinical practice, especially within 6 h post-extubation. Furthermore, we suggest for follow-up assessments at multiple time to obtain a thorough comprehension of PED incidence and the recovery trajectory among ICU patients who have undergone orotracheal intubation.

Assessment tool: Considering the varying strengths and limitations of existing assessment tools, ICU professionals should carefully evaluate the characteristics of critically ill patients and select appropriate assessment tools, before the use of more complex and expensive assessment tools.

Routinely evaluate the need for orotracheal intubation: Healthcare professionals should routinely evaluate the need for orotracheal intubation, strive to minimize the duration of mechanical ventilation.

Availability of data and materials

All data related to the present systematic review and meta-analysis are available from the original study corresponding author on reasonable request.

Abbreviations

Confidence interval

  • Intensive care unit

Post-extubation dysphagia

Sydney Swallow Questionnaire 200

Water swallowing test

Post-Extubation Dysphagia Screening Tool

Bedside swallow evaluation

The Yale swallow protocol

Mann Assessment of Swallowing Ability

American Speech-Language-Hearing Association

Video Fluoroscopic Swallowing Study

Fiberoptic endoscopic evaluation of swallowing

Gugging swallowing screen

Standardized Swallowing Assessment

Functional Oral Intake Scale

Nurse-performed screening for post-extubation dysphagia

Speech-language pathologists

Events of PED

Preferred Reporting Items for Systematic Reviews and Meta-analyses

International Prospective Register of Systematic Reviews

Wunsch H, Wagner J, Herlim M, Chong DH, Kramer AA, Halpern SD. ICU occupancy and mechanical ventilator use in the United States. Crit Care Med. 2013;41(12):2712–9.

Article   PubMed   Google Scholar  

Brodsky MB, Akst LM, Jedlanek E, Pandian V, Blackford B, Price C, Cole G, Mendez-Tellez PA, Hillel AT, Best SR, et al. Laryngeal injury and upper airway symptoms after endotracheal intubation during surgery: a systematic review and meta-analysis. Anesth Analg. 2021;132(4):1023–32.

Article   PubMed   PubMed Central   Google Scholar  

Brodsky MB, Chilukuri K, De I, Huang M, Needham DM. Coordination of pharyngeal and laryngeal swallowing events during single liquid swallows after oral endotracheal intubation. Am J Respir Crit Care Med. 2017;195:768–77.

Google Scholar  

McIntyre M, Doeltgen S, Dalton N, Koppa M, Chimunda T. Post-extubation dysphagia incidence in critically ill patients: a systematic review and meta-analysis. Aust Crit Care. 2021;34(1):67–75.

Tsai MH, Ku SC, Wang TG, Hsiao TY, Lee JJ, Chan DC, Huang GH, Chen C. Swallowing dysfunction following endotracheal intubation age matters. Medicine. 2016;95(24):e3871.

Leder SB, Warner HL, Suiter DM, Young NO, Bhattacharya B, Siner JM, Davis KA, Maerz LL, Rosenbaum SH, Marshall PS, et al. Evaluation of swallow function post-extubation: is it necessary to wait 24 hours? Ann Otol Rhinol Laryngol. 2019;128(7):619–24.

Zeng L, Song Y, Dong Y, Wu Q, Zhang L, Yu L, Gao L, Shi Y. Risk score for predicting dysphagia in patients after neurosurgery: a prospective observational trial. Front Neurol. 2021;12:605687.

Dan L, Yunfang C, Chengfen Y, Li T. Reliability and validity of the Chinese version of postextubation dysphagia screening tool for patients with mechanical ventilation. Tianjin J Nurs. 2022;30(2):161–5.

Troll C, Trapl-Grundschober M, Teuschl Y, Cerrito A, Compte MG, Siegemund M. A bedside swallowing screen for the identification of post-extubation dysphagia on the intensive care unit—validation of the Gugging Swallowing Screen (GUSS)—ICU. BMC Anesthesiol. 2023;23(1):122.

McInytre M, Doeltgen S, Shao C, Chimunda T. The incidence and clinical outcomes of postextubation dysphagia in a regional critical care setting. Aust Crit Care. 2022;35(2):107–12.

See KC, Peng SY, Phua J, Sum CL, Concepcion J. Nurse-performed screening for postextubation dysphagia: a retrospective cohort study in critically ill medical patients. Crit Care. 2016;20(1):326.

Schefold JC, Berger D, Zurcher P, Lensch M, Perren A, Jakob SM, Parviainen I, Takala J. Dysphagia in mechanically ventilated ICU patients (DYnAMICS): a prospective observational trial. Crit Care Med. 2017;45(12):2061–9.

Byun SE, Shon HC, Kim JW, Kim HK, Sim Y. Risk factors and prognostic implications of aspiration pneumonia in older hip fracture patients: a multicenter retrospective analysis. Geriatr Gerontol Int. 2019;19(2):119–23.

Jaillette E, Martin-Loeches I, Artigas A, Nseir S. Optimal care and design of the tracheal cuff in the critically ill patient. Ann Intensive Care. 2014;4(1):7.

Tang JY, Feng XQ, Huang XX, Zhang YP, Guo ZT, Chen L, Chen HT, Ying XX. Development and validation of a predictive model for patients with post-extubation dysphagia. World J Emerg Med. 2023;14(1):49–55.

Xia C, Ji J. The characteristics and predicators of post-extubation dysphagia in ICU patients with endotracheal intubation. Dysphagia. 2022;38:253.

Beduneau G, Souday V, Richard JC, Hamel JF, Carpentier D, Chretien JM, Bouchetemble P, Laccoureye L, Astier A, Tanguy V, et al. Persistent swallowing disorders after extubation in mechanically ventilated patients in ICU: a two-center prospective study. Ann Intensive Care. 2020;10(1):1–7.

Article   Google Scholar  

Johnson KL, Speirs L, Mitchell A, Przybyl H, Anderson D, Manos B, Schaenzer AT, Winchester K. Validation of a postextubation dysphagia screening tool for patients after prolonged endotracheal intubation. Am J Crit Care. 2018;27(2):89–96.

Yılmaz D, Mengi T, Sarı S. Post-extubation dysphagia and COVID-2019. Turkish J Neurol. 2021;27:21–5.

Oliveira A, Friche A, Salomão MS, Bougo GC, Vicente L. Predictive factors for oropharyngeal dysphagia after prolonged orotracheal intubation. Brazil J Otorhinolaryngol. 2018;84(6):722–8.

Yamada T, Ochiai R, Kotake Y. Changes in maximum tongue pressure and postoperative dysphagia in mechanically ventilated patients after cardiovascular surgery. Indian J Crit Care Med. 2022;26(12):1253–8.

Brodsky MB, Huang M, Shanholtz C, Mendez-Tellez PA, Palmer JB, Colantuoni E, Needham DM. Recovery from dysphagia symptoms after oral endotracheal intubation in acute respiratory distress syndrome survivors. A 5-year longitudinal study. Ann Am Thorac Soc. 2017;14(3):376–83.

Skoretz SA, Yau TM, Ivanov J, Granton JT, Martino R. Dysphagia and associated risk factors following extubation in cardiovascular surgical patients. Dysphagia. 2014;29(6):647–54.

Park HS, Koo JH, Song SH. Association of post-extubation dysphagia with tongue weakness and somatosensory disturbance in non-neurologic critically ill patients. Ann Rehabil Med Arm. 2017;41(6):961–8.

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Rev Esp Cardiol (Engl Ed). 2021;74(9):790–9.

Higgins JP, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, Savovic J, Schulz KF, Weeks L, Sterne JA. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ Br Med J. 2011;343: d5928.

Lo CK, Mertz D, Loeb M. Newcastle-Ottawa Scale: comparing reviewers’ to authors’ assessments. BMC Med Res Methodol. 2014;14:45.

Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, Schünemann HJ. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ-Br Med J. 2008;336(7650):924–6.

El SA, Okada M, Bhat A, Pietrantoni C. Swallowing disorders post orotracheal intubation in the elderly. Intensive Care Med. 2003;29(9):1451–5.

Yang WJ, Park E, Min YS, Huh JW, Kim AR, Oh HM, Nam TW, Jung TD. Association between clinical risk factors and severity of dysphagia after extubation based on a videofluoroscopic swallowing study. Korean J Intern Med. 2020;35(1):79.

Megarbane B, Hong TB, Kania R, Herman P, Baud FJ. Early laryngeal injury and complications because of endotracheal intubation in acutely poisoned patients: a prospective observational study. Clin Toxicol. 2010;48(4):331–6.

Scheel R, Pisegna JM, McNally E, Noordzij JP, Langmore SE. Endoscopic assessment of swallowing after prolonged intubation in the ICU setting. Ann Otol Rhinol Laryngol. 2016;125(1):43–52.

Fan GUO, Mingming WANG, Shengqiang ZOU. Analysis of risk factors and establishment of prediction model for post-extubation swallowing dysfunction in ICU patients with endotracheal intubation. Chin Nurs Res. 2020;34(19):3424–8.

Yaqian W: Localization and evaluation of reliability and validity of GuSS-ICU bedside swallowing screening tool. Master: Huzhou University; 2020.

Yun D, Yuan Z, Yanli Y. Risk factors and nursing strategies of the occurrences of acquired swallowing disorders after ICU patients treated with oral tracheal intubation and extubation. Med Equip. 2021;34(1):20–2.

JinTian Y. Study on the recovery of swallowing function and the real experience of patients with acquired swallowing disorder after cardiac surgery. Master: Nanjing University; 2020.

de Medeiros GC, Sassi FC, Mangilli LD, Zilberstein B, de Andrade C. Clinical dysphagia risk predictors after prolonged orotracheal intubation. Clinics. 2014;69(1):8–14.

Kwok AM, Davis JW, Cagle KM, Sue LP, Kaups KL. Post-extubation dysphagia in trauma patients: it’s hard to swallow. Am J Surg. 2013;206(6):924–7 ( 927–928 ).

Barker J, Martino R, Reichardt B, Hickey EJ, Ralph-Edwards A. Incidence and impact of dysphagia in patients receiving prolonged endotracheal intubation after cardiac surgery. Can J Surg. 2009;52(2):119–24.

PubMed   PubMed Central   Google Scholar  

Bordon A, Bokhari R, Sperry J, Testa D, Feinstein A, Ghaemmaghami V. Swallowing dysfunction after prolonged intubation: analysis of risk factors in trauma patients. Am J Surg. 2011;202(6):679–82.

Limin Z. The application of gugging swallowing screenin post-extubation swallowing dysfunction assessment after long-term intubation. Master. Tianjin Medical University; 2016.

Omura K, Komine A, Yanagigawa M, Chiba N, Osada M. Frequency and outcome of post-extubation dysphagia using nurse-performed swallowing screening protocol. Nurs Crit Care. 2019;24(2):70–5.

Regala M, Marvin S, Ehlenbach WJ. Association between postextubation dysphagia and long-term mortality among critically ill older adults. J Am Geriatr Soc. 2019;67(9):1895–901.

Meng PP, Zhang SC, Han C, Wang Q, Bai GT, Yue SW. The occurrence rate of swallowing disorders after stroke patients in Asia: a PRISMA-compliant systematic review and meta-analysis. J Stroke Cerebrovasc Dis Off J Nat Stroke Assoc. 2020;29(10): 105113.

Yingli H, Mengxin C, Donglei S. Incidence and influencing factors of post-extubation dysphagia among patients with mechanical ventilation: a meta-analysis. Chin J Modern Nurs. 2019;25(17):2158–63.

Spronk PE, Spronk LEJ, Egerod I, McGaughey J, McRae J, Rose L, Brodsky MB, Brodsky MB, Rose L, Lut J, et al. Dysphagia in intensive care evaluation (DICE): an international cross-sectional survey. Dysphagia. 2022;37(6):1451–60.

Pourhoseingholi MA, Vahedi M, Rahimzadeh M. Sample size calculation in medical studies. Gastroenterol Hepatol Bed Bench. 2013;6(1):14–7.

Faber J, Fonseca LM. How sample size influences research outcomes. Dental Press J Orthod. 2014;19(4):27–9.

Zuercher P, Moret CS, Dziewas R, Schefold JC. Dysphagia in the intensive care unit: epidemiology, mechanisms, and clinical management. Crit Care. 2019;23(1):103.

Malandraki GA, Markaki V, Georgopoulos VC, Psychogios L, Nanas S. Postextubation dysphagia in critical patients: a first report from the largest step-down intensive care unit in Greece. Am J Speech Lang Pathol. 2016;25(2):150–6.

Ambika RS, Datta B, Manjula BV, Warawantkar UV, Thomas AM. Fiberoptic endoscopic evaluation of swallow (FEES) in intensive care unit patients post extubation. Indian J Otolaryngol Head Neck Surg. 2019;71(2):266–70.

Download references

No funding.

Author information

Weixia Yu and Limi Dan contributed as the co-first authors.

Authors and Affiliations

Department of Nursing, the First Affiliated Hospital of Soochow University, Suzhou, 215006, China

Weixia Yu, Limi Dan, Jianzheng Cai, Yuyu Wang, Qingling Wang, Yingying Zhang & Xin Wang

You can also search for this author in PubMed   Google Scholar

Contributions

Weixia Yu, Limi Dan, Jianzheng Cai, and Yuyu Wang developed the original concept of this systematic review and meta-analysis. Weixia Yu, Limi Dan, Jianzheng Cai and Yuyu Wang contributed to the screening of eligible studies, data extraction, and data synthesis. Weixia Yu, Limi Dan, Jianzheng Cai, Yuyu Wang and Qingling Wang drafted the first version of the manuscript. Yingying Zhang, Qingling Wang and Xin Wang prepared the tables and figures. All the authors have edited and contributed for intellectual content. All the authors read and approved the final manuscript and take public responsibility for it.

Corresponding authors

Correspondence to Jianzheng Cai or Yuyu Wang .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

40001_2024_2024_moesm1_esm.docx.

Additional file 1: Table S1. PRISMA 2020 Checklist. Table S2. Search strategy. Table S3. Quality evaluation results of the cohort studies. Table S4. Quality evaluation results of the cross-sectional study. Fig. S1. Subgroup analysis of the incidence of PED by assessment time. Fig. S2. Subgroup analysis of the incidence of PED by sample size. Fig. S3. Incidence of PED by assessment tool. Fig. S4. Incidence of PED by mean intubation time. Fig. S5 Incidence of PED by mean age. Fig. S6. Incidence of PED by ICU type. Fig. S7. Incidence of PED by evaluator. Fig. S8. Incidence of PED by year of publication. Fig. S9. Incidence of PED by study design. Fig. S10. Incidence of PED by quality of cohort study. Fig. S11. Incidence of PED by quality of Cross-sectional study. Fig. S12. Bubble plot of meta-regression result for evaluate time as a covariate. Fig. S13. Bubble plot of meta-regression result for sample size as a covariate. Fig. S14. Bubble plot of meta-regression result for assessment tool as a covariate. Fig. S15. Bubble plot of meta-regression result for mean intubation time as a covariate. Fig. S16. Bubble plot of meta-regression result for mean age as a covariate. Fig. S17. Bubble plot of meta-regression result for ICU type as a covariate. Fig. S18. Bubble plot of meta-regression result for evaluator as a covariate. Fig. S19. Bubble plot of meta-regression result for year of publication as a covariate. Fig. S20. Bubble plot of meta-regression result for study design as a covariate. Fig. S21. Bubble plot of meta-regression result for quality of cohort study as a covariate. Fig. S22. Bubble plot of meta-regression result for quality of cross-sectional study as a covariate. Fig. S23. Sensitivity analysis of PED. Fig. S24. Publication bias assessment plot. Fig. S25. Publication bias assessment plot. “Trim and Full test” method.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yu, W., Dan, L., Cai, J. et al. Incidence of post-extubation dysphagia among critical care patients undergoing orotracheal intubation: a systematic review and meta-analysis. Eur J Med Res 29 , 444 (2024). https://doi.org/10.1186/s40001-024-02024-x

Download citation

Received : 19 December 2023

Accepted : 12 August 2024

Published : 31 August 2024

DOI : https://doi.org/10.1186/s40001-024-02024-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Orotracheal intubation
  • Post-extubation
  • Systematic review
  • Meta-analysis

European Journal of Medical Research

ISSN: 2047-783X

regression analysis meaning in research

  • Skip to content
  • Skip to menu

Liverpool John Moores University logo

  • Find a course

Facial reconstruction

Search LJMU Research Online

Browse Repository | Browse E-Theses

Motivations for urban front gardening: A quantitative analysis

-

Murtagh, N and Frost, R (2023) Motivations for urban front gardening: A quantitative analysis. Landscape and Urban Planning, 238. ISSN 0169-2046

Private gardens in urban settings offer multiple benefits for the environment and society. In addition to benefits to people's health and well-being, planting in front gardens in particular can mitigate local flooding and urban heat islands. To encourage more front garden planting, greater understanding of householders’ motivations for front gardening is needed. Addressing research gaps on gardening for reasons other than food production and on motivations for gardening in front gardens, a large-scale online survey (n = 1,000) was conducted with urban/suburban dwellers in England. Exploratory factor analysis identified three factors of motivation: enjoyment, meaning and benefit (intrinsic), creating something beautiful (aesthetic) and functional outcomes (utilitarian). A multiple regression model incorporating the three factors and sociodemographic variables explained 11% of variance of time spent front gardening, with intrinsic motivations the strongest predictor. Intrinsic motivations were stronger for women than for men. The study provides a quantitative categorisation of motivational factors as a basis for comparative research and design of interventions and policy to increase front gardening.

Item Type: Article
Uncontrolled Keywords: 05 Environmental Sciences; 09 Engineering; 12 Built Environment and Design; Urban & Regional Planning
Subjects: > >
Divisions:
Publisher: Elsevier
SWORD Depositor:
Date Deposited: 30 Aug 2024 15:34
Last Modified: 30 Aug 2024 15:45
DOI or ID number:
URI:

Actions (login required)

View Item
  • Open access
  • Published: 30 August 2024

Medical students in distress: a mixed methods approach to understanding the impact of debt on well-being

  • Adrienne Yang 1   na1 ,
  • Simone Langness 2   na1 ,
  • Lara Chehab 1   na1 ,
  • Nikhil Rajapuram 3 ,
  • Li Zhang 4 &
  • Amanda Sammann 1  

BMC Medical Education volume  24 , Article number:  947 ( 2024 ) Cite this article

Metrics details

Nearly three in four U.S. medical students graduate with debt in six-figure dollar amounts which impairs students emotionally and academically and impacts their career choices and lives long after graduation. Schools have yet to develop systems-level solutions to address the impact of debt on students’ well-being. The objectives of this study were to identify students at highest risk for debt-related stress, define the impact on medical students’ well-being, and to identify opportunities for intervention.

This was a mixed methods, cross-sectional study that used quantitative survey analysis and human-centered design (HCD). We performed a secondary analysis on a national multi-institutional survey on medical student wellbeing, including univariate and multivariate logistic regression, a comparison of logistic regression models with interaction terms, and analysis of free text responses. We also conducted semi-structured interviews with a sample of medical student respondents and non-student stakeholders to develop insights and design opportunities.

Independent risk factors for high debt-related stress included pre-clinical year (OR 1.75), underrepresented minority (OR 1.40), debt $20–100 K (OR 4.85), debt >$100K (OR 13.22), private school (OR 1.45), West Coast region (OR 1.57), and consideration of a leave of absence for wellbeing (OR 1.48). Mental health resource utilization ( p  = 0.968) and counselors ( p  = 0.640) were not protective factors against debt-related stress. HCD analysis produced 6 key insights providing additional context to the quantitative findings, and associated opportunities for intervention.

Conclusions

We used an innovative combination of quantitative survey analysis and in-depth HCD exploration to develop a multi-dimensional understanding of debt-related stress among medical students. This approach allowed us to identify significant risk factors impacting medical students experiencing debt-related stress, while providing context through stakeholder voices to identify opportunities for system-level solutions.

Peer Review reports

Introduction

Over the past few decades, it has become increasingly costly for aspiring physicians to attend medical school and pursue a career in medicine. Most recent data shows that 73% of medical students graduate with debt often amounting to six-Fig [ 1 ]. – an amount that is steadily increasing every year [ 2 ]. In 2020, the median cost of a four-year medical education in the United States (U.S.) was $250,222 for public and $330,180 for private school students [ 1 ] – a price that excludes collateral costs such as living, food, and lifestyle expenses. To meet these varied costs, students typically rely on financial support from their families, personal means, scholarships, or loans. Students are thereby graduating with more debt than ever before and staying indebted for longer, taking 10 to 20 years to repay their student loans regardless of specialty choice or residency length [ 1 ].

Unsurprisingly, higher debt burden has been negatively correlated with generalized severe distress among medical students [ 3 , 4 ], in turn jeopardizing their academic performance and potentially impacting their career choices [ 5 ]. Studies have found that medical students with higher debt relative to their peers were more likely to choose a specialty with a higher average annual income [ 5 ], less likely to plan to practice in underserved locations, and less likely to choose primary care specialties [ 4 ]. However, a survey of 2019 graduating medical students from 142 medical schools found that, when asked to rank factors that influenced their specialty choice, students ranked economic factors, including debt and income, at the bottom of the list. With this inconsistency in the literature, authors Youngclaus and Fresne declare that further studies and analysis are required to better understand this important relationship [ 1 ].

Unfortunately, debt and its negative effects disproportionately impact underrepresented minority (URM) students, including African Americans, Hispanic Americans, American Indian, Native Hawaiian, and Alaska Native [ 6 ], who generally have more debt than students who are White or Asian American [ 1 ]. In 2019, among medical school graduates who identified as Black, 91% reported having education debt, in comparison to the 73% reported by all graduates [ 1 ]. Additionally, Black medical school graduates experience a higher median education debt amount relative to other groups of students, with a median debt of $230,000 [ 1 ]. This inequitable distribution of debt disproportionately places financial-related stress on URM students [ 7 ], discouraging students from pursuing a medical education [ 8 ]. These deterring factors can lead to a physician workforce that lacks diversity and compromises health equity outcomes [ 9 ].

Limited literature exists to identify the impact of moderating variables on the relationship between debt and debt-related stress. Financial knowledge is found to be a strong predictor of self-efficacy and confidence in students’ financial management, leading to financial optimism and potentially alleviating debt stress [ 10 , 11 , 12 ]. Numerous studies list mindfulness practices, exercise, and connecting with loved ones as activities that promote well-being and reduce generalized stress among students [ 13 , 14 , 15 ]. However, to date, no studies have examined whether these types of stress-reducing activities, by alleviating generalized stress, reduce debt-related stress. Studies have not examined whether resources such as physician role models may act as a protective factor against debt-related stress.

Despite the growing recognition that debt burdens medical students emotionally and academically, we have yet to develop systemic solutions that target students’ unmet needs in this space. We performed the first multi-institutional national study on generalized stress among medical students, and found that debt burden was one of several risk factors for generalized stress among medical students [ 3 ]. The goal of this study is to build upon our findings by using a mixed methods approach combining rigorous survey analysis and human-centered design to develop an in-depth understanding of the impact that education debt has on medical students’ emotional and academic well-being and to identify opportunities for intervention.

We conducted a mixed methods, cross-sectional study that explored the impact of debt-related stress on US medical students’ well-being and professional development. This study was conducted at the University of California, San Francisco (UCSF). All activities were approved by the UCSF institutional review board, and informed consent was obtained verbally from participants prior to interviews. We performed a secondary analysis of the quantitative and qualitative results of the Medical Student Wellbeing Survey (MSWS), a national multi-institutional survey on medical student wellbeing administered between 2019 and 2020, to determine risk factors and moderating variables of debt-related stress. To further explore these variables, we used human-centered design (HCD), an approach to problem-solving that places users at the center of the research process in order to determine key pain points and unmet needs, and co-design solutions tailored to their unique context [ 16 ]. In this study, we performed in-depth, semi-structured interviews with a purposefully sampled cohort of medical students and a convenience sample of non-student stakeholders to determine key insights representing students’ unmet needs, and identified opportunities to ameliorate the impact of debt-related stress on medical students.

Quantitative data: the medical student wellbeing survey

The MSWS is a survey to assess medical student wellbeing that was administered from September 2019 to February 2020 to medical students actively enrolled in accredited US or Caribbean medical schools [ 3 ]. Respondents of the MSWS represent a national cohort of > 3,000 medical students from > 100 unique medical school programs. The MSWS utilizes a combination of validated survey questions, such as the Medical Student Wellbeing Index (MS-WBI), and questions based on foundations established from previously validated wellbeing survey methods [ 3 ]. Questions generally focused on student demographics, sources of stress during medical school, specialty consideration, and frequency in activities that promote wellbeing. Some questions ask students to rate physical, emotional, and social domains of wellbeing using a five-point Likert scale. Questions of interest from the MSWS included debt-related stress, generalized stress, intended specialty choice, and utilization of well-being resources and counselors. An additional variable investigated was average school tuition, which was determined by a review of publicly available data for each student’s listed medical school [ 17 ]. All data from the MSWS was de-identified for research purposes.

Stress: debt-related and generalized stress

Debt stress was assessed by the question, “How does financial debt affect your stress level?” Students responded using a five-point Likert scale from − 2 to 2: significant increase in stress (-2), mild increase (-1), no change (0), mild decrease (1), or significant decrease (2). Responses for this question were evaluated as a binary index of ‘high debt stress,’ defined as a response of − 2, versus ‘low debt stress,’ defined as a response of − 1 or 0. In addition, generalized stress from the MSWS was assessed by questions from the embedded MS-WBI, which produced a score. Previous studies have shown that the score can be used to create a binary index of distress: a score ≥ 4 has been associated with severe distress, and a score < 4 has been associated with no severe distress [ 18 ].

Intended specialty

We categorized students’ responses to intended specialty choice by competitiveness, using the 2018 National Resident Match Program data [ 19 ]. ‘High’ and ‘low’ competitiveness were defined as an average United States Medical Licensing Examination (USMLE) Step 1 score of > 240 or ≤ 230, respectively, or if > 18% or < 4% of applicants were unmatched, respectively. ‘Moderate’ competition was defined as any specialty not meeting criteria for either ‘high’ or ‘low’ competitiveness.

Resource utilization

The MSWS assessed the utilization of well-being resources by the question, “At your institution, which of the following well-being resources have you utilized? (Select all that apply)” Students responded by selecting each of the resource(s) they used: Mental Health and Counseling Services, Peer Mentorship, Self-Care Education, Mindfulness/Meditation Classes, Community Building Events, and Other. The number of choices that the student selected was calculated, allowing for placement into a category depending on the amount of resource utilization: 0–20%, 20–40%, 40–60%, 60–80%, 80–100%. Responses for this question were evaluated as a binary index of ‘high resource utilization,’ defined as a response of 80–100% resource utilization, versus ‘low resource utilization,’ defined as a response of < 80% resource utilization. The co-authors collaboratively decided upon this “top-box score approach,” [ 20 ] which is the sum of percentages for the most favorable top one, two or three highest categories on a scale, to assess if the most extreme users (80–100%) of these supportive resources experienced a decrease in debt-related stress. Additionally, use of a counselor for mental health support was assessed by the question, “Which of the following activities do you use to cope with difficult situations (or a difficult day on clinical rotation)? (Select all that apply).” Students responded by selecting the activities that they use from a list (e.g., listen to music, mindfulness practice, meet with a counselor, exercise). Responses for this question were evaluated as a binary index of ‘Meeting with a Counselor,’ defined by selection of that option, versus ‘Not Meeting with a Counselor,’ defined as not selecting that option.

Quantitative data analysis

We performed a secondary analysis of quantitative data from the MSWS to calculate frequencies and odds ratios for the five quantitative variables described above (debt-related stress, generalized stress, intended specialty, resource utilization, and school tuition). Tests performed are summarized in Table  1 (“Secondary Analysis Tests Performed”). Univariate analysis and multivariate logistic regression were performed among students in the high debt stress (-2) and low debt stress (0 or − 1) for select variables, such as clinical phase, URM, debt burden, specialty competitiveness, and average school tuition, to identify risk factors for high debt stress. To determine if ‘high resource utilization’ or ‘meeting with a counselor’ were moderating variables on the relationship between debt burden and debt stress, we applied the logistic regression with the interaction terms of ‘debt’ and ‘resource utilization’ (high vs. low). Then, we performed a similar analysis but replaced the interaction term with ‘debt’ and ‘meeting with a counselor’ (yes vs. no). We also performed Chi-squared tests to determine the degree to which severe distress increases as debt burden increases, if specialty competitiveness varied by debt stress, and if the proportion of students who identified as URM, in comparison to non-URM, differed by debt level. All statistical tests were two-sided and p  < 0.05 was considered significant. Statistical analyses were performed using SAS version 9.4 and R version 4.0.5.

Qualitative data: interviews and MSWS free text responses

Free-text entries.

At the conclusion of the 2019–2020 MSWS, respondents had unlimited text space to provide comments to two prompts. The first prompt read, “What well-being resource(s), if offered at your school, do you feel would be most useful?” The second prompt read “If you have any further comments to share, please write them below.” Answers to either prompt that pertained to debt, cost of medical school, or finances were extracted for the purpose of this study and analyzed with the other qualitative data subsequently described.

Interview selection & purposive sampling

Interview participants were identified from a repository of respondents to the MSWS who had attached their email address and expressed willingness at the time of the survey to be contacted for an interview [ 3 ]. Our recruitment period was between April 19, 2021 to July 2, 2021. The recruitment process involved sending invitations to all of the email addresses in the list to participate in a 45-minute interview on the topic of student debt and wellbeing. The invitation included a brief screening questionnaire asking students to report updates to questions that were previously asked in the MSWS (i.e.: clinical training year, marital status, dependents). Additional novel questions included primary financial support system, estimate of financial support systems’ household income in the last year, estimate of educational financial debt at conclusion of medical school, student’s plan for paying off debt, and degree of stress (using a Likert scale from 0 to 10) over current and future education debt.

Purposeful sampling of medical student stakeholders for interviews allowed us to maximize heterogeneity. We utilized the students’ responses to the brief screening questionnaire with their corresponding responses to demographic questions from the MSWS to select interviewees that varied by gender, race, presence of severe distress, type of medical school (public vs. private), region of school, and tuition level of school. The sampling ensured a diverse representation, in accordance with HCD methodology [ 21 ]. Brief descriptions of participant experiences are listed in Table  2 (“Interviewee Descriptors”). Students who were selected for interviews were sent a confirmation email to participate. Interviews were to be conducted until thematic saturation was reached. In addition, to include representation from the entire ecosystem, we interviewed a financial aid counselor at a medical school and a pre-medical student, chosen through convenience sampling. We directly contacted those two individuals for interviews.

Semi-structured interviews

All interviews were conducted between April 2021 and July 2021 over Zoom. A single researcher conducted interviews over an average of 45 min. Informed consent was obtained verbally from participants prior to interviews; interviews and their recordings only proceeded following verbal consent. The interview guide (S1 File) included open-ended questions about students’ experience of debt-related stress and their reflections on its consequences. The audio recordings were transcribed using Otter.ai, a secure online transcription service that converts audio files to searchable text files. Interview responses were redacted to preserve anonymity of respondent identity.

Qualitative data analysis

Interview data was analyzed using a general inductive approach to thematic analysis. Specifically, two researchers (SL and AY) independently inductively analyzed transcripts from the first three semi-structured interviews to come up with themes relating to the experiences and consequences of debt-related stress. They reconciled discrepancies in themes through discussion to create the codebook (S2 File), which included 18 themes. SL and AY independently coded each subsequent interview transcript as well as the free text responses from the survey, meeting to reach a consensus on representative quotes for applicable themes.

Following the HCD methodology, two researchers met with the core team to discuss the themes from the interviews and translate them into “insight statements”, which reflect key tensions and challenges experienced by stakeholders. Insight statements carefully articulate stakeholders’ unique perspectives and motivations in a way that is actionable for solution development [ 22 ]. As such, these insight statements are reframed into design opportunities, which suggest that multiple solutions are possible [ 23 , 24 ]. For example, discussion about themes 1a and 1b (“Questionable Job Security” and “Disappointing MD salary and Satisfaction Payoff”) revealed that they were related in the way that they led students to wonder whether the investment in medical school would be offset by the salary payoff. This led to the identification of the tension for low-income students in particular, who have to weigh this tradeoff earlier in their medical school journey than other students who are less financially-constrained (insight: “Medical school is a risky investment for low-income students”.) The design opportunity logically translates into a call to action for brainstorming and solution development: “Support low-income students to make values-based tradeoffs when considering a career in medicine.”

MSWS respondents and quantitative analysis

A total of 3,162 students responded to the MSWS and their sociodemographic characteristics have been described previously [ 3 ]. A total of 2,771 respondents (87.6%) responded to our study’s variables of interest, including a response for ‘high debt stress’ (–2) or ‘low debt stress’ (–1 or 0). Table  3 lists the distribution of debt-related stress across different variables for all respondents.

Risk factors for debt-related stress

Factors that were independently associated with higher debt-related stress included being in pre-clinical year (OR 1.75, 95% CI 1.30–2.36, p  < 0.001), identifying as URM (OR 1.40, 95% CI 1.03–1.88), p  = 0.029), having debt $20–100 K (OR 4.85, 95% CI 3.32–7.30, p  < 0.001), debt > 100 K (OR 13.22, 95% CI 9.05–19.90, p  < 0.001), attending a private medical school (OR 1.45, 95% CI 1.06–1.98, p  = 0.019), attending medical school on the West Coast (OR 1.57, 95% CI 1.17–2.13, p  = 0.003), and having considered taking a leave of absence for wellbeing (OR 1.48, 95% CI 1.13–1.93, p  = 0.004) (Table  4 , S1 Table).

Severe distress by debt amount

Levels of generalized severe distress differed across debt burden groups. As debt level increased, the percentage of individuals with “severe” distress increased ( p  < 0.001).

Debt and career decisions

There were significant differences between the high debt stress versus low debt stress groups and plans to pursue highly vs. moderately vs. minimally competitive specialties ( p  = 0.027) (Fig.  1 ) A greater percentage of low debt stress students were pursuing a highly competitive specialty or a minimally competitive specialty. A greater percentage of high debt stress students were pursuing a moderately competitive specialty. As shown in Table  4 , there were no differences in debt-associated stress between students who choose different specialties, such as medical versus surgical versus mixed (medical/surgical).

figure 1

Debt stress by specialty competitiveness

URM students’ experience of debt

URM identity was an independent risk factor for higher debt-related stress (Table  4 ) In addition, debt levels varied between those who identify as URM versus non-URM ( p  < 0.001). Students identifying as URM tended to have higher debt than those who did not. Although the percentage of non-URM students was higher than that of URM students within the lowest debt burden category (<$20k), among all higher debt burden categories, including $20–100 K, $100–300 K, and >$300K, the percentage of URM students was higher than the percentage of non-URM students.

Moderating factors on the relationship between debt and debt stress

Protective factors such as high degree of mental health resource utilization and meeting with a counselor did not reduce the impact of debt burden on debt stress. Among students who reported a high degree of mental health resource utilization, there was no impact on the relationship between debt and debt stress ( p  = 0.968). Similarly, meeting with a counselor had no impact on the relationship between debt and debt stress ( p  = 0.640).

Interview respondents and qualitative analysis

We conducted in-depth, semi-structured interviews with 11 medical students, who are briefly described in Table  2 . We reached thematic saturation with 11 interviews, a point at which we found recurring themes. Therefore, no further interviews were needed. Among the medical student interviewees, there was representation from all regions, including the Northeast ( n  = 3), West Coast ( n  = 5), Midwest ( n  = 2), and South ( n  = 1). Students were also from all clinical phases, including pre-clinical ( n  = 3), clinical ( n  = 4), gap year/other ( n  = 2), and post-clinical ( n  = 2). Most interviewees were female ( n  = 8) and 5 of the interviewees identified as URM. Financial support systems were diverse, including self ( n  = 3), spouse/partner ( n  = 3), and parents/other ( n  = 5). Most interviewees reported low debt stress ( n  = 8), as opposed to high debt stress ( n  = 3). 55% of interviewees planned to pursue specialties that pay <$300K ( n  = 6), with some pursuing specialties that pay $300–400 K ( n  = 2) and >$400K ( n  = 3).

Among the MSWS free-text responses, to the prompt, “What well-being resource(s), if offered at your school, do you feel would be most useful?” 20 of 118 respondents (16.9%) provided free-text responses that pertained to debt, cost of medical school, or finances. To the prompt “If you have any further comments to share, please write them below” 11 of 342 students (3.2%) provided relevant free-text responses. Analysis of the free-text responses and semi-structured interviews revealed 6 distinct insights (Table  5 ), with each insight translated into an actionable design opportunity.

Medical school is a risky investment for low-income students.

Description

The personal and financial sacrifices required for low-income students to attend medical school and pursue a career in medicine outweigh the benefits of becoming a physician. When considering a career in medicine, students feel discouraged by questionable job security (theme 1a) and reduced financial compensation (theme 1b) – a combination that jeopardizes immediate and long-term job satisfaction. Some students feel hopeful that their decision to pursue medicine will be personally rewarding (1b.6) and their salaries will stabilize (1a.1, 1a.5), but many low-income students experience doubt about whether they made the right career choice (1b.2, 1b.4, 1b.6), and feel stressed that they will be in debt for longer than they expected (1a.3, 1a.4, 1b.1, 1b.5). Support low-income students to make values-based tradeoffs when considering a career in medicine.

Design opportunity

Support low-income students to make values-based tradeoffs when considering a career in medicine.

Medical schools lack the adaptive infrastructure to be welcoming to low-income students.

Students face financial challenges from the moment they apply to medical school (theme 2a), a costly process that limits admissions options for low-income students due to their inability to pay for numerous application fees (2a.1) and expensive test preparation courses (2a.2, 2a.3). Once students begin medical school, they feel unsupported in their varied responsibilities towards their families (theme 2b) and additional financial needs (theme 2c), requiring them to make tradeoffs with their education and personal lives (2b.2, 2c.1).

Design opportunity 2

Develop flexible systems that can recognize and accommodate students’ complex financial needs during medical school.

Students worry about the impact that their medical school debt has on their present and future families, which compounds feelings of guilt and anxiety.

For students who need to take loans, the decision to pursue a career in medicine is a collective investment with their families. Students feel guilty about the sacrifices their families have to make for the sake of their career (theme 3a) and feel pressure to continue to provide financially for their family while having debt (theme 3b). Students are stressed about acquiring more debt throughout their training (3a.1) and the impact that has on loved ones who are dependent on them (3a.4, 3a.5, 3b.2), especially with respect to ensuring their financial security in the future (3b.4).

Design opportunity 3

Create an environment that acknowledges and accounts for the burden of responsibility that students face towards their families.

Without the appropriate education about loans, the stress of debt is exponentially worse.

Students feel the greatest fear around loans when they do not understand them, including the process of securing loans and paying off debt (theme 4a). Students are overwhelmed by their loan amounts (4a.5) and lack the knowledge or resources to manage their debt (4a.1, 4a.2), making them uncertain about how they will become debt-free in the future (4a.3, 4a.4). Students reported that various resources helped to alleviate those burgeoning fears (theme 4b), including financial aid counselors (4b.2, 4b.3) and physician role models (4b.5, 4b.6) that generally increase knowledge and skills related to debt management (4b.1).

Design opportunity 4

Empower students to become experts in managing their debt by making loan-related resources more available and accessible.

The small, daily expenses are the most burdensome and cause the greatest amount of stress.

Students with educational debt are mentally unprepared for the burden of managing their daily living expenses (theme 5a), causing them to make significant lifestyle adjustments in the hopes to ease their resulting anxiety (theme 5b). These costs are immediate and tangible, compared to tuition costs which are more distant and require less frequent management (5a.3) Students learn to temper their expectations for living beyond a bare minimum during medical school (5a.1, 5b.2, 5b.4) and develop strategies to ensure that their necessary expenses are as low as possible (5b.1, 5b.2, 5b.3, 5b.4).

Design opportunity 5

Develop and distribute resources to support both short- and long-term financial costs for medical students.

Students view debt as a dark cloud that constrains their mental health and dictates their career trajectory.

The constant burden of educational debt constrains students’ abilities to control their mental health (theme 6a) and pursue their desired career path in medicine (themes 6b & 6c). Students feel controlled by their debt (6a.3) and concerned that it will impact their [ability] to live a personally fulfilling life (6a.1, 6a.2, 6c.6), especially with respect to pursuing their desired medical specialties (6b.1, 6c.3, 6c.5, 6c.6). Students with scholarships, as opposed to loans, felt more able to choose specialties that prioritized their values rather than their finances (6c.1, 6c.2), an affordance that impacts long-term career growth and satisfaction.

Design opportunity 6

Create a culture of confidence for managing debt and debt-stress among medical students.

This is the first multi-institutional national study to explore the impact of debt-related stress on medical students’ well-being in the United States. We used an innovative, mixed methods approach to better understand the factors that significantly affect debt-related stress, and propose opportunities for improving medical student well-being.

URM students

Analysis of survey results found that students who identify as URM are more likely to experience higher levels of debt-related stress than non-URM students. Our study also found that among all higher debt burden categories, debt levels were higher for URM students, findings consistent with studies that have shown the disproportionate burden of debt among URM students [ 1 ]. Our semi-structured interviews illuminated that students from low-income backgrounds feel unsupported by their medical schools in these varied financial stressors that extend beyond tuition costs (insight 2), leaving their needs unmet and increasing financial stress over time: “We don’t have different socio-economic classes in medicine because there’s constantly a cost that [isn’t] even factored into tuition cost [and] that we can’t take student loans for.” Many URM students feel especially stressed by their financial obligations towards their families (insight 3), and describe the decision to enter into medicine as one that is collective ( “the family’s going to school” ) rather than individual, placing additional pressure on themselves to succeed in their career: “ Being of low SES , the most significant stressor for me is the financing of medical school and the pull of responsibility for my family.” Several other studies from the literature confirm that students who identify as URM and first generation college or medical students are at higher risk for financial stress compared to their counterparts [ 7 ], and report that they feel as though it is their responsibility to honor their families through their educational and career pursuits [ 25 ]. Our study demonstrates and describes how low-income and URM students face numerous financial barriers in medical school, resulting in medical trainees that are less diverse than the patient populations they are serving [ 1 , 8 ].

Debt amount

Our quantitative analysis found that students with debt amounts over $100,000 are at much higher risk for experiencing severe stress than students with debt less than that amount. Although this finding may seem intuitive, it is important to highlight the degree to which this risk differs between these two cohorts. Students with debt amounts between $20,000 and $100,000 are approximately 5 times more likely to experience high stress than students with debt less than $20,000, while students with debt amounts over $100,000 are approximately 13 times more likely to experience severe stress when compared to the same cohort. Interview participants describe that the more debt they have, the less hopeful they feel towards achieving financial security (insight 1): “There are other healthcare professionals that will not accrue the same amount of loans that we will , and then may or may not have the same salary or privileges […] makes me question , did I do the right thing?” Students internalize this rising stress so as not to shift the feelings of guilt onto their families (insight 3), thereby compounding the psychological burden associated with large amounts of debt (insight 6): “As long as you’re in debt , you’re owned by someone or something and the sooner you can get out of it , the better; the sooner I can get started with my life.”

Pre-clinical students

According to our survey analysis, students who are in their pre-clinical years are at higher risk for stress than students in their clinical years. Our interview findings from insight 4 suggest that students feel initially overwhelmed and unsure about what questions to ask ( “One of my fears is that I don’t know what I don’t know”) or how to manage their loans so that it doesn’t have a permanent impact on their lives: “The biggest worry is , what if [the debt] becomes so large that I am never able to pay it off and it ends up ruining me financially.” Pre-clinical students may therefore feel unsure or ill-equipped to manage their loans, making them feel overwhelmed by the initial stimulus of debt. By the time students reach their clinical years, they may have had time to develop strategies for managing stress, acquire more financial knowledge, and/or normalize the idea of having debt.

Medical school characteristics

Our survey analysis found several risk factors related to medical school characteristics. First, we found that students who attended a private school were at higher risk for debt-related stress than students who attended a public school. Not only is the median 4-year cost of attendance in 2023 almost $100,000 higher in private compared to public medical schools [ 26 ], but it is also the case that financial aid packages are more liberally available for public schools due to state government funding [ 27 ]. This not only relieves students from having higher amounts of debt, but it also creates a more inclusive cohort of medical students. Insight 2 from our interviews suggests that private medical schools without the infrastructure to meet students’ varying financial needs force low-income students to make tradeoffs between their education and personal lives.

Another characteristic that was found to be a risk factor for debt stress was attending a medical school on the West Coast (compared to a non-coastal school.) This was a surprising finding given that tuition rates for both private and public schools on the West Coast are no higher than those in other regions [ 17 ]. The distribution of survey respondents did not vary significantly across regional categories, so no bias in sample size is suspected. While these interviews were not designed to address the reasoning behind students’ choice of medical school matriculation, there is a potential explanation for this finding. Historically, students match for residency programs that are in their home state or not far from their home state; [ 28 , 29 ] therefore, we speculate that students may prefer to settle on the West Coast, and may be willing to take on more financial debt in pursuit of their long-term practice and lifestyle goals.

Our quantitative analysis found that students who reported having considered taking a leave of absence for well-being purposes were at higher risk for debt-related stress. This cohort of students likely experience higher levels of stress as they are conscious of the negative impact it has on their life, and have already ruminated on leaving medical school. A study by Fallar et al. found that the period leading up to a leave of absence is particularly stressful for students because they are unfamiliar with the logistics of taking time off, and don’t feel as though leaving medical school is encouraged or normalized for students [ 30 ]. An interview with a student who did a joint MD and PhD program expressed having more time for herself during her PhD program, and described using money for activities that could alleviate stress (“I took figure skating during my PhD”) rather than create more stress by compromising on their lifestyle during medical school (insight 5). More research may be needed to better understand and support students considering taking a leave of absence from medical school.

  • Specialty choice

Our study found that students with high debt stress pursue moderately competitive specialties compared to students with low debt stress. This may be explained by the fact that low debt stress gives students the freedom to pursue minimally competitive specialties, which may be more fulfilling to them but typically have lower salaries. Insight 6 further elaborates upon this finding that students with high debt stress deprioritize specialties for which they are passionate in favor of higher paying specialties that might alleviate their debt: “I love working with kids…but being an outpatient pediatrician just wasn’t going to be enough to justify the [private school] price tag.” Students with lower debt stress describe having the freedom to choose specialties that align with their values, regardless of anticipated salary: “Scholarships give me the freedom to do [specialties] that maybe are a little bit less well-paying in medicine.” Interestingly, certain studies examining the relationship between specialty choice and debt stress have found that high debt stress is associated with a higher likelihood of pursuing a more competitive, and presumably higher paying, specialty [ 5 ]. More research investigating the relationship between debt stress and specialty choice could illuminate opportunities for increasing a sense of agency and overall satisfaction among students for their career choices.

In our exploration of potential protective factors against the effects of debt-related stress, our survey analysis found that the two variables measured (high mental health resource utilization and meeting with a counselor) did not have any impact on reducing debt-related stress. This finding is inconsistent with the literature, which considers these activities to promote general well-being among students but has never been studied in the context of debt-related stress [ 13 , 14 , 15 ]. A potential explanation is that the survey questions that assessed these activities were imperfect. For example, the question of meeting with a counselor was not a standalone question, but instead, was at the bottom of a list of other wellbeing activities; therefore, students may have been fatigued by the time they got to the bottom of the list and not selected it. Additionally, our definition of “high” mental health resource utilization may have been perceived as too strict (i.e.: 80–100%) and perhaps we would have seen effects at lower percentages of utilization (i.e.: 40–60%). Despite this finding, students describe in their interviews that having access to certain resources such as financial knowledge and physician role models can help to alleviate stress by helping them feel confident in managing their loans in the immediate and more distant future (insight 4): “I’ve had explicit discussions with physicians who went to med school , had debt , paid it off [.] the debt hasn’t hindered their life in any way. I think that just makes me feel a lot calmer.” This finding aligns with previous studies that suggest that financial knowledge, such as knowledge about loans and a payoff plan, confers confidence in students’ financial management [ 11 , 12 ]. These factors are also aligned with previous studies that suggest financial optimism, such as with a physician role model who successfully paid off loans, is associated with less financial stress [ 10 ].

Our quantitative analysis of risk factors helped us to identify which areas might significantly impact debt-related stress among medical students, while our qualitative analysis provided more in-depth insight into those risk factors for more human-centered intervention design. The HCD process not only provides additional context from the perspective of medical students, but also proposes distinct design opportunities upon which interventions may be designed and tested. Drawing from the six design opportunities outlined in this paper, we propose a solution on a national scale: lowering the cost of the MCAT and medical school applications to reduce the financial barrier to applying to medical school [ 31 ]. We also propose the following solutions that can be implemented at the level of medical schools to better support medical students facing debt-related stress: (1) providing adequate financial aid that prevents low-income students from needing to work while being in medical school [ 32 ], (2) providing targeted financial planning classes and counseling for first-year medical students who have taken loans [ 33 ], and (3) creating mentorship programs that pair medical students with debt with physician role models who had also had debt but successfully paid it off [ 34 ]. We encourage medical schools to consider these suggestions, choosing the ideas from the list that make sense and tailoring them as necessary for their students and their unique needs. Additionally, given that our quantitative portion of the study was a secondary analysis of a survey focused on general medical student well-being, a nationwide study is needed that is specifically designed to explore the topic of debt-related stress among medical students. Furthermore, more research is needed that assesses the impact of activities that promote well-being (e.g., access to therapy, mindfulness practices, exercise) on debt-related stress among medical students.

Limitations

Our study had some notable limitations. One potential limitation is that our data collection occurred between 2019 and 2021 for this publication in 2023. Additionally, as described in the original study [ 3 ], a limitation of the MSWS is the inability to determine a response rate of students due to the survey distribution by medical student liaisons from each medical school; under the reasonable assumption that the survey was distributed to every US allopathic medical student, the response rate was estimated to have been 8.7%. 3 An additional limitation is the potential for response bias [ 3 ]. A limitation of the qualitative interviews is the potential for response bias among the interviewees. Although we purposely sampled, the students who accepted the invitation to interview may have been students with extreme views, either very negative views of debt or very neutral views of debt. Additionally, the interviewees were not representative of all possible financial situations, given that most students were from private schools, which typically have higher tuition rates. Also, all students had debt amounts in the middle and high categories, with none in the low category. Finally, our model of risk factors for debt-related stress suggested the presence of negative confounding factors, which exerted effects on specific variables (i.e.: pre-clinical year, West Coast) for which univariate analysis found no significant associations but multivariate analysis did. We did not perform further analysis to identify which variables served as the negative confounding variables.

In conclusion, our mixed methods, cross-sectional study exploring debt-related stress and its impact on US medical students’ wellbeing and professional development revealed a set of risk factors and design opportunities for intervention. By using a combined quantitative and qualitative HCD approach, we were able to develop a broad, in-depth understanding of the challenges and opportunities facing medical students with education debt. With these efforts to support the well-being and academic success of students at higher risk of debt-related stress, medical education institutions can develop and nurture a more diverse medical field that can best support the needs of future patients.

Data availability

Data is provided within the supplementary information files.

Youngclaus J, Julie Fresne. Physician education debt and the cost to attend medical school: 2020 update. Association of American Medical Colleges; 2020.

Association of American Medical Colleges. Medical school graduation questionnaire: 2020 all schools summary report. 2020 [cited 2023 Sep 7]. https://www.aamc.org/data-reports/students-residents/report/graduation-questionnaire-gq

Rajapuram N, Langness S, Marshall MR, Sammann A. Medical students in distress: the impact of gender, race, debt, and disability. PLoS ONE. 2020;15(12):e0243250.

Article   Google Scholar  

Rohlfing J, Navarro R, Maniya OZ, Hughes BD, Rogalsky DK. Medical student debt and major life choices other than specialty. Med Educ Online. 2014;19. https://doi.org/10.3402/meo.v19.25603

Pisaniello MS, Asahina AT, Bacchi S, Wagner M, Perry SW, Wong ML, et al. Effect of medical student debt on mental health, academic performance and specialty choice: a systematic review. BMJ Open. 2019;9(7):e029980.

AAMC. [cited 2023 Oct 18]. Unique populations. https://www.aamc.org/professional-development/affinity-groups/gfa/unique-populations

McMichael B, Lee IVA, Fallon B, Matusko N, Sandhu G. Racial and socioeconomic inequity in the financial stress of medical school. MedEdPublish (2016). 2022;12:3.

Mclnturff B. E. Frontczak. Medical school applicant survey. 2004.

Morrison E, Grbic D. Dimensions of diversity and perception of having learned from individuals from different backgrounds: the particular importance of racial diversity. Acad Med. 2015;90(7):937.

Heckman S, Lim H, Montalto C. Factors related to financial stress among college students. J Financial Therapy. 2014;5(1):19–39.

Heckman SJ, Grable JE. Testing the role of parental debt attitudes, student income, dependency status, and financial knowledge have in shaping financial self-efficacy among college students. Coll Student J. 2011;45(1):51–64.

Google Scholar  

Gillen M, Loeffler DN. Financial literacy and social work students: knowledge is power. Journal of Financial Therapy. 2012 [cited 2023 Sep 13];3(2). https://newprairiepress.org/jft/vol3/iss2/4

Conley CS, Durlak JA, Dickson DA. An evaluative review of outcome research on universal mental health promotion and prevention programs for higher education students. J Am Coll Health. 2013;61(5):286–301.

Luken M, Sammons A. Systematic review of mindfulness practice for reducing job burnout. Am J Occup Ther. 2016;70(2):p70022500201–10.

Weight CJ, Sellon JL, Lessard-Anderson CR, Shanafelt TD, Olsen KD, Laskowski ER. Physical activity, quality of life, and burnout among physician trainees: the effect of a team-based, incentivized exercise program. Mayo Clin Proc. 2013;88(12):1435–42.

Design. Kit. [cited 2023 Oct 2]. What is human-centered design? https://www.designkit.org/human-centered-design.html

AAMC. Tuition and student, fees reports. 2006–2013 tuition and student fees report. https://www.aamc.org/data-reports/reporting-tools/report/tuition-and-student-fees-reports

Dyrbye LN, Schwartz A, Downing SM, Szydlo DW, Sloan JA, Shanafelt TD. Efficacy of a brief screening tool to identify medical students in distress. Acad Med. 2011;86(7):907–14.

Charting outcomes in the match. U.S. allopathic seniors (Characteristics of U.S. allopathic seniors who matched to their preferred specialty in the 2018 main residency match). [cited 2023 Sep 18]. https://www.nrmp.org/wp-content/uploads/2021/07/Charting-Outcomes-in-the-Match-2018_Seniors-1.pdf

Greg Timpany. Top-box score – deriving a new measure. QuestionPro. https://www.questionpro.com/blog/creating-a-top-box-score/

Design. Kit. [cited 2023 Oct 9]. Extremes and mainstreams. https://www.designkit.org/methods/extremes-and-mainstreams.html

Design Kit. [cited 2023 Sep 18]. Create insight statements. https://www.designkit.org/methods/create-insight-statements.html

Designing an information and communications technology tool with and for victims of violence and their case managers in San Francisco. Human-centered design study. [cited 2024 Aug 12]. https://mhealth.jmir.org/2020/8/e15866

Bridge Innovate ® . 2017 [cited 2024 Aug 12]. Turning insights into opportunities. https://www.bridgeinnovate.com/blog/2017/12/19/turning-insights-into-opportunities

Van T, Bui K. First-generation college students at a four-year university: background characteristics, reasons for pursuing higher education, and first-year experiences. Coll Student J. 2002;36(1):3.

Students. & Residents. [cited 2023 Oct 18]. You can afford medical school. https://students-residents.aamc.org/financial-aid-resources/you-can-afford-medical-school

Medical schools with best financial aid. in 2023 | BeMo ® . [cited 2023 Oct 18]. https://bemoacademicconsulting.com/blog/medical-schools-with-best-financial-aid

Hasnie A, Hasnie U, Nelson B, Aldana I, Estrada C, Williams W. Relationship between residency match distance from medical school and virtual application, school characteristics, and specialty competitiveness. Cureus. 15(5):e38782.

Dorner FH, Burr RM, Tucker SL. The geographic relationships between physicians’ residency sites and the locations of their first practices. Acad Med. 1991;66(9):540.

Fallar R, Leikauf J, Dokun O, Anand S, Gliatto P, Mellman L, et al. Medical students’ experiences of unplanned leaves of absence. Med Sci Educ. 2019;29(4):1003–11.

Millo L, Ho N, Ubel PA. The cost of applying to medical school — a barrier to diversifying the profession. N Engl J Med. 2019;381(16):1505–8.

Should you work during medical school? | Medical School Admissions Doctor | U.S. News. [cited 2024 Jul 27]. https://www.usnews.com/education/blogs/medical-school-admissions-doctor/articles/should-you-work-during-medical-school

First year in medical school?. Here’s your financial checklist | American Medical Association. [cited 2024 Jul 27]. https://www.ama-assn.org/medical-students/medical-student-finance/first-year-medical-school-heres-your-financial-checklist

The White Coat Investor - Investing. & Personal Finance for Doctors. [cited 2024 Jul 27]. Physician Personal Finance | White Coat Investor. https://www.whitecoatinvestor.com/personal-finance-for-doctors/

Download references

Acknowledgements

We thank the members of The Better Lab, including Devika Patel, Christiana Von Hippel, and Marianna Salvatori, for their support. We appreciate Pamela Derish (UCSF) for assistance in manuscript editing and the UCSF Clinical and Translational Science Institute (CTSI) for assistance in statistical analysis. This publication was supported by the National Center for Advancing Translational Sciences, National Institutes of Health, through UCSF-CTSI Grant Number UL1 TR001872. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.

Funding was not obtained for this project.

Author information

Adrienne Yang, Simone Langness and Lara Chehab contributed equally to this work.

Authors and Affiliations

Department of Surgery, University of California, San Francisco, CA, USA

Adrienne Yang, Lara Chehab & Amanda Sammann

Department of Trauma Surgery, Sharp HealthCare, San Diego, CA, USA

Simone Langness

Department of Pediatrics, Stanford University, Stanford, CA, USA

Nikhil Rajapuram

Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA

You can also search for this author in PubMed   Google Scholar

Contributions

A.Y. and L.C. wrote the main manuscript text and prepared the figures. S.L. created the study design. All authors reviewed the manuscript.

Corresponding author

Correspondence to Adrienne Yang .

Ethics declarations

Ethics approval and consent to participate.

All activities conducted as part of this study were approved by the UCSF Institutional Review Board. Informed consent was obtained verbally from all participants prior to conducting interviews.

Consent for publication

We, the authors, consent to the publication of this manuscript.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, supplementary material 3, supplementary material 4, supplementary material 5, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yang, A., Langness, S., Chehab, L. et al. Medical students in distress: a mixed methods approach to understanding the impact of debt on well-being. BMC Med Educ 24 , 947 (2024). https://doi.org/10.1186/s12909-024-05927-9

Download citation

Received : 12 April 2024

Accepted : 19 August 2024

Published : 30 August 2024

DOI : https://doi.org/10.1186/s12909-024-05927-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

BMC Medical Education

ISSN: 1472-6920

regression analysis meaning in research

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

diagnostics-logo

Article Menu

regression analysis meaning in research

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Pathological characteristics, management, and prognosis of rectal neuroendocrine tumors: a retrospective study from a tertiary hospital.

regression analysis meaning in research

1. Introduction

2. materials and methods, statistical analysis, 3.1. demographics and baseline characteristics, 3.2. diagnostic work-up, 3.3. individual patient management, 3.4. patient outcomes and survival analysis, 3.5. risk factors for disease progression, 4. discussion, 5. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

  • Dasari, A.; Shen, C.; Halperin, D.; Zhao, B.; Zhou, S.; Xu, Y.; Shih, T.; Yao, J.C. Trends in the Incidence, Prevalence, and Survival Outcomes in Patients with Neuroendocrine Tumors in the United States. JAMA Oncol. 2017 , 3 , 1335–1342. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
  • Maschmeyer, G.; Mügge, L.O.; Kämpfe, D.; Kreibich, U.; Wilhelm, S.; Aßmann, M.; Schwarz, M.; Kahl, C.; Köhler, S.; Grobe, N.; et al. A retrospective review of diagnosis and treatment modalities of neuroendocrine tumors (excluding primary lung cancer) in 10 oncological institutions of the East German Study Group of Hematology and Oncology (OSHO), 2010–2012. J. Cancer Res. Clin. Oncol. 2015 , 141 , 1639–1644. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Ikeda, K.; Kojima, M.; Saito, N.; Sakuyama, N.; Koushi, K.; Watanabe, T.; Sugihara, K.; Akimoto, T.; Ito, M.; Ochiai, A. Current status of the histopathological assessment, diagnosis, and reporting of colorectal neuroendocrine tumors: A Web survey from the Japanese Society for Cancer of Colon and Rectum. Pathol. Int. 2016 , 66 , 94–101. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Rinke, A.; Ambrosini, V.; Dromain, C.; Garcia-Carbonero, R.; Haji, A.; Koumarianou, A.; van Dijkum, E.N.; O’Toole, D.; Rindi, G.; Scoazec, J.; et al. European Neuroendocrine Tumor Society (ENETS) 2023 guidance paper for colorectal neuroendocrine tumours. J. Neuroendocrinol. 2023 , 35 , e13309. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Jung, H.J.; Hong, S.J.; Han, J.P.; Kim, H.S.; Jeong, G.A.; Cho, G.S.; Kim, H.K.; Ko, B.M.; Lee, M.S. Long-term outcome of endoscopic and surgical resection for foregut neuroendocrine tumors. J. Dig. Dis. 2015 , 16 , 595–600. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Öberg, K.; Knigge, U.; Kwekkeboom, D.; Perren, A.; Group, E.G.W. Neuroendocrine gastro-entero-pancreatic tumors: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Ann. Oncol. 2012 , 23 (Suppl. S7), vii124–vii130. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Fendrich, V.; Bartsch, D.K. Surgical treatment of gastrointestinal neuroendocrine tumors. Langenbecks Arch. Surg. 2011 , 396 , 299–311. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Erickson, J.A.; Grenache, D.G. A chromogranin A ELISA absent of an apparent high-dose hook effect observed in other chromogranin A ELISAs. Clin. Chim. Acta 2016 , 452 , 120–123. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Nagtegaal, I.D.; Odze, R.D.; Klimstra, D.; Paradis, V.; Rugge, M.; Schirmacher, P.; Washington, K.M.; Carneiro, F.; Cree, I.A.; the WHO Classification of Tumours Editorial Board. The 2019 WHO classification of tumours of the digestive system. Histopathology 2020 , 76 , 182–188. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
  • Brierley, J.D.; Gospodarowicz, M.K.; Wittekind, C. TNM Classification of Malignant Tumours ; John Wiley & Sons: Hoboken, NJ, USA, 2017. [ Google Scholar ]
  • Lin, H.H.; Lin, J.K.; Jiang, J.K.; Lin, C.C.; Lan, Y.T.; Yang, S.H.; Wang, H.S.; Chen, W.S.; Lin, T.C.; Liang, W.Y.; et al. Clinicopathological analysis of colorectal carcinoid tumors and patient outcomes. World J. Surg. Oncol. 2014 , 12 , 366. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
  • Kim, J.; Kim, J.H.; Lee, J.Y.; Chun, J.; Im, J.P.; Kim, J.S. Clinical outcomes of endoscopic mucosal resection for rectal neuroendocrine tumor. BMC Gastroenterol. 2018 , 18 , 77. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
  • Takatsu, Y.; Fukunaga, Y.; Nagasaki, T.; Akiyoshi, T.; Konishi, T.; Fujimoto, Y.; Nagayama, S.M.; Ueno, M.M. Short- and Long-term Outcomes of Laparoscopic Total Mesenteric Excision for Neuroendocrine Tumors of the Rectum. Dis. Colon. Rectum. 2017 , 60 , 284–289. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Fields, A.C.; McCarty, J.C.; Ma-Pak, L.; Lu, P.; Irani, J.; Goldberg, J.E.; Bleday, R.; Chan, J.; Melnitchouk, N. New lymph node staging for rectal neuroendocrine tumors. J. Surg. Oncol. 2019 , 119 , 156–162. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Witzke, O.; Sommerer, C.; Arns, W. Everolimus immunosuppression in kidney transplantation: What is the optimal strategy? Transplant. Rev. 2016 , 30 , 3–12. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Calomino, N.; Poto, G.E.; Carbone, L.; Bagnacci, G.; Piccioni, S.; Andreucci, E.; Nenci, L.; Marano, L.; Verre, L.; Petrioli, R.; et al. Neuroendocrine tumors’ patients treated with somatostatin analogue could complicate with emergency cholecystectomy. Ann. Ital. Chir. 2023 , 94 , 518–522. [ Google Scholar ] [ PubMed ]
  • Rinke, A.; Müller, H.H.; Schade-Brittinger, C.; Klose, K.J.; Barth, P.; Wied, M.; Mayer, C.; Aminossadati, B.; Pape, U.-F.; Bläker, M.; et al. Placebo-controlled, double-blind, prospective, randomized study on the effect of octreotide LAR in the control of tumor growth in patients with metastatic neuroendocrine midgut tumors: A report from the PROMID Study Group. J. Clin. Oncol. 2009 , 27 , 4656–4663. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Caplin, M.E.; Pavel, M.; Phan, A.T.; Ćwikła, J.B.; Sedláčková, E.; Thanh, X.T.; Wolin, E.M.; Ruszniewski, P.; on behalf of the CLARINET Investigators. Lanreotide autogel/depot in advanced enteropancreatic neuroendocrine tumours: Final results of the CLARINET open-label extension study. Endocrine 2021 , 71 , 502–513. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
  • Gallo, C.; Rossi, R.E.; Cavalcoli, F.; Barbaro, F.; Boškoski, I.; Invernizzi, P.; Massironi, S. Rectal neuroendocrine tumors: Current advances in management, treatment, and surveillance. World J. Gastroenterol. 2022 , 28 , 1123–1138. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
  • Ahmed, M. Gastrointestinal neuroendocrine tumors in 2020. World J. Gastrointest. Oncol. 2020 , 12 , 791–807. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
  • McConnell, Y.J. Surgical management of rectal carcinoids: Trends and outcomes from the Surveillance, Epidemiology, and End Results database (1988 to 2012). Am. J. Surg. 2016 , 211 , 877–885. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Capurso, G.; Gaujoux, S.; Pescatori, L.C.; Panzuto, F.; Panis, Y.; Pilozzi, E.; Terris, B.; de Mestier, L.; Prat, F.; Rinzivillo, M.; et al. The ENETS TNM staging and grading system accurately predict prognosis in patients with rectal NENs. Dig. Liver Dis. 2019 , 51 , 1725–1730. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Violante, T.; Murphy, B.; Ferrari, D.; Graham, R.P.; Navin, P.; Merchea, A.; Larson, D.W.; Dozois, E.J.; Halfdanarson, T.R.; Perry, W.R. Presacral Neuroendocrine Neoplasms: A Multi-site Review of Surgical Outcomes. Ann. Surg. Oncol. 2024 , 31 , 4551–4557. [ Google Scholar ] [ CrossRef ] [ PubMed ]

Click here to enlarge figure

Demographicsn = 45
Male24 (53.3)
Female21 (46.7)
Age at diagnosis, mean ± SD (years)57.5 ± 13.5
BMI, mean ± SD (kg/m )24.6 ± 5.3
Smoker7 (15.6)
Incidental finding34 (75.6)
Symptomatic11 (24.4)
Rectal bleeding4 (8.9)
Abdominal pain5 (11.1)
Back pain 1 (2.2)
Mass presence1 (2.2)
Constipation1 (2.2)
Distance from anal verge, mean ± SD (cm)7.0 ± 4.2
Interventionn = 45
Endoscopic resection26 (57.8)
Polypectomy 15 (57.7)
EMR8 (30.8)
ESD3 (11.5)
Surgical resection9 (20)
TAMIS5 (55.6)
LAR3 (33.3)
Proctocolectomy1 (11.1)
No intervention9 (20.0)
Size, mean ± SD (mm)14.5 ± 16.8
Size ≤ 1 cm26 (57.8)
Size 1–2 cm8 (17.8)
Size ≥ 2 cm11 (24.4)
Endocrine function0 (0)
Vascular invasion3 (6.7)
Lymphatic invasion3 (6.7)
Abnormal chromogranin A (>95 ng/mL)5 (11.1)
Tumor grade
Grade 1 (Ki-67 ≤ 3%)27 (60.0)
Grade 2 (Ki-67 3–20%)6 (13.3)
Grade 3 (Ki-67 ≥ 20%)12 (26.7)
Tumor stage
Stage 128 (62.2)
Stage 21 (2.2)
Stage 35 (11.1)
Stage 411 (24.4)
Resection margin R13 (6.7)
Resection margin R22 (4.4)
Somatostatin receptor antagonist8 (17.8)
Peptide receptor radionuclide therapy3 (6.7)
Chemotherapy14 (31.1)
Everolimus/Sunitinib3 (6.7)
Local recurrence4 (8.9)
Disease progression13 (28.9)
De novo metastases11 (24.4)
Tumor-related mortality10 (22.2)
Overall mortality11 (24.4)
Follow-up (months), mean ± SD46.6 ± 41.0
Overall survival (months), mean ± SD46.1 ± 41.0
Disease free survival (months), mean ± SD40 ± 39.4
Univariable p-ValueMultivariable Analysis
OR (95% CI)BSEp-Value
Tumor grade 6.422 (0.129–318.540)1.8601.9920.350
Tumor stage
Symptomatic presentation 10.929 (0.548–218.038)2.3911.5270.117
Distance from anal verge0.240----
Size 0.863 (0.737–1.009)−0.1480.0800.065
Ki-67 1.010 (0.933–1.094)0.0100.0410.798
Basal chromogranin0.335----
Positive margins (R1 or R2) ----
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Cavalcoli, F.; Rausa, E.; Ferrari, D.; Rosa, R.; Maccauro, M.; Pusceddu, S.; Sabella, G.; Cantù, P.; Vitellaro, M.; Coppa, J.; et al. Pathological Characteristics, Management, and Prognosis of Rectal Neuroendocrine Tumors: A Retrospective Study from a Tertiary Hospital. Diagnostics 2024 , 14 , 1881. https://doi.org/10.3390/diagnostics14171881

Cavalcoli F, Rausa E, Ferrari D, Rosa R, Maccauro M, Pusceddu S, Sabella G, Cantù P, Vitellaro M, Coppa J, et al. Pathological Characteristics, Management, and Prognosis of Rectal Neuroendocrine Tumors: A Retrospective Study from a Tertiary Hospital. Diagnostics . 2024; 14(17):1881. https://doi.org/10.3390/diagnostics14171881

Cavalcoli, Federica, Emanuele Rausa, Davide Ferrari, Roberto Rosa, Marco Maccauro, Sara Pusceddu, Giovanna Sabella, Paolo Cantù, Marco Vitellaro, Jorgelina Coppa, and et al. 2024. "Pathological Characteristics, Management, and Prognosis of Rectal Neuroendocrine Tumors: A Retrospective Study from a Tertiary Hospital" Diagnostics 14, no. 17: 1881. https://doi.org/10.3390/diagnostics14171881

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. Regression analysis: What it means and how to interpret the outcome

    regression analysis meaning in research

  2. What is regression analysis?

    regression analysis meaning in research

  3. Regression: Definition, Analysis, Calculation, and Example

    regression analysis meaning in research

  4. PPT

    regression analysis meaning in research

  5. Regression Analysis. Regression analysis models Explained…

    regression analysis meaning in research

  6. Regression Analysis

    regression analysis meaning in research

VIDEO

  1. regression equation by Prof G P Dang of DAV PG College dehradun

  2. REGRESSION ANALYSIS

  3. What is regression, regression analysis? Regression & Correlation? Dependent & Independent Variable?

  4. Differences Between Research and Analysis

  5. Chapter 1: The Nature of Regression Analysis

  6. Interpreting multiple regression analysis results

COMMENTS

  1. Regression Analysis

    Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a ...

  2. Regression Analysis

    Regression Analysis Regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').

  3. Regression analysis

    In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the outcome or response variable, or a label in machine learning parlance) and one or more independent variables (often called regressors, predictors, covariates, explanatory variables or ...

  4. A Refresher on Regression Analysis

    A Refresher on Regression Analysis. Understanding one of the most important types of data analysis. by. Amy Gallo. November 04, 2015. uptonpark/iStock/Getty Images. You probably know by now that ...

  5. What is Regression Analysis?

    Learn More . Regression analysis is a widely used set of statistical analysis methods for gauging the true impact of various factors on specific facets of a business. These methods help data analysts better understand relationships between variables, make predictions, and decipher intricate patterns within data.

  6. What Is Regression Analysis? Types, Importance, and Benefits

    Regression analysis is a powerful tool used to derive statistical inferences for the future using observations from the past. It identifies the connections between variables occurring in a dataset and determines the magnitude of these associations and their significance on outcomes.

  7. Regression Analysis

    Regression analysis is a set of statistical methods used to estimate relationships between a dependent variable and one or more independent variables.

  8. Regression Tutorial with Analysis Examples

    This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions. I close the post with examples of different types of regression analyses.

  9. Sage Research Methods

    Understanding Regression Analysis: An Introductory Guide presents the fundamentals of regression analysis, from its meaning to uses, in a concise, easy-to-read, and non-technical style. It illustrates how regression coefficients are estimated, interpreted, and used in a variety of settings within the social sciences, business, law, and public ...

  10. Regression Analysis

    Linear regression analysis is one of the most important statistical methods. It examines the linear relationship between a metric-scaled dependent variable (also called endogenous, explained, response, or predicted variable) and one or more metric-scaled independent variables (also called exogenous, explanatory, control, or predictor variable).

  11. Regression Analysis: Definition, Types, Usage & Advantages

    Overall, regression analysis saves the survey researchers' additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.

  12. Regression: Definition, Analysis, Calculation, and Example

    Regression is a statistical measurement that attempts to determine the strength of the relationship between one dependent variable and a series of other variables.

  13. Regression Analysis

    Regression analysis is a technique that permits one to study and measure the relation between two or more variables. Starting from data registered in a sample, regression analysis seeks to determine an estimate of a mathematical relation between two or more variables.

  14. What is Regression Analysis in Data Science?

    Explore the fundamentals of regression analysis in data science, its significance in predicting outcomes, and its role in analyzing relationships.

  15. The clinician's guide to interpreting a regression analysis

    Linear regression analysis. Linear regression is used to quantify a linear relationship or association between a continuous response/outcome variable or dependent variable with at least one ...

  16. (PDF) Regression Analysis

    7.1 Introduction. Regression analysis is one of the most fr equently used tools in market resear ch. In its. simplest form, regression analys is allows market researchers to analyze rela tionships ...

  17. Regression Analysis: The Complete Guide

    Regression analysis is a statistical method. It's used for analyzing different factors that might influence an objective - such as the success of a product launch, business growth, a new marketing campaign - and determining which factors are important and which ones can be ignored.

  18. When Should I Use Regression Analysis?

    Learn when to use regression analysis. I explain the capabilities of regression, the type of relationships it can assess, and how it controls variables.

  19. What is Regression Analysis and Why Should I Use It?

    Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other. In order to understand regression analysis fully, it's ...

  20. What Is Regression Analysis in Business Analytics?

    Regression analysis is the statistical method used to determine the structure of a relationship between variables. Learn to use it to inform business decisions.

  21. What is Regression Analysis? Definition, Types, and Examples

    Wondering what is regression analysis? Learn the formula, definition, types, examples, use cases and more in this article.

  22. How to Read and Interpret a Regression Table

    This tutorial walks through an example of a regression analysis and provides an in-depth explanation of how to read and interpret the output of a regression table.

  23. Regression Analysis for Prediction: Understanding the Process

    Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur, 15 regression analysis has 2 uses ...

  24. Mixed effects models but not t-tests or linear regression detect

    Introduction While there is an interest in defining longitudinal change in people with chronic illness like Parkinson's disease (PD), statistical analysis of longitudinal data is not straightforward for clinical researchers. Here, we aim to demonstrate how the choice of statistical method may influence research outcomes, (e.g., progression in apathy), specifically the size of longitudinal ...

  25. Investigations on machine learning, deep learning, and ...

    In this study, therefore, a range of matrices are analyzed including Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, p value, and correlation coefficient for each model. The results demonstrate that machine learning models typically overtake the deep learning models with the support vector regression polynomial model.

  26. Incidence of post-extubation dysphagia among critical care patients

    Results of meta-regression analysis. In the meta-regression analysis, we examined PED assessment time, sample size, assessment tools, mean intubation time, mean age, ICU type, evaluator, publication year, study design and study quality as potential covariates to identify the source of heterogeneity (Table 2). The univariate meta-regression ...

  27. Motivations for urban front gardening: A quantitative analysis

    Exploratory factor analysis identified three factors of motivation: enjoyment, meaning and benefit (intrinsic), creating something beautiful (aesthetic) and functional outcomes (utilitarian). A multiple regression model incorporating the three factors and sociodemographic variables explained 11% of variance of time spent front gardening, with ...

  28. Medical students in distress: a mixed methods approach to understanding

    Univariate analysis and multivariate logistic regression were performed among students in the high debt stress (-2) and low debt stress (0 or − 1) for select variables, such as clinical phase, URM, debt burden, specialty competitiveness, and average school tuition, to identify risk factors for high debt stress. ... More research investigating ...

  29. Diagnostics

    Background: Rectal neuroendocrine tumors (rNENs) are rare, constituting 1-2% of rectal tumors, and are often asymptomatic, leading to challenges in early diagnosis. Current management guidelines recommend endoscopic resection for small lesions and surgical intervention for larger or high-risk tumors. This study aims to retrospectively analyze the pathological characteristics, management, and ...