• Privacy Policy

Research Method

Home » Correlation Analysis – Types, Methods and Examples

Correlation Analysis – Types, Methods and Examples

Table of Contents

Correlation Analysis

Correlation Analysis

Correlation analysis is a statistical method used to evaluate the strength and direction of the relationship between two or more variables . The correlation coefficient ranges from -1 to 1.

  • A correlation coefficient of 1 indicates a perfect positive correlation. This means that as one variable increases, the other variable also increases.
  • A correlation coefficient of -1 indicates a perfect negative correlation. This means that as one variable increases, the other variable decreases.
  • A correlation coefficient of 0 means that there’s no linear relationship between the two variables.

Correlation Analysis Methodology

Conducting a correlation analysis involves a series of steps, as described below:

  • Define the Problem : Identify the variables that you think might be related. The variables must be measurable on an interval or ratio scale. For example, if you’re interested in studying the relationship between the amount of time spent studying and exam scores, these would be your two variables.
  • Data Collection : Collect data on the variables of interest. The data could be collected through various means such as surveys , observations , or experiments. It’s crucial to ensure that the data collected is accurate and reliable.
  • Data Inspection : Check the data for any errors or anomalies such as outliers or missing values. Outliers can greatly affect the correlation coefficient, so it’s crucial to handle them appropriately.
  • Choose the Appropriate Correlation Method : Select the correlation method that’s most appropriate for your data. If your data meets the assumptions for Pearson’s correlation (interval or ratio level, linear relationship, variables are normally distributed), use that. If your data is ordinal or doesn’t meet the assumptions for Pearson’s correlation, consider using Spearman’s rank correlation or Kendall’s Tau.
  • Compute the Correlation Coefficient : Once you’ve selected the appropriate method, compute the correlation coefficient. This can be done using statistical software such as R, Python, or SPSS, or manually using the formulas.
  • Interpret the Results : Interpret the correlation coefficient you obtained. If the correlation is close to 1 or -1, the variables are strongly correlated. If the correlation is close to 0, the variables have little to no linear relationship. Also consider the sign of the correlation coefficient: a positive sign indicates a positive relationship (as one variable increases, so does the other), while a negative sign indicates a negative relationship (as one variable increases, the other decreases).
  • Check the Significance : It’s also important to test the statistical significance of the correlation. This typically involves performing a t-test. A small p-value (commonly less than 0.05) suggests that the observed correlation is statistically significant and not due to random chance.
  • Report the Results : The final step is to report your findings. This should include the correlation coefficient, the significance level, and a discussion of what these findings mean in the context of your research question.

Types of Correlation Analysis

Types of Correlation Analysis are as follows:

Pearson Correlation

This is the most common type of correlation analysis. Pearson correlation measures the linear relationship between two continuous variables. It assumes that the variables are normally distributed and have equal variances. The correlation coefficient (r) ranges from -1 to +1, with -1 indicating a perfect negative linear relationship, +1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship.

Spearman Rank Correlation

Spearman’s rank correlation is a non-parametric measure that assesses how well the relationship between two variables can be described using a monotonic function. In other words, it evaluates the degree to which, as one variable increases, the other variable tends to increase, without requiring that increase to be consistent.

Kendall’s Tau

Kendall’s Tau is another non-parametric correlation measure used to detect the strength of dependence between two variables. Kendall’s Tau is often used for variables measured on an ordinal scale (i.e., where values can be ranked).

Point-Biserial Correlation

This is used when you have one dichotomous and one continuous variable, and you want to test for correlations. It’s a special case of the Pearson correlation.

Phi Coefficient

This is used when both variables are dichotomous or binary (having two categories). It’s a measure of association for two binary variables.

Canonical Correlation

This measures the correlation between two multi-dimensional variables. Each variable is a combination of data sets, and the method finds the linear combination that maximizes the correlation between them.

Partial and Semi-Partial (Part) Correlations

These are used when the researcher wants to understand the relationship between two variables while controlling for the effect of one or more additional variables.

Cross-Correlation

Used mostly in time series data to measure the similarity of two series as a function of the displacement of one relative to the other.

Autocorrelation

This is the correlation of a signal with a delayed copy of itself as a function of delay. This is often used in time series analysis to help understand the trend in the data over time.

Correlation Analysis Formulas

There are several formulas for correlation analysis, each corresponding to a different type of correlation. Here are some of the most commonly used ones:

Pearson’s Correlation Coefficient (r)

Pearson’s correlation coefficient measures the linear relationship between two variables. The formula is:

   r = Σ[(xi – Xmean)(yi – Ymean)] / sqrt[(Σ(xi – Xmean)²)(Σ(yi – Ymean)²)]

  • xi and yi are the values of X and Y variables.
  • Xmean and Ymean are the mean values of X and Y.
  • Σ denotes the sum of the values.

Spearman’s Rank Correlation Coefficient (rs)

Spearman’s correlation coefficient measures the monotonic relationship between two variables. The formula is:

   rs = 1 – (6Σd² / n(n² – 1))

  • d is the difference between the ranks of corresponding variables.
  • n is the number of observations.

Kendall’s Tau (τ)

Kendall’s Tau is a measure of rank correlation. The formula is:

   τ = (nc – nd) / 0.5n(n-1)

  • nc is the number of concordant pairs.
  • nd is the number of discordant pairs.

This correlation is a special case of Pearson’s correlation, and so, it uses the same formula as Pearson’s correlation.

Phi coefficient is a measure of association for two binary variables. It’s equivalent to Pearson’s correlation in this specific case.

Partial Correlation

The formula for partial correlation is more complex and depends on the Pearson’s correlation coefficients between the variables.

For partial correlation between X and Y given Z:

  rp(xy.z) = (rxy – rxz * ryz) / sqrt[(1 – rxz^2)(1 – ryz^2)]

  • rxy, rxz, ryz are the Pearson’s correlation coefficients.

Correlation Analysis Examples

Here are a few examples of how correlation analysis could be applied in different contexts:

  • Education : A researcher might want to determine if there’s a relationship between the amount of time students spend studying each week and their exam scores. The two variables would be “study time” and “exam scores”. If a positive correlation is found, it means that students who study more tend to score higher on exams.
  • Healthcare : A healthcare researcher might be interested in understanding the relationship between age and cholesterol levels. If a positive correlation is found, it could mean that as people age, their cholesterol levels tend to increase.
  • Economics : An economist may want to investigate if there’s a correlation between the unemployment rate and the rate of crime in a given city. If a positive correlation is found, it could suggest that as the unemployment rate increases, the crime rate also tends to increase.
  • Marketing : A marketing analyst might want to analyze the correlation between advertising expenditure and sales revenue. A positive correlation would suggest that higher advertising spending is associated with higher sales revenue.
  • Environmental Science : A scientist might be interested in whether there’s a relationship between the amount of CO2 emissions and average temperature increase. A positive correlation would indicate that higher CO2 emissions are associated with higher average temperatures.

Importance of Correlation Analysis

Correlation analysis plays a crucial role in many fields of study for several reasons:

  • Understanding Relationships : Correlation analysis provides a statistical measure of the relationship between two or more variables. It helps in understanding how one variable may change in relation to another.
  • Predicting Trends : When variables are correlated, changes in one can predict changes in another. This is particularly useful in fields like finance, weather forecasting, and technology, where forecasting trends is vital.
  • Data Reduction : If two variables are highly correlated, they are conveying similar information, and you may decide to use only one of them in your analysis, reducing the dimensionality of your data.
  • Testing Hypotheses : Correlation analysis can be used to test hypotheses about relationships between variables. For example, a researcher might want to test whether there’s a significant positive correlation between physical exercise and mental health.
  • Determining Factors : It can help identify factors that are associated with certain behaviors or outcomes. For example, public health researchers might analyze correlations to identify risk factors for diseases.
  • Model Building : Correlation is a fundamental concept in building multivariate statistical models, including regression models and structural equation models. These models often require an understanding of the inter-relationships (correlations) among multiple variables.
  • Validity and Reliability Analysis : In psychometrics, correlation analysis is used to assess the validity and reliability of measurement instruments such as tests or surveys.

Applications of Correlation Analysis

Correlation analysis is used in many fields to understand and quantify the relationship between variables. Here are some of its key applications:

  • Finance : In finance, correlation analysis is used to understand the relationship between different investment types or the risk and return of a portfolio. For example, if two stocks are positively correlated, they tend to move together; if they’re negatively correlated, they move in opposite directions.
  • Economics : Economists use correlation analysis to understand the relationship between various economic indicators, such as GDP and unemployment rate, inflation rate and interest rates, or income and consumption patterns.
  • Marketing : Correlation analysis can help marketers understand the relationship between advertising spend and sales, or the relationship between price changes and demand.
  • Psychology : In psychology, correlation analysis can be used to understand the relationship between different psychological variables, such as the correlation between stress levels and sleep quality, or between self-esteem and academic performance.
  • Medicine : In healthcare, correlation analysis can be used to understand the relationships between various health outcomes and potential predictors. For example, researchers might investigate the correlation between physical activity levels and heart disease, or between smoking and lung cancer.
  • Environmental Science : Correlation analysis can be used to investigate the relationships between different environmental factors, such as the correlation between CO2 levels and average global temperature, or between pesticide use and biodiversity.
  • Social Sciences : In fields like sociology and political science, correlation analysis can be used to investigate relationships between different social and political phenomena, such as the correlation between education levels and political participation, or between income inequality and social unrest.

Advantages and Disadvantages of Correlation Analysis

AdvantagesDisadvantages
Provides statistical measure of the relationship between variables.Cannot establish causality, only association.
Useful for prediction if variables are known to have a correlation.Can be misleading if important variables are left out (omitted variable bias).
Can help in hypothesis testing about the relationships between variables.Outliers can greatly affect the correlation coefficient.
Can help in data reduction by identifying closely related variables.Assumes a linear relationship in Pearson correlation, which may not always hold.
Fundamental concept in building multivariate statistical models.May not capture complex relationships (e.g., quadratic or cyclical relationships).
Helps in validity and reliability analysis in psychometrics.Correlation can be affected by the range of observed values (restriction of range).

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Methodological Framework

Methodological Framework – Types, Examples and...

Data Analysis

Data Analysis – Process, Methods and Types

Descriptive Statistics

Descriptive Statistics – Types, Methods and...

ANOVA

ANOVA (Analysis of variance) – Formulas, Types...

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Multidimensional Scaling

Multidimensional Scaling – Types, Formulas and...

Instant insights, infinite possibilities

What is correlation analysis?

Last updated

11 May 2023

Reviewed by

Miroslav Damyanov

Correlation analysis is a staple of data analytics. It’s a commonly used method to measure the relationship between two variables. It helps researchers understand the extent to which changes to the value in one variable are associated with changes to the value in the other. 

Correlations are often misused and misunderstood, especially in the insight industry. Below is a helpful guide to help you understand the basics and mechanics of correlation analysis. 

Make research less tedious

Dovetail streamlines research to help you uncover and share actionable insights

  • Definition of correlation analysis

Correlation analysis, also known as bivariate, is a statistical test primarily used to identify and explore linear relationships between two variables and then determine the strength and direction of that relationship. It’s mainly used to spot patterns within datasets. 

It’s worth noting that correlation doesn't equate to causation. In essence, one cannot infer a cause-and-effect relationship between the two types of data with correlation analysis. However, you can determine the relationship's size, degree, and direction. 

  • Strength of the correlation

The degree of association in correlation analysis is measured by a correlation coefficient. The Pearson correlation, which is denoted by r , is the most commonly used coefficient. The correlation coefficient quantifies the degree of linear association between two variables and can take values between -1 and +1.

No correlation: This is when the value r is zero.

Low degree: A small correlation is when r lies below ± .29

Moderate degree: If the value of the correlation coefficient is between ± 0.30 and ± 0.49, then there’s a medium correlation.

High degree: When the correlation coefficient takes a value between ±0.50 and ±1, it indicates a strong correlation.

Perfect: A perfect correlation occurs when the value of r is near ±1, indicating that as one variable increases, the other variable either increases (if positive) or decreases (if negative). 

  • Direction of the correlation

You can also identify the direction of the linear relationship between two variables by the correlation coefficient's sign. 

Positive correlation

Scores from +0.5 to +1 indicate a robust positive correlation, meaning they both increase simultaneously.

Negative correlation

Scores from -0.5 to -1 indicate a sturdy negative correlation, meaning that as a single variable increases, the other reduces proportionally. 

No correlation

If the correlation coefficient is 0, it means there’s no correlation or relationship between the two variables being analyzed. It's worth noting that increasing the sample size can lead to more precise and accurate results.

Significance of the correlation 

Once we learn about the strength and direction of the correlation, it’s critical to evaluate whether the observed correlation is likely to have occurred by chance or whether it’s a real relationship between the two variables. Therefore, we need to test the correlation for significance. The most common method for determining the significance of a correlation coefficient is by conducting a hypothesis test. 

The hypothesis test (t-test) helps us decide whether the value of the population correlation coefficient ρ is "close to zero" or "significantly different from zero." We decide this based on the sample correlation coefficient ( r ) and the sample size (n). 

As with other hypothesis tests, the significance level is set first, generally at 5%. If the t-test yields a p-value below 5%, we can conclude that the correlation coefficient is significantly different from zero. Furthermore, we simply say that the correlation coefficient is "significant." Otherwise, we wouldn’t have enough evidence to conclude that there’s a true linear relationship between the two variables.

In general, the larger the correlation coefficient ( r ) and sample size (n), the more likely it is that the correlation is statistically significant. However, it's important to remember that a significant correlation doesn’t necessarily imply causation between the two variables. 

  • What factors affect a correlation analysis?

Below are the factors you must consider when arranging a correlation analysis:

Performing a correlation analysis is only appropriate if there’s evidence of a linear relationship between the quantitative variables. You can use a scatter plot to assess linearity. If you can’t draw a straight line between the points, a correlation analysis isn’t recommended.

Ensure you draw a dispersed plot since it assists in glancing and uncovering exceptions, heteroscedasticity, and non-linear relations.

Avoid analyzing correlations when information is rehashed proportions of a similar variable from a similar individual at the equivalent or changed time focus.

The existing sample size should be determined a priori. 

  • Uses of correlation analysis

Correlation analysis is primarily used to quantify the degree to which two variables relate. By using correlation analysis, researchers evaluate the correlation coefficient that tells them to what degree one variable changes when the other changes too. It provides researchers with a linear relationship between two variables. 

Correlation analysis is used by marketers to evaluate the efficiency of a marketing campaign by monitoring and analyzing customers' reactions to various marketing tactics. As such, they can better understand and serve their customers. 

Another use of correlation analysis is among data scientists and experts tasked with data monitoring. They can use correlation analysis for root cause analysis and minimize Time To Deduction (TTD) and Time To Remediation (TTR). 

Different anomalies or two unusual events happening simultaneously or at the same rate can help identify the exact cause of an issue. As a result, users incur a lower cost of experiencing the issue if they can understand and fix it soon using correlation analysis. 

  • What is the business value of correlation analysis?

Correlation analysis has numerous business values, including identifying potential inputs for more complex analyses and testing for future changes while holding other factors constant. 

Additionally, businesses can use correlation analysis to understand the relationship between two variables. This type of analysis is easy to interpret and comprehend, as it focuses on the variance of one data row in relation to another dataset.

One of the primary business values of correlation analysis is its ability to identify hidden issues within a company. For example, if there’s a positive correlation between customers looking at reviews for a particular product and whether or not they purchase it, this could indicate a place where testing can provide more information. 

By testing whether increasing the number of people who look at positive product reviews leads to an increase in purchases, businesses can develop hypotheses to improve their products and services.

Correlation analysis can also help businesses diagnose problems with multiple regression models. For instance, if a multivariate or multiple regression model isn’t producing the expected results or if independent variables are not truly independent, correlation analysis can help discover these issues.

In digital environments, correlations can be especially helpful in fueling different hypotheses that can then be rapidly tested. This is because the testing can be low risk and not require a significant investment of time or money. 

With the abundance of data available to businesses, they must be careful in selecting the variables they’ll analyze. By doing so, they can uncover previously hidden relationships between variables and gain insights that can help them make data-driven decisions. 

  • Correlation ≠ causation

As previously stated, correlation doesn't strictly imply causation, even when you identify a significant relationship by correlation analysis techniques. You can’t determine the cause by the analysis.

The significant relationship implies that there’s much more to comprehend. Additionally, it implies that there are underlying and extraneous factors that you must further explore to look for a cause. Despite the possibility of a causal relationship existing, it would be irresponsible for researchers to utilize the correlation results as proof of such existence. 

  • Example of correlation analysis

A real-life example of correlation analysis is health improvement vs. medical dose reductions. Medical researchers can use a correlation study in clinical trials to better comprehend how a newly-developed drug impacts patients. 

If a patient's health improves due to taking the drug regularly, there’s a positive correlation. Conversely, if the patient's health deteriorates or doesn't improve, there’s no correlation between the two variables (health and the drug).

What is the difference between correlation and correlation analysis?

Correlation shows us the direction and strength of a relationship between two variables. It’s expressed numerically by the correlation coefficient. Correlation analysis, on the other hand, is a statistical test that reveals the relationship between two variables/datasets.

What are correlation and regression?

Regression and correlation are the most popular methods used to examine the linear relationship between two quantitative variables. Correlation measures how strong the relationship is between a pair of variables, while regression is used to describe the relationship as an equation. 

What is the purpose of correlation?

Correlation analysis can help you to identify possible inputs for a more refined analysis. You can also use it to test for future changes while holding other things constant. The whole purpose of using correlations in research is to determine which variables are connected.

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 22 August 2024

Last updated: 5 February 2023

Last updated: 16 August 2024

Last updated: 9 March 2023

Last updated: 30 April 2024

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next, log in or sign up.

Get started for free

August 2nd, 2024

What Is Correlation Analysis? Definition, Examples, & More

By Connor Martin · 10 min read

Collecting. Organizing. Analyzing. Interpreting. Presenting. Those are the key steps in statistical analysis . However, each of these steps might differ depending on the specific objectives (and complexity) of the analysis being conducted.

With correlation analysis, the emphasis is on assessing whether two (or more) variables are related and, if so, how strongly. Though this statistical method is employed rather frequently, it’s still often misinterpreted.

With this in mind, let’s set the record straight and clarify the process, purpose, and applications of correlation analysis in statistics.

Defining Correlation Analysis in Statistics

As previously mentioned, correlation analysis is used to measure and describe the strength of the relationship between two variables. That is, of course, after determining whether the said linear relationship even exists. 

What correlation analysis doesn’t do, however, is make any statements about cause and effect between the two variables. Its purpose is simply to gauge the level of change in one variable associated with changes in another variable (without implying causation).

In this sense, there can be three possible results of correlation analysis.

1. No correlation

2. Positive correlation. With two positively correlated variables, when one variable increases, the other does, too.

3. Negative correlation. When one variable increases, the other variable decreases proportionally.

Why Is Correlation Analysis Helpful?

Due to its nature, correlation analysis serves as an excellent starting point for any research . From there, it can help researchers easily spot trends, make predictions, and identify patterns.

But that’s not all. These insights aren’t drawn just for the sake of research. Their goal is to fuel the single most important aspect of decision-making – informed choice.

Let’s take healthcare as an example. Determining a negative correlation between higher physical activity levels and lower heart rate disease risk scores can prompt healthcare providers to emphasize exercise in preventative care programs. Similarly, identifying a positive correlation between smoking and lung cancer risk can lead public health officials to prioritize anti-smoking campaigns and policies. 

correlation analysis definition in research

How to Measure Correlation

This guide has already mentioned the three possible outcomes of the correlation analysis. But how do you actually come to this conclusion? In other words, how do you measure the correlation between two variables? Here’s a brief step-by-step guide to answer this question.

Step 1 - Write a Survey

Naturally, the first step is to collect the necessary data for the two variables of interest. Let’s say you’re conducting market research on customer satisfaction and product reviews. The process of gathering data would start with designing a survey to collect responses and determine the necessary strength scores. 

Step 2 - Program the Survey

For Step 2, you’ll need to use a program to test out the survey and ensure all the questions function as intended. Why? Because you can’t allow any technical issues or mislabeled scales to compromise the validity of data. If you do so, your data will be tainted and, thus, unusable.

Step 3 - Clean the Data

After reaching the target number of responses to your survey, you’ve officially collected the necessary data. Congratulations! But be careful. It’s still not time for the analysis. You must first clean the data, i.e., identify and correct errors, remove duplicates, handle missing values, and ensure consistency in formatting. Only then can you rest assured the integrity of your data is protected. 

Step 4 - Analyze the Relationships Between the Two Variables

This is where the correlation analysis actually takes place. How? By employing the Pearson correlation coefficient and the Spearman rank correlation methods. Learn more about these methods in the next section. 

Coefficients to Use for Correlation

Pearson’s correlation coefficient and Spearman’s rank correlation coefficient aren’t the only correlation coefficients to be used for correlation analysis. However, they are the most common ones. Here’s a brief overview of how each correlation coefficient works and when to use it.

Pearson Correlation Coefficient

The Pearson correlation coefficient, labeled as an “r,” is the most widely used correlation coefficient and the one researchers typically try out first. This correlation coefficient measures the linear relationship between two continuous variables that are also normally distributed.

The correlation coefficient calculated using this method will be a numerical value between -1 and +1 (as it always happens in correlation analysis). The -1 value indicates a perfect negative linear relationship, while the +1 value means a perfect positive relationship. The value of 0 means, of course, that no correlation exists.

An example of a linear correlation between two variables is a child’s height increasing with age.

correlation analysis definition in research

Spearman Rank Coefficient

Does your data display a non-linear relationship? If so, Spearman’s rank correlation coefficient is the way to go. This correlation coefficient, denoted with a “ρ,” assesses how well the relationship between two variables can be described using a monotonic function.

A monotonic function is one that either never decreases or never increases as its variable increases. This means that the direction of the relationship between the two variables is consistent but not necessarily linear. A good example of this kind of relationship is students’ ranks in different subjects.

How to Interpret Correlation Analysis

This guide has already discussed what the values of “-1,” “0,” and “1” mean. But these values are the so-called perfect values. The rest of the received values can be interpreted as follows:

  • - 0.00—0.29: Weak relationship
  • - 0.30—0.49: Moderate relationship
  • - 0.50—0.79: Strong relationships

Anything above 0.79 is considered a perfect relationship.

Correlation Analysis Practical Examples

By now, you’ve already learned some practical examples of correlation analysis. Here are a few more to illustrate its wide-ranging applications:

- Education: Analyzing the amount of time you study and your GPA

- Marketing : Investigating the relationship between advertising spend and sales revenue

- Finance : Examining the correlation between interest rates and stock market returns

Make the Most of Correlation Analysis with Julius AI

Correlation analysis can be tricky to perform, especially if you lack the time, experience, or skills to ensure an accurate interpretation of data. If this is the case, you can always outsource this task to your personal data analyst – the AI-powered Julius AI. You only need to input your data, and Julius AI will take care of the rest. Achieve maximum efficiency without the data-crunching headache.

correlation analysis definition in research

— Your AI for Analyzing Data & Files

Turn hours of wrestling with data into minutes on Julius.

MLP Logo

Correlation – Connecting the Dots, the Role of Correlation in Data Analysis

  • September 23, 2023

Correlation is a fundamental concept in statistics and data science. It quantifies the degree to which two variables are related. But what does this mean, and how can we use it to our advantage in real-world scenarios? Let’s dive deep into understanding correlation, how to measure it, and its practical implications.

correlation analysis definition in research

In this Blog post we will learn:

  • What is Correlation?
  • Importance of Correlation in Data Science?
  • How to Measure Correlation? 3.1. Typs of Correlation 3.2. Pearson Correlation Coefficient 3.3. Formula: 3.4. Explanation: 3.5. Interpretation:
  • Calculate Correlation Using Python 4.1. Visualize Correlations 4.2. Test for Significance in Correlation 4.3. Handle Multiple Correlations 4.4. Visualizing the Correlation Matrix with a Heatmap 4.5. How to Account for Non-Linear Correlations?
  • Difference Between Correlation and Causation?

1. What is Correlation?

Correlation refers to a statistical measure that represents the strength and direction of a linear relationship between two variables. If you’ve ever wondered if one event or variable has a relationship with another, you’re thinking about correlation. For instance, does the number of hours you study correlate with your exam scores?

2. Importance of Correlation in Data Science?

Understanding correlations can help data scientists:

  • Discover relationships between variables.
  • Determine important variables for predictive modeling.
  • Uncover underlying patterns in data.
  • Make better business decisions by understanding key drivers.

3. How to Measure Correlation?

The most common measure of correlation is the Pearson correlation coefficient, often denoted as ‘r’. Its values range between -1 and 1. Here’s what these values indicate:

  • 1 or -1: Perfect correlation; 1 is positive, -1 is negative.
  • 0: No correlation.
  • Between 0 and ±1: Varying degrees of correlation, with strength increasing as it approaches ±1.

3.1. Typs of Correlation

Positive Correlation : – Value: $r$ is between 0 and +1. – Meaning: When one variable increases, the other also increases, and when one decreases, the other also decreases. – Graphically, a positive correlation will generally display a line of best fit that slopes upwards.

Negative Correlation : – Value: $r$ is between 0 and -1. – Meaning: When one variable increases, the other decreases, and vice versa. – Graphically, a negative correlation will typically show a line of best fit that slopes downwards.

No Correlation (Zero Correlation) : – Value: $r$ is approximately 0. – Meaning: Changes in one variable do not predict any particular change in the other variable. They move independently of each other. – Graphically, data with no correlation will appear scattered with no discernible pattern or trend.

correlation analysis definition in research

3.2. Pearson Correlation Coefficient

The Pearson correlation coefficient, often denoted as $r$, quantifies the linear relationship between two variables. Let’s delve into its formula and understand its significance.

3.3. Formula:

Given two variables, $X$ and $Y$, with data points $x_1, x_2, …, x_n$ and $y_1, y_2, …, y_n$ respectively, the Pearson correlation coefficient, $r$, is formulated as:

$ r = \frac{\sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i – \bar{x})^2 \sum_{i=1}^{n} (y_i – \bar{y})^2}} $

Where: – $\bar{x}$ represents the mean of the $x$ values. – $\bar{y}$ represents the mean of the $y$ values.

3.4. Explanation:

  • The numerator, $\sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y})$, sums up the product of the deviations of each data point from their respective averages. This evaluates if the deviations of one variable coincide with the deviations of the other.
  • The denominator ensures normalization of the coefficient, ensuring that $r$ remains between -1 and 1. The terms $\sum_{i=1}^{n} (x_i – \bar{x})^2$ and $\sum_{i=1}^{n} (y_i – \bar{y})^2$ sum the squared deviations of each data point from their means for $X$ and $Y$ respectively.
  • $r = 1$: Indicates a perfect positive linear relationship between $X$ and $Y$.
  • $r = -1$: Signifies a perfect negative linear relationship between $X$ and $Y$.
  • $r = 0$: Suggests no evident linear trend between the variables.

3.5. Interpretation:

Envision plotting the data points of $X$ and $Y$ on a scatter plot. The Pearson correlation provides insight into how closely these points cluster around a straight line.

correlation analysis definition in research

  • An $r$ value near 1 implies that as $X$ elevates, $Y$ also tends to rise, resulting in an upward trending line.
  • An $r$ value nearing -1 indicates that as $X$ escalates, $Y$ generally diminishes, yielding a downward trending line.
  • A value approaching 0 indicates no discernible linear trend between the variables.

However, a crucial note is that correlation doesn’t signify causation. A strong correlation doesn’t necessarily indicate that one variable caused the other.

4. Calculate Correlation Using Python

Let’s assume you’re a teacher who wants to understand if there’s a relationship between the hours a student studies and their exam scores.

Scenario: You have data on 5 students: hours studied and their corresponding exam scores.

This output suggests a strong positive correlation between study hours and exam scores.

4.1. Visualize Correlations

A scatter plot is a common way.

correlation analysis definition in research

4.2. Test for Significance in Correlation

It helps to determine if the observed correlation is statistically significant. This means we’re reasonably sure the correlation is real and not due to chance.

If the p-value is below a threshold (commonly 0.05), the correlation is considered statistically significant.

4.3. Handle Multiple Correlations

In real-world datasets, you might want to check correlations between multiple variables. This can be done using a correlation matrix.

4.4. Visualizing the Correlation Matrix with a Heatmap

Visualizing multiple correlations using a heatmap is a common and insightful way to quickly grasp relationships between multiple variables in a dataset. We’ll use the Python libraries like pandas and seaborn , to display these correlations.

For larger datasets, visualizing this matrix as a heatmap can be insightful.

correlation analysis definition in research

  • annot=True ensures that the correlation values appear on the heatmap.
  • cmap specifies the color palette. In this case, we’ve chosen ‘coolwarm’, but there are various palettes available in seaborn.
  • linewidths determines the width of the lines that will divide each cell.
  • vmin and vmax are used to anchor the colormap, ensuring that the center is set at a meaningful value.

4.5. How to Account for Non-Linear Correlations?

Pearson’s correlation coefficient captures linear relationships. But what if the relationship is curved or nonlinear? Enter Spearman’s rank correlation. It’s based on ranked values rather than raw data.

5. Difference Between Correlation and Causation?

It’s vital to note that correlation does not imply causation. Just because two variables are correlated doesn’t mean one caused the other. Using our example, while hours studied and exam scores are correlated, it doesn’t mean studying longer always causes better scores. Other factors might play a role.

6. Conclusion

Correlation is a powerful tool in data science, offering insights into relationships between variables. But it’s crucial to use it judiciously and remember that correlation doesn’t equate to causation. Python, with its rich library ecosystem, provides a many tools and methods to efficiently calculate, visualize, and interpret correlations.

The key is to understand the data, choose the appropriate correlation measure, and always be aware of the underlying assumptions.

More Articles

F statistic formula – explained, hypothesis testing – a deep dive into hypothesis testing, the backbone of statistical inference, sampling and sampling distributions – a comprehensive guide on sampling and sampling distributions, law of large numbers – a deep dive into the world of statistics, central limit theorem – a deep dive into central limit theorem and its significance in statistics, similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

correlation analysis definition in research

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

correlation analysis definition in research

JMP | Statistical Discovery.™ From SAS.

Statistics Knowledge Portal

A free online introduction to statistics

Correlation

What is correlation.

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It’s a common tool for describing simple relationships without making a statement about cause and effect.

How is correlation measured?

The sample correlation coefficient, r , quantifies the strength of the relationship. Correlations are also tested for statistical significance.

What are some limitations of correlation analysis?

Correlation can’t look at the presence or effect of other variables outside of the two being explored. Importantly, correlation doesn’t tell us about cause and effect . Correlation also cannot accurately describe curvilinear relationships.

Correlations describe data moving together

Correlations are useful for describing simple relationships among data. For example, imagine that you are looking at a dataset of campsites in a mountain park. You want to know whether there is a relationship between the elevation of the campsite (how high up the mountain it is), and the average high temperature in the summer.

For each individual campsite, you have two measures: elevation and temperature. When you compare these two variables across your sample with a correlation, you can find a linear relationship: as elevation increases, the temperature drops. They are negatively correlated .

What do correlation numbers mean?

We describe correlations with a unit-free measure called the correlation coefficient which ranges from -1 to +1 and is denoted by r . Statistical significance is indicated with a p-value. Therefore, correlations are typically written with two key numbers: r =  and p = .

  • The closer r is to zero, the weaker the linear relationship.
  • Positive r values indicate a positive correlation, where the values of both variables tend to increase together.
  • Negative r values indicate a negative correlation, where the values of one variable tend to increase when the values of the other variable decrease.
  • The p-value gives us evidence that we can meaningfully conclude that the population correlation coefficient is likely different from zero, based on what we observe from the sample.
  • "Unit-free measure" means that correlations exist on their own scale: in our example, the number given for r is not on the same scale as either elevation or temperature. This is different from other summary statistics. For instance, the mean of the elevation measurements is on the same scale as its variable.

What is a p-value?

A p-value is a measure of probability used for hypothesis testing.

It is the probability of obtaining test results equal to or more extreme than what was observed, assuming that no effect is actually present – in other words, assuming that the null hypothesis is true. For our campsite data, the null hypothesis is that there is no linear relationship between elevation and temperature. A small p-value suggests that the observed data is unlikely under the null hypothesis. When a p-value is used to describe a result as statistically significant, this means that it falls below a pre-defined cutoff (e.g., p <.05 or p <.01) at which point we reject the null hypothesis in favor of an alternative hypothesis (for our campsite data, that there  is  a relationship between elevation and temperature).

Once we’ve obtained a significant correlation, we can also look at its strength. A perfect positive correlation has a value of 1, and a perfect negative correlation has a value of -1. But in the real world, we would never expect to see a perfect correlation unless one variable is actually a proxy measure for the other. In fact, seeing a perfect correlation number can alert you to an error in your data! For example, if you accidentally recorded distance from sea level for each campsite instead of temperature, this would correlate perfectly with elevation.

Another useful piece of information is the N, or number of observations. As with most statistical tests, knowing the size of the sample helps us judge the strength of our sample and how well it represents the population. For example, if we only measured elevation and temperature for five campsites, but the park has two thousand campsites, we’d want to add more campsites to our sample.

Visualizing correlations with scatterplots

Back to our example from above: as campsite elevation increases, temperature drops. We can look at this directly with a scatterplot. Imagine that we’ve plotted our campsite data:

  • Each point in the plot represents one campsite, which we can place on an x- and y-axis by its elevation and summertime high temperature.
  • The correlation coefficient ( r ) also illustrates our scatterplot. It tells us, in numerical terms, how close the points mapped in the scatterplot come to a linear relationship. Stronger relationships, or bigger r values, mean relationships where the points are very close to the line which we’ve fit to the data.

correlation analysis definition in research

What about more complex relationships?

Scatterplots are also useful for determining whether there is anything in our data that might disrupt an accurate correlation, such as unusual patterns like a curvilinear relationship or an extreme outlier.

Correlations can’t accurately capture curvilinear relationships. In a curvilinear relationship, variables are correlated in a given direction until a certain point, where the relationship changes.

For example, imagine that we looked at our campsite elevations and how highly campers rate each campsite, on average. Perhaps at first, elevation and campsite ranking are positively correlated, because higher campsites get better views of the park. But at a certain point, higher elevations become negatively correlated with campsite rankings, because campers feel cold at night!

correlation analysis definition in research

We can get even more insight by adding shaded density ellipses to our scatterplot. A density ellipse illustrates the densest region of the points in a scatterplot, which in turn helps us see the strength and direction of the correlation.

Density ellipses can be various sizes. One common choice for examining correlation is a 95% density ellipse, which captures approximately the densest 95% of the observations. If two variables are moving together, like our campsites’ elevation and temperature, we would expect to see this density ellipse mirror the shape of the line. And we can see that in a curvilinear relationship, the density ellipse looks round: a correlation won’t give us a meaningful description of this relationship.

correlation analysis definition in research

Correlation in Statistics: Correlation Analysis Explained

  • What is Correlation?

The Correlation Coefficient

Correlation in excel.

Correlation is used to test relationships between quantitative variables or categorical variables . In other words, it’s a measure of how things are related. The study of how variables are correlated is called correlation analysis.

Some examples of data that have a high correlation:

  • Your caloric intake and your weight.
  • Your eye color and your relatives’ eye colors.
  • The amount of time your study and your GPA.

Some examples of data that have a low correlation (or none at all):

  • Your sexual preference and the type of cereal you eat.
  • A dog’s name and the type of dog biscuit they prefer.
  • The cost of a car wash and how long it takes to buy a soda inside the station.

Correlations are useful because if you can find out what relationship variables have, you can make predictions about future behavior . Knowing what the future holds is very important in the social sciences like government and healthcare. Businesses also use these statistics for budgets and business plans.

A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of between -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation (negative or positive correlation here refers to the type of graph the relationship will produce).

what is correlation

The most common correlation coefficient is the Pearson Correlation Coefficient . It’s used to test for linear relationships between data. In AP stats or elementary stats, the Pearson is likely the only one you’ll be working with. However, you may come across others, depending upon the type of data you are working with. For example, Goodman and Kruskal’s lambda coefficient is a fairly common coefficient. It can be symmetric , where you do not have to specify which variable is dependent, and asymmetric where the dependent variable is specified.

Correlation in Excel 2013

If you’re familiar with entering functions in Excel you could enter the CORREL command: =CORREL(array 1, array 2) For example, =CORREL(A2:A6,B2:B6)

However, the Data Analysis Toolpak is much easier overall, because you don’t have to remember (or hunt for) an array of functions; They are all just listed in the Data Analysis list. If Data Analysis isn’t showing to the far right of the data tab, make sure you have loaded the Data Analysis Toolpak . The Data Analysis Toolpak is an optional add-in to Excel which gives you access to many functions, including:

  • Correlation,
  • Linear Regression ,
  • Histograms ,
  • ANOVA one way and two way tests.

Step 1: Type your data into a worksheet in Excel. The best format is two columns. Place your x-values in column A and your y-values in column B.

Step 2: Click the “Data” tab and then click “Data Analysis.”

Step 3: Click “Correlation” and then click “OK.”

Step 4: Type the location for your x-y variables in the Input Range box. Or, use your cursor to highlight the area where your variables are located.

Step 5: Click either the “columns” or “rows” option to let Excel know how your data is laid out. In most cases, you’ll click “columns” as that’s the standard way to lay out data in Excel.

Step 6: Check the “Labels in first row” if you have column headers.

Step 7: Click the “Output Range” text box and then select an area on the worksheet where you want your output to go.

That’s it!

Check out our YouTube channel for more Excel tips and help!

Agresti A. (1990) Categorical Data Analysis. John Wiley and Sons, New York. Dodge, Y. (2008). The Concise Encyclopedia of Statistics . Springer. Vogt, W.P. (2005). Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Sciences . SAGE. Wheelan, C. (2014). Naked Statistics . W. W. Norton & Company

Correlation Analysis

  • First Online: 16 July 2019

Cite this chapter

correlation analysis definition in research

  • Y. Z. Ma 2  

1952 Accesses

Correlation is a fundamental tool for multivariate data analysis. Most multivariate statistical methods use correlation as a basis for data analytics. Machine learning methods are also impacted by correlations in data. With todays’ big data, the role of correlation becomes increasingly important. Although the basic concept of correlation is simple, it has many complexities in practice. Many may know the common saying “correlation is not causation”, but the statement “a causation does not necessarily lead to correlation” is much less known or even debatable. This chapter presents uses and pitfalls of correlation analysis for geoscience applications.

Nothing ever exists entirely alone; everything is in relation to everything else. Buddha

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Aldrich, J. (1995). Correlation genuine and spurious in Pearson and Yule. Statistical Science, 10 (4), 364–376.

Article   MathSciNet   Google Scholar  

Fletcher, S. (2017). Data assimilation for the geosciences: From theory to application . Amsterdam: Elsevier.

Book   Google Scholar  

Galton, F. (1888). Co-relations and their measurement, chiefly from anthropometric data. Proceedings of the Royal Society of London, 45 , 135–145.

Google Scholar  

Langford, E., Schwertman, N., & Owens, M. (2001). Is the property of being positively correlated transitive? American Statistician, 55 , 322–325.

Ma, Y. Z. (2009). Simpson’s paradox in natural resource evaluation. Mathematical Geosciences, 41 (2), 193–213. https://doi.org/10.1007/s11004-008-9187-z .

Article   MATH   Google Scholar  

Ma, Y. Z. (2011). Pitfalls in predictions of rock properties using multivariate analysis and regression method. Journal of Applied Geophysics, 75 , 390–400.

Article   Google Scholar  

Ma, Y. Z. (2015). Simpson’s paradox in GDP and Per-capita GDP growth. Empirical Economics, 49 (4), 1301–1315.

Ma, Y. Z., & Gomez, E. (2015). Uses and abuses in applying neural networks for predicting reservoir properties. Journal of Petroleum Science and Engineering, 133 , 66–75. https://doi.org/10.1016/j.petrol.2015.05.006 .

Ma, Y. Z., Wang, H., Sitchler, J., et al. (2014). Mixture decomposition and lithofacies clustering using wireline logs. Journal of Applied Geophysics, 102 , 10–20. https://doi.org/10.1016/j.jappgeo.2013.12.011 .

Martinelli, G., & Chugunov, N. (2014). Sensitivity analysis with correlated inputs for volumetric analysis of hydrocarbon prospects. In The proceeding of ECMOR XIV– 14th European conference on the mathematics of oil recovery . https://doi.org/10.3997/2214-4609.20141870 .

Mayer-Schonberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think . Boston: Houghton Mifflin Harcourt.

Pearl, J. (2000). Causality: Models, reasoning and inference . Cambridge: Cambridge University Press, 384p.

MATH   Google Scholar  

Pearson, K., Lee, A., & Bramley-Moore, L. (1899). Mathematical contributions to the theory of evolution – VI. Genetic (reproductive) selection: Inheritance of fertility in man, and of fertility in thorough-bred racehorses. Philosophical Transactions of the Royal Society of London, Series A, 192 , 257–278.

Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection . Hoboken: Wiley.

Yule, G. U., & Kendall, M. G. (1968). An introduction to the theory of statistics (14th ed.). New York: Hafner Pub. Co. Revised and Enlarged, Fifth Impression.

Zeisel, H. (1985). Say it with figures (6th ed.). New York: Harper and Brothers.

Download references

Author information

Authors and affiliations.

Schlumberger, Denver, CO, USA

You can also search for this author in PubMed   Google Scholar

1.1 Appendix 4.1 Probabilistic Definitions of Mean, Variance and Covariance

Here we present the probability definitions of the most common statistical parameters, mean, variance, correlation and covariance. Knowing these definitions will facilitate understanding of many statistical and geostatistical methods. For practical purpose, the mathematical expectation operator can be thought of “averaging”, applicable to all the situations discussed below.

For a random variable X , its mean (often termed expected value in probability) is defined by

where f ( x ) is the probability density function. Physically, it is the frequency of occurrence for every state, x i , of the random variable X . The mean in Eq. 4.9 can be interpreted as a weighted average and the frequency of occurrences is the weighting. This is also the foundation for the weighted mean discussed in Chap. 3 ; the only difference is that the weighting in Eq. 4.9 is the frequencies of the values, x i , and the weighted mean in spatial setting discussed in Chap. 3 is defined by the geometrical patterns related to sampling (although its underpinning is still the frequency).

The variance of X is defined by

where m X or E ( X ) is the mean or expected value of X .

The covariance between two random variables X and Y is defined as

where m Y or E ( Y ) is the mean or expected value of Y , and f XY ( x ,  y )is the joint probability distribution (i.e., joint frequency of occurrence) function.

Note that no matter how many variables are involved, the mathematical expectation operator can be thought of “averaging”. For example, E ( XY ) in Eq. 4.11 is the average of the product of X and Y . Incidentally, this term is the main component in defining covariance and correlation and the product of two random variables is the mathematical expression of relationship. How to evaluate such a term is very important in hydrocarbon resource evaluation (see Chap. 22 ).

Covariance and correlation are simplified expressions of 2D histograms. Figure 4.10 shows two examples of 2D histogram (joint frequency distribution of two well-log variables), along with their correlation coefficients. GR and sonic (DT) has a small positive correlation of 0.34; GR and resistivity have a strong negative correlation of −0.78.

figure 10

(a) 2D histogram, displayed in contours, of sonic (DT) and GR. The two variables have a small correlation, with the correlation coefficient equal to 0.34. (b) 2D histogram, displayed in contours, of GR and resistivity. The two variables have a strong negative correlation, with the correlation coefficient equal to −0.78

1.2 Appendix 4.2 Graphic Displays for Analyzing Variables’ Relationships

Graphic displays provide quick and straightforward ways for assessing the relationships between two variables. The most common display is the crossplot, as shown in Fig. 4.11 . In the statistical literature, the crossplot is often termed scatterplot (implying scatter of data, which is not always true). Alternatively, the variables are displayed as a function of coordinates (time or space), such as shown in Fig. 4.1 . A better, but more expensive, method is the 2D histogram with the frequency as the third dimension, as shown in Chap. 2 (Fig. 2.2 ), or alternatively with the frequency as contours, such as shown in Fig. 4.10 .

figure 11

(a) – (d) Example of crossplotting porosity and density with or without linking the data. Depending on how the data are ordered, the links appear differently. Three ways are displayed in this example, but the underlying relationship is the same because the underlying data are the same (unless the spatial or temporal correlation is analyzed, see Chap. 13 ). The correlation coefficient in this example is 0.817

One type of crossplotting is to link the data pair following the order of the data. One should be careful in interpreting such displays because they may appear differently when the order of data is changed. Figure 4.11 shows four crossplots that convey the same relationship between density and porosity because they all have the exact same data, and thus, the correlation is the same for all of them, at 0.817. When one is unfamiliar with this type of display, one would perceive that they are quite different as one might think that the two variables have a higher correlation in Fig. 4.11b, d than in Fig. 4.11c .

Compared to displaying the two variables as a function of time or spatial coordinates for geospatial variables or time series, a crossplot has the ease for interpreting the relationship, but it losses the geospatial or temporal ordering, which is why the four displays in Fig. 4.11 are the same despite the differences in their appearances. A spatial (or temporal) correlation function enables the analysis of spatial (or temporal) relationship whereby the order of data is important (see Chap. 13 ).

An enhanced version of crossplotting is to add the histograms of the two variables along with the crossplot. Figure 4.12 show an example of crossplotting GR and resistivity. The histogram gives information on the frequency of the data bins.

figure 12

Crossplot between GR and logarithm of resistivity, along with their histograms

Matrix of 2D Histograms

Like histograms for one variable (e.g., Fig. 3.1 in Chap. 3 ), a bivariate histogram can reveal modes and other frequency properties of the variables and is often the best way to investigate the bivariate relationship. A multivariate histogram can also be computed, but it cannot be effectively visualized graphically because of the curse of high dimensionality. Several techniques can be used to alleviate this problem based on the exploratory analysis of the relationships between two variables.

Correlation matrix is an exploratory analysis tool for attempting to analyze the multivariate relationships, as shown in Table 4.1 . However, a correlation matrix contains only bivariate correlations, it doesn’t directly give the multivariate relationship.

Similarly, since there is no effective way of displaying a multivariate histogram, a 2D histogram matrix can help gain insights into multivariate relationships. Figure 4.13 shows a matrix of 2D histograms between any two of the three well logs, which can be used to assess the relationships among these logs.

figure 13

Matrix of 2D histograms. For the displays, porosity was multiplied by 1000. Note that when the same property is used for 2D histogram, the 2D histogram is the same as 1D histogram, except that the display is on the one-to-one diagonal line. (Modified from Ma et al. ( 2014 ))

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Ma, Y.Z. (2019). Correlation Analysis. In: Quantitative Geosciences: Data Analytics, Geostatistics, Reservoir Characterization and Modeling. Springer, Cham. https://doi.org/10.1007/978-3-030-17860-4_4

Download citation

DOI : https://doi.org/10.1007/978-3-030-17860-4_4

Published : 16 July 2019

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-17859-8

Online ISBN : 978-3-030-17860-4

eBook Packages : Energy Energy (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Ready to level up your insights?

Get ready to streamline, scale and supercharge your research. Fill out this form to request a demo of the InsightHub platform and discover the difference insights empowerment can make. A member of our team will reach out within two working days.

Cost effective insights that scale

Quality insight doesn't need to cost the earth. Our flexible approach helps you make the most of research budgets and build an agile solution that works for you. Fill out this form to request a call back from our team to explore our pricing options.

  • What is InsightHub?
  • Data Collection
  • Data Analysis
  • Data Activation
  • Research Templates
  • Information Security
  • Our Expert Services
  • Support & Education
  • Consultative Services
  • Insight Delivery
  • Research Methods
  • Sectors We Work With
  • Meet the team
  • Advisory Board
  • Press & Media
  • Book a Demo
  • Request Pricing

Camp InsightHub

Embark on a new adventure. Join Camp InsightHub, our free demo platform, to discover the future of research.

FlexMR InsightHub

Read a brief overview of the agile research platform enabling brands to inform decisions at speed in this PDF.

InsightHub on the Blog

  • Surveys, Video and the Changing Face of Agile Research
  • Building a Research Technology Stack for Better Insights
  • The Importance of Delegation in Managing Insight Activities
  • Common Insight Platform Pitfalls (and How to Avoid Them)
  • Support and Education
  • Insight Delivery Services

FlexMR Services Team

Our services drive operational and strategic success in challenging environments. Find out how.

Video Close Connection Programme

Close Connections bring stakeholders and customers together for candid, human conversations.

Services on the Blog

  • Closing the Client-Agency Divide in Market Research
  • How to Speed Up Fieldwork Without Compromising Quality
  • Practical Ways to Support Real-Time Decision Making
  • Developing a Question Oriented, Not Answer Oriented Culture
  • Meet the Team

FlexMR Credentials Deck

The FlexMR credentials deck provides a brief introduction to the team, our approach to research and previous work.

FlexMR Insights Empowerment

We are the insights empowerment company. Our framework addresses the major pressures insight teams face.

Latest News

  • Insight as Art Shortlisted for AURA Innovation Award
  • FlexMR Launch Video Close Connection Programme
  • VideoMR Analysis Tool Added to InsightHub
  • FlexMR Makes Shortlist for Quirks Research Supplier Award
  • Latest Posts
  • Strategic Thinking
  • Technology & Trends
  • Practical Application
  • Insights Empowerment
  • View Full Blog Archives

FlexMR Close Connection Programme

Discover how to build close customer connections to better support real-time decision making.

Market Research Playbook

What is a market research and insights playbook, plus discover why should your team consider building one.

Featured Posts

  • Five Strategies for Turning Insight into Action
  • How to Design Surveys that Ask the Right Questions
  • Scaling Creative Qual for Rich Customer Insight
  • How to Measure Brand Awareness: The Complete Guide
  • All Resources
  • Client Stories
  • Whitepapers
  • Events & Webinars
  • The Open Ideas Panel
  • InsightHub Help Centre
  • FlexMR Client Network

Insights Empowerment Readiness Calculator

The insights empowerment readiness calculator measures your progress in building an insight-led culture.

MRX Lab Podcast

The MRX Lab podcast explores new and novel ideas from the insights industry in 10 minutes or less.

Featured Stories

  • Specsavers Informs Key Marketing Decisions with InsightHub
  • The Coventry Panel Helps Maintain Award Winning CX
  • Isagenix Customer Community Steers New Product Launch
  • Curo Engage Residents with InsightHub Community
  • Research Methods /
  • Strategic Thinking /
  • Practical Application /

What is Correlation Analysis? A Definition and Explanation

Emily james, choosing when to invest in research & insights.

First, a little insight: at FlexMR, we work with 50 different brands on an ongoing basis, all lookin...

Maria Twigge

  • Insights Empowerment (29)
  • Practical Application (173)
  • Research Methods (283)
  • Strategic Thinking (200)
  • Survey Templates (7)
  • Tech & Trends (387)

Correlation analysis is a topic that few people might remember from statistics lessons in school, but the majority of insights professionals will know as a staple of data analytics. However, correlations are frequently misunderstood and misused, even in the insights industry for a number of reasons. So here is a helpful guide to the basics of correlation analysis, with a few links along the way.  

Definition of Correlation Analysis

Correlation Analysis is statistical method that is used to discover if there is a relationship between two variables/datasets, and how strong that relationship may be.

In terms of market research this means that, correlation analysis is used to analyse quantitative data gathered from research methods such as surveys and polls, to identify whether there is any significant connections, patterns, or trends between the two.

Essentially, correlation analysis is used for spotting patterns within datasets. A positive correlation result means that both variables increase in relation to each other, while a negative correlation means that as one variable decreases, the other increases.

Correlation Coefficients

There are usually three different ways of ranking statistical correlation according to Spearman, Kendall, and Pearson. Each coefficient will represent the end result as ‘ r’. Spearman’s Rank and Pearson’s Coefficient are the two most widely used analytical formulae depending on the types of data researchers have to hand:

Spearman’s Rank Correlation Coefficient

This coefficient is used to see if there is any significant relationship between the two datasets, and operates under the assumption that the data being used is ordinal, which here means that the numbers do not indicate quantity, but rather they signify a position of place of the subject’s standing (e.g. 1 st , 2 nd , 3 rd , etc.)

Spearmans Rank

This coefficient requires a table of data which displays the raw data, it’s ranks, and the different between the two ranks. This squared difference between the two ranks will be shown on a scatter graph, which will clearly indicate whether there is a positive correlation, negative correlation, or no correlation at all between the two variables. The constraint that this coefficient works under is -1 ≤ r ≤ +1, where a result of 0 would mean that there was no relation between the data whatsoever. For more information on Spearman’s Rank Correlation Coefficient, there is a great document explaining the process here .

Pearson Product-Moment Coefficient

This is the most widely used correlation analysis formula, which measures the strength of the ‘ linear ’ relationships between the raw data from both variables, rather than their ranks. This is an dimensionless coefficient, meaning that there are no data-related boundaries to be considered when conducting analyses with this formula, which is a reason why this coefficient is the first formula researchers try.

Pearsons Rank

However, if the relationship between the data is not linear, then that is when this particular coefficient will not accurately represent the relationship between the two variables, and when Spearman’s Rank must be implemented instead. Pearson’s coefficient requires the relevant data must be inputted into a table similar to that of Spearman’s Rank but without the ranks, and the result produced will be in the numerical form which all correlation coefficients produce, including Spearman’s Rank and Pearson’s Coefficient: -1 ≤ r ≤ +1.

When to Use

The two methods outlined above are to be used according to whether there are parameters associated with the data gathered. The two terms to watch out for are:

  • Parametric: (Pearson’s Coefficient) Where the data must be handled in relation to the parameters of populations or probability distributions. Typically used with quantitative data already set out within said parameters.
  • Nonparametric: (Spearman’s Rank) Where no assumptions can be made about the probability distribution. Typically used with qualitative data, but can be used with quantitative data if Spearman’s Rank proves inadequate.

In cases when both are applicable, statisticians recommend using the parametric methods such as Pearson’s Coefficient, because they tend to be more precise. But that doesn’t mean discount the non-parametric methods if there isn’t enough data or a more specified accurate result is needed.

Interpreting Results

Typically, the best way to gain a generalised but more immediate interpretation of the results of a set of data, is to visualise it on a scatter graph such as these:

Positive Correlation Graph

Positive Correlation

Any score from +0.5 to +1 indicates a very strong positive correlation, which means that they both increase at the same time. The line of best fit, or the trend line, is places to best represent the data on the graph. In this case, it is following the data points upwards to indicate the positive correlation.  

Negative Correlation Graph

Negative Correlation

Any score from -0.5 to -1 indicate a strong negative correlation, which means that as one variable increases, the other decreases proportionally. The line of best fit can be seen here to indicate the negative correlation. In these cases it will slope downwards from the point of origin.

No Correlation Graph-2

No Correlation

Very simply, a score of 0 indicates that there is no correlation, or relationship, between the two variables.The larger the sample size, the more accurate the result. No matter which formula is used, this fact will stand true for all. The more data there is in putted into the formula, the more accurate the end result will be.

Outliers or anomalies must be accounted for in both correlation coefficients. Using a scatter graph is the easiest way of identifying any anomalies that may have occurred, and running the correlation analysis twice (with and without anomalies) is a great way to assess the strength of the influence of the anomalies on the analysis. If anomalies are present, Spearman’s Rank coefficient may be used instead of Pearson’s Coefficient, as this formula is extremely robust against anomalies due to the ranking system used.

Correlation ≠ Causation

While a significant relationship may be identified by correlation analysis techniques , correlation does not imply causation. The cause cannot be determined by the analysis, nor should this conclusion be attempted. The significant relationship implies that there is more to understand and that there are extraneous or underlying factors that should be explored further in order to search for a cause. While it is possible that a causal relationship exists, it would be remiss of any researcher to use the correlation results as proof of this existence.

The cause of any relationship that may be discovered through the correlation analysis, is for the researcher to determine through other means of statistical analysis, such as the coefficient of determination analysis . However, there is a great amount of value that correlation analysis can provide; for example, the value of the dependency or the variables can be estimated, which can help firms estimate the cost and sale of a product or service.

In essence, the uses for and applications of correlation-based statistical analyses allows researchers to identify which aspects and variables are dependent on each other, the result of which can generate actionable insights as they are, or starting points for further investigations and deeper insights.

Camp InsightHub

About FlexMR

We are The Insights Empowerment Company. We help research, product and marketing teams drive informed decisions with efficient, scalable & impactful insight.

About Emily James

As a professional copywriter, Emily brings our global vision to life through a broad range of industry-leading content.

Stay up to date

You might also like....

Blog Featured Image Header

10 Common Questions About InsightHu...

With a suite of impactful integrated data collection, analysis and activation tools, FlexMR’s InsightHub platform is used by many insight teams and experts for impactful insight generation and activat...

Blog Featured Image Header

10 Design Principles to Help Improv...

Surveys have been the most popular research method since the conception of market research. They are a still-flourishing method that stakeholders continually turn to as a first port of call and resear...

Blog Featured Image Header

The Best Projective Techniques for ...

Online focus groups are one of the most prominent ways to conduct qualitative research for a very good reason: they directly connect brands to customers, so they can truly understand what goes on insi...

Grit Top 50 Logo

  • CelerData Overview
  • CelerData Cloud BYOC
  • User-Facing Analytics
  • Real-Time Analytics
  • Trino/Presto
  • Apache Druid
  • CelerData Blog
  • Documentation
  • Whitepapers & Case Studies
  • Webinars & Videos

letter

Correlation Analysis

What is a correlation analysis, definition and basic concepts.

Correlation Analysis serves as a fundamental tool in understanding the relationship between two variables. This Analysis measures how one variable affects another, providing valuable Insights into their relationship. Researchers often use Correlation to identify patterns within data. The Analysis involves calculating a coefficient that indicates the strength and direction of the relationship.

Understanding Variables

Variables are the core components of any Correlation Analysis. These elements represent measurable characteristics or quantities that can change. For example, temperature and ice cream sales can be variables in a study. Understanding how these variables interact is crucial for deriving meaningful Insights. Researchers must carefully select and define variables to ensure accurate Analysis.

Types of Correlation

Different types of Correlation exist, each revealing unique relationships. Positive Correlation occurs when both variables increase together. Negative Correlation happens when one variable decreases as the other increases. No Correlation indicates no discernible pattern between the variables. Spearman's Rank Correlation and Pearson's Correlation Coefficient are common Methods used to assess these relationships. Spearman's Rank is particularly useful when dealing with ordinal data.

Importance of Correlation Analysis

Correlation Analysis plays a vital role in various fields, offering significant benefits. This Analysis provides a starting point for further research and helps in understanding the behavior between two variables. Drive Research emphasizes the importance of recognizing these relationships for informed decision-making.

Identifying Relationships

Identifying relationships between variables is a primary goal of Correlation Analysis. This Analysis reveals how much one variable influences another. For instance, Drive Research might use Correlation to determine how customer satisfaction impacts sales. Recognizing these relationships allows businesses to make strategic decisions based on data-driven Insights.

Predictive Analysis

Correlation Analysis also aids in predictive Analysis . By understanding relationships, researchers can forecast future trends. Drive Research utilizes Correlation to predict market behaviors, enhancing strategic planning. This predictive capability makes Correlation Analysis an invaluable tool in research and business Analysis.

How to Measure Correlation

Correlation coefficients.

Understanding correlation coefficients is crucial for effective analysis. These coefficients measure the strength and direction of the relationship between two variables. The range of a correlation coefficient is from -1 to +1. A value of +1 indicates a perfect positive relationship, while -1 signifies a perfect negative relationship. A value of 0 suggests no relationship exists between the variables.

Pearson's Correlation Coefficient

Pearson's correlation coefficient is widely used in research. This method measures the linear relationship between two continuous variables. Researchers often use Pearson's coefficient when data follows a normal distribution. The calculation involves comparing the covariance of the variables to their individual standard deviations. A high Pearson's coefficient indicates a strong linear relationship.

Spearman's Rank Correlation

Spearman's Rank correlation is another valuable tool in analysis. This method assesses the monotonic relationship between two variables. Unlike Pearson's, Spearman's Rank does not require normally distributed data. Researchers use this method for ordinal data or when outliers are present. Spearman's Rank provides insights into the order of data rather than their exact values.

Interpreting Correlation Coefficients

Interpreting correlation coefficients helps in understanding relationships. Positive and negative correlations reveal different dynamics between variables.

Positive vs. Negative Correlation

A positive correlation means both variables increase together. For example, as temperature rises, ice cream sales might also rise. A negative correlation indicates that as one variable increases, the other decreases. An example includes the relationship between exercise frequency and body weight.

Strength of Correlation

The strength of a correlation reveals how closely variables are related. A high correlation coefficient, close to +1 or -1, indicates a strong relationship. A low correlation coefficient, near 0, suggests a weak relationship. Understanding the strength helps in making informed decisions in market research analysis.

Applications of Correlation Analysis

In business and economics, market trends.

Businesses use correlation analysis to understand market trends. This analysis helps identify the relationship between different economic indicators. For example, companies analyze the correlation between consumer spending and economic growth. This understanding allows businesses to make informed decisions. Drive Research often employs this method to predict future market behaviors. Identifying these patterns supports strategic planning and investment.

Consumer Behavior

Understanding consumer behavior is crucial for businesses. Correlation analysis reveals how different factors influence purchasing decisions. Companies examine the correlation between advertising efforts and sales growth. This analysis provides insights into effective marketing strategies. Drive Research uses these findings to tailor campaigns that resonate with target audiences. Businesses can enhance customer engagement by recognizing these relationships.

In Science and Research

Experimental studies.

Researchers rely on correlation analysis in experimental studies. This method uncovers relationships between variables in scientific experiments. Scientists explore the correlation between environmental changes and species adaptation. These insights contribute to advancements in ecological research. Drive Research applies similar techniques to study various scientific phenomena. Understanding these connections aids in developing innovative solutions.

Data Analysis

Data analysis benefits greatly from correlation analysis. Analysts use this method to detect meaningful relationships within datasets. For instance, data scientists investigate the correlation between website traffic and conversion rates. This analysis helps optimize digital marketing strategies. Drive Research leverages these insights to improve client outcomes. Recognizing these correlations enhances decision-making and efficiency.

Common Misconceptions

Correlation vs. causation, explanation of the difference.

Many people confuse correlation with causation. Correlation shows a relationship between two variables. Causation indicates that one variable directly affects another. Understanding this difference is crucial for accurate analysis.

Examples of Misinterpretation

Misinterpretations often occur in research. For example, rising breast cancer rates may correlate with increased joint surgeries. This correlation does not mean one causes the other. Orthopedic studies often highlight the need to differentiate between correlational and causal associations.

Limitations of Correlation Analysis

Non-linear relationships.

Correlation analysis focuses on linear relationships. Non-linear relationships may exist between variables. These relationships require different analytical methods. Recognizing this limitation helps in choosing the right approach.

Outliers and Their Impact

Outliers can distort correlation results. These extreme values affect the strength and direction of relationships. Analysts must identify and address outliers. Proper handling ensures more accurate interpretations.

Practical Examples of Correlation Analysis

Case studies, real-world example 1.

Blood Pressure and Medication : Researchers often explore the relationship between blood pressure levels and medication effectiveness. A study might measure how a specific drug impacts systolic and diastolic readings. This analysis helps medical professionals tailor treatments to individual needs. Understanding these correlations improves patient outcomes.

Real-world Example 2

Sales and Advertising Spend : Businesses frequently analyze the link between advertising budgets and sales figures. Companies assess how increased spending influences revenue growth. This insight guides marketing strategies and budget allocations. Recognizing these patterns supports more effective decision-making.

Step-by-step Analysis

Data collection.

Start by gathering relevant data. Identify the variables you want to analyze. Use surveys, experiments, or existing datasets. Ensure data accuracy and reliability. Proper data collection forms the foundation of your analysis.

Calculation and Interpretation

Calculate correlation coefficients using statistical software. Choose methods like Pearson or Spearman based on your data type. Analyze the results to understand the strength and direction of relationships. Use visual aids like scatter plots for clarity. Interpret findings to draw meaningful conclusions. Apply insights to inform strategies and decisions.

FAQs and Additional Resources

Frequently asked questions, common queries.

What is correlation analysis? Correlation analysis measures the relationship between two variables. This method helps identify how one variable may influence another.

How do you interpret correlation coefficients? A coefficient close to +1 indicates a strong positive relationship. A coefficient close to -1 shows a strong negative relationship. A coefficient near 0 suggests no relationship.

What is the difference between correlation and causation? Correlation shows a relationship between variables. Causation means one variable directly affects another.

Expert Answers

Can correlation imply causation? Correlation does not imply causation. Reverse causality can distort relationships. For example, low BMI may correlate with higher mortality due to underlying health issues.

When should you use Pearson or Spearman correlation? Use Pearson for linear relationships with normally distributed data. Spearman works well with ordinal data or when outliers are present.

Further Reading and Tools

Recommended books.

"Statistics for Business and Economics" by Paul Newbold This book provides a comprehensive understanding of statistical methods, including correlation analysis.

"The Art of Statistics: Learning from Data" by David Spiegelhalter Explore the principles of statistics with practical examples and insights.

Online Resources

Khan Academy: Correlation and Causality Access free courses that explain correlation and its implications.

Statistical Software Tutorials Learn how to calculate and interpret correlation coefficients using various software tools.

These resources will enhance your understanding and application of correlation analysis in different fields.

Correlation analysis plays a vital role in understanding relationships between variables. This method empowers you to make informed decisions and uncover hidden patterns. Applying this knowledge in real-world scenarios enhances strategic planning and decision-making. Exploring further resources will deepen your understanding and improve your analytical skills. Embrace correlation analysis as a starting point for investigating relationships and making impactful choices.

starrocks_slack

Join StarRocks Community on Slack

Recommended Resources

Group 372

Trino vs. StarRocks: Get Data Warehouse Performance on the Data Lake

Group 381

5 Brilliant Lakehouse Architectures from Tencent, WeChat, and More

StarRocks airbnb Unified OLAP (3)

Airbnb Builds a New Generation of Fast Analytics Experience with StarRocks

Have questions? Talk to a CelerData expert.

Our solutions architects are standing by to help answer all of your questions about our products and set you up with a personalized demo of CelerData Cloud.

What is Correlation Analysis? Definition, Process, Examples

Appinio Research · 07.11.2023 · 31min read

What is Correlation Analysis Definition Process Examples

Are you curious about how different variables interact and influence each other? Correlation analysis is the key to unlocking these relationships in your data. In this guide, we'll dive deep into correlation analysis, exploring its definition, methods, applications, and practical examples.

Whether you're a data scientist, researcher, or business professional, understanding correlation analysis will empower you to make informed decisions, manage risks, and uncover valuable insights from your data. Let's embark on this exploration of correlation analysis and discover its significance in various domains.

What is Correlation Analysis?

Correlation analysis is a statistical technique used to measure and evaluate the strength and direction of the relationship between two or more variables. It helps identify whether changes in one variable are associated with changes in another and quantifies the degree of this association.

Purpose of Correlation Analysis

The primary purpose of correlation analysis is to:

  • Discover Relationships: Correlation analysis helps researchers and analysts identify patterns and relationships between variables in their data. It answers questions like, "Do these variables move together or in opposite directions?"
  • Quantify Relationships: Correlation analysis quantifies the strength and direction of associations between variables, providing a numerical measure that allows for comparisons and objective assessments.
  • Predictive Insights: Correlation analysis can be used for predictive purposes. If two variables show a strong correlation, changes in one variable can be used to predict changes in the other, which is valuable for forecasting and decision-making.
  • Data Reduction: In multivariate analysis, correlation analysis can help identify redundant variables. Highly correlated variables may carry similar information, allowing analysts to simplify their models and reduce dimensionality.
  • Diagnostics: In fields like healthcare and finance , correlation analysis is used for diagnostic purposes. For instance, it can reveal correlations between symptoms and diseases or between financial indicators and market trends.

Importance of Correlation Analysis

  • Decision-Making: Correlation analysis provides crucial insights for informed decision-making. For example, in finance, understanding the correlation between assets helps in portfolio diversification, risk management , and asset allocation decisions. In business, it aids in assessing the effectiveness of marketing strategies and identifying factors influencing sales.
  • Risk Assessment: Correlation analysis is essential for risk assessment and management. In financial risk analysis, it helps identify how assets within a portfolio move concerning each other. Highly positively correlated assets can increase risk, while negatively correlated assets can provide diversification benefits.
  • Scientific Research: In scientific research, correlation analysis is a fundamental tool for understanding relationships between variables. For example, healthcare research can uncover correlations between patient characteristics and health outcomes, leading to improved treatments and interventions.
  • Quality Control: In manufacturing and quality control, correlation analysis can be used to identify factors that affect product quality. For instance, it helps determine whether changes in manufacturing processes correlate with variations in product specifications.
  • Predictive Modeling : Correlation analysis is a precursor to building predictive models . Variables with strong correlations may be used as predictors in regression models to forecast outcomes, such as predicting customer churn based on their usage patterns and demographics.
  • Identifying Confounding Factors: In epidemiology and social sciences, correlation analysis is used to identify confounding factors. When studying the relationship between two variables, a third variable may confound the association. Correlation analysis helps researchers identify and account for these confounders.

In summary, correlation analysis is a versatile and indispensable statistical tool with broad applications in various fields. It helps reveal relationships, assess risks, make informed decisions, and advance scientific understanding, making it a valuable asset in data analysis and research.

Types of Correlation

Correlation analysis involves examining the relationship between variables. There are several methods to measure correlation, each suited for different types of data and situations. In this section, we'll explore three main types of correlation:

Pearson Correlation Coefficient

The Pearson Correlation Coefficient, often referred to as Pearson's "r," is the most widely used method to measure linear relationships between continuous variables. It quantifies the strength and direction of a linear association between two variables.

Spearman Rank Correlation

Spearman Rank Correlation, also known as Spearman's "ρ" (rho), is a non-parametric method used to measure the strength and direction of the association between two variables. It is particularly beneficial when dealing with non-linear relationships or ordinal data.

Kendall Tau Correlation

Kendall Tau Correlation, often denoted as "τ" (tau), is another non-parametric method for assessing the association between two variables. It is advantageous when dealing with small sample sizes or data with ties (values that occur more than once).

How to Prepare Data for Correlation Analysis?

Before diving into correlation analysis, you must ensure your data is well-prepared to yield meaningful results. Proper data preparation is crucial for accurate and reliable outcomes . Let's explore the essential steps involved in preparing your data.

1. Data Collection

  • Identify Relevant Variables: Determine which variables you want to analyze for correlation. These variables should be logically connected or hypothesized to have an association.
  • Data Sources: Collect data from reliable sources, ensuring that it is representative of the population or phenomenon you are studying.
  • Data Quality: Check for data quality issues such as missing values, outliers, or errors during the data collection process.

2. Data Cleaning

  • Handling Missing Data: Decide on an appropriate strategy for dealing with missing values. You can either impute missing data or exclude cases with missing values, depending on the nature of your analysis and the extent of missing data.
  • Duplicate Data: Detect and remove duplicate entries to avoid skewing your analysis.
  • Data Transformation: If necessary, perform data transformations like normalization or standardization to ensure that variables are on the same scale.

3. Handling Missing Values

  • Types of Missing Data: Understand the types of missing data, such as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR).
  • Imputation Methods: Choose appropriate imputation methods, such as mean imputation, median imputation, or regression imputation, based on the missing data pattern and the nature of your variables.

4. Outlier Detection and Treatment

  • Identifying Outliers: Utilize statistical methods or visualizations (e.g., box plots, scatter plots) to identify outliers in your data.
  • Treatment Options: Decide whether to remove outliers, transform them, or leave them in your dataset based on the context and objectives of your analysis.

Effective data preparation sets the stage for robust correlation analysis. By following these steps, you ensure that your data is clean, complete, and ready for meaningful insights. In the subsequent sections of this guide, we will delve deeper into the calculations, interpretations, and practical applications of correlation analysis.

The Pearson Correlation Coefficient, often referred to as Pearson's "r," is a widely used statistical measure for quantifying the strength and direction of a linear relationship between two continuous variables. Understanding how to calculate, interpret, and recognize the strength and direction of this correlation is essential.

Calculation

The formula for calculating the Pearson correlation coefficient is as follows:

r = (Σ((X - X̄)(Y - Ȳ))) / (n-1)
  • X and Y are the variables being analyzed.
  • X̄ and Ȳ are the means (averages) of X and Y.
  • n is the number of data points.

To calculate "r," you take the sum of the products of the deviations of individual data points from their respective means for both variables. The division by (n-1) represents the degrees of freedom, ensuring that the sample variance is unbiased.

Interpretation

Interpreting the Pearson correlation coefficient is crucial for understanding the nature of the relationship between two variables:

  • Positive correlation (r > 0): When "r" is positive, it indicates a positive linear relationship. This means that as one variable increases, the other tends to increase as well.
  • Negative correlation (r < 0): A negative "r" value suggests a negative linear relationship, implying that as one variable increases, the other tends to decrease.
  • No Correlation (r ≈ 0): If "r" is close to 0, there is little to no linear relationship between the variables. In this case, changes in one variable are not associated with consistent changes in the other.

Strength and Direction of Correlation

The magnitude of the Pearson correlation coefficient "r" indicates the strength of the correlation:

  • Strong Correlation: When |r| is close to 1 (either positive or negative), it suggests a strong linear relationship. A value of 1 indicates a perfect linear relationship, while -1 indicates a perfect negative linear relationship.
  • Weak Correlation: When |r| is closer to 0, it implies a weaker linear relationship. The closer "r" is to 0, the weaker the correlation.

The sign of "r" (+ or -) indicates the direction of the correlation:

  • Positive Correlation: A positive "r" suggests that as one variable increases, the other tends to increase. The variables move in the same direction.
  • Negative Correlation: A negative "r" suggests that as one variable increases, the other tends to decrease. The variables move in opposite directions.

Assumptions and Limitations

It's essential to be aware of the assumptions and limitations of the Pearson correlation coefficient:

  • Linearity: Pearson correlation assumes that there is a linear relationship between the variables. If the relationship is not linear, Pearson's correlation may not accurately capture the association.
  • Normal Distribution: It assumes that both variables are normally distributed. If this assumption is violated, the results may be less reliable.
  • Outliers: Outliers can have a significant impact on the Pearson correlation coefficient. Extreme values may distort the correlation results.
  • Independence: It assumes that the data points are independent of each other.

Understanding these assumptions and limitations is vital when interpreting the results of Pearson correlation analysis. In cases where these assumptions are not met, other correlation methods like Spearman or Kendall Tau may be more appropriate.

Spearman Rank Correlation, also known as Spearman's "ρ" (rho), is a non-parametric method used to measure the strength and direction of the association between two variables. This method is valuable when dealing with non-linear relationships or ordinal data.

To calculate Spearman Rank Correlation, you need to follow these steps:

  • Rank the values of each variable separately. Assign the lowest rank to the smallest value and the highest rank to the largest value.
  • Calculate the differences between the ranks for each pair of data points for both variables. Square the differences and sum them for all data points.
  • Use the formula for Spearman's rho:
ρ = 1 - ((6 * Σd²) / (n(n² - 1)))
  • ρ is the Spearman rank correlation coefficient.
  • Σd² is the sum of squared differences in ranks.

When to Use Spearman Correlation?

Spearman Rank Correlation is particularly useful in the following scenarios:

  • When the relationship between variables is not strictly linear, as it does not assume linearity.
  • When dealing with ordinal data , where values have a natural order but are not equidistant.
  • When your data violates the assumptions of the Pearson correlation coefficient, such as normality and linearity.

Interpreting Spearman's rho is similar to interpreting Pearson correlation:

  • A positive ρ indicates a positive monotonic relationship, meaning that as one variable increases, the other tends to increase.
  • A negative ρ suggests a negative monotonic relationship, where as one variable increases, the other tends to decrease.
  • A ρ close to 0 implies little to no monotonic association between the variables.

Spearman Rank Correlation is robust and versatile, making it a valuable tool for analyzing relationships in a variety of data types and scenarios.

The Kendall Tau Correlation, often denoted as "τ" (tau), is a non-parametric measure used to assess the strength and direction of association between two variables. Kendall Tau is particularly valuable when dealing with small sample sizes, non-linear relationships, or data that violates the assumptions of the Pearson correlation coefficient.

Calculating Kendall Tau Correlation involves counting concordant and discordant pairs of data points. Here's how it's done:

  • For each pair of data points (Xi, Xj) and (Yi, Yj), determine whether they are concordant or discordant.
  • Concordant pairs: If Xi < Xj and Yi < Yj or Xi > Xj and Yi > Yj.
  • Discordant pairs: If Xi < Xj and Yi > Yj or Xi > Xj and Yi < Yj.
  • Count the number of concordant pairs (C) and discordant pairs (D).
  • Use the formula for Kendall's Tau:
τ = (C - D) / (0.5 * n * (n - 1))
  • τ is the Kendall Tau correlation coefficient.
  • C is the number of concordant pairs.
  • D is the number of discordant pairs.

Advantages of Kendall Tau

Kendall Tau Correlation offers several advantages, making it a robust choice in various scenarios:

  • Robust to Outliers: Kendall Tau is less sensitive to outliers compared to Pearson correlation, making it suitable for data with extreme values.
  • Small Sample Sizes: It performs well with small sample sizes, making it applicable even when you have limited data.
  • Non-Parametric: Kendall Tau is non-parametric, meaning it doesn't assume specific data distributions, making it versatile for various data types.
  • No Assumption of Linearity: Unlike Pearson correlation, Kendall Tau does not assume a linear relationship between variables, making it suitable for capturing non-linear associations.

Interpreting Kendall Tau correlation follows a similar pattern to Pearson and Spearman correlation:

  • Positive τ (τ > 0): Indicates a positive association between variables. As one variable increases, the other tends to increase.
  • Negative τ (τ < 0): Suggests a negative association. As one variable increases, the other tends to decrease.
  • τ Close to 0: Implies little to no association between the variables.

Kendall Tau is a valuable tool when you want to explore associations in your data without making strong assumptions about data distribution or linearity.

How to Interpret Correlation Results?

Once you've calculated correlation coefficients, the next step is interpreting the results. Understanding how to make sense of the correlation values and what they mean for your analysis is crucial.

Correlation Heatmaps

Correlation heatmaps are visual representations of correlation coefficients between multiple variables. They provide a quick and intuitive way to identify patterns and relationships in your data.

  • Positive Correlation (High Values): Variables with high positive correlations appear as clusters of bright colors (e.g., red or yellow) in the heatmap .
  • Negative Correlation (Low Values): Variables with high negative correlations are shown as clusters of dark colors (e.g., blue or green) in the heatmap.
  • No Correlation (Values Close to 0): Variables with low or no correlation appear as a neutral color (e.g., white or gray) in the heatmap.

Correlation heatmaps are especially useful when dealing with a large number of variables, helping you identify which pairs exhibit strong associations.

Scatterplots

Scatterplots are graphical representations of data points on a Cartesian plane, with one variable on the x-axis and another on the y-axis. They are valuable for visualizing the relationship between two continuous variables.

  • Positive Correlation: In a positive correlation, data points on the scatterplot tend to form an upward-sloping pattern, suggesting that as one variable increases, the other tends to increase.
  • Negative Correlation: A negative correlation is represented by a downward-sloping pattern, indicating that as one variable increases, the other tends to decrease.
  • No Correlation: When there is no correlation, data points are scattered randomly without forming any distinct pattern.

Scatterplots provide a clear and intuitive way to assess the direction and strength of the correlation between two variables.

Statistical Significance

It's crucial to determine whether the observed correlation is statistically significant . Statistical significance helps you assess whether the correlation is likely due to random chance or if it reflects a true relationship between the variables.

Common methods for assessing statistical significance include hypothesis testing (e.g., t-tests) or calculating p-values. A low p-value (typically less than 0.05) indicates that the correlation is likely not due to chance and is statistically significant.

Understanding statistical significance helps you confidently draw conclusions from your correlation analysis and make informed decisions based on your findings. Discover the hidden truths beyond the golden 0.05 threshold in our exclusive webinar recording, "From Zero to Significance." Dive deep into the intricacies and pitfalls of significance testing with Louise Leitsch, our Director of Research, as she demystifies market research jargon and simplifies complex concepts like P-value and alpha inflation.

Gain invaluable insights that guarantee reliable results and elevate your research game to new heights. Don't miss out — watch now!

"From Zero to Significance" Webinar recording

Common Mistakes in Correlation Analysis

While correlation analysis is a powerful tool for uncovering relationships in data, it's essential to be aware of common mistakes and pitfalls that can lead to incorrect conclusions. Here are some of the most prevalent issues to watch out for:

Causation vs. Correlation

Mistake: Assuming that correlation implies causation is a common error in data analysis. Correlation only indicates that two variables are associated or vary together; it does not establish a cause-and-effect relationship.

Example: Suppose you find a strong positive correlation between ice cream sales and the number of drowning incidents during the summer months. Concluding that eating ice cream causes drowning would be a mistake. The common factor here is hot weather, which drives both ice cream consumption and swimming, leading to an apparent correlation.

Solution: Always exercise caution when interpreting correlation. To establish causation, you need additional evidence from controlled experiments or a thorough understanding of the underlying mechanisms.

Confounding Variables

Mistake: Ignoring or failing to account for confounding variables can lead to misleading correlation results. Confounding variables are external factors that affect both of the variables being studied, making it appear that there is a correlation when there isn't one.

Example: Suppose you are analyzing the relationship between the number of sunscreen applications and the incidence of sunburn. You find a negative correlation, suggesting that more sunscreen leads to more sunburn. However, the confounding variable is the time spent in the sun, which affects both sunscreen application and sunburn risk.

Solution: Be vigilant about potential confounding variables and either control for them in your analysis or consider their influence on the observed correlation.

Sample Size Issues

Mistake: Drawing strong conclusions from small sample sizes can be misleading. Small samples can result in less reliable correlation estimates and may not be representative of the population.

Example: If you have only ten data points and find a strong correlation, it's challenging to generalize that correlation to a larger population with confidence.

Solution: Whenever possible, aim for larger sample sizes to improve the robustness of your correlation analysis. Statistical tests can help determine whether the observed correlation is statistically significant, given the sample size. You can also leverage the Appinio sample size calculator to determine the necessary sample size.

Applications of Correlation Analysis

Correlation analysis has a wide range of applications across various fields. Understanding the relationships between variables can provide valuable insights for decision-making and research. Here are some notable applications in different domains:

Business and Finance

  • Stock Market Analysis: Correlation analysis can help investors and portfolio managers assess the relationships between different stocks and assets. Understanding correlations can aid in diversifying portfolios to manage risk.
  • Marketing Effectiveness: Businesses use correlation analysis to determine the impact of marketing strategies on sales, customer engagement, and other key performance metrics.
  • Risk Management: In financial institutions, correlation analysis is crucial for assessing the interdependence of assets and estimating risk exposure in portfolios.

Healthcare and Medicine

  • Drug Efficacy: Researchers use correlation analysis to assess the correlation between drug dosage and patient response. It helps determine the appropriate drug dosage for specific conditions.
  • Disease Research: Correlation analysis is used to identify potential risk factors and correlations between various health indicators and the occurrence of diseases.
  • Clinical Trials: In clinical trials, correlation analysis is employed to evaluate the correlation between treatment interventions and patient outcomes.

Social Sciences

  • Education: Educational researchers use correlation analysis to explore relationships between teaching methods, student performance, and various socioeconomic factors.
  • Sociology: Correlation analysis is applied to study correlations between social variables, such as income, education, and crime rates.
  • Psychology: Psychologists use correlation analysis to investigate relationships between variables like stress levels, behavior, and mental health outcomes .

These are just a few examples of how correlation analysis is applied across diverse fields. Its versatility makes it a valuable tool for uncovering associations and guiding decision-making in many areas of research and practice.

Correlation Analysis in Python

Python is a widely used programming language for data analysis and offers several libraries that facilitate correlation analysis. In this section, we'll explore how to perform correlation analysis using Python, including the use of libraries like NumPy and pandas. We'll also provide code examples to illustrate the process.

Using Libraries

NumPy is a fundamental library for numerical computing in Python. It provides essential tools for working with arrays and performing mathematical operations, making it valuable for correlation analysis.

To calculate the Pearson correlation coefficient using NumPy, you can use the numpy.corrcoef() function:

import numpy as np # Create two arrays (variables) variable1 = np.array([1, 2, 3, 4, 5]) variable2 = np.array([5, 4, 3, 2, 1]) # Calculate Pearson correlation coefficient correlation_coefficient = np.corrcoef(variable1, variable2)[0, 1] print(f"Pearson Correlation Coefficient: {correlation_coefficient}")

pandas is a powerful data manipulation library in Python. It provides a convenient DataFrame structure for handling and analyzing data.

To perform correlation analysis using pandas, you can use the pandas.DataFrame.corr() method:

import pandas as pd # Create a DataFrame with two columns data = {'Variable1': [1, 2, 3, 4, 5],         'Variable2': [5, 4, 3, 2, 1]} df = pd.DataFrame(data) # Calculate Pearson correlation coefficient correlation_matrix = df.corr() pearson_coefficient = correlation_matrix.loc['Variable1', 'Variable2'] print(f"Pearson Correlation Coefficient: {pearson_coefficient}")

Code Examples

import scipy.stats # Create two arrays (variables) variable1 = [1, 2, 3, 4, 5] variable2 = [5, 4, 3, 2, 1] # Calculate Spearman rank correlation coefficient spearman_coefficient, _ = scipy.stats.spearmanr(variable1, variable2) print(f"Spearman Rank Correlation Coefficient: {spearman_coefficient}")
import scipy.stats # Create two arrays (variables) variable1 = [1, 2, 3, 4, 5] variable2 = [5, 4, 3, 2, 1] # Calculate Kendall Tau correlation coefficient kendall_coefficient, _ = scipy.stats.kendalltau(variable1, variable2) print(f"Kendall Tau Correlation Coefficient: {kendall_coefficient}")

These code examples demonstrate how to calculate correlation coefficients using Python and its libraries. You can apply these techniques to your own datasets and analyses, depending on the type of correlation you want to measure.

Correlation Analysis in R

R is a powerful statistical programming language and environment that excels in data analysis and visualization. In this section, we'll explore how to perform correlation analysis in R, utilizing libraries like corrplot and psych. Additionally, we'll provide code examples to demonstrate the process.

corrplot is a popular R package for creating visually appealing correlation matrices and correlation plots. It provides various options for customizing the appearance of correlation matrices, making it an excellent choice for visualizing relationships between variables.

To use corrplot, you need to install and load the package:

The psych package in R provides a wide range of functions for psychometrics, including correlation analysis. It offers functions for calculating correlation matrices, performing factor analysis, and more.

To use psych, you should install and load the package:

# Create two vectors (variables) variable1 <- c(1, 2, 3, 4, 5) variable2 <- c(5, 4, 3, 2, 1) # Calculate Pearson correlation coefficient pearson_coefficient <- cor(variable1, variable2, method = "pearson") print(paste("Pearson Correlation Coefficient:", round(pearson_coefficient, 2)))
# Create two vectors (variables) variable1 <- c(1, 2, 3, 4, 5) variable2 <- c(5, 4, 3, 2, 1) # Calculate Spearman rank correlation coefficient spearman_coefficient <- cor(variable1, variable2, method = "spearman") print(paste("Spearman Rank Correlation Coefficient:", round(spearman_coefficient, 2)))
# Create two vectors (variables) variable1 <- c(1, 2, 3, 4, 5) variable2 <- c(5, 4, 3, 2, 1) # Calculate Kendall Tau correlation coefficient kendall_coefficient <- cor(variable1, variable2, method = "kendall") print(paste("Kendall Tau Correlation Coefficient:", round(kendall_coefficient, 2)))

These code examples illustrate how to calculate correlation coefficients using R, specifically focusing on the Pearson, Spearman Rank, and Kendall Tau correlation methods. You can apply these techniques to your own datasets and analyses in R, depending on your specific research or data analysis needs.

Correlation Analysis Examples

Now that we've covered the fundamentals of correlation analysis, let's explore practical examples that showcase how correlation analysis can be applied to real-world scenarios. These examples will help you understand the relevance and utility of correlation analysis in various domains.

Example 1: Finance and Investment

Suppose you are an investment analyst working for a hedge fund, and you want to evaluate the relationship between two stocks: Stock A and Stock B. Your goal is to determine whether there is a correlation between the daily returns of these stocks.

  • Data Collection: Gather historical daily price data for both Stock A and Stock B.
  • Data Preparation: Calculate the daily returns for each stock, which can be done by taking the percentage change in the closing price from one day to the next.
  • Correlation Analysis: Use correlation analysis to measure the correlation between the daily returns of Stock A and Stock B. You can calculate the Pearson correlation coefficient, which will indicate the strength and direction of the relationship.
  • Interpretation: If the correlation coefficient is close to 1, it suggests a strong positive correlation, meaning that when Stock A goes up, Stock B tends to go up as well. If it's close to -1, it indicates a strong negative correlation, implying that when one stock rises, the other falls. A correlation coefficient close to 0 suggests little to no linear relationship.
  • Portfolio Management: Based on the correlation analysis results, you can decide whether it makes sense to include both stocks in your portfolio. If they are highly positively correlated, adding both may not provide adequate diversification. Conversely, if they are negatively correlated, they may serve as a good hedge against each other.

Example 2: Healthcare and Medical Research

You are a researcher studying the relationship between patients' Body Mass Index (BMI) and their cholesterol levels. Your objective is to determine if there is a correlation between BMI and cholesterol levels among a sample of patients.

  • Data Collection: Collect data from a sample of patients, including their BMI and cholesterol levels.
  • Data Preparation: Ensure that the data is clean and there are no missing values. You may need to categorize BMI levels if you want to explore categorical correlations.
  • Correlation Analysis: Perform correlation analysis to calculate the Pearson correlation coefficient between BMI and cholesterol levels. This will help you quantify the strength and direction of the relationship.
  • Interpretation: If the Pearson correlation coefficient is positive and significant, it suggests that as BMI increases, cholesterol levels tend to increase. A negative coefficient would indicate the opposite. A correlation close to 0 implies little to no linear relationship.
  • Clinical Implications: Use the correlation analysis results to inform clinical decisions. For example, if there is a strong positive correlation, healthcare professionals may consider monitoring cholesterol levels more closely in patients with higher BMI.

Example 3: Education and Student Performance

As an educational researcher, you are interested in understanding the factors that influence student performance in a high school setting. You want to explore the correlation between variables such as student attendance, hours spent studying, and exam scores.

  • Data Collection: Collect data from a sample of high school students, including their attendance records, hours spent studying per week, and exam scores.
  • Data Preparation: Ensure data quality, handle any missing values, and categorize variables if necessary.
  • Correlation Analysis: Use correlation analysis to calculate correlation coefficients, such as the Pearson coefficient, between attendance, study hours, and exam scores. This will help identify which factors, if any, are correlated with student performance.
  • Interpretation: Analyze the correlation coefficients to determine the strength and direction of the relationships. For instance, a positive correlation between attendance and exam scores would suggest that students with better attendance tend to perform better academically.
  • Educational Interventions: Based on the correlation analysis findings, academic institutions can implement targeted interventions. For example, if there is a negative correlation between study hours and exam scores, educators may encourage students to allocate more time to studying.

These practical examples illustrate how correlation analysis can be applied to different fields, including finance, healthcare, and education. By understanding the relationships between variables, organizations and researchers can make informed decisions, optimize strategies, and improve outcomes in their respective domains.

Conclusion for Correlation Analysis

Correlation analysis is a powerful tool that allows us to understand the connections between different variables. By quantifying these relationships, we gain insights that help us make better decisions, manage risks, and improve outcomes in various fields like finance, healthcare, and education.

So, whether you're analyzing stock market trends, researching medical data, or studying student performance, correlation analysis equips you with the knowledge to uncover meaningful connections and make data-driven choices. Embrace the power of correlation analysis in your data journey, and you'll find that it's an essential compass for navigating the complex landscape of information and decision-making.

How to Conduct Correlation Analysis in Minutes?

In the world of data-driven decision-making, Appinio is your go-to partner for real-time consumer insights. We've redefined market research, making it exciting, intuitive, and seamlessly integrated into everyday choices. When it comes to correlation analysis, here's why you'll love Appinio:

  • Lightning-Fast Insights: Say goodbye to waiting. With Appinio, you'll turn questions into insights in minutes, not days.
  • No Research Degree Required: Our platform is so user-friendly that anyone can master it, no PhD in research needed.
  • Global Reach, Local Expertise: Survey your ideal target group from 1200+ characteristics across 90+ countries. Our dedicated research consultants are with you every step of the way.

Register now EN

Get free access to the platform!

Join the loop 💌

Be the first to hear about new updates, product news, and data insights. We'll send it all straight to your inbox.

Get the latest market research news straight to your inbox! 💌

Wait, there's more

Get your brand Holiday Ready: 4 Essential Steps to Smash your Q4

03.09.2024 | 3min read

Get your brand Holiday Ready: 4 Essential Steps to Smash your Q4

Beyond Demographics: Psychographic Power in target group identification

03.09.2024 | 8min read

Beyond Demographics: Psychographics power in target group identification

What is Convenience Sampling Definition Method Examples

29.08.2024 | 32min read

What is Convenience Sampling? Definition, Method, Examples

Logo for M Libraries Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

7.2 Correlational Research

Learning objectives.

  • Define correlational research and give several examples.
  • Explain why a researcher might choose to conduct correlational research rather than experimental research or another type of nonexperimental research.

What Is Correlational Research?

Correlational research is a type of nonexperimental research in which the researcher measures two variables and assesses the statistical relationship (i.e., the correlation) between them with little or no effort to control extraneous variables. There are essentially two reasons that researchers interested in statistical relationships between variables would choose to conduct a correlational study rather than an experiment. The first is that they do not believe that the statistical relationship is a causal one. For example, a researcher might evaluate the validity of a brief extraversion test by administering it to a large group of participants along with a longer extraversion test that has already been shown to be valid. This researcher might then check to see whether participants’ scores on the brief test are strongly correlated with their scores on the longer one. Neither test score is thought to cause the other, so there is no independent variable to manipulate. In fact, the terms independent variable and dependent variable do not apply to this kind of research.

The other reason that researchers would choose to use a correlational study rather than an experiment is that the statistical relationship of interest is thought to be causal, but the researcher cannot manipulate the independent variable because it is impossible, impractical, or unethical. For example, Allen Kanner and his colleagues thought that the number of “daily hassles” (e.g., rude salespeople, heavy traffic) that people experience affects the number of physical and psychological symptoms they have (Kanner, Coyne, Schaefer, & Lazarus, 1981). But because they could not manipulate the number of daily hassles their participants experienced, they had to settle for measuring the number of daily hassles—along with the number of symptoms—using self-report questionnaires. Although the strong positive relationship they found between these two variables is consistent with their idea that hassles cause symptoms, it is also consistent with the idea that symptoms cause hassles or that some third variable (e.g., neuroticism) causes both.

A common misconception among beginning researchers is that correlational research must involve two quantitative variables, such as scores on two extraversion tests or the number of hassles and number of symptoms people have experienced. However, the defining feature of correlational research is that the two variables are measured—neither one is manipulated—and this is true regardless of whether the variables are quantitative or categorical. Imagine, for example, that a researcher administers the Rosenberg Self-Esteem Scale to 50 American college students and 50 Japanese college students. Although this “feels” like a between-subjects experiment, it is a correlational study because the researcher did not manipulate the students’ nationalities. The same is true of the study by Cacioppo and Petty comparing college faculty and factory workers in terms of their need for cognition. It is a correlational study because the researchers did not manipulate the participants’ occupations.

Figure 7.2 “Results of a Hypothetical Study on Whether People Who Make Daily To-Do Lists Experience Less Stress Than People Who Do Not Make Such Lists” shows data from a hypothetical study on the relationship between whether people make a daily list of things to do (a “to-do list”) and stress. Notice that it is unclear whether this is an experiment or a correlational study because it is unclear whether the independent variable was manipulated. If the researcher randomly assigned some participants to make daily to-do lists and others not to, then it is an experiment. If the researcher simply asked participants whether they made daily to-do lists, then it is a correlational study. The distinction is important because if the study was an experiment, then it could be concluded that making the daily to-do lists reduced participants’ stress. But if it was a correlational study, it could only be concluded that these variables are statistically related. Perhaps being stressed has a negative effect on people’s ability to plan ahead (the directionality problem). Or perhaps people who are more conscientious are more likely to make to-do lists and less likely to be stressed (the third-variable problem). The crucial point is that what defines a study as experimental or correlational is not the variables being studied, nor whether the variables are quantitative or categorical, nor the type of graph or statistics used to analyze the data. It is how the study is conducted.

Figure 7.2 Results of a Hypothetical Study on Whether People Who Make Daily To-Do Lists Experience Less Stress Than People Who Do Not Make Such Lists

Results of a Hypothetical Study on Whether People Who Make Daily To-Do Lists Experience Less Stress Than People Who Do Not Make Such Lists

Data Collection in Correlational Research

Again, the defining feature of correlational research is that neither variable is manipulated. It does not matter how or where the variables are measured. A researcher could have participants come to a laboratory to complete a computerized backward digit span task and a computerized risky decision-making task and then assess the relationship between participants’ scores on the two tasks. Or a researcher could go to a shopping mall to ask people about their attitudes toward the environment and their shopping habits and then assess the relationship between these two variables. Both of these studies would be correlational because no independent variable is manipulated. However, because some approaches to data collection are strongly associated with correlational research, it makes sense to discuss them here. The two we will focus on are naturalistic observation and archival data. A third, survey research, is discussed in its own chapter.

Naturalistic Observation

Naturalistic observation is an approach to data collection that involves observing people’s behavior in the environment in which it typically occurs. Thus naturalistic observation is a type of field research (as opposed to a type of laboratory research). It could involve observing shoppers in a grocery store, children on a school playground, or psychiatric inpatients in their wards. Researchers engaged in naturalistic observation usually make their observations as unobtrusively as possible so that participants are often not aware that they are being studied. Ethically, this is considered to be acceptable if the participants remain anonymous and the behavior occurs in a public setting where people would not normally have an expectation of privacy. Grocery shoppers putting items into their shopping carts, for example, are engaged in public behavior that is easily observable by store employees and other shoppers. For this reason, most researchers would consider it ethically acceptable to observe them for a study. On the other hand, one of the arguments against the ethicality of the naturalistic observation of “bathroom behavior” discussed earlier in the book is that people have a reasonable expectation of privacy even in a public restroom and that this expectation was violated.

Researchers Robert Levine and Ara Norenzayan used naturalistic observation to study differences in the “pace of life” across countries (Levine & Norenzayan, 1999). One of their measures involved observing pedestrians in a large city to see how long it took them to walk 60 feet. They found that people in some countries walked reliably faster than people in other countries. For example, people in the United States and Japan covered 60 feet in about 12 seconds on average, while people in Brazil and Romania took close to 17 seconds.

Because naturalistic observation takes place in the complex and even chaotic “real world,” there are two closely related issues that researchers must deal with before collecting data. The first is sampling. When, where, and under what conditions will the observations be made, and who exactly will be observed? Levine and Norenzayan described their sampling process as follows:

Male and female walking speed over a distance of 60 feet was measured in at least two locations in main downtown areas in each city. Measurements were taken during main business hours on clear summer days. All locations were flat, unobstructed, had broad sidewalks, and were sufficiently uncrowded to allow pedestrians to move at potentially maximum speeds. To control for the effects of socializing, only pedestrians walking alone were used. Children, individuals with obvious physical handicaps, and window-shoppers were not timed. Thirty-five men and 35 women were timed in most cities. (p. 186)

Precise specification of the sampling process in this way makes data collection manageable for the observers, and it also provides some control over important extraneous variables. For example, by making their observations on clear summer days in all countries, Levine and Norenzayan controlled for effects of the weather on people’s walking speeds.

The second issue is measurement. What specific behaviors will be observed? In Levine and Norenzayan’s study, measurement was relatively straightforward. They simply measured out a 60-foot distance along a city sidewalk and then used a stopwatch to time participants as they walked over that distance. Often, however, the behaviors of interest are not so obvious or objective. For example, researchers Robert Kraut and Robert Johnston wanted to study bowlers’ reactions to their shots, both when they were facing the pins and then when they turned toward their companions (Kraut & Johnston, 1979). But what “reactions” should they observe? Based on previous research and their own pilot testing, Kraut and Johnston created a list of reactions that included “closed smile,” “open smile,” “laugh,” “neutral face,” “look down,” “look away,” and “face cover” (covering one’s face with one’s hands). The observers committed this list to memory and then practiced by coding the reactions of bowlers who had been videotaped. During the actual study, the observers spoke into an audio recorder, describing the reactions they observed. Among the most interesting results of this study was that bowlers rarely smiled while they still faced the pins. They were much more likely to smile after they turned toward their companions, suggesting that smiling is not purely an expression of happiness but also a form of social communication.

A woman bowling

Naturalistic observation has revealed that bowlers tend to smile when they turn away from the pins and toward their companions, suggesting that smiling is not purely an expression of happiness but also a form of social communication.

sieneke toering – bowling big lebowski style – CC BY-NC-ND 2.0.

When the observations require a judgment on the part of the observers—as in Kraut and Johnston’s study—this process is often described as coding . Coding generally requires clearly defining a set of target behaviors. The observers then categorize participants individually in terms of which behavior they have engaged in and the number of times they engaged in each behavior. The observers might even record the duration of each behavior. The target behaviors must be defined in such a way that different observers code them in the same way. This is the issue of interrater reliability. Researchers are expected to demonstrate the interrater reliability of their coding procedure by having multiple raters code the same behaviors independently and then showing that the different observers are in close agreement. Kraut and Johnston, for example, video recorded a subset of their participants’ reactions and had two observers independently code them. The two observers showed that they agreed on the reactions that were exhibited 97% of the time, indicating good interrater reliability.

Archival Data

Another approach to correlational research is the use of archival data , which are data that have already been collected for some other purpose. An example is a study by Brett Pelham and his colleagues on “implicit egotism”—the tendency for people to prefer people, places, and things that are similar to themselves (Pelham, Carvallo, & Jones, 2005). In one study, they examined Social Security records to show that women with the names Virginia, Georgia, Louise, and Florence were especially likely to have moved to the states of Virginia, Georgia, Louisiana, and Florida, respectively.

As with naturalistic observation, measurement can be more or less straightforward when working with archival data. For example, counting the number of people named Virginia who live in various states based on Social Security records is relatively straightforward. But consider a study by Christopher Peterson and his colleagues on the relationship between optimism and health using data that had been collected many years before for a study on adult development (Peterson, Seligman, & Vaillant, 1988). In the 1940s, healthy male college students had completed an open-ended questionnaire about difficult wartime experiences. In the late 1980s, Peterson and his colleagues reviewed the men’s questionnaire responses to obtain a measure of explanatory style—their habitual ways of explaining bad events that happen to them. More pessimistic people tend to blame themselves and expect long-term negative consequences that affect many aspects of their lives, while more optimistic people tend to blame outside forces and expect limited negative consequences. To obtain a measure of explanatory style for each participant, the researchers used a procedure in which all negative events mentioned in the questionnaire responses, and any causal explanations for them, were identified and written on index cards. These were given to a separate group of raters who rated each explanation in terms of three separate dimensions of optimism-pessimism. These ratings were then averaged to produce an explanatory style score for each participant. The researchers then assessed the statistical relationship between the men’s explanatory style as college students and archival measures of their health at approximately 60 years of age. The primary result was that the more optimistic the men were as college students, the healthier they were as older men. Pearson’s r was +.25.

This is an example of content analysis —a family of systematic approaches to measurement using complex archival data. Just as naturalistic observation requires specifying the behaviors of interest and then noting them as they occur, content analysis requires specifying keywords, phrases, or ideas and then finding all occurrences of them in the data. These occurrences can then be counted, timed (e.g., the amount of time devoted to entertainment topics on the nightly news show), or analyzed in a variety of other ways.

Key Takeaways

  • Correlational research involves measuring two variables and assessing the relationship between them, with no manipulation of an independent variable.
  • Correlational research is not defined by where or how the data are collected. However, some approaches to data collection are strongly associated with correlational research. These include naturalistic observation (in which researchers observe people’s behavior in the context in which it normally occurs) and the use of archival data that were already collected for some other purpose.

Discussion: For each of the following, decide whether it is most likely that the study described is experimental or correlational and explain why.

  • An educational researcher compares the academic performance of students from the “rich” side of town with that of students from the “poor” side of town.
  • A cognitive psychologist compares the ability of people to recall words that they were instructed to “read” with their ability to recall words that they were instructed to “imagine.”
  • A manager studies the correlation between new employees’ college grade point averages and their first-year performance reports.
  • An automotive engineer installs different stick shifts in a new car prototype, each time asking several people to rate how comfortable the stick shift feels.
  • A food scientist studies the relationship between the temperature inside people’s refrigerators and the amount of bacteria on their food.
  • A social psychologist tells some research participants that they need to hurry over to the next building to complete a study. She tells others that they can take their time. Then she observes whether they stop to help a research assistant who is pretending to be hurt.

Kanner, A. D., Coyne, J. C., Schaefer, C., & Lazarus, R. S. (1981). Comparison of two modes of stress measurement: Daily hassles and uplifts versus major life events. Journal of Behavioral Medicine, 4 , 1–39.

Kraut, R. E., & Johnston, R. E. (1979). Social and emotional messages of smiling: An ethological approach. Journal of Personality and Social Psychology, 37 , 1539–1553.

Levine, R. V., & Norenzayan, A. (1999). The pace of life in 31 countries. Journal of Cross-Cultural Psychology, 30 , 178–205.

Pelham, B. W., Carvallo, M., & Jones, J. T. (2005). Implicit egotism. Current Directions in Psychological Science, 14 , 106–110.

Peterson, C., Seligman, M. E. P., & Vaillant, G. E. (1988). Pessimistic explanatory style is a risk factor for physical illness: A thirty-five year longitudinal study. Journal of Personality and Social Psychology, 55 , 23–27.

Research Methods in Psychology Copyright © 2016 by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Correlation in Psychology: Meaning, Types, Examples & coefficient

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Correlation means association – more precisely, it measures the extent to which two variables are related. There are three possible results of a correlational study: a positive correlation, a negative correlation, and no correlation.
  • A positive correlation is a relationship between two variables in which both variables move in the same direction. Therefore, one variable increases as the other variable increases, or one variable decreases while the other decreases. An example of a positive correlation would be height and weight. Taller people tend to be heavier.

positive correlation

  • A negative correlation is a relationship between two variables in which an increase in one variable is associated with a decrease in the other. An example of a negative correlation would be the height above sea level and temperature. As you climb the mountain (increase in height), it gets colder (decrease in temperature).

negative correlation

  • A zero correlation exists when there is no relationship between two variables. For example, there is no relationship between the amount of tea drunk and the level of intelligence.

zero correlation

Scatter Plots

A correlation can be expressed visually. This is done by drawing a scatter plot (also known as a scattergram, scatter graph, scatter chart, or scatter diagram).

A scatter plot is a graphical display that shows the relationships or associations between two numerical variables (or co-variables), which are represented as points (or dots) for each pair of scores.

A scatter plot indicates the strength and direction of the correlation between the co-variables.

Types of Correlations: Positive, Negative, and Zero

When you draw a scatter plot, it doesn’t matter which variable goes on the x-axis and which goes on the y-axis.

Remember, in correlations, we always deal with paired scores, so the values of the two variables taken together will be used to make the diagram.

Decide which variable goes on each axis and then simply put a cross at the point where the two values coincide.

Uses of Correlations

  • If there is a relationship between two variables, we can make predictions about one from another.
  • Concurrent validity (correlation between a new measure and an established measure).

Reliability

  • Test-retest reliability (are measures consistent?).
  • Inter-rater reliability (are observers consistent?).

Theory verification

  • Predictive validity.

Correlation Coefficients

Instead of drawing a scatter plot, a correlation can be expressed numerically as a coefficient, ranging from -1 to +1. When working with continuous variables, the correlation coefficient to use is Pearson’s r.

Correlation Coefficient Interpretation

The correlation coefficient ( r ) indicates the extent to which the pairs of numbers for these two variables lie on a straight line. Values over zero indicate a positive correlation, while values under zero indicate a negative correlation.

A correlation of –1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down. A correlation of +1 indicates a perfect positive correlation, meaning that as one variable goes up, the other goes up.

There is no rule for determining what correlation size is considered strong, moderate, or weak. The interpretation of the coefficient depends on the topic of study.

When studying things that are difficult to measure, we should expect the correlation coefficients to be lower (e.g., above 0.4 to be relatively strong). When we are studying things that are easier to measure, such as socioeconomic status, we expect higher correlations (e.g., above 0.75 to be relatively strong).)

In these kinds of studies, we rarely see correlations above 0.6. For this kind of data, we generally consider correlations above 0.4 to be relatively strong; correlations between 0.2 and 0.4 are moderate, and those below 0.2 are considered weak.

When we are studying things that are more easily countable, we expect higher correlations. For example, with demographic data, we generally consider correlations above 0.75 to be relatively strong; correlations between 0.45 and 0.75 are moderate, and those below 0.45 are considered weak.

Correlation vs. Causation

Causation means that one variable (often called the predictor variable or independent variable) causes the other (often called the outcome variable or dependent variable).

Experiments can be conducted to establish causation. An experiment isolates and manipulates the independent variable to observe its effect on the dependent variable and controls the environment in order that extraneous variables may be eliminated.

A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable. A correlation only shows if there is a relationship between variables.

causation correlationg graph

While variables are sometimes correlated because one does cause the other, it could also be that some other factor, a confounding variable , is actually causing the systematic movement in our variables of interest.

Correlation does not always prove causation, as a third variable may be involved. For example, being a patient in a hospital is correlated with dying, but this does not mean that one event causes the other, as another third variable might be involved (such as diet and level of exercise).

“Correlation is not causation” means that just because two variables are related it does not necessarily mean that one causes the other.

A correlation identifies variables and looks for a relationship between them. An experiment tests the effect that an independent variable has upon a dependent variable but a correlation looks for a relationship between two variables.

This means that the experiment can predict cause and effect (causation) but a correlation can only predict a relationship, as another extraneous variable may be involved that it not known about.

1. Correlation allows the researcher to investigate naturally occurring variables that may be unethical or impractical to test experimentally. For example, it would be unethical to conduct an experiment on whether smoking causes lung cancer.

2 . Correlation allows the researcher to clearly and easily see if there is a relationship between variables. This can then be displayed in a graphical form.

Limitations

1 . Correlation is not and cannot be taken to imply causation. Even if there is a very strong association between two variables, we cannot assume that one causes the other.

For example, suppose we found a positive correlation between watching violence on T.V. and violent behavior in adolescence.

It could be that the cause of both these is a third (extraneous) variable – for example, growing up in a violent home – and that both the watching of T.V. and the violent behavior is the outcome of this.

2 . Correlation does not allow us to go beyond the given data. For example, suppose it was found that there was an association between time spent on homework (1/2 hour to 3 hours) and the number of G.C.S.E. passes (1 to 6).

It would not be legitimate to infer from this that spending 6 hours on homework would likely generate 12 G.C.S.E. passes.

How do you know if a study is correlational?

A study is considered correlational if it examines the relationship between two or more variables without manipulating them. In other words, the study does not involve the manipulation of an independent variable to see how it affects a dependent variable.

One way to identify a correlational study is to look for language that suggests a relationship between variables rather than cause and effect.

For example, the study may use phrases like “associated with,” “related to,” or “predicts” when describing the variables being studied.

Another way to identify a correlational study is to look for information about how the variables were measured. Correlational studies typically involve measuring variables using self-report surveys, questionnaires, or other measures of naturally occurring behavior.

Finally, a correlational study may include statistical analyses such as correlation coefficients or regression analyses to examine the strength and direction of the relationship between variables.

Why is a correlational study used?

Correlational studies are particularly useful when it is not possible or ethical to manipulate one of the variables.

For example, it would not be ethical to manipulate someone’s age or gender. However, researchers may still want to understand how these variables relate to outcomes such as health or behavior.

Additionally, correlational studies can be used to generate hypotheses and guide further research.

If a correlational study finds a significant relationship between two variables, this can suggest a possible causal relationship that can be further explored in future research.

What is the goal of correlational research?

The ultimate goal of correlational research is to increase our understanding of how different variables are related and to identify patterns in those relationships.

This information can then be used to generate hypotheses and guide further research aimed at establishing causality.

Print Friendly, PDF & Email

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Interpreting Correlation Coefficients

By Jim Frost 145 Comments

What are Correlation Coefficients?

Correlation coefficients measure the strength of the relationship between two variables. A correlation between variables indicates that as one variable changes in value, the other variable tends to change in a specific direction.  Understanding that relationship is useful because we can use the value of one variable to predict the value of the other variable. For example, height and weight are correlated—as height increases, weight also tends to increase. Consequently, if we observe an individual who is unusually tall, we can predict that his weight is also above the average.

In statistics , correlation coefficients are a quantitative assessment that measures both the direction and the strength of this tendency to vary together. There are different types of correlation coefficients that you can use for different kinds of data . In this post, I cover the most common type of correlation—Pearson’s correlation coefficient.

Before we get into the numbers, let’s graph some data first so we can understand the concept behind what we are measuring.

Graph Your Data to Find Correlations

Scatterplots are a great way to check quickly for correlation between pairs of continuous data. The scatterplot below displays the height and weight of pre-teenage girls. Each dot on the graph represents an individual girl and her combination of height and weight. These data are actual data that I collected during an experiment.

This scatterplot displays a positive correlation between height and weight.

At a glance, you can see that there is a correlation between height and weight. As height increases, weight also tends to increase. However, it’s not a perfect relationship. If you look at a specific height, say 1.5 meters, you can see that there is a range of weights associated with it. You can also find short people who weigh more than taller people. However, the general tendency that height and weight increase together is unquestionably present—a correlation exists.

Pearson’s correlation coefficient takes all of the data points on this graph and represents them as a single number. In this case, the statistical output below indicates that the Pearson’s correlation coefficient is 0.694.

Statistical output that displays Pearson's correlation coefficient and p-value.

What do the Pearson correlation coefficient and p-value mean? We’ll interpret the output soon. First, let’s look at a range of possible correlation coefficients so we can understand how our height and weight example fits in.

Related posts : Using Excel to Calculate Correlation and Guide to Scatterplots

How to Interpret Pearson Correlation Coefficients

Pearson’s correlation coefficient is represented by the Greek letter rho ( ρ ) for the population parameter and r for a sample statistic. This correlation coefficient is a single number that measures both the strength and direction of the linear relationship between two continuous variables. Values can range from -1 to +1.

The greater the absolute value of the Pearson correlation coefficient, the stronger the relationship.

  • The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one variable is accompanied by a perfectly consistent change in the other. For these relationships, all of the data points fall on a line. In practice, you won’t see either type of perfect relationship.
  • A coefficient of zero represents no linear relationship. As one variable increases, there is no tendency in the other variable to either increase or decrease.
  • When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a line.

The sign of the Pearson correlation coefficient represents the direction of the relationship.

  • Positive coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Positive relationships produce an upward slope on a scatterplot.
  • Negative coefficients represent cases when the value of one variable increases, the value of the other variable tends to decrease. Negative relationships produce a downward slope.

Statisticians consider Pearson’s correlation coefficients to be a standardized effect size because they indicate the strength of the relationship between variables using unitless values that fall within a standardized range of -1 to +1. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics .

Learn how to calculate correlation in my post, Correlation Coefficient Formula Walkthrough .

Covariance is an unstandardized form of correlation. Learn about it in my posts:

  • Covariance: Definition, Formula & Example
  • Covariances vs Correlation: Understanding the Differences

Examples of Positive and Negative Correlation Coefficients

A positive correlation example is the relationship between the speed of a wind turbine and the amount of energy it produces. As the turbine speed increases, electricity production also increases.

A negative correlation example is the relationship between outdoor temperature and heating costs. As the temperature increases, heating costs decrease.

Graphs for Different Correlation Coefficients

Graphs always help bring concepts to life. The scatterplots below represent a spectrum of different Pearson correlation coefficients. I’ve held the horizontal and vertical scales of the scatterplots constant to allow for valid comparisons between them.

This scatterplot displays a perfect positive correlation of +1.

Discussion about the Scatterplots

For the scatterplots above, I created one positive correlation between the variables and one negative relationship between the variables. Then, I varied only the amount of dispersion between the data points and the line that defines the relationship. That process illustrates how correlation measures the strength of the relationship. The stronger the relationship, the closer the data points fall to the line. I didn’t include plots for weaker correlation coefficients that are closer to zero than 0.6 and -0.6 because they start to look like blobs of dots and it’s hard to see the relationship.

A common misinterpretation is assuming that negative Pearson correlation coefficients indicate that there is no relationship. After all, a negative correlation sounds suspiciously like no relationship. However, the scatterplots for the negative correlations display real relationships. For negative correlation coefficients, high values of one variable are associated with low values of another variable. For example, there is a negative correlation coefficient for school absences and grades. As the number of absences increases, the grades decrease.

Earlier I mentioned how crucial it is to graph your data to understand them better. However, a quantitative measurement of the relationship does have an advantage. Graphs are a great way to visualize the data, but the scaling can exaggerate or weaken the appearance of a correlation. Additionally, the automatic scaling in most statistical software tends to make all data look similar .

Fortunately, Pearson’s correlation coefficients are unaffected by scaling issues. Consequently, a statistical assessment is better for determining the precise strength of the relationship.

Graphs and the relevant statistical measures often work better in tandem.

Pearson’s Correlation Coefficients Measure Linear Relationship

Pearson’s correlation coefficients measure only linear relationships. Consequently, if your data contain a curvilinear relationship, the Pearson correlation coefficient will not detect it. For example, the correlation for the data in the scatterplot below is zero. However, there is a relationship between the two variables—it’s just not linear.

Scatterplot displays a curvilinear relationship that has a Pearson's correlation coefficient of 0.

This example illustrates another reason to graph your data! Just because the coefficient is near zero, it doesn’t necessarily indicate that there is no relationship.

Spearman’s correlation is a nonparametric alternative to Pearson’s correlation coefficient. Use Spearman’s correlation for nonlinear, monotonic relationships and for ordinal data. For more information, read my post Spearman’s Correlation Explained !

Hypothesis Test for Correlation Coefficients

Correlation coefficients have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. For Pearson correlations, the two hypotheses are the following:

  • Null hypothesis: There is no linear relationship between the two variables. ρ = 0.
  • Alternative hypothesis: There is a linear relationship between the two variables. ρ ≠ 0.

Correlation coefficients that equal zero indicate no linear relationship exists. If your p-value is less than your significance level , the sample contains sufficient evidence to reject the null hypothesis and conclude that the Pearson correlation coefficient does not equal zero. In other words, the sample data support the notion that the relationship exists in the population.

Related post : Overview of Hypothesis Tests

Interpreting our Height and Weight Correlation Example

Now that we have seen a range of positive and negative relationships, let’s see how our Pearson correlation coefficient of 0.694 fits in. We know that it’s a positive relationship. As height increases, weight tends to increase. Regarding the strength of the relationship, the graph shows that it’s not a very strong relationship where the data points tightly hug a line. However, it’s not an entirely amorphous blob with a very low correlation. It’s somewhere in between. That description matches our moderate correlation coefficient of 0.694.

For the hypothesis test, our p-value equals 0.000. This p-value is less than any reasonable significance level. Consequently, we can reject the null hypothesis and conclude that the relationship is statistically significant. The sample data support the notion that the relationship between height and weight exists in the population of preteen girls.

Correlation Does Not Imply Causation

I’m sure you’ve heard this expression before, and it is a crucial warning. Correlation between two variables indicates that changes in one variable are associated with changes in the other variable. However, correlation does not mean that the changes in one variable actually cause the changes in the other variable.

Sometimes it is clear that there is a causal relationship. For the height and weight data, it makes sense that adding more vertical structure to a body causes the total mass to increase. Or, increasing the wattage of lightbulbs causes the light output to increase.

However, in other cases, a causal relationship is not possible. For example, ice cream sales and shark attacks have a positive correlation coefficient. Clearly, selling more ice cream does not cause shark attacks (or vice versa). Instead, a third variable, outdoor temperatures, causes changes in the other two variables. Higher temperatures increase both sales of ice cream and the number of swimmers in the ocean, which creates the apparent relationship between ice cream sales and shark attacks.

Beware of spurious correlations!

In statistics, you typically need to perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation. Conversely, Correlational Studies will find relationships quickly and easily but they are not suitable for establishing causality.

Learn more about Correlation vs. Causation: Understanding the Differences .

Related posts : Using Random Assignment in Experiments and Observational Studies

How Strong of a Correlation is Considered Good?

What is a good correlation? How high should correlation coefficients be? These are commonly asked questions. I have seen several schemes that attempt to classify correlations as strong, medium, and weak.

However, there is only one correct answer. A Pearson correlation coefficient should accurately reflect the strength of the relationship. Take a look at the correlation between the height and weight data, 0.694. It’s not a very strong relationship, but it accurately represents our data. An accurate representation is the best-case scenario for using a statistic to describe an entire dataset.

The strength of any relationship naturally depends on the specific pair of variables. Some research questions involve weaker relationships than other subject areas. Case in point, humans are hard to predict. Studies that assess relationships involving human behavior tend to have correlation coefficients weaker than +/- 0.6.

However, if you analyze two variables in a physical process, and have very precise measurements, you might expect correlations near +1 or -1. There is no one-size fits all best answer for how strong a relationship should be. The correct values for correlation coefficients depend on your study area.

Taking Correlation to the Next Level with Regression Analysis

Wouldn’t it be nice if instead of just describing the strength of the relationship between height and weight, we could define the relationship itself using an equation? Regression analysis does just that. That analysis finds the line and corresponding equation that provides the best fit to our dataset. We can use that equation to understand how much weight increases with each additional unit of height and to make predictions for specific heights. Read my post where I talk about the regression model for the height and weight data .

Regression analysis allows us to expand on correlation in other ways. If we have more variables that explain changes in weight, we can include them in the model and potentially improve our predictions. And, if the relationship is curved, we can still fit a regression model to the data.

Additionally, a form of the Pearson correlation coefficient shows up in regression analysis. R-squared is a primary measure of how well a regression model fits the data. This statistic represents the percentage of variation in one variable that other variables explain. For a pair of variables, R-squared is simply the square of the Pearson’s correlation coefficient. For example, squaring the height-weight correlation coefficient of 0.694 produces an R-squared of 0.482, or 48.2%. In other words, height explains about half the variability of weight in preteen girls.

If you’re learning about statistics and like the approach I use in my blog, check out my Introduction to Statistics book! It’s available at Amazon and other retailers.

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Share this:

correlation analysis definition in research

Reader Interactions

' src=

August 17, 2024 at 2:43 pm

Great, thank you!

' src=

August 15, 2024 at 9:33 am

Hi Jim. I had a query. Like if we say there is a correlation of 0.68 between x and y variable, then what exactly does this “0.68” as a “number” indicate apart from the fact that we can say there is a moderate association between x and y.

May 7, 2024 at 9:18 am

Is there any benefit to doing both a correlation and a regression test? I don’t think there is – I believe that a regression output will give you the same information a correlation output would plus more. Please could you let me know if that is correct or am I missing something?

' src=

May 7, 2024 at 2:08 pm

Hi Charlotte,

In general, you are correct for simple regression, where you have one independent variable and the dependent variable. The R-square for that model is literally the square of the Pearson’s correlation (r) for those two variables. As you mention, regression gives you additional output along with the strength of the relationship.

But there are a few caveats.

Regression is much more flexible than correlation because it allows you to add other variables, fit curvature and include interaction effects. For example, regression allows you to fit curvature between the two variables using polynomials. So, there are cases where using Pearson’s correlation is inappropriate because the data violate some of the assumptions but regression analysis can handle those data acceptably.

But what you say is correct when you’re looking at a straight line relationship between a pair of variables. In that specific case, simple regression and Pearson’s correlation provide consistent information with regression providing more details.

' src=

March 12, 2024 at 4:11 am

Hi If you are finding the trend between one type of quantitative discrete data and one type of qualitative ordinal data, what correlation test do you use?

' src=

September 9, 2023 at 4:46 am

It could be that the sharks are using ice cream as bait. Maybe the sharks are smarter than we think… Seriously, the ice cream as a cause is not likely, but sometimes a perfectly sensible hypothesis with lots of data behind it can be just plain wrong.

September 9, 2023 at 11:43 pm

It can be wrong in causal sense but if ice cream cones has a non-causal correlation with the number of shark attacks, it can still help you make predictions. Now, if you thought limiting ice cream sales will reduce shark attacks, that’s not going to work!

' src=

June 9, 2023 at 1:56 am

What is to be done when two positive items show a negative correlation within one variable.. e.g increase in house help decreases no interruptions in work?? It’s confusing as both r positive questions

June 10, 2023 at 1:09 am

It’s possibly the result of other variables, known as confounding variables (or confounders) that you might not even have recorded. For example, there might be some other variable that correlates with both “house help” and “interruptions at work” that explain the unexpected negative correlation. Perhaps individuals with house help have more activities occurring throughout the day at home. Those activities would then cause more interruptions. So, you might have chain of correlations where the “home activities” and “house help” have positive correlations. Additionally, “home activities” and “interruptions” might have a negative correlation. Given this arrangement, it wouldn’t be surprising to see a negative correlation between “home activities” and “interruptions.”

It goes to show that you need to understand the larger context when analyzing data. Technically, this phenomenon is known as omitted variable bias . Your model (pairwise correlation) omits an important variable (a confounder) which is biasing the results. Click the link to learn more.

The answer is to identify and record the confounding variables and include them in your model, likely a regression model or partial correlation.

' src=

May 8, 2023 at 12:58 pm

What if my pearson’s r is 0.187 and p-value is 0.001 do i reject the null hypothesis?

May 8, 2023 at 2:56 pm

Yes! That p-value is below any reasonable significance level. Hence, you can reject the null hypothesis. However, be aware that while the correlation is statistically significant, it is so weak that it probably isn’t practically significant in the real world. In other words, it probably exists in the population you’re assessing but it is too weak to be noticeable/meaningful.

November 30, 2022 at 4:53 am

Thank you, Jim. I really appreciate your help. I will read your post about statistical v practical significance – that sounds really useful. I love how you explain things in such an accessible way.

I have one more question that I was hoping you would be able to help me with, please?

If I have done a correlation test and I have found an extremely weak negative relationship (e.g., -.02) but the relationship is not statistically significant, would this mean that although I have found that there is a very weak negative correlation between the variables in the sample data, this would unlikely to be found in the population. Therefore, I would fail to reject the null hypothesis that the correlation in the population equals zero.

Thank you again for your help and for this wonderful blog.

December 1, 2022 at 1:57 am

You’re very welcome!

In the case where the correlation is not significant, it indicates that you have insufficient evidence to conclude that it does not equal zero. That’s a mouthful but there’s a reason for the convoluted wording. Insignificant results don’t prove that there is no effect, it just indicates that your test didn’t detect an effect in the population. It could be that the effect doesn’t exist in the population OR it could be that your sample size was too small or there’s too much variability in the data.

In short, we say that you failed to reject the null hypothesis.

Basically, you can’t prove a negative (no effect). All you can say is that your study didn’t detect an effect. In this case, it didn’t detect a non-zero correlation.

You can read more about the reason behind the wording failing to reject the null hypothesis and what it means precisely.

November 29, 2022 at 12:39 pm

Thank you for this webpage. It is great. I have a question, which I was hoping you’d be able to help me with please.

I have carried out a correlation test, and from my understanding a null hypothesis would be that there is no relationship between the two variables (the variables are independent – there is no correlation).

The p value is statistically significant (.000), and the Pearson correlation result is -.036.

My understanding is that if there is a statically significant relationship then I would reject the null hypothesis (which suggests there is no relationship between the two variables). My issue is then whether -.036 suggests a very weak relationship or no relationship at all given how close to 0 it is. If it is the latter, would I then say I have failed to reject the null hypothesis even though there is a statisicially significant relationship? Or would I say that I have rejected the null hypothesis because there is a statically significant relationship, but the correlation is very weak.

Any help would be appreciated. Kind regards.

November 29, 2022 at 4:10 pm

What you’re seeing is the difference between statistical significance and practically significance. Yes, your results are statistically significant. You can reject the null hypothesis that rho (the correlation in the population) does not equal zero. Your data provide enough evidence to conclude that the negative correlation exists in the population (not just your sample).

However, as you say, it’s an extremely weak relationship. Even though it’s not zero it is essentially zero in a practical sense. Statistically significant results don’t automatically mean that the effect size (correlation is this case) is meaningful in the real-world. When a test has very high statistical power (e.g., sometimes due to a very large sample size), it can detect trivial effects. Those effects are real but they’re small in size.

I write more about this in my post about statistical vs. practical significance . But, in a nutshell, your correlation coefficient is statistically significant, but it is not a meaningful effect in the real world.

' src=

September 28, 2022 at 10:44 am

I have a simple question, only to frame how to use correlation. Imagine a trial with plants, testing different phosphate (Pi) concentrations (like 8) and its effect on plant growth (assessed as mean plant size per Pi concentration, from enough replicates and data validity to perform classical parametric statistics).

In case A, I have a strong (positive) and significant Pearson correlation between these two parameters, and in particular, the 8 average size values show statistical significant differences (ANOVA) between all the Pi concentrations tested.

In case B, I have the same strong (positive) significant Pearson correlation, but there is no any statistical significant difference in term of size between any Pi concentration tested.

My guess is that it may be possible to interpret the case A as Pi is correlated with plant growth; but in case B, no interpretation can be provided given that no significant difference is seen between Pi concentrations on plant size, even if a correlation is obtained. Is this right ? But in this case, if I have 3 out the 8 Pi concentrations which I obtained significant difference on plant size, should I perform correlation only between significant Pi groups or could I still take all the 8 Pi groups to make interpretations ? Thanks in advance !

September 29, 2022 at 7:02 pm

I don’t fully understand your trial. You say that you have a continuous measure of Pi concentration and then average plant sizes. Pearson correlations work with two continuous measures–not a group average. So, you’d need to correlate the Pi concentration with plant size, not average plant size. Or perhaps I’m misunderstanding your description. Please clarify your process. Thanks!

In a more general sense, you have to remember that statistical significance doesn’t necessarily indicate there is a real-world, practical significance to your results. That’s possibly what you’re finding in case B. Although again it’s hard to say if you’re applying correlation to averages.

Statistical significance just indicates that you have reason to believe that a relationship/effect exists in the population. It doesn’t necessarily mean that the effect is large enough to be practically meaningful. For more information, read my post about Practical vs. Statistical Significance .

' src=

August 16, 2022 at 11:16 am

This was very educative and easy to follow through for a statistics noob such as me. Thanks! I like your books. Which one is most suited for a beginner level of knowledge?

August 17, 2022 at 12:20 am

My Introduction to Statistics book is the best to get started with for beginners. Click the link to see a post where I discuss it and included a full table of contents.

After reading that, you’d be ready to read both of my two other books: Hypothesis Testing Regression Analysis

' src=

May 16, 2022 at 2:45 pm

Jim, Nassim Taleb makes the point on YouTube (search for Taleb and correlation) that an r = 0.10 is much closer to zero than to r = 0.20) implying that the distribution function for r is very dependent on the r in the population, and the sample size and that the scale of -1.0 to +1.0 is not a scale separated by equal units. He then warns of significance tests because r is a random variable and subject to sampling fluctuations and r = .25 could easily be zero due to sampling error (especially for small sample sizes). Can you please discuss if the scale of r = -1.0 to 1.0 is set in equidistant units, or units that only superficially look like they are equidistant?

May 16, 2022 at 6:41 pm

I did a quick search and found a video where he’s talking about using correlation in the financial and investment areas. He seems to be saying that correlation is not the correct tool for that context. I can’t talk to that point because I’m not familiar with the context.

However, yes, I can help you out with most of the other points!

I’ll start with the fact that the scale of -1 to +1 is, in some ways, not consistent. To start, correlation coefficients are a standardized effect. As such, they are unitless. You can’t link them to anything real, but they help you compare between disparate types of studies. In other words, they excel at providing a standard basis of comparison between studies. However, they’re not as good for knowing what the statistic actually means, except for a few specific values, -1, +1, and 0. And perhaps that’s why Taleb isn’t fond of them. (At 20 minutes, I didn’t watch the entire video.)

However, we can convert r to R-squared and it becomes more meaningful. R-squared tells us how much of the variance the relationship accounts for. And, as the name implies, you simply square r to get R-squared. It’s in R-squared where you see that the difference between r of 0.1 and 0.2 is different from say 0.8 and 0.9. When you go from 0.1 to 0.2, R-squared increases from 0.01 to 0.04, an increase of 3%. And note that at those correlations, we’re only explaining between 1 – 4% of the variance. Virtually nothing! Now, if we look at going from an r of 0.8 to 0.9, R-squared increases from 0.64 to 0.81, or 17%. So, we have the same size increase in r (0.1) in both cases, but R-squared increases by 3% in one case and 17% in the other. Also, notice how at a r of 0.5, you’re only accounting for 25% of the variance. That’s not very much. You need an r of 0.707 to explain half the variance (50%). Another way to think of it is that the range of r [0, 0.7] accounts for half the variance while r [0.7, 1] accounts for the other half.

I agree with the point that r = 0.1 is virtually nothing. In fact, you need an r of 0.316 to explain even a tenth (10%) of the variability. I also agree that fixed differences in r (e.g., 0.1) indicates different changes in the strength of the relationship, as I illustrate above. I think those points are valid.

Below, I include a graph showing r vs. R-squared and the curved line indicates that the relationship between the two statistics changes (the inconsistency you mention). If the relationship was consistent, it would be a straight line. For me, R-squared is the better statistic, particularly in conjunction with regression analysis, which provides more information about the nature of the relationships. Of course, the negative range of r produces the mirror graph but the same ideas apply.

Graph displaying the relationship between r and R-squared.

I think correlation coefficients (r) have some other shortcomings. They describe the strength of the relationship but not the actual relationship. And they don’t account for other variables. Regression analysis handles those aspects and I generally prefer that methodology. For me, simple correlation just doesn’t provide enough information by itself in most cases. You also typically don’t get residual plots so you can be sure that you’re satisfying the assumptions (Pearson’s correlation (r) is essentially a linear model).

The sample r does depend on the relationship in the population. But that’s true for all sample statistics–as I write in my post, Sample Statistics Are Always Wrong to Some Extent! I don’t think it’s any worse for correlation than other types of sample statistics. As you increase your sample size, the estimate’s precision will increase (i.e., the error bars become smaller).

I think significance tests are valid for correlation. Yes, it’s subject to sampling fluctuations ( sampling error ) but so are all sample based statistics. Hypothesis testing is designed to factor that in. In fact, significance testing specifically helps you distinguish between cases where the sample r = 0.25 might represent 0 in the population vs. cases where that is unlikely. That’s the very intention of significance testing, so I strongly disagree with that point!

' src=

April 9, 2022 at 2:20 am

Thank you for the fast response!! I have alaso read the Spearman’s Rho article (very insightful). In my scatterplot it is suggesting that there is no correlation (completely random distribution). However, I would still like to test the correlation but in the Spearmans’s Rho article you mentioned that if it is there is no correlation, both the spearman’s Rho value and Pearson’s correlation value would be close to zero. Is it also possible that one value is positive and one is negative? My results right now are R2 Linear= 0.003, Pearson correlation= .058, and Spearman’s correlation coefficient= -0.19. Should I base the rejection of either of my hypothesises on Spaerman’s value or Pearson’s value

Thank you so much!!!

April 9, 2022 at 10:42 pm

I’m glad that it was helpful! It’s definitely possible for correlations to switch directions like that. That’s especially true because both correlations are barely different from zero. So, it wouldn’t take much to cause them to be on opposite sides of zero. The R-squared is telling you that the Pearson’s correlation explains hardly any of the variability.

' src=

April 8, 2022 at 7:05 pm

Thank you for this post!! I was wondering, I did a scatterplot which gave me a R2 value of 0.003. The fitline showed a really weak positive correlation which I wanted to test with the Spearmans rho. However, this value is showing a negative value (negative relationship). Do you maybe know why it is showing different correlations since I am using the exact same values?

April 8, 2022 at 7:51 pm

The R-squared value and slope you’re seeing are related to Pearson’s correlation, which differs from Spearmans rho. They’re different statistical measures using different methods, so it’s not surprising that their values can be different. For more information, read my post about Spearman’s Rho .

' src=

April 6, 2022 at 3:37 am

Hi Jim, I had a question. It’s kinda complicated but I try my best to explain it well.

I run a correlation test between objective social isolation and subjective social isolation. To measure OSI, I used an instrument called LSNS-6, while I used R-UCLA Loneliness Scale to measure the SSI. Here is the scoring guide for the instruments: * higher score obtained in LSNS-6 = low objective social isolation * higher score obtained in R-UCLA Loneliness scale = high subjective social isolation

After I run the correlation test, I found the value was r= -.437.

My question is, did the value represents correlation between variables (meaning when someone is objectively isolated, they are less likely to be subjectively isolated and vice versa) OR the value represents correlation between scores of instruments used (meaning when someone score higher in LSNS-6, they will had a lower scores for R-UCLA Loneliness Scale and vice versa)? I had confusions due to the scoring guide. I hope you can help me.

Thank you Jim!

April 8, 2022 at 8:17 pm

This specific correlation is a bit tricky because, based on what you wrote, the LSNS-6 is inverted. High LSNS-6 scores correspond to low objective social isolation. Let’s work through this example.

The negative correlation (-0.437) indicates that high LSNS-6 scores tend to correlate with low R-UCLA scores. Now, if we “translate” the instrument measures into what the scores mean as constructs, low objective social isolation tends to correspond low subjective social isolation.

In other words, there is a negative correlation between the instrument scores. However, there is a positive correlation between the concepts of objective social isolation and subjective isolation, which makes theoretical sense.

The reason why the instrument scores have a negative correlation and the constructs having a positive correlation goes back to the fact that high LSNs-6 scores relate to low objective isolation.

I hope that helps!

' src=

April 2, 2022 at 7:16 am

Thanks so much for the highly helpful statistical resources on this website. I am a bit confused about an analysis I carried out. My scatter plot show a kind of negative relationship between two variables but my Pearson’s correlation coefficient results tend to say something different. r= -0.198 and p-value of 0.082. I would appreciate clarification on this.

April 4, 2022 at 3:56 pm

I’m not sure what is surprising you? Can you be more specific?

It sounds like your scatterplot displays a negative correlation and your negative correlation is also negative, which sounds consistent. It’s a fairly weak correlation. The p-value indicates that your data don’t provide quite enough evidence to conclude that the correlation you see in the sample via the scatterplot and correlation coefficient also exists in the population. It might just be sampling error.

' src=

January 14, 2022 at 8:31 am

Hi Jim, Andrew here.

I am using a Pearson test for two variables: LifeSatisfaction and JobSatisfaction. I have gotten a P-Value 0.000 whilst my R-Value is 0.338. Can you explain to me what relation this is? Am I right in thinking that is strong significance with a weak correlation? And that there is no significant correlation between the two.

January 14, 2022 at 4:59 pm

What you’re running in to is the difference between statistical significance and practical significance in the real world. A statistically significant results, such as your correlation, suggests that the relationship you observe in your sample also exists in the population as a whole. However, statistical significance says nothing about how important that relationship is in a practical sense.

Your correlation results suggest that a positive correlation exists between life satisfaction and job satisfaction amongst the population from which you drew your sample. However, the fairly weak correlation of 0.338 might not be of practical significant. People with satisfying jobs might be a little happier but perhaps not to a noticeable degree.

So, for your correlation, statistical significance–yes! Practical significant–maybe not.

For more information, read my post about statistical significance vs. practical significance where I go into it in more detail.

' src=

January 7, 2022 at 7:07 pm

Thank you, Jim, will do.

' src=

January 7, 2022 at 5:07 pm

Hello Jim, I just came across this website. I have a query.

I wrote the following for a report: Table 5 shows the associations between all the domains. The correlation coefficients between the environment and the economy, social, and culture domains are rs=0.335 (weak), rs=0.427 (low) and rs=0.374 (weak), respectively. The correlation coefficient between the economy and the social and culture domains are rs=0.224 and rs=0.157, respectively and are negligible. The correlation coefficient (rs =0.451) between the social and the culture domains is low, positive, and significant. These weak to low correlation coefficient values imply that changes in one domain are not correlated strongly with changes in the related domain.

The comment I received was: Correlation studies are meant to see relationships- not influence- even if there is a positive correlation between x and y, one can never conclude if x or y is the reason for such correlation. It can never determine which variables have the most influence. Thus the caution and need to re-word for some of the lines above. A correlation study also does not take into account any extraneous variables that might influence the correlation outcome.

I am not sure how I should reword? I have checked several sources and their interpretations are similar to mine, Please advise. Thank you

January 7, 2022 at 9:25 pm

Personally, I think your wording is fine. Appropriately, you don’t suggest that correlation implies causation. You state that there is correlation. So, I’m not sure why the reviewer has an issue with it.

Perhaps the reviewer wants an explicit statement to that effect? “As with all correlation studies, these correlations do not necessarily represent causal relationships.”

The second portion of the review comment about extraneous variables is, in my opinion, more relevant. Pairwise correlations don’t control for the effects of other variables. Omitted variable bias can affect these pairs. I write about this in a post about omitted variable bias . These biases can exaggerate or minimize the apparent strength of pairwise correlations.

You can avoid that problem by using partial correlations or multiple regression analysis. Although, it’s not necessarily a problem. It’s just a possibility.

January 5, 2022 at 8:52 pm

Is it possible to compare two correlation coefficients? For example, let’s say that I have three data points (A, B, and C) for each of 75 subjects. If I run a Pearson’s on the A&B survey points and receive a result of .006, while the Pearson’s on the A&C survey points is .215…although both are not significant, can I say that there is a stronger correlation between A&C than between A&B? thank you!

January 6, 2022 at 8:31 pm

I am not aware of test that will assess whether the difference between two correlation coefficients is statistically significant. I know you can do that with regression coefficients , so you might want to determine whether you can use that approach. Click the link to learn more.

However, I can guess that your two coefficients probably are not significantly different and thus you can’t say one is higher. Each of your hypothesis tests are assessing whether one of the coefficients is significantly different from zero. In both cases (0.006 and 0.215), neither are significantly different from zero. Because both of your coefficients are on the same side of zero (positive) the distance between them is even smaller than your larger coefficients (0.215) distance from zero. Hence, that difference probably is also not statistically significant. However, one muddling issue is that with the two datasets combined you have a larger total sample size than either alone, which might allow a supposed combined test to determine that the smaller difference is significant. But that’s uncertain and probably unlikely.

There’s a more fundamental issue to consider beyond statistical significance . . . practical significance. The correlation of 0.006 is so small it might as well be zero. The other is 0.215 (which according to the hypothesis test, also might as well be zero). However, in practical terms, a correlation of 0.215 is also a very weak correlation. So, even if its hypothesis test said it was statistically significant from zero, it’s a puny correlation that doesn’t provide much predictive power at all. So, you’re looking at the difference between two practically insignificant correlations. Even if the larger sample size for a combined test did indicate the difference is statistically significant, that difference (0.215 – 0.006 = 0.209) almost certainly is not practically significant in a real-world sense.

But, if you really want to know the statistical answer, look into the regression method.

May 16, 2022 at 2:57 pm

JIm – here is a YT purporting to demonstrate how to compare correlation coefficients for statistical significance. I’m not a statistician and cannot vouch for the contents. https://www.youtube.com/watch?v=ipqUoAN2m4g

May 16, 2022 at 7:22 pm

That seems like a very non-standard approach in the YT video. And, with a sample size of 200 (100 males, 100 females), even very small effect sizes should be significant. So, I have some doubts about that process, but I haven’t dug into it. It might be totally valid, but it seems inefficient in terms of statistical power for the sample size.

Here’s how I would’ve done that analysis. Instead of correlation, I’d use regression with an interaction effect. I’d want to model the relationship between the amount time studying for a test and the scores. Additionally, I also gather 100 males and females and want to see if the relationship between time studying and test scores differs between genders. In regression, that’s an interaction effect. It’s the same question the YT video assesses, but using a different approach that provides a whole lot more answers.

To see that approach in action, read my post about Comparing Regression Lines Using Hypothesis Tests . In that post, I refer to comparing the relationships between two conditions, A and B. You can equate those two conditions to gender (male and female). And I look at the relationship between Input and Output, which you can equate to Time Studying and Test Score, respectively. While reading that post, notice how much more information you obtain using that approach than just the two correlation coefficients and whether they’re significantly different.

That’s what I mean by generally preferring regression analysis over simple correlation.

' src=

December 9, 2021 at 7:33 pm

salut Jim merci beaucoup pour cette explication je travaille sur un article et je veux calculer la taille d’echantillon pour critiquer la taille d’echantillon utulisé est ce que c posiible de deduire le P par le graphqiue et puis appliquer la regle pour d”duire N ?

December 12, 2021 at 11:57 pm

Unfortunately, I don’t speak French. However, I used Google Translate and I think I understand your question.

No, you can’t calculate the p-value by looking at a graph. You need the actual data values to do that. However, there is another approach you can use to determine whether they have a reasonable sample size.

You can use power and sample size software (such as the free G*Power ) to determine a good sample size. Keep in mind that the sample size you need depends on the strength of the correlation in the population. If the population has a correlation of 0.3, then you’ll need 67 data points to obtain a statistical power of 0.8. However, if the population correlation is higher, the required sample size declines while maintaining the statistical power of 0.8. For instance, for population correlations of 0.5 and 0.8, you’ll only need sample sizes of 23 and 8, respectively.

Using this approach, you’ll at least be able to determine whether they’re using a reasonable sample size given the size of correlation that they report even though you won’t know the p-value.

Hopefully, the reported the sample size, but, if not, you can just count the number of dots on the scatterplot.

' src=

November 19, 2021 at 4:47 pm

Hi Jim. How do I interpret r(12) = -.792, p < .001 for Pearson Coefficiient Correlation?

' src=

October 26, 2021 at 4:53 am

Hi If the correlation between the two independent constructs/variables and the dependent variable/constructs is medium or large, what must the manager to improve the two independent constructs/variables

' src=

October 7, 2021 at 1:12 am

Hi Jim, First of all thank you, this is an excellent resource and has really helped clarify some queries I had. I have run a Pearson’s r test on some stats software to analyse relationship between increasing age and need for friendship. The return is r = 0.052 and p = 0.381. Am I right in assuming there is a very slight positive correlation between the variables but one that is not statistically significant so the null hypothesis cannot be rejected? Kind regards

October 7, 2021 at 11:26 pm

Hi Victoria,

That correlation is so close to 0 that it essentially means that there is no relationship between your two variables. In fact, it’s so close to zero that calling it a very slight positive correlation might be exaggerating by a bit.

As for the p-value, you’re correct. It’s testing the null hypothesis that the correlation equals zero. Because your p-value is greater than any reasonable significance level, you fail to reject the null. Your data provide insufficient evidence to conclude that the correlation doesn’t equal zero (no effect).

If you haven’t, you should graph your data in a scatterplot. Perhaps there’s a U shaped relationship that Pearson’s won’t detect?

' src=

July 21, 2021 at 11:23 pm

No Jim, I mean to ask, let’s assume correlation between variable x and y is 0.91, how do we interpret the remaining 0.09 assuming correlation at 1 is strong positive linear correlation. ?

Is this because of diversification, correlation residual or any error term?

July 21, 2021 at 11:29 pm

Oh, ok. Basically, you’re asking why it’s not a perfect correlation of 1? What explains that difference of 0.09 between the observed correlation and 1? There are several reasons. The typical reason is that most relationships aren’t perfect. There’s usually a certain amount of inherent uncertainty between two variables. It’s the nature of the relationship. Occasionally, you might find very near perfect correlations for relationships governed by physical laws.

If you were to have pair of variables that should have a perfect correlation for theoretical reasons, you might still observe an imperfect correlation thanks to measurement error.

July 20, 2021 at 12:49 pm

If two variable has a correlation of 0.91 what is 0.09, in the equation?

July 21, 2021 at 10:59 pm

I’d need more information/context to be able to answer that question. Is it a regression coefficient?

' src=

June 30, 2021 at 4:21 pm

You are a great resource. Thank you for being so responsive. I’m sure I’ll be bugging you some more in the future.

June 30, 2021 at 12:48 pm

Jim, using Excel, I just calculated that the correlation between two variables (A and B) is .57, which I believe you would consider to be “moderate.” My question is, how can I translate that correlation into a statement that predicts what would happen to B if A goes up by 1 point. Thanks in advance for your help and most especially for your clarity.

June 30, 2021 at 2:59 pm

Hi Gerry, to get that type of information, you’ll need use regression analysis. Read my post about using Excel to perform regression for details . For your example, be sure to use A as the independent variable and B as the dependent variable. Then look at the regression coefficient for A to get your answer!

' src=

May 24, 2021 at 11:51 pm

Hey Man, I’m taking my stats final this week and I’m so glad I found you! Thank you for saving random college kids like me!

' src=

May 19, 2021 at 8:38 am

Hi, I am Nasib Zaman The Spearman correlation between high temperature and COVID-19 cases was significant ( r = 0.393). Correlation between UV index and COVID-19 cases was also significant ( r = 0.386). Is it true?

May 20, 2021 at 1:31 am

Both suggests that as temperature and UV increase that the number of COVID cases increases. Although it is a weak correlation. I don’t know whether that’s true or not. You’d have to assess the validity of the data to make that determination. Additionally, their might be confounding variables at play, which could bias the correlations. I have no way of knowing.

' src=

April 12, 2021 at 1:49 pm

I am using Pearson’s correlation co-efficient to to express the strength of relationship between my two variables on happiness, would this be an appropriate use?

Happiness Diet RelationshipSatisfaction

Pearson Correlation

Happiness 1.000 .310 . 416 Diet .310 1.000 .193 RelationshipSatisfaction .416 .193 1.000

Sig. (1-tailed) 0.00 0.00 Happiness Diet 0.00 0.00 RelationshipSatisfaction 0.00 0.00

N Happiness 1297 1297 1297 Diet 1297 1297 1297 RelationshipSatisfaction 1297 1297 1297

If so, would I be right to say that because the coefficient was r= (.193), it suggests that there is not too strong a relationship between the two independent variables. Can I use anything else to indicate significance levels?

' src=

March 29, 2021 at 3:12 am

I just want to say that your posts are great, but the QA section in the comments is even greater!

Congrats, Jim.

March 29, 2021 at 2:57 pm

Thanks so much!! 🙂

And, I’m really glad you enjoy the QA in the comments. I always request readers to post their questions in the comments section of the relevant post so the answers benefit everyone!

' src=

March 24, 2021 at 1:16 am

Thank you very much. This question was troubling me since last some days , thanks for helping.

Have a nice day…

March 24, 2021 at 1:34 am

You’re very welcome, Ronak! I’m glad to help!

' src=

March 22, 2021 at 12:56 pm

Nalin here. I found your article to be very clarifying conceptually. I had a doubt.

So there is this dataset I have been working on and I calculated the Pearson correlation coefficient between the target variable and the predictor variables. I found out that none of the predictor variables had a correlation >0.1 and <-0.1 with the target variable, hence indicating that no linear relationship exists between them.

How can I verify whether or not any non-linear relationships exist between these pairs of variables or not? Will a scatterplot confirm my claims?

March 23, 2021 at 3:09 pm

Yes, graphing the data in a scatterplot is always a good idea. While you might not have a linear relationship, you could have a curvilinear relationship. A scatterplot would reveal that.

One other thing to watch out for is omitted variable bias. When you perform correlation on a pair of variables, you’re not factoring in other relevant variables that can be confounding the results. To see what I mean, read my post about omitted variable bias . In it, I start with a correlation that appear to be zero even though there actually is a relationship. After I accounted for another variable, there was a significant relationship between the original pair of variables! Just another thing to watch out for that isn’t obvious!

March 20, 2021 at 3:23 am

Yes, I am also doing well…

I am having some subsequent queries…

By overall trend you mean that correlation coefficient will capture how y is changing with respect to x (means y is increasing or decreasing with increase or decrease in x), am i interpreting correctly ?

correlation analysis definition in research

March 22, 2021 at 12:25 am

This is something should be clear by examining the scatterplot. Will a straight line fit the dots? Do the dots fall randomly about a straight line or are there patterns? If a straight line fits the data, Pearson’s correlation is valid. However, if it does not, then Pearson’s is not valid. Graphing is the best way to make the determination.

Thanks for the image.

March 23, 2021 at 3:41 pm

Hi again Ronak!

On your graph, the data points are the red line (actually lots and lots of data points and not really a line!). And, the green line is the linear fit. You don’t usually think of Pearson’s correlation as modeling the data but it uses a linear fit. So, the green line is how Pearson’s correlation models your data. You can see that the model doesn’t fit the data adequately. There are systematic (i.e., non-random departures) from the data points. Right there you know that Pearson’s correlation is invalid for these data.

Your data has an upward trend. That is, as X increases, Y also increases. And Pearson’s partially captures that trend. Hence, the positive slope for the green line and the positive correlation you calculated. But, it’s not perfect. You need a better model! In terms of correlation, the graph displays a monotonic relationship and Spearman’s correlation would be a good candidate. Or, you could use regression analysis and include a polynomial to model the curvature . Either of these methods will produce a better fit and more accurate results!

March 18, 2021 at 11:01 am

i am ronak from india. how are you?…hoping corona has not troubled you much. you have simplified concept very well. you are doing amazing job ,great work. i have one doubt and want to clarify it.

Question : whenever we talk correlation coefficient we talk in terms of linear relationship. but i have calculated correlation coefficient for relationship Y vs X^3.

X variable : 1 to 10000 Y = X^3

and correlation coefficient is coming around 0.9165. it is strange even relationship is not linear still it is giving me very high correlation coefficient.

March 19, 2021 at 3:53 pm

I’m doing well here. Just hunkering down like everyone else! I hope you’re doing well too! 🙂

For your data, I’d recommend graphing them in a scatterplot and fit a linear trend line. You can do that in Excel. If your data follow an S-shaped cubic relationship, it is still possible to get a relatively strong correlation. You’ll be able to see how that happens in the scatterplot with trend line. There’s an overall trend to the data that your line follows, but it does hug the curves. However, if you fit a model with a cubic term to fit the curves, you’ll get a better model.

So, let’s switch from a correlation to R-squared. Your correlation of 0.9165 corresponds to an R-squared of 0.84. I’m literally squaring your correlation coefficient to get the R-squared value. Now, fit a regression model with the quadratic and cubic terms to fit your data. You’ll find that your R-squared for this model is higher than for the linear model.

In short, the linear correlation is capturing the overall trend in the data but doesn’t fit the data points as well as the model designed for curvilinear data. Your correlation seems good but it doesn’t fully fit the data.

' src=

March 11, 2021 at 10:56 am

Hi Jim Do the partial correlation include the continuous (scale) variables all times? Is it possible to include other types of variables (as nominal or ordinal)? Regards Jagar

March 16, 2021 at 12:30 am

Pearson correlations are for continuous data that follow a linear relationship. If you have ordinal data or continuous data that follow a monotonic relationship, you can use Spearman’s correlation.

There are correlations specifically for nominal data. I need to write a blog post about those!

' src=

March 10, 2021 at 11:45 am

if the correlation coefficient is 0.153 what type of correlation is it?

February 14, 2021 at 1:49 pm

' src=

February 12, 2021 at 8:09 pm

If my r value when finding correlation between two things is -0.0258 what would that be negative weak correlation or something else?

February 14, 2021 at 12:08 am

Hi Dez, your correlation coefficient is essentially zero, which indicates no relationship between the variables. As one variable increases, there is no tendency for the variable to either increase or decrease. There’s just no relationship between them according to your data.

' src=

January 9, 2021 at 12:10 pm

my coefficient correlation between my independent variables (anger, anxiety, happiness, satisfaction) and a dependent variable(entrepreneurial decision making behavior) is 0.401, 0.303, 0.369, 0.384.

what does this mean? how do i interpret explain this? what’s the relationship?

January 10, 2021 at 1:33 am

It means that separately each independent variable (IV) has a positive correlation with the dependent variable (DV). As each IV increases, the DV tends to increase. However, it is a fairly weak correlation. Additionally, these correlations don’t control for confounding variables. You should perform a regression analysis because you have your IVs and DV. Your model will tell how much variability the IVs account for in the DV collectively. And, it will control for the other variables in the model, which can help reduce omitted variable bias.

The information in this post should help you interpret your correlation coefficients. Just read through it carefully.

' src=

January 4, 2021 at 6:20 am

Hello there, If one were to find out the correlation between the average grade and a variable, could this coefficient be used? Thanks!

January 4, 2021 at 4:03 pm

If you mean something like an average grade per student and the other variable is something like the number of hours each student studies, yes, that’s fine. You just need to be sure that the average grade applies to one person and that the other variable applies to the same person. You can’t use a class average and then the other variable is for individuals.

' src=

December 27, 2020 at 8:27 am

I’m helping a friend working on a paper and don’t have the variables. The question centers around the nature of Criterion Referenced Tests, in general, i.e. correlations of CRT vs. Norm Referenced Tests. As you know, Norm Referenced compares students to each other across a wide population. In this paper, the student is creating a teacher made CRT. It is measuring proficiency of students of more similar abilities and smaller population to criteria and not to each other. I suspect, in general, the CRT doesn’t distinguish as well between students with similar abilities and knowledge. Therefore, the reliability coefficients, in general, are less reliable. How does this effect high or low correlations?

December 26, 2020 at 9:40 pm

high or lower correlation on a CRT proficiency test good or bad?

December 27, 2020 at 1:30 am

Hi Raymond, I’d have to know more about the variables to have an idea about what the correlation means.

' src=

December 8, 2020 at 11:02 pm

I have zero statistics experience but I want to spice up a paper that I’m writing with some quants. And so learned the basics about Pearson correlation on SPSS and I plugged in my data. Now, here’s where it gets “interesting.” Two sets of numbers show up: One on the Pearson Correlation row and below that is the Sig. (2-tailed) row.

I’m too embarrassed to ask folks around me (because I should already know this!). So, let me ask you: which of the row of numbers should I use in my analysis about the correlations between two variables? For example, my independent variable correlates with the dependent variable at -.002 on the first (Pearson Correlation) row. But below that is the Sig. (2-tailed) .995. What does that mean? And is it necessary to have both numbers?

I would really appreciate your response … and will acknowledge you (if the paper gets published).

Many thanks from an old-school qualitative researcher struggling in the times of quants! 🙂

December 9, 2020 at 12:32 am

The one you want to use for a measure of association is the Pearson Correlation. The other value is the p-value. The p-value is for a hypothesis test that determines whether your correlation value is significantly different from zero (no correlation).

If we take your -0.002 correlation and it’s p-value (0.995), we’d interpret that as meaning that your sample contains insufficient evidence to conclude that the population correlation is not zero. Given how close the correlation is to zero, that’s not surprising! Zero correlation indicates there is no tendency for one variable to either increase or decrease as the other variable increases. In other words, there is no relationship between them.

' src=

November 24, 2020 at 7:55 am

Thank you for the good explanation. I am looking for the source or an article that states that most correlations regarding human behaviour are around .6. What source did you use?

Kind regards, Amy

' src=

November 13, 2020 at 5:27 am

This is an informative article and I agree with most of what is said, but this particular sentence might be misleading to readers: “R-squared is a primary measure of how well a regression model fits the data.”. R-squared is in fact based on the assumption that the regression model fits the data to a reasonable extent therefore it cannot also simultaneously be a measure of the goodness of said fit.

The rest of the claims regarding R-squared I completely agree with.

Cheers, Georgi

November 13, 2020 at 2:48 pm

Yes, I make that exact point repeatedly throughout multiple blog posts, particularly my post about R-squared .

Additionally, R-squared is a goodness-of-fit measure, so it is not misleading to say that it measures how well the model fits the data. Yes, it is not a 100% informative measure by itself. You’d also need to assess residual plots in conjunction with the R-squared. Again, that’s a point that I make repeatedly.

I don’t mind disagreements, but I do ask that before disagreeing, you read what I write about a topic to understand what I’m saying. In this case, you would’ve found in my various topics about R-squared and residual plots that we’re saying the same thing.

' src=

November 7, 2020 at 12:31 pm

Thank you very much!

November 6, 2020 at 7:34 pm

Hi Jim, I have a question for you – and thank you in advance for responding to it 🙂

Set A has the correlation coefficient of .25 and Set B has the correlation of .9, Which set has the steeper trend line? A or B?

November 6, 2020 at 8:41 pm

Set B has a stronger relationship. However, that’s not quite equivalent to saying it has a steeper trend line. It means the data points fall closer to the line.

If you look at the examples in this post, you’ll notice that all the positive correlations have roughly equal slopes despite having different correlations. Instead, you see the points moving closer to the line as the strength of the relationship increases. The only exception is that a correlation of zero has a slope of zero.

The point being that you can’t tell from the correlation alone which trend line is steeper. However, the relationship in Set B is much stronger than the relationship in Set A.

' src=

October 19, 2020 at 6:33 am

Thank you 😊. Now I understand.

October 11, 2020 at 4:49 am

hi, I’m a little confused.

What does it indicating, If there is positive correlation, but negative coefficient from multiple regression outcome? in this situation, how to interpret? the relationship is negative or positive?

October 13, 2020 at 1:32 pm

This is likely a case of omitted variable bias. A pairwise correlation involves just two variables. Multiple regression analysis involves three variables at a minimum (2 IVs and a DV). Correlation doesn’t control for other variables while regression analysis controls for the other variables in the model. That can explain the different relationships. Omitted variable bias occurs under specific conditions. Click the link to read about when it occurs. I include an example where I first look at a pair of variables and then three variables and shows how that changes the results, similar to your example.

' src=

September 30, 2020 at 4:26 pm

Hi Jim, I have 4 objective in my research and when I did the correlation between first one and others the result is: ob1 with ob2 is (0.87) – ob1 with ob3 is (0.84) – ob1 with ob4 is ( 0.83). My question is what is that meaning and can I do Correlation Coefficient with all of them in one time.

' src=

September 28, 2020 at 4:06 pm

Which best describes the correlation coefficient for r=.08?

September 30, 2020 at 4:29 pm

Hi Jolette,

I’d say that is an extremely weak correlation. I’d want to see its p-value. If it’s not significant, then you can’t conclude that the correlation is different from zero (no correlation). Is there something else particular you want to know about it?

' src=

September 15, 2020 at 11:50 am

Correlation result between Vul and FCV

t = 3.4535, df = 306, p-value = 0.0006314 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.08373962 0.29897226 sample estimates: cor 0.1936854

What does this mean?

September 17, 2020 at 2:53 am

Hi Lakshmi,

It means that your correlation coefficient is ~0.19. That’s the sample estimate. However, because you’re working with a sample, there’s always sample error and so the population correlation is probably not exactly equal to the sample value. The confidence interval indications that you can be 95% confident that the true population correlation falls between ~0.08 and 0.30. The p-value is less than any common significance level. Consequently, you can reject the null hypothesis that the population correlation equals zero and conclude that it does not equal zero. In other words, the correlation you see in the sample is likely to exist in the population.

A correlation of 0.19 is a fairly weak relationship. However, even though it is weak, you have enough evidence to conclude that it exists in the population.

' src=

September 1, 2020 at 8:16 am

Hi Jim Thank you for your support. I have a question that is. Testing criteria for Validity by Pearson correlation, r table determine by formula DF=N-2 – If it is Valid the correlation value less that Pearson correlation value. (Pearson correlation > r table ) – if it is Invalid the correlation value greater that Pearson correlation value. (Pearson correlation < r table ) I got the above information on SPSS tutorial Video about Pearson correlation.

but I didn't get on other literature please can you recommend me some literature that refers about this? or can you clarify more about how to check Validity by Pearson correlation?

' src=

August 31, 2020 at 3:21 am

HI JIM i am zia from pakistan i wanna finding correlation of two factoer i have find 144.6 of 66.93 thats is postive relation?

August 31, 2020 at 12:39 pm

Hi Zia, I’m sorry but I’m not clear about what you’re asking. Correlation coefficients range between -1 and +1, so those two values are not correlation coefficients. Are they regression coefficients?

' src=

August 16, 2020 at 6:47 am

Warmest greetings.

My name is Norshidah Nordin and I am very grateful if you could provide me some answers to the following questions.

1) Can I used two different set of samples (for e.g. students academic performance (CGPA) as dependent variable and teacher’s self efficacy as dependent variable) to run on a Pearson correlation analysis. If yes, could you elaborate on this aspect.

2) what is the minimum sample size to use in multiple regression analysis.

August 17, 2020 at 9:06 pm

Hi Norshidah,

For correlations, you need to have multiple measurements on the same item or person. In your scenario, it sounds like you’re taking different measurements on different people. Pearson’s correlation would not be appropriate.

The minimum sample size for multiple regression depends on the number of terms you need to include in your model. Read my post about overfitting regression models , which occurs when you have too few observations for the number of model terms.

I hope this helps!

' src=

July 29, 2020 at 5:27 pm

Greetings sir, question…. Can you do an accurate regression with a Pearson’s correlation coefficient of 0.10? Why or Why not?

July 31, 2020 at 5:33 pm

Hi Monique,

It is possible. First, you should determine whether that correlation is statistically significant. You’re seeing a correlation in your sample, but you want to be confident that is also exists in the large population you’re studying. There’s a possibility that the correlation only exists in your sample by random chance and does not exist in the population–particularly with such a low coefficient. So, check the p-value for the coefficient. If it’s significant, you have reason to proceed with the regression analysis. Additionally, graph your data. Pearson’s only is for linear relationships. Perhaps your coefficient is low because the relationship is curved?

You can fit the regression model to your data. A correlation of 0.10 equates to an R-squared of only 0.01, which is very low. Perhaps adding more independent variables will increase the R-squared. Even if the r-squared stays very low, if your independent variable is significant, you’re still learning something from your regression model. To understand what you can learn in this situation, read my post about regression models with significant variables and a low R-squared values .

So, it is possible to do a valid regression and learn useful information even when the correlation is so low. But, you need to check for significance along the way.

' src=

July 8, 2020 at 4:55 am

Hello Jim, first and foremost thank you for giving us a comprehensive information regarding this! This totally help me. But I have a question; my pearson results showing that there’s a moderate positive relationship between my variables which is Parasocial Interaction and the fans’ purchase intention.

But the thing is, if I look at the answer majority of my participants are mostly answering Neutral regarding purchase intention.

What does this means? could you help me to figure out this T.T thanks you in advance! I’m a student currently doing thesis from Malaysia.

July 8, 2020 at 4:00 pm

Hi Titania,

Have you graphed your data using a scatterplot? I’d highly recommend that because I think it will probably clarify what your data are telling you. Also, are both of your variables continuous variables? I’m wonder if purchase intention is ordinal if one of the values is Neutral. If that’s the case, you’d need to use Spearman’s Rank Correlation rather than Pearson’s.

' src=

June 18, 2020 at 8:57 am

Hello Jim ! I have a question . I calculated a correlation coefficient between the scale variables and got 0.36, which is relatively weak since it gives a 0.12 if quared. What does the interpretation of correlation concern ? The sample taken or the type of data measurement ? or anything else?

I hope you got my question. Thank you for your help!!

June 18, 2020 at 5:06 pm

I’m not clear what you’re asking exactly. Please clarify. The correlation measures the strength of the relationship between the two continuous variables, as I explain in this article.

Yes, that it is a weak relationship. If you’re going to include this is a regression analysis, you might want to read my article about interpreting low R-squared values .

I’m not sure what you mean by scale variables. However, if these are Likert scale items, you’ll need to use Spearman’s correlation instead of Pearson’s correlation.

' src=

May 26, 2020 at 12:08 am

Hi Jim I am very new to statistics and data analysis. I am doing a quantitative study and my sample size is 200 participants. So far I have only obtained 50 complete responses. . Using G*Power a simple linear regression with a medium effect size, an alpha of .05, and a power level of .80 can I do a data analysis with this small sample.

May 26, 2020 at 3:52 am

Please repost your question in the comments section of the appropriate article. It has nothing to do with correlation coefficients. Use the search bar part way down in the right column and search for power. I have a post about power analysis that is a good fit.

' src=

May 24, 2020 at 9:02 pm

Thank you Mr.Jim, it was a great answer for me!😉 Take good care~

May 24, 2020 at 9:46 am

I am a student from Malaysia.

I have a question to ask Mr.Jim about how to determine the validity (the accurate figure) of the data for analysis purpose base on the table of Pearson’s Correlation Coefficient? Do it has any method?

For example, since the coefficient between one independent variable with the other variable is below 0.7, thus the data valid for analysis purpose.

However, I have read the table there is a figure which more than 0.7. I am not sure about that.

Hope to hearing from Mr.Jim soon. Thank you.

May 24, 2020 at 4:20 pm

Hi, I hope you’re doing well!

There is no single correlation coefficient value that determines whether it is valid to study. It partly depends on your subject area. I low noise physical process might often have a correlation in the very high 0.9s and 0.8 would be considered unacceptable. However, in a study of human behavior, it’s normal and acceptable to have much lower correlations. For example a correlation of 0.5 might be considered very good. Of course, I’m writing the positive values, but the same applies to negative correlations too.

It also depends on what the purpose of your study. If you’re doing something practical, such as describing the relationship between material composition and strength, there might be very specific requirements about how strong that relationship must be for it to be useful. It’s based on real-world practicalities. On the other hand, if you’re just studying something for the sake of science and expanding knowledge, lower correlations might still be interesting.

So, there’s not single answer. It depends on the subject-area you are studying and the purpose of your study.

' src=

February 17, 2020 at 3:49 pm

HI Jim, what could be the implication of my result if I obtained a weak relationship between industry experience and instructional effectiveness? thanks in advance

February 20, 2020 at 11:29 am

The best way to think of it is to look at the graphs in this article and compare the higher correlation graphs to the lower correlation graphs. In the higher correlation graphs, if you know the value of one variable, you have a more precise prediction of the value of the other variable. Look along the x-axis and pick a value. In the higher correlation graphs, the range of y-values that correspond to your x-value is narrower. That range is relatively wide for lower correlations.

For your example, I’ll assume there is a positive correlation. As industry experience increases, instructional effectiveness also increases. However, because that relationship is weak, the range of instructional effectiveness for any given value of industry experience is relatively wide.

' src=

November 25, 2019 at 9:05 pm

if correlation between X and Y is 0.8 .what is the correlation of -X and -Y

November 26, 2019 at 4:59 pm

If you take all the values of X and multiply them by -1 and do the same for Y, your correlation would still be 0.8.

' src=

November 7, 2019 at 3:51 am

This is very helpful, thank you Jim!

' src=

November 6, 2019 at 3:16 am

Hi, My data is continuous – the variables are individual shares volatility and oil prices and they were non-normal. I used Kendall’s Tau and did not rank the data or alter it in any way. Can my results be trusted?

November 6, 2019 at 3:32 pm

Hi Lorraine,

Kendall’s Tau is a correlation coefficient for ranked data. Even though you might not have ranked your data, your statistical software must have created the ranks behind the scenes.

Typically, you’ll use Pearson’s correlation when you have continuous data that have a straight line relationship. If your data are ordinal, ranked, or do not have a straight line relationship, using something other than Pearson’s correlation is necessary.

You mention that your data are nonnormal. Technically, you want to graph your data and look at the shape of the relationship rather than assessing the distribution for each variable. Although, nonnormality can make a linear relationship less likely. So, graph your data on a scatterplot and see what it looks like. If it is close to a straight line, you should probably use Pearson’s correlation. If it’s not a straight line relationship, you might need to use something like Kendall’s Tau or Spearman’s rho coefficient, both of which are based on ranked data. While Spearman’s rho is more commonly used, Kendall’s Tau has preferable statistical properties.

' src=

October 24, 2019 at 11:56 pm

Hi, Jim. If correlations between continuous variables can be measured using Pearson’s, how is correlation between categorical variables measured? Thank you.

October 25, 2019 at 2:38 pm

There are several possible methods, although unlike with continuous data, there doesn’t seem to be a consensus best approach.

But, first off, if you want to determine whether the relationship between categorical variables is statistically significant, use the chi-square test of independence . This test determines whether the relationship between categorical variables is significant, but it does not tell you the degree of correlation.

For the correlation values themselves, there are different methods, such as Goodman and Kruskal’s lambda, Cramér’s V (or phi) for categorical variables with more than 2 levels, and the Phi coefficient for binary data. There are several others that are available as well. Offhand I don’t know the relative pros and cons of each methodology. Perhaps that would be a good post for the future!

' src=

August 29, 2019 at 7:31 pm

Thanks, great explanations.

' src=

April 25, 2019 at 11:58 am

In a multi-variable regression model, is there a method for determining where two predictor variables are correlated in their impact on the outcome variable?

If so, then how is this type of scenario determined, and handled?

Thanks, Curt

April 25, 2019 at 1:27 pm

When predictors are correlated, it’s known as multicollinearity. This condition reduces the precision of the coefficient estimates. I’ve written a post about it: Multicollinearity: Detection, Problems, and Solutions . That post should answer all your questions!

' src=

February 3, 2019 at 6:45 am

Hi Jim: Great explanations. One quick thing, because the probability distribution is asymptotic, there is no p=.000. The probability can never be zero. I see students reporting that or p<.000 all of the time. The actual number may be p <.00000001, so setting a level of p < .001 is usually the best thing to do and seems like journal editors want that when reporting data. Your thoughts?

February 4, 2019 at 12:25 am

Hi Susan, yes, you’re correct about that. You can’t have a p-value that equals zero. Sometimes software will round down when it’s a very small value. The underlying issue is that no matter how large the difference between your sample value and the null hypothesis value, there is a non-zero probability that you’d obtain the observed results when the null is true.

' src=

January 9, 2019 at 6:41 pm

Sir you are love. Such a nice share

' src=

November 21, 2018 at 11:17 am

Awesome stuff, really helpful

' src=

November 9, 2018 at 11:48 am

What do you do when you can’t perform randomized controlled experiments, like in the cases of social science or societal wide health issues? Apropos to gun violence in America, there appears to be correlation between the availability of guns in a society and the number of gun deaths in a society, where as the number of guns in the society goes up the number of gun deaths go up. This is true of individual states in the US where gun availability differs, and also in countries where gun availability differs. But, when/how can you come to a determination that lowering the number of guns available in a society could reasonably be said to lower the number of gun deaths in that society.

November 9, 2018 at 12:20 pm

Hi Patrick,

It is difficult proving causality using observational studies rather than randomized experiments.

In my mind, the following approach can help when you’re trying to use observational studies to show that A causes B.

In observational study, you need to worry about confounding variables because the study is not randomized. These confounding variables can provide alternative explanations for the effect/correlations. If you can include all confounding variables in the analysis, it makes the case stronger because it helps rule out other causes. You must also show that A precedes B. Further, it helps if you can demonstrate the mechanism by which A causes B. That mechanism requires subject-area knowledge beyond just a statistical test.

Those are some ideas that come to my mind after brief reflection. There might well be more and, of course, there will be variations based on the study-area.

' src=

September 19, 2018 at 4:55 am

Thank you so much, I am learning a lot of thing from you!

Please, keep doing this great job!

Best regards

September 19, 2018 at 11:45 pm

You bet, Patrik!

September 18, 2018 at 6:04 am

Another question is: should I consider transform my variable before using person correlation, if they do not follow normal distribution or if the two variable do not have a clear liner relationship? What is the implication of that transformation? How to interpret the relationship if used transformed variable (let“s say log)?

September 18, 2018 at 4:44 pm

Because the data need to follow the bivariate normal distribution to use the hypothesis test, I’d assume the transformation process would be more complex than transforming each variable individually. However, I’m not sure about this.

However, if you just want to make a straight line for the correlation to assess, I’d be careful about that too. The correlation of the transformed data would not apply to the untransformed data. One solution would be to use Spearman’s rank order correlation. Another would be to use regression analysis. In regression analysis, you can fit curves, use transformations, etc., and the assumption is that the residual follow a normal distribution (along with some other assumptions) is easy to check.

If you’re not sure that your data fit the assumptions for Pearson’s correlation, consider using regression instead. There are more tools there for you to use.

September 18, 2018 at 5:36 am

Hi Jim, I am always here following your posts.

I would like if you could clarify something to me, please! What is the assumptions for person correlation that must hold true, in order to apply correlation coefficient?

I have read something on the internet, but there is many confusion. Some people are saying that the dependent variable (if have) must be normally distributed, other saying both (dependent and independent) must be following normal distribution. Therefore, I dont know which one I should follow. I would appreciate a lot your kind contribution. This is something that I am using for my paper.

Thank you in advance!

September 18, 2018 at 4:34 pm

I’m so glad to see that you’re hear reading and learning!

This issue turns out to be a bit complicated!

The assumption is actually that the two variables follow a bivariate normal distribution. I won’t go into that here in much detail, but a bivariate normal distribution is more complex than just each variable following a normal distribution. In a nutshell, if you plot data that follow a bivariate normal distribution on a scatterplot, it’ll appear as an elliptical shape.

In terms of the the correlation coefficient, that simply describes the relationship between the data. It is what it is and the data don’t need to follow a bivariate normal distribution as long as you are assessing a linear relationship.

On the other hand, the hypothesis test of Pearson’s correlation coefficient does assume that the data follow a bivariate normal distribution. If you want to test whether the coefficient equals zero, then you need to satisfy this assumption. However, one thing I’m not sure about is whether the test is robust to departures from normality. For example, a 1-sample t-test assumes normality, but with a large enough sample size you don’t need to satisfy this assumption. I’m not sure if a similar sample size requirement applies to this particular test.

I hope this clarifies this issue a bit!

' src=

August 29, 2018 at 8:04 am

Hello, thanks for the good explanation. Do variables have to be normally distributed to be analyzed in a Pearson’s correlation? Thanks, Moritz

August 30, 2018 at 1:41 pm

No, the variables do not need to follow a normal distribution to use Pearson’s correlation. However, you do need to graph the data on a scatterplot to be sure that the relationship between the variables is linear rather than curved. For curved relationships, consider using Spearman’s rank correlation.

' src=

June 1, 2018 at 9:08 am

Pearson’s correlation measures only linear relationships. But regression can be performed with nonlinear functions, and the software will calculate a value of R^2. What is the meaning of an R^2 value when it accompanies a nonlinear regression?

June 1, 2018 at 9:49 am

Hi Jerry, you raise an important point. R^2 is actually not a valid measure in nonlinear models. To read about why, read my post about R-squared in nonlinear models . In that post, I write about why it’s problematic that many statistical software packages do calculate R-squared values for nonlinear regression. Instead, you should use a different goodness-of-fit measure, such as the standard error of the regression .

' src=

May 30, 2018 at 11:59 pm

Hi, fantastic blog, very helpful. I was hoping I could ask a question? You talk about correlation coefficients but I was wondering if you have a section that talks about the slope of an association? For example, am I right in thinking that the slope is equal to the standardized coefficient from a regression?

I refer to the paper of Cameron et al., (The Aging of Elastic and Muscular Arteries. Diabetes Care 26:2133–2138, 2003) where in table 3 they report a correlation and a slope. Is the correlation the r value and the slope the beta value?

Many thanks, Matt

May 31, 2018 at 12:13 pm

Thanks and I’m glad you found the blog to be helpful!

Typically, you’d use regression analysis to obtain the slope and correlation to obtain the correlation coefficient. These statistics represent fairly different types of information. The correlation coefficient (r) is more closely related to R^2 in simple regression analysis because both statistics measure how close the data points fall to a line. Not surprisingly if you square r, you obtain R^2.

However, you can use r to calculate the slope coefficient. To do that, you’ll need some other information–the standard deviation of the X variable and the standard deviation of the Y variable.

The formula for the slope in simple regression = r(standard deviation of Y/standard deviation of X).

For more information, read my post about slope coefficients and their p-values in regression analysis . I think that will answer a lot of your questions.

' src=

April 12, 2018 at 5:19 am

Nice post ! About pitfalls regarding correlation’s interpretation, here’s a funny database:

http://www.tylervigen.com/spurious-correlations

And a nice and poetic illustration of the concept of correlation:

https://www.youtube.com/watch?v=VFjaBh12C6s&t=0s&index=4&list=PLCkLQOAPOtT1xqDNK8m6IC1bgYCxGZJb_

Have a nice day

April 12, 2018 at 1:57 pm

Thanks for sharing those links! It always fun finding strange correlations like that.

The link for spurious correlations illustrates an important point. Many of those funny correlations are for time series data where both variables have a long-term trend. If you have two variables that you measure over time and they both have long term trends, those two variables will have a strong correlation even if there is no real connection between them!

' src=

April 3, 2018 at 7:05 pm

“In statistics, you typically need to perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation.”

Would you please provide an example where you can reasonably conclude that x causes y? And how do you know there isn’t a z that you didn’t control for?

April 3, 2018 at 11:00 pm

That’s a great question. The trick is that when you perform an experiment, you should randomly assign subjects to treatment and control groups. This process randomly distributes any other characteristics that are related to the outcome variable (y). Suppose there is a z that is correlated to the outcome. That z gets randomly distributed between the treatment and control groups. The end result is that z should exist in all groups in roughly equal amounts. This equal distribution should occur even if you don’t know what z is. And, that’s the beautiful thing about random assignment. You don’t need to know everything that can affect the outcome, but random assignment still takes care of it all.

Consequently, if there is a relationship between a treatment and the outcome, you can be pretty certain that the treatment causes the changes in the outcome because all other correlation-only relationships should’ve been randomized away.

I’ll be writing about random assignment in the near future. And, I’ve written about the effectiveness of flu shots , which is based on randomized controlled trials.

Comments and Questions Cancel reply

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Clin Kidney J
  • v.14(11); 2021 Nov

Logo of ckj

Conducting correlation analysis: important limitations and pitfalls

Roemer j janse.

Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands

Tiny Hoekstra

Department of Nephrology, Amsterdam Cardiovascular Sciences, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands

Kitty J Jager

ERA-EDTA Registry, Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

Carmine Zoccali

CNR-IFC, Center of Clinical Physiology, Clinical Epidemiology of Renal Diseases and Hypertension, Reggio Calabria, Italy

Giovanni Tripepi

Friedo w dekker, merel van diepen.

The correlation coefficient is a statistical measure often used in studies to show an association between variables or to look at the agreement between two methods. In this paper, we will discuss not only the basics of the correlation coefficient, such as its assumptions and how it is interpreted, but also important limitations when using the correlation coefficient, such as its assumption of a linear association and its sensitivity to the range of observations. We will also discuss why the coefficient is invalid when used to assess agreement of two methods aiming to measure a certain value, and discuss better alternatives, such as the intraclass coefficient and Bland–Altman’s limits of agreement. The concepts discussed in this paper are supported with examples from literature in the field of nephrology.

‘Correlation is not causation’: a saying not rarely uttered when a person infers causality from two variables occurring together, without them truly affecting each other. Yet, though causation may not always be understood correctly, correlation too is a concept in which mistakes are easily made. Nonetheless, the correlation coefficient has often been reported within the medical literature. It estimates the association between two variables (e.g. blood pressure and kidney function), or is used for the estimation of agreement between two methods of measurement that aim to measure the same variable (e.g. the Modification of Diet in Renal Disease (MDRD) formula and the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula for estimating the glomerular filtration rate (eGFR)]. Despite the wide use of the correlation coefficient, limitations and pitfalls for both situations exist, of which one should be aware when drawing conclusions from correlation coefficients. In this paper, we aim to describe the correlation coefficient and its limitations, together with methods that can be applied to avoid these limitations.

The basics: the correlation coefficient

Fundamentals.

The correlation coefficient was described over a hundred years ago by Karl Pearson [ 1 ], taking inspiration from a similar idea of correlation from Sir Francis Galton, who developed linear regression and was the not-so-well-known half-cousin of Charles Darwin [ 2 ]. In short, the correlation coefficient, denoted with the Greek character rho ( ρ ) for the true (theoretical) population and r for a sample of the true population, aims to estimate the strength of the linear association between two variables. If we have variables X and Y that are plotted against each other in a scatter plot, the correlation coefficient indicates how well a straight line fits these data. The coefficient ranges from −1 to 1 and is dimensionless (i.e., it has no unit). Two correlations with r = −1 and r  = 1 are shown in Figure 1A and B , respectively. The values of −1 and 1 indicate that all observations can be described perfectly using a straight line, which in turn means that if X is known, Y can be determined deterministically and vice versa. Here, the minus sign indicates an inverse association: if X increases, Y decreases. Nonetheless, real-world data are often not perfectly summarized using a straight line. In a scatterplot as shown in Figure 1C , the correlation coefficient represents how well a linear association fits the data.

An external file that holds a picture, illustration, etc.
Object name is sfab085f1.jpg

Different shapes of data and their correlation coefficients. ( A ) Linear association with r = −1. ( B ) A linear association with r  = 1. ( C ) A scatterplot through which a straight line could plausibly be drawn, with r  = 0.50. ( D ) A sinusoidal association with r  = 0. ( E ) A quadratic association with r  = 0. ( F ) An exponential association with r  = 0.50.

It is also possible to test the hypothesis of whether X and Y are correlated, which yields a P-value indicating the chance of finding the correlation coefficient’s observed value or any value indicating a higher degree of correlation, given that the two variables are not actually correlated. Though the correlation coefficient will not vary depending on sample size, the P-value yielded with the t -test will.

The value of the correlation coefficient is also not influenced by the units of measurement, but it is influenced by measurement error. If more error (also known as noise) is present in the variables X and Y , variability in X will be partially due to the error in X , and thus not solely explainable by Y . Moreover, the correlation coefficient is also sensitive to the range of observations, which we will discuss later in this paper.

An assumption of the Pearson correlation coefficient is that the joint distribution of the variables is normal. However, it has been shown that the correlation coefficient is quite robust with regard to this assumption, meaning that Pearson’s correlation coefficient may still be validly estimated in skewed distributions [ 3 ]. If desired, a non-parametric method is also available to estimate correlation; namely, the Spearman’s rank correlation coefficient. Instead of the actual values of observations, the Spearman’s correlation coefficient uses the rank of the observations when ordering observations from small to large, hence the ‘rank’ in its name [ 4 ]. This usage of the rank makes it robust against outliers [ 4 ].

Explained variance and interpretation

One may also translate the correlation coefficient into a measure of the explained variance (also known as R 2 ), by taking its square. The result can be interpreted as the proportion of statistical variability (i.e. variance) in one variable that can be explained by the other variable. In other words, to what degree can variable X be explained by Y and vice versa. For instance, as mentioned above, a correlation of −1 or +1 would both allow us to determine X from Y and vice versa without error, which is also shown in the coefficient of determination, which would be (−1) 2 or 1 2 = 1, indicating that 100% of variability in one variable can be explained by the other variable.

In some cases, the interpretation of the strength of correlation coefficient is based on rules of thumb, as is often the case with P-values (P-value <0.05 is statistically significant, P-value >0.05 is not statistically significant). However, such rules of thumb should not be used for correlations. Instead, the interpretation should always depend on context and purposes [ 5 ]. For instance, when studying the association of renin–angiotensin–system inhibitors (RASi) with blood pressure, patients with increased blood pressure may receive the perfect dosage of RASi until their blood pressure is exactly normal. Those with an already exactly normal blood pressure will not receive RASi. However, as the perfect dosage of RASi makes the blood pressure of the RASi users exactly normal, and thus equal to the blood pressure of the RASi non-users, no variation is left between users and non-users. Because of this, the correlation will be 0.

The linearity of correlation

An important limitation of the correlation coefficient is that it assumes a linear association. This also means that any linear transformation and any scale transformation of either variable X or Y , or both, will not affect the correlation coefficient. However, variables X and Y may also have a non-linear association, which could still yield a low correlation coefficient, as seen in Figure 1D and E , even though variables X and Y are clearly related. Nonetheless, the correlation coefficient will not always return 0 in case of a non-linear association, as portrayed in Figure 1F with an exponential correlation with r  = 0.5. In short, a correlation coefficient is not a measure of the best-fitted line through the observations, but only the degree to which the observations lie on one straight line.

In general, before calculating a correlation coefficient, it is advised to inspect a scatterplot of the observations in order to assess whether the data could possibly be described with a linear association and whether calculating a correlation coefficient makes sense. For instance, the scatterplot in Figure 1C could plausibly fit a straight line, and a correlation coefficient would therefore be suitable to describe the association in the data.

The range of observations for correlation

An important pitfall of the correlation coefficient is that it is influenced by the range of observations. In Figure 2A , we illustrate hypothetical data with 50 observations, with r  = 0.87. Included in the figure is an ellipse that shows the variance of the full observed data, and an ellipse that shows the variance of only the 25 lowest observations. If we subsequently analyse these 25 observations independently as shown in Figure 2B , we will see that the ellipse has shortened. If we determine the correlation coefficient for Figure 2B , we will also find a substantially lower correlation: r  = 0.57.

An external file that holds a picture, illustration, etc.
Object name is sfab085f2.jpg

The effect of the range of observations on the correlation coefficient, as shown with ellipses. ( A ) Set of 50 observations from hypothetical dataset X with r  = 0.87, with an illustrative ellipse showing length and width of the whole dataset, and an ellipse showing only the first 25 observations. ( B ) Set of only the 25 lowest observations from hypothetical dataset X with r  = 0.57, with an illustrative ellipse showing length and width.

The importance of the range of observations can further be illustrated using an example from a paper by Pierrat et al. [ 6 ] in which the correlation between the eGFR calculated using inulin clearance and eGFR calculated using the Cockcroft–Gault formula was studied both in adults and children. Children had a higher correlation coefficient than adults ( r  = 0.81 versus r  = 0.67), after which the authors mentioned: ‘The coefficients of correlation were even better […] in children than in adults.’ However, the range of observations in children was larger than the range of observations in adults, which in itself could explain the higher correlation coefficient observed in children. One can thus not simply conclude that the Cockcroft–Gault formula for eGFR correlates better with inulin in children than in adults. Because the range of the correlation influences the correlation coefficient, it is important to realize that correlation coefficients cannot be readily compared between groups or studies. Another consequence of this is that researchers could inflate the correlation coefficient by including additional low and high eGFR values.

The non-causality of correlation

Another important pitfall of the correlation coefficient is that it cannot be interpreted as causal. It is of course possible that there is a causal effect of one variable on the other, but there may also be other possible explanations that the correlation coefficient does not take into account. Take for example the phenomenon of confounding. We can study the association of prescribing angiotensin-converting enzyme (ACE)-inhibitors with a decline in kidney function. These two variables would be highly correlated, which may be due to the underlying factor albuminuria. A patient with albuminuria is more likely to receive ACE-inhibitors, but is also more likely to have a decline in kidney function. So ACE-inhibitors and a decline in kidney function are correlated not because of ACE-inhibitors causing a decline in kidney function, but because they have a shared underlying cause (also known as common cause) [ 7 ]. More reasons why associations may be biased exist, which are explained elsewhere [ 8 , 9 ].

It is however possible to adjust for such confounding effects, for example by using multivariable regression. Whereas a univariable (or ‘crude’) linear regression analysis is no different than calculating the correlation coefficient, a multivariable regression analysis allows one to adjust for possible confounder variables. Other factors need to be taken into account to estimate causal effects, but these are beyond the scope of this paper.

Agreement between methods

We have discussed the correlation coefficient and its limitations when studying the association between two variables. However, the correlation coefficient is also often incorrectly used to study the agreement between two methods that aim to estimate the same variable. Again, also here, the correlation coefficient is an invalid measure.

The correlation coefficient aims to represent to what degree a straight line fits the data. This is not the same as agreement between methods (i.e. whether X  =  Y ). If methods completely agree, all observations would fall on the line of equality (i.e. the line on which the observations would be situated if X and Y had equal values). Yet the correlation coefficient looks at the best-fitted straight line through the data, which is not per se the line of equality. As a result, any method that would consistently measure a twice as large value as the other method would still correlate perfectly with the other method. This is shown in Figure 3 , where the dashed line shows the line of equality, and the other lines portray different linear associations, all with perfect correlation, but no agreement between X and Y . These linear associations may portray a systematic difference, better known as bias, in one of the methods.

An external file that holds a picture, illustration, etc.
Object name is sfab085f3.jpg

A set of linear associations, with the dashed line (- - -) showing the line of equality where X  =  Y . The equations and correlations for the other lines are shown as well, which shows that only a linear association is needed for r  = 1, and not specifically agreement.

This limitation applies to all comparisons of methods, where it is studied whether methods can be used interchangeably, and it also applies to situations where two individuals measure a value and where the results are then compared (inter-observer variation or agreement; here the individuals can be seen as the ‘methods’), and to situations where it is studied whether one method measures consistently at two different time points (also known as repeatability). Fortunately, other methods exist to compare methods [ 10 , 11 ], of which one was proposed by Bland and Altman themselves [ 12 ].

Intraclass coefficient

One valid method to assess interchangeability is the intraclass coefficient (ICC), which is a generalization of Cohen’s κ , a measure for the assessment of intra- and interobserver agreement. The ICC shows the proportion of the variability in the new method that is due to the normal variability between individuals. The measure takes into account both the correlation and the systematic difference (i.e. bias), which makes it a measure of both the consistency and agreement of two methods. Nonetheless, like the correlation coefficient, it is influenced by the range of observations. However, an important advantage of the ICC is that it allows comparison between multiple variables or observers. Similar to the ICC is the concordance correlation coefficient (CCC), though it has been stated that the CCC yields values similar to the ICC [ 13 ]. Nonetheless, the CCC may also be found in the literature [ 14 ].

The 95% limits of agreement and the Bland–Altman plot

When they published their critique on the use of the correlation coefficient for the measurement of agreement, Bland and Altman also published an alternative method to measure agreement, which they called the limits of agreement (also referred to as a Bland–Altman plot) [ 12 ]. To illustrate the method of the limits of agreement, an artificial dataset was created using the MASS package (version 7.3-53) for R version 4.0.4 (R Corps, Vienna, Austria). Two sets of observations (two observations per person) were derived from a normal distribution with a mean ( µ ) of 120 and a randomly chosen standard deviation ( σ ) between 5 and 15. The mean of 120 was chosen with the aim to have the values resemble measurements of high eGFR, where the first set of observed eGFRs was hypothetically acquired using the MDRD formula, and the second set of observed eGFRs was hypothetically acquired using the CKD-EPI formula. The observations can be found in Table 1 .

Artificial data portraying hypothetically observed MDRD measurements and CKD-EPI measurements

Participant IDeGFR with MDRD, mL/min/ 1.73 m eGFR with CKD-EPI, mL/min/1.73 m Difference (CKD-EPI – MDRD)
1119.1118.4−0.7
2123.7121.6−2.1
3123.5117.6−5.9
4121.1118.1−3.0
5115.7119.43.7
6117.4120.53.1
7119.2120.81.6
8120.0119.4−0.6
9126.7118.0−8.7
10122.1123.11.0
11117.8120.93.1
12116.8118.82.0
13119.2121.72.5
14119.2117.8−1.4
15118.9118.8−0.1
16120.7115.8−4.9
17117.5124.16.6
18121.2122.10.9
19116.6125.48.8
20119.4120.00.6

Mean: 0.32

SD: 4.09

The 95% limits of agreement can be easily calculated using the mean of the differences ( d ¯ ) and the standard deviation (SD) of the differences. The upper limit (UL) of the limits of agreement would then be UL = d ¯ + 1.96 * SD and the lower limit (LL) would be LL = d ¯ - 1.96 * SD . If we apply this to the data from Table 1 , we would find d ¯ = 0.32 and SD = 4.09. Subsequently, UL = 0.32 + 1.96 * 4.09 = 8.34 and LL = 0.32 − 1.96 * 4.09 = −7.70. Our limits of agreement are thus −7.70 to 8.34. We can now decide whether these limits of agreement are too broad. Imagine we decide that if we want to replace the MDRD formula with the CKD-EPI formula, we say that the difference may not be larger than 7 mL/min/1.73 m 2 . Thus, on the basis of these (hypothetical) data, the MDRD and CKD-EPI formulas cannot be used interchangeably in our case. It should also be noted that, as the limits of agreement are statistical parameters, they are also subject to uncertainty. The uncertainty can be determined by calculating 95% confidence intervals for the limits of agreement, on which Bland and Altman elaborate in their paper [ 12 ].

The limits of agreement are also subject to two assumptions: (i) the mean and SD of the differences should be constant over the range of observations and (ii) the differences are approximately normally distributed. To check these assumptions, two plots were proposed: the Bland–Altman plot, which is the differences plotted against the means of their measurements, and a histogram of the differences. If in the Bland–Altman plot the means and SDs of the differences appear to be equal along the x -axis, the first assumption is met. The histogram of the differences should follow the pattern of a normal distribution. We checked these assumptions by creating a Bland–Altman plot in Figure 4A and a histogram of the differences in Figure 4B . As often done, we also added the limits of agreement to the Bland–Altman plot, between which approximately 95% of datapoints are expected to be. In Figure 4A , we see that the mean of the differences appears to be equal along the x -axis; i.e., these datapoints could plausibly fit the horizontal line of the total mean across the whole x -axis. Nonetheless, the SD does not appear to be distributed equally: the means of the differences at the lower values of the x -axis are closer to the total mean (thus a lower SD) than the means of the differences at the middle values of the x -axis (thus a higher SD). Therefore, the first assumption is not met. Nonetheless, the second assumption is met, because our differences follow a normal distribution, as shown in Figure 4B . Our failure to meet the first assumption can be due to a number of reasons, for which Bland and Altman also proposed solutions [ 15 ]. For example, data may be skewed. However, in that case, log-transforming variables may be a solution [ 16 ].

An external file that holds a picture, illustration, etc.
Object name is sfab085f4.jpg

Plots to check assumptions for the limits of agreement. ( A ) The Bland–Altman plot for the assumption that the mean and SD of the differences are constant over the range of observations. In our case, we see that the mean of the differences appears to be equal along the x -axis; i.e., these datapoints could plausibly fit the horizontal line of the total mean across the whole x -axis. Nonetheless, the SD does not appear to be distributed equally: the means of the differences at the lower values of the x -axis are closer to the total mean (thus a lower SD) than the means of the differences at the middle values of the x -axis (thus a higher SD). Therefore, the first assumption is not met. The limits of agreement and the mean are added as dashed (- - -) lines. ( B ) A histogram of the distribution of differences to ascertain the assumption of whether the differences are normally distributed. In our case, the observations follow a normal distribution and thus, the assumption is met.

It is often mistakenly thought that the Bland–Altman plot alone is the analysis to determine the agreement between methods, but the authors themselves spoke strongly against this [ 15 ]. We suggest that authors should both report the limits of agreement and show the Bland–Altman plot, to allow readers to assess for themselves whether they think the agreement is met.

The correlation coefficient is easy to calculate and provides a measure of the strength of linear association in the data. However, it also has important limitations and pitfalls, both when studying the association between two variables and when studying agreement between methods. These limitations and pitfalls should be taken into account when using and interpreting it. If necessary, researchers should look into alternatives to the correlation coefficient, such as regression analysis for causal research, and the ICC and the limits of agreement combined with a Bland–Altman plot when comparing methods.

CONFLICT OF INTEREST STATEMENT

None declared.

Contributor Information

Roemer J Janse, Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands.

Tiny Hoekstra, Department of Nephrology, Amsterdam Cardiovascular Sciences, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.

Kitty J Jager, ERA-EDTA Registry, Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands.

Carmine Zoccali, CNR-IFC, Center of Clinical Physiology, Clinical Epidemiology of Renal Diseases and Hypertension, Reggio Calabria, Italy.

Giovanni Tripepi, CNR-IFC, Center of Clinical Physiology, Clinical Epidemiology of Renal Diseases and Hypertension, Reggio Calabria, Italy.

Friedo W Dekker, Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands.

Merel van Diepen, Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands.

correlation analysis definition in research

Multivariable Methods

  •   Page:
  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • |   6  
  • |   7  
  • |   8  
  • |   9  
  • |   10  
  • Introduction
  • Learning Objectives
  • Confounding
  • Determining Whether a Variable is a Confounder
  • A Stratified Analysis
  • The Cochran-Mantel-Haenszel Method
  • Data Layout for Cochran-Mantel-Haenszel Estimates
  • Effect Modification

Introduction to Correlation and Regression Analysis

Correlation analysis, example - correlation of gestational age and birth weight.

  • Regression Analysis 
  • Simple Linear Regression 
  • BMI and Total Cholesterol
  • BMI and HDL Cholesterol
  • Comparing Mean HDL Levels With Regression Analysis 
  • The Controversy Over Environmental Tobacco Smoke Exposure
  • Multiple Linear Regression Analysis
  • Controlling for Confounding With Multiple Linear Regression
  • Relative Importance of the Independent Variables 
  • Evaluating Effect Modification With Multiple Linear Regression 
  • "Dummy" Variables in Regression Models 
  • Example of the Use of Dummy Variables
  • Multiple Logistic Regression Analysis
  • Example of Logistic Regression - Association Between Obesity and CVD
  • Example - Risk Factors Associated With Low Infant Birth Weight

On This Page sidebar

Module Topics

All Modules

In this section we will first discuss correlation analysis, which is used to quantify the association between two continuous variables (e.g., between an independent and a dependent variable or between two independent variables). Regression analysis is a related technique to assess the relationship between an outcome variable and one or more risk factors or confounding variables. The outcome variable is also called the response or dependent variable and the risk factors and confounders are called the predictors , or explanatory or independent variables . In regression analysis, the dependent variable is denoted " y" and the independent variables are denoted by " x ".

[ NOTE: The term "predictor" can be misleading if it is interpreted as the ability to predict even beyond the limits of the data. Also, the term "explanatory variable" might give an impression of a causal effect in a situation in which inferences should be limited to identifying associations. The terms "independent" and "dependent" variable are less subject to these interpretations as they do not strongly imply cause and effect.

In correlation analysis, we estimate a sample correlation coefficient , more specifically the Pearson Product Moment correlation coefficient . The sample correlation coefficient, denoted r ,

ranges between -1 and +1 and quantifies the direction and strength of the linear association between the two variables. The correlation between two variables can be positive (i.e., higher levels of one variable are associated with higher levels of the other) or negative (i.e., higher levels of one variable are associated with lower levels of the other).

The sign of the correlation coefficient indicates the direction of the association. The magnitude of the correlation coefficient indicates the strength of the association.

For example, a correlation of r = 0.9 suggests a strong, positive association between two variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation close to zero suggests no linear association between two continuous variables.

LISA: [I find this description confusing. You say that the correlation coefficient is a measure of the "strength of association", but if you think about it, isn't the slope a better measure of association? We use risk ratios and odds ratios to quantify the strength of association, i.e., when an exposure is present it has how many times more likely the outcome is. The analogous quantity in correlation is the slope, i.e., for a given increment in the independent variable, how many times is the dependent variable going to increase? And "r" (or perhaps better R-squared) is a measure of how much of the variability in the dependent variable can be accounted for by differences in the independent variable. The analogous measure for a dichotomous variable and a dichotomous outcome would be the attributable proportion, i.e., the proportion of Y that can be attributed to the presence of the exposure.]

It is important to note that there may be a non-linear association between two continuous variables, but computation of a correlation coefficient does not detect this. Therefore, it is always important to evaluate the data carefully before computing a correlation coefficient. Graphical displays are particularly useful to explore associations between variables.

The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the X-axis and the other along the Y-axis.

 

  • Scenario 1 depicts a strong positive association (r=0.9), similar to what we might see for the correlation between infant birth weight and birth length.
  • Scenario 2 depicts a weaker association (r=0,2) that we might expect to see between age and body mass index (which tends to increase with age).
  • Scenario 3 might depict the lack of association (r approximately 0) between the extent of media exposure in adolescence and age at which adolescents initiate sexual activity.
  • Scenario 4 might depict the strong negative association (r= -0.9) generally observed between the number of hours of aerobic exercise per week and percent body fat.

A small study is conducted involving 17 infants to investigate the association between gestational age at birth, measured in weeks, and birth weight, measured in grams.

BirthWgtTable.png

We wish to estimate the association between gestational age and infant birth weight. In this example, birth weight is the dependent variable and gestational age is the independent variable. Thus y=birth weight and x=gestational age. The data are displayed in a scatter diagram in the figure below.

BirthWgtGraph.png

Each point represents an (x,y) pair (in this case the gestational age, measured in weeks, and the birth weight, measured in grams). Note that the independent variable is on the horizontal axis (or X-axis), and the dependent variable is on the vertical axis (or Y-axis). The scatter plot shows a positive or direct association between gestational age and birth weight. Infants with shorter gestational ages are more likely to be born with lower weights and infants with longer gestational ages are more likely to be born with higher weights.

The formula for the sample correlation coefficient is

Correlation1.png

where Cov(x,y) is the covariance of x and y defined as

Correlation2.png

The variances of x and y measure the variability of the x scores and y scores around their respective sample means (

Correlation6.png

To compute the sample correlation coefficient, we need to compute the variance of gestational age, the variance of birth weight and also the covariance of gestational age and birth weight.

We first summarize the gestational age data. The mean gestational age is:

Correlation7.png

To compute the variance of gestational age, we need to sum the squared deviations (or differences) between each observed gestational age and the mean gestational age. The computations are summarized below.

GestationCorrelationCompTable.png

The variance of gestational age is:

Correlation8.png

Next, we summarize the birth weight data. The mean birth weight is:

Correlation9.png

The variance of birth weight is computed just as we did for gestational age as shown in the table below.

GestationCorrelationCompTable2.png

The variance of birth weight is:

Correlation12.png

Next we compute the covariance,

Correlation13.png

To compute the covariance of gestational age and birth weight, we need to multiply the deviation from the mean gestational age by the deviation from the mean birth weight for each participant (i.e.,

Correlation14.png

The computations are summarized below. Notice that we simply copy the deviations from the mean gestational age and birth weight from the two tables above into the table below and multiply.

Correlation15.png

The covariance of gestational age and birth weight is:

Correlation16.png

We now compute the sample correlation coefficient:

Correlation18.png

Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.

As we noted, sample correlation coefficients range from -1 to +1. In practice, meaningful correlations (i.e., correlations that are clinically or practically important) can be as small as 0.4 (or -0.4) for positive (or negative) associations. There are also statistical tests to determine whether an observed correlation is statistically significant or not (i.e., statistically significantly different from zero). Procedures to test whether an observed sample correlation is suggestive of a statistically significant correlation are described in detail in Kleinbaum, Kupper and Muller. 1

return to top | previous page | next page

Drive Research Logo

Contact Us (315) 303-2040

Discover the hidden dynamics driving market success through correlation analysis

Unveil the interconnections between variables, empowering you to make informed decisions, tailor strategies, and maximize your competitive edge with pinpoint accuracy.

Request a Quote Read the Blog

  • Market Research Company Blog

What is Correlation Analysis? [How to Measure + Pros & Cons]

by Tim Gell

Posted at: 6/14/2024 12:30 PM

Correlation analysis is a tremendous tool to use in understanding how one variable affects another. It can also be used to find out how much they affect each other.

By providing a distinct perspective on which factors impact your business the most, you can feel more confident in the actions you take after the report.    

Correlations can be misunderstood or misused, which is why it’s important to have experience or be using expert insights in order to get things correct.   

In this blog post, our online survey agency provides more insight into correlation analysis including its definition, how to measure and interpret correlation, examples, and more.

What is Correlation Analysis? Our Definition

Correlation analysis in market research is a statistical method that identifies the strength of a relationship between two or more variables. In a nutshell, the process reveals patterns within a dataset’s many variables.

It's all about identifying relationships between variables–specifically in research. 

Using one of the several formulas, the end result will be a numerical output between -1 and +1.

Let’s say you are interested in the relationship between two variables, Variable A and Variable B. 

  • Results close to +1 indicate a positive correlation , meaning as Variable A increases, Variable B also increases.
  • Outputs closer to -1 are a sign of a negative correlation, these results mean that as Variable A increases, Variable B decreases.

A value near 0 in a correlation analysis indicates a less meaningful relationship between Variable A and Variable B. 

While you are technically testing two variables at a time, you can look at as many variables as you would like in a grid output with the same variables listed as both columns and rows. 

The Drive Research team explores more into the definition of correlation analysis in the video below.

How to Measure Correlation

You must first conduct an online survey to analyze the correlation between two variables. The process includes writing, programming, and fielding a survey. The results are later used to determine strength scores.

You are likely to find a useful application for them in customer satisfaction surveys , employee surveys , customer experience (CX) programs, or market surveys .

These surveys typically include many questions that make ideal variables in a correlation analysis.

Below is the process our online survey agency follows to measure correlation.

Step 1. Write the survey

The first step in running a correlation analysis in market research is designing the survey. You will need to plan ahead with questions in mind for the analysis.

This includes anything that yields data that is both numerical and ordinal.

Think of metrics such as:

  • Agreement scales
  • Importance scales
  • Satisfaction scales
  • Temperature

Step 2. Program + field the survey

Once the survey is finalized, you will need to program and test it to ensure the questions are functioning correctly.

This is important because mislabeled scales or improper data validation in the programming will taint the data used for correlation analysis.

Use our online survey testing checklist for what to look for because launching the questionnaire into fieldwork.

Once everything checks out, it's time to administer the fieldwork of the survey.

Step 3. Analyze the correlation between 2 variables

Next, clean the survey data after the target number of responses is reached. This protects the integrity of the data for analysis.

The two most common ways to run a correlation include:

  • The Pearson r correlation is best used when the relationship between variables is linear, quantitative, and has no outliers. 
  • The Spearman rank correlation is best used when you want to see when one ranked variable increases if the other ranked variable increases or decreases.

Though, most data analysis software features a tool to run a correlation analysis after you enter the inputs automatically.

For instance, you can run the analysis through some sort of spreadsheet software, like Microsoft Excel.

Here is a great video that walks through the process of using Excel to calculate a correlation coefficient. 

If you're not comfortable conducting a survey and using Excel, contact our market research company . Our team can commission a full correlation analysis study on behalf of your organization.

The Coefficients To Use

Just like we showed in the measuring section during the analyzing phase, there are two main coefficients to use: the Pearson Coefficient and the Spearman’s Coefficient.

Pearson Coefficient

The Pearson Correlation Coefficient (r) is used to measure the strength and direction of the linear relationship between two continuous variables.

It assumes that both variables are normally distributed, have a linear relationship, and are measured on an interval or ratio scale.

The coefficient ranges from -1 to +1, where +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

This coefficient is useful for understanding how changes in one variable are associated with changes in another.

Spearman’s Coefficient

Spearman's Rank Correlation Coefficient (ρ or rs) measures the strength and direction of the monotonic relationship between two ranked variables.

It assumes that the data can be ranked and that the relationship between the variables is monotonic, meaning it consistently increases or decreases but is not necessarily linear.

The coefficient ranges from -1 to +1, with +1 indicating a perfect positive monotonic relationship, -1 indicating a perfect negative monotonic relationship, and 0 indicating no monotonic relationship.

This method is useful when the data do not meet the assumptions of Pearson correlation, particularly for ordinal data or non-linear relationships.

How to Interpret Correlation Analysis 

Correlation coefficients range from 0 to 1, where the higher the coefficient means the stronger correlation.

When the value is greater than 0.7, there is considered to be a strong correlation between the two variables.

All correlation strength scores and classifications are outlined below.

  • Perfect: 0.80 to 1.00
  • Strong: 0.50 to 0.79
  • Moderate: 0.30 to 0.49
  • Weak: 0.00 to 0.29

C orrelation Analysis Example

Employee surveys are a great example of how correlation analysis is used.

For instance, most full-service employee survey companies utilize correlation analysis to determine which independent variables (such as salary or benefits) impact a dependent variable (such as employee satisfaction).

Let's look at an example.

A common employee net promoter score (eNPS) question to measure correlation is "How likely an employee is to recommend working at a company to a friend or family member on a 1 to 10 scale where 1 reflects “not at all likely” and 10 reflects “very likely." 

The final eNPS serves as the dependent variable.

Then, a follow-up question should be included that asks employees to rate how satisfied they are with different organizational factors.

For example, asking employees to rate their level of satisfaction on a 1 to 5 scale where 1 is “not at all satisfied” and 5 is “very satisfied” with factors such as salary, benefits, training, and diversity and inclusion.

The final scores serve as independent variables.

Correlation analysis will tell you which independent variables most positively correlate with eNPS.

For instance, if salary and benefits had a correlation coefficient of 0.6, it would have a “strong” correlation with eNPS.   

Benefits of Finding Correlation Analysis in Market Research

There are several reasons to consider running a correlation analysis in your next market research study.

Get more from your data

For one, planning a correlation analysis motivates market researchers to ask better questions in the survey.

Knowing many variables will be examined during the analysis, researchers will spend more time thinking through all the most important and relevant data that should be collected.

Make more informed decisions

Once you have the data, the correlation analysis helps you identify which variables have the strongest relationships.

Unforeseen negative or positive correlations may help businesses make better-informed decisions.

Even though correlation analysis results are not a great predictor themselves, they can still inform future qualitative or quantitative research.

For instance, you may discover a significant pattern between variables that inspires additional research.

A great counterpart to regression analysis

Correlation analysis also nicely leads to regression analysis . By comparison, regression analysis tells you what Variable A might look like based on a particular value of Variable B.

In other words, correlation tells you there is a relationship, but regression shows you what that relationship looks like.

Drawbacks to Measuring Correlation

Correlation analysis is useful to understand how variables interact with one another.

That said, pitfalls exist and have to be looked out for if you choose to run the survey in-house. Drawbacks to measuring correlation include:

Coincidences within the results

One of those biggest pitfalls is you may get a result that shows a strong correlation, either negative or positive.

Take customer satisfaction, for example. Through the analysis, you may find customer experience is highly graded and correlates strongly with overall satisfaction. 

To say that one is directly causing the other could be faulty and should be carefully considered. 

There are many factors at play that could just be a coincidence when reviewing correlation analysis statistics , and it's essential to make conclusions within the scope of reason.

Correlation is not causation

Only use correlation analysis if you understand and can explain to a client that correlation is not causation.

It is tempting to jump to the conclusion that two variables have a direct result on each other, but this analysis is meant for identifying connections, not predicting them.

That said, when there is an interest in discovering relationships between two or more variables, correlation analysis is an excellent fit in a market research project .

Contact Drive Research to Measure Correlation Analysis

Correlation analysis in research allows for a deeper look into variables within a business or industry.

To assure 100% confidence, we recommend partnering with a market research company like Drive Research .

Our team of experts has years of experience using correlation analysis to analyze feedback from our client's employees, customers, and other audiences.

Want to include correlation analysis in your next research study? Contact Drive Research for a quote.

  • Message us on our website
  • Email us at  [email protected]
  • Call us at  888-725-DATA
  • Text us at 315-303-2040

tim gell - about the author

As a Research Manager, Tim is involved in every stage of a market research project for our clients. He first developed an interest in market research while studying at Binghamton University based on its marriage of business, statistics, and psychology. 

Learn more about Tim, here .

subscribe to our blog

Categories: Market Research Glossary Market Research Analysis

Need help with your project? Get in touch with Drive Research.

View Our Blog

  • Search Menu
  • Sign in through your institution
  • Supplements
  • Author videos
  • Anniversary Collection
  • Cardio-Renal Collection
  • Advance Articles
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • Why publish with CKJ?
  • About the ERA
  • Journals Career Network
  • Editorial Board
  • Advertising and Corporate Services
  • Self-Archiving Policy
  • The ERA Journals
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

The basics: the correlation coefficient, the linearity of correlation, the range of observations for correlation, the non-causality of correlation, agreement between methods, conflict of interest statement.

  • < Previous

Conducting correlation analysis: important limitations and pitfalls

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Roemer J Janse, Tiny Hoekstra, Kitty J Jager, Carmine Zoccali, Giovanni Tripepi, Friedo W Dekker, Merel van Diepen, Conducting correlation analysis: important limitations and pitfalls, Clinical Kidney Journal , Volume 14, Issue 11, November 2021, Pages 2332–2337, https://doi.org/10.1093/ckj/sfab085

  • Permissions Icon Permissions

The correlation coefficient is a statistical measure often used in studies to show an association between variables or to look at the agreement between two methods. In this paper, we will discuss not only the basics of the correlation coefficient, such as its assumptions and how it is interpreted, but also important limitations when using the correlation coefficient, such as its assumption of a linear association and its sensitivity to the range of observations. We will also discuss why the coefficient is invalid when used to assess agreement of two methods aiming to measure a certain value, and discuss better alternatives, such as the intraclass coefficient and Bland–Altman’s limits of agreement. The concepts discussed in this paper are supported with examples from literature in the field of nephrology.

‘Correlation is not causation’: a saying not rarely uttered when a person infers causality from two variables occurring together, without them truly affecting each other. Yet, though causation may not always be understood correctly, correlation too is a concept in which mistakes are easily made. Nonetheless, the correlation coefficient has often been reported within the medical literature. It estimates the association between two variables (e.g. blood pressure and kidney function), or is used for the estimation of agreement between two methods of measurement that aim to measure the same variable (e.g. the Modification of Diet in Renal Disease (MDRD) formula and the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula for estimating the glomerular filtration rate (eGFR)]. Despite the wide use of the correlation coefficient, limitations and pitfalls for both situations exist, of which one should be aware when drawing conclusions from correlation coefficients. In this paper, we aim to describe the correlation coefficient and its limitations, together with methods that can be applied to avoid these limitations.

Fundamentals

The correlation coefficient was described over a hundred years ago by Karl Pearson [ 1 ], taking inspiration from a similar idea of correlation from Sir Francis Galton, who developed linear regression and was the not-so-well-known half-cousin of Charles Darwin [ 2 ]. In short, the correlation coefficient, denoted with the Greek character rho ( ρ ) for the true (theoretical) population and r for a sample of the true population, aims to estimate the strength of the linear association between two variables. If we have variables X and Y that are plotted against each other in a scatter plot, the correlation coefficient indicates how well a straight line fits these data. The coefficient ranges from −1 to 1 and is dimensionless (i.e., it has no unit). Two correlations with r = −1 and r  = 1 are shown in Figure 1A and B , respectively. The values of −1 and 1 indicate that all observations can be described perfectly using a straight line, which in turn means that if X is known, Y can be determined deterministically and vice versa. Here, the minus sign indicates an inverse association: if X increases, Y decreases. Nonetheless, real-world data are often not perfectly summarized using a straight line. In a scatterplot as shown in Figure 1C , the correlation coefficient represents how well a linear association fits the data.

Different shapes of data and their correlation coefficients. (A) Linear association with r = −1. (B) A linear association with r = 1. (C) A scatterplot through which a straight line could plausibly be drawn, with r = 0.50. (D) A sinusoidal association with r = 0. (E) A quadratic association with r = 0. (F) An exponential association with r = 0.50.

Different shapes of data and their correlation coefficients. ( A ) Linear association with r = −1. ( B ) A linear association with r  = 1. ( C ) A scatterplot through which a straight line could plausibly be drawn, with r  = 0.50. ( D ) A sinusoidal association with r  = 0. ( E ) A quadratic association with r  = 0. ( F ) An exponential association with r  = 0.50.

It is also possible to test the hypothesis of whether X and Y are correlated, which yields a P-value indicating the chance of finding the correlation coefficient’s observed value or any value indicating a higher degree of correlation, given that the two variables are not actually correlated. Though the correlation coefficient will not vary depending on sample size, the P-value yielded with the t -test will.

The value of the correlation coefficient is also not influenced by the units of measurement, but it is influenced by measurement error. If more error (also known as noise) is present in the variables X and Y , variability in X will be partially due to the error in X , and thus not solely explainable by Y . Moreover, the correlation coefficient is also sensitive to the range of observations, which we will discuss later in this paper.

An assumption of the Pearson correlation coefficient is that the joint distribution of the variables is normal. However, it has been shown that the correlation coefficient is quite robust with regard to this assumption, meaning that Pearson’s correlation coefficient may still be validly estimated in skewed distributions [ 3 ]. If desired, a non-parametric method is also available to estimate correlation; namely, the Spearman’s rank correlation coefficient. Instead of the actual values of observations, the Spearman’s correlation coefficient uses the rank of the observations when ordering observations from small to large, hence the ‘rank’ in its name [ 4 ]. This usage of the rank makes it robust against outliers [ 4 ].

Explained variance and interpretation

One may also translate the correlation coefficient into a measure of the explained variance (also known as R 2 ), by taking its square. The result can be interpreted as the proportion of statistical variability (i.e. variance) in one variable that can be explained by the other variable. In other words, to what degree can variable X be explained by Y and vice versa. For instance, as mentioned above, a correlation of −1 or +1 would both allow us to determine X from Y and vice versa without error, which is also shown in the coefficient of determination, which would be (−1) 2 or 1 2 = 1, indicating that 100% of variability in one variable can be explained by the other variable.

In some cases, the interpretation of the strength of correlation coefficient is based on rules of thumb, as is often the case with P-values (P-value <0.05 is statistically significant, P-value >0.05 is not statistically significant). However, such rules of thumb should not be used for correlations. Instead, the interpretation should always depend on context and purposes [ 5 ]. For instance, when studying the association of renin–angiotensin–system inhibitors (RASi) with blood pressure, patients with increased blood pressure may receive the perfect dosage of RASi until their blood pressure is exactly normal. Those with an already exactly normal blood pressure will not receive RASi. However, as the perfect dosage of RASi makes the blood pressure of the RASi users exactly normal, and thus equal to the blood pressure of the RASi non-users, no variation is left between users and non-users. Because of this, the correlation will be 0.

An important limitation of the correlation coefficient is that it assumes a linear association. This also means that any linear transformation and any scale transformation of either variable X or Y , or both, will not affect the correlation coefficient. However, variables X and Y may also have a non-linear association, which could still yield a low correlation coefficient, as seen in Figure 1D and E , even though variables X and Y are clearly related. Nonetheless, the correlation coefficient will not always return 0 in case of a non-linear association, as portrayed in Figure 1F with an exponential correlation with r  = 0.5. In short, a correlation coefficient is not a measure of the best-fitted line through the observations, but only the degree to which the observations lie on one straight line.

In general, before calculating a correlation coefficient, it is advised to inspect a scatterplot of the observations in order to assess whether the data could possibly be described with a linear association and whether calculating a correlation coefficient makes sense. For instance, the scatterplot in Figure 1C could plausibly fit a straight line, and a correlation coefficient would therefore be suitable to describe the association in the data.

An important pitfall of the correlation coefficient is that it is influenced by the range of observations. In Figure 2A , we illustrate hypothetical data with 50 observations, with r  = 0.87. Included in the figure is an ellipse that shows the variance of the full observed data, and an ellipse that shows the variance of only the 25 lowest observations. If we subsequently analyse these 25 observations independently as shown in Figure 2B , we will see that the ellipse has shortened. If we determine the correlation coefficient for Figure 2B , we will also find a substantially lower correlation: r  = 0.57.

The effect of the range of observations on the correlation coefficient, as shown with ellipses. (A) Set of 50 observations from hypothetical dataset X with r = 0.87, with an illustrative ellipse showing length and width of the whole dataset, and an ellipse showing only the first 25 observations. (B) Set of only the 25 lowest observations from hypothetical dataset X with r = 0.57, with an illustrative ellipse showing length and width.

The effect of the range of observations on the correlation coefficient, as shown with ellipses. ( A ) Set of 50 observations from hypothetical dataset X with r  = 0.87, with an illustrative ellipse showing length and width of the whole dataset, and an ellipse showing only the first 25 observations. ( B ) Set of only the 25 lowest observations from hypothetical dataset X with r  = 0.57, with an illustrative ellipse showing length and width.

The importance of the range of observations can further be illustrated using an example from a paper by Pierrat et al. [ 6 ] in which the correlation between the eGFR calculated using inulin clearance and eGFR calculated using the Cockcroft–Gault formula was studied both in adults and children. Children had a higher correlation coefficient than adults ( r  = 0.81 versus r  = 0.67), after which the authors mentioned: ‘The coefficients of correlation were even better […] in children than in adults.’ However, the range of observations in children was larger than the range of observations in adults, which in itself could explain the higher correlation coefficient observed in children. One can thus not simply conclude that the Cockcroft–Gault formula for eGFR correlates better with inulin in children than in adults. Because the range of the correlation influences the correlation coefficient, it is important to realize that correlation coefficients cannot be readily compared between groups or studies. Another consequence of this is that researchers could inflate the correlation coefficient by including additional low and high eGFR values.

Another important pitfall of the correlation coefficient is that it cannot be interpreted as causal. It is of course possible that there is a causal effect of one variable on the other, but there may also be other possible explanations that the correlation coefficient does not take into account. Take for example the phenomenon of confounding. We can study the association of prescribing angiotensin-converting enzyme (ACE)-inhibitors with a decline in kidney function. These two variables would be highly correlated, which may be due to the underlying factor albuminuria. A patient with albuminuria is more likely to receive ACE-inhibitors, but is also more likely to have a decline in kidney function. So ACE-inhibitors and a decline in kidney function are correlated not because of ACE-inhibitors causing a decline in kidney function, but because they have a shared underlying cause (also known as common cause) [ 7 ]. More reasons why associations may be biased exist, which are explained elsewhere [ 8 , 9 ].

It is however possible to adjust for such confounding effects, for example by using multivariable regression. Whereas a univariable (or ‘crude’) linear regression analysis is no different than calculating the correlation coefficient, a multivariable regression analysis allows one to adjust for possible confounder variables. Other factors need to be taken into account to estimate causal effects, but these are beyond the scope of this paper.

We have discussed the correlation coefficient and its limitations when studying the association between two variables. However, the correlation coefficient is also often incorrectly used to study the agreement between two methods that aim to estimate the same variable. Again, also here, the correlation coefficient is an invalid measure.

The correlation coefficient aims to represent to what degree a straight line fits the data. This is not the same as agreement between methods (i.e. whether X  =  Y ). If methods completely agree, all observations would fall on the line of equality (i.e. the line on which the observations would be situated if X and Y had equal values). Yet the correlation coefficient looks at the best-fitted straight line through the data, which is not per se the line of equality. As a result, any method that would consistently measure a twice as large value as the other method would still correlate perfectly with the other method. This is shown in Figure 3 , where the dashed line shows the line of equality, and the other lines portray different linear associations, all with perfect correlation, but no agreement between X and Y . These linear associations may portray a systematic difference, better known as bias, in one of the methods.

A set of linear associations, with the dashed line (- - -) showing the line of equality where X = Y. The equations and correlations for the other lines are shown as well, which shows that only a linear association is needed for r = 1, and not specifically agreement.

A set of linear associations, with the dashed line (- - -) showing the line of equality where X  =  Y . The equations and correlations for the other lines are shown as well, which shows that only a linear association is needed for r  = 1, and not specifically agreement.

This limitation applies to all comparisons of methods, where it is studied whether methods can be used interchangeably, and it also applies to situations where two individuals measure a value and where the results are then compared (inter-observer variation or agreement; here the individuals can be seen as the ‘methods’), and to situations where it is studied whether one method measures consistently at two different time points (also known as repeatability). Fortunately, other methods exist to compare methods [ 10 , 11 ], of which one was proposed by Bland and Altman themselves [ 12 ].

Intraclass coefficient

One valid method to assess interchangeability is the intraclass coefficient (ICC), which is a generalization of Cohen’s κ , a measure for the assessment of intra- and interobserver agreement. The ICC shows the proportion of the variability in the new method that is due to the normal variability between individuals. The measure takes into account both the correlation and the systematic difference (i.e. bias), which makes it a measure of both the consistency and agreement of two methods. Nonetheless, like the correlation coefficient, it is influenced by the range of observations. However, an important advantage of the ICC is that it allows comparison between multiple variables or observers. Similar to the ICC is the concordance correlation coefficient (CCC), though it has been stated that the CCC yields values similar to the ICC [ 13 ]. Nonetheless, the CCC may also be found in the literature [ 14 ].

The 95% limits of agreement and the Bland–Altman plot

When they published their critique on the use of the correlation coefficient for the measurement of agreement, Bland and Altman also published an alternative method to measure agreement, which they called the limits of agreement (also referred to as a Bland–Altman plot) [ 12 ]. To illustrate the method of the limits of agreement, an artificial dataset was created using the MASS package (version 7.3-53) for R version 4.0.4 (R Corps, Vienna, Austria). Two sets of observations (two observations per person) were derived from a normal distribution with a mean ( µ ) of 120 and a randomly chosen standard deviation ( σ ) between 5 and 15. The mean of 120 was chosen with the aim to have the values resemble measurements of high eGFR, where the first set of observed eGFRs was hypothetically acquired using the MDRD formula, and the second set of observed eGFRs was hypothetically acquired using the CKD-EPI formula. The observations can be found in Table 1 .

Artificial data portraying hypothetically observed MDRD measurements and CKD-EPI measurements

Participant IDeGFR with MDRD, mL/min/ 1.73 m eGFR with CKD-EPI, mL/min/1.73 m Difference (CKD-EPI – MDRD)
1119.1118.4−0.7
2123.7121.6−2.1
3123.5117.6−5.9
4121.1118.1−3.0
5115.7119.43.7
6117.4120.53.1
7119.2120.81.6
8120.0119.4−0.6
9126.7118.0−8.7
10122.1123.11.0
11117.8120.93.1
12116.8118.82.0
13119.2121.72.5
14119.2117.8−1.4
15118.9118.8−0.1
16120.7115.8−4.9
17117.5124.16.6
18121.2122.10.9
19116.6125.48.8
20119.4120.00.6

Mean: 0.32

SD: 4.09

Participant IDeGFR with MDRD, mL/min/ 1.73 m eGFR with CKD-EPI, mL/min/1.73 m Difference (CKD-EPI – MDRD)
1119.1118.4−0.7
2123.7121.6−2.1
3123.5117.6−5.9
4121.1118.1−3.0
5115.7119.43.7
6117.4120.53.1
7119.2120.81.6
8120.0119.4−0.6
9126.7118.0−8.7
10122.1123.11.0
11117.8120.93.1
12116.8118.82.0
13119.2121.72.5
14119.2117.8−1.4
15118.9118.8−0.1
16120.7115.8−4.9
17117.5124.16.6
18121.2122.10.9
19116.6125.48.8
20119.4120.00.6

Mean: 0.32

SD: 4.09

The 95% limits of agreement can be easily calculated using the mean of the differences ( ⁠ d ¯ ⁠ ) and the standard deviation (SD) of the differences. The upper limit (UL) of the limits of agreement would then be UL = d ¯ + 1.96 * SD and the lower limit (LL) would be LL = d ¯ - 1.96 * SD ⁠ . If we apply this to the data from Table 1 , we would find d ¯ = 0.32 and SD = 4.09. Subsequently, UL = 0.32 + 1.96 * 4.09 = 8.34 and LL = 0.32 − 1.96 * 4.09 = −7.70. Our limits of agreement are thus −7.70 to 8.34. We can now decide whether these limits of agreement are too broad. Imagine we decide that if we want to replace the MDRD formula with the CKD-EPI formula, we say that the difference may not be larger than 7 mL/min/1.73 m 2 . Thus, on the basis of these (hypothetical) data, the MDRD and CKD-EPI formulas cannot be used interchangeably in our case. It should also be noted that, as the limits of agreement are statistical parameters, they are also subject to uncertainty. The uncertainty can be determined by calculating 95% confidence intervals for the limits of agreement, on which Bland and Altman elaborate in their paper [ 12 ].

The limits of agreement are also subject to two assumptions: (i) the mean and SD of the differences should be constant over the range of observations and (ii) the differences are approximately normally distributed. To check these assumptions, two plots were proposed: the Bland–Altman plot, which is the differences plotted against the means of their measurements, and a histogram of the differences. If in the Bland–Altman plot the means and SDs of the differences appear to be equal along the x -axis, the first assumption is met. The histogram of the differences should follow the pattern of a normal distribution. We checked these assumptions by creating a Bland–Altman plot in Figure 4A and a histogram of the differences in Figure 4B . As often done, we also added the limits of agreement to the Bland–Altman plot, between which approximately 95% of datapoints are expected to be. In Figure 4A , we see that the mean of the differences appears to be equal along the x -axis; i.e., these datapoints could plausibly fit the horizontal line of the total mean across the whole x -axis. Nonetheless, the SD does not appear to be distributed equally: the means of the differences at the lower values of the x -axis are closer to the total mean (thus a lower SD) than the means of the differences at the middle values of the x -axis (thus a higher SD). Therefore, the first assumption is not met. Nonetheless, the second assumption is met, because our differences follow a normal distribution, as shown in Figure 4B . Our failure to meet the first assumption can be due to a number of reasons, for which Bland and Altman also proposed solutions [ 15 ]. For example, data may be skewed. However, in that case, log-transforming variables may be a solution [ 16 ].

Plots to check assumptions for the limits of agreement. (A) The Bland–Altman plot for the assumption that the mean and SD of the differences are constant over the range of observations. In our case, we see that the mean of the differences appears to be equal along the x-axis; i.e., these datapoints could plausibly fit the horizontal line of the total mean across the whole x-axis. Nonetheless, the SD does not appear to be distributed equally: the means of the differences at the lower values of the x-axis are closer to the total mean (thus a lower SD) than the means of the differences at the middle values of the x-axis (thus a higher SD). Therefore, the first assumption is not met. The limits of agreement and the mean are added as dashed (- - -) lines. (B) A histogram of the distribution of differences to ascertain the assumption of whether the differences are normally distributed. In our case, the observations follow a normal distribution and thus, the assumption is met.

Plots to check assumptions for the limits of agreement. ( A ) The Bland–Altman plot for the assumption that the mean and SD of the differences are constant over the range of observations. In our case, we see that the mean of the differences appears to be equal along the x -axis; i.e., these datapoints could plausibly fit the horizontal line of the total mean across the whole x -axis. Nonetheless, the SD does not appear to be distributed equally: the means of the differences at the lower values of the x -axis are closer to the total mean (thus a lower SD) than the means of the differences at the middle values of the x -axis (thus a higher SD). Therefore, the first assumption is not met. The limits of agreement and the mean are added as dashed (- - -) lines. ( B ) A histogram of the distribution of differences to ascertain the assumption of whether the differences are normally distributed. In our case, the observations follow a normal distribution and thus, the assumption is met.

It is often mistakenly thought that the Bland–Altman plot alone is the analysis to determine the agreement between methods, but the authors themselves spoke strongly against this [ 15 ]. We suggest that authors should both report the limits of agreement and show the Bland–Altman plot, to allow readers to assess for themselves whether they think the agreement is met.

The correlation coefficient is easy to calculate and provides a measure of the strength of linear association in the data. However, it also has important limitations and pitfalls, both when studying the association between two variables and when studying agreement between methods. These limitations and pitfalls should be taken into account when using and interpreting it. If necessary, researchers should look into alternatives to the correlation coefficient, such as regression analysis for causal research, and the ICC and the limits of agreement combined with a Bland–Altman plot when comparing methods.

None declared.

Pearson K , Henrici OMFE. VII. Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia . Philos Trans R Soc Lond Ser A 1896 ; 187 : 253 – 318

Google Scholar

Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors . J Statist Educ 2001 ; 9 : doi: 10.1080/10691898.2001.11910537

Havlicek LL , Peterson NL. Effect of the violation of assumptions upon significance levels of the Pearson r . Psychol Bull 1977 ; 84 : 373 – 377

Schober P , Boer C , Schwarte LA. Correlation coefficients: appropriate use and interpretation . Anesth Analg 2018 ; 126 : 1763 – 1768

Kozak M. What is strong correlation? Teach Statist 2009 ; 31 : 85 – 86

Pierrat A , Gravier E , Saunders C et al.  . Predicting GFR in children and adults: a comparison of the Cockcroft–Gault, Schwartz, and modification of diet in renal disease formulas . Kidney Int 2003 ; 64 : 1425 – 1436

Fu EL , van Diepen M , Xu Y et al.  . Pharmacoepidemiology for nephrologists (part 2): potential biases and how to overcome them . Clin Kidney J 2021 ; 14 : 1317 – 1326

Jager KJ , Tripepi G , Chesnaye NC et al.  . Where to look for the most frequent biases? Nephrology (Carlton) 2020 ; 25 : 435 – 441

Suttorp MM , Siegerink B , Jager KJ et al.  . Graphical presentation of confounding in directed acyclic graphs . Nephrol Dial Transplant 2015 ; 30 : 1418 – 1423

van Stralen KJ , Dekker FW , Zoccali C et al. Measuring agreement, more complicated than it seems . Nephron Clin Pract 2012 ; 120 : c162 –c16 7

van Stralen KJ , Jager KJ , Zoccali C et al. Agreement between methods . Kidney Int 2008 ; 74 : 1116 – 1120

Bland JM , Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement . Lancet 1986 ; 1 : 307 – 310

Carol AA , Note O. A concordance correlation coefficient to evaluate reproducibility . Biometrics 1997 ; 53 : 1503 – 1507

Pecchini P , Malberti F , Mieth M et al.  . Measuring asymmetric dimethylarginine (ADMA) in CKD: a comparison between enzyme-linked immunosorbent assay and liquid chromatography-electrospray tandem mass spectrometry . J Nephrol 2012 ; 25 : 1016 – 1022

Bland JM , Altman DG. Applying the right statistics: analyses of measurement studies . Ultrasound Obstet Gynecol 2003 ; 22 : 85 – 93

Euser AM , Dekker FW , Le Cessie S. A practical approach to Bland–Altman plots and variation coefficients for log transformed variables . J Clin Epidemiol 2008 ; 61 : 978 – 982

  • correlation studies
  • pearson correlation coefficient
Month: Total Views:
May 2021 63
June 2021 218
July 2021 531
August 2021 410
September 2021 795
October 2021 800
November 2021 1,182
December 2021 1,000
January 2022 936
February 2022 955
March 2022 1,304
April 2022 1,219
May 2022 1,396
June 2022 996
July 2022 801
August 2022 826
September 2022 870
October 2022 1,006
November 2022 1,184
December 2022 1,405
January 2023 1,206
February 2023 1,330
March 2023 1,508
April 2023 1,488
May 2023 1,445
June 2023 1,183
July 2023 1,189
August 2023 1,024
September 2023 894
October 2023 1,042
November 2023 1,250
December 2023 1,071
January 2024 1,113
February 2024 1,111
March 2024 1,807
April 2024 1,460
May 2024 1,386
June 2024 1,010
July 2024 980
August 2024 1,067
September 2024 399

Email alerts

Citing articles via.

  • ckj Twitter
  • ERA Twitter
  • ERA Facebook
  • ERA Instagram
  • ERA LinkedIn
  • ERA Youtube

Affiliations

European Renal Association - European Dialysis and Transplant Association

  • Online ISSN 2048-8513
  • Print ISSN 2048-8505
  • Copyright © 2024 European Renal Association
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Correlation vs. Causation | Difference, Designs & Examples

Correlation vs. Causation | Difference, Designs & Examples

Published on July 12, 2021 by Pritha Bhandari . Revised on June 22, 2023.

Correlation means there is a statistical association between variables. Causation means that a change in one variable causes a change in another variable.

In research, you might have come across the phrase “correlation doesn’t imply causation.” Correlation and causation are two related ideas, but understanding their differences will help you critically evaluate sources and interpret scientific research.

Table of contents

What’s the difference, why doesn’t correlation mean causation, correlational research, third variable problem, regression to the mean, spurious correlations, directionality problem, causal research, other interesting articles, frequently asked questions about correlation and causation.

Correlation describes an association between types of variables : when one variable changes, so does the other. A correlation is a statistical indicator of the relationship between variables. These variables change together: they covary. But this covariation isn’t necessarily due to a direct or indirect causal link.

Causation means that changes in one variable brings about changes in the other; there is a cause-and-effect relationship between variables. The two variables are correlated with each other and there is also a causal link between them.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

There are two main reasons why correlation isn’t causation. These problems are important to identify for drawing sound scientific conclusions from research.

The third variable problem means that a confounding variable affects both variables to make them seem causally related when they are not. For example, ice cream sales and violent crime rates are closely correlated, but they are not causally linked with each other. Instead, hot temperatures, a third variable, affects both variables separately. Failing to account for third variables can lead research biases to creep into your work.

The directionality problem occurs when two variables correlate and might actually have a causal relationship, but it’s impossible to conclude which variable causes changes in the other. For example, vitamin D levels are correlated with depression, but it’s not clear whether low vitamin D causes depression, or whether depression causes reduced vitamin D intake.

You’ll need to use an appropriate research design to distinguish between correlational and causal relationships:

  • Correlational research designs can only demonstrate correlational links between variables.
  • Experimental designs can test causation.

In a correlational research design, you collect data on your variables without manipulating them.

Correlational research is usually high in external validity , so you can generalize your findings to real life settings. But these studies are low in internal validity , which makes it difficult to causally connect changes in one variable to changes in the other.

These research designs are commonly used when it’s unethical, too costly, or too difficult to perform controlled experiments. They are also used to study relationships that aren’t expected to be causal.

Without controlled experiments, it’s hard to say whether it was the variable you’re interested in that caused changes in another variable. Extraneous variables are any third variable or omitted variable other than your variables of interest that could affect your results.

Limited control in correlational research means that extraneous or confounding variables serve as alternative explanations for the results. Confounding variables can make it seem as though a correlational relationship is causal when it isn’t.

When two variables are correlated, all you can say is that changes in one variable occur alongside changes in the other.

Prevent plagiarism. Run a free check.

Regression to the mean is observed when variables that are extremely higher or extremely lower than average on the first measurement move closer to the average on the second measurement. Particularly in research that intentionally focuses on the most extreme cases or events, RTM should always be considered as a possible cause of an observed change.

Players or teams featured on the cover of SI have earned their place by performing exceptionally well. But athletic success is a mix of skill and luck, and even the best players don’t always win.

Chances are that good luck will not continue indefinitely, and neither can exceptional success.

A spurious correlation is when two variables appear to be related through hidden third variables or simply by coincidence.

The Theory of the Stork draws a simple causal link between the variables to argue that storks physically deliver babies. This satirical study shows why you can’t conclude causation from correlational research alone.

When you analyze correlations in a large dataset with many variables, the chances of finding at least one statistically significant result are high. In this case, you’re more likely to make a type I error . This means erroneously concluding there is a true correlation between variables in the population based on skewed sample data.

To demonstrate causation, you need to show a directional relationship with no alternative explanations. This relationship can be unidirectional, with one variable impacting the other, or bidirectional, where both variables impact each other.

A correlational design won’t be able to distinguish between any of these possibilities, but an experimental design can test each possible direction, one at a time.

  • Physical activity may affect self esteem
  • Self esteem may affect physical activity
  • Physical activity and self esteem may both affect each other

In correlational research, the directionality of a relationship is unclear because there is limited researcher control. You might risk concluding reverse causality, the wrong direction of the relationship.

Causal links between variables can only be truly demonstrated with controlled experiments . Experiments test formal predictions, called hypotheses , to establish causality in one direction at a time.

Experiments are high in internal validity , so cause-and-effect relationships can be demonstrated with reasonable confidence.

You can establish directionality in one direction because you manipulate an independent variable before measuring the change in a dependent variable.

In a controlled experiment, you can also eliminate the influence of third variables by using random assignment and control groups.

Random assignment helps distribute participant characteristics evenly between groups so that they’re similar and comparable. A control group lets you compare the experimental manipulation to a similar treatment or no treatment (or a placebo, to control for the placebo effect ).

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis
  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A correlation reflects the strength and/or direction of the association between two or more variables.

  • A positive correlation means that both variables change in the same direction.
  • A negative correlation means that the variables change in opposite directions.
  • A zero correlation means there’s no relationship between the variables.

Correlation describes an association between variables : when one variable changes, so does the other. A correlation is a statistical indicator of the relationship between variables.

Causation means that changes in one variable brings about changes in the other (i.e., there is a cause-and-effect relationship between variables). The two variables are correlated with each other, and there’s also a causal link between them.

While causation and correlation can exist simultaneously, correlation does not imply causation. In other words, correlation is simply a relationship where A relates to B—but A doesn’t necessarily cause B to happen (or vice versa). Mistaking correlation for causation is a common error and can lead to false cause fallacy .

The third variable and directionality problems are two main reasons why correlation isn’t causation .

The third variable problem means that a confounding variable affects both variables to make them seem causally related when they are not.

The directionality problem is when two variables correlate and might actually have a causal relationship, but it’s impossible to conclude which variable causes changes in the other.

Controlled experiments establish causality, whereas correlational studies only show associations between variables.

  • In an experimental design , you manipulate an independent variable and measure its effect on a dependent variable. Other variables are controlled so they can’t impact the results.
  • In a correlational design , you measure variables without manipulating any of them. You can test whether your variables change together, but you can’t be sure that one variable caused a change in another.

In general, correlational research is high in external validity while experimental research is high in internal validity .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Correlation vs. Causation | Difference, Designs & Examples. Scribbr. Retrieved September 9, 2024, from https://www.scribbr.com/methodology/correlation-vs-causation/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, correlational research | when & how to use, guide to experimental design | overview, steps, & examples, confounding variables | definition, examples & controls, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

IMAGES

  1. Correlational Research: Definition with Examples

    correlation analysis definition in research

  2. Correlation Analysis: Definition,Use, Benefits, Significance, Types and

    correlation analysis definition in research

  3. Correlation analysis

    correlation analysis definition in research

  4. Correlation Analysis definition, formula and step by step procedure

    correlation analysis definition in research

  5. Correlation analysis

    correlation analysis definition in research

  6. PPT

    correlation analysis definition in research

VIDEO

  1. Correlation Analysis for Categorical Variables: Chi-Square Test, Phi |Cramer's V Coefficient

  2. Correlation analysis -definition of correlation,properties of correlation

  3. Correlation Analysis

  4. Definition of the correlation function (U1-9-03)

  5. Class 1 Introduction to Correlation Analysis

  6. CORRELATION

COMMENTS

  1. Correlation Analysis

    Correlation analysis is a statistical method used to evaluate the strength and direction of the relationship between two or more variables. The correlation coefficient ranges from -1 to 1. A correlation coefficient of 1 indicates a perfect positive correlation. This means that as one variable increases, the other variable also increases.

  2. Correlational Research

    A correlational research design investigates relationships between variables without the researcher controlling or manipulating any of them. A correlation reflects the strength and/or direction of the relationship between two (or more) variables. The direction of a correlation can be either positive or negative. Positive correlation.

  3. What Is Correlation Analysis: Comprehensive Guide

    Definition of correlation analysis. Correlation analysis, also known as bivariate, is a statistical test primarily used to identify and explore linear relationships between two variables and then determine the strength and direction of that relationship. It's mainly used to spot patterns within datasets. It's worth noting that correlation ...

  4. Correlation Coefficient

    Using a correlation coefficient. In correlational research, you investigate whether changes in one variable are associated with changes in other variables.. Correlational research example You investigate whether standardized scores from high school are related to academic grades in college. You predict that there's a positive correlation: higher SAT scores are associated with higher college ...

  5. What Is Correlation Analysis? Definition, Examples, & More

    Defining Correlation Analysis in Statistics. As previously mentioned, correlation analysis is used to measure and describe the strength of the relationship between two variables. That is, of course, after determining whether the said linear relationship even exists. What correlation analysis doesn't do, however, is make any statements about ...

  6. Correlation

    3.1. Typs of Correlation. Positive Correlation: - Value: r is between 0 and +1. - Meaning: When one variable increases, the other also increases, and when one decreases, the other also decreases. - Graphically, a positive correlation will generally display a line of best fit that slopes upwards.

  7. Correlation

    Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It's a common tool for describing simple relationships without making a statement about cause and effect.

  8. Correlation in Statistics: Correlation Analysis Explained

    Step 1: Type your data into a worksheet in Excel. The best format is two columns. Place your x-values in column A and your y-values in column B. Step 2: Click the "Data" tab and then click "Data Analysis.". Step 3: Click "Correlation" and then click "OK.". Step 4: Type the location for your x-y variables in the Input. Range box.

  9. Correlational Research

    A correlational research design investigates relationships between variables without the researcher controlling or manipulating any of them. A correlation reflects the strength and/or direction of the relationship between two (or more) variables. The direction of a correlation can be either positive or negative. Positive correlation.

  10. Correlation Analysis

    Correlation is a fundamental tool for multivariate data analysis. Most multivariate statistical methods use correlation as a basis for data analytics. Machine learning methods are also impacted by correlations in data. With todays' big data, the role of correlation becomes increasingly important. Although the basic concept of correlation is ...

  11. What is Correlation Analysis? A Definition and Explanation

    Definition of Correlation Analysis. Correlation Analysis is statistical method that is used to discover if there is a relationship between two variables/datasets, and how strong that relationship may be. In terms of market research this means that, correlation analysis is used to analyse quantitative data gathered from research methods such as ...

  12. Correlation Analysis

    This Analysis reveals how much one variable influences another. For instance, Drive Research might use Correlation to determine how customer satisfaction impacts sales. Recognizing these relationships allows businesses to make strategic decisions based on data-driven Insights. Predictive Analysis. Correlation Analysis also aids in predictive ...

  13. What is Correlation Analysis? Definition, Process, Examples

    Correlation analysis is a powerful tool that allows us to understand the connections between different variables. By quantifying these relationships, we gain insights that help us make better decisions, manage risks, and improve outcomes in various fields like finance, healthcare, and education.

  14. 7.2 Correlational Research

    Correlational research is a type of nonexperimental research in which the researcher measures two variables and assesses the statistical relationship (i.e., the correlation) between them with little or no effort to control extraneous variables. There are essentially two reasons that researchers interested in statistical relationships between ...

  15. Correlation: Meaning, Types, Examples & Coefficient

    The correlation coefficient (r) indicates the extent to which the pairs of numbers for these two variables lie on a straight line.Values over zero indicate a positive correlation, while values under zero indicate a negative correlation. A correlation of -1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down.

  16. Interpreting Correlation Coefficients

    Statisticians consider Pearson's correlation coefficients to be a standardized effect size because they indicate the strength of the relationship between variables using unitless values that fall within a standardized range of -1 to +1. Effect sizes help you understand how important the findings are in a practical sense.

  17. Pearson Correlation Coefficient (r)

    Revised on February 10, 2024. The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between -1 and 1 that measures the strength and direction of the relationship between two variables. When one variable changes, the other variable changes in the same direction.

  18. Interpretation of correlations in clinical research

    Proper Interpretation of Correlation. Correlational analyses have been reported as one of the most common analytic techniques in research at the beginning of the 21 st century, particularly in health and epidemiological research. 15 Thus effective and proper interpretation is critical to understanding the literature.

  19. Conducting correlation analysis: important limitations and pitfalls

    The correlation coefficient is easy to calculate and provides a measure of the strength of linear association in the data. However, it also has important limitations and pitfalls, both when studying the association between two variables and when studying agreement between methods. These limitations and pitfalls should be taken into account when ...

  20. Introduction to Correlation and Regression Analysis

    Introduction to Correlation and Regression Analysis. In this section we will first discuss correlation analysis, which is used to quantify the association between two continuous variables (e.g., between an independent and a dependent variable or between two independent variables). Regression analysis is a related technique to assess the ...

  21. What is Correlation Analysis? [How to Measure

    Our Definition. Correlation analysis in market research is a statistical method that identifies the strength of a relationship between two or more variables. In a nutshell, the process reveals patterns within a dataset's many variables. It's all about identifying relationships between variables-specifically in research.

  22. Conducting correlation analysis: important limitations and pitfalls

    The correlation coefficient is a statistical measure often used in studies to show an association between variables or to look at the agreement b ... researchers should look into alternatives to the correlation coefficient, such as regression analysis for causal research, and the ICC and the limits of agreement combined with a Bland-Altman ...

  23. Correlation vs. Causation

    Correlation means there is a statistical association between variables. Causation means that a change in one variable causes a change in another variable. In research, you might have come across the phrase "correlation doesn't imply causation.". Correlation and causation are two related ideas, but understanding their differences will help ...