• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

data analysis in research work

Home Market Research

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

data analysis in research work

What Are My Employees Really Thinking? The Power of Open-ended Survey Analysis

May 24, 2024

When I think of “disconnected”, it is important that this is not just in relation to people analytics, Employee Experience or Customer Experience - it is also relevant to looking across them.

I Am Disconnected – Tuesday CX Thoughts

May 21, 2024

Customer success tools

20 Best Customer Success Tools of 2024

May 20, 2024

AI-Based Services in Market Research

AI-Based Services Buying Guide for Market Research (based on ESOMAR’s 20 Questions) 

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Data Analysis

  • Introduction to Data Analysis
  • Quantitative Analysis Tools
  • Qualitative Analysis Tools
  • Mixed Methods Analysis
  • Geospatial Analysis
  • Further Reading

Profile Photo

What is Data Analysis?

According to the federal government, data analysis is "the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data" ( Responsible Conduct in Data Management ). Important components of data analysis include searching for patterns, remaining unbiased in drawing inference from data, practicing responsible  data management , and maintaining "honest and accurate analysis" ( Responsible Conduct in Data Management ). 

In order to understand data analysis further, it can be helpful to take a step back and understand the question "What is data?". Many of us associate data with spreadsheets of numbers and values, however, data can encompass much more than that. According to the federal government, data is "The recorded factual material commonly accepted in the scientific community as necessary to validate research findings" ( OMB Circular 110 ). This broad definition can include information in many formats. 

Some examples of types of data are as follows:

  • Photographs 
  • Hand-written notes from field observation
  • Machine learning training data sets
  • Ethnographic interview transcripts
  • Sheet music
  • Scripts for plays and musicals 
  • Observations from laboratory experiments ( CMU Data 101 )

Thus, data analysis includes the processing and manipulation of these data sources in order to gain additional insight from data, answer a research question, or confirm a research hypothesis. 

Data analysis falls within the larger research data lifecycle, as seen below. 

( University of Virginia )

Why Analyze Data?

Through data analysis, a researcher can gain additional insight from data and draw conclusions to address the research question or hypothesis. Use of data analysis tools helps researchers understand and interpret data. 

What are the Types of Data Analysis?

Data analysis can be quantitative, qualitative, or mixed methods. 

Quantitative research typically involves numbers and "close-ended questions and responses" ( Creswell & Creswell, 2018 , p. 3). Quantitative research tests variables against objective theories, usually measured and collected on instruments and analyzed using statistical procedures ( Creswell & Creswell, 2018 , p. 4). Quantitative analysis usually uses deductive reasoning. 

Qualitative  research typically involves words and "open-ended questions and responses" ( Creswell & Creswell, 2018 , p. 3). According to Creswell & Creswell, "qualitative research is an approach for exploring and understanding the meaning individuals or groups ascribe to a social or human problem" ( 2018 , p. 4). Thus, qualitative analysis usually invokes inductive reasoning. 

Mixed methods  research uses methods from both quantitative and qualitative research approaches. Mixed methods research works under the "core assumption... that the integration of qualitative and quantitative data yields additional insight beyond the information provided by either the quantitative or qualitative data alone" ( Creswell & Creswell, 2018 , p. 4). 

  • Next: Planning >>
  • Last Updated: May 3, 2024 9:38 AM
  • URL: https://guides.library.georgetown.edu/data-analysis

Creative Commons

Loading metrics

Open Access

Principles for data analysis workflows

Contributed equally to this work with: Sara Stoudt, Váleri N. Vásquez

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Statistical & Data Sciences Program, Smith College, Northampton, Massachusetts, United States of America

ORCID logo

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Energy and Resources Group, University of California Berkeley, Berkeley, California, United States of America

* E-mail: [email protected]

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Department of Molecular and Cellular Biology, University of California Berkeley, Berkeley, California, United States of America

  • Sara Stoudt, 
  • Váleri N. Vásquez, 
  • Ciera C. Martinez

PLOS

Published: March 18, 2021

  • https://doi.org/10.1371/journal.pcbi.1008770
  • Reader Comments

Fig 1

A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work.

Citation: Stoudt S, Vásquez VN, Martinez CC (2021) Principles for data analysis workflows. PLoS Comput Biol 17(3): e1008770. https://doi.org/10.1371/journal.pcbi.1008770

Editor: Patricia M. Palagi, SIB Swiss Institute of Bioinformatics, SWITZERLAND

Copyright: © 2021 Stoudt et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: SS was supported by the National Physical Sciences Consortium ( https://stemfellowships.org/ ) fellowship. SS, VNV, and CCM were supported by the Gordon & Betty Moore Foundation ( https://www.moore.org/ ) (GBMF3834) and Alfred P. Sloan Foundation ( https://sloan.org/ ) (2013-10-27) as part of the Moore-Sloan Data Science Environments. CCM holds a Postdoctoral Enrichment Program Award from the Burroughs Wellcome Fund ( https://www.bwfund.org/ ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Both traditional science fields and the humanities are becoming increasingly data driven and computational. Researchers who may not identify as data scientists are working with large and complex data on a regular basis. A systematic and reproducible research workflow —the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of data-intensive research practice in any academic discipline. The importance and effective development of a workflow should, in turn, be a cornerstone of the data science education designed to prepare researchers across disciplinary specializations.

Data science education tends to review foundational statistical analysis methods [ 1 ] and furnish training in computational tools , software, and programming languages. In scientific fields, education and training includes a review of domain-specific methods and tools, but generally omits guidance on the coding practices relevant to developing new analysis software—a skill of growing relevance in data-intensive scientific fields [ 2 ]. Meanwhile, the holistic discussion of how to develop and pursue a research workflow is often left out of introductions to both data science and disciplinary science. Too frequently, students and academic practitioners of data-intensive research are left to learn these essential skills on their own and on the job. Guidance on the breadth of potential products that can emerge from research is also lacking. In the interest of both reproducible science (providing the necessary data and code to recreate the results) and effective career building, researchers should be primed to regularly generate outputs over the course of their workflow.

The goal of this paper is to deconstruct an academic data-intensive research project, demonstrating how both design principles and software development methods can motivate the creation and standardization of practices for reproducible data and code. The implementation of such practices generates research products that can be effectively communicated, in addition to constituting a scientific contribution. Here, “data-intensive” research is used interchangeably with “data science” in a recognition of the breadth of domain applications that draw upon computational analysis methods and workflows. (We define other terms we’ve bolded throughout this paper in Box 1 ). To be useful, let alone high impact, research analyses should be contextualized in the data processing decisions that led to their creation and accompanied by a narrative that explains why the rest of the world should be interested. One way of thinking about this is that the scientific method should be tangibly reflected, and feasibly reproducible, in any data-intensive research project.

Box 1. Terminology

This box provides definitions for terms in bold throughout the text. Terms are sorted alphabetically and cross referenced where applicable.

Agile: An iterative software development framework which adheres to the principles described in the Manifesto for Agile software development [ 35 ] (e.g., breaks up work into small increments).

Accessor function: A function that returns the value of a variable (synonymous term: getter function).

Assertion: An expression that is expected to be true at a particular point in the code.

Computational tool: May include libraries, packages, collections of functions, and/or data structures that have been consciously designed to facilitate the development and pursuit of data-intensive questions (synonymous term: software tool).

Continuous integration: Automatic tests that updated code.

Gut check: Also “data gut check.” Quick, broad, and shallow testing [ 48 ] before and during data analysis. Although this is usually described in the context of software development, the concept of a data-specific gut check can include checking the dimensions of data structures after merging or assessing null values/missing values, zero values, negative values, and ranges of values to see if they make sense (synonymous words: smoke test, sanity check [ 49 ], consistency check, sniff test, soundness check).

Data-intensive research : Research that is centrally based on the analysis of data and its structural or statistical properties. May include but is not limited to research that hinges on large volumes of data or a wide variety of data types requiring computational skills to approach such research (synonymous term: data science research). “Data science” as a stand-alone term may also refer more broadly to the use of computational tools and statistical methods to gain insights from digitized information.

Data structure: A format for storing data values and definition of operations that can be applied to data of a particular type.

Defensive programming : Strategies to guard against failures or bugs in code; this includes the use of tests and assertions.

Design thinking: The iterative process of defining a problem then identifying and prototyping potential solutions to that problem, with an emphasis on solutions that are empathetic to the particular needs of the target user.

Docstring: A code comment for a particular line of code that describes what a function does, as opposed to how the function performs that operation.

DOI: A digital object identifier or DOI is a unique handle, standardized by the International Organization for Standardization (ISO), that can be assigned to different types of information objects.

Extensibility: The flexibility to be extended or repurposed in a new scenario.

Function: A piece of more abstracted code that can be reused to perform the same operation on different inputs of the same type and has a standardized output [ 50 – 52 ].

Getter function: Another term for an accessor function.

Integrated Development Environment (IDE): A software application that facilitates software development and minimally consists of a source code editor, build automation tools, and a debugger.

Modularity: An ability to separate different functionality into stand-alone pieces.

Mutator method: A function used to control changes to variables. See “setter function” and “accessor function.”

Notebook: A computational or physical place to store details of a research process including decisions made.

Mechanistic code : Code used to perform a task as opposed to conduct an analysis. Examples include processing functions and plotting functions.

Overwrite: The process, intentional or accidental, of assigning new values to existing variables.

Package manager: A system used to automate the installation and configuration of software.

Pipeline : A series of programmatic processes during data analysis and data cleaning, usually linear in nature, that can be automated and usually be described in the context of inputs and outputs.

Premature optimization : Focusing on details before the general scheme is decided upon.

Refactoring: A change in code, such as file renaming, to make it more organized without changing the overall output or behavior.

Replicable: A new study arrives at the same scientific findings as a previous study, collecting new data (with the same or different methods) and completes new analyses [ 53 – 55 ].

Reproducible: Authors provide all the necessary data, and the computer codes to run the analysis again, recreating the results [ 53 – 55 ].

Script : A collection of code, ideally related to one particular step in the data analysis.

Setter function: A type of function that controls changes to variables. It is used to directly access and alter specific values (synonymous term: mutator method).

Serialization: The process of saving data structures, inputs and outputs, and experimental setups generally in a storable, shareable format. Serialized information can be reconstructed in different computer environments for the purpose of replicating or reproducing experiments.

Software development: A process of writing and documenting code in pursuit of an end goal, typically focused on process over analysis.

Source code editor: A program that facilitates changes to code by an author.

Technical debt: The extra work you defer by pursuing an easier, yet not ideal solution, early on in the coding process.

Test-driven development: Each change in code should be verified against tests to prove its functionality.

Unit test: A code test for the smallest chunk of code that is actually testable.

Version control: A way of managing changes to code or documentation that maintains a record of changes over time.

White paper: An informative, at least semiformal document that explains a particular issue but is not peer reviewed.

Workflow : The process that moves a scientific investigation from raw data to coherent research question to insightful contribution. This often involves a complex series of processes and includes a mixture of machine automation and human intervention. It is a nonlinear and iterative exercise.

Discussions of “workflow” in data science can take on many different meanings depending on the context. For example, the term “workflow” often gets conflated with the term “ pipeline ” in the context of software development and engineering. Pipelines are often described as a series of processes that can be programmatically defined and automated and explained in the context of inputs and outputs. However, in this paper, we offer an important distinction between pipelines and workflows: The former refers to what a computer does, for example, when a piece of software automatically runs a series of Bash or R script s. For the purpose of this paper, a workflow describes what a researcher does to make advances on scientific questions: developing hypotheses, wrangling data, writing code, and interpreting results.

Data analysis workflows can culminate in a number of outcomes that are not restricted to the traditional products of software engineering (software tools and packages) or academia (research papers). Rather, the workflow that a researcher defines and iterates over the course of a data science project can lead to intellectual contributions as varied as novel data sets, new methodological approaches, or teaching materials in addition to the classical tools, packages, and papers. While the workflow should be designed to serve the researcher and their collaborators, maintaining a structured approach throughout the process will inform results that are replicable (see replicable versus reproducible in Box 1 ) and easily translated into a variety of products that furnish scientific insights for broader consumption.

In the following sections, we explain the basic principles of a constructive and productive data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Where relevant, we draw analogies to the realm of design thinking and software development . While the 3 phases described here are not intended to be a strict rulebook, we hope that the many references to additional resources—and suggestions for nontraditional research products—provide guidance and support for both students new to research and current researchers who are new to data-intensive work.

The Explore, Refine, Produce (ERP) workflow for data-intensive research

We partition the workflow of a data-intensive research process into 3 phases: Explore, Refine, and Produce. These phases, collectively the ERP workflow, are visually described in Fig 1A and 1B . In the Explore Phase, researchers “meet” their data: process it, interrogate it, and sift through potential solutions to a problem of interest. In the Refine Phase, researchers narrow their focus to a particularly promising approach, develop prototypes, and organize their code into a clearer narrative. The Produce Phase happens concurrently with the Explore and Refine Phases. In this phase, researchers prepare their work for broader consumption and critique.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

(A) We deconstruct a data-intensive research project into 3 phases, visualizing this process as a tree structure. Each branch in the tree represents a decision that needs to be made about the project, such as data cleaning, refining the scope of the research, or using a particular tool or model. Throughout the natural life of a project, there are many dead ends (yellow Xs). These may include choices that do not work, such as experimentation with a tool that is ultimately not compatible with our data. Dead ends can result in informal learning or procedural fine-tuning. Some dead ends that lie beyond the scope of our current project may turn into a new project later on (open turquoise circles). Throughout the Explore and Refine Phases, we are concurrently in the Produce Phase because research products (closed turquoise circles) can arise at any point throughout the workflow. Products, regardless of the phase that generates their content, contribute to scientific understanding and advance the researcher’s career goals. Thus, the data-intensive research portfolio and corresponding academic CV can be grown at any point in the workflow. (B) The ERP workflow as a nonlinear cycle. Although the tree diagram displayed in Fig 1A accurately depicts the many choices and dead ends that a research project contains, it does not as easily reflect the nonlinearity of the process; Fig 1B’s representation aims to fill this gap. We often iterate between the Explore and Refine Phases while concurrently contributing content to the Produce Phase. The time spent in each phase can vary significantly across different types of projects. For example, hypothesis generation in the Explore Phase might be the biggest hurdle in one project, while effectively communicating a result to a broader audience in the Produce Phase might be the most challenging aspect of another project.

https://doi.org/10.1371/journal.pcbi.1008770.g001

Each phase has an immediate audience—the researcher themselves, their collaborative groups, or the public—that broadens progressively and guides priorities. Each of the 3 phases can benefit from standards that the software development community uses to streamline their code-based pipelines, as well as from principles the design community uses to generate and carry out ideas; many such practices can be adapted to help structure a data-intensive researcher’s workflow. The Explore and Refine Phases provide fodder for the concurrent Produce Phase. We hope that the potential to produce a variety of research products throughout a data-intensive research process, rather than merely at the end of a project, motivates researchers to apply the ERP workflow.

Phase 1: Explore

Data-intensive research projects typically start with a domain-specific question or a particular data set to explore [ 3 ]. There is no fixed, cross-disciplinary rule that defines the point in a workflow by which a hypothesis must be established. This paper adopts an open-minded approach concerning the timing of hypothesis generation [ 4 ], assuming that data-intensive research projects can be motivated by either an explicit, preexisting hypothesis or a new data set about which no strong preconceived assumptions or intuitions exist. The often messy Explore Phase is rarely discussed as an explicit step of the methodological process, but it is an essential component of research: It allows us to gain intuition about our data, informing future phases of the workflow. As we explore our data, we refine our research question and work toward the articulation of a well-defined problem. The following section will address how to reap the benefits of data set and problem space exploration and provide pointers on how to impose structure and reproducibility during this inherently creative phase of the research workflow.

Designing data analysis: Goals and standards of the Explore Phase

Trial and error is the hallmark of the Explore Phase (note the density of “deadends” and decisions made in this phase in Fig 1A ). In “Designerly Ways of Knowing” [ 5 ], the design process is described as a “co-evolution of solution and problem spaces.” Like designers, data-intensive researchers explore the problem space, learn about the potential structure of the solution space, and iterate between the 2 spaces. Importantly, the difficulties we encounter in this phase help us build empathy for an eventual audience beyond ourselves. It is here that we experience firsthand the challenges of processing our data set, framing domain research questions appropriate to it, and structuring the beginnings of a workflow. Documenting our trial and error helps our own work stay on track in addition to assisting future researchers facing similar challenges.

One end goal of the Explore Phase is to determine whether new questions of interest might be answered by leveraging existing software tools (either off the shelf or with minor adjustments), rather than building new computational capabilities ourselves. For example, during this phase, a common activity includes surveying the software available for our data set or problem space and estimating its utility for the unique demands of our current analysis. Through exploration, we learn about relevant computational and analysis tools while concurrently building an understanding of our data.

A second important goal of the Explore Phase is data cleaning and developing a strategy to analyze our data. This is a dynamic process that often goes hand in hand with improving our understanding of the data. During the Explore Phase, we redesign and reformat data structures, identify important variables, remove redundancies, take note of missing information, and ponder outliers in our data set. Once we have established the software tools—the programming language, data analysis packages, and a handful of the useful functions therein—that are best suited to our data and domain area, we also start putting those tools to use [ 6 ]. In addition, during the Explore Phase, we perform initial tests, build a simple model, or create some basic visualizations to better grasp the contents of our data set and check for expected outputs. Our research is underway in earnest now, and this effort will help us to identify what questions we might be able to ask of our data.

The Explore Phase is often a solo endeavor; as shown in Fig 1A , our audience is typically our current or future self. This can make navigating the phase difficult, especially for new researchers. It also complicates a third goal of this phase: documentation. In this phase, we ourselves are our only audience, and if we are not conscientious documenters, we can easily end up concluding the phase without the ability to coherently describe our research process up to that point. Record keeping in the Explore Phase is often subject to our individual style of approaching problems. Some styles work in real time, subsetting or reconfiguring data as ideas occur. More methodical styles tend to systematically plan exploratory steps, recording them before taking action. These natural tendencies impact the state of our analysis code, affecting its readability and reproducibility.

However, there are strategies—inspired by analogous software development principles—that can help set us up for success in meeting the standards of reproducibility [ 7 ] relevant to a scientifically sound research workflow. These strategies impose a semblance of order on the Explore Phase. To avoid concerns of premature optimization [ 8 ] while we are iterating during this phase, documentation is the primary goal, rather than fine-tuning the code structure and style. Documentation enables the traceability of a researcher’s workflow, such that all efforts are replicable and final outcomes are reproducible.

Analogies to software development in the Explore Phase

Documentation: code and process..

Software engineers typically value formal documentation that is readable by software users. While the audience for our data analysis code may not be defined as a software user per se, documentation is still vital for workflow development. Documentation for data analysis workflows can come in many forms, including comments describing individual lines of code, README files orienting a reader within a code repository, descriptive commit history logs tracking the progress of code development, docstrings detailing function capabilities, and vignettes providing example applications. Documentation provides both a user manual for particular tools within a project (for example, data cleaning functions), and a reference log describing scientific research decisions and their rationale (for example, the reasons behind specific parameter choices).

In the Explore Phase, we may identify with the type of programmer described by Brant and colleagues as “opportunistic” [ 9 ]. This type of programmer finds it challenging to prioritize documenting and organizing code that they see as impermanent or a work in progress. “Opportunistic” programmers tend to build code using others’ tools, focusing on writing “glue” code that links preexisting components and iterate quickly. Hartmann and colleagues also describe this mash-up approach [ 10 ]. Rather than “opportunistic programmers,” their study focuses on “opportunistic designers.” This style of design “search[es] for bridges,” finding connections between what first appears to be different fields. Data-intensive researchers often use existing tools to answer questions of interest; we tend to build our own only when needed.

Even if the code that is used for data exploration is not developed into a software-based final research product, the exploratory process as a whole should exist as a permanent record: Future scientists should be able to rerun our analysis and work from where we left off, beginning from raw, unprocessed data. Therefore, documenting choices and decisions we make along the way is crucial to making sure we do not forget any aspect of the analysis workflow, because each choice may ultimately impact the final results. For example, if we remove some data points from our analyses, we should know which data points we removed—and our reason for removing them—and be able to communicate those choices when we start sharing our work with others. This is an important argument against ephemerally conducting our data analysis work via the command line.

Instead of the command line, tools like a computational notebook [ 11 ] can help capture a researcher’s decision-making process in real time [ 12 ]. A computational notebook where we never delete code, and—to avoid overwriting named variables—only move forward in our document, could act as “version control designed for a 10-minute scale” that Brant and colleagues found might help the “opportunistic” programmer. More recent advances in this area include the reactive notebook [ 13 – 14 ]. Such tools assist documentation while potentially enhancing our creativity during the Explore Phase. The bare minimum documentation of our Explore Phase might therefore include such a notebook or an annotated script [ 15 ] to record all analyses that we perform and code that we write.

To go a step beyond annotated scripts or notebooks, researchers might employ a version control system such as Git. With its issues, branches, and informative commit messages, Git is another useful way to maintain a record of our trial-and-error process and track which files are progressing toward which goals of the overall project. Using Git together with a public online hosting service such as GitHub allows us to share our work with collaborators and the public in real time, if we so choose.

A researcher dedicated to conducting an even more thoroughly documented Explore Phase may take Ford’s advice and include notes that explicitly document our stream of consciousness [ 16 ]. Our notes should be able to efficiently convey what failed, what worked but was uninteresting or beyond scope of the project, and what paths of inquiry we will continue forward with in more depth ( Fig 1A ). In this way, as we transition from the Explore Phase to the Refine Phase, we will have some signposts to guide our way.

Testing: Comparing expectations to output.

As Ford [ 16 ] explains, we face competing goals in the Explore Phase: We want to get results quickly, but we also want to be confident in our answers. Her strategy is to focus on documentation over tests for one-off analyses that will not form part of a larger research project. However, the complete absence of formal tests may raise a red flag for some data scientists used to the concept of test-driven development . This is a tension between the code-based work conducted in scientific research versus software development: Tests help build confidence in analysis code and convince users that it is reliable or accurate, but tests also imply finality and take time to write that we may not be willing to allocate in the experimental Explore Phase. However, software development style tests do have useful analogs in data analysis efforts: We can think of tests, in the data analysis sense, as a way of checking whether our expectations match the reality of a piece of code’s output.

Imagine we are looking at a data set for the first time. What weird things can happen? The type of variable might not be what we expect (for example, the integer 4 instead of the float 4.0). The data set could also include unexpected aspects (for example, dates formatted as strings instead of numbers). The amount of missing data may be larger than we thought, and this missingness could be coded in a variety of ways (for example, as a NaN, NULL, or −999). Finally, the dimensions of a data frame after merging or subsetting it for data cleaning may not match our expectations. Such gaps in expectation versus reality are “silent faults” [ 17 ]. Without checking for them explicitly, we might proceed with our analysis unaware that anything is amiss and encode that error in our results.

For these reasons, every data exploration should include quantitative and qualitative “gut checks” [ 18 ] that can help us diagnose an expectation mismatch as we go about examining and manipulating our data. We may check assumptions about data quality such as the proportion of missing values, verify that a joined data set has the expected dimensions, or ascertain the statistical distributions of well-known data categories. In this latter case, having domain knowledge can help us understand what to expect. We may want to compare 2 data sets (for example, pre- and post-processed versions) to ensure they are the same [ 19 ]; we may also evaluate diagnostic plots to assess a model’s goodness of fit. Each of the elements that gut checks help us monitor will impact the accuracy and direction of our future analyses.

We perform these manual checks to reassure ourselves that our actions at each step of data cleaning, processing, or preliminary analysis worked as expected. However, these types of checks often rely on us as researchers visually assessing output and deciding if we agree with it. As we transition to needing to convince users beyond ourselves of the correctness of our work, we may consider employing defensive programming techniques that help guard against specific mistakes. An example of defensive programming in the Julia language is the use of assertions, such as the @assert macro to validate values or function outputs. Another option includes writing “chatty functions” [ 20 ] that signal a user to pause, examine the output, and decide if they agree with it.

When to transition from the Explore Phase: Balancing breadth and depth

A researcher in the Explore Phase experiments with a variety of potential data configurations, analysis tools, and research directions. Not all of these may bear fruit in the form of novel questions or promising preliminary findings. Learning how to find a balance between the breadth and depth of data exploration helps us understand when to transition to the Refine Phase of data-intensive research. Specific questions to ask ourselves as we prepare to transition between the Explore Phase and the Refine Phase can be found in Box 2 .

Box 2. Questions

This box provides guiding questions to assist readers in navigating through each workflow phase. Questions pertain to planning, organization, and accountability over the course of workflow iteration.

Questions to ask in the Explore Phase

  • Good: Ourselves (e.g., Code includes signposts refreshing our memory of what is happening where.)
  • Better: Our small team who has specialized knowledge about the context of the problem.
  • Best: Anyone with experience using similar tools to us.
  • Good: Dead ends marked differently than relevant and working code.
  • Better: Material connected to a handful of promising leads.
  • Best: Material connected to a clearly defined scope.
  • Good: Backed up in a second location in addition to our computer.
  • Better: Within a shared space among our team (e.g., Google Drive, Box, etc.).
  • Best: Within a version control system (e.g., GitHub) that furnishes a complete timeline of actions taken.
  • Good: Noted in a separate place from our code (e.g., a physical notebook).
  • Better: Noted in comments throughout the code itself, with expectations informally checked.
  • Best: Noted systematically throughout code as part of a narrative, with expectations formally checked.

Questions to ask in the Refine Phase

  • Who is in our team?
  • Consider career level, computational experience, and domain-specific experience.
  • How do we communicate methodology with our teammates’ skills in mind?
  • What reproducibility tools can be agreed upon?
  • How can our work be packaged into impactful research products?
  • Can we explain the same important results across different platforms (e.g., blog post in addition to white paper)?
  • How can we alert these people and make our work accessible?
  • How can we use narrative to make this clear?

Questions to ask in the Produce Phase

  • Do we have more than 1 audience?
  • What is the next step in our research?
  • Can we turn our work into more than 1 publishable product?
  • Consider products throughout the entire workflow.
  • See suggestions in the Tool development guide ( Box 4 ).

Imposing structure at certain points throughout the Explore Phase can help to balance our wide search for solutions with our deep dives into particular options. In an analogy to the software development world, we can treat our exploratory code as a code release—the marker of a stable version of a piece of software. For example, we can take stock of the code we have written at set intervals, decide what aspects of the analysis conducted using it seem most promising, and focus our attention on more formally tuning those parts of the code. At this point, we can also note the presence of research “dead ends” and perhaps record where they fit into our thought process. Some trains of thought may not continue into the next phase or become a formal research product, but they can still contribute to our understanding of the problem or eliminate a potential solution from consideration. As the project matures, computational pipelines are established. These inform project workflow, and tools, such as Snakemake and Nextflow, can begin to be used to improve the flexibility and reproducibility of the project [ 21 – 23 ]. As we make decisions about which research direction we are going to pursue, we can also adjust our file structure and organize files into directories with more informative names.

Just as Cross [ 5 ] finds that a “reasonably-structured process” leads to design success where “rigid, over-structured approaches” find less success, a balance between the formality of documentation and testing and the informality of creative discovery is key to the Explore Phase of data-intensive research. By taking inspiration from software development and adapting the principles of that arena to fit our data analysis work, we add enough structure to this phase to ease transition into the next phase of the research workflow.

Phase 2: Refine

Inevitably, we reach a point in the Explore Phase when we have acquainted ourselves with our data set, processed and cleaned it, identified interesting research questions that might be asked using it, and found the analysis tools that we prefer to apply. Having reached this important juncture, we may also wish to expand our audience from ourselves to a team of research collaborators. It is at this point that we are ready to transition to the Refine Phase. However, we should keep in mind that new insights may bring us back to the Explore Phase: Over the lifetime of a given research project, we are likely to cycle through each workflow phase multiple times.

In the Refine Phase, the extension of our target audience demands a higher standard for communicating our research decisions as well as a more formal approach to organizing our workflow and documenting and testing our code. In this section, we will discuss principles for structuring our data analysis in the Refine Phase. This phase will ultimately prepare our work for polishing into more traditional research products, including peer-reviewed academic papers.

Designing data analysis: Goals and standards of the Refine Phase

The Refine Phase encompasses many critical aspects of a data-intensive research project. Additional data cleaning may be conducted, analysis methodologies are chosen, and the final experimental design is decided upon. Experimental design may include identifying case studies for variables of interest within our data. If applicable, it is during this phase that we determine the details of simulations. Preliminary results from the Explore Phase inform how we might improve upon or scale up prototypes in the Refine Phase. Data management is essential during this phase and can be expanded to include the serialization of experimental setups. Finally, standards of reproducibility should be maintained throughout. Each of these aspects constitutes an important goal of the Refine Phase as we determine the most promising avenues for focusing our research workflow en route to the polished research products that will emerge from this phase and demand even higher reproducibility standards.

All of these goals are developed in conjunction with our research team. Therefore, decisions should be documented and communicated in a way that is reproducible and constructive within that group. Just as the solitary nature of the Explore Phase can be daunting, the collaboration that may happen in the Refine Phase brings its own set of challenges as we figure out how to best work together. Our team can be defined as the people who participate in developing the research question, preparing the data set it is applied to, coding the analysis, or interpreting the results. It might also include individuals who offer feedback about the progress of our work. In the context of academia, our team usually includes our laboratory or research group. Like most other aspects of data-intensive research, our team may evolve as the project evolves. But however we define our team, its members inform how our efforts proceed during the Refine Phase: Thus, another primary goal of the Refine Phase is establishing group-based standards for the research workflow. Specific questions to ask ourselves during this phase can be found in Box 2 .

In recent years, the conversation on standards within academic data science and scientific computing has shifted from “best” practices [ 24 ] to “good enough” practices [ 25 ]. This is an important distinction when establishing team standards during the Refine Phase: Reproducibility is a spectrum [ 26 ], and collaborative work in data-intensive research carries unique demands on researchers as scholars and coworkers [ 27 ]. At this point in the research workflow, standards should be adopted according to their appropriateness for our team. This means talking among ourselves not only about scientific results, but also about the computational experimental design that led to those results and the role that each team member plays in the research workflow. Establishing methods for effective communication is therefore another important goal in the Refine Phase, as we cannot develop group-based standards for the research workflow without it.

Analogies to software development in the Refine Phase

Documentation as a driver of reproducibility..

The concept of literate programming [ 8 ] is at the core of an effective Refine Phase. This philosophy brings together code with human-readable explanations, allowing scientists to demonstrate the functionality of their code in the context of words and visualizations that describe the rationale for and results of their analysis. The computational notebooks that were useful in the Explore Phase are also applicable here, where they can assist with team-wide discussions, research development, prototyping, and idea sharing. Jupyter Notebooks [ 28 ] are agnostic to choice of programming language and so provide a good option for research teams that may be working with a diverse code base or different levels of comfort with a particular programming language. Language-specific interfaces such as R’s RMarkdown functionality [ 29 ] and Literate.jl or the reactive notebook put forward by Pluto.jl in the Julia programming language furnish additional options for literate programming.

The same strategies that promote scientific reproducibility for traditional laboratory notebooks can be applied to the computational notebook [ 30 ]. After all, our data-intensive research workflow can be considered a sort of scientific experiment—we develop a hypothesis, query our data, support or reject our hypothesis, and state our insights. A central tenet of scientific reproducibility is recording inputs relevant to a given analysis, such as parameter choices, and explaining any calculation used to obtain them so that our outputs can later be verifiably replicated. Methodological details—for example, the decision to develop a dynamic model in continuous time versus discrete time or the choice of a specific statistical analysis over alternative options—should also be fully explained in computational notebooks developed during the Refine Phase. Domain knowledge may inform such decisions, making this an important part of proper notebook documentation; such details should also be elaborated in the final research product. Computational research descriptions in academic journals generally include a narrative relevant to their final results, but these descriptions often do not include enough methodological detail to enable replicability, much less reproducibility. However, this is changing with time [ 31 , 32 ].

As scientists, we should keep a record of the tools we use to obtain our results in addition to our methodological process. In a data-intensive research workflow, this includes documenting the specific version of any software that we used, as well as its relevant dependencies and compatibility constraints. Recording this information at the top of the computational notebook that details our data science experiment allows future researchers—including ourselves and our teams—to establish the precise computational environment that was used to run the original research analysis. Our chosen programming language may supply automated approaches for doing this, such as a package manager , simplifying matters and painlessly raising the standards of reproducibility in a research team. The unprecedented levels of reproducibility possible in modern computational environments have produced some variance in the expectations of different research communities; it behooves the research team to investigate the community-level standards applicable to our specific domain science and chosen programming language.

A notebook can include more than a deep dive into a full-fledged data science experiment. It can also involve exploring and communicating basic properties of the data, whether for purposes of training team members new to the project or for brainstorming alternative possible approaches to a piece of research. In the Exploration Phase, we have discovered characteristics of our data that we want our research team to know about, for example, outliers or unexpected distributions, and created preliminary visualizations to better understand their presence. In the Refine Phase, we may choose to improve these initial plots and reprise our data processing decisions with team members to ensure that the logic we applied still holds.

Computational notebooks can live in private or public repositories to ensure accessibility and transparency among team members. A version control system such as Git continues to be broadly useful for documentation purposes in the Refine Phase, beyond acting as a storage site for computational notebooks. Especially as our team and code base grows larger, a history of commits and pull requests helps keep track of responsibilities, coding or data issues, and general workflow.

Importantly however, all tools have their appropriate use cases. Researchers should not develop an overt reliance on any one tool and should learn to recognize when different tools are required. For example, computational notebooks may quickly become unwieldy for certain projects and large teams, incurring technical debt in the form of duplications or overwritten variables. As our research project grows in complexity and size, or gains team members, we may want to transition to an Integrated Development Environment (IDE) or a source code editor —which interact easily with container environments like Docker and version control systems such as GitHub—to help scale our data analysis, while retaining important properties like reproducibility.

Testing and establishing code modularity.

Code in data-intensive research is generally written as a means to an end, the end being a scientific result from which researchers can draw conclusions. This stands in stark contrast to the purpose of code developed by data engineers or computer scientists, which is generally written to optimize a mechanistic function for maximum efficiency. During the Refine Phase, we may find ourselves with both analysis-relevant and mechanistic code , especially in “big data” statistical analyses or complex dynamic simulations where optimized computation becomes a concern. Keeping the immediate audience of this workflow phase, our research team, at the forefront of our mind can help us take steps to structure both mechanistic and analysis code in a useful way.

Mechanistic code, which is designed for repeated use, often employs abstractions by wrapping code into functions that apply the same action repeatedly or stringing together multiple scripts into a computational pipeline. Unit tests and so-called accessor functions or getter and setter functions that extract parameter values from data structures or set new values are examples of mechanistic code that might be included in a data-intensive research analysis. Meanwhile, code that is designed to gain statistical insight into distributions or model scientific dynamics using mathematical equations are 2 examples of analysis code. Sometimes, the line between mechanistic code and analysis code can be a blurry one. For example, we might write a looping function to sample our data set repeatedly, and that would classify as mechanistic code. But that sampling may be designed to occur according to an algorithm such as Markov Chain Monte Carlo that is directly tied to our desire to sample from a specific probability distribution; therefore, this could be labeled analysis and mechanistic code. Keep your audience in mind and the reproducibility of your experiment when considering how to present your code.

It is common practice to wrap code that we use repeatedly into functions to increase readability and modularity while reducing the propensity for user-induced error. However, the scripts and programming notebooks so useful to establishing a narrative and documenting work in the Refine Phase are set up to be read in a linear fashion. Embedding mechanistic functions in the midst of the research narrative obscures the utility of the notebooks in telling the research story and generally clutters up the analysis with a lot of extra code. For example, if we develop a function to eliminate the redundancy of repeatedly restructuring our data to produce a particular type of plot, we do not need to showcase that function in the middle of a computational notebook analyzing the implications of the plot that is created—the point is the research implications of the image, not the code that made the plot. Then where do we keep the data-reshaping, plot-generating code?

Strategies to structure the more mechanistic aspects of our analysis can be drawn from common software development practices. As our team grows or changes, we may require the same mechanistic code. For example, the same data-reshaping, plot-generating function described earlier might be pulled into multiple computational experiments that are set up in different locations, computational notebooks, scripts, or Git branches. Therefore, a useful approach would be to start collecting those mechanistic functions into their own script or file, sometimes called “helpers” or “utils,” that acts as a supplement to the various ongoing experiments, wherever they may be conducted. This separate script or file can be referenced or “called” at the beginning of the individual data analyses. Doing so allows team members to benefit from collaborative improvements to the mechanistic code without having to reinvent the wheel themselves. It also preserves the narrative properties of team members’ analysis-centric computational notebooks or scripts while maintaining transparency in basic methodologies that ensure project-wide reproducibility. The need to begin collecting mechanistic functions into files separate from analysis code is a good indicator that it may be time for the research team to supplement computational notebooks by using a code editor or IDE for further code development.

Testing scientific software is not always perfectly analogous to testing typical software development projects, where automated continuous integration is often employed [ 17 ]. However, as we start to modularize our code, breaking it into functions and from there into separate scripts or files that serve specific purposes, principles from software engineering become more readily applicable to our data-intensive analysis. Unit tests can now help us ensure that our mechanistic functions are working as expected, formalizing the “gut checks” that we performed in the Explore Phase. Among other applications, these tests should verify that our functions return the appropriate value, object type, or error message as needed [ 33 ]. Formal tests can also provide a more extensive investigation of how “trustworthy” the performance of a particular analysis method might be, affording us an opportunity to check the correctness of our scientific inferences. For example, we could use control data sets where we know the result of a particular analysis to make sure our analysis code is functioning as we expect. Alternatively, we could also use a regression test to compare computational outputs before and after changes in the code to make sure we haven’t introduced any unanticipated behavior.

When to transition from the Refine Phase: Going backwards and forwards

Workflows in data science are rarely linear; it is often necessary for researchers to iterate between the Refine and Explore Phases ( Fig 1B ). For example, while our research team may decide on a computational experimental design to pursue in the Refine Phase, the scope of that design may require us to revisit decisions made during the data processing that was conducted in the Explore Phase. This might mean including additional information from supplementary data sets to help refine our hypothesis or research question. In returning to the Explore Phase, we investigate these potential new data sets and decide if it makes sense to merge them with our original data set.

Iteration between the Refine and Explore Phases is a careful balance. On the one hand, we should be careful not to allow “scope creep” to expand our problem space beyond an area where we are able to develop constructive research contributions. On the other hand, if we are too rigid about decisions made over the course of our workflow and refuse to look backwards as well as forwards, we may risk cutting ourselves off from an important part of the potential solution space.

Data-intensive researchers can once more look to principles within the software development community, such as Agile frameworks, to help guide the careful balancing act required to conduct research that is both comprehensive and able to be completed [ 34 , 35 ]. How a team organizes and further documents their organization process can serve as research products themselves, which we describe further in the next phase of the workflow: the Produce Phase.

Phase 3: Produce

In the previous sections of this paper, we discussed how to progress from the exploration of raw data through the refinement of a research question and selection of an analytical methodology. We also described how the details of that workflow are guided by the breadth of the immediately relevant audience: ourselves in the Explore Phase and our research team in the Refine Phase. In the Produce Phase, it becomes time to make our data analysis camera ready for a much broader group, bringing our research results into a state that can be understood and built upon by others. This may translate to developing a variety of research products in addition to—or instead of—traditional academic outputs like peer-reviewed publications and typical software development products such as computational tools.

Beyond data analysis: Goals and standards of the Produce Phase

The main goal of the Produce Phase is to prepare our analysis to enter the public realm as a set of products ready for external use, reflection, and improvement. The Produce Phase encompasses the cleanup that happens prior to initially sharing our results to a broader community beyond our team, for example, ahead of submitting our work to peer review. It also includes the process of incorporating suggestions for improvement prior to finalization, for example, adjustments to address reviewer comments ahead of publication. The research products that emerge from a given workflow may vary in both their form and their formality—indeed, some research products, like a code base, might continually evolve without ever assuming “final” status—but each product constitutes valuable contributions that push our field’s scientific boundaries in their own way.

Importantly, producing public-facing products over the course of an entire workflow ( Fig 2 ) rather than just at the end of a project can help researchers progressively build their data science research portfolios and fulfill a second goal of the Produce Phase: gaining credit, and credibility, in our domain area. This is especially relevant for junior scientists who are just starting research careers or who wish to become industry data scientists [ 3 ]. Developing polished products at several intervals along a single workflow is also instructional for the researcher themselves. Researchers who prepare their work for public assessment from the earliest phases of an analysis become acquainted with the pertinent problem and solution spaces from multiple perspectives. This additional understanding, together with the feedback that polished products generate from people outside ourselves and our immediate team, may furnish insights that improve our approach in other phases of the research workflow.

thumbnail

Research products can build off of content generated in either the Explore or the Refine Phase. As they did in Fig 1A , turquoise circles represent potential research products generated as the project develops Closed circles represents research project within scope of project, while open circles represent beyond scope of current project. This figure emphasizes how those research products project onto a timeline and represent elements in our portfolio of work or lines on a CV. The ERP workflow emphasizes and encourages production, beyond traditional, academic research products, throughout the lifecycle of a data-intensive project rather than just at the very end.

https://doi.org/10.1371/journal.pcbi.1008770.g002

Building our data science research portfolio requires a method for tracking and attributing the many products that we might develop. One important method for tracking and attribution is the digital object identifier or DOI. It is a unique handle, standardized by the International Organization for Standardization (ISO), that can be assigned to different types of information objects. DOIs are usually connected to metadata, for example, they might include a URL pointing to where the object they are associated with can be found online. Academic researchers are used to thinking of DOIs as persistent identifiers for peer-reviewed publications. However, DOIs can also be generated for data sets, GitHub repositories, computational notebooks, teaching materials, management plans, reports, white papers , and preprints. Researchers would also be well advised to register for a unique and persistent digital identifier to be associated with their name, called an ORCID iD ( https://orcid.org ), as an additional method of tracking and attributing their personal outputs over the course of their career.

A third, longer-term goal of the Produce Phase involves establishing a researcher’s professional trajectory. Every individual needs to gauge how their compendium of research products contribute to their career and how intentional portfolio building might, in turn, drive the research that they ultimately conduct. For example, researchers who wish to work in academia might feel obliged to obtain “academic value” from less traditional research products by essentially reprising them as peer-reviewed papers. But judging a researcher’s productivity by the metric of paper authorship can alter how and even whether research is performed [ 36 ]. Increasingly, academic journals are revisiting their publishing requirements [ 37 ] and raising their standards of reproducibility. This shift is bringing the data and programming methodologies that underpin our written analyses closer to center stage. Data-intensive research, and the people who produce it, stand to benefit. Scientists—now encouraged, and even required by some academic journals to share both data and code—can publish and receive credit as well as feedback for the multiple research products that support their publications. Questions to ask ourselves as we consider possible research products can be found in Box 2 .

Produce: Products of the Explore Phase

The old adage that one person’s trash is another’s treasure is relevant to the Explore Phase of a data science analysis: Of the many potential applications for a particular data set, there is often only time to explore a small subset. Those applications which fall outside the scope of the current analysis can nonetheless be valuable to our future selves or to others seeking to conduct their own analyses. To that end, the documentation that accompanies data exploration can furnish valuable guidance for later projects. Further, the cleaned and processed data set that emerges from the Explore Phase is itself a valuable outcome that can be assigned a DOI and rendered a formal product of this portion of the data analysis workflow, using outlets like Dryad ( http://www.datadryad.org ) and Figshare ( https://figshare.com/ ) among others.

Publicly sharing the data set, along with its metadata, is an essential component of scientific transparency and reproducibility, and it is of fundamental importance to the scientific community. Data associated with a research outcome should follow “FAIR” principles of findability, accessibility, interoperability, and reusability. Importantly, discipline-specific data standards should be followed when preparing data, whether the data are being refined for public-facing or personal use. Data-intensive researchers should familiarize themselves with the standards relevant to their field of study and recognize that meeting these standards increases the likelihood of their work being both reusable and reproducible. In addition to enabling future scientists to use the data set as it was developed, adhering to a standard also facilitates the creation of synthetic data sets for later research projects. Examples of discipline-specific data standards in the natural sciences are Darwin Core ( https://dwc.tdwg.org ) for biodiversity data and EML ( https://eml.ecoinformatics.org ) for ecological data. To maximize the utility of a publically accessible data set, during the Produce Phase, researchers should confirm that it includes descriptive README files and field descriptions and also ensure that all abbreviations and coded entries are defined. In addition, an appropriate license should be assigned to the data set prior to publication: The license indicates whether, or under what circumstances, the data require attribution.

The Git repositories or computational notebooks that archive a data scientist’s approach, record the process of uncovering coding bugs, redundancies, or inconsistencies and note the rationale for focusing on specific aspects of the data are also useful research products in their own right. These items, which emerge from software development practices, can provide a touchstone for alternative explorations of the same data set at a later time. In addition to documenting valuable lessons learned, contributions of this kind can formally augment a data-intensive researcher’s registered body of work: Code used to actively clean data or record an Explore Phase process can be made citable by employing services like Zenodo to add a DOI to the applicable Git commit. Smaller code snippets or data excerpts can be shared—publicly or privately—using the more lightweight GitHub Gists ( https://gist.github.com/ ). Tools such as Dr.Watson ( https://github.com/JuliaDynamics/DrWatson.jl ) and Snakemake [ 23 ] are designed to assist researchers with organization and reproducibility and can inform the polishing process for products emerging from any phase of the analysis (see [ 22 ] for more discussion of reproducible workflow design and tools). As with data products, in the Produce Phase, researchers should license their code repositories such that other scientists know how they can use, augment, or redistribute the contents. The Produce Phase is also the time for researchers to include descriptive README files and clear guidelines for future code contributors in their repository.

Alternative mechanisms for crediting the time and talent that researchers invest in the Explore Phase include relatively informal products. For example, blog posts can detail problem space exploration for a specific research question or lessons learned about data analysis training and techniques. White papers that describe the raw data set and the steps taken to clean it, together with an explanation of why and how these decisions were taken, might constitute another such informal product. Versions of these blog posts or white papers can be uploaded to open-access websites such as arXiv.org as preprints and receive a DOI.

The familiar academic route of a peer-reviewed publication is also available for products emerging from the Explore Phase. For example, depending on the domain area of interest, journals such as Nature Scientific Data and IEEE Transactions are especially suited to papers that document the methods of data set development or simply reproduce the data set itself. Pedagogical contributions that were learned or applied over the course of a research workflow can be written up for submission to training-focused journals such as the Journal of Statistics Education . For a list of potential research product examples for the Explore Phase, see Box 3 .

Box 3. Products

Research products can be developed throughout the ERP workflow. This box helps identify some options for each phase, including products less traditional to academia. Those that can be labeled with a digital object identifier (DOI) are marked as such.

Potential Products in the Explore Phase

  • Publication of cleaned and processed data set (DOI)
  • Citable GitHub repository and/or computational notebook that shows data cleaning/processing, exploratory data analysis. (e.g., Jupyter Notebook, Knitr, Literate, Pluto, etc.) (DOI)
  • GitHub Gists (e.g., particular piece of processing code)
  • White paper (e.g., explaining a data set)
  • Blog post (e.g., detailing exploratory process)
  • Teaching/training materials (e.g., data wrangling)
  • Preprint (e.g., about a data set or its creation) (DOI)
  • Peer-reviewed publication (e.g., about a curated data set) (DOI)

Potential Products in the Refine Phase

  • White paper (e.g., explaining preliminary findings)
  • Citable GitHub repository and/or computational showing methodology and results (DOI)
  • Blog post (e.g., explaining findings informally)
  • Teaching/training materials (e.g., using your work as an example to teach a computational method)
  • Preprint (e.g., preliminary paper before being submitted to a journal) (DOI)
  • Peer-reviewed publication (e.g., formal description of your findings) (DOI)
  • Grant application incorporating the data management procedure
  • Methodology (e.g., writing a methods paper) (DOI)
  • This might include a package, a library, or an interactive web application.
  • See Box 4 for further discussion of this potential research product.

Produce: Products of the Refine Phase

In the Refine Phase, documentation and the ability to communicate both methods and results become essential to daily management of the project. Happily, the implementation of these basic practices can also provide benefits beyond the immediate team of research collaborators: They can be standardized as a Data Management Plan or Protocol (DMP). DMPs are a valuable product that can emerge from the Refine Phase as a formal version of lessons learned concerning both research and team management. This product records the strategies and approaches used to, for example, describe, share, store, analyze, and preserve data.

While DMPs are often living documents over the course of a research project, evolving dynamically with the needs or restrictions that are encountered along the way, there is great utility to codifying them either for our team’s later use or for others conducting similar projects. DMPs can also potentially be leveraged into new research grants for our team, as these protocols are now a common mandate by many funders [ 38 ]. The group discussions that contribute to developing a DMP can be difficult and encompass considerations relevant to everything from team building to research design. The outcome of these discussions is often directly tied to the constructiveness of a research team and its robustness to potential turnover [ 38 ]. Sharing these standards and lessons learned in the form of polished research products can propel a proactive discussion of data management and sharing practices within our research domain. This, in turn, bolsters the creation or enhancement of community standards beyond our team and provides training materials for those new to the field.

As with the research products that are generated by the Explore Phase, DMPs can lead to polished blog posts, training materials, white papers, and preprints that enable researchers to both spread the word about their valuable findings and be credited for their work. In addition, peer-reviewed journals are beginning to allow the publication of DMPs as a formal outcome of the data analysis workflow (e.g., Rio Journal ). Importantly, when new members join a research team, they should receive a copy of the group’s DMP. If any additional training pertinent to plans or protocols is furnished to help get new members up to speed, these materials too can be polished into research products that contribute to scientific advancement. For a list of potential research product examples for the Refine Phase, see Box 3 .

Produce: Traditional research products and scientific software

By polishing our work, we finalize and format it to receive critiques beyond ourselves and our immediate team. The scientific analysis and results that are born of the full research workflow—once documented and linked appropriately to the code and data used to conduct it—are most frequently packaged into the traditional academic research product: a peer-reviewed publication. Even this product, however, can be improved upon in terms of its reproducibility and transparency thanks to software development tools and practices. For example, papers that employ literate programming notebooks enable researchers to augment the real-time evolution of a written draft with the code that informs it. A well-kept notebook can be used to outline the motivations for a manuscript and select the figures best suited to conveying the intended narrative, because it shows the evolution of ideas and the mathematics behind each analysis along with—ideally—brief textual explanations.

Peer-reviewed papers are of primary importance to the career and reputation of academic researchers [ 39 ], but the traditional format for such publications often does not take into account essential aspects of data-intensive analysis such as computational reproducibility [ 40 ]. Where strict requirements for reproducibility are not enforced by a given journal, researchers should nonetheless compile the supporting products that made our submitted manuscript possible—including relevant code and data, as well as the documentation of our computational tools and methodologies as described in the earlier sections of this paper—into a research compendium [ 37 , 41 – 43 ]. The objective is to provide transparency to those who read or wish to replicate our academic publication and reproduce the workflow that led to our results.

In addition to peer-reviewed publications and the various alternative research products described above, some scientists may choose to revisit the scripts developed during the Explore or RefinePhases and polish that code into a traditional software development product: a computational tool, also called a software tool . A computational tool can include libraries, packages, collections of functions, or data structures designed to help with a specific class of problem. Such products might be accompanied by repository documentation or a full-fledged methodological paper that can be categorized as additional research products beyond the tool itself. Each of these items can augment a researcher’s body of citable work and contribute to advances in our domain science.

One very simple example of a tool might be an interactive web application built in RShiny ( https://shiny.rstudio.com/ ) that allows the easy exploration of cleaned data sets or demonstrates the outcomes of alternative research questions. More complex examples include a software package that builds an open-source analysis pipeline or a data structure that formally standardizes the problem space of a domain-specific research area. In all cases, the README files, docstrings, example vignettes, and appropriate licensing relevant to the Explore phase are also a necessity for open-source software. Developers should also specify contributing guidelines for future researchers who might seek to improve or extend the capabilities of the original tool. Where applicable, the dynamic equations that inform simulations should be cited with the original scientific literature where they were derived.

The effort to translate reproducible scripts into reusable software and then to maintain the software and support users is often a massive undertaking. While the software engineering literature furnishes a rich suite of resources for researchers seeking to develop their own computational tools, this existing body of work is generally directed toward trained programmers and software engineers. The design decisions that are crucial to scientists—who are primarily interested in data analysis, experiment extensibility , and result reporting and inference—can be obscured by concepts that are either out of scope or described in overtly technical jargon. Box 4 furnishes a basic guide to highlight the decision points and architectural choices relevant to creating a tool for data-intensive research. Domain scientists seeking to wade into computational tool development are well advised to review the guidelines described in Gruning and colleagues [ 2 ] in addition to more traditional software development resources and texts such as Clean Code [ 44 ], Refactoring [ 45 ], and Best Practices in Scientific Computing [ 24 ].

Box 4. Tool development guide

Creating a new software tool as the polished product of a research workflow is nontrivial. This box furnishes a series of guiding questions to help researchers think through whether tool creation is appropriate to project goals, domain science needs, and team member skill sets.

  • Does a tool in this space already exist that can be used to provide the functionality/answer the research question of interest?
  • Does it formalize our research question?
  • Does it extend/allow extension of investigative capabilities beyond the research question that our existing script was developed to ask?
  • Does creating a tool advance our personal career goals or augment a desired/necessary skill set?
  • Funding (if applicable)?
  • Domain expertise?
  • Programming expertise?
  • Collaborative research partners with either time, funding, or relevant expertise?
  • Will the process of creating the new tool be valued/helpful for your career goals?
  • Should we build on an existing tool or make a new one?
  • What research area is it designed for?
  • Who is the envisioned end user? (e.g., scientist inside our domain, scientist outside our domain, policy maker, member of the public)
  • What is the goal of the end user? (e.g., analysis of raw inputs, explanation of results, creation of inputs for the next step of a larger analysis)
  • What are field norms?
  • Is it accessible (free, open source)?
  • What is the likely form and type of data input to our tool?
  • What is the desired form and type of data output from our tool?
  • Are there preexisting structures that are useful to emulate, or should we develop our own?
  • Is there an existing package that provides basic structure or building block functionalities necessary or useful for our tool, such that we do not need to reinvent the wheel?

Conclusions

Defining principles for data analysis workflows is important for scientific accuracy, efficiency, and the effective communication of results, regardless of whether researchers are working alone or in a team. Establishing standards, such as for documentation and unit testing, both improves the quality of work produced by practicing data scientists and sets a proactive example for fledgling researchers to do the same. There is no single set of principles for performing data-intensive research. Each computational project carries its own context—from the scientific domain in which it is conducted, to the software and methodological analysis tools we use to pursue our research questions, to the dynamics of our particular research team. Therefore, this paper has outlined general concepts for designing a data analysis such that researchers may incorporate the aspects of the ERP workflow that work best for them. It has also put forward suggestions for specific tools to facilitate that workflow and for a selection of nontraditional research products that could emerge throughout a given data analysis project.

Aiming for full reproducibility when communicating research results is a noble pursuit, but it is imperative to understand that there is a balance between generating a complete analysis and furnishing a 100% reproducible product. Researchers have competing motivations: finishing their work in a timely fashion versus having a perfectly documented final product, while balancing how these trade-offs might strengthen their career. Despite various calls for the creation of a standard framework [ 7 , 46 ], achieving complete reproducibility may go far beyond the individual researcher to encompass a culture-wide shift in expectations by consumers of scientific research products, to realistic capacities of version control software. The first of these advancements is particularly challenging and unlikely to manifest quickly across data-intensive research areas, although it is underway in a number of scientific domains [ 26 ]. By reframing what a formal research product can be—and noting that polished contributions can constitute much more than the academic publications previously held forth as the benchmark for career advancement—we motivate structural change to data analysis workflows.

In addition to amassing outputs beyond the peer-reviewed academic publication, there are increasingly venues for writing less traditional papers that describe or consist solely of a novel data set, a software tool, a particular methodology, or training materials. As the professional landscape for data-intensive research evolves, these novel publications and research products are extremely valuable for distinguishing applicants to academic and nonacademic jobs, grants, and teaching positions. Data scientists and researchers should possess numerous and multifaceted skills to perform scientifically robust and computationally effective data analysis. Therefore, potential research collaborators or hiring entities both inside and outside the academy should take into account a variety of research products, from every phase of the data analysis workflow, when evaluating the career performance of data-intensive researchers [ 47 ].

Acknowledgments

We thank the Best Practices Working Group (UC Berkeley) for the thoughtful conversations and feedback that greatly informed the content of this paper. We thank the Berkeley Institute for Data Science for hosting meetings that brought together data scientists, biologists, statisticians, computer scientists, and software engineers to discuss how data-intensive research is performed and evaluated. We especially thank Stuart Gieger (UC Berkeley) for his leadership of the Best Practices in Data Science Group and Rebecca Barter (UC Berkeley) for her helpful feedback.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 3. Robinson E, Nolis J. Build a Career in Data Science. Simon and Schuster; 2020.
  • 6. Terence S. An Extensive Step by Step Guide to Exploratory Data Analysis. 2020 [cited 2020 Jun 15]. https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e .
  • 13. Bostock MA. Better Way to Code—Mike Bostock—Medium. 2017 [cited 2020 Jun 15]. https://medium.com/@mbostock/a-better-way-to-code-2b1d2876a3a0 .
  • 14. van der Plas F. Pluto.jl. Github. https://github.com/fonsp/Pluto.jl .
  • 15. Best Practices for Writing R Code–Programming with R. [cited 15 Jun 2020]. https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/
  • 16. PyCon 2019. Jes Ford—Getting Started Testing in Data Science—PyCon 2019. Youtube; 5 May 2019 [cited 2020 Feb 20]. https://www.youtube.com/watch?v=0ysyWk-ox-8
  • 17. Hook D, Kelly D. Testing for trustworthiness in scientific software. 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. 2009. pp. 59–64.
  • 18. Oh J-H. Check Yo’ Data Before You Wreck Yo’ Results. In: Medium [Internet]. ACLU Tech & Analytics; 24 Jan 2020 [cited 2020 Apr 9]. https://medium.com/aclu-tech-analytics/check-yo-data-before-you-wreck-yo-results-53f0e919d0b9 .
  • 19. Gelfand S. comparing two data frames: one #rstats, many ways! | Sharla Gelfand. In: Sharla Gelfand [Internet]. Sharla Gelfand; 17 Feb 2020 [cited 2020 Apr 20]. https://sharla.party/post/comparing-two-dfs/ .
  • 20. Gelfand S. Don’t repeat yourself, talk to yourself! Repeated reporting in the R universe | Sharla Gelfand. In: Sharla Gelfand [Internet]. 30 Jan 2020 [cited 2020 Apr 20]. https://sharla.party/talk/2020-01-01-rstudio-conf/ .
  • 27. Geiger RS, Sholler D, Culich A, Martinez C, Hoces de la Guardia F, Lanusse F, et al. Challenges of Doing Data-Intensive Research in Teams, Labs, and Groups: Report from the BIDS Best Practices in Data Science Series. 2018.
  • 29. Xie Y. Dynamic Documents with R and knitr. Chapman and Hall/CRC; 2017.
  • 33. Wickham H. R Packages: Organize, Test, Document, and Share Your Code. “O’Reilly Media, Inc.”; 2015.
  • 34. Abrahamsson P, Salo O, Ronkainen J, Warsta J. Agile Software Development Methods: Review and Analysis. arXiv [cs.SE]. 2017. http://arxiv.org/abs/1709.08439 .
  • 35. Beck K, Beedle M, Van Bennekum A, Cockburn A, Cunningham W, Fowler M, et al. Manifesto for agile software development. 2001. https://moodle2019-20.ua.es/moodle/pluginfile.php/2213/mod_resource/content/2/agile-manifesto.pdf .
  • 38. Sholler D, Das D, Hoces de la Guardia F, Hoffman C, Lanusse F, Varoquaux N, et al. Best Practices for Managing Turnover in Data Science Groups, Teams, and Labs. 2019.
  • 44. Martin RC. Clean Code: A Handbook of Agile Software Craftsmanship. Pearson Education; 2009.
  • 45. Fowler M. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional; 2018.
  • 47. Geiger RS, Cabasse C, Cullens CY, Norén L, Fiore-Gartland B, Das D, et al. Career Paths and Prospects in Academic Data Science: Report of the Moore-Sloan Data Science Environments Survey. 2018.
  • 49. Jorgensen PC, editor. About the International Software Testing Qualification Board. 1st ed. The Craft of Model-Based Testing. 1st ed. Boca Raton: Taylor & Francis, a CRC title, part of the Taylor & Francis imprint, a member of the Taylor & Francis Group, the academic division of T&F Informa, plc, [2017]: Auerbach Publications; 2017. pp. 231–240.
  • 51. Wikipedia contributors. Functional design. In: Wikipedia, The Free Encyclopedia [Internet]. 4 Feb 2020 [cited 21 Feb 2020]. https://en.wikipedia.org/w/index.php?title=Functional_design&oldid=939128138
  • 52. 7 Essential Guidelines For Functional Design—Smashing Magazine. In: Smashing Magazine [Internet]. 5 Aug 2008 [cited 21 Feb 2020]. https://www.smashingmagazine.com/2008/08/7-essential-guidelines-for-functional-design/
  • 53. Claerbout JF, Karrenbach M. Electronic documents give reproducible research a new meaning. SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists; 1992. pp. 601–604.
  • 54. Heroux MA, Barba L, Parashar M, Stodden V, Taufer M. Toward a Compatible Reproducibility Taxonomy for Computational and Computing Sciences. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States); 2018. https://www.osti.gov/biblio/1481626 .
  • Privacy Policy

Research Method

Home » Data Analysis – Process, Methods and Types

Data Analysis – Process, Methods and Types

Table of Contents

Data Analysis

Data Analysis

Definition:

Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets. The ultimate aim of data analysis is to convert raw data into actionable insights that can inform business decisions, scientific research, and other endeavors.

Data Analysis Process

The following are step-by-step guides to the data analysis process:

Define the Problem

The first step in data analysis is to clearly define the problem or question that needs to be answered. This involves identifying the purpose of the analysis, the data required, and the intended outcome.

Collect the Data

The next step is to collect the relevant data from various sources. This may involve collecting data from surveys, databases, or other sources. It is important to ensure that the data collected is accurate, complete, and relevant to the problem being analyzed.

Clean and Organize the Data

Once the data has been collected, it needs to be cleaned and organized. This involves removing any errors or inconsistencies in the data, filling in missing values, and ensuring that the data is in a format that can be easily analyzed.

Analyze the Data

The next step is to analyze the data using various statistical and analytical techniques. This may involve identifying patterns in the data, conducting statistical tests, or using machine learning algorithms to identify trends and insights.

Interpret the Results

After analyzing the data, the next step is to interpret the results. This involves drawing conclusions based on the analysis and identifying any significant findings or trends.

Communicate the Findings

Once the results have been interpreted, they need to be communicated to stakeholders. This may involve creating reports, visualizations, or presentations to effectively communicate the findings and recommendations.

Take Action

The final step in the data analysis process is to take action based on the findings. This may involve implementing new policies or procedures, making strategic decisions, or taking other actions based on the insights gained from the analysis.

Types of Data Analysis

Types of Data Analysis are as follows:

Descriptive Analysis

This type of analysis involves summarizing and describing the main characteristics of a dataset, such as the mean, median, mode, standard deviation, and range.

Inferential Analysis

This type of analysis involves making inferences about a population based on a sample. Inferential analysis can help determine whether a certain relationship or pattern observed in a sample is likely to be present in the entire population.

Diagnostic Analysis

This type of analysis involves identifying and diagnosing problems or issues within a dataset. Diagnostic analysis can help identify outliers, errors, missing data, or other anomalies in the dataset.

Predictive Analysis

This type of analysis involves using statistical models and algorithms to predict future outcomes or trends based on historical data. Predictive analysis can help businesses and organizations make informed decisions about the future.

Prescriptive Analysis

This type of analysis involves recommending a course of action based on the results of previous analyses. Prescriptive analysis can help organizations make data-driven decisions about how to optimize their operations, products, or services.

Exploratory Analysis

This type of analysis involves exploring the relationships and patterns within a dataset to identify new insights and trends. Exploratory analysis is often used in the early stages of research or data analysis to generate hypotheses and identify areas for further investigation.

Data Analysis Methods

Data Analysis Methods are as follows:

Statistical Analysis

This method involves the use of mathematical models and statistical tools to analyze and interpret data. It includes measures of central tendency, correlation analysis, regression analysis, hypothesis testing, and more.

Machine Learning

This method involves the use of algorithms to identify patterns and relationships in data. It includes supervised and unsupervised learning, classification, clustering, and predictive modeling.

Data Mining

This method involves using statistical and machine learning techniques to extract information and insights from large and complex datasets.

Text Analysis

This method involves using natural language processing (NLP) techniques to analyze and interpret text data. It includes sentiment analysis, topic modeling, and entity recognition.

Network Analysis

This method involves analyzing the relationships and connections between entities in a network, such as social networks or computer networks. It includes social network analysis and graph theory.

Time Series Analysis

This method involves analyzing data collected over time to identify patterns and trends. It includes forecasting, decomposition, and smoothing techniques.

Spatial Analysis

This method involves analyzing geographic data to identify spatial patterns and relationships. It includes spatial statistics, spatial regression, and geospatial data visualization.

Data Visualization

This method involves using graphs, charts, and other visual representations to help communicate the findings of the analysis. It includes scatter plots, bar charts, heat maps, and interactive dashboards.

Qualitative Analysis

This method involves analyzing non-numeric data such as interviews, observations, and open-ended survey responses. It includes thematic analysis, content analysis, and grounded theory.

Multi-criteria Decision Analysis

This method involves analyzing multiple criteria and objectives to support decision-making. It includes techniques such as the analytical hierarchy process, TOPSIS, and ELECTRE.

Data Analysis Tools

There are various data analysis tools available that can help with different aspects of data analysis. Below is a list of some commonly used data analysis tools:

  • Microsoft Excel: A widely used spreadsheet program that allows for data organization, analysis, and visualization.
  • SQL : A programming language used to manage and manipulate relational databases.
  • R : An open-source programming language and software environment for statistical computing and graphics.
  • Python : A general-purpose programming language that is widely used in data analysis and machine learning.
  • Tableau : A data visualization software that allows for interactive and dynamic visualizations of data.
  • SAS : A statistical analysis software used for data management, analysis, and reporting.
  • SPSS : A statistical analysis software used for data analysis, reporting, and modeling.
  • Matlab : A numerical computing software that is widely used in scientific research and engineering.
  • RapidMiner : A data science platform that offers a wide range of data analysis and machine learning tools.

Applications of Data Analysis

Data analysis has numerous applications across various fields. Below are some examples of how data analysis is used in different fields:

  • Business : Data analysis is used to gain insights into customer behavior, market trends, and financial performance. This includes customer segmentation, sales forecasting, and market research.
  • Healthcare : Data analysis is used to identify patterns and trends in patient data, improve patient outcomes, and optimize healthcare operations. This includes clinical decision support, disease surveillance, and healthcare cost analysis.
  • Education : Data analysis is used to measure student performance, evaluate teaching effectiveness, and improve educational programs. This includes assessment analytics, learning analytics, and program evaluation.
  • Finance : Data analysis is used to monitor and evaluate financial performance, identify risks, and make investment decisions. This includes risk management, portfolio optimization, and fraud detection.
  • Government : Data analysis is used to inform policy-making, improve public services, and enhance public safety. This includes crime analysis, disaster response planning, and social welfare program evaluation.
  • Sports : Data analysis is used to gain insights into athlete performance, improve team strategy, and enhance fan engagement. This includes player evaluation, scouting analysis, and game strategy optimization.
  • Marketing : Data analysis is used to measure the effectiveness of marketing campaigns, understand customer behavior, and develop targeted marketing strategies. This includes customer segmentation, marketing attribution analysis, and social media analytics.
  • Environmental science : Data analysis is used to monitor and evaluate environmental conditions, assess the impact of human activities on the environment, and develop environmental policies. This includes climate modeling, ecological forecasting, and pollution monitoring.

When to Use Data Analysis

Data analysis is useful when you need to extract meaningful insights and information from large and complex datasets. It is a crucial step in the decision-making process, as it helps you understand the underlying patterns and relationships within the data, and identify potential areas for improvement or opportunities for growth.

Here are some specific scenarios where data analysis can be particularly helpful:

  • Problem-solving : When you encounter a problem or challenge, data analysis can help you identify the root cause and develop effective solutions.
  • Optimization : Data analysis can help you optimize processes, products, or services to increase efficiency, reduce costs, and improve overall performance.
  • Prediction: Data analysis can help you make predictions about future trends or outcomes, which can inform strategic planning and decision-making.
  • Performance evaluation : Data analysis can help you evaluate the performance of a process, product, or service to identify areas for improvement and potential opportunities for growth.
  • Risk assessment : Data analysis can help you assess and mitigate risks, whether it is financial, operational, or related to safety.
  • Market research : Data analysis can help you understand customer behavior and preferences, identify market trends, and develop effective marketing strategies.
  • Quality control: Data analysis can help you ensure product quality and customer satisfaction by identifying and addressing quality issues.

Purpose of Data Analysis

The primary purposes of data analysis can be summarized as follows:

  • To gain insights: Data analysis allows you to identify patterns and trends in data, which can provide valuable insights into the underlying factors that influence a particular phenomenon or process.
  • To inform decision-making: Data analysis can help you make informed decisions based on the information that is available. By analyzing data, you can identify potential risks, opportunities, and solutions to problems.
  • To improve performance: Data analysis can help you optimize processes, products, or services by identifying areas for improvement and potential opportunities for growth.
  • To measure progress: Data analysis can help you measure progress towards a specific goal or objective, allowing you to track performance over time and adjust your strategies accordingly.
  • To identify new opportunities: Data analysis can help you identify new opportunities for growth and innovation by identifying patterns and trends that may not have been visible before.

Examples of Data Analysis

Some Examples of Data Analysis are as follows:

  • Social Media Monitoring: Companies use data analysis to monitor social media activity in real-time to understand their brand reputation, identify potential customer issues, and track competitors. By analyzing social media data, businesses can make informed decisions on product development, marketing strategies, and customer service.
  • Financial Trading: Financial traders use data analysis to make real-time decisions about buying and selling stocks, bonds, and other financial instruments. By analyzing real-time market data, traders can identify trends and patterns that help them make informed investment decisions.
  • Traffic Monitoring : Cities use data analysis to monitor traffic patterns and make real-time decisions about traffic management. By analyzing data from traffic cameras, sensors, and other sources, cities can identify congestion hotspots and make changes to improve traffic flow.
  • Healthcare Monitoring: Healthcare providers use data analysis to monitor patient health in real-time. By analyzing data from wearable devices, electronic health records, and other sources, healthcare providers can identify potential health issues and provide timely interventions.
  • Online Advertising: Online advertisers use data analysis to make real-time decisions about advertising campaigns. By analyzing data on user behavior and ad performance, advertisers can make adjustments to their campaigns to improve their effectiveness.
  • Sports Analysis : Sports teams use data analysis to make real-time decisions about strategy and player performance. By analyzing data on player movement, ball position, and other variables, coaches can make informed decisions about substitutions, game strategy, and training regimens.
  • Energy Management : Energy companies use data analysis to monitor energy consumption in real-time. By analyzing data on energy usage patterns, companies can identify opportunities to reduce energy consumption and improve efficiency.

Characteristics of Data Analysis

Characteristics of Data Analysis are as follows:

  • Objective : Data analysis should be objective and based on empirical evidence, rather than subjective assumptions or opinions.
  • Systematic : Data analysis should follow a systematic approach, using established methods and procedures for collecting, cleaning, and analyzing data.
  • Accurate : Data analysis should produce accurate results, free from errors and bias. Data should be validated and verified to ensure its quality.
  • Relevant : Data analysis should be relevant to the research question or problem being addressed. It should focus on the data that is most useful for answering the research question or solving the problem.
  • Comprehensive : Data analysis should be comprehensive and consider all relevant factors that may affect the research question or problem.
  • Timely : Data analysis should be conducted in a timely manner, so that the results are available when they are needed.
  • Reproducible : Data analysis should be reproducible, meaning that other researchers should be able to replicate the analysis using the same data and methods.
  • Communicable : Data analysis should be communicated clearly and effectively to stakeholders and other interested parties. The results should be presented in a way that is understandable and useful for decision-making.

Advantages of Data Analysis

Advantages of Data Analysis are as follows:

  • Better decision-making: Data analysis helps in making informed decisions based on facts and evidence, rather than intuition or guesswork.
  • Improved efficiency: Data analysis can identify inefficiencies and bottlenecks in business processes, allowing organizations to optimize their operations and reduce costs.
  • Increased accuracy: Data analysis helps to reduce errors and bias, providing more accurate and reliable information.
  • Better customer service: Data analysis can help organizations understand their customers better, allowing them to provide better customer service and improve customer satisfaction.
  • Competitive advantage: Data analysis can provide organizations with insights into their competitors, allowing them to identify areas where they can gain a competitive advantage.
  • Identification of trends and patterns : Data analysis can identify trends and patterns in data that may not be immediately apparent, helping organizations to make predictions and plan for the future.
  • Improved risk management : Data analysis can help organizations identify potential risks and take proactive steps to mitigate them.
  • Innovation: Data analysis can inspire innovation and new ideas by revealing new opportunities or previously unknown correlations in data.

Limitations of Data Analysis

  • Data quality: The quality of data can impact the accuracy and reliability of analysis results. If data is incomplete, inconsistent, or outdated, the analysis may not provide meaningful insights.
  • Limited scope: Data analysis is limited by the scope of the data available. If data is incomplete or does not capture all relevant factors, the analysis may not provide a complete picture.
  • Human error : Data analysis is often conducted by humans, and errors can occur in data collection, cleaning, and analysis.
  • Cost : Data analysis can be expensive, requiring specialized tools, software, and expertise.
  • Time-consuming : Data analysis can be time-consuming, especially when working with large datasets or conducting complex analyses.
  • Overreliance on data: Data analysis should be complemented with human intuition and expertise. Overreliance on data can lead to a lack of creativity and innovation.
  • Privacy concerns: Data analysis can raise privacy concerns if personal or sensitive information is used without proper consent or security measures.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Data collection

Data Collection – Methods Types and Examples

Delimitations

Delimitations in Research – Types, Examples and...

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

Research Process

Research Process – Steps, Examples and Tips

Research Design

Research Design – Types, Methods and Examples

Your Modern Business Guide To Data Analysis Methods And Techniques

Data analysis methods and techniques blog post by datapine

Table of Contents

1) What Is Data Analysis?

2) Why Is Data Analysis Important?

3) What Is The Data Analysis Process?

4) Types Of Data Analysis Methods

5) Top Data Analysis Techniques To Apply

6) Quality Criteria For Data Analysis

7) Data Analysis Limitations & Barriers

8) Data Analysis Skills

9) Data Analysis In The Big Data Environment

In our data-rich age, understanding how to analyze and extract true meaning from our business’s digital insights is one of the primary drivers of success.

Despite the colossal volume of data we create every day, a mere 0.5% is actually analyzed and used for data discovery , improvement, and intelligence. While that may not seem like much, considering the amount of digital information we have at our fingertips, half a percent still accounts for a vast amount of data.

With so much data and so little time, knowing how to collect, curate, organize, and make sense of all of this potentially business-boosting information can be a minefield – but online data analysis is the solution.

In science, data analysis uses a more complex approach with advanced techniques to explore and experiment with data. On the other hand, in a business context, data is used to make data-driven decisions that will enable the company to improve its overall performance. In this post, we will cover the analysis of data from an organizational point of view while still going through the scientific and statistical foundations that are fundamental to understanding the basics of data analysis. 

To put all of that into perspective, we will answer a host of important analytical questions, explore analytical methods and techniques, while demonstrating how to perform analysis in the real world with a 17-step blueprint for success.

What Is Data Analysis?

Data analysis is the process of collecting, modeling, and analyzing data using various statistical and logical methods and techniques. Businesses rely on analytics processes and tools to extract insights that support strategic and operational decision-making.

All these various methods are largely based on two core areas: quantitative and qualitative research.

To explain the key differences between qualitative and quantitative research, here’s a video for your viewing pleasure:

Gaining a better understanding of different techniques and methods in quantitative research as well as qualitative insights will give your analyzing efforts a more clearly defined direction, so it’s worth taking the time to allow this particular knowledge to sink in. Additionally, you will be able to create a comprehensive analytical report that will skyrocket your analysis.

Apart from qualitative and quantitative categories, there are also other types of data that you should be aware of before dividing into complex data analysis processes. These categories include: 

  • Big data: Refers to massive data sets that need to be analyzed using advanced software to reveal patterns and trends. It is considered to be one of the best analytical assets as it provides larger volumes of data at a faster rate. 
  • Metadata: Putting it simply, metadata is data that provides insights about other data. It summarizes key information about specific data that makes it easier to find and reuse for later purposes. 
  • Real time data: As its name suggests, real time data is presented as soon as it is acquired. From an organizational perspective, this is the most valuable data as it can help you make important decisions based on the latest developments. Our guide on real time analytics will tell you more about the topic. 
  • Machine data: This is more complex data that is generated solely by a machine such as phones, computers, or even websites and embedded systems, without previous human interaction.

Why Is Data Analysis Important?

Before we go into detail about the categories of analysis along with its methods and techniques, you must understand the potential that analyzing data can bring to your organization.

  • Informed decision-making : From a management perspective, you can benefit from analyzing your data as it helps you make decisions based on facts and not simple intuition. For instance, you can understand where to invest your capital, detect growth opportunities, predict your income, or tackle uncommon situations before they become problems. Through this, you can extract relevant insights from all areas in your organization, and with the help of dashboard software , present the data in a professional and interactive way to different stakeholders.
  • Reduce costs : Another great benefit is to reduce costs. With the help of advanced technologies such as predictive analytics, businesses can spot improvement opportunities, trends, and patterns in their data and plan their strategies accordingly. In time, this will help you save money and resources on implementing the wrong strategies. And not just that, by predicting different scenarios such as sales and demand you can also anticipate production and supply. 
  • Target customers better : Customers are arguably the most crucial element in any business. By using analytics to get a 360° vision of all aspects related to your customers, you can understand which channels they use to communicate with you, their demographics, interests, habits, purchasing behaviors, and more. In the long run, it will drive success to your marketing strategies, allow you to identify new potential customers, and avoid wasting resources on targeting the wrong people or sending the wrong message. You can also track customer satisfaction by analyzing your client’s reviews or your customer service department’s performance.

What Is The Data Analysis Process?

Data analysis process graphic

When we talk about analyzing data there is an order to follow in order to extract the needed conclusions. The analysis process consists of 5 key stages. We will cover each of them more in detail later in the post, but to start providing the needed context to understand what is coming next, here is a rundown of the 5 essential steps of data analysis. 

  • Identify: Before you get your hands dirty with data, you first need to identify why you need it in the first place. The identification is the stage in which you establish the questions you will need to answer. For example, what is the customer's perception of our brand? Or what type of packaging is more engaging to our potential customers? Once the questions are outlined you are ready for the next step. 
  • Collect: As its name suggests, this is the stage where you start collecting the needed data. Here, you define which sources of data you will use and how you will use them. The collection of data can come in different forms such as internal or external sources, surveys, interviews, questionnaires, and focus groups, among others.  An important note here is that the way you collect the data will be different in a quantitative and qualitative scenario. 
  • Clean: Once you have the necessary data it is time to clean it and leave it ready for analysis. Not all the data you collect will be useful, when collecting big amounts of data in different formats it is very likely that you will find yourself with duplicate or badly formatted data. To avoid this, before you start working with your data you need to make sure to erase any white spaces, duplicate records, or formatting errors. This way you avoid hurting your analysis with bad-quality data. 
  • Analyze : With the help of various techniques such as statistical analysis, regressions, neural networks, text analysis, and more, you can start analyzing and manipulating your data to extract relevant conclusions. At this stage, you find trends, correlations, variations, and patterns that can help you answer the questions you first thought of in the identify stage. Various technologies in the market assist researchers and average users with the management of their data. Some of them include business intelligence and visualization software, predictive analytics, and data mining, among others. 
  • Interpret: Last but not least you have one of the most important steps: it is time to interpret your results. This stage is where the researcher comes up with courses of action based on the findings. For example, here you would understand if your clients prefer packaging that is red or green, plastic or paper, etc. Additionally, at this stage, you can also find some limitations and work on them. 

Now that you have a basic understanding of the key data analysis steps, let’s look at the top 17 essential methods.

17 Essential Types Of Data Analysis Methods

Before diving into the 17 essential types of methods, it is important that we go over really fast through the main analysis categories. Starting with the category of descriptive up to prescriptive analysis, the complexity and effort of data evaluation increases, but also the added value for the company.

a) Descriptive analysis - What happened.

The descriptive analysis method is the starting point for any analytic reflection, and it aims to answer the question of what happened? It does this by ordering, manipulating, and interpreting raw data from various sources to turn it into valuable insights for your organization.

Performing descriptive analysis is essential, as it enables us to present our insights in a meaningful way. Although it is relevant to mention that this analysis on its own will not allow you to predict future outcomes or tell you the answer to questions like why something happened, it will leave your data organized and ready to conduct further investigations.

b) Exploratory analysis - How to explore data relationships.

As its name suggests, the main aim of the exploratory analysis is to explore. Prior to it, there is still no notion of the relationship between the data and the variables. Once the data is investigated, exploratory analysis helps you to find connections and generate hypotheses and solutions for specific problems. A typical area of ​​application for it is data mining.

c) Diagnostic analysis - Why it happened.

Diagnostic data analytics empowers analysts and executives by helping them gain a firm contextual understanding of why something happened. If you know why something happened as well as how it happened, you will be able to pinpoint the exact ways of tackling the issue or challenge.

Designed to provide direct and actionable answers to specific questions, this is one of the world’s most important methods in research, among its other key organizational functions such as retail analytics , e.g.

c) Predictive analysis - What will happen.

The predictive method allows you to look into the future to answer the question: what will happen? In order to do this, it uses the results of the previously mentioned descriptive, exploratory, and diagnostic analysis, in addition to machine learning (ML) and artificial intelligence (AI). Through this, you can uncover future trends, potential problems or inefficiencies, connections, and casualties in your data.

With predictive analysis, you can unfold and develop initiatives that will not only enhance your various operational processes but also help you gain an all-important edge over the competition. If you understand why a trend, pattern, or event happened through data, you will be able to develop an informed projection of how things may unfold in particular areas of the business.

e) Prescriptive analysis - How will it happen.

Another of the most effective types of analysis methods in research. Prescriptive data techniques cross over from predictive analysis in the way that it revolves around using patterns or trends to develop responsive, practical business strategies.

By drilling down into prescriptive analysis, you will play an active role in the data consumption process by taking well-arranged sets of visual data and using it as a powerful fix to emerging issues in a number of key areas, including marketing, sales, customer experience, HR, fulfillment, finance, logistics analytics , and others.

Top 17 data analysis methods

As mentioned at the beginning of the post, data analysis methods can be divided into two big categories: quantitative and qualitative. Each of these categories holds a powerful analytical value that changes depending on the scenario and type of data you are working with. Below, we will discuss 17 methods that are divided into qualitative and quantitative approaches. 

Without further ado, here are the 17 essential types of data analysis methods with some use cases in the business world: 

A. Quantitative Methods 

To put it simply, quantitative analysis refers to all methods that use numerical data or data that can be turned into numbers (e.g. category variables like gender, age, etc.) to extract valuable insights. It is used to extract valuable conclusions about relationships, differences, and test hypotheses. Below we discuss some of the key quantitative methods. 

1. Cluster analysis

The action of grouping a set of data elements in a way that said elements are more similar (in a particular sense) to each other than to those in other groups – hence the term ‘cluster.’ Since there is no target variable when clustering, the method is often used to find hidden patterns in the data. The approach is also used to provide additional context to a trend or dataset.

Let's look at it from an organizational perspective. In a perfect world, marketers would be able to analyze each customer separately and give them the best-personalized service, but let's face it, with a large customer base, it is timely impossible to do that. That's where clustering comes in. By grouping customers into clusters based on demographics, purchasing behaviors, monetary value, or any other factor that might be relevant for your company, you will be able to immediately optimize your efforts and give your customers the best experience based on their needs.

2. Cohort analysis

This type of data analysis approach uses historical data to examine and compare a determined segment of users' behavior, which can then be grouped with others with similar characteristics. By using this methodology, it's possible to gain a wealth of insight into consumer needs or a firm understanding of a broader target group.

Cohort analysis can be really useful for performing analysis in marketing as it will allow you to understand the impact of your campaigns on specific groups of customers. To exemplify, imagine you send an email campaign encouraging customers to sign up for your site. For this, you create two versions of the campaign with different designs, CTAs, and ad content. Later on, you can use cohort analysis to track the performance of the campaign for a longer period of time and understand which type of content is driving your customers to sign up, repurchase, or engage in other ways.  

A useful tool to start performing cohort analysis method is Google Analytics. You can learn more about the benefits and limitations of using cohorts in GA in this useful guide . In the bottom image, you see an example of how you visualize a cohort in this tool. The segments (devices traffic) are divided into date cohorts (usage of devices) and then analyzed week by week to extract insights into performance.

Cohort analysis chart example from google analytics

3. Regression analysis

Regression uses historical data to understand how a dependent variable's value is affected when one (linear regression) or more independent variables (multiple regression) change or stay the same. By understanding each variable's relationship and how it developed in the past, you can anticipate possible outcomes and make better decisions in the future.

Let's bring it down with an example. Imagine you did a regression analysis of your sales in 2019 and discovered that variables like product quality, store design, customer service, marketing campaigns, and sales channels affected the overall result. Now you want to use regression to analyze which of these variables changed or if any new ones appeared during 2020. For example, you couldn’t sell as much in your physical store due to COVID lockdowns. Therefore, your sales could’ve either dropped in general or increased in your online channels. Through this, you can understand which independent variables affected the overall performance of your dependent variable, annual sales.

If you want to go deeper into this type of analysis, check out this article and learn more about how you can benefit from regression.

4. Neural networks

The neural network forms the basis for the intelligent algorithms of machine learning. It is a form of analytics that attempts, with minimal intervention, to understand how the human brain would generate insights and predict values. Neural networks learn from each and every data transaction, meaning that they evolve and advance over time.

A typical area of application for neural networks is predictive analytics. There are BI reporting tools that have this feature implemented within them, such as the Predictive Analytics Tool from datapine. This tool enables users to quickly and easily generate all kinds of predictions. All you have to do is select the data to be processed based on your KPIs, and the software automatically calculates forecasts based on historical and current data. Thanks to its user-friendly interface, anyone in your organization can manage it; there’s no need to be an advanced scientist. 

Here is an example of how you can use the predictive analysis tool from datapine:

Example on how to use predictive analytics tool from datapine

**click to enlarge**

5. Factor analysis

The factor analysis also called “dimension reduction” is a type of data analysis used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The aim here is to uncover independent latent variables, an ideal method for streamlining specific segments.

A good way to understand this data analysis method is a customer evaluation of a product. The initial assessment is based on different variables like color, shape, wearability, current trends, materials, comfort, the place where they bought the product, and frequency of usage. Like this, the list can be endless, depending on what you want to track. In this case, factor analysis comes into the picture by summarizing all of these variables into homogenous groups, for example, by grouping the variables color, materials, quality, and trends into a brother latent variable of design.

If you want to start analyzing data using factor analysis we recommend you take a look at this practical guide from UCLA.

6. Data mining

A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.  When considering how to analyze data, adopting a data mining mindset is essential to success - as such, it’s an area that is worth exploring in greater detail.

An excellent use case of data mining is datapine intelligent data alerts . With the help of artificial intelligence and machine learning, they provide automated signals based on particular commands or occurrences within a dataset. For example, if you’re monitoring supply chain KPIs , you could set an intelligent alarm to trigger when invalid or low-quality data appears. By doing so, you will be able to drill down deep into the issue and fix it swiftly and effectively.

In the following picture, you can see how the intelligent alarms from datapine work. By setting up ranges on daily orders, sessions, and revenues, the alarms will notify you if the goal was not completed or if it exceeded expectations.

Example on how to use intelligent alerts from datapine

7. Time series analysis

As its name suggests, time series analysis is used to analyze a set of data points collected over a specified period of time. Although analysts use this method to monitor the data points in a specific interval of time rather than just monitoring them intermittently, the time series analysis is not uniquely used for the purpose of collecting data over time. Instead, it allows researchers to understand if variables changed during the duration of the study, how the different variables are dependent, and how did it reach the end result. 

In a business context, this method is used to understand the causes of different trends and patterns to extract valuable insights. Another way of using this method is with the help of time series forecasting. Powered by predictive technologies, businesses can analyze various data sets over a period of time and forecast different future events. 

A great use case to put time series analysis into perspective is seasonality effects on sales. By using time series forecasting to analyze sales data of a specific product over time, you can understand if sales rise over a specific period of time (e.g. swimwear during summertime, or candy during Halloween). These insights allow you to predict demand and prepare production accordingly.  

8. Decision Trees 

The decision tree analysis aims to act as a support tool to make smart and strategic decisions. By visually displaying potential outcomes, consequences, and costs in a tree-like model, researchers and company users can easily evaluate all factors involved and choose the best course of action. Decision trees are helpful to analyze quantitative data and they allow for an improved decision-making process by helping you spot improvement opportunities, reduce costs, and enhance operational efficiency and production.

But how does a decision tree actually works? This method works like a flowchart that starts with the main decision that you need to make and branches out based on the different outcomes and consequences of each decision. Each outcome will outline its own consequences, costs, and gains and, at the end of the analysis, you can compare each of them and make the smartest decision. 

Businesses can use them to understand which project is more cost-effective and will bring more earnings in the long run. For example, imagine you need to decide if you want to update your software app or build a new app entirely.  Here you would compare the total costs, the time needed to be invested, potential revenue, and any other factor that might affect your decision.  In the end, you would be able to see which of these two options is more realistic and attainable for your company or research.

9. Conjoint analysis 

Last but not least, we have the conjoint analysis. This approach is usually used in surveys to understand how individuals value different attributes of a product or service and it is one of the most effective methods to extract consumer preferences. When it comes to purchasing, some clients might be more price-focused, others more features-focused, and others might have a sustainable focus. Whatever your customer's preferences are, you can find them with conjoint analysis. Through this, companies can define pricing strategies, packaging options, subscription packages, and more. 

A great example of conjoint analysis is in marketing and sales. For instance, a cupcake brand might use conjoint analysis and find that its clients prefer gluten-free options and cupcakes with healthier toppings over super sugary ones. Thus, the cupcake brand can turn these insights into advertisements and promotions to increase sales of this particular type of product. And not just that, conjoint analysis can also help businesses segment their customers based on their interests. This allows them to send different messaging that will bring value to each of the segments. 

10. Correspondence Analysis

Also known as reciprocal averaging, correspondence analysis is a method used to analyze the relationship between categorical variables presented within a contingency table. A contingency table is a table that displays two (simple correspondence analysis) or more (multiple correspondence analysis) categorical variables across rows and columns that show the distribution of the data, which is usually answers to a survey or questionnaire on a specific topic. 

This method starts by calculating an “expected value” which is done by multiplying row and column averages and dividing it by the overall original value of the specific table cell. The “expected value” is then subtracted from the original value resulting in a “residual number” which is what allows you to extract conclusions about relationships and distribution. The results of this analysis are later displayed using a map that represents the relationship between the different values. The closest two values are in the map, the bigger the relationship. Let’s put it into perspective with an example. 

Imagine you are carrying out a market research analysis about outdoor clothing brands and how they are perceived by the public. For this analysis, you ask a group of people to match each brand with a certain attribute which can be durability, innovation, quality materials, etc. When calculating the residual numbers, you can see that brand A has a positive residual for innovation but a negative one for durability. This means that brand A is not positioned as a durable brand in the market, something that competitors could take advantage of. 

11. Multidimensional Scaling (MDS)

MDS is a method used to observe the similarities or disparities between objects which can be colors, brands, people, geographical coordinates, and more. The objects are plotted using an “MDS map” that positions similar objects together and disparate ones far apart. The (dis) similarities between objects are represented using one or more dimensions that can be observed using a numerical scale. For example, if you want to know how people feel about the COVID-19 vaccine, you can use 1 for “don’t believe in the vaccine at all”  and 10 for “firmly believe in the vaccine” and a scale of 2 to 9 for in between responses.  When analyzing an MDS map the only thing that matters is the distance between the objects, the orientation of the dimensions is arbitrary and has no meaning at all. 

Multidimensional scaling is a valuable technique for market research, especially when it comes to evaluating product or brand positioning. For instance, if a cupcake brand wants to know how they are positioned compared to competitors, it can define 2-3 dimensions such as taste, ingredients, shopping experience, or more, and do a multidimensional scaling analysis to find improvement opportunities as well as areas in which competitors are currently leading. 

Another business example is in procurement when deciding on different suppliers. Decision makers can generate an MDS map to see how the different prices, delivery times, technical services, and more of the different suppliers differ and pick the one that suits their needs the best. 

A final example proposed by a research paper on "An Improved Study of Multilevel Semantic Network Visualization for Analyzing Sentiment Word of Movie Review Data". Researchers picked a two-dimensional MDS map to display the distances and relationships between different sentiments in movie reviews. They used 36 sentiment words and distributed them based on their emotional distance as we can see in the image below where the words "outraged" and "sweet" are on opposite sides of the map, marking the distance between the two emotions very clearly.

Example of multidimensional scaling analysis

Aside from being a valuable technique to analyze dissimilarities, MDS also serves as a dimension-reduction technique for large dimensional data. 

B. Qualitative Methods

Qualitative data analysis methods are defined as the observation of non-numerical data that is gathered and produced using methods of observation such as interviews, focus groups, questionnaires, and more. As opposed to quantitative methods, qualitative data is more subjective and highly valuable in analyzing customer retention and product development.

12. Text analysis

Text analysis, also known in the industry as text mining, works by taking large sets of textual data and arranging them in a way that makes it easier to manage. By working through this cleansing process in stringent detail, you will be able to extract the data that is truly relevant to your organization and use it to develop actionable insights that will propel you forward.

Modern software accelerate the application of text analytics. Thanks to the combination of machine learning and intelligent algorithms, you can perform advanced analytical processes such as sentiment analysis. This technique allows you to understand the intentions and emotions of a text, for example, if it's positive, negative, or neutral, and then give it a score depending on certain factors and categories that are relevant to your brand. Sentiment analysis is often used to monitor brand and product reputation and to understand how successful your customer experience is. To learn more about the topic check out this insightful article .

By analyzing data from various word-based sources, including product reviews, articles, social media communications, and survey responses, you will gain invaluable insights into your audience, as well as their needs, preferences, and pain points. This will allow you to create campaigns, services, and communications that meet your prospects’ needs on a personal level, growing your audience while boosting customer retention. There are various other “sub-methods” that are an extension of text analysis. Each of them serves a more specific purpose and we will look at them in detail next. 

13. Content Analysis

This is a straightforward and very popular method that examines the presence and frequency of certain words, concepts, and subjects in different content formats such as text, image, audio, or video. For example, the number of times the name of a celebrity is mentioned on social media or online tabloids. It does this by coding text data that is later categorized and tabulated in a way that can provide valuable insights, making it the perfect mix of quantitative and qualitative analysis.

There are two types of content analysis. The first one is the conceptual analysis which focuses on explicit data, for instance, the number of times a concept or word is mentioned in a piece of content. The second one is relational analysis, which focuses on the relationship between different concepts or words and how they are connected within a specific context. 

Content analysis is often used by marketers to measure brand reputation and customer behavior. For example, by analyzing customer reviews. It can also be used to analyze customer interviews and find directions for new product development. It is also important to note, that in order to extract the maximum potential out of this analysis method, it is necessary to have a clearly defined research question. 

14. Thematic Analysis

Very similar to content analysis, thematic analysis also helps in identifying and interpreting patterns in qualitative data with the main difference being that the first one can also be applied to quantitative analysis. The thematic method analyzes large pieces of text data such as focus group transcripts or interviews and groups them into themes or categories that come up frequently within the text. It is a great method when trying to figure out peoples view’s and opinions about a certain topic. For example, if you are a brand that cares about sustainability, you can do a survey of your customers to analyze their views and opinions about sustainability and how they apply it to their lives. You can also analyze customer service calls transcripts to find common issues and improve your service. 

Thematic analysis is a very subjective technique that relies on the researcher’s judgment. Therefore,  to avoid biases, it has 6 steps that include familiarization, coding, generating themes, reviewing themes, defining and naming themes, and writing up. It is also important to note that, because it is a flexible approach, the data can be interpreted in multiple ways and it can be hard to select what data is more important to emphasize. 

15. Narrative Analysis 

A bit more complex in nature than the two previous ones, narrative analysis is used to explore the meaning behind the stories that people tell and most importantly, how they tell them. By looking into the words that people use to describe a situation you can extract valuable conclusions about their perspective on a specific topic. Common sources for narrative data include autobiographies, family stories, opinion pieces, and testimonials, among others. 

From a business perspective, narrative analysis can be useful to analyze customer behaviors and feelings towards a specific product, service, feature, or others. It provides unique and deep insights that can be extremely valuable. However, it has some drawbacks.  

The biggest weakness of this method is that the sample sizes are usually very small due to the complexity and time-consuming nature of the collection of narrative data. Plus, the way a subject tells a story will be significantly influenced by his or her specific experiences, making it very hard to replicate in a subsequent study. 

16. Discourse Analysis

Discourse analysis is used to understand the meaning behind any type of written, verbal, or symbolic discourse based on its political, social, or cultural context. It mixes the analysis of languages and situations together. This means that the way the content is constructed and the meaning behind it is significantly influenced by the culture and society it takes place in. For example, if you are analyzing political speeches you need to consider different context elements such as the politician's background, the current political context of the country, the audience to which the speech is directed, and so on. 

From a business point of view, discourse analysis is a great market research tool. It allows marketers to understand how the norms and ideas of the specific market work and how their customers relate to those ideas. It can be very useful to build a brand mission or develop a unique tone of voice. 

17. Grounded Theory Analysis

Traditionally, researchers decide on a method and hypothesis and start to collect the data to prove that hypothesis. The grounded theory is the only method that doesn’t require an initial research question or hypothesis as its value lies in the generation of new theories. With the grounded theory method, you can go into the analysis process with an open mind and explore the data to generate new theories through tests and revisions. In fact, it is not necessary to collect the data and then start to analyze it. Researchers usually start to find valuable insights as they are gathering the data. 

All of these elements make grounded theory a very valuable method as theories are fully backed by data instead of initial assumptions. It is a great technique to analyze poorly researched topics or find the causes behind specific company outcomes. For example, product managers and marketers might use the grounded theory to find the causes of high levels of customer churn and look into customer surveys and reviews to develop new theories about the causes. 

How To Analyze Data? Top 17 Data Analysis Techniques To Apply

17 top data analysis techniques by datapine

Now that we’ve answered the questions “what is data analysis’”, why is it important, and covered the different data analysis types, it’s time to dig deeper into how to perform your analysis by working through these 17 essential techniques.

1. Collaborate your needs

Before you begin analyzing or drilling down into any techniques, it’s crucial to sit down collaboratively with all key stakeholders within your organization, decide on your primary campaign or strategic goals, and gain a fundamental understanding of the types of insights that will best benefit your progress or provide you with the level of vision you need to evolve your organization.

2. Establish your questions

Once you’ve outlined your core objectives, you should consider which questions will need answering to help you achieve your mission. This is one of the most important techniques as it will shape the very foundations of your success.

To help you ask the right things and ensure your data works for you, you have to ask the right data analysis questions .

3. Data democratization

After giving your data analytics methodology some real direction, and knowing which questions need answering to extract optimum value from the information available to your organization, you should continue with democratization.

Data democratization is an action that aims to connect data from various sources efficiently and quickly so that anyone in your organization can access it at any given moment. You can extract data in text, images, videos, numbers, or any other format. And then perform cross-database analysis to achieve more advanced insights to share with the rest of the company interactively.  

Once you have decided on your most valuable sources, you need to take all of this into a structured format to start collecting your insights. For this purpose, datapine offers an easy all-in-one data connectors feature to integrate all your internal and external sources and manage them at your will. Additionally, datapine’s end-to-end solution automatically updates your data, allowing you to save time and focus on performing the right analysis to grow your company.

data connectors from datapine

4. Think of governance 

When collecting data in a business or research context you always need to think about security and privacy. With data breaches becoming a topic of concern for businesses, the need to protect your client's or subject’s sensitive information becomes critical. 

To ensure that all this is taken care of, you need to think of a data governance strategy. According to Gartner , this concept refers to “ the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption, and control of data and analytics .” In simpler words, data governance is a collection of processes, roles, and policies, that ensure the efficient use of data while still achieving the main company goals. It ensures that clear roles are in place for who can access the information and how they can access it. In time, this not only ensures that sensitive information is protected but also allows for an efficient analysis as a whole. 

5. Clean your data

After harvesting from so many sources you will be left with a vast amount of information that can be overwhelming to deal with. At the same time, you can be faced with incorrect data that can be misleading to your analysis. The smartest thing you can do to avoid dealing with this in the future is to clean the data. This is fundamental before visualizing it, as it will ensure that the insights you extract from it are correct.

There are many things that you need to look for in the cleaning process. The most important one is to eliminate any duplicate observations; this usually appears when using multiple internal and external sources of information. You can also add any missing codes, fix empty fields, and eliminate incorrectly formatted data.

Another usual form of cleaning is done with text data. As we mentioned earlier, most companies today analyze customer reviews, social media comments, questionnaires, and several other text inputs. In order for algorithms to detect patterns, text data needs to be revised to avoid invalid characters or any syntax or spelling errors. 

Most importantly, the aim of cleaning is to prevent you from arriving at false conclusions that can damage your company in the long run. By using clean data, you will also help BI solutions to interact better with your information and create better reports for your organization.

6. Set your KPIs

Once you’ve set your sources, cleaned your data, and established clear-cut questions you want your insights to answer, you need to set a host of key performance indicators (KPIs) that will help you track, measure, and shape your progress in a number of key areas.

KPIs are critical to both qualitative and quantitative analysis research. This is one of the primary methods of data analysis you certainly shouldn’t overlook.

To help you set the best possible KPIs for your initiatives and activities, here is an example of a relevant logistics KPI : transportation-related costs. If you want to see more go explore our collection of key performance indicator examples .

Transportation costs logistics KPIs

7. Omit useless data

Having bestowed your data analysis tools and techniques with true purpose and defined your mission, you should explore the raw data you’ve collected from all sources and use your KPIs as a reference for chopping out any information you deem to be useless.

Trimming the informational fat is one of the most crucial methods of analysis as it will allow you to focus your analytical efforts and squeeze every drop of value from the remaining ‘lean’ information.

Any stats, facts, figures, or metrics that don’t align with your business goals or fit with your KPI management strategies should be eliminated from the equation.

8. Build a data management roadmap

While, at this point, this particular step is optional (you will have already gained a wealth of insight and formed a fairly sound strategy by now), creating a data governance roadmap will help your data analysis methods and techniques become successful on a more sustainable basis. These roadmaps, if developed properly, are also built so they can be tweaked and scaled over time.

Invest ample time in developing a roadmap that will help you store, manage, and handle your data internally, and you will make your analysis techniques all the more fluid and functional – one of the most powerful types of data analysis methods available today.

9. Integrate technology

There are many ways to analyze data, but one of the most vital aspects of analytical success in a business context is integrating the right decision support software and technology.

Robust analysis platforms will not only allow you to pull critical data from your most valuable sources while working with dynamic KPIs that will offer you actionable insights; it will also present them in a digestible, visual, interactive format from one central, live dashboard . A data methodology you can count on.

By integrating the right technology within your data analysis methodology, you’ll avoid fragmenting your insights, saving you time and effort while allowing you to enjoy the maximum value from your business’s most valuable insights.

For a look at the power of software for the purpose of analysis and to enhance your methods of analyzing, glance over our selection of dashboard examples .

10. Answer your questions

By considering each of the above efforts, working with the right technology, and fostering a cohesive internal culture where everyone buys into the different ways to analyze data as well as the power of digital intelligence, you will swiftly start to answer your most burning business questions. Arguably, the best way to make your data concepts accessible across the organization is through data visualization.

11. Visualize your data

Online data visualization is a powerful tool as it lets you tell a story with your metrics, allowing users across the organization to extract meaningful insights that aid business evolution – and it covers all the different ways to analyze data.

The purpose of analyzing is to make your entire organization more informed and intelligent, and with the right platform or dashboard, this is simpler than you think, as demonstrated by our marketing dashboard .

An executive dashboard example showcasing high-level marketing KPIs such as cost per lead, MQL, SQL, and cost per customer.

This visual, dynamic, and interactive online dashboard is a data analysis example designed to give Chief Marketing Officers (CMO) an overview of relevant metrics to help them understand if they achieved their monthly goals.

In detail, this example generated with a modern dashboard creator displays interactive charts for monthly revenues, costs, net income, and net income per customer; all of them are compared with the previous month so that you can understand how the data fluctuated. In addition, it shows a detailed summary of the number of users, customers, SQLs, and MQLs per month to visualize the whole picture and extract relevant insights or trends for your marketing reports .

The CMO dashboard is perfect for c-level management as it can help them monitor the strategic outcome of their marketing efforts and make data-driven decisions that can benefit the company exponentially.

12. Be careful with the interpretation

We already dedicated an entire post to data interpretation as it is a fundamental part of the process of data analysis. It gives meaning to the analytical information and aims to drive a concise conclusion from the analysis results. Since most of the time companies are dealing with data from many different sources, the interpretation stage needs to be done carefully and properly in order to avoid misinterpretations. 

To help you through the process, here we list three common practices that you need to avoid at all costs when looking at your data:

  • Correlation vs. causation: The human brain is formatted to find patterns. This behavior leads to one of the most common mistakes when performing interpretation: confusing correlation with causation. Although these two aspects can exist simultaneously, it is not correct to assume that because two things happened together, one provoked the other. A piece of advice to avoid falling into this mistake is never to trust just intuition, trust the data. If there is no objective evidence of causation, then always stick to correlation. 
  • Confirmation bias: This phenomenon describes the tendency to select and interpret only the data necessary to prove one hypothesis, often ignoring the elements that might disprove it. Even if it's not done on purpose, confirmation bias can represent a real problem, as excluding relevant information can lead to false conclusions and, therefore, bad business decisions. To avoid it, always try to disprove your hypothesis instead of proving it, share your analysis with other team members, and avoid drawing any conclusions before the entire analytical project is finalized.
  • Statistical significance: To put it in short words, statistical significance helps analysts understand if a result is actually accurate or if it happened because of a sampling error or pure chance. The level of statistical significance needed might depend on the sample size and the industry being analyzed. In any case, ignoring the significance of a result when it might influence decision-making can be a huge mistake.

13. Build a narrative

Now, we’re going to look at how you can bring all of these elements together in a way that will benefit your business - starting with a little something called data storytelling.

The human brain responds incredibly well to strong stories or narratives. Once you’ve cleansed, shaped, and visualized your most invaluable data using various BI dashboard tools , you should strive to tell a story - one with a clear-cut beginning, middle, and end.

By doing so, you will make your analytical efforts more accessible, digestible, and universal, empowering more people within your organization to use your discoveries to their actionable advantage.

14. Consider autonomous technology

Autonomous technologies, such as artificial intelligence (AI) and machine learning (ML), play a significant role in the advancement of understanding how to analyze data more effectively.

Gartner predicts that by the end of this year, 80% of emerging technologies will be developed with AI foundations. This is a testament to the ever-growing power and value of autonomous technologies.

At the moment, these technologies are revolutionizing the analysis industry. Some examples that we mentioned earlier are neural networks, intelligent alarms, and sentiment analysis.

15. Share the load

If you work with the right tools and dashboards, you will be able to present your metrics in a digestible, value-driven format, allowing almost everyone in the organization to connect with and use relevant data to their advantage.

Modern dashboards consolidate data from various sources, providing access to a wealth of insights in one centralized location, no matter if you need to monitor recruitment metrics or generate reports that need to be sent across numerous departments. Moreover, these cutting-edge tools offer access to dashboards from a multitude of devices, meaning that everyone within the business can connect with practical insights remotely - and share the load.

Once everyone is able to work with a data-driven mindset, you will catalyze the success of your business in ways you never thought possible. And when it comes to knowing how to analyze data, this kind of collaborative approach is essential.

16. Data analysis tools

In order to perform high-quality analysis of data, it is fundamental to use tools and software that will ensure the best results. Here we leave you a small summary of four fundamental categories of data analysis tools for your organization.

  • Business Intelligence: BI tools allow you to process significant amounts of data from several sources in any format. Through this, you can not only analyze and monitor your data to extract relevant insights but also create interactive reports and dashboards to visualize your KPIs and use them for your company's good. datapine is an amazing online BI software that is focused on delivering powerful online analysis features that are accessible to beginner and advanced users. Like this, it offers a full-service solution that includes cutting-edge analysis of data, KPIs visualization, live dashboards, reporting, and artificial intelligence technologies to predict trends and minimize risk.
  • Statistical analysis: These tools are usually designed for scientists, statisticians, market researchers, and mathematicians, as they allow them to perform complex statistical analyses with methods like regression analysis, predictive analysis, and statistical modeling. A good tool to perform this type of analysis is R-Studio as it offers a powerful data modeling and hypothesis testing feature that can cover both academic and general data analysis. This tool is one of the favorite ones in the industry, due to its capability for data cleaning, data reduction, and performing advanced analysis with several statistical methods. Another relevant tool to mention is SPSS from IBM. The software offers advanced statistical analysis for users of all skill levels. Thanks to a vast library of machine learning algorithms, text analysis, and a hypothesis testing approach it can help your company find relevant insights to drive better decisions. SPSS also works as a cloud service that enables you to run it anywhere.
  • SQL Consoles: SQL is a programming language often used to handle structured data in relational databases. Tools like these are popular among data scientists as they are extremely effective in unlocking these databases' value. Undoubtedly, one of the most used SQL software in the market is MySQL Workbench . This tool offers several features such as a visual tool for database modeling and monitoring, complete SQL optimization, administration tools, and visual performance dashboards to keep track of KPIs.
  • Data Visualization: These tools are used to represent your data through charts, graphs, and maps that allow you to find patterns and trends in the data. datapine's already mentioned BI platform also offers a wealth of powerful online data visualization tools with several benefits. Some of them include: delivering compelling data-driven presentations to share with your entire company, the ability to see your data online with any device wherever you are, an interactive dashboard design feature that enables you to showcase your results in an interactive and understandable way, and to perform online self-service reports that can be used simultaneously with several other people to enhance team productivity.

17. Refine your process constantly 

Last is a step that might seem obvious to some people, but it can be easily ignored if you think you are done. Once you have extracted the needed results, you should always take a retrospective look at your project and think about what you can improve. As you saw throughout this long list of techniques, data analysis is a complex process that requires constant refinement. For this reason, you should always go one step further and keep improving. 

Quality Criteria For Data Analysis

So far we’ve covered a list of methods and techniques that should help you perform efficient data analysis. But how do you measure the quality and validity of your results? This is done with the help of some science quality criteria. Here we will go into a more theoretical area that is critical to understanding the fundamentals of statistical analysis in science. However, you should also be aware of these steps in a business context, as they will allow you to assess the quality of your results in the correct way. Let’s dig in. 

  • Internal validity: The results of a survey are internally valid if they measure what they are supposed to measure and thus provide credible results. In other words , internal validity measures the trustworthiness of the results and how they can be affected by factors such as the research design, operational definitions, how the variables are measured, and more. For instance, imagine you are doing an interview to ask people if they brush their teeth two times a day. While most of them will answer yes, you can still notice that their answers correspond to what is socially acceptable, which is to brush your teeth at least twice a day. In this case, you can’t be 100% sure if respondents actually brush their teeth twice a day or if they just say that they do, therefore, the internal validity of this interview is very low. 
  • External validity: Essentially, external validity refers to the extent to which the results of your research can be applied to a broader context. It basically aims to prove that the findings of a study can be applied in the real world. If the research can be applied to other settings, individuals, and times, then the external validity is high. 
  • Reliability : If your research is reliable, it means that it can be reproduced. If your measurement were repeated under the same conditions, it would produce similar results. This means that your measuring instrument consistently produces reliable results. For example, imagine a doctor building a symptoms questionnaire to detect a specific disease in a patient. Then, various other doctors use this questionnaire but end up diagnosing the same patient with a different condition. This means the questionnaire is not reliable in detecting the initial disease. Another important note here is that in order for your research to be reliable, it also needs to be objective. If the results of a study are the same, independent of who assesses them or interprets them, the study can be considered reliable. Let’s see the objectivity criteria in more detail now. 
  • Objectivity: In data science, objectivity means that the researcher needs to stay fully objective when it comes to its analysis. The results of a study need to be affected by objective criteria and not by the beliefs, personality, or values of the researcher. Objectivity needs to be ensured when you are gathering the data, for example, when interviewing individuals, the questions need to be asked in a way that doesn't influence the results. Paired with this, objectivity also needs to be thought of when interpreting the data. If different researchers reach the same conclusions, then the study is objective. For this last point, you can set predefined criteria to interpret the results to ensure all researchers follow the same steps. 

The discussed quality criteria cover mostly potential influences in a quantitative context. Analysis in qualitative research has by default additional subjective influences that must be controlled in a different way. Therefore, there are other quality criteria for this kind of research such as credibility, transferability, dependability, and confirmability. You can see each of them more in detail on this resource . 

Data Analysis Limitations & Barriers

Analyzing data is not an easy task. As you’ve seen throughout this post, there are many steps and techniques that you need to apply in order to extract useful information from your research. While a well-performed analysis can bring various benefits to your organization it doesn't come without limitations. In this section, we will discuss some of the main barriers you might encounter when conducting an analysis. Let’s see them more in detail. 

  • Lack of clear goals: No matter how good your data or analysis might be if you don’t have clear goals or a hypothesis the process might be worthless. While we mentioned some methods that don’t require a predefined hypothesis, it is always better to enter the analytical process with some clear guidelines of what you are expecting to get out of it, especially in a business context in which data is utilized to support important strategic decisions. 
  • Objectivity: Arguably one of the biggest barriers when it comes to data analysis in research is to stay objective. When trying to prove a hypothesis, researchers might find themselves, intentionally or unintentionally, directing the results toward an outcome that they want. To avoid this, always question your assumptions and avoid confusing facts with opinions. You can also show your findings to a research partner or external person to confirm that your results are objective. 
  • Data representation: A fundamental part of the analytical procedure is the way you represent your data. You can use various graphs and charts to represent your findings, but not all of them will work for all purposes. Choosing the wrong visual can not only damage your analysis but can mislead your audience, therefore, it is important to understand when to use each type of data depending on your analytical goals. Our complete guide on the types of graphs and charts lists 20 different visuals with examples of when to use them. 
  • Flawed correlation : Misleading statistics can significantly damage your research. We’ve already pointed out a few interpretation issues previously in the post, but it is an important barrier that we can't avoid addressing here as well. Flawed correlations occur when two variables appear related to each other but they are not. Confusing correlations with causation can lead to a wrong interpretation of results which can lead to building wrong strategies and loss of resources, therefore, it is very important to identify the different interpretation mistakes and avoid them. 
  • Sample size: A very common barrier to a reliable and efficient analysis process is the sample size. In order for the results to be trustworthy, the sample size should be representative of what you are analyzing. For example, imagine you have a company of 1000 employees and you ask the question “do you like working here?” to 50 employees of which 49 say yes, which means 95%. Now, imagine you ask the same question to the 1000 employees and 950 say yes, which also means 95%. Saying that 95% of employees like working in the company when the sample size was only 50 is not a representative or trustworthy conclusion. The significance of the results is way more accurate when surveying a bigger sample size.   
  • Privacy concerns: In some cases, data collection can be subjected to privacy regulations. Businesses gather all kinds of information from their customers from purchasing behaviors to addresses and phone numbers. If this falls into the wrong hands due to a breach, it can affect the security and confidentiality of your clients. To avoid this issue, you need to collect only the data that is needed for your research and, if you are using sensitive facts, make it anonymous so customers are protected. The misuse of customer data can severely damage a business's reputation, so it is important to keep an eye on privacy. 
  • Lack of communication between teams : When it comes to performing data analysis on a business level, it is very likely that each department and team will have different goals and strategies. However, they are all working for the same common goal of helping the business run smoothly and keep growing. When teams are not connected and communicating with each other, it can directly affect the way general strategies are built. To avoid these issues, tools such as data dashboards enable teams to stay connected through data in a visually appealing way. 
  • Innumeracy : Businesses are working with data more and more every day. While there are many BI tools available to perform effective analysis, data literacy is still a constant barrier. Not all employees know how to apply analysis techniques or extract insights from them. To prevent this from happening, you can implement different training opportunities that will prepare every relevant user to deal with data. 

Key Data Analysis Skills

As you've learned throughout this lengthy guide, analyzing data is a complex task that requires a lot of knowledge and skills. That said, thanks to the rise of self-service tools the process is way more accessible and agile than it once was. Regardless, there are still some key skills that are valuable to have when working with data, we list the most important ones below.

  • Critical and statistical thinking: To successfully analyze data you need to be creative and think out of the box. Yes, that might sound like a weird statement considering that data is often tight to facts. However, a great level of critical thinking is required to uncover connections, come up with a valuable hypothesis, and extract conclusions that go a step further from the surface. This, of course, needs to be complemented by statistical thinking and an understanding of numbers. 
  • Data cleaning: Anyone who has ever worked with data before will tell you that the cleaning and preparation process accounts for 80% of a data analyst's work, therefore, the skill is fundamental. But not just that, not cleaning the data adequately can also significantly damage the analysis which can lead to poor decision-making in a business scenario. While there are multiple tools that automate the cleaning process and eliminate the possibility of human error, it is still a valuable skill to dominate. 
  • Data visualization: Visuals make the information easier to understand and analyze, not only for professional users but especially for non-technical ones. Having the necessary skills to not only choose the right chart type but know when to apply it correctly is key. This also means being able to design visually compelling charts that make the data exploration process more efficient. 
  • SQL: The Structured Query Language or SQL is a programming language used to communicate with databases. It is fundamental knowledge as it enables you to update, manipulate, and organize data from relational databases which are the most common databases used by companies. It is fairly easy to learn and one of the most valuable skills when it comes to data analysis. 
  • Communication skills: This is a skill that is especially valuable in a business environment. Being able to clearly communicate analytical outcomes to colleagues is incredibly important, especially when the information you are trying to convey is complex for non-technical people. This applies to in-person communication as well as written format, for example, when generating a dashboard or report. While this might be considered a “soft” skill compared to the other ones we mentioned, it should not be ignored as you most likely will need to share analytical findings with others no matter the context. 

Data Analysis In The Big Data Environment

Big data is invaluable to today’s businesses, and by using different methods for data analysis, it’s possible to view your data in a way that can help you turn insight into positive action.

To inspire your efforts and put the importance of big data into context, here are some insights that you should know:

  • By 2026 the industry of big data is expected to be worth approximately $273.4 billion.
  • 94% of enterprises say that analyzing data is important for their growth and digital transformation. 
  • Companies that exploit the full potential of their data can increase their operating margins by 60% .
  • We already told you the benefits of Artificial Intelligence through this article. This industry's financial impact is expected to grow up to $40 billion by 2025.

Data analysis concepts may come in many forms, but fundamentally, any solid methodology will help to make your business more streamlined, cohesive, insightful, and successful than ever before.

Key Takeaways From Data Analysis 

As we reach the end of our data analysis journey, we leave a small summary of the main methods and techniques to perform excellent analysis and grow your business.

17 Essential Types of Data Analysis Methods:

  • Cluster analysis
  • Cohort analysis
  • Regression analysis
  • Factor analysis
  • Neural Networks
  • Data Mining
  • Text analysis
  • Time series analysis
  • Decision trees
  • Conjoint analysis 
  • Correspondence Analysis
  • Multidimensional Scaling 
  • Content analysis 
  • Thematic analysis
  • Narrative analysis 
  • Grounded theory analysis
  • Discourse analysis 

Top 17 Data Analysis Techniques:

  • Collaborate your needs
  • Establish your questions
  • Data democratization
  • Think of data governance 
  • Clean your data
  • Set your KPIs
  • Omit useless data
  • Build a data management roadmap
  • Integrate technology
  • Answer your questions
  • Visualize your data
  • Interpretation of data
  • Consider autonomous technology
  • Build a narrative
  • Share the load
  • Data Analysis tools
  • Refine your process constantly 

We’ve pondered the data analysis definition and drilled down into the practical applications of data-centric analytics, and one thing is clear: by taking measures to arrange your data and making your metrics work for you, it’s possible to transform raw information into action - the kind of that will push your business to the next level.

Yes, good data analytics techniques result in enhanced business intelligence (BI). To help you understand this notion in more detail, read our exploration of business intelligence reporting .

And, if you’re ready to perform your own analysis, drill down into your facts and figures while interacting with your data on astonishing visuals, you can try our software for a free, 14-day trial .

A Step-by-Step Guide to the Data Analysis Process

Like any scientific discipline, data analysis follows a rigorous step-by-step process. Each stage requires different skills and know-how. To get meaningful insights, though, it’s important to understand the process as a whole. An underlying framework is invaluable for producing results that stand up to scrutiny.

In this post, we’ll explore the main steps in the data analysis process. This will cover how to define your goal, collect data, and carry out an analysis. Where applicable, we’ll also use examples and highlight a few tools to make the journey easier. When you’re done, you’ll have a much better understanding of the basics. This will help you tweak the process to fit your own needs.

Here are the steps we’ll take you through:

  • Defining the question
  • Collecting the data
  • Cleaning the data
  • Analyzing the data
  • Sharing your results
  • Embracing failure

On popular request, we’ve also developed a video based on this article. Scroll further along this article to watch that.

Ready? Let’s get started with step one.

1. Step one: Defining the question

The first step in any data analysis process is to define your objective. In data analytics jargon, this is sometimes called the ‘problem statement’.

Defining your objective means coming up with a hypothesis and figuring how to test it. Start by asking: What business problem am I trying to solve? While this might sound straightforward, it can be trickier than it seems. For instance, your organization’s senior management might pose an issue, such as: “Why are we losing customers?” It’s possible, though, that this doesn’t get to the core of the problem. A data analyst’s job is to understand the business and its goals in enough depth that they can frame the problem the right way.

Let’s say you work for a fictional company called TopNotch Learning. TopNotch creates custom training software for its clients. While it is excellent at securing new clients, it has much lower repeat business. As such, your question might not be, “Why are we losing customers?” but, “Which factors are negatively impacting the customer experience?” or better yet: “How can we boost customer retention while minimizing costs?”

Now you’ve defined a problem, you need to determine which sources of data will best help you solve it. This is where your business acumen comes in again. For instance, perhaps you’ve noticed that the sales process for new clients is very slick, but that the production team is inefficient. Knowing this, you could hypothesize that the sales process wins lots of new clients, but the subsequent customer experience is lacking. Could this be why customers don’t come back? Which sources of data will help you answer this question?

Tools to help define your objective

Defining your objective is mostly about soft skills, business knowledge, and lateral thinking. But you’ll also need to keep track of business metrics and key performance indicators (KPIs). Monthly reports can allow you to track problem points in the business. Some KPI dashboards come with a fee, like Databox and DashThis . However, you’ll also find open-source software like Grafana , Freeboard , and Dashbuilder . These are great for producing simple dashboards, both at the beginning and the end of the data analysis process.

2. Step two: Collecting the data

Once you’ve established your objective, you’ll need to create a strategy for collecting and aggregating the appropriate data. A key part of this is determining which data you need. This might be quantitative (numeric) data, e.g. sales figures, or qualitative (descriptive) data, such as customer reviews. All data fit into one of three categories: first-party, second-party, and third-party data. Let’s explore each one.

What is first-party data?

First-party data are data that you, or your company, have directly collected from customers. It might come in the form of transactional tracking data or information from your company’s customer relationship management (CRM) system. Whatever its source, first-party data is usually structured and organized in a clear, defined way. Other sources of first-party data might include customer satisfaction surveys, focus groups, interviews, or direct observation.

What is second-party data?

To enrich your analysis, you might want to secure a secondary data source. Second-party data is the first-party data of other organizations. This might be available directly from the company or through a private marketplace. The main benefit of second-party data is that they are usually structured, and although they will be less relevant than first-party data, they also tend to be quite reliable. Examples of second-party data include website, app or social media activity, like online purchase histories, or shipping data.

What is third-party data?

Third-party data is data that has been collected and aggregated from numerous sources by a third-party organization. Often (though not always) third-party data contains a vast amount of unstructured data points (big data). Many organizations collect big data to create industry reports or to conduct market research. The research and advisory firm Gartner is a good real-world example of an organization that collects big data and sells it on to other companies. Open data repositories and government portals are also sources of third-party data .

Tools to help you collect data

Once you’ve devised a data strategy (i.e. you’ve identified which data you need, and how best to go about collecting them) there are many tools you can use to help you. One thing you’ll need, regardless of industry or area of expertise, is a data management platform (DMP). A DMP is a piece of software that allows you to identify and aggregate data from numerous sources, before manipulating them, segmenting them, and so on. There are many DMPs available. Some well-known enterprise DMPs include Salesforce DMP , SAS , and the data integration platform, Xplenty . If you want to play around, you can also try some open-source platforms like Pimcore or D:Swarm .

Want to learn more about what data analytics is and the process a data analyst follows? We cover this topic (and more) in our free introductory short course for beginners. Check out tutorial one: An introduction to data analytics .

3. Step three: Cleaning the data

Once you’ve collected your data, the next step is to get it ready for analysis. This means cleaning, or ‘scrubbing’ it, and is crucial in making sure that you’re working with high-quality data . Key data cleaning tasks include:

  • Removing major errors, duplicates, and outliers —all of which are inevitable problems when aggregating data from numerous sources.
  • Removing unwanted data points —extracting irrelevant observations that have no bearing on your intended analysis.
  • Bringing structure to your data —general ‘housekeeping’, i.e. fixing typos or layout issues, which will help you map and manipulate your data more easily.
  • Filling in major gaps —as you’re tidying up, you might notice that important data are missing. Once you’ve identified gaps, you can go about filling them.

A good data analyst will spend around 70-90% of their time cleaning their data. This might sound excessive. But focusing on the wrong data points (or analyzing erroneous data) will severely impact your results. It might even send you back to square one…so don’t rush it! You’ll find a step-by-step guide to data cleaning here . You may be interested in this introductory tutorial to data cleaning, hosted by Dr. Humera Noor Minhas.

Carrying out an exploratory analysis

Another thing many data analysts do (alongside cleaning data) is to carry out an exploratory analysis. This helps identify initial trends and characteristics, and can even refine your hypothesis. Let’s use our fictional learning company as an example again. Carrying out an exploratory analysis, perhaps you notice a correlation between how much TopNotch Learning’s clients pay and how quickly they move on to new suppliers. This might suggest that a low-quality customer experience (the assumption in your initial hypothesis) is actually less of an issue than cost. You might, therefore, take this into account.

Tools to help you clean your data

Cleaning datasets manually—especially large ones—can be daunting. Luckily, there are many tools available to streamline the process. Open-source tools, such as OpenRefine , are excellent for basic data cleaning, as well as high-level exploration. However, free tools offer limited functionality for very large datasets. Python libraries (e.g. Pandas) and some R packages are better suited for heavy data scrubbing. You will, of course, need to be familiar with the languages. Alternatively, enterprise tools are also available. For example, Data Ladder , which is one of the highest-rated data-matching tools in the industry. There are many more. Why not see which free data cleaning tools you can find to play around with?

4. Step four: Analyzing the data

Finally, you’ve cleaned your data. Now comes the fun bit—analyzing it! The type of data analysis you carry out largely depends on what your goal is. But there are many techniques available. Univariate or bivariate analysis, time-series analysis, and regression analysis are just a few you might have heard of. More important than the different types, though, is how you apply them. This depends on what insights you’re hoping to gain. Broadly speaking, all types of data analysis fit into one of the following four categories.

Descriptive analysis

Descriptive analysis identifies what has already happened . It is a common first step that companies carry out before proceeding with deeper explorations. As an example, let’s refer back to our fictional learning provider once more. TopNotch Learning might use descriptive analytics to analyze course completion rates for their customers. Or they might identify how many users access their products during a particular period. Perhaps they’ll use it to measure sales figures over the last five years. While the company might not draw firm conclusions from any of these insights, summarizing and describing the data will help them to determine how to proceed.

Learn more: What is descriptive analytics?

Diagnostic analysis

Diagnostic analytics focuses on understanding why something has happened . It is literally the diagnosis of a problem, just as a doctor uses a patient’s symptoms to diagnose a disease. Remember TopNotch Learning’s business problem? ‘Which factors are negatively impacting the customer experience?’ A diagnostic analysis would help answer this. For instance, it could help the company draw correlations between the issue (struggling to gain repeat business) and factors that might be causing it (e.g. project costs, speed of delivery, customer sector, etc.) Let’s imagine that, using diagnostic analytics, TopNotch realizes its clients in the retail sector are departing at a faster rate than other clients. This might suggest that they’re losing customers because they lack expertise in this sector. And that’s a useful insight!

Predictive analysis

Predictive analysis allows you to identify future trends based on historical data . In business, predictive analysis is commonly used to forecast future growth, for example. But it doesn’t stop there. Predictive analysis has grown increasingly sophisticated in recent years. The speedy evolution of machine learning allows organizations to make surprisingly accurate forecasts. Take the insurance industry. Insurance providers commonly use past data to predict which customer groups are more likely to get into accidents. As a result, they’ll hike up customer insurance premiums for those groups. Likewise, the retail industry often uses transaction data to predict where future trends lie, or to determine seasonal buying habits to inform their strategies. These are just a few simple examples, but the untapped potential of predictive analysis is pretty compelling.

Prescriptive analysis

Prescriptive analysis allows you to make recommendations for the future. This is the final step in the analytics part of the process. It’s also the most complex. This is because it incorporates aspects of all the other analyses we’ve described. A great example of prescriptive analytics is the algorithms that guide Google’s self-driving cars. Every second, these algorithms make countless decisions based on past and present data, ensuring a smooth, safe ride. Prescriptive analytics also helps companies decide on new products or areas of business to invest in.

Learn more:  What are the different types of data analysis?

5. Step five: Sharing your results

You’ve finished carrying out your analyses. You have your insights. The final step of the data analytics process is to share these insights with the wider world (or at least with your organization’s stakeholders!) This is more complex than simply sharing the raw results of your work—it involves interpreting the outcomes, and presenting them in a manner that’s digestible for all types of audiences. Since you’ll often present information to decision-makers, it’s very important that the insights you present are 100% clear and unambiguous. For this reason, data analysts commonly use reports, dashboards, and interactive visualizations to support their findings.

How you interpret and present results will often influence the direction of a business. Depending on what you share, your organization might decide to restructure, to launch a high-risk product, or even to close an entire division. That’s why it’s very important to provide all the evidence that you’ve gathered, and not to cherry-pick data. Ensuring that you cover everything in a clear, concise way will prove that your conclusions are scientifically sound and based on the facts. On the flip side, it’s important to highlight any gaps in the data or to flag any insights that might be open to interpretation. Honest communication is the most important part of the process. It will help the business, while also helping you to excel at your job!

Tools for interpreting and sharing your findings

There are tons of data visualization tools available, suited to different experience levels. Popular tools requiring little or no coding skills include Google Charts , Tableau , Datawrapper , and Infogram . If you’re familiar with Python and R, there are also many data visualization libraries and packages available. For instance, check out the Python libraries Plotly , Seaborn , and Matplotlib . Whichever data visualization tools you use, make sure you polish up your presentation skills, too. Remember: Visualization is great, but communication is key!

You can learn more about storytelling with data in this free, hands-on tutorial .  We show you how to craft a compelling narrative for a real dataset, resulting in a presentation to share with key stakeholders. This is an excellent insight into what it’s really like to work as a data analyst!

6. Step six: Embrace your failures

The last ‘step’ in the data analytics process is to embrace your failures. The path we’ve described above is more of an iterative process than a one-way street. Data analytics is inherently messy, and the process you follow will be different for every project. For instance, while cleaning data, you might spot patterns that spark a whole new set of questions. This could send you back to step one (to redefine your objective). Equally, an exploratory analysis might highlight a set of data points you’d never considered using before. Or maybe you find that the results of your core analyses are misleading or erroneous. This might be caused by mistakes in the data, or human error earlier in the process.

While these pitfalls can feel like failures, don’t be disheartened if they happen. Data analysis is inherently chaotic, and mistakes occur. What’s important is to hone your ability to spot and rectify errors. If data analytics was straightforward, it might be easier, but it certainly wouldn’t be as interesting. Use the steps we’ve outlined as a framework, stay open-minded, and be creative. If you lose your way, you can refer back to the process to keep yourself on track.

In this post, we’ve covered the main steps of the data analytics process. These core steps can be amended, re-ordered and re-used as you deem fit, but they underpin every data analyst’s work:

  • Define the question —What business problem are you trying to solve? Frame it as a question to help you focus on finding a clear answer.
  • Collect data —Create a strategy for collecting data. Which data sources are most likely to help you solve your business problem?
  • Clean the data —Explore, scrub, tidy, de-dupe, and structure your data as needed. Do whatever you have to! But don’t rush…take your time!
  • Analyze the data —Carry out various analyses to obtain insights. Focus on the four types of data analysis: descriptive, diagnostic, predictive, and prescriptive.
  • Share your results —How best can you share your insights and recommendations? A combination of visualization tools and communication is key.
  • Embrace your mistakes —Mistakes happen. Learn from them. This is what transforms a good data analyst into a great one.

What next? From here, we strongly encourage you to explore the topic on your own. Get creative with the steps in the data analysis process, and see what tools you can find. As long as you stick to the core principles we’ve described, you can create a tailored technique that works for you.

To learn more, check out our free, 5-day data analytics short course . You might also be interested in the following:

  • These are the top 9 data analytics tools
  • 10 great places to find free datasets for your next project
  • How to build a data analytics portfolio

PW Skills | Blog

Data Analysis Techniques in Research – Methods, Tools & Examples

' src=

Varun Saharawat is a seasoned professional in the fields of SEO and content writing. With a profound knowledge of the intricate aspects of these disciplines, Varun has established himself as a valuable asset in the world of digital marketing and online content creation.

data analysis techniques in research

Data analysis techniques in research are essential because they allow researchers to derive meaningful insights from data sets to support their hypotheses or research objectives.

Data Analysis Techniques in Research : While various groups, institutions, and professionals may have diverse approaches to data analysis, a universal definition captures its essence. Data analysis involves refining, transforming, and interpreting raw data to derive actionable insights that guide informed decision-making for businesses.

Data Analytics Course

A straightforward illustration of data analysis emerges when we make everyday decisions, basing our choices on past experiences or predictions of potential outcomes.

If you want to learn more about this topic and acquire valuable skills that will set you apart in today’s data-driven world, we highly recommend enrolling in the Data Analytics Course by Physics Wallah . And as a special offer for our readers, use the coupon code “READER” to get a discount on this course.

Table of Contents

What is Data Analysis?

Data analysis is the systematic process of inspecting, cleaning, transforming, and interpreting data with the objective of discovering valuable insights and drawing meaningful conclusions. This process involves several steps:

  • Inspecting : Initial examination of data to understand its structure, quality, and completeness.
  • Cleaning : Removing errors, inconsistencies, or irrelevant information to ensure accurate analysis.
  • Transforming : Converting data into a format suitable for analysis, such as normalization or aggregation.
  • Interpreting : Analyzing the transformed data to identify patterns, trends, and relationships.

Types of Data Analysis Techniques in Research

Data analysis techniques in research are categorized into qualitative and quantitative methods, each with its specific approaches and tools. These techniques are instrumental in extracting meaningful insights, patterns, and relationships from data to support informed decision-making, validate hypotheses, and derive actionable recommendations. Below is an in-depth exploration of the various types of data analysis techniques commonly employed in research:

1) Qualitative Analysis:

Definition: Qualitative analysis focuses on understanding non-numerical data, such as opinions, concepts, or experiences, to derive insights into human behavior, attitudes, and perceptions.

  • Content Analysis: Examines textual data, such as interview transcripts, articles, or open-ended survey responses, to identify themes, patterns, or trends.
  • Narrative Analysis: Analyzes personal stories or narratives to understand individuals’ experiences, emotions, or perspectives.
  • Ethnographic Studies: Involves observing and analyzing cultural practices, behaviors, and norms within specific communities or settings.

2) Quantitative Analysis:

Quantitative analysis emphasizes numerical data and employs statistical methods to explore relationships, patterns, and trends. It encompasses several approaches:

Descriptive Analysis:

  • Frequency Distribution: Represents the number of occurrences of distinct values within a dataset.
  • Central Tendency: Measures such as mean, median, and mode provide insights into the central values of a dataset.
  • Dispersion: Techniques like variance and standard deviation indicate the spread or variability of data.

Diagnostic Analysis:

  • Regression Analysis: Assesses the relationship between dependent and independent variables, enabling prediction or understanding causality.
  • ANOVA (Analysis of Variance): Examines differences between groups to identify significant variations or effects.

Predictive Analysis:

  • Time Series Forecasting: Uses historical data points to predict future trends or outcomes.
  • Machine Learning Algorithms: Techniques like decision trees, random forests, and neural networks predict outcomes based on patterns in data.

Prescriptive Analysis:

  • Optimization Models: Utilizes linear programming, integer programming, or other optimization techniques to identify the best solutions or strategies.
  • Simulation: Mimics real-world scenarios to evaluate various strategies or decisions and determine optimal outcomes.

Specific Techniques:

  • Monte Carlo Simulation: Models probabilistic outcomes to assess risk and uncertainty.
  • Factor Analysis: Reduces the dimensionality of data by identifying underlying factors or components.
  • Cohort Analysis: Studies specific groups or cohorts over time to understand trends, behaviors, or patterns within these groups.
  • Cluster Analysis: Classifies objects or individuals into homogeneous groups or clusters based on similarities or attributes.
  • Sentiment Analysis: Uses natural language processing and machine learning techniques to determine sentiment, emotions, or opinions from textual data.

Also Read: AI and Predictive Analytics: Examples, Tools, Uses, Ai Vs Predictive Analytics

Data Analysis Techniques in Research Examples

To provide a clearer understanding of how data analysis techniques are applied in research, let’s consider a hypothetical research study focused on evaluating the impact of online learning platforms on students’ academic performance.

Research Objective:

Determine if students using online learning platforms achieve higher academic performance compared to those relying solely on traditional classroom instruction.

Data Collection:

  • Quantitative Data: Academic scores (grades) of students using online platforms and those using traditional classroom methods.
  • Qualitative Data: Feedback from students regarding their learning experiences, challenges faced, and preferences.

Data Analysis Techniques Applied:

1) Descriptive Analysis:

  • Calculate the mean, median, and mode of academic scores for both groups.
  • Create frequency distributions to represent the distribution of grades in each group.

2) Diagnostic Analysis:

  • Conduct an Analysis of Variance (ANOVA) to determine if there’s a statistically significant difference in academic scores between the two groups.
  • Perform Regression Analysis to assess the relationship between the time spent on online platforms and academic performance.

3) Predictive Analysis:

  • Utilize Time Series Forecasting to predict future academic performance trends based on historical data.
  • Implement Machine Learning algorithms to develop a predictive model that identifies factors contributing to academic success on online platforms.

4) Prescriptive Analysis:

  • Apply Optimization Models to identify the optimal combination of online learning resources (e.g., video lectures, interactive quizzes) that maximize academic performance.
  • Use Simulation Techniques to evaluate different scenarios, such as varying student engagement levels with online resources, to determine the most effective strategies for improving learning outcomes.

5) Specific Techniques:

  • Conduct Factor Analysis on qualitative feedback to identify common themes or factors influencing students’ perceptions and experiences with online learning.
  • Perform Cluster Analysis to segment students based on their engagement levels, preferences, or academic outcomes, enabling targeted interventions or personalized learning strategies.
  • Apply Sentiment Analysis on textual feedback to categorize students’ sentiments as positive, negative, or neutral regarding online learning experiences.

By applying a combination of qualitative and quantitative data analysis techniques, this research example aims to provide comprehensive insights into the effectiveness of online learning platforms.

Also Read: Learning Path to Become a Data Analyst in 2024

Data Analysis Techniques in Quantitative Research

Quantitative research involves collecting numerical data to examine relationships, test hypotheses, and make predictions. Various data analysis techniques are employed to interpret and draw conclusions from quantitative data. Here are some key data analysis techniques commonly used in quantitative research:

1) Descriptive Statistics:

  • Description: Descriptive statistics are used to summarize and describe the main aspects of a dataset, such as central tendency (mean, median, mode), variability (range, variance, standard deviation), and distribution (skewness, kurtosis).
  • Applications: Summarizing data, identifying patterns, and providing initial insights into the dataset.

2) Inferential Statistics:

  • Description: Inferential statistics involve making predictions or inferences about a population based on a sample of data. This technique includes hypothesis testing, confidence intervals, t-tests, chi-square tests, analysis of variance (ANOVA), regression analysis, and correlation analysis.
  • Applications: Testing hypotheses, making predictions, and generalizing findings from a sample to a larger population.

3) Regression Analysis:

  • Description: Regression analysis is a statistical technique used to model and examine the relationship between a dependent variable and one or more independent variables. Linear regression, multiple regression, logistic regression, and nonlinear regression are common types of regression analysis .
  • Applications: Predicting outcomes, identifying relationships between variables, and understanding the impact of independent variables on the dependent variable.

4) Correlation Analysis:

  • Description: Correlation analysis is used to measure and assess the strength and direction of the relationship between two or more variables. The Pearson correlation coefficient, Spearman rank correlation coefficient, and Kendall’s tau are commonly used measures of correlation.
  • Applications: Identifying associations between variables and assessing the degree and nature of the relationship.

5) Factor Analysis:

  • Description: Factor analysis is a multivariate statistical technique used to identify and analyze underlying relationships or factors among a set of observed variables. It helps in reducing the dimensionality of data and identifying latent variables or constructs.
  • Applications: Identifying underlying factors or constructs, simplifying data structures, and understanding the underlying relationships among variables.

6) Time Series Analysis:

  • Description: Time series analysis involves analyzing data collected or recorded over a specific period at regular intervals to identify patterns, trends, and seasonality. Techniques such as moving averages, exponential smoothing, autoregressive integrated moving average (ARIMA), and Fourier analysis are used.
  • Applications: Forecasting future trends, analyzing seasonal patterns, and understanding time-dependent relationships in data.

7) ANOVA (Analysis of Variance):

  • Description: Analysis of variance (ANOVA) is a statistical technique used to analyze and compare the means of two or more groups or treatments to determine if they are statistically different from each other. One-way ANOVA, two-way ANOVA, and MANOVA (Multivariate Analysis of Variance) are common types of ANOVA.
  • Applications: Comparing group means, testing hypotheses, and determining the effects of categorical independent variables on a continuous dependent variable.

8) Chi-Square Tests:

  • Description: Chi-square tests are non-parametric statistical tests used to assess the association between categorical variables in a contingency table. The Chi-square test of independence, goodness-of-fit test, and test of homogeneity are common chi-square tests.
  • Applications: Testing relationships between categorical variables, assessing goodness-of-fit, and evaluating independence.

These quantitative data analysis techniques provide researchers with valuable tools and methods to analyze, interpret, and derive meaningful insights from numerical data. The selection of a specific technique often depends on the research objectives, the nature of the data, and the underlying assumptions of the statistical methods being used.

Also Read: Analysis vs. Analytics: How Are They Different?

Data Analysis Methods

Data analysis methods refer to the techniques and procedures used to analyze, interpret, and draw conclusions from data. These methods are essential for transforming raw data into meaningful insights, facilitating decision-making processes, and driving strategies across various fields. Here are some common data analysis methods:

  • Description: Descriptive statistics summarize and organize data to provide a clear and concise overview of the dataset. Measures such as mean, median, mode, range, variance, and standard deviation are commonly used.
  • Description: Inferential statistics involve making predictions or inferences about a population based on a sample of data. Techniques such as hypothesis testing, confidence intervals, and regression analysis are used.

3) Exploratory Data Analysis (EDA):

  • Description: EDA techniques involve visually exploring and analyzing data to discover patterns, relationships, anomalies, and insights. Methods such as scatter plots, histograms, box plots, and correlation matrices are utilized.
  • Applications: Identifying trends, patterns, outliers, and relationships within the dataset.

4) Predictive Analytics:

  • Description: Predictive analytics use statistical algorithms and machine learning techniques to analyze historical data and make predictions about future events or outcomes. Techniques such as regression analysis, time series forecasting, and machine learning algorithms (e.g., decision trees, random forests, neural networks) are employed.
  • Applications: Forecasting future trends, predicting outcomes, and identifying potential risks or opportunities.

5) Prescriptive Analytics:

  • Description: Prescriptive analytics involve analyzing data to recommend actions or strategies that optimize specific objectives or outcomes. Optimization techniques, simulation models, and decision-making algorithms are utilized.
  • Applications: Recommending optimal strategies, decision-making support, and resource allocation.

6) Qualitative Data Analysis:

  • Description: Qualitative data analysis involves analyzing non-numerical data, such as text, images, videos, or audio, to identify themes, patterns, and insights. Methods such as content analysis, thematic analysis, and narrative analysis are used.
  • Applications: Understanding human behavior, attitudes, perceptions, and experiences.

7) Big Data Analytics:

  • Description: Big data analytics methods are designed to analyze large volumes of structured and unstructured data to extract valuable insights. Technologies such as Hadoop, Spark, and NoSQL databases are used to process and analyze big data.
  • Applications: Analyzing large datasets, identifying trends, patterns, and insights from big data sources.

8) Text Analytics:

  • Description: Text analytics methods involve analyzing textual data, such as customer reviews, social media posts, emails, and documents, to extract meaningful information and insights. Techniques such as sentiment analysis, text mining, and natural language processing (NLP) are used.
  • Applications: Analyzing customer feedback, monitoring brand reputation, and extracting insights from textual data sources.

These data analysis methods are instrumental in transforming data into actionable insights, informing decision-making processes, and driving organizational success across various sectors, including business, healthcare, finance, marketing, and research. The selection of a specific method often depends on the nature of the data, the research objectives, and the analytical requirements of the project or organization.

Also Read: Quantitative Data Analysis: Types, Analysis & Examples

Data Analysis Tools

Data analysis tools are essential instruments that facilitate the process of examining, cleaning, transforming, and modeling data to uncover useful information, make informed decisions, and drive strategies. Here are some prominent data analysis tools widely used across various industries:

1) Microsoft Excel:

  • Description: A spreadsheet software that offers basic to advanced data analysis features, including pivot tables, data visualization tools, and statistical functions.
  • Applications: Data cleaning, basic statistical analysis, visualization, and reporting.

2) R Programming Language:

  • Description: An open-source programming language specifically designed for statistical computing and data visualization.
  • Applications: Advanced statistical analysis, data manipulation, visualization, and machine learning.

3) Python (with Libraries like Pandas, NumPy, Matplotlib, and Seaborn):

  • Description: A versatile programming language with libraries that support data manipulation, analysis, and visualization.
  • Applications: Data cleaning, statistical analysis, machine learning, and data visualization.

4) SPSS (Statistical Package for the Social Sciences):

  • Description: A comprehensive statistical software suite used for data analysis, data mining, and predictive analytics.
  • Applications: Descriptive statistics, hypothesis testing, regression analysis, and advanced analytics.

5) SAS (Statistical Analysis System):

  • Description: A software suite used for advanced analytics, multivariate analysis, and predictive modeling.
  • Applications: Data management, statistical analysis, predictive modeling, and business intelligence.

6) Tableau:

  • Description: A data visualization tool that allows users to create interactive and shareable dashboards and reports.
  • Applications: Data visualization , business intelligence , and interactive dashboard creation.

7) Power BI:

  • Description: A business analytics tool developed by Microsoft that provides interactive visualizations and business intelligence capabilities.
  • Applications: Data visualization, business intelligence, reporting, and dashboard creation.

8) SQL (Structured Query Language) Databases (e.g., MySQL, PostgreSQL, Microsoft SQL Server):

  • Description: Database management systems that support data storage, retrieval, and manipulation using SQL queries.
  • Applications: Data retrieval, data cleaning, data transformation, and database management.

9) Apache Spark:

  • Description: A fast and general-purpose distributed computing system designed for big data processing and analytics.
  • Applications: Big data processing, machine learning, data streaming, and real-time analytics.

10) IBM SPSS Modeler:

  • Description: A data mining software application used for building predictive models and conducting advanced analytics.
  • Applications: Predictive modeling, data mining, statistical analysis, and decision optimization.

These tools serve various purposes and cater to different data analysis needs, from basic statistical analysis and data visualization to advanced analytics, machine learning, and big data processing. The choice of a specific tool often depends on the nature of the data, the complexity of the analysis, and the specific requirements of the project or organization.

Also Read: How to Analyze Survey Data: Methods & Examples

Importance of Data Analysis in Research

The importance of data analysis in research cannot be overstated; it serves as the backbone of any scientific investigation or study. Here are several key reasons why data analysis is crucial in the research process:

  • Data analysis helps ensure that the results obtained are valid and reliable. By systematically examining the data, researchers can identify any inconsistencies or anomalies that may affect the credibility of the findings.
  • Effective data analysis provides researchers with the necessary information to make informed decisions. By interpreting the collected data, researchers can draw conclusions, make predictions, or formulate recommendations based on evidence rather than intuition or guesswork.
  • Data analysis allows researchers to identify patterns, trends, and relationships within the data. This can lead to a deeper understanding of the research topic, enabling researchers to uncover insights that may not be immediately apparent.
  • In empirical research, data analysis plays a critical role in testing hypotheses. Researchers collect data to either support or refute their hypotheses, and data analysis provides the tools and techniques to evaluate these hypotheses rigorously.
  • Transparent and well-executed data analysis enhances the credibility of research findings. By clearly documenting the data analysis methods and procedures, researchers allow others to replicate the study, thereby contributing to the reproducibility of research findings.
  • In fields such as business or healthcare, data analysis helps organizations allocate resources more efficiently. By analyzing data on consumer behavior, market trends, or patient outcomes, organizations can make strategic decisions about resource allocation, budgeting, and planning.
  • In public policy and social sciences, data analysis is instrumental in developing and evaluating policies and interventions. By analyzing data on social, economic, or environmental factors, policymakers can assess the effectiveness of existing policies and inform the development of new ones.
  • Data analysis allows for continuous improvement in research methods and practices. By analyzing past research projects, identifying areas for improvement, and implementing changes based on data-driven insights, researchers can refine their approaches and enhance the quality of future research endeavors.

However, it is important to remember that mastering these techniques requires practice and continuous learning. That’s why we highly recommend the Data Analytics Course by Physics Wallah . Not only does it cover all the fundamentals of data analysis, but it also provides hands-on experience with various tools such as Excel, Python, and Tableau. Plus, if you use the “ READER ” coupon code at checkout, you can get a special discount on the course.

For Latest Tech Related Information, Join Our Official Free Telegram Group : PW Skills Telegram Group

Data Analysis Techniques in Research FAQs

What are the 5 techniques for data analysis.

The five techniques for data analysis include: Descriptive Analysis Diagnostic Analysis Predictive Analysis Prescriptive Analysis Qualitative Analysis

What are techniques of data analysis in research?

Techniques of data analysis in research encompass both qualitative and quantitative methods. These techniques involve processes like summarizing raw data, investigating causes of events, forecasting future outcomes, offering recommendations based on predictions, and examining non-numerical data to understand concepts or experiences.

What are the 3 methods of data analysis?

The three primary methods of data analysis are: Qualitative Analysis Quantitative Analysis Mixed-Methods Analysis

What are the four types of data analysis techniques?

The four types of data analysis techniques are: Descriptive Analysis Diagnostic Analysis Predictive Analysis Prescriptive Analysis

5 Surprising Facts About Data Analytics

Facts About Data Analytics

Data Analytics is a vital tool for the future of our society and economy, and it is essential to gain…

What Is Business Analytics Business Intelligence?

business analytics business intelligence

Want to learn what Business analytics business intelligence is? Reading this article will help you to understand all topics clearly,…

5+ Best Data Analytics Certifications for Success in Your Career 2024!

data analysis in research work

Data Analytics Certifications: In data science, mastering the art of data analysis is a surefire way to boost your career.…

bottom banner

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Anaesth
  • v.60(9); 2016 Sep

Basic statistical tools in research and data analysis

Zulfiqar ali.

Department of Anaesthesiology, Division of Neuroanaesthesiology, Sheri Kashmir Institute of Medical Sciences, Soura, Srinagar, Jammu and Kashmir, India

S Bala Bhaskar

1 Department of Anaesthesiology and Critical Care, Vijayanagar Institute of Medical Sciences, Bellary, Karnataka, India

Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if proper statistical tests are used. This article will try to acquaint the reader with the basic research tools that are utilised while conducting various studies. The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.

INTRODUCTION

Statistics is a branch of science that deals with the collection, organisation, analysis of data and drawing of inferences from the samples to the whole population.[ 1 ] This requires a proper design of the study, an appropriate selection of the study sample and choice of a suitable statistical test. An adequate knowledge of statistics is necessary for proper designing of an epidemiological study or a clinical trial. Improper statistical methods may result in erroneous conclusions which may lead to unethical practice.[ 2 ]

Variable is a characteristic that varies from one individual member of population to another individual.[ 3 ] Variables such as height and weight are measured by some type of scale, convey quantitative information and are called as quantitative variables. Sex and eye colour give qualitative information and are called as qualitative variables[ 3 ] [ Figure 1 ].

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g001.jpg

Classification of variables

Quantitative variables

Quantitative or numerical data are subdivided into discrete and continuous measurements. Discrete numerical data are recorded as a whole number such as 0, 1, 2, 3,… (integer), whereas continuous data can assume any value. Observations that can be counted constitute the discrete data and observations that can be measured constitute the continuous data. Examples of discrete data are number of episodes of respiratory arrests or the number of re-intubations in an intensive care unit. Similarly, examples of continuous data are the serial serum glucose levels, partial pressure of oxygen in arterial blood and the oesophageal temperature.

A hierarchical scale of increasing precision can be used for observing and recording the data which is based on categorical, ordinal, interval and ratio scales [ Figure 1 ].

Categorical or nominal variables are unordered. The data are merely classified into categories and cannot be arranged in any particular order. If only two categories exist (as in gender male and female), it is called as a dichotomous (or binary) data. The various causes of re-intubation in an intensive care unit due to upper airway obstruction, impaired clearance of secretions, hypoxemia, hypercapnia, pulmonary oedema and neurological impairment are examples of categorical variables.

Ordinal variables have a clear ordering between the variables. However, the ordered data may not have equal intervals. Examples are the American Society of Anesthesiologists status or Richmond agitation-sedation scale.

Interval variables are similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. A good example of an interval scale is the Fahrenheit degree scale used to measure temperature. With the Fahrenheit scale, the difference between 70° and 75° is equal to the difference between 80° and 85°: The units of measurement are equal throughout the full range of the scale.

Ratio scales are similar to interval scales, in that equal differences between scale values have equal quantitative meaning. However, ratio scales also have a true zero point, which gives them an additional property. For example, the system of centimetres is an example of a ratio scale. There is a true zero point and the value of 0 cm means a complete absence of length. The thyromental distance of 6 cm in an adult may be twice that of a child in whom it may be 3 cm.

STATISTICS: DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics[ 4 ] try to describe the relationship between variables in a sample or population. Descriptive statistics provide a summary of data in the form of mean, median and mode. Inferential statistics[ 4 ] use a random sample of data taken from a population to describe and make inferences about the whole population. It is valuable when it is not possible to examine each member of an entire population. The examples if descriptive and inferential statistics are illustrated in Table 1 .

Example of descriptive and inferential statistics

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g002.jpg

Descriptive statistics

The extent to which the observations cluster around a central location is described by the central tendency and the spread towards the extremes is described by the degree of dispersion.

Measures of central tendency

The measures of central tendency are mean, median and mode.[ 6 ] Mean (or the arithmetic average) is the sum of all the scores divided by the number of scores. Mean may be influenced profoundly by the extreme variables. For example, the average stay of organophosphorus poisoning patients in ICU may be influenced by a single patient who stays in ICU for around 5 months because of septicaemia. The extreme values are called outliers. The formula for the mean is

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g003.jpg

where x = each observation and n = number of observations. Median[ 6 ] is defined as the middle of a distribution in a ranked data (with half of the variables in the sample above and half below the median value) while mode is the most frequently occurring variable in a distribution. Range defines the spread, or variability, of a sample.[ 7 ] It is described by the minimum and maximum values of the variables. If we rank the data and after ranking, group the observations into percentiles, we can get better information of the pattern of spread of the variables. In percentiles, we rank the observations into 100 equal parts. We can then describe 25%, 50%, 75% or any other percentile amount. The median is the 50 th percentile. The interquartile range will be the observations in the middle 50% of the observations about the median (25 th -75 th percentile). Variance[ 7 ] is a measure of how spread out is the distribution. It gives an indication of how close an individual observation clusters about the mean value. The variance of a population is defined by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g004.jpg

where σ 2 is the population variance, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The variance of a sample is defined by slightly different formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g005.jpg

where s 2 is the sample variance, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. The formula for the variance of a population has the value ‘ n ’ as the denominator. The expression ‘ n −1’ is known as the degrees of freedom and is one less than the number of parameters. Each observation is free to vary, except the last one which must be a defined value. The variance is measured in squared units. To make the interpretation of the data simple and to retain the basic unit of observation, the square root of variance is used. The square root of the variance is the standard deviation (SD).[ 8 ] The SD of a population is defined by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g006.jpg

where σ is the population SD, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The SD of a sample is defined by slightly different formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g007.jpg

where s is the sample SD, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. An example for calculation of variation and SD is illustrated in Table 2 .

Example of mean, variance, standard deviation

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g008.jpg

Normal distribution or Gaussian distribution

Most of the biological variables usually cluster around a central value, with symmetrical positive and negative deviations about this point.[ 1 ] The standard normal distribution curve is a symmetrical bell-shaped. In a normal distribution curve, about 68% of the scores are within 1 SD of the mean. Around 95% of the scores are within 2 SDs of the mean and 99% within 3 SDs of the mean [ Figure 2 ].

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g009.jpg

Normal distribution curve

Skewed distribution

It is a distribution with an asymmetry of the variables about its mean. In a negatively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the right of Figure 1 . In a positively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the left of the figure leading to a longer right tail.

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g010.jpg

Curves showing negatively skewed and positively skewed distribution

Inferential statistics

In inferential statistics, data are analysed from a sample to make inferences in the larger collection of the population. The purpose is to answer or test the hypotheses. A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. Hypothesis tests are thus procedures for making rational decisions about the reality of observed effects.

Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty).

In inferential statistics, the term ‘null hypothesis’ ( H 0 ‘ H-naught ,’ ‘ H-null ’) denotes that there is no relationship (difference) between the population variables in question.[ 9 ]

Alternative hypothesis ( H 1 and H a ) denotes that a statement between the variables is expected to be true.[ 9 ]

The P value (or the calculated probability) is the probability of the event occurring by chance if the null hypothesis is true. The P value is a numerical between 0 and 1 and is interpreted by researchers in deciding whether to reject or retain the null hypothesis [ Table 3 ].

P values with interpretation

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g011.jpg

If P value is less than the arbitrarily chosen value (known as α or the significance level), the null hypothesis (H0) is rejected [ Table 4 ]. However, if null hypotheses (H0) is incorrectly rejected, this is known as a Type I error.[ 11 ] Further details regarding alpha error, beta error and sample size calculation and factors influencing them are dealt with in another section of this issue by Das S et al .[ 12 ]

Illustration for null hypothesis

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g012.jpg

PARAMETRIC AND NON-PARAMETRIC TESTS

Numerical data (quantitative variables) that are normally distributed are analysed with parametric tests.[ 13 ]

Two most basic prerequisites for parametric statistical analysis are:

  • The assumption of normality which specifies that the means of the sample group are normally distributed
  • The assumption of equal variance which specifies that the variances of the samples and of their corresponding population are equal.

However, if the distribution of the sample is skewed towards one side or the distribution is unknown due to the small sample size, non-parametric[ 14 ] statistical techniques are used. Non-parametric tests are used to analyse ordinal and categorical data.

Parametric tests

The parametric tests assume that the data are on a quantitative (numerical) scale, with a normal distribution of the underlying population. The samples have the same variance (homogeneity of variances). The samples are randomly drawn from the population, and the observations within a group are independent of each other. The commonly used parametric tests are the Student's t -test, analysis of variance (ANOVA) and repeated measures ANOVA.

Student's t -test

Student's t -test is used to test the null hypothesis that there is no difference between the means of the two groups. It is used in three circumstances:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g013.jpg

where X = sample mean, u = population mean and SE = standard error of mean

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g014.jpg

where X 1 − X 2 is the difference between the means of the two groups and SE denotes the standard error of the difference.

  • To test if the population means estimated by two dependent samples differ significantly (the paired t -test). A usual setting for paired t -test is when measurements are made on the same subjects before and after a treatment.

The formula for paired t -test is:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g015.jpg

where d is the mean difference and SE denotes the standard error of this difference.

The group variances can be compared using the F -test. The F -test is the ratio of variances (var l/var 2). If F differs significantly from 1.0, then it is concluded that the group variances differ significantly.

Analysis of variance

The Student's t -test cannot be used for comparison of three or more groups. The purpose of ANOVA is to test if there is any significant difference between the means of two or more groups.

In ANOVA, we study two variances – (a) between-group variability and (b) within-group variability. The within-group variability (error variance) is the variation that cannot be accounted for in the study design. It is based on random differences present in our samples.

However, the between-group (or effect variance) is the result of our treatment. These two estimates of variances are compared using the F-test.

A simplified formula for the F statistic is:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g016.jpg

where MS b is the mean squares between the groups and MS w is the mean squares within groups.

Repeated measures analysis of variance

As with ANOVA, repeated measures ANOVA analyses the equality of means of three or more groups. However, a repeated measure ANOVA is used when all variables of a sample are measured under different conditions or at different points in time.

As the variables are measured from a sample at different points of time, the measurement of the dependent variable is repeated. Using a standard ANOVA in this case is not appropriate because it fails to model the correlation between the repeated measures: The data violate the ANOVA assumption of independence. Hence, in the measurement of repeated dependent variables, repeated measures ANOVA should be used.

Non-parametric tests

When the assumptions of normality are not met, and the sample means are not normally, distributed parametric tests can lead to erroneous results. Non-parametric tests (distribution-free test) are used in such situation as they do not require the normality assumption.[ 15 ] Non-parametric tests may fail to detect a significant difference when compared with a parametric test. That is, they usually have less power.

As is done for the parametric tests, the test statistic is compared with known values for the sampling distribution of that statistic and the null hypothesis is accepted or rejected. The types of non-parametric analysis techniques and the corresponding parametric analysis techniques are delineated in Table 5 .

Analogue of parametric and non-parametric tests

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g017.jpg

Median test for one sample: The sign test and Wilcoxon's signed rank test

The sign test and Wilcoxon's signed rank test are used for median tests of one sample. These tests examine whether one instance of sample data is greater or smaller than the median reference value.

This test examines the hypothesis about the median θ0 of a population. It tests the null hypothesis H0 = θ0. When the observed value (Xi) is greater than the reference value (θ0), it is marked as+. If the observed value is smaller than the reference value, it is marked as − sign. If the observed value is equal to the reference value (θ0), it is eliminated from the sample.

If the null hypothesis is true, there will be an equal number of + signs and − signs.

The sign test ignores the actual values of the data and only uses + or − signs. Therefore, it is useful when it is difficult to measure the values.

Wilcoxon's signed rank test

There is a major limitation of sign test as we lose the quantitative information of the given data and merely use the + or – signs. Wilcoxon's signed rank test not only examines the observed values in comparison with θ0 but also takes into consideration the relative sizes, adding more statistical power to the test. As in the sign test, if there is an observed value that is equal to the reference value θ0, this observed value is eliminated from the sample.

Wilcoxon's rank sum test ranks all data points in order, calculates the rank sum of each sample and compares the difference in the rank sums.

Mann-Whitney test

It is used to test the null hypothesis that two samples have the same median or, alternatively, whether observations in one sample tend to be larger than observations in the other.

Mann–Whitney test compares all data (xi) belonging to the X group and all data (yi) belonging to the Y group and calculates the probability of xi being greater than yi: P (xi > yi). The null hypothesis states that P (xi > yi) = P (xi < yi) =1/2 while the alternative hypothesis states that P (xi > yi) ≠1/2.

Kolmogorov-Smirnov test

The two-sample Kolmogorov-Smirnov (KS) test was designed as a generic method to test whether two random samples are drawn from the same distribution. The null hypothesis of the KS test is that both distributions are identical. The statistic of the KS test is a distance between the two empirical distributions, computed as the maximum absolute difference between their cumulative curves.

Kruskal-Wallis test

The Kruskal–Wallis test is a non-parametric test to analyse the variance.[ 14 ] It analyses if there is any difference in the median values of three or more independent samples. The data values are ranked in an increasing order, and the rank sums calculated followed by calculation of the test statistic.

Jonckheere test

In contrast to Kruskal–Wallis test, in Jonckheere test, there is an a priori ordering that gives it a more statistical power than the Kruskal–Wallis test.[ 14 ]

Friedman test

The Friedman test is a non-parametric test for testing the difference between several related samples. The Friedman test is an alternative for repeated measures ANOVAs which is used when the same parameter has been measured under different conditions on the same subjects.[ 13 ]

Tests to analyse the categorical data

Chi-square test, Fischer's exact test and McNemar's test are used to analyse the categorical or nominal variables. The Chi-square test compares the frequencies and tests whether the observed data differ significantly from that of the expected data if there were no differences between groups (i.e., the null hypothesis). It is calculated by the sum of the squared difference between observed ( O ) and the expected ( E ) data (or the deviation, d ) divided by the expected data by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g018.jpg

A Yates correction factor is used when the sample size is small. Fischer's exact test is used to determine if there are non-random associations between two categorical variables. It does not assume random sampling, and instead of referring a calculated statistic to a sampling distribution, it calculates an exact probability. McNemar's test is used for paired nominal data. It is applied to 2 × 2 table with paired-dependent samples. It is used to determine whether the row and column frequencies are equal (that is, whether there is ‘marginal homogeneity’). The null hypothesis is that the paired proportions are equal. The Mantel-Haenszel Chi-square test is a multivariate test as it analyses multiple grouping variables. It stratifies according to the nominated confounding variables and identifies any that affects the primary outcome variable. If the outcome variable is dichotomous, then logistic regression is used.

SOFTWARES AVAILABLE FOR STATISTICS, SAMPLE SIZE CALCULATION AND POWER ANALYSIS

Numerous statistical software systems are available currently. The commonly used software systems are Statistical Package for the Social Sciences (SPSS – manufactured by IBM corporation), Statistical Analysis System ((SAS – developed by SAS Institute North Carolina, United States of America), R (designed by Ross Ihaka and Robert Gentleman from R core team), Minitab (developed by Minitab Inc), Stata (developed by StataCorp) and the MS Excel (developed by Microsoft).

There are a number of web resources which are related to statistical power analyses. A few are:

  • StatPages.net – provides links to a number of online power calculators
  • G-Power – provides a downloadable power analysis program that runs under DOS
  • Power analysis for ANOVA designs an interactive site that calculates power or sample size needed to attain a given power for one effect in a factorial ANOVA design
  • SPSS makes a program called SamplePower. It gives an output of a complete report on the computer screen which can be cut and paste into another document.

It is important that a researcher knows the concepts of the basic statistical methods used for conduct of a research study. This will help to conduct an appropriately well-designed study leading to valid and reliable results. Inappropriate use of statistical techniques may lead to faulty conclusions, inducing errors and undermining the significance of the article. Bad statistics may lead to bad research, and bad research may lead to unethical practice. Hence, an adequate knowledge of statistics and the appropriate use of statistical tests are important. An appropriate knowledge about the basic statistical methods will go a long way in improving the research designs and producing quality medical research which can be utilised for formulating the evidence-based guidelines.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

Business growth

Business tips

What is data analysis? Examples and how to get started

A hero image with an icon of a line graph / chart

Even with years of professional experience working with data, the term "data analysis" still sets off a panic button in my soul. And yes, when it comes to serious data analysis for your business, you'll eventually want data scientists on your side. But if you're just getting started, no panic attacks are required.

Table of contents:

Quick review: What is data analysis?

Data analysis is the process of examining, filtering, adapting, and modeling data to help solve problems. Data analysis helps determine what is and isn't working, so you can make the changes needed to achieve your business goals. 

Keep in mind that data analysis includes analyzing both quantitative data (e.g., profits and sales) and qualitative data (e.g., surveys and case studies) to paint the whole picture. Here are two simple examples (of a nuanced topic) to show you what I mean.

An example of quantitative data analysis is an online jewelry store owner using inventory data to forecast and improve reordering accuracy. The owner looks at their sales from the past six months and sees that, on average, they sold 210 gold pieces and 105 silver pieces per month, but they only had 100 gold pieces and 100 silver pieces in stock. By collecting and analyzing inventory data on these SKUs, they're forecasting to improve reordering accuracy. The next time they order inventory, they order twice as many gold pieces as silver to meet customer demand.

An example of qualitative data analysis is a fitness studio owner collecting customer feedback to improve class offerings. The studio owner sends out an open-ended survey asking customers what types of exercises they enjoy the most. The owner then performs qualitative content analysis to identify the most frequently suggested exercises and incorporates these into future workout classes.

Why is data analysis important?

Here's why it's worth implementing data analysis for your business:

Understand your target audience: You might think you know how to best target your audience, but are your assumptions backed by data? Data analysis can help answer questions like, "What demographics define my target audience?" or "What is my audience motivated by?"

Inform decisions: You don't need to toss and turn over a decision when the data points clearly to the answer. For instance, a restaurant could analyze which dishes on the menu are selling the most, helping them decide which ones to keep and which ones to change.

Adjust budgets: Similarly, data analysis can highlight areas in your business that are performing well and are worth investing more in, as well as areas that aren't generating enough revenue and should be cut. For example, a B2B software company might discover their product for enterprises is thriving while their small business solution lags behind. This discovery could prompt them to allocate more budget toward the enterprise product, resulting in better resource utilization.

Identify and solve problems: Let's say a cell phone manufacturer notices data showing a lot of customers returning a certain model. When they investigate, they find that model also happens to have the highest number of crashes. Once they identify and solve the technical issue, they can reduce the number of returns.

Types of data analysis (with examples)

There are five main types of data analysis—with increasingly scary-sounding names. Each one serves a different purpose, so take a look to see which makes the most sense for your situation. It's ok if you can't pronounce the one you choose. 

Types of data analysis including text analysis, statistical analysis, diagnostic analysis, predictive analysis, and prescriptive analysis.

Text analysis: What is happening?

Here are a few methods used to perform text analysis, to give you a sense of how it's different from a human reading through the text: 

Word frequency identifies the most frequently used words. For example, a restaurant monitors social media mentions and measures the frequency of positive and negative keywords like "delicious" or "expensive" to determine how customers feel about their experience. 

Language detection indicates the language of text. For example, a global software company may use language detection on support tickets to connect customers with the appropriate agent. 

Keyword extraction automatically identifies the most used terms. For example, instead of sifting through thousands of reviews, a popular brand uses a keyword extractor to summarize the words or phrases that are most relevant. 

Statistical analysis: What happened?

Statistical analysis pulls past data to identify meaningful trends. Two primary categories of statistical analysis exist: descriptive and inferential.

Descriptive analysis

Here are a few methods used to perform descriptive analysis: 

Measures of frequency identify how frequently an event occurs. For example, a popular coffee chain sends out a survey asking customers what their favorite holiday drink is and uses measures of frequency to determine how often a particular drink is selected. 

Measures of central tendency use mean, median, and mode to identify results. For example, a dating app company might use measures of central tendency to determine the average age of its users.

Measures of dispersion measure how data is distributed across a range. For example, HR may use measures of dispersion to determine what salary to offer in a given field. 

Inferential analysis

Inferential analysis uses a sample of data to draw conclusions about a much larger population. This type of analysis is used when the population you're interested in analyzing is very large. 

Here are a few methods used when performing inferential analysis: 

Hypothesis testing identifies which variables impact a particular topic. For example, a business uses hypothesis testing to determine if increased sales were the result of a specific marketing campaign. 

Regression analysis shows the effect of independent variables on a dependent variable. For example, a rental car company may use regression analysis to determine the relationship between wait times and number of bad reviews. 

Diagnostic analysis: Why did it happen?

Diagnostic analysis, also referred to as root cause analysis, uncovers the causes of certain events or results. 

Here are a few methods used to perform diagnostic analysis: 

Time-series analysis analyzes data collected over a period of time. A retail store may use time-series analysis to determine that sales increase between October and December every year. 

Correlation analysis determines the strength of the relationship between variables. For example, a local ice cream shop may determine that as the temperature in the area rises, so do ice cream sales. 

Predictive analysis: What is likely to happen?

Predictive analysis aims to anticipate future developments and events. By analyzing past data, companies can predict future scenarios and make strategic decisions.  

Here are a few methods used to perform predictive analysis: 

Decision trees map out possible courses of action and outcomes. For example, a business may use a decision tree when deciding whether to downsize or expand. 

Prescriptive analysis: What action should we take?

The highest level of analysis, prescriptive analysis, aims to find the best action plan. Typically, AI tools model different outcomes to predict the best approach. While these tools serve to provide insight, they don't replace human consideration, so always use your human brain before going with the conclusion of your prescriptive analysis. Otherwise, your GPS might drive you into a lake.

Here are a few methods used to perform prescriptive analysis: 

Algorithms are used in technology to perform specific tasks. For example, banks use prescriptive algorithms to monitor customers' spending and recommend that they deactivate their credit card if fraud is suspected. 

Data analysis process: How to get started

The actual analysis is just one step in a much bigger process of using data to move your business forward. Here's a quick look at all the steps you need to take to make sure you're making informed decisions. 

Circle chart with data decision, data collection, data cleaning, data analysis, data interpretation, and data visualization.

Data decision

As with almost any project, the first step is to determine what problem you're trying to solve through data analysis. 

Make sure you get specific here. For example, a food delivery service may want to understand why customers are canceling their subscriptions. But to enable the most effective data analysis, they should pose a more targeted question, such as "How can we reduce customer churn without raising costs?" 

Data collection

Next, collect the required data from both internal and external sources. 

Internal data comes from within your business (think CRM software, internal reports, and archives), and helps you understand your business and processes.

External data originates from outside of the company (surveys, questionnaires, public data) and helps you understand your industry and your customers. 

Data cleaning

Data can be seriously misleading if it's not clean. So before you analyze, make sure you review the data you collected.  Depending on the type of data you have, cleanup will look different, but it might include: 

Removing unnecessary information 

Addressing structural errors like misspellings

Deleting duplicates

Trimming whitespace

Human checking for accuracy 

Data analysis

Now that you've compiled and cleaned the data, use one or more of the above types of data analysis to find relationships, patterns, and trends. 

Data analysis tools can speed up the data analysis process and remove the risk of inevitable human error. Here are some examples.

Spreadsheets sort, filter, analyze, and visualize data. 

Structured query language (SQL) tools manage and extract data in relational databases. 

Data interpretation

After you analyze the data, you'll need to go back to the original question you posed and draw conclusions from your findings. Here are some common pitfalls to avoid:

Correlation vs. causation: Just because two variables are associated doesn't mean they're necessarily related or dependent on one another. 

Confirmation bias: This occurs when you interpret data in a way that confirms your own preconceived notions. To avoid this, have multiple people interpret the data. 

Small sample size: If your sample size is too small or doesn't represent the demographics of your customers, you may get misleading results. If you run into this, consider widening your sample size to give you a more accurate representation. 

Data visualization

Automate your data collection, frequently asked questions.

Need a quick summary or still have a few nagging data analysis questions? I'm here for you.

What are the five types of data analysis?

The five types of data analysis are text analysis, statistical analysis, diagnostic analysis, predictive analysis, and prescriptive analysis. Each type offers a unique lens for understanding data: text analysis provides insights into text-based content, statistical analysis focuses on numerical trends, diagnostic analysis looks into problem causes, predictive analysis deals with what may happen in the future, and prescriptive analysis gives actionable recommendations.

What is the data analysis process?

The data analysis process involves data decision, collection, cleaning, analysis, interpretation, and visualization. Every stage comes together to transform raw data into meaningful insights. Decision determines what data to collect, collection gathers the relevant information, cleaning ensures accuracy, analysis uncovers patterns, interpretation assigns meaning, and visualization presents the insights.

What is the main purpose of data analysis?

In business, the main purpose of data analysis is to uncover patterns, trends, and anomalies, and then use that information to make decisions, solve problems, and reach your business goals.

Related reading: 

This article was originally published in October 2022 and has since been updated with contributions from Cecilia Gillen. The most recent update was in September 2023.

Get productivity tips delivered straight to your inbox

We’ll email you 1-3 times per week—and never share your information.

Shea Stevens picture

Shea Stevens

Shea is a content writer currently living in Charlotte, North Carolina. After graduating with a degree in Marketing from East Carolina University, she joined the digital marketing industry focusing on content and social media. In her free time, you can find Shea visiting her local farmers market, attending a country music concert, or planning her next adventure.

  • Data & analytics
  • Small business

What is data extraction? And how to automate the process

Data extraction is the process of taking actionable information from larger, less structured sources to be further refined or analyzed. Here's how to do it.

Related articles

Hero image of a woman doing a makeup tutorial to a camera

How to start a successful side hustle

Two orange people icons on a light orange background with a dotted line behind it.

11 management styles, plus tips for applying each type

11 management styles, plus tips for applying...

data analysis in research work

Keep your company adaptable with automation

Icons of three people representing leads and contacts grouped together against a yellow background.

How to enrich lead data for personalized outreach

How to enrich lead data for personalized...

Improve your productivity automatically. Use Zapier to get your apps working together.

A Zap with the trigger 'When I get a new lead from Facebook,' and the action 'Notify my team in Slack'

  • Open access
  • Published: 16 May 2024

Promoting equality, diversity and inclusion in research and funding: reflections from a digital manufacturing research network

  • Oliver J. Fisher 1 ,
  • Debra Fearnshaw   ORCID: orcid.org/0000-0002-6498-9888 2 ,
  • Nicholas J. Watson 3 ,
  • Peter Green 4 ,
  • Fiona Charnley 5 ,
  • Duncan McFarlane 6 &
  • Sarah Sharples 2  

Research Integrity and Peer Review volume  9 , Article number:  5 ( 2024 ) Cite this article

421 Accesses

1 Altmetric

Metrics details

Equal, diverse, and inclusive teams lead to higher productivity, creativity, and greater problem-solving ability resulting in more impactful research. However, there is a gap between equality, diversity, and inclusion (EDI) research and practices to create an inclusive research culture. Research networks are vital to the research ecosystem, creating valuable opportunities for researchers to develop their partnerships with both academics and industrialists, progress their careers, and enable new areas of scientific discovery. A feature of a network is the provision of funding to support feasibility studies – an opportunity to develop new concepts or ideas, as well as to ‘fail fast’ in a supportive environment. The work of networks can address inequalities through equitable allocation of funding and proactive consideration of inclusion in all of their activities.

This study proposes a strategy to embed EDI within research network activities and funding review processes. This paper evaluates 21 planned mitigations introduced to address known inequalities within research events and how funding is awarded. EDI data were collected from researchers engaging in a digital manufacturing network activities and funding calls to measure the impact of the proposed method.

Quantitative analysis indicates that the network’s approach was successful in creating a more ethnically diverse network, engaging with early career researchers, and supporting researchers with care responsibilities. However, more work is required to create a gender balance across the network activities and ensure the representation of academics who declare a disability. Preliminary findings suggest the network’s anonymous funding review process has helped address inequalities in funding award rates for women and those with care responsibilities, more data are required to validate these observations and understand the impact of different interventions individually and in combination.

Conclusions

In summary, this study offers compelling evidence regarding the efficacy of a research network's approach in advancing EDI within research and funding. The network hopes that these findings will inform broader efforts to promote EDI in research and funding and that researchers, funders, and other stakeholders will be encouraged to adopt evidence-based strategies for advancing this important goal.

Peer Review reports

Introduction

Achieving equality, diversity, and inclusion (EDI) is an underpinning contributor to human rights, civilisation and society-wide responsibility [ 1 ]. Furthermore, promoting and embedding EDI within research environments is essential to make the advancements required to meet today’s research challenges [ 2 ]. This is evidenced by equal, diverse and inclusive teams leading to higher productivity, creativity and greater problem-solving ability [ 3 ], which increases the scientific impact of research outputs and researchers [ 4 ]. However, there remains a gap between EDI research and the everyday implementation of inclusive practices to achieve change [ 5 ]. This paper presents and reflects on the EDI measures trialled by the UK Engineering and Physical Sciences Research Council (EPSRC) funded digital manufacturing research network, Connected Everything (grant number: EP/S036113/1) [ 6 ]. The EPSRC is a UK research council that funds engineering and physical sciences research. By sharing these reflections, this work aims to contribute to the wider effort of creating an inclusive research culture. The perceptions of equality, diversity, and inclusion may vary among individuals. For the scope of this study, the following definitions are adopted:

Equality: Equality is about ensuring that every individual has an equal opportunity to make the most of their lives and talents. No one should have poorer life chances because of the way they were born, where they come from, what they believe, or whether they have a disability.

Diversity: Diversity concerns understanding that each individual is unique, recognising our differences, and exploring these differences in a safe, positive, and nurturing way to value each other as individuals.

Inclusion: Inclusion is an effort and practice in which groups or individuals with different backgrounds are culturally and socially accepted, welcomed and treated equally. This concerns treating each person as an individual, making them feel valued, and supported and being respectful of who they are.

Research networks have varied goals, but a common purpose is to create new interdisciplinary research communities, by fostering interactions between researchers and appropriate scientific, technological and industrial groups. These networks aim to offer valuable career progression opportunities for researchers, through access to research funding, forming academic and industrial collaborations at network events, personal and professional development, and research dissemination. However, feedback from a 2021 survey of 19 UK research networks, suggests that these research networks are not always diverse, and whilst on the face of it they seem inclusive, they are perceived as less inclusive by minority groups (including non-males, those with disabilities, and ethnic minority respondents) [ 7 ]. The exclusivity of these networks further exacerbates the inequality within the academic community as it prevents certain groups from being able to engage with all aspects of network activities.

Research investigating the causes of inequality and exclusivity has identified several suggestions to make research culture more inclusive, including improving diverse representation within event programmes and panels [ 8 , 9 ]; ensuring events are accessible to all [ 10 ]; providing personalised resources and training to build capacity and increase engagement [ 11 ]; educating institutions and funders to understand and address the barriers to research [ 12 ]; and increasing diversity in peer review and funding panels [ 13 ]. Universities, research institutions and research funding bodies are increasingly taking responsibility to ensure the health of the research and innovation system and to foster inclusion. For example, the EPSRC has set out their own ‘Expectation for EDI’ to promote the formation of a diverse and inclusive research culture [ 14 ]. To drive change, there is an emphasis on the importance of measuring diversity and links to measured outcomes to benchmark future studies on how interventions affect diversity [ 5 ]. Further, collecting and sharing EDI data can also drive aspirations, provide a target for actions, and allow institutions to consider common issues. However, there is a lack of available data regarding the impact of EDI practices on diversity that presents an obstacle, impeding the realisation of these benefits and hampering progress in addressing common issues and fostering diversity and inclusion [ 5 ].

Funding acquisition is important to an academic’s career progression, yet funding may often be awarded in ways that feel unequal and/or non-transparent. The importance of funding in academic career progression means that, if credit for obtaining funding is not recognised appropriately, careers can be damaged, and, as a result of the lack of recognition for those who have been involved in successful research, funding bodies may not have a complete picture of the research community, and are unable to deliver the best value for money [ 15 ]. Awarding funding is often a key research network activity and an area where networks can have a positive impact on the wider research community. It is therefore important that practices are established to embed EDI consideration within the funding process and to ensure that network funding is awarded without bias. Recommendations from the literature to make the funding award process fairer include: ensuring a diverse funding panel; funders instituting reviewer anti-bias training; anonymous review; and/or automatic adjustments to correct for known biases [ 16 ]. In the UK, the government organisation UK Research and Innovation (UKRI), tasked with overseeing research and innovation funding, has pledged to publish data to enhance transparency. This initiative aims to furnish an evidence base for designing interventions and evaluating their efficacy. While the data show some positive signs (e.g., the award rates for male and female PI applicants were equal at 29% in 2020–21), Ottoline Leyser (UKRI Chief Executive) highlights the ‘persistent pernicious disparities for under-represented groups in applying for and winning research funding’ [ 17 ]. This suggests that a more radical approach to rethinking the traditional funding review process may be required.

This paper describes the approach taken by the ‘Connected Everything’ EPSRC-funded Network to embed EDI in all aspects of its research funding process, and evaluates the impact of this ambition, leading to recommendations for embedding EDI in research funding allocation.

Connected everything’s equality diversity and inclusion strategy

Connected Everything aims to create a multidisciplinary community of researchers and industrialists to address key challenges associated with the future of digital manufacturing. The network is managed by an investigator team who are responsible for the strategic planning and, working with the network manager, to oversee the delivery of key activities. The network was first funded between 2016–2019 (grant number: EP/P001246/1) and was awarded a second grant (grant number: EP/S036113/1). The network activities are based around three goals: building partnerships, developing leadership and accelerating impact.

The Connected Everything network represents a broad range of disciplines, including manufacturing, computer science, cybersecurity, engineering, human factors, business, sociology, innovation and design. Some of the subject areas, such as Computer Science and Engineering, tend to be male-dominated (e.g., in 2021/22, a total of 185,42 higher education student enrolments in engineering & technology subjects was broken down as 20.5% Female and 79.5% Male [ 18 ]). The networks also face challenges in terms of accessibility for people with care responsibilities and disabilities. In 2019, Connected Everything committed to embedding EDI in all its network activities and published a guiding principle and goals for improving EDI (see Additional file 1 ). When designing the processes to deliver the second iteration of Connected Everything, the team identified several sources of potential bias/exclusion which have the potential to impact engagement with the network. Based on these identified factors, a series of mitigation interventions were implemented and are outlined in Table  1 .

Connected everything anonymous review process

A key Connected Everything activity is the funding of feasibility studies to enable cross-disciplinary, foresight, speculative and risky early-stage research, with a focus on low technology-readiness levels. Awards are made via a short, written application followed by a pitch to a multidisciplinary diverse panel including representatives from industry. Six- to twelve-month-long projects are funded to a maximum value of £60,000.

The current peer-review process used by funders may reveal the applicants’ identities to the reviewer. This can introduce dilemmas to the reviewer regarding (a) deciding whether to rely exclusively on information present within the application or search for additional information about the applicants and (b) whether or not to account for institutional prestige [ 34 ]. Knowing an applicant’s identity can bias the assessment of the proposal, but by focusing the assessment on the science rather than the researcher, equality is more frequently achieved between award rates (i.e., the proportion of successful applications) [ 15 ]. To progress Connected Everything’s commitment to EDI, the project team created a 2-stage review process, where the applicants’ identity was kept anonymous during the peer review stage. This anonymous process, which is outlined in Fig.  1 , was created for the feasibility study funding calls in 2019 and used for subsequent funding calls.

figure 1

Connected Everything’s anonymous review process [EDI: Equality, diversity, and inclusion]

To facilitate the anonymous review process, the proposal was submitted in two parts: part A the research idea and part B the capability-to-deliver statement. All proposals were first anonymously reviewed by a random selection of two members from the Connected Everything executive group, which is a diverse group of digital manufacturing experts and peers from academia, industry and research institutions that provide guidance and leadership on Connected Everything activities. The reviewers rated the proposals against the selection criteria (see Additional file 1 , Table 1) and provided overall comments alongside a recommendation on whether or not the applicant should be invited to the panel pitch. This information was summarised and shared with a moderation sift panel, made up of a minimum of two Connected Everything investigators and a minimum of one member of the executive group, that tensioned the reviewers’ comments (i.e. comments and evaluations provided by the peer reviewers are carefully considered and weighed against each other) and ultimately decided which proposals to invite to the panel. This tension process included using the identifying information to ensure the applicants did have the capability to deliver the project. If this remained unclear, the applicants were asked to confirm expertise in an area the moderation sift panel thought was key or asked to bring in additional expertise to the project team during the panel pitch.

During stage two the applicants were invited to pitch their research idea to a panel of experts who were selected to reflect the diversity of the community. The proposals, including applicants’ identities, were shared with the panel at least two weeks ahead of the panel. Individual panel members completed a summary sheet at the end of the pitch session to record how well the proposal met the selection criteria (see Additional file 1 , Table 1). Panel members did not discuss their funding decision until all the pitches had been completed. A panel chair oversaw the process but did not declare their opinion on a specific feasibility study unless the panel could not agree on an outcome. The panel and panel chair were reminded to consider ways to manage their unconscious bias during the selection process.

Due to the positive response received regarding the anonymous review process, Connected Everything extended its use when reviewing other funded activities. As these awards were for smaller grant values (~ £5,000), it was decided that no panel pitch was required, and the researcher’s identity was kept anonymous for the entire process.

Data collection and analysis methods

Data collection.

Equality, diversity and inclusion data were voluntarily collected from applicants for Connected Everything research funding and from participants who won scholarships to attend Connected Everything funded activities. Responses to the EDI data requests were collected from nine Connected Everything coordinated activities between 2019 and 2022. Data requests were sent after the applicant had applied for Connected Everything funding or had attended a Connected Everything funded activity. All data requests were completed voluntarily, with reassurance given that completion of the data requested in no way affected their application. In total 260 responses were received, of which the three feasibility study calls comprised 56.2% of the total responses received. Overall, there was a 73.8% response rate.

To understand the diversity of participants engaging with Connected Everything activities and funding, the data requests asked for details of specific diversity characteristics: gender, transgender, disability, ethnicity, age, and care responsibilities. Although sex and gender are terms that are often used interchangeably, they are two different concepts. To clarify, the definitions used by the UK government describe sex as a set of biological attributes that is generally limited to male or female, and typically attributed to individuals at birth. In contrast, gender identity is a social construction related to behaviours and attributes, and is self-determined based on a person’s internal perception, identification and experience. Transgender is a term used to describe people whose gender identity is not the same as the sex they were registered at birth. Respondents were first asked to identify their gender and then whether their gender was different from their birth sex.

For this study, respondents were asked to (voluntarily) self-declare whether they consider themselves to be disabled or not. Ethnicity within the data requests was based on the 2011 census classification system. When reporting ethnicity data, this study followed the AdvanceHE example to aggregate the census categories into six groups to enable benchmarking against the available academic ethnicity data. AdvanceHE is a UK charity that works to improve the higher education system for staff, students and society. However, it was acknowledged that there were limitations with this grouping, including the assumption that minority ethnic staff or students are a homogenous group [ 16 ]. Therefore, this study made sure to breakdown these groups during the discussion of the results. The six groups are:

Asian: Asian/Asian British: Indian, Pakistani, Bangladeshi, and any other Asian background;

Black: Black/African/Caribbean/Black British: African, Caribbean, and any other Black/African/Caribbean background;

Other ethnic backgrounds, including Arab.

White: all white ethnic groups.

Benchmarking data

Published data from the Higher Education Statistics Agency [ 26 ] (a UK organisation responsible for collecting, analysing, and disseminating data related to higher education institutions and students), UKRI funding data [ 19 , 35 ] and 2011 census data [ 36 ] were used to benchmark the EDI data collected within this study. The responses to the data collected were compared to the engineering and technology cluster of academic disciplines, as this is most represented by Connected Everything’s main funded EPSRC. The Higher Education Statistics Agency defines the engineering and technology cluster as including the following subject areas: general engineering; chemical engineering; mineral, metallurgy & materials engineering; civil engineering; electrical, electronic & computer engineering; mechanical, aero & production engineering and; IT, systems sciences & computer software engineering [ 37 ].

When assessing the equality in funding award rates, previous studies have focused on analysing the success rates of only the principal investigators [ 15 , 16 , 38 ]; however, Connected Everything recognised that writing research proposals is a collaborative task, so requested diversity data from the whole research team. The average of the last six years of published principal investigator and co-investigator diversity data for UKRI and EPSRC funding awards (2015–2021) was used to benchmark the Connected Everything funding data [ 35 ]. The UKRI and EPSRC funding review process includes a peer review stage followed by panel pitch and assessment stage; however, the applicant's track record is assessed during the peer review stage, unlike the Connected Everything review process.

The data collected have been used to evaluate the success of the planned migrations to address EDI factors affecting the higher education research ecosystem, as outlined in Table  1 (" Connected Everything’s Equality Diversity and Inclusion Strategy " Section).

Dominance of small number of research-intensive universities receiving funding from network

The dominance of a small number of research-intensive universities receiving funding from a network can have implications for the field of research, including: the unequal distribution of resources; a lack of diversity of research, limited collaboration opportunities; and impact on innovation and progress. Analysis of published EPSRC funding data between 2015 and 2021 [ 19 ], shows that the funding has been predominately (74.1%, 95% CI [71.%, 76.9%] out of £3.98 billion) awarded to Russell Group universities. The Russell Group is a self-selected association of 24 research-intensive universities (out of the 174 universities) in the UK, established in 1994. Evaluation of the universities that received Connected Everything feasibility study funding between 2016–2019, shows that Connected Everything awarded just over half (54.6%, 95% CI [25.1%, 84.0%] out of 11 awards) to Russell Group universities. Figure  2 shows that the Connected Everything funding awarded to Russell Group universities reduced to 44.4%, 95% CI [12.0%, 76.9%] of 9 awards between 2019–2022.

figure 2

A comparison of funding awarded by EPSRC (total = £3.98 billion) across Russell Group universities and non-Russell Group universities, alongside the allocations for Connected Everything I (total = £660 k) and Connected Everything II (total = £540 k)

Dominance of successful applications from men

The percentage point difference between the award rates of researchers who identified as female, those who declare a disability, or identified as ethnic minority applicants and carers and their respective counterparts have been plotted in Fig.  3 . Bars to the right of the axis mean that the award rate of the female/declared-disability/ethnic-minority/carer applicants is greater than that of male/non- disability/white/not carer applicants.

figure 3

Percentage point (PP) differences in award rate by funding provider for gender, disability status, ethnicity and care responsibilities (data not collected by UKRI and EPSRC [ 35 ]). The total number of applicants for each funder are as follows: Connected Everything = 146, EPSRC = 37,960, and UKRI = 140,135. *The numbers of applicants were too small (< 5) to enable a meaningful discussion

Figure  3 (A) shows that between 2015 and 2021 research team applicants who identified as male had a higher award rate than those who identified as female when applying for EPSRC and wider UKRI research council funding. Connected Everything funding applicants who identified as female achieved a higher award rate (19.4%, 95% CI [6.5%, 32.4%] out of 146) compared to male applicants (15.6%, 95% CI [8.8%, 22.4%] out of 146). These data suggest that biases have been reduced by the Connected Everything review process and other mitigation strategies (e.g., visible gender diversity in panel pitch members and publishing CE principal and goals to demonstrate commitment to equality and fairness). This finding aligns with an earlier study that found gender bias during the peer review process, resulting in female investigators receiving less favourable evaluations than their male counterparts [ 15 ].

Over-representation of people identifying as male in engineering and technology academic community

Figure  4 shows the response to the gender question, with 24.2%, 95% CI [19.0%, 29.4%] of 260 responses identifying as female. This aligns with the average for the engineering and technology cluster (21.4%, 95% CI [20.9%, 21.9%] female of 27,740 academic staff), which includes subject areas representative of our main funder, EPSRC [ 22 ]. We also sought to understand the representation of transgender researchers within the network. However, following the rounding policy outlined by UK Government statistics policies and procedures [ 39 ], the number of responses that identified as a different sex to birth was too low (< 5) to enable a meaningful discussion.

figure 4

Gender question responses from a total of 260 respondents

Dominance of successful applications from white academics

Figure  3 (C) shows that researchers with a minority ethnicity consistently have a lower award rate than white researchers when applying for EPSRC and UKRI funding. Similarly, the results in Fig.  3 (C) indicate that white researchers are more successful (8.0% percentage point, 95% CI [-8.6%, 24.6%]) when applying for Connected Everything funding. These results indicate that more measures should be implemented to support the ethnic minority researchers applying for Connected Everything funding, as well as sense checking there is no unconscious bias in any of the Connected Everything funding processes. The breakdown of the ethnicity diversity of applicants at different stages of the Connected Everything review process (i.e. all applications, applicants invited to panel pitch and awarded feasibility studies) has been plotted in Fig.  5 to help identify where more support is needed. Figure  5 shows an increase in the proportion of white researchers from 54%, 95% CI [45.4%, 61.8%] of all 146 applicants to 66%, 95% CI [52.8%, 79.1%] of the 50 researchers invited to the panel pitch. This suggests that stage 1 of the Connected Everything review process (anonymous review of written applications) may favour white applicants and/or introduce unconscious bias into the process.

figure 5

Ethnicity questions responses from different stages during the Connected Everything anonymous review process. The total number of applicants is 146, with 50 at the panel stage and 23 ultimately awarded

Under-representation of those from black or minority ethnic backgrounds

Connected Everything appears to have a wide range of ethnic diversity, as shown in Fig.  6 . The ethnicities Asian (18.3%, 95% CI [13.6%, 23.0%]), Black (5.1%, 95% CI [2.4%, 7.7%]), Chinese (12.5%, 95% CI [8.4%, 16.5%]), mixed (3.5%, 95% CI [1.3%, 5.7%]) and other (7.8%, 95% CI [4.5%, 11.1%]) have a higher representation among the 260 individuals engaging with network’s activities, in contrast to both the engineering and technology academic community and the wider UK population. When separating these groups into the original ethnic diversity answers, it becomes apparent that there is no engagement with ‘Black or Black British: Caribbean’, ‘Mixed: White and Black Caribbean’ or ‘Mixed: White and Asian’ researchers within Connected Everything activities. The lack of engagement with researchers from a Caribbean heritage is systemic of a lack of representation within the UK research landscape [ 25 ].

figure 6

Ethnicity question responses from a total of 260 respondents compared to distribution of the 13,085 UK engineering and technology (E&T) academic staff [ 22 ] and 56 million people recorded in the UK 2011 census data [ 36 ]

Under-representation of disabilities, chronic conditions, invisible illnesses and neurodiversity in funded activities and events.

Figure  7 (A) shows that 5.7%, 95% CI [2.4%, 8.9%] of 194 responses declared a disability. This is higher than the average of engineering and technology academics that identify as disabled (3.4%, 95% CI [3.2%, 3.7%] of 27,730 academics). Between Jan-March 2022, 9.0 million people of working age (16–64) within the UK were identified as disabled by the Office for National Statistics [ 40 ], which is 21% of the working age population [ 27 ]. Considering these statistics, there is a stark under-representation of disabilities, chronic conditions, invisible illnesses and neurodiversity amongst engineering and technology academic staff and those engaging in Connected Everything activities.

figure 7

Responses to A  Disability and B  Care responsibilities questions colected from a total of 194 respondents

Between 2015 and 2020 academics that declared a disability have been less successful than academics without a disability in attracting UKRI and EPSRC funding, as shown in Fig.  3 (B). While Fig.  3 (B) shows that those who declare a disability have a higher Connected Everything funding award rate, the number of applicants who declared a disability was too small (< 5) to enable a meaningful discussion regarding this result.

Under-representation of those with care responsibilities in funded activities and events

In response to the care responsibilities question, Fig.  7 (B) shows that 27.3%, 95% CI [21.1%, 33.6%] of 194 respondents identified as carers, which is higher than the 6% of adults estimated to be providing informal care across the UK in a UK Government survey of the 2020/2021 financial year [ 41 ]. However, the ‘informal care’ definition used by the 2021 survey includes unpaid care to a friend or family member needing support, perhaps due to illness, older age, disability, a mental health condition or addiction [ 41 ]. The Connected Everything survey included care responsibilities across the spectrum of care that includes partners, children, other relatives, pets, friends and kin. It is important to consider a wide spectrum of care responsibilities, as key academic events, such as conferences, have previously been demonstrably exclusionary sites for academics with care responsibilities [ 42 ]. Breakdown analysis of the responses to care responsibilities by gender in Fig.  8 reveals that 37.8%, 95% CI [25.3%, 50.3%] of 58 women respondents reported care responsibilities, compared to 22.6%, 95% CI [61.1%, 76.7%] of 136 men respondents. Our findings reinforce similar studies that conclude the burden of care falls disproportionately on female academics [ 43 ].

figure 8

Responses to care responsibilities when grouped by A  136 males and B  58 females

Figure  3 (D) shows that researchers with careering responsibilities applying for Connected Everything funding have a higher award rate than those researchers applying without care responsibilities. These results suggest that the Connected Everything review process is supportive of researchers with care responsibilities, who have faced barriers in other areas of academia.

Reduced opportunities for ECRs

Early-career researchers (ECRs) represent the transition stage between starting a PhD and senior academic positions. EPSRC defines an ECR as someone who is either within eight years of their PhD award, or equivalent professional training or within six years of their first academic appointment [ 44 ]. These periods exclude any career break, for example, due to family care; health reasons; and reasons related to COVID-19 such as home schooling or increased teaching load. The median age for starting a PhD in the UK is 24 to 25, while PhDs usually last between three and four years [ 45 ]. Therefore, these data would imply that the EPSRC median age of ECRs is between 27 and 37 years. It should be noted, however, that this definition is not ideal and excludes ECRs who may have started their research career later in life.

Connected Everything aims to support ECRs via measures that include mentoring support, workshops, summer schools and podcasts. Figure  9 shows a greater representation of researchers engaging with Connected Everything activities that are aged between 30–44 (62.4%, 95% CI [55.6%, 69.2%] of 194 respondents) when compared to the wider engineering and technology academic community (43.7%, 95% CI [43.1%, 44.3%] of 27,780 academics) and UK population (26.9%, 95% CI [26.9%, 26.9%]).

figure 9

Age question responses from a total of 194 respondents compared to distribution of the 27,780 UK engineering and technology (E&T) academic staff [ 22 ] and 56 million people recorded in the UK 2011 census data [ 36 ]

High competition for funding has a greater impact on ECRs

Figure  10 shows that the largest age bracket applying for and winning Connected Everything funding is 31–45, whereas 72%, CI 95% [70.1%, 74.5%] of 12,075 researchers awarded EPSRC grants between 2015 and 2021 were 40 years or older. These results suggest that measures introduced by Connected Everything has been successful at providing funding opportunities for researchers who are likely to be early-mid career stage.

figure 10

Age of researchers at applicant and awarded funding stages for A  Connected Everything between 2019–2022 (total of 146 applicants and 23 awarded) and B  EPSRC funding between 2015–2021 [ 35 ] (total of 35,780 applicants and 12,075 awarded)

The results of this paper provide insights into the impact that Connected Everything’s planned mitigations have had on promoting equality, diversity, and inclusion (EDI) in research and funding. Collecting EDI data from individuals who engage with network activities and apply for research funding enabled an evaluation of whether these mitigations have been successful in achieving the intended outcomes outlined at the start of the study, as summarised in Table  2 .

The results in Table  2 indicate that Connected Everything’s approach to EDI has helped achieve the intended outcome to improve representation of women, ECRs, those with a declared disability and black/minority ethnic backgrounds engaging with network events when compared to the engineering and technology academic community. In addition, the network has helped raise awareness of the high presence of researchers with care responsibilities at network events, which can help to track progress towards making future events inclusive and accessible towards these carers. The data highlights two areas for improvement: (1) ensuring a gender balance; and (2) increasing representation of those with declared disabilities. Both these discrepancies are indicative of the wider imbalances and underrepresentation of these groups in the engineering and technology academic community [ 26 ], yet represent areas where networks can strive to make a difference. Possible strategies include: using targeted outreach; promoting greater representation of these groups in event speakers; and going further to create a welcoming and inclusive environment. One barrier that can disproportionately affect women researchers is the need to balance care responsibilities with attending network events [ 46 ]. This was reflected in the Connected Everything data that reported 37.8%, 95% CI [25.3%, 50.3%] of women engaging with network activities had care responsibilities, compared to 22.6%, 95% CI [61.1%, 76.7%] of men. Providing accommodations such as on-site childcare, flexible scheduling, or virtual attendance options can therefore help to promote inclusivity and allow more women researchers to attend.

Only 5.7%, 95% CI [2.4%, 8.9%] of responses engaging with Connected Everything declared a disability, which is higher than the engineering and technology academic community (3.4%, 95% CI [3.2%, 3.7%]) [ 26 ], but unrepresentative of the wider UK population. It has been suggested that academics can be uncomfortable when declaring disabilities because scholarly contributions and institutional citizenship are so prized that they feel they cannot be honest about their issues or health concerns and keep them secret [ 47 ]. In research networks, it is important to be mindful of this hidden group within higher education and ensure that measures are put in place to make the network’s activities inclusive to all. Future considerations for accommodations to improve research events inclusivity include: improving physical accessibility of events; providing assistive technology such as screen readers, audio descriptions, and captioning can help individuals with visual or hearing impairments to access and participate; providing sign language interpreters; offering flexible scheduling options; and the provision of quiet rooms, written materials in accessible formats, and support staff trained to work with individuals with cognitive disabilities.

Connected Everything introduced measures (e.g., anonymised reviewing process, Q&A sessions before funding calls, inclusive design of panel pitch) to help address inequalities in how funding is awarded. Table 2 shows success in reducing the dominance of researchers who identify as male and research-intensive universities in winning research funding and that researchers with care responsibilities were more successful at winning funding than those without care responsibilities. The data revealed that the proposed measures were unable to address the inequality in award rates between white and ethnic minority researchers, which is an area to look to improve. The inequality appears to occur during the anonymous review stage, with a greater proportion of white researchers being invited to panel. Recommendations to make the review process fairer include: ensuring greater diversity of reviewers; reviewer anti-bias training; and automatic adjustments to correct for known biases in writing style [ 16 , 32 ].

When reflecting on the development of a strategy to embed EDI throughout the network, Connected Everything has learned several key lessons that may benefit other networks undergoing a similar activity. These include:

EDI is never ‘done’: There is a constant need to review approaches to EDI to ensure they remain relevant to the network community. Connected Everything could review its principles to include the concept of justice in its approach to diversity and inclusion. The concept of justice concerning EDI refers to the removal of systematic barriers that stop fair and equitable distribution of resources and opportunities among all members of society, regardless of their individual characteristics or backgrounds. The principles and subsequent actions could be reviewed against the EDI expectations [ 14 ], paying particular attention to areas where barriers may still be present. For example, shifting from welcoming people into existing structures and culture to creating new structures and culture together, with specific emphasis on decision or advisory mechanisms within the network. This activity could lend itself to focusing more on tailored support to overcome barriers, thus achieving equity, if it is not within the control of the network to remove the barrier itself (justice).

Widen diversity categories: By collecting data on a broad range of characteristics, we can identify and address disparities and biases that might otherwise be overlooked. A weakness of this dataset is that ignores the experience of those with intersectional identities, across race, ethnicity, gender, class, disability and/ or LGBTQI. The Wellcome Trust noted how little was known about the socio-economic background of scientists and researchers [ 48 ].

Collect data on whole research teams: For the first two calls for feasibility study funding, Connected Everything only asked the Principal Investigator to voluntarily provide their data. We realised that this was a limited approach and, in the third call, asked for the data regarding the whole research team to be shared anonymously. Furthermore, we do not currently measure the diversity of our event speakers, panellists or reviewers. Collecting these data in the future will help to ensure the network is accountable and will ensure that all groups are represented during our activities and in the funding decision-making process.

High response rate: Previous surveys measuring network diversity (e.g., [ 7 ]) have struggled to get responses when surveying their memberships; whereas, this study achieved a response rate of 73.8%. We attribute this high response rate to sending EDI data requests on the point of contact with the network (e.g., on submitting funding proposals or after attending network events), rather than trying to survey the entire network membership at anyone point in time.

Improve administration: The administration associated with collecting EDI data requires a commitment to transparency, inclusivity, and continuous improvement. For example, during the first feasibility funding call, Connected Everything made it clear that the review process would be anonymous, but the application form was not in separate documents. This made anonymising the application forms extremely time-consuming. For the subsequent calls, separate documents were created – Part A for identifying information (Principal Investigator contact details, Project Team and Industry collaborators) and Part B for the research idea.

Accepting that this can be uncomfortable: Trying to improve EDI can be uncomfortable because it often requires challenging our assumptions, biases, and existing systems and structures. However, it is essential if we want to make real progress towards equity and inclusivity. Creating processes to support embedding EDI takes time and Connected Everything has found it is rare to get it right the first time. Connected Everything is sharing its learning as widely as possible both to support others in their approaches and continue our learning as we reflect on how to continually improve, even when it is challenging.

Enabling individual engagement with EDI: During this work, Connected Everything recognised that methods for engaging with such EDI issues in research design and delivery are lacking. Connected Everything, with support from the Future Food Beacon of Excellence at the University of Nottingham, set out to develop a card-based tool [ 49 ] to help researchers and stakeholders identify questions around how their work may promote equity and increase inclusion or have a negative impact towards one or more protected groups and how this can be overcome. The results of this have been shared at conference presentations [ 50 ] and will be published later.

While this study provides insights into how EDI can be improved in research network activities and funding processes, it is essential to acknowledge several limitations that may impact the interpretation of the findings.

Sample size and generalisability: A total of 260 responses were received, which may not be representative of our overall network of 500 + members. Nevertheless, this data provides a sense of the current diversity engaging in Connected Everything activities and funding opportunities, which we can compare with other available data to steer action to further diversify the network.

Handling of missing data: Out of the 260 responses, 66 data points were missing for questions regarding age, disability, and caring responsibilities. These questions were mistakenly omitted from a Connected Everything summer school survey, contributing to 62 missing data points. While we assumed the remainer of missing data to be at random during analysis, it's important to acknowledge it could be related to other factors, potentially introducing bias into our results.

Emphasis on quantitative data: The study relies on using quantitative data to evaluate the impact of the EDI measures introduced by Connected Everything. However, relying solely on quantitative metrics may overlook nuanced aspects of EDI that cannot be easily quantified. For example, EDI encompasses multifaceted issues influenced by historical, cultural, and contextual factors. These nuances may not be fully captured by numbers alone. In addition, some EDI efforts may not yield immediate measurable outcomes but still contribute to a more inclusive environment.

Diversity and inclusion are not synonymous: The study proposes 21 measures to contribute towards creating an equal, diverse and inclusive research culture and collects diversity data to measure the impact of these measures. However, while diversity is simpler to monitor, increasing diversity alone does not guarantee equality or inclusion. Even with diverse research groups, individuals from underrepresented groups may still face barriers, microaggressions, or exclusion.

Balancing anonymity and rigour in grant reviews:The proposed anonymous review process proposed by Connected Everything removes personal and organisational details from the research ideas under reviewer evaluation. However, there exists a possibility that a reviewer could discern the identity of the grant applicant based on the research idea. Reviewers are expected to be subject matter experts in the field relevant to the grant proposal they are evaluating. Given the specialised nature of scientific research, it is conceivable that a well-known applicant could be identified through the specifics of the work, the methodologies employed, and even the writing style.

Expanding gender identity options: A limitation of this study emerged from the restricted gender options (male, female, other, prefer not to say) provided to respondents when answering the gender identity question. This limitation reflects the context of data collection in 2018, a time when diversity monitoring guidance was still limited. As our understanding of gender identity evolves beyond binary definitions, future data collection efforts should embrace a more expansive and inclusive approach, recognising the diverse spectrum of gender identities.

In conclusion, this study provides evidence of the effectiveness of a research network's approach to promoting equality, diversity, and inclusion (EDI) in research and funding. By collecting EDI data from individuals who engage with network activities and apply for research funding, this study has shown that the network's initiatives have had a positive impact on representation and fairness in the funding process. Specifically, the analysis reveals that the network is successful at engaging with ECRs, and those with care responsibilities and has a diverse range of ethnicities represented at Connected Everything events. Additionally, the network activities have a more equal gender balance and greater representation of researchers with disabilities when compared to the engineering and technology academic community, though there is still an underrepresentation of these groups compared to the national population.

Connected Everything introduced measures to help address inequalities in how funding is awarded. The measures introduced helped reduce the dominance of researchers who identified as male and research-intensive universities in winning research funding. Additionally, researchers with care responsibilities were more successful at winning funding than those without care responsibilities. However, inequality persisted with white researchers achieving higher award rates than those from ethnic minority backgrounds. Recommendations to make the review process fairer include: ensuring greater diversity of reviewers; reviewer anti-bias training; and automatic adjustments to correct for known biases in writing style.

Connected Everything’s approach to embedding EDI in network activities has already been shared widely with other EPSRC-funded networks and Hubs (e.g. the UKRI Circular Economy Hub and the UK Acoustics Network Plus). The network hopes that these findings will inform broader efforts to promote EDI in research and funding and that researchers, funders, and other stakeholders will be encouraged to adopt evidence-based strategies for advancing this important goal.

Availability of data and materials

The data collected was anonymously, however, it may be possible to identify an individual by combining specific records of the data request form data. Therefore, the study data has been presented in aggregate form to protect the confidential of individuals and the data utilised in this study cannot be made openly accessible due to ethical obligations to protect the privacy and confidentiality of the data providers.

Abbreviations

Early career researcher

Equality, diversity and inclusion

Engineering physical sciences research council

UK research and innovation

Xuan J, Ocone R. The equality, diversity and inclusion in energy and AI: call for actions. Energy AI. 2022;8:100152.

Article   Google Scholar  

Guyan K, Oloyede FD. Equality, diversity and inclusion in research and innovation: UK review. Advance HE; 2019.  https://www.ukri.org/wp-content/uploads/2020/10/UKRI-020920-EDI-EvidenceReviewUK.pdf .

Cooke A, Kemeny T. Cities, immigrant diversity, and complex problem solving. Res Policy. 2017;46:1175–85.

AlShebli BK, Rahwan T, Woon WL. The preeminence of ethnic diversity in scientific collaboration. Nat Commun. 2018;9:5163.

Gagnon S, Augustin T, Cukier W. Interplay for change in equality, diversity and inclusion studies: Hum Relations. Epub ahead of print 23 April 2021. https://doi.org/10.1177/00187267211002239 .

Everything C. https://connectedeverything.ac.uk/ . Accessed 27 Feb (2023).

Chandler-Wilde S, Kanza S, Fisher O, Fearnshaw D, Jones E. Reflections on an EDI Survey of UK-Government-Funded Research Networks in the UK. In: The 51st International Congress and Exposition on Noise Control Engineering. St. Albans: Institute of Acoustics; 2022. p. 9.0–940.

Google Scholar  

Prathivadi Bhayankaram K, Prathivadi Bhayankaram N. Conference panels: do they reflect the diversity of the NHS workforce? BMJ Lead 2022;6:57 LP – 59.

Goodman SW, Pepinsky TB. Gender representation and strategies for panel diversity: Lessons from the APSA Annual Meeting. PS Polit Sci Polit 2019;52:669–676.

Olsen J, Griffiths M, Soorenian A, et al. Reporting from the margins: disabled academics reflections on higher education. Scand J Disabil Res. 2020;22:265–74.

Baldie D, Dickson CAW, Sixsmith J. Building an Inclusive Research Culture. In: Knowledge, Innovation, and Impact. 2021, pp. 149–157.

Sato S, Gygax PM, Randall J, et al. The leaky pipeline in research grant peer review and funding decisions: challenges and future directions. High Educ 2020 821. 2020;82:145–62.

Recio-Saucedo A, Crane K, Meadmore K, et al. What works for peer review and decision-making in research funding: a realist synthesis. Res Integr Peer Rev. 2022;2022 71:7: 1–28.

EPSRC. Expectations for equality, diversity and inclusion – UKRI, https://www.ukri.org/about-us/epsrc/our-policies-and-standards/equality-diversity-and-inclusion/expectations-for-equality-diversity-and-inclusion/ (2022, Accessed 26 Apr 2022).

Witteman HO, Hendricks M, Straus S, et al. Are gender gaps due to evaluations of the applicant or the science? A natural experiment at a national funding agency. Lancet. 2019;393:531–40.

Li YL, Bretscher H, Oliver R, et al. Racism, equity and inclusion in research funding. Sci Parliam. 2020;76:17–9.

UKRI publishes latest diversity. data for research funding – UKRI, https://www.ukri.org/news/ukri-publishes-latest-diversity-data-for-research-funding/ (Accessed 28 July 2022).

Higher Education Statistics Agency. What do HE students study? https://www.hesa.ac.uk/data-and-analysis/students/what-study (2023, Accessed 25 March 2023).

UKRI. Competitive funding decisions, https://www.ukri.org/what-we-offer/what-we-have-funded/competitive-funding-decisions / (2023, Accessed 2 April 2023).

Santos G, Van Phu SD. Gender and academic rank in the UK. Sustain. 2019;11:3171.

Jebsen JM, Nicoll Baines K, Oliver RA, et al. Dismantling barriers faced by women in STEM. Nat Chem. 2022;14:1203–6.

Advance HE. Equality in higher education: staff statistical report 2021 | Advance HE, https://www.advance-he.ac.uk/knowledge-hub/equality-higher-education-statistical-report-2021 (28 October 2021, Accessed 26 April 2022).

EngineeringUK. Engineering in Higher Education, https://www.engineeringuk.com/media/318874/engineering-in-higher-education_report_engineeringuk_march23_fv.pdf (2023, Accessed 25 March 2023).

Bhopal K. Academics of colour in elite universities in the UK and the USA: the ‘unspoken system of exclusion’. Stud High Educ. 2022;47:2127–37.

Williams P, Bath S, Arday J et al. The Broken Pieline: Barriers to Black PhD Students Accessing Research Council Funding . 2019.

HESA. Who’s working in HE? Personal characteristics, https://www.hesa.ac.uk/data-and-analysis/staff/working-in-he/characteristics (2023, Accessed 1 April 2023).

Office for National Statistics. Principal projection - UK population in age groups, https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationprojections/datasets/tablea21principalprojectionukpopulationinagegroups (2022, Accessed 3 August 2022).

HESA. Who’s studying in HE? Personal characteristics, https://www.hesa.ac.uk/data-and-analysis/students/whos-in-he/characteristics (2023, Accessed 1 April 2023).

Herman E, Nicholas D, Watkinson A et al. The impact of the pandemic on early career researchers: what we already know from the internationally published literature. Prof la Inf ; 30. Epub ahead of print 11 March 2021. https://doi.org/10.3145/epi.2021.mar.08 .

Moreau M-P, Robertson M. ‘Care-free at the top’? Exploring the experiences of senior academic staff who are caregivers, https://srhe.ac.uk/wp-content/uploads/2020/03/Moreau-Robertson-SRHE-Research-Report.pdf (2019).

Shillington AM, Gehlert S, Nurius PS, et al. COVID-19 and long-term impacts on tenure-line careers. J Soc Social Work Res. 2020;11:499–507.

de Winde CM, Sarabipour S, Carignano H et al. Towards inclusive funding practices for early career researchers. J Sci Policy Gov; 18. Epub ahead of print 24 March 2021. https://doi.org/10.38126/JSPG180105 .

Trust W. Grant funding data report 2018/19, https://wellcome.org/sites/default/files/grant-funding-data-2018-2019.pdf (2020).

Vallée-Tourangeau G, Wheelock A, Vandrevala T, et al. Peer reviewers’ dilemmas: a qualitative exploration of decisional conflict in the evaluation of grant applications in the medical humanities and social sciences. Humanit Soc Sci Commun. 2022;2022 91:9: 1–11.

Diversity data – UKRI. https://www.ukri.org/what-we-offer/supporting-healthy-research-and-innovation-culture/equality-diversity-and-inclusion/diversity-data/ (accessed 30 September 2022).

2011 Census - Office for National Statistics. https://www.ons.gov.uk/census/2011census (Accessed 2 August 2022).

Cost centres. (2012/13 onwards) | HESA, https://www.hesa.ac.uk/support/documentation/cost-centres/2012-13-onwards (Accessed 28 July 2022).

Viner N, Powell P, Green R. Institutionalized biases in the award of research grants: a preliminary analysis revisiting the principle of accumulative advantage. Res Policy. 2004;33:443–54.

ofqual. Rounding policy - GOV.UK, https://www.gov.uk/government/publications/ofquals-statistics-policies-and-procedures/rounding-policy (2023, Accessed 2 April 2023).

Office for National Statistics. Labour market status of disabled people, https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/datasets/labourmarketstatusofdisabledpeoplea08 (2022, Accessed 3 August 2022).

Family Resources Survey. financial year 2020 to 2021 - GOV.UK, https://www.gov.uk/government/statistics/family-resources-survey-financial-year-2020-to-2021 (Accessed 10 Aug 2022).

Henderson E. Academics in two places at once: (not) managing caring responsibilities at conferences. 2018, p. 218.

Jolly S, Griffith KA, DeCastro R, et al. Gender differences in time spent on parenting and domestic responsibilities by high-achieving young physician-researchers. Ann Intern Med. 2014;160:344–53.

UKRI. Early career researchers, https://www.ukri.org/what-we-offer/developing-people-and-skills/esrc/early-career-researchers/ (2022, Accessed 2 April 2023).

Cornell B. PhD Life: The UK student experience , www.hepi.ac.uk (2019, Accessed 2 April 2023).

Kibbe MR, Kapadia MR. Underrepresentation of women at academic medical conferences—manels must stop. JAMA Netw Open 2020; 3:e2018676–e2018676.

Brown N, Leigh J. Ableism in academia: where are the disabled and ill academics? 2018; 33: 985–989.  https://doi.org/10.1080/0968759920181455627

Bridge Group. Diversity in Grant Awarding and Recruitment at Wellcome Summary Report. 2017.

Peter Craigon O, Fisher D, Fearnshaw et al. VERSION 1 - The Equality Diversity and Inclusion cards. Epub ahead of print 2022. https://doi.org/10.6084/m9.figshare.21222212.v3 .

Connected Everything II. EDI ideation cards for research - YouTube, https://www.youtube.com/watch?v=GdJjL6AaBbc&ab_channel=ConnectedEverythingII (2022, Accessed 7 June 2023).

Download references

Acknowledgements

The authors would like to acknowledge the support Engineering and Physical Sciences Research Council (EPSRC) [grant number EP/S036113/1], Connected Everything II: Accelerating Digital Manufacturing Research Collaboration and Innovation. The authors would also like to gratefully acknowledge the Connected Everything Executive Group for their contribution towards developing Connected Everything’s equality, diversity and inclusion strategy.

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) [grant number EP/S036113/1].

Author information

Authors and affiliations.

Food, Water, Waste Research Group, Faculty of Engineering, University of Nottingham, University Park, Nottingham, UK

Oliver J. Fisher

Human Factors Research Group, Faculty of Engineering, University of Nottingham, University Park, Nottingham, UK

Debra Fearnshaw & Sarah Sharples

School of Food Science and Nutrition, University of Leeds, Leeds, UK

Nicholas J. Watson

School of Engineering, University of Liverpool, Liverpool, UK

Peter Green

Centre for Circular Economy, University of Exeter, Exeter, UK

Fiona Charnley

Institute for Manufacturing, University of Cambridge, Cambridge, UK

Duncan McFarlane

You can also search for this author in PubMed   Google Scholar

Contributions

OJF analysed and interpreted the data, and was the lead author in writing and revising the manuscript. DF led the data acquisition and supported the interpretation of the data. DF was also a major contributor to the design of the equality diversity and inclusion (EDI) strategy proposed in this work. NJW supported the design of the EDI strategy and was a major contributor in reviewing and revising the manuscript. PG supported the design of the EDI strategy, and was a major contributor in reviewing and revising the manuscript. FC supported the design of the EDI strategy and the interpretation of the data. DM supported the design of the EDI strategy. SS led the development EDI strategy proposed in this work, and was a major contributor in data interpretation and reviewing and revising the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Debra Fearnshaw .

Ethics declarations

Ethics approval and consent to participate.

Research was considered exempt from requiring ethical approval as is uses completely anonymous surveys results that are routinely collected as part of the administration of the network plus and informed consent was obtained at the time of original data collection.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Fisher, O.J., Fearnshaw, D., Watson, N.J. et al. Promoting equality, diversity and inclusion in research and funding: reflections from a digital manufacturing research network. Res Integr Peer Rev 9 , 5 (2024). https://doi.org/10.1186/s41073-024-00144-w

Download citation

Received : 12 October 2023

Accepted : 09 April 2024

Published : 16 May 2024

DOI : https://doi.org/10.1186/s41073-024-00144-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Research integrity
  • Network policy
  • Funding reviewing
  • EDI interventions

Research Integrity and Peer Review

ISSN: 2058-8615

data analysis in research work

Microsoft Research AI for Science

Chris Bishop and Frank Noé in conversation

AI for Science in Conversation: Chris Bishop and Frank Noé discuss setting up a team in Berlin

Screenshot of Bonnie Kruft from The fifth paradigm of scientific discovery plenary

Watch Research Summit "Fifth Paradigm of Scientific Discovery" plenary on demand

Christopher Bishop, Technical Fellow and Director, Microsoft Research AI4Science

AI for Science to empower the fifth paradigm of scientific discovery

Christopher Bishop, Technical Fellow, and Director, AI4Science

“Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space and time. Our AI4Science team encompasses world experts in machine learning, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines, who are working together to tackle some of the most pressing challenges in this field.“ 未来十年,深度学习注定将会给自然科学带来变革性的影响。其结果具有潜在的深远意义,可能会极大地提高我们在差异巨大的空间和时间尺度上对自然现象进行建模和预测的能力。为此,微软研究院科学智能中心(AI4Science)集结了机器学习、计算物理、计算化学、分子生物学、软件工程和其他学科领域的世界级专家,共同致力于解决该领域中最紧迫的挑战。 Professor Chris Bishop , Technical Fellow, and Director, AI for Science

Work with us

Senior researcher – machine learning  .

Location : Beijing, China

Technical Program Manager 2 – AI for Science  

Our locations.

Amsterdam Netherlands, sunset city skyline of Dutch house at canal waterfront

Amsterdam, Netherlands

Photo of the Beijing lab

Beijing, China

Berlin skyline at night

Berlin, Germany

Microsoft Research Cambridge

Cambridge, UK

Building 99 in Redmond

Redmond, USA

Cityscape of Shanghai, showing the MSR office

Shanghai, China

  • Follow on Twitter
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram
  • Subscribe to our RSS feed

Share this page:

  • Share on Twitter
  • Share on Facebook
  • Share on LinkedIn
  • Share on Reddit

FACTOR ANALYSIS INVARIANT TO LINEAR TRANSFORMATIONS OF DATA

Modeling data with Gaussian distributions is an important statistical problem. To obtain robust models one imposes constraints the means and covariances of these distributions [6, 4, 10, 8]. Constrained ML modeling implies the existence of optimal feature spaces where the constraints are more valid [2, 3]. This paper introduces one such constrained ML modeling technique called factor analysis invariant to linear transformations (FACILT) which is essentially factor analysis in optimal feature spaces. FACILT is a generalization of several existing methods for modeling covariances. This paper presents an EM algorithm for FACILT modeling.

Publication

  • R.A. Gopinath
  • B. Ramabhadran
  • S. Dharanipragada

Current Status of the IBM Trainable Speech Synthesis System

Acoustics-only based automatic phonetic baseform generation, automated quality monitoring for call centers using speech and nlp technologies.

Neuroticism and inflammatory skin diseases: a bidirectional Mendelian randomization study

  • ORIGINAL PAPER
  • Published: 24 May 2024
  • Volume 316 , article number  213 , ( 2024 )

Cite this article

data analysis in research work

  • Charalabos Antonatos 1 ,
  • Alexandros Pontikas 1 ,
  • Adam Akritidis 1 ,
  • Sophia Georgiou 2 ,
  • Alexander J. Stratigos 3 ,
  • Ileana Afroditi Kleidona 3 ,
  • Stamatis Gregoriou 3 ,
  • Katerina Grafanaki 2 , 4 &
  • Yiannis Vasilopoulos 1  

Previous observational studies have linked inflammatory skin diseases with mental health issues and neuroticism. However, the specific impact of neuroticism and its subclusters (i.e. worry, depressed affect, and sensitivity to environmental stress and adversity) on these conditions remains underexplored. In this work, we explored causal associations between common inflammatory skin diseases and neuroticism. We conducted a two-sample, bidirectional Mendelian randomization (MR) analysis using data from genome-wide association studies in psoriasis, atopic dermatitis, neuroticism and relevant genetic subclusters conducted on participants of European ancestry. Corrections for sample overlap were applied where necessary. We found that psoriasis was causally associated with increased levels of worry (odds ratio, 95% confidence intervals: 1.011, 1.006–1.016, P = 3.84 × 10 –6 ) while none of the neuroticism subclusters showed significant association with psoriasis. Sensitivity analyses revealed considerable evidence of directional pleiotropy between psoriasis and neuroticism traits. Conversely, genetic liability to atopic dermatitis did not exhibit any significant association with neuroticism traits. Notably, genetically predicted worry was linked to an elevated risk of atopic dermatitis (odds ratio, 95% confidence intervals: 1.227, 1.067–1.41, P = 3.97 × 10 –3 ). Correction for overlapping samples confirmed the robustness of these results. These findings suggest potential avenues for future interventions aimed at reducing stress and worry among patients with inflammatory skin conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

data analysis in research work

Similar content being viewed by others

data analysis in research work

The bidirectional causal association between psoriasis and psychological illnesses: a 2-sample Mendelian randomization study

data analysis in research work

Association of GA genotype of SNP rs4680 in COMT gene with psoriasis

data analysis in research work

Triggers for the onset and recurrence of psoriasis: a review and update

Parisi R, Iskandar IYK, Kontopantelis E et al (2020) National, regional, and worldwide epidemiology of psoriasis: systematic analysis and modelling study. BMJ 369:m1590. https://doi.org/10.1136/bmj.m1590

Article   PubMed   PubMed Central   Google Scholar  

Langan SM, Mulick AR, Rutter CE et al (2023) Trends in eczema prevalence in children and adolescents: A Global Asthma Network Phase I Study. Clin Exp Allergy 53:337–352. https://doi.org/10.1111/cea.14276

Article   PubMed Central   Google Scholar  

Yew YW, Kuan AHY, Ge L et al (2020) Psychosocial impact of skin diseases: a population-based study. PLoS ONE 15:e0244765. https://doi.org/10.1371/journal.pone.0244765

Article   CAS   PubMed   PubMed Central   Google Scholar  

Mento C, Rizzo A, Muscatello MRA et al (2020) Negative emotions in skin disorders: a systematic review. Int J Psychol Res (Medellin) 13:71–86. https://doi.org/10.21500/20112084.4078

Article   PubMed   Google Scholar  

Nagel M, Watanabe K, Stringer S et al (2018) Item-level analyses reveal genetic heterogeneity in neuroticism. Nat Commun 9:905. https://doi.org/10.1038/s41467-018-03242-8

Nagel M, Speed D, van der Sluis S, Østergaard SD (2020) Genome-wide association study of the sensitivity to environmental stress and adversity neuroticism cluster. Acta Psychiatr Scand 141:476–478. https://doi.org/10.1111/acps.13155

Article   CAS   PubMed   Google Scholar  

Schonmann Y, Mansfield KE, Hayes JF et al (2020) Atopic eczema in adulthood and risk of depression and anxiety: a population-based cohort study. J Allergy Clin Immunol Pract 8:248-257.e16. https://doi.org/10.1016/j.jaip.2019.08.030

Henderson AD, Adesanya E, Mulick A et al (2023) Common mental health disorders in adults with inflammatory skin conditions: nationwide population-based matched cohort studies in the UK. BMC Med 21:285. https://doi.org/10.1186/s12916-023-02948-x

Wang Y, Wang X, Gu X et al (2023) Evidence for a causal association between psoriasis and psychiatric disorders using a bidirectional Mendelian randomization analysis in up to 902,341 individuals. J Affect Disord 337:27–36. https://doi.org/10.1016/j.jad.2023.05.059

Baurecht H, Welker C, Baumeister S-E et al (2021) Relationship between atopic dermatitis, depression and anxiety: a two-sample Mendelian randomization study. Br J Dermatol 185:781–786. https://doi.org/10.1111/bjd.20092

Rukh G, de Ruijter M, Schiöth HB (2023) Effect of worry, depression, and sensitivity to environmental stress owing to neurotic personality on risk of cardiovascular disease: a Mendelian randomization study. J Pers 91:856–867. https://doi.org/10.1111/jopy.12782

Stuart PE, Tsoi LC, Nair RP et al (2022) Transethnic analysis of psoriasis susceptibility in South Asians and Europeans enhances fine-mapping in the MHC and genomewide. HGG Adv 3:100069. https://doi.org/10.1016/j.xhgg.2021.100069

Budu-Aggrey A, Kilanowski A, Sobczyk MK et al (2023) European and multi-ancestry genome-wide association meta-analysis of atopic dermatitis highlights importance of systemic immune regulation. Nat Commun 14:6172. https://doi.org/10.1038/s41467-023-41180-2

Richmond RC, Sanderson E (2022) Methods and practical considerations for performing Mendelian randomization. Int J Epidemiol 51:2031–2034. https://doi.org/10.1093/ije/dyac166

Article   Google Scholar  

Verbanck M, Chen C-Y, Neale B, Do R (2018) Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat Genet 50:693–698. https://doi.org/10.1038/s41588-018-0099-7

Bowden J, Del Greco MF, Minelli C et al (2016) Assessing the suitability of summary data for two-sample Mendelian randomization analyses using MR-Egger regression: the role of the I2 statistic. Int J Epidemiol 45:1961–1974. https://doi.org/10.1093/ije/dyw220

Mounier N, Kutalik Z (2023) Bias correction for inverse variance weighting Mendelian randomization. Genet Epidemiol 47:314–331. https://doi.org/10.1002/gepi.22522

Mrowietz U, Sümbül M, Gerdes S (2023) Depression, a major comorbidity of psoriatic disease, is caused by metabolic inflammation. J Eur Acad Dermatol Venereol 37:1731–1738. https://doi.org/10.1111/jdv.19192

Koo J, Marangell LB, Nakamura M et al (2017) Depression and suicidality in psoriasis: review of the literature including the cytokine theory of depression. J Eur Acad Dermatol Venereol 31:1999–2009. https://doi.org/10.1111/jdv.14460

Gottschalk MG, Domschke K (2017) Genetics of generalized anxiety disorder and related traits. Dial Clin Neurosci 19:159–168. https://doi.org/10.31887/DCNS.2017.19.2/kdomschke

Download references

Acknowledgements

C.A. was financially supported by the “Andreas Mentzelopoulos Foundation”.

This investigation was supported in part by a research grant from the National Eczema Association NEA23-CRG186.

Author information

Authors and affiliations.

Laboratory of Genetics, Section of Genetics, Cell Biology and Development, Department of Biology, University of Patras, 26504, Patras, Greece

Charalabos Antonatos, Alexandros Pontikas, Adam Akritidis & Yiannis Vasilopoulos

Department of Dermatology-Venereology, School of Medicine, University of Patras, 26504, Patras, Greece

Sophia Georgiou & Katerina Grafanaki

Department of Dermatology-Venereology, Faculty of Medicine, Andreas Sygros Hospital, National and Kapodistrian University of Athens, 16121, Athens, Greece

Alexander J. Stratigos, Ileana Afroditi Kleidona & Stamatis Gregoriou

Department of Biochemistry, School of Medicine, University of Patras, 26504, Patras, Greece

Katerina Grafanaki

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, CA, KG, StG, YV; software, formal analysis: CA, AP; visualization: CA; Writing-original draft preparation, CA, AP, AA; writing-review and editing: CA, AP, AA, SG, AS, IAK, StG, KG, YV. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Katerina Grafanaki or Yiannis Vasilopoulos .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Conflict of interest

None declared.

Ethical approval

All published GWAS datasets used in the study have existing ethical permissions from institutional boards. No ethical approval was required for this manuscript.

Data availability

Summary statistics can be downloaded from: psoriasis, https://www.ebi.ac.uk/gwas/studies/GCST90019016 ; atopic dermatitis, https://www.ebi.ac.uk/gwas/studies/GCST90244787 ; neuroticism and subclusters, https://ctg.cncr.nl/software/summary_statistics/ . Code used to perform the current study is available at: https://github.com/antonatosc/Neuroticism_Skin_MR .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 204 KB)

Rights and permissions.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Antonatos, C., Pontikas, A., Akritidis, A. et al. Neuroticism and inflammatory skin diseases: a bidirectional Mendelian randomization study. Arch Dermatol Res 316 , 213 (2024). https://doi.org/10.1007/s00403-024-03017-w

Download citation

Received : 07 April 2024

Revised : 07 April 2024

Accepted : 26 April 2024

Published : 24 May 2024

DOI : https://doi.org/10.1007/s00403-024-03017-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Atopic dermatitis
  • Neuroticism
  • Depressive symptoms
  • Environmental stress
  • Find a journal
  • Publish with us
  • Track your research

Popup Statistics

Do Popups Still Work in 2024? Here’s What The Data Tell Us (Original Research)

data analysis in research work

In 2024, we analyzed 26,270 Sleeknote popup campaigns to determine how to build a high-converting website popup in 2024. 

Here’s what we learned:

What Is The Average Conversion Rate for a Popup?

Despite the ongoing changes in the marketing landscape, popups continue to work for businesses that leverage them.

Our data shows that the average Sleeknote campaign conversion rate is 4.13%.

data analysis in research work

But what is a “conversion”? And what are the variables that influence that number? 

Read on to learn our seven most surprising findings (and how to build a high-converting popup).

1. Use Gamified Popups

When we think about popups, we tend to think of the following scenario:

We visit a site and, within seconds, see a popup asking for our name and email address. 

Sometimes, a brand asks for a name and email in exchange for a discount. But more often than not, it’s an afterthought, pushing instead for visitors to “get updates” (i.e., join its newsletter).

In 2024, we consider such an approach the “old way” of using popups. The relationship is transactional ; the visitor trades their details for an incentive.

By contrast, the “new way” of using popups is relational . Here, the visitor plays for the incentive, such as “spinning” a wheel or completing a quiz. 

The change to “gamify” the opt-in experience drives a level of engagement we haven’t seen before, as reflected in the numbers below.

Here’s an example from one of our customers, Danish travel retailer Take Offer , combining a giveaway with a “wheel popup” to drive email signups:

We saw popups using spin-to-win (8.67%) outperformed popups that did not use spin-to-win (3.70%) by a whopping 132.32%.

data analysis in research work

Aside from spin-to-win, we looked at two other gamified popup variations, each with surprising conclusions.

i. Use Daily Offers

Offering visitors a sale on a product for 24 hours is a popular tactic for collecting emails during specific seasonal periods like Easter, Christmas, and Black Week.

Rather than offering a 10% discount to all visitors (which shoppers now expect in industries like ecommerce), the brand gamifies the visitor’s experience. 

(The visitor thinks to themselves, “Should I opt in now? Or, return tomorrow and compete for a potentially better, more compelling offer?)

The data shows that people love daily offers. We found that the average conversion rate for popups with daily offers was 29.59%.

data analysis in research work

One approach to using daily offers, as shown in the example below, is asking the visitor for their name and email and then revealing the day’s offer in the second step:

(As an aside, Sleeknote makes it simple to run daily offers. Set the dates and add the copy, images, coupon codes, and links for each offer. Then, we update the campaign with the day’s offer each day it goes live.)

Use daily offers to collect more details from readers, send better, more targeted emails, and leave visitors wanting to return to your site for new offers.

ii. Use Quizzes

Prevalent in ecommerce, quizzes ask website visitors questions about their interests and recommend potential products based on their answers. 

The visitor gets a personalized experience on-site, and the brand collects valuable details from its subscribers, which it can use to send better, more targeted emails. 

The problem, however, is once shoppers leave the homepage (where the invitation to take a quiz is often featured), brands miss out on using quizzes to their full potential.

To solve that problem, marketers can instead use onsite quizzes—popups with quizzes embedded—and target specific pages where it makes the most sense for the visitor. 

For instance, if a visitor is on a product page, the brand can show a popup offering to help them find the right product (without leaving the page).

We found the average conversion rate for onsite quizzes was 8.65%.

data analysis in research work

Popups that encourage active participation from the shopper perform much better than those that don’t. Invite users to spin a wheel, take a quiz, or opt into a daily deal to drive higher email conversions.

2. Use a Timer-Led Trigger

Some marketers use popups with a timer-led trigger, which shows the popup to the visitor after a certain number of seconds.

Others use a scroll-led trigger, which shows after the visitor has scrolled a certain percentage of the page.

(And some show a popup the second a visitor lands on the website for the first time, which serves no one.)

But which is better? And which is the optimal number to use when setting a trigger?

We found that timer-led triggers (4.42%) outperform scroll-based triggers (2.64%) by 67.42% and that 6 seconds is the optimal number of seconds.

data analysis in research work

Use timer-led triggers and set popups to show after 6 seconds to give visitors time to browse the current page.

3. Optimize on Mobile

Google doesn’t like intrusive interstitials. You know that. You also know users spend more time on mobile than on desktop. 

The question is: do mobile users convert as well as desktop users, if not better? 

To our surprise, mobile popups (5.60%) outperform desktop popups (2.86%) by a whopping 97.18%.

data analysis in research work

It’s worth mentioning here, however, that no two conversions are alike, and a conversion goal might change from one device to another. 

At Sleeknote, for instance, we want to capture email subscribers on mobile and drive free trials on desktop (given the latter’s UX is better).

As such, we have campaigns that reflect the above, each with its own device-specific goal.

On our Recipes page, we want desktop visitors to start a free trial by clicking one of our pre-made popup templates.

data analysis in research work

But on mobile, we want visitors to book a free demo. 

So, to help them do that, we show a teaser, inviting the user to learn more about our product before inviting them to book a free call.

data analysis in research work

Yes, making one popup for both devices is quicker and easier. Still, given that few marketers capitalize on desktop and mobile traffic to their fullest extent, making that second mobile campaign is well worth the time and effort.

Show popups on mobile, but make them device-specific depending on your conversion goals.

4. Use Multistep Popups

Internet privacy has become a hot topic in recent years. 

With the death of third-party cookies, the birth of Consent Mode 2.0, and other privacy-related updates, more marketers are seeking alternate solutions to collecting first-party data from site visitors.  

The problem, as CRO marketers in particular can attest to, is asking for more details from visitors—in other words, adding more input fields—can lead to fewer conversions.

We can confirm this as our research shows that conversions drop from 4.41% to 2.90% when going from one input to two and 2.90% to 1.93% from two to three.

data analysis in research work

But not all hope is lost.

When using a multistep popup that asks for the visitor’s email in the first step and more details in the second step, marketers can see that 76 percent of visitors input more information.

data analysis in research work

To enrich lead data without compromising conversions, consider using a multistep popup. In the first step, ask for the visitor’s email. Then, in the second step, ask for extra details, such as their interests or preferences.

You can later use this information and other details collected in the second step to send better-targeted emails and drive more revenue per subscriber.

Use multistep popups to collect more details from website visitors without seeing a conversion drop.

5. Use an Image

It’s no secret that images increase conversions. Showcasing a product on a product page, complete with a model demonstrating said product, moves shoppers to purchase.

But what about popups? Does adding an image move browsers to input their details or click-through to learn more about a product?

As it turns out, it does. 

We found that campaigns with an image (4.05%) outperformed popups without an image (0.66%) by 513.63%.

data analysis in research work

It’s important to mention here that images are not limited to products.

Some of our favorite popup examples, including Børsen promoting its paid subscription as per the example below, feature an image promoting the offer.

To increase a popup’s conversion rate, use an image that complements the popup’s message. (You’ll go even further when using an image for the popup and teaser.)

6. Use a Countdown Timer

Countdown timers can also increase conversions when combined with a genuine, time-sensitive promotion. 

It doesn’t matter whether it’s a seasonal campaign, a Black Friday promotion, or a limited-time offer—the fear of missing out moves shoppers to purchase.

While urgency is common in email campaigns and on landing pages, we wondered whether creating urgency in a popup had a similar effect. 

However, rather than look at popups that create urgency in their copy, we looked at another, more underused element: a countdown timer.

We found that popups with a countdown timer (5.17%) convert better than campaigns without a countdown timer (4.12%) by 25.48%.

data analysis in research work

Marketing tactics change over time. But “weapons of persuasion” that move us to action? Those will never change. Give visitors a much-needed visual reminder (when it makes sense) to drive higher onsite conversions.  

When running a time-sensitive campaign, include a countdown timer to remind users of the campaign’s deadline.

7. Use a Teaser

Timing is everything when it comes to popups. 

Showing a popup too soon causes a visitor to close it without thinking while showing it too late increases the risk of missing out on a potential conversion.

As discussed above, using a timer-led trigger solves the issue for most marketers—waiting 6 seconds gives readers time to decide whether to join the brand’s email list.

Still, the best time to engage visitors is when they decide to see a popup, such as with a teaser that previews the popup’s content.

The teaser is often visible in the screen’s bottom-left or right-hand corner, reminding visitors there’s an offer for them if/when they’re ready to learn more.

Here’s an example from a Sleeknote customer, WoodUpp .

We found that popups with a teaser (4.54%) convert better than campaigns without a teaser (2.74%) by 65.69 percent.

data analysis in research work

Sometimes, a sneak preview is all needed to nudge skeptical visitors to opt in or make a purchase. If not, the visitor can reopen the popup when the offer is most needed (such as adding a code before checking out).

Use a teaser to preview a popup’s message to increase conversions and reengage visitors when needed.

If you’re already using popups, this guide will teach you how to use them to their full advantage.

And if you’re not, there’s no better time to get started than now. 

You can try Sleeknote for free for seven days and get unlimited access to all our features (including those mentioned in the guide).

Try Sleeknote for free now .

data analysis in research work

Sam Thomas Davies

Best Popup Teaser Examples

We Looked at 2,347 Popup Teasers. Here Are 7 of Our Favorite

data analysis in research work

IMAGES

  1. 5 Steps of the Data Analysis Process

    data analysis in research work

  2. Standard statistical tools in research and data analysis

    data analysis in research work

  3. A Step-by-Step Guide to the Data Analysis Process [2022]

    data analysis in research work

  4. Data analysis

    data analysis in research work

  5. CHOOSING A QUALITATIVE DATA ANALYSIS (QDA) PLAN

    data analysis in research work

  6. Data Analysis in Research and its Importance

    data analysis in research work

VIDEO

  1. How to interpret Reliability analysis results

  2. Data organization in Biology

  3. DATA ANALYSIS

  4. What is the Future of Academic Research with the Advancement of AI?

  5. Research Proposal writing workshop day 3 part 1

  6. ACTION RESEARCH VS. BASIC RESEARCH : Understanding the Differences

COMMENTS

  1. What is Data Analysis? An Expert Guide With Examples

    Data analysis is a comprehensive method of inspecting, cleansing, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It is a multifaceted process involving various techniques and methodologies to interpret data from various sources in different formats, both structured and unstructured.

  2. Data Analysis in Research: Types & Methods

    Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis.

  3. What Is Data Analysis? (With Examples)

    What Is Data Analysis? (With Examples) Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock Holme's proclaims ...

  4. Introduction to Data Analysis

    Data analysis can be quantitative, qualitative, or mixed methods. Quantitative research typically involves numbers and "close-ended questions and responses" (Creswell & Creswell, 2018, p. 3).Quantitative research tests variables against objective theories, usually measured and collected on instruments and analyzed using statistical procedures (Creswell & Creswell, 2018, p. 4).

  5. Learning to Do Qualitative Data Analysis: A Starting Point

    For many researchers unfamiliar with qualitative research, determining how to conduct qualitative analyses is often quite challenging. Part of this challenge is due to the seemingly limitless approaches that a qualitative researcher might leverage, as well as simply learning to think like a qualitative researcher when analyzing data. From framework analysis (Ritchie & Spencer, 1994) to content ...

  6. Data analysis

    exploratory data analysis. data analysis, the process of systematically collecting, cleaning, transforming, describing, modeling, and interpreting data, generally employing statistical techniques. Data analysis is an important part of both scientific research and business, where demand has grown in recent years for data-driven decision making.

  7. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  8. Principles for data analysis workflows

    A systematic and reproducible "workflow"—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases ...

  9. Data Analysis in Quantitative Research

    Abstract. Quantitative data analysis serves as part of an essential process of evidence-making in health and social sciences. It is adopted for any types of research question and design whether it is descriptive, explanatory, or causal. However, compared with qualitative counterpart, quantitative data analysis has less flexibility.

  10. Data Analysis

    Data Analysis. Definition: Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets.

  11. What is data analysis? Methods, techniques, types & how-to

    A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.

  12. A Step-by-Step Guide to the Data Analysis Process

    1. Step one: Defining the question. The first step in any data analysis process is to define your objective. In data analytics jargon, this is sometimes called the 'problem statement'. Defining your objective means coming up with a hypothesis and figuring how to test it.

  13. The Beginner's Guide to Statistical Analysis

    Statistical analysis means investigating trends, patterns, and relationships using quantitative data. It is an important research tool used by scientists, governments, businesses, and other organizations. To draw valid conclusions, statistical analysis requires careful planning from the very start of the research process. You need to specify ...

  14. What Does a Data Analyst Do? Your 2024 Career Guide

    A data analyst gathers, cleans, and studies data sets to help solve problems. Here's how you can start on a path to become one. A data analyst collects, cleans, and interprets data sets in order to answer a question or solve a problem. They work in many industries, including business, finance, criminal justice, science, medicine, and government.

  15. Creating a Data Analysis Plan: What to Consider When Choosing

    For those interested in conducting qualitative research, previous articles in this Research Primer series have provided information on the design and analysis of such studies. 2, 3 Information in the current article is divided into 3 main sections: an overview of terms and concepts used in data analysis, a review of common methods used to ...

  16. Data Analysis Techniques In Research

    Data analysis techniques in research are essential because they allow researchers to derive meaningful insights from data sets to support their hypotheses or research objectives. Data Analysis Techniques in Research: While various groups, institutions, and professionals may have diverse approaches to data analysis, a universal definition ...

  17. Research Methods

    To situate your research in an existing body of work, or to evaluate trends within a research topic. Case study: Either: ... Your data analysis methods will depend on the type of data you collect and how you prepare it for analysis. Data can often be analyzed both quantitatively and qualitatively. For example, survey responses could be analyzed ...

  18. What Is Data Analysis? (With Examples)

    What Is Data Analysis? (With Examples) Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorise before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock Holmes proclaims ...

  19. Data analysis

    Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.

  20. Basic statistical tools in research and data analysis

    Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if ...

  21. What is data analysis? Examples and how to start

    Data analysis is the process of examining, filtering, adapting, and modeling data to help solve problems. Data analysis helps determine what is and isn't working, so you can make the changes needed to achieve your business goals. Keep in mind that data analysis includes analyzing both quantitative data (e.g., profits and sales) and qualitative ...

  22. Data Collection

    Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem. While methods and aims may differ between fields, the overall process of ...

  23. What Does a Data Analyst Do?

    In its simplest form, data analysis involves collecting, organizing, analyzing, and interpreting data to uncover patterns, trends, and insights that inform decision-making processes. Data analysis encompasses various techniques, from descriptive statistics and exploratory data analysis to predictive analytics and machine learning algorithms.

  24. Data Analyst Jobs

    AI Data Analyst ($20 Fixed Price) Job Summary: We are hiring an AI Data Analyst with at least five years of relevant experience in dat…. Data Analyst Artificial Intelligence Python Data Science Data Analysis Jobs. See more. Shopify Analyst (App Store) Hourly ‐ Posted 2 days ago. Less than 30 hrs/week.

  25. Promoting equality, diversity and inclusion in research and funding

    Equal, diverse, and inclusive teams lead to higher productivity, creativity, and greater problem-solving ability resulting in more impactful research. However, there is a gap between equality, diversity, and inclusion (EDI) research and practices to create an inclusive research culture. Research networks are vital to the research ecosystem, creating valuable opportunities for researchers to ...

  26. Microsoft Research AI for Science

    AI for Science to empower the fifth paradigm of scientific discovery. "Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space ...

  27. Factor Analysis Invariant to Linear Transformations of Data

    This paper introduces one such constrained ML modeling technique called factor analysis invariant to linear transformations (FACILT) which is essentially factor analysis in optimal feature spaces. FACILT is a generalization of several existing methods for modeling covariances. This paper presents an EM algorithm for FACILT modeling.

  28. Neuroticism and inflammatory skin diseases: a bidirectional ...

    In this work, we explored causal associations between common inflammatory skin diseases and neuroticism. We conducted a two-sample, bidirectional Mendelian randomization (MR) analysis using data from genome-wide association studies in psoriasis, atopic dermatitis, neuroticism and relevant genetic subclusters conducted on participants of ...

  29. What is a Zestimate? Zillow's Zestimate Accuracy

    It is a computer-generated estimate of the value of a home today, given the available data. We encourage buyers, sellers and homeowners to supplement the Zestimate with other research, such as visiting the home, getting a professional appraisal of the home, or requesting a comparative market analysis (CMA) from a real estate agent.

  30. Do Popups Still Work in 2024? Here's What The Data Tell Us ...

    In 2024, we consider such an approach the "old way" of using popups. The relationship is transactional; the visitor trades their details for an incentive. By contrast, the "new way" of using popups is relational. Here, the visitor plays for the incentive, such as "spinning" a wheel or completing a quiz.