• How It Works
  • PhD thesis writing
  • Master thesis writing
  • Bachelor thesis writing
  • Dissertation writing service
  • Dissertation abstract writing
  • Thesis proposal writing
  • Thesis editing service
  • Thesis proofreading service
  • Thesis formatting service
  • Coursework writing service
  • Research paper writing service
  • Architecture thesis writing
  • Computer science thesis writing
  • Engineering thesis writing
  • History thesis writing
  • MBA thesis writing
  • Nursing dissertation writing
  • Psychology dissertation writing
  • Sociology thesis writing
  • Statistics dissertation writing
  • Buy dissertation online
  • Write my dissertation
  • Cheap thesis
  • Cheap dissertation
  • Custom dissertation
  • Dissertation help
  • Pay for thesis
  • Pay for dissertation
  • Senior thesis
  • Write my thesis

214 Best Big Data Research Topics for Your Thesis Paper

big data research topics

Finding an ideal big data research topic can take you a long time. Big data, IoT, and robotics have evolved. The future generations will be immersed in major technologies that will make work easier. Work that was done by 10 people will now be done by one person or a machine. This is amazing because, in as much as there will be job loss, more jobs will be created. It is a win-win for everyone.

Big data is a major topic that is being embraced globally. Data science and analytics are helping institutions, governments, and the private sector. We will share with you the best big data research topics.

On top of that, we can offer you the best writing tips to ensure you prosper well in your academics. As students in the university, you need to do proper research to get top grades. Hence, you can consult us if in need of research paper writing services.

Big Data Analytics Research Topics for your Research Project

Are you looking for an ideal big data analytics research topic? Once you choose a topic, consult your professor to evaluate whether it is a great topic. This will help you to get good grades.

  • Which are the best tools and software for big data processing?
  • Evaluate the security issues that face big data.
  • An analysis of large-scale data for social networks globally.
  • The influence of big data storage systems.
  • The best platforms for big data computing.
  • The relation between business intelligence and big data analytics.
  • The importance of semantics and visualization of big data.
  • Analysis of big data technologies for businesses.
  • The common methods used for machine learning in big data.
  • The difference between self-turning and symmetrical spectral clustering.
  • The importance of information-based clustering.
  • Evaluate the hierarchical clustering and density-based clustering application.
  • How is data mining used to analyze transaction data?
  • The major importance of dependency modeling.
  • The influence of probabilistic classification in data mining.

Interesting Big Data Analytics Topics

Who said big data had to be boring? Here are some interesting big data analytics topics that you can try. They are based on how some phenomena are done to make the world a better place.

  • Discuss the privacy issues in big data.
  • Evaluate the storage systems of scalable in big data.
  • The best big data processing software and tools.
  • Data mining tools and techniques are popularly used.
  • Evaluate the scalable architectures for parallel data processing.
  • The major natural language processing methods.
  • Which are the best big data tools and deployment platforms?
  • The best algorithms for data visualization.
  • Analyze the anomaly detection in cloud servers
  • The scrutiny normally done for the recruitment of big data job profiles.
  • The malicious user detection in big data collection.
  • Learning long-term dependencies via the Fourier recurrent units.
  • Nomadic computing for big data analytics.
  • The elementary estimators for graphical models.
  • The memory-efficient kernel approximation.

Big Data Latest Research Topics

Do you know the latest research topics at the moment? These 15 topics will help you to dive into interesting research. You may even build on research done by other scholars.

  • Evaluate the data mining process.
  • The influence of the various dimension reduction methods and techniques.
  • The best data classification methods.
  • The simple linear regression modeling methods.
  • Evaluate the logistic regression modeling.
  • What are the commonly used theorems?
  • The influence of cluster analysis methods in big data.
  • The importance of smoothing methods analysis in big data.
  • How is fraud detection done through AI?
  • Analyze the use of GIS and spatial data.
  • How important is artificial intelligence in the modern world?
  • What is agile data science?
  • Analyze the behavioral analytics process.
  • Semantic analytics distribution.
  • How is domain knowledge important in data analysis?

Big Data Debate Topics

If you want to prosper in the field of big data, you need to try even hard topics. These big data debate topics are interesting and will help you to get a better understanding.

  • The difference between big data analytics and traditional data analytics methods.
  • Why do you think the organization should think beyond the Hadoop hype?
  • Does the size of the data matter more than how recent the data is?
  • Is it true that bigger data are not always better?
  • The debate of privacy and personalization in maintaining ethics in big data.
  • The relation between data science and privacy.
  • Do you think data science is a rebranding of statistics?
  • Who delivers better results between data scientists and domain experts?
  • According to your view, is data science dead?
  • Do you think analytics teams need to be centralized or decentralized?
  • The best methods to resource an analytics team.
  • The best business case for investing in analytics.
  • The societal implications of the use of predictive analytics within Education.
  • Is there a need for greater control to prevent experimentation on social media users without their consent?
  • How is the government using big data; for the improvement of public statistics or to control the population?

University Dissertation Topics on Big Data

Are you doing your Masters or Ph.D. and wondering the best dissertation topic or thesis to do? Why not try any of these? They are interesting and based on various phenomena. While doing the research ensure you relate the phenomenon with the current modern society.

  • The machine learning algorithms are used for fall recognition.
  • The divergence and convergence of the internet of things.
  • The reliable data movements using bandwidth provision strategies.
  • How is big data analytics using artificial neural networks in cloud gaming?
  • How is Twitter accounts classification done using network-based features?
  • How is online anomaly detection done in the cloud collaborative environment?
  • Evaluate the public transportation insights provided by big data.
  • Evaluate the paradigm for cancer patients using the nursing EHR to predict the outcome.
  • Discuss the current data lossless compression in the smart grid.
  • How does online advertising traffic prediction helps in boosting businesses?
  • How is the hyperspectral classification done using the multiple kernel learning paradigm?
  • The analysis of large data sets downloaded from websites.
  • How does social media data help advertising companies globally?
  • Which are the systems recognizing and enforcing ownership of data records?
  • The alternate possibilities emerging for edge computing.

The Best Big Data Analysis Research Topics and Essays

There are a lot of issues that are associated with big data. Here are some of the research topics that you can use in your essays. These topics are ideal whether in high school or college.

  • The various errors and uncertainty in making data decisions.
  • The application of big data on tourism.
  • The automation innovation with big data or related technology
  • The business models of big data ecosystems.
  • Privacy awareness in the era of big data and machine learning.
  • The data privacy for big automotive data.
  • How is traffic managed in defined data center networks?
  • Big data analytics for fault detection.
  • The need for machine learning with big data.
  • The innovative big data processing used in health care institutions.
  • The money normalization and extraction from texts.
  • How is text categorization done in AI?
  • The opportunistic development of data-driven interactive applications.
  • The use of data science and big data towards personalized medicine.
  • The programming and optimization of big data applications.

The Latest Big Data Research Topics for your Research Proposal

Doing a research proposal can be hard at first unless you choose an ideal topic. If you are just diving into the big data field, you can use any of these topics to get a deeper understanding.

  • The data-centric network of things.
  • Big data management using artificial intelligence supply chain.
  • The big data analytics for maintenance.
  • The high confidence network predictions for big biological data.
  • The performance optimization techniques and tools for data-intensive computation platforms.
  • The predictive modeling in the legal context.
  • Analysis of large data sets in life sciences.
  • How to understand the mobility and transport modal disparities sing emerging data sources?
  • How do you think data analytics can support asset management decisions?
  • An analysis of travel patterns for cellular network data.
  • The data-driven strategic planning for citywide building retrofitting.
  • How is money normalization done in data analytics?
  • Major techniques used in data mining.
  • The big data adaptation and analytics of cloud computing.
  • The predictive data maintenance for fault diagnosis.

Interesting Research Topics on A/B Testing In Big Data

A/B testing topics are different from the normal big data topics. However, you use an almost similar methodology to find the reasons behind the issues. These topics are interesting and will help you to get a deeper understanding.

  • How is ultra-targeted marketing done?
  • The transition of A/B testing from digital to offline.
  • How can big data and A/B testing be done to win an election?
  • Evaluate the use of A/B testing on big data
  • Evaluate A/B testing as a randomized control experiment.
  • How does A/B testing work?
  • The mistakes to avoid while conducting the A/B testing.
  • The most ideal time to use A/B testing.
  • The best way to interpret results for an A/B test.
  • The major principles of A/B tests.
  • Evaluate the cluster randomization in big data
  • The best way to analyze A/B test results and the statistical significance.
  • How is A/B testing used in boosting businesses?
  • The importance of data analysis in conversion research
  • The importance of A/B testing in data science.

Amazing Research Topics on Big Data and Local Governments

Governments are now using big data to make the lives of the citizens better. This is in the government and the various institutions. They are based on real-life experiences and making the world better.

  • Assess the benefits and barriers of big data in the public sector.
  • The best approach to smart city data ecosystems.
  • The big analytics used for policymaking.
  • Evaluate the smart technology and emergence algorithm bureaucracy.
  • Evaluate the use of citizen scoring in public services.
  • An analysis of the government administrative data globally.
  • The public values are found in the era of big data.
  • Public engagement on local government data use.
  • Data analytics use in policymaking.
  • How are algorithms used in public sector decision-making?
  • The democratic governance in the big data era.
  • The best business model innovation to be used in sustainable organizations.
  • How does the government use the collected data from various sources?
  • The role of big data for smart cities.
  • How does big data play a role in policymaking?

Easy Research Topics on Big Data

Who said big data topics had to be hard? Here are some of the easiest research topics. They are based on data management, research, and data retention. Pick one and try it!

  • Who uses big data analytics?
  • Evaluate structure machine learning.
  • Explain the whole deep learning process.
  • Which are the best ways to manage platforms for enterprise analytics?
  • Which are the new technologies used in data management?
  • What is the importance of data retention?
  • The best way to work with images is when doing research.
  • The best way to promote research outreach is through data management.
  • The best way to source and manage external data.
  • Does machine learning improve the quality of data?
  • Describe the security technologies that can be used in data protection.
  • Evaluate token-based authentication and its importance.
  • How can poor data security lead to the loss of information?
  • How to determine secure data.
  • What is the importance of centralized key management?

Unique IoT and Big Data Research Topics

Internet of Things has evolved and many devices are now using it. There are smart devices, smart cities, smart locks, and much more. Things can now be controlled by the touch of a button.

  • Evaluate the 5G networks and IoT.
  • Analyze the use of Artificial intelligence in the modern world.
  • How do ultra-power IoT technologies work?
  • Evaluate the adaptive systems and models at runtime.
  • How have smart cities and smart environments improved the living space?
  • The importance of the IoT-based supply chains.
  • How does smart agriculture influence water management?
  • The internet applications naming and identifiers.
  • How does the smart grid influence energy management?
  • Which are the best design principles for IoT application development?
  • The best human-device interactions for the Internet of Things.
  • The relation between urban dynamics and crowdsourcing services.
  • The best wireless sensor network for IoT security.
  • The best intrusion detection in IoT.
  • The importance of big data on the Internet of Things.

Big Data Database Research Topics You Should Try

Big data is broad and interesting. These big data database research topics will put you in a better place in your research. You also get to evaluate the roles of various phenomena.

  • The best cloud computing platforms for big data analytics.
  • The parallel programming techniques for big data processing.
  • The importance of big data models and algorithms in research.
  • Evaluate the role of big data analytics for smart healthcare.
  • How is big data analytics used in business intelligence?
  • The best machine learning methods for big data.
  • Evaluate the Hadoop programming in big data analytics.
  • What is privacy-preserving to big data analytics?
  • The best tools for massive big data processing
  • IoT deployment in Governments and Internet service providers.
  • How will IoT be used for future internet architectures?
  • How does big data close the gap between research and implementation?
  • What are the cross-layer attacks in IoT?
  • The influence of big data and smart city planning in society.
  • Why do you think user access control is important?

Big Data Scala Research Topics

Scala is a programming language that is used in data management. It is closely related to other data programming languages. Here are some of the best scala questions that you can research.

  • Which are the most used languages in big data?
  • How is scala used in big data research?
  • Is scala better than Java in big data?
  • How is scala a concise programming language?
  • How does the scala language stream process in real-time?
  • Which are the various libraries for data science and data analysis?
  • How does scala allow imperative programming in data collection?
  • Evaluate how scala includes a useful REPL for interaction.
  • Evaluate scala’s IDE support.
  • The data catalog reference model.
  • Evaluate the basics of data management and its influence on research.
  • Discuss the behavioral analytics process.
  • What can you term as the experience economy?
  • The difference between agile data science and scala language.
  • Explain the graph analytics process.

Independent Research Topics for Big Data

These independent research topics for big data are based on the various technologies and how they are related. Big data will greatly be important for modern society.

  • The biggest investment is in big data analysis.
  • How are multi-cloud and hybrid settings deep roots?
  • Why do you think machine learning will be in focus for a long while?
  • Discuss in-memory computing.
  • What is the difference between edge computing and in-memory computing?
  • The relation between the Internet of things and big data.
  • How will digital transformation make the world a better place?
  • How does data analysis help in social network optimization?
  • How will complex big data be essential for future enterprises?
  • Compare the various big data frameworks.
  • The best way to gather and monitor traffic information using the CCTV images
  • Evaluate the hierarchical structure of groups and clusters in the decision tree.
  • Which are the 3D mapping techniques for live streaming data.
  • How does machine learning help to improve data analysis?
  • Evaluate DataStream management in task allocation.
  • How is big data provisioned through edge computing?
  • The model-based clustering of texts.
  • The best ways to manage big data.
  • The use of machine learning in big data.

Is Your Big Data Thesis Giving You Problems?

These are some of the best topics that you can use to prosper in your studies. Not only are they easy to research but also reflect on real-time issues. Whether in University or college, you need to put enough effort into your studies to prosper. However, if you have time constraints, we can provide professional writing help. Are you looking for online expert writers? Look no further, we will provide quality work at a cheap price.

178 Communication Research Topics

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment * Error message

Name * Error message

Email * Error message

Save my name, email, and website in this browser for the next time I comment.

As Putin continues killing civilians, bombing kindergartens, and threatening WWIII, Ukraine fights for the world's peaceful future.

Ukraine Live Updates

  • IT career paths

Top 35 big data interview questions with answers for 2024

Big data is a hot field and organizations are looking for talent at all levels. get ahead of the competition for that big data job with these top interview questions and answers..

  • Robert Sheldon
  • Elizabeth Davies

Increasingly, organizations across the globe are seeing the wisdom of embracing big data. The careful analysis and synthesis of massive data sets can provide invaluable insights to help them make informed and timely strategic business decisions .

For example, big data analytics can help determine what new products to develop based on a deep understanding of customer behaviors, preferences and buying patterns. Analytics can also reveal untapped potential, such as new territories or nontraditional market segments.

As organizations race to augment their big data capabilities and skills , the demand for qualified candidates in the field is reaching new heights. If you aspire to pursue a career path in this domain, a world of opportunity awaits. Today's most challenging -- yet rewarding and in-demand -- big data roles include data analysts, data scientists, database administrators, big data engineers and Hadoop specialists. Knowing what big data questions an interviewer will likely ask and how to answer such questions is essential to success.

This article will provide some direction to help set you up for success in your next big data interview -- whether you are a recent graduate in data science or information management or already have experience working in big data-related roles or other technology fields. This piece will also provide you with some of the most commonly asked big data interview questions that prospective employers might ask.

This article is part of

The ultimate guide to big data for businesses

  • Which also includes:
  • 8 benefits of using big data for businesses
  • What a big data strategy includes and how to build one
  • 10 big data challenges and how to address them

Download this entire guide for FREE now!

How to prepare for a big data interview

Before delving into the specific big data interview questions and answers, here are the basics of interview preparation.

  • Prepare a tailored and compelling resume. Ideally, you should tailor your resume (and cover letter) to the particular role and position for which you are applying. Not only should these documents demonstrate your qualifications and experience, but they should also convince your prospective employer that you've researched the organization's history, financials, strategy, leadership, culture and vision. Also, don't be shy to call out what you believe to be your strongest soft skills that would be relevant to the role. These might include communication and presentation capabilities; tenacity and perseverance; an eye for detail and professionalism; and respect, teamwork and collaboration.
  • Remember, an interview is a two-way street. Of course, it is essential to provide correct and articulate answers to an interviewer's technical questions, but don't overlook the value of asking your own questions. Prepare a shortlist of these questions in advance of the appointment to ask at opportune moments.
  • The Q&A: prepare, prepare, prepare. Invest the time necessary to research and prepare your answers to the most commonly asked questions, then rehearse your answers before the interview. Be yourself during the interview. Look for ways to show your personality and convey your responses authentically and thoughtfully. Monosyllabic, vague or bland answers won't serve you well.

Now, here are the top 35 big data interview questions. These include a specific focus on the Hadoop framework, given its widespread adoption and ability to solve the most difficult big data challenges, thereby delivering on core business requirements.

Top 35 big data interview questions and answers

Each of the following 35 big data interview questions includes an answer. However, don't rely solely on these answers when preparing for your interview. Instead, use them as a launching point for digging more deeply into each topic.

1. What is big data?

As basic as this question might seem, you should have a clear and concise answer that demonstrates your understanding of this term and its full scope, making it clear that big data can include just about any type of data from any number of sources. The data might come from sources such as the following:

  • server logs
  • social media
  • medical records
  • temporary files
  • machinery sensors
  • automobiles
  • industrial equipment
  • internet of things (IoT) devices

Big data can include structured, semi-structured and unstructured data -- in any combination -- collected from a range of heterogeneous sources. Once collected, the data must be carefully managed so it can be mined for information and transformed into actionable insights. When mining data, data scientists and other professionals often use advanced technologies such as machine learning, deep learning, predictive modeling or other advanced analytics to gain a deeper understanding of the data.

2. How can big data analytics benefit business?

There are a number of ways that big data can benefit organizations , as long as they can extract value from the data, gain actionable insights and put those insights to work. Although you won't be expected to list every possible outcome of a big data project, you should be able to cite several examples that demonstrate what can be achieved with an effective big data project. For example, you might include any of the following:

  • Improve customer service.
  • Personalize marketing campaigns.
  • Increase worker productivity.
  • Improve daily operations and service delivery.
  • Reduce operational expenses.
  • Identify new revenue streams.
  • Improve products and services.
  • Gain a competitive advantage in your industry.
  • Gain deeper insights into customers and markets.
  • Optimize supply chains and delivery routes.

Organizations within specific industries can also gain from big data analytics . For example, a utility company might use big data to better track and manage electrical grids. Or governments might use big data to improve emergency response, help prevent crime and support smart city initiatives.

3. What are your experiences in big data?

If you have had previous roles in the field of big data, outline your title, functions, responsibilities and career path. Include any specific challenges and how you met those challenges. Also mention any highlights or achievements related either to a specific big data project or to big data in general. Be sure to include any programming languages you've worked with, especially as they pertain to big data.

4. What are some of the challenges that come with a big data project?

No big data project is without its challenges . Some of those challenges might be specific to the project itself or to big data in general. You should be aware of what some of these challenges are -- even if you haven't experienced them yourself. Below are some of the more common challenges:

  • Many organizations don't have the in-house skill sets they need to plan, deploy, manage and mine big data.
  • Managing a big data environment is a complex and time-consuming undertaking that must consider both the infrastructure and data, while ensuring that all the pieces fit together.
  • Securing data and protecting personally identifiable information is complicated by the types of data, amounts of data and the diverse origins of that data.
  • Scaling infrastructure to meet performance and storage requirements can be a complex and costly process.
  • Ensuring data quality and integrity can be difficult to achieve when working with large quantities of heterogeneous data.
  • Analyzing large sets of heterogeneous data can be time-consuming and resource-intensive, and it does not always lead to actionable insights or predictable outcomes.
  • Ensuring that you have the right tools in place and that they all work together brings its own set of challenges.
  • The cost of infrastructure, software and personnel can quickly add up, and those costs can be difficult to keep under control.

5. What are the five Vs of big data?

Big data is often discussed in terms of the following five Vs :

  • The vast amounts of data that are collected from multiple heterogeneous sources.
  • The various formats of structured, semi-structured and unstructured data, from social media, IoT devices, database tables, web applications, streaming services, machinery, business software and other sources.
  • The ever-increasing rate at which data is being generated on all fronts in all industries.
  • The degree of accuracy of collected data, which can vary significantly from one source to the next.
  • The potential business value of the collected data.

Interviewers might ask for only four Vs, rather than five. In which case, they're usually looking for the first four (volume, variety, velocity and veracity). If this happens in your interview, you might also mention that there is sometimes a fifth V: value. To impress your interviewer even further, you can mention yet another V: variability, which refers to the ways in which the data can be used and formatted.

Figure: The six Vs of big data

6. What are the key steps in deploying a big data platform?

There is no one formula that defines exactly how a big data platform should be implemented. However, it's generally accepted that rolling out a big data platform follows these three basic steps:

  • Data ingestion. Start out by collecting data from multiple sources, such as social media platforms, log files or business documentation. Data ingestion might be an ongoing process in which data is continuously collected to support real-time analytics, or it might be collected at defined intervals to meet specific business requirements.
  • Data storage. After extracting the data, store it in a database, which might be the Hadoop Distributed File System ( HDFS ), Apache HBase or another NoSQL database .
  • Data processing. The final step is to prepare the data so it is readily available for analysis. For this, you'll need to implement one or more frameworks that have the capacity handle massive data sets, such as Hadoop, Apache Spark, Flink, Pig or MapReduce, to name a few.

7. What is Hadoop and what are its main components?

Hadoop is an open source distributed processing framework for handling large data sets across computer clusters. It can scale up to thousands of machines, each supporting local computation and storage. Hadoop can process large amounts of different data types and distribute the workloads across multiple nodes, which makes it a good fit for big data initiatives.

The Hadoop platform includes the following four primary modules (components):

  • Hadoop Common. A collection of utilities that support the other modules.
  • Hadoop Distributed File System (HDFS). A key component of the Hadoop ecosystem that serves as the platform's primary data storage system, while providing high-throughput access to application data.
  • Hadoop YARN (Yet Another Resource Negotiator). A resource management framework that schedules jobs and allocates system resources across the Hadoop ecosystem.
  • Hadoop MapReduce. A YARN-based system for parallel processing large data sets.

8. Why is Hadoop so popular in big data analytics?

Hadoop is effective in dealing with large amounts of structured, unstructured and semi-structured data. Analyzing unstructured data isn't easy, but Hadoop's storage, processing and data collection capabilities make it less onerous. In addition, Hadoop is open source and runs on commodity hardware, so it is less costly than systems that rely on proprietary hardware and software.

One of Hadoop's biggest selling points is that it can scale up to support thousands of hardware nodes. Its use of HDFS facilitates rapid data access across all nodes in a cluster, and its inherent fault tolerance makes it possible for applications to continue to run even if individual nodes fail. Hadoop also stores data in its raw form, without imposing any schemas. This allows each team to decide later how to process and filter the data, based on their specific requirements at the time.

As a follow-on from this question, please define the following four terms, specifically in the context of Hadoop:

9. Open source

Hadoop is an Open Source platform. As a result, users can access and modify the source code to meet their specific needs. Hadoop is licensed under Apache License 2.0 , which grants users a "perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form." Because Hadoop is open source and has been so widely implemented, it has a large and active user community for helping to resolve issues and improving the product.

10. Scalability

Hadoop can be scaled out to support thousands of hardware nodes, using only commodity hardware. Organizations can start out with smaller systems and then scale out by adding more nodes to their clusters. They can also scale up by adding resources to the individual nodes. This scalability makes it possible to ingest, store and process the vast amounts of data typical of a big data initiative.

11. Data recovery

Hadoop replication provides built-in fault tolerance capabilities that protect against system failure. Even if a node fails, applications can keep running while avoiding any loss of data. HDFS stores files in blocks that are replicated to ensure fault tolerance, helping to improve both reliability and performance. Administrators can configure block sizes and replication factors on a per-file basis.

12. Data locality

Hadoop moves the computation close to where data resides, rather than moving large sets of data to computation. This helps to reduce network congestion while improving the overall throughput.

13. What are some vendor-specific distributions of Hadoop?

Several vendors now offer Hadoop-based products. Some of the more notable products include the following:

  • Amazon EMR (Elastic MapReduce)
  • Microsoft Azure HDInsight
  • IBM InfoSphere Information Server
  • Hortonworks Data Platform

14. What are some of the main configuration files used in Hadoop?

The Hadoop platform provides multiple configuration files for controlling cluster settings, including the following:

  • 7adoop-env.sh. Site-specific environmental variables for controlling Hadoop scripts in the bin directory.
  • yarn-env.sh. Site-specific environmental variables for controlling YARN scripts in the bin directory.
  • mapred-site.xml. Configuration settings specific to MapReduce, such as the MapReduce.framework.name setting.
  • core-site.xml. Core configuration settings, such as the I/O configurations common to HDFS and MapReduce.
  • yarn-site.xml. Configuration settings specific to YARN's ResourceManager and NodeManager.
  • hdfs-site.xml. Configuration settings specific to HDFS, such as the file path where the NameNode stores the namespace and transactions logs.

Figure: Know core Hadoop components when entering a big data interview.

15. What is HDFS and what are its main components?

HDFS is a distributed file system that serves as Hadoop's default storage environment. It can run on low-cost commodity hardware, while providing a high degree of fault tolerance. HDFS stores the various types of data in a distributed environment that offers high throughput to applications with large data sets. HDFS is deployed in a primary/secondary architecture, with each cluster supporting the following two primary node types:

  • NameNode. A single primary node that manages the file system namespace, regulates client access to files and processes the metadata information for all the data blocks in the HDFS.
  • DataNode. A secondary node that manages the storage attached to each node in the cluster. A cluster typically contains many DataNode instances, but there is usually only one DataNode per physical node. Each DataNode serves read and write requests from the file system's clients.

16. What is Hadoop YARN and what are its main components?

Hadoop YARN manages resources and provides an execution environment for required processes, while allocating system resources to the applications running in the cluster. It also handles job scheduling and monitoring. YARN decouples resource management and scheduling from the data processing component in MapReduce.

YARN separates resource management and job scheduling into the following two daemons :

  • ResourceManager. This daemon arbitrates resources for the cluster's applications. It includes two main components: Scheduler and ApplicationsManager. The Scheduler allocates resources to running applications. The ApplicationsManager has multiple roles: accepting job submissions, negotiating the execution of the application-specific ApplicationMaster and providing a service for restarting the ApplicationMaster container on failure.
  • NodeManager. This daemon launches and manages containers on a node and uses them to run specified tasks. NodeManager also runs services that determine the health of the node, such as performing disk checks. Moreover, NodeManager can execute user-specified tasks.

17. What are Hadoop's primary operational modes?

Hadoop supports three primary operational nodes.

  • Standalone. Also referred to as Local mode, the Standalone mode is the default mode. It runs as a single Java process on a single node. It also uses the local file system and requires no configuration changes. The Standalone mode is used primarily for debugging purposes.
  • Pseudo-distributed. Also referred to as a single-node cluster, the Pseudo-distributed mode runs on a single machine, but each Hadoop daemon runs in a separate Java process. This mode also uses HDFS, rather than the local file system, and it requires configuration changes. This mode is often used for debugging and testing purposes.
  • Fully distributed. This is the full production mode, with all daemons running on separate nodes in a primary/secondary configuration. Data is distributed across the cluster, which can range from a few nodes to thousands of nodes. This mode requires configuration changes but offers the scalability, reliability and fault tolerance expected of a production system.

18. What are three common input formats in Hadoop?

Hadoop supports multiple input formats, which determine the shape of the data when it is collected into the Hadoop platform. The following input formats are three of the most common:

  • Text. This is the default input format. Each line within a file is treated as a separate record. The records are saved as key/value pairs, with the line of text treated as the value.
  • Key-Value Text. This input format is similar to the Text format, breaking each line into separate records. Unlike the Text format, which treats the entire line as the value, the Key-Value Text format breaks the line itself into a key and a value, using the tab character as a separator.
  • Sequence File . This format reads binary files that store sequences of user-defined key-value pairs as individual records.

Hadoop supports other input formats as well, so you also should have a good understanding of them, in addition to the ones described here.

19. What makes an HDFS environment fault-tolerant?

HDFS can be easily set up to replicate data to different DataNodes. HDFS breaks files down into blocks that are distributed across nodes in the cluster. Each block is also replicated to other nodes. If one node fails, the other nodes take over, allowing applications to access the data through one of the backup nodes.

20. What is rack awareness in Hadoop clusters?

Rack awareness is one of the mechanisms used by Hadoop to optimize data access when processing client read and write requests. When a request comes in, the NameNode identifies and selects the nearest DataNodes, preferably those on the same rack or on nearby racks. Rack awareness can help improve performance and reliability, while reducing network traffic. Rack awareness can also play a role in fault tolerance. For example, the NameNode might place data block replicas on separate racks to help ensure availability in case a network switch fails or a rack becomes unavailable for other reasons.

21. How does Hadoop protect data against unauthorized access?

Hadoop uses the Kerberos network authentication protocol to protect data from unauthorized access. Kerberos uses secret-key cryptography to provide strong authentication for client/server applications. A client must undergo the following three basic steps to prove its identity to a server (each of which involves message exchanges with the server):

  • Authentication. The client sends an authentication request to the Kerberos authentication server. The server verifies the client and sends the client a ticket granting ticket (TGT) and a session key.
  • Authorization. Once authenticated, the client requests a service ticket from the ticket granting server (TGS). The TGT must be included with the request. If the TGS can authenticate the client, it sends the service ticket and credentials necessary to access the requested resource.
  • Service request. The client sends its request to the Hadoop resource it is trying to access. The request must include the service ticket issued by the TGS.

22. What is speculative execution in Hadoop?

Speculative execution is an optimization technique that Hadoop uses when it detects that a DataNode is executing a task too slowly. There can be many reasons for a slowdown, and it can be difficult to determine its actual cause. Rather than trying to diagnose and fix the problem, Hadoop identifies the task in question and launches an equivalent task -- the speculative task -- as a backup. If the original task completes before the speculative task, Hadoop kills that speculative task.

23. What is the purpose of the JPS command in Hadoop?

JPS, which is short for Java Virtual Machine Process Status, is a command used to check the status of the Hadoop daemons, specifically NameNode, DataNode, ResourceManager and NodeManager. Administrators can use the command to verify whether the daemons are up and running. The tool returns the process ID and process name of each Java Virtual Machine (JVM) running on the target system.

24. What commands can you use to start and stop all the Hadoop daemons at one time?

You can use the following command to start all the Hadoop daemons:

./sbin/start-all.sh

You can use the following command to stop all the Hadoop daemons:

./sbin/stop-all.sh

Figure: Hadoop YARN's architecture

25. What is an edge node in Hadoop?

An edge node is a computer that acts as an end-user portal for communicating with other nodes in a Hadoop cluster. An edge node provides an interface between the Hadoop cluster and an outside network. For this reason, it is also referred to as a gateway node or edge communication node. Edge nodes are often used to run administration tools or client applications. They typically do not run any Hadoop services.

26. What are the key differences between NFS and HDFS?

NFS , which stands for Network File System, is a widely implemented distributed file system protocol used extensively in network-attached storage ( NAS ) systems. It is one of the oldest distributed file storage systems and is well-suited to smaller data sets. NAS makes data available over a network but accessible like files on a local machine.

HDFS is a more recent technology. It is designed for handling big data workloads, providing high throughput and high capacity, far beyond the capabilities of an NFS-based system. HDFS also offers integrated data protections that safeguard against node failures. NFS is typically implemented on single systems that do not include the inherent fault tolerance that comes with HDFS. However, NFS-based systems are usually much less complicated to deploy and maintain than HDFS-based systems.

27. What is commodity hardware?

Commodity hardware is a device or component that is widely available, relatively inexpensive and can typically be used interchangeably with other components . Commodity hardware is sometimes referred to as off-the-shelf hardware because of its ready availability and ease of acquisition. Organizations often choose commodity hardware over proprietary hardware because it is cheaper, simpler and faster to acquire, and it is easier to replace all or some of the components in the event of hardware failure. Commodity hardware might include servers, storage systems, network equipment or other components.

28. What is MapReduce?

MapReduce is a software framework in Hadoop that's used for processing large data sets across a cluster of computers in which each node includes its own storage. MapReduce can process data in parallel on these nodes, making it possible to distribute input data and collate the results. In this way, Hadoop can run jobs split across a massive number of servers. MapReduce also provides its own level of fault tolerance, with each node periodically reporting its status to a primary node. In addition, MapReduce offers native support for writing Java applications, although you can also write MapReduce applications in other programming languages.

29. What are the two main phases of a MapReduce operation?

A MapReduce operation can be divided into the following two primary phases :

  • Map phase. MapReduce processes the input data, splits it into chunks and maps those chunks in preparation for analysis. MapReduce runs these processes in parallel.
  • Reduce phase. MapReduce processes the mapped chunks, aggregating the data based on the defined logic. The output of these phases is then written to HDFS.

MapReduce operations are sometimes divided into phases other than these two. For example, the Reduce phase might be split into the Shuffle phase and the Reduce phase. In some cases, you might also see a Combiner phase, which is an optional phase used to optimize MapReduce operations.

30. What is feature selection in big data?

Feature selection refers to the process of extracting only specific information from a data set. This can reduce the amount of data that needs to be analyzed, while improving the quality of that data used for analysis. Feature selection makes it possible for data scientists to refine the input variables they use to model and analyze the data, leading to more accurate results, while reducing the computational overhead.

Data scientists use sophisticated algorithms for feature selection, which usually fall into one of the following three categories:

  • Filter methods. A subset of input variables is selected during a preprocessing stage by ranking the data based on such factors as importance and relevance.
  • Wrapper methods. This approach is a resource-intensive operation that uses machine learning and predictive analytics  to try to determine which input variables to keep, usually providing better results than filter methods.
  • Embedded methods. Embedded methods combine attributes of both the file and wrapper methods, using fewer computational resources than wrapper methods, while providing better results than filter methods. However, embedded methods are not always as effective as wrapper methods.

31. What is an "outlier" in the context of big data?

An outlier is a data point that's abnormally distant from others in a group of random samples. The presence of outliers can potentially mislead the process of machine learning and result in inaccurate models or substandard outcomes. In fact, an outlier can potentially bias an entire result set. That said, outliers can sometimes contain nuggets of valuable information.

32. What are two common techniques for detecting outliers?

Analysts often use the following two techniques to detect outliers:

  • Extreme value analysis. This is the most basic form of outlier detection and is limited to one-dimensional data. Extreme value analysis determines the statistical tails of the data distribution. The Altman Z-score is a good example of extreme value analysis.
  • Probabilistic and statistical models. The models determine the unlikely instances from a probabilistic model of data. Data points with a low probability of membership are marked as outliers. However, these models assume that the data adheres to specific distributions. A common example of this type of outlier detection is the Bayesian probabilistic model .

These are only two of the core methods used to detect outliers. Other approaches include linear regression models, information theoretic models, high-dimensional outlier detection methods and other approaches.

33. What is the FSCK command used for?

FSCK, which stands for file system consistency check, is an HDFS filesystem checking utility that can be used to generate a summary report about the file system's status. However, the report merely identifies the presence of errors; it does not correct them. The FSCK command can be executed against an entire system or a select subset of files.

Figure: YARN vs. MapReduce

34. Are you open to gaining additional learning and qualifications that could help you advance your career with us?

Here's your chance to demonstrate your enthusiasm and career ambitions. Of course, your answer will depend on your current level of academic qualifications and certifications, as well as your personal circumstances, which might include family responsibilities and financial considerations. Therefore, respond forthrightly and honestly. Bear in mind that many courses and learning modules are readily available online. Moreover, analytics vendors have established training courses aimed at those seeking to upskill themselves in this domain. You can also inquire about the company's policy on mentoring and coaching.

35. Do you have any questions for us?

As mentioned earlier, it's a good rule of thumb to go to interviews with a few prepared questions. But depending on how the conversation has unfolded during the interview, you might choose not to ask them. For instance, if they've already been answered or the discussion has sparked other, more pertinent queries in your mind, you can put your original questions aside.

You should also be aware of how you time your questions, taking your cue from the interviewer. Depending on the circumstances, it might be acceptable to ask questions during the interview, although it's generally more common to hold off on your questions until the end of the interview. That said, you should never hesitate to ask for clarification on a question the interviewer asks.

A final word on big data interview questions

Remember, the process doesn't end after an interview has ended. After the session, send a note of thanks to the interviewer(s) or your point(s) of contact. Follow this up with a secondary message if you haven't received any feedback within a few days.

The world of big data is expanding continuously and exponentially . If you're serious and passionate about the topic and prepared to roll up your sleeves and work hard, the sky's the limit.

Related Resources

  • The Buyer’s Guide For AI & Cloud Skills Development –QA
  • How To Set Up An HR Function: The Essential Guide And Checklist –Sage
  • 3 STEPS TO DRIVE EMPLOYEE RETENTION AND BUSINESS GROWTH –ServiceNow
  • Endace Video 14 –Endace

Dig Deeper on IT career paths

big data research questions

Hadoop Distributed File System (HDFS)

KinzaYasar

Hadoop vs. Spark: An in-depth big data framework comparison

GeorgeLawton

Compare Hadoop vs. Spark vs. Kafka for your big data strategy

DanielRobinson

A URL (Uniform Resource Locator) is a unique identifier used to locate a resource on the internet.

File Transfer Protocol (FTP) is a network protocol for transmitting files between computers over TCP/IP connections.

A virtual private network (VPN) is a service that creates a safe, encrypted online connection.

Network detection and response (NDR) technology continuously scrutinizes network traffic to identify suspicious activity and ...

Identity threat detection and response (ITDR) is a collection of tools and best practices aimed at defending against cyberattacks...

Managed extended detection and response (MXDR) is an outsourced service that collects and analyzes threat data from across an ...

Demand shaping is an operational supply chain management (SCM) strategy where a company uses tactics such as price incentives, ...

Data monetization is the process of measuring the economic benefit of corporate data.

C-level, also called the C-suite, is a term used to describe high-ranking executive titles in an organization.

Employee self-service (ESS) is a widely used human resources technology that enables employees to perform many job-related ...

Diversity, equity and inclusion is a term used to describe policies and programs that promote the representation and ...

Payroll software automates the process of paying salaried, hourly and contingent employees.

Multichannel marketing refers to the practice of companies interacting with customers via multiple direct and indirect channels ...

A contact center is a central point from which organizations manage all customer interactions across various channels.

Voice or speaker recognition is the ability of a machine or program to receive and interpret dictation or to understand and ...

big data research questions

Research Topics & Ideas: Data Science

50 Topic Ideas To Kickstart Your Research Project

Research topics and ideas about data science and big data analytics

If you’re just starting out exploring data science-related topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research by providing a hearty list of data science and analytics-related research ideas , including examples from recent studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan to fill that gap. If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, consider our 1-on-1 coaching service .

Research topic idea mega list

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research topic evaluator

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Recent Data Science-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic, they are fairly generic and non-specific. So, it helps to look at actual studies in the data science and analytics space to see how this all comes together in practice.

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Private Coaching service, the perfect starting point for developing a unique, well-justified research topic.

Private Coaching

I have to submit dissertation. can I get any help

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

StatAnalytica

99+ Data Science Research Topics: A Path to Innovation

data science research topics

In today’s rapidly advancing digital age, data science research plays a pivotal role in driving innovation, solving complex problems, and shaping the future of technology. Choosing the right data science research topics is paramount to making a meaningful impact in this field. 

In this blog, we will delve into the intricacies of selecting compelling data science research topics, explore a range of intriguing ideas, and discuss the methodologies to conduct meaningful research.

How to Choose Data Science Research Topics?

Table of Contents

Selecting the right research topic is the cornerstone of a successful data science endeavor. Several factors come into play when making this decision. 

  • First and foremost, personal interests and passion are essential. A genuine curiosity about a particular subject can fuel the dedication and enthusiasm needed for in-depth research. 
  • Current trends and challenges in data science provide valuable insights into areas that demand attention. 
  • Additionally, the availability of data and resources, as well as the potential impact and applications of the research, should be carefully considered.
: Tips & Tricks

99+ Data Science Research Topics Ideas: Category Wise

Supervised machine learning.

  • Predictive modeling for disease outbreak prediction.
  • Credit scoring using machine learning for financial institutions.
  • Sentiment analysis for stock market predictions.
  • Recommender systems for personalized content recommendations.
  • Customer churn prediction in e-commerce.
  • Speech recognition for voice assistants.
  • Handwriting recognition for digitization of historical documents.
  • Facial recognition for security and surveillance.
  • Time series forecasting for energy consumption.
  • Object detection in autonomous vehicles.

Unsupervised Machine Learning

  • Market basket analysis for retail optimization.
  • Topic modeling for content recommendation.
  • Clustering techniques for social network analysis.
  • Anomaly detection in manufacturing processes.
  • Customer segmentation for marketing strategies.
  • Event detection in social media data.
  • Network traffic anomaly detection for cybersecurity.
  • Anomaly detection in healthcare data.
  • Fraud detection in insurance claims.
  • Outlier detection in environmental monitoring.

Natural Language Processing (NLP)

  • Abstractive text summarization for news articles.
  • Multilingual sentiment analysis for global brands.
  • Named entity recognition for information extraction.
  • Speech-to-text transcription for accessibility.
  • Hate speech detection in social media.
  • Aspect-based sentiment analysis for product reviews.
  • Text classification for content moderation.
  • Language translation for low-resource languages.
  • Chatbot development for customer support.
  • Emotion detection in text and speech.

Deep Learning

  • Image super-resolution using convolutional neural networks.
  • Reinforcement learning for game playing and robotics.
  • Generative adversarial networks (GANs) for image generation.
  • Transfer learning for domain adaptation in deep models.
  • Deep learning for medical image analysis.
  • Video analysis for action recognition.
  • Natural language understanding with transformer models.
  • Speech synthesis using deep neural networks.
  • AI-powered creative art generation.
  • Deep reinforcement learning for autonomous vehicles.

Big Data Analytics

  • Real-time data processing for IoT sensor networks.
  • Social media data analysis for marketing insights.
  • Data-driven decision-making in supply chain management.
  • Customer journey analysis for e-commerce.
  • Predictive maintenance using sensor data.
  • Stream processing for financial market data.
  • Energy consumption optimization in smart grids.
  • Data analytics for climate change mitigation.
  • Smart city infrastructure optimization.
  • Data analytics for personalized healthcare recommendations.

Data Ethics and Privacy

  • Fairness and bias mitigation in AI algorithms.
  • Ethical considerations in AI for criminal justice.
  • Privacy-preserving data sharing techniques.
  • Algorithmic transparency and interpretability.
  • Data anonymization for privacy protection.
  • AI ethics in healthcare decision support.
  • Ethical considerations in facial recognition technology.
  • Governance frameworks for AI and data use.
  • Data protection in the age of IoT.
  • Ensuring AI accountability and responsibility.

Reinforcement Learning

  • Autonomous drone navigation for package delivery.
  • Deep reinforcement learning for game AI.
  • Optimal resource allocation in cloud computing.
  • Reinforcement learning for personalized education.
  • Dynamic pricing strategies using reinforcement learning.
  • Robot control and manipulation with RL.
  • Multi-agent reinforcement learning for traffic management.
  • Reinforcement learning in healthcare for treatment plans.
  • Learning to optimize supply chain logistics.
  • Reinforcement learning for inventory management.

Computer Vision

  • Video-based human activity recognition.
  • 3D object detection and tracking.
  • Visual question answering for image understanding.
  • Scene understanding for autonomous robots.
  • Facial emotion recognition in real-time.
  • Image deblurring and restoration.
  • Visual SLAM for augmented reality applications.
  • Image forensics and deepfake detection.
  • Object counting and density estimation.
  • Medical image segmentation and diagnosis.

Time Series Analysis

  • Time series forecasting for renewable energy generation.
  • Stock price prediction using LSTM models.
  • Climate data analysis for weather forecasting.
  • Anomaly detection in industrial sensor data.
  • Predictive maintenance for machinery.
  • Time series analysis of social media trends.
  • Human behavior modeling with time series data.
  • Forecasting economic indicators.
  • Time series analysis of health data for disease prediction.
  • Traffic flow prediction and optimization.

Graph Analytics

  • Social network analysis for influence prediction.
  • Recommender systems with graph-based models.
  • Community detection in complex networks.
  • Fraud detection in financial networks.
  • Disease spread modeling in epidemiology.
  • Knowledge graph construction and querying.
  • Link prediction in citation networks.
  • Graph-based sentiment analysis in social media.
  • Urban planning with transportation network analysis.
  • Ontology alignment and data integration in semantic web.

What Is The Right Research Methodology?

  • Alignment with Objectives: Ensure that the chosen research approach aligns with the specific objectives of your study. This will help you answer the research questions effectively.
  • Data Collection Methods: Carefully plan and execute data collection methods. Consider using surveys, interviews, data mining, or a combination of these based on the nature of your research and the data availability.
  • Data Analysis Techniques: Select appropriate data analysis techniques that suit the research questions. This may involve using statistical analysis for quantitative data, machine learning algorithms for predictive modeling, or deep learning models for complex pattern recognition, depending on the research context.
  • Ethical Considerations: Prioritize ethical considerations in data science research. This includes obtaining informed consent from study participants and ensuring data anonymization to protect privacy. Ethical guidelines should be followed throughout the research process.

Choosing the right research methodology involves a thoughtful and purposeful selection of methods and techniques that best serve the objectives of your data science research.

How to Conduct Data Science Research?

Conducting data science research involves a systematic and structured approach to generate insights or develop solutions using data. Here are the key steps to conduct data science research:

  • Define Research Objectives

Clearly define the goals and objectives of your research. What specific questions do you want to answer or problems do you want to solve?

  • Literature Review

Conduct a thorough literature review to understand the current state of research in your chosen area. Identify gaps, challenges, and potential research opportunities.

  • Data Collection

Gather the relevant data for your research. This may involve data from sources like databases, surveys, APIs, or even creating your datasets.

  • Data Preprocessing

Clean and preprocess the data to ensure it is in a usable format. This includes handling missing values, outliers, and data transformations.

  • Exploratory Data Analysis (EDA)

Perform EDA to gain a deeper understanding of the data. Visualizations, summary statistics, and data profiling can help identify patterns and insights.

  • Hypothesis Formulation (if applicable)

If your research involves hypothesis testing, formulate clear hypotheses based on your data and objectives.

  • Model Development

Choose the appropriate modeling techniques (e.g., machine learning, statistical models) based on your research objectives. Develop and train models as needed.

  • Evaluation and Validation

Assess the performance and validity of your models or analytical methods. Use appropriate metrics to measure how well they achieve the research goals.

  • Interpret Results

Analyze the results and interpret what they mean in the context of your research objectives. Visualizations and clear explanations are important.

  • Iterate and Refine

If necessary, iterate on your data collection, preprocessing, and modeling steps to improve results. This process may involve adjusting parameters or trying different algorithms.

  • Ethical Considerations

Ensure that your research complies with ethical guidelines, particularly concerning data privacy and informed consent.

  • Documentation

Maintain comprehensive documentation of your research process, including data sources, methodologies, and results. This helps in reproducibility and transparency.

  • Communication

Communicate your findings through reports, presentations, or academic papers. Clearly convey the significance of your research and its implications.

  • Peer Review and Feedback

If applicable, seek peer review and feedback from experts in the field to validate your research and gain valuable insights.

  • Publication and Sharing

Consider publishing your research in reputable journals or sharing it with the broader community through conferences, online platforms, or industry events.

  • Continuous Learning

Stay updated with the latest developments in data science and related fields to refine your research skills and methodologies.

Conducting data science research is a dynamic and iterative process, and each step is essential for generating meaningful insights and contributing to the field. It’s important to approach your research with a critical and systematic mindset, ensuring that your work is rigorous and well-documented.

Challenges and Pitfalls of Data Science Research

Data science research, while promising and impactful, comes with its set of challenges. Common obstacles include data quality issues, lack of domain expertise, algorithmic biases, and ethical dilemmas. 

Researchers must be aware of these challenges and devise strategies to overcome them. Collaboration with domain experts, thorough validation of algorithms, and adherence to ethical guidelines are some of the approaches to mitigate potential pitfalls.

Impact and Application

The impact of data science research topics extends far beyond the confines of laboratories and academic institutions. Research outcomes often find applications in real-world scenarios, revolutionizing industries and enhancing the quality of life. 

Predictive models in healthcare improve patient care and treatment outcomes. Advanced fraud detection systems safeguard financial transactions. Natural language processing technologies power virtual assistants and language translation services, fostering global communication. 

Real-time data processing in IoT applications drives smart cities and connected ecosystems. Ethical considerations and privacy-preserving techniques ensure responsible and respectful use of personal data, building trust between technology and society.

Embarking on a journey in data science research topics is an exciting and rewarding endeavor. By choosing the right research topics, conducting rigorous studies, and addressing challenges ethically and responsibly, researchers can contribute significantly to the ever-evolving field of data science. 

As we explore the depths of machine learning, natural language processing, big data analytics, and ethical considerations, we pave the way for innovation, shape the future of technology, and make a positive impact on the world.

Related Posts

r language for data science

Top Reasons For Why Should You Use R for Data Science

cloud-computing-vs-big-data

In Depth Difference Between Big Data And Cloud Computing

  • Open access
  • Published: 14 May 2024

15 years of Big Data: a systematic literature review

  • Davide Tosi 1 ,
  • Redon Kokaj 1 &
  • Marco Roccetti 2  

Journal of Big Data volume  11 , Article number:  73 ( 2024 ) Cite this article

2285 Accesses

2 Citations

1 Altmetric

Metrics details

Big Data is still gaining attention as a fundamental building block of the Artificial Intelligence and Machine Learning world. Therefore, a lot of effort has been pushed into Big Data research in the last 15 years. The objective of this Systematic Literature Review is to summarize the current state of the art of the previous 15 years of research about Big Data by providing answers to a set of research questions related to the main application domains for Big Data analytics; the significant challenges and limitations researchers have encountered in Big Data analysis, and emerging research trends and future directions in Big Data. The review follows a predefined procedure that automatically searches five well-known digital libraries. After applying the selection criteria to the results, 189 primary studies were identified as relevant, of which 32 were Systematic Literature Reviews. Required information was extracted from the 32 studies and summarized. Our Systematic Literature Review sketched the picture of 15 years of research in Big Data, identifying application domains, challenges, and future directions in this research field. We believe that a substantial amount of work remains to be done to align and seamlessly integrate Big Data into data-driven advanced software solutions of the future.

Introduction

Over the past 15 years, Big Data has emerged as a foundational pillar providing support to an extensive range of different scientific fields, from medicine and healthcare [ 1 ] to engineering [ 2 ], finance and marketing [ 3 , 4 , 5 ], politics [ 6 ], social networks analysis [ 7 , 8 ], and telecommunications [ 9 ], to cite only a few examples. This 15-year period has witnessed a significant increase in research efforts aimed at unraveling the major problems in Big Data, with an almost innumerable array of potential solutions and data sources [ 10 , 11 , 12 , 13 ]. This has resulted in a boundless world of scientific papers that, in the end, have demonstrated the twofold, ambivalent nature of Big Data. On one side, in fact, we have had a confirmation of the pivotal role played by this scientific field in shaping the technological advancements of our time. On the other side, an approach to the comprehension of Big Data, based on this endless universe of ten of thousand technical papers, each specializing in its specific sector, however natural it might seem, has become not sustainable because it has often made researchers confuse (or mixing) the theory (of Big Data) with the practice or use of it. We cannot ignore that there have also been numerous active attempts to describe the general landscape of Big Data through survey papers. Nonetheless, again, given the vastness of the subject, the majority of them did not shun the trap of pre-formed models and have tried to respond, as closely as possible, to the concrete requirements coming from just one sub-field or from the point of view of a few perspectives. In this complex context, to take at least one step further into the knowledge of the state of the art of Big Data research over the above-mentioned period of time, we have decided to conduct a different form of comprehensive exploration which was not biased by the specificity of some given sectors or confounded by single technical perspectives. To do that, we have adopted the methodology termed systematic literature review (SLR), as proposed by Kitchenham and Charters [ 14 ] in the field of software engineering [ 15 , 16 ]. Although SLR proceeds through a set of well-defined steps, also in this case, an initial choice has to be made regarding the most crucial parameters through which the subject of investigation should be explored. In the case of Big Data, our primary focus has been on gaining insights into the principal application domains of Big Data, unraveling the major challenges and limitations encountered by researchers in the analysis of the typically enormous datasets they manage, and unveiling the emerging trends and directions in future Big Data research.

Guided by the structured methodology imposed by SLR, we hence started with three research questions that matched the points raised before: essentially, (i) most common application domains, (ii) current research challenges and limitations, and (iii) emerging future trends and directions. From this point on, we proceeded following the SLR steps. Basically: first, we translated the three research questions above into specific search terms, through which five different digital libraries were investigated, namely: Scopus , IEEE Explore , ACM Digital Library , SpringerLink and Google Scholar . Upon completion of the search activity (detailed in the following section of this paper), 189 primary studies that matched our generic search criteria were identified. Of these 189, only 32 of these studies were actually reviews. Since the target of our study was to provide a panoramic view of this 15-year Big Data research period, (a) shedding light on the prevalent application domains, (b) highlighting the hurdles faced by researchers, and (c) finally outlining the potential trajectories for future research, we focused on the analysis of just these 32 survey studies.

With this paper, we do not want to conduct a traditional literature review on a very extensive topic like Big Data. Traditional scientific surveys can include many more studies and corresponding papers, and they are mainly built with an eye toward generalizability and inclusion rather than selectivity and relevance. As a consequence, those approaches often bring to us no much more than a mere summary of the topic of interest. SLRs, instead, start from the legitimate presumption to be more than merely a summary of a topic. In essence, they distinguish themselves from ordinary surveys of the available literature because they are specifically built to add to the identification of all publications on a topic also all the following activities: explicit formulation of a search objective, identification and description of a search procedure, definition of criteria for inclusion and exclusion of publications, literature selection, and information extraction only based on a transparent evaluation of the quality of publications. Not only this, but an SLR should also provide insightful information on the current state of research on a topic, starting from a given set of research questions and following a formal methodological procedure, designed to reduce distortions caused by an overly generous and restrictive selection of the literature, while guaranteeing the reliability of the selected publications. Hence, to pursue these objectives, an SLR should start with the definition of the criteria for determining what should be included/excluded before conducting the search. Not to mention that, typically, an SLR should be performed mainly using electronic literature databases. It should be also noticed that such a structured approach should document all the information gathered (and the steps taken as part of this process), with the aim of making the paper selection process completely visible and reproducible [ 17 ].

In the end, we know very well that a point-to-point analysis of the set of almost 320 papers from which we have started our SLR could have brought more (generic) information than that provided by the circa 30 papers finally selected by our SLR. Nonetheless, it is highly likely that this information would have been somewhat redundant, more prone to defects and personal biases, and finally, also more boring to read.

With this SLR, we aim to contribute, in a focused and structured way, to Big Data research in several ways: from one side, we provide researchers with a clear picture of how Big Data application domains changed over time; then, we highlight challenges faced by academia and industry and their evolution in the last 15 years; finally, we sketch a set of open points that researchers will take into consideration in the next future.

We can conclude that, while our collective understanding of Big Data has grown after this investigation, this analysis has underscored again the fact that in this field, a kind of optimal stability emerges in terms of research interests through the even distribution among applications domains/challenges/future trends. From one side, we observe a pervasive adoption of Big Data solutions in all everyday life domains (such as Energy [ 18 ], Smart Cities [ 9 ], and Healthcare [ 19 ].) On the other hand, researchers have spent a lot of effort managing data quality, designing and developing advanced frameworks to manage Big Data in real-time, focusing on security and privacy. However, many challenges still remain open to seamlessly integrate Big Data into data-driven advanced software solutions of the future, such as mitigating energy consumption, optimizing algorithms, increasing framework security with privacy and ethical focus, intersecting Artificial Intelligence and Machine Learning technologies, opening data sets, improving interoperability among different stakeholders, and considering societal and business changes.

The remainder of this paper is organized as follows: in Sect " Research method ", we run the SLR methodology on our Big Data use case (with the definition of our research questions, the search strategy, the inclusion/exclusion criteria, the study quality assessment questionnaire, and the data extraction from primary studies). All this is in the dual attempt to explain the abstract methodology, as well as its application in our field. Section " SLR: implementation " describes how we conducted the review and the results obtained in each stage and step of our SLR; Section " SLR: results " shows our findings, briefly summarizing each of our selected primary studies; Section " Discussion " discusses critically those findings garnering special attention in our analytical process; Section " Threats to validity " discusses the possible threats to the validity of our study; Section " Conclusion " demonstrates the conclusions we drew for our SLR.

A taxonomy of key concepts for Big Data evolution over the last 15 years is presented in Fig.  1 .

figure 1

Taxonomy of Big Data evolution over the last 15 years

Research method

Research questions.

This SLR has been conducted following the procedure defined by Kitchenham and Charters. As such, in the first step, we defined the research questions (RQ) that will drive the entire review methodology.

As we define the research questions that will guide our SLR, it is crucial to establish a balance between the breadth and depth of our investigation. After careful consideration and to ensure that our review maintains a focused and meaningful scope, it has been decided to narrow down our research questions to the following three:

RQ1 : what are the most common application domains for Big Data analytics, and how have they evolved over time?

RQ2 : what are the major challenges and limitations that researchers have encountered in Big Data analysis, and how have they been addressed?

RQ3 : what are the emerging research trends and directions in Big Data that will likely shape the field in the next 5 to 10 years?

Search strategy

SLR begins by looking for relevant studies related to our research questions. To do this, we find appropriate search terms using the method outlined by Kitchenham and Charters, which suggests to consider three aspects: Population (P), Interventions (I), and Outcomes (O).

We identified the following relevant search terms for each aspect in our review:

Population : Big Data, real-time data analytics, large datasets.

Intervention : methodologies, techniques, domains, architectures, solutions.

Outcomes : research trends, future directions, emerging technologies, challenges, SLR, Systematic Literature Review.

The search string was constructed as follows:

P refers to population terms, I refers to intervention terms and O refers to outcome terms, all of which are connected through boolean operators AND and OR.

Searches string may take the exemplar form like the following:

(“big data” OR “real-time data analytics” OR “large datasets”) AND (“methodologies” OR “techniques” OR “domains” OR “architectures” OR “solutions”) AND (“research trends” OR “future directions” OR “emerging technologies” OR “challenges” OR “SLR” OR “Systematic Literature Review”)

Since we need to find and study primary studies related to our research questions, the selection of appropriate digital libraries/search engines to search for the articles needed is essential. For this reason, it has been decided to use the following state-of-the-art sources:

Scopus : a multidisciplinary database that covers a broad range of research fields.

IEEE Xplore : an invaluable resource for technology and engineering-related SLR.

ACM Digital Library : a comprehensive collection of relevant articles, conference papers, and journals focused on computer science and information technology.

SpringerLink : an extensive collection of academic articles in the fields that align closely with our research interests.

Google Scholar : a freely accessible web search engine that indexes scholarly literature across various disciplines.

We aim to ensure a comprehensive and focused literature search by utilizing these sources, thereby facilitating a thorough and methodical research.

Inclusion/Exclusion criteria

In this stage of the SLR, we need to make an accurate selection of the studies extracted. To do this, we must define some rigorous inclusion/exclusion criteria, to decide which studies are going to be useful for our purpose. To achieve this, studies were excluded based on the following criteria:

Studies published before the 15-year time frame

Studies in languages other than English

Exclude non-academic sources, including blogs, news articles, marketing materials, and reports from non-academic organizations

Studies that are only marginally related to Big Data or the specific topics within our research questions.

In conclusion, all those studies that are not cut off by the exclusion criteria above are to be considered as included. They are called “Primary Studies” (PS).

Study quality assessment

Kitchenham and Charters stresses the necessity of assessing the quality of primary studies to reduce bias and enhance the validity of the evaluation process. In our research, we employ a study quality assessment to make sure that we have only the most relevant results for our research.

To achieve this, we formulated a five question study quality questionnaire, which serves as the foundation for assessing the quality of the primary studies:

QA1 : has the primary study established a well-defined research objective?

QA2 : did the primary study comprehensively describe its research methods and data sources?

QA3 : has the technique or approach undergone a trustworthy validation?

QA4 : has the primary study effectively identified and discussed the significant challenges and limitations encountered in Big Data analysis?

QA5 : are the findings, research trends, and directions clearly presented and directly connected to the study’s objectives or goals?

Hence, we applied the formulated questionnaire to the included PSs to assess their quality. The output of this SLR stage will be discussed in Section 4.

Data extraction

The data extraction process entails gathering relevant information from the chosen primary studies to address the research questions. To facilitate this process, we have created a dedicated data extraction form, as shown in Table  1 . As suggested in Kitchenham and Charters, we used the test-retest process to check the consistency and accuracy of the extracted data with respect to the original sources. After finishing the data extraction for all the selected studies, we randomly selected 3 primary studies and performed a second extraction of the data. No inconsistencies were detected.

SLR: implementation

In this section, we describe step-by-step the implementation and execution of the different stages of our SLR. Figure  2 depicts the search stages followed and the resulting number of primary studies for each stage.

In stage 1, an automated search was performed by applying the search string to the digital libraries. The software used for the management of the references is Zotero (www.zotero.org), a popular choice for SLRs. We began the research using the following research string:

(“big data” OR “real-time data analytics” OR “large datasets”) AND (“methodologies” OR “techniques” OR “domains” OR “architectures” OR “solutions”) AND (“research trends” OR “future directions” OR “emerging technologies” OR “challenges” OR “SLR” OR “Systematic Literature Review”). As a result, we found a total of 4204 studies. The reason for this many results could be attributed mostly to the main topic of this SLR being “Big Data”, a hugely popular field, especially in the last few years.

In stage 2, we used the Zotero’s duplicate identification tool, and we found a total of 25 duplicates. Additionally, 1 duplicate was found manually, bringing the total number of results to 4178 articles.

In stage 3, studies were excluded based on the title and the language. Fortunately, all the documents were in English, so we just needed to focus on the title, eliminating what had no use for our research. This cut down the total number to 553.

In stage 4, we eliminated the articles whose abstracts had marginal or no interest at all to us. At the end of the process, 189 Primary Studies were left, 32 of which were SLRs.

To ensure the best quality possible for our SLR, we have collected generic information on all the 189 studies that passed the Primary Study check. This information is depicted in Figs.  3 and 4 . We then proceeded with an in-depth full-text review for the 32 PSs, which are the main subject of our SLR.

figure 2

Stages of the applied search strategy

Figure  3 depicts the distribution per year for all the 189 studies. Our SLR focuses on the evolution of Big Data in the last 15 years. In any case, no studies before 2012 were detected. The reason for this could be attributed to the fact that before then Big Data, as a research topic, was not as popular.

figure 3

Number of filtered primary studies and number of total citations

Figure  4 represents the total number of citations per year for our selected 32 Primary Studies. The graph clearly shows that the most recent studies have not been cited as much. Particularly, even though the studies released in the last two years compose about one third of our selected primary studies (11 out of 32), we can see that they have not been cited as much in comparison to the previous years. The lower citation rate may indicate that recently, researchers have focused more on understudied areas or more recent emerging trends, suggesting that the field of Big Data is currently undergoing an evolution. However, further analysis of the quality, methodology and context of these studies is necessary for more concrete conclusions.

figure 4

Number of total PSs per year

For further clarity, we elaborated Table  2 to represent the chosen articles by highlighting the first author’s family name, the venue, the title of each PS, and a short introduction that highlights the main findings of each PS. Note that the ”J” indicates that the article has been published in a journal.

To better understand the influence of the selected Primary Studies over time, we created a bubble chart to show the most cited documents by aggregating the PSs with the same publication year (see Fig.  5 ). The size of each bubble is proportional to the number of citations.

figure 5

Bubble chart showing the number of primary studies and total citations per year of publication

SLR: results

The study of the PSs allowed us to pinpoint exactly which research question (RQ1-RQ3) is answered by each primary study. Table  3 summarizes our findings.

As previously stated, it is important to assess the quality of each study. In subsect. " Study quality assessment ", we developed a brief questionnaire that would help us determine the quality of a primary study. Table  4 shows the results of this quality check. It uses a simple “Yes,” “No,” or “na” (used when we don’t have enough information to answer) to fill out the Quality Assessment questionnaire.

From now on, we will briefly summarize each study and its findings.

PS1—A comprehensive and systematic literature review on the Big Data management techniques in the internet of things [ 20 ]

In this article, the authors explored the Big Data management techniques applied to the internet of things. Big Data was initially applied for healthcare monitoring, smart cities, and industrial systems. Over time, with the evolution of IoT, it expanded to include broader topics: healthcare applications involved health state monitoring and predictive modeling, smart cities encompassed traffic management, energy efficiency and security, while industrial systems employed Big Data to improve scalability and security. The application landscape broadened emphasizing the importance of quality attributes such as performance, efficiency, reliability, and scalability in ensuring the success of Big Data Analytics systems in IoT across ever-evolving domains.

The challenges and open issues in Big Data Analytics within IoT span various dimensions, including centralized architectures, energy consumption in data collection, blockchain limitations, communication challenges, and diverse data features.

For future research, the exploration of AI for intelligent mobile data collection will take on a more relevant role, combining compressive sensing with AI for communication challenges and utilizing new optimization algorithms for data processing. To ensure security and privacy in IoT, Big Data Analytics could involve cryptography mechanisms, a data perception layer and a lightweight framework with AI. Addressing these challenges is essential for advancing Big Data Analytics in the evolving landscape of IoT applications.

PS2—A comprehensive review on Big Data for industries challenges and opportunities [ 21 ]

The article explores the transformative impact of Big Data Analytics in power systems, mineral industries, and manufacturing. In power systems, it revolutionizes fault detection, enables early warning systems and predicts future electricity demand, enhancing reliability and decision-making. For mineral industries, Big Data improves data storage, processing and analytics, optimizing exploration, extraction, and resource management. In manufacturing, it facilitates data-driven decision-making, comprehensive product quality assessment, and streamlined supply chain management for increased operational efficiency.

The study also highlights challenges in implementing Big Data Analytics, emphasizing the crucial need for precise data quality assessment models and secure frameworks. Machine learning and data analytics play a pivotal role in overcoming challenges, particularly in fault detection, load forecasting, and reservoir management. The call for open-source databases and integration with machine learning addresses the scarcity of datasets, reflecting challenges in maximizing Big Data’s potential.

Furthermore, the paper recommends future research trends, including advanced data quality assessment models, frameworks for high-dimensional data and solutions for secure communication. Emphasizing open-source databases and integrating machine learning promotes a collaborative and transparent approach. The call for interpretable models reflects a trend toward understanding and optimizing Big Data Analytics. Overall, these recommendations shape the future direction of Big Data applications in diverse industries.

PS3—A survey on IoT Big Data current status, 13 V’s challenges, and future directions [ 22 ]

The document delves into the landscape of Big Data Analytics, particularly exploring its integration with the Internet of Things. Application domains such as energy, healthcare, transportation, and smart cities emerge prominently. The discussion unfolds how these domains have evolved, signalling a shift towards IoT-driven intelligent applications.

Within this expansive terrain, the study identifies and elucidates 13 major challenges encapsulated by the “13 V’s”. These challenges span traditional aspects like volume, velocity, and variety, extending to less common concerns like vagueness and location-aware data processing. The document also offers innovative solutions, like edge-based processing and semantic representation, as strategies to manage these complex challenges.

In regards to the future, the document outlines emerging trends anticipated to define the Big Data landscape in the coming 5 to 10 years. These include a focus on energy-efficient data acquisition, the integration of machine learning and deep learning for advanced analytics, a strategic emphasis on edge and fog infrastructures, the evolving paradigm of multi-cloud data management, a shift towards data-oriented network addressing, and the increasing adoption of blockchain technology. These trends collectively indicate a trajectory towards more efficient, scalable, and secure practices in Big Data Analytics, particularly within the realm of IoT applications.

PS4—A systematic literature review on features of deep learning in Big Data analytics [ 23 ]

The document navigates the evolution of Big Data, emphasizing challenges and the rise of machine learning, particularly Deep Learning. Machine learning’s widespread use, observed in areas like healthcare and finance, underscores its crucial role. Even in complex data scenarios, its effectiveness is evident, as demonstrated by the U.S. Department of Homeland Security’s success in identifying threats.

Recognizing a gap in existing research, the document proposes a review focusing on Deep Learning in Big Data Analytics. The goal is to explore features like hierarchical layers and high-level abstraction. The study emphasizes Deep Learning’s strength in handling extensive datasets, its versatility, and its ability to prevent over fitting.

This exploration into Big Data’s journey underscores the central role of machine learning. The proposed review, specifically focusing on Deep Learning in Big Data Analytics, not only captures current advancements but also suggests there’s more to discover in the future where Big Data and machine learning intersect.

PS5—A systematic survey of data mining and Big Data analysis in internet of things [ 24 ]

The document navigates through diverse applications of Big Data Analytics, illustrating its transformative journey across sectors. Notably, it tracks the evolution within healthcare and finance, showcasing how Big Data has become integral to these domains over time.

Going further, the research dives into the various challenges of Big Data analysis. It identifies three main challenges: dealing with societal changes, understanding how businesses use IoT, and solving technical issues like security and connectivity. The study emphasizes the need to adapt to society’s changing needs, categorize IoT uses in business and front technical problems for effective Big Data analysis.

Moreover, the research anticipates future trends, in particular the rising importance of Big Data frameworks in handling expansive IoT-generated data. The intersection of these frameworks with data mining in the IoT domain emerges as a pivotal focus, pointing toward exciting possibilities and potential paths for future research in the realm of Big Data.

PS6—Access methods for Big Data: current status and future directions [ 25 ]

The document explores diverse applications of Big Data Analytics in research, education, urban planning, transportation, environmental modeling, energy conservation, and homeland security, emphasizing its transformative potential.

It addresses challenges like heterogeneity, scale, timeliness, privacy, and the evolving processing paradigms due to data volume surpassing computational resources.

Future directions include the need for systems handling structured and unstructured data, embedded analytics for real-time processing, innovative paradigms, application frameworks, and advanced databases ensuring transactional semantics. The research underscores the importance of tools addressing ethical, security, and privacy concerns.

PS7—An industrial Big Data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities [ 26 ]

This research introduces an innovative Big Data pipeline designed for industrial analytics in manufacturing.

The pipeline excels in integrating legacy and smart devices, ensuring cross-network communication, and adhering to open standards, marking a significant evolution in the field. The document showcases the pipeline’s ability to handle complexities, integrate older systems, ensure reliability, and scale efficiently in industrial data analytics.

The future plan involves implementing the pipeline to validate its architecture, particularly in predictive maintenance for Wind Turbines and Air Handling Units, contributing to the evolving landscape of Big Data Analytics.

PS8—Applications of Big Data in emerging management disciplines: a literature review using text mining [ 27 ]

This study explores diverse applications of Big Data Analytics across twelve emerging management domains, emphasizing their dynamic nature over time.

It addresses adoption challenges, focusing on data quality, resource management, and distinguishing between the ability and capability of organizations in using Big Data Analytics. The research underscores the thoughtful adoption of Big Data Analytics and the importance of measuring its business value comprehensively. It acknowledges the difficulty of translating insights into real-time actionable items.

Looking forward, the study proposes a framework connecting emerging management domains with conventional practices, suggesting future research areas in human resources, marketing, sales, strategy, and services. The research emphasizes the need for in-depth exploration to integrate emerging domains into established management practices, providing valuable insights for research and practical application.

PS9—Applying Big Data analytics in higher education: a systematic mapping study [ 28 ]

The document conducts a thorough exploration of Big Data Analytics (BDA) in Higher Education Institutions from 2010 to 2020. It uncovers diverse BDA applications in three domains: Educational Quality, Decision-Making Process, and Information Management.

Challenges in BDA adoption include handling large data volumes, addressing privacy concerns, and dealing with resource constraints. The study emphasizes the need for practical outcomes, automated tools, and validated frameworks.

Despite robust research interest, the field exhibits immaturity, with a prevalence of conference papers indicating an early development stage. The study calls for increased empirical research to fortify the evidence base and foster a more mature BDA integration in higher education.

PS10—Artificial intelligence approaches and mechanisms for Big Data analytics: a systematic study [ 29 ]

The SLR explores AI-driven Big Data Analytics, emphasizing machine learning, knowledge-based reasoning, decision-making algorithms, and search methods. Applications, notably in supervised learning, aim to enhance precision and efficiency but grapple with complexity and scalability issues.

Challenges encompass processing vast, heterogeneous data, ensuring system security, and addressing qualitative parameters. Fog computing emerges as a potential solution, yet security concerns remain under-explored.

Emerging trends spotlight Big Data Analytics for IoT through fog computing, the need for enhanced algorithms handling extensive data, and the necessity to address data quality issues in unstructured formats.

PS11—Bibliometric mining of research directions and trends for Big Data [ 30 ]

The research identifies key application domains, with particular focus on China, and emerging directions such as Machine Learning and Healthcare.

Navigating challenges, the study introduces a semi-automatic method, utilizing blacklists and thesauri to enhance precision in identifying research directions. This favors a balance between automation and expert input.

The study forecasts Big Data’s future using a growth rate criterion, emphasizing Machine Learning and Deep Learning. Moreover, the study suggests applying its methodology not only to Big Data but also to various research areas, such as Machine Learning, showcasing its potential applicability in diverse research areas.

PS12—Big Data adoption: state of the art and research challenges [ 31 ]

The study explores the widespread adoption of Big Data Analytics across diverse sectors such as finance, education, healthcare, and more. It identifies a need for increased research in untapped areas like education and healthcare, suggesting potential transformative effects.

Challenges in current Big Data research include the need for refined theoretical models, adaptable data collection methods, and larger sample sizes to ensure accuracy. The study recommends a mixed-method approach to address these challenges effectively.

The study, although not explicitly stating upcoming trends, suggests a changing research focus in both developing and developed countries. It indicates a growing awareness of untapped opportunities, hinting at a future emphasis on specific situations and new factors in Big Data adoption.

PS13—Big Data analytics for data-driven industry: a review of data sources, tools, challenges, solutions, and research directions [ 32 ]

This research provides a comprehensive overview of Big Data Analytics. Exploring application domains, it traces Big Data’s historical integration across education, healthcare, finance, national security, and Industry 4.0 components like IoT and smart cities.

Delving into challenges, the research highlights skill shortages, dataset management, privacy, scalability, and intellectual property issues. Solutions range from software-defined data management to innovative truthfulness and privacy preservation methods.

Looking ahead, the study identifies some emerging trends: sourcing data from education and diverse IoT devices, refining pre-processing, advancing data management, enhancing privacy, and exploring deep learning methods. These trends forecast a dynamic future for Big Data Analytics, shaping the field in the next years.

PS14—Big Data analytics in healthcare: a systematic literature review and road map for practical implementation [ 33 ]

The paper conducts a thorough examination of Big Data Analytics (BDA) applications in healthcare, introducing the novel Med-BDA architecture.

Notably, the work addresses challenges inherent in BDA (such as increased costs, difficulty in acquiring a relevant skill set, rapidly expanding technology stack, and heightened management overhead), presenting a comprehensive road map to alleviate issues such as cost escalation and skill acquisition hurdles.

The document concludes by outlining the potential for extensions to Med-BDA and its applicability to diverse Big Data domains, showcasing a forward-looking perspective in BDA research and application.

PS15—Big Data analytics in telecommunications: literature review and architecture recommendations [ 34 ]

The document explores Big Data Analytics in TELCO, introducing LambdaTel as a proposed solution for batch and streaming data processing. It discusses Big Data Analytics applications like CRM and Customer Attrition.

Challenges, such as the lack of standardized architecture, are acknowledged. LambdaTel addresses these challenges through a structured approach, emphasizing security and recommending the usage Python.

While not explicitly talking about future trends, the document suggests a commitment to ongoing adaptation, seen in recommendations like Python usage, Dockerized implementation and the application of LambdaTel in a local Telco company for cross-selling/up-selling.

PS16—Big Data analytics meets social media: a systematic review of techniques, open issues, and future directions [ 35 ]

The document highlights social media’s transformative impact in healthcare, emphasizing its role in patient support and disease tracking. It emphasizes leveraging social platforms for patient support, disease prevention, and real-time tracking of contagious diseases.

The review highlights challenges in both content and network-oriented approaches, such as privacy concerns, scalability limitations, and accuracy enhancement with incomplete data. Comprehensive resolution remains an open frontier, requiring innovative solutions for privacy preservation and accurate predictions.

The paper also highlights emerging trends in Big Data Analytics, emphasizing real-time and predictive analysis, and addressing challenges in sentiment analysis. It identifies under explored areas like political and e-commerce applications, underscoring the expanding trajectory of Big Data Analytics. Furthermore, it emphasizes the evolving complexities of linguistic analysis, underlining the need for domain-dependent sentiment analysis, and addressing challenges like sarcasm detection.

PS17—Big Data and its future in computational biology: a literature review [ 36 ]

The document underscores the growing significance of Big Data in computational biology and healthcare, particularly in the conversion of healthcare records into digital formats. It highlights the major application domains, focusing on optimizing health and medical care through electronic health data.

Challenges include the under-utilization of electronic health data and the need to convert raw data into actionable information. Despite increasing interest, the field lacks comprehensive literature reviews.

The document outlines emerging trends in Big Data for computational biology and bio informatics. It emphasizes the pivotal role of volume, variety, and velocity in defining Big Data’s impact on bio informatics. Key technologies, including Hadoop and MapReduce, are discussed, illustrating their significance in the field. The integration of Big Data technology is shown to enhance biological findings and facilitate real-time identification of high-risk patients. However, limitations, such as narrow study focuses, are noted.

PS18—Big Data and sentiment analysis: a comprehensive and systematic literature review [ 37 ]

The document delves into the diverse applications of Big Data Analytics, spotlighting its evolution, notably in sentiment analysis for marketing and disaster response.

Challenges identified include data quality issues and the absence of standardized disaster-related datasets. The limitations of centralized data mining algorithms for distributed systems are acknowledged, urging exploration into other platforms (YARN is directly cited as an example). The analysis underscores the need for immediate and improved performance, emphasizing real-time analysis.

In the future, it is important for researchers to carefully look into specific methods like Hadoop, MapReduce, and deep learning. This will help us better understand what these methods are good at and where they might struggle.

PS19—Big Data applications on the internet of things: a systematic literature review [ 38 ]

This document explores the evolving applications of Big Data, from understanding customer sentiments to enhancing disaster response. Hadoop emerges as a popular framework.

Challenges include robust data acquisition from IoT devices, addressing security concerns and optimizing system scalability.

Future directions involve improving algorithms for efficiency, addressing energy consumption, and exploring the synergy of Big Data and machine learning for emergency systems.

PS20—Big Data in education: a state of the art, limitations, and future research directions [ 39 ]

The paper talks about how Big Data Analytics is used in various areas, especially in education, with a noticeable increase in publications from 2014 to 2019. It highlights important topics like how students behave, creating models, using data for education, improving systems, and adding Big Data (as a topic) to study plans.

Researchers face challenges in employing qualitative methods and data collection techniques, highlighting the need for quantitative approaches and more robust methodologies.

Future research should emphasize quantifying Big Data’s impact, adopting efficient solutions, exploring new tools and developing frameworks for educational applications. Integrating the concept of Big Data into study plans requires significant restructuring and well-designed learning activities.

PS21—Big Data in healthcare—a comprehensive bibliometric analysis of current research trends [ 40 ]

This document unveils the dynamic evolution of Big Data Analytics across diverse application domains, with a notable surge in research activities within the healthcare sector since 2012.

While the study discusses various related studies and challenges in Big Data analysis, it does not directly address or provide specific solutions to those challenges.

Looking ahead, the document reveals emerging trends and directions shaping the future of Big Data Analytics over the next 5 to 10 years. Key themes include data analytics, predictive analytics, and collaborative networks, providing a glimpse into the evolving landscape of research endeavors.

PS22—Big Data life cycle in shop-floor-trends and challenges [ 41 ]

The document explores Big Data Analytics in manufacturing, emphasizing its application domains like maintenance, automation, and decision-making.

Challenges include data measurement errors, high-frequency sampling issues, and the need for real-time processing. The study notes a shift to scalable storage options and highlights the importance of efficient data management.

Emerging trends involve the prominent role of AI and statistical approaches in data processing, coupled with a growing emphasis on data privacy. The study concludes with a call for future work focused on developing a consolidated framework for the Big Data life cycle in manufacturing.

PS23—Big Data testing techniques: taxonomy, challenges and future trends [ 42 ]

The paper explores the shift from traditional to advanced testing methods to address challenges in ETL processes, data quality, and node failures.

Addressing major challenges in Big Data analysis, the paper emphasizes the inadequacy of traditional testing, highlighting specific difficulties like ETL testing, node failure prevention, and unit-level debugging. It showcases evolving strategies employed by researchers to ensure the quality of Big Data systems.

Looking ahead, the document outlines emerging research trends shaping the future of Big Data Analytics. It identifies trends such as combinatorial testing techniques, fault tolerance testing, and model-driven entity reconciliation testing as key areas for future exploration.

PS24—Big Data with cognitive computing: a review for the future [ 43 ]

The paper explores the application domains of Big Data Analytics, highlighting its early stage in conjunction with cognitive computing, particularly in healthcare.

Challenges in adoption are attributed to a perceived lack of strategic value. The study categorizes issues into data, process, and management challenges, emphasizing the potential of integrating cognitive computing to overcome barriers.

Regarding emerging trends, there’s a rising interest in cognitive computing. The research encourages more global collaboration and highlights a gap in understanding how Big Data studies impact decision-making processes.

PS25—Current approaches for executing Big Data science projects-a systematic literature review [ 44 ]

The paper explores the landscape of Big Data Analytics. Regarding the common application domains and their evolution, the study notes a significant increase in articles. Workshops play a crucial role in shaping the trajectory, reflecting a robust and expanding interest in Big Data Analytics, influenced by technological advancements.

It also addresses challenges in Big Data analysis, with a focus on workflows and agility. While acknowledging the conceptual nature of agility papers, a gap between theoretical benefits and practical implementation is underscored, necessitating further exploration to optimize agile frameworks for data science projects.

The study highlights emerging trends in Big Data, emphasizing the need for integrated frameworks in data science. It points out a research gap in standardized approaches, urging further exploration for innovative methodologies.

PS26—Data quality affecting Big Data analytics in smart factories: research themes, issues and methods [ 45 ]

This review explores the growing applications of Big Data Analytics in Smart Factories, emphasizing an upsurge in empirical case studies on production, process monitoring, and quality tracing.

Challenges involve key data quality issues (missing, anomalous, noisy, and old data), as well as ISO-defined data quality dimensions. While technical methods prevail, an integrated approach combining technical and non-technical methods for comprehensive data quality management is highlighted. Theoretical insights focus on data quality dimensions, issues, and resolutions, while practical implications underscore the need for collaboration and integrated methods.

The study calls for future research in frameworks, data quality requirements, and emerging scenarios, contributing to Big Data Analytics evolution in Smart Factories.

PS27—Harnessing Big Data analytics for healthcare: a comprehensive review of frameworks, implications, applications, and impacts [ 46 ]

The study meticulously explores the landscape of Big Data Analytics in healthcare. Noteworthy application domains, such as multi modal data analysis and fusion, natural language processing, and electronic health records, emerge from this exploration.

Some challenges faced in Big Data analysis are presented in the document, highlighting issues like data quality, privacy concerns, and a shortage of skilled professionals. It emphasizes the necessity for interoperability and standardization while identifying ongoing challenges in multi modality, ethical considerations, and bias mitigation.

The research outlines emerging trends and directions in Big Data, emphasizing the importance of ongoing exploration in areas like multi modality, data mining, precision medicine, ethical considerations, and the broader understanding of the Big Data Ecosystem.

PS28—Leveraging Big Data in smart cities: a systematic review [ 47 ]

Big Data Analytics has evolved across diverse domains, expanding from finance and healthcare to smart cities and e-commerce. This evolution has been marked by a transformative impact on industries.

Challenges in Big Data, including security, privacy, and scalability issues, have prompted innovative solutions. Advanced encryption, anonymization techniques, and scalable computing frameworks address these concerns.

Looking ahead, emerging trends highlight the fusion of Big Data with AI, machine learning, and technologies like edge computing. Ethical considerations gain prominence and quantum computing’s potential is explored for handling massive datasets.

PS29—Roles and capabilities of enterprise architecture in Big Data analytics technology adoption and implementation [ 48 ]

The document explores the evolution and current state of Big Data Analytics, highlighting its diverse applications in domains like healthcare and finance.

Researchers have grappled with challenges such as data privacy and scalability, addressing them through innovations like advanced encryption and scalable algorithms.

Looking forward, emerging trends include the integration of Artificial Intelligence and Machine Learning for enhanced analytics and a growing focus on ethics and responsible data use. The intersection of Big Data with edge computing and IoT also opens new frontiers for real-time analytics.

PS30—Security and privacy challenges of Big Data adoption: a qualitative study in telecommunication industry [ 49 ]

The research investigates the evolution of Big Data Analytics applications across diverse domains, emphasizing healthcare, finance, marketing, and telecommunications.

Challenges include data security and privacy, addressed through advanced encryption and privacy-preserving techniques.

In the future, emerging trends highlight explainable AI, ethical data practices, and innovations in handling streaming data, graph databases, and blockchain integration.

PS31—The role of AI, machine learning, and Big Data in digital twinning: a systematic literature review, challenges, and opportunities [ 50 ]

The document explores diverse applications of Big Data Analytics across industries like healthcare, energy, and manufacturing. It underscores the evolution of these applications, highlighting a focus on optimization, diagnostics, and predictive analytics.

Challenges include data collection difficulties, picking the right AI models that are both accurate and fast and the ongoing need for standardization in digital twinning.

The document anticipates future trends, emphasizing the integration of AI, Machine Learning, and Big Data, particularly in digital twinning. It sets the stage for ongoing research in optimizing industrial processes, predictive analytics, healthcare, and smart city implementations.

PS32—The state of the art and taxonomy of Big Data analytics: view from new Big Data framework [ 51 ]

The document extensively explores the landscape of Big Data Analytics, emphasizing the dominant role of Hadoop while acknowledging the rise of Apache Spark in recent years.

Major challenges in the field involve handling diverse data formats, optimizing algorithms for evolving hardware configurations, and bridging the gap between complex systems and end-users through user-friendly visualization techniques.

It anticipates future advancements in applications, specifically in domains like e-commerce and the IoT, while expressing optimism about increased investments in Big Data technology.

In the last 15 years, Big Data has found applications across various domains, evolving over time in line with the evolution of technologies and new business needs. Some of the most common application domains for Big Data Analytics include:

Business and Finance, for example, to detect fraud detection by analyzing large datasets and identifying patterns indicative of fraudulent activities or to study customer behavior, preferences, and trends to improve marketing strategies.

Healthcare, for example, to forecast disease outbreaks, patient admission rates, and treatment outcomes, or to personalize medicine with the analysis of genetic data for ad-hoc treatments.

Retail, for example, to automatically manage and optimize inventories, and stock levels by predicting demands, or to create recommender systems to targeted and segmented customers’ profiles.

Manufacturing, for example, to predict and schedule maintenance needs and potential equipment failures by analyzing sensor data, or to improve product quality by monitoring and analyzing production processes.

Telecommunications, for example, to optimize at real-time network performance and areas for improvement, or to predict customer churn by identifying factors and customers’ behaviors that contribute to customer churn.

Government, Public Services, and Transportation, for example, to plan efficient urban mobility, traffic management, and resource allocation in Smart Cities, or to predict and prevent criminal activities, or to optimize energy distribution and reduce wastage, or to optimize transportation routes, reduce delivery times, and vehicle fleets for efficiency and cost savings.

Media, Entertainment, and Education, for example, to recommend movies, music, or articles based on users’ behaviors and preferences, or to tailor content and advertising by studying users’ behaviors, or to improve educational impact by analyzing student performance.

In Fig.  6 , we show the distribution of the studies addressing the three research questions (RQ1-RQ3), from which we has started initially our investigation: 31 PSs discuss common application domains where the use of Big Data solutions is relevant (RQ1); 30 PSs analyze research challenges and limitations of Big Data (RQ2); 28 PSs highlight emerging research trends and directions in Big Data (RQ3). The total number of papers addressing the 3 RQs is different from the number of the selected 32 PSs, since we observed overlaps and intersections (e.g., a PS can address multiple RQs.)

figure 6

Distribution of studies addressing the three research questions

To better understand the main focus of the PSs, Fig.  7 shows the distribution of studies addressing the three research questions, but this time, we made it avoiding intersections (i.e., each primary study can only be part of one of the 3 categories.) We can classify 12 PSs as papers that mainly focus on RQ1, 10 PSs mainly focus on RQ2, and 10 PSs on RQ3. The homogeneous distribution of the primary studies allows us to be optimistic about the results of our research since we had a good number of studies to answer each of our research questions.

figure 7

Distribution of studies mainly addressing the three research questions

To further make clear the main focuses of our studies, we decided to categorize each one. Figures  8 , 9 , and 10 show the focus of the documents for each Research Question (note that the sum of the categorized documents may be greater than the number of studies that answer that RQ, because they may overlap and be part of more than one category).

figure 8

Categorization of RQ1 studies

figure 9

Categorization of RQ2 studies

figure 10

Categorization of RQ3 studies

Having clarified this, we now discuss the findings of our SLR. We divided this discussion in three sections, one for each Research Question, so that we could clearly define which elements answer which question.

RQ1: what are the most common application domains for Big Data analytics, and how have they evolved over time?

Delving into the realm of Big Data across various sectors over the last 15 years reveals a narrative of evolution and adaptation. Initially rooted in finance, healthcare and marketing, the domain of Big Data analytics has undergone a metamorphosis, embracing applications from computational biology to education and manufacturing, expanding into the avant-garde concept of digital twinning. This dynamic evolution is evident in studies investigating Big Data management techniques on the Internet of Things, where the focus has shifted from basic health state monitoring to sophisticated predictive modeling. This evolution signifies a maturation of Big Data analytics, with an increased focus on nuanced attributes like performance, efficiency, reliability, and scalability.

RQ2: what are the major challenges and limitations that researchers have encountered in Big Data analysis, and how have they been addressed?

Shifting our focus to the challenges within the Big Data analytics landscape, a complex history of persistent hurdles and inventive solutions comes into focus. The studies converge on a common thread, unraveling ongoing challenges encapsulated in the trio of data quality, scalability, and privacy/security concerns. Researchers faced with these challenges have become architects of innovative solutions, leveraging advanced algorithms, distributed frameworks, and privacy-preserving techniques. These solutions reflect a commitment to advancing the field in response to the complexities of handling vast and dynamic datasets.

In the implementation of Big Data Analytics, diverse challenges emerge. A dedicated study on industries points to crucial issues in data quality assessment models and secure frameworks. Here, the role of machine learning and data analytics, particularly in fault detection and reservoir management, becomes pivotal. The interconnected nature of these challenges emphasizes the importance of a comprehensive approach to implementation. Beyond technological challenges, ethical considerations surrounding data privacy and security take center stage. Researchers stress the significance of tools addressing ethical concerns, underlining that responsible deployment is intrinsic to the ethical use of Big Data Analytics.

In response to these challenges, the industry advocates for innovative solutions, emphasizing AI-driven approaches, cryptography mechanisms, and lightweight frameworks with AI. This recognition underscores the need for inventive strategies to navigate the intricate integration of Big Data into rapidly evolving technological landscapes.

RQ3: what are the emerging research trends and directions in Big Data that will likely shape the field in the next 5 to 10 years?

Looking into the next 5 to 10 years, several trends are expected to shape the landscape of Big Data Analytics. One significant trend involves making data acquisition more energy-efficient, a move that aligns with broader sustainability goals. The integration of machine learning and deep learning techniques is anticipated to enhance the analytical capabilities of Big Data systems, enabling more accurate predictions and insights. Another noteworthy trend is the emphasis on edge and fog infrastructures, signifying a shift towards decentralized processing for faster data processing and decision-making, especially relevant in the context of the Internet of Things. Importantly, these trends extend beyond technological advancements to include ethical considerations. As Big Data assumes a pivotal role in decision-making processes, these ethical dimensions must be at the forefront. This involves dealing with the tricky ethical issues that come with having such a big influence through data analytics.

In essence, the trajectory of Big Data analytics in the coming years is a dual journey, one that advances technologically with a keen eye on efficiency and, concurrently, prioritizes ethical practices. It’s a future where innovation and responsibility go hand in hand, defining a landscape that reflects both progress and ethical consciousness.

Threats to validity

Ensuring the validity of a SLR is essential for the development of a reliable study. For this reason, in this section, we examine potential threats to construct, internal and external validity, aiming to maintain the robustness of our findings.

Construct validity determines whether the implementation of the SLR aligns with its initial objectives. The efficacy of our search process and the relevance of search terms are crucial concerns. While our search terms were derived from well-defined research questions and adjusted based on that, the completeness and comprehensiveness of these terms may be subject to limitations. Additionally, the use of different keywords might have returned other relevant studies that have not been taken into consideration. A potential language bias may also exist due to the exclusion of non-English articles, representing a limitation that should be acknowledged in the overall validity of the research.

Internal validity assesses the extent to which the design and execution of the study minimize systematic errors. A key focus is on the process of data extraction from the selected primary studies. Some required data may not have been explicitly expressed or were entirely missing, posing a potential threat to internal validity. To minimize this risk, the SLR process has been supervised by another person in order to minimize error into the process.

External validity examines the extent to which the observed effects of the study can be applied beyond its scope. In this SLR, we concentrated on research questions and quality assessments to mitigate the risk of limited generalizability. However, the study’s focus on the specific domain of Big Data research may limit external validity. Moreover, the dynamic nature of Big Data and the predefined time frame (last 15 years) could affect the generalizability of findings. Recognizing these constraints, the outcomes of this SLR are considered generalizable within the specified context of Big Data research.

By acknowledging these potential threats to validity, we strive to enhance the credibility and reliability of our SLR, contributing valuable insights to the evolving landscape of Big Data research.

Over the past 15 years, Big Data has become a crucial player in various fields, adapting to technological shifts and meeting the changing needs of businesses. This review has taken a closer look at how Big Data has been applied, its challenges, and what we can expect in the near future. 189 studies were ultimately found, 32 of which were SLRs analyzed for this study.

Big Data started in areas like Business, Healthcare, and Marketing, but its influence has ultimately grown. Now, it helps predict disease outbreaks, manage retail inventory, forecast equipment failures in manufacturing, improve network performance, optimize urban planning, personalize media content, and enhance education.

Dealing with Big Data hasn’t been without challenges. Issues like ensuring data quality, handling scalability, and maintaining privacy and security have been persistent. Researchers have responded with creative solutions, using advanced algorithms and privacy measures.

Looking to the future, the trends suggest exciting developments. Making data acquisition more energy-efficient and integrating advanced machine learning techniques are on the horizon. There is a shift toward decentralized processing, especially with the Internet of Things in mind. Importantly, these trends aren’t just about technology; they also emphasize ethical considerations. Ethical issues need careful attention as Big Data becomes more influential in decision-making processes.

To summarize, the future of Big Data is a journey that combines technological progress with a strong ethical stance. It’s a path where innovation and responsibility walk hand in hand, shaping a landscape that advances both technologically and ethically. The last 15 years have set the stage and the road ahead invites us to keep exploring and engaging with the ever-evolving world of Big Data.

Data availability

No datasets were generated or analysed during the current study.

Bibliography

Tosi D, Campi AS. How schools affected the covid-19 pandemic in Italy: data analysis for Lombardy Region, Campania Region, and Emilia Region. Future Internet. 2021. https://doi.org/10.3390/fi13050109 .

Article   Google Scholar  

Davoudian A, Liu M. Big Data systems: a software engineering perspective. ACM Comput Surv. 2020. https://doi.org/10.1145/3408314 .

Kushwaha AK, Kar AK. Language model-driven chatbot for business to address marketing and selection of products. In: Sharma SK, Dwivedi YK, Metri B, Rana NP, editors. Re-imagining diffusion and adoption of information technology and systems: a continuing conversation. Cham: Springer; 2020. p. 16–28. https://doi.org/10.1007/978-3-030-64849-7_3 .

Chapter   Google Scholar  

Kushwaha AK, Kar AK. Micro-foundations of artificial intelligence adoption in business: making the shift. In: Sharma SK, Dwivedi YK, Metri B, Rana NP, editors. Re-imagining diffusion and adoption of information technology and systems: a continuing conversation. Cham: Springer; 2020. p. 249–60. https://doi.org/10.1007/978-3-030-64849-7_22 .

Dong W, Liao S, Zhang Z. Leveraging financial social media data for corporate fraud detection. J Manag Inf Syst. 2018;35(2):461–87. https://doi.org/10.1080/07421222.2018.1451954 .

Kushwaha AK, Mandal S, Pharswan R, Kar AK, Ilavarasan PV. Studying online political behaviours as rituals: a study of social media behaviour regarding the CAA. In: Sharma SK, Dwivedi YK, Metri B, Rana NP, editors. Re-imagining diffusion and adoption of information technology and systems: a continuing conversation. Cham: Springer; 2020. p. 315–26. https://doi.org/10.1007/978-3-030-64861-9_28 .

Fronzetti Colladon A, Gloor P, Iezzi DF. Editorial introduction: the power of words and networks. Int J Inf Manag. 2020;51: 102031. https://doi.org/10.1016/j.ijinfomgt.2019.10.016 .

Kushwaha AK, Kar AK, Ilavarasan PV. Predicting retweet class using deep learning. In: Piuri V, Raj S, Genovese A, Srivastava R, editors. Trends in deep learning methodologies: hybrid computational intelligence for pattern analysis. Cambridge: Academic Press; 2021. p. 89–112. https://doi.org/10.1016/B978-0-12-822226-3.00004-0 .

Tosi D. Cell phone Big Data to compute mobility scenarios for future smart cities. Int J Data Sci Anal. 2017;4:265–84. https://doi.org/10.1007/s41060-017-0061-2 .

Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: from Big Data to big impact. MIS Q. 2012;36(4):1165–88. https://doi.org/10.2307/41703503 .

Wamba SF, Ngai E, Riggins F, Akter S. Big Data and business analytics adoption and use: a step toward transforming operations and production management? Bingley: Emerald Group Publishing Limited; 2017.

Google Scholar  

George G, Osinga EC, Lavie D, Scott BA. Big Data and data science methods for management research. Acad Manag. 2016. https://doi.org/10.5465/amj.2016.4005 .

Curtin J, Kauffman RJ, Riggins FJ. Making the ‘MOST’ out of RFID technology: a research agenda for the study of the adoption, usage and impact of RFID. Inf Technol Manag. 2007;8(2):87–110. https://doi.org/10.1007/s10799-007-0010-1 .

Kitchenham BA, Charters S. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007-001, Keele University and Durham University Joint Report. 2007. https://www.elsevier.com/__data/promis_misc/525444systematicreviewsguide.pdf . Accessed 15 Jan 2024

Tosi D, Morasca S. Supporting the semi-automatic semantic annotation of web services: a systematic literature review. Inf Softw Technol. 2015;61:16–32. https://doi.org/10.1016/j.infsof.2015.01.007 .

Tahir A, Tosi D, Morasca S. A systematic review on the functional testing of semantic web services. J Syst Softw. 2013;86(11):2877–89. https://doi.org/10.1016/j.jss.2013.06.064 .

Briner RB, Denyer D. 112 systematic review and evidence synthesis as a practice and scholarship tool. In: Rousseau DM, editor. The Oxford handbook of evidence-based management. Oxford: Oxford University Press; 2012. https://doi.org/10.1093/oxfordhb/9780199763986.013.0007 .

Tosi D, Marzorati S, La Rosa M, Dondossola G, Terruggia R. Big Data from cellular networks: how to estimate energy demand at real-time. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE. 2015. pp. 1–10. https://doi.org/10.1109/DSAA.2015.7344881 .

Cappi R, Casini L, Tosi D, Roccetti M. Questioning the seasonality of SARS-COV-2: a Fourier spectral analysis. BMJ Open. 2022. https://doi.org/10.1136/bmjopen-2022-061602 .

Naghib A, Jafari Navimipour N, Hosseinzadeh M, Sharifi A. A comprehensive and systematic literature review on the Big Data management techniques in the internet of things. Wirel Netw. 2023;29(3):1085–144. https://doi.org/10.1007/s11276-022-03177-5 .

Sarker S, Arefin MS, Kowsher M, Bhuiyan T, Dhar PK, Kwon O-J. A comprehensive review on Big Data for industries: challenges and opportunities. IEEE Access. 2023;11:744–69. https://doi.org/10.1109/ACCESS.2022.3232526 .

Bansal M, Chana I, Clarke S. A survey on IoT Big Data: current status, 13 V’s challenges, and future directions. ACM Comput Surv. 2020. https://doi.org/10.1145/3419634 .

Hordri NF, Samar A, Yuhaniz SS, Shamsuddin SM. A systematic literature review on features of deep learning in Big Data analytics. Int J Adv Soft Comput Appl. 2017;9(1):32–49.

Zhong Y, Chen L, Dan C, Rezaeipanah A. A systematic survey of data mining and Big Data analysis in internet of things. J Supercomput. 2022;78(17):18405–53. https://doi.org/10.1007/s11227-022-04594-1 .

Rashid ANMB. Access methods for Big Data: current status and future directions. EAI Endorsed Trans Scal Inf Syst. 2017;4(15):1–14. https://doi.org/10.4108/eai.28-12-2017.153520 .

O’Donovan P, Leahy K, Bruton K, O’Sullivan DTJ. An industrial Big Data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities. J Big Data. 2015. https://doi.org/10.1186/s40537-015-0034-z .

Kushwaha AK, Kar AK, Dwivedi YK. Applications of Big Data in emerging management disciplines: a literature review using text mining. Int J Inf Manag Data Insights. 2021. https://doi.org/10.1016/j.jjimei.2021.100017 .

Alkhalil A, Abdallah MAE, Alogali A, Aljaloud A. Applying Big Data analytics in higher education: a systematic mapping study. Int J Inf Commun Technol Educ. 2021;17(3):29–51. https://doi.org/10.4018/IJICTE.20210701.oa3 .

Rahmani AM, Azhir E, Ali S, Mohammadi M, Ahmed OH, Ghafour MY, Ahmed SH, Hosseinzadeh M. Artificial intelligence approaches and mechanisms for Big Data analytics: a systematic study. PeerJ Computer Sci. 2021;7:1–28. https://doi.org/10.7717/peerj-cs.488 .

Lundberg L. Bibliometric mining of research directions and trends for Big Data. J Big Data. 2023. https://doi.org/10.1186/s40537-023-00793-6 .

Baig MI, Shuib L, Yadegaridehkordi E. Big Data adoption: state of the art and research challenges. Inf Process Manag. 2019. https://doi.org/10.1016/j.ipm.2019.102095 .

Ikegwu AC, Nweke HF, Anikwe CV, Alo UR, Okonkwo OR. Big Data analytics for data-driven industry: a review of data sources, tools, challenges, solutions, and research directions. Clust Comput. 2022;25(5):3343–87. https://doi.org/10.1007/s10586-022-03568-5 .

Imran S, Mahmood T, Morshed A, Sellis T. Big Data analytics in healthcare: a systematic literature review and roadmap for practical implementation. IEEE/CAA J Autom Sin. 2021;8(1):1–22. https://doi.org/10.1109/JAS.2020.1003384 .

Zahid H, Mahmood T, Morshed A, Sellis T. Big Data analytics in telecommunications: literature review and architecture recommendations. IEEE/CAA J Autom Sin. 2020;7(1):18–38. https://doi.org/10.1109/JAS.2019.1911795 .

Bazzaz Abkenar S, Haghi Kashani M, Mahdipour E, Jameii SM. Big Data analytics meets social media: a systematic review of techniques, open issues, and future directions. Telemat Inf. 2021. https://doi.org/10.1016/j.tele.2020.101517 .

ElSayed IA, ElDahshan K, Hefny H, ElSayed EK. Big Data and its future in computational biology: a literature review. J Computer Sci. 2021;17(12):1222–8. https://doi.org/10.3844/jcssp.2021.1222.1228 .

Hajiali M. Big Data and sentiment analysis: a comprehensive and systematic literature review. Concurr Comput Pract Exp. 2020. https://doi.org/10.1002/cpe.5671 .

Ahmadova U, Mustafayev M, Kiani Kalejahi B, Saeedvand S, Rahmani AM. Big Data applications on the internet of things: a systematic literature review. Int J Commun Syst. 2021. https://doi.org/10.1002/dac.5004 .

Baig MI, Shuib L, Yadegaridehkordi E. Big Data in education: a state of the art, limitations, and future research directions. Int J Educ Technol High Educ. 2020;17(1):1–23.

Reshi AA, Shah ARIF, Shafi S, Qadri MH. Big Data in healthcare a comprehensive bibliometric analysis of current research trends. Scal Comput. 2023;24(3):531–49. https://doi.org/10.12694/scpe.v24i3.2155 .

Pulikottil T, Estrada-Jimenez LA, Abadía JJP, Carrera-Rivera A, Torayev A, Rehman HU, Mo F, Nikghadam-Hojjati S, Barata J. Big Data life cycle in shop-floor-trends and challenges. IEEE Access. 2023;11:30008–26. https://doi.org/10.1109/ACCESS.2023.3253286 .

Arshad I, Alsamhi SH, Afzal W. Big Data testing techniques: taxonomy, challenges and future trends. Computers Mater Contin. 2023;74(2):2739–70. https://doi.org/10.32604/cmc.2023.030266 .

Gupta S, Kar AK, Baabdullah A, Al-Khowaiter WAA. Big Data with cognitive computing: a review for the future. Int J Inf Manag. 2018;42:78–89. https://doi.org/10.1016/j.ijinfomgt.2018.06.005 .

Saltz JS, Krasteva I. Current approaches for executing Big Data science projects-a systematic literature review. PeerJ Computer Sci. 2022. https://doi.org/10.7717/PEERJ-CS.862 .

Liu C, Peng G, Kong Y, Li S, Chen S. Data quality affecting Big Data analytics in smart factories: research themes, issues and methods. Symmetry. 2021. https://doi.org/10.3390/sym13081440 .

Ahmed A, Xi R, Hou M, Shah SA, Hameed S. Harnessing Big Data analytics for healthcare: a comprehensive review of frameworks, implications, applications, and impacts. IEEE Access. 2023;11:112891–928. https://doi.org/10.1109/ACCESS.2023.3323574 .

Karimi Y, Haghi Kashani M, Akbari M, Mahdipour E. Leveraging Big Data in smart cities: a systematic review. Concurr Comput Pract Exp. 2021. https://doi.org/10.1002/cpe.6379 .

Gong Y, Janssen M. Roles and capabilities of enterprise architecture in Big Data analytics technology adoption and implementation. J Theor Appl Electron Commer Res. 2021;16(1):37–51. https://doi.org/10.4067/S0718-18762021000100104 .

Anawar S, Othman NF, Selamat SR, Ayop Z, Harum N, Rahim FA. Security and privacy challenges of Big Data adoption: a qualitative study in telecommunication industry. Int J Interact Mob Technol. 2022;16(19):81–97. https://doi.org/10.3991/ijim.v16i19.32093 .

Rathore MM, Shah SA, Shukla D, Bentafat E, Bakiras S. The role of AI, machine learning, and Big Data in digital twinning: a systematic literature review, challenges, and opportunities. IEEE Access. 2021;9:32030–52. https://doi.org/10.1109/ACCESS.2021.3060863 .

Mohamed A, Najafabadi MK, Wah YB, Zaman EAK, Maskat R. The state of the art and taxonomy of Big Data analytics: view from new Big Data framework. Artif Intell Rev. 2020;53(2):989–1037. https://doi.org/10.1007/s10462-019-09685-9 .

Li Y, Liu Z, Zhu H. Enterprise search in the Big Data era: recent developments and open challenges. Proc VLDB Endow. 2014;7(13):1717–8. https://doi.org/10.14778/2733004.2733071 .

Lee D, Camacho D, Jung JJ. Smart mobility with Big Data: approaches, applications, and challenges. Appl Sci. 2023. https://doi.org/10.3390/app13127244 .

Himeur Y, Elnour M, Fadli F, Meskin N, Petri I, Rezgui Y, Bensaali F, Amira A. AI-Big Data analytics for building automation and management systems: a survey, actual challenges and future perspectives. Artif Intell Rev. 2023;56(6):4929–5021. https://doi.org/10.1007/s10462-022-10286-2 .

Cesario E. Big Data analytics and smart cities: applications, challenges, and opportunities. Front Big Data. 2023. https://doi.org/10.3389/fdata.2023.1149402 .

Zwilling M. Big Data challenges in social sciences: an NLP analysis. J Computer Inf Syst. 2023;63(3):537–54. https://doi.org/10.1080/08874417.2022.2085211 .

Rani R, Khurana M, Kumar A, Kumar N. Big Data dimensionality reduction techniques in IoT: review, applications and open research challenges. Clust Comput. 2022;25(6):4027–49. https://doi.org/10.1007/s10586-022-03634-y .

Jagatheesaperumal SK, Rahouti M, Ahmad K, Al-Fuqaha A, Guizani M. The duo of artificial intelligence and Big Data for industry 4.0: applications, techniques, challenges, and future research directions. IEEE Internet Things J. 2022;9(15):12861–85. https://doi.org/10.1109/JIOT.2021.3139827 .

Lundberg L, Grahn H. Research trends, enabling technologies and application areas for Big Data. Algorithms. 2022. https://doi.org/10.3390/a15080280 .

Ali TAL, Khafagy MH, Farrag MH. Big Data challenges: preserving techniques for privacy violations. J Theor Appl Inf Technol. 2022;100(8):2505–17.

Latifian A. How does cloud computing help businesses to manage Big Data issues. Kybernetes. 2022;51(6):1917–48. https://doi.org/10.1108/K-05-2021-0432 .

Rehman A, Naz S, Razzak I. Leveraging Big Data analytics in healthcare enhancement: trends, challenges and opportunities. Multimed Syst. 2022;28(4):1339–71. https://doi.org/10.1007/s00530-020-00736-8 .

Al-Zahrani A, Al-Hebbi M. Big Data major security issues: challenges and defense strategies. Tehnicki Glasnik. 2022;16(2):197–204. https://doi.org/10.31803/tg-20220124135330 .

Song X, Zhang H, Akerkar R, Huang H, Guo S, Zhong L, Ji Y, Opdahl AL, Purohit H, Skupin A, Pottathil A, Culotta A. Big Data and emergency management: concepts, methodologies, and applications. IEEE Trans Big Data. 2022;8(2):397–419. https://doi.org/10.1109/TBDATA.2020.2972871 .

Singh N, Singh DP, Pant B. Big Data knowledge discovery as a service: recent trends and challenges. Wirel Pers Commun. 2022;123(2):1789–807. https://doi.org/10.1007/s11277-021-09213-5 .

Mohammadi E, Karami A. Exploring research trends in Big Data across disciplines: a text mining analysis. J Inf Sci. 2022;48(1):44–56. https://doi.org/10.1177/0165551520932855 .

Ambeth Kumar VD, Varadarajan V, Gupta MK, Rodrigues JJPC, Janu N. AI empowered Big Data analytics for industrial applications. J Univers Computer Sci. 2022;28(9):877–81. https://doi.org/10.3897/jucs.94155 .

Kumari S, Muthulakshmi P. Transformative effects of Big Data on advanced data analytics: open issues and critical challenges. J Computer Sci. 2022;18(6):463–79. https://doi.org/10.3844/jcssp.2022.463.479 .

Tang S, He B, Yu C, Li Y, Li K. A survey on spark ecosystem: Big Data processing infrastructure, machine learning, and applications. IEEE Trans Knowl Data Eng. 2022;34(1):71–91. https://doi.org/10.1109/TKDE.2020.2975652 .

Reyes-Veras PF, Renukappa S, Suresh S. Challenges faced by the adoption of Big Data in the Dominican Republic construction industry: an empirical study. J Inf Technol Constr. 2021;26:812–31. https://doi.org/10.36680/J.ITCON.2021.044 .

Bentotahewa V, Hewage C, Williams J. Solutions to Big Data privacy and security challenges associated with COVID-19 surveillance systems. Front Big Data. 2021. https://doi.org/10.3389/fdata.2021.645204 .

Escobar CA, McGovern ME, Morales-Menendez R. Quality 4.0: a review of Big Data challenges in manufacturing. J Intell Manuf. 2021;32(8):2319–34. https://doi.org/10.1007/s10845-021-01765-4 .

Mwitondi KS, Said RA. Dealing with randomness and concept drift in large datasets. Data. 2021. https://doi.org/10.3390/data6070077 .

Kusal S, Patil S, Kotecha K, Aluvalu R, Varadarajan V. Ai based emotion detection for textual Big Data: techniques and contribution. Big Data Cognit Comput. 2021. https://doi.org/10.3390/bdcc5030043 .

Lee E, Jang J. Research trend analysis for sustainable QR code use: focus on Big Data analysis. KSII Trans Internet Inf Syst. 2021;15(9):3221–42. https://doi.org/10.3837/tiis.2021.09.008 .

Rhahla M, Allegue S, Abdellatif T. Guidelines for GDPR compliance in Big Data systems. J Inf Secur Appl. 2021. https://doi.org/10.1016/j.jisa.2021.102896 .

Amović M, Govedarica M, Radulović A, Janković I. Big Data in smart city: management challenges. Appl Sci. 2021. https://doi.org/10.3390/app11104557 .

Hoozemans J, Peltenburg J, Nonnemacher F, Hadnagy A, Al-Ars Z, Hofstee HP. FPGA acceleration for Big Data analytics: challenges and opportunities. IEEE Circuits Syst Mag. 2021;21(2):30–47. https://doi.org/10.1109/MCAS.2021.3071608 .

Jalali SMJ, Park HW, Vanani IR, Pho K-H. Research trends on Big Data domain using text mining algorithms. Digit Scholarsh Hum. 2021;36(2):361–70. https://doi.org/10.1093/llc/fqaa012 .

Almutairi MM. Role of Big Data in education in KSA. Int J Inf Technol. 2021;13(1):367–73. https://doi.org/10.1007/s41870-020-00489-7 .

Ardagna D, Barbierato E, Gianniti E, Gribaudo M, Pinto TBM, Silva APC, Almeida JM. Predicting the performance of Big Data applications on the cloud. J Supercomput. 2021;77(2):1321–53. https://doi.org/10.1007/s11227-020-03307-w .

Mkrttchian V, Gamidullaeva L, Finogeev A, Chernyshenko S, Chernyshenko V, Amirov D, Potapova I. Big Data and internet of things (IoT) technologies’ influence on higher education: current state and future prospects. Int J Web-Based Learn Teach Technol. 2021;16(5):137–57. https://doi.org/10.4018/IJWLTT.20210901.oa8 .

Mourtzis D. Towards the 5th industrial revolution: a literature review and a framework for process optimization based on Big Data analytics and semantics. J Mach Eng. 2021;21(3):5–39. https://doi.org/10.36897/jme/141834 .

Dias MNR, Hassan S, Shahzad A. The impact of Big Data utilization on Malaysian government hospital healthcare performance. Int J eBus eGov Stud. 2021;13(1):50–77. https://doi.org/10.34111/ijebeg.202113103 .

Babar M, Alshehri MD, Tariq MU, Ullah F, Khan A, Uddin MI, Almasoud AS. IoT-enabled Big Data analytics architecture for multimedia data communications. Wirel Commun Mob Comput. 2021. https://doi.org/10.1155/2021/5283309 .

Bhat SA, Huang N-F. Big Data and AI revolution in precision agriculture: survey and challenges. IEEE Access. 2021;9:110209–22. https://doi.org/10.1109/ACCESS.2021.3102227 .

Zainab A, Ghrayeb A, Syed D, Abu-Rub H, Refaat SS, Bouhali O. Big Data management in smart grids: technologies and challenges. IEEE Access. 2021;9:73046–59. https://doi.org/10.1109/ACCESS.2021.3080433 .

Jabir B, Falih N. Big Data analytics opportunities and challenges for the smart enterprise. Int J Tech Phys Probl Eng. 2021;13(2):20–6.

Zineb EF, Najat R, Jaafar A. An intelligent approach for data analysis and decision making in Big Data: a case study on e-commerce industry. Int J Adv Computer Sci Appl. 2021;12(7):723–36. https://doi.org/10.14569/IJACSA.2021.0120783 .

Syed D, Zainab A, Ghrayeb A, Refaat SS, Abu-Rub H, Bouhali O. Smart grid Big Data analytics: survey of technologies, techniques, and applications. IEEE Access. 2021;9:59564–85. https://doi.org/10.1109/ACCESS.2020.3041178 .

Talebkhah M, Sali A, Marjani M, Gordan M, Hashim SJ, Rokhani FZ. IoT and Big Data applications in smart cities: recent advances, challenges, and critical issues. IEEE Access. 2021;9:55465–84. https://doi.org/10.1109/ACCESS.2021.3070905 .

Dubuc T, Stahl F, Roesch EB. Mapping the Big Data landscape: technologies, platforms and paradigms for real-time analytics of data streams. IEEE Access. 2021;9:15351–74. https://doi.org/10.1109/ACCESS.2020.3046132 .

Ang KL-M, Seng JKP. Big Data and machine learning with hyperspectral information in agriculture. IEEE Access. 2021;9:36699–718. https://doi.org/10.1109/ACCESS.2021.3051196 .

Zeadally S, Siddiqui F, Baig Z, Ibrahim A. Smart healthcare: challenges and potential solutions using internet of things (IoT) and Big Data analytics. PSU Res Rev. 2020;4(2):149–68. https://doi.org/10.1108/PRR-08-2019-0027 .

Thudumu S, Branch P, Jin J, Singh JJ. A comprehensive survey of anomaly detection techniques for high dimensional Big Data. J Big Data. 2020. https://doi.org/10.1186/s40537-020-00320-x .

Trang NH. Limitations of Big Data partitions technology. J Appl Data Sci. 2020;1(1):11–9. https://doi.org/10.47738/jads.v1i1.7 .

Caíno-Lores S, Lapin A, Carretero J, Kropf P. Applying Big Data paradigms to a large scale scientific workflow: lessons learned and future directions. Future Gener Computer Syst. 2020;110:440–52. https://doi.org/10.1016/j.future.2018.04.014 .

Awaysheh FM, Alazab M, Gupta M, Pena TF, Cabaleiro JC. Next-generation Big Data federation access control: a reference model. Future Gener Computer Syst. 2020;108:726–41. https://doi.org/10.1016/j.future.2020.02.052 .

Valencia-Parra A, Varela-Vaca AJ, Parody L, Gomez-Lopez MT. Unleashing constraint optimisation problem solving in Big Data environments. J Comput Sci. 2020. https://doi.org/10.1016/j.jocs.2020.101180 .

Article   MathSciNet   Google Scholar  

López-Martínez F, Núñez-Valdez ER, García-Díaz V, Bursac Z. A case study for a Big Data and machine learning platform to improve medical decision support in population health management. Algorithms. 2020. https://doi.org/10.3390/A13040102 .

Iqbal R, Doctor F, More B, Mahmud S, Yousuf U. Big Data analytics and computational intelligence for cyber-physical systems: recent trends and state of the art applications. Future Gener Computer Syst. 2020;105:766–78. https://doi.org/10.1016/j.future.2017.10.021 .

Carnevale L, Celesti A, Fazio M, Villari M. A Big Data analytics approach for the development of advanced cardiology applications. Information. 2020. https://doi.org/10.3390/info11020060 .

Shukla AK, Muhuri PK, Abraham A. A bibliometric analysis and cutting-edge overview on fuzzy techniques in Big Data. Eng Appl Artif Intell. 2020. https://doi.org/10.1016/j.engappai.2020.103625 .

Karim A, Siddiqa A, Safdar Z, Razzaq M, Gillani SA, Tahir H, Kiran S, Ahmed E, Imran M. Big Data management in participatory sensing: issues, trends and future directions. Future Gener Computer Syst. 2020;107:942–55. https://doi.org/10.1016/j.future.2017.10.007 .

Humayun M. Role of emerging IoT Big Data and cloud computing for real time application. Int J Adv Computer Sci Appl. 2020;11(4):494–506.

Rabanal F, Martínez C. Cryptography for Big Data environments: current status, challenges, and opportunities. Comput Math Methods. 2020. https://doi.org/10.1002/cmm4.1075 .

Ramesh T, Santhi V. Exploring Big Data analytics in health care. Int J Intell Netw. 2020;1:135–40. https://doi.org/10.1016/j.ijin.2020.11.003 .

Gautam A, Chatterjee I. Big Data and cloud computing: a critical review. Int J Oper Res Inf Syst. 2020;11(3):19–38. https://doi.org/10.4018/IJORIS.2020070102 .

Bajaber F, Sakr S, Batarfi O, Altalhi A, Barnawi A. Benchmarking Big Data systems: a survey. Computer Commun. 2020;149:241–51. https://doi.org/10.1016/j.comcom.2019.10.002 .

Maksimov P, Koiranen T. Application of novel Big Data processing techniques in process industries. Int J Computer Appl Technol. 2020;62(3):200–15. https://doi.org/10.1504/IJCAT.2020.106591 .

Dash S, Shakyawar SK, Sharma M, Kaushik S. Big Data in healthcare: management, analysis and future prospects. J Big Data. 2019. https://doi.org/10.1186/s40537-019-0217-0 .

Nagalakshmi N, Anand Babu GL, Reddy KS, Ashalatha T. Security challenges associated with Big Data in health care system. Int J Eng Adv Technol. 2019;9(1):4057–60. https://doi.org/10.35940/ijeat.A1296.109119 .

Dai H-N, Wong RC-W, Wang H, Zheng Z, Vasilakos AV. Big Data analytics for large-scale wireless networks: challenges and opportunities. ACM Comput Surv. 2019. https://doi.org/10.1145/3337065 .

Barika M, Garg S, Zomaya AY, Wang L, Moorsel AVAN, Ranjan R. Orchestrating Big Data analysis workflows in the cloud: research challenges, survey, and future directions. ACM Comput Surv. 2019. https://doi.org/10.1145/3332301 .

Hariri RH, Fredericks EM, Bowers KM. Uncertainty in Big Data analytics: survey, opportunities, and challenges. J Big Data. 2019. https://doi.org/10.1186/s40537-019-0206-3 .

Latif Z, Lei W, Latif S, Pathan ZH, Ullah R, Jianqiu Z. Big Data challenges: prioritizing by decision-making process using analytic network process technique. Multimed Tools Appl. 2019;78(19):27127–53. https://doi.org/10.1007/s11042-017-5161-4 .

Kumari A, Tanwar S, Tyagi S, Kumar N. Verification and validation techniques for streaming Big Data analytics in internet of things environment. IET Netw. 2019;8(3):155–63. https://doi.org/10.1049/iet-net.2018.5187 .

Singh SP, Nayyar A, Kumar R, Sharma A. Fog computing: from architecture to edge computing and Big Data processing. J Supercomput. 2019;75(4):2070–105. https://doi.org/10.1007/s11227-018-2701-2 .

Raufi B, Ismaili F, Ajdari J, Zenuni X. Web personalization issues in Big Data and semantic web: challenges and opportunities. Turk J Electr Eng Computer Sci. 2019;27(4):2379–94. https://doi.org/10.3906/elk-1812-25 .

Rahman NA, Nor NM. Healthcare using social media: Big Data analytics perspective. J Adv Res Dyn Control Syst. 2019;11(8 Special Issue):1169–79.

Mishra S, Pattnaik S, Mishra BB. Application of Big Data analysis in supply chain management: future challenges. J Adv Res Dyn Control Syst. 2019;11(8 Special Issue):2541–8.

Ivanovic M, Klasnja-Milicevic A. Big Data and collective intelligence. Int J Embed Syst. 2019;11(5):573–83. https://doi.org/10.1504/IJES.2019.102430 .

Qolomany B, Al-Fuqaha A, Gupta A, Benhaddou D, Alwajidi S, Qadir J, Fong AC. Leveraging machine learning and Big Data for smart buildings: a comprehensive survey. IEEE Access. 2019;7:90316–56. https://doi.org/10.1109/ACCESS.2019.2926642 .

Shah SA, Seker DZ, Hameed S, Draheim D. The rising role of Big Data analytics and IoT in disaster management: recent advances, taxonomy and prospects. IEEE Access. 2019;7:54595–614. https://doi.org/10.1109/ACCESS.2019.2913340 .

Lin W, Zhang Z, Peng S. Academic research trend analysis based on Big Data technology. Int J Comput Sci Eng. 2019;20(1):31–9. https://doi.org/10.1504/ijcse.2019.103247 .

Hong L, Luo M, Wang R, Lu P, Lu W, Lu L. Big Data in health care: applications and challenges. Data Inf Manag. 2018;2(3):175–97. https://doi.org/10.2478/dim-2018-0014 .

Pal D, Triyason T, Padungweang P. Big Data in smart-cities: current research and challenges. Indones J Electr Eng Inf. 2018;6(4):351–60. https://doi.org/10.11591/ijeei.v6i4.543 .

Li N, Mahalik NP. A Big Data and cloud computing specification, standards and architecture: agricultural and food informatics. Int J Inf Commun Technol. 2019;14(2):159–74. https://doi.org/10.1504/IJICT.2019.097687 .

Chiroma H, Abdullahi UA, Abdulhamid SM, Abdulsalam Alarood A, Gabralla LA, Rana N, Shuib L, Targio Hashem IA, Gbenga DE, Abubakar AI, Zeki AM, Herawan T. Progress on artificial neural networks for Big Data analytics: a survey. IEEE Access. 2019;7:70535–51. https://doi.org/10.1109/ACCESS.2018.2880694 .

Waheed H, Hassan S-U, Aljohani NR, Wasif M. A bibliometric perspective of learning analytics research landscape. Behav Inf Technol. 2018;37(10–11):941–57. https://doi.org/10.1080/0144929X.2018.1467967 .

Ray J, Johnny O, Trovati M, Sotiriadis S, Bessis N. The rise of Big Data science: a survey of techniques, methods and approaches in the field of natural language processing and network theory. Big Data Cognit Comput. 2018;2(3):1–18. https://doi.org/10.3390/bdcc2030022 .

Mantelero A. AI and Big Data: a blueprint for a human rights, social and ethical impact assessment. Computer Law Secur Rev. 2018;34(4):754–72. https://doi.org/10.1016/j.clsr.2018.05.017 .

Sultan K, Ali H, Zhang Z. Big Data perspective and challenges in next generation networks. Future Internet. 2018. https://doi.org/10.3390/fi10070056 .

Li Q, Chen Y, Wang J, Chen Y, Chen H. Web media and stock markets : a survey and future directions from a Big Data perspective. IEEE Trans Knowl Data Eng. 2018;30(2):381–99. https://doi.org/10.1109/TKDE.2017.2763144 .

Jabbar S, Malik KR, Ahmad M, Aldabbas O, Asif M, Khalid S, Han K, Ahmed SH. A methodology of real-time data fusion for localized Big Data analytics. IEEE Access. 2018;6:24510–20. https://doi.org/10.1109/ACCESS.2018.2820176 .

Darwish TSJ, Abu Bakar K. Fog based intelligent transportation Big Data analytics in the internet of vehicles environment: motivations, architecture, challenges, and critical issues. IEEE Access. 2018;6:15679–701. https://doi.org/10.1109/ACCESS.2018.2815989 .

Zheng S, Chen S, Yang L, Zhu J, Luo Z, Hu J, Yang X. Big Data processing architecture for radio signals empowered by deep learning: concept, experiment, applications and challenges. IEEE Access. 2018;6:55907–22. https://doi.org/10.1109/ACCESS.2018.2872769 .

Stefanowski J, Krawiec K, Wrembel R. Exploring complex and Big Data. Int J Appl Math Computer Sci. 2017;27(4):669–79. https://doi.org/10.1515/amcs-2017-0046 .

Harerimana G, Jang B, Kim JW, Park HK. Health Big Data analytics: a technology survey. IEEE Access. 2018;6:65661–78. https://doi.org/10.1109/ACCESS.2018.2878254 .

Ravi S, Jeyaprakash T. Combined ideas on the necessity of Big Data on internet of things and researchers point of view and its challenges, future directions. J Adv Res Dyn Control Syst. 2018;10(9 Special Issue):2140–4.

Neggers J, Allix O, Hild F, Roux S. Big Data in experimental mechanics and model order reduction: today’s challenges and tomorrow’s opportunities. Arch Comput Methods Eng. 2018;25(1):143–64. https://doi.org/10.1007/s11831-017-9234-3 .

Khan S, Liu X, Shakil KA, Alam M. A survey on scholarly data: from Big Data perspective. Inf Process Manag. 2017;53(4):923–44. https://doi.org/10.1016/j.ipm.2017.03.006 .

Costa C, Santos MY. Big Data: state-of-the-art concepts, techniques, technologies, modeling approaches and research challenges. IAENG Int J Computer Sci. 2017;44(3):285–301.

Lv Z, Song H, Basanta-Val P, Steed A, Jo M. Next-generation Big Data analytics: state of the art, challenges, and future research topics. IEEE Trans Ind Inf. 2017;13(4):1891–9. https://doi.org/10.1109/TII.2017.2650204 .

Memon MA, Soomro S, Jumani AK, Kartio MA. Big Data analytics and its applications. Ann Emerg Technol Comput. 2017;1(1):45–54. https://doi.org/10.33166/AETiC.2017.01.006 .

Mantelero A. Regulating Big Data The guidelines of the Council of Europe in the context of the European data protection framework. Computer Law Secur Rev. 2017;33(5):584–602. https://doi.org/10.1016/j.clsr.2017.05.011 .

Zhou L, Pan S, Wang J, Vasilakos AV. Machine learning on Big Data: opportunities and challenges. Neurocomputing. 2017;237:350–61. https://doi.org/10.1016/j.neucom.2017.01.026 .

Yan J, Meng Y, Lu L, Li L. Industrial Big Data in an industry 4.0 environment: challenges, schemes, and applications for predictive maintenance. IEEE Access. 2017;5:23484–91. https://doi.org/10.1109/ACCESS.2017.2765544 .

Gonçalves ME. The EU data protection reform and the challenges of Big Data: remaining uncertainties and ways forward. Inf Commun Technol Law. 2017;26(2):90–115. https://doi.org/10.1080/13600834.2017.1295838 .

Peng S, Wang G, Xie D. Social influence analysis in social networking Big Data: opportunities and challenges. IEEE Netw. 2017;31(1):11–7. https://doi.org/10.1109/MNET.2016.1500104NM .

L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM. Machine learning with Big Data: challenges and approaches. IEEE Access. 2017;5:7776–97. https://doi.org/10.1109/ACCESS.2017.2696365 .

Manikyam NRH, Mohan Kumar S. Methods and techniques to deal with Big Data analytics and challenges in cloud computing environment. Int J Civil Eng Technol. 2017;8(4):669–78.

El-Seoud SA, El-Sofany HF, Abdelfattah M, Mohamed R. Big Data and cloud computing: trends and challenges. Int J Interact Mob Technol. 2017;11(2):34–52. https://doi.org/10.3991/ijim.v11i2.6561 .

Zúñiga H, Diehl T. Citizenship, social media, and Big Data: current and future research in the social sciences. Soc Sci Computer Rev. 2017;35(1):3–9. https://doi.org/10.1177/0894439315619589 .

Wang H, Xu Z, Pedrycz W. An overview on the roles of fuzzy set techniques in Big Data processing: trends, challenges and opportunities. Knowl-Based Syst. 2017;118:15–30. https://doi.org/10.1016/j.knosys.2016.11.008 .

Choi T-M, Chan HK, Yue X. Recent development in Big Data analytics for business operations and risk management. IEEE Trans Cybern. 2017;47(1):81–92. https://doi.org/10.1109/TCYB.2015.2507599 .

Zhong RY, Newman ST, Huang GQ, Lan S. Big Data for supply chain management in the service and manufacturing sectors: challenges, opportunities, and future perspectives. Comput Ind Eng. 2016;101:572–91. https://doi.org/10.1016/j.cie.2016.07.013 .

Bajaber F, Elshawi R, Batarfi O, Altalhi A, Barnawi A, Sakr S. Big Data 2.0 processing systems: taxonomy and open challenges. J Grid Comput. 2016;14(3):379–405. https://doi.org/10.1007/s10723-016-9371-1 .

De Gennaro M, Paffumi E, Martini G. Big Data for supporting low-carbon road transport policies in europe: applications, challenges and opportunities. Big Data Res. 2016;6:11–25. https://doi.org/10.1016/j.bdr.2016.04.003 .

Wang H, Xu Z, Fujita H, Liu S. Towards felicitous decision making: an overview on challenges and trends of Big Data. Inf Sci. 2016;367–368:747–65. https://doi.org/10.1016/j.ins.2016.07.007 .

Rodríguez-Mazahua L, Rodríguez-Enríquez C-A, Sánchez-Cervantes JL, Cervantes J, García-Alcaraz JL, Alor-Hernández G. A general perspective of Big Data: applications, tools, challenges and trends. J Supercomput. 2016;72(8):3073–113. https://doi.org/10.1007/s11227-015-1501-1 .

Bello-Orgaz G, Jung JJ, Camacho D. Social Big Data: recent achievements and new challenges. Inf Fus. 2016;28:45–59. https://doi.org/10.1016/j.inffus.2015.08.005 .

Zheng X, Chen W, Wang P, Shen D, Chen S, Wang X, Zhang Q, Yang L. Big Data for social transportation. IEEE Trans Intell Transp Syst. 2016;17(3):620–30. https://doi.org/10.1109/TITS.2015.2480157 .

Sahay S. Big Data and public health: challenges and opportunities for low and middle income countries. Commun Assoc Inf Syst. 2016;39(1):419–38. https://doi.org/10.17705/1cais.03920 .

Sharma N, Namratha B. Towards addressing the challenges of data intensive computing in Big Data analytics. Int J Control Theor Appl. 2016;9(23):57–62.

Yu S. Big privacy: challenges and opportunities of privacy study in the age of Big Data. IEEE Access. 2016;4:2751–63. https://doi.org/10.1109/ACCESS.2016.2577036 .

Chen C-M. Use cases and challenges in telecom Big Data analytics. APSIPA Trans Signal Inf Process. 2016. https://doi.org/10.1017/ATSIP.2016.20 .

Anagnostopoulos I, Zeadally S, Exposito E. Handling Big Data: research challenges and future directions. J Supercomput. 2016;72(4):1494–516. https://doi.org/10.1007/s11227-016-1677-z .

Jothi B, Pushpalatha M, Krishnaveni S. Significance and challenges in Big Data: a survey. Int J Control Theor Appl. 2016;9(34):235–43.

Huang Y, Schuehle J, Porter AL, Youtie J. A systematic method to create search strategies for emerging technologies based on the Web of Science: illustrated for ‘Big Data’. Scientometrics. 2015;105(3):2005–22. https://doi.org/10.1007/s11192-015-1638-y .

Xu Z, Shi Y. Exploring Big Data analysis: fundamental scientific problems. Ann Data Sci. 2015;2(4):363–72. https://doi.org/10.1007/s40745-015-0063-7 .

Olshannikova E, Ometov A, Koucheryavy Y, Olsson T. Visualizing Big Data with augmented and virtual reality: challenges and research agenda. J Big Data. 2015. https://doi.org/10.1186/s40537-015-0031-2 .

Assunção MD, Calheiros RN, Bianchi S, Netto MAS, Buyya R. Big Data computing and clouds: trends and future directions. J Parallel Distrib Comput. 2015;79–80:3–15. https://doi.org/10.1016/j.jpdc.2014.08.003 .

Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in Big Data analytics. J Big Data. 2015. https://doi.org/10.1186/s40537-014-0007-7 .

Nativi S, Mazzetti P, Santoro M, Papeschi F, Craglia M, Ochiai O. Big Data challenges in building the global earth observation system of systems. Environ Model Softw. 2015;68:1–26. https://doi.org/10.1016/j.envsoft.2015.01.017 .

Tian X, Han R, Wang L, Lu G, Zhan J. Latency critical Big Data computing in finance. J Finance Data Sci. 2015;1(1):33–41. https://doi.org/10.1016/j.jfds.2015.07.002 .

Perera C, Ranjan R, Wang L, Khan SU, Zomaya AY. Big Data privacy in the internet of things era. IT Prof. 2015;17(3):32–9. https://doi.org/10.1109/MITP.2015.34 .

Jin X, Wah BW, Cheng X, Wang Y. Significance and challenges of Big Data research. Big Data Res. 2015;2(2):59–64. https://doi.org/10.1016/j.bdr.2015.01.006 .

Mao R, Xu H, Wu W, Li J, Li Y, Lu M. Overcoming the challenge of variety: Big Data abstraction, the next evolution of data management for AAL communication systems. IEEE Commun Mag. 2015;53(1):42–7. https://doi.org/10.1109/MCOM.2015.7010514 .

Philip Chen CL, Zhang C-Y. Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci. 2014;275:314–47. https://doi.org/10.1016/j.ins.2014.01.015 .

Ma Y, Wu H, Wang L, Huang B, Ranjan R, Zomaya A, Jie W. Remote sensing Big Data computing: challenges and opportunities. Future Gener Computer Syst. 2015;51:47–60. https://doi.org/10.1016/j.future.2014.10.029 .

Jeong SR, Ghani I. Semantic computing for Big Data: approaches, tools, and emerging directions (2011–2014). KSII Trans Internet Inf Syst. 2014;8(6):2022–42. https://doi.org/10.3837/tiis.2014.06.012 .

Sun D, Liu C, Ren D. Prospects, challenges and latest developments in designing a scalable Big Data stream computing system. Int J Wirel Mob Comput. 2015;9(2):155–60. https://doi.org/10.1504/IJWMC.2015.072567 .

Dobre C, Xhafa F. Intelligent services for Big Data science. Future Gener Computer Syst. 2014;37:267–81. https://doi.org/10.1016/j.future.2013.07.014 .

Qin HF, Li ZH. Research on the method of Big Data analysis. Inf Technol J. 2013;12(10):1974–80. https://doi.org/10.3923/itj.2013.1974.1980 .

Ji C, Li Y, Qiu W, Jin Y, Xu Y, Awada U, Li K, Qu W. Big Data processing: big challenges. J Interconnect Netw. 2012. https://doi.org/10.1142/S0219265912500090 .

Kambatla K, Kollias G, Kumar V, Grama A. Trends in Big Data analytics. J Parallel Distrib Comput. 2014;74(7):2561–73. https://doi.org/10.1016/j.jpdc.2014.01.003 .

Dong XL, Srivastava D. Big Data integration. Proc VLDB Endow. 2013;6(11):1188–9. https://doi.org/10.14778/2536222.2536253 .

Yin H, Jiang Y, Lin C, Luo Y, Liu Y. Big Data: transforming the design philosophy of future internet. IEEE Netw. 2014;28(4):14–9. https://doi.org/10.1109/MNET.2014.6863126 .

Nti IK, Quarcoo JA, Aning J, Fosu GK. A mini-review of machine learning in Big Data analytics: applications, challenges, and prospects. Big Data Min Anal. 2022;5(2):81–97. https://doi.org/10.26599/BDMA.2021.9020028 .

Yu Y, Li M, Liu L, Li Y, Wang J. Clinical Big Data and deep learning: applications, challenges, and future outlooks. Big Data Min Analy. 2019;2(4):288–305. https://doi.org/10.26599/BDMA.2019.9020007 .

Amalina F, Targio Hashem IA, Azizul ZH, Fong AT, Firdaus A, Imran M, Anuar NB. Blending Big Data analytics: review on challenges and a recent study. IEEE Access. 2020;8:3629–45. https://doi.org/10.1109/ACCESS.2019.2923270 .

Chen X-W, Lin X. Big Data deep learning: challenges and perspectives. IEEE Access. 2014;2:514–25. https://doi.org/10.1109/ACCESS.2014.2325029 .

Alam A, Ullah I, Lee Y-K. Video Big Data analytics in the cloud: a reference architecture, survey, opportunities, and open research issues. IEEE Access. 2020;8:152377–422. https://doi.org/10.1109/ACCESS.2020.3017135 .

Pham Q-V, Nguyen DC, Huynh-The T, Hwang W-J, Pathirana PN. Artificial intelligence (AI) and Big Data for coronavirus (COVID-19) pandemic: a survey on the state-of-the-arts. IEEE Access. 2020;8:130820–39. https://doi.org/10.1109/ACCESS.2020.3009328 .

Aydin AA. A comparative perspective on technologies of Big Data value chain. IEEE Access. 2023;11:112133–46. https://doi.org/10.1109/ACCESS.2023.3323160 .

Kalantari A, Kamsin A, Kamaruddin HS, Ale Ebrahim N, Gani A, Ebrahimi A, Shamshirband S. A bibliometric approach to tracking Big Data research trends. J Big Data. 2017. https://doi.org/10.1186/s40537-017-0088-1 .

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big Data analytics: a survey. J Big Data. 2015. https://doi.org/10.1186/s40537-015-0030-3 .

Raghupathi W, Raghupathi V. Big Data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;2(1):3. https://doi.org/10.1186/2047-2501-2-3 .

Ram Mohan Rao P, Murali Krishna S, Siva Kumar AP. Privacy preservation techniques in Big Data analytics: a survey. J Big Data. 2018. https://doi.org/10.1186/s40537-018-0141-8 .

Ali A, Qadir J, Rasool RU, Sathiaseelan A, Zwitter A, Crowcroft J. Big Data for development: applications and techniques. Big Data Anal. 2016. https://doi.org/10.1186/s41044-016-0002-4 .

Hasan MM, Popp J, Oláh J. Current landscape and influence of Big Data on finance. J Big Data. 2020. https://doi.org/10.1186/s40537-020-00291-z .

Seyedan M, Mafakheri F. Predictive Big Data analytics for supply chain demand forecasting: methods, applications, and research opportunities. J Big Data. 2020. https://doi.org/10.1186/s40537-020-00329-2 .

Chang V, Muñoz VM, Ramachandran M. Emerging applications of internet of things, Big Data, security, and complexity: special issue on collaboration opportunity for IoTBDS and COMPLEXIS. Computing. 2020;102(6):1301–4. https://doi.org/10.1007/s00607-020-00811-y .

Biswas S, Khare N, Agrawal P, Jain P. Machine learning concepts for correlated Big Data privacy. J Big Data. 2021. https://doi.org/10.1186/s40537-021-00530-x .

Belcastro L, Cantini R, Marozzo F, Orsino A, Talia D, Trunfio P. Programming Big Data analysis: principles and solutions. J Big Data. 2022. https://doi.org/10.1186/s40537-021-00555-2 .

Abdalla HB. A brief survey on Big Data: technologies, terminologies and data-intensive applications. J Big Data. 2022. https://doi.org/10.1186/s40537-022-00659-3 .

Download references

Acknowledgments

Not applicable.

Author information

Authors and affiliations.

Department of Theoretical and Applied Sciences, Università degli Studi dell’Insubria, Via Mazzini 5, 21100, Varese, Italy

Davide Tosi & Redon Kokaj

Department of Computer Science and Engineering, University of Bologna, Mura Anteo Zamboni 7, 40126, Bologna, Italy

Marco Roccetti

You can also search for this author in PubMed   Google Scholar

Contributions

D.T. designed the SLR and wrote the main manuscript text R.K. conducted the SLR M.R. contributed to the Introduction, Challenges, and Conclusions All authors reviewed the manuscript.

Corresponding author

Correspondence to Davide Tosi .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Tosi, D., Kokaj, R. & Roccetti, M. 15 years of Big Data: a systematic literature review. J Big Data 11 , 73 (2024). https://doi.org/10.1186/s40537-024-00914-9

Download citation

Received : 05 February 2024

Accepted : 07 April 2024

Published : 14 May 2024

DOI : https://doi.org/10.1186/s40537-024-00914-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Systematic literature review
  • Data analysis
  • Artificial intelligence

big data research questions

  • News & Highlights

Search

  • Publications and Documents
  • Education in C/T Science
  • Browse Our Courses
  • C/T Research Academy
  • K12 Investigator Training
  • Harvard Catalyst On-Demand
  • SMART IRB Reliance Request
  • Biostatistics Consulting
  • Regulatory Support
  • Pilot Funding
  • Informatics Program
  • Community Engagement
  • Diversity Inclusion
  • Research Enrollment and Diversity
  • Harvard Catalyst Profiles

Harvard Catalyst Logo

Creating a Good Research Question

  • Advice & Growth
  • Process in Practice

Successful translation of research begins with a strong question. How do you get started? How do good research questions evolve? And where do you find inspiration to generate good questions in the first place?  It’s helpful to understand existing frameworks, guidelines, and standards, as well as hear from researchers who utilize these strategies in their own work.

In the fall and winter of 2020, Naomi Fisher, MD, conducted 10 interviews with clinical and translational researchers at Harvard University and affiliated academic healthcare centers, with the purpose of capturing their experiences developing good research questions. The researchers featured in this project represent various specialties, drawn from every stage of their careers. Below you will find clips from their interviews and additional resources that highlight how to get started, as well as helpful frameworks and factors to consider. Additionally, visit the Advice & Growth section to hear candid advice and explore the Process in Practice section to hear how researchers have applied these recommendations to their published research.

  • Naomi Fisher, MD , is associate professor of medicine at Harvard Medical School (HMS), and clinical staff at Brigham and Women’s Hospital (BWH). Fisher is founder and director of Hypertension Services and the Hypertension Specialty Clinic at the BWH, where she is a renowned endocrinologist. She serves as a faculty director for communication-related Boundary-Crossing Skills for Research Careers webinar sessions and the Writing and Communication Center .
  • Christopher Gibbons, MD , is associate professor of neurology at HMS, and clinical staff at Beth Israel Deaconess Medical Center (BIDMC) and Joslin Diabetes Center. Gibbons’ research focus is on peripheral and autonomic neuropathies.
  • Clare Tempany-Afdhal, MD , is professor of radiology at HMS and the Ferenc Jolesz Chair of Research, Radiology at BWH. Her major areas of research are MR imaging of the pelvis and image- guided therapy.
  • David Sykes, MD, PhD , is assistant professor of medicine at Massachusetts General Hospital (MGH), he is also principal investigator at the Sykes Lab at MGH. His special interest area is rare hematologic conditions.
  • Elliot Israel, MD , is professor of medicine at HMS, director of the Respiratory Therapy Department, the director of clinical research in the Pulmonary and Critical Care Medical Division and associate physician at BWH. Israel’s research interests include therapeutic interventions to alter asthmatic airway hyperactivity and the role of arachidonic acid metabolites in airway narrowing.
  • Jonathan Williams, MD, MMSc , is assistant professor of medicine at HMS, and associate physician at BWH. He focuses on endocrinology, specifically unravelling the intricate relationship between genetics and environment with respect to susceptibility to cardiometabolic disease.
  • Junichi Tokuda, PhD , is associate professor of radiology at HMS, and is a research scientist at the Department of Radiology, BWH. Tokuda is particularly interested in technologies to support image-guided “closed-loop” interventions. He also serves as a principal investigator leading several projects funded by the National Institutes of Health and industry.
  • Osama Rahma, MD , is assistant professor of medicine at HMS and clinical staff member in medical oncology at Dana-Farber Cancer Institute (DFCI). Rhama is currently a principal investigator at the Center for Immuno-Oncology and Gastroenterology Cancer Center at DFCI. His research focus is on drug development of combinational immune therapeutics.
  • Sharmila Dorbala, MD, MPH , is professor of radiology at HMS and clinical staff at BWH in cardiovascular medicine and radiology. She is also the president of the American Society of Nuclear Medicine. Dorbala’s specialty is using nuclear medicine for cardiovascular discoveries.
  • Subha Ramani, PhD, MBBS, MMed , is associate professor of medicine at HMS, as well as associate physician in the Division of General Internal Medicine and Primary Care at BWH. Ramani’s scholarly interests focus on innovative approaches to teaching, learning and assessment of clinical trainees, faculty development in teaching, and qualitative research methods in medical education.
  • Ursula Kaiser, MD , is professor at HMS and chief of the Division of Endocrinology, Diabetes and Hypertension, and senior physician at BWH. Kaiser’s research focuses on understanding the molecular mechanisms by which pulsatile gonadotropin-releasing hormone regulates the expression of luteinizing hormone and follicle-stimulating hormone genes.

Insights on Creating a Good Research Question

Junichi Tokuda, PhD

Play Junichi Tokuda video

Ursula Kaiser, MD

Play Ursula Kaiser video

Start Successfully: Build the Foundation of a Good Research Question

Jonathan Williams, MD, MMSc

Start Successfully Resources

Ideation in Device Development: Finding Clinical Need Josh Tolkoff, MS A lecture explaining the critical importance of identifying a compelling clinical need before embarking on a research project. Play Ideation in Device Development video .

Radical Innovation Jeff Karp, PhD This ThinkResearch podcast episode focuses on one researcher’s approach using radical simplicity to break down big problems and questions. Play Radical Innovation .

Using Healthcare Data: How can Researchers Come up with Interesting Questions? Anupam Jena, MD, PhD Another ThinkResearch podcast episode addresses how to discover good research questions by using a backward design approach which involves analyzing big data and allowing the research question to unfold from findings. Play Using Healthcare Data .

Important Factors: Consider Feasibility and Novelty

Sharmila Dorbala, MD, MPH

Refining Your Research Question 

Play video of Clare Tempany-Afdhal

Elliot Israel, MD

Play Elliott Israel video

Frameworks and Structure: Evaluate Research Questions Using Tools and Techniques

Frameworks and Structure Resources

Designing Clinical Research Hulley et al. A comprehensive and practical guide to clinical research, including the FINER framework for evaluating research questions. Learn more about the book .

Translational Medicine Library Guide Queens University Library An introduction to popular frameworks for research questions, including FINER and PICO. Review translational medicine guide .

Asking a Good T3/T4 Question  Niteesh K. Choudhry, MD, PhD This video explains the PICO framework in practice as participants in a workshop propose research questions that compare interventions. Play Asking a Good T3/T4 Question video

Introduction to Designing & Conducting Mixed Methods Research An online course that provides a deeper dive into mixed methods’ research questions and methodologies. Learn more about the course

Network and Support: Find the Collaborators and Stakeholders to Help Evaluate Research Questions

Chris Gibbons, MD,

Network & Support Resource

Bench-to-bedside, Bedside-to-bench Christopher Gibbons, MD In this lecture, Gibbons shares his experience of bringing research from bench to bedside, and from bedside to bench. His talk highlights the formation and evolution of research questions based on clinical need. Play Bench-to-bedside. 

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 04 August 2020

Moving back to the future of big data-driven research: reflecting on the social in genomics

  • Melanie Goisauf   ORCID: orcid.org/0000-0002-3909-8071 1 , 2   na1 ,
  • Kaya Akyüz   ORCID: orcid.org/0000-0002-2444-2095 1 , 2   na1 &
  • Gillian M. Martin   ORCID: orcid.org/0000-0002-5281-8117 3   na1  

Humanities and Social Sciences Communications volume  7 , Article number:  55 ( 2020 ) Cite this article

3251 Accesses

8 Citations

9 Altmetric

Metrics details

  • Science, technology and society

With the advance of genomics, specific individual conditions have received increased attention in the generation of scientific knowledge. This spans the extremes of the aim of curing genetic diseases and identifying the biological basis of social behaviour. In this development, the ways knowledge is produced have gained significant relevance, as the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory. This article argues that an in-depth discussion and critical reflection on the social configurations that are inscribed in, and reproduced by genomic data-intensive research is urgently needed. This is illustrated by debating a recent case: a large-scale genome-wide association study (GWAS) on sexual orientation that suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). This case is analysed from three angles: (1) the demonstration of how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) the exploration of the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ). Big Data Soc and (3) the demonstration of how the assumption of ‘free from theory’ in this case does not mean free of choices made, which are themselves restricted by data that are available. In questioning how key sociological categories are incorporated in a wider scientific debate on genetic conditions and knowledge production, the article shows how underlying classification and categorizations, which are inherently social in their production, can have wide ranging implications. The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Similar content being viewed by others

big data research questions

Using genetics for social science

big data research questions

Genetic determinism, essentialism and reductionism: semantic clarity for contested science

big data research questions

Participation bias in the UK Biobank distorts genetic associations and downstream analyses

Introduction.

With the advance of genomic research, specific individual conditions received increased attention in scientific knowledge generation. While understanding the genetic foundations of diseases has become an important driver for the advancement of personalized medicine, the focus of interest has also expanded from disease to social behaviour. These developments are embedded in a wider discourse in science and society about the opportunities and limits of genomic research and intervention. With the emergence of the genome as a key concept for ‘life itself’, understandings of health and disease, responsibility and risk, and the relation between present conditions and future health outcomes have shifted, impacting also the ways in which identities are conceptualized under new genetic conditions (Novas and Rose 2000 ). At the same time, the growing literature of postgenomics points to evolving understandings of what ‘gene’ and ‘environment’ are (Landecker and Panofsky 2013 ; Fox Keller 2014 ; Meloni 2016 ). The postgenomic genome is no longer understood as merely directional and static, but rather as a complex and dynamic system that responds to its environment (Fox Keller 2015 ), where the social as part of the environment becomes a signal for activation or silencing of genes (Landecker 2016 ). At the same time, genetic engineering, prominently known as the gene-editing technology CRISPR/Cas9, has received considerable attention, but also caused concerns regarding its ethical, legal and societal implications (ELSI) and governance (Howard et al. 2018 ; Jasanoff and Hurlbut 2018 ). Taking these developments together, the big question of nature vs. nurture has taken on a new significance.

Studies which aim to reveal how biology and culture are being put in relation to each other appear frequently and pursue a genomic re-thinking of social outcomes and phenomena, such as educational attainment (Lee et al. 2018 ) or social stratification (Abdellaoui et al. 2019 ). Yet, we also witness very controversial applications of biotechnology, such as the first known case of human germline editing by He Jiankui in China, which has impacted the scientific community both as an impetus of wide protests and insecurity about the future of gene-editing and its use, but also instigated calls towards public consensus to (re-)set boundaries to what is editable (Morrison and de Saille 2019 ).

Against this background, we are going to debate in this article a particular case that appeared within the same timeframe as these developments: a large-scale genome-wide association study (GWAS) on sexual orientation Footnote 1 , which suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). Some scientists have been claiming sexual orientation to be partly heritable and trying to identify genetic basis for sexual orientation for years (Hamer et al. 1993 ); however, this was the first time that genetic variants were identified as statistically significant and replicated in an independent sample. We consider this GWAS not only by questioning the ways genes are associated with “the social” within this research, but also by exploring how the complexity of the social is reduced through specific data practices in research.

The sexual orientation study also constitutes an interesting case to reflect on how knowledge is produced at a time the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory (Meloni 2014 ). Large amounts of genomic data are needed to identify genetic variations and for finding correlations with different biological and social factors. The rise of the genome corresponds to the rise of big data as the collection and sharing of genomic data gains power with the development of big data analytics (Parry and Greenhough 2017 ). Growing number of correlations, e.g. in genomics of educational attainment (Lee et al. 2018 ; Okbay et al. 2016 ), are being found that are linking the genome to the social, increasingly blurring the established biological/social divide. These could open up new ways of understanding life, and underpin the importance of culture, while, paradoxically, may also carry the risk of new genetic determinism and essentialism. The changing understanding of the now molecularised and datafied body also illustrates the changing significance of empirical research and sociology (Savage and Burrows 2007 ) in the era of postgenomics and ‘datafication’ (Ruckenstein and Schüll 2017 ). These developments are situated within methodological debates in which social sciences often appear through the perspective of ELSI.

As the field of genomics is progressing rapidly and the intervention in the human genome is no longer science fiction, we argue that it is important to discuss and reflect now on the social configurations that are inscribed in, and reproduced by genomic data-driven research. These may co-produce the conception of certain potentially editable conditions, i.e. create new, and reproduce existing classifications that are largely shaped by societal understandings of difference and order. Such definitions could have real consequences—as Thomas and Thomas ( 1929 ) remind us—for individuals and societies, and mark what has been described as an epistemic shift in biomedicine from the clinical gaze to the ‘molecular gaze’ where the processes of “medicalisation and biomedicalisation both legitimate and compel interventions that may produce transformations in individual, familial and other collective identities” (Clarke et al. 2013 , p. 23). While Science and Technology Studies (STS) has demonstrated how science and society are co-produced in research (Jasanoff 2004 ), we want to use the momentum of the current discourse to critically reflect on these developments from three angles: (1) we demonstrate how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) we explore the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ) and (3) using the GWAS case in focus, we show how the assumption of ‘free from theory’ (Kichin 2014a ) in this case does not mean free of choices made, choices which are themselves restricted by data that are available. We highlight Griffiths’ ( 2016 ) contention that the material nature of genes, their impacts on biological makeup of individuals and their socially and culturally situated behaviour are not deterministic, and need to be understood within the dynamic, culturally and temporally situated context within which knowledge claims are made. We conclude by making the important point that ignoring the social may lead to a distorted, datafied, genomised body which ignores the key fact that “genes are not stable but essentially malleable” (Prainsack 2015 ) and that this ‘malleability’ is rooted in the complex interplay between biological and social environments.

From this perspective, the body is understood through the lens of embodiment, considering humans ‘live’ their genome within their own lifeworld contexts (Rehmann-Sutter and Mahr 2016 ). We also consider this paper as an intervention into the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

In the following reflections, we proceed step by step: First, we introduce the case of the GWAS on same-sex sexual behaviour, as well as its limits, context and impact. Second, we recall key sociological theory on categorizations and their implications. Third, we discuss the emergence of a digital-datafication of scientific knowledge production. Finally, we conclude by cautioning against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Studying sexual orientation: The case of same-sex sexual behaviour

Currently, a number of studies at the intersection of genetic and social conditions appear on the horizon. Just as in the examples we have already mentioned, such as those on educational attainment (Lee et al. 2018 ), or social stratification (Abdellaoui et al. 2019 ), it is important to note that the limit to such studies is only the availability of the data itself. In other words, once the data is available, there is always the potential that it would eventually be used. This said, an analysis of the entirety of the genomic research on social outcomes and behaviour is beyond the scope of this article. Therefore, we want to exemplify our argument with reference to the research on the genetics of same-sex sexual behaviour.

Based on a sample of half a million individuals of European ancestry, the first large-scale GWAS of its kind claims five genetic variants to be contributing to the assessed “same-sex sexual behaviour” (Ganna et al. 2019b ). Among these variants, two are useful only for male–male sexual behaviour, one for female–female sexual behaviour, and the remaining two for both. The data that has led to this analysis was sourced from biobanks/cohorts with different methods of data collection. The authors conclude that these genetic variations are not predictive of sexual orientation; not only because genetics is supposedly only part of the picture, but also because the variations are only a small part (<1% of the variance in same-sex sexual behaviour, p. 4) of the approximated genetic basis (8–25% of the variance in same-sex sexual behaviour) that may be identified with large sample sizes (p. 1). The study is an example of how the ‘gay gene’ discourse that has been around for years, gets transformed with the available data accumulating in the biobanks and the consequent genomic analysis, offering only one facet of a complex social phenomenon: same-sex sexual behaviour.

The way the GWAS has been conducted was not novel in terms of data collection. Genome-wide studies of similar scale, e.g. on insomnia (Jansen et al. 2019 ) or blood pressure (Evangelou et al. 2018 ), often rely on already collected data in biobanks rather than trying to collect hundreds of thousands of individuals’ DNA from scratch. Furthermore, in line with wider developments, the study was preregistered Footnote 2 with an analysis plan for the data to be used by the researchers. Unlike other GWASes, however, the researchers partnered with an LGBTQIA+ advocacy group (GLAAD) and a science communication charity (Sense About Science), where individuals beyond the research team interpreted the findings and discussed how to convey the results Footnote 3 . Following these engagements, the researchers have produced a website Footnote 4 with potential frequently asked questions as well as a video about the study, highlighting what it does and what it does not claim.

Despite efforts to control the drifting away of the study into genetic deterministic and discriminatory interpretations, the study has been criticized by many Footnote 5 . Indeed, the controversial “How gay are you?” Footnote 6 app on the GenePlaza website utilized the findings of the study, which in turn raised the alarm bells and, ultimately, was taken down after much debate. The application, however, showed how rapidly such findings can translate into individualized systems of categorization, and consequently feed into and be fed by the public imaginary. One of the study authors demands continuation of research by noting “[s]cientists have a responsibility to describe the human condition in a more nuanced and deeper way” (Maxmen, 2019 , p. 610). Critics, however, note that the context of data collected from the individuals may have influence on the findings; for instance, past developments (i.e. decriminalization of homosexuality, the HIV/AIDS epidemic, and legalization of same-sex marriage) are relevant to understand the UK Biobank’s donor profile and if the GWAS were to be redone according to the birth year of the individuals, different findings could have come out of the study (Richardson et al. 2019 , p. 1461).

It has been pointed out that such research should be assessed by a competent ethical review board according to its potential risks and benefits (Maxmen 2019 , p. 610), in addition to the review and approval by the UK Biobank Access Sub-Committee (Ganna et al. 2019a , p. 1461). Another ethical issue of concern raised by critics is that the informed consent form of UK Biobank does not specify that it could be used for such research since “homosexuality has long been removed from disease classifications” and that the broad consent forms allow only “health-related research” (Holm and Ploug 2019 , p. 1460). We do not want to make a statement here for or against broad consent. However, we argue that discussions about informed consent showcase the complexities related to secondary use of data in research. Similarly, the ‘gay gene’ app developed in the wake of the sexual orientation study, revealed the difficulty of controlling how the produced knowledge may be used, including in ways that are openly denounced by the study authors.

To the best of our knowledge, there have not been similar genome-wide studies published on sexual orientation and, while we acknowledge the limitations associated with focusing on a single case in our discussion, we see this case as relevant to opening up the following question: How are certain social categorizations incorporated into the knowledge production practices? We want to answer this by first revisiting some of the fundamental sociological perspectives into categorizations and the social implications these may have.

Categorizing sex, gender, bodies, disease and knowledge

Sociological perspectives on categorizations.

Categorizations and classifications take a central role in the sociology of knowledge, social stratifications and data-based knowledge production. Categories like gender, race, sexuality and class (and their intersection, see Crenshaw 1989 ) have become key classifications for the study of societies and in understanding the reproduction of social order. One of the most influential theories about the intertwining of categories like gender and class with power relations was formulated by Bourdieu ( 2010 , 2001 ). He claimed that belonging to a certain class or gender is an embodied practice that ensures the reproduction of social structure which is shaped by power relations. The position of subjects within this structure reflects the acquired cultural capital, such as education. Incorporated dispositions, schemes of perception, appreciation, classification that make up the individual’s habitus are shaped by social structure, which actors reproduce in practices. One key mechanism of social categorization is gender classification. The gender order appears to be in the ‘nature of things’ of biologically different bodies, whereas it is in fact an incorporated social construction that reflects and constitutes power relations. Bourdieu’s theory links the function of structuring classifications with embodied knowledge and demonstrates that categories of understanding are pervaded by societal power relations.

In a similar vein Foucault ( 2003 , 2005 ) describes the intertwining of ordering classifications, bodies and power in his study of the clinic. Understandings of and knowledge about the body follow a specific way of looking at it—the ‘medical gaze’ of separating the patient’s body from identity and distinguishing healthy from the diseased, which, too, is a process pervaded by power differentials. Such classifications evolved historically. Foucault reminds us that all periods in history are characterized by specific epistemological assumptions that shape discourses and manifest in modalities of order that made certain kinds of knowledge, for instance scientific knowledge, possible. The unnoticed “order of things”, as well as the social order, is implemented in classifications. Such categorizations also evolved historically for the discourse about sexuality, or, in particular as he pointed out writing in the late 1970s, distinguishing sexuality of married couples from other forms, such as homosexuality (Foucault 1998 ).

Bourdieu and Foucault offer two influential approaches within the wider field of sociology of knowledge that provide a theoretical framework on how categorizations and classifications structure the world in conjunction with social practice and power relations. Their work demonstrates that such structuration is never free from theory, i.e. they are not existing prediscursively, but are embedded within a certain temporal and spatial context that constitutes ‘situated knowledge’ (Haraway 1988 ). Consequently, classifications create (social) order that cannot be understood as ‘naturally’ given but as a result of relational social dynamics embedded in power differentials.

Feminist theory in the 1970s emphasized the inherently social dimension of male and female embodiment, which distinguished between biological sex and socially rooted gender. This distinction built the basis for a variety of approaches that examined gender as a social phenomenon, as something that is (re-)constructed in social interaction, impacted by collectively held beliefs and normative expectations. Consequently, the difference between men and women was no longer simply understood as a given biological fact, but as something that is, also, a result of socialization and relational exchanges within social contexts (see, e.g., Connell 2005 ; Lorber 1994 ). Belonging to a gender or sex is a complex practice of attribution, assignment, identification and, consequently, classification (Kessler and McKenna 1978 ). The influential concept of ‘doing gender’ emphasized that not only the gender, but also the assignment of sex is based on socially agreed-upon biological classification criteria, that form the basis of placing a person in a sex category , which needs to be practically sustained in everyday life. The analytical distinction between sex and gender became eventually implausible as it obscures the process in which the body itself is subject to social forces (West and Zimmerman 1991 ).

In a similar way, sexual behaviour and sexuality are also shaped by society, as societal expectations influence sexual attraction—in many societies within normative boundaries of gender binary and heteronormativity (Butler 1990 ). This also had consequences for a deviation from this norm, resulting for example in the medicalisation of homosexuality (Foucault 1998 ).

Reference to our illustrative case study on the recently published research into the genetic basis of sexuality brings the relevance of this theorization into focus. The study cautions against the ‘gay gene’ discourse, the use of the findings for prediction, and genetic determinism of sexual orientation, noting “the richness and diversity of human sexuality” and stressing that the results do not “make any conclusive statements about the degree to which ‘nature’ and ‘nurture’ influence sexual preference” (Ganna et al. 2019b , p. 6).

Coming back to categorizations, more recent approaches from STS are also based on the assumption that classifications are a “spatio-temporal segmentation of the world” (Bowker and Star 2000 , p. 10), and that classification systems are, similar to concepts of gender theory (e.g. Garfinkel 1967 ), consistent, mutually exclusive and complete. The “International Classification of Diseases (lCD)”, a classification scheme of diseases based on their statistical significance, is an example of such a historically grown knowledge system. How the ICD is utilized in practice points to the ethical and social dimensions involved (Bowker and Star 2000 ). Such approaches help to unravel current epistemological shifts in medical research and intervention, including removal of homosexuality from the disease classification half a century ago.

Re-classifying diseases in tandem with genetic conditions creates new forms of ‘genetic responsibilities (Novas and Rose 2000 ). For instance, this may result in a change of the ‘sick role’ (described early in Parsons 1951 ) in creating new obligations not only for diseased but also for actually healthy persons in relation to potential futures. Such genetic knowledge is increasingly produced using large-scale genomic databases and creates new categories based on genetic risk, and consequently, may result in new categories of individuals that are ‘genetically at risk’ (Novas and Rose 2000 ). The question now is how these new categories will alter, structure or replace evolved categories, in terms of constructing the social world and medical practice.

While advancement in genomics is changing understandings of bodies and diseases, the meanings of certain social categories for medical research remain rather stable. Developments of personalized medicine go along with “the ‘re-inscription’ of traditional epidemiological categories into people’s DNA” and adherence to “old population categories while working out new taxonomies of individual difference” (Prainsack 2015 , pp. 28–29). This, again, highlights the fact that knowledge production draws on and is shaped by categories that have a political and cultural meaning within a social world that is pervaded by power relations.

From categorization to social implication and intervention

While categorizations are inherently social in their production, their use in knowledge production has wide ranging implications. Such is the case of how geneticisation of sexual orientation has been an issue that troubled and comforted the LGBTQIA+ communities. Despite the inexistence of an identified gene, ‘gay gene’ has been part of societal discourse. Such circulation disseminates an unequal emphasis on the biologized interpretations of sexual orientation, which may be portrayed differently in media and appeal to groups of opposing views in contrasting ways (Conrad and Markens 2001 ). Geneticisation, especially through media, moves sexual orientation to an oppositional framework between individual choice and biological consequence (Fausto-Sterling 2007 ) and there have been mixed opinions within LGBTQIA+ communities, whether this would resolve the moralization of sexual orientation or be a move back into its medicalisation (Nelkin and Lindee 2004 ). Thus, while some activists support geneticisation, others resist it and work against the potential medicalisation of homosexuality (Shostak et al. 2008 ). The ease of communicating to the general public simple genetic basis for complex social outcomes which are genetically more complex than reported, contributes to the geneticisation process, while the scientific failures of replicating ‘genetic basis’ claims do not get reported (Conrad 1999 ). In other words, while finding a genetic basis becomes entrenched as an idea in the public imaginary, research showing the opposite does not get an equal share in the media and societal discourse, neither of course does the social sciences’ critique of knowledge production that has been discussed for decades.

A widely, and often quantitatively, studied aspect of geneticisation of sexual orientation is how this plays out in the broader understanding of sexual orientation in society. While there are claims that geneticisation of sexual orientation can result in depoliticization of the identities (O’Riordan 2012 ), it may at the same time lead to polarization of society. According to social psychologists, genetic attributions to conditions are likely to lead to perceptions of immutability, specificity in aetiology, homogeneity and discreteness as well naturalistic fallacy (Dar-Nimrod and Heine 2011 ). Despite the multitude of suggestive surveys that belief in genetic basis of homosexuality correlates with acceptance, some studies suggest learning about genetic attribution to homosexuality can be polarizing and confirmatory of the previously held negative or positive attitudes (Boysen and Vogel 2007 ; Mitchell and Dezarn 2014 ). Such conclusions can be taken as a precaution that just as scientific knowledge production is social, its consequences are, too.

Looking beyond the case

We want to exemplify this argument by taking a detour to another case where the intersection between scientific practice, knowledge production and the social environment is of particular interest. While we have discussed the social implications of geneticisation with a focus on sexual orientation, recent developments in biomedical sciences and biotechnology also have the potential to reframe the old debates in entirely different ways. For instance, while ‘designer babies’ were only an imaginary concept until recently, the facility and affordability of processes, such as in vitro selection of baby’s genotype and germline genome editing, have potentially important impacts in this regard. When CRISPR/Cas9 technique was developed for rapid and easy gene editing, both the hopes and worries associated with its use were high. Martin and others ( 2020 , pp. 237–238) claim gene editing is causing both disruption within the postgenomic regime, specifically to its norms and practices, and the convergence of various biotechnologies such as sequencing and editing. Against this background, He Jiankui’s announcement in November 2018 through YouTube Footnote 7 that twins were born with edited genomes was an unwelcome surprise for many. This unexpected move may have hijacked the discussions on ethical, legal, societal implications of human germline genome-editing, but also rang the alarm bells across the globe for similar “rogue” scientists planning experimentation with the human germline (Morrison and de Saille 2019 ). The facility to conduct germline editing is, logically, only one step away from ‘correcting’ and if there is a correction, then that would mean a return to a normative state. He’s construction of HIV infection as a genetic risk can be read as a placeholder for numerous questions to human germline editing: What are the variations that are “valuable” enough for a change in germline? For instance, there are plans by Denis Rebrikov in Russia to genome edit embryos to ‘fix’ a mutation that causes congenital deafness (Cyranoski 2019 ). If legalized, what would be the limits applied and who would be able to afford such techniques? At a time when genomics research into human sociality is booming, would the currently produced knowledge in this field and others translate into ‘corrective’ genome-editing? Who would decide?

The science, in itself is still unclear at this stage as, for many complex conditions, using gene editing to change one allele to another is often minuscule in effect, considering that numerous alleles altogether may affect phenotypes, while at the same time a single allele may affect multiple phenotypes. In another GWAS case, social genomicists claim there are thousands of variations that are found to be influential for a particular social outcome such as educational attainment (Lee et al. 2018 ), with each having minimal effect. It has also been shown in the last few years, as the same study is conducted with ever more larger samples, more genomic variants are associated with the social outcome, i.e. 74 single nucleotide polymorphisms (SNPs) associated with the outcome in a sample size of 293,723 (Okbay et al. 2016 ) and 1271 SNPs associated with the outcome in a sample size of 1.1 million individuals (Lee et al. 2018 ).

Applying this reasoning to the GWAS on same-sex sexual behaviour, it is highly probable that the findings will be superseded in the following years with similar studies of bigger data, increasing the number of associations.

A genomic re-thinking?

The examples outlined here have served to show how focusing the discussion on “genetic determinism” is fruitless considering the complexity of the knowledge production practices and how the produced knowledge could both mirror social dynamics and shape these further. Genomic rethinking of the social necessitates a new formulation of social equality, where genomes are also relevant. Within the work of social genomics researchers, there has been cautious optimism toward the contribution of findings from genomics research to understanding social outcomes of policy change (Conley and Fletcher 2018 ; Lehrer and Ding 2019 ). Two fundamental thoughts govern this thinking. First, genetic basis is not to be equalized with fate; in other words, ‘genetic predispositions’ make sense only within the broader social and physical environmental frame, which often allows room for intervention. Second, genetics often relates to heterogeneity of the individuals within a population, in ways that the same policy may be positive, neutral or negative for different individuals due to their genes. In this respect, knowledge gained via social genomics may be imagined as a basis for a more equal society in ‘uncovering’ invisible variables, while, paradoxically, it may also be a justification for exclusion of certain groups. For example, a case that has initially raised the possibility that policies affect individuals differently because of their genetic background was a genetic variant that was correlated to being unaffected by tax increases on tobacco (Fletcher 2012 ). The study suggested that raising the taxes may be an ineffective tool for lowering smoking rates below a certain level, since those who are continuing to smoke may be those who cannot easily stop due to their genetic predisposition to smoking. Similar ideas could also apply to a diverse array of knowledge produced in social genomics, where the policies may be under scrutiny according to how they are claimed to variably influence the members of a society due to their genetics.

Datafication of scientific knowledge production

From theory to data-driven science.

More than a decade has gone by since Savage and Burrows ( 2007 ) described a crisis in empirical research, where the well-developed methodologies for collecting data about the social world would become marginal as such data are being increasingly generated and collected as a by-product of daily virtual transactions. Today, sociological research faces a widely datafied world, where (big) data analytics are profoundly changing the paradigm of knowledge production, as Facebook, Twitter, Google and others produce large amounts of socially relevant data. A similar phenomenon is taking place through opportunities that public and private biobanks, such as UK Biobank or 23andMe, offer. Crossing the boundaries of social sciences and biological sciences is facilitated through mapping correlations between genomic data, and data on social behaviour or outcomes.

This shift from theory to data-driven science misleadingly implies a purely inductive knowledge production, neglecting the fact that data is not produced free of preceding theoretical framing, methodological decisions, technological conditions and the interpretation of correlations—i.e. an assemblage situated within a specific place, time, political regime and cultural context (Kitchin 2014a ). It glosses over the fact that data cannot simply be treated as raw materials, but rather as “inherently partial, selective and representative”, the collection of which has consequences (Kitchin 2014b , p. 3). How knowledge of the body is generated starts with how data is produced and how it is used and mobilized. Through sequencing, biological samples are translated into digital data that are circulated and merged and correlated with other data. With the translation from genes into data, their meaning also changes (Saukko 2017 ). The kind of knowledge that is produced is also not free of scientific and societal concepts.

Individually assigned categorical variables to genomes have become important for genomic research and are impacting the ways in which identities are conceptualized under (social) genomic conditions. These characteristics include those of social identity, such as gender, ethnicity, educational and socioeconomic status. They are often used for the study of human genetic variation and individual differences with the aim to advance personalized medicine and based on demographic and ascribed social characteristics.

The sexual orientation study that is central to this paper can be read as a case where such categories intersect with the mode of knowledge production. As the largest contributor of data to the study, UK Biobank’s data used in this research are revealing since they are based on the answer to the following question “Have you ever had sexual intercourse with someone of the same sex?” along with the statement “Sexual intercourse includes vaginal, oral or anal intercourse.” Footnote 8 .

Furthermore, the authors accept having made numerous reductive assumptions and that their study has methodological limitations. For instance, Ganna et al. ( 2019b ) acknowledge both within the article (p. 1) and an accompanying website Footnote 9 that the research is based on a binary ‘sex’ system with exclusions of non-complying groups as the authors report that they “dropped individuals from [the] study whose biological sex and self-identified sex/gender did not match” (p. 2). However, both categorizing sexual orientation mainly on practice rather than attraction or desire, and building it on normative assumptions about sexuality, i.e. gender binary and heteronormativity, are problematic, as sexual behaviour is diverse and does not necessarily correspond with such assumptions.

The variations found in the sexual orientation study, as is true for other genome-wide association studies, are often relevant for the populations studied and in this case, those mainly belong to certain age groups and European ancestry. While the study avoids critique in saying that their research is not genetics of sexual orientation, but rather of same-sex sexual behaviour, whether such a genomic study would be possible is also questionable. This example demonstrates that, despite the increasing influence of big data, a fundamental problem with the datafication of many social phenomena is whether or not they are amenable to measurement. In the case of sexual orientation, whether the answer to the sexual orientation questions corresponds to the “homosexuality” or “willingness to reveal homosexuality”/“stated sexual orientation” is debatable, considering the social pressure and stigma that may be an element in certain social contexts (Conley 2009 , p. 242).

While our aim is to bring a social scientific perspective, biologists have raised at least two different critical opinions on the knowledge production practice here in the case of the sexual orientation study, first on the implications of the produced knowledge Footnote 10 and second on the problems and flaws of the search for a genetic basis Footnote 11 . In STS, however, genetic differences that were hypothesized to be relevant for health, especially under the category of race in the US, have been a major point of discussion within the genomic ‘inclusion’ debates of 1990s (Reardon 2017 , p. 49; Bliss 2015 ). In other words, a point of criticism towards the knowledge production was the focus on certain “racial” or racialized groups, such as American of European ancestry, which supposedly biased the findings and downstream development of therapies for ‘other’ groups. However, measuring health and medical conditions against the background of groups that are constituted based on social or cultural categories (e.g. age, gender, ethnicity), may also result in a reinscription/reconstitution of social inequalities attached to these categories (Prainsack 2015 ) and at the same time result in health justice being a topic seen through a postgenomics lens, where postgenomics is “a frontline weapon against inequality” (Bliss 2015 p. 175). Social-economic factors may recede in the background, while data with its own often invisible politics are foregrounded.

Unlike what Savage and Burrows suggested in 2007, the coming crisis can not only be seen as a crisis of sociology, but of science in general. Just as the shift of focus in social sciences towards digital data is only one part of the picture, another part could be the developments in genomisation of the social. Considering that censuses and large-scale statistics are not new, the distinction of the current phenomenon is possibly the opportunity to individualize the data, while categories themselves are often unable to capture the complexity, despite producing knowledge more efficiently. In that sense, the above-mentioned survey questions do not do justice to the complexity of social behaviour. What is most important to flag within these transformations is the lack of reflexivity regarding how big data comes to represent the world and whether it adds and/or takes away from the ways of knowing before big data. These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research. They could set the stage for real consequences for individuals and groups. Just as what is defined as an editable condition happens as a social process that relies on socio-political categories, the knowledge acquired from big data relies in similar way on the same kind of categories.

The data choices and restrictions: ‘Free from theory’ or freedom of choice

Data, broadly understood, have become a fundamental part of our lives, from accepting and granting different kinds of consent for our data to travel on the internet, to gaining the ‘right to be forgotten’ in certain countries, as well as being able to retrieve collected information about ourselves from states, websites, even supermarket chains. While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises. The associated changes in the knowledge production have important consequences for the ways in which we understand and live in the world (Jasanoff 2004 ). The co-productionist perspective in this sense does not relate to whether or how the social and the biological are co-produced, but rather it is pointing to how produced knowledge in science is both shaped by and shaping societies. Thus, the increasing impact and authority of big data in general, and within the sexual orientation study in focus here, opens up new avenues to claim as some suggest, that we have reached the end of theory.

The “end of theory” has actively been debated within and beyond science. Kitchin ( 2014a ) locates the recent origin of this debate in a piece in the Wired , where the author states “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson 2008 ). Others call this a paradigm shift towards data-intensive research leaving behind the empirical and theoretical stages (Gray 2009 , p. xviii). While Google and others form the basis for this data-driven understanding in their predictive capacity or letting the data speak, the idea that knowledge production is ‘free from theory’ in this case seems to be, at best, an ignorance of any data infrastructure and how the categories are formed within it.

Taking a deeper look at the same-sex sexual behaviour study from this angle suggests that such research cannot be free from theory as it has to make an assumption regarding the role of genetics in the context of social dynamics. In other words, it has to move sexual orientation, at least partially in the form of same-sex sexual behaviour, out of the domain of the social towards the biological. In doing so, just as the study concludes the complexity of sexual orientation, the authors note in their informative video Footnote 12 on their website, that “they found that about a third of the differences between people in their sexual behaviour could be explained by inherited genetic factors. But the environment also plays a large role in shaping these differences.” While the study points to a minuscule component of the biological, it also frames biology as the basis on which the social, as part of the environment, acts upon.

Reconsidering how the biology and the social are represented in the study, three theoretical choices are made due to the limitation of the data. First of all, the biological is taken to be “the genome-wide data” in the biobanks that the study relies on. This means sexual orientation is assumed to be within the SNPs, or points on the genome that are common variations across a population, and not in other kinds of variations that are rare or not captured by the genotyped SNPs. These differences include, but are not limited to, large-scale to small-scale duplications and deletions of the genomic regions, rare variants or even common variants in the population that the SNP chips do not capture. Such ignored differences are very important for a number of conditions, from cancer to neurobiology. Similarly, the genomic focus leaves aside the epigenetic factors that could theoretically be the missing link between genomes and environments. In noting this, we do not suggest that the authors of the study are unaware or uninterested in epigenetics; however, regardless of their interest and/or knowledge, the availability of large-scale genome-wide data puts such data ahead of any other variation in the genome and epigenome. In other words, if the UK Biobank and 23andMe had similar amounts of epigenomic or whole genome data beyond the SNPs, the study would have most possibly relied on these other variations in the genome. The search for genetic basis within SNPs is a theoretical choice, and in this case this choice is pre-determined by the limitations of the data infrastructures.

The second choice that the authors make is to take three survey questions, i.e. in the case of UK Biobank data, as encompassing enough of the complexity of sexual orientation for their research. As partly discussed earlier, these questions are simply asking about sexual behaviour. Based on the UK Biobank’s definition of sexual intercourse as “vaginal, oral or anal intercourse” the answers to the following questions were relevant for the research: “Have you ever had sexual intercourse with someone of the same sex?” (Data-Field 2159), “How many sexual partners of the same sex have you had in your lifetime?” (Data-Field 3669), and, “About how many sexual partners have you had in your lifetime?” (Data-Field 2149). Answers to such questions do little justice to the complexity of the topic. Considering that they are not included in the biobank as data for the purpose of identifying a genetic basis to same-sex sexual behaviour, there is much to consider in what capacity they are useful for that. It is worth noting here that the UK Biobank is primarily focused on health-related research, and thus these three survey questions could not have been asked with a genomic exploration of ‘same-sex sexual behaviour’ or ‘sexual orientation’ in mind. The degree of success in the way they have been used to identify the genetic basis for complex social behaviours is questionable.

The authors of the study consider the UK Biobank sample to be comprised of relatively old individuals and this to be a shortcoming Footnote 13 . Similarly, the study authors claim that 23andMe samples may be biased because “[i]ndividuals who engage in same-sex sexual behaviour may be more likely to self-select the sexual orientation survey”, which then explains the high percentage of such individuals (18.9%) (Ganna et al. 2019b , p. 1). However, the authors do not problematize that there is at least three-fold difference between the youngest and oldest generation in the UK Biobank sample in their response to the same-sex sexual behaviour question (Ganna et al. 2019b , p. 2). The study, thus, highlights the problematic issue about who should be regarded as the representative sample to be asked about their “same-sex sexual behaviour”. Still, this is a data choice that the authors make in concluding a universal explanation out of a very specific and socially constrained collection of self-reported data that encompasses only part of what the researchers are interested in.

The third choice is a choice unmade. The study data mainly came from UK Biobank, following a proposal by Brendan Zietsch with the title “Direct test whether genetic factors predisposing to homosexuality increase mating success in heterosexuals” Footnote 14 . The original plan for research frames “homosexuality” as a condition that heterosexuals can be “predisposed” to and as this condition is not eliminated through evolution, scientists hypothesize that whatever genetic variation that predisposes an individual to homosexuality may also be functional in increasing the individual’s reproductive capacity. Despite using such an evolutionary explanation as the theoretical basis for obtaining the data from the UK Biobank, the authors use evolution/evolutionary only three times in the article, whereas the concept “mating success” is totally missing. Unlike the expectation in the research proposal, authors observe lower number of offspring for individuals reporting same-sex sexual behaviour, and they conclude briefly “This reproductive deficit raises questions about the evolutionary maintenance of the trait, but we do not address these here” (Ganna et al. 2019b , p. 2). In other words, the hypothesis that allowed scientists to acquire the UK Biobank data becomes irrelevant for the researchers, when they are reporting their findings.

In this section, we have performed an analysis of how data choices are made at different steps of the research and hinted at how these choices reflect certain understandings of how society functions. These are evident in the ways sexual behaviour is represented and categorized according to quantitative data, and, the considerations of whether certain samples are contemporary enough (UK Biobank) or too self-selecting (same-sex sexual behaviour being too high in 23andMe). The study, however, does not problematize how the percentage of individuals reporting same-sex sexual behaviour steadily increases according to year of birth, at least tripling for males and increasing more than five-fold for females from 1940 and 1970 (for UK Biobank). Such details are among the data that the authors display as descriptive statistics in Fig. 1 (Ganna et al. 2019b , p. 2); however, these do not attract a discussion that genomic data receives. The study itself starts from the idea that genetic markers that are associated with same-sex sexual behaviour could have an evolutionary advantage and ends in saying the behaviour is complex. Critics claim the “approach [of the study] implies that it is acceptable to issue claims of genetic drivers of behaviours and then lay the burden of proof on social scientists to perform post-hoc socio-cultural analysis” (Richardson et al. 2019 , p. 1461).

In this paper, we have ‘moved back to the future’—taking stock of the present-day accelerated impact of big data and of its potential and real consequences. Using the sexual orientation GWAS as point of reference, we have shown that claims to working under the premise of ‘pure science’ of genomics are untenable as the social is present by default—within the methodological choices made by the researchers, the impact on/of the social imaginary or epigenetic context.

By focusing on the contingency of the knowledge production on the social categories that are themselves reflections of the social in the data practices, we have highlighted the relational processes at the root of knowledge production. We are experiencing a period where the repertoire of what gets quantified continuously, and possibly exponentially, increases; however, this does not necessarily mean that our understanding of complexity increases at the same rate, rather, it may lead to unintended simplification where meaningful levels of understanding of causality are lost in the “triumph of correlations” in big data (Mayer-Schönberger and Cukier 2013 ; cited in Leonelli 2014 ). While sociology has much to offer through its qualitative roots, we think it should do more than critique, especially considering the culturally and temporally specific understandings of the social are also linked to the socio-material consequences.

We want to highlight that now is the time to think about the broader developments in science and society, not merely from an external perspective, but within a new framework. Clearly, our discussion of a single case here cannot sustain suggestions for a comprehensive and applicable framework for any study; however, we can flag the urgency of its requirement. We have shown that, in the context of the rapid developments within big data-driven, and socio-genomic research, it is necessary to renew the argument for bringing the social, and its interrelatedness to the biological, clearly back into focus. We strongly believe that reemphasizing this argument is essential to underline the analytical strength of the social science perspective, and in order to avoid the possibility of losing sight of the complexity of social phenomena, which risk being oversimplified in mainly statistical data-driven science.

We can also identify three interrelated dimensions of scientific practice that the framework would valorize: (1) Recognition of the contingency of choices made within the research process, and sensibility of their consequent impact within the social context. (2) Ethical responsibilities that move beyond procedural contractual requirements, to sustaining a process rooted in clear understanding of societal environments. (3) Interdisciplinarity in analytical practice that potentiates the impact of each perspectival lens.

Such a framework would facilitate moving out of the disciplinary or institutionalized silos of ELSI, STS, sociology, genetics, or even emerging social genomics. Rather than competing for authority on ‘the social’, the aim should be to critically complement each other and refract the produced knowledge with a multiplicity of lenses. Zooming ‘back to the future’ within the field of socio-biomedical science, we would flag the necessity of re-calibrating to a multi-perspectival endeavour—one that does justice to the complex interplay of social and biological processes within which knowledge is produced.

The GWAS primarily uses the term “same-sex sexual behaviour” as one of the facets of “sexual orientation” where the former becomes the component that is directly associable with the genes and the latter the broader phenomenon of interest. Thus, while the article is referring to “same-sex sexual behaviour” in its title, it is editorially presented in the same Science issue under Human Genetics heading with the subheading “The genetics of sexual orientation” (p. 880) (see Funk 2019 ). Furthermore, the request for data from UK Biobank by the corresponding author Brendan P. Zietsch (see footnote 14) refers only to sexual orientation and homosexuality and not to same-sex sexual behaviour. Therefore, we follow the same interchangeable use in this article.

Source: https://osf.io/xwfe8 (04.03.2020).

Source: https://www.wsj.com/articles/research-finds-genetic-links-to-same-sex-behavior-11567101661 (04.03.2020).

Source: https://geneticsexbehavior.info (04.03.2020).

In addition to footnotes 10 and 11, for a discussion please see: https://www.nytimes.com/2019/08/29/science/gay-gene-sex.html (04.03.2020).

Later “122 Shades of Grey”: https://www.geneplaza.com/app-store/72/preview (04.03.2020).

Source: https://www.youtube.com/watch?v=th0vnOmFltc (04.03.2020).

Source: http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2159 (04.03.2020).

Source: https://geneticsexbehavior.info/ (04.03.2020).

Source: https://www.broadinstitute.org/blog/opinion-big-data-scientists-must-be-ethicists-too (04.03.2020).

Source: https://medium.com/@cecilejanssens/study-finds-no-gay-gene-was-there-one-to-find-ce5321c87005 (03.03.2020).

Source: https://videos.files.wordpress.com/2AVNyj7B/gosb_subt-4_dvd.mp4 (04.03.2020).

Source: https://geneticsexbehavior.info/what-we-found/ (04.03.2020).

Source: https://www.ukbiobank.ac.uk/2017/04/direct-test-whether-genetic-factors-predisposing-to-homosexuality-increase-mating-success-in-heterosexuals/ (04.03.2020).

Abdellaoui A, Hugh-Jones D, Yengo L, Kemper KE, Nivard MG, Veul L, Holtz Y, Zietsch BP, Frayling TM, Wray NR (2019) Genetic correlates of social stratification in Great Britain. Nat Hum Behav 1–21. https://doi.org/10.1038/s41562-019-0757-5

Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired https://www.wired.com/2008/06/pb-theory/ . Accessed 31 Mar 2020

Bliss C (2015) Defining health justice in the postgenomic era. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham, Durham/London, pp. 174–191

Chapter   Google Scholar  

Bourdieu P (2001) Masculine domination. Stanford University Press, Stanford

Google Scholar  

Bourdieu P (2010) Distinction: a social critique of the judgement of taste. Routledge, London/New York

Bowker GC, Star SL (2000) Sorting things out: classification and its consequences. MIT Press, Cambridge/London

Book   Google Scholar  

Boysen GA, Vogel DL (2007) Biased assimilation and attitude polarization in response to learning about biological explanations of homosexuality. Sex Roles 57(9–10):755–762. https://doi.org/10.1007/s11199-007-9256-7

Article   Google Scholar  

Butler J (1990) Gender trouble. Feminism and the subversion of identity. Routledge, New York

Clarke AE, Shim JK, Shostak S, Nelson A (2013) Biomedicalising genetic health, diseases and identities. In: Atkinson P, Glasner P, Lock M (eds) Handbook of genetics and society: mapping the new genomc era. Routledge, Oxon, pp. 21–40

Conley D (2009) The promise and challenges of incorporating genetic data into longitudinal social science surveys and research. Biodemogr Soc Biol 55(2):238–251. https://doi.org/10.1080/19485560903415807

Conley D, Fletcher J (2018) The genome factor: what the social genomics revolution reveals about ourselves, our history, and the future. Princeton University Press, Princeton/Oxford

Connell RW (2005) Masculinities. Polity, Cambridge

Conrad P (1999) A mirage of genes. Sociol Health Illn 21(2):228–241. https://doi.org/10.1111/1467-9566.00151

Conrad P, Markens S (2001) Constructing the ‘gay gene’ in the news: optimism and skepticism in the US and British press. Health 5(3):373–400. https://doi.org/10.1177/136345930100500306

Crenshaw K (1989) Demarginalizing the intersection of race and sex: a black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, vol 1989(8). University of Chicago Legal Forum. http://chicagounbound.uchicago.edu/uclf/vol1989/iss1/8 . Accessed 1 Apr 2020

Cyranoski D (2019) Russian ‘CRISPR-baby’ scientist has started editing genes in human eggs with goal of altering deaf gene. Nature 574(7779):465–466. https://doi.org/10.1038/d41586-019-03018-0

Article   ADS   CAS   PubMed   Google Scholar  

Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of DNA. Psychol Bull 137(5):800–818. https://doi.org/10.1037/a0021860

Article   PubMed   PubMed Central   Google Scholar  

Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, Ntritsos G, Dimou N, Cabrera CP, Karaman I (2018) Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet 50(10):1412–1425. https://doi.org/10.1038/s41588-018-0205-x

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fausto-Sterling A (2007) Frameworks of desire. Daedalus 136(2):47–57. https://doi.org/10.1162/daed.2007.136.2.47

Fletcher JM (2012) Why have tobacco control policies stalled? Using genetic moderation to examine policy impacts. PLoS ONE 7(12):e50576. https://doi.org/10.1371/journal.pone.0050576

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Foucault M (1998) The history of sexuality 1: the will to knowledge. Penguin Books, London

Foucault M (2003) The birth of the clinic. Routledge, London/New York

Foucault M (2005) The order of things. Routledge, London/New York

Fox Keller E (2014) From gene action to reactive genomes. J Physiol 592(11):2423–2429. https://doi.org/10.1113/jphysiol.2014.270991

Article   CAS   Google Scholar  

Fox Keller E (2015) The postgenomic genome. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham/London, pp. 9–31

Funk M (2019) The genetics of sexual orientation. Science 365(6456):878–880. https://doi.org/10.1126/science.365.6456.878-k

Article   ADS   Google Scholar  

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019a) Genome studies must account for history—response. Science 366(6472):1461–1462. https://doi.org/10.1126/science.aaz8941

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019b) Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365(6456):eaat7693. https://doi.org/10.1126/science.aat7693

Garfinkel H (1967) Studies in ethnomethodology. Polity Press, Cambridge

Gray J (2009) Jim Gray on eScience: a transformed scientific method. In: Hey T, Tansley S, Tolle KM (eds) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, pp. xvii–xxxi

Griffiths DA (2016) Queer genes: realism, sexuality and science. J Crit Realism 15(5):511–529. https://doi.org/10.1080/14767430.2016.1210872

Hamer DH, Hu S, Magnuson VL, Hu N, Pattatucci AM (1993) A linkage between DNA markers on the X chromosome and male sexual orientation. Science 261(5119):321–327. https://doi.org/10.1126/science.8332896

Haraway D (1988) Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem Stud 14(3):575–599

Holm S, Ploug T (2019) Genome studies reveal flaws in broad consent. Science 366(6472):1460–1461. https://doi.org/10.1126/science.aaz3797

Howard HC, van El CG, Forzano F, Radojkovic D, Rial-Sebbag E, de Wert G, Borry P, Cornel MC (2018) One small edit for humans, one giant edit for humankind? Points and questions to consider for a responsible way forward for gene editing in humans. Eur J Hum Genet 26(1):1. https://doi.org/10.1038/s41431-017-0024-z

Article   CAS   PubMed   Google Scholar  

Jansen PR, Watanabe K, Stringer S, Skene N, Bryois J, Hammerschlag AR, de Leeuw CA, Benjamins JS, Muñoz-Manchado AB, Nagel M, Savage JE, Tiemeier H, White T, Agee M, Alipanahi B, Auton A, Bell RK, Bryc K, Elson SL, Fontanillas P, Furlotte NA, Hinds DA, Huber KE, Kleinman A, Litterman NK, McCreight JC, McIntyre MH, Mountain JL, Noblin ES, Northover CAM, Pitts SJ, Sathirapongsasuti JF, Sazonova OV, Shelton JF, Shringarpure S, Tian C, Wilson CH, Tung JY, Hinds DA, Vacic V, Wang X, Sullivan PF, van der Sluis S, Polderman TJC, Smit AB, Hjerling-Leffler J, Van Someren EJW, Posthuma D, The 23andMe Research, T. (2019) Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat Genet 51(3):394–403. https://doi.org/10.1038/s41588-018-0333-3

Jasanoff S (2004) The idiom of co-production. In: Jasanoff S (ed.) States of knowledge: the co-production of science and social order. Routledge, London, p 1–12

Jasanoff S, Hurlbut JB (2018) A global observatory for gene editing. Nature 555:435–437. https://doi.org/10.1038/d41586-018-03270-w

Kessler SJ, McKenna W (1978) Gender: an ethnomethodological approach. John Wiley & Sons, New York

Kitchin, R. (2014a) Big Data, new epistemologies and paradigm shifts. Big Data Soc. https://doi.org/10.1177/2053951714528481

Kitchin R (2014b) The data revolution. Big data, open data, data infrastructures and their consequences. Sage, London

Landecker H (2016) The social as signal in the body of chromatin. Sociol Rev 64(1_suppl):79–99. https://doi.org/10.1111/2059-7932.12014

Landecker H, Panofsky A (2013) From social structure to gene regulation, and back: a critical introduction to environmental epigenetics for sociology. Annu Rev Sociol 39:333–357. https://doi.org/10.1146/annurev-soc-071312-145707

Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK (2018) Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment. Nat Genet 50(8):1112. https://doi.org/10.1038/s41588-018-0147-3

Lehrer SF, Ding W (2019) Can social scientists use molecular genetic data to explain individual differences and inform public policy? In: Foster G (ed.) Biophysical measurement in experimental social science research. Academic Press, London/San Diego/Cambridge/Oxford, pp. 225–265

Leonelli, S. (2014) What difference does quantity make? On the epistemology of Big Data in biology. Big Data Soc. https://doi.org/10.1177/2053951714534395

Lorber J (1994) Paradoxes of gender. Yale University Press, New Haven

Martin P, Morrison M, Turkmendag I, Nerlich B, McMahon A, de Saille S, Bartlett A (2020) Genome editing: the dynamics of continuity, convergence, and change in the engineering of life. New Genet Soc 39(2):219–242. https://doi.org/10.1080/14636778.2020.1730166

Maxmen A (2019) Controversial ‘gay gene’ app provokes fears of a genetic Wild West. Nature 574(7780):609–610. https://doi.org/10.1038/d41586-019-03282-0

Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, Boston/New York

Meloni M (2014) Biology without biologism: social theory in a postgenomic age. Sociology 48(4):731–746. https://doi.org/10.1177/0038038513501944

Meloni M (2016) Political biology: Science and social values in human heredity from eugenics to epigenetics. Palgrave Macmillan, n.p.p

Mitchell RW, Dezarn L (2014) Does knowing why someone is gay influence tolerance? Genetic, environmental, choice, and “reparative” explanations. Sex Cult 18(4):994–1009. https://doi.org/10.1007/s12119-014-9233-6

Morrison M, de Saille S (2019) CRISPR in context: towards a socially responsible debate on embryo editing. Palgrave Commun 5(1):1–9. https://doi.org/10.1057/s41599-019-0319-5

Nelkin D, Lindee MS (2004) The DNA mystique: the gene as a cultural icon. University of Michigan Press, Ann Arbor

Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc 29(4):485–513. https://doi.org/10.1080/03085140050174750

O’Riordan K (2012) The life of the gay gene: from hypothetical genetic marker to social reality. J Sex Res 49(4):362–368. https://doi.org/10.1080/00224499.2012.663420

Article   PubMed   Google Scholar  

Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, Turley P, Chen G-B, Emilsson V, Meddens SFW (2016) Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533(7604):539–542. https://doi.org/10.1038/nature17671

Parry B, Greenhough B (2017) Bioinformation. Polity Press, Cambridge

Parsons T (1951) The social system. Free Press, New York

Prainsack B (2015) Is personalized medicine different? (Reinscription: the sequel) A response to Troy Duster. Br J Sociol 66(1):28–35. https://doi.org/10.1111/1468-4446.12117

Reardon J (2017) The postgenomic condition: ethics, justice, and knowledge after the genome. University of Chicago Press, Chicago/London

Rehmann-Sutter C, Mahr D (2016) The lived genome. In: Whitehead A, Woods A (eds) Edinburgh companion to the critical medical humanities. Edinburgh University Press, Edinburgh, pp. 87–103

Richardson SS, Borsa A, Boulicault M, Galka J, Ghosh N, Gompers A, Noll NE, Perret M, Reiches MW, Sandoval JCB (2019) Genome studies must account for history. Science 366(6472):1461. https://doi.org/10.1126/science.aaz6594

Ruckenstein M, Schüll ND (2017) The datafication of health. Annu Rev Anthropol 46(261–278). https://doi.org/10.1146/annurev-anthro-102116-041244

Saukko P (2017) Shifting metaphors in direct-to-consumer genetic testing: from genes as information to genes as big data. New Genet Soc 36(3):296–313. https://doi.org/10.1080/14636778.2017.1354691

Savage M, Burrows R (2007) The coming crisis of empirical sociology. Sociology 41(5):885–899. https://doi.org/10.1177/0038038507080443

Shostak S, Conrad P, Horwitz AV (2008) Sequencing and its consequences: path dependence and the relationships between genetics and medicalization. Am J Sociol 114(S1):S287–S316. https://doi.org/10.1086/595570

Thomas WJ, Thomas DS (1929) The child in America. Behavior problems and programs. Knopf, New York

West C, Zimmerman DH (1991) Doing gender. In: Lorber J, Farrell SA (eds) The social construction of gender. Sage, Newbury Park/London, pp. 13–37

Download references

Acknowledgements

Open access funding provided by University of Vienna. The authors thank Brígida Riso for contributing to a previous version of this article.

Author information

These authors contributed equally: Melanie Goisauf, Kaya Akyüz, Gillian M. Martin.

Authors and Affiliations

Department of Science and Technology Studies, University of Vienna, Vienna, Austria

Melanie Goisauf & Kaya Akyüz

BBMRI-ERIC, Graz, Austria

Department of Sociology, University of Malta, Msida, Malta

Gillian M. Martin

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Melanie Goisauf or Kaya Akyüz .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Goisauf, M., Akyüz, K. & Martin, G.M. Moving back to the future of big data-driven research: reflecting on the social in genomics. Humanit Soc Sci Commun 7 , 55 (2020). https://doi.org/10.1057/s41599-020-00544-5

Download citation

Received : 15 November 2019

Accepted : 09 July 2020

Published : 04 August 2020

DOI : https://doi.org/10.1057/s41599-020-00544-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Biobanking and risk assessment: a comprehensive typology of risks for an adaptive risk governance.

  • Gauthier Chassang
  • Michaela Th. Mayrhofer

Life Sciences, Society and Policy (2021)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

big data research questions

Loading metrics

Open Access

Ten simple rules for responsible big data research

* E-mail: [email protected]

Affiliation Department of Geography, University of Kentucky, Lexington, Kentucky, United States of America

Affiliation Microsoft Research, New York, New York, United States of America

Affiliations Microsoft Research, New York, New York, United States of America, Data & Society, New York, New York, United States of America

Affiliations Microsoft Research, New York, New York, United States of America, Information Law Institute, New York University, New York, New York, United States of America

Affiliation Data & Society, New York, New York, United States of America

ORCID logo

Affiliation Department of Media and Communications, London School of Economics, London, United Kingdom

Affiliation Harvard-Smithsonian Center for Astrophysics, Harvard University, Cambridge, Massachusetts, United States of America

Affiliation Center for Engineering Ethics and Society, National Academy of Engineering, Washington, DC, United States of America

Affiliation Institute for Health Aging, University of California-San Francisco, San Francisco, California, United States of America

Affiliation Ethical Resolve, Santa Cruz, California, United States of America

Affiliation Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America

Affiliation Department of Sociology, Columbia University, New York, New York, United States of America

Affiliation Carey School of Law, University of Maryland, Baltimore, Maryland, United States of America

  • Matthew Zook, 
  • Solon Barocas, 
  • danah boyd, 
  • Kate Crawford, 
  • Emily Keller, 
  • Seeta Peña Gangadharan, 
  • Alyssa Goodman, 
  • Rachelle Hollander, 
  • Barbara A. Koenig, 

PLOS

Published: March 30, 2017

  • https://doi.org/10.1371/journal.pcbi.1005399
  • Reader Comments

Citation: Zook M, Barocas S, boyd d, Crawford K, Keller E, Gangadharan SP, et al. (2017) Ten simple rules for responsible big data research. PLoS Comput Biol 13(3): e1005399. https://doi.org/10.1371/journal.pcbi.1005399

Editor: Fran Lewitter, Whitehead Institute, UNITED STATES

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Funding: The work for this article was supported by the National Science Foundation grant # IIS-1413864. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of the computational and natural sciences, to more directly address sensitive aspects of human behavior, interaction, and health. The tools of big data research are increasingly woven into our daily lives, including mining digital medical records for scientific and economic insights, mapping relationships via social media, capturing individuals’ speech and action via sensors, tracking movement across space, shaping police and security policy via “predictive policing,” and much more.

The beneficial possibilities for big data in science and industry are tempered by new challenges facing researchers that often lie outside their training and comfort zone. Social scientists now grapple with data structures and cloud computing, while computer scientists must contend with human subject protocols and institutional review boards (IRBs). While the connection between individual datum and actual human beings can appear quite abstract, the scope, scale, and complexity of many forms of big data creates a rich ecosystem in which human participants and their communities are deeply embedded and susceptible to harm. This complexity challenges any normative set of rules and makes devising universal guidelines difficult.

Nevertheless, the need for direction in responsible big data research is evident, and this article provides a set of “ten simple rules” for addressing the complex ethical issues that will inevitably arise. Modeled on PLOS Computational Biology ’s ongoing collection of rules, the recommendations we outline involve more nuance than the words “simple” and “rules” suggest. This nuance is inevitably tied to our paper’s starting premise: all big data research on social, medical, psychological, and economic phenomena engages with human subjects, and researchers have the ethical responsibility to minimize potential harm.

The variety in data sources, research topics, and methodological approaches in big data belies a one-size-fits-all checklist; as a result, these rules are less specific than some might hope. Rather, we exhort researchers to recognize the human participants and complex systems contained within their data and make grappling with ethical questions part of their standard workflow. Towards this end, we structure the first five rules around how to reduce the chance of harm resulting from big data research practices; the second five rules focus on ways researchers can contribute to building best practices that fit their disciplinary and methodological approaches. At the core of these rules, we challenge big data researchers who consider their data disentangled from the ability to harm to reexamine their assumptions. The examples in this paper show how often even seemingly innocuous and anonymized data have produced unanticipated ethical questions and detrimental impacts.

This paper is a result of a two-year National Science Foundation (NSF)-funded project that established the Council for Big Data, Ethics, and Society, a group of 20 scholars from a wide range of social, natural, and computational sciences ( http://bdes.datasociety.net/ ). The Council was charged with providing guidance to the NSF on how to best encourage ethical practices in scientific and engineering research, utilizing big data research methods and infrastructures [ 1 ].

1. Acknowledge that data are people and can do harm

One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people. Simply starting with the assumption that all data are people until proven otherwise places the difficulty of disassociating data from specific individuals front and center. This logic is readily evident for “risky” datasets, e.g., social media with inflammatory language, but even seemingly benign data can contain sensitive and private information, e.g., it is possible to extract data on the exact heart rates of people from YouTube videos [ 2 ]. Even data that seemingly have nothing to do with people might impact individuals’ lives in unexpected ways, e.g., oceanographic data that change the risk profiles of communities’ and properties’ values or Exchangeable Image Format (EXIF) records from photos that contain location coordinates and reveal the photographer’s movement or even home location.

Harm can also result when seemingly innocuous datasets about population-wide effects are used to shape the lives of individuals or stigmatize groups, often without procedural recourse [ 3 , 4 ]. For example, social network maps for services such as Twitter can determine credit-worthiness [ 5 ], opaque recidivism scores can shape criminal justice decisions in a racially disparate manner [ 6 ], and categorization based on zip codes resulted in less access to Amazon Prime same-day delivery service for African-Americans in United States cities [ 7 ]. These high-profile cases show that apparently neutral data can yield discriminatory outcomes, thereby compounding social inequities.

Other cases show that “public” datasets are easily adapted for highly invasive research by incorporating other data, such as Hague et al.’s [ 8 ] use of property records and geographic profiling techniques to allegedly identify the pseudonymous artist Banksy [ 9 ]. In particular, data ungoverned by substantive consent practices, whether social media or the residual DNA we continually leave behind us, may seem public but can cause unintentional breaches of privacy and other harms [ 9 , 10 ].

Start with the assumption that data are people (until proven otherwise), and use it to guide your analysis. No one gets an automatic pass on ethics.

2. Recognize that privacy is more than a binary value

Breaches of privacy are key means by which big data research can do harm, and it is important to recognize that privacy is contextual [ 11 ] and situational [ 12 ], not reducible to a simple public/private binary. Just because something has been shared publicly does not mean any subsequent use would be unproblematic. Looking at a single Instagram photo by an individual has different ethical implications than looking at someone’s full history of all social media posts. Privacy depends on the nature of the data, the context in which they were created and obtained, and the expectations and norms of those who are affected. Understand that your attitude towards acceptable use and privacy may not correspond with those whose data you are using, as privacy preferences differ across and within societies.

For example, Tene and Polonetsky [ 13 ] explore how pushing past social norms, particularly in novel situations created by new technologies, is perceived by individuals as “creepy” even when they do not violate data protection regulations or privacy laws. Social media apps that utilize users’ locations to push information, corporate tracking of individuals’ social media and private communications to gain customer intelligence, and marketing based on search patterns have been perceived by some to be “creepy” or even outright breaches of privacy. Likewise, distributing health records is a necessary part of receiving health care, but this same sharing brings new ethical concerns when it goes beyond providers to marketers.

Privacy also goes beyond single individuals and extends to groups [ 10 ]. This is particularly resonant for communities who have been on the receiving end of discriminatory data-driven policies historically, such as the practice of redlining [ 14 , 15 ]. Other examples include community maps—made to identify problematic properties or an assertion of land rights—being reused by others to identify opportunities for redevelopment or exploitation [ 16 ]. Thus, reusing a seemingly public dataset could run counter to the original privacy intents of those who created it and raise questions about whether it represents responsible big data research.

Situate and contextualize your data to anticipate privacy breaches and minimize harm. The availability or perceived publicness of data does not guarantee lack of harm, nor does it mean that data creators consent to researchers using their data.

3. Guard against the reidentification of your data

It is problematic to assume that data cannot be reidentified. There are numerous examples of researchers with good intentions and seemingly good methods failing to anonymize data sufficiently to prevent the later identification of specific individuals [ 17 ]; in other cases, these efforts were extremely superficial [ 18 , 19 ]. When datasets thought to be anonymized are combined with other variables, it may result in unexpected reidentification, much like a chemical reaction resulting from the addition of a final ingredient.

While the identificatory power of birthdate, gender, and zip code is well known [ 20 ], there are a number of other parameters—particularly the metadata associated with digital activity—that may be as or even more useful for identifying individuals [ 21 ]. Surprising to many, unlabeled network graphs—such as location and movement, DNA profiles, call records from mobile phone data, and even high-resolution satellite images of the earth—can be used to reidentify people [ 22 ]. More important than specifying the variables that allow for reidentification, however, is the realization that it is difficult to recognize these vulnerable points a priori [ 23 ]. Factors discounted today as irrelevant or inherently harmless—such as battery usage—may very well prove to be a significant vector of personal identification tomorrow [ 24 ]. For example, the addition of spatial location can turn social media posts into a means of identifying home location [ 25 ], and Google’s reverse image search can connect previously separate personal activities—such as dating and professional profiles—in unanticipated ways [ 26 ]. Even data about groups—“aggregate statistics”—can have serious implications if they reveal that certain communities, for example, suffer from stigmatized diseases or social behavior much more than others [ 27 ].

Identify possible vectors of reidentification in your data. Work to minimize them in your published results to the greatest extent possible.

4. Practice ethical data sharing

For some projects, sharing data is an expectation of the human participants involved and thus a key part of ethical research. For example, in rare genetic disease research, biological samples are shared in the hope of finding cures, making dissemination a condition of participation. In other projects, questions of the larger public good—an admittedly difficult to define category—provide compelling arguments for sharing data, e.g., the NIH-sponsored database of Genotypes and Phenotypes (dbGaP), which makes deidentified genomic data widely available to researchers, democratizing access, or the justice claim made by the Institute of Medicine about the value of mandating that individual-level data from clinical trials be shared among researchers [ 28 ]. Asking participants for broad, as opposed to narrowly structured consent for downstream data management makes it easier to share data. Careful research design and guidance from IRBs can help clarify consent processes. However, we caution that even when broad consent was obtained upfront, researchers should consider the best interests of the human participant, proactively considering the likelihood of privacy breaches and reidentification issues. This is of particular concern for human DNA data, which is uniquely identifiable.

These types of projects, however—in which rules of use and sharing are well governed by informed consent and right of withdrawal—are increasingly the exception rather than the rule for big data. In our digital society, we are followed by data clouds composed of the trace elements of daily life—credit card transactions, medical test results, closed-circuit television (CCTV) images and video, smart phone apps, etc.—collected under mandatory terms of service rather than responsible research design overseen by university compliance officers. While we might wish to have the standards of informed consent and right of withdrawal, these informal big data sources are gathered by agents other than the researcher—private software companies, state agencies, and telecommunications firms. These data are only accessible to researchers after their creation, making it impossible to gain informed consent a priori, and contacting the human participants retroactively for permission is often forbidden by the owner of the data or is impossible to do at scale.

Of course, researchers within software companies and state institutions collecting these data have a special responsibility to address the terms under which data are collected; but that does not exempt the end-user of shared data. In short, the burden of ethical use (see Rules 1 to 3) and sharing is placed on the researcher, since the terms of service under which the human subjects’ data were produced can often be extremely broad with little protection for breaches of privacy. In these circumstances, researchers must balance the requirements from funding agencies to share data [ 29 ] with their responsibilities to the human beings behind the data they acquired. A researcher needs to inform funding agencies about possible ethical concerns before the research begins and guard against reidentification before sharing.

Share data as specified in research protocols, but proactively address concerns of potential harm from informally collected big data.

5. Consider the strengths and limitations of your data; big does not automatically mean better

In order to do both accurate and responsible big data research, it is important to ground datasets in their proper context including conflicts of interests. Context also affects every stage of research: from data acquisition, to cleaning, to interpretation of findings, and dissemination of the results. During the step of data acquisition, it is crucial to understand both the source of the data and the rules and regulations with which they were gathered. This is especially important in cases of research conducted in relatively loose regulatory environments, in which use of answers to research questions may conflict with the expectations of those who provided the data. One possible approach might be the ethical norms employed to track the provenance of artifacts, often in cooperation and collaboration with the communities from which they come (e.g., archaeologists working in indigenous communities to determine the disposition of material culture). In a similar manner, computer scientists use data lineage techniques to track the evolution of a dataset and often to trace bugs in the data.

Being mindful of the data’s context provides the foundation for clarifying when your data and analysis are working and when they are not. While it is tempting to interpret findings based on big data as a clear outcome, a key step within scientific research is clearly articulating what data or an indicator represent and what they do not. Are your findings as clear-cut if your interpretation of a social media posting switches from a recording of fact to the performance of a social identity? Given the messy, almost organic nature of many datasets derived from social actions, it is fundamental that researchers be sensitive to the potential multiple meanings of data.

For example, is a Facebook post or an Instagram photo best interpreted as an approval/disapproval of a phenomenon, a simple observation, or an effort to improve status within a friend network? While any of these interpretations are potentially valid, the lack of context makes it even more difficult to justify the choice of one understanding over another. Reflecting on the potential multiple meanings of data fosters greater clarity in research hypotheses and also makes researchers aware of the other potential uses of their data. Again, the act of interpretation is a human process, and because the judgments of those (re)using your data may differ from your own, it is essential to clarify both the strengths and shortcomings of the data.

Document the provenance and evolution of your data. Do not overstate clarity; acknowledge messiness and multiple meanings.

6. Debate the tough, ethical choices

Research involving human participants at federally funded institutions is governed by IRBs charged with preventing harm through well-established procedures and are familiar to many researchers. IRBs, however, are not the sole arbiter of ethics; many ethical issues involving big data are outside of their governance mandate. Precisely because big data researchers often encounter situations that are foreign to or outside of the mandate of IRBs, we emphasize the importance of debating the issues within groups of peers.

Rather than a bug, the lack of clear-cut solutions and governance protocols should be more appropriately understood as a feature that researchers should embrace within their own work. Discussion and debate of ethical issues is an essential part of professional development—both within and between disciplines—as it can establish a mature community of responsible practitioners. Bringing these debates into coursework and training can produce peer reviewers who are particularly well placed to raise these ethical questions and spur recognition of the need for these conversations.

A precondition of any formal ethics rules or regulations is the capacity to have such open-ended debates. As digital social scientist and ethicist Annette Markham [ 30 ] writes, “we can make [data ethics] an easier topic to broach by addressing ethics as being about choices we make at critical junctures; choices that will invariably have impact.” Given the nature of big data, bringing technical, scientific, social, and humanistic researchers together on projects enables this debate to emerge as a strength because, if done well, it provides the means to understand the ethical issues from a range of perspectives and disrupt the silos of disciplines [ 31 ]. There are a number of good models for interdisciplinary ethics research, such as the trainings offered by the Science and Justice research center at the University of California, Santa Cruz [ 32 ] and Values in Design curricula [ 33 ]. Research ethics consultation services, available at some universities as a result of the Clinical and Translational Science Award (CTSA) program of the National Institutes of Health (NIH), can also be resources for researchers [ 34 ].

Some of the better-known “big data” ethical cases—i.e., the Facebook emotional contagion study [ 35 ]—provide extremely productive venues for cross-disciplinary discussions. Why might one set of scholars see this as a relatively benign approach while other groups see significant ethical shortcomings? Where do researchers differ in drawing the line between responsible and irresponsible research and why? Understanding the different ways people discuss these challenges and processes provides an important check for researchers, especially if they come from disciplines not focused on human subject concerns.

Moreover, the high visibility surrounding these events means that (for better or worse) they represent the “public” view of big data research, and becoming an active member of this conversation ensures that researchers can give voice to their insights rather than simply being at the receiving end of policy decisions. In an effort to help these debates along, the Council for Big Data, Ethics, and Society has produced a number of case studies focused specifically on big data research and a white paper with recommendations to start these important conversations ( http://bdes.datasociety.net/output/ ).

Engage your colleagues and students about ethical practice for big data research.

7. Develop a code of conduct for your organization, research community, or industry

The process of debating tough choices inserts ethics directly into the workflow of research, making “faking ethics” as unacceptable as faking data or results. Internalizing these debates, rather than treating them as an afterthought or a problem to outsource, is key for successful research, particularly when using trace data produced by people. This is relevant for all research including those within industry who have privileged access to the data streams of digital daily life. Public attention to the ethical use of these data should not be avoided; after all, these datasets are based on an infrastructure that billions of people are using to live their lives, and there is a compelling public interest that research is done responsibly.

One of the best ways to cement this in daily practice is to develop codes of conduct for use in your organization or research community and for inclusion in formal education and ongoing training. The codes can provide guidance in peer review of publications and in funding consideration. In practice, a highly visible case of unethical research brings problems to an entire field, not just to those directly involved. Moreover, designing codes of conduct makes researchers more successful. Issues that might otherwise be ignored until they blow up—e.g., Are we abiding by the terms of service or users’ expectations? Does the general public consider our research “creepy”? [ 13 ]—can be addressed thoughtfully rather than in a scramble for damage control. This is particularly relevant to public-facing private businesses interested in avoiding potentially unfavorable attention.

An additional and longer-term advantage of developing codes of conduct is that it is clear that change is coming to big data research. The NSF funded the Council for Big Data, Ethics, and Society as a means of getting in front of a developing issue and pending regulatory changes within federal rules for the protection of human subjects that are currently under review [ 1 ]. Actively developing rules for responsible big data research within a research community is a key way researchers can join this ongoing process.

Establish appropriate codes of ethical conduct within your community. Make industry researchers and representatives of affected communities active contributors to this process.

8. Design your data and systems for auditability

Although codes of conduct will vary depending on the topic and research community, a particularly important element is designing data and systems for auditability. Responsible internal auditing processes flow easily into audit systems and also keep track of factors that might contribute to problematic outcomes. Developing automated testing processes for assessing problematic outcomes and mechanisms for auditing other's work during review processes can help strengthen research as a whole. The goal of auditability is to clearly document when decisions are made and, if necessary, backtrack to an earlier dataset and address the issue at the root (e.g., if strategies for anonymizing data are compromised).

Designing for auditability also brings direct benefits to researchers by providing a mechanism for double-checking work and forcing oneself to be explicit about decisions, increasing understandability and replicability. For example, many types of social media and other trace data are unstructured, and answers to even basic questions such as network ties, location, and randomness depend on the steps taken to collect and collate data. Systems of auditability clarify how different datasets (and the subsequent analysis) differ from each other, aiding understanding and creating better research.

Plan for and welcome audits of your big data practices.

9. Engage with the broader consequences of data and analysis practices

It is also important for responsible big data researchers to think beyond the traditional metrics of success in business and the academy. For example, the energy demands for digital daily life, a key source of big data for social science research, are significant in this era of climate change [ 36 ]. How might big data research lessen the environmental impact of data analytics work? For example, should researchers take the lead in asking cloud storage providers and data processing centers to shift to sustainable and renewable energy sources? As important and publicly visible users of the cloud, big data researchers collectively represent an interest group that could rally behind such a call for change.

The pursuit of citations, reputation, or money is a key incentive for pushing research forward, but it can also result in unintended and undesirable outcomes. In contrast, we might ask to what extent is a research project focused on enhancing the public good or the underserved of society? Are questions about equity or promoting other public values being addressed in one’s data streams, or is a big data focus rendering them invisible or irrelevant to your analysis [ 37 ]? How can increasingly vulnerable yet fundamentally important public resources—such as state-mandated cancer registries—be protected? How might research aid or inhibit different business and political actors? While all big data research need not take up social and cultural questions, a fundamental aim of research goes beyond understanding the world to considering ways to improve it.

Recognize that doing big data research has societal-wide effects.

10. Know when to break these rules

The final (and counterintuitive) rule is the charge to recognize when it is appropriate to stray from these rules. For example, in times of natural disaster or a public health emergency, it may be important to temporarily put aside questions of individual privacy in order to serve a larger public good. Likewise, the use of genetic or other biological data collected without informed consent might be vital in managing an emerging disease epidemic.

Moreover, be sure to review the regulatory expectations and legal demands associated with protection of privacy within your dataset. After all, this is an exceedingly slippery slope, so before following this rule (to break others), be cautious that the “emergency” is not simply a convenient justification. The best way to ensure this is to build experience in engaging in the tough debates (Rule 6), constructing codes of conduct (Rule 7), and developing systems for auditing (Rule 8). The more mature the community of researchers is about their processes, checks, and balances, the better equipped it is to assess when breaking the rules is acceptable. It may very well be that you do not come to a final clear set of practices. After all, just as privacy is not binary (Rule 2), neither is responsible research. Ethics is often about finding a good or better, but not perfect, answer, and it is important to ask (and try to answer) the challenging questions. Only through this engagement can a culture of responsible big data research emerge.

Understand that responsible big data research depends on more than meeting checklists.

The goal of this set of ten rules is to help researchers do better work and ultimately become more successful while avoiding larger complications, including public mistrust. To achieve this, however, scholars must shift from a mindset that is rigorous when focused on techniques and methodology and naïve when it comes to ethics. Statements to the effect that “Data is [sic] already public” [ 38 ] are unjustified simplifications of much more complex data ecosystems embedded in even more complex and contingent social practices. Data are people, and to maintain a rigorously naïve definition to the contrary [ 18 ] will end up harming research efforts in the long run as pushback comes from the people whose actions and utterances are subject to analysis.

In short, responsible big data research is not about preventing research but making sure that the work is sound, accurate, and maximizes the good while minimizing harm. The problems and choices researchers face are real, complex, and challenging and so too must be our response. We must treat big data research with the respect that it deserves and recognize that unethical research undermines the production of knowledge. Fantastic opportunities to better understand society and our world exist, but with these opportunities also come the responsibility to consider the ethics of our choices in the everyday practices and actions of our research. The Council for Big Data, Ethics, and Society ( http://bdes.datasociety.net/ ) provides an initial set of case studies, papers, and even ten simple rules for guiding this process; it is now incumbent on you to use and improve these in your research.

Acknowledgments

This article also benefitted from the input of Geoff Bowker and Helen Nissenbaum.

  • 1. Metcalf J, boyd d, Keller E. Perspectives on Big Data, Ethics, and Society. Council for Big Data, Ethics, and Society. 2016. http://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society/ . Accessed 31 May 2016.
  • View Article
  • Google Scholar
  • 5. Danyllo WA, Alisson VB, Alexandre ND, Moacir LM, Jansepetrus BP, Oliveira RF. Identifying relevant users and groups in the context of credit analysis based on data from Twitter. InCloud and Green Computing (CGC), 2013 Third International Conference on 2013 Sep 30 (pp. 587–592). IEEE.
  • 6. Angwin J, Larson J, Mattu S, Kirchner L. Machine bias. Pro Publica. 23 May 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing . Accessed 4 September 2016.
  • 7. Ingold D, Spencer S. Amazon Doesn’t Consider the Race of Its Customers. Should It? Bloomberg.com 21 April 2016. http://www.bloomberg.com/graphics/2016-amazon-same-day/ . Accessed 12 June 2016.
  • 9. Metcalf J, Crawford K. Where are Human Subjects in Big Data Research? The Emerging Ethics Divide. The Emerging Ethics Divide. Big Data and Society, 2016.
  • 11. Nissenbaum H. Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press; 2009.
  • 12. Marwick AE. boyd d. Networked privacy: How teenagers negotiate context in social media. New Media & Society. 2014:1461444814543995.
  • 14. Massey DS, Denton NA. American apartheid: Segregation and the making of the underclass. Harvard University Press; 1993.
  • 15. Davidow B. Redlining for the 21st Century. The Atlantic. 5 March 2014. http://www.theatlantic.com/business/archive/2014/03/redlining-for-the-21st-century/284235/ . Accessed 31 May 2016.
  • 17. Barbaro M, Zeller T, Hansell S. A face is exposed for AOL searcher no. 4417749. New York Times. 2006 Aug 9;9.
  • 18. Cox J. 70,000 OkCupid Users Just Had Their Data Published. Motherboard. 12 May 2016. http://motherboard.vice.com/read/70000-okcupid-users-just-had-their-data-published . Accessed 12 June 2016.
  • 19. Pandurangan V. On Taxis and Rainbows: Lessons from NYC’s improperly anonymized taxi logs. Medium. 2014. https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1 . Accessed 10 November 2015.
  • 22. Kloumann IM, Kleinberg JM. Community membership identification from small seed sets. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 2014 Aug 24 (pp. 1366–1375). ACM.
  • 23. Narayanan A, Huey J, Felten EW. A precautionary approach to big data privacy. In Data protection on the move 2016 (pp. 357–385). Springer Netherlands.
  • 24. Michalevsky Y, Schulman A, Veerapandian GA, Boneh D, Nakibly G. Powerspy: Location tracking using mobile device power analysis. In24th USENIX Security Symposium (USENIX Security 15) 2015 (pp. 785–800).
  • PubMed/NCBI
  • 30. Markham A. OKCupid data release fiasco: It’s time to rethink ethics education. Medium.Points 18 May 2016. https://points.datasociety.net/okcupid-data-release-fiasco-ba0388348cd#.g4ofbpnc6 . Accessed 12 June 2016.
  • 36. Cook G, Dowdall T, Pomerantz D, Wang Y. Clicking clean: how companies are creating the green internet. Greenpeace Inc., Washington, DC. 2014. http://www.greenpeace.org/usa/wp-content/uploads/legacy/Global/usa/planet3/PDFs/clickingclean.pdf
  • 38. Zimmer M. OkCupid Study Reveals the Perils of Big-Data Science. Wired. 14 May 2016. https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/ . Accessed 12 June 2016.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMC Med Ethics

Logo of bmcmeth

Ethics review of big data research: What should stay and what should be reformed?

Agata ferretti.

1 Health Ethics and Policy Lab, Department of Health Sciences and Technology, ETH Zürich, Hottingerstrasse 10 (HOA), 8092 Zürich, Switzerland

Marcello Ienca

Mark sheehan.

2 The Ethox Centre, Department of Population Health, University of Oxford, Oxford, UK

Alessandro Blasimme

Edward s. dove.

3 School of Law, University of Edinburgh, Edinburgh, UK

Bobbie Farsides

4 Brighton and Sussex Medical School, Brighton, UK

Phoebe Friesen

5 Biomedical Ethics Unit, Department of Social Studies of Medicine, McGill University, Montreal, Canada

6 Johns Hopkins Berman Institute of Bioethics, Baltimore, USA

Walter Karlen

7 Mobile Health Systems Lab, Department of Health Sciences and Technology, ETH Zürich, Zürich, Switzerland

Peter Kleist

8 Cantonal Ethics Committee Zürich, Zürich, Switzerland

S. Matthew Liao

9 Center for Bioethics, Department of Philosophy, New York University, New York, USA

Camille Nebeker

10 Research Center for Optimal Digital Ethics in Health (ReCODE Health), Herbert Wertheim School of Public Health and Longevity Science, University of California, San Diego, USA

Gabrielle Samuel

11 Department of Global Health and Social Medicine, King’s College London, London, UK

Mahsa Shabani

12 Faculty of Law and Criminology, Ghent University, Ghent, Belgium

Minerva Rivas Velarde

13 Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland

Effy Vayena

Associated data.

Not applicable.

Ethics review is the process of assessing the ethics of research involving humans. The Ethics Review Committee (ERC) is the key oversight mechanism designated to ensure ethics review. Whether or not this governance mechanism is still fit for purpose in the data-driven research context remains a debated issue among research ethics experts.

In this article, we seek to address this issue in a twofold manner. First, we review the strengths and weaknesses of ERCs in ensuring ethical oversight. Second, we map these strengths and weaknesses onto specific challenges raised by big data research. We distinguish two categories of potential weakness. The first category concerns persistent weaknesses, i.e., those which are not specific to big data research, but may be exacerbated by it. The second category concerns novel weaknesses, i.e., those which are created by and inherent to big data projects. Within this second category, we further distinguish between purview weaknesses related to the ERC’s scope (e.g., how big data projects may evade ERC review) and functional weaknesses, related to the ERC’s way of operating. Based on this analysis, we propose reforms aimed at improving the oversight capacity of ERCs in the era of big data science.

Conclusions

We believe the oversight mechanism could benefit from these reforms because they will help to overcome data-intensive research challenges and consequently benefit research at large.

The debate about the adequacy of the Ethics Review Committee (ERC) as the chief oversight body for big data studies is partly rooted in the historical evolution of the ERC. Particularly relevant is the ERC’s changing response to new methods and technologies in scientific research. ERCs—also known as Institutional Review Boards (IRBs) or Research Ethics Committees (RECs)—came to existence in the 1950s and 1960s [ 1 ]. Their original mission was to protect the interests of human research participants, particularly through an assessment of potential harms to them (e.g., physical pain or psychological distress) and benefits that might accrue from the proposed research. ERCs expanded in scope during the 1970s, from participant protection towards ensuring valuable and ethical human subject research (e.g., having researchers implement an informed consent process), as well as supporting researchers in exploring their queries [ 2 ].

Fast forward fifty years, and a lot has changed. Today, biomedical projects leverage unconventional data sources (e.g., social media), partially inscrutable data analytics tools (e.g., machine learning), and unprecedented volumes of data [ 3 – 5 ]. Moreover, the evolution of research practices and new methodologies such as post-hoc data mining have blurred the concept of ‘ human subject’ and elicited a shift towards the concept of data subject —as attested in data protection regulations. [ 6 , 7 ]. With data protection and privacy concerns being in the spotlight of big data research review, language from data protection laws has worked its way into the vocabulary of research ethics. This terminological shift further reveals that big data, together with modern analytic methods used to interpret the data, creates novel dynamics between researchers and participants [ 8 ]. Research data repositories about individuals and aggregates of individuals are considerably expanding in size. Researchers can remotely access and use large volumes of potentially sensitive data without communicating or actively engaging with study participants. Consequently, participants become more vulnerable and subjected to the research itself [ 9 ]. As such, the nature of risk involved in this new form of research changes too. In particular, it moves from the risk of physical or psychological harm towards the risk of informational harm, such as privacy breaches or algorithmic discrimination [ 10 ]. This is the case, for instance, with projects using data collected through web search engines, mobile and smart devices, entertainment websites, and social media platforms. The fact that health-related research is leaving hospital labs and spreading into online space creates novel opportunities for research, but also raises novel challenges for ERCs. For this reason, it is important to re-examine the fit between new data-driven forms of research and existing oversight mechanisms [ 11 ].

The suitability of ERCs in the context of big data research is not merely a theoretical puzzle but also a practical concern resulting from recent developments in data science. In 2014, for example, the so-called ‘emotional contagion study’ received severe criticism for avoiding ethical oversight by an ERC, failing to obtain research consent, violating privacy, inflicting emotional harm, discriminating against data subjects, and placing vulnerable participants (e.g., children and adolescents) at risk [ 12 , 13 ]. In both public and expert opinion [ 14 ], a responsible ERC would have rejected this study because it contravened the research ethics principles of preventing harm (in this case, emotional distress) and adequately informing data subjects. However, the protocol adopted by the researchers was not required to undergo ethics review under US law [ 15 ] for two reasons. First, the data analyzed were considered non-identifiable, and researchers did not engage directly with subjects, exempting the study from ethics review. Second, the study team included both scientists affiliated with a public university (Cornell) and Facebook employees. The affiliation of the researchers is relevant because—in the US and some other countries—privately funded studies are not subject to the same research protections and ethical regulations as publicly funded research [ 16 ]. An additional example is the 2015 case in which the United Kingdom (UK) National Health Service (NHS) shared 1.6 million pieces of identifiable and sensitive data with Google DeepMind. This data transfer from the public to the private party took place legally, without the need for patient consent or ethics review oversight [ 17 ]. These cases demonstrate how researchers can pursue potentially risky big data studies without falling under the ERC’s purview. The limitations of the regulatory framework for research oversight are evident, in both private and public contexts.

The gaps in the ERC’s regulatory process, together with the increased sophistication of research contexts—which now include a variety of actors such as universities, corporations, funding agencies, public institutes, and citizens associations—has led to an increase in the range of oversight bodies. For instance, besides traditional university ethics committees and national oversight committees, funding agencies and national research initiatives have increasingly created internal ethics review boards [ 18 , 19 ]. New participatory models of governance have emerged, largely due to an increase in subjects’ requests to control their own data [ 20 ]. Corporations are creating research ethics committees as well, modelled after the institutional ERC [ 21 ]. In May 2020, for example, Facebook welcomed the first members of its Oversight Board, whose aim is to review the company’s decisions about content moderation [ 22 ]. Whether this increase in oversight models is motivated by the urge to fill the existing regulatory gaps, or whether it is just ‘ethics washing’, is still an open question. However, other types of specialized committees have already found their place alongside ERCs, when research involves international collaboration and data sharing [ 23 ]. Among others, data safety monitoring boards, data access committees, and responsible research and innovation panels serve the purpose of covering research areas left largely unregulated by current oversight [ 24 ].

The data-driven digital transformation challenges the purview and efficacy of ERCs. It also raises fundamental questions concerning the role and scope of ERCs as the oversight body for ethical and methodological soundness in scientific research. 1 Among these questions, this article will explore whether ERCs are still capable of their intended purpose, given the range of novel (maybe not categorically new, but at least different in practice) issues that have emerged in this type of research. To answer this question, we explore some of the challenges that the ERC oversight approach faces in the context of big data research and review the main strengths and weaknesses of this oversight mechanism. Based on this analysis, we will outline possible solutions to address current weaknesses and improve ethics review in the era of big data science.

Strengths of the ethics review via ERC

Historically, ERCs have enabled cross disciplinary exchange and assessment [ 27 ]. ERC members typically come from different backgrounds and bring their perspectives to the debate; when multi-disciplinarity is achieved, the mixture of expertise provides the conditions for a solid assessment of advantages and risks associated with new research. Committees which include members from a variety of backgrounds are also suited to promote projects from a range of fields, and research that cuts across disciplines [ 28 ]. Within these committees, the reviewers’ expertise can be paired with a specific type of content to be reviewed. This one-to-one match can bring timely and, ideally, useful feedback [ 29 ]. In many countries (e.g., European countries, the United States (US), Canada, Australia), ERCs are explicitly mandated by law to review many forms of research involving human participants; moreover, these laws also describe how such a body should be structured and the purview of its review [ 30 , 31 ]. In principle, ERCs also aim to be representative of society and the research enterprise, including members of the public and minorities, as well as researchers and experts [ 32 ]. And in performing a gatekeeping function to the research enterprise, ERCs play an important role: they recognize that both experts and lay people should have a say, with different views to contribute [ 33 ].

Furthermore, the ERC model strives to ensure independent assessment. The fact that ERCs assess projects “from the outside” and maintain a certain degree of objectivity towards what they are reviewing, reduces the risk of overlooking research issues and decreases the risk for conflicts of interest. Moreover, being institutionally distinct—for example, being established by an organization that is distinct from the researcher or the research sponsor—brings added value to the research itself as this lessens the risk for conflict of interest. Conflict of interest is a serious issue in research ethics because it can compromise the judgment of reviewers. Institutionalized review committees might particularly suffer from political interference. This is the case, for example, for universities and health care systems (like the NHS), which tend to engage “in house” experts as ethics boards members. However, ERCs that can prove themselves independent are considered more trustworthy by the general public and data subjects; it is reassuring to know that an independent committee is overseeing research projects [ 34 ].

The ex-ante (or pre-emptive) ethical evaluation of research studies is by many considered the standard procedural approach of ERCs [ 35 ]. Though the literature is divided on the usefulness and added value provided by this form of review [ 36 , 37 ], ex-ante review is commonly used as a mechanism to ensure the ethical validity of a study design before the research is conducted [ 38 , 39 ]. Early research scrutiny aims at risk-mitigation: the ERC evaluates potential research risks and benefits, in order to protect participants’ physical and psychological well-being, dignity, and data privacy. This practice saves researchers’ resources and valuable time by preventing the pursuit of unethical or illegal paths [ 40 ]. Finally, the ex-ante ethical assessment gives researchers an opportunity to receive feedback from ERCs, whose competence and experience may improve the research quality and increase public trust in the research [ 41 ].

All strengths mentioned in this section are strengths of the ERC model in principle. In practice, there are many ERCs that are not appropriately interdisciplinary or representative of the population and minorities, that lack independence from the research being reviewed, and that fail to improve research quality, and may in fact hinder it. We now turn to consider some of these weaknesses in more detail.

Weaknesses of the ethics review via ERC

In order to assess whether ERCs are adequately equipped to oversee big data research, we must consider the weaknesses of this model. We identify two categories of weaknesses which are described in the following section and summarized in Fig.  1 :

  • Persistent weaknesses : those existing in the current oversight system, which could be exacerbated by big data research

Within this second category of novel weaknesses, we further differentiate between:

  • Purview weaknesses : reasons why some big data projects may bypass the ERCs’ purview
  • Functional weaknesses : reasons why some ERCs may be inadequate to assess big data projects specifically

An external file that holds a picture, illustration, etc.
Object name is 12910_2021_616_Fig1_HTML.jpg

Weaknesses of the ERCs

We base the conceptual distinction between persistent and novel weaknesses on the fact that big data research diverges from traditional biomedical research in many respects. As previously mentioned, big data projects are often broad in scope, involve new actors, use unprecedented methodologies to analyze data, and require specific expertise. Furthermore, the peculiarities of big data itself (e.g., being large in volume and from a variety of sources) make data-driven research different in practice from traditional research. However, we should not consider the category of “novel weaknesses” a closed category. We do not argue that weaknesses mentioned here do not, at least partially, overlap with others which already exist. In fact, in almost all cases of ‘novelty’, (i) there is some link back to a concept from traditional research ethics, and (ii) some thought has been given to the issue outside of a big data or biomedical context (e.g., the problem of ERCs’ expertise has arisen in other fields [ 42 ]). We believe that by creating conceptual clarity about novel oversight challenges presented by big data research, we can begin to identify tailored reforms.

Persistent weaknesses

As regulation for research oversight varies between countries, ERCs often suffer from a lack of harmonization. This weakness in the current oversight mechanism is compounded by big data research, which often relies on multi-center international consortia. These consortia in turn depend on approval by multiple oversight bodies demanding different types of scrutiny [ 43 ]. Furthermore, big data research may give rise to collaborations between public bodies, universities, corporations, foundations, and citizen science cooperatives. In this network, each stakeholder has different priorities and depends upon its own rules for regulation of the research process [ 44 – 46 ]. Indeed, this expansion of regulatory bodies and aims does not come with a coordinated effort towards agreed-upon review protocols [ 47 ]. The lack of harmonization is perpetuated by academic journals and funding bodies with diverging views on the ethics of big data. If the review bodies which constitute the “ethics ecosystem” [ 19 ] do not agree to the same ethics review requirements, a big data project deemed acceptable by an ERC in one country may be rejected by another ERC, within or beyond the national borders.

In addition, there is inconsistency in the assessment criteria used within and across committees. Researchers report subjective bias in the evaluation methodology of ERCs, as well as variations in ERC judgements which are not based on morally relevant contextual considerations [ 48 , 49 ]. Some authors have argued that the probability of research acceptance among experts increases if some research peer or same-field expert sits on the evaluation committee [ 50 , 51 ]. The judgement of an ERC can also be influenced by the boundaries of the scientific knowledge of its members. These boundaries can impact the ERC’s approach towards risk taking in unexplored fields of research [ 52 ]. Big data research might worsen this problem since the field is relatively new, with no standardized metric to assess risk within and across countries [ 53 ]. The committees do not necessarily communicate with each other to clarify their specific role in the review process, or try to streamline their approach to the assessment. This results in unclear oversight mandates and inconsistent ethical evaluations [ 27 , 54 ].

Additionally, ERCs may fall short in their efforts to justly redistribute the risks and benefits of research. The current review system is still primarily tilted toward protecting the interests of individual research participants. ERCs do not consistently assess societal benefit, or risks and benefits in light of the overall conduct of research (balancing risks for the individual with collective benefits). Although demands on ERCs vary from country to country [ 55 ], the ERC approach is still generally tailored towards traditional forms of biomedical research, such as clinical trials and longitudinal cohort studies with hospital patients. These studies are usually narrow in scope and carry specific risks only for the participants involved. In contrast, big data projects can impact society more broadly. As an example, computational technologies have shown potential to determine individuals’ sexual orientation by screening facial images [ 56 ]. An inadequate assessment of the common good resulting from this type of study can be socially detrimental [ 57 ]. In this sense, big data projects resemble public health research studies, with an ethical focus on the common good over individual autonomy [ 58 ]. Within this context, ERCs have an even greater responsibility to ensure the just distribution of research benefits across the population. Accurately determining the social value of big data research is challenging, as negative consequences may be difficult to detect before research begins. Nevertheless, this task remains a crucial objective of research oversight.

The literature reports examples of the failure of ERCs to be accountable and transparent [ 59 ]. This might be the result of an already unclear role of ERCs. Indeed, the ERCs practices are an outcome of different levels of legal, ethical, and professional regulations, which largely vary across jurisdictions. Therefore, some ERCs might function as peer counselors, others as independent advisors, and still others as legal controllers. What seems to be common across countries, though, is that ERCs rarely disclose their procedures, policies, and decision-making process. The ERCs’ “secrecy” can result in an absence of trust in the ethical oversight model [ 60 ].This is problematic because ERCs rely on public acceptance as accountable and trustworthy entities [ 61 ]. In big data research, as the number of data subjects is exponentially greater, a lack of accountability and an opaque deliberative process on the part of ERCs might bring even more significant public backlash. Ensuring truthfulness of the stated benefits and risks of research is a major determinant of trust in both science and research oversight. Researchers are another category of stakeholders negatively impacted by poor communication and publicity on the part of the ERC. Commentators have shown that ERCs often do not clearly provide guidance about the ethical standards applied in the research review [ 62 ]. For instance, if researchers provide unrealistic expectations of privacy and security to data subjects, ERCs have an institutional responsibility to flag those promises (e.g., about data security and the secondary-uses of subject data), especially when the research involves personal and high sensitivity data [ 63 ]. For their part, however, ERCs should make their expectations and decision-making processes clear.

Finally, ERCs face the increasing issue of being overwhelmed by the number of studies to review [ 64 , 65 ]. Whereas ERCs originally reviewed only human subjects research happening in natural sciences and medicine, over time they also became the ethical body of reference for those conducting human research in the social sciences (e.g., in behavioral psychology, educational sciences, etc.). This increase in demand creates pressure on ERC members, who often review research pro bono and on a voluntary basis. The wide range of big data research could exacerbate this existing issue. Having more research to assess and less time to accomplish the task may negatively impact the quality of the ERC’s output, as well as increase the time needed for review [ 66 ]. Consequently, researchers might carry out potentially risky studies because the relevant ethical issues of those studies were overlooked. Furthermore, research itself could be significantly delayed, until it loses its timely scientific value.

Novel weaknesses: purview weaknesses

To determine whether the ERC is still the most fit-for-purpose entity to oversee big data research, it is important to establish under which conditions big data projects fall under the purview of ERCs.

Historically, research oversight has primarily focused on human subject research in the biomedical field, using public funding. In the US for instance, each review board is responsible for a subtype of research based on content or methodology (for example there are IRBs dedicated to validating clinical trial protocols, assessing cancer treatments, examining pediatric research, and reviewing qualitative research). This traditional ethics review structure cannot accommodate big data research [ 2 ]. Big data projects often reach beyond a single institution, cut across disciplines, involve data collected from a variety of sources, re-use data not originally collected for research purposes, combine diverse methodologies, orient towards population-level research, rely on large data aggregates, and emerge from collaboration with the private sector. Given this scenario, big data projects may likely fall beyond the purview of ERCs.

Another case in which big data research does not fall under ERC purview is when it relies on anonymized data. If researchers use data that cannot be traced back to subjects (anonymized or non-personal data), then according to both the US Common Rule and HIPAA regulations, the project is considered safe enough to be granted an ethics review waiver. If instead researchers use pseudonymized (or de-identified) data, they must apply for research ethics review, as in principle the key that links the de-identified data with subjects could be revealed or hacked, causing harm to subjects. In the European Union, it would be left to each Member State (and national laws or policies at local institutions) to define whether research using anonymized data should seek ethical review. This case shows once more that current research ethics regulation is relatively loose and disjointed across jurisdictions, and may leave areas where big data research is unregulated. In particular, the special treatment given anonymized data comes from an emphasis on risk at the individual level. So far in the big data discourse, the concept of harm has been mainly linked to vulnerability in data protection. Therefore if privacy laws are respected, and protection is built into the data system, researchers can prevent harmful outcomes [ 40 ]. However, this view is myopic as it does not include other misuses of data aggregates, such as group discrimination and dignitary harm. These types of harm are already emerging in the big data ecosystem, where anonymized data reveal health patterns of a certain sub-group, or computational technologies include strong racial biases [ 67 , 68 ]. Furthermore, studies using anonymized data should not be deemed oversight-free by default, as it is increasingly hard to anonymize data. Technological advancements might soon make it possible to re-identify individuals from aggregate data sets [ 69 ].

The risks associated with big data projects also increase due to the variety of actors involved in research alongside university researchers (e.g., private companies, citizen science associations, bio-citizen groups, community workers cooperatives, foundations, and non-profit organizations) [ 70 , 71 ]. The novel aspect of health-related big data research compared with traditional research is that anyone who can access large amounts of data about individuals and build predictive models based on that data, can now determine and infer the health status of a person without directly engaging with that person in a research program [ 72 ]. Facebook, for example, is carrying out a suicide prediction and prevention project, which relies exclusively on the information that users post on the social network [ 18 ]. Because this type of research is now possible, and the available ethics review model exempts many big data projects from ERC appraisal, gaps in oversight are growing [ 17 , 73 ]. Just as corporations can re-use publicly available datasets (such as social media data) to determine life insurance premiums [ 74 ], citizen science projects can be conducted without seeking research oversight [ 75 ]. Indeed, participant-led big data research (despite being increasingly common) is another area where the traditional overview model is not effective [ 76 ]. In addition, ERCs might consider research conducted outside academia or publicly funded institutions to be not serious. Thus ERCs may disregard review requests from actors outside the academic environment (e.g., by the citizen science or health tech start up) [ 77 ].

Novel weaknesses: functional weaknesses

Functional weaknesses are those related to the skills, composition, and operational activities of ERCs in relation to big data research.

From this functional perspective, we argue that the ex-ante review model might not be appropriate for big data research. Project assessment at the project design phase or at the data collection level is insufficient to address emerging challenges that characterize big data projects – especially as data, over time, could become useful for other purposes, and therefore be re-used or shared [ 53 ]. Limitations of the ex-ante review model have already become apparent in the field of genetic research [ 78 ]. In this context, biobanks must often undergo a second ethics assessment to authorize the specific research use on exome sequencing of their primary data samples [ 79 ]. Similarly, in a case in which an ERC approved the original collection of sensitive personal data, a data access committee would ensure that the secondary uses are in line with original consent and ethics approval. However, if researchers collect data from publicly accessible platforms, they can potentially use and re-use data for research lawfully, without seeking data subject consent or ERC review. This is often the case in social media research. Social media data, which are collected by researchers or private companies using a form of broad consent, can be re-used by researchers to conduct additional analysis without ERC approval. It is not only the re-use of data that poses unforeseeable risks. The ex-ante approach might not be suitable to assess other stages of the data lifecycle [ 80 ], such as deployment machine learning algorithms.

Rather than re-using data, some big data studies build models on existing data (using data mining and machine learning methods), creating new data, which is then used to further feed the algorithms [ 81 ]. Sometimes it is not possible to anticipate which analytic models or tools (e.g., artificial intelligence) will be leveraged in the research. And even then, the nature of computational technologies which extract meaning from big data make it difficult to anticipate all the correlations that will emerge from the analysis [ 37 ]. This is an additional reason that big data research often has a tentative approach to a research question, instead of growing from a specific research hypothesis [ 82 ].The difficulty of clearly framing the big data research itself makes it even harder for ERCs to anticipate unforeseeable risks and potential societal consequences. Given the existing regulations and the intrinsic exploratory nature of big data projects, the mandate of ERCs does not appear well placed to guarantee research oversight. It seems even less so if we consider problems that might arise after the publication of big data studies, such as repurposing or dual-use issues [ 83 ].

ERCs also face the challenge of assessing the value of informed consent for big data projects. To re-obtain consent from research subjects is impractical, particularly when using consumer generated data (e.g., social media data) for research purposes. In these cases, researchers often rely on broad consent and consent waivers. This leaves the data subjects unaware of their participation in specific studies, and therefore makes them incapable of engaging with the research progress. Therefore, the data subjects and the communities they represent become vulnerable towards potential negative research outcomes. The tool of consent has limitations in big data research—it cannot disclose all possible future uses of data, in part because these uses may be unknown at the time of data generation. Moreover, researchers can access existing datasets multiple times and reuse the same data with alternative purposes [ 84 ]. What should be the ERCs’ strategy, given the current model of informed consent leaves an ethical gap in big data projects? ERCs may be tempted to focus on the consent challenge, neglecting other pressing big data issues [ 53 ]. However, the literature reports an increasing number of authors who are against the idea of a new consent form for big data studies [ 5 ].

A final widely discussed concern is the ERC’s inadequate expertise in the area of big data research [ 85 , 86 ]. In the past, there have been questions about the technical and statistical expertise of ERC members. For example, ERCs have attempted to conform social science research to the clinical trial model, using the same knowledge and approach to review both types of research [ 87 ]. However, big data research poses further challenges to ERCs’ expertise. First, the distinct methodology of big data studies (based on data aggregation and mining) requires a specialized technical expertise (e.g., information systems, self-learning algorithms, and anonymization protocols). Indeed, big data projects have a strong technical component, due to data volume and sources, which brings specific challenges (e.g., collecting data outside traditional protocols on social media) [ 88 , 89 ]. Second, ERCs may be unfamiliar with new actors involved in big data research, such as citizen science actors or private corporations. Because of this lack of relevant expertise, ERCs may require unjustified amendments to research studies, or even reject big data projects tout-court [ 36 ]. Finally, ERCs may lose credibility as an oversight body capable of assessing ethical violations and research misconduct. In the past, ERCs solved this challenge by consulting independent experts in a relevant field when reviewing a protocol in that domain. However, this solution is not always practical as it depends upon the availability of an expert. Furthermore, experts may be researchers working and publishing in the field themselves. This scenario would be problematic because researchers would have to define the rules experts must abide by, compromising the concept of independent review [ 19 ]. Nonetheless, this problem does not disqualify the idea of expertise but requires high transparency standards regarding rule development and compliance. Other options include ad-hoc expert committees or provision of relevant training for existing committee members [ 47 , 90 , 91 ]. Given these options, which one is best to address ERCs’ lack of expertise in big data research?

Reforming the ERC

Our analysis shows that ERCs play a critical role in ensuring ethical oversight and risk–benefit evaluation [ 92 ], assessing the scientific validity of a project in its early stages, and offering an independent, critical, and interdisciplinary approach to the review. These strengths demonstrate why the ERC is an oversight model worth holding on to. Nevertheless, ERCs carry persistent big data-specific weaknesses, reducing their effectiveness and appropriateness as oversight bodies for data-driven research. To answer our initial research question, we propose that the current oversight mechanism is not as fit for purpose to assess the ethics of big data research as it could be in principle. ERCs should be improved at several levels to be able to adequately address and overcome these challenges. Changes could be introduced at the level of the regulatory framework as well as procedures. Additionally, reforming the ERC model might mean introducing complementary forms of oversight. In this section we explore these possibilities. Figure  2 offers an overview of the reforms that could aid ERCs in improving their process.

An external file that holds a picture, illustration, etc.
Object name is 12910_2021_616_Fig2_HTML.jpg

Reforms overview for the research oversight mechanism

Regulatory reforms

The regulatory design of research oversight is the first aspect which needs reform. ERCs could benefit from new guidance (e.g., in the form of a flowchart) on the ethics of big data research. This guidance could build upon a deep rethinking of the importance of data for the functioning of societies, the way we use data in society, and our justifications for this use. In the UK, for instance, individuals can generally opt out of having their data (e.g., hospital visit data, health records, prescription drugs) stored by physicians’ offices or by NHS digital services. However, exceptions to this opt-out policy apply when uses of the data are vital to the functioning of society (for example, in the case of official national statistics or overriding public interest, such as the COVID-19 pandemic) [ 93 ].

We imagine this new guidance also re-defining the scope of ERC review, from protection of individual interest to a broader research impact assessment. In other words, it will allow the ERC’s scope to expand and to address purview issues which were previously discussed. For example, less research will be oversight-free because more factors would trigger ERC purview in the first place. The new governance would impose ERC review for research involving anonymized data, or big data research within public–private partnerships. Furthermore, ERC purview could be extended beyond the initial phase of the study to other points in the data lifecycle [ 94 ]. A possible option is to assess a study after its conclusion (as is the case in the pharmaceutical industry): ERCs could then decide if research findings and results should be released and further used by the scientific community. This new ethical guidance would serve ERCs not only in deciding whether a project requires review, but also in learning from past examples and best practices how to best proceed in the assessment. Hence, this guidance could come in handy to increase transparency surrounding assessment criteria used across ERCs. Transparency could be achieved by defining a minimum global standard for ethics assessment that allows international collaboration based on open data and a homogenous evaluation model. Acceptance of a global standard would also mean that the same oversight procedures will apply to research projects with similar risks and research paths, regardless of whether they are carried on by public or private entities. Increased clarification and transparency might also streamline the review process within and across committees, rendering the entire system more efficient.

Procedural reforms

Procedural reforms might target specific aspects of the ERC model to make it more suitable for the review of big data research. To begin with, ERCs should develop new operational tools to mitigate emerging big data challenges. For example, the AI Now algorithmic impact assessment tool, which appraises the ethics of automated decision systems, and informs decisions about whether or not to deploy the systems in society, could be used [ 95 ]. Forms of broad consent [ 96 ] and dynamic consent [ 20 ] can also address some of the issues raised, by using, re-using, and sharing big data (publicly available or not). Nonetheless, informed consent should not be considered a panacea for all ethical issues in big data research—especially in the case of publicly available social media data [ 97 ]. If the ethical implications of big data studies affect the society and its vulnerable sub-groups, individual consent cannot be relied upon as an effective safeguard. For this reason, ERCs should move towards a more democratic process of review. Possible strategies include engaging research subjects and communities in the decision-making process or promoting a co-governance system. The recent Montreal Declaration for Responsible AI is an example of an ethical oversight process developed out of public involvement [ 98 ]. Furthermore, this inclusive approach could increase the trustworthiness of the ethics review mechanism itself [ 99 ]. In practice, the more that ERCs involve potential data subjects in a transparent conversation about the risks of big data research, the more socially accountable the oversight mechanism will become.

ERCs must also address their lack of big data and general computing expertise. There are several potential ways to bridge this gap. First, ERCs could build capacity with formal training on big data. ERCs are willing to learn from researchers about social media data and computational methodologies used for data mining and analysis [ 85 ]. Second, ERCs could adjust membership to include specific experts from needed fields (e.g., computer scientists, biotechnologists, bioinformaticians, data protection experts). Third, ERCs could engage with external experts for specific consultations. Despite some resistance to accepting help, recent empirical research has shown that ERCs may be inclined to rely upon external experts in case of need [ 86 ].

In the data-driven research context, ERCs must embrace their role as regulatory stewards, and walk researchers through the process of ethics review [ 40 ]. ERCs should establish an open communication channel with researchers to communicate the value of research ethics while clarifying the criteria used to assess research. If ERCs and researchers agree to mutually increase transparency, they create an opportunity to learn from past mistakes and prevent future ones [ 100 ]. Universities might seek to educate researchers on ethical issues that can arise when conducting data-driven research. In general, researchers would benefit from training on identifying issues of ethics or completing ethics self-assessment forms, particularly if they are responsible for submitting projects for review [ 101 ]. As biomedical research is trending away from hospitals and clinical trials, and towards people’s homes and private corporations, researchers should strive towards greater clarity, transparency, and responsibility. Researchers should disclose both envisioned risks and benefits, as well as the anticipated impact at the individual and population level [ 54 ]. ERCs can then more effectively assess the impact of big data research and determine whether the common good is guaranteed. Furthermore, they might examine how research benefits are distributed throughout society. Localized decision making can play a role here [ 55 ]. ERCs may take into account characteristics specific to the social context, to evaluate whether or not the research respects societal values.

Complementary reforms

An additional measure to tackle the novelty of big data research might consist in reforming the current research ethics system through regulatory and procedural tools. However, this strategy may not be sufficient: the current system might require additional support from other forms of oversight to complement its work.

One possibility is the creation of hybrid review mechanisms and norms, merging valuable aspects of the traditional ERC review model with more innovative models, which have been adopted by various partners involved in the research (e.g., corporations, participants, communities) [ 102 ]. This integrated mechanism of oversight would cover all stages of big data research and involve all relevant stakeholders [ 103 ]. Journals and the publishing industry could play a role within this hybrid ecosystem in limiting potential dual use concerns. For instance, in the research publication phase, resources could be assigned to editors so as to assess research integrity standards and promote only those projects which are ethically aligned. However, these implementations can have an impact only when there is a shared understanding of best practice within the oversight ecosystem [ 19 ].

A further option is to include specialized and distinct ethical committees alongside ERCs, whose purpose is to assess big data research and provide sectorial accreditation to researchers. In this model, ERCs would not be overwhelmed by the numbers of study proposals to review and could outsource evaluations requiring specialist knowledge in the field of big data. It is true that specialized committees (data safety monitoring boards, data access committees, and responsible research and innovation panels) already exist and support big data researchers in ensuring data protection (e.g., system security, data storage, data transfer). However, something like a “data review board” could assess research implications both for the individual and society, while reviewing a project’s technical features. Peer review could play a critical role in this model: the research community retains the expertise needed to conduct ethical research and to support each other when the path is unclear [ 101 ].

Despite their promise, these scenarios all suffer from at least one primary limitation. The former might face a backlash when attempting to bring together the priorities and ethical values of various stakeholders, within common research norms. Furthermore, while decentralized oversight approaches might bring creativity over how to tackle hard problems, they may also be very dispersive and inefficient. The latter could suffer from overlapping scope across committees, resulting in confusing procedures, and multiplying efforts while diluting liability. For example, research oversight committees have multiplied within the United States, leading to redundancy and disharmony across committees [ 47 ]. Moreover, specialized big data ethics committees working in parallel with current ERCs could lead to questions over the role of the traditional ERC, when an increasing number of studies will be big data studies.

ERCs face several challenges in the context of big data research. In this article, we sought to bring clarity regarding those which might affect the ERC’s practice, distinguishing between novel and persistent weaknesses which are compounded by big data research. While these flaws are profound and inherent in the current sociotechnical transformation, we argue that the current oversight model is still partially capable of guaranteeing the ethical assessment of research. However, we also advance the notion that introducing reform at several levels of the oversight mechanism could benefit and improve the ERC system itself. Among these reforms, we identify the urgency for new ethical guidelines and new ethical assessment tools to safeguard society from novel risks brought by big data research. Moreover, we recommend that ERCs adapt their membership to include necessary expertise for addressing the research needs of the future. Additionally, ERCs should accept external experts’ consultations and consider training in big data technical features as well as big data ethics. A further reform concerns the need for transparent engagement among stakeholders. Therefore, we recommend that ERCs involve both researchers and data subjects in the assessment of big data research. Finally, we acknowledge the existing space for a coordinated and complementary support action from other forms of oversight. However, the actors involved must share a common understanding of best practice and assessment criteria in order to efficiently complement the existing oversight mechanism. We believe that these adaptive suggestions could render the ERC mechanism sufficiently agile and well-equipped to overcome data-intensive research challenges and benefit research at large.

Acknowledgements

This article reports the ideas and the conclusions emerged during a collaborative and participatory online workshop. All authors participated in the “Big Data Challenges for Ethics Review Committees” workshop, held online the 23-24 April 2020 and organized by the Health Ethics and Policy Lab, ETH Zurich.

Abbreviations

ERC(s)Ethics Review Committee(s)
HIPAAHealth Insurance Portability and Accountability Act
IRB(s)Institutional Review Board(s)
NHSNational Health Service
REC(s)Research Ethics Committee(s)
UKUnited Kingdom
USUnited States

Authors' contributions

AF drafted the manuscript, MI, MS1 and EV contributed substantially to the writing. EV is the senior lead on the project from which this article derives. All the authors (AF, MI, MS1, AB, ESD, BF, PF, JK, WK, PK, SML, CN, GS, MS2, MRV, EV) contributed greatly to the intellectual content of this article, edited it, and approved the final version. All authors read and approved the final manuscript.

This research is supported by the Swiss National Science Foundation under award 407540_167223 (NRP 75 Big Data). MS1 is grateful for funding from the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The funding bodies did not take part in designing this research and writing the manuscript.

Availability of data and materials

Declarations.

The authors declare that they have no competing interests.

1 There is an unsettled discussion about whether ERCs ought to play a role in evaluating both scientific and ethical aspects of research, or whether these can even come apart—but we will not go into detail here. 25.Dawson AJ, Yentis SM. Contesting the science/ethics distinction in the review of clinical research. Journal of Medical Ethics. 2007;33(3):165–7, 26.Angell EL, Bryman A, Ashcroft RE, Dixon-Woods M. An analysis of decision letters by research ethics committees: the ethics/scientific quality boundary examined. BMJ Quality & Safety. 2008;17(2):131–6.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

What Are Some Real-World Examples of Big Data?

Since our first ancestors put ink to parchment, data has been part of the human experience.

From tracking the complex movements of the planets, to more basic things like bookkeeping, data has shaped the way we’ve evolved. Today, thanks to the internet, we collect such vast amounts of data that we have a whole new term to describe it: “big data.”

While big data is not only collected online, the digital space is undoubtedly its most abundant source. From social media likes, to emails, weather reports, and wearable devices, huge amounts of data are created and accumulated every second of every day. But how exactly is it used?

If you’re just starting out from scratch, then try this  free data short course on for size.

In this article, I’ll focus on some of the most big data examples out there. These are ways in which organizations—large and small—use big data to shape the way they work.

  • What is big data and why is it useful?
  • Big data in marketing and advertising
  • Big data in education
  • Big data in healthcare
  • Big data in travel, transport, and logistics
  • Key takeaways

First, let’s start with a quick summary of what big data is, and why so many organizations are scrambling to harness its potential.

1. What is big data and why is it useful?

“Big data” is used to describe repositories of information too large or complex to be analyzed using traditional techniques. For the most part, big data is unstructured, i.e. it is not organized in a meaningful way.

Although the term is commonly used to describe information collected online, to understand it better, it can help to picture it literally. Imagine walking into a vast office space without desks, computers, or filing cabinets. Instead, the whole place is a towering mess of disorganized papers, documents, and files. Your job is to organize all of this information and to make sense of it. No mean feat!

While digitization has all but eradicated the need for paper documentation, it has actually increased the complexity of the task. The skill in tackling big data is in knowing how to categorize and analyze it. For this, we need the right big data tools  and know-how. But how do we categorize such vast amounts of information in a way that makes it useful?

While this might seem like a fruitless task, organizations worldwide are investing huge amounts of time and money in trying to tap big data’s potential. This is why data scientists and data analysts are currently so in demand.

Learn more about it in our complete guide to what is big data .

But how is it done? Let’s take a look.

2. Big data in marketing and advertising

One of big data’s most obvious uses is in marketing and advertising. If you’ve ever seen an advert on Facebook or Instagram, then you’ve seen big data at work. Let’s explore some more concrete examples.

Netflix and big data

Netflix has over 150 million subscribers, and collects data on all of them. They track what people watch, when they watch it, the device being used, if a show is paused, and how quickly a user finishes watching a series.

They even take screenshots of scenes that people watch twice. Why? Because by feeding all this information into their algorithms, Netflix can create custom user profiles. These allow them to tailor the experience by recommending movies and TV shows with impressive accuracy.

And while you might have seen articles about how Netflix likes to splash the cash on new shows , this isn’t done blindly—all the data they collect helps them decide what to commission next.

Amazon and big data

Much like Netflix, Amazon collects vast amounts of data on its users. They track what users buy, how often (and for how long) they stay online, and even things like product reviews (useful for sentiment analysis ).

Amazon can even guess people’s income based on their billing address. By compiling all this data across millions of users, Amazon can create highly-specialized segmented user profiles.

Using predictive analytics , they can then target their marketing based on users’ browsing habits. This is used for suggesting what you might want to buy next, but also for things like grouping products together to streamline the shopping experience.

McDonald’s and big data

Big data isn’t just used to tailor online experiences. A good example of this is McDonald’s, who use big data to shape key aspects of their offering offline, too. This includes their mobile app, drive-thru experience, and digital menus.

With its own app, McDonald’s collects vital information about user habits. This lets them offer tailored loyalty rewards to encourage repeat business. But they also collect data from each restaurant’s drive-thru, allowing them to ensure enough staff is on shift to cover demand. Finally, their digital menus offer different options depending on factors such as the time of day, if any events are taking place nearby, and even the weather.

So, if it’s a hot day, expect to be offered a McFlurry or a cold drink…not a spicy burger!

3. Big data in education

Until recently, the approach to education was more or less one-size-fits-all. With companies now harnessing big data, this is no longer the case. Schools, colleges, and technology providers are all using it to enhance the educational experience.

Reducing drop-out rates with big data

Purdue University in Indiana was an early adopter of big data in education. In 2007, Purdue launched a unique, early intervention system called Signals, which was designed to help predict academic and behavioral issues.

By applying predictive modeling to student data (e.g. class prep, level of engagement, and overall academic performance) Purdue was able to accurately forecast which students were at risk of dropping out. When action was required, both students and teachers were informed, meaning the college could intervene and tackle any issues. As a result, according to one study, those taking two or more Signals courses were 21% less likely to drop out.

Improving the learner experience with big data

Some educational technology providers use big data to enhance student learning. One example of this is the UK-based company, Sparx , who created a math app for school kids. Using machine learning, personalized content, and data analytics, the app helps improve the pupil learning experience.

With over 32,000 questions, the app uses an adaptive algorithm to push the most relevant content to each student based on their previous answers. This includes real-time feedback, therefore tackling mistakes as soon as they arise. Plus, by collecting data from all their users across schools, Sparx gains broader insight into the overall learning patterns and pitfalls that students face, helping them to constantly improve their product.

Improving teaching methods with big data

Other educational technology providers have used big data to improve teaching methods. In Roosevelt Elementary School in San Francisco, teachers use an analytics app called DIBELS . The app gathers data on children’s reading habits so that teachers can see where they most need help.

Aggregating data on all pupils, teachers can group those with the same learning needs, targeting teaching where it’s most needed. This also encourages educators to reflect on their methods. For instance, if they face similar issues across multiple students, they might need to adapt their approach.

4. Big data in healthcare

From pharmaceutical companies to medical product providers, big data’s potential within the healthcare industry is huge. Vast volumes of data inform everything from diagnosis and treatment, to disease prevention, and tracking.

Electronic health records and big data

Our medical records include everything from our personal demographics to our family histories, diets, and more. For decades, this information was in a paper format, limiting its usefulness.

However, health systems around the world are now digitizing these data, creating a substantial set of electronic health records (EHRs). EHRs have vast potential. On a day-to-day level, they allow doctors to receive reminders or warnings when a patient needs to be contacted (for instance, to check their medication).

However, EHRs also allow clinical researchers to spot patterns between things like disease, lifestyle, and environment—correlations that would previously have been impossible to detect. This is revolutionizing how we detect, prevent, and treat disease, informing new interventions, and changes in government health policy.

Big data and wearable devices

Healthcare providers are always seeking new ways to improve patient care with faster, cheaper, more effective treatments. Wearables are a key part of this. They allow us to track patient data in real-time.

For instance, a heart monitor worn to detect blood pressure can allow doctors to track patients for extended periods at home, rather than relying on the results of a quick hospital test. If there’s a problem, doctors can quickly intervene. More importantly though, using big data analytics tools, information collected from countless patients can offer invaluable insights, helping healthcare providers improve their products. This ultimately saves money and lives.

Big data for disease tracking

Another application of big data in healthcare is disease tracking. The current coronavirus pandemic is a perfect example. Since the coronavirus outbreak began, governments have been scrabbling to launch track-and-trace systems to stem the spread of disease.

In China, for instance, the government has introduced heat detectors at train stations to identify those with fever. Because every passenger is legally required to use identification before using public transport, authorities can quickly alert those who may have been exposed. The Chinese government also uses security cameras and mobile phone data to track those who have broken quarantine. While this does come with privacy concerns, China’s approach nevertheless demonstrates the power of big data.

5. Big data in travel, transport, and logistics

From flying off on vacation to ordering packages to your front door, big data has myriad applications in travel, transport, and logistics. Let’s explore further.

Big data in logistics

Tracking warehouse stock levels, traffic reports, product orders, and more, logistics companies use big data to streamline their operations. A good example is UPS. By tracking weather and truck sensor data, UPS learned the quickest routes for their drivers.

This itself was a useful insight, but after analyzing the data in more detail, they made an interesting discovery: by turning left across traffic, drivers were wasting a lot of fuel . As a result, UPS introduced a ‘no left turn’ policy. The company claims that they now use 10 million gallons less gas per year, and emit 20,000 tonnes less carbon dioxide. Pretty impressive stuff!

Big data and city mobility

Big data is big business in urban mobility, from car hire companies to the boom of e-bike and e-scooter hire. Uber is an excellent example of a company that has harnessed the full potential of big data. Firstly, because they have a large database of drivers, they can match users to the closest driver in a matter of seconds.

But it doesn’t stop there. Uber also stores data for every trip taken. This enables them to predict when the service is going to be at its busiest, allowing them to set their fares accordingly. What’s more, by pooling data from across the cities they operate in, Uber can analyze how to avoid traffic jams and bottlenecks. Cool, huh?

Big data and the airline industry

Aircraft manufacturer, Boeing, operates an Airplane Health Management System. Every day, the system analyzes millions of measurements across their entire fleet. From in-flight metrics to mechanical analysis, the resulting data has numerous applications.

For instance, by predicting potential failures, the company knows when servicing is required, saving them thousands of dollars annually on unnecessary maintenance. More importantly, this big data provides invaluable safety insights, improving airplane safety at Boeing, and across the airline industry at large.

6. Big data in finance and banking

Fraud detection with big data.

Banks and financial institutions process billions of transactions daily—in 2022 there were more than 21,510 credit card transactions per second ! With the rise of online banking, mobile payments, and digital transactions, the risk of fraud has also increased.

Big data analytics can help in detecting unusual patterns or behaviors in transaction data. For instance, if a credit card is used in two different countries within a short time frame, it might be flagged as suspicious. By analyzing vast amounts of transaction data in real-time, banks can quickly detect and prevent fraudulent activities.

Personalized banking with big data

With over 78% of Americans banking digitally , banks are increasingly using big data to offer personalized services to their customers. By analyzing a customer’s transaction history, browsing habits, and even social media activities, banks can offer tailored financial products, interest rates, or even financial advice.

For instance, if a bank notices that a customer is frequently spending on travel, they might offer them a credit card with travel rewards or discounts.

7. Big data in agriculture

Precision farming with big data.

Farmers are using big data to make more informed decisions about their crops. How do they achieve this? Well, with sensors placed in fields measure the moisture levels, temperature, and soil conditions, as well as on tractors and other farm machinery.

Speaking of farm machinery, here’s an unusual but not for long example: d rones . By equipping drones with cameras can provide detailed aerial views of the crops, helping in detecting diseases or pests. Hobby drone giant DJI already produces its own line of drones for this purpose.

By analyzing this data, farmers can determine the optimal time to plant, irrigate, or harvest their crops, leading to increased yields and reduced costs.

Supply chain optimization with big data

Agricultural supply chains are complex, with multiple stages from farm to table. Big data can help in tracking and optimizing each stage of the supply chain. For instance, by analyzing data from transportation vehicles, storage facilities, and retail outlets, suppliers can ensure that perishable goods like fruits and vegetables are delivered in the shortest time, reducing wastage and ensuring freshness.

These examples can be integrated into the article to provide a more comprehensive overview of the diverse applications of big data across different sectors.

8. Key takeaways

In this post, we’ve explored big data’s real-world uses in several industries. Big data is regularly used by:

  • Advertisers and marketers —to tailor offers and promotions, and to make customer recommendations
  • Educational institutions —to minimize drop-outs, offer tailored learning, and to improve teaching methods
  • Healthcare providers —to create new treatments, develop wearable devices, and to improve clinical research
  • Transport and logistics —to streamline supply chain operations, improve airline safety, and even to save fuel and reduce carbon emissions
  • Banking and finance —to help prevent fraud, as well as to offer customers tailored products based on their activity
  • Agriculture —to help farmers perform as efficiently as possible and to monitor their crops

This taster of big data’s potential highlights just how powerful it can be. From financial services to the food industry, mining and manufacturing, big data insights are shaping the world we live in. If you want to be a part of this incredible journey, and are curious about a career in data analytics, why not try our free, five-day data analytics short course ?

Keen to explore further? Check out the following:

  • How To Become A Data Consultant: A Beginner’s Guide
  • Bias in Machine Learning: What Are the Ethics of AI?
  • What Are Large Language Models? A Complete Guide

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Starting the research process
  • 10 Research Question Examples to Guide Your Research Project

10 Research Question Examples to Guide your Research Project

Published on October 30, 2022 by Shona McCombes . Revised on October 19, 2023.

The research question is one of the most important parts of your research paper , thesis or dissertation . It’s important to spend some time assessing and refining your question before you get started.

The exact form of your question will depend on a few things, such as the length of your project, the type of research you’re conducting, the topic , and the research problem . However, all research questions should be focused, specific, and relevant to a timely social or scholarly issue.

Once you’ve read our guide on how to write a research question , you can use these examples to craft your own.

Research question Explanation
The first question is not enough. The second question is more , using .
Starting with “why” often means that your question is not enough: there are too many possible answers. By targeting just one aspect of the problem, the second question offers a clear path for research.
The first question is too broad and subjective: there’s no clear criteria for what counts as “better.” The second question is much more . It uses clearly defined terms and narrows its focus to a specific population.
It is generally not for academic research to answer broad normative questions. The second question is more specific, aiming to gain an understanding of possible solutions in order to make informed recommendations.
The first question is too simple: it can be answered with a simple yes or no. The second question is , requiring in-depth investigation and the development of an original argument.
The first question is too broad and not very . The second question identifies an underexplored aspect of the topic that requires investigation of various  to answer.
The first question is not enough: it tries to address two different (the quality of sexual health services and LGBT support services). Even though the two issues are related, it’s not clear how the research will bring them together. The second integrates the two problems into one focused, specific question.
The first question is too simple, asking for a straightforward fact that can be easily found online. The second is a more question that requires and detailed discussion to answer.
? dealt with the theme of racism through casting, staging, and allusion to contemporary events? The first question is not  — it would be very difficult to contribute anything new. The second question takes a specific angle to make an original argument, and has more relevance to current social concerns and debates.
The first question asks for a ready-made solution, and is not . The second question is a clearer comparative question, but note that it may not be practically . For a smaller research project or thesis, it could be narrowed down further to focus on the effectiveness of drunk driving laws in just one or two countries.

Note that the design of your research question can depend on what method you are pursuing. Here are a few options for qualitative, quantitative, and statistical research questions.

Type of research Example question
Qualitative research question
Quantitative research question
Statistical research question

Other interesting articles

If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.

Methodology

  • Sampling methods
  • Simple random sampling
  • Stratified sampling
  • Cluster sampling
  • Likert scales
  • Reproducibility

 Statistics

  • Null hypothesis
  • Statistical power
  • Probability distribution
  • Effect size
  • Poisson distribution

Research bias

  • Optimism bias
  • Cognitive bias
  • Implicit bias
  • Hawthorne effect
  • Anchoring bias
  • Explicit bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, October 19). 10 Research Question Examples to Guide your Research Project. Scribbr. Retrieved September 6, 2024, from https://www.scribbr.com/research-process/research-question-examples/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, writing strong research questions | criteria & examples, how to choose a dissertation topic | 8 steps to follow, evaluating sources | methods & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Big Data in Academic Research: Challenges, Pitfalls, and Opportunities

  • First Online: 05 October 2021

Cite this chapter

big data research questions

  • Jacques Raubenheimer 10  

Part of the book series: Policy Implications of Research in Education ((PIRE,volume 13))

680 Accesses

1 Citations

Big Data are a product of the computer era, enabling the knowledge economy, in which academic researchers are key players, although researchers have been slow to adopt Big Data as a source for academic enquiry. This may be in part because Big Data are curated by commercial or governmental entities, not by researchers. Big Data present several challenges to researchers, including those associated with the size of the data, the development and growth of data sources, and the temporal changes in large data sets. Further challenges are that Big Data are gathered for purposes other than research, making their fit-for-purpose problematic; that Big Data may easily lead to overfitting and spuriousness; and the biases inherent to Big Data. Linkage of data sets always remains problematic. Big Data results are hard to generalize, and working with Big Data may raise new ethical problems, even while obviating old ethical concerns. Nonetheless, Big Data offer many opportunities, allowing researchers to study previously inaccessible problems, with previously inconceivable sources of data. Although Big Data overcome some of the challenges of small data studies, Big Data studies will not supplant small data studies—these should work in concert, leading to real-world translation that can have a lasting impact.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

big data research questions

Research on Big Data

big data research questions

Practical issues to consider when working with big data

big data research questions

Big Data—Technologies and Potential

Addawood, A., Schneider, J., & Bashier, M. (2017). Stance classification of twitter debates: The encryption debate as a use case. In Proceedings of the 8th international conference on Social Media & Society . Association for Computing Machinery. https://doi.org/10.1145/3097286.3097288

Chapter   Google Scholar  

Anthony, S. (2011). The history of computer storage (slideshow) . Retrieved January 19, 2011, from https://www.extremetech.com/computing/90156-the-history-of-computer-storage-slideshow

Araz, O. M., Bentley, D., & Muelleman, R. L. (2014). Using Google Flu Trends data in forecasting influenza-like-illness related ED visits in Omaha, Nebraska. American Journal of Emergency Medicine, 32 (9), 1016–1023. https://doi.org/10.1016/j.ajem.2014.05.052

Article   Google Scholar  

Arora, V. S., Stuckler, D., & McKee, M. (2016). Tracking search engine queries for suicide in the United Kingdom, 2004–2013. Public Health, 137 , 147–153. https://doi.org/10.1016/j.puhe.2015.10.015

Ayers, J. W., Althouse, B. M., Allem, J. P., Rosenquist, J. N., & Ford, D. E. (2013). Seasonality in seeking mental health information on Google. American Journal of Preventive Medicine, 44 (5). https://doi.org/10.1016/j.amepre.2013.01.012

Ayers, J. W., Althouse, B. M., & Dredze, M. (2014). Could behavioral medicine lead the web data revolution? Journal of the American Medical Association, 311 (14), 1399–1400. https://doi.org/10.1001/jama.2014.1505

Badger, E. (2014). 10 Years of London Underground ridership data in one map . Retrieved January 26, 2018, from https://www.citylab.com/transportation/2014/01/10-years-london-underground-ridership-data-one-map/8156/

Bair, A. R. (2016). From crisis to crisis: A Big Data, antenarrative analysis of how social media users make meaning during and after crisis events . Utah State University. http://digitalcommons.usu.edu/etd/5045/

Google Scholar  

Bakos, Y., Marotta-Wurgler, F., & Trossen, D. R. (2014). Does anyone read the fine print? Consumer attention to standard form contracts. The Journal of Legal Studies, 43 (1). https://www.journals.uchicago.edu/doi/abs/10.1086/674424

Biddle, L., Derges, J., Mars, B., Heron, J., Donovan, J. L., Potokar, J., Piper, M., Wyllie, C., & Gunnell, D. (2016). Suicide and the Internet: Changes in the accessibility of suicide-related information between 2007 and 2014. Journal of Affective Disorders, 190 , 370–375. https://doi.org/10.1016/j.jad.2015.10.028

Biddle, L., Gunnell, D., Owen-Smith, A., Potokar, J., Longson, D., Hawton, K., Kapur, N., & Donovan, J. (2012). Information sources used by the suicidal to inform choice of method. Journal of Affective Disorders, 136 (3), 702–709. https://doi.org/10.1016/j.jad.2011.10.004

Björk, B.-C., Roos, A., & Lauri, M. (2009). Scientific journal publishing: Yearly volume and open access availability. Information Research, 14 (1). http://informationr.net/ir/14-1/paper391.html

Blue, V. (2014). Facebook: Unethical, untrustworthy, and now downright harmful . Retrieved July 1, 2014, from http://www.zdnet.com/facebook-unethical-untrustworthy-and-now-downright-harmful

Bogle, A. (2018). Strava just published details about secret military bases, and an Australian was the first to know . Retrieved January 29, 2018, from http://www.abc.net.au/news/science/2018-01-29/strava-heat-map-shows-military-bases-and-supply-routes/9369490

Bohensky, M. A., Jolley, D., Sundararajan, V., Evans, S., Pilcher, D. V., Scott, I., & Brand, C. A. (2010). Data linkage: A powerful research tool with potential problems. BMC Health Services Research , 10, 1–7. https://doi.org/10.1186/1472-6963-10-346 .

Bragazzi, N. L. (2013). A Google trends-based approach for monitoring NSSI.  Psychology Research and Behavior Management , 7, 1–8. https://doi.org/10.2147/PRBM.S44084 .

Breese, E. B. (2015). When marketers and academics share a research platform: The story of crimson hexagon. Journal of Applied Social Science, 10 (1), 3–7. https://doi.org/10.1177/1936724415569953

Brin, S., & Page, L. (1998). The anatomy of a large scale hypertextual web search engine. Computer Networks and ISDN Systems, 30 (1/7), 107–117. https://doi.org/10.1.1.109.4049.

Bruckner, T. A., McClure, C., & Kim, Y. (2014). Google searches for suicide and risk of suicide. Psychiatric Services, 65 (2), 271–272. https://doi.org/10.1176/appi.ps.201300211.

Butler, D. (2013). When Google got flu wrong. Nature, 494 (7436), 155–156. https://doi.org/10.1038/494155a .

Carneiro, H. A., & Mylonakis, E. (2009). Google trends: A web-based tool for real-time surveillance of disease outbreaks. Clinical Infectious Diseases, 49(10), 1557–1564. https://doi.org/10.1086/630200.

Centre for Health Record Linkage. (n.d.). Guide to health record linkage services . Sydney. http://www.nss.gov.au/nss/home.nsf/0/e2d861c453d7b7f6ca25756700191b53/$FILE/CHeReL_Guide_version 1.3.pdf.

Chan, M. S., Morales, A., Farhadloo, M., Palmer, R. P., & Albarracín, D. (2017). Harvesting and harnessing social media data for psychological research. In H. Blanton (Ed.), Social psychological research methods . Taylor & Francis.

Charles, R. H. (1913). The letter of Aristeas to Philocrates . Retrieved January 17, 2018, from http://www.attalus.org/translate/aristeas1.html#9

Chiang, L.-Y., Crockett, R., Johnson, I., & O’Keefe, A. (2017). Passenger flow in the tube. Worcester Polytechnic Institute, . http://wp.wpi.edu/london/files/2017/06/Tube_IQP_E17_Final.pdf

Chillingworth, B. (2018). Woman charged over death of NSW farmer allegedly searched “murder” before his death . Retrieved January 26, 2018, from http://www.smh.com.au/nsw/woman-charged-over-death-of-nsw-farmer-allegedly-searched-murder-before-his-death-20180124-h0o1ts.html

Choi, H., & Varian, H. (2009). Official Google research blog: Predicting the present with Google Trends . http://googleresearch.blogspot.com/2009/04/predicting-present-with-google-trends.html

Choi, H., & Varian, H. (2012). Predicting the present with Google trends. Economic Record, 88 (SUPPL.1), 2–9. https://doi.org/10.1111/j.1475-4932.2012.00809.x .

Christie, A. (1932). The thirteen problems . Ulverscroft.

Churches, T., & Christen, P. (2004). Some methods for blindfolded record linkage. BMC Medical Informatics and Decision Making , 4, 1–17. https://doi.org/10.1186/1472-6947-4-9 .

Citizen Research Centre. (2017). Xenophobia on Social Media in SA, 2011–2017: Anatomy of an Incident: Violence in Gauteng and the “March against Immigrants.” http://citizenresearchcentre.org/2017/03/15/xenophobia-on-social-media-in-south-africa/

Clark, D. E. (2004). Practical introduction to record linkage for injury research. Injury Prevention, 10 (3), 186–191. https://doi.org/10.1136/ip.2003.004580.

Clarke, F., & Chien, C.-H. (2017). Visualising Big Data for official statistics: The ABS experience. In T. Prodromou (Ed.), Data visualization and statistical literacy for open and Big Data (pp. 224–252). : IGI Global. https://doi.org/10.4018/978-1-5225-2512-7.ch009 .

Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google Flu Trends performance in the United States during the 2009 influenza virus a (H1N1) pandemic. PLoS One, 6(8), 1–8. https://doi.org/10.1371/journal.pone.0023610 .

Copeland, P., Romano, R., Zhang, T., Hecht, G., Zigmond, D., & Stefansen, C. (2013). Google Disease Trends: An update. http://research.google.com/pubs/archive/41763.pdf

Coughlin, T. (2015). HDD annual unit shipments increase in 2014 . Retrieved January 22, 2018, from https://www.forbes.com/sites/tomcoughlin/2015/01/29/hdd-annual-unit-shipments-increase-in-2014

Courtland, R. (2015). Gordon Moore: The man whose name means progress: The visionary engineer reflects on 50 years of Moore’s Law . Retrieved September 15, 2016, from http://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means-progress

Coviello, L., Sohn, Y., Kramer, A. D. I., Marlow, C., Franceschetti, M., Christakis, N. A., & Fowler, J. H. (2014). Detecting emotional contagion in massive social networks. PLoS One, 9 (3), 1–6. https://doi.org/10.1371/journal.pone.0090315 .

Cox, M., & Ellsworth, D. (1997). Application-controlled demand paging for out-of-core visualization. In Proceedings of the 8th IEEE Visualization ‘97 Conference (pp. 235–244).

Crimson Hexagon. (2018). Data library: A trillion posts can answer a lot of questions . Retrieved January 17, 2018, from https://www.crimsonhexagon.com/data-library/

D’agostino, R. B. (1998). Tutorial in biostatistics: Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Statistics in Medicine , 17 , 2265–2281. https://doi.org/10.1002/(SICI)1097-0258(19981015)17:19<2265::AID-SIM918>3.0.CO;2-B

Data Recovery Group. (2011). Data storage history and future . Retrieved January 19, 2018, from http://www.datarecoverygroup.com/articles/data-storage-history-and-future

Dave, P. (2013). Big Data – what is Big Data – 3 Vs of Big Data – volume, velocity and variety – day 2 of 21 . Retrieved January 25, 2018, from https://blog.sqlauthority.com/2013/10/02/big-data-what-is-big-data-3-vs-of-big-data-volume-velocity-and-variety-day-2-of-21/

Deery, S. (2013). Ex-girlfriend Adriana Donato’s murder planned for weeks by boyfriend using Google searches . Retrieved January 29, 2018, from http://www.heraldsun.com.au/exgirlfriend-adriana-donatos-murder-planned-for-weeks-by-boyfriend-using-google-searches/news-story/4ce0d80e6a5582c3782befbd601508cc

Deiner, M. S., Lietman, T. M., McLeod, S. D., Chodosh, J., & Porco, T. C. (2016). Surveillance tools emerging from search engines and social media data for determining eye disease patterns. JAMA Ophthalmology, 134 (9), 1024. https://doi.org/10.1001/jamaophthalmol.2016.2267 .

DeVan, A. (2016). The 7 V’s of Big Data . Retrieved January 25, 2018, from https://www.impactradius.com/blog/7-vs-big-data/

Doctorow, C. (2008). Big Data: Welcome to the petacentre. Nature, 455 (7209), 16–21. https://doi.org/10.1038/455016a

Dontha, R. (2017). The origins of Big Data . Retrieved January 17, 2018, from https://www.kdnuggets.com/2017/02/origins-big-data.html

Drucker, P. (1957). The landmarks of tomorrow . Heinemann.

Dugas, A. F., Hsieh, Y. H., Levin, S. R., Pines, J. M., Mareiniss, D. P., Mohareb, A., Gaydos, C. A., Perl, T. M., & Rothman, R. E. (2012). Google Flu Trends: Correlation with emergency department influenza rates and crowding metrics. Clinical Infectious Diseases, 54 (4), 463–469. https://doi.org/10.1093/cid/cir883 .

Dugas, A. F., Jalalpour, M., Gel, Y., Levin, S., Torcaso, F., Igusa, T., & Rothman, R. E. (2013). Influenza forecasting with Google Flu Trends. PLoS One, 8 (2). https://doi.org/10.1371/journal.pone.0056176 .

Dusetzina, S. B., Tyree, S., Meyer, A.-M., Meyer, A., Green, L., & Carpenter, W. R. (2014). Linking data for health services research: A framework and instructional guide . Rockville. https://doi.org/AHRQ No.14-EHC033.

Emery, S. L., Szczypka, G., Abril, E. P., Kim, Y., & Vera, L. (2014). Are you scared yet? Evaluating fear appeal messages in tweets about the tips campaign. Journal of Communication, 64 (2), 278–295. https://doi.org/10.1111/jcom.12083 .

Entous, A., Dwoskin, E., & Timberg, C. (2018). Obama tried to give Zuckerberg a wake-up call over fake news on Facebook . Retrieved June 1, 2018, from https://www.washingtonpost.com/business/economy/obama-tried-to-give-zuckerberg-a-wake-up-call-over-fake-news-on-facebook/2017/09/24/15d19b12-ddac-4ad5-ac6e-ef909e1c1284_story.html

Eysenbach, G. (2006). Infodemiology: Tracking flu-related searches on the web for syndromic surveillance. In Proceedings of the American Medical Informatics Association Annual Symposium 2006 (pp. 244–8). https://doi.org/PMC1839505

Faris, R., Roberts, H., Etling, B., Othman, D., & Benkler, Y. (2015). Score another one for the Internet? The role of the networked public sphere in the U.S. net neutrality policy debate. SSRN Electronic Journal , 1, 0–34. https://doi.org/10.2139/ssrn.2563761 .

Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64 (328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049.

Fish, J. N., & Russell, S. T. (2017). Have mischievous responders misidentified sexual minority youth disparities in the National Longitudinal Study of adolescent to adult health? Archives of Sexual Behavior , 1–15. https://doi.org/10.1007/s10508-017-0993-6 .

Fogarty, C. B., & Small, D. S. (2016). Sensitivity analysis for multiple comparisons in matched observational studies through quadratically constrained linear programming. Journal of the American Statistical Association, 111 (516), 1820–1830. https://doi.org/10.1080/01621459.2015.1120675 .

Fond, G., Gaman, A., Brunel, L., Haffen, E., & Llorca, P. M. (2015). Google trends®: Ready for real-time suicide prevention or just a Zeta-Jones effect? An exploratory study. Psychiatry Research, 228 (3), 913–917. https://doi.org/10.1016/j.psychres.2015.04.022 .

Fuchs, C. (2008). The role of income inequality in a multivariate cross-national analysis of the digital divide. Social Science Computer Review, 27 (1), 41–58. https://doi.org/10.1177/0894439308321628 .

Fung, K. (2014). Google Flu Trends’ failure shows Good Data > Big Data . Retrieved January 25, 2018, from https://hbr.org/2014/03/google-flu-trends-failure-shows-good-data-big-data

Furu, K., Wettermark, B., Andersen, M., Martikainen, J. E., Almarsdottir, A. B., & Sørensen, H. T. (2010). The Nordic countries as a cohort for pharmacoepidemiological research. Basic & Clinical Pharmacology & Toxicology, 106 (2), 86–94. https://doi.org/10.1111/j.1742-7843.2009.00494.x .

Ginsberg, J., & Mohebbi, M. H. (2008). Tracking Flu Trends . Retrieved March 5, 2018, from https://googleblog.blogspot.com.au/2008/11/tracking-flu-trends.html

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457 (7232), 1012–1014. https://doi.org/10.1038/nature07634 .

Gordon, M. (2014). What is Strava Metro? Retrieved March 2, 2018, from https://support.strava.com/hc/en-us/articles/216918877-What-is-Strava-Metro-

Gunn, J. F., & Lester, D. (2013). Using Google searches on the Internet to monitor suicidal behavior. Journal of Affective Disorders, 148(2–3), 411–412. https://doi.org/10.1016/j.jad.2012.11.004 .

Gunnell, D., Bennewith, O., Kapur, N., Simkin, S., Cooper, J., & Hawton, K. (2012). The use of the Internet by people who die by suicide in England: A cross sectional study. Journal of Affective Disorders, 141(2–3), 480–483. https://doi.org/10.1016/j.jad.2012.04.015 .

Hachman, M. (2015). The price of free: How Apple, Facebook, Microsoft and Google sell you to advertisers . Retrieved January 26, 2018, from https://www.pcworld.com/article/2986988/privacy/the-price-of-free-how-apple-facebook-microsoft-and-google-sell-you-to-advertisers.html

Hagihara, A., Miyazaki, S., & Abe, T. (2012). Internet suicide searches and the incidence of suicide in young people in Japan. European Archives of Psychiatry and Clinical Neuroscience, 262 (1), 39–46. https://doi.org/10.1007/s00406-011-0212-8 .

Harlow, L. L., & Oswald, F. L. (2016). Big Data in psychology: Introduction to the special issue. Psychological Methods, 21 (4), 447–457. https://doi.org/10.1037/met0000120 .

Henderson, V., Storeygard, A., & Weil, D. (2008). Measuring economic growth from outer space. American Economic Review, 102 (2), 994–1028. https://doi.org/10.1257/aer.102.2.994 .

Hern, A. (2018). Far more than 87m Facebook users had data compromised, MPs told . Retrieved June 4, 2018, from https://www.theguardian.com/uk-news/2018/apr/17/facebook-users-data-compromised-far-more-than-87m-mps-told-cambridge-analytica

Hilbert, M. (2012a). How much information is there in the “information society”? Significance, 9 (4), 8–12. https://doi.org/10.1111/j.1740-9713.2012.00584.x .

Hilbert, M. (2012b). How to measure “how much information”? Theoretical, methodological, and statistical challenges for the social sciences. International Journal of Communication, 6 (1), 1042–1055. http://ijoc.org/index.php/ijoc/article/view/1318/746

Hilbert, M. (2014). What is the content of the world’s technologically mediated information and communication capacity: How much text, image, audio, and video? Information Society, 30 (2), 127–143. https://doi.org/10.1080/01972243.2013.873748 .

Hilbert, M., & López, P. (2011). The world’s technological capacity to store, communicate, and compute information. Science , 332(April), 60–65. https://doi.org/10.1126/science.1200970 .

Hilbert, M., & López, P. (2012a). How to measure the world’s technological capacity to communicate, store, and compute information, Part I: Results and scope. International Journal of Communication, 6 (1), 956–979. http://ijoc.org/index.php/ijoc/article/view/1562/742

Hilbert, M., & López, P. (2012b). How to measure the world’s technological capacity to communicate, store, and compute information, Part II: Measurement unit and conclusions. International Journal of Communication, 6 , 936–955. http://ijoc.org/index.php/ijoc/article/view/1563/741

Hitt, C. (2018). Woman accused of murdering husband searched for “how to kill someone and not get caught” online. Retrieved January 26, 2018, from http://www.nydailynews.com/news/crime/woman-web-searched-pin-murder-article-1.3753079

Holland, T. (2017). How Facebook and Google changed the advertising game . Retrieved January 26, 2018, from https://theconversation.com/how-facebook-and-google-changed-the-advertising-game-70050

Hopke, J. E., & Simis, M. (2017a). Discourse over a contested technology on Twitter: A case study of hydraulic fracturing. Public Understanding of Science, 26 (1), 105–120. https://doi.org/10.1177/0963662515607725 .

Hopke, J. E., & Simis, M. (2017b). Response to “word choice as political speech”: Hydraulic fracturing is a partisan issue. Public Understanding of Science, 26 (1), 124–126. https://doi.org/10.1177/0963662516643621 .

Hopkins, D., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54 (1), 229–247. https://doi.org/10.1111/j.1540-5907.2009.00428.x .

Human Rights Council of the United Nations. (2016). The promotion, protection and enjoyment of human rights on the Internet , Pub. L. No. A/HRC/32/L.20 (2016). https://www.article19.org/data/files/Internet_Statement_Adopted.pdf .

IBM. (n.d.-a). Extracting business value from the 4 V’s of big data . Retrieved January 25, 2018, from http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data .

IBM. (n.d.-b). Infographic: The four V’s of Big Data . Retrieved January 25, 2018, from http://www.ibmbigdatahub.com/infographic/four-vs-big-data .

Ingram, D. (2018). Facebook fuels broad privacy debate by tracking non-users . Retrieved April 16, 2018, from https://www.reuters.com/article/us-facebook-privacy-tracking/facebook-fuels-broad-privacy-debate-by-tracking-non-users-idUSKBN1HM0DR

Jacobs, A. (2009). The pathologies of Big Data. Queue, 7 (6), 36–44. https://doi.org/10.1145/1563821.1563874 .

Jain, A. (2016). The 5 Vs of Big Data – Watson health perspectives . Retrieved January 25, 2018, from https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/%0A

Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23 (3), 258–263. https://doi.org/10.1087/20100308.

Johnson, H. A., Wagner, M. M., Hogan, W. R., Chapman, W., Olszewski, R. T., Dowling, J., & Barnas, G. (2004). Analysis of web access logs for surveillance of influenza. Studies in Health Technology and Informations, 107 (2), 1202–1206. https://doi.org/10.3233/978-1-60750-949-3-1202 .

Kamenetz, A. (2014). “Mischievous responders” confound research on teens . Retrieved August 12, 2017, from https://www.npr.org/sections/ed/2014/05/22/313166161/mischievous-responders-confound-research-on-teens

Kelman, C. W., Kortt, M. A., Becker, N. G., Li, Z., Mathews, J. D., Guest, C. S., & Holman, C. D. J. (2003). Deep vein thrombosis and air travel: Record linkage study. BMJ (Clinical Research Ed.), 327 (7423), 1072. https://doi.org/10.1136/bmj.327.7423.1072.

Kim, J., Brossard, D., Scheufele, D. A., & Xenos, M. (2016). “Shared” information in the age of Big Data. Journalism & Mass Communication Quarterly, 93 (2), 430–445. https://doi.org/10.1177/1077699016640715 .

King, G. (2011). Ensuring the data-rich future of the social sciences. Science, 331 (6018), 719–721. https://doi.org/10.1126/science.1197872 .

King, G. (2014). Restructuring the social sciences: Reflections from Harvard’s Institute for Quantitative Social Science. PS: Political Science & Politics, 47 (01), 165–172. https://doi.org/10.1017/S1049096513001534 .

King, G., & Persily, N. (2018). A new model for industry-academic partnerships . http://j.mp/2q1IQpH

Kramer, A. D. I. (2014). Facebook post by A Kramer . Retrieved July 1, 2014, from https://www.facebook.com/akramer/posts/10152987150867796

Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences USA, 111 (24), 8788–8790. https://doi.org/10.1073/pnas.1412469111.

Kristoufek, L., Moat, H. S., & Preis, T. (2016). Estimating suicide occurrence statistics using Google trends. EPJ Data Science, 5 (1), 32. https://doi.org/10.1140/epjds/s13688-016-0094-0 .

La Rue, F. (2011). Report of the special rapporteur on the promotion and protection of the right to freedom of opinion and expression . Frank La Rue. http://www2.ohchr.org/english/bodies/hrcouncil/docs/17session/A.HRC.17.27_en.pdf

Laney, D. (2001). 3D data management: Controlling data volume, velocity, and variety. Application Delivery Strategies, 949 (February 2001), 4. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

Larsen, P. O., & von Ins, M. (2010). The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics, 84 (3), 575–603. https://doi.org/10.1007/s11192-010-0202-z .

Lazer, D., & Kennedy, R. (2015). What we can learn from the epic failure of Google Flu Trends . Retrieved January 25, 2018, from https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/

Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014a). The parable of Google Flu: Traps in Big Data analysis. Science, 343 (6176), 1203–1205. https://doi.org/10.1126/science.1248506.

Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Van Alstyne, M. (2014b). Computational social science. Science , 323, 721–723. https://doi.org/10.1126/science.1169410 .

Leathern, R. (2018). Shining a light on ads with political content . Retrieved June 2, 2018, from https://newsroom.fb.com/news/2018/05/ads-with-political-content/

Lesk, M. (1997). How much information is there in the world? Retrieved December 8, 2017, from http://www.lesk.com/mlesk/ksg97/ksg.html

Lester, D. (2009). The Nordic prescription databases as a resource for pharmacoepidemiological research—A literature review. Clinical Neuropsychiatry, 6 (5), 188–191. https://doi.org/10.1002/pds.

Lewis, P. (2018). “Utterly horrifying”: Ex-Facebook insider says covert data harvesting was routine . Retrieved March 21, 2018, from https://www.theguardian.com/news/2018/mar/20/facebook-data-cambridge-analytica-sandy-parakilas

Lewis, P., & Carrie Wong, J. (2018). Facebook employs psychologist whose firm sold data to Cambridge Analytica . Retrieved June 4, 2018, from https://www.theguardian.com/news/2018/mar/18/facebook-cambridge-analytica-joseph-chancellor-gsr

Li, N., Akin, H., Yi-Fan, L. S., Brossard, D., Xenos, M., & Scheufele, D. A. (2016). Tweeting disaster: An analysis of online discourse about nuclear power in the wake of the Fukushima Daiichi nuclear accident. Journal of Science Communication, 15 (5), 1–20. https://jcom.sissa.it/archive/15/05/JCOM_1505_2016_A02

Library of Congress. (2017). General information . Retrieved January 17, 2018, from https://www.loc.gov/about/general-information/

Lohr, S. (2013, February 1). The origins of “Big Data”: An etymological detective story. The New York Times . https://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-story/

London’s Tube and Big Data: Underground movement. (2013). Retrieved January 26, 2018, from https://www.newscientist.com/article/in415-londons-tube-and-big-data-underground-movement/

Ma-Kellams, C., Or, F., Baek, J. H., & Kawachi, I. (2016). Rethinking suicide surveillance: Google search data and self-reported suicidality differentially estimate completed suicide risk. Clinical Psychological Science, 4 (3), 480–484. https://doi.org/10.1177/2167702615593475.

Mabe, M., & Amin, M. (2001). Growth dynamics of scholarly and scientific journals. Scientometrics, 51 (1), 147–162. https://link.springer.com/article/10.1023/A:1010520913124

Madrigal, A. C. (2014). In defense of Google Flu Trends . Retrieved January 25, 2018, from https://www.theatlantic.com/technology/archive/2014/03/in-defense-of-google-flu-trends/359688/

Malik, M. T., Gumel, A., Thompson, L. H., Strome, T., & Mahmud, S. M. (2011). “Google Flu Trends” and emergency department triage data predicted the 2009 pandemic H1N1 waves in Manitoba. Canadian Journal of Public Health, 102 (4), 294–297. http://www.jstor.org/stable/41995614

Marr, B. (2014). Big Data – The 5 Vs everyone must know . Retrieved January 25, 2018, from https://www.slideshare.net/BernardMarr/140228-big-data-volume-velocity-variety-varacity-value%0A

Mars, B., Heron, J., Biddle, L., Donovan, J. L., Holley, R., Piper, M., Potokar, J., Wyllie, C., & Gunnell, D. (2015). Exposure to, and searching for, information about suicide and self-harm on the Internet: Prevalence and predictors in a population based cohort of young adults. Journal of Affective Disorders , 185, 239–245. https://doi.org/10.1016/j.jad.2015.06.001 .

Mashey, J. R. (1998). Big Data and the next wave of infraStress . University of California, Berkeley. http://static.usenix.org/event/usenix99/invited_talks/mashey.pdf

Matsa, K. E., Mitchell, A., & Stocking, G. (2017). Methodology . Retrieved February 26, 2018, from http://www.journalism.org/2017/04/27/google-flint-methodology/

McCarthy, M. J. (2010). Internet monitoring of suicide risk in the population. Journal of Affective Disorders, 122 (3), 277–279. https://doi.org/10.1016/j.jad.2009.08.015.

McNulty, E. (2014). Understanding Big Data: The seven V’s . Retrieved January 25, 2018, from http://dataconomy.com/2014/05/seven-vs-big-data/

Mellish, L., Karanges, E. A., Litchfield, M. J., Schaffer, A. L., Blanch, B., Daniels, B. J., Segrave, A., & Pearson, S.-A. (2015). The Australian pharmaceutical benefits scheme data collection: A practical guide for researchers. BMC Research Notes, 8 (1), 634. https://doi.org/10.1186/s13104-015-1616-8 .

Mellon, J. (2013). Where and when can we use Google trends to measure issue salience? PS: Political Science & Politics, 46 (02), 280–290. https://doi.org/10.1017/S1049096513000279 .

Moe, W. W., & Schweidel, D. A. (2017). Opportunities for innovation in social media analytics. Journal of Product Innovation Management, 34 (5), 697–702. https://doi.org/10.1111/jpim.12405 .

Monnappa, A. (2017). How Facebook is using Big Data – The good, the bad, and the ugly . Retrieved January 26, 2018, from https://www.simplilearn.com/how-facebook-is-using-big-data-article

Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics, 38 (8), 114–117. https://doi.org/10.1109/N-SSC.2006.4785860.

Moriarity, M. (2017). 15 Things you can learn from 1 trillion Posts: How 1,000,000,000,000 posts can change the world we live in . Retrieved January 17, 2018, from https://www.crimsonhexagon.com/blog/15-things-you-can-learn-from-1-trillion-posts/

Mullin, J. (2012). How much do Google and Facebook profit from your data? Retrieved January 26, 2018, from https://arstechnica.com/tech-policy/2012/10/how-much-do-google-and-facebook-profit-from-your-data/

Nanji, A. (2017). The most popular social networks with Millennials, Gen X and Baby Boomers . Retrieved March 2, 2018, from https://www.marketingprofs.com/charts/2017/31792/the-most-popular-social-networks-with-millennials-gen-x-and-baby-boomers

National Highway Traffic Safety Administration. (2016). Fatality Analysis Reporting System (FARS): Analytical User’s Manual 1975–2015 (No. DOT HS 812 315). ftp://ftp.nhtsa.dot.gov/FARS/FARS-DOC/Analytical User Guide/USERGUIDE-2015.pdf.

Obar, J. A., & Oelof-Hirsch, A. (2016). The biggest lie on the Internet: Ignoring the privacy policies and terms of service policies of social networking services. In The 44th Research Conference on Communication, Information and Internet Policy 2016 . https://doi.org/10.2139/ssrn.2757465

Olson, D. R., Konty, K. J., Paladini, M., Viboud, C., & Simonsen, L. (2013). Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: A comparative epidemiological study at three geographic scales. PLoS Computational Biology, 9 (10). https://doi.org/10.1371/journal.pcbi.1003256 .

Ong, T. C., Mannino, M. V., Schilling, L. M., & Kahn, M. G. (2014). Improving record linkage performance in the presence of missing linkage data. Journal of Biomedical Informatics , 52, 43–54. https://doi.org/10.1016/j.jbi.2014.01.016.

Ortiz, J. R., Zhou, H., Shay, D. K., Neuzil, K. M., Fowlkes, A. L., & Goss, C. H. (2011). Monitoring influenza activity in the United States: A comparison of traditional surveillance systems with Google Flu Trends. PLoS One, 6 (4), 2–10. https://doi.org/10.1371/journal.pone.0018687 .

Ortutay, B. (2018). Facebook scandal affected more users than thought: Up to 87M . Retrieved April 5, 2018, from https://www.apnews.com/e0e0df2083fe40c0b0ad10ff1946f041

Ortutay, B., Kirka, D., & Katz, G. (2018). Facebook’s Zuckerberg apologizes for ‘major breach of trust. ’ Retrieved March 22, 2018, from https://apnews.com/c8f615be9523421998b4fcc16374ff37

Page, A., Chang, S.-S., & Gunnell, D. (2011). Surveillance of Australian suicidal behaviour using the Internet? Australian and New Zealand Journal of Psychiatry, 45 (12), 1020–1022. https://doi.org/10.3109/00048674.2011.623660 .

Paige, E., Kemp-Casey, A., Korda, R., & Banks, E. (2015). Using Australian Pharmaceutical Benefits Scheme data for pharmacoepidemiological research: Challenges and approaches. Public Health Research & Practice, 25 (4), 1–6. https://doi.org/10.17061/phrp2541546 .

Parker, J., Cuthbertson, C., Loveridge, S., Skidmore, M., & Dyar, W. (2017). Forecasting state-level premature deaths from alcohol, drugs, and suicides using Google trends data. Journal of Affective Disorders, 213 (November 2016), 9–15. https://doi.org/10.1016/j.jad.2016.10.038

Pearson, S.-A., Pesa, N., Langton, J. M., Drew, A., Faedo, M., & Robertson, J. (2015). Studies using Australia’s Pharmaceutical Benefits Scheme data for pharmacoepidemiological research: A systematic review of the published literature (1987–2013). Pharmacoepidemiology and Drug Safety , 24, 447–455. https://doi.org/10.1002/pds.3756.

Pingdom. (2008). The history of computer data storage, in pictures . Retrieved January 19, 2018, from http://royal.pingdom.com/2008/04/08/the-history-of-computer-data-storage-in-pictures/

Plume, A., & van Weijen, D. (2014). Publish or perish? The rise of the fractional author…. Research Trends , 38 . https://www.researchtrends.com/issue-38-september-2014/publish-or-perish-the-rise-of-the-fractional-author/

Polgreen, P. M., Chen, Y., Pennock, D. M., & Nelson, F. D. (2008). Using Internet searches for influenza surveillance. Clinical Infectious Diseases, 47 (11), 1443–1448. https://doi.org/10.1086/593098 .

Press, G. (2013). A very short history of Big Data . Retrieved December 8, 2017, from https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/#1eaca84c65a1

Prodromou, T., & Dunne, T. (2017). Data visualisation and statistics education in the future. In T. Prodromou (Ed.), Data visualization and statistical literacy for open and Big Data (pp. 1–28). : IGI Global. https://doi.org/10.4018/978-1-5225-2512-7.ch001 .

Przybylski, A. K. (2016). Mischievous responding in internet gaming disorder research. PeerJ , 4, e2401. https://doi.org/10.7717/peerj.2401.

Puang-ngern, B., Bilgin, A. A., & Kyng, T. J. (2017). Comparison of graduates’ and academics’ perceptions of the skills required for Big Data analysis: Statistics education in the age of Big Data. In T. Prodromou (Ed.) , Data visualization and statistical literacy for open and Big Data (pp. 126–152). : IGI Global. https://doi.org/10.4018/978-1-5225-2512-7.ch006.

Rassen, J. A., Glynn, R. J., Brookhart, M. A., & Schneeweiss, S. (2011). Covariate selection in high-dimensional propensity score analyses of treatment effects in small samples. American Journal of Epidemiology, 173 (12), 1404–1413. https://doi.org/10.1093/aje/kwr001.

Raubenheimer, J. E. (2019). Google Trends Extraction Tool. https://doi.org/10.5281/zenodo.2620618

Raubenheimer, J. E. (2021). Google Trends Extraction Tool for Google Trends Extended for Health data. Software Impacts, 8, 100060. https://doi.org/10.1016/j.simpa.2021.100060

Reuters. (2018). Cambridge Analytica and British parent shut down after Facebook scandal. Retrieved May 3, 2018, from https://www.reuters.com/article/us-facebook-privacy/cambridge-analytica-and-british-parent-shut-down-after-facebook-scandal-idUSKBN1I32L7

Robb, D. (2017). The Global Heatmap, now 6x hotter . Retrieved January 1, 2018, from https://medium.com/strava-engineering/the-global-heatmap-now-6x-hotter-23fc01d301de

Roberts, M. E., Stewart, B. M., & Nielsen, R. (2015). Matching methods for high-dimensional data with applications to text . http://www.margaretroberts.net/wp-content/uploads/2015/07/textmatching.pdf

Robinson-Cimpian, J. P. (2014). Inaccurate estimation of disparities due to mischievous responders: Several suggestions to assess conclusions. Educational Researcher, 43 (4), 171–185. https://doi.org/10.3102/0013189X14534297.

Rosenbaum, P. R. (1987). Sensitivity analysis for certain permutation inferences in matched observational studies. Biometrika, 74 (1), 13–26. http://www.jstor.org/stable/2336017

Rosenbaum, P. R. (1989). Sensitivity analysis for matched observational studies with many ordered treatments. Scandinavian Journal of Statistics, 16 (3), 227–236. http://www.jstor.org/stable/4616136

Runge, K. K., Yeo, S. K., Cacciatore, M., Scheufele, D. A., Brossard, D., Xenos, M., Anderson, A., Choi, D. H., Kim, J., Li, N., Liang, X., Stubbings, M., & Su, L. Y. F. (2013). Tweeting nano: How public discourses about nanotechnology develop in social media environments. Journal of Nanoparticle Research, 15 (1). https://doi.org/10.1007/s11051-012-1381-8 .

Salsburg, D. S. (2017). Errors, blunders, and lies: How to tell the difference . CRC Press.

Book   Google Scholar  

Salzberg, S. (2014). Why Google Flu is a failure . Retrieved January 25, 2018, from https://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-is-a-failure/#42fed4945535

Sari Aslama, N., Cheshire, B. J., & Cheng, T. (2015). Big Data analysis of population flow between TfL oyster and bicycle hire networks in London . University College London. http://leeds.gisruk.org/abstracts/GISRUK2015_submission_92.pdf

Schaffer, A. L., Buckley, N. A., Dobbins, T. A., Banks, E., & Pearson, S.-A. (2015). The crux of the matter: Did the ABC’s catalyst program change statin use in Australia? Medical Journal of Australia, 11 (11), 591–595. https://doi.org/10.5694/mja15.0010 .

Schneeweiss, S. (2006). Sensitivity analysis and external adjustment for unmeasured confounders in epidemiologic database studies of therapeutics. Pharmacoepidemiology and Drug Safety , 15, 291–303. https://doi.org/10.1002/pds.1200.

Schrage, E., & Ginsberg, D. (2018). Facebook launches new initiative to help scholars assess social media’s impact on elections . Retrieved June 2, 2018, from https://newsroom.fb.com/news/2018/04/new-elections-initiative/

Scurr, J. H., Machin, S. J., Bailey-King, S., Mackie, I. J., McDonald, S., & Coleridge Smith, P. D. (2001). Frequency and prevention of symptomless deep vein thrombosis in long-haul flights: A randomised trial. Lancet, 357 , 1485–1489. https://www.thelancet.com/journals/lancet/article/PIIS0140673600046456/abstract

Smith, G. C. S., & Pell, J. P. (2003). Parachute use to prevent death and major trauma related to gravitational challenge: Systematic review of randomised controlled trials. BMJ (Clinical Research Ed.), 327 (7429), 1459–1461. https://doi.org/10.1177/154510970400300401 .

Solano, P., Ustulin, M., Pizzorno, E., Vichi, M., Pompili, M., Serafini, G., & Amore, M. (2016). A Google-based approach for monitoring suicide risk. Psychiatry Research , 246, 581–586. https://doi.org/10.1016/J.PSYCHRES.2016.10.030 .

Song, T. M., Song, J., An, J. Y., Hayman, L. L., & Woo, J. M. (2014). Psychological and social factors affecting Internet searches on suicide in Korea: A Big Data analysis of Google search trends. Yonsei Medical Journal, 55 (1), 254–263. https://doi.org/10.3349/ymj.2014.55.1.254 .

Spielberg, S. (2002). Minority Report . USA: Twentieth Century Fox. http://www.imdb.com/title/tt0181689

Stephens-Davidowitz, S. (2017). Everybody lies . HarperCollins.

Sueki, H. (2011). Does the volume of Internet searches using suicide-related search terms influence the suicide death rate: Data from 2004 to 2009 in Japan. Psychiatry and Clinical Neurosciences, 65 (4), 392–394. https://doi.org/10.1111/j.1440-1819.2011.02216.x .

Taleb, N. (2013). Beware the big errors of “Big Data.” Retrieved December 8, 2017, from https://www.wired.com/2013/02/big-data-means-big-errors-people/

The Flu Trends Team. (2015). The next chapter for Flu Trends . Retrieved January 25, 2018, from https://research.googleblog.com/2015/08/the-next-chapter-for-flu-trends.html

The Statistics Portal. (2018). Global shipments of hard disk drives (HDD) from 4th quarter 2010 to 3rd quarter 2017 (in millions) . Retrieved January 22, 2018, from https://www.statista.com/statistics/275336/global-shipment-figures-for-hard-disk-drives-from-4th-quarter-2010/

Thomas, R., & McSharry, P. (2015). Big Data revolution: What farmers, doctors and insurance agents teach us about discovering Big Data patterns . John Wiley & Sons.

Tran, U. S., Andel, R., Niederkrotenthaler, T., Till, B., Ajdacic-Gross, V., & Voracek, M. (2017). Low validity of Google trends for behavioral forecasting of national suicide rates. PLoS One, 12 (8), 1–26. https://doi.org/10.1371/journal.pone.0183149 .

Tromp, M., Ravelli, A. C., Bonsel, G. J., Hasman, A., & Reitsma, J. B. (2011). Results from simulated data sets: Probabilistic record linkage outperforms deterministic record linkage. Journal of Clinical Epidemiology, 64 (5), 565–572. https://doi.org/10.1016/j.jclinepi.2010.05.008 .

Turriago-Hoyos, A., Thoene, U., & Arjoon, S. (2016). Knowledge workers and virtues in Peter Drucker’s management theory. SAGE Open, 6 (1). https://doi.org/10.1177/2158244016639631 .

Ueda, M., Mori, K., Matsubayashi, T., & Sawada, Y. (2017). Tweeting celebrity suicides: Users’ reaction to prominent suicide deaths on Twitter and subsequent increases in actual suicides. Social Science and Medicine , 189, 158–166. https://doi.org/10.1016/j.socscimed.2017.06.032 .

Ugander, J., Backstrom, L., Marlow, C., & Kleinberg, J. (2012). Structural diversity in social contagion. Proceedings of the National Academy of Sciences USA, 109 (16), 5962–5966. https://doi.org/10.1073/pnas.1116502109.

UN Global Pulse. (2014). Mining Indonesian tweets to understand food price crises . Jakarta. https://www.unglobalpulse.org/projects/social-media-social-protection-indonesia

Valdivia, A., Lopez-Alcalde, J., Vicente, M., Pichiule, M., Ruiz, M., & Ordobas, M. (2010). Monitoring influenza activity in Europe with Google Flu Trends: Comparison with the findings of sentinel physician networks—Results for 2009-10. Euro Surveillance, 15 (29), 1–6. http://www.eurosurveillance.org/ViewArticle.aspx?ArticleId=19621

Vaughan-Nichols, S. (2014). We’re all just lab rats in Facebook’s laboratory . Retrieved January 25, 2018, from http://www.zdnet.com/article/were-all-just-lab-rats-in-facebooks-laboratory/

Verma, I. M. (2014). Editorial expression of concern and correction. Proceedings of the National Academy of Sciences USA , 111 (29), 10779. www.pnas.org/cgi/doi/10.1073/pnas.1412469111.

Vespignani, A. (2009). Predicting the behavior of techno-social systems. Science, 325 (5939), 425–428. https://doi.org/10.1126/science.1171990 .

Walsh, B. (2014). Google’s Flu project shows the failings of Big Data . Retrieved January 25, 2018, from http://time.com/23782/google-flu-trends-big-data-problems/

Ware, M., & Mabe, M. (2009). The STM report: An overview of scientific and scholarly journal publishing. http://www.markwareconsulting.com/institutional-repositories/the-stm-report-an-overview-of-scientific-and-scholarly-journal-publishing/

Ware, M., & Mabe, M. (2012). The STM report: An overview of scientific and scholarly journal publishing (3rd ed). http://www.stm-assoc.org/2012_12_11_STM_Report_2012.pdf

Ware, M., & Mabe, M. (2015). The STM report: An overview of scientific and scholarly journal publishing (4th ed). http://www.stm-assoc.org/2015_02_20_STM_Report_2015.pdf

Wartzman, R. (2014). What Peter Drucker knew about 2020 . Retrieved January 23, 2018, from https://hbr.org/2014/10/what-peter-drucker-knew-about-2020

Wettermark, B., Zoëga, H., Furu, K., Korhonen, M., Hallas, J., Nørgaard, M., Almarsdottir, A. B., Andersen, M., Andersson Sundell, K., Bergman, U., Helin-Salmivaara, A., Hoffmann, M., Kieler, H., Martikainen, J. E., Mortensen, M., Petzold, M., Wallach-Kildemoes, H., Wallin, C., & Sørensen, H. (2013). The Nordic prescription databases as a resource for pharmacoepidemiological research—A literature review. Pharmacoepidemiology and Drug Safety, 22 (7), 691–699. https://doi.org/10.1002/pds.3457 .

Why the 3V’s are not sufficient to describe Big Data. (2015). Retrieved January 25, 2018, from https://datafloq.com/read/3vs-sufficient-describe-big-data/166

Wikipedia. (2018). Wikipedia: Database download . Retrieved January 31, 2018, from wikipedia.org/wiki/Wikipedia:Database_download.

Wilson, N., Mason, K., Tobias, M., Peacey, M., Huang, Q. S., & Baker, M. (2009). Interpreting “Google Flu Trends” data for pandemic H1N1 Influenza: The New Zealand experience. Euro Surveillance, 14 (44), 1–3. http://www.eurosurveillance.org/ViewArticle.aspx?ArticleId=19386

Winkler, W. E. (1993). Matching and record linkage . US Census Bureau – Research Reports . https://www.census.gov/srd/papers/pdf/rr93-8.pdf

Winkler, W. E. (2014). Matching and record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 6(5), 313–325. https://doi.org/10.1002/wics.1317 .

Yang, A. C., Tsai, S. J., Huang, N. E., & Peng, C. K. (2011). Association of Internet search trends with suicide death in Taipei City, Taiwan, 2004-2009. Journal of Affective Disorders, 132 (1–2), 179–184. https://doi.org/10.1016/j.jad.2011.01.019 .

Youtie, J., Porter, A. L., & Huang, Y. (2017). Early social science research about Big Data. Science and Public Policy, 44 (1), scw021. https://doi.org/10.1093/scipol/scw021 .

Download references

Acknowledgments

This project is partially funded by the National Health and Medical Research Council (NHMRC) through the Translational Australian Clinical Toxicology Program (TACT) (grant ID1055176).

Author information

Authors and affiliations.

University of Sydney, Sydney, Australia

Jacques Raubenheimer

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jacques Raubenheimer .

Editor information

Editors and affiliations.

University of New England, Armidale, NSW, Australia

Theodosia Prodromou

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Raubenheimer, J. (2021). Big Data in Academic Research: Challenges, Pitfalls, and Opportunities. In: Prodromou, T. (eds) Big Data in Education: Pedagogy and Research . Policy Implications of Research in Education, vol 13. Springer, Cham. https://doi.org/10.1007/978-3-030-76841-6_1

Download citation

DOI : https://doi.org/10.1007/978-3-030-76841-6_1

Published : 05 October 2021

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-76840-9

Online ISBN : 978-3-030-76841-6

eBook Packages : Education Education (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

With big data, answers drive questions

big data research questions

  • Andreas Schmidt
  • 29 March 2017

2017-03-28-With-Big-Data-Answers-Drive-Questions

Usually, when we search for a solution, we start with a question and then seek out answers. According to Viktor Mayer-Schönberger , one of the plenary speakers at the 2017 OCLC EMEA Regional Council Meeting in Berlin , big data flips that equation on its head.

Tying into the event’s theme, “Libraries at the Crossroads: Resolving Identities,” Viktor explained that big data is all about gaining new perspectives on the world. It is revolutionizing what we see and how we process information. And he explained that with big data, we start with answers—what the data tells us—and then go back to fill in appropriate questions and hypotheses.

As a Professor at Oxford University’s Internet Institute and author of Big Data: A Revolution That Will Transform How We Live, Work, and Think , Viktor also explained that every additional data point is an opportunity to boost customer services and find new synergies. He talked about the quantity of big data translating into a new capability to make sense of patterns.

As I thought about his presentation, I wondered about the impact of big data on libraries. In our own way, we librarians have been big data crunchers for decades. We’ve made great strides in collecting bibliographic data at scale. So how do we move these efforts forward?

Positioning libraries for big data success

Big data has made processing large collections of data inexpensive and fast. It provides the ability for forward-looking decision-making based on data from multiple, disparate data sources.

Some recent opportunities include:

Curating research data . University researchers and government agencies manage and preserve massive digital assets—images, text and data—that require integrated management and preservation programs. These data include project proposals, grant proposals, researcher notes, researcher profiles, datasets, experiment results, article drafts and copies of published articles. The library’s role in connecting and curating these institutional assets is needed and a big opportunity for new services. OCLC Research scientists are exploring topics related to data curation and libraries with an eye toward distinctive services that will support research missions.

Aggregating library data . We are leveraging members’ collected knowledge investment for efficiency and re-use by libraries and other organizations. One example is the Virtual International Authority File ( VIAF ), which virtually combines multiple name authority files into a single dataset. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries that serve different language communities. VIAF became an OCLC service in 2012 and today, 25 national libraries from 30 countries are represented in the cooperative data file.

Managing collection data . As libraries move from locally owned to jointly managed print collections, good data about collections can help establish priorities and focus. When aggregated and analyzed across many libraries (through programs such as Sustainable Collections Services ), collections data can suggest patterns and provide insights that inform management decisions. We anticipate that a large part of existing print collections, spread across many libraries, will move into coordinated or shared management within a few years. While quantitative data must be used carefully, information about overlap and usage can supplement the judgment of librarians.

Getting ready for the future

The “Crossroads” theme of the conference was woven through many of the presentations, discussions and conversations I heard. But big data cuts across many of the topics presented, such as issues of digitization, research information management and institutional identities.

Library services will clearly be increasingly affected by big data—but here’s a thought-provoking question: Will the data be our own, or that which comes from an increasingly connected and monitored world? Will we be able to collect data from thousands of institutions in ways that present answers for which we can formulate library-specific questions? Or will we be stuck trying to adjust our inquiries and plans based on data collected elsewhere?

We are still in the early days of aggregating all sorts of new and exciting library data. Indeed, library big data might play a crucial role in framing questions about education, authority and literacy outside the spheres of commercial interest—if we can successfully navigate these crossroads together.

Related Posts:

Local decision support based on collective intelligence

Share your comments and questions on Twitter with hashtag #OCLCnext .

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

big data research questions

Home Market Research

Target Population: What It Is + Strategies for Targeting

target population

Understanding your target population is key to any successful campaign, whether it’s marketing, public health, or social initiatives. But what exactly is a target population, and how can you effectively reach the right people? 

In this blog, we will explore the concept of a target population and outline the best strategies for targeting them to ensure your efforts are both effective and impactful.

What is the Target Population?

A target population is a specific group of individuals that a particular study, program, campaign, or intervention is designed to reach, influence, or study. This group is characterized by certain common attributes or criteria that make it the focus of the effort. 

Understanding and defining a target audience is crucial for the effectiveness and relevance of various activities in fields like research, marketing, and public health.

Importance of Identifying a Target Population

Identifying a target population is crucial across various fields— research , marketing, public health, and more. Identifying and understanding your target audience is crucial for several reasons:

Resource Efficiency

Focusing on a specific group can help you allocate your resources more effectively, ensuring that your efforts aren’t wasted on people who are unlikely to engage with your message.

Personalized Messaging: 

Knowing your target population allows you to tailor your messaging to resonate with their specific needs, challenges, and interests, increasing the chances of your message being well-received.

Higher Conversion Rates: 

When your efforts are directed toward those most likely to respond, you’re more likely to see higher conversion rates, whether that means more sales, higher customer engagement , or greater participation in your initiative.

Strategies for Effective Population Targeting

Once you’ve identified your target population, it’s important to employ strategies that effectively reach and engage them. Here’s how to do it:

1. Define Your Objectives

Start by clearly defining what you want to achieve. From boosting sales, raising awareness, to improving customer satisfaction , having clear goals helps you focus your efforts and measure success. Setting specific, measurable objectives of the research and data types ensures that you have a clear direction and criteria for evaluating your strategies.

2. Segment Your Audience

Segmentation involves dividing your target population into smaller, more manageable groups based on shared characteristics. This can be done through:

  • Demographic Segmentation : Age, gender, income, education level.
  • Geographic Segmentation : City, region, country.
  • Psychographic Segmentation : Interests, values, lifestyle.
  • Behavioral Segmentation : Buying habits, brand loyalty, usage patterns.

By understanding these segments, you can tailor your messaging to address the unique needs and preferences of each group. For example, within the entire population of “ millennials ,” you might identify sub-groups like “ tech-savvy young professionals ” or “ health-conscious parents .”

3. Utilize Data and Analytics

Leverage data to guide your targeting efforts. Collect data through surveys, social media sentiment analysis , and website analytics to better understand your audience. This data can reveal trends and patterns that help predict future behaviors. The more data-driven your approach, the more accurate your targeting will be.

  • Surveys : Collect detailed information about your audience’s preferences and behaviors.
  • Social Media Analytics: Track engagement, interests, and demographics.
  • Website Analytics: Monitor user behavior and interaction patterns.

4. Create Detailed Audience Personas

Develop personas that represent your ideal customers or target groups. These personas should include demographic information, interests, challenges, and motivations. 

For example, a persona might be “Jessica, a 30-year-old marketing manager who loves fitness and is always on the lookout for new wellness apps.” Creating these profiles helps you understand and empathize with your audience, making it easier to craft messages that resonate with them.

5. Leverage Digital Marketing Tools

Take advantage of digital marketing tools to reach your target population effectively in your market research study. These tools include:

  • Social Media Advertising: Target specific groups based on their interests and behaviors.
  • Search Engine Optimization (SEO): Optimize your website and content to attract relevant traffic.
  • Programmatic Advertising: Use automated systems to buy and place ads that reach your audience in real time.

6. Test and Optimize Your Strategies

Continuously test different strategies to see what works best for your target population. Employ A/B testing to compare different messages, formats, or channels. Gather feedback from your audience and be prepared to adjust your approach based on the results. This iterative process helps you refine your targeting strategies and improve overall effectiveness.

7. Maintain Ethical Standards

Ensure that your targeting practices respect privacy and are free from bias. Follow data protection regulations like GDPR, and avoid stereotypes or unfair assumptions. Ethical targeting fosters trust and helps build long-term relationships with your audience.

8. Collaborate with Influencers and Partners

Partner with influencers or brands that share your target audience. Influencers can amplify your message to their followers, while partnerships can provide access to new segments of your population. For example, a fitness app might collaborate with a well-known fitness influencer to reach a larger, relevant audience.

9. Monitor and Adapt Your Approach

Regularly monitor the performance of your targeting efforts. Use analytics to track key performance indicators (KPIs) such as engagement rates, conversion rates, and return on investment (ROI). Be ready to adapt your strategy based on new data, trends, or feedback. This flexibility ensures you stay relevant and effective in your target market.

10. Employ a Multichannel Strategy

Don’t rely on a single method to reach your audience. Use a mix of online and offline channels, such as social media, email marketing, events, and traditional advertising, to ensure you reach your target population wherever they are. Consistent messaging across all channels reinforces your brand and helps achieve your objectives.

Examples of Target Population

Here are some examples of target audiences across different contexts:

1. Marketing a Product

  • Target audience: Young adults aged 18-24 who are tech-savvy and live in urban areas.
  • Example: A smartphone company might target this group for a new feature-rich phone by advertising on social media platforms like Instagram and TikTok, where this demographic spends a lot of time.

2. Public Health Campaign

  • Target audience: Middle-aged adults (45-65) with a history of smoking.
  • Example: A public health organization might target this group for a smoking cessation program, using outreach methods like community health seminars and informational brochures in clinics.

3. Educational Program

  • Target audience: High school students in underprivileged areas.
  • Example: A non-profit organization might target this group for a scholarship and mentoring program, focusing on schools in low-income neighborhoods and promoting the program through school counselors and community centers.

How QuestionPro Helps in Target Population?

QuestionPro offers a range of tools and features that can significantly help identify, understand, and effectively reach a target audience. Here’s how:

1. Survey Creation and Distribution

QuestionPro allows you to create detailed and customizable surveys that can be tailored to your specific target population. This helps in gathering relevant data directly from the group you want to study. You can distribute surveys through various channels like: 

  • Social media

It ensures you reach your target population wherever they are.

2. Advanced Segmentation

With the data collected through surveys, you can segment your audience based on demographics (age, gender, income) or psychographics (lifestyle, values, interests). This helps you understand different sub-groups within your target population. QuestionPro also enables you to gather data on behaviors, such as purchase history or usage patterns, helping you refine your target population further.

3. Audience Targeting

QuestionPro’s audience targeting tools allow you to select specific criteria for your survey respondents. This ensures that you’re gathering insights from the exact population segment you’re interested in.

4. Data Analysis and Reporting

QuestionPro provides real-time data analysis, allowing you to quickly see trends and insights as responses come in. This helps you make timely decisions about your target population.

5. Persona Development

The insights gathered through QuestionPro can be used to create detailed personas that represent your target population. These personas can include demographic details, behavioral patterns, and preferences, making it easier to tailor your strategies.

6. Feedback Loops

By regularly using QuestionPro to collect feedback from your target population, you can continuously refine your understanding of their needs and preferences. This helps you adapt your strategies over time to better meet their expectations.

7. Global Reach

If your target population is spread across different regions or speaks multiple languages, QuestionPro supports multilingual surveys, making it easier to reach a global audience.

Targeting the right population is a crucial step in any successful campaign. By understanding who your target population is and employing these strategies to reach and engage them, you can maximize the effectiveness of your efforts, leading to better results and a higher return on investment. 

Effective population targeting is the key to success, whether you’re marketing a product, promoting a cause, or launching a new initiative.

QuestionPro helps organizations precisely identify and engage with their target population, gather meaningful insights, and make data-driven decisions to achieve their goals. Contact QuestionPro for more information!

MORE LIKE THIS

Experimental vs Observational Studies: Differences & Examples

Experimental vs Observational Studies: Differences & Examples

Sep 5, 2024

Interactive forms

Interactive Forms: Key Features, Benefits, Uses + Design Tips

Sep 4, 2024

closed-loop management

Closed-Loop Management: The Key to Customer Centricity

Sep 3, 2024

Net Trust Score

Net Trust Score: Tool for Measuring Trust in Organization

Sep 2, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

Key things to know about U.S. election polling in 2024

Conceptual image of an oversized voting ballot box in a large crowd of people with shallow depth of field

Confidence in U.S. public opinion polling was shaken by errors in 2016 and 2020. In both years’ general elections, many polls underestimated the strength of Republican candidates, including Donald Trump. These errors laid bare some real limitations of polling.

In the midterms that followed those elections, polling performed better . But many Americans remain skeptical that it can paint an accurate portrait of the public’s political preferences.

Restoring people’s confidence in polling is an important goal, because robust and independent public polling has a critical role to play in a democratic society. It gathers and publishes information about the well-being of the public and about citizens’ views on major issues. And it provides an important counterweight to people in power, or those seeking power, when they make claims about “what the people want.”

The challenges facing polling are undeniable. In addition to the longstanding issues of rising nonresponse and cost, summer 2024 brought extraordinary events that transformed the presidential race . The good news is that people with deep knowledge of polling are working hard to fix the problems exposed in 2016 and 2020, experimenting with more data sources and interview approaches than ever before. Still, polls are more useful to the public if people have realistic expectations about what surveys can do well – and what they cannot.

With that in mind, here are some key points to know about polling heading into this year’s presidential election.

Probability sampling (or “random sampling”). This refers to a polling method in which survey participants are recruited using random sampling from a database or list that includes nearly everyone in the population. The pollster selects the sample. The survey is not open for anyone who wants to sign up.

Online opt-in polling (or “nonprobability sampling”). These polls are recruited using a variety of methods that are sometimes referred to as “convenience sampling.” Respondents come from a variety of online sources such as ads on social media or search engines, websites offering rewards in exchange for survey participation, or self-enrollment. Unlike surveys with probability samples, people can volunteer to participate in opt-in surveys.

Nonresponse and nonresponse bias. Nonresponse is when someone sampled for a survey does not participate. Nonresponse bias occurs when the pattern of nonresponse leads to error in a poll estimate. For example, college graduates are more likely than those without a degree to participate in surveys, leading to the potential that the share of college graduates in the resulting sample will be too high.

Mode of interview. This refers to the format in which respondents are presented with and respond to survey questions. The most common modes are online, live telephone, text message and paper. Some polls use more than one mode.

Weighting. This is a statistical procedure pollsters perform to make their survey align with the broader population on key characteristics like age, race, etc. For example, if a survey has too many college graduates compared with their share in the population, people without a college degree are “weighted up” to match the proper share.

How are election polls being conducted?

Pollsters are making changes in response to the problems in previous elections. As a result, polling is different today than in 2016. Most U.S. polling organizations that conducted and publicly released national surveys in both 2016 and 2022 (61%) used methods in 2022 that differed from what they used in 2016 . And change has continued since 2022.

A sand chart showing that, as the number of public pollsters in the U.S. has grown, survey methods have become more diverse.

One change is that the number of active polling organizations has grown significantly, indicating that there are fewer barriers to entry into the polling field. The number of organizations that conduct national election polls more than doubled between 2000 and 2022.

This growth has been driven largely by pollsters using inexpensive opt-in sampling methods. But previous Pew Research Center analyses have demonstrated how surveys that use nonprobability sampling may have errors twice as large , on average, as those that use probability sampling.

The second change is that many of the more prominent polling organizations that use probability sampling – including Pew Research Center – have shifted from conducting polls primarily by telephone to using online methods, or some combination of online, mail and telephone. The result is that polling methodologies are far more diverse now than in the past.

(For more about how public opinion polling works, including a chapter on election polls, read our short online course on public opinion polling basics .)

All good polling relies on statistical adjustment called “weighting,” which makes sure that the survey sample aligns with the broader population on key characteristics. Historically, public opinion researchers have adjusted their data using a core set of demographic variables to correct imbalances between the survey sample and the population.

But there is a growing realization among survey researchers that weighting a poll on just a few variables like age, race and gender is insufficient for getting accurate results. Some groups of people – such as older adults and college graduates – are more likely to take surveys, which can lead to errors that are too sizable for a simple three- or four-variable adjustment to work well. Adjusting on more variables produces more accurate results, according to Center studies in 2016 and 2018 .

A number of pollsters have taken this lesson to heart. For example, recent high-quality polls by Gallup and The New York Times/Siena College adjusted on eight and 12 variables, respectively. Our own polls typically adjust on 12 variables . In a perfect world, it wouldn’t be necessary to have that much intervention by the pollster. But the real world of survey research is not perfect.

big data research questions

Predicting who will vote is critical – and difficult. Preelection polls face one crucial challenge that routine opinion polls do not: determining who of the people surveyed will actually cast a ballot.

Roughly a third of eligible Americans do not vote in presidential elections , despite the enormous attention paid to these contests. Determining who will abstain is difficult because people can’t perfectly predict their future behavior – and because many people feel social pressure to say they’ll vote even if it’s unlikely.

No one knows the profile of voters ahead of Election Day. We can’t know for sure whether young people will turn out in greater numbers than usual, or whether key racial or ethnic groups will do so. This means pollsters are left to make educated guesses about turnout, often using a mix of historical data and current measures of voting enthusiasm. This is very different from routine opinion polls, which mostly do not ask about people’s future intentions.

When major news breaks, a poll’s timing can matter. Public opinion on most issues is remarkably stable, so you don’t necessarily need a recent poll about an issue to get a sense of what people think about it. But dramatic events can and do change public opinion , especially when people are first learning about a new topic. For example, polls this summer saw notable changes in voter attitudes following Joe Biden’s withdrawal from the presidential race. Polls taken immediately after a major event may pick up a shift in public opinion, but those shifts are sometimes short-lived. Polls fielded weeks or months later are what allow us to see whether an event has had a long-term impact on the public’s psyche.

How accurate are polls?

The answer to this question depends on what you want polls to do. Polls are used for all kinds of purposes in addition to showing who’s ahead and who’s behind in a campaign. Fair or not, however, the accuracy of election polling is usually judged by how closely the polls matched the outcome of the election.

A diverging bar chart showing polling errors in U.S. presidential elections.

By this standard, polling in 2016 and 2020 performed poorly. In both years, state polling was characterized by serious errors. National polling did reasonably well in 2016 but faltered in 2020.

In 2020, a post-election review of polling by the American Association for Public Opinion Research (AAPOR) found that “the 2020 polls featured polling error of an unusual magnitude: It was the highest in 40 years for the national popular vote and the highest in at least 20 years for state-level estimates of the vote in presidential, senatorial, and gubernatorial contests.”

How big were the errors? Polls conducted in the last two weeks before the election suggested that Biden’s margin over Trump was nearly twice as large as it ended up being in the final national vote tally.

Errors of this size make it difficult to be confident about who is leading if the election is closely contested, as many U.S. elections are .

Pollsters are rightly working to improve the accuracy of their polls. But even an error of 4 or 5 percentage points isn’t too concerning if the purpose of the poll is to describe whether the public has favorable or unfavorable opinions about candidates , or to show which issues matter to which voters. And on questions that gauge where people stand on issues, we usually want to know broadly where the public stands. We don’t necessarily need to know the precise share of Americans who say, for example, that climate change is mostly caused by human activity. Even judged by its performance in recent elections, polling can still provide a faithful picture of public sentiment on the important issues of the day.

The 2022 midterms saw generally accurate polling, despite a wave of partisan polls predicting a broad Republican victory. In fact, FiveThirtyEight found that “polls were more accurate in 2022 than in any cycle since at least 1998, with almost no bias toward either party.” Moreover, a handful of contrarian polls that predicted a 2022 “red wave” largely washed out when the votes were tallied. In sum, if we focus on polling in the most recent national election, there’s plenty of reason to be encouraged.

Compared with other elections in the past 20 years, polls have been less accurate when Donald Trump is on the ballot. Preelection surveys suffered from large errors – especially at the state level – in 2016 and 2020, when Trump was standing for election. But they performed reasonably well in the 2018 and 2022 midterms, when he was not.

Pew Research Center illustration

During the 2016 campaign, observers speculated about the possibility that Trump supporters might be less willing to express their support to a pollster – a phenomenon sometimes described as the “shy Trump effect.” But a committee of polling experts evaluated five different tests of the “shy Trump” theory and turned up little to no evidence for each one . Later, Pew Research Center and, in a separate test, a researcher from Yale also found little to no evidence in support of the claim.

Instead, two other explanations are more likely. One is about the difficulty of estimating who will turn out to vote. Research has found that Trump is popular among people who tend to sit out midterms but turn out for him in presidential election years. Since pollsters often use past turnout to predict who will vote, it can be difficult to anticipate when irregular voters will actually show up.

The other explanation is that Republicans in the Trump era have become a little less likely than Democrats to participate in polls . Pollsters call this “partisan nonresponse bias.” Surprisingly, polls historically have not shown any particular pattern of favoring one side or the other. The errors that favored Democratic candidates in the past eight years may be a result of the growth of political polarization, along with declining trust among conservatives in news organizations and other institutions that conduct polls.

Whatever the cause, the fact that Trump is again the nominee of the Republican Party means that pollsters must be especially careful to make sure all segments of the population are properly represented in surveys.

The real margin of error is often about double the one reported. A typical election poll sample of about 1,000 people has a margin of sampling error that’s about plus or minus 3 percentage points. That number expresses the uncertainty that results from taking a sample of the population rather than interviewing everyone . Random samples are likely to differ a little from the population just by chance, in the same way that the quality of your hand in a card game varies from one deal to the next.

A table showing that sampling error is not the only kind of polling error.

The problem is that sampling error is not the only kind of error that affects a poll. Those other kinds of error, in fact, can be as large or larger than sampling error. Consequently, the reported margin of error can lead people to think that polls are more accurate than they really are.

There are three other, equally important sources of error in polling: noncoverage error , where not all the target population has a chance of being sampled; nonresponse error, where certain groups of people may be less likely to participate; and measurement error, where people may not properly understand the questions or misreport their opinions. Not only does the margin of error fail to account for those other sources of potential error, putting a number only on sampling error implies to the public that other kinds of error do not exist.

Several recent studies show that the average total error in a poll estimate may be closer to twice as large as that implied by a typical margin of sampling error. This hidden error underscores the fact that polls may not be precise enough to call the winner in a close election.

Other important things to remember

Transparency in how a poll was conducted is associated with better accuracy . The polling industry has several platforms and initiatives aimed at promoting transparency in survey methodology. These include AAPOR’s transparency initiative and the Roper Center archive . Polling organizations that participate in these organizations have less error, on average, than those that don’t participate, an analysis by FiveThirtyEight found .

Participation in these transparency efforts does not guarantee that a poll is rigorous, but it is undoubtedly a positive signal. Transparency in polling means disclosing essential information, including the poll’s sponsor, the data collection firm, where and how participants were selected, modes of interview, field dates, sample size, question wording, and weighting procedures.

There is evidence that when the public is told that a candidate is extremely likely to win, some people may be less likely to vote . Following the 2016 election, many people wondered whether the pervasive forecasts that seemed to all but guarantee a Hillary Clinton victory – two modelers put her chances at 99% – led some would-be voters to conclude that the race was effectively over and that their vote would not make a difference. There is scientific research to back up that claim: A team of researchers found experimental evidence that when people have high confidence that one candidate will win, they are less likely to vote. This helps explain why some polling analysts say elections should be covered using traditional polling estimates and margins of error rather than speculative win probabilities (also known as “probabilistic forecasts”).

National polls tell us what the entire public thinks about the presidential candidates, but the outcome of the election is determined state by state in the Electoral College . The 2000 and 2016 presidential elections demonstrated a difficult truth: The candidate with the largest share of support among all voters in the United States sometimes loses the election. In those two elections, the national popular vote winners (Al Gore and Hillary Clinton) lost the election in the Electoral College (to George W. Bush and Donald Trump). In recent years, analysts have shown that Republican candidates do somewhat better in the Electoral College than in the popular vote because every state gets three electoral votes regardless of population – and many less-populated states are rural and more Republican.

For some, this raises the question: What is the use of national polls if they don’t tell us who is likely to win the presidency? In fact, national polls try to gauge the opinions of all Americans, regardless of whether they live in a battleground state like Pennsylvania, a reliably red state like Idaho or a reliably blue state like Rhode Island. In short, national polls tell us what the entire citizenry is thinking. Polls that focus only on the competitive states run the risk of giving too little attention to the needs and views of the vast majority of Americans who live in uncompetitive states – about 80%.

Fortunately, this is not how most pollsters view the world . As the noted political scientist Sidney Verba explained, “Surveys produce just what democracy is supposed to produce – equal representation of all citizens.”

  • Survey Methods
  • Trust, Facts & Democracy
  • Voter Files

Download Scott Keeter's photo

Scott Keeter is a senior survey advisor at Pew Research Center .

Download Courtney Kennedy's photo

Courtney Kennedy is Vice President of Methods and Innovation at Pew Research Center .

How do people in the U.S. take Pew Research Center surveys, anyway?

How public polling has changed in the 21st century, what 2020’s election poll errors tell us about the accuracy of issue polling, a field guide to polling: election 2020 edition, methods 101: how is polling done around the world, most popular.

901 E St. NW, Suite 300 Washington, DC 20004 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

© 2024 Pew Research Center

Download Interview guide PDF

  • Data Science Interview Questions

Download PDF

big data research questions

Introduction:

Data science is an interdisciplinary field that mines raw data, analyses it, and comes up with patterns that are used to extract valuable insights from it. Statistics, computer science, machine learning, deep learning, data analysis, data visualization, and various other technologies form the core foundation of data science.

Over the years, data science has gained widespread importance due to the importance of data. Data is considered the new oil of the future which when analyzed and harnessed properly can prove to be very beneficial to the stakeholders. Not just this, a data scientist gets exposure to work in diverse domains, solving real-life practical problems all by making use of trendy technologies. The most common real-time application is fast delivery of food in apps such as Uber Eats by aiding the delivery person to show the fastest possible route to reach the destination from the restaurant. 

Data Science is also used in item recommendation systems in e-commerce sites like Amazon, Flipkart, etc which recommend the user what item they can buy based on their search history. Not just recommendation systems, Data Science is becoming increasingly popular in fraud detection applications to detect any fraud involved in credit-based financial applications. A successful data scientist can interpret data, perform innovation and bring out creativity while solving problems that help drive business and strategic goals. This makes it the most lucrative job of the 21st century.

big data research questions

In this article, we will explore what are the most commonly asked Data Science Technical Interview Questions which will help both aspiring and experienced data scientists.

Data Science Interview Questions for Freshers

Data science interview questions for experienced, frequently asked questions, data science mcq, 1. what is data science.

An interdisciplinary field that constitutes various scientific processes, algorithms, tools, and machine learning techniques working to help find common patterns and gather sensible insights from the given raw input data using statistical and mathematical analysis is called Data Science.

The following figure represents the life cycle of data science.

big data research questions

  • It starts with gathering the business requirements and relevant data.
  • Once the data is acquired, it is maintained by performing data cleaning, data warehousing, data staging, and data architecture.
  • Data processing does the task of exploring the data, mining it, and analyzing it which can be finally used to generate the summary of the insights extracted from the data.
  • Once the exploratory steps are completed, the cleansed data is subjected to various algorithms like predictive analysis, regression, text mining, recognition patterns, etc depending on the requirements.
  • In the final stage, the results are communicated to the business in a visually appealing manner. This is where the skill of data visualization, reporting, and different business intelligence tools come into the picture. Learn More .

2. Define the terms KPI, lift, model fitting, robustness and DOE.

  • KPI: KPI stands for Key Performance Indicator that measures how well the business achieves its objectives.
  • Lift: This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model.
  • Model fitting: This indicates how well the model under consideration fits given observations.
  • Robustness: This represents the system’s capability to handle differences and variances effectively.
  • DOE: stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.

3. What is the difference between data analytics and data science?

  • Data science involves the task of transforming data by using various technical analysis methods to extract meaningful insights using which a data analyst can apply to their business scenarios.
  • Data analytics deals with checking the existing hypothesis and information and answers questions for a better and effective business-related decision-making process.
  • Data Science drives innovation by answering questions that build connections and answers for futuristic problems. Data analytics focuses on getting present meaning from existing historical context whereas data science focuses on predictive modeling.
  • Data Science can be considered as a broad subject that makes use of various mathematical and scientific tools and algorithms for solving complex problems whereas data analytics can be considered as a specific field dealing with specific concentrated problems using fewer tools of statistics and visualization.

The following Venn diagram depicts the difference between data science and data analytics clearly:

big data research questions

4. What are some of the techniques used for sampling? What is the main advantage of sampling?

Data analysis can not be done on a whole volume of data at a time especially when it involves larger datasets. It becomes crucial to take some data samples that can be used for representing the whole population and then perform analysis on it. While doing this, it is very much necessary to carefully take sample data out of the huge data that truly represents the entire dataset.

big data research questions

There are majorly two categories of sampling techniques based on the usage of statistics, they are:

  • Probability Sampling techniques: Clustered sampling, Simple random sampling, Stratified sampling.
  • Non-Probability Sampling techniques: Quota sampling, Convenience sampling, snowball sampling, etc.

5. List down the conditions for Overfitting and Underfitting.

Overfitting: The model performs well only for the sample training data. If any new data is given as input to the model, it fails to provide any result. These conditions occur due to low bias and high variance in the model. Decision trees are more prone to overfitting.

big data research questions

Underfitting: Here, the model is so simple that it is not able to identify the correct relationship in the data, and hence it does not perform well even on the test data. This can happen due to high bias and low variance. Linear regression is more prone to Underfitting.

big data research questions

Learn via our Video Courses

6. differentiate between the long and wide format data..

Long format Data Wide-Format Data
Here, each row of the data represents the one-time information of a subject. Each subject would have its data in different/ multiple rows. Here, the repeated responses of a subject are part of separate columns.
The data can be recognized by considering rows as groups. The data can be recognized by considering columns as groups.
This data format is most commonly used in R analyses and to write into log files after each trial. This data format is rarely used in R analyses and most commonly used in stats packages for repeated measures ANOVAs.

The following image depicts the representation of wide format and long format data:

big data research questions

7. What are Eigenvectors and Eigenvalues?

Eigenvectors are column vectors or unit vectors whose length/magnitude is equal to 1. They are also called right vectors. Eigenvalues are coefficients that are applied on eigenvectors which give these vectors different values for length or magnitude.

big data research questions

A matrix can be decomposed into Eigenvectors and Eigenvalues and this process is called Eigen decomposition. These are then eventually used in machine learning methods like PCA (Principal Component Analysis) for gathering valuable insights from the given matrix.

8. What does it mean when the p-values are high and low?

A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.

  • Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null.
  • High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null.
  • p-value = 0.05 means that the hypothesis can go either way.

9. When is resampling done?

Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.

10. What do you understand by Imbalanced Data?

Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.

11. Are there any differences between the expected value and mean value?

There are not many differences between these two, but it is to be noted that these are used in different contexts. The mean value generally refers to the probability distribution whereas the expected value is referred to in the contexts involving random variables.

12. What do you understand by Survivorship Bias?

This bias refers to the logical error while focusing on aspects that survived some process and overlooking those that did not work due to lack of prominence. This bias can lead to deriving wrong conclusions.

13. What is a Gradient and Gradient Descent?

Gradient: Gradient is the measure of a property that how much the output has changed with respect to a little change in the input. In other words, we can say that it is a measure of change in the weights with respect to the change in error. The gradient can be mathematically represented as the slope of a function.

big data research questions

Gradient Descent: Gradient descent is a minimization algorithm that minimizes the Activation function. Well, it can minimize any function given to it but it is usually provided with the activation function only. 

Gradient descent, as the name suggests means descent or a decrease in something. The analogy of gradient descent is often taken as a person climbing down a hill/mountain. The following is the equation describing what gradient descent means:

So, if a person is climbing down the hill, the next position that the climber has to come to is denoted by “b” in this equation. Then, there is a minus sign because it denotes the minimization (as gradient descent is a minimization algorithm). The Gamma is called a waiting factor and the remaining term which is the Gradient term itself shows the direction of the steepest descent. 

This situation can be represented in a graph as follows:

big data research questions

Here, we are somewhere at the “Initial Weights” and we want to reach the Global minimum. So, this minimization algorithm will help us do that.

14. Define confounding variables.

Confounding variables are also known as confounders. These variables are a type of extraneous variables that influence both independent and dependent variables causing spurious association and mathematical relationships between those variables that are associated but are not casually related to each other.

15. Define and explain selection bias?

The selection bias occurs in the case when the researcher has to make a decision on which participant to study. The selection bias is associated with those researches when the participant selection is not random. The selection bias is also called the selection effect. The selection bias is caused by as a result of the method of sample collection.

Four types of selection bias are explained below:

  • Sampling Bias: As a result of a population that is not random at all, some members of a population have fewer chances of getting included than others, resulting in a biased sample. This causes a systematic error known as sampling bias.
  • Time interval: Trials may be stopped early if we reach any extreme value but if all variables are similar invariance, the variables with the highest variance have a higher chance of achieving the extreme value.
  • Data: It is when specific data is selected arbitrarily and the generally agreed criteria are not followed.
  • Attrition: Attrition in this context means the loss of the participants. It is the discounting of those subjects that did not complete the trial.

16. Define bias-variance trade-off?

Let us first understand the meaning of bias and variance in detail:

Bias: It is a kind of error in a machine learning model when an ML Algorithm is oversimplified. When a model is trained, at that time it makes simplified assumptions so that it can easily understand the target function. Some algorithms that have low bias are Decision Trees, SVM, etc. On the other hand, logistic and linear regression algorithms are the ones with a high bias.

Variance: Variance is also a kind of error. It is introduced into an ML Model when an ML algorithm is made highly complex. This model also learns noise from the data set that is meant for training. It further performs badly on the test data set. This may lead to over lifting as well as high sensitivity.

When the complexity of a model is increased, a reduction in the error is seen. This is caused by the lower bias in the model. But, this does not happen always till we reach a particular point called the optimal point. After this point, if we keep on increasing the complexity of the model, it will be over lifted and will suffer from the problem of high variance. We can represent this situation with the help of a graph as shown below:

big data research questions

As you can see from the image above, before the optimal point, increasing the complexity of the model reduces the error (bias). However, after the optimal point, we see that the increase in the complexity of the machine learning model increases the variance.

Trade-off Of Bias And Variance: So, as we know that bias and variance, both are errors in machine learning models, it is very essential that any machine learning model has low variance as well as a low bias so that it can achieve good performance.

Let us see some examples. The K-Nearest Neighbor Algorithm is a good example of an algorithm with low bias and high variance. This trade-off can easily be reversed by increasing the k value which in turn results in increasing the number of neighbours. This, in turn, results in increasing the bias and reducing the variance.

Another example can be the algorithm of a support vector machine. This algorithm also has a high variance and obviously, a low bias and we can reverse the trade-off by increasing the value of parameter C. Thus, increasing the C parameter increases the bias and decreases the variance.

So, the trade-off is simple. If we increase the bias, the variance will decrease and vice versa.

17. Define the confusion matrix?

It is a matrix that has 2 rows and 2 columns. It has 4 outputs that a binary classifier provides to it. It is used to derive various measures like specificity, error rate, accuracy, precision, sensitivity, and recall.

big data research questions

The test data set should contain the correct and predicted labels. The labels depend upon the performance. For instance, the predicted labels are the same if the binary classifier performs perfectly. Also, they match the part of observed labels in real-world scenarios. The four outcomes shown above in the confusion matrix mean the following:

  • True Positive: This means that the positive prediction is correct.
  • False Positive: This means that the positive prediction is incorrect.
  • True Negative: This means that the negative prediction is correct.
  • False Negative: This means that the negative prediction is incorrect.

The formulas for calculating basic measures that comes from the confusion matrix are:

  • Error rate : (FP + FN)/(P + N)
  • Accuracy : (TP + TN)/(P + N)
  • Sensitivity = TP/P
  • Specificity = TN/N
  • Precision = TP/(TP + FP)
  • F-Score  = (1 + b)(PREC.REC)/(b2 PREC + REC) Here, b is mostly 0.5 or 1 or 2.

In these formulas:

FP = false positive FN = false negative TP = true positive RN = true negative

Sensitivity is the measure of the True Positive Rate. It is also called recall. Specificity is the measure of the true negative rate. Precision is the measure of a positive predicted value. F-score is the harmonic mean of precision and recall.

18. What is logistic regression? State an example where you have recently used logistic regression.

Logistic Regression is also known as the logit model. It is a technique to predict the binary outcome from a linear combination of variables (called the predictor variables). 

For example , let us say that we want to predict the outcome of elections for a particular political leader. So, we want to find out whether this leader is going to win the election or not. So, the result is binary i.e. win (1) or loss (0). However, the input is a combination of linear variables like the money spent on advertising, the past work done by the leader and the party, etc. 

19. What is Linear Regression? What are some of the major drawbacks of the linear model?

Linear regression is a technique in which the score of a variable Y is predicted using the score of a predictor variable X. Y is called the criterion variable. Some of the drawbacks of Linear Regression are as follows:

  • The assumption of linearity of errors is a major drawback.
  • It cannot be used for binary outcomes. We have Logistic Regression for that.
  • Overfitting problems are there that can’t be solved.

20. What is a random forest? Explain it’s working.

Classification is very important in machine learning. It is very important to know to which class does an observation belongs. Hence, we have various classification algorithms in machine learning like logistic regression, support vector machine, decision trees, Naive Bayes classifier, etc. One such classification technique that is near the top of the classification hierarchy is the random forest classifier. 

So, firstly we need to understand a decision tree before we can understand the random forest classifier and its works. So, let us say that we have a string as given below:

big data research questions

So, we have the string with 5 ones and 4 zeroes and we want to classify the characters of this string using their features. These features are colour (red or green in this case) and whether the observation (i.e. character) is underlined or not. Now, let us say that we are only interested in red and underlined observations. So, the decision tree would look something like this:

big data research questions

So, we started with the colour first as we are only interested in the red observations and we separated the red and the green-coloured characters. After that, the “No” branch i.e. the branch that had all the green coloured characters was not expanded further as we want only red-underlined characters. So, we expanded the “Yes” branch and we again got a “Yes” and a “No” branch based on the fact whether the characters were underlined or not. 

So, this is how we draw a typical decision tree. However, the data in real life is not this clean but this was just to give an idea about the working of the decision trees. Let us now move to the random forest.

Random Forest

It consists of a large number of decision trees that operate as an ensemble. Basically, each tree in the forest gives a class prediction and the one with the maximum number of votes becomes the prediction of our model. For instance, in the example shown below, 4 decision trees predict 1, and 2 predict 0. Hence, prediction 1 will be considered.

big data research questions

The underlying principle of a random forest is that several weak learners combine to form a keen learner. The steps to build a random forest are as follows:

  • Build several decision trees on the samples of data and record their predictions.
  • Each time a split is considered for a tree, choose a random sample of mm predictors as the split candidates out of all the pp predictors. This happens to every tree in the random forest.
  • Apply the rule of thumb i.e. at each split m = p√m = p.
  • Apply the predictions to the majority rule.

21. In a time interval of 15-minutes, the probability that you may see a shooting star or a bunch of them is 0.2. What is the percentage chance of you seeing at least one star shooting from the sky if you are under it for about an hour?

Let us say that Prob is the probability that we may see a minimum of one shooting star in 15 minutes.

So, Prob = 0.2

Now, the probability that we may not see any shooting star in the time duration of 15 minutes is = 1 - Prob

1-0.2 = 0.8

The probability that we may not see any shooting star for an hour is: 

= (1-Prob)(1-Prob)(1-Prob)*(1-Prob) = 0.8 * 0.8 * 0.8 * 0.8 = (0.8)⁴   ≈ 0.40

So, the probability that we will see one shooting star in the time interval of an hour is = 1-0.4 = 0.6

So, there are approximately 60% chances that we may see a shooting star in the time span of an hour.

22. What is deep learning? What is the difference between deep learning and machine learning?

Deep learning is a paradigm of machine learning. In deep learning,  multiple layers of processing are involved in order to extract high features from the data. The neural networks are designed in such a way that they try to simulate the human brain. 

Deep learning has shown incredible performance in recent years because of the fact that it shows great analogy with the human brain.

The difference between machine learning and deep learning is that deep learning is a paradigm or a part of machine learning that is inspired by the structure and functions of the human brain called the artificial neural networks. Learn More .

1. How are the time series problems different from other regression problems?

  • Time series data can be thought of as an extension to linear regression which uses terms like autocorrelation, movement of averages for summarizing historical data of y-axis variables for predicting a better future.
  • Forecasting and prediction is the main goal of time series problems where accurate predictions can be made but sometimes the underlying reasons might not be known.
  • Having Time in the problem does not necessarily mean it becomes a time series problem. There should be a relationship between target and time for a problem to become a time series problem.
  • The observations close to one another in time are expected to be similar to the ones far away which provide accountability for seasonality. For instance, today’s weather would be similar to tomorrow’s weather but not similar to weather from 4 months from today. Hence, weather prediction based on past data becomes a time series problem.

2. What are RMSE and MSE in a linear regression model?

RMSE: RMSE stands for Root Mean Square Error. In a linear regression model, RMSE is used to test the performance of the machine learning model. It is used to evaluate the data spread around the line of best fit. So, in simple words, it is used to measure the deviation of the residuals.

RMSE is calculated using the formula:

big data research questions

  • Yi is the actual value of the output variable.
  • Y(Cap) is the predicted value and,
  • N is the number of data points.

MSE: Mean Squared Error is used to find how close is the line to the actual data. So, we make the difference in the distance of the data points from the line and the difference is squared. This is done for all the data points and the submission of the squared difference divided by the total number of data points gives us the Mean Squared Error (MSE).

So, if we are taking the squared difference of N data points and dividing the sum by N, what does it mean? Yes, it represents the average of the squared difference of a data point from the line i.e. the average of the squared difference between the actual and the predicted values. The formula for finding MSE is given below:

big data research questions

  • Yi is the actual value of the output variable (the ith data point)
  • Y(cap) is the predicted value and,
  • N is the total number of data points.

So, RMSE is the square root of MSE .

3. What are Support Vectors in SVM (Support Vector Machine)?

big data research questions

In the above diagram, we can see that the thin lines mark the distance from the classifier to the closest data points (darkened data points). These are called support vectors. So, we can define the support vectors as the data points or vectors that are nearest (closest) to the hyperplane. They affect the position of the hyperplane. Since they support the hyperplane, they are known as support vectors.

4. So, you have done some projects in machine learning and data science and we see you are a bit experienced in the field. Let’s say your laptop’s RAM is only 4GB and you want to train your model on 10GB data set.

What will you do have you experienced such an issue before.

In such types of questions, we first need to ask what ML model we have to train. After that, it depends on whether we have to train a model based on Neural Networks or SVM.

The steps for Neural Networks are given below:

  • The Numpy array can be used to load the entire data. It will never store the entire data, rather just create a mapping of the data.
  • Now, in order to get some desired data, pass the index into the NumPy Array.
  • This data can be used to pass as an input to the neural network maintaining a small batch size.

The steps for SVM are given below:

  • For SVM, small data sets can be obtained. This can be done by dividing the big data set.
  • The subset of the data set can be obtained as an input if using the partial fit function.
  • Repeat the step of using the partial fit method for other subsets as well.

Now, you may describe the situation if you have faced such an issue in your projects or working in machine learning/ data science.

5. Explain Neural Network Fundamentals.

In the human brain, different neurons are present. These neurons combine and perform various tasks. The Neural Network in deep learning tries to imitate human brain neurons. The neural network learns the patterns from the data and uses the knowledge that it gains from various patterns to predict the output for new data, without any human assistance.

A perceptron is the simplest neural network that contains a single neuron that performs 2 functions. The first function is to perform the weighted sum of all the inputs and the second is an activation function.

big data research questions

There are some other neural networks that are more complicated. Such networks consist of the following three layers:

  • Input Layer: The neural network has the input layer to receive the input.
  • Hidden Layer: There can be multiple hidden layers between the input layer and the output layer. The initially hidden layers are used for detecting the low-level patterns whereas the further layers are responsible for combining output from previous layers to find more patterns.
  • Output Layer: This layer outputs the prediction.

An example neural network image is shown below:

big data research questions

6. What is Generative Adversarial Network?

This approach can be understood with the famous example of the wine seller. Let us say that there is a wine seller who has his own shop. This wine seller purchases wine from the dealers who sell him the wine at a low cost so that he can sell the wine at a high cost to the customers. Now, let us say that the dealers whom he is purchasing the wine from, are selling him fake wine. They do this as the fake wine costs way less than the original wine and the fake and the real wine are indistinguishable to a normal consumer (customer in this case). The shop owner has some friends who are wine experts and he sends his wine to them every time before keeping the stock for sale in his shop. So, his friends, the wine experts, give him feedback that the wine is probably fake. Since the wine seller has been purchasing the wine for a long time from the same dealers, he wants to make sure that their feedback is right before he complains to the dealers about it. Now, let us say that the dealers also have got a tip from somewhere that the wine seller is suspicious of them.

So, in this situation, the dealers will try their best to sell the fake wine whereas the wine seller will try his best to identify the fake wine. Let us see this with the help of a diagram shown below:

big data research questions

From the image above, it is clear that a noise vector is entering the generator (dealer) and he generates the fake wine and the discriminator has to distinguish between the fake wine and real wine. This is a Generative Adversarial Network (GAN).

In a GAN, there are 2 main components viz. Generator and Discrminator. So, the generator is a CNN that keeps producing images and the discriminator tries to identify the real images from the fake ones. 

7. What is a computational graph?

A computational graph is also known as a “Dataflow Graph”. Everything in the famous deep learning library TensorFlow is based on the computational graph. The computational graph in Tensorflow has a network of nodes where each node operates. The nodes of this graph represent operations and the edges represent tensors.

8. What are auto-encoders?

Auto-encoders are learning networks. They transform inputs into outputs with minimum possible errors. So, basically, this means that the output that we want should be almost equal to or as close as to input as follows. 

Multiple layers are added between the input and the output layer and the layers that are in between the input and the output layer are smaller than the input layer. It received unlabelled input. This input is encoded to reconstruct the input later.

9. What are Exploding Gradients and Vanishing Gradients?

  • Exploding Gradients: Let us say that you are training an RNN. Say, you saw exponentially growing error gradients that accumulate, and as a result of this, very large updates are made to the neural network model weights. These exponentially growing error gradients that update the neural network weights to a great extent are called Exploding Gradients .
  • Vanishing Gradients: Let us say again, that you are training an RNN. Say, the slope became too small. This problem of the slope becoming too small is called Vanishing Gradient . It causes a major increase in the training time and causes poor performance and extremely low accuracy.

10. What is the p-value and what does it indicate in the Null Hypothesis?

P-value is a number that ranges from 0 to 1. In a hypothesis test in statistics, the p-value helps in telling us how strong the results are. The claim that is kept for experiment or trial is called Null Hypothesis.

  • A low p-value i.e. p-value less than or equal to 0.05 indicates the strength of the results against the Null Hypothesis which in turn means that the Null Hypothesis can be rejected. 
  • A high p-value i.e. p-value greater than 0.05 indicates the strength of the results in favour of the Null Hypothesis i.e. for the Null Hypothesis which in turn means that the Null Hypothesis can be accepted.

11. Since you have experience in the deep learning field, can you tell us why TensorFlow is the most preferred library in deep learning?

Tensorflow is a very famous library in deep learning. The reason is pretty simple actually. It provides C++ as well as Python APIs which makes it very easier to work on. Also, TensorFlow has a fast compilation speed as compared to Keras and Torch (other famous deep learning libraries). Apart from that, Tenserflow supports both GPU and CPU computing devices. Hence, it is a major success and a very popular library for deep learning.

12. Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?

Depending on the size of the dataset, we follow the below ways:

  • In case the datasets are small, the missing values are substituted with the mean or average of the remaining data. In pandas, this can be done by using mean = df.mean() where df represents the pandas dataframe representing the dataset and mean() calculates the mean of the data. To substitute the missing values with the calculated mean, we can use df.fillna(mean) .
  • For larger datasets, the rows with missing values can be removed and the remaining data can be used for data prediction.

13. What is Cross-Validation?

Cross-Validation is a Statistical technique used for improving a model’s performance. Here, the model will be trained and tested with rotation using different samples of the training dataset to ensure that the model performs well for unknown data. The training data will be split into various groups and the model is run and validated against these groups in rotation.

big data research questions

The most commonly used techniques are:

  • K- Fold method
  • Leave p-out method
  • Leave-one-out method
  • Holdout method

14. What are the differences between correlation and covariance?

Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them:

  • Correlation: This technique is used to measure and estimate the quantitative relationship between two variables and is measured in terms of how strong are the variables related.
  • Covariance: It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

Mathematically, consider 2 random variables, X and Y where the means are represented as  μ X {"detectHand":false}  and  μ Y {"detectHand":false}  respectively and standard deviations are represented by  σ X {"detectHand":false}  and  σ Y {"detectHand":false}  respectively and E represents the expected value operator, then:

  • covarianceXY = E[(X- μ X {"detectHand":false} ),(Y- μ Y {"detectHand":false} )]
  • correlationXY = E[(X- μ X {"detectHand":false} ),(Y- μ Y {"detectHand":false} )]/( σ X {"detectHand":false} σ Y {"detectHand":false} ) so that

Based on the above formula, we can deduce that the correlation is dimensionless whereas covariance is represented in units that are obtained from the multiplication of units of two variables.

The following image graphically shows the difference between correlation and covariance:

big data research questions

15. How do you approach solving any data analytics based project?

Generally, we follow the below steps:

  • The first step is to thoroughly understand the business requirement/problem
  • Next, explore the given data and analyze it carefully. If you find any data missing, get the requirements clarified from the business.
  • Data cleanup and preparation step is to be performed next which is then used for modelling. Here, the missing values are found and the variables are transformed.
  • Run your model against the data, build meaningful visualization and analyze the results to get meaningful insights.
  • Release the model implementation, and track the results and performance over a specified period to analyze the usefulness.
  • Perform cross-validation of the model.

Check out the list of data analytics projects .

big data research questions

16. How regularly must we update an algorithm in the field of machine learning?

We do not want to update and make changes to an algorithm on a regular basis as an algorithm is a well-defined step procedure to solve any problem and if the steps keep on updating, it cannot be said well defined anymore. Also, this brings in a lot of problems to the systems already implementing the algorithm as it becomes difficult to bring in continuous and regular changes. So, we should update an algorithm only in any of the following cases:

  • If you want the model to evolve as data streams through infrastructure, it is fair to make changes to an algorithm and update it accordingly.
  • If the underlying data source is changing, it almost becomes necessary to update the algorithm accordingly.
  • If there is a case of non-stationarity, we may update the algorithm.
  • One of the most important reasons for updating any algorithm is its underperformance and lack of efficiency. So, if an algorithm lacks efficiency or underperforms it should be either replaced by some better algorithm or it must be updated.

17. Why do we need selection bias?

Selection Bias happens in cases where there is no randomization specifically achieved while picking a part of the dataset for analysis. This bias tells that the sample analyzed does not represent the whole population meant to be analyzed.

  • For example, in the below image, we can see that the sample that we selected does not entirely represent the whole population that we have. This helps us to question whether we have selected the right data for analysis or not.

big data research questions

18. Why is data cleaning crucial? How do you clean the data?

While running an algorithm on any data, to gather proper insights, it is very much necessary to have correct and clean data that contains only relevant information. Dirty data most often results in poor or incorrect insights and predictions which can have damaging effects.

For example, while launching any big campaign to market a product, if our data analysis tells us to target a product that in reality has no demand and if the campaign is launched, it is bound to fail. This results in a loss of the company’s revenue. This is where the importance of having proper and clean data comes into the picture.

  • Data Cleaning of the data coming from different sources helps in data transformation and results in the data where the data scientists can work on.
  • Properly cleaned data increases the accuracy of the model and provides very good predictions.
  • If the dataset is very large, then it becomes cumbersome to run data on it. The data cleanup step takes a lot of time (around 80% of the time) if the data is huge. It cannot be incorporated with running the model. Hence, cleaning data before running the model, results in increased speed and efficiency of the model.
  • Data cleaning helps to identify and fix any structural issues in the data. It also helps in removing any duplicates and helps to maintain the consistency of the data.

The following diagram represents the advantages of data cleaning:

big data research questions

19. What are the available feature selection methods for selecting the right variables for building efficient predictive models?

While using a dataset in data science or machine learning algorithms, it so happens that not all the variables are necessary and useful to build a model. Smarter feature selection methods are required to avoid redundant models to increase the efficiency of our model. Following are the three main methods in feature selection:

  • These methods pick up only the intrinsic properties of features that are measured via univariate statistics and not cross-validated performance. They are straightforward and are generally faster and require less computational resources when compared to wrapper methods.
  • There are various filter methods such as the Chi-Square test, Fisher’s Score method, Correlation Coefficient, Variance Threshold, Mean Absolute Difference (MAD) method, Dispersion Ratios, etc.

big data research questions

  • These methods need some sort of method to search greedily on all possible feature subsets, access their quality by learning and evaluating a classifier with the feature.
  • The selection technique is built upon the machine learning algorithm on which the given dataset needs to fit.
  • Forward Selection: Here, one feature is tested at a time and new features are added until a good fit is obtained.
  • Backward Selection: Here, all the features are tested and the non-fitting ones are eliminated one by one to see while checking which works better.
  • Recursive Feature Elimination: The features are recursively checked and evaluated how well they perform.
  • These methods are generally computationally intensive and require high-end resources for analysis. But these methods usually lead to better predictive models having higher accuracy than filter methods.

big data research questions

  • Embedded methods constitute the advantages of both filter and wrapper methods by including feature interactions while maintaining reasonable computational costs.
  • These methods are iterative as they take each model iteration and carefully extract features contributing to most of the training in that iteration.
  • Examples of embedded methods: LASSO Regularization (L1), Random Forest Importance.

big data research questions

20. During analysis, how do you treat the missing values?

To identify the extent of missing values, we first have to identify the variables with the missing values. Let us say a pattern is identified. The analyst should now concentrate on them as it could lead to interesting and meaningful insights. However, if there are no patterns identified, we can substitute the missing values with the median or mean values or we can simply ignore the missing values. 

If the variable is categorical, the common strategies for handling missing values include:

  • Assigning a New Category: You can assign a new category, such as "Unknown" or "Other," to represent the missing values.
  • Mode imputation: You can replace missing values with the mode, which represents the most frequent category in the variable.
  • Using a Separate Category: If the missing values carry significant information, you can create a separate category to indicate missing values.

It's important to select an appropriate strategy based on the nature of the data and the potential impact on subsequent analysis or modelling.

If 80% of the values are missing for a particular variable, then we would drop the variable instead of treating the missing values.

21. Will treating categorical variables as continuous variables result in a better predictive model?

Yes! A categorical variable is a variable that can be assigned to two or more categories with no definite category ordering. Ordinal variables are similar to categorical variables with proper and clear ordering defines. So, if the variable is ordinal, then treating the categorical value as a continuous variable will result in better predictive models.

22. How will you treat missing values during data analysis?

The impact of missing values can be known after identifying what type of variables have missing values.

  • If the data analyst finds any pattern in these missing values, then there are chances of finding meaningful insights.
  • In case of patterns are not found, then these missing values can either be ignored or can be replaced with default values such as mean, minimum, maximum, or median values.
  • Assigning a new category: You can assign a new category, such as "Unknown" or "Other," to represent the missing values.
  • Using a separate category : If the missing values carry significant information, you can create a separate category to indicate the missing values. It's important to select an appropriate strategy based on the nature of the data and the potential impact on subsequent analysis or modelling.
  • If 80% of values are missing, then it depends on the analyst to either replace them with default values or drop the variables.

23. What does the ROC Curve represent and how to create it?

ROC (Receiver Operating Characteristic) curve is a graphical representation of the contrast between false-positive rates and true positive rates at different thresholds. The curve is used as a proxy for a trade-off between sensitivity and specificity.

The ROC curve is created by plotting values of true positive rates (TPR or sensitivity) against false-positive rates (FPR or (1-specificity)) TPR represents the proportion of observations correctly predicted as positive out of overall positive observations. The FPR represents the proportion of observations incorrectly predicted out of overall negative observations. Consider the example of medical testing, the TPR represents the rate at which people are correctly tested positive for a particular disease.

big data research questions

24. What are the differences between univariate, bivariate and multivariate analysis?

Statistical analyses are classified based on the number of variables processed at a given time.

Univariate analysis Bivariate analysis Multivariate analysis
This analysis deals with solving only one variable at a time. This analysis deals with the statistical study of two variables at a given time. This analysis deals with statistical analysis of more than two variables and studies the responses.
Example: Sales pie charts based on territory. Example: Scatterplot of Sales and spend volume analysis study. Example: Study of the relationship between human’s social media habits and their self-esteem which depends on multiple factors like age, number of hours spent, employment status, relationship status, etc.

25. What is the difference between the Test set and validation set?

The test set is used to test or evaluate the performance of the trained model. It evaluates the predictive power of the model. The validation set is part of the training set that is used to select parameters for avoiding model overfitting.

26. What do you understand by a kernel trick?

Kernel functions are generalized dot product functions used for the computing dot product of vectors xx and yy in high dimensional feature space. Kernal trick method is used for solving a non-linear problem by using a linear classifier by transforming linearly inseparable data into separable ones in higher dimensions.

big data research questions

27. Differentiate between box plot and histogram.

Box plots and histograms are both visualizations used for showing data distributions for efficient communication of information. Histograms are the bar chart representation of information that represents the frequency of numerical variable values that are useful in estimating probability distribution, variations and outliers. Boxplots are used for communicating different aspects of data distribution where the shape of the distribution is not seen but still the insights can be gathered. These are useful for comparing multiple charts at the same time as they take less space when compared to histograms.

big data research questions

28. How will you balance/correct imbalanced data?

There are different techniques to correct/balance imbalanced data. It can be done by increasing the sample numbers for minority classes. The number of samples can be decreased for those classes with extremely high data points. Following are some approaches followed to balance data:

  • Specificity/Precision: Indicates the number of selected instances that are relevant.
  • Sensitivity: Indicates the number of relevant instances that are selected.
  • F1 score: It represents the harmonic mean of precision and sensitivity.
  • MCC (Matthews correlation coefficient): It represents the correlation coefficient between observed and predicted binary classifications.
  • AUC (Area Under the Curve): This represents a relation between the true positive rates and false-positive rates.

For example, consider the below graph that illustrates training data:

Here, if we measure the accuracy of the model in terms of getting "0"s, then the accuracy of the model would be very high -> 99.9%, but the model does not guarantee any valuable information. In such cases, we can apply different evaluation metrics as stated above.

big data research questions

  • Under-sampling This balances the data by reducing the size of the abundant class and is used when the data quantity is sufficient. By performing this, a new dataset that is balanced can be retrieved and this can be used for further modeling.
  • Over-sampling This is used when data quantity is not sufficient. This method balances the dataset by trying to increase the samples size. Instead of getting rid of extra samples, new samples are generated and introduced by employing the methods of repetition, bootstrapping, etc.
  • Perform K-fold cross-validation correctly: Cross-Validation needs to be applied properly while using over-sampling. The cross-validation should be done before over-sampling because if it is done later, then it would be like overfitting the model to get a specific result. To avoid this, resampling of data is done repeatedly with different ratios. 

29. What is better - random forest or multiple decision trees?

Random forest is better than multiple decision trees as random forests are much more robust, accurate, and lesser prone to overfitting as it is an ensemble method that ensures multiple weak decision trees learn strongly.

30. Consider a case where you know the probability of finding at least one shooting star in a 15-minute interval is 30%. Evaluate the probability of finding at least one shooting star in a one-hour duration?

So the probability is 0.8628 = 86.28%

31. Toss the selected coin 10 times from a jar of 1000 coins. Out of 1000 coins, 999 coins are fair and 1 coin is double-headed, assume that you see 10 heads. Estimate the probability of getting a head in the next coin toss.

We know that there are two types of coins - fair and double-headed. Hence, there are two possible ways of choosing a coin. The first is to choose a fair coin and the second is to choose a coin having 2 heads.

P(selecting fair coin) = 999/1000 = 0.999 P(selecting double headed coin) = 1/1000 = 0.001

Using Bayes rule,

So, the answer is 0.7531 or 75.3%.

32. What are some examples when false positive has proven important than false negative?

Before citing instances, let us understand what are false positives and false negatives.

  • False Positives are those cases that were wrongly identified as an event even if they were not. They are called Type I errors.
  • False Negatives are those cases that were wrongly identified as non-events despite being an event. They are called Type II errors.

Some examples where false positives were important than false negatives are:

  • In the medical field: Consider that a lab report has predicted cancer to a patient even if he did not have cancer. This is an example of a false positive error. It is dangerous to start chemotherapy for that patient as he doesn’t have cancer as starting chemotherapy would lead to damage of healthy cells and might even actually lead to cancer.
  • In the e-commerce field: Suppose a company decides to start a campaign where they give $100 gift vouchers for purchasing $10000 worth of items without any minimum purchase conditions. They assume it would result in at least 20% profit for items sold above $10000. What if the vouchers are given to the customers who haven’t purchased anything but have been mistakenly marked as those who purchased $10000 worth of products. This is the case of false-positive error.

33. Give one example where both false positives and false negatives are important equally?

In Banking fields: Lending loans are the main sources of income to the banks. But if the repayment rate isn’t good, then there is a risk of huge losses instead of any profits. So giving out loans to customers is a gamble as banks can’t risk losing good customers but at the same time, they can’t afford to acquire bad customers. This case is a classic example of equal importance in false positive and false negative scenarios.

34. Is it good to do dimensionality reduction before fitting a Support Vector Model?

If the features number is greater than observations then doing dimensionality reduction improves the SVM (Support Vector Model).

35. What are various assumptions used in linear regression? What would happen if they are violated?

Linear regression is done under the following assumptions:

  • The sample data used for modeling represents the entire population.
  • There exists a linear relationship between the X-axis variable and the mean of the Y variable.
  • The residual variance is the same for any X values. This is called homoscedasticity
  • The observations are independent of one another.
  • Y is distributed normally for any value of X.

Extreme violations of the above assumptions lead to redundant results. Smaller violations of these result in greater variance or bias of the estimates.

36. How is feature selection performed using the regularization method?

The method of regularization entails the addition of penalties to different parameters in the machine learning model for reducing the freedom of the model to avoid the issue of overfitting. There are various regularization methods available such as linear model regularization, Lasso/L1 regularization, etc. The linear model regularization applies penalty over coefficients that multiplies the predictors. The Lasso/L1 regularization has the feature of shrinking some coefficients to zero, thereby making it eligible to be removed from the model.

37. How do you identify if a coin is biased?

To identify this, we perform a hypothesis test as below: According to the null hypothesis, the coin is unbiased if the probability of head flipping is 50%. According to the alternative hypothesis, the coin is biased and the probability is not equal to 500. Perform the below steps:

  • Flip coin 500 times
  • Calculate p-value.
  • p-value > alpha: Then null hypothesis holds good and the coin is unbiased.
  • p-value < alpha: Then the null hypothesis is rejected and the coin is biased.

38. What is the importance of dimensionality reduction?

The process of dimensionality reduction constitutes reducing the number of features in a dataset to avoid overfitting and reduce the variance. There are mostly 4 advantages of this process:

  • This reduces the storage space and time for model execution.
  • Removes the issue of multi-collinearity thereby improving the parameter interpretation of the ML model.
  • Makes it easier for visualizing data when the dimensions are reduced.
  • Avoids the curse of increased dimensionality.

39. How is the grid search parameter different from the random search tuning strategy?

Tuning strategies are used to find the right set of hyperparameters. Hyperparameters are those properties that are fixed and model-specific before the model is tested or trained on the dataset. Both the grid search and random search tuning strategies are optimization techniques to find efficient hyperparameters.

  • Here, every combination of a preset list of hyperparameters is tried out and evaluated.
  • The search pattern is similar to searching in a grid where the values are in a matrix and a search is performed. Each parameter set is tried out and their accuracy is tracked. after every combination is tried out, the model with the highest accuracy is chosen as the best one.
  • The main drawback here is that, if the number of hyperparameters is increased, the technique suffers. The number of evaluations can increase exponentially with each increase in the hyperparameter. This is called the problem of dimensionality in a grid search.

big data research questions

  • In this technique, random combinations of hyperparameters set are tried and evaluated for finding the best solution. For optimizing the search, the function is tested at random configurations in parameter space as shown in the image below.
  • In this method, there are increased chances of finding optimal parameters because the pattern followed is random. There are chances that the model is trained on optimized parameters without the need for aliasing.
  • This search works the best when there is a lower number of dimensions as it takes less time to find the right set.

big data research questions

Conclusion:

Data Science is a very vast field and comprises many topics like Data Mining, Data Analysis, Data Visualization, Machine Learning, Deep Learning, and most importantly it is laid on the foundation of mathematical concepts like Linear Algebra and Statistical analysis. Since there are a lot of pre-requisites for becoming a good professional Data Scientist, the perks and benefits are very big. Data Scientist has become the most sought job role these days. 

Looking for a comprehensive course on Data Science: Check out Scaler’s Data Science Course .

Useful Resources:

  • Best Data Science Courses
  • Python Data Science Interview Questions
  • Google Data Scientist Salary
  • Spotify Data Scientist Salary
  • Data Scientist Salary
  • Data Science Resume
  • Data Analyst: Career Guide
  • Tableau Interview
  • Additional Technical Interview Questions

1. How do I prepare for a data science interview?

Some of the preparation tips for data science interviews are as follows:

  • Resume Building: Firstly, prepare your resume well. It is preferable if the resume is only a 1-page resume, especially for a fresher. You should give great thought to the format of the resume as it matters a lot. The data science interviews can be based more on the topics like linear and logistic regression, SVM, root cause analysis, random forest, etc. So, prepare well for the data science-specific questions like those discussed in this article, make sure your resume has a mention of such important topics and you have a good knowledge of them. Also, please make sure that your resume contains some Data Science-based Projects as well. It is always better to have a group project or internship experience in the field that you are interested to go for. However, personal projects will also have a good impact on the resume. So, your resume should contain at least 2-3 data science-based projects that show your skill and knowledge level in data science. Please do not write any such skill in your resume that you do not possess. If you are just familiar with some technology and have not studied it at an advanced level, you can mention a beginner tag for those skills.
  • Prepare Well: Apart from the specific questions on data science, questions on Core subjects like Database Management systems (DBMS), Operating Systems (OS), Computer Networks(CN), and Object-Oriented Programming (OOPS) can be asked from the freshers especially. So, prepare well for that as well.
  • Data structures and Algorithms are the basic building blocks of programming. So, you should be well versed with that as well.
  • Research the Company: This is the tip that most people miss and it is very important. If you are going for an interview with any company, read about the company before and especially in the case of data science, learn which libraries the company uses, what kind of models are they building, and so on. This gives you an edge over most other people.

2. Are data science interviews hard?

An honest reply will be “YES”. This is because of the fact that this field is newly emerging and will keep on emerging forever. In almost every interview, you have to answer many tough and challenging questions with full confidence and your concepts should be strong to satisfy the interviewer. However, with great practice, anything can be achieved. So, follow the tips discussed above and keep practising and learning. You will definitely succeed.

3. What are the top 3 technical skills of a data scientist?

The top 3 skills of a data scientist are:

  • Mathematics: Data science requires a lot of mathematics and a good data scientist is strong in it. It is not possible to become a good data scientist if you are weak in mathematics.
  • Machine Learning and Deep Learning : A data scientist should be very skilled in Artificial Intelligence technologies like deep learning and machine learning. Some good projects and a lot of hands-on practice will help in achieving excellence in that field.
  • Programming: This is an obvious yet the most important skill. If a person is good at programming it does mean that he/she can solve complex problems as that is just a problem-solving skill. Programming is the ability to write clean and industry-understandable code. This is the skill that most freshers slack because of the lack of exposure to industry-level code. This also improves with practice and experience. 

4. Is data science a good career?

Yes, data science is one of the most futuristic and great career fields. Today and tomorrow or even years later, this field is just going to expand and never end. The reason is simple. Data can be compared to gold today as it is the key to selling everything in the world. Data scientists know how to play with this data to generate some tremendous outputs that are not even imaginable today making it a great career.

5. Are coding questions asked in data science interviews?

Yes, coding questions are asked in data science interviews. One more important thing to note here is that the data scientists are very good problem solvers as they are indulged in a lot of strict mathematics-based activities. Hence, the interviewer expects the data science interview candidates to know data structures and algorithms and at least come up with the solutions to most of the problems.

6. Is python and SQL enough for data science?

Yes. Python and SQL are sufficient for the data science roles. However, knowing the R programming Language can have also have a better impact. If you know these 3 languages, you have got the edge over most of the competitors. However, Python and SQL are enough for data science interviews.

7. What are Data Science tools?

There are various data science tools available in the market nowadays. Various tools can be of great importance. Tensorflow is one of the most famous data science tools. Some of the other famous tools are BigML, SAS (Statistical Analysis System), Knime, Scikit, Pytorch, etc.

Which among the below is NOT a necessary condition for weakly stationary time series data?

Overfitting more likely occurs when there is a huge data amount to train. True or False?

Given the information that the demand is 100 in October 2020, 150 in November 2020, 350 during December 2020 and 400 during January 2021. Calculate a 3-month simple moving average for February 2021.

Which of the below method depicts hierarchical data in nested format?

Which among the following defines the analysis of data objects not complying with general data behaviour?

What does a linear equation having 3 variables represent?

What would be the formula representation of this problem in terms of x and y variables: “The price of 2 pens and 1 pencil as 10 units”?

Which among the below is true regarding hypothesis testing?

What are the model parameters that are used to build ML models using iterative methods under model-based learning methods?

What skills are necessary for a Data Scientist?

  • Privacy Policy

instagram-icon

  • Practice Questions
  • Programming
  • System Design
  • Fast Track Courses
  • Online Interviewbit Compilers
  • Online C Compiler
  • Online C++ Compiler
  • Online Java Compiler
  • Online Javascript Compiler
  • Online Python Compiler
  • Interview Preparation
  • Java Interview Questions
  • Sql Interview Questions
  • Python Interview Questions
  • Javascript Interview Questions
  • Angular Interview Questions
  • Networking Interview Questions
  • Selenium Interview Questions
  • Data Structure Interview Questions
  • System Design Interview Questions
  • Hr Interview Questions
  • Html Interview Questions
  • C Interview Questions
  • Amazon Interview Questions
  • Facebook Interview Questions
  • Google Interview Questions
  • Tcs Interview Questions
  • Accenture Interview Questions
  • Infosys Interview Questions
  • Capgemini Interview Questions
  • Wipro Interview Questions
  • Cognizant Interview Questions
  • Deloitte Interview Questions
  • Zoho Interview Questions
  • Hcl Interview Questions
  • Highest Paying Jobs In India
  • Exciting C Projects Ideas With Source Code
  • Top Java 8 Features
  • Angular Vs React
  • 10 Best Data Structures And Algorithms Books
  • Best Full Stack Developer Courses
  • Python Commands List
  • Maximum Subarray Sum Kadane’s Algorithm
  • Python Cheat Sheet
  • C++ Cheat Sheet
  • Javascript Cheat Sheet
  • Git Cheat Sheet
  • Java Cheat Sheet
  • Data Structure Mcq
  • C Programming Mcq
  • Javascript Mcq

1 Million +

Digital Humanities in Practice: From Research Questions to Results

Combine literary research with data science to find answers in unexpected ways. Learn basic coding tools to help save time and draw insights from thousands of digital documents at once.

Digital Humanities in Practice Course Image

Associated Schools

Harvard Faculty of Arts & Sciences

Harvard Faculty of Arts & Sciences

What you'll learn.

Understand which digital methods are most suitable to meaningfully analyze large databases of text

Identify the resources needed to complete complex digital projects and learn about their possible limitations

Create enhanced datasets by scraping websites, identifying character sets and search criteria, and using APIs

Download existing datasets and create new ones by scraping websites and using APIs

Enrich metadata and tag text to optimize the results of your analysis

Analyze thousands of books with digital methods such as topic modeling, vector models, and concept search

Course description

From the printing press to the typewriter, there is a long history of scholars adapting to new technologies. In the last forty or fifty years, the most significant advance has been the digitization of books. We now have whole libraries—centuries of history, literature, and philosophy—available instantaneously. This new access is a wonderful benefit, but it can also be overwhelming. If you have hundreds of thousands of books available to you in an instant, where do you even start? With a bit of elementary code, you can study all of these books at once, and derive new sorts of insights.

Computation is changing the very nature of how we do research in the humanities. Tools from data science can help you to explore the record of human culture in ways that just wouldn’t have been possible before. You’re more likely to reach out to others, to work across disciplines, and to assemble teams. Whether you're a student wanting to expand your skillset, a librarian supporting new modes of research, or a journalist who has just received a massive cache of leaked e-mails, this course will show you how to draw insights from thousands of documents at once. You will learn how, with a few simple lines of code, to make use of the metadata—the information about our objects of study—to zero in on what matters most, and visualize your results so that you can understand them at a glance.

In this course, you’ll work on building parts of a search engine, one tailor-made to the needs of academic research. Along the way, you'll learn the fundamentals of text analysis: a set of techniques for manipulating the written word that stand at the core of the digital humanities.

By the end of the course, you will be able to apply what you learn to what interests you most, be it contemporary speeches, journalism, caselaw, and even art objects. This course will analyze pieces of 18th-century literature, showing you how these methods can be applied to philosophical works, religious texts, political and historical records – material from across the spectrum of humanistic inquiry.

Combine your traditional research skills with data science to find answers you never might have expected.

Instructors

Stephen Osadetz

Stephen Osadetz

Cole  Crawford

Cole Crawford

Christine  Fernsebner Eslao

Christine Fernsebner Eslao

You may also like.

Earliest surviving map showing Portuguese geographic discoveries in the east and west.

Masterpieces of World Literature

Embark on a global journey to explore the past, present, and future of world literature.

Building rooftops overlooking a body of water

Modern Masterpieces of World Literature

Examine how great modern writers capture the intricacies of our globalized world and how their works circulate within that world to find their own audiences.

The golden temple, a white and gold building sitting behind water

Sikhism Through Its Scriptures

This course examines the Sikh scripture from a doctrinal and historical perspective by providing an overview of Sikh teachings as well as the historical context within which the scripture evolved and became canonized.

Join our list to learn more

Money blog: House prices hit two-year high - see the average cost in your region

House prices have hit a two-year high after jumping 0.3% in August, the latest data from Halifax has shown. Scroll through the Money blog for this plus more personal finance and consumer posts - and leave your comments below.

Saturday 7 September 2024 08:31, UK

  • Liam Gallagher jokes about Oasis ticket prices
  • Reality star tells Sky News she didn't have pension in her 40s 
  • Sharp rise in price of first class stamp
  • House prices hit two-year high - see how they vary by region
  • Supermarket's tea beats more expensive brands in taste test  

Essential reads

  • Who's to blame for concert prices going through roof - and who gets money?
  • Fake voucher trend sees supermarket call in police
  • How data roaming charges compare by network
  • How your pension could be taxed

Tips and advice

  • Weekly mortgage guide
  • Free school meals guide
  • Cheapest holidays dates before Xmas
  • Money Problem : 'My dog died but insurance still wants a year's payment'

Ask a question or make a comment

Instead of our regular Saturday long read, we've published our first ever Money blog spin off - a student finance special.

In it you'll find:

  • All the best student discounts - food, clothes, beer and more
  • Top 10 budgeting tips for starting uni 
  • What are the highest-paying jobs in the UK?
  • The best bank accounts for students
  • Eight things you need to know about renting as a student
  • Student loans: How do they work and is it too late to apply? 
  • The towns and cities where it's cheapest to be a student 

Check it out here - and we'll be back with live updates on Monday...

By Jimmy Rice , Money blog editor

Away from Oasis ticket prices, the news agenda in Money this week was dominated by pensions.

We learned on Wednesday that the state pension looks set to rise by just below 4% next April - equalling around £400 extra per year for those on the full state pension.

Pre-2016 retirees who may be eligible for the secondary state pension could see a £300 per year increase.

Because of the triple lock, each year the state pension rises by whatever is highest from inflation, average wage growth or 2.5%.

Officials did nothing to downplay a BBC report, apparently based on internal Treasury figures, that average wage growth would be the highest of these this year.

The figures that would be used to set next April's rise are released next week but the OBR forecast is for 3.7% - which would take the full state pension to around £12,000.

Whether or not pensioners would view this as good news is up for debate (see our last post), but there was definite bad news for older Britons earlier in the week, as Chancellor Rachel Reeves refused to rule out heavier taxation on pensions in the October budget.

How could pensions be taxed further? We had a look here...

Ms Reeves also confirmed on Tuesday that she'd impose a cap on corporation tax.

She said the tax would be capped at its current level of 25% to "give business the confidence to grow".

A final piece of news from Money this week that could have consequences for your bank balance was confirmation that the Household Support Fund would be extended until April.

Councils decide how to dish out their share of the fund but it's often via cash grants or vouchers. Many councils also use the cash to work with local charities and community groups to provide residents with key appliances, school uniforms, cookery classes and items to improve energy efficiency in the home.

People should contact their local council for details on how to apply for the Household Support Fund - they can find their council here .

On the Oasis ticket price story, which continued to make headlines through the week including today, a post in Money appeared to help prompt a U-turn from official reseller Twickets.

The company told us it would be lowering its fees after criticism online...

Unofficial resellers were also in the spotlight and, on an episode of the Daily podcast, Niall Paterson spoke to Viagogo - eliciting an admission that things need to change...

Here in Money, we published a few explainers that are well worth checking out...

We'll be back with live updates on Monday - but do check out our Money blog spin-off tomorrow, a student finance special.

Have a good weekend.

We start this week's round up of your comments with Virgin Media O2's decision to axe its weekly free Greggs perk...

Customers on social media claimed they'd review whether they remained with O2 - while one Money blog reader asked what his rights were if he wanted to cancel...

I signed a new O2 contract on 16 August based largely on the advertised promise of the Greggs priority offer. I'm angry that I have been mis-sold my new contract and I will not be able to enjoy the benefit that I signed it for. I want to end it early, what are my rights? Phil

We looked at O2 Priority's T&Cs - and they clearly set out that they can make any change to the terms of the agreement and service without giving you a right to cancel.

Therefore, if you want to cancel you'll have to pay an early termination charge.

There is one exception - but only if you're in the first two weeks of your contract.

Consumer champion Scott Dixon says: "When you enter into a phone contract with a mobile phone provider online, it is classed as a distance sale and is covered by legislation.

"This legislation binds traders to provide key information at the point of sale including right to cancel information. This gives you a 14-day cooling-off period to leave without paying any termination fees, although you would have to pay for what you have used such as calls, texts and data.

"If you entered into the contract in-store, this would not apply." 

This probably isn't what Phil wants to hear - but we did look at other ways he and others might be able to get free or discount Greggs...

This post, which we hoped would be helpful, didn't go down well with everyone...

How to eat Greggs on the cheap?! Give me strength... Pork Pie Percy

Another topic that elicited a strong response from readers was a campaign group's call for the chancellor to impose a pay-per-mile tax on electric vehicles.

EV drivers obviously don't pay fuel duty - and the pay-per-mile proposal would make up for lost revenue to the Treasury as more people ditch petrol and diesel cars.

The Campaign for Better Transport group proposing the tax says the public would be on board - but our LinkedIn poll suggests this isn't the case...

Readers said...

I wonder how many people realise that an introduction of pay per mile, I guess by means of a tracker type of device, will actually allow big brother to watch your every move when travelling in your car, your speed on any given road, accident data etc... our freedom is diminishing. Big Ian
EVs need electricity to work, the cost of electricity in the UK is mad. I pay higher electricity bills because I don't have a diesel anymore. Why should I be charged pence per mile just by having an EV? It's money and NOT pollution targets the government are looking at. A Grant
The proposed introduction of pay per mile for ZEV will clearly by necessary to compensate for the taxes lost from the sale of petroleum based fuels. This was always going to happen. EU4ME
Only a matter of time before they came for the electric clan. I wonder if sales of electric will now suffer?  Chappers2013

Read more on this story here...

Pension stories always attract a lot of feedback - and this week's suggestion that the state pension will rise in line with average earnings growth next year was no different.

A rise of 3.7% would equal another £400 a year...

Wow how generous, suggested £400 rise to state pension would equate to a rise of £7.69 a week to a pensioner. But in reality, take away winter fuel and the rise is £100, that's £1.92 a week - will be rolling in the money. SueP
Without raising the personal allowance any pension increases will be eaten up with tax. This country is unbelievable in the way it treats its old folk. Monkee knows best
A potential £400 rise in state pension is hardly a headline, it's still a long way off from the minimum living wage. Prendy

An Oasis fan who spent more than £350 on a single ticket says she was left "fuming" after extra show dates were announced. 

Diane Green, from Middlesbrough, was close to buying a ticket costing £158 but said she was kicked out of an online queue. 

She then had to wait four hours to pay £357.95 for one ticket.

The 60-year-old wanted to buy a total of four tickets to take herself, her son and two friends to see the band at Heaton Park in Manchester, but said "there's just no way I could have got more".

"I would never have done it (purchased the ticket)," she said.

"If I had known they were putting more dates on, I would have just thought 'no, I'll chance it again', but it was really frustrating."

"I paid double. I could have got two tickets when I paid and now only one person can go. In our household, it's like, who goes?"

Ms Green said she bought the ticket thinking it was her only chance to see the band and was "absolutely fuming" when they announced more dates.

"It's disgraceful," she added. "For me to purchase a ticket for £358, it's a lot of money. I regret doing it in a way."

Oasis announced two new Wembley Stadium dates due to "phenomenal public demand" earlier this week.

It comes after controversy over the sale of tickets for their reunion tour, with 17 shows across Cardiff, Manchester, Wembley, Edinburgh and Dublin selling out.

Fans were beset with problems getting on to ticket websites, from being labelled bots and being kicked out of queuing to some ending up paying more than the advertised price of £148 as costs surged past £355. 

Liam Gallagher appeared to brush off the controversy earlier as he joked about ticket prices on social media, telling one person to "shut up" after Oasis were accused of ripping off fans.

Nationwide's £2.9bn takeover of rival Virgin Money is expected to complete next month after the deal was approved by the UK's financial regulators.

The deal will still need to be sanctioned in court, with a hearing set to take place on 27 September, but it is due to be formally complete on 1 October. 

It comes after Nationwide agreed to the takeover of its London-listed rival in March.

The building society struck the deal with a 220p-a-share offer for Virgin Money, including a planned 2p-per-share dividend payout.

It will bring together Britain's fifth and sixth-largest retail lenders, creating a combined group with around 24.5 million customers and more than 25,000 staff. 

The new owners of The Body Shop are lining up tens of millions of pounds in new financing as they finalise a deal to buy the chain out of administration.

Sky News has learnt that Aurea, an investment company led by cosmetics entrepreneur Mike Jatania, is in advanced talks to secure more than £30m in working capital from Hilco Capital, a prolific investor in and lender to the retail industry.

Banking sources said that the deal between Aurea and FRP Advisory, The Body Shop's administrators, was likely to be finalised within days.

If confirmed, the new debt from Hilco would be used to help place the cosmetics chain back on a growth footing, the bankers said.

The UK economy would need investment of £1trn over a decade for an annual growth rate of 3% to be achieved, according to a business lobby group.

The Capital Markets Industry Taskforce (CMIT), which represents leaders in the financial services sphere, said £100bn a year must be found to help the country catch up after trailing its peers for many years.

It urged a focus on energy, housing and venture capital, arguing the money could be unlocked from the £6trn in long-term capital within the pensions and insurance sector.

The government has made growing the economy its top priority.

Prime Minister Sir Keir Starmer let it be known during the election campaign that he was seeking to achieve a growth rate of 2.5% - a level the economy has struggled to reach since the financial crisis of 2008.

You've waved your magic wand, and your "happily ever after" home appears... 

It sounds like a buyer's dream - and one property has come to market that could be a dream come true for a Disney fan. 

A semi-detached house in Rhyl, Wales, looks ordinary from the outside, but its interior has been decorated as an homage to Disney and other cartoon characters. 

The cast of Aladdin, Maleficent from Sleeping Beauty and Tinkerbell from Peter Pan are just some of the characters displayed around this three-bed house. 

It's been put on the market for £179,950 - more than £44,400 less than the average price of a property in Wales (you can read more about this in our 8.54 post). 

On Zoopla, it is listed as being close to public transport and within walking distance to the town centre. 

It also has two reception areas, a shed and a garden. 

According to the online estate agent, it is "ideal for first time buyers". 

Daniel Copley, consumer expert at Zoopla, told the Money blog: "It goes without saying that this property would make the perfect home for a Disney fan with its spectacular murals showcasing a whole new world.

"Aside from this, the property is conveniently located near the local leisure centre and schools, while Rhyl’s beautiful beaches are also within walking distance." 

Visa says it is planning a new service which offers more control and better protection to people paying bills by bank transfer.

The dedicated service for account-to-account (A2A) payments will launch early in the UK next year, it said - with an "easy to use" resolution service that could make it easier for customers to claw their money back if something goes wrong.

Visa said consumers using the service will be able to monitor their payments more easily and raise any issues by clicking a button in their banking app, giving them a similar level of protection to when they use their cards.

Biometrics will also be incorporated to offer a new level of security, it added.

Royal Mail is hiking the price of first class stamps again - this time by 30p. 

From 7 October, they will increase to £1.65, while second class stamps will remain at 85p.

In April, first class stamp prices increased by 10p to £1.35, and by 10p to 85p for second class.

Royal Mail said it had sought to keep price increases as low as possible in the face of declining letter volumes, inflationary pressures and the costs of maintaining the Universal Service Obligation, under which deliveries have to be made six days a week.

It added that letter volumes have fallen from 20 billion in 2004/5 to around 6.7 billion a year in 2023/4. 

This means the average household now receives four letters a week, compared to 14 a decade ago.

In the same period, the number of addresses Royal Mail must deliver to has risen by four million, meaning the cost of each delivery has also risen. 

Nick Landon, Royal Mail's chief commercial officer, said: "We always consider price increases very carefully. 

"However, when letter volumes have declined by two-thirds since their peak, the cost of delivering each letter inevitably increases."

He called for the universal service to be adapted to reflect changing customer preferences, saying the financial cost to meet the current demands are "significant". 

"The universal service must adapt to reflect changing customer preferences and increasing costs so that we can protect the one-price-goes anywhere service, now and in the future," he added. 

Postal regulator Ofcom said this week that Royal Mail could be allowed to drop Saturday deliveries for second class letters under an overhaul of the service.

Up to 60 new Wagamama restaurants could be coming to the UK. 

The Asian food chain's owner, The Restaurant Group (TRG), said it wanted to operate between 200 and 220 premises across the country as part of a long-term plan. 

It's currently on track to open 10 new sites this year, which would create around 500 jobs, according to The Caterer. 

It comes as TRG posted its financial results for the year ending December 2023. 

It said Wagamama saw its dine-in like-for-like sales increase by 11%. 

It's other brand, Brunning and Price Pubs, saw sales go up by 10%. 

TRG's chief executive Andy Hornby said 2023 was a "genuinely transformational" year for the company. 

"We traded strongly throughout the year thanks to the phenomenal efforts of our restaurant and pub teams," he said. 

"We are on track to open 10 more Wagamama sites in the UK during 2024 and we have acquired 100% ownership of our Wagamama business in the USA." 

He added that he was "confident" that the company would continue to grow in the years ahead, despite the "challenging" consumer backdrop. 

Be the first to get Breaking News

Install the Sky News app for free

big data research questions

IMAGES

  1. 140 Excellent Big Data Research Topics to Consider

    big data research questions

  2. 110 Best Big Data Research Topics and Project Ideas

    big data research questions

  3. Top 85 Big Data Interview Questions and Answers for 2024

    big data research questions

  4. 166 Big Data Research Topics To Ace Your Paper

    big data research questions

  5. (PDF) Critical Questions for Big Data

    big data research questions

  6. 100+ Big Data Interview Questions and Answers 2023

    big data research questions

VIDEO

  1. Data Science and Big Data Research Group Live Stream

  2. How to tackle interview questions

  3. Next Generation Data Integration Platform Apache Seatunnel

  4. Hadoop and Big Data Interview Questions: Answers and Insights

  5. Lecture 57: Handling Big Data Research

  6. Big Data: Researching Big Data

COMMENTS

  1. 214 Big Data Research Topics: Interesting Ideas To Try

    These 15 topics will help you to dive into interesting research. You may even build on research done by other scholars. Evaluate the data mining process. The influence of the various dimension reduction methods and techniques. The best data classification methods. The simple linear regression modeling methods.

  2. 1057 questions with answers in BIG DATA

    This data is then integrated into big data systems to provide a holistic view of the supply chain. 4. **Blockchain Technology:** Blockchain is being explored for enhancing transparency and ...

  3. Top 35 big data interview questions with answers for 2024

    Top 35 big data interview questions and answers. Each of the following 35 big data interview questions includes an answer. However, don't rely solely on these answers when preparing for your interview. Instead, use them as a launching point for digging more deeply into each topic. 1.

  4. Research Topics & Ideas: Data Science

    Research Topics & Ideas: Data Science

  5. 99+ Data Science Research Topics: A Path to Innovation

    As we explore the depths of machine learning, natural language processing, big data analytics, and ethical considerations, we pave the way for innovation, shape the future of technology, and make a positive impact on the world. Discover exciting 99+ data science research topics and methodologies in this in-depth blog.

  6. 15 years of Big Data: a systematic literature review

    Over the past 15 years, Big Data has emerged as a foundational pillar providing support to an extensive range of different scientific fields, from medicine and healthcare [] to engineering [], finance and marketing [3,4,5], politics [], social networks analysis [7, 8], and telecommunications [], to cite only a few examples.This 15-year period has witnessed a significant increase in research ...

  7. Frontiers in Big Data

    Foundation Models for Healthcare: Innovations in Generative AI, Computer Vision, Language Models, and Multimodal Systems. This innovative journal focuses on the power of big data - its role in machine learning, AI, and data mining, and its practical application from cybersecurity to climate science and public health.

  8. Creating a Good Research Question

    Using Healthcare Data: How can Researchers Come up with Interesting Questions? Anupam Jena, MD, PhD Another ThinkResearch podcast episode addresses how to discover good research questions by using a backward design approach which involves analyzing big data and allowing the research question to unfold from findings. Play Using Healthcare Data.

  9. Moving back to the future of big data-driven research: reflecting on

    From theory to data-driven science. More than a decade has gone by since Savage and Burrows described a crisis in empirical research, where the well-developed methodologies for collecting data ...

  10. Big Data Research

    The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic.. The journal will accept papers on foundational aspects in … View full aims & scope $2760

  11. What are the threats and potentials of big data for qualitative research?

    The potentials of big data for qualitative research are examined, providing recommendations to bring together complementary research endeavors that map large scale social patterns using big data with qualitative questions about participants' subjective perceptions, rich expression of feelings, and reasons for human action.

  12. Ten simple rules for responsible big data research

    The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of ...

  13. CRITICAL QUESTIONS FOR BIG DATA

    Abstract. The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential ...

  14. (PDF) Critical Questions for Big Data

    We define Big Data 1 as a cultural, technological, and scholarly phenomenon that rests on. the interplay of: 1) Technology: maximizing computation power and algorithmic accuracy to gather, analyze ...

  15. What Is Big Data Analytics? Definition, Benefits, and More

    What is big data analytics? Big data analytics is the process of collecting, examining, and analyzing large amounts of data to discover market trends, insights, and patterns that can help companies make better business decisions. This information is available quickly and efficiently so that companies can be agile in crafting plans to maintain their competitive advantage.

  16. Ethics review of big data research: What should stay and what should be

    This is an additional reason that big data research often has a tentative approach to a research question, instead of growing from a specific research hypothesis .The difficulty of clearly framing the big data research itself makes it even harder for ERCs to anticipate unforeseeable risks and potential societal consequences. Given the existing ...

  17. 16 Fascinating Real-World Big Data Examples

    16 Fascinating Real-World Big Data Examples

  18. 10 Research Question Examples to Guide your Research Project

    10 Research Question Examples to Guide your ...

  19. Big Data in Academic Research: Challenges, Pitfalls, and ...

    These new frontiers indicate that Big Data allow us—compel us—to study heretofore inaccessible research questions. Big Data can perhaps best be employed to answer questions where conventional methods are failing (although the caveat may be that the failure of conventional methods may only become apparent when their results are contrasted ...

  20. With big data, answers drive questions

    With big data, answers drive questions. Andreas Schmidt. 29 March 2017. Big data. Usually, when we search for a solution, we start with a question and then seek out answers. According to Viktor Mayer-Schönberger, one of the plenary speakers at the 2017 OCLC EMEA Regional Council Meeting in Berlin, big data flips that equation on its head.

  21. Target Population: What It Is + Strategies for Targeting

    Setting specific, measurable objectives of the research and data types ensures that you have a clear direction and criteria for evaluating your strategies. 2. Segment Your Audience. Segmentation involves dividing your target population into smaller, more manageable groups based on shared characteristics. This can be done through:

  22. Key things to know about election polls in the U.S.

    How big were the errors? Polls conducted in the last two weeks before the election suggested that Biden's margin over Trump was nearly twice as large as it ended up being in the final national vote tally. Errors of this size make it difficult to be confident about who is leading if the election is closely contested, as many U.S. elections are.

  23. Data Science Interview Questions

    Top Data Science Interview Questions and Answers (2024)

  24. Big Tasks for Small Computers

    Most current ideas about computing seem to revolve around the notion of putting together an optimal computer for a specific - generally cutting-edge, "big data" - task. For the past couple of years, I have focused on the following question: What is the "best thing" I can get the "worst computer" to do? This piece is a sampler of technical advice, philosophy and the story of ...

  25. Digital Humanities in Practice: From Research Questions to Results

    Combine literary research with data science to find answers in unexpected ways. Learn basic coding tools to help save time and draw insights from thousands of digital documents at once.

  26. Money blog: House prices hit two-year high

    House prices have hit a two-year high after jumping 0.3% in August, the latest data from Halifax has shown. Scroll through the Money blog for this plus more personal finance and consumer posts ...