U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • CBE Life Sci Educ
  • v.21(3); Fall 2022

Literature Reviews, Theoretical Frameworks, and Conceptual Frameworks: An Introduction for New Biology Education Researchers

Julie a. luft.

† Department of Mathematics, Social Studies, and Science Education, Mary Frances Early College of Education, University of Georgia, Athens, GA 30602-7124

Sophia Jeong

‡ Department of Teaching & Learning, College of Education & Human Ecology, Ohio State University, Columbus, OH 43210

Robert Idsardi

§ Department of Biology, Eastern Washington University, Cheney, WA 99004

Grant Gardner

∥ Department of Biology, Middle Tennessee State University, Murfreesboro, TN 37132

Associated Data

To frame their work, biology education researchers need to consider the role of literature reviews, theoretical frameworks, and conceptual frameworks as critical elements of the research and writing process. However, these elements can be confusing for scholars new to education research. This Research Methods article is designed to provide an overview of each of these elements and delineate the purpose of each in the educational research process. We describe what biology education researchers should consider as they conduct literature reviews, identify theoretical frameworks, and construct conceptual frameworks. Clarifying these different components of educational research studies can be helpful to new biology education researchers and the biology education research community at large in situating their work in the broader scholarly literature.

INTRODUCTION

Discipline-based education research (DBER) involves the purposeful and situated study of teaching and learning in specific disciplinary areas ( Singer et al. , 2012 ). Studies in DBER are guided by research questions that reflect disciplines’ priorities and worldviews. Researchers can use quantitative data, qualitative data, or both to answer these research questions through a variety of methodological traditions. Across all methodologies, there are different methods associated with planning and conducting educational research studies that include the use of surveys, interviews, observations, artifacts, or instruments. Ensuring the coherence of these elements to the discipline’s perspective also involves situating the work in the broader scholarly literature. The tools for doing this include literature reviews, theoretical frameworks, and conceptual frameworks. However, the purpose and function of each of these elements is often confusing to new education researchers. The goal of this article is to introduce new biology education researchers to these three important elements important in DBER scholarship and the broader educational literature.

The first element we discuss is a review of research (literature reviews), which highlights the need for a specific research question, study problem, or topic of investigation. Literature reviews situate the relevance of the study within a topic and a field. The process may seem familiar to science researchers entering DBER fields, but new researchers may still struggle in conducting the review. Booth et al. (2016b) highlight some of the challenges novice education researchers face when conducting a review of literature. They point out that novice researchers struggle in deciding how to focus the review, determining the scope of articles needed in the review, and knowing how to be critical of the articles in the review. Overcoming these challenges (and others) can help novice researchers construct a sound literature review that can inform the design of the study and help ensure the work makes a contribution to the field.

The second and third highlighted elements are theoretical and conceptual frameworks. These guide biology education research (BER) studies, and may be less familiar to science researchers. These elements are important in shaping the construction of new knowledge. Theoretical frameworks offer a way to explain and interpret the studied phenomenon, while conceptual frameworks clarify assumptions about the studied phenomenon. Despite the importance of these constructs in educational research, biology educational researchers have noted the limited use of theoretical or conceptual frameworks in published work ( DeHaan, 2011 ; Dirks, 2011 ; Lo et al. , 2019 ). In reviewing articles published in CBE—Life Sciences Education ( LSE ) between 2015 and 2019, we found that fewer than 25% of the research articles had a theoretical or conceptual framework (see the Supplemental Information), and at times there was an inconsistent use of theoretical and conceptual frameworks. Clearly, these frameworks are challenging for published biology education researchers, which suggests the importance of providing some initial guidance to new biology education researchers.

Fortunately, educational researchers have increased their explicit use of these frameworks over time, and this is influencing educational research in science, technology, engineering, and mathematics (STEM) fields. For instance, a quick search for theoretical or conceptual frameworks in the abstracts of articles in Educational Research Complete (a common database for educational research) in STEM fields demonstrates a dramatic change over the last 20 years: from only 778 articles published between 2000 and 2010 to 5703 articles published between 2010 and 2020, a more than sevenfold increase. Greater recognition of the importance of these frameworks is contributing to DBER authors being more explicit about such frameworks in their studies.

Collectively, literature reviews, theoretical frameworks, and conceptual frameworks work to guide methodological decisions and the elucidation of important findings. Each offers a different perspective on the problem of study and is an essential element in all forms of educational research. As new researchers seek to learn about these elements, they will find different resources, a variety of perspectives, and many suggestions about the construction and use of these elements. The wide range of available information can overwhelm the new researcher who just wants to learn the distinction between these elements or how to craft them adequately.

Our goal in writing this paper is not to offer specific advice about how to write these sections in scholarly work. Instead, we wanted to introduce these elements to those who are new to BER and who are interested in better distinguishing one from the other. In this paper, we share the purpose of each element in BER scholarship, along with important points on its construction. We also provide references for additional resources that may be beneficial to better understanding each element. Table 1 summarizes the key distinctions among these elements.

Comparison of literature reviews, theoretical frameworks, and conceptual reviews

Literature reviewsTheoretical frameworksConceptual frameworks
PurposeTo point out the need for the study in BER and connection to the field.To state the assumptions and orientations of the researcher regarding the topic of studyTo describe the researcher’s understanding of the main concepts under investigation
AimsA literature review examines current and relevant research associated with the study question. It is comprehensive, critical, and purposeful.A theoretical framework illuminates the phenomenon of study and the corresponding assumptions adopted by the researcher. Frameworks can take on different orientations.The conceptual framework is created by the researcher(s), includes the presumed relationships among concepts, and addresses needed areas of study discovered in literature reviews.
Connection to the manuscriptA literature review should connect to the study question, guide the study methodology, and be central in the discussion by indicating how the analyzed data advances what is known in the field.  A theoretical framework drives the question, guides the types of methods for data collection and analysis, informs the discussion of the findings, and reveals the subjectivities of the researcher.The conceptual framework is informed by literature reviews, experiences, or experiments. It may include emergent ideas that are not yet grounded in the literature. It should be coherent with the paper’s theoretical framing.
Additional pointsA literature review may reach beyond BER and include other education research fields.A theoretical framework does not rationalize the need for the study, and a theoretical framework can come from different fields.A conceptual framework articulates the phenomenon under study through written descriptions and/or visual representations.

This article is written for the new biology education researcher who is just learning about these different elements or for scientists looking to become more involved in BER. It is a result of our own work as science education and biology education researchers, whether as graduate students and postdoctoral scholars or newly hired and established faculty members. This is the article we wish had been available as we started to learn about these elements or discussed them with new educational researchers in biology.

LITERATURE REVIEWS

Purpose of a literature review.

A literature review is foundational to any research study in education or science. In education, a well-conceptualized and well-executed review provides a summary of the research that has already been done on a specific topic and identifies questions that remain to be answered, thus illustrating the current research project’s potential contribution to the field and the reasoning behind the methodological approach selected for the study ( Maxwell, 2012 ). BER is an evolving disciplinary area that is redefining areas of conceptual emphasis as well as orientations toward teaching and learning (e.g., Labov et al. , 2010 ; American Association for the Advancement of Science, 2011 ; Nehm, 2019 ). As a result, building comprehensive, critical, purposeful, and concise literature reviews can be a challenge for new biology education researchers.

Building Literature Reviews

There are different ways to approach and construct a literature review. Booth et al. (2016a) provide an overview that includes, for example, scoping reviews, which are focused only on notable studies and use a basic method of analysis, and integrative reviews, which are the result of exhaustive literature searches across different genres. Underlying each of these different review processes are attention to the s earch process, a ppraisa l of articles, s ynthesis of the literature, and a nalysis: SALSA ( Booth et al. , 2016a ). This useful acronym can help the researcher focus on the process while building a specific type of review.

However, new educational researchers often have questions about literature reviews that are foundational to SALSA or other approaches. Common questions concern determining which literature pertains to the topic of study or the role of the literature review in the design of the study. This section addresses such questions broadly while providing general guidance for writing a narrative literature review that evaluates the most pertinent studies.

The literature review process should begin before the research is conducted. As Boote and Beile (2005 , p. 3) suggested, researchers should be “scholars before researchers.” They point out that having a good working knowledge of the proposed topic helps illuminate avenues of study. Some subject areas have a deep body of work to read and reflect upon, providing a strong foundation for developing the research question(s). For instance, the teaching and learning of evolution is an area of long-standing interest in the BER community, generating many studies (e.g., Perry et al. , 2008 ; Barnes and Brownell, 2016 ) and reviews of research (e.g., Sickel and Friedrichsen, 2013 ; Ziadie and Andrews, 2018 ). Emerging areas of BER include the affective domain, issues of transfer, and metacognition ( Singer et al. , 2012 ). Many studies in these areas are transdisciplinary and not always specific to biology education (e.g., Rodrigo-Peiris et al. , 2018 ; Kolpikova et al. , 2019 ). These newer areas may require reading outside BER; fortunately, summaries of some of these topics can be found in the Current Insights section of the LSE website.

In focusing on a specific problem within a broader research strand, a new researcher will likely need to examine research outside BER. Depending upon the area of study, the expanded reading list might involve a mix of BER, DBER, and educational research studies. Determining the scope of the reading is not always straightforward. A simple way to focus one’s reading is to create a “summary phrase” or “research nugget,” which is a very brief descriptive statement about the study. It should focus on the essence of the study, for example, “first-year nonmajor students’ understanding of evolution,” “metacognitive prompts to enhance learning during biochemistry,” or “instructors’ inquiry-based instructional practices after professional development programming.” This type of phrase should help a new researcher identify two or more areas to review that pertain to the study. Focusing on recent research in the last 5 years is a good first step. Additional studies can be identified by reading relevant works referenced in those articles. It is also important to read seminal studies that are more than 5 years old. Reading a range of studies should give the researcher the necessary command of the subject in order to suggest a research question.

Given that the research question(s) arise from the literature review, the review should also substantiate the selected methodological approach. The review and research question(s) guide the researcher in determining how to collect and analyze data. Often the methodological approach used in a study is selected to contribute knowledge that expands upon what has been published previously about the topic (see Institute of Education Sciences and National Science Foundation, 2013 ). An emerging topic of study may need an exploratory approach that allows for a description of the phenomenon and development of a potential theory. This could, but not necessarily, require a methodological approach that uses interviews, observations, surveys, or other instruments. An extensively studied topic may call for the additional understanding of specific factors or variables; this type of study would be well suited to a verification or a causal research design. These could entail a methodological approach that uses valid and reliable instruments, observations, or interviews to determine an effect in the studied event. In either of these examples, the researcher(s) may use a qualitative, quantitative, or mixed methods methodological approach.

Even with a good research question, there is still more reading to be done. The complexity and focus of the research question dictates the depth and breadth of the literature to be examined. Questions that connect multiple topics can require broad literature reviews. For instance, a study that explores the impact of a biology faculty learning community on the inquiry instruction of faculty could have the following review areas: learning communities among biology faculty, inquiry instruction among biology faculty, and inquiry instruction among biology faculty as a result of professional learning. Biology education researchers need to consider whether their literature review requires studies from different disciplines within or outside DBER. For the example given, it would be fruitful to look at research focused on learning communities with faculty in STEM fields or in general education fields that result in instructional change. It is important not to be too narrow or too broad when reading. When the conclusions of articles start to sound similar or no new insights are gained, the researcher likely has a good foundation for a literature review. This level of reading should allow the researcher to demonstrate a mastery in understanding the researched topic, explain the suitability of the proposed research approach, and point to the need for the refined research question(s).

The literature review should include the researcher’s evaluation and critique of the selected studies. A researcher may have a large collection of studies, but not all of the studies will follow standards important in the reporting of empirical work in the social sciences. The American Educational Research Association ( Duran et al. , 2006 ), for example, offers a general discussion about standards for such work: an adequate review of research informing the study, the existence of sound and appropriate data collection and analysis methods, and appropriate conclusions that do not overstep or underexplore the analyzed data. The Institute of Education Sciences and National Science Foundation (2013) also offer Common Guidelines for Education Research and Development that can be used to evaluate collected studies.

Because not all journals adhere to such standards, it is important that a researcher review each study to determine the quality of published research, per the guidelines suggested earlier. In some instances, the research may be fatally flawed. Examples of such flaws include data that do not pertain to the question, a lack of discussion about the data collection, poorly constructed instruments, or an inadequate analysis. These types of errors result in studies that are incomplete, error-laden, or inaccurate and should be excluded from the review. Most studies have limitations, and the author(s) often make them explicit. For instance, there may be an instructor effect, recognized bias in the analysis, or issues with the sample population. Limitations are usually addressed by the research team in some way to ensure a sound and acceptable research process. Occasionally, the limitations associated with the study can be significant and not addressed adequately, which leaves a consequential decision in the hands of the researcher. Providing critiques of studies in the literature review process gives the reader confidence that the researcher has carefully examined relevant work in preparation for the study and, ultimately, the manuscript.

A solid literature review clearly anchors the proposed study in the field and connects the research question(s), the methodological approach, and the discussion. Reviewing extant research leads to research questions that will contribute to what is known in the field. By summarizing what is known, the literature review points to what needs to be known, which in turn guides decisions about methodology. Finally, notable findings of the new study are discussed in reference to those described in the literature review.

Within published BER studies, literature reviews can be placed in different locations in an article. When included in the introductory section of the study, the first few paragraphs of the manuscript set the stage, with the literature review following the opening paragraphs. Cooper et al. (2019) illustrate this approach in their study of course-based undergraduate research experiences (CUREs). An introduction discussing the potential of CURES is followed by an analysis of the existing literature relevant to the design of CUREs that allows for novel student discoveries. Within this review, the authors point out contradictory findings among research on novel student discoveries. This clarifies the need for their study, which is described and highlighted through specific research aims.

A literature reviews can also make up a separate section in a paper. For example, the introduction to Todd et al. (2019) illustrates the need for their research topic by highlighting the potential of learning progressions (LPs) and suggesting that LPs may help mitigate learning loss in genetics. At the end of the introduction, the authors state their specific research questions. The review of literature following this opening section comprises two subsections. One focuses on learning loss in general and examines a variety of studies and meta-analyses from the disciplines of medical education, mathematics, and reading. The second section focuses specifically on LPs in genetics and highlights student learning in the midst of LPs. These separate reviews provide insights into the stated research question.

Suggestions and Advice

A well-conceptualized, comprehensive, and critical literature review reveals the understanding of the topic that the researcher brings to the study. Literature reviews should not be so big that there is no clear area of focus; nor should they be so narrow that no real research question arises. The task for a researcher is to craft an efficient literature review that offers a critical analysis of published work, articulates the need for the study, guides the methodological approach to the topic of study, and provides an adequate foundation for the discussion of the findings.

In our own writing of literature reviews, there are often many drafts. An early draft may seem well suited to the study because the need for and approach to the study are well described. However, as the results of the study are analyzed and findings begin to emerge, the existing literature review may be inadequate and need revision. The need for an expanded discussion about the research area can result in the inclusion of new studies that support the explanation of a potential finding. The literature review may also prove to be too broad. Refocusing on a specific area allows for more contemplation of a finding.

It should be noted that there are different types of literature reviews, and many books and articles have been written about the different ways to embark on these types of reviews. Among these different resources, the following may be helpful in considering how to refine the review process for scholarly journals:

  • Booth, A., Sutton, A., & Papaioannou, D. (2016a). Systemic approaches to a successful literature review (2nd ed.). Los Angeles, CA: Sage. This book addresses different types of literature reviews and offers important suggestions pertaining to defining the scope of the literature review and assessing extant studies.
  • Booth, W. C., Colomb, G. G., Williams, J. M., Bizup, J., & Fitzgerald, W. T. (2016b). The craft of research (4th ed.). Chicago: University of Chicago Press. This book can help the novice consider how to make the case for an area of study. While this book is not specifically about literature reviews, it offers suggestions about making the case for your study.
  • Galvan, J. L., & Galvan, M. C. (2017). Writing literature reviews: A guide for students of the social and behavioral sciences (7th ed.). Routledge. This book offers guidance on writing different types of literature reviews. For the novice researcher, there are useful suggestions for creating coherent literature reviews.

THEORETICAL FRAMEWORKS

Purpose of theoretical frameworks.

As new education researchers may be less familiar with theoretical frameworks than with literature reviews, this discussion begins with an analogy. Envision a biologist, chemist, and physicist examining together the dramatic effect of a fog tsunami over the ocean. A biologist gazing at this phenomenon may be concerned with the effect of fog on various species. A chemist may be interested in the chemical composition of the fog as water vapor condenses around bits of salt. A physicist may be focused on the refraction of light to make fog appear to be “sitting” above the ocean. While observing the same “objective event,” the scientists are operating under different theoretical frameworks that provide a particular perspective or “lens” for the interpretation of the phenomenon. Each of these scientists brings specialized knowledge, experiences, and values to this phenomenon, and these influence the interpretation of the phenomenon. The scientists’ theoretical frameworks influence how they design and carry out their studies and interpret their data.

Within an educational study, a theoretical framework helps to explain a phenomenon through a particular lens and challenges and extends existing knowledge within the limitations of that lens. Theoretical frameworks are explicitly stated by an educational researcher in the paper’s framework, theory, or relevant literature section. The framework shapes the types of questions asked, guides the method by which data are collected and analyzed, and informs the discussion of the results of the study. It also reveals the researcher’s subjectivities, for example, values, social experience, and viewpoint ( Allen, 2017 ). It is essential that a novice researcher learn to explicitly state a theoretical framework, because all research questions are being asked from the researcher’s implicit or explicit assumptions of a phenomenon of interest ( Schwandt, 2000 ).

Selecting Theoretical Frameworks

Theoretical frameworks are one of the most contemplated elements in our work in educational research. In this section, we share three important considerations for new scholars selecting a theoretical framework.

The first step in identifying a theoretical framework involves reflecting on the phenomenon within the study and the assumptions aligned with the phenomenon. The phenomenon involves the studied event. There are many possibilities, for example, student learning, instructional approach, or group organization. A researcher holds assumptions about how the phenomenon will be effected, influenced, changed, or portrayed. It is ultimately the researcher’s assumption(s) about the phenomenon that aligns with a theoretical framework. An example can help illustrate how a researcher’s reflection on the phenomenon and acknowledgment of assumptions can result in the identification of a theoretical framework.

In our example, a biology education researcher may be interested in exploring how students’ learning of difficult biological concepts can be supported by the interactions of group members. The phenomenon of interest is the interactions among the peers, and the researcher assumes that more knowledgeable students are important in supporting the learning of the group. As a result, the researcher may draw on Vygotsky’s (1978) sociocultural theory of learning and development that is focused on the phenomenon of student learning in a social setting. This theory posits the critical nature of interactions among students and between students and teachers in the process of building knowledge. A researcher drawing upon this framework holds the assumption that learning is a dynamic social process involving questions and explanations among students in the classroom and that more knowledgeable peers play an important part in the process of building conceptual knowledge.

It is important to state at this point that there are many different theoretical frameworks. Some frameworks focus on learning and knowing, while other theoretical frameworks focus on equity, empowerment, or discourse. Some frameworks are well articulated, and others are still being refined. For a new researcher, it can be challenging to find a theoretical framework. Two of the best ways to look for theoretical frameworks is through published works that highlight different frameworks.

When a theoretical framework is selected, it should clearly connect to all parts of the study. The framework should augment the study by adding a perspective that provides greater insights into the phenomenon. It should clearly align with the studies described in the literature review. For instance, a framework focused on learning would correspond to research that reported different learning outcomes for similar studies. The methods for data collection and analysis should also correspond to the framework. For instance, a study about instructional interventions could use a theoretical framework concerned with learning and could collect data about the effect of the intervention on what is learned. When the data are analyzed, the theoretical framework should provide added meaning to the findings, and the findings should align with the theoretical framework.

A study by Jensen and Lawson (2011) provides an example of how a theoretical framework connects different parts of the study. They compared undergraduate biology students in heterogeneous and homogeneous groups over the course of a semester. Jensen and Lawson (2011) assumed that learning involved collaboration and more knowledgeable peers, which made Vygotsky’s (1978) theory a good fit for their study. They predicted that students in heterogeneous groups would experience greater improvement in their reasoning abilities and science achievements with much of the learning guided by the more knowledgeable peers.

In the enactment of the study, they collected data about the instruction in traditional and inquiry-oriented classes, while the students worked in homogeneous or heterogeneous groups. To determine the effect of working in groups, the authors also measured students’ reasoning abilities and achievement. Each data-collection and analysis decision connected to understanding the influence of collaborative work.

Their findings highlighted aspects of Vygotsky’s (1978) theory of learning. One finding, for instance, posited that inquiry instruction, as a whole, resulted in reasoning and achievement gains. This links to Vygotsky (1978) , because inquiry instruction involves interactions among group members. A more nuanced finding was that group composition had a conditional effect. Heterogeneous groups performed better with more traditional and didactic instruction, regardless of the reasoning ability of the group members. Homogeneous groups worked better during interaction-rich activities for students with low reasoning ability. The authors attributed the variation to the different types of helping behaviors of students. High-performing students provided the answers, while students with low reasoning ability had to work collectively through the material. In terms of Vygotsky (1978) , this finding provided new insights into the learning context in which productive interactions can occur for students.

Another consideration in the selection and use of a theoretical framework pertains to its orientation to the study. This can result in the theoretical framework prioritizing individuals, institutions, and/or policies ( Anfara and Mertz, 2014 ). Frameworks that connect to individuals, for instance, could contribute to understanding their actions, learning, or knowledge. Institutional frameworks, on the other hand, offer insights into how institutions, organizations, or groups can influence individuals or materials. Policy theories provide ways to understand how national or local policies can dictate an emphasis on outcomes or instructional design. These different types of frameworks highlight different aspects in an educational setting, which influences the design of the study and the collection of data. In addition, these different frameworks offer a way to make sense of the data. Aligning the data collection and analysis with the framework ensures that a study is coherent and can contribute to the field.

New understandings emerge when different theoretical frameworks are used. For instance, Ebert-May et al. (2015) prioritized the individual level within conceptual change theory (see Posner et al. , 1982 ). In this theory, an individual’s knowledge changes when it no longer fits the phenomenon. Ebert-May et al. (2015) designed a professional development program challenging biology postdoctoral scholars’ existing conceptions of teaching. The authors reported that the biology postdoctoral scholars’ teaching practices became more student-centered as they were challenged to explain their instructional decision making. According to the theory, the biology postdoctoral scholars’ dissatisfaction in their descriptions of teaching and learning initiated change in their knowledge and instruction. These results reveal how conceptual change theory can explain the learning of participants and guide the design of professional development programming.

The communities of practice (CoP) theoretical framework ( Lave, 1988 ; Wenger, 1998 ) prioritizes the institutional level , suggesting that learning occurs when individuals learn from and contribute to the communities in which they reside. Grounded in the assumption of community learning, the literature on CoP suggests that, as individuals interact regularly with the other members of their group, they learn about the rules, roles, and goals of the community ( Allee, 2000 ). A study conducted by Gehrke and Kezar (2017) used the CoP framework to understand organizational change by examining the involvement of individual faculty engaged in a cross-institutional CoP focused on changing the instructional practice of faculty at each institution. In the CoP, faculty members were involved in enhancing instructional materials within their department, which aligned with an overarching goal of instituting instruction that embraced active learning. Not surprisingly, Gehrke and Kezar (2017) revealed that faculty who perceived the community culture as important in their work cultivated institutional change. Furthermore, they found that institutional change was sustained when key leaders served as mentors and provided support for faculty, and as faculty themselves developed into leaders. This study reveals the complexity of individual roles in a COP in order to support institutional instructional change.

It is important to explicitly state the theoretical framework used in a study, but elucidating a theoretical framework can be challenging for a new educational researcher. The literature review can help to identify an applicable theoretical framework. Focal areas of the review or central terms often connect to assumptions and assertions associated with the framework that pertain to the phenomenon of interest. Another way to identify a theoretical framework is self-reflection by the researcher on personal beliefs and understandings about the nature of knowledge the researcher brings to the study ( Lysaght, 2011 ). In stating one’s beliefs and understandings related to the study (e.g., students construct their knowledge, instructional materials support learning), an orientation becomes evident that will suggest a particular theoretical framework. Theoretical frameworks are not arbitrary , but purposefully selected.

With experience, a researcher may find expanded roles for theoretical frameworks. Researchers may revise an existing framework that has limited explanatory power, or they may decide there is a need to develop a new theoretical framework. These frameworks can emerge from a current study or the need to explain a phenomenon in a new way. Researchers may also find that multiple theoretical frameworks are necessary to frame and explore a problem, as different frameworks can provide different insights into a problem.

Finally, it is important to recognize that choosing “x” theoretical framework does not necessarily mean a researcher chooses “y” methodology and so on, nor is there a clear-cut, linear process in selecting a theoretical framework for one’s study. In part, the nonlinear process of identifying a theoretical framework is what makes understanding and using theoretical frameworks challenging. For the novice scholar, contemplating and understanding theoretical frameworks is essential. Fortunately, there are articles and books that can help:

  • Creswell, J. W. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Los Angeles, CA: Sage. This book provides an overview of theoretical frameworks in general educational research.
  • Ding, L. (2019). Theoretical perspectives of quantitative physics education research. Physical Review Physics Education Research , 15 (2), 020101-1–020101-13. This paper illustrates how a DBER field can use theoretical frameworks.
  • Nehm, R. (2019). Biology education research: Building integrative frameworks for teaching and learning about living systems. Disciplinary and Interdisciplinary Science Education Research , 1 , ar15. https://doi.org/10.1186/s43031-019-0017-6 . This paper articulates the need for studies in BER to explicitly state theoretical frameworks and provides examples of potential studies.
  • Patton, M. Q. (2015). Qualitative research & evaluation methods: Integrating theory and practice . Sage. This book also provides an overview of theoretical frameworks, but for both research and evaluation.

CONCEPTUAL FRAMEWORKS

Purpose of a conceptual framework.

A conceptual framework is a description of the way a researcher understands the factors and/or variables that are involved in the study and their relationships to one another. The purpose of a conceptual framework is to articulate the concepts under study using relevant literature ( Rocco and Plakhotnik, 2009 ) and to clarify the presumed relationships among those concepts ( Rocco and Plakhotnik, 2009 ; Anfara and Mertz, 2014 ). Conceptual frameworks are different from theoretical frameworks in both their breadth and grounding in established findings. Whereas a theoretical framework articulates the lens through which a researcher views the work, the conceptual framework is often more mechanistic and malleable.

Conceptual frameworks are broader, encompassing both established theories (i.e., theoretical frameworks) and the researchers’ own emergent ideas. Emergent ideas, for example, may be rooted in informal and/or unpublished observations from experience. These emergent ideas would not be considered a “theory” if they are not yet tested, supported by systematically collected evidence, and peer reviewed. However, they do still play an important role in the way researchers approach their studies. The conceptual framework allows authors to clearly describe their emergent ideas so that connections among ideas in the study and the significance of the study are apparent to readers.

Constructing Conceptual Frameworks

Including a conceptual framework in a research study is important, but researchers often opt to include either a conceptual or a theoretical framework. Either may be adequate, but both provide greater insight into the research approach. For instance, a research team plans to test a novel component of an existing theory. In their study, they describe the existing theoretical framework that informs their work and then present their own conceptual framework. Within this conceptual framework, specific topics portray emergent ideas that are related to the theory. Describing both frameworks allows readers to better understand the researchers’ assumptions, orientations, and understanding of concepts being investigated. For example, Connolly et al. (2018) included a conceptual framework that described how they applied a theoretical framework of social cognitive career theory (SCCT) to their study on teaching programs for doctoral students. In their conceptual framework, the authors described SCCT, explained how it applied to the investigation, and drew upon results from previous studies to justify the proposed connections between the theory and their emergent ideas.

In some cases, authors may be able to sufficiently describe their conceptualization of the phenomenon under study in an introduction alone, without a separate conceptual framework section. However, incomplete descriptions of how the researchers conceptualize the components of the study may limit the significance of the study by making the research less intelligible to readers. This is especially problematic when studying topics in which researchers use the same terms for different constructs or different terms for similar and overlapping constructs (e.g., inquiry, teacher beliefs, pedagogical content knowledge, or active learning). Authors must describe their conceptualization of a construct if the research is to be understandable and useful.

There are some key areas to consider regarding the inclusion of a conceptual framework in a study. To begin with, it is important to recognize that conceptual frameworks are constructed by the researchers conducting the study ( Rocco and Plakhotnik, 2009 ; Maxwell, 2012 ). This is different from theoretical frameworks that are often taken from established literature. Researchers should bring together ideas from the literature, but they may be influenced by their own experiences as a student and/or instructor, the shared experiences of others, or thought experiments as they construct a description, model, or representation of their understanding of the phenomenon under study. This is an exercise in intellectual organization and clarity that often considers what is learned, known, and experienced. The conceptual framework makes these constructs explicitly visible to readers, who may have different understandings of the phenomenon based on their prior knowledge and experience. There is no single method to go about this intellectual work.

Reeves et al. (2016) is an example of an article that proposed a conceptual framework about graduate teaching assistant professional development evaluation and research. The authors used existing literature to create a novel framework that filled a gap in current research and practice related to the training of graduate teaching assistants. This conceptual framework can guide the systematic collection of data by other researchers because the framework describes the relationships among various factors that influence teaching and learning. The Reeves et al. (2016) conceptual framework may be modified as additional data are collected and analyzed by other researchers. This is not uncommon, as conceptual frameworks can serve as catalysts for concerted research efforts that systematically explore a phenomenon (e.g., Reynolds et al. , 2012 ; Brownell and Kloser, 2015 ).

Sabel et al. (2017) used a conceptual framework in their exploration of how scaffolds, an external factor, interact with internal factors to support student learning. Their conceptual framework integrated principles from two theoretical frameworks, self-regulated learning and metacognition, to illustrate how the research team conceptualized students’ use of scaffolds in their learning ( Figure 1 ). Sabel et al. (2017) created this model using their interpretations of these two frameworks in the context of their teaching.

An external file that holds a picture, illustration, etc.
Object name is cbe-21-rm33-g001.jpg

Conceptual framework from Sabel et al. (2017) .

A conceptual framework should describe the relationship among components of the investigation ( Anfara and Mertz, 2014 ). These relationships should guide the researcher’s methods of approaching the study ( Miles et al. , 2014 ) and inform both the data to be collected and how those data should be analyzed. Explicitly describing the connections among the ideas allows the researcher to justify the importance of the study and the rigor of the research design. Just as importantly, these frameworks help readers understand why certain components of a system were not explored in the study. This is a challenge in education research, which is rooted in complex environments with many variables that are difficult to control.

For example, Sabel et al. (2017) stated: “Scaffolds, such as enhanced answer keys and reflection questions, can help students and instructors bridge the external and internal factors and support learning” (p. 3). They connected the scaffolds in the study to the three dimensions of metacognition and the eventual transformation of existing ideas into new or revised ideas. Their framework provides a rationale for focusing on how students use two different scaffolds, and not on other factors that may influence a student’s success (self-efficacy, use of active learning, exam format, etc.).

In constructing conceptual frameworks, researchers should address needed areas of study and/or contradictions discovered in literature reviews. By attending to these areas, researchers can strengthen their arguments for the importance of a study. For instance, conceptual frameworks can address how the current study will fill gaps in the research, resolve contradictions in existing literature, or suggest a new area of study. While a literature review describes what is known and not known about the phenomenon, the conceptual framework leverages these gaps in describing the current study ( Maxwell, 2012 ). In the example of Sabel et al. (2017) , the authors indicated there was a gap in the literature regarding how scaffolds engage students in metacognition to promote learning in large classes. Their study helps fill that gap by describing how scaffolds can support students in the three dimensions of metacognition: intelligibility, plausibility, and wide applicability. In another example, Lane (2016) integrated research from science identity, the ethic of care, the sense of belonging, and an expertise model of student success to form a conceptual framework that addressed the critiques of other frameworks. In a more recent example, Sbeglia et al. (2021) illustrated how a conceptual framework influences the methodological choices and inferences in studies by educational researchers.

Sometimes researchers draw upon the conceptual frameworks of other researchers. When a researcher’s conceptual framework closely aligns with an existing framework, the discussion may be brief. For example, Ghee et al. (2016) referred to portions of SCCT as their conceptual framework to explain the significance of their work on students’ self-efficacy and career interests. Because the authors’ conceptualization of this phenomenon aligned with a previously described framework, they briefly mentioned the conceptual framework and provided additional citations that provided more detail for the readers.

Within both the BER and the broader DBER communities, conceptual frameworks have been used to describe different constructs. For example, some researchers have used the term “conceptual framework” to describe students’ conceptual understandings of a biological phenomenon. This is distinct from a researcher’s conceptual framework of the educational phenomenon under investigation, which may also need to be explicitly described in the article. Other studies have presented a research logic model or flowchart of the research design as a conceptual framework. These constructions can be quite valuable in helping readers understand the data-collection and analysis process. However, a model depicting the study design does not serve the same role as a conceptual framework. Researchers need to avoid conflating these constructs by differentiating the researchers’ conceptual framework that guides the study from the research design, when applicable.

Explicitly describing conceptual frameworks is essential in depicting the focus of the study. We have found that being explicit in a conceptual framework means using accepted terminology, referencing prior work, and clearly noting connections between terms. This description can also highlight gaps in the literature or suggest potential contributions to the field of study. A well-elucidated conceptual framework can suggest additional studies that may be warranted. This can also spur other researchers to consider how they would approach the examination of a phenomenon and could result in a revised conceptual framework.

It can be challenging to create conceptual frameworks, but they are important. Below are two resources that could be helpful in constructing and presenting conceptual frameworks in educational research:

  • Maxwell, J. A. (2012). Qualitative research design: An interactive approach (3rd ed.). Los Angeles, CA: Sage. Chapter 3 in this book describes how to construct conceptual frameworks.
  • Ravitch, S. M., & Riggan, M. (2016). Reason & rigor: How conceptual frameworks guide research . Los Angeles, CA: Sage. This book explains how conceptual frameworks guide the research questions, data collection, data analyses, and interpretation of results.

CONCLUDING THOUGHTS

Literature reviews, theoretical frameworks, and conceptual frameworks are all important in DBER and BER. Robust literature reviews reinforce the importance of a study. Theoretical frameworks connect the study to the base of knowledge in educational theory and specify the researcher’s assumptions. Conceptual frameworks allow researchers to explicitly describe their conceptualization of the relationships among the components of the phenomenon under study. Table 1 provides a general overview of these components in order to assist biology education researchers in thinking about these elements.

It is important to emphasize that these different elements are intertwined. When these elements are aligned and complement one another, the study is coherent, and the study findings contribute to knowledge in the field. When literature reviews, theoretical frameworks, and conceptual frameworks are disconnected from one another, the study suffers. The point of the study is lost, suggested findings are unsupported, or important conclusions are invisible to the researcher. In addition, this misalignment may be costly in terms of time and money.

Conducting a literature review, selecting a theoretical framework, and building a conceptual framework are some of the most difficult elements of a research study. It takes time to understand the relevant research, identify a theoretical framework that provides important insights into the study, and formulate a conceptual framework that organizes the finding. In the research process, there is often a constant back and forth among these elements as the study evolves. With an ongoing refinement of the review of literature, clarification of the theoretical framework, and articulation of a conceptual framework, a sound study can emerge that makes a contribution to the field. This is the goal of BER and education research.

Supplementary Material

  • Allee, V. (2000). Knowledge networks and communities of learning . OD Practitioner , 32 ( 4 ), 4–13. [ Google Scholar ]
  • Allen, M. (2017). The Sage encyclopedia of communication research methods (Vols. 1–4 ). Los Angeles, CA: Sage. 10.4135/9781483381411 [ CrossRef ] [ Google Scholar ]
  • American Association for the Advancement of Science. (2011). Vision and change in undergraduate biology education: A call to action . Washington, DC. [ Google Scholar ]
  • Anfara, V. A., Mertz, N. T. (2014). Setting the stage . In Anfara, V. A., Mertz, N. T. (eds.), Theoretical frameworks in qualitative research (pp. 1–22). Sage. [ Google Scholar ]
  • Barnes, M. E., Brownell, S. E. (2016). Practices and perspectives of college instructors on addressing religious beliefs when teaching evolution . CBE—Life Sciences Education , 15 ( 2 ), ar18. https://doi.org/10.1187/cbe.15-11-0243 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Boote, D. N., Beile, P. (2005). Scholars before researchers: On the centrality of the dissertation literature review in research preparation . Educational Researcher , 34 ( 6 ), 3–15. 10.3102/0013189x034006003 [ CrossRef ] [ Google Scholar ]
  • Booth, A., Sutton, A., Papaioannou, D. (2016a). Systemic approaches to a successful literature review (2nd ed.). Los Angeles, CA: Sage. [ Google Scholar ]
  • Booth, W. C., Colomb, G. G., Williams, J. M., Bizup, J., Fitzgerald, W. T. (2016b). The craft of research (4th ed.). Chicago, IL: University of Chicago Press. [ Google Scholar ]
  • Brownell, S. E., Kloser, M. J. (2015). Toward a conceptual framework for measuring the effectiveness of course-based undergraduate research experiences in undergraduate biology . Studies in Higher Education , 40 ( 3 ), 525–544. https://doi.org/10.1080/03075079.2015.1004234 [ Google Scholar ]
  • Connolly, M. R., Lee, Y. G., Savoy, J. N. (2018). The effects of doctoral teaching development on early-career STEM scholars’ college teaching self-efficacy . CBE—Life Sciences Education , 17 ( 1 ), ar14. https://doi.org/10.1187/cbe.17-02-0039 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cooper, K. M., Blattman, J. N., Hendrix, T., Brownell, S. E. (2019). The impact of broadly relevant novel discoveries on student project ownership in a traditional lab course turned CURE . CBE—Life Sciences Education , 18 ( 4 ), ar57. https://doi.org/10.1187/cbe.19-06-0113 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Creswell, J. W. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Los Angeles, CA: Sage. [ Google Scholar ]
  • DeHaan, R. L. (2011). Education research in the biological sciences: A nine decade review (Paper commissioned by the NAS/NRC Committee on the Status, Contributions, and Future Directions of Discipline Based Education Research) . Washington, DC: National Academies Press. Retrieved May 20, 2022, from www7.nationalacademies.org/bose/DBER_Mee ting2_commissioned_papers_page.html [ Google Scholar ]
  • Ding, L. (2019). Theoretical perspectives of quantitative physics education research . Physical Review Physics Education Research , 15 ( 2 ), 020101. [ Google Scholar ]
  • Dirks, C. (2011). The current status and future direction of biology education research . Paper presented at: Second Committee Meeting on the Status, Contributions, and Future Directions of Discipline-Based Education Research, 18–19 October (Washington, DC). Retrieved May 20, 2022, from http://sites.nationalacademies.org/DBASSE/BOSE/DBASSE_071087 [ Google Scholar ]
  • Duran, R. P., Eisenhart, M. A., Erickson, F. D., Grant, C. A., Green, J. L., Hedges, L. V., Schneider, B. L. (2006). Standards for reporting on empirical social science research in AERA publications: American Educational Research Association . Educational Researcher , 35 ( 6 ), 33–40. [ Google Scholar ]
  • Ebert-May, D., Derting, T. L., Henkel, T. P., Middlemis Maher, J., Momsen, J. L., Arnold, B., Passmore, H. A. (2015). Breaking the cycle: Future faculty begin teaching with learner-centered strategies after professional development . CBE—Life Sciences Education , 14 ( 2 ), ar22. https://doi.org/10.1187/cbe.14-12-0222 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Galvan, J. L., Galvan, M. C. (2017). Writing literature reviews: A guide for students of the social and behavioral sciences (7th ed.). New York, NY: Routledge. https://doi.org/10.4324/9781315229386 [ Google Scholar ]
  • Gehrke, S., Kezar, A. (2017). The roles of STEM faculty communities of practice in institutional and departmental reform in higher education . American Educational Research Journal , 54 ( 5 ), 803–833. https://doi.org/10.3102/0002831217706736 [ Google Scholar ]
  • Ghee, M., Keels, M., Collins, D., Neal-Spence, C., Baker, E. (2016). Fine-tuning summer research programs to promote underrepresented students’ persistence in the STEM pathway . CBE—Life Sciences Education , 15 ( 3 ), ar28. https://doi.org/10.1187/cbe.16-01-0046 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Institute of Education Sciences & National Science Foundation. (2013). Common guidelines for education research and development . Retrieved May 20, 2022, from www.nsf.gov/pubs/2013/nsf13126/nsf13126.pdf
  • Jensen, J. L., Lawson, A. (2011). Effects of collaborative group composition and inquiry instruction on reasoning gains and achievement in undergraduate biology . CBE—Life Sciences Education , 10 ( 1 ), 64–73. https://doi.org/10.1187/cbe.19-05-0098 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kolpikova, E. P., Chen, D. C., Doherty, J. H. (2019). Does the format of preclass reading quizzes matter? An evaluation of traditional and gamified, adaptive preclass reading quizzes . CBE—Life Sciences Education , 18 ( 4 ), ar52. https://doi.org/10.1187/cbe.19-05-0098 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Labov, J. B., Reid, A. H., Yamamoto, K. R. (2010). Integrated biology and undergraduate science education: A new biology education for the twenty-first century? CBE—Life Sciences Education , 9 ( 1 ), 10–16. https://doi.org/10.1187/cbe.09-12-0092 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lane, T. B. (2016). Beyond academic and social integration: Understanding the impact of a STEM enrichment program on the retention and degree attainment of underrepresented students . CBE—Life Sciences Education , 15 ( 3 ), ar39. https://doi.org/10.1187/cbe.16-01-0070 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lave, J. (1988). Cognition in practice: Mind, mathematics and culture in everyday life . New York, NY: Cambridge University Press. [ Google Scholar ]
  • Lo, S. M., Gardner, G. E., Reid, J., Napoleon-Fanis, V., Carroll, P., Smith, E., Sato, B. K. (2019). Prevailing questions and methodologies in biology education research: A longitudinal analysis of research in CBE — Life Sciences Education and at the Society for the Advancement of Biology Education Research . CBE—Life Sciences Education , 18 ( 1 ), ar9. https://doi.org/10.1187/cbe.18-08-0164 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lysaght, Z. (2011). Epistemological and paradigmatic ecumenism in “Pasteur’s quadrant:” Tales from doctoral research . In Official Conference Proceedings of the Third Asian Conference on Education in Osaka, Japan . Retrieved May 20, 2022, from http://iafor.org/ace2011_offprint/ACE2011_offprint_0254.pdf
  • Maxwell, J. A. (2012). Qualitative research design: An interactive approach (3rd ed.). Los Angeles, CA: Sage. [ Google Scholar ]
  • Miles, M. B., Huberman, A. M., Saldaña, J. (2014). Qualitative data analysis (3rd ed.). Los Angeles, CA: Sage. [ Google Scholar ]
  • Nehm, R. (2019). Biology education research: Building integrative frameworks for teaching and learning about living systems . Disciplinary and Interdisciplinary Science Education Research , 1 , ar15. https://doi.org/10.1186/s43031-019-0017-6 [ Google Scholar ]
  • Patton, M. Q. (2015). Qualitative research & evaluation methods: Integrating theory and practice . Los Angeles, CA: Sage. [ Google Scholar ]
  • Perry, J., Meir, E., Herron, J. C., Maruca, S., Stal, D. (2008). Evaluating two approaches to helping college students understand evolutionary trees through diagramming tasks . CBE—Life Sciences Education , 7 ( 2 ), 193–201. https://doi.org/10.1187/cbe.07-01-0007 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Posner, G. J., Strike, K. A., Hewson, P. W., Gertzog, W. A. (1982). Accommodation of a scientific conception: Toward a theory of conceptual change . Science Education , 66 ( 2 ), 211–227. [ Google Scholar ]
  • Ravitch, S. M., Riggan, M. (2016). Reason & rigor: How conceptual frameworks guide research . Los Angeles, CA: Sage. [ Google Scholar ]
  • Reeves, T. D., Marbach-Ad, G., Miller, K. R., Ridgway, J., Gardner, G. E., Schussler, E. E., Wischusen, E. W. (2016). A conceptual framework for graduate teaching assistant professional development evaluation and research . CBE—Life Sciences Education , 15 ( 2 ), es2. https://doi.org/10.1187/cbe.15-10-0225 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Reynolds, J. A., Thaiss, C., Katkin, W., Thompson, R. J. Jr. (2012). Writing-to-learn in undergraduate science education: A community-based, conceptually driven approach . CBE—Life Sciences Education , 11 ( 1 ), 17–25. https://doi.org/10.1187/cbe.11-08-0064 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Rocco, T. S., Plakhotnik, M. S. (2009). Literature reviews, conceptual frameworks, and theoretical frameworks: Terms, functions, and distinctions . Human Resource Development Review , 8 ( 1 ), 120–130. https://doi.org/10.1177/1534484309332617 [ Google Scholar ]
  • Rodrigo-Peiris, T., Xiang, L., Cassone, V. M. (2018). A low-intensity, hybrid design between a “traditional” and a “course-based” research experience yields positive outcomes for science undergraduate freshmen and shows potential for large-scale application . CBE—Life Sciences Education , 17 ( 4 ), ar53. https://doi.org/10.1187/cbe.17-11-0248 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sabel, J. L., Dauer, J. T., Forbes, C. T. (2017). Introductory biology students’ use of enhanced answer keys and reflection questions to engage in metacognition and enhance understanding . CBE—Life Sciences Education , 16 ( 3 ), ar40. https://doi.org/10.1187/cbe.16-10-0298 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sbeglia, G. C., Goodridge, J. A., Gordon, L. H., Nehm, R. H. (2021). Are faculty changing? How reform frameworks, sampling intensities, and instrument measures impact inferences about student-centered teaching practices . CBE—Life Sciences Education , 20 ( 3 ), ar39. https://doi.org/10.1187/cbe.20-11-0259 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Schwandt, T. A. (2000). Three epistemological stances for qualitative inquiry: Interpretivism, hermeneutics, and social constructionism . In Denzin, N. K., Lincoln, Y. S. (Eds.), Handbook of qualitative research (2nd ed., pp. 189–213). Los Angeles, CA: Sage. [ Google Scholar ]
  • Sickel, A. J., Friedrichsen, P. (2013). Examining the evolution education literature with a focus on teachers: Major findings, goals for teacher preparation, and directions for future research . Evolution: Education and Outreach , 6 ( 1 ), 23. https://doi.org/10.1186/1936-6434-6-23 [ Google Scholar ]
  • Singer, S. R., Nielsen, N. R., Schweingruber, H. A. (2012). Discipline-based education research: Understanding and improving learning in undergraduate science and engineering . Washington, DC: National Academies Press. [ Google Scholar ]
  • Todd, A., Romine, W. L., Correa-Menendez, J. (2019). Modeling the transition from a phenotypic to genotypic conceptualization of genetics in a university-level introductory biology context . Research in Science Education , 49 ( 2 ), 569–589. https://doi.org/10.1007/s11165-017-9626-2 [ Google Scholar ]
  • Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes . Cambridge, MA: Harvard University Press. [ Google Scholar ]
  • Wenger, E. (1998). Communities of practice: Learning as a social system . Systems Thinker , 9 ( 5 ), 2–3. [ Google Scholar ]
  • Ziadie, M. A., Andrews, T. C. (2018). Moving evolution education forward: A systematic analysis of literature to identify gaps in collective knowledge for teaching . CBE—Life Sciences Education , 17 ( 1 ), ar11. https://doi.org/10.1187/cbe.17-08-0190 [ PMC free article ] [ PubMed ] [ Google Scholar ]

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Methodology
  • What Is a Conceptual Framework? | Tips & Examples

What Is a Conceptual Framework? | Tips & Examples

Published on August 2, 2022 by Bas Swaen and Tegan George. Revised on September 5, 2024.

Conceptual-Framework-example

A conceptual framework illustrates the expected relationship between your variables. It defines the relevant objectives for your research process and maps out how they come together to draw coherent conclusions.

Keep reading for a step-by-step guide to help you construct your own conceptual framework.

Table of contents

Developing a conceptual framework in research, step 1: choose your research question, step 2: select your independent and dependent variables, step 3: visualize your cause-and-effect relationship, step 4: identify other influencing variables, frequently asked questions about conceptual models.

A conceptual framework is a representation of the relationship you expect to see between your variables, or the characteristics or properties that you want to study.

Conceptual frameworks can be written or visual and are generally developed based on a literature review of existing studies about your topic.

Your research question guides your work by determining exactly what you want to find out, giving your research process a clear focus.

However, before you start collecting your data, consider constructing a conceptual framework. This will help you map out which variables you will measure and how you expect them to relate to one another.

In order to move forward with your research question and test a cause-and-effect relationship, you must first identify at least two key variables: your independent and dependent variables .

  • The expected cause, “hours of study,” is the independent variable (the predictor, or explanatory variable)
  • The expected effect, “exam score,” is the dependent variable (the response, or outcome variable).

Note that causal relationships often involve several independent variables that affect the dependent variable. For the purpose of this example, we’ll work with just one independent variable (“hours of study”).

Now that you’ve figured out your research question and variables, the first step in designing your conceptual framework is visualizing your expected cause-and-effect relationship.

We demonstrate this using basic design components of boxes and arrows. Here, each variable appears in a box. To indicate a causal relationship, each arrow should start from the independent variable (the cause) and point to the dependent variable (the effect).

Sample-conceptual-framework-using-an-independent-variable-and-a-dependent-variable

It’s crucial to identify other variables that can influence the relationship between your independent and dependent variables early in your research process.

Some common variables to include are moderating, mediating, and control variables.

Moderating variables

Moderating variable (or moderators) alter the effect that an independent variable has on a dependent variable. In other words, moderators change the “effect” component of the cause-and-effect relationship.

Let’s add the moderator “IQ.” Here, a student’s IQ level can change the effect that the variable “hours of study” has on the exam score. The higher the IQ, the fewer hours of study are needed to do well on the exam.

Sample-conceptual-framework-with-a-moderator-variable

Let’s take a look at how this might work. The graph below shows how the number of hours spent studying affects exam score. As expected, the more hours you study, the better your results. Here, a student who studies for 20 hours will get a perfect score.

Figure-effect-without-moderator

But the graph looks different when we add our “IQ” moderator of 120. A student with this IQ will achieve a perfect score after just 15 hours of study.

Figure-effect-with-moderator-iq-120

Below, the value of the “IQ” moderator has been increased to 150. A student with this IQ will only need to invest five hours of study in order to get a perfect score.

Figure-effect-with-moderator-iq-150

Here, we see that a moderating variable does indeed change the cause-and-effect relationship between two variables.

Mediating variables

Now we’ll expand the framework by adding a mediating variable . Mediating variables link the independent and dependent variables, allowing the relationship between them to be better explained.

Here’s how the conceptual framework might look if a mediator variable were involved:

Conceptual-framework-mediator-variable

In this case, the mediator helps explain why studying more hours leads to a higher exam score. The more hours a student studies, the more practice problems they will complete; the more practice problems completed, the higher the student’s exam score will be.

Moderator vs. mediator

It’s important not to confuse moderating and mediating variables. To remember the difference, you can think of them in relation to the independent variable:

  • A moderating variable is not affected by the independent variable, even though it affects the dependent variable. For example, no matter how many hours you study (the independent variable), your IQ will not get higher.
  • A mediating variable is affected by the independent variable. In turn, it also affects the dependent variable. Therefore, it links the two variables and helps explain the relationship between them.

Control variables

Lastly,  control variables must also be taken into account. These are variables that are held constant so that they don’t interfere with the results. Even though you aren’t interested in measuring them for your study, it’s crucial to be aware of as many of them as you can be.

Conceptual-framework-control-variable

A mediator variable explains the process through which two variables are related, while a moderator variable affects the strength and direction of that relationship.

A confounding variable is closely related to both the independent and dependent variables in a study. An independent variable represents the supposed cause , while the dependent variable is the supposed effect . A confounding variable is a third variable that influences both the independent and dependent variables.

Failing to account for confounding variables can cause you to wrongly estimate the relationship between your independent and dependent variables.

Yes, but including more than one of either type requires multiple research questions .

For example, if you are interested in the effect of a diet on health, you can use multiple measures of health: blood sugar, blood pressure, weight, pulse, and many more. Each of these is its own dependent variable with its own research question.

You could also choose to look at the effect of exercise levels as well as diet, or even the additional effect of the two combined. Each of these is a separate independent variable .

To ensure the internal validity of an experiment , you should only change one independent variable at a time.

A control variable is any variable that’s held constant in a research study. It’s not a variable of interest in the study, but it’s controlled because it could influence the outcomes.

A confounding variable , also called a confounder or confounding factor, is a third variable in a study examining a potential cause-and-effect relationship.

A confounding variable is related to both the supposed cause and the supposed effect of the study. It can be difficult to separate the true effect of the independent variable from the effect of the confounding variable.

In your research design , it’s important to identify potential confounding variables and plan how you will reduce their impact.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Swaen, B. & George, T. (2024, September 05). What Is a Conceptual Framework? | Tips & Examples. Scribbr. Retrieved September 13, 2024, from https://www.scribbr.com/methodology/conceptual-framework/

Is this article helpful?

Bas Swaen

Other students also liked

Independent vs. dependent variables | definition & examples, mediator vs. moderator variables | differences & examples, control variables | what are they & why do they matter, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Instant insights, infinite possibilities

What is a good example of a conceptual framework?

Last updated

18 April 2023

Reviewed by

Miroslav Damyanov

Short on time? Get an AI generated summary of this article instead

A well-designed study doesn’t just happen. Researchers work hard to ensure the studies they conduct will be scientifically valid and will advance understanding in their field.

Make research less tedious

Dovetail streamlines research to help you uncover and share actionable insights

  • The importance of a conceptual framework

The main purpose of a conceptual framework is to improve the quality of a research study. A conceptual framework achieves this by identifying important information about the topic and providing a clear roadmap for researchers to study it.

Through the process of developing this information, researchers will be able to improve the quality of their studies in a few key ways.

Clarify research goals and objectives

A conceptual framework helps researchers create a clear research goal. Research projects often become vague and lose their focus, which makes them less useful. However, a well-designed conceptual framework helps researchers maintain focus. It reinforces the project’s scope, ensuring it stays on track and produces meaningful results.

Provide a theoretical basis for the study

Forming a hypothesis requires knowledge of the key variables and their relationship to each other. Researchers need to identify these variables early on to create a conceptual framework. This ensures researchers have developed a strong understanding of the topic before finalizing the study design. It also helps them select the most appropriate research and analysis methods.

Guide the research design

As they develop their conceptual framework, researchers often uncover information that can help them further refine their work.

Here are some examples:

Confounding variables they hadn’t previously considered

Sources of bias they will have to take into account when designing the project

Whether or not the information they were going to study has already been covered—this allows them to pivot to a more meaningful goal that brings new and relevant information to their field

  • Steps to develop a conceptual framework

There are four major steps researchers will follow to develop a conceptual framework. Each step will be described in detail in the sections that follow. You’ll also find examples of how each might be applied in a range of fields.

Step 1: Choose the research question

The first step in creating a conceptual framework is choosing a research question . The goal of this step is to create a question that’s specific and focused.

By developing a clear question, researchers can more easily identify the variables they will need to account for and keep their research focused. Without it, the next steps will be more difficult and less effective.

Here are some examples of good research questions in a few common fields:

Natural sciences: How does exposure to ultraviolet radiation affect the growth rate of a particular type of algae?

Health sciences: What is the effectiveness of cognitive-behavioral therapy for treating depression in adolescents?

Business: What factors contribute to the success of small businesses in a particular industry?

Education: How does implementing technology in the classroom impact student learning outcomes?

Step 2: Select the independent and dependent variables

Once the research question has been chosen, it’s time to identify the dependent and independent variables .

The independent variable is the variable researchers think will affect the dependent variable . Without this information, researchers cannot develop a meaningful hypothesis or design a way to test it.

The dependent and independent variables for our example questions above are:

Natural sciences

Independent variable: exposure to ultraviolet radiation

Dependent variable: the growth rate of a particular type of algae

Health sciences

Independent variable: cognitive-behavioral therapy

Dependent variable: depression in adolescents

Independent variables: factors contributing to the business’s success

Dependent variable: sales, return on investment (ROI), or another concrete metric

Independent variable: implementation of technology in the classroom

Dependent variable: student learning outcomes, such as test scores, GPAs, or exam results

Step 3: Visualize the cause-and-effect relationship

This step is where researchers actually develop their hypothesis. They will predict how the independent variable will impact the dependent variable based on their knowledge of the field and their intuition.

With a hypothesis formed, researchers can more accurately determine what data to collect and how to analyze it. They will then visualize their hypothesis by creating a diagram. This visualization will serve as a framework to help guide their research.

The diagrams for our examples might be used as follows:

Natural sciences : how exposure to radiation affects the biological processes in the algae that contribute to its growth rate

Health sciences : how different aspects of cognitive behavioral therapy can affect how patients experience symptoms of depression

Business : how factors such as market demand, managerial expertise, and financial resources influence a business’s success

Education : how different types of technology interact with different aspects of the learning process and alter student learning outcomes

Step 4: Identify other influencing variables

The independent and dependent variables are only part of the equation. Moderating, mediating, and control variables are also important parts of a well-designed study. These variables can impact the relationship between the two main variables and must be accounted for.

A moderating variable is one that can change how the independent variable affects the dependent variable. A mediating variable explains the relationship between the two. Control variables are kept the same to eliminate their impact on the results. Examples of each are given below:

Moderating variable: water temperature (might impact how algae respond to radiation exposure)

Mediating variable: chlorophyll production (might explain how radiation exposure affects algae growth rate)

Control variable: nutrient levels in the water

Moderating variable: the severity of depression symptoms at baseline might impact how effective the therapy is for different adolescents

Mediating variable: social support might explain how cognitive-behavioral therapy leads to improvements in depression

Control variable: other forms of treatment received before or during the study

Moderating variable: the size of the business (might impact how different factors contribute to market share, sales, ROI, and other key success metrics)

Mediating variable: customer satisfaction (might explain how different factors impact business success)

Control variable: industry competition

Moderating variable: student age (might impact how effective technology is for different students)

Mediating variable: teacher training (might explain how technology leads to improvements in learning outcomes)

Control variable: student learning style

  • Conceptual versus theoretical frameworks

Although they sound similar, conceptual and theoretical frameworks have different goals and are used in different contexts. Understanding which to use will help researchers craft better studies.

Conceptual frameworks describe a broad overview of the subject and outline key concepts, variables, and the relationships between them. They provide structure to studies that are more exploratory in nature, where the relationships between the variables are still being established. They are particularly helpful in studies that are complex or interdisciplinary because they help researchers better organize the factors involved in the study.

Theoretical frameworks, on the other hand, are used when the research question is more clearly defined and there’s an existing body of work to draw upon. They define the relationships between the variables and help researchers predict outcomes. They are particularly helpful when researchers want to refine the existing body of knowledge rather than establish it.

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 22 August 2024

Last updated: 5 February 2023

Last updated: 16 August 2024

Last updated: 9 March 2023

Last updated: 30 April 2024

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next, log in or sign up.

Get started for free

Researchmate.net logo

Mastering Research Frameworks: 8 Step-by-Step Guide To A Success Academic Writing!

Introduction.

Research is a fundamental aspect of any academic or scientific endeavor. It involves the systematic investigation of a particular topic or problem to generate new knowledge or validate existing theories. However, conducting research can be a complex and challenging process, requiring careful planning and organization. This is where research frameworks come into play.

A research framework is important because it provides a structured approach to guide and organize the entire research process, ensuring that studies are methodical, coherent, and aligned with established objectives. (Researchmate.net)

In this comprehensive guide, we will explore the concept of research frameworks and how they can help researchers in their work. We will discuss the components of a research framework, the different types of frameworks, and the methodology behind developing and implementing a research framework. Additionally, we will provide examples of research frameworks as samples to guide researchers in designing their own projects. For researchers looking to collaborate and enhance their research framework strategies, platforms like Researchmate.net offer valuable resources and networking opportunities.

What is a Research Framework?

A research framework refers to the overall structure, approach, and theoretical underpinnings that guide a research study. It is a systematic and organized plan that outlines the key elements of a research project, including the research questions , objectives, methodology, data collection methods, and data analysis techniques.

A research framework provides researchers with a roadmap to follow throughout the research process, ensuring that the study is conducted in a logical and coherent manner. It helps researchers to organize their thoughts, identify gaps in existing knowledge, and develop a clear research plan. By establishing a research framework, researchers can ensure that their study is rigorous, valid, and reliable, and that it contributes to the existing body of knowledge in their field. Overall, a research framework serves as a foundation for the research study, guiding the researcher in every step of the research process.

Components of a Research Framework

A research framework consists of several key components that work together to guide the research process. It is essentially a structured outline that serves as a guide for researchers to organize their thoughts, define research objectives, and plan the research process comprehensively. While there are various research framework templates available, they typically include the following components:

Problem Statement

The problem statement defines the research problem or question that the study aims to address. It provides a clear and concise statement of the issue that needs to be investigated. This often emerges from identifying a research gap in the existing literature, highlighting areas that lack sufficient study or have not been explored at all.

The research objectives outline the specific goals and outcomes that the study aims to achieve. These objectives help to focus the research and provide a clear direction for the study. The objectives should be measurable and aligned with the research question to ensure that the study is targeted and relevant.

Literature Review

The literature review is a critical component of a research framework. It involves reviewing existing research and literature related to the research topic. This helps to identify gaps in the current knowledge and provides a foundation for the study.

Theoretical or Conceptual Framework

The phrases ‘ conceptual framework ‘ and ‘ theoretical framework ‘ are often used to describe the overall structure that defines and outlines a research project. These frameworks are composed of theories, concepts, and models that serve as the foundation and guide for the research process.

Methodology

The research methodology outlines the methods and techniques that will be used to collect and analyze data. It includes details on the research design, data collection methods, and data analysis techniques.

Data Collection

Data collection method is a component of research methodology which involves collecting data from various sources, such as surveys, interviews , observations, or existing datasets. The data collected should be relevant to the research objectives and provide insights into the research problem.

Data Analysis

Data analysis involves organizing, interpreting, and analyzing the collected data. This can include statistical analysis, qualitative analysis, or a combination of both, depending on the research objectives and data collected.

Findings and Conclusion

The findings and conclusion section presents the results of the data analysis and discusses the implications of the findings. It summarizes the key findings, draws conclusions, and provides recommendations for future research or practical applications. It highlights the contribution of the study to the existing body of knowledge and suggests areas for further investigation.

These components work together to provide a comprehensive framework for conducting research. Each component plays a crucial role in guiding the research process and ensuring that the study is rigorous and valid.

Types of Research Frameworks

There are two types of research frameworks: theoretical and conceptual.

A theoretical framework is a single formal theory that is used as the basis for a study. It provides a set of concepts and principles that guide the research process. On the other hand, a conceptual framework is a broader framework that includes multiple concepts and theories. It provides a unified framework for understanding and analyzing a particular research problem. The two types of frameworks relate differently to the research question and design. The theoretical framework often inspires the research question based on the existing theory, while the conceptual framework helps in organizing and structuring the research process.

Both types of frameworks have their advantages and limitations. A theoretical framework provides a solid foundation for research and allows for the testing of specific hypotheses. However, it may be limited in its applicability to a specific research problem. On the other hand, a conceptual framework allows for a more holistic and comprehensive understanding of the research problem. It provides a framework for exploring multiple perspectives and theories. However, it may lack the specificity and precision of a theoretical framework.

In practice, researchers often use a combination of theoretical and conceptual frameworks to guide their research. They may start with a theoretical framework to establish a foundation and then use a conceptual framework to explore and analyze the research problem from different angles. The choice of research framework depends on the nature of the research problem, the research question, and the goals of the study. Researchers should carefully consider the advantages and limitations of each type of framework and select the most appropriate one for their specific research context.

Research Framework Methodology

Methodology is an essential component of a research framework as it provides a structured approach to conducting research projects. The methodology section of a research framework includes the research design, sampling design, data collection techniques, analysis, and interpretation of the data. These elements are crucial in ensuring the validity and reliability of the research finding as follows:

  • The research design refers to the overall plan or strategy that researchers adopt to answer their research questions. It includes decisions about the type of research, the research approach, and the research paradigm. The research design provides a roadmap for the entire research process.
  • Sampling design is another important aspect of the methodology. It involves selecting a representative sample from the target population. The sample should be chosen in such a way that it accurately represents the characteristics of the population and allows for generalization of the findings.
  • Data collection techniques are the methods used to gather data for the research. These can include surveys, interviews, observations, experiments, or the analysis of existing data. The choice of data collection techniques depends on the research questions and the nature of the data being collected. Once the data is collected, it needs to be analyzed and interpreted. This involves organizing and summarizing the data, identifying patterns and trends, and drawing conclusions based on the findings.
  • The analysis and interpretation of data are crucial in generating meaningful insights and answering the research questions.

Research Framework Examples

Example 1: Tourism Research Framework

One example of a research framework is a tourism research framework. This framework includes various components such as tourism systems and development models, the political economy and political ecology of tourism, and community involvement in tourism. By using this framework, researchers can analyze and understand the complex dynamics of tourism and its impact on communities and the environment.

Example 2: Educational Research Framework

Another example of a research framework is an educational research framework. This framework focuses on studying various aspects of education, such as teaching methods, curriculum development, and student learning outcomes. It may include components like educational theories, pedagogical approaches, and assessment methods. Researchers can use this framework to guide their studies and gain insights into improving educational practices and policies.

Example 3: Health Research Framework

A health research framework is another common example. This framework is used to investigate different aspects of health, such as disease prevention, healthcare delivery, and patient outcomes. It may include components like epidemiological models, healthcare systems analysis, and health behavior theories. Researchers can utilize this framework to design studies that contribute to the understanding and improvement of healthcare practices and policies.

Example 4: Business Research Framework

In the field of business, a research framework can be developed to study various aspects of business operations, management strategies, and market dynamics. This framework may include components like organizational theories, market analysis models, and strategic planning frameworks. Researchers can apply this framework to investigate business-related phenomena and provide valuable insights for decision-making and industry development.

Example 5: Social Science Research Framework

A social science research framework is designed to study human behavior, social structures, and societal issues. It may include components like sociological theories, psychological models, and qualitative research methods. Researchers in the social sciences can use this framework to explore and analyze various social phenomena, contributing to the understanding and improvement of society as a whole.

In conclusion, a research framework provides a structured approach to organizing and analyzing research data, allowing researchers to make informed decisions and draw meaningful conclusions. Throughout this guide, we have delved into the nature of research frameworks, including their components, types, methodologies, and practical examples. These frameworks are essential tools for conducting effective and efficient research, helping researchers streamline processes, enhance the quality of findings, and contribute significantly to their fields.

However, it is important to recognize that research frameworks are not a one-size-fits-all solution; they may need to be tailored to suit the specific objectives, scope, and context of individual research projects. While these frameworks provide essential structure, they should not replace critical thinking and creativity. Researchers are encouraged to remain open to new ideas and perspectives, adapting frameworks to meet their unique needs and navigate the complexities of the research process, thereby advancing knowledge within their disciplines.

Leave a Comment Cancel reply

Save my name, email, and website in this browser for the next time I comment.

Related articles

Research Questions

How to Formulate Research Questions in a Research Proposal? Discover The No. 1 Easiest Template Here!

Chatgpt-Best-Literature-Review-Generator

7 Easy Step-By-Step Guide of Using ChatGPT: The Best Literature Review Generator for Time-Saving Academic Research

Writing-Engaging-Introduction-in-Research-Papers

Writing Engaging Introduction in Research Papers : 7 Tips and Tricks!

Comparative-Frameworks-

Understanding Comparative Frameworks: Their Importance, Components, Examples and 8 Best Practices

artificial-intelligence-in-thesis-writing-for-phd-students

Revolutionizing Effective Thesis Writing for PhD Students Using Artificial Intelligence!

Interviews-as-One-of-Qualitative-Research-Instruments

3 Types of Interviews in Qualitative Research: An Essential Research Instrument and Handy Tips to Conduct Them

highlight abstracts

Highlight Abstracts: An Ultimate Guide For Researchers!

Critical abstracts

Crafting Critical Abstracts: 11 Expert Strategies for Summarizing Research

  • Deakin Library

Module 2: Frame your research

Module 2: frame your research: frameworks, research question.

  • Return to Hub

Purpose of frameworks in developing research questions for reviews

In research there is a potential of coming across the term "framework" in multiple contexts. For the purpose of developing a review the term "framework" refers to a tool used to formulate the research question. Frameworks provide a structured approach to the review process, by defining boundaries on the research question to make sure it stays focused on a specific topic.

Frameworks can be helpful in formulating a clear and answerable research question, by identifying the elements within the question. 

There are many frameworks which can be customised to address different types of research questions. The most well-known framework is PICO, used to frame questions about the effectiveness of interventions. However, many other frameworks are suitable to address this type of question, as well as all types of questions across multiple discipline areas.

research framework module

Read the articles listed below for more information about using frameworks to form a research question for your review: 

  • Formulating the Evidence Based Practice Question: A Review of the Frameworks by K. S. Davies. 
  • Formulating questions to explore complex interventions within qualitative evidence synthesis by A. Booth et al. See supplementary material, for information specific to using frameworks to develop a question for a rapid review. 

Explore common frameworks

There are various types of frameworks that can be applied to a research question. Frameworks can help to identify the key elements within a research question. This module focuses on commonly used frameworks in Health and Sciences literature based research. 

Click on the plus (+) icons below to explore the definition, purpose, and examples for each framework type.

Definition and purpose

The PICO framework is useful for questions about the effectiveness of interventions.   

PICO stands for: 

  • P opulation or Patient or Problem 
  • I ntervention or Indicator  
  • C omparison or Control 
  • O utcome 

Wrist splints are commonly prescribed for people with carpal tunnel syndrome. You want to know what evidence there is for their effectiveness in reducing pain and increasing wrist function.

What is the effectiveness of wrist splints (I) compared with corticosteroid injections (C) for reducing pain and increasing function (O) in for carpal tunnel syndrome (P)?  

Framework and scenario matching

Element Definiton Scenario
P (patient/population/problem) Who is the population of interest? 
OR
What is the problem of interest?
People with carpal tunnel syndrome
I (intervention/indicator) What is the intervention or indicator of interest? Wrist splits
C (comparison/control) What are you comparing the intervention to? Corticosteroid injections
O (outcome) What is the outcome of interest? Improvement of pain and wrist function

Example research paper

Karjalainen, T. V., Lusa, V., Page, M. J., O'Connor, D., Massy-Westropp, N., & Peters, S. E. (2023). Splinting for carpal tunnel syndrome . Cochrane Database of Systematic Reviews , (2).  

Variant PICO Frameworks 

PICO has variations and extensions to accommodate different question types, including qualitative questions. For example: 

  • PICOS stands for PICO plus Study design 
  • PICOT stands for PICO plus Time 
  • PECO stands for Population/Problem, Exposure, Comparison, Outcome 
  • PECOS stands for Population/Problem, Exposure, Comparison, Outcome, Study design 
  • PICo stands for Problem, phenomenon of Interest, Context (used on qualitative questions)

The PCC framework is useful for questions that are broad or reviewing qualitative research.

The PCC framework is recommended by the JBI Scoping Review guidelines ( 11.2.2 Developing the title and question ). 

PCC stands for:

  • P opulation or P roblem

The government is funding a review into measuring the experiences of adults with atrial fibrillation.  They're particularly interested in the impact atrial fibrillation has on quality of life.  You want to apply for the grant and start planning your methodology.

What tools are available to measure quality of life (C) in adults with atrial fibrillation (P) in Australia (C)?  

Element Definiton Scenario
P (population/problem) Who is the population of interest? 
OR
What is the problem of interest?
Adults with atrial fibrillation
C (concept) What is the concept of interest? Quality of life measurement
C (context) What is the context? E.g. Geographic, Setting, etc. Australia

Risom, S. S., Nørgaard, M. W., & Streur, M. M. (2022). Quality of life and symptom experience measurement tools in adults with atrial fibrillation: a scoping review protocol . JBI evidence synthesis , 20 (5), 1376-1384.  

The PEO framework is useful for epidemiological questions about exposure to an event or an illness. 

PEO stands for: 

  • P opulation and their problems 
  • E xposure 
  • O utcomes or themes

Recently, there have been increasing cases of laryngeal cancer amongst people who work as stonemasons.  The research team seeks to examine the literature to determine whether there is an association between exposure to dust through stonemasonry and developing silicosis.

Is there an association for people who work as stonemasons (P) between occupational exposure to silica dust (E) and laryngeal cancer (O)?  

Element Definiton Scenario
P (population and their problem) Who is the population of interest? 
AND
What is the problem of interest?
People who work as stonemasons
E (exposure) What is the exposure event or exposure disease? Silica dust
O (outcomes or themes) What is the result or outcome of interest? 
OR
What themes are of interest?
Laryngeal cancer or silicosis

Chen, M., & Tse, L. A. (2012). Laryngeal cancer and silica dust exposure: A systematic review and meta‐analysis . American journal of industrial medicine , 55 (8), 669-676.

The SPICE framework is useful for questions evaluating the results of a service, project, or intervention. 

SPICE stands for: 

  • S etting 
  • P erspective 
  • I ntervention 
  • C omparison 
  • E valuation

You are wanting to design a new program to support the well-being of people living with spinal cord injury, but first, you want to know what other programs have been developed, and how they’ve been received by the program participants.

From the perspective of community-based (S) people living with spinal cord injury (P), what is the impact of well-being interventions (I) on their own quality of life (E)?  

Element Definiton Scenario
S (setting) What is the setting? Community
P (perspective) Whose perspectives and experiences are of interest and what are they? People living with a spinal cord injury
I (intervention) What is the intervention of interest? Well-being
C (comparison)* What are you comparing the intervention to? No comparison
E (evaluation) What is the result? Impact of well-being interventions on people with a spinal cord injury

*Note: There may not always be a comparison element.

Simpson, B., Villeneuve, M., & Clifton, S. (2022). The experience and perspective of people with spinal cord injury about well-being interventions: a systematic review of qualitative studies . Disability and rehabilitation , 44 (14), 3349-3363.

The SPIDER framework is useful to help frame qualitative questions or those involving mixed methods research. 

SPIDER stands for: 

  • S ample 
  • P henomenon of I nterest 
  • D esign 
  • E valuation 
  • R esearch type

You're beginning a research degree in which you want to investigate barriers to nurses offering cross-cultural care. You want to know whether any studies were undertaken from the perspectives of nurses before you start your research.

What are the perspectives (E) of nurses and nursing students (S) of their experiences in delivering transcultural care (PI)?  

Element Definiton Scenario
S (sample) Who is the group of interest? Nurses or nursing students
PI (phenomenon of interest) What is the researcher interested in? (e.g. behaviours, experiences) Experiences of transcultural care
D (design) What study designs will be included in the review? Interview, survey, focus groups, questionnaires
E (evaluation) What are the outcomes of the research? (e.g. perspectives) Themes in nurse perspectives
R (research type) What type of research will be included in the review? Qualitative, mixed methods

Shahzad, S., Ali, N., Younas, A., & Tayaben, J. L. (2021). Challenges and approaches to transcultural care: An integrative review of nurses' and nursing students' experiences . Journal of Professional Nursing , 37 (6), 1119-1131.

research framework module

Activity: Match the framework to the question

Reviews can be exploring the same topic from different perspectives and for different reasons.

Below are three reviews looking at the topic of counselling for children. Think about and select the appropriate framework that has been used to frame each research question.

Text accesible version

Activity overview

This interactive activity shows a series of review questions in a random order. The user needs to identify which review question framework has been applied appropriately to frame each review question.

Read the review questions below and think about which review question framework has been applied to frame it.

First review question

In school aged children with anxiety how effective is online counselling compared with in-person counselling in reducing panic attacks?

Activity: select the framework used to frame the review question

Choose from the frameworks below:

Answer to first review question

The PICO framework has been used to frame this review question based on its key elements:  In school aged children with anxiety (P) how effective is online counselling (I) compared with in-person counselling (C) in reducing panic attacks (O)?

Second review question

What are the experiences of school aged children with anxiety undergoing online counselling in rural Australia?

Answer to second review question

The PCC framework has been used to frame this review question based on its key elements: What are the experiences of school aged children with anxiety (P) undergoing online counselling (C) in rural Australia (C)?

Third review question

Children in rural Australia often lack access to in-person counselling services. From the perspective of their parents, how effective is online counselling compared with in person counselling?

Answer to third review question

The SPICE framework has been used to frame this review question based on its key elements: Children in rural Australia (S) often lack access to in-person counselling services. From the perspective of their parents (P), how effective (E) is online counselling (I) compared with in person counselling (C)?

Remember and reflect

Key takeaway.

A framework is a useful tool in shaping your review topic into a research question. A framework does this by identifying key elements from your review topic. These key elements can then be used when developing your search to address your research question.

Before beginning your search take a moment to clarify your research question. As it's important to have a clear and answerable research question at the beginning of your review process. If later in the review process your question changes, the risk of bias in your review increases. It may even require you to start the review process again from the beginning.

  • << Previous: About
  • Next: Research question >>
  • Last Updated: Jul 26, 2024 12:02 PM
  • URL: https://deakin.libguides.com/frame-research-module

Educational resources and simple solutions for your research journey

theoretical framework

What is a Theoretical Framework? How to Write It (with Examples) 

What is a Theoretical Framework? How to Write It (with Examples)

Theoretical framework 1,2 is the structure that supports and describes a theory. A theory is a set of interrelated concepts and definitions that present a systematic view of phenomena by describing the relationship among the variables for explaining these phenomena. A theory is developed after a long research process and explains the existence of a research problem in a study. A theoretical framework guides the research process like a roadmap for the research study and helps researchers clearly interpret their findings by providing a structure for organizing data and developing conclusions.   

A theoretical framework in research is an important part of a manuscript and should be presented in the first section. It shows an understanding of the theories and concepts relevant to the research and helps limit the scope of the research.  

Table of Contents

What is a theoretical framework ?  

A theoretical framework in research can be defined as a set of concepts, theories, ideas, and assumptions that help you understand a specific phenomenon or problem. It can be considered a blueprint that is borrowed by researchers to develop their own research inquiry. A theoretical framework in research helps researchers design and conduct their research and analyze and interpret their findings. It explains the relationship between variables, identifies gaps in existing knowledge, and guides the development of research questions, hypotheses, and methodologies to address that gap.  

research framework module

Now that you know the answer to ‘ What is a theoretical framework? ’, check the following table that lists the different types of theoretical frameworks in research: 3

   
Conceptual  Defines key concepts and relationships 
Deductive  Starts with a general hypothesis and then uses data to test it; used in quantitative research 
Inductive  Starts with data and then develops a hypothesis; used in qualitative research 
Empirical  Focuses on the collection and analysis of empirical data; used in scientific research 
Normative  Defines a set of norms that guide behavior; used in ethics and social sciences 
Explanatory  Explains causes of particular behavior; used in psychology and social sciences 

Developing a theoretical framework in research can help in the following situations: 4

  • When conducting research on complex phenomena because a theoretical framework helps organize the research questions, hypotheses, and findings  
  • When the research problem requires a deeper understanding of the underlying concepts  
  • When conducting research that seeks to address a specific gap in knowledge  
  • When conducting research that involves the analysis of existing theories  

Summarizing existing literature for theoretical frameworks is easy. Get our Research Ideation pack  

Importance of a theoretical framework  

The purpose of theoretical framework s is to support you in the following ways during the research process: 2  

  • Provide a structure for the complete research process  
  • Assist researchers in incorporating formal theories into their study as a guide  
  • Provide a broad guideline to maintain the research focus  
  • Guide the selection of research methods, data collection, and data analysis  
  • Help understand the relationships between different concepts and develop hypotheses and research questions  
  • Address gaps in existing literature  
  • Analyze the data collected and draw meaningful conclusions and make the findings more generalizable  

Theoretical vs. Conceptual framework  

While a theoretical framework covers the theoretical aspect of your study, that is, the various theories that can guide your research, a conceptual framework defines the variables for your study and presents how they relate to each other. The conceptual framework is developed before collecting the data. However, both frameworks help in understanding the research problem and guide the development, collection, and analysis of the research.  

The following table lists some differences between conceptual and theoretical frameworks . 5

   
Based on existing theories that have been tested and validated by others  Based on concepts that are the main variables in the study 
Used to create a foundation of the theory on which your study will be developed  Visualizes the relationships between the concepts and variables based on the existing literature 
Used to test theories, to predict and control the situations within the context of a research inquiry  Helps the development of a theory that would be useful to practitioners 
Provides a general set of ideas within which a study belongs  Refers to specific ideas that researchers utilize in their study 
Offers a focal point for approaching unknown research in a specific field of inquiry  Shows logically how the research inquiry should be undertaken 
Works deductively  Works inductively 
Used in quantitative studies  Used in qualitative studies 

research framework module

How to write a theoretical framework  

The following general steps can help those wondering how to write a theoretical framework: 2

  • Identify and define the key concepts clearly and organize them into a suitable structure.  
  • Use appropriate terminology and define all key terms to ensure consistency.  
  • Identify the relationships between concepts and provide a logical and coherent structure.  
  • Develop hypotheses that can be tested through data collection and analysis.  
  • Keep it concise and focused with clear and specific aims.  

Write a theoretical framework 2x faster. Get our Manuscript Writing pack  

Examples of a theoretical framework  

Here are two examples of a theoretical framework. 6,7

Example 1 .   

An insurance company is facing a challenge cross-selling its products. The sales department indicates that most customers have just one policy, although the company offers over 10 unique policies. The company would want its customers to purchase more than one policy since most customers are purchasing policies from other companies.  

Objective : To sell more insurance products to existing customers.  

Problem : Many customers are purchasing additional policies from other companies.  

Research question : How can customer product awareness be improved to increase cross-selling of insurance products?  

Sub-questions: What is the relationship between product awareness and sales? Which factors determine product awareness?  

Since “product awareness” is the main focus in this study, the theoretical framework should analyze this concept and study previous literature on this subject and propose theories that discuss the relationship between product awareness and its improvement in sales of other products.  

Example 2 .

A company is facing a continued decline in its sales and profitability. The main reason for the decline in the profitability is poor services, which have resulted in a high level of dissatisfaction among customers and consequently a decline in customer loyalty. The management is planning to concentrate on clients’ satisfaction and customer loyalty.  

Objective: To provide better service to customers and increase customer loyalty and satisfaction.  

Problem: Continued decrease in sales and profitability.  

Research question: How can customer satisfaction help in increasing sales and profitability?  

Sub-questions: What is the relationship between customer loyalty and sales? Which factors influence the level of satisfaction gained by customers?  

Since customer satisfaction, loyalty, profitability, and sales are the important topics in this example, the theoretical framework should focus on these concepts.  

Benefits of a theoretical framework  

There are several benefits of a theoretical framework in research: 2  

  • Provides a structured approach allowing researchers to organize their thoughts in a coherent way.  
  • Helps to identify gaps in knowledge highlighting areas where further research is needed.  
  • Increases research efficiency by providing a clear direction for research and focusing efforts on relevant data.  
  • Improves the quality of research by providing a rigorous and systematic approach to research, which can increase the likelihood of producing valid and reliable results.  
  • Provides a basis for comparison by providing a common language and conceptual framework for researchers to compare their findings with other research in the field, facilitating the exchange of ideas and the development of new knowledge.  

research framework module

Frequently Asked Questions 

Q1. How do I develop a theoretical framework ? 7

A1. The following steps can be used for developing a theoretical framework :  

  • Identify the research problem and research questions by clearly defining the problem that the research aims to address and identifying the specific questions that the research aims to answer.
  • Review the existing literature to identify the key concepts that have been studied previously. These concepts should be clearly defined and organized into a structure.
  • Develop propositions that describe the relationships between the concepts. These propositions should be based on the existing literature and should be testable.
  • Develop hypotheses that can be tested through data collection and analysis.
  • Test the theoretical framework through data collection and analysis to determine whether the framework is valid and reliable.

Q2. How do I know if I have developed a good theoretical framework or not? 8

A2. The following checklist could help you answer this question:  

  • Is my theoretical framework clearly seen as emerging from my literature review?  
  • Is it the result of my analysis of the main theories previously studied in my same research field?  
  • Does it represent or is it relevant to the most current state of theoretical knowledge on my topic?  
  • Does the theoretical framework in research present a logical, coherent, and analytical structure that will support my data analysis?  
  • Do the different parts of the theory help analyze the relationships among the variables in my research?  
  • Does the theoretical framework target how I will answer my research questions or test the hypotheses?  
  • Have I documented every source I have used in developing this theoretical framework ?  
  • Is my theoretical framework a model, a table, a figure, or a description?  
  • Have I explained why this is the appropriate theoretical framework for my data analysis?  

Q3. Can I use multiple theoretical frameworks in a single study?  

A3. Using multiple theoretical frameworks in a single study is acceptable as long as each theory is clearly defined and related to the study. Each theory should also be discussed individually. This approach may, however, be tedious and effort intensive. Therefore, multiple theoretical frameworks should be used only if absolutely necessary for the study.  

Q4. Is it necessary to include a theoretical framework in every research study?  

A4. The theoretical framework connects researchers to existing knowledge. So, including a theoretical framework would help researchers get a clear idea about the research process and help structure their study effectively by clearly defining an objective, a research problem, and a research question.  

Q5. Can a theoretical framework be developed for qualitative research?  

A5. Yes, a theoretical framework can be developed for qualitative research. However, qualitative research methods may or may not involve a theory developed beforehand. In these studies, a theoretical framework can guide the study and help develop a theory during the data analysis phase. This resulting framework uses inductive reasoning. The outcome of this inductive approach can be referred to as an emergent theoretical framework . This method helps researchers develop a theory inductively, which explains a phenomenon without a guiding framework at the outset.  

research framework module

Q6. What is the main difference between a literature review and a theoretical framework ?  

A6. A literature review explores already existing studies about a specific topic in order to highlight a gap, which becomes the focus of the current research study. A theoretical framework can be considered the next step in the process, in which the researcher plans a specific conceptual and analytical approach to address the identified gap in the research.  

Theoretical frameworks are thus important components of the research process and researchers should therefore devote ample amount of time to develop a solid theoretical framework so that it can effectively guide their research in a suitable direction. We hope this article has provided a good insight into the concept of theoretical frameworks in research and their benefits.  

References  

  • Organizing academic research papers: Theoretical framework. Sacred Heart University library. Accessed August 4, 2023. https://library.sacredheart.edu/c.php?g=29803&p=185919#:~:text=The%20theoretical%20framework%20is%20the,research%20problem%20under%20study%20exists .  
  • Salomao A. Understanding what is theoretical framework. Mind the Graph website. Accessed August 5, 2023. https://mindthegraph.com/blog/what-is-theoretical-framework/  
  • Theoretical framework—Types, examples, and writing guide. Research Method website. Accessed August 6, 2023. https://researchmethod.net/theoretical-framework/  
  • Grant C., Osanloo A. Understanding, selecting, and integrating a theoretical framework in dissertation research: Creating the blueprint for your “house.” Administrative Issues Journal : Connecting Education, Practice, and Research; 4(2):12-26. 2014. Accessed August 7, 2023. https://files.eric.ed.gov/fulltext/EJ1058505.pdf  
  • Difference between conceptual framework and theoretical framework. MIM Learnovate website. Accessed August 7, 2023. https://mimlearnovate.com/difference-between-conceptual-framework-and-theoretical-framework/  
  • Example of a theoretical framework—Thesis & dissertation. BacherlorPrint website. Accessed August 6, 2023. https://www.bachelorprint.com/dissertation/example-of-a-theoretical-framework/  
  • Sample theoretical framework in dissertation and thesis—Overview and example. Students assignment help website. Accessed August 6, 2023. https://www.studentsassignmenthelp.co.uk/blogs/sample-dissertation-theoretical-framework/#Example_of_the_theoretical_framework  
  • Kivunja C. Distinguishing between theory, theoretical framework, and conceptual framework: A systematic review of lessons from the field. Accessed August 8, 2023. https://files.eric.ed.gov/fulltext/EJ1198682.pdf  

Editage All Access is a subscription-based platform that unifies the best AI tools and services designed to speed up, simplify, and streamline every step of a researcher’s journey. The Editage All Access Pack is a one-of-a-kind subscription that unlocks full access to an AI writing assistant, literature recommender, journal finder, scientific illustration tool, and exclusive discounts on professional publication services from Editage.  

Based on 22+ years of experience in academia, Editage All Access empowers researchers to put their best research forward and move closer to success. Explore our top AI Tools pack, AI Tools + Publication Services pack, or Build Your Own Plan. Find everything a researcher needs to succeed, all in one place –  Get All Access now starting at just $14 a month !    

Related Posts

Peer Review Week 2024

Join Us for Peer Review Week 2024

Editage All Access Boosting Productivity for Academics in India

How Editage All Access is Boosting Productivity for Academics in India

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Dissertation

What is a Theoretical Framework? | A Step-by-Step Guide

Published on 14 February 2020 by Shona McCombes . Revised on 10 October 2022.

A theoretical framework is a foundational review of existing theories that serves as a roadmap for developing the arguments you will use in your own work.

Theories are developed by researchers to explain phenomena, draw connections, and make predictions. In a theoretical framework, you explain the existing theories that support your research, showing that your work is grounded in established ideas.

In other words, your theoretical framework justifies and contextualises your later research, and it’s a crucial first step for your research paper , thesis, or dissertation . A well-rounded theoretical framework sets you up for success later on in your research and writing process.

Instantly correct all language mistakes in your text

Be assured that you'll submit flawless writing. Upload your document to correct all your mistakes.

upload-your-document-ai-proofreader

Table of contents

Why do you need a theoretical framework, how to write a theoretical framework, structuring your theoretical framework, example of a theoretical framework, frequently asked questions about theoretical frameworks.

Before you start your own research, it’s crucial to familiarise yourself with the theories and models that other researchers have already developed. Your theoretical framework is your opportunity to present and explain what you’ve learned, situated within your future research topic.

There’s a good chance that many different theories about your topic already exist, especially if the topic is broad. In your theoretical framework, you will evaluate, compare, and select the most relevant ones.

By “framing” your research within a clearly defined field, you make the reader aware of the assumptions that inform your approach, showing the rationale behind your choices for later sections, like methodology and discussion . This part of your dissertation lays the foundations that will support your analysis, helping you interpret your results and make broader generalisations .

  • In literature , a scholar using postmodernist literary theory would analyse The Great Gatsby differently than a scholar using Marxist literary theory.
  • In psychology , a behaviourist approach to depression would involve different research methods and assumptions than a psychoanalytic approach.
  • In economics , wealth inequality would be explained and interpreted differently based on a classical economics approach than based on a Keynesian economics one.

The only proofreading tool specialized in correcting academic writing

The academic proofreading tool has been trained on 1000s of academic texts and by native English editors. Making it the most accurate and reliable proofreading tool for students.

research framework module

Correct my document today

To create your own theoretical framework, you can follow these three steps:

  • Identifying your key concepts
  • Evaluating and explaining relevant theories
  • Showing how your research fits into existing research

1. Identify your key concepts

The first step is to pick out the key terms from your problem statement and research questions . Concepts often have multiple definitions, so your theoretical framework should also clearly define what you mean by each term.

To investigate this problem, you have identified and plan to focus on the following problem statement, objective, and research questions:

Problem : Many online customers do not return to make subsequent purchases.

Objective : To increase the quantity of return customers.

Research question : How can the satisfaction of company X’s online customers be improved in order to increase the quantity of return customers?

2. Evaluate and explain relevant theories

By conducting a thorough literature review , you can determine how other researchers have defined these key concepts and drawn connections between them. As you write your theoretical framework, your aim is to compare and critically evaluate the approaches that different authors have taken.

After discussing different models and theories, you can establish the definitions that best fit your research and justify why. You can even combine theories from different fields to build your own unique framework if this better suits your topic.

Make sure to at least briefly mention each of the most important theories related to your key concepts. If there is a well-established theory that you don’t want to apply to your own research, explain why it isn’t suitable for your purposes.

3. Show how your research fits into existing research

Apart from summarising and discussing existing theories, your theoretical framework should show how your project will make use of these ideas and take them a step further.

You might aim to do one or more of the following:

  • Test whether a theory holds in a specific, previously unexamined context
  • Use an existing theory as a basis for interpreting your results
  • Critique or challenge a theory
  • Combine different theories in a new or unique way

A theoretical framework can sometimes be integrated into a literature review chapter , but it can also be included as its own chapter or section in your dissertation. As a rule of thumb, if your research involves dealing with a lot of complex theories, it’s a good idea to include a separate theoretical framework chapter.

There are no fixed rules for structuring your theoretical framework, but it’s best to double-check with your department or institution to make sure they don’t have any formatting guidelines. The most important thing is to create a clear, logical structure. There are a few ways to do this:

  • Draw on your research questions, structuring each section around a question or key concept
  • Organise by theory cluster
  • Organise by date

As in all other parts of your research paper , thesis, or dissertation , make sure to properly cite your sources to avoid plagiarism .

To get a sense of what this part of your thesis or dissertation might look like, take a look at our full example .

Prevent plagiarism, run a free check.

While a theoretical framework describes the theoretical underpinnings of your work based on existing research, a conceptual framework allows you to draw your own conclusions, mapping out the variables you may use in your study and the interplay between them.

A literature review and a theoretical framework are not the same thing and cannot be used interchangeably. While a theoretical framework describes the theoretical underpinnings of your work, a literature review critically evaluates existing research relating to your topic. You’ll likely need both in your dissertation .

A theoretical framework can sometimes be integrated into a  literature review chapter , but it can also be included as its own chapter or section in your dissertation . As a rule of thumb, if your research involves dealing with a lot of complex theories, it’s a good idea to include a separate theoretical framework chapter.

A literature review is a survey of scholarly sources (such as books, journal articles, and theses) related to a specific topic or research question .

It is often written as part of a dissertation , thesis, research paper , or proposal .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

McCombes, S. (2022, October 10). What is a Theoretical Framework? | A Step-by-Step Guide. Scribbr. Retrieved 9 September 2024, from https://www.scribbr.co.uk/thesis-dissertation/the-theoretical-framework/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, what is a literature review | guide, template, & examples, how to write a results section | tips & examples, how to write a discussion section | tips & examples.

Practical Research Module: Conceptual Framework and Review of Related Literature

This Senior High School Self-Learning Module (SLM) is prepared so that you, our dear learners, can continue your studies and learn while at home. Activities, questions, directions, exercises, and discussions are carefully stated for you to understand each lesson.

At the end of this module, you should be able to:

1. illustrate and explain the research framework (CS_RS12-If-j-6) ;

2. define terms used in the study (CS_RS12-If-j-7);

3. list research hypothesis (if appropriate) (CS_RS12-If-j-8) and

4. present a written review of related literature and conceptual framework (CS_RS12-If-j-9) .

Senior High School Quarter 1 Self-Learning Module Practical Research 2 – Conceptual Framework and Review of Related Literature

Can't find what you're looking for.

We are here to help - please use the search box below.

Leave a Comment Cancel reply

Research Framework

  • First Online: 01 January 2013

Cite this chapter

research framework module

  • Basudeb Bhatta 2  

Part of the book series: SpringerBriefs in Earth Sciences ((BRIEFSEARTH))

2524 Accesses

Although this book is focused on research methods, the entire research framework is also necessary to be addressed in context. As stated earlier in Chap. 1 , research methods involve some fundamental theoretical questions. These questions are philosophical and concern to the ontology and epistemology. Such philosophical concerns tend to get stored out into distinct paradigms that a researcher can follow. Nested within the theoretical coordinates of paradigms, a set of decisions one has to make about the methodology. Finally, at the most concrete and practical level we find research methods. This chapter is aimed to discuss the entire research framework in the context of remote sensing. Once we gain some ideas on overall research framework, we can proceed further to research methods in the subsequent chapters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abler R, Adams J, Gould P (1971) Spatial organization: the geographer’s view of the world. Prentice-Hall, Englewood Cliffs

Google Scholar  

Câmara G, Egenhofer M, Fonseca F, Monteiro AMV (2001) What’s in an image? In: Spatial information theory: foundations of geographic information science, international conference, COSIT

Câmara GM, Monteiro AMV, Paiva JAC, Souza RCM (2000) Action-driven ontologies of the geographical space: beyond the field-object debate. In: Savanah GA (eds) GIScience, AAG, 52–54 Oct 2000

Chen KS, Yen SK, Tsay DW (1997) Neural classification of SPOT imagery through integration of intensity and fractal information. Int J Remote Sens 18(4):763–783

Article   Google Scholar  

Curran PJ (1987) Remote sensing methodologies and geography. Int J Remote Sens 8:1255–1275

Durand N, Derivaux S, Forestier G, Wemmert C, Gancarski PO, Boussaid D, Puissant A (2007) Ontology-based object recognition for remote sensing image interpretation. In: IEEE international conference on tools with artificial intelligence, Patras, Greece, pp 472–479

Gahegan M (2001) Visual exploration in geography: analysis with light. In: Miller HJ, Han J (eds) Geographic data mining and knowledge discovery. Taylor & Francis, London, pp 260–287

Chapter   Google Scholar  

Gao J, Skillcorn D (1998) Capability of SPOT XS data in producing detailed land cover maps at the urban–rural periphery. Int J Remote Sens 19(15):2877–2891

Gomez B, Jones JP III (eds) (2010) Research methods in geography: a critical introduction. Wiley-Blackwell, West Sussex

Gower ST, Kucharik CJ, Norman JM (1999) Direct and indirect estimation of leaf area index, f(APAR), and net primary production of terrestrial ecosystems. Remote Sens Environ 70(1):29–51

Guba EG, Lincoln YS (1994) Competing paradigms in qualitative research. In: Denzin NK, Lincoln YS (eds) Handbook of qualitative research. Sage, CA

Henning E, Van Rensburg W, Smit B (2004) Theoretical frameworks. In: Henning E, Van Rensburg W, Smit B (eds) Finding your way in qualitative research. Van Schaik Publishers, Pretoria

Holloway I (1997) Basic concepts for qualitative research. Blackwell Science, Oxford

Irny SI, Rose AA (2005) Designing a strategic information systems planning methodology for Malaysian institutes of higher learning (isp-ipta). Issues Inf Syst 6(1):325–331

Jensen JR (2005) Introductory digital image processing: a remote sensing perspective. Prentice-Hall, Upper Saddle River

Jensen JR (2006) Remote sensing of the environment: an earth resource perspective, 2nd edn. Prentice Hall, Upper Saddle River

Johnson LF (2001) Nitrogen influence on fresh-leaf NIR spectra. Remote Sens Environ 78(3):314–320

King RB (1994) The value of ground resolution, spectral range and stereoscopy of satellite imagery for land system and land-use mapping of the humid tropics. Int J Remote Sens 15(3):521–530

Kosso P (2011) A summary of scientific method. Springer, Heidelberg

Book   Google Scholar  

Kuhn TS (1996) The structure of scientific revolutions, 3rd edn. University of Chicago Press, Chicago and London

Langford M, Bell W (1997) Land cover mapping in a tropical hillsides environment: a case study in the Cauca region of Colombia. Int J Remote Sens 18(6):1289–1306

MacDonald J (2002) The earth observation business and the forces that impact it. Earth Obs Bus Network

Miller HJ, Han J (2001) Geographic data mining and knowledge discovery. Taylor & Francis, London

Patton MQ (1990) Qualitative evaluation and research methods, 2nd edn. Sage, Newbury Park

Peddle DR, Hall FG, Ledrew EF (1999) Spectral mixture analysis and geometric-optical reflectance modeling of boreal forest biophysical structure. Remote Sens Environ 67(3):288–297

Prenzel B (2004) Remote sensing-based quantification of land-cover and land-use change for planning. Prog Plann 61:281–299

Quenzel H (1983) Principles of remote sensing techniques. In: Camagni P, Sandroni S (eds) Optical remote sensing of air pollution. Elsevier Science, Amsterdam, pp 27–43

Ray TW, Murray BC (1996) Non-linear spectral mixing in desert vegetation. Remote Sens Environ 55(1):59–64

Salem BB, El-Cibahy A, El-Raey M (1995) Detection of land cover classes in agro-ecosystems of northern Egypt by remote sensing. Int J Remote Sens 16(14):2581–2594

Schott J (1997) Remote sensing: the image chain approach. Oxford University Press, New York

Shaw IGR, Dixon DP, Jones JP III (2010) Theorizing our world. In: Gomez B, Jones JP III (eds) Research methods in geography: a critical introduction. Blackwell Publishing, West Sussex, pp 9–25

Shuttleworth M (2008) Drawing conclusions. Experiment resources, Online. URL: http://www.experiment-resources.com/drawing-conclusions.html

Silva MPS, Câmara G (2004) Remote sensing image mining using ontologies, Online. URL: http://www.dpi.inpe.br/~mpss/artigos/ImageMining2004.pdf

Smith B (1999) Ontology: philosophical and computational. Unpublished manuscript. URL: http://wings.buffalo.edu/philosophy/faculty/smith/articles/ontologies.htm

Verbyla DL, Richardson CA (1996) Remote sensing clearcut areas within a forested watershed: comparing SPOT HRV panchromatic, SPOT HRV multispectral, and landsat thematic mapper data. J Soil Water Conserv 51(5):423–427

Zarco-Tejada P, Miller J (1999) Land cover mapping at BOREAS using red edge spectral parameters from CASI imagery. J Geophys Res 104(D22):27921–27933

Zarco-Tejada PJ (2000) Hyperspectral remote sensing of closed forest canopies: estimation of chlorophyll fluorescence and pigment content department of physics. York University, Toronto

Zarco-Tejada PJ, Miller JR, Noland TL, Mohammed GH, Sampson PH (2001) Scaling-up and model inversion methods with narrowband optical indices for chlorophyll content estimation in closed forest canopies with hyperspectral data. IEEE Trans Geosci Remote Sens 39(7):1491–1507

Zhang J, Hsu W, Lee M (2002) Image mining: trends and developments. Kluwer Academic, Dordrecht

Download references

Author information

Authors and affiliations.

Computer Aided Design Centre, Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700 032, India

Basudeb Bhatta

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Basudeb Bhatta .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 The Author(s)

About this chapter

Bhatta, B. (2013). Research Framework. In: Research Methods in Remote Sensing. SpringerBriefs in Earth Sciences. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6594-8_2

Download citation

DOI : https://doi.org/10.1007/978-94-007-6594-8_2

Published : 16 April 2013

Publisher Name : Springer, Dordrecht

Print ISBN : 978-94-007-6593-1

Online ISBN : 978-94-007-6594-8

eBook Packages : Earth and Environmental Science Earth and Environmental Science (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

You are using an outdated browser. Upgrade your browser today or install Google Chrome Frame to better experience this site.

Related Resources

  • —Nature of Inquiry and Research

Self-Learning Module- Quarter 1 Practical Research 2: SHS Modules 1-3 View Download

Self Learning Module  |  ZIP

Curriculum Information

Education Type K to 12
Grade Level Grade 11, Grade 12
Learning Area
Content/Topic Nature of Inquiry and Research Identifying the Inquiry and Stating the Problem
Intended Users Educators, Learners
Competencies Differentiates quantitative from qualitative research Designs a research project related to daily life Writes a research title

Copyright Information

Copyright Yes
Copyright Owner Department of Education
Conditions of Use Use, Copy, Print

Technical Information

File Size 3.17 MB
File Type application/x-zip-compressed

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 12 September 2024

An open-source framework for end-to-end analysis of electronic health record data

  • Lukas Heumos 1 , 2 , 3 ,
  • Philipp Ehmele 1 ,
  • Tim Treis 1 , 3 ,
  • Julius Upmeier zu Belzen   ORCID: orcid.org/0000-0002-0966-4458 4 ,
  • Eljas Roellin 1 , 5 ,
  • Lilly May 1 , 5 ,
  • Altana Namsaraeva 1 , 6 ,
  • Nastassya Horlava 1 , 3 ,
  • Vladimir A. Shitov   ORCID: orcid.org/0000-0002-1960-8812 1 , 3 ,
  • Xinyue Zhang   ORCID: orcid.org/0000-0003-4806-4049 1 ,
  • Luke Zappia   ORCID: orcid.org/0000-0001-7744-8565 1 , 5 ,
  • Rainer Knoll 7 ,
  • Niklas J. Lang 2 ,
  • Leon Hetzel 1 , 5 ,
  • Isaac Virshup 1 ,
  • Lisa Sikkema   ORCID: orcid.org/0000-0001-9686-6295 1 , 3 ,
  • Fabiola Curion 1 , 5 ,
  • Roland Eils 4 , 8 ,
  • Herbert B. Schiller 2 , 9 ,
  • Anne Hilgendorff 2 , 10 &
  • Fabian J. Theis   ORCID: orcid.org/0000-0002-2419-1943 1 , 3 , 5  

Nature Medicine ( 2024 ) Cite this article

72 Altmetric

Metrics details

  • Epidemiology
  • Translational research

With progressive digitalization of healthcare systems worldwide, large-scale collection of electronic health records (EHRs) has become commonplace. However, an extensible framework for comprehensive exploratory analysis that accounts for data heterogeneity is missing. Here we introduce ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and EHR data. ehrapy incorporates a series of analytical steps, from data extraction and quality control to the generation of low-dimensional representations. Complemented by rich statistical modules, ehrapy facilitates associating patients with disease states, differential comparison between patient clusters, survival analysis, trajectory inference, causal inference and more. Leveraging ontologies, ehrapy further enables data sharing and training EHR deep learning models, paving the way for foundational models in biomedical research. We demonstrate ehrapy’s features in six distinct examples. We applied ehrapy to stratify patients affected by unspecified pneumonia into finer-grained phenotypes. Furthermore, we reveal biomarkers for significant differences in survival among these groups. Additionally, we quantify medication-class effects of pneumonia medications on length of stay. We further leveraged ehrapy to analyze cardiovascular risks across different data modalities. We reconstructed disease state trajectories in patients with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) based on imaging data. Finally, we conducted a case study to demonstrate how ehrapy can detect and mitigate biases in EHR data. ehrapy, thus, provides a framework that we envision will standardize analysis pipelines on EHR data and serve as a cornerstone for the community.

Similar content being viewed by others

research framework module

Data-driven identification of heart failure disease states and progression pathways using electronic health records

research framework module

EHR foundation models improve robustness in the presence of temporal distribution shift

research framework module

Harnessing EHR data for health research

Electronic health records (EHRs) are becoming increasingly common due to standardized data collection 1 and digitalization in healthcare institutions. EHRs collected at medical care sites serve as efficient storage and sharing units of health information 2 , enabling the informed treatment of individuals using the patient’s complete history 3 . Routinely collected EHR data are approaching genomic-scale size and complexity 4 , posing challenges in extracting information without quantitative analysis methods. The application of such approaches to EHR databases 1 , 5 , 6 , 7 , 8 , 9 has enabled the prediction and classification of diseases 10 , 11 , study of population health 12 , determination of optimal treatment policies 13 , 14 , simulation of clinical trials 15 and stratification of patients 16 .

However, current EHR datasets suffer from serious limitations, such as data collection issues, inconsistencies and lack of data diversity. EHR data collection and sharing problems often arise due to non-standardized formats, with disparate systems using exchange protocols, such as Health Level Seven International (HL7) and Fast Healthcare Interoperability Resources (FHIR) 17 . In addition, EHR data are stored in various on-disk formats, including, but not limited to, relational databases and CSV, XML and JSON formats. These variations pose challenges with respect to data retrieval, scalability, interoperability and data sharing.

Beyond format variability, inherent biases of the collected data can compromise the validity of findings. Selection bias stemming from non-representative sample composition can lead to skewed inferences about disease prevalence or treatment efficacy 18 , 19 . Filtering bias arises through inconsistent criteria for data inclusion, obscuring true variable relationships 20 . Surveillance bias exaggerates associations between exposure and outcomes due to differential monitoring frequencies 21 . EHR data are further prone to missing data 22 , 23 , which can be broadly classified into three categories: missing completely at random (MCAR), where missingness is unrelated to the data; missing at random (MAR), where missingness depends on observed data; and missing not at random (MNAR), where missingness depends on unobserved data 22 , 23 . Information and coding biases, related to inaccuracies in data recording or coding inconsistencies, respectively, can lead to misclassification and unreliable research conclusions 24 , 25 . Data may even contradict itself, such as when measurements were reported for deceased patients 26 , 27 . Technical variation and differing data collection standards lead to distribution differences and inconsistencies in representation and semantics across EHR datasets 28 , 29 . Attrition and confounding biases, resulting from differential patient dropout rates or unaccounted external variable effects, can significantly skew study outcomes 30 , 31 , 32 . The diversity of EHR data that comprise demographics, laboratory results, vital signs, diagnoses, medications, x-rays, written notes and even omics measurements amplifies all the aforementioned issues.

Addressing these challenges requires rigorous study design, careful data pre-processing and continuous bias evaluation through exploratory data analysis. Several EHR data pre-processing and analysis workflows were previously developed 4 , 33 , 34 , 35 , 36 , 37 , but none of them enables the analysis of heterogeneous data, provides in-depth documentation, is available as a software package or allows for exploratory visual analysis. Current EHR analysis pipelines, therefore, differ considerably in their approaches and are often commercial, vendor-specific solutions 38 . This is in contrast to strategies using community standards for the analysis of omics data, such as Bioconductor 39 or scverse 40 . As a result, EHR data frequently remain underexplored and are commonly investigated only for a particular research question 41 . Even in such cases, EHR data are then frequently input into machine learning models with serious data quality issues that greatly impact prediction performance and generalizability 42 .

To address this lack of analysis tooling, we developed the EHR Analysis in Python framework, ehrapy, which enables exploratory analysis of diverse EHR datasets. The ehrapy package is purpose-built to organize, analyze, visualize and statistically compare complex EHR data. ehrapy can be applied to datasets of different data types, sizes, diseases and origins. To demonstrate this versatility, we applied ehrapy to datasets obtained from EHR and population-based studies. Using the Pediatric Intensive Care (PIC) EHR database 43 , we stratified patients diagnosed with ‘unspecified pneumonia’ into distinct clinically relevant groups, extracted clinical indicators of pneumonia through statistical analysis and quantified medication-class effects on length of stay (LOS) with causal inference. Using the UK Biobank 44 (UKB), a population-scale cohort comprising over 500,000 participants from the United Kingdom, we employed ehrapy to explore cardiovascular risk factors using clinical predictors, metabolomics, genomics and retinal imaging-derived features. Additionally, we performed image analysis to project disease progression through fate mapping in patients affected by coronavirus disease 2019 (COVID-19) using chest x-rays. Finally, we demonstrate how exploratory analysis with ehrapy unveils and mitigates biases in over 100,000 visits by patients with diabetes across 130 US hospitals. We provide online links to additional use cases that demonstrate ehrapy’s usage with further datasets, including MIMIC-II (ref. 45 ), and for various medical conditions, such as patients subject to indwelling arterial catheter usage. ehrapy is compatible with any EHR dataset that can be transformed into vectors and is accessible as a user-friendly open-source software package hosted at https://github.com/theislab/ehrapy and installable from PyPI. It comes with comprehensive documentation, tutorials and further examples, all available at https://ehrapy.readthedocs.io .

ehrapy: a framework for exploratory EHR data analysis

The foundation of ehrapy is a robust and scalable data storage backend that is combined with a series of pre-processing and analysis modules. In ehrapy, EHR data are organized as a data matrix where observations are individual patient visits (or patients, in the absence of follow-up visits), and variables represent all measured quantities ( Methods ). These data matrices are stored together with metadata of observations and variables. By leveraging the AnnData (annotated data) data structure that implements this design, ehrapy builds upon established standards and is compatible with analysis and visualization functions provided by the omics scverse 40 ecosystem. Readers are also available in R, Julia and Javascript 46 . We additionally provide a dataset module with more than 20 public loadable EHR datasets in AnnData format to kickstart analysis and development with ehrapy.

For standardized analysis of EHR data, it is crucial that these data are encoded and stored in consistent, reusable formats. Thus, ehrapy requires that input data are organized in structured vectors. Readers for common formats, such as CSV, OMOP 47 or SQL databases, are available in ehrapy. Data loaded into AnnData objects can be mapped against several hierarchical ontologies 48 , 49 , 50 , 51 ( Methods ). Clinical keywords of free text notes can be automatically extracted ( Methods ).

Powered by scanpy, which scales to millions of observations 52 ( Methods and Supplementary Table 1 ) and the machine learning library scikit-learn 53 , ehrapy provides more than 100 composable analysis functions organized in modules from which custom analysis pipelines can be built. Each function directly interacts with the AnnData object and adds all intermediate results for simple access and reuse of information to it. To facilitate setting up these pipelines, ehrapy guides analysts through a general analysis pipeline (Fig. 1 ). At any step of an analysis pipeline, community software packages can be integrated without any vendor lock-in. Because ehrapy is built on open standards, it can be purposefully extended to solve new challenges, such as the development of foundational models ( Methods ).

figure 1

a , Heterogeneous health data are first loaded into memory as an AnnData object with patient visits as observational rows and variables as columns. Next, the data can be mapped against ontologies, and key terms are extracted from free text notes. b , The EHR data are subject to quality control where low-quality or spurious measurements are removed or imputed. Subsequently, numerical data are normalized, and categorical data are encoded. Data from different sources with data distribution shifts are integrated, embedded, clustered and annotated in a patient landscape. c , Further downstream analyses depend on the question of interest and can include the inference of causal effects and trajectories, survival analysis or patient stratification.

In the ehrapy analysis pipeline, EHR data are initially inspected for quality issues by analyzing feature distributions that may skew results and by detecting visits and features with high missing rates that ehrapy can then impute ( Methods ). ehrapy tracks all filtering steps while keeping track of population dynamics to highlight potential selection and filtering biases ( Methods ). Subsequently, ehrapy’s normalization and encoding functions ( Methods ) are applied to achieve a uniform numerical representation that facilitates data integration and corrects for dataset shift effects ( Methods ). Calculated lower-dimensional representations can subsequently be visualized, clustered and annotated to obtain a patient landscape ( Methods ). Such annotated groups of patients can be used for statistical comparisons to find differences in features among them to ultimately learn markers of patient states.

As analysis goals can differ between users and datasets, the ehrapy analysis pipeline is customizable during the final knowledge inference step. ehrapy provides statistical methods for group comparison and extensive support for survival analysis ( Methods ), enabling the discovery of biomarkers. Furthermore, ehrapy offers functions for causal inference to go from statistically determined associations to causal relations ( Methods ). Moreover, patient visits in aggregated EHR data can be regarded as snapshots where individual measurements taken at specific timepoints might not adequately reflect the underlying progression of disease and result from unrelated variation due to, for example, day-to-day differences 54 , 55 , 56 . Therefore, disease progression models should rely on analysis of the underlying clinical data, as disease progression in an individual patient may not be monotonous in time. ehrapy allows for the use of advanced trajectory inference methods to overcome sparse measurements 57 , 58 , 59 . We show that this approach can order snapshots to calculate a pseudotime that can adequately reflect the progression of the underlying clinical process. Given a sufficient number of snapshots, ehrapy increases the potential to understand disease progression, which is likely not robustly captured within a single EHR but, rather, across several.

ehrapy enables patient stratification in pneumonia cases

To demonstrate ehrapy’s capability to analyze heterogeneous datasets from a broad patient set across multiple care units, we applied our exploratory strategy to the PIC 43 database. The PIC database is a single-center database hosting information on children admitted to critical care units at the Children’s Hospital of Zhejiang University School of Medicine in China. It contains 13,499 distinct hospital admissions of 12,881 individual pediatric patients admitted between 2010 and 2018 for whom demographics, diagnoses, doctors’ notes, vital signs, laboratory and microbiology tests, medications, fluid balances and more were collected (Extended Data Figs. 1 and 2a and Methods ). After missing data imputation and subsequent pre-processing (Extended Data Figs. 2b,c and 3 and Methods ), we generated a uniform manifold approximation and projection (UMAP) embedding to visualize variation across all patients using ehrapy (Fig. 2a ). This visualization of the low-dimensional patient manifold shows the heterogeneity of the collected data in the PIC database, with malformations, perinatal and respiratory being the most abundant International Classification of Diseases (ICD) chapters (Fig. 2b ). The most common respiratory disease categories (Fig. 2c ) were labeled pneumonia and influenza ( n  = 984). We focused on pneumonia to apply ehrapy to a challenging, broad-spectrum disease that affects all age groups. Pneumonia is a prevalent respiratory infection that poses a substantial burden on public health 60 and is characterized by inflammation of the alveoli and distal airways 60 . Individuals with pre-existing chronic conditions are particularly vulnerable, as are children under the age of 5 (ref. 61 ). Pneumonia can be caused by a range of microorganisms, encompassing bacteria, respiratory viruses and fungi.

figure 2

a , UMAP of all patient visits in the ICU with primary discharge diagnosis grouped by ICD chapter. b , The prevalence of respiratory diseases prompted us to investigate them further. c , Respiratory categories show the abundance of influenza and pneumonia diagnoses that we investigated more closely. d , We observed the ‘unspecified pneumonia’ subgroup, which led us to investigate and annotate it in more detail. e , The previously ‘unspecified pneumonia’-labeled patients were annotated using several clinical features (Extended Data Fig. 5 ), of which the most important ones are shown in the heatmap ( f ). g , Example disease progression of an individual child with pneumonia illustrating pharmacotherapy over time until positive A. baumannii swab.

We selected the age group ‘youths’ (13 months to 18 years of age) for further analysis, addressing a total of 265 patients who dominated the pneumonia cases and were diagnosed with ‘unspecified pneumonia’ (Fig. 2d and Extended Data Fig. 4 ). Neonates (0–28 d old) and infants (29 d to 12 months old) were excluded from the analysis as the disease context is significantly different in these age groups due to distinct anatomical and physical conditions. Patients were 61% male, had a total of 277 admissions, had a mean age at admission of 54 months (median, 38 months) and had an average LOS of 15 d (median, 7 d). Of these, 152 patients were admitted to the pediatric intensive care unit (PICU), 118 to the general ICU (GICU), four to the surgical ICU (SICU) and three to the cardiac ICU (CICU). Laboratory measurements typically had 12–14% missing data, except for serum procalcitonin (PCT), a marker for bacterial infections, with 24.5% missing, and C-reactive protein (CRP), a marker of inflammation, with 16.8% missing. Measurements assigned as ‘vital signs’ contained between 44% and 54% missing values. Stratifying patients with unspecified pneumonia further enables a more nuanced understanding of the disease, potentially facilitating tailored approaches to treatment.

To deepen clinical phenotyping for the disease group ‘unspecified pneumonia’, we calculated a k -nearest neighbor graph to cluster patients into groups and visualize these in UMAP space ( Methods ). Leiden clustering 62 identified four patient groupings with distinct clinical features that we annotated (Fig. 2e ). To identify the laboratory values, medications and pathogens that were most characteristic for these four groups (Fig. 2f ), we applied t -tests for numerical data and g -tests for categorical data between the identified groups using ehrapy (Extended Data Fig. 5 and Methods ). Based on this analysis, we identified patient groups with ‘sepsis-like, ‘severe pneumonia with co-infection’, ‘viral pneumonia’ and ‘mild pneumonia’ phenotypes. The ‘sepsis-like’ group of patients ( n  = 28) was characterized by rapid disease progression as exemplified by an increased number of deaths (adjusted P  ≤ 5.04 × 10 −3 , 43% ( n  = 28), 95% confidence interval (CI): 23%, 62%); indication of multiple organ failure, such as elevated creatinine (adjusted P  ≤ 0.01, 52.74 ± 23.71 μmol L −1 ) or reduced albumin levels (adjusted P  ≤ 2.89 × 10 −4 , 33.40 ± 6.78 g L −1 ); and increased expression levels and peaks of inflammation markers, including PCT (adjusted P  ≤ 3.01 × 10 −2 , 1.42 ± 2.03 ng ml −1 ), whole blood cell count, neutrophils, lymphocytes, monocytes and lower platelet counts (adjusted P  ≤ 6.3 × 10 −2 , 159.30 ± 142.00 × 10 9 per liter) and changes in electrolyte levels—that is, lower potassium levels (adjusted P  ≤ 0.09 × 10 −2 , 3.14 ± 0.54 mmol L −1 ). Patients whom we associated with the term ‘severe pneumonia with co-infection’ ( n  = 74) were characterized by prolonged ICU stays (adjusted P  ≤ 3.59 × 10 −4 , 15.01 ± 29.24 d); organ affection, such as higher levels of creatinine (adjusted P  ≤ 1.10 × 10 −4 , 52.74 ± 23.71 μmol L −1 ) and lower platelet count (adjusted P  ≤ 5.40 × 10 −23 , 159.30 ± 142.00 × 10 9 per liter); increased inflammation markers, such as peaks of PCT (adjusted P  ≤ 5.06 × 10 −5 , 1.42 ± 2.03 ng ml −1 ), CRP (adjusted P  ≤ 1.40 × 10 −6 , 50.60 ± 37.58 mg L −1 ) and neutrophils (adjusted P  ≤ 8.51 × 10 −6 , 13.01 ± 6.98 × 10 9 per liter); detection of bacteria in combination with additional pathogen fungals in sputum samples (adjusted P  ≤ 1.67 × 10 −2 , 26% ( n  = 74), 95% CI: 16%, 36%); and increased application of medication, including antifungals (adjusted P  ≤ 1.30 × 10 −4 , 15% ( n  = 74), 95% CI: 7%, 23%) and catecholamines (adjusted P  ≤ 2.0 × 10 −2 , 45% ( n  = 74), 95% CI: 33%, 56%). Patients in the ‘mild pneumonia’ group were characterized by positive sputum cultures in the presence of relatively lower inflammation markers, such as PCT (adjusted P  ≤ 1.63 × 10 −3 , 1.42 ± 2.03 ng ml −1 ) and CRP (adjusted P  ≤ 0.03 × 10 −1 , 50.60 ± 37.58 mg L −1 ), while receiving antibiotics more frequently (adjusted P  ≤ 1.00 × 10 −5 , 80% ( n  = 78), 95% CI: 70%, 89%) and additional medications (electrolytes, blood thinners and circulation-supporting medications) (adjusted P  ≤ 1.00 × 10 −5 , 82% ( n  = 78), 95% CI: 73%, 91%). Finally, patients in the ‘viral pneumonia’ group were characterized by shorter LOSs (adjusted P  ≤ 8.00 × 10 −6 , 15.01 ± 29.24 d), a lack of non-viral pathogen detection in combination with higher lymphocyte counts (adjusted P  ≤ 0.01, 4.11 ± 2.49 × 10 9 per liter), lower levels of PCT (adjusted P  ≤ 0.03 × 10 −2 , 1.42 ± 2.03 ng ml −1 ) and reduced application of catecholamines (adjusted P  ≤ 5.96 × 10 −7 , 15% (n = 97), 95% CI: 8%, 23%), antibiotics (adjusted P  ≤ 8.53 × 10 −6 , 41% ( n  = 97), 95% CI: 31%, 51%) and antifungals (adjusted P  ≤ 5.96 × 10 −7 , 0% ( n  = 97), 95% CI: 0%, 0%).

To demonstrate the ability of ehrapy to examine EHR data from different levels of resolution, we additionally reconstructed a case from the ‘severe pneumonia with co-infection’ group (Fig. 2g ). In this case, the analysis revealed that CRP levels remained elevated despite broad-spectrum antibiotic treatment until a positive Acinetobacter baumannii result led to a change in medication and a subsequent decrease in CRP and monocyte levels.

ehrapy facilitates extraction of pneumonia indicators

ehrapy’s survival analysis module allowed us to identify clinical indicators of disease stages that could be used as biomarkers through Kaplan–Meier analysis. We found strong variance in overall aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyl transferase (GGT) and bilirubin levels (Fig. 3a ), including changes over time (Extended Data Fig. 6a,b ), in all four ‘unspecified pneumonia’ groups. Routinely used to assess liver function, studies provide evidence that AST, ALT and GGT levels are elevated during respiratory infections 63 , including severe pneumonia 64 , and can guide diagnosis and management of pneumonia in children 63 . We confirmed reduced survival in more severely affected children (‘sepsis-like pneumonia’ and ‘severe pneumonia with co-infection’) using Kaplan–Meier curves and a multivariate log-rank test (Fig. 3b ; P  ≤ 1.09 × 10 −18 ) through ehrapy. To verify the association of this trajectory with altered AST, ALT and GGT expression levels, we further grouped all patients based on liver enzyme reference ranges ( Methods and Supplementary Table 2 ). By Kaplan–Meier survival analysis, cases with peaks of GGT ( P  ≤ 1.4 × 10 −2 , 58.01 ± 2.03 U L −1 ), ALT ( P  ≤ 2.9 × 10 −2 , 43.59 ± 38.02 U L −1 ) and AST ( P  ≤ 4.8 × 10 −4 , 78.69 ± 60.03 U L −1 ) in ‘outside the norm’ were found to correlate with lower survival in all groups (Fig. 3c and Extended Data Fig. 6 ), in line with previous studies 63 , 65 . Bilirubin was not found to significantly affect survival ( P  ≤ 2.1 × 10 −1 , 12.57 ± 21.22 mg dl −1 ).

figure 3

a , Line plots of major hepatic system laboratory measurements per group show variance in the measurements per pneumonia group. b , Kaplan–Meier survival curves demonstrate lower survival for ‘sepsis-like’ and ‘severe pneumonia with co-infection’ groups. c , Kaplan–Meier survival curves for children with GGT measurements outside the norm range display lower survival.

ehrapy quantifies medication class effect on LOS

Pneumonia requires case-specific medications due to its diverse causes. To demonstrate the potential of ehrapy’s causal inference module, we quantified the effect of medication on ICU LOS to evaluate case-specific administration of medication. In contrast to causal discovery that attempts to find a causal graph reflecting the causal relationships, causal inference is a statistical process used to investigate possible effects when altering a provided system, as represented by a causal graph and observational data (Fig. 4a ) 66 . This approach allows identifying and quantifying the impact of specific interventions or treatments on outcome measures, thereby providing insight for evidence-based decision-making in healthcare. Causal inference relies on datasets incorporating interventions to accurately quantify effects.

figure 4

a , ehrapy’s causal module is based on the strategy of the tool ‘dowhy’. Here, EHR data containing treatment, outcome and measurements and a causal graph serve as input for causal effect quantification. The process includes the identification of the target estimand based on the causal graph, the estimation of causal effects using various models and, finally, refutation where sensitivity analyses and refutation tests are performed to assess the robustness of the results and assumptions. b , Curated causal graph using age, liver damage and inflammation markers as disease progression proxies together with medications as interventions to assess the causal effect on length of ICU stay. c , Determined causal effect strength on LOS in days of administered medication categories.

We manually constructed a minimal causal graph with ehrapy (Fig. 4b ) on records of treatment with corticosteroids, carbapenems, penicillins, cephalosporins and antifungal and antiviral medications as interventions (Extended Data Fig. 7 and Methods ). We assumed that the medications affect disease progression proxies, such as inflammation markers and markers of organ function. The selection of ‘interventions’ is consistent with current treatment standards for bacterial pneumonia and respiratory distress 67 , 68 . Based on the approach of the tool ‘dowhy’ 69 (Fig. 4a ), ehrapy’s causal module identified the application of corticosteroids, antivirals and carbapenems to be associated with shorter LOSs, in line with current evidence 61 , 70 , 71 , 72 . In contrast, penicillins and cephalosporins were associated with longer LOSs, whereas antifungal medication did not strongly influence LOS (Fig. 4c ).

ehrapy enables deriving population-scale risk factors

To illustrate the advantages of using a unified data management and quality control framework, such as ehrapy, we modeled myocardial infarction risk using Cox proportional hazards models on UKB 44 data. Large population cohort studies, such as the UKB, enable the investigation of common diseases across a wide range of modalities, including genomics, metabolomics, proteomics, imaging data and common clinical variables (Fig. 5a,b ). From these, we used a publicly available polygenic risk score for coronary heart disease 73 comprising 6.6 million variants, 80 nuclear magnetic resonance (NMR) spectroscopy-based metabolomics 74 features, 81 features derived from retinal optical coherence tomography 75 , 76 and the Framingham Risk Score 77 feature set, which includes known clinical predictors, such as age, sex, body mass index, blood pressure, smoking behavior and cholesterol levels. We excluded features with more than 10% missingness and imputed the remaining missing values ( Methods ). Furthermore, individuals with events up to 1 year after the sampling time were excluded from the analyses, ultimately selecting 29,216 individuals for whom all mentioned data types were available (Extended Data Figs. 8 and 9 and Methods ). Myocardial infarction, as defined by our mapping to the phecode nomenclature 51 , was defined as the endpoint (Fig. 5c ). We modeled the risk for myocardial infarction 1 year after either the metabolomic sample was obtained or imaging was performed.

figure 5

a , The UKB includes 502,359 participants from 22 assessment centers. Most participants have genetic data (97%) and physical measurement data (93%), but fewer have data for complex measures, such as metabolomics, retinal imaging or proteomics. b , We found a distinct cluster of individuals (bottom right) from the Birmingham assessment center in the retinal imaging data, which is an artifact of the image acquisition process and was, thus, excluded. c , Myocardial infarctions are recorded for 15% of the male and 7% of the female study population. Kaplan–Meier estimators with 95% CIs are shown. d , For every modality combination, a linear Cox proportional hazards model was fit to determine the prognostic potential of these for myocardial infarction. Cardiovascular risk factors show expected positive log hazard ratios (log (HRs)) for increased blood pressure or total cholesterol and negative ones for sampling age and systolic blood pressure (BP). log (HRs) with 95% CIs are shown. e , Combining all features yields a C-index of 0.81. c – e , Error bars indicate 95% CIs ( n  = 29,216).

Predictive performance for each modality was assessed by fitting Cox proportional hazards (Fig. 5c ) models on each of the feature sets using ehrapy (Fig. 5d ). The age of the first occurrence served as the time to event; alternatively, date of death or date of the last record in the EHR served as censoring times. Models were evaluated using the concordance index (C-index) ( Methods ). The combination of multiple modalities successfully improved the predictive performance for coronary heart disease by increasing the C-index from 0.63 (genetic) to 0.76 (genetics, age and sex) and to 0.77 (clinical predictors) with 0.81 (imaging and clinical predictors) for combinations of feature sets (Fig. 5e ). Our finding is in line with previous observations of complementary effects between different modalities, where a broader ‘major adverse cardiac event’ phenotype was modeled in the UKB achieving a C-index of 0.72 (ref. 78 ). Adding genetic data improves predictive potential, as it is independent of sampling age and has limited prediction of other modalities 79 . The addition of metabolomic data did not improve predictive power (Fig. 5e ).

Imaging-based disease severity projection via fate mapping

To demonstrate ehrapy’s ability to handle diverse image data and recover disease stages, we embedded pulmonary imaging data obtained from patients with COVID-19 into a lower-dimensional space and computationally inferred disease progression trajectories using pseudotemporal ordering. This describes a continuous trajectory or ordering of individual points based on feature similarity 80 . Continuous trajectories enable mapping the fate of new patients onto precise states to potentially predict their future condition.

In COVID-19, a highly contagious respiratory illness caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), symptoms range from mild flu-like symptoms to severe respiratory distress. Chest x-rays typically show opacities (bilateral patchy, ground glass) associated with disease severity 81 .

We used COVID-19 chest x-ray images from the BrixIA 82 dataset consisting of 192 images (Fig. 6a ) with expert annotations of disease severity. We used the BrixIA database scores, which are based on six regions annotated by radiologists, to classify disease severity ( Methods ). We embedded raw image features using a pre-trained DenseNet model ( Methods ) and further processed this embedding into a nearest-neighbors-based UMAP space using ehrapy (Fig. 6b and Methods ). Fate mapping based on imaging information ( Methods ) determined a severity ordering from mild to critical cases (Fig. 6b–d ). Images labeled as ‘normal’ are projected to stay within the healthy group, illustrating the robustness of our approach. Images of diseased patients were ordered by disease severity, highlighting clear trajectories from ‘normal’ to ‘critical’ states despite the heterogeneity of the x-ray images stemming from, for example, different zoom levels (Fig. 6a ).

figure 6

a , Randomly selected chest x-ray images from the BrixIA dataset demonstrate its variance. b , UMAP visualization of the BrixIA dataset embedding shows a separation of disease severity classes. c , Calculated pseudotime for all images increases with distance to the ‘normal’ images. d , Stream projection of fate mapping in UMAP space showcases disease severity trajectory of the COVID-19 chest x-ray images.

Detecting and mitigating biases in EHR data with ehrapy

To showcase how exploratory analysis using ehrapy can reveal and mitigate biases, we analyzed the Fairlearn 83 version of the Diabetes 130-US Hospitals 84 dataset. The dataset covers 10 years (1999–2008) of clinical records from 130 US hospitals, detailing 47 features of diabetes diagnoses, laboratory tests, medications and additional data from up to 14 d of inpatient care of 101,766 diagnosed patient visits ( Methods ). It was originally collected to explore the link between the measurement of hemoglobin A1c (HbA1c) and early readmission.

The cohort primarily consists of White and African American individuals, with only a minority of cases from Asian or Hispanic backgrounds (Extended Data Fig. 10a ). ehrapy’s cohort tracker unveiled selection and surveillance biases when filtering for Medicare recipients for further analysis, resulting in a shift of age distribution toward an age of over 60 years in addition to an increasing ratio of White participants. Using ehrapy’s visualization modules, our analysis showed that HbA1c was measured in only 18.4% of inpatients, with a higher frequency in emergency admissions compared to referral cases (Extended Data Fig. 10b ). Normalization biases can skew data relationships when standardization techniques ignore subgroup variability or assume incorrect distributions. The choice of normalization strategy must be carefully considered to avoid obscuring important factors. When normalizing the number of applied medications individually, differences in distributions between age groups remained. However, when normalizing both distributions jointly with age group as an additional group variable, differences between age groups were masked (Extended Data Fig. 10c ). To investigate missing data and imputation biases, we introduced missingness for the number of applied medications according to an MCAR mechanism, which we verified using ehrapy’s Little’s test ( P  ≤ 0.01 × 10 −2 ), and an MAR mechanism ( Methods ). Whereas imputing the mean in the MCAR case did not affect the overall location of the distribution, it led to an underestimation of the variance, with the standard deviation dropping from 8.1 in the original data to 6.8 in the imputed data (Extended Data Fig. 10d ). Mean imputation in the MAR case skewed both location and variance of the mean from 16.02 to 14.66, with a standard deviation of only 5.72 (Extended Data Fig. 10d ). Using ehrapy’s multiple imputation based MissForest 85 imputation on the MAR data resulted in a mean of 16.04 and a standard deviation of 6.45. To predict patient readmission in fewer than 30 d, we merged the three smallest race groups, ‘Asian’, ‘Hispanic’ and ‘Other’. Furthermore, we dropped the gender group ‘Unknown/Invalid’ owing to the small sample size making meaningful assessment impossible, and we performed balanced random undersampling, resulting in 5,677 cases from each condition. We observed an overall balanced accuracy of 0.59 using a logistic regression model. However, the false-negative rate was highest for the races ‘Other’ and ‘Unknown’, whereas their selection rate was lowest, and this model was, therefore, biased (Extended Data Fig. 10e ). Using ehrapy’s compatibility with existing machine learning packages, we used Fairlearn’s ThresholdOptimizer ( Methods ), which improved the selection rates for ‘Other’ from 0.32 to 0.38 and for ‘Unknown’ from 0.23 to 0.42 and the false-negative rates for ‘Other’ from 0.48 to 0.42 and for ‘Unknown’ from 0.61 to 0.45 (Extended Data Fig. 10e ).

Clustering offers a hypothesis-free alternative to supervised classification when clear hypotheses or labels are missing. It has enabled the identification of heart failure subtypes 86 and progression pathways 87 and COVID-19 severity states 88 . This concept, which is central to ehrapy, further allowed us to identify fine-grained groups of ‘unspecified pneumonia’ cases in the PIC dataset while discovering biomarkers and quantifying effects of medications on LOS. Such retroactive characterization showcases ehrapy’s ability to put complex evidence into context. This approach supports feedback loops to improve diagnostic and therapeutic strategies, leading to more efficiently allocated resources in healthcare.

ehrapy’s flexible data structures enabled us to integrate the heterogeneous UKB data for predictive performance in myocardial infarction. The different data types and distributions posed a challenge for predictive models that were overcome with ehrapy’s pre-processing modules. Our analysis underscores the potential of combining phenotypic and health data at population scale through ehrapy to enhance risk prediction.

By adapting pseudotime approaches that are commonly used in other omics domains, we successfully recovered disease trajectories from raw imaging data with ehrapy. The determined pseudotime, however, only orders data but does not necessarily provide a future projection per patient. Understanding the driver features for fate mapping in image-based datasets is challenging. The incorporation of image segmentation approaches could mitigate this issue and provide a deeper insight into the spatial and temporal dynamics of disease-related processes.

Limitations of our analyses include the lack of control for informative missingness where the absence of information represents information in itself 89 . Translation from Chinese to English in the PIC database can cause information loss and inaccuracies because the Chinese ICD-10 codes are seven characters long compared to the five-character English codes. Incompleteness of databases, such as the lack of radiology images in the PIC database, low sample sizes, underrepresentation of non-White ancestries and participant self-selection, cannot be accounted for and limit generalizability. This restricts deeper phenotyping of, for example, all ‘unspecified pneumonia’ cases with respect to their survival, which could be overcome by the use of multiple databases. Our causal inference use case is limited by unrecorded variables, such as Sequential Organ Failure Assessment (SOFA) scores, and pneumonia-related pathogens that are missing in the causal graph due to dataset constraints, such as high sparsity and substantial missing data, which risk overfitting and can lead to overinterpretation. We counterbalanced this by employing several refutation methods that statistically reject the causal hypothesis, such as a placebo treatment, a random common cause or an unobserved common cause. The longer hospital stays associated with penicillins and cephalosporins may be dataset specific and stem from higher antibiotic resistance, their use as first-line treatments, more severe initial cases, comorbidities and hospital-specific protocols.

Most analysis steps can introduce algorithmic biases where results are misleading or unfavorably affect specific groups. This is particularly relevant in the context of missing data 22 where determining the type of missing data is necessary to handle it correctly. ehrapy includes an implementation of Little’s test 90 , which tests whether data are distributed MCAR to discern missing data types. For MCAR data single-imputation approaches, such as mean, median or mode, imputation can suffice, but these methods are known to reduce variability 91 , 92 . Multiple imputation strategies, such as Multiple Imputation by Chained Equations (MICE) 93 and MissForest 85 , as implemented in ehrapy, are effective for both MCAR and MAR data 22 , 94 , 95 . MNAR data require pattern-mixture or shared-parameter models that explicitly incorporate the mechanism by which data are missing 96 . Because MNAR involves unobserved data, the assumptions about the missingness mechanism cannot be directly verified, making sensitivity analysis crucial 21 . ehrapy’s wide range of normalization functions and grouping functionality enables to account for intrinsic variability within subgroups, and its compatibility with Fairlearn 83 can potentially mitigate predictor biases. Generally, we recommend to assess all pre-processing in an iterative manner with respect to downstream applications, such as patient stratification. Moreover, sensitivity analysis can help verify the robustness of all inferred knowledge 97 .

These diverse use cases illustrate ehrapy’s potential to sufficiently address the need for a computationally efficient, extendable, reproducible and easy-to-use framework. ehrapy is compatible with major standards, such as Observational Medical Outcomes Partnership (OMOP), Common Data Model (CDM) 47 , HL7, FHIR or openEHR, with flexible support for common tabular data formats. Once loaded into an AnnData object, subsequent sharing of analysis results is made easy because AnnData objects can be stored and read platform independently. ehrapy’s rich documentation of the application programming interface (API) and extensive hands-on tutorials make EHR analysis accessible to both novices and experienced analysts.

As ehrapy remains under active development, users can expect ehrapy to continuously evolve. We are improving support for the joint analysis of EHR, genetics and molecular data where ehrapy serves as a bridge between the EHR and the omics communities. We further anticipate the generation of EHR-specific reference datasets, so-called atlases 98 , to enable query-to-reference mapping where new datasets get contextualized by transferring annotations from the reference to the new dataset. To promote the sharing and collective analysis of EHR data, we envision adapted versions of interactive single-cell data explorers, such as CELLxGENE 99 or the UCSC Cell Browser 100 , for EHR data. Such web interfaces would also include disparity dashboards 20 to unveil trends of preferential outcomes for distinct patient groups. Additional modules specifically for high-frequency time-series data, natural language processing and other data types are currently under development. With the widespread availability of code-generating large language models, frameworks such as ehrapy are becoming accessible to medical professionals without coding expertise who can leverage its analytical power directly. Therefore, ehrapy, together with a lively ecosystem of packages, has the potential to enhance the scientific discovery pipeline to shape the era of EHR analysis.

All datasets that were used during the development of ehrapy and the use cases were used according to their terms of use as indicated by each provider.

Design and implementation of ehrapy

A unified pipeline as provided by our ehrapy framework streamlines the analysis of EHR data by providing an efficient, standardized approach, which reduces the complexity and variability in data pre-processing and analysis. This consistency ensures reproducibility of results and facilitates collaboration and sharing within the research community. Additionally, the modular structure allows for easy extension and customization, enabling researchers to adapt the pipeline to their specific needs while building on a solid foundational framework.

ehrapy was designed from the ground up as an open-source effort with community support. The package, as well as all associated tutorials and dataset preparation scripts, are open source. Development takes place publicly on GitHub where the developers discuss feature requests and issues directly with users. This tight interaction between both groups ensures that we implement the most pressing needs to cater the most important use cases and can guide users when difficulties arise. The open-source nature, extensive documentation and modular structure of ehrapy are designed for other developers to build upon and extend ehrapy’s functionality where necessary. This allows us to focus ehrapy on the most important features to keep the number of dependencies to a minimum.

ehrapy was implemented in the Python programming language and builds upon numerous existing numerical and scientific open-source libraries, specifically matplotlib 101 , seaborn 102 , NumPy 103 , numba 104 , Scipy 105 , scikit-learn 53 and Pandas 106 . Although taking considerable advantage of all packages implemented, ehrapy also shares the limitations of these libraries, such as a lack of GPU support or small performance losses due to the translation layer cost for operations between the Python interpreter and the lower-level C language for matrix operations. However, by building on very widely used open-source software, we ensure seamless integration and compatibility with a broad range of tools and platforms to promote community contributions. Additionally, by doing so, we enhance security by allowing a larger pool of developers to identify and address vulnerabilities 107 . All functions are grouped into task-specific modules whose implementation is complemented with additional dependencies.

Data preparation

Dataloaders.

ehrapy is compatible with any type of vectorized data, where vectorized refers to the data being stored in structured tables in either on-disk or database form. The input and output module of ehrapy provides readers for common formats, such as OMOP, CSV tables or SQL databases through Pandas. When reading in such datasets, the data are stored in the appropriate slots in a new AnnData 46 object. ehrapy’s data module provides access to more than 20 public EHR datasets that feature diseases, including, but not limited to, Parkinson’s disease, breast cancer, chronic kidney disease and more. All dataloaders return AnnData objects to allow for immediate analysis.

AnnData for EHR data

Our framework required a versatile data structure capable of handling various matrix formats, including Numpy 103 for general use cases and interoperability, Scipy 105 sparse matrices for efficient storage, Dask 108 matrices for larger-than-memory analysis and Awkward array 109 for irregular time-series data. We needed a single data structure that not only stores data but also includes comprehensive annotations for thorough contextual analysis. It was essential for this structure to be widely used and supported, which ensures robustness and continual updates. Interoperability with other analytical packages was a key criterion to facilitate seamless integration within existing tools and workflows. Finally, the data structure had to support both in-memory operations and on-disk storage using formats such as HDF5 (ref. 110 ) and Zarr 111 , ensuring efficient handling and accessibility of large datasets and the ability to easily share them with collaborators.

All of these requirements are fulfilled by the AnnData format, which is a popular data structure in single-cell genomics. At its core, an AnnData object encapsulates diverse components, providing a holistic representation of data and metadata that are always aligned in dimensions and easily accessible. A data matrix (commonly referred to as ‘ X ’) stands as the foundational element, embodying the measured data. This matrix can be dense (as Numpy array), sparse (as Scipy sparse matrix) or ragged (as Awkward array) where dimensions do not align within the data matrix. The AnnData object can feature several such data matrices stored in ‘layers’. Examples of such layers can be unnormalized or unencoded data. These data matrices are complemented by an observations (commonly referred to as ‘obs’) segment where annotations on the level of patients or visits are stored. Patients’ age or sex, for instance, are often used as such annotations. The variables (commonly referred to as ‘var’) section complements the observations, offering supplementary details about the features in the dataset, such as missing data rates. The observation-specific matrices (commonly referred to as ‘obsm’) section extends the capabilities of the AnnData structure by allowing the incorporation of observation-specific matrices. These matrices can represent various types of information at the individual cell level, such as principal component analysis (PCA) results, t-distributed stochastic neighbor embedding (t-SNE) coordinates or other dimensionality reduction outputs. Analogously, AnnData features a variables-specific variables (commonly referred to as ‘varm’) component. The observation-specific pairwise relationships (commonly referred to as ‘obsp’) segment complements the ‘obsm’ section by accommodating observation-specific pairwise relationships. This can include connectivity matrices, indicating relationships between patients. The inclusion of an unstructured annotations (commonly referred to as ‘uns’) component further enhances flexibility. This segment accommodates unstructured annotations or arbitrary data that might not conform to the structured observations or variables categories. Any AnnData object can be stored on disk in h5ad or Zarr format to facilitate data exchange.

ehrapy natively interfaces with the scientific Python ecosystem via Pandas 112 and Numpy 103 . The development of deep learning models for EHR data 113 is further accelerated through compatibility with pathml 114 , a unified framework for whole-slide image analysis in pathology, and scvi-tools 115 , which provides data loaders for loading tensors from AnnData objects into PyTorch 116 or Jax arrays 117 to facilitate the development of generalizing foundational models for medical artificial intelligence 118 .

Feature annotation

After AnnData creation, any metadata can be mapped against ontologies using Bionty ( https://github.com/laminlabs/bionty-base ). Bionty provides access to the Human Phenotype, Phecodes, Phenotype and Trait, Drug, Mondo and Human Disease ontologies.

Key medical terms stored in an AnnData object in free text can be extracted using the Medical Concept Annotation Toolkit (MedCAT) 119 .

Data processing

Cohort tracking.

ehrapy provides a CohortTracker tool that traces all filtering steps applied to an associated AnnData object. To calculate cohort summary statistics, the implementation makes use of tableone 120 and can subsequently be plotted as bar charts together with flow diagrams 121 that visualize the order and reasoning of filtering operations.

Basic pre-processing and quality control

ehrapy encompasses a suite of functionalities for fundamental data processing that are adopted from scanpy 52 but adapted to EHR data:

Regress out: To address unwanted sources of variation, a regression procedure is integrated, enhancing the dataset’s robustness.

Subsample: Selects a specified fraction of observations.

Balanced sample: Balances groups in the dataset by random oversampling or undersampling.

Highly variable features: The identification and annotation of highly variable features following the ‘highly variable genes’ function of scanpy is seamlessly incorporated, providing users with insights into pivotal elements influencing the dataset.

To identify and minimize quality issues, ehrapy provides several quality control functions:

Basic quality control: Determines the relative and absolute number of missing values per feature and per patient.

Winsorization: For data refinement, ehrapy implements a winsorization process, creating a version of the input array less susceptible to extreme values.

Feature clipping: Imposes limits on features to enhance dataset reliability.

Detect biases: Computes pairwise correlations between features, standardized mean differences for numeric features between groups of sensitive features, categorical feature value count differences between groups of sensitive features and feature importances when predicting a target variable.

Little’s MCAR test: Applies Little’s MCAR test whose null hypothesis is that data are MCAR. Rejecting the null hypothesis may not always mean that data are not MCAR, nor is accepting the null hypothesis a guarantee that data are MCAR. For more details, see Schouten et al. 122 .

Summarize features: Calculates statistical indicators per feature, including minimum, maximum and average values. This can be especially useful to reduce complex data with multiple measurements per feature per patient into sets of columns with single values.

Imputation is crucial in data analysis to address missing values, ensuring the completeness of datasets that can be required for specific algorithms. The ‘ehrapy’ pre-processing module offers a range of imputation techniques:

Explicit Impute: Replaces missing values, in either all columns or a user-specified subset, with a designated replacement value.

Simple Impute: Imputes missing values in numerical data using mean, median or the most frequent value, contributing to a more complete dataset.

KNN Impute: Uses k -nearest neighbor imputation to fill in missing values in the input AnnData object, preserving local data patterns.

MissForest Impute: Implements the MissForest strategy for imputing missing data, providing a robust approach for handling complex datasets.

MICE Impute: Applies the MICE algorithm for imputing data. This implementation is based on the miceforest ( https://github.com/AnotherSamWilson/miceforest ) package.

Data encoding can be required if categoricals are a part of the dataset to obtain numerical values only. Most algorithms in ehrapy are compatible only with numerical values. ehrapy offers two encoding algorithms based on scikit-learn 53 :

One-Hot Encoding: Transforms categorical variables into binary vectors, creating a binary feature for each category and capturing the presence or absence of each category in a concise representation.

Label Encoding: Assigns a unique numerical label to each category, facilitating the representation of categorical data as ordinal values and supporting algorithms that require numerical input.

To ensure that the distributions of the heterogeneous data are aligned, ehrapy offers several normalization procedures:

Log Normalization: Applies the natural logarithm function to the data, useful for handling skewed distributions and reducing the impact of outliers.

Max-Abs Normalization: Scales each feature by its maximum absolute value, ensuring that the maximum absolute value for each feature is 1.

Min-Max Normalization: Transforms the data to a specific range (commonly (0, 1)) by scaling each feature based on its minimum and maximum values.

Power Transformation Normalization: Applies a power transformation to make the data more Gaussian like, often useful for stabilizing variance and improving the performance of models sensitive to distributional assumptions.

Quantile Normalization: Aligns the distributions of multiple variables, ensuring that their quantiles match, which can be beneficial for comparing datasets or removing batch effects.

Robust Scaling Normalization: Scales data using the interquartile range, making it robust to outliers and suitable for datasets with extreme values.

Scaling Normalization: Standardizes data by subtracting the mean and dividing by the standard deviation, creating a distribution with a mean of 0 and a standard deviation of 1.

Offset to Positive Values: Shifts all values by a constant offset to make all values non-negative, with the lowest negative value becoming 0.

Dataset shifts can be corrected using the scanpy implementation of the ComBat 123 algorithm, which employs a parametric and non-parametric empirical Bayes framework for adjusting data for batch effects that is robust to outliers.

Finally, a neighbors graph can be efficiently computed using scanpy’s implementation.

To obtain meaningful lower-dimensional embeddings that can subsequently be visualized and reused for downstream algorithms, ehrapy provides the following algorithms based on scanpy’s implementation:

t-SNE: Uses a probabilistic approach to embed high-dimensional data into a lower-dimensional space, emphasizing the preservation of local similarities and revealing clusters in the data.

UMAP: Embeds data points by modeling their local neighborhood relationships, offering an efficient and scalable technique that captures both global and local structures in high-dimensional data.

Force-Directed Graph Drawing: Uses a physical simulation to position nodes in a graph, with edges representing pairwise relationships, creating a visually meaningful representation that emphasizes connectedness and clustering in the data.

Diffusion Maps: Applies spectral methods to capture the intrinsic geometry of high-dimensional data by modeling diffusion processes, providing a way to uncover underlying structures and patterns.

Density Calculation in Embedding: Quantifies the density of observations within an embedding, considering conditions or groups, offering insights into the concentration of data points in different regions and aiding in the identification of densely populated areas.

ehrapy further provides algorithms for clustering and trajectory inference based on scanpy:

Leiden Clustering: Uses the Leiden algorithm to cluster observations into groups, revealing distinct communities within the dataset with an emphasis on intra-cluster cohesion.

Hierarchical Clustering Dendrogram: Constructs a dendrogram through hierarchical clustering based on specified group by categories, illustrating the hierarchical relationships among observations and facilitating the exploration of structured patterns.

Feature ranking

ehrapy provides two ways of ranking feature contributions to clusters and target variables:

Statistical tests: To compare any obtained clusters to obtain marker features that are significantly different between the groups, ehrapy extends scanpy’s ‘rank genes groups’. The original implementation, which features a t -test for numerical data, is complemented by a g -test for categorical data.

Feature importance: Calculates feature rankings for a target variable using linear regression, support vector machine or random forest models from scikit-learn. ehrapy evaluates the relative importance of each predictor by fitting the model and extracting model-specific metrics, such as coefficients or feature importances.

Dataset integration

Based on scanpy’s ‘ingest’ function, ehrapy facilitates the integration of labels and embeddings from a well-annotated reference dataset into a new dataset, enabling the mapping of cluster annotations and spatial relationships for consistent comparative analysis. This process ensures harmonized clinical interpretations across datasets, especially useful when dealing with multiple experimental diseases or batches.

Knowledge inference

Survival analysis.

ehrapy’s implementation of survival analysis algorithms is based on lifelines 124 :

Ordinary Least Squares (OLS) Model: Creates a linear regression model using OLS from a specified formula and an AnnData object, allowing for the analysis of relationships between variables and observations.

Generalized Linear Model (GLM): Constructs a GLM from a given formula, distribution and AnnData, providing a versatile framework for modeling relationships with nonlinear data structures.

Kaplan–Meier: Fits the Kaplan–Meier curve to generate survival curves, offering a visual representation of the probability of survival over time in a dataset.

Cox Hazard Model: Constructs a Cox proportional hazards model using a specified formula and an AnnData object, enabling the analysis of survival data by modeling the hazard rates and their relationship to predictor variables.

Log-Rank Test: Calculates the P value for the log-rank test, comparing the survival functions of two groups, providing statistical significance for differences in survival distributions.

GLM Comparison: Given two fit GLMs, where the larger encompasses the parameter space of the smaller, this function returns the P value, indicating the significance of the larger model and adding explanatory power beyond the smaller model.

Trajectory inference

Trajectory inference is a computational approach that reconstructs and models the developmental paths and transitions within heterogeneous clinical data, providing insights into the temporal progression underlying complex systems. ehrapy offers several inbuilt algorithms for trajectory inference based on scanpy:

Diffusion Pseudotime: Infers the progression of observations by measuring geodesic distance along the graph, providing a pseudotime metric that represents the developmental trajectory within the dataset.

Partition-based Graph Abstraction (PAGA): Maps out the coarse-grained connectivity structures of complex manifolds using a partition-based approach, offering a comprehensive visualization of relationships in high-dimensional data and aiding in the identification of macroscopic connectivity patterns.

Because ehrapy is compatible with scverse, further trajectory inference-based algorithms, such as CellRank, can be seamlessly applied.

Causal inference

ehrapy’s causal inference module is based on ‘dowhy’ 69 . It is based on four key steps that are all implemented in ehrapy:

Graphical Model Specification: Define a causal graphical model representing relationships between variables and potential causal effects.

Causal Effect Identification: Automatically identify whether a causal effect can be inferred from the given data, addressing confounding and selection bias.

Causal Effect Estimation: Employ automated tools to estimate causal effects, using methods such as matching, instrumental variables or regression.

Sensitivity Analysis and Testing: Perform sensitivity analysis to assess the robustness of causal inferences and conduct statistical testing to determine the significance of the estimated causal effects.

Patient stratification

ehrapy’s complete pipeline from pre-processing to the generation of lower-dimensional embeddings, clustering, statistical comparison between determined groups and more facilitates the stratification of patients.

Visualization

ehrapy features an extensive visualization pipeline that is customizable and yet offers reasonable defaults. Almost every analysis function is matched with at least one visualization function that often shares the name but is available through the plotting module. For example, after importing ehrapy as ‘ep’, ‘ep.tl.umap(adata)’ runs the UMAP algorithm on an AnnData object, and ‘ep.pl.umap(adata)’ would then plot a scatter plot of the UMAP embedding.

ehrapy further offers a suite of more generally usable and modifiable plots:

Scatter Plot: Visualizes data points along observation or variable axes, offering insights into the distribution and relationships between individual data points.

Heatmap: Represents feature values in a grid, providing a comprehensive overview of the data’s structure and patterns.

Dot Plot: Displays count values of specified variables as dots, offering a clear depiction of the distribution of counts for each variable.

Filled Line Plot: Illustrates trends in data with filled lines, emphasizing variations in values over a specified axis.

Violin Plot: Presents the distribution of data through mirrored density plots, offering a concise view of the data’s spread.

Stacked Violin Plot: Combines multiple violin plots, stacked to allow for visual comparison of distributions across categories.

Group Mean Heatmap: Creates a heatmap displaying the mean count per group for each specified variable, providing insights into group-wise trends.

Hierarchically Clustered Heatmap: Uses hierarchical clustering to arrange data in a heatmap, revealing relationships and patterns among variables and observations.

Rankings Plot: Visualizes rankings within the data, offering a clear representation of the order and magnitude of values.

Dendrogram Plot: Plots a dendrogram of categories defined in a group by operation, illustrating hierarchical relationships within the dataset.

Benchmarking ehrapy

We generated a subset of the UKB data selecting 261 features and 488,170 patient visits. We removed all features with missingness rates greater than 70%. To demonstrate speed and memory consumption for various scenarios, we subsampled the data to 20%, 30% and 50%. We ran a minimal ehrapy analysis pipeline on each of those subsets and the full data, including the calculation of quality control metrics, filtering of variables by a missingness threshold, nearest neighbor imputation, normalization, dimensionality reduction and clustering (Supplementary Table 1 ). We conducted our benchmark on a single CPU with eight threads and 60 GB of maximum memory.

ehrapy further provides out-of-core implementations using Dask 108 for many algorithms in ehrapy, such as our normalization functions or our PCA implementation. Out-of-core computation refers to techniques that process data that do not fit entirely in memory, using disk storage to manage data overflow. This approach is crucial for handling large datasets without being constrained by system memory limits. Because the principal components get reused for other computationally expensive algorithms, such as the neighbors graph calculation, it effectively enables the analysis of very large datasets. We are currently working on supporting out-of-core computation for all computationally expensive algorithms in ehrapy.

We demonstrate the memory benefits in a hosted tutorial where the in-memory pipeline for 50,000 patients with 1,000 features required about 2 GB of memory, and the corresponding out-of-core implementation required less than 200 MB of memory.

The code for benchmarking is available at https://github.com/theislab/ehrapy-reproducibility . The implementation of ehrapy is accessible at https://github.com/theislab/ehrapy together with extensive API documentation and tutorials at https://ehrapy.readthedocs.io .

PIC database analysis

Study design.

We collected clinical data from the PIC 43 version 1.1.0 database. PIC is a single-center, bilingual (English and Chinese) database hosting information of children admitted to critical care units at the Children’s Hospital of Zhejiang University School of Medicine in China. The requirement for individual patient consent was waived because the study did not impact clinical care, and all protected health information was de-identified. The database contains 13,499 distinct hospital admissions of 12,881 distinct pediatric patients. These patients were admitted to five ICU units with 119 total critical care beds—GICU, PICU, SICU, CICU and NICU—between 2010 and 2018. The mean age of the patients was 2.5 years, of whom 42.5% were female. The in-hospital mortality was 7.1%; the mean hospital stay was 17.6 d; the mean ICU stay was 9.3 d; and 468 (3.6%) patients were admitted multiple times. Demographics, diagnoses, doctors’ notes, laboratory and microbiology tests, prescriptions, fluid balances, vital signs and radiographics reports were collected from all patients. For more details, see the original publication of Zeng et al. 43 .

Study participants

Individuals older than 18 years were excluded from the study. We grouped the data into three distinct groups: ‘neonates’ (0–28 d of age; 2,968 patients), ‘infants’ (1–12 months of age; 4,876 patients) and ‘youths’ (13 months to 18 years of age; 6,097 patients). We primarily analyzed the ‘youths’ group with the discharge diagnosis ‘unspecified pneumonia’ (277 patients).

Data collection

The collected clinical data included demographics, laboratory and vital sign measurements, diagnoses, microbiology and medication information and mortality outcomes. The five-character English ICD-10 codes were used, whose values are based on the seven-character Chinese ICD-10 codes.

Dataset extraction and analysis

We downloaded the PIC database of version 1.1.0 from Physionet 1 to obtain 17 CSV tables. Using Pandas, we selected all information with more than 50% coverage rate, including demographics and laboratory and vital sign measurements (Fig. 2 ). To reduce the amount of noise, we calculated and added only the minimum, maximum and average of all measurements that had multiple values per patient. Examination reports were removed because they describe only diagnostics and not detailed findings. All further diagnoses and microbiology and medication information were included into the observations slot to ensure that the data were not used for the calculation of embeddings but were still available for the analysis. This ensured that any calculated embedding would not be divided into treated and untreated groups but, rather, solely based on phenotypic features. We imputed all missing data through k -nearest neighbors imputation ( k  = 20) using the knn_impute function of ehrapy. Next, we log normalized the data with ehrapy using the log_norm function. Afterwards, we winsorized the data using ehrapy’s winsorize function to obtain 277 ICU visits ( n  = 265 patients) with 572 features. Of those 572 features, 254 were stored in the matrix X and the remaining 318 in the ‘obs’ slot in the AnnData object. For clustering and visualization purposes, we calculated 50 principal components using ehrapy’s pca function. The obtained principal component representation was then used to calculate a nearest neighbors graph using the neighbors function of ehrapy. The nearest neighbors graph then served as the basis for a UMAP embedding calculation using ehrapy’s umap function.

We applied the community detection algorithm Leiden with resolution 0.6 on the nearest neighbor graph using ehrapy’s leiden function. The four obtained clusters served as input for two-sided t -tests for all numerical values and two-sided g -tests for all categorical values for all four clusters against the union of all three other clusters, respectively. This was conducted using ehrapy’s rank_feature_groups function, which also corrects P values for multiple testing with the Benjamini–Hochberg method 125 . We presented the four groups and the statistically significantly different features between the groups to two pediatricians who annotated the groups with labels.

Our determined groups can be confidently labeled owing to their distinct clinical profiles. Nevertheless, we could only take into account clinical features that were measured. Insightful features, such as lung function tests, are missing. Moreover, the feature representation of the time-series data is simplified, which can hide some nuances between the groups. Generally, deciding on a clustering resolution is difficult. However, more fine-grained clusters obtained via higher clustering resolutions may become too specific and not generalize well enough.

Kaplan–Meier survival analysis

We selected patients with up to 360 h of total stay for Kaplan–Meier survival analysis to ensure a sufficiently high number of participants. We proceeded with the AnnData object prepared as described in the ‘Patient stratification’ subsection to conduct Kaplan–Meier analysis among all four determined pneumonia groups using ehrapy’s kmf function. Significance was tested through ehrapy’s test_kmf_logrank function, which tests whether two Kaplan–Meier series are statistically significant, employing a chi-squared test statistic under the null hypothesis. Let h i (t) be the hazard ratio of group i at time t and c a constant that represents a proportional change in the hazard ratio between the two groups, then:

This implicitly uses the log-rank weights. An additional Kaplan–Meier analysis was conducted for all children jointly concerning the liver markers AST, ALT and GGT. To determine whether measurements were inside or outside the norm range, we used reference ranges (Supplementary Table 2 ). P values less than 0.05 were labeled significant.

Our Kaplan–Meier curve analysis depends on the groups being well defined and shares the same limitations as the patient stratification. Additionally, the analysis is sensitive to the reference table where we selected limits that generalize well for the age ranges, but, due to children of different ages being examined, they may not necessarily be perfectly accurate for all children.

Causal effect of mechanism of action on LOS

Although the dataset was not initially intended for investigating causal effects of interventions, we adapted it for this purpose by focusing on the LOS in the ICU, measured in months, as the outcome variable. This choice aligns with the clinical aim of stabilizing patients sufficiently for ICU discharge. We constructed a causal graph to explore how different drug administrations could potentially reduce the LOS. Based on consultations with clinicians, we included several biomarkers of liver damage (AST, ALT and GGT) and inflammation (CRP and PCT) in our model. Patient age was also considered a relevant variable.

Because several different medications act by the same mechanisms, we grouped specific medications by their drug classes This grouping was achieved by cross-referencing the drugs listed in the dataset with DrugBank release 5.1 (ref. 126 ), using Levenshtein distances for partial string matching. After manual verification, we extracted the corresponding DrugBank categories, counted the number of features per category and compiled a list of commonly prescribed medications, as advised by clinicians. This approach facilitated the modeling of the causal graph depicted in Fig. 4 , where an intervention is defined as the administration of at least one drug from a specified category.

Causal inference was then conducted with ehrapy’s ‘dowhy’ 69 -based causal inference module using the expert-curated causal graph. Medication groups were designated as causal interventions, and the LOS was the outcome of interest. Linear regression served as the estimation method for analyzing these causal effects. We excluded four patients from the analysis owing to their notably long hospital stays exceeding 90 d, which were deemed outliers. To validate the robustness of our causal estimates, we incorporated several refutation methods:

Placebo Treatment Refuter: This method involved replacing the treatment assignment with a placebo to test the effect of the treatment variable being null.

Random Common Cause: A randomly generated variable was added to the data to assess the sensitivity of the causal estimate to the inclusion of potential unmeasured confounders.

Data Subset Refuter: The stability of the causal estimate was tested across various random subsets of the data to ensure that the observed effects were not dependent on a specific subset.

Add Unobserved Common Cause: This approach tested the effect of an omitted variable by adding a theoretically relevant unobserved confounder to the model, evaluating how much an unmeasured variable could influence the causal relationship.

Dummy Outcome: Replaces the true outcome variable with a random variable. If the causal effect nullifies, it supports the validity of the original causal relationship, indicating that the outcome is not driven by random factors.

Bootstrap Validation: Employs bootstrapping to generate multiple samples from the dataset, testing the consistency of the causal effect across these samples.

The selection of these refuters addresses a broad spectrum of potential biases and model sensitivities, including unobserved confounders and data dependencies. This comprehensive approach ensures robust verification of the causal analysis. Each refuter provides an orthogonal perspective, targeting specific vulnerabilities in causal analysis, which strengthens the overall credibility of the findings.

UKB analysis

Study population.

We used information from the UKB cohort, which includes 502,164 study participants from the general UK population without enrichment for specific diseases. The study involved the enrollment of individuals between 2006 and 2010 across 22 different assessment centers throughout the United Kingdom. The tracking of participants is still ongoing. Within the UKB dataset, metabolomics, proteomics and retinal optical coherence tomography data are available for a subset of individuals without any enrichment for specific diseases. Additionally, EHRs, questionnaire responses and other physical measures are available for almost everyone in the study. Furthermore, a variety of genotype information is available for nearly the entire cohort, including whole-genome sequencing, whole-exome sequencing, genotyping array data as well as imputed genotypes from the genotyping array 44 . Because only the latter two are available for download, and are sufficient for polygenic risk score calculation as performed here, we used the imputed genotypes in the present study. Participants visited the assessment center up to four times for additional and repeat measurements and completed additional online follow-up questionnaires.

In the present study, we restricted the analyses to data obtained from the initial assessment, including the blood draw, for obtaining the metabolomics data and the retinal imaging as well as physical measures. This restricts the study population to 33,521 individuals for whom all of these modalities are available. We have a clear study start point for each individual with the date of their initial assessment center visit. The study population has a mean age of 57 years, is 54% female and is censored at age 69 years on average; 4.7% experienced an incident myocardial infarction; and 8.1% have prevalent type 2 diabetes. The study population comes from six of the 22 assessment centers due to the retinal imaging being performed only at those.

For the myocardial infarction endpoint definition, we relied on the first occurrence data available in the UKB, which compiles the first date that each diagnosis was recorded for a participant in a hospital in ICD-10 nomenclature. Subsequently, we mapped these data to phecodes and focused on phecode 404.1 for myocardial infarction.

The Framingham Risk Score was developed on data from 8,491 participants in the Framingham Heart Study to assess general cardiovascular risk 77 . It includes easily obtainable predictors and is, therefore, easily applicable in clinical practice, although newer and more specific risk scores exist and might be used more frequently. It includes age, sex, smoking behavior, blood pressure, total and low-density lipoprotein cholesterol as well as information on insulin, antihypertensive and cholesterol-lowering medications, all of which are routinely collected in the UKB and used in this study as the Framingham feature set.

The metabolomics data used in this study were obtained using proton NMR spectroscopy, a low-cost method with relatively low batch effects. It covers established clinical predictors, such as albumin and cholesterol, as well as a range of lipids, amino acids and carbohydrate-related metabolites.

The retinal optical coherence tomography–derived features were returned by researchers to the UKB 75 , 76 . They used the available scans and determined the macular volume, macular thickness, retinal pigment epithelium thickness, disc diameter, cup-to-disk ratio across different regions as well as the thickness between the inner nuclear layer and external limiting membrane, inner and outer photoreceptor segments and the retinal pigment epithelium across different regions. Furthermore, they determined a wide range of quality metrics for each scan, including the image quality score, minimum motion correlation and inner limiting membrane (ILM) indicator.

Data analysis

After exporting the data from the UKB, all timepoints were transformed into participant age entries. Only participants without prevalent myocardial infarction (relative to the first assessment center visit at which all data were collected) were included.

The data were pre-processed for retinal imaging and metabolomics subsets separately, to enable a clear analysis of missing data and allow for the k -nearest neighbors–based imputation ( k  = 20) of missing values when less than 10% were missing for a given participant. Otherwise, participants were dropped from the analyses. The imputed genotypes and Framingham analyses were available for almost every participant and, therefore, not imputed. Individuals without them were, instead, dropped from the analyses. Because genetic risk modeling poses entirely different methodological and computational challenges, we applied a published polygenic risk score for coronary heart disease using 6.6 million variants 73 . This was computed using the plink2 score option on the imputed genotypes available in the UKB.

UMAP embeddings were computed using default parameters on the full feature sets with ehrapy’s umap function. For all analyses, the same time-to-event and event-indicator columns were used. The event indicator is a Boolean variable indicating whether a myocardial infarction was observed for a study participant. The time to event is defined as the timespan between the start of the study, in this case the date of the first assessment center visit. Otherwise, it is the timespan from the start of the study to the start of censoring; in this case, this is set to the last date for which EHRs were available, unless a participant died, in which case the date of death is the start of censoring. Kaplan–Meier curves and Cox proportional hazards models were fit using ehrapy’s survival analysis module and the lifelines 124 package’s Cox-PHFitter function with default parameters. For Cox proportional hazards models with multiple feature sets, individually imputed and quality-controlled feature sets were concatenated, and the model was fit on the resulting matrix. Models were evaluated using the C-index 127 as a metric. It can be seen as an extension of the common area under the receiver operator characteristic score to time-to-event datasets, in which events are not observed for every sample and which ranges from 0.0 (entirely false) over 0.5 (random) to 1.0 (entirely correct). CIs for the C-index were computed based on bootstrapping by sampling 1,000 times with replacement from all computed partial hazards and computing the C-index over each of these samples. The percentiles at 2.5% and 97.5% then give the upper and lower confidence bound for the 95% CIs.

In all UKB analyses, the unit of study for a statistical test or predictive model is always an individual study participant.

The generalizability of the analysis is limited as the UK Biobank cohort may not represent the general population, with potential selection biases and underrepresentation of the different demographic groups. Additionally, by restricting analysis to initial assessment data and censoring based on the last available EHR or date of death, our analysis does not account for longitudinal changes and can introduce follow-up bias, especially if participants lost to follow-up have different risk profiles.

In-depth quality control of retina-derived features

A UMAP plot of the retina-derived features indicating the assessment centers shows a cluster of samples that lie somewhat outside the general population and mostly attended the Birmingham assessment center (Fig. 5b ). To further investigate this, we performed Leiden clustering of resolution 0.3 (Extended Data Fig. 9a ) and isolated this group in cluster 5. When comparing cluster 5 to the rest of the population in the retina-derived feature space, we noticed that many individuals in cluster 5 showed overall retinal pigment epithelium (RPE) thickness measures substantially elevated over the rest of the population in both eyes (Extended Data Fig. 9b ), which is mostly a feature of this cluster (Extended Data Fig. 9c ). To investigate potential confounding, we computed ratios between cluster 5 and the rest of the population over the ‘obs’ DataFrame containing the Framingham features, diabetes-related phecodes and genetic principal components. Out of the top and bottom five highest ratios observed, six are in genetic principal components, which are commonly used to represent genetic ancestry in a continuous space (Extended Data Fig. 9d ). Additionally, diagnoses for type 1 and type 2 diabetes and antihypertensive use are enriched in cluster 5. Further investigating the ancestry, we computed log ratios for self-reported ancestries and absolute counts, which showed no robust enrichment and depletion effects.

A closer look at three quality control measures of the imaging pipeline revealed that cluster 5 was an outlier in terms of either image quality (Extended Data Fig. 9e ) or minimum motion correlation (Extended Data Fig. 9f ) and the ILM indicator (Extended Data Fig. 9g ), all of which can be indicative of artifacts in image acquisition and downstream processing 128 . Subsequently, we excluded 301 individuals from cluster 5 from all analyses.

COVID-19 chest-x-ray fate determination

Dataset overview.

We used the public BrixIA COVID-19 dataset, which contains 192 chest x-ray images annotated with BrixIA scores 82 . Hereby, six regions were annotated by a senior radiologist with more than 20 years of experience and a junior radiologist with a disease severity score ranging from 0 to 3. A global score was determined as the sum of all of these regions and, therefore, ranges from 0 to 18 (S-Global). S-Global scores of 0 were classified as normal. Images that only had severity values up to 1 in all six regions were classified as mild. Images with severity values greater than or equal to 2, but a S-Global score of less than 7, were classified as moderate. All images that contained at least one 3 in any of the six regions with a S-Global score between 7 and 10 were classified as severe, and all remaining images with S-Global scores greater than 10 with at least one 3 were labeled critical. The dataset and instructions to download the images can be found at https://github.com/ieee8023/covid-chestxray-dataset .

We first resized all images to 224 × 224. Afterwards, the images underwent a random affine transformation that involved rotation, translation and scaling. The rotation angle was randomly selected from a range of −45° to 45°. The images were also subject to horizontal and vertical translation, with the maximum translation being 15% of the image size in either direction. Additionally, the images were scaled by a factor ranging from 0.85 to 1.15. The purpose of applying these transformations was to enhance the dataset and introduce variations, ultimately improving the robustness and generalization of the model.

To generate embeddings, we used a pre-trained DenseNet model with weights densenet121-res224-all of TorchXRayVision 129 . A DenseNet is a convolutional neural network that makes use of dense connections between layers (Dense Blocks) where all layers (with matching feature map sizes) directly connect with each other. To maintain a feed-forward nature, every layer in the DenseNet architecture receives supplementary inputs from all preceding layers and transmits its own feature maps to all subsequent layers. The model was trained on the nih-pc- chex-mimic_ch-google-openi-rsna dataset 130 .

Next, we calculated 50 principal components on the feature representation of the DenseNet model of all images using ehrapy’s pca function. The principal component representation served as input for a nearest neighbors graph calculation using ehrapy’s neighbors function. This graph served as the basis for the calculation of a UMAP embedding with three components that was finally visualized using ehrapy.

We randomly picked a root in the group of images that was labeled ‘Normal’. First, we calculated so-called pseudotime by fitting a trajectory through the calculated UMAP space using diffusion maps as implemented in ehrapy’s dpt function 57 . Each image’s pseudotime value represents its estimated position along this trajectory, serving as a proxy for its severity stage relative to others in the dataset. To determine fates, we employed CellRank 58 , 59 with the PseudotimeKernel . This kernel computes transition probabilities for patient visits based on the connectivity of the k -nearest neighbors graph and the pseudotime values of patient visits, which resembles their progression through a process. Directionality is infused in the nearest neighbors graph in this process where the kernel either removes or downweights edges in the graph that contradict the directional flow of increasing pseudotime, thereby refining the graph to better reflect the developmental trajectory. We computed the transition matrix with a soft threshold scheme (Parameter of the PseudotimeKernel ), which downweights edges that point against the direction of increasing pseudotime. Finally, we calculated a projection on top of the UMAP embedding with CellRank using the plot_projection function of the PseudotimeKernel that we subsequently plotted.

This analysis is limited by the small dataset of 192 chest x-ray images, which may affect the model’s generalizability and robustness. Annotation subjectivity from radiologists can further introduce variability in severity scores. Additionally, the random selection of a root from ‘Normal’ images can introduce bias in pseudotime calculations and subsequent analyses.

Diabetes 130-US hospitals analysis

We used data from the Diabetes 130-US hospitals dataset that were collected between 1999 and 2008. It contains clinical care information at 130 hospitals and integrated delivery networks. The extracted database information pertains to hospital admissions specifically for patients diagnosed with diabetes. These encounters required a hospital stay ranging from 1 d to 14 d, during which both laboratory tests and medications were administered. The selection criteria focused exclusively on inpatient encounters with these defined characteristics. More specifically, we used a version that was curated by the Fairlearn team where the target variable ‘readmitted’ was binarized and a few features renamed or binned ( https://fairlearn.org/main/user_guide/datasets/diabetes_hospital_data.html ). The dataset contains 101,877 patient visits and 25 features. The dataset predominantly consists of White patients (74.8%), followed by African Americans (18.9%), with other racial groups, such as Hispanic, Asian and Unknown categories, comprising smaller percentages. Females make up a slight majority in the data at 53.8%, with males accounting for 46.2% and a negligible number of entries listed as unknown or invalid. A substantial majority of the patients are over 60 years of age (67.4%), whereas those aged 30–60 years represent 30.2%, and those 30 years or younger constitute just 2.5%.

All of the following descriptions start by loading the Fairlearn version of the Diabetes 130-US hospitals dataset using ehrapy’s dataloader as an AnnData object.

Selection and filtering bias

An overview of sensitive variables was generated using tableone. Subsequently, ehrapy’s CohortTracker was used to track the age, gender and race variables. The cohort was filtered for all Medicare recipients and subsequently plotted.

Surveillance bias

We plotted the HbA1c measurement ratios using ehrapy’s catplot .

Missing data and imputation bias

MCAR-type missing data for the number of medications variable (‘num_medications‘) were introduced by randomly setting 30% of the variables to be missing using Numpy’s choice function. We tested that the data are MCAR by applying ehrapy’s implementation of Little’s MCAR test, which returned a non-significant P value of 0.71. MAR data for the number of medications variable (‘num_medications‘) were introduced by scaling the ‘time_in_hospital’ variable to have a mean of 0 and a standard deviation of 1, adjusting these values by multiplying by 1.2 and subtracting 0.6 to influence overall missingness rate, and then using these values to generate MAR data in the ‘num_medications’ variable via a logistic transformation and binomial sampling. We verified that the newly introduced missing values are not MCAR with respect to the ‘time_in_hospital’ variable by applying ehrapy’s implementation of Little’s test, which was significant (0.01 × 10 −2 ). The missing data were imputed using ehrapy’s mean imputation and MissForest implementation.

Algorithmic bias

Variables ‘race’, ‘gender’, ‘age’, ‘readmitted’, ‘readmit_binary’ and ‘discharge_disposition_id’ were moved to the ‘obs’ slot of the AnnData object to ensure that they were not used for model training. We built a binary label ‘readmit_30_days’ indicating whether a patient had been readmitted in fewer than 30 d. Next, we combined the ‘Asian’ and ‘Hispanic’ categories into a single ‘Other’ category within the ‘race’ column of our AnnData object and then filtered out and discarded any samples labeled as ‘Unknown/Invalid’ under the ‘gender‘ column and subsequently moved the ‘gender’ data to the variable matrix X of the AnnData object. All categorical variables got encoded. The data were split into train and test groups with a test size of 50%. The data were scaled, and a logistic regression model was trained using scikit-learn, which was also used to determine the balanced accuracy score. Fairlearn’s MetricFrame function was used to inspect the target model performance against the sensitive variable ‘race’. We subsequently fit Fairlearn’s ThresholdOptimizer using the logistic regression estimator with balanced_accuracy_score as the target object. The algorithmic demonstration of Fairlearn’s abilities on this dataset is shown here: https://github.com/fairlearn/talks/tree/main/2021_scipy_tutorial .

Normalization bias

We one-hot encoded all categorical variables with ehrapy using the encode function. We applied ehrapy’s implementation of scaling normalization with and without the ‘Age group’ variable as group key to scale the data jointly and separately using ehrapy’s scale_norm function.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Physionet provides access to the PIC database 43 at https://physionet.org/content/picdb/1.1.0 for credentialed users. The BrixIA images 82 are available at https://github.com/BrixIA/Brixia-score-COVID-19 . The data used in this study were obtained from the UK Biobank 44 ( https://www.ukbiobank.ac.uk/ ). Access to the UK Biobank resource was granted under application number 49966. The data are available to researchers upon application to the UK Biobank in accordance with their data access policies and procedures. The Diabetes 130-US Hospitals dataset is available at https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008 .

Code availability

The ehrapy source code is available at https://github.com/theislab/ehrapy under an Apache 2.0 license. Further documentation, tutorials and examples are available at https://ehrapy.readthedocs.io . We are actively developing the software and invite contributions from the community.

Jupyter notebooks to reproduce our analysis and figures, including Conda environments that specify all versions, are available at https://github.com/theislab/ehrapy-reproducibility .

Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 , E215–E220 (2000).

Article   CAS   PubMed   Google Scholar  

Atasoy, H., Greenwood, B. N. & McCullough, J. S. The digitization of patient care: a review of the effects of electronic health records on health care quality and utilization. Annu. Rev. Public Health 40 , 487–500 (2019).

Article   PubMed   Google Scholar  

Jamoom, E. W., Patel, V., Furukawa, M. F. & King, J. EHR adopters vs. non-adopters: impacts of, barriers to, and federal initiatives for EHR adoption. Health (Amst.) 2 , 33–39 (2014).

Google Scholar  

Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1 , 18 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Wolf, A. et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. Int. J. Epidemiol. 48 , 1740–1740g (2019).

Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12 , e1001779 (2015).

Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 5 , 180178 (2018).

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3 , 160035 (2016).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hyland, S. L. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 26 , 364–373 (2020).

Rasmy, L. et al. Recurrent neural network models (CovRNN) for predicting outcomes of patients with COVID-19 on admission to hospital: model development and validation using electronic health record data. Lancet Digit. Health 4 , e415–e425 (2022).

Marcus, J. L. et al. Use of electronic health record data and machine learning to identify candidates for HIV pre-exposure prophylaxis: a modelling study. Lancet HIV 6 , e688–e695 (2019).

Kruse, C. S., Stein, A., Thomas, H. & Kaur, H. The use of electronic health records to support population health: a systematic review of the literature. J. Med. Syst. 42 , 214 (2018).

Sheikh, A., Jha, A., Cresswell, K., Greaves, F. & Bates, D. W. Adoption of electronic health records in UK hospitals: lessons from the USA. Lancet 384 , 8–9 (2014).

Sheikh, A. et al. Health information technology and digital innovation for national learning health and care systems. Lancet Digit. Health 3 , e383–e396 (2021).

Cord, K. A. M., Mc Cord, K. A. & Hemkens, L. G. Using electronic health records for clinical trials: where do we stand and where can we go? Can. Med. Assoc. J. 191 , E128–E133 (2019).

Article   Google Scholar  

Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3 , 96 (2020).

Ayaz, M., Pasha, M. F., Alzahrani, M. Y., Budiarto, R. & Stiawan, D. The Fast Health Interoperability Resources (FHIR) standard: systematic literature review of implementations, applications, challenges and opportunities. JMIR Med. Inform. 9 , e21929 (2021).

Peskoe, S. B. et al. Adjusting for selection bias due to missing data in electronic health records-based research. Stat. Methods Med. Res. 30 , 2221–2238 (2021).

Haneuse, S. & Daniels, M. A general framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash. DC) 4 , 1203 (2016).

PubMed   Google Scholar  

Gallifant, J. et al. Disparity dashboards: an evaluation of the literature and framework for health equity improvement. Lancet Digit. Health 5 , e831–e839 (2023).

Sauer, C. M. et al. Leveraging electronic health records for data science: common pitfalls and how to avoid them. Lancet Digit. Health 4 , e893–e898 (2022).

Li, J. et al. Imputation of missing values for electronic health record laboratory data. NPJ Digit. Med. 4 , 147 (2021).

Rubin, D. B. Inference and missing data. Biometrika 63 , 581 (1976).

Scheid, L. M., Brown, L. S., Clark, C. & Rosenfeld, C. R. Data electronically extracted from the electronic health record require validation. J. Perinatol. 39 , 468–474 (2019).

Phelan, M., Bhavsar, N. A. & Goldstein, B. A. Illustrating informed presence bias in electronic health records data: how patient interactions with a health system can impact inference. EGEMS (Wash. DC). 5 , 22 (2017).

PubMed   PubMed Central   Google Scholar  

Secondary Analysis of Electronic Health Records (ed MIT Critical Data) (Springer, 2016).

Jetley, G. & Zhang, H. Electronic health records in IS research: quality issues, essential thresholds and remedial actions. Decis. Support Syst. 126 , 113137 (2019).

McCormack, J. P. & Holmes, D. T. Your results may vary: the imprecision of medical measurements. BMJ 368 , m149 (2020).

Hobbs, F. D. et al. Is the international normalised ratio (INR) reliable? A trial of comparative measurements in hospital laboratory and primary care settings. J. Clin. Pathol. 52 , 494–497 (1999).

Huguet, N. et al. Using electronic health records in longitudinal studies: estimating patient attrition. Med. Care 58 Suppl 6 Suppl 1 , S46–S52 (2020).

Zeng, J., Gensheimer, M. F., Rubin, D. L., Athey, S. & Shachter, R. D. Uncovering interpretable potential confounders in electronic medical records. Nat. Commun. 13 , 1014 (2022).

Getzen, E., Ungar, L., Mowery, D., Jiang, X. & Long, Q. Mining for equitable health: assessing the impact of missing data in electronic health records. J. Biomed. Inform. 139 , 104269 (2023).

Tang, S. et al. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. J. Am. Med. Inform. Assoc. 27 , 1921–1934 (2020).

Dagliati, A. et al. A process mining pipeline to characterize COVID-19 patients’ trajectories and identify relevant temporal phenotypes from EHR data. Front. Public Health 10 , 815674 (2022).

Sun, Y. & Zhou, Y.-H. A machine learning pipeline for mortality prediction in the ICU. Int. J. Digit. Health 2 , 3 (2022).

Article   CAS   Google Scholar  

Mandyam, A., Yoo, E. C., Soules, J., Laudanski, K. & Engelhardt, B. E. COP-E-CAT: cleaning and organization pipeline for EHR computational and analytic tasks. In Proc. of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. https://doi.org/10.1145/3459930.3469536 (Association for Computing Machinery, 2021).

Gao, C. A. et al. A machine learning approach identifies unresolving secondary pneumonia as a contributor to mortality in patients with severe pneumonia, including COVID-19. J. Clin. Invest. 133 , e170682 (2023).

Makam, A. N. et al. The good, the bad and the early adopters: providers’ attitudes about a common, commercial EHR. J. Eval. Clin. Pract. 20 , 36–42 (2014).

Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17 , 137–145 (2020).

Virshup, I. et al. The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 41 , 604–606 (2023).

Zou, Q. et al. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 9 , 515 (2018).

Cios, K. J. & William Moore, G. Uniqueness of medical data mining. Artif. Intell. Med. 26 , 1–24 (2002).

Zeng, X. et al. PIC, a paediatric-specific intensive care database. Sci. Data 7 , 14 (2020).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018).

Lee, J. et al. Open-access MIMIC-II database for intensive care research. Annu. Int. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2011 , 8315–8318 (2011).

Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).

Voss, E. A. et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. 22 , 553–564 (2015).

Vasilevsky, N. A. et al. Mondo: unifying diseases for the world, by the world. Preprint at medRxiv https://doi.org/10.1101/2022.04.13.22273750 (2022).

Harrison, J. E., Weber, S., Jakob, R. & Chute, C. G. ICD-11: an international classification of diseases for the twenty-first century. BMC Med. Inform. Decis. Mak. 21 , 206 (2021).

Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47 , D1018–D1027 (2019).

Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7 , e14325 (2019).

Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19 , 15 (2018).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res . 12 , 2825–2830 (2011).

de Haan-Rietdijk, S., de Haan-Rietdijk, S., Kuppens, P. & Hamaker, E. L. What’s in a day? A guide to decomposing the variance in intensive longitudinal data. Front. Psychol. 7 , 891 (2016).

Pedersen, E. S. L., Danquah, I. H., Petersen, C. B. & Tolstrup, J. S. Intra-individual variability in day-to-day and month-to-month measurements of physical activity and sedentary behaviour at work and in leisure-time among Danish adults. BMC Public Health 16 , 1222 (2016).

Roffey, D. M., Byrne, N. M. & Hills, A. P. Day-to-day variance in measurement of resting metabolic rate using ventilated-hood and mouthpiece & nose-clip indirect calorimetry systems. JPEN J. Parenter. Enter. Nutr. 30 , 426–432 (2006).

Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13 , 845–848 (2016).

Lange, M. et al. CellRank for directed single-cell fate mapping. Nat. Methods 19 , 159–170 (2022).

Weiler, P., Lange, M., Klein, M., Pe'er, D. & Theis, F. CellRank 2: unified fate mapping in multiview single-cell data. Nat. Methods 21 , 1196–1205 (2024).

Zhang, S. et al. Cost of management of severe pneumonia in young children: systematic analysis. J. Glob. Health 6 , 010408 (2016).

Torres, A. et al. Pneumonia. Nat. Rev. Dis. Prim. 7 , 25 (2021).

Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9 , 5233 (2019).

Kamin, W. et al. Liver involvement in acute respiratory infections in children and adolescents—results of a non-interventional study. Front. Pediatr. 10 , 840008 (2022).

Shi, T. et al. Risk factors for mortality from severe community-acquired pneumonia in hospitalized children transferred to the pediatric intensive care unit. Pediatr. Neonatol. 61 , 577–583 (2020).

Dudnyk, V. & Pasik, V. Liver dysfunction in children with community-acquired pneumonia: the role of infectious and inflammatory markers. J. Educ. Health Sport 11 , 169–181 (2021).

Charpignon, M.-L. et al. Causal inference in medical records and complementary systems pharmacology for metformin drug repurposing towards dementia. Nat. Commun. 13 , 7652 (2022).

Grief, S. N. & Loza, J. K. Guidelines for the evaluation and treatment of pneumonia. Prim. Care 45 , 485–503 (2018).

Paul, M. Corticosteroids for pneumonia. Cochrane Database Syst. Rev. 12 , CD007720 (2017).

Sharma, A. & Kiciman, E. DoWhy: an end-to-end library for causal inference. Preprint at arXiv https://doi.org/10.48550/ARXIV.2011.04216 (2020).

Khilnani, G. C. et al. Guidelines for antibiotic prescription in intensive care unit. Indian J. Crit. Care Med. 23 , S1–S63 (2019).

Harris, L. K. & Crannage, A. J. Corticosteroids in community-acquired pneumonia: a review of current literature. J. Pharm. Technol. 37 , 152–160 (2021).

Dou, L. et al. Decreased hospital length of stay with early administration of oseltamivir in patients hospitalized with influenza. Mayo Clin. Proc. Innov. Qual. Outcomes 4 , 176–182 (2020).

Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50 , 1219–1224 (2018).

Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14 , 604 (2023).

Ko, F. et al. Associations with retinal pigment epithelium thickness measures in a large cohort: results from the UK Biobank. Ophthalmology 124 , 105–117 (2017).

Patel, P. J. et al. Spectral-domain optical coherence tomography imaging in 67 321 adults: associations with macular thickness in the UK Biobank study. Ophthalmology 123 , 829–840 (2016).

D’Agostino Sr, R. B. et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation 117 , 743–753 (2008).

Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28 , 2309–2320 (2022).

Xu, Y. et al. An atlas of genetic scores to predict multi-omic traits. Nature 616 , 123–131 (2023).

Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37 , 547–554 (2019).

Rousan, L. A., Elobeid, E., Karrar, M. & Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 20 , 245 (2020).

Signoroni, A. et al. BS-Net: learning COVID-19 pneumonia severity on a large chest X-ray dataset. Med. Image Anal. 71 , 102046 (2021).

Bird, S. et al. Fairlearn: a toolkit for assessing and improving fairness in AI. https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/ (2020).

Strack, B. et al. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed. Res. Int. 2014 , 781670 (2014).

Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28 , 112–118 (2012).

Banerjee, A. et al. Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study. Lancet Digit. Health 5 , e370–e379 (2023).

Nagamine, T. et al. Data-driven identification of heart failure disease states and progression pathways using electronic health records. Sci. Rep. 12 , 17871 (2022).

Da Silva Filho, J. et al. Disease trajectories in hospitalized COVID-19 patients are predicted by clinical and peripheral blood signatures representing distinct lung pathologies. Preprint at bioRxiv https://doi.org/10.1101/2023.09.08.23295024 (2023).

Haneuse, S., Arterburn, D. & Daniels, M. J. Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Netw. Open 4 , e210184 (2021).

Little, R. J. A. A test of missing completely at random for multivariate data with missing values. J. Am. Stat. Assoc. 83 , 1198–1202 (1988).

Jakobsen, J. C., Gluud, C., Wetterslev, J. & Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—a practical guide with flowcharts. BMC Med. Res. Methodol. 17 , 162 (2017).

Dziura, J. D., Post, L. A., Zhao, Q., Fu, Z. & Peduzzi, P. Strategies for dealing with missing data in clinical trials: from design to analysis. Yale J. Biol. Med. 86 , 343–358 (2013).

White, I. R., Royston, P. & Wood, A. M. Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30 , 377–399 (2011).

Jäger, S., Allhorn, A. & Bießmann, F. A benchmark for data imputation methods. Front. Big Data 4 , 693674 (2021).

Waljee, A. K. et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3 , e002847 (2013).

Ibrahim, J. G. & Molenberghs, G. Missing data methods in longitudinal studies: a review. Test (Madr.) 18 , 1–43 (2009).

Li, C., Alsheikh, A. M., Robinson, K. A. & Lehmann, H. P. Use of recommended real-world methods for electronic health record data analysis has not improved over 10 years. Preprint at bioRxiv https://doi.org/10.1101/2023.06.21.23291706 (2023).

Regev, A. et al. The Human Cell Atlas. eLife 6 , e27041 (2017).

Megill, C. et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Preprint at bioRxiv https://doi.org/10.1101/2021.04.05.438318 (2021).

Speir, M. L. et al. UCSC Cell Browser: visualize your single-cell data. Bioinformatics 37 , 4578–4580 (2021).

Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9 , 90–95 (2007).

Waskom, M. seaborn: statistical data visualization. J. Open Source Softw. 6 , 3021 (2021).

Harris, C. R. et al. Array programming with NumPy. Nature 585 , 357–362 (2020).

Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. In Proc. of the Second Workshop on the LLVM Compiler Infrastructure in HPC. https://doi.org/10.1145/2833157.2833162 (Association for Computing Machinery, 2015).

Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272 (2020).

McKinney, W. Data structures for statistical computing in Python. In Proc. of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.). https://doi.org/10.25080/majora-92bf1922-00a (SciPy, 2010).

Boulanger, A. Open-source versus proprietary software: is one more reliable and secure than the other? IBM Syst. J. 44 , 239–248 (2005).

Rocklin, M. Dask: parallel computation with blocked algorithms and task scheduling. In Proc. of the 14th Python in Science Conference. https://doi.org/10.25080/majora-7b98e3ed-013 (SciPy, 2015).

Pivarski, J. et al. Awkward Array. https://doi.org/10.5281/ZENODO.4341376

Collette, A. Python and HDF5: Unlocking Scientific Data (‘O’Reilly Media, Inc., 2013).

Miles, A. et al. zarr-developers/zarr-python: v2.13.6. https://doi.org/10.5281/zenodo.7541518 (2023).

The pandas development team. pandas-dev/pandas: Pandas. https://doi.org/10.5281/ZENODO.3509134 (2024).

Weberpals, J. et al. Deep learning-based propensity scores for confounding control in comparative effectiveness research: a large-scale, real-world data study. Epidemiology 32 , 378–388 (2021).

Rosenthal, J. et al. Building tools for machine learning and artificial intelligence in cancer research: best practices and a case study with the PathML toolkit for computational pathology. Mol. Cancer Res. 20 , 202–206 (2022).

Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40 , 163–166 (2022).

Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.). 8024–8035 (Curran Associates, 2019).

Frostig, R., Johnson, M. & Leary, C. Compiling machine learning programs via high-level tracing. https://cs.stanford.edu/~rfrostig/pubs/jax-mlsys2018.pdf (2018).

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616 , 259–265 (2023).

Kraljevic, Z. et al. Multi-domain clinical natural language processing with MedCAT: the Medical Concept Annotation Toolkit. Artif. Intell. Med. 117 , 102083 (2021).

Pollard, T. J., Johnson, A. E. W., Raffa, J. D. & Mark, R. G. An open source Python package for producing summary statistics for research papers. JAMIA Open 1 , 26–31 (2018).

Ellen, J. G. et al. Participant flow diagrams for health equity in AI. J. Biomed. Inform. 152 , 104631 (2024).

Schouten, R. M. & Vink, G. The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50 , 1243–1258 (2021).

Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8 , 118–127 (2007).

Davidson-Pilon, C. lifelines: survival analysis in Python. J. Open Source Softw. 4 , 1317 (2019).

Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 , 289–300 (1995).

Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34 , D668–D672 (2006).

Harrell, F. E. Jr, Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA 247 , 2543–2546 (1982).

Currant, H. et al. Genetic variation affects morphological retinal phenotypes extracted from UK Biobank optical coherence tomography images. PLoS Genet. 17 , e1009497 (2021).

Cohen, J. P. et al. TorchXRayVision: a library of chest X-ray datasets and models. In Proc. of the 5th International Conference on Medical Imaging with Deep Learning (eds Konukoglu, E. et al.). 172 , 231–249 (PMLR, 2022).

Cohen, J.P., Hashir, M., Brooks, R. & Bertrand, H. On the limits of cross-domain generalization in automated X-ray prediction. In Proceedings of Machine Learning Research , Vol. 121 (eds Arbel, T. et al.) 136–155 (PMLR, 2020).

Download references

Acknowledgements

We thank M. Ansari who designed the ehrapy logo. The authors thank F. A. Wolf, M. Lücken, J. Steinfeldt, B. Wild, G. Rätsch and D. Shung for feedback on the project. We further thank L. Halle, Y. Ji, M. Lücken and R. K. Rubens for constructive comments on the paper. We thank F. Hashemi for her help in implementing the survival analysis module. This research was conducted using data from the UK Biobank, a major biomedical database ( https://www.ukbiobank.ac.uk ), under application number 49966. This work was supported by the German Center for Lung Research (DZL), the Helmholtz Association and the CRC/TRR 359 Perinatal Development of Immune Cell Topology (PILOT). N.H. and F.J.T. acknowledge support from the German Federal Ministry of Education and Research (BMBF) (LODE, 031L0210A), co-funded by the European Union (ERC, DeepCell, 101054957). A.N. is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD program Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. This work was also supported by the Chan Zuckerberg Initiative (CZIF2022-007488; Human Cell Atlas Data Ecosystem).

Open access funding provided by Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH).

Author information

Authors and affiliations.

Institute of Computational Biology, Helmholtz Munich, Munich, Germany

Lukas Heumos, Philipp Ehmele, Tim Treis, Eljas Roellin, Lilly May, Altana Namsaraeva, Nastassya Horlava, Vladimir A. Shitov, Xinyue Zhang, Luke Zappia, Leon Hetzel, Isaac Virshup, Lisa Sikkema, Fabiola Curion & Fabian J. Theis

Institute of Lung Health and Immunity and Comprehensive Pneumology Center with the CPC-M bioArchive; Helmholtz Zentrum Munich; member of the German Center for Lung Research (DZL), Munich, Germany

Lukas Heumos, Niklas J. Lang, Herbert B. Schiller & Anne Hilgendorff

TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany

Lukas Heumos, Tim Treis, Nastassya Horlava, Vladimir A. Shitov, Lisa Sikkema & Fabian J. Theis

Health Data Science Unit, Heidelberg University and BioQuant, Heidelberg, Germany

Julius Upmeier zu Belzen & Roland Eils

Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany

Eljas Roellin, Lilly May, Luke Zappia, Leon Hetzel, Fabiola Curion & Fabian J. Theis

Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA), Darmstadt, Germany

Altana Namsaraeva

Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Bonn, Germany

Rainer Knoll

Center for Digital Health, Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Berlin, Germany

Roland Eils

Research Unit, Precision Regenerative Medicine (PRM), Helmholtz Munich, Munich, Germany

Herbert B. Schiller

Center for Comprehensive Developmental Care (CDeCLMU) at the Social Pediatric Center, Dr. von Hauner Children’s Hospital, LMU Hospital, Ludwig Maximilian University, Munich, Germany

Anne Hilgendorff

You can also search for this author in PubMed   Google Scholar

Contributions

L. Heumos and F.J.T. conceived the study. L. Heumos, P.E., X.Z., E.R., L.M., A.N., L.Z., V.S., T.T., L. Hetzel, N.H., R.K. and I.V. implemented ehrapy. L. Heumos, P.E., N.L., L.S., T.T. and A.H. analyzed the PIC database. J.U.z.B. and L. Heumos analyzed the UK Biobank database. X.Z. and L. Heumos analyzed the COVID-19 chest x-ray dataset. L. Heumos, P.E. and J.U.z.B. wrote the paper. F.J.T., A.H., H.B.S. and R.E. supervised the work. All authors read, corrected and approved the final paper.

Corresponding author

Correspondence to Fabian J. Theis .

Ethics declarations

Competing interests.

L. Heumos is an employee of LaminLabs. F.J.T. consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd. and Omniscope Ltd. and has ownership interest in Dermagnostix GmbH and Cellarity. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature Medicine thanks Leo Anthony Celi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary handling editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 overview of the paediatric intensive care database (pic)..

The database consists of several tables corresponding to several data modalities and measurement types. All tables colored in green were selected for analysis and all tables in blue were discarded based on coverage rate. Despite the high coverage rate, we discarded the ‘OR_EXAM_REPORTS’ table because of the lack of detail in the exam reports.

Extended Data Fig. 2 Preprocessing of the Paediatric Intensive Care (PIC) dataset with ehrapy.

( a ) Heterogeneous data of the PIC database was stored in ‘data’ (matrix that is used for computations) and ‘observations’ (metadata per patient visit). During quality control, further annotations are added to the ‘variables’ (metadata per feature) slot. ( b ) Preprocessing steps of the PIC dataset. ( c ) Example of the function calls in the data analysis pipeline that resembles the preprocessing steps in (B) using ehrapy.

Extended Data Fig. 3 Missing data distribution for the ‘youths’ group of the PIC dataset.

The x-axis represents the percentage of missing values in each feature. The y-axis reflects the number of features in each bin with text labels representing the names of the individual features.

Extended Data Fig. 4 Patient selection during analysis of the PIC dataset.

Filtering for the pneumonia cohort of the youths filters out care units except for the general intensive care unit and the pediatric intensive care unit.

Extended Data Fig. 5 Feature rankings of stratified patient groups.

Scores reflect the z-score underlying the p-value per measurement for each group. Higher scores (above 0) reflect overrepresentation of the measurement compared to all other groups and vice versa. ( a ) By clinical chemistry. ( b ) By liver markers. ( c ) By medication type. ( d ) By infection markers.

Extended Data Fig. 6 Liver marker value progression for the ‘youths’ group and Kaplan-Meier curves.

( a ) Viral and severe pneumonia with co-infection groups display enriched gamma-glutamyl transferase levels in blood serum. ( b ) Aspartate transferase (AST) and Alanine transaminase (ALT) levels are enriched for severe pneumonia with co-infection during early ICU stay. ( c ) and ( d ) Kaplan-Meier curves for ALT and AST demonstrate lower survivability for children with measurements outside the norm.

Extended Data Fig. 7 Overview of medication categories used for causal inference.

( a ) Feature engineering process to group administered medications into medication categories using drugbank. ( b ) Number of medications per medication category. ( c ) Number of patients that received (dark blue) and did not receive specific medication categories (light blue).

Extended Data Fig. 8 UK-Biobank data overview and quality control across modalities.

( a ) UMAP plot of the metabolomics data demonstrating a clear gradient with respect to age at sampling, and ( b ) type 2 diabetes prevalence. ( c ) Analogously, the features derived from retinal imaging show a less pronounced age gradient, and ( d ) type 2 diabetes prevalence gradient. ( e ) Stratifying myocardial infarction risk by the type 2 diabetes comorbidity confirms vastly increased risk with a prior type 2 (T2D) diabetes diagnosis. Kaplan-Meier estimators with 95 % confidence intervals are shown. ( f ) Similarly, the polygenic risk score for coronary heart disease used in this work substantially enriches myocardial infarction risk in its top 5% percentile. Kaplan-Meier estimators with 95 % confidence intervals are shown. ( g ) UMAP visualization of the metabolomics features colored by the assessment center shows no discernable biases. (A-G) n = 29,216.

Extended Data Fig. 9 UK-Biobank retina derived feature quality control.

( a ) Leiden Clustering of retina derived feature space. ( b ) Comparison of ‘overall retinal pigment epithelium (RPE) thickness’ values between cluster 5 (n = 301) and the rest of the population (n = 28,915). ( c ) RPE thickness in the right eye outliers on the UMAP largely corresponds to cluster 5. ( d ) Log ratio of top and bottom 5 fields in obs dataframe between cluster 5 and the rest of the population. ( e ) Image Quality of the optical coherence tomography scan as reported in the UKB. ( f ) Minimum motion correlation quality control indicator. ( g ) Inner limiting membrane (ILM) quality control indicator. (D-G) Data are shown for the right eye only, comparable results for the left eye are omitted. (A-G) n = 29,216.

Extended Data Fig. 10 Bias detection and mitigation study on the Diabetes 130-US hospitals dataset (n = 101,766 hospital visits, one patient can have multiple visits).

( a ) Filtering to the visits of Medicare recipients results in an increase of Caucasians. ( b ) Proportion of visits where Hb1Ac measurements are recorded, stratified by admission type. Adjusted P values were calculated with Chi squared tests and Bonferroni correction (Adjusted P values: Emergency vs Referral 3.3E-131, Emergency vs Other 1.4E-101, Referral vs Other 1.6E-4.) ( c ) Normalizing feature distributions jointly vs. separately can mask distribution differences. ( d ) Imputing the number of medications for visits. Onto the complete data (blue), MCAR (30% missing data) and MAR (38% missing data) were introduced (orange), with the MAR mechanism depending on the time in hospital. Mean imputation (green) can reduce the variance of the distribution under MCAR and MAR mechanisms, and bias the center of the distribution under an MAR mechanism. Multiple imputation, such as MissForest imputation can impute meaningfully even in MAR cases, when having access to variables involved in the MAR mechanism. Each boxplot represents the IQR of the data, with the horizontal line inside the box indicating the median value. The left and right bounds of the box represent the first and third quartiles, respectively. The ‘whiskers’ extend to the minimum and maximum values within 1.5 times the IQR from the lower and upper quartiles, respectively. ( e ) Predicting the early readmission within 30 days after release on a per-stay level. Balanced accuracy can mask differences in selection and false negative rate between sensitive groups.

Supplementary information

Supplementary tables 1 and 2, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Heumos, L., Ehmele, P., Treis, T. et al. An open-source framework for end-to-end analysis of electronic health record data. Nat Med (2024). https://doi.org/10.1038/s41591-024-03214-0

Download citation

Received : 11 December 2023

Accepted : 25 July 2024

Published : 12 September 2024

DOI : https://doi.org/10.1038/s41591-024-03214-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

research framework module

IMAGES

  1. How to develop and present a conceptual framework in a research paper?

    research framework module

  2. Research framework diagram

    research framework module

  3. Research Framework

    research framework module

  4. Research methodology framework

    research framework module

  5. Research framework. Research framework.

    research framework module

  6. 1: Research framework.

    research framework module

VIDEO

  1. Class 4813 03 Research Framework and Methodology

  2. What is a Theoretical Framework really? simple explanation

  3. Research Frameworks

  4. Research Alignment

  5. Research Alignment, part 2

  6. Conceptual Framework in Research

COMMENTS

  1. PDF Practical Research 2

    1 CO_Q1_Practical Research 2_Module 3 What I Need to Know At the end of this module, you should be able to: 1. illustrate and explain the research framework (CS_RS12-If-j-6); 2. define terms used in the study (CS_RS12-If-j-7); 3. list research hypothesis (if appropriate) (CS_RS12-If-j-8) and 4. present a written review of related literature and conceptual framework

  2. PDF CHAPTER CONCEPTUAL FRAMEWORKS IN RESEARCH distribute

    In this chapter, we emphasize the importance of conceptual. frameworks in research and discuss the central role they play in all aspects of qualitative research. This chapter begins with a discussion of what constitutes a conceptual framework. After establishing a working sense of what a conceptual framework is and a beginning roadmap of its ...

  3. Literature Reviews, Theoretical Frameworks, and Conceptual Frameworks

    Including a conceptual framework in a research study is important, but researchers often opt to include either a conceptual or a theoretical framework. Either may be adequate, but both provide greater insight into the research approach. For instance, a research team plans to test a novel component of an existing theory. ...

  4. What Is a Conceptual Framework?

    Developing a conceptual framework in research. Step 1: Choose your research question. Step 2: Select your independent and dependent variables. Step 3: Visualize your cause-and-effect relationship. Step 4: Identify other influencing variables. Frequently asked questions about conceptual models.

  5. Conceptual Framework: Definition, Tips, and Examples

    A conceptual framework helps researchers create a clear research goal. Research projects often become vague and lose their focus, which makes them less useful. However, a well-designed conceptual framework helps researchers maintain focus. It reinforces the project's scope, ensuring it stays on track and produces meaningful results.

  6. Mastering Research Frameworks: 8 Step-by-Step Guide To A Success

    Research Framework Examples. Example 1: Tourism Research Framework. One example of a research framework is a tourism research framework. This framework includes various components such as tourism systems and development models, the political economy and political ecology of tourism, and community involvement in tourism.

  7. PDF Conceptual Framework

    see if it is a valid and useful module for constructing a theory that will adequately inform your study. This idea that existing theory and research provide "modules" that you can use in your research was developed at length by Becker (2007, pp.141-146). As he stated, I am always collecting such prefabricated parts for use in future ...

  8. What Is a Conceptual Framework?

    A conceptual framework illustrates the expected relationship between your variables. It defines the relevant objectives for your research process and maps out how they come together to draw coherent conclusions. Tip. You should construct your conceptual framework before you begin collecting your data.

  9. LibGuides: Module 2: Frame your research: Frameworks

    For the purpose of developing a review the term "framework" refers to a tool used to formulate the research question. Frameworks provide a structured approach to the review process, by defining boundaries on the research question to make sure it stays focused on a specific topic. Frameworks can be helpful in formulating a clear and answerable ...

  10. Research Framework

    A research framework is a blueprint that specifies the details of the procedures necessary for obtaining information for conducting the research projects. The objective of exploratory research design is to provide insights and understanding of situation, whereas conclusive research design is designed to assist the decision-maker in determining ...

  11. What is a framework? Understanding their purpose, value, development

    Frameworks are important research tools across nearly all fields of science. They are critically important for structuring empirical inquiry and theoretical development in the environmental social sciences, governance research and practice, the sustainability sciences and fields of social-ecological systems research in tangent with the associated disciplines of those fields (Binder et al. 2013 ...

  12. What is a Theoretical Framework? How to Write It (with Examples)

    A theoretical framework guides the research process like a roadmap for the study, so you need to get this right. Theoretical framework 1,2 is the structure that supports and describes a theory. A theory is a set of interrelated concepts and definitions that present a systematic view of phenomena by describing the relationship among the variables for explaining these phenomena.

  13. Research design: the methodology for interdisciplinary research framework

    2.1 Research as a process in the methodology in interdisciplinary research framework. The Methodology for Interdisciplinary Research (MIR) framework was built on the process approach (Kumar 1999), because in the process approach, the research question or hypothesis is leading for all decisions in the various stages of research.That means that it helps the MIR framework to put the common goal ...

  14. What is a Theoretical Framework?

    A theoretical framework is a foundational review of existing theories that serves as a roadmap for developing the arguments you will use in your own work. Theories are developed by researchers to explain phenomena, draw connections, and make predictions. In a theoretical framework, you explain the existing theories that support your research ...

  15. What is a research framework and why do we need one?

    As above, a framework helps us to determine, based on what we're trying to learn, the right approach and methods to apply in a given situation. It also helps to structure and plan our research activities, according to the breadth and scope of what we're trying to learn. For example, we might reasonably anticipate more foundational ...

  16. Practical Research Module: Conceptual Framework and Review of Related

    3. list research hypothesis (if appropriate) (CS_RS12-If-j-8) and. 4. present a written review of related literature and conceptual framework (CS_RS12-If-j-9). Senior High School Quarter 1 Self-Learning Module Practical Research 2 - Conceptual Framework and Review of Related Literature PRACTICAL-RESEARCH-2_Q1_Mod3-V2

  17. PDF Chapter 3 Research framework and Design 3.1. Introduction

    Introduction. Chapter 3Research framework and Design3.1. IntroductionResearch m. thodology is the indispensable part of any research work. This guides the researcher about the flow of research and provides the. ramework through which the research is to be carried out. This chapter expounds the research paradigm, research approach, research ...

  18. Research Framework

    Although this book is focused on research methods, the entire research framework is also necessary to be addressed in context. As stated earlier in Chap. 1, research methods involve some fundamental theoretical questions. These questions are philosophical and concern to the ontology and epistemology. Such philosophical concerns tend to get ...

  19. Research design: the methodology for interdisciplinary research framework

    The framework has been specifically constructed to facilitate the design of interdisciplinary scientific research, and can be applied in an educational program, as a reference for monitoring the ...

  20. DepEd Learning Portal

    View Download. Self Learning Module | ZIP. Published on 2022 July 11th. Description. Contents: 1. Practical Research 2 Quarter 1- Modules 1: Nature of Inquiry and Research 2. Practical Research 2 Quarter 1- Modules 2: Identifying the Inquiry and Stating the Problem 3. Practical Research 2 Quarter 1- Modules 3: Conceptual Framework and Review of ...

  21. An open-source framework for end-to-end analysis of electronic ...

    Here we introduce ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and EHR data. ehrapy incorporates a series of analytical steps ...