ReviseSociology

A level sociology revision – education, families, research methods, crime and deviance and more!

What is an Indicator?

An indicator provides a measure of a concept, and is typically used in quantitative research.

It is useful to distinguish between an indicator and a measure:

Measures refer to things that can be relatively unambiguously counted, such as personal income, household income, age, number of children, or number of years spent at school. Measures, in other words, are quantities. If we are interested in some of the changes in personal income, the latter can be quantified in a reasonably direct way (assuming we have access to all the relevant data).

Sociologists use indicators to tap concepts that are less directly quantifiable, such as job satisfaction. If we are interested in the causes of variation of job satisfaction, we will need indicators that stand for the concept of ‘job satisfaction’. These indicators will allow the level of ‘job satisfaction’ to be measured, and we can treat the resulting quantitative information as if it were a measure.

An indicator, then, is something which is devised that is employed as though it were a measure of a concept.

Direct and Indirect indicators 

Direct indicators are ones which are closely related to the concept being measured. For example questions about how much a person earns each much are direct indicators of personal income; but the same question would only be an indirect measurement of the concept of social class background.

Share this:

  • Share on Tumblr

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Discover more from ReviseSociology

Subscribe now to keep reading and get access to the full archive.

Continue reading

National Academies Press: OpenBook

Improving Measures of Science, Technology, and Innovation: Interim Report (2012)

Chapter: 2 concepts and uses of indicators.

Concepts and Uses of Indicators

THE ROLE OF INDICATORS

There are myriad descriptions and definitions of indicators—their composition, use, and limitations. The National Center for Science and Engineering Statistics (NCSES) defines an indicator as “a statistical proxy of one or more metrics that allow for an assessment of a condition.” 1 Indicators allow one to assess the current status of a project, program, or other activity and how far one is from targets or goals. In many circumstances, an activity is not directly measurable, and therefore indicators provide analytically proximate values that are based on expert judgement.

Indicators of science, technology, and innovation (STI) often substitute for direct measures of knowledge creation, invention, innovation, technological diffusion, and science and engineering talent, which would be difficult if not impossible to obtain. Techniques are improving for obtaining data that directly measure innovation activities, and these data are already being used to complement indicators that are derived from traditional methods. STI indicators, however, will still have an important role to play in informing policy decisions, especially if they are based on tested analytical frameworks.

USES AND DESIRABLE ATTRIBUTES OF INDICATORS

STI indicators are often used to relate knowledge inputs to outputs, outcomes or impacts. At a very basic level, knowledge inputs include years of schooling, level of degree, and the amount of training an employee receives on the job. Outputs are specific products, processes, or services. Outcomes and impacts are the near-term and longer-term effects and ramifications to the economy or society in which the technological ecosystem operates. 2 Indicators are relied on for both post-activity evaluations and analysis prior to an activity, although there are major limitations in using STI indicators for predictive exercises. Foresight is often the best that can be asked of indicators. A comprehensive review of the use of STI indicators for policy decisions is

____________

1 Definition from NCSES (personal communication, 2011). In that communication, NCSES also provided the definitions of “data” and “metric”: data— information in raw or unorganized form that represents conditions, ideas, or objects; metric—a systematic measurement of data.

2 For example, scientific advancement in detecting and removal of pathogenic microorganisms lead to technological mechanisms that in turn lead to cleaner water, thereby increasing productivity (through a healthier workforce) and hence increasing inputs in the production of goods and services, as well as increased welfare of citizens.

found in Gault (2010), who outlines four ways that indicators are used for policy purposes: monitoring, benchmarking, evaluating, and “foresighting.” 3

At the panel’s workshop, several presenters described attributes of indicators that NCSES should keep in mind as it develops new STI indicators. One important desirable attribute that was emphasized is a low sensitivity to manipulation. In addition, STI indicators are like baseball statistics—it is unlikely that one single statistic tells the whole story. Instead, users will need to rely on a collection or suite of indicators. Mindfully, during the workshop, Hugo Hollanders, of UNU-MERIT, 4 stated that there is both political and media appeal of composite indices. 5 Other ideal characteristics of indicators that workshop participants mentioned included scientifically derived/evidence based; comparable across regions; powerful for communication; affordable; accessible; scalable; sustainable; and policy and analytically relevant. STI indicators need to be policy neutral, even though the particular ones selected may reflect the preferences of the stakeholders who request them.

Although the production of indicators across many fields has an established history, there are at least three major cautions regarding their use that are important to note.

  • First, indicators can send mixed signals, which require expert judgment for interpretation. For example, increased innovation—which is key to advancing living standards—is often considered to enhance job creation. Policy makers discuss spurring innovation as a job-creation tactic. However, innovation can lead to fewer jobs if the process or managerial expertise increases efficiency. Short-term displacement of workers in one industry or sector can be counterbalanced in the longer term by development of new products, services, and even sectors and by increased market demand if process efficiencies drive down prices (see Pianta, 2005; Van Rennen, 1997). One way to be cautious about mixed signals is to develop STI indicators that support analysis of time scales, sectors, and geographic locations.
  • Second, a given metric, once it becomes widely used, changes the behavior of the people and practices it attempts to measure. The worst thing a metric can do is not just to deliver a bad (i.e., misleading) answer, but to incentivize bad practice (see, e.g., West and Bergstrom, 2010). It is important that indicators avoid sending distorted signals to users.
  • Not everything that counts can be counted, and not everything that can be counted counts (an idea attributed to Albert Einstein). It seems clear that some outcome measures that reflect the importance of research and development (R&D) and

3 For example, at the panel’s workshop Changlin Gao reported that China is targeting its STI indicators’ program on the four broad measures: (1) monitoring (international innovation system, linkages within and between national innovation systems; regional innovation systems and industrial clusters; firms; innovation; the implementation of national S&T projects; the selected quantitative indicators in the S&T development goals); (2) evaluating (performance of public investment on S&T; performance of government research institutes and national labs; national S&T programs; specialization of S&T fields; advantages versus disadvantages; new emerging industries, such as information technology, biotechnology industry, energy, health, knowledge based services, and etc.); (3) benchmarking (international benchmarking; interprovincial benchmarking); and (4) forecasting (the latest data not available in gathered statistics).

4 UNU-MERIT—the U.N. University Maastricht Economic and Social Research Institute on Innovation and Technology—is a research and training center of the United Nations University and works in close collaboration with the University of Maastricht.

5 To clarify, the panel is not advocating that NCSES develop a “headline indicator.” A suite of key STI indicators should be more informative for users of the statistics.

innovation to society are illusive. For example, social well-being is difficult to measure, yet one of the key interests of policy makers is the return on investment of public funding for science and technology for the good of society.

BEYOND SCORING TO POLICY RELEVANCE

An important aspect of the charge to this panel is the assessment of the utility of STI indicators. Although the National Science Foundation (NSF) does not do policy work, the statistics that NCSES produces are often cited in debates about policies regarding the science and engineering enterprise. For instance, the American Association for the Advancement of Science (AAAS) annually prepares a report giving various breakdowns of R&D expenditures in the federal budget. These data are informed by NSF’s publications, National Patterns of R&D Resources and Federal Funds for Research and Development . In the latest report (American Association for the Advancement of Science, 2011), NSF data are used to show the role of innovation in productivity growth and how innovation affects the quality of life.

The Congressional Research Service (CRS) 6 regularly refers to the National Science Board’s Science and Engineering Indicators (SEI) biennial volumes (see National Science Board, 2010), which are prepared by NCSES. 7 The online version of SEI also has a sizable share of users outside the policy arena and outside the United States. There are several highly influential reports each year that rely on NCSES indicators to relate scientific inputs to socioeconomic outcomes. The final report of this panel will contain a comprehensive representation of the policy relevance of STI indicators.

In the course of its work to date, the panel queried a variety of users, including policy makers, government and academic administrators, researchers, and corporate decision makers in high-tech manufacturing and service industries. We also sought input from developers of STI indicators and from individuals who are called on by policy makers to do assessments of high-tech sectors in the United States and abroad. This input yielded dozens of questions that STI indicators could address. From the extensive list of questions and issues we received, the panel distilled eight key issues that are expected to be prominent in the minds of decision-makers for the foreseeable future: growth, productivity and jobs; STI activities; STI talent; private investment, government investment and procurement; institutions, networks, and regulations (including intellectual property protection and technology transfer); global STI activities and outcomes; subnational STI activities and outcomes; and systemic changes on the horizon. Box 2-1 shows the questions that flow from these issues. Although the policy relevance of the STI indicators is of primary importance for the panel’s work, the recommendations here and in the final report will address fundamental aspects of monitoring and benchmarking that are of broader interest.

6 See National Research Council. (2011, p. 86): “In meeting the requirements of Congress for objective and impartial analysis, CRS publishes periodic reports on trends in federal support for R&D, as well as reports on special topics in R&D funding.”

7 The National Science Board released the SEI 2012 on January 18, 2012. The chapter topics are unchanged in the new edition.

Key Issues and Questions for STI Indicators

  • Growth, Productivity and Jobs What is the contribution of science, technology, and innovation (STI) activity to productivity, employment and growth? What is the relative importance of technological innovation versus non-technological innovation for economic growth? Is the United States falling behind with respect to innovation and what are the effects on socioeconomic outcomes?
  • STI Activities What are the drivers of innovation? How influential is R&D for innovation and growth (by sector)? What would constitute a “balance” between the biological and physical sciences? On what basis could that be determined? Does biological science depend on physical science for advancement? How important are the following for advancing innovation: small businesses, large businesses, strategic alliances, technology transfer between universities and firms, academic researchers, government labs and procurement activities, and nonprofit organizations? What are the emerging innovative sectors and what is unique about them?
  • STI Talent How much knowledge capital does the United States have? How many people, possessing what kind of skills, are needed to achieve a robust STI system? What additional sources of “talent” can best be tapped—especially among immigrants, women, and minorities? How many science and engineering doctorate holders took nontraditional pathways into the science, technology, engineering and mathematics (STEM) workforce? Did this vary by race/ethnicity, gender or the existence of a disability? How important are community colleges in developing human resources for STEM talent? Is the U.S. falling behind in STEM workers? What fields other than science, technology, engineering, and mathematics are important for advances in STI?
  • Private Investment, Government Investment and Procurement What impact does federal research spending have on innovation and economic health, and over what time frame? How large should the federal research budget be? How should policy makers decide where to put additional research dollars or reallocate existing funding streams—information and communications technology, biotechnology, physical science, nanotechnology, environmental technology, social science, etc.? Does government investment crowd out or energize private investment STI activities? What is the role of entrepreneurship in driving innovation?
  • Institutions, Networks and Regulations What impacts are federal research programs having on entrepreneurial activities in science and engineering sectors? Where are the key gaps in the transfer of scientific and technological knowledge that undercut the performance of the STI system? Where is the supposed “valley of death” in innovation? In which industries is the valley of death most prevalent? What part of the process is underfunded for specific sectors? What is the nature and impact of intellectual property protection on scientific and innovation outputs?
  • Global STI Activities and Outcomes What can we learn from other countries and what are other countries learning from us? In what technological areas are other countries accelerating? What impact does the international flow of STI have on U.S. economic performance? What is the relative cost of innovation inputs in the U.S. versus other countries? Where are multinational corporations sourcing R&D? What are the institutional differences that affect innovation activities among nations and how are they changing?
  • Subnational STI Activities and Outcomes How does innovation activity in a given firm at a given place contribute to that firm’s productivity, employment and growth, and perhaps also to these characteristics in the surrounding area? How are those innovation supply chains working within a state? Are firms principally outsourcing new knowledge from customers or from universities?
  • Systemic Changes on the Horizon How will demographic shifts affect the STEM workforce, nationally and internationally? Will it shift the locus of the most highly productive regions? Will global financial crises slow innovation activities or merely shift the locus of activities? When will emerging economies be integrated into the global ecosystem of innovation and what impact will that have on the system? How are public views of science and technology changing over time?

The National Center for Science and Engineering Statistics (NCSES), at the U.S. National Foundation, is 1 of 14 major statistical agencies in the federal government, of which at least 5 collect relevant information on science, technology, and innovation activities in the United States and abroad. The America COMPETES Reauthorization Act of 2010 expanded and codified NCSES's role as a U.S. federal statistical agency. Important aspects of the agency's mandate include collection, acquisition, analysis, and reporting and dissemination of data on research and development trends, on U.S. competitiveness in science, technology, and research and development, and on the condition and progress of U.S. science, technology, engineering, and mathematics (STEM) education.

Improving Measures of Science, Technology and Innovation: Interim Report examines the status of the NCSES's science, technology, and innovation (STI) indicators. This report assesses and provides recommendations regarding the need for revised, refocused, and newly developed indicators designed to better reflect fundamental and rapid changes that are reshaping global science, technology and innovation systems. The book also determines the international scope of STI indicators and the need for developing new indicators that measure developments in innovative activities in the United States and abroad, and Offers foresight on the types of data, metrics and indicators that will be particularly influential in evidentiary policy decision-making for years to come.

In carrying out its charge, the authoring panel undertook a broad and comprehensive review of STI indicators from different countries, including Japan, China, India and several countries in Europe, Latin America and Africa. Improving Measures of Science, Technology, and Innovation makes recommendations for near-term action by NCSES along two dimensions: (1) development of new policy-relevant indicators that are based on NCSES survey data or on data collections at other statistical agencies; and (2) exploration of new data extraction and management tools for generating statistics, using automated methods of harvesting unstructured or scientometric data and data derived from administrative records.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Qual Health Care

Logo of intqhc

What makes a good quality indicator set? A systematic review of criteria

Laura schang.

Department of Methodology, Federal Institute for Quality Assurance and Transparency in Health Care (IQTIG), Katharina-Heinroth-Ufer 1, Berlin 10787, Germany

Iris Blotenberg

Dennis boywitt, associated data.

Data on all reviewed studies are incorporated in Supplementary Appendix 2 .

While single indicators measure a specific aspect of quality (e.g. timely support during labour), users of these indicators, such as patients, providers and policy-makers, are typically interested in some broader construct (e.g. quality of maternity care) whose measurement requires a set of indicators. However, guidance on desirable properties of indicator sets is lacking.

Based on the premise that a set of valid indicators does not guarantee a valid set of indicators, the aim of this review is 2-fold: First, we introduce content validity as a desirable property of indicator sets and review the extent to which studies in the peer-reviewed health care quality literature address this criterion. Second, to obtain a complete inventory of criteria, we examine what additional criteria of quality indicator sets were used so far.

We searched the databases Web of Science, Medline, Cinahl and PsycInfo from inception to May 2021 and the reference lists of included studies. English- or German-language, peer-reviewed studies concerned with desirable characteristics of quality indicator sets were included. Applying qualitative content analysis, two authors independently coded the articles using a structured coding scheme and discussed conflicting codes until consensus was reached.

Of 366 studies screened, 62 were included in the review. Eighty-five per cent (53/62) of studies addressed at least one of the component criteria of content validity (content coverage, proportional representation and contamination) and 15% (9/62) addressed all component criteria. Studies used various content domains to structure the targeted construct (e.g. quality dimensions, elements of the care pathway and policy priorities), providing a framework to assess content validity. The review revealed four additional substantive criteria for indicator sets: cost of measurement (21% [13/62] of the included studies), prioritization of ‘essential’ indicators (21% [13/62]), avoidance of redundancy (13% [8/62]) and size of the set (15% [9/62]). Additionally, four procedural criteria were identified: stakeholder involvement (69% [43/62]), using a conceptual framework (44% [27/62]), defining the purpose of measurement (26% [16/62]) and transparency of the development process (8% [5/62]).

The concept of content validity and its component criteria help assessing whether conclusions based on a set of indicators are valid conclusions about the targeted construct. To develop a valid indicator set, careful definition of the targeted construct including its (sub-)domains is paramount. Developers of quality indicators should specify the purpose of measurement and consider trade-offs with other criteria for indicator sets whose application may reduce content validity (e.g. costs of measurement) in light thereof.

Introduction

Health care quality indicators serve to enable their users—such as patients, providers and policy-makers—to make informed decisions based on the quality of care [ 1–3 ]. While single indicators measure specific aspects of quality [ 4 ], users of these measures are frequently interested in some broader construct. For instance, single indicators may measure the provision of smoking cessation advice or timely support during labour [ 5 ]. However, it is the quality of community-based maternity care that would be of interest to patients (e.g. when choosing a provider) or policy-makers (e.g. for accountability purposes) [ 5 , 6 ]. Since health care quality is multidimensional [ 7–9 ] and providers may perform relatively well on some aspects of care, but less so on others [ 10 ], multiple indicators are needed to measure constructs such as ‘quality of community-based maternity care’. Conclusions about such constructs thus depend on the properties not only of single indicators but also of the indicator set as a whole [ 11–14 ].

So far, however, recommendations for developing quality indicators focus primarily on the criteria for single indicators, such as the validity, reliability and feasibility of an indicator [see e.g. 4 , 15–22 ]. In contrast, guidance on desirable properties of indicator sets is lacking [ 13 , 23 ].

To address this gap, the ‘lens model’ [ 24–26 ] provides a helpful starting point: Accordingly, indicators serve as ‘cues’ forming the ‘lens’ through which users of measurement results ‘view’ the targeted construct (see Figure 1 ). If the ‘cues’ do not represent the construct in a valid fashion, conclusions about the construct may be misguided. Therefore, we propose that content validity constitutes an important property of indicator sets. Generally, assuring content validity of an indicator set means ensuring that the content of the assessment instrument adequately reflects the targeted construct [ 27–29 ]. There are three main threats to the content validity of an indicator set: omission of relevant indicators, overrepresentation of indicators for some aspects of care and inclusion of irrelevant indicators. These threats reduce the content validity of the set and, ultimately, limit the quality of conclusions one can draw about the targeted construct based on measurement results [e.g. 28 , 30 ]. As such, content validity provides the theoretical yardstick to confirm—or refute—concerns that existing indicator sets often seem imbalanced [ 23 , 31–33 ].

An external file that holds a picture, illustration, etc.
Object name is mzab107f1.jpg

Illustration of content validity using the Brunswik lens model ( 24–26 , own display): The construct of interest (‘what’ to measure) may be quality of care regarding a specific sector, service area or another topic. Content domains and subdomains structure the targeted construct, for instance, in terms of quality dimensions, the care pathway, policy priorities or other domains (see Table 2 ). The content domains and subdomains thus form the conceptual framework guiding the selection of indicators. A content-valid indicator set covers the relevant content domains and subdomains, assures proportional representation and does not contain irrelevant content (see Table 1 ). Thus, a content-valid indicator set ensures that conclusions about the targeted construct based on measurement results (see panel on the far right) are valid conclusions about the targeted construct according to the conceptual framework (see panel on the far left; see [ 28 , 30 ]).

Given the current lack of guidance on the criteria for indicator sets [ 13 , 23 ], the aim of this paper is to take stock of the criteria addressed so far in the peer-reviewed health care quality literature. Since we deem content validity a desirable property of indicator sets, our first research question is: to what extent do studies address the content validity of indicator sets? Second, to obtain a complete inventory of criteria, we ask what additional criteria of indicator sets exist in the health care quality literature. We discuss our results with the aim of providing guidance for those tasked with developing indicator sets.

Search strategy

We systematically searched the databases Web of Science, Medline, Cinahl and PsycInfo on 21 May 2021. To obtain a comprehensive overview of the field, we used the broad search term ‘indicator set’ without any filters or limits. Additionally, we searched the reference lists of included studies.

Eligibility criteria

Inclusion criteria.

Studies were eligible for inclusion if they addressed the criteria for indicator sets (defined as desirable properties that can only be assessed at the level of the set [ 13 , 23 ]), were published in a peer-reviewed journal and focused on health care quality.

Exclusion criteria

We excluded studies without full text available and those not written in English or German.

Study selection

Two authors (L.S. and I.B.) independently screened all titles, abstracts and potentially relevant articles retrieved for full-text review. They resolved any doubts about the eligibility of studies through discussion until consensus was reached.

Data extraction

Following qualitative content analysis (QCA), we developed a coding scheme with definitions and exemplars for all codes [ 34 , 35 ], which we used to extract information from each included study. We developed codes in two ways. First, following directed QCA, we used existing theory to develop codes [ 34 , 36 ]. Since content validity comprises three component criteria—content coverage, proportional representation and contamination [ 28 , 37 ] (for definitions, see Table 1 )—we used these to derive codes deductively.

Criteria of content validity: definition, exemplar and frequency in included studies

Second, because generally no unified definitions of criteria for indicator sets exist [ 13 , 23 ], we inductively developed codes in accordance with conventional QCA [ 34 ]. Thus, two authors (L.S. and I.B.) read all documents and, in iterative discussions with D.B., determined codes by identifying desirable characteristics of indicator sets from the studies themselves [ 34 , 38 , 39 ]. To achieve this, we examined definitions and procedures adopted by the studies. We did not code mere labels or adjectives whose meaning remained unclear (e.g. ‘comprehensive’, ‘wide scope’). Instead, we coded text segments only if the authors described what they meant or did to assure ‘good’ indicator sets. In addition, we extracted information on the construct targeted by the respective study (e.g. diabetes care) and on the domains (e.g. quality dimensions) selected by the authors to assess content validity.

To ensure a consistent understanding of the codes, two authors (L.S. and I.B.) independently coded and compared the results of an identical sample of articles. Subsequently, both authors repeated this process for all articles using the analysis software MAXQDA. Any conflicts in coding were reconciled through discussion until consensus was reached.

Data synthesis

To synthesize the data in relation to our research questions, we tabulated the absolute and relative frequencies of the criteria and the domains identified from all included studies.

Of 531 studies identified through database searching and 27 studies identified through the search of reference lists, we ultimately included 62 studies ( Figure 2 ; for details see Supplementary Appendix 1 and Appendix 2 ). The studies addressed a variety of constructs, including, amongst others, quality of hospital care [ 12 ], quality of primary care [ 40 ], quality of mental health care [ 41 ] or quality of community-based maternity care [ 5 ] (for details on all studies, see Supplementary Appendix 2 ).

An external file that holds a picture, illustration, etc.
Object name is mzab107f2.jpg

Study selection process.

In 90% (56/62) of the studies, authors structured the construct they intended to measure in content domains, such as quality dimensions, policy priorities or elements of the care pathway ( Table 2 ). Frequently, studies also referred to the coverage of different measurement domains ( Table 2 ).

Domains for structuring health care quality constructs

Research question 1: to what extent do studies address the content validity of indicator sets?

Overall, while only 19% (12/62) of the studies in our review used the term ‘content validity’, 85% (53/62) of the studies addressed at least one of its component criteria. Only nine studies (15%) addressed all three criteria ( Table 1 ).

Content coverage

Seventy-one per cent (44/62) of studies referred to the criterion ‘content coverage’ ( Table 1 ). While more than half of all studies (35/62) addressed content coverage in terms of the ‘breadth’ of content domains covered, 15% (9/62) additionally referred to the ‘depth’ of coverage of a specific content domain (with respect to its subdomains).

Proportional representation

Proportional representation was addressed by about a third of the studies (19/62); typically, by commenting on unequal numbers of indicators across different quality dimensions (see exemplar in Table 1 ). Some studies pre-specified a particular number of indicators for each domain in order to ensure proportional representation of all content domains in the indicator set [e.g. 33 , 42 ].

Contamination

Half of the studies (31/62) referred to avoiding the contamination of the indicator set by including indicators only if they were relevant for the targeted construct ( Table 1 ).

Research question 2: what additional criteria of indicator sets exist in the health care quality literature?

Additional substantive criteria.

We identified four additional substantive criteria of indicator sets from the included studies ( Table 3 ). Studies concerned with ‘costs of measurement’ frequently addressed the burden of data collection imposed on providers (see exemplar, Table 3 ). While several studies referred to the ‘size’ of the set, this criterion was frequently introduced as a means to an end, e.g. to reduce costs of measurement (by reducing the number of indicators) [e.g. 22 , 43 ], to enhance content coverage (by increasing the number of indicators) [ 42 ] or to promote proportional representation (by aiming for a specified number of indicators in each content domain) [ 33 , 44 ]. With respect to the criterion ‘prioritization’, studies typically used a ranking or rating procedure to identify the ‘most important’ or ‘essential’ indicators. Some studies also mentioned avoiding redundancy as a criterion.

Additional criteria for indicator sets: definition, exemplar and frequency in included studies

Procedural criteria

Several studies also pointed out the desirable properties of the process of developing indicator sets ( Table 3 ). While the rationale behind these procedural criteria often remained unclear, in several studies, they appeared to serve as a means to assure content validity. Several studies developed a framework that was then used to map indicators and thus assure content coverage [ 5 , 45 , 46 ]. Early involvement of stakeholders, in turn, served to define the construct and identify the relevant content domains by eliciting aspects considered important from the perspectives of patients and providers [e.g. 5 , 33 ]. During the process of indicator selection, stakeholders were frequently involved to ensure content coverage [e.g. 5 , 12 ] and prevent contamination of the set [e.g. 40 , 47 ]. Some studies also emphasized the need to consider the assessment purpose when developing indicator sets and to ensure transparency about methods and limitations ( Table 3 ).

Statement of principal findings

Regarding our first research question—the extent to which studies in the health care quality literature address content validity as a criterion for indicator sets—three principal findings emerge. First, while 85% (53/62) of the studies addressed at least one of the component criteria of content validity (content coverage, proportional representation, or contamination), suggesting that most studies consider (components of) content validity important, only 15% (9/62) addressed all of its component criteria. Second, our review revealed that several authors distinguished between the ‘breadth’ and ‘depth’ of content coverage. Third, we found that authors used various content and/or measurement domains to structure the targeted construct in order to provide a framework for assessing content validity.

Regarding our second research question, we further identified four substantive criteria and four procedural criteria. Among the former, costs of measurement and prioritization of ‘essential’ indicators were addressed most frequently (each by 21% [13/62] of the included studies). Among the latter, several studies emphasized the importance of defining or using a conceptual framework (44% [27/62]) and stakeholder involvement (69% [43/62]).

Strengths and limitations

Our review is, to our knowledge, the first review of criteria for indicator sets in the health care quality literature. These criteria are an inventory of what previous studies have considered important properties of indicator sets. As such, the review offers a valuable guide for those tasked with developing indicator sets and for further research on this topic. Second, with our analytic approach, we went beyond the frequently inconsistent terminology in the studies and examined instead what the authors recommended or did to obtain ‘good’ indicator sets. This enabled us to offer a taxonomy of criteria and, based on consistent definitions, to report their frequencies in the studies included.

Our study has limitations. First, while our review was extensive in that it covered four scientific databases using broad search terms, we focussed on the peer-reviewed health care quality literature and did not examine in detail other fields (e.g. sustainability and education). From the non-health studies examined, however, we identified no additional criteria [ 11 , 48 , 49 ]. Second, searches of the grey literature might have yielded additional criteria. However, including searches of grey literature in a systematic review also entails several limitations, such as poor methodological reproducibility, missing citation information and varying indexing and search functionalities of Web-based search engines and repositories [ 50 ]. Third, QCA always involves some subjectivity in coding [ 34 ]. However, we took several steps to enhance the trustworthiness of the results, including the use of a coding scheme, coder training to ensure consistent implementation of the scheme, independent coding by two reviewers and comparison of all conflicts until consensus was reached [ 35 , 39 ]. We are therefore convinced that our results provide a credible account of the reviewed studies.

Interpretation within the context of the wider literature

Typically, users of measurement results want to draw valid conclusions about some broader construct (such as a provider’s quality of primary care [ 40 ] or quality of mental health care [ 41 ], as in some of the studies in our review). In these cases, an exclusive emphasis on the methodological quality of single indicators is insufficient: it might result in incomplete coverage, overrepresentation of indicators for some aspects of care and/or superfluous indicators [ 11 ]. Because each component criterion of content validity helps to remedy one of these threats [e.g. 28 ], an indicator set becomes more valid when all three component criteria are assured [e.g. 28 , 30 ]. Thus, our finding that only 15% (9/62) of the included studies sought to assure all three component criteria suggests the need for a stronger emphasis on content validity for developers of indicator sets.

Health care quality constructs are frequently conceptualized in terms of multiple levels, with several domains and subdomains ( 12 , 13 , 45 ; see also Figure 1 ). Thus, the distinction between the ‘breadth’ and ‘depth’ of content coverage we found in several studies seems important for quality indicator sets. While an indicator set may address all relevant content domains (thus achieving high ‘breadth’), the ‘depth’ to which each of these domains is covered also influences the degree to which an indicator set measures what it purports to measure [ 13 ]. Therefore, it seems important to assess both the ‘breadth’ and ‘depth’ of content coverage of quality indicator sets.

Content validity is assessed with reference to content domains [ 28 , 30 ]. Therefore, careful development of the (sub-)domains of the targeted construct represents the crucial first step to obtain a valid indicator set [ 28 , 29 ]. Our finding that more than two thirds (42/62) of the reviewed studies employed Donabedian’s generic measurement domains to assess indicator sets may reflect the enduring debate in the literature about the merits and demerits of structure, process and outcome indicators [ 51 , 52 ]. These measurement domains, however, are not helpful for structuring the construct. For instance, patient safety of primary care can be measured with structure, process and outcome indicators, but this would not ensure the coverage of other quality dimensions of the construct ‘quality of primary care’ such as effectiveness and responsiveness [ 13 ]. Therefore, we caution against using measurement domains as a substitute for actual content domains. Instead, we suggest, the development of the content domains should be driven by the quality objectives regarding the targeted construct [ 53 , 54 ].

Our findings also reflect long-standing tensions between maximising insights gained from measurement and minimising costs to obtain these insights [ 11 , 55 ]. While ‘comprehensive’ measurement of all aspects of health care quality has been deemed an unrealistic ambition [ 13 , 56 ], it is important to emphasize that assuring content validity does not entail measuring ‘everything’. Rather, it involves making explicit the content domains that are relevant for the targeted construct and the degree to which an indicator set represents these domains [ 27 , 28 ]. The criterion ‘prioritization’ identified in the literature seems premised on the notion that some indicators are more important to the targeted construct than others. The consequent exclusion of (relevant) indicators reduces, however, content validity and limits the ability to draw conclusions about the targeted construct [ 27 , 28 ]. Similar trade-offs arise with the criterion ‘size’: Unless a relatively narrow construct such as preoperative management in colorectal cancer care [ 57 ] is targeted, it is difficult to achieve a highly content-valid indicator set with very few indicators [ 11 , 48 ]. Yet, a large number of indicators does not guarantee high content validity [ 11 ], for instance, when not all relevant content domains are covered.

Implications for policy, practice and research

The component criteria of content validity help with assessing whether conclusions based on a set of indicators are valid conclusions about the targeted construct. Those tasked with developing quality indicators should therefore assure the validity of not only single indicators but also of the indicator set as a whole. Developers of quality indicators should specify the purpose of measurement and consider trade-offs with other potential criteria for indicator sets whose application may reduce content validity (e.g. costs of measurement and prioritization) in light thereof.

To develop a valid indicator set, careful definition of the targeted construct, including its (sub-)domains, is paramount: Since content validity can only be assessed in relation to a conceptual framework [ 27 , 28 ], the indicator set can only be as good as the chosen framework. The conceptual framework should serve as a mapping tool to select indicators and to signal gaps in content coverage [ 11 , 21 , 58 , 59 ]. Building on the finding that the indicator set can only be as good as the content domains specified, future research should examine how different purposes of quality measurement, such as accountability and improvement [ 3 ], influence how the targeted construct should be conceptualized.

Conclusions

Based on the premise that a set of ‘valid indicators’ does not guarantee a ‘valid set’ of indicators, this review takes stock of existing criteria for indicator sets in the health care quality literature with a focus on content validity. These criteria can guide the process of developing indicator sets and, by complementing the assessment of single indicators, support patients, providers and policy-makers in making informed decisions based on the results of quality measurement.

Supplementary Material

Mzab107_supp, acknowledgements.

None declared.

Contributor Information

Laura Schang, Department of Methodology, Federal Institute for Quality Assurance and Transparency in Health Care (IQTIG), Katharina-Heinroth-Ufer 1, Berlin 10787, Germany.

Iris Blotenberg, Department of Methodology, Federal Institute for Quality Assurance and Transparency in Health Care (IQTIG), Katharina-Heinroth-Ufer 1, Berlin 10787, Germany.

Dennis Boywitt, Department of Methodology, Federal Institute for Quality Assurance and Transparency in Health Care (IQTIG), Katharina-Heinroth-Ufer 1, Berlin 10787, Germany.

Supplementary material

Supplementary material is available at International Journal for Quality in Health Care online.

This work was supported by the authors’ institution. No external funding was received.

Contributorship

All authors developed the research question and conceptualized the study. I.B. and L.S. conducted the literature review and analysed the data. All authors contributed to drafting the manuscript. Throughout the design and implementation of the study, all authors regularly discussed the methods, the progress of the study and emergent findings. All authors read and approved the final manuscript and endorsed the decision for publication.

Ethics and other permissions

Not required, since this was a systematic literature review.

Data availability statement

  • Search Menu
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why Publish?
  • About Science and Public Policy
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

1. introduction, 2. quantifying research excellence for policy purposes, 3. methodology, 4. meanings, metrics, processes, and reimagination, 5. discussion, 6. final remarks, acknowledgements.

  • < Previous

Research excellence indicators: time to reimagine the ‘making of’?

  • Article contents
  • Figures & tables
  • Supplementary Data

Federico Ferretti, Ângela Guimarães Pereira, Dániel Vértesy, Sjoerd Hardeman, Research excellence indicators: time to reimagine the ‘making of’?, Science and Public Policy , Volume 45, Issue 5, October 2018, Pages 731–741, https://doi.org/10.1093/scipol/scy007

  • Permissions Icon Permissions

In the current parlance of evidence-based policy, indicators are increasingly called upon to inform policymakers, including in the research and innovation domain. However, few studies have scrutinized how such indicators come about in practice. We take as an example the development of an indicator by the European Commission, the Research Excellence in Science & Technology indicator. First, we outline tensions related to defining and measuring research excellence for policy using the notion of ‘essentially contested concept’. Second, we explore the construction and use of the aforementioned indicator through in-depth interviews with relevant actors and the co-production of indicators, that is the interplay of their making vis-à-vis academic practices and policy expectations. We find that although many respondents in our study feel uncomfortable with the current usage of notions of excellence as indicator of quality of research practices, few alternatives are suggested. We identify a number of challenges which may contribute to the debate of indicator development, suggesting that the making of current indicators for research policy in the EU may be in need of serious review.

When it comes to research policy, excellence is on top of the agenda. Yet, the meaning attributed to the notion of excellence differs markedly among both academics and policymakers alike.

There is an extensive scholarly debate around the breadth and depth of the meaning of excellence, its capacity to provide quantitative assessments of research activities and its potential to support policy choices. Yet, there is a considerable agreement that it strongly influences the conduction of science. The contentedness of the excellence concept can be derived from the discomfort it has evoked among scholars, leading some even to plea for an altogether rejection of the concept (Stilgoe 2015). The discomfort with the concept is higher whenever proposals are made to measure it. The critique of measuring excellence follows two lines. One is technical and emphasises the need for methodological rigour. While in principle not denying the need for and the possibility of designing science and technology indicators, this line of criticism stresses the shortcomings of methodological approaches used up until now (Grupp and Mogee 2004; Grupp and Schubert 2010 ). The other critique is more philosophical and, while not denying the theoretical and political relevance of excellence, it takes issue with the use of current metrics in assessing it ( Weingart 2005 ; Martin 2011 ; Sørensen et al. 2015 ). Notwithstanding these criticisms though, and especially given the period of science professionalization where policymaking finds itself in ( Elzinga 2012 ), these same metrics are frequently called upon to legitimate policy interventions (Wilsdon et al. 2015).

In addition, highly reflected shortcomings in the existing mechanisms for science’s quality control system, undermine trust in assessment practices around scientific excellence—in other words, if the peer review system is in crisis, what research outcomes are evaluated as excellent? (See Martin 2013 ; Sarewitz 2015 ; Saltelli and Funtowicz 2017 .)

The aspiration for an ‘evidence-based society’ ( Smith 1996 ) requests that policy makers and alike, especially those operating at the level of transnational governmental organisations, rely on information on the current state of research to identify policy priorities, or to allocate funds. Indicators are typically proposed as tools catering this need ( Saltelli et al. 2011 ). A main issue holds, however: how to come up with indicators of research excellence in the face of its often controversial underpinnings, as well as their situated nature?

At the Joint Research Centre of the European Commission, we have been actively involved in the design and construction of a country-level indicator of excellence, the Research Excellence Science & Technology indicator (RES&T) offered and used by the European Commission (cf. European Commission 2014 ; Hardeman et al. 2013 ). Hence we are in a unique position to critically reflect upon challenges of quantifying research excellence for policy purposes.

Here we adopt the notion of essentially contested concept as our theoretical work horse ( Gallie 1955 ; Collier et al. 2006 ) to discuss why the usefulness of research excellence for policy purposes is a subject of contention and what this means for its quantification. Essentially contested concepts are concepts ‘the proper use of which inevitably involves endless disputes about their proper uses on the part of their users’ ( Gallie 1955 : 169).

The work presented in this article revolves around two questions which evolved with the learning through the empirical material: First, we examine whether research excellence can be ‘institutionalised’ in the form of stable research excellence indicators, from the vantage point of Gallie’s notion of ‘essentially contested concept’. Second, whether the re-negotiation of meanings of research excellence that underpin current indicators revolves around the articulation of different imaginaries of excellence displayed by different actors. These initial questions were reframed with the progressive understanding of the authors that the focus in the practices were certainly relevant but larger questions emerged, such as whether ‘excellence’ alone was indeed the relevant descriptor to evaluate quality of research in the EU. Hence, this discussion is also offered vis-à-vis our findings throughout the research process.

The article starts by looking into the notion of excellence and its function as a proxy for scientific quality using the notion of essentially contested concept as well as elements of tension around its conceptualization (Section 2) as reported in the literature. It proceeds with describing briefly the development of the indicator that we are taking as an example to respond to the research questions described earlier. The second part of the article explains the methodology applied (Section 3) and the outcomes (Section 4) of the empirical research carried out to inform this article, which consisted of a number of in-depth interviews with relevant actors, that is developers of the RES&T indicator, EU policymakers, and academics. The interviews aimed at exploring meanings, challenges, and ways to reimagine the processes behind indicators development. In those interviews, we explore ‘re-imagination’ as a space for our interviewees to reflect further and discuss alternatives to current research indicators frameworks. These are offered in a discussion (Section 5) of current challenges to reimagine an indicator to qualify quality in science.

2.1 Measuring and quantifying indicators-for-policy

The appeal of numbers is especially compelling to bureaucratic officials who lack a mandate of popular election or divine right; scientific objectivity thus provides an answer to a moral demand for impartiality and fairness; is a way of making decisions without seeming to decide. (T. M. Porter 1995)

Indicators seek to put into numbers phenomena that are hard to measure ( Boulanger 2014 ; Porter 2015 ). Therewith, measuring is something else than quantifying ( Desrosieres 2015 ): while measuring is about putting into numbers something that already exists, quantifying is about putting into numbers something that requires an interpretative act. Indicators are often exemplary of quantifications. They are desirable because they offer narratives to simplify complex phenomena and therewith attempt to render them comprehensible ( Espeland 2015 ). Such simplifications are especially appealing whenever information is called for by policymakers operating at a distance from the real contexts that is, the actual purpose of their policy action. Simplification means that someone decides which aspects of complex phenomena are stripped away while others are taken on board. The (knowledge and values) grounds for that operation are not always visible. The risk is that, in stripping away some aspects (and focusing on others), a distorted view on the phenomenon of interest may arise, with potentially severe consequences for policy decisions derived from them. Lacking the opportunity to gather detailed information on each and every aspect of a phenomenon of concern, policymakers are nevertheless drawn to indicators offering them the information needed in the form of summary accounts ( Porter 2015 ).

Constructing an indicator on research excellence typically involves activities of quantification as research excellence has no physical substance in itself. For an indicator on research excellence to come into existence one first needs a meaning and understanding about what ‘research excellence’ is about before one can even start assigning numbers to the concept ( Barré 2001 ). We find that the notion of ‘co-production’ ( Jasanoff 2004 ) is relevant as it makes visible that indicators are not developed in a vacuum but respond and simultaneously normalise scientific practice and policy expectations.

2.2 Research excellence as an essentially contested concept

Research excellence could be straightforwardly defined as going beyond a superior standard in research ( Tijssen 2003 ). However, straightforward and intuitively appealing as this definition may seem, it merely shifts the issue of defining what is meant by research excellence towards what counts as ‘a superior standard in research’. For one thing, it remains unclear what should be counted as research to begin with, as well as how standards of superiority should be set, on which account and by whom. Overall, the notion of research excellence is potentially much more controversial than it might seem at first. In fact, whenever it comes to articulating what should count as excellent research and why this is so, scientific communities systematically strive for coming to an agreement ( Lamont 2009 ).

One way to conceive of research excellence then, is to think of it as an essentially contested concept . The notion of essentially contested concept was first introduced by Gallie (1955) to describe cases, that is ideas or phenomena that are widely appraised but controversial at the same time. In substantiating his view, Gallie (1955) listed five properties of essentially contested concepts (see also: Collier et al. 2006 ). Essentially contested concepts are (1) appraisive , (2) internally complex, (3) describable in multiple ways, (4) inherently open, and (5) recognized reciprocally among different parties ( Gallie 1955 ). Due to their complex, open and value-laden nature, essentially contested concepts cannot be defined in a single-best, fixed, and objective way from the outset. Hence, they are likely to produce endless debates on their interpretation and implications.

Research excellence might well serve as an instance of an essentially contested concept. First, research excellence, by its very appeal to superior standards, evokes a general sense of worth and, therewith, shareability . Although one can argue about its exact definition and the implications that such definitions could have, it is hard to be against excellence altogether (Stilgoe 2015). Second, research excellence is likely to be internally complex as it pertains to elements of the research enterprise that need not be additive in straightforward ways.

For example, research excellence can be about process as well as outcomes, whereby the former need not automatically transform into the latter ( Merton 1973 ). Third, it follows that research excellence can be described in multiple ways: while some might simply speak of research excellence with reference to science’s peer review system ( Tijssen 2003 ), others prefer to broaden the notion of research excellence beyond its internal value system to include science’s wider societal impact as well (Stilgoe 2015). Fourth, what counts as excellent research now might not necessary count as excellent research in the future, and any definition of research excellence might well be subject to revision. Finally, the fact that one can have a different view on what research excellence is or should be, is agreed upon by proponents of different definitions. Ultimately, proponents of a particular notion of research excellence could or could not be aware of alternative interpretations.

Recently, Sir Keith Bernett (2016) argued that a mechanical vision of academia is driving ‘mechanical and conventional ways we think about “excellence”. We measure a community of scholars in forms of published papers and league tables’ (Bernett 2016). Hence, what counts as excellence is entertained by the imagination of some about what ‘excellent research’ is; but what, political, social, and ethical commitments are built into the adopted notion and the choice of what needs to be quantified?

2.3 Quantifying research excellence for policy purposes: critical issues

Following the previous discussion, if one acknowledges research excellence as an essentially contested concept, the construction of indicators faces difficulties, which start with the mere act of attempting quantification, that is agreeing on a shared meaning of research excellence. In the 1970s Merton (1973 : 433–435) introduced three questions that need to be addressed to come to terms with the notion of research excellence (see also: Sorensen et al. 2015).

First, what is the basic unit of analysis to which research excellence pertains? Merton (1973) suggested that this could be everything ranging from a discovery, a paper, a painting, a building, a book, a sculpture, a symphony, a person’s life work, or an oeuvre. There is both a temporal, as well as a socio-spatial dimension to the identification of a unit for research excellence. Temporal in the sense that research excellence does not need to be attributable to a specific point in time only but might span across larger time periods. Also, though not so much discussed by Merton (1973) , a unit for research excellence also has a socio-spatial dimension. Research excellence might pertain to objects (books, papers, sculptures, etc.) or people. When it comes to the latter a major issue holds to whom excellence can be attributed (individuals, groups, organisations, territories) and how to draw appropriate boundaries among them (cf. Hardeman 2013 ). Expanding or restricting a unit’s range in time and/or space effects the quantification of research excellence accordingly.

Second, what qualities of research excellence are to be judged? Beyond the identification of an appropriate unit of analysis, the second issue raised by Merton (1973) points out several concerns. One is about the domain of research itself. As with disputes about science and non-science ( Gieryn 1983 ), demarcating research from non-research is more easily said than done. Yet, to attribute excellence to research, such boundary work needs to be done nevertheless. Should research excellence, as in the article Republic of Science (Polanyi 1962) be judged according to its own criteria? Or should research, in line with Weinberg’s (1962) emphasis on external criteria, be judged according to its contribution to society at large? To the same extent of setting the unit of excellence, setting the qualities in one way (and not another) produce certainly different outcomes for the policies derived therefrom. That said, focusing on a particular notion of excellence (i.e. using a particular set of qualities) might crowd out other—in principle equally valid—qualities ( Rafols et al. 2012 ; Sorensen et al. 2015).

Third, who shall judge? For example, a researcher working in a public lab might have a whole different idea on what counts as excellent research than one working in a private lab. This brings Stilgoe (2015) to argue that ‘“Excellence” tells us nothing about how important the science is and everything about who decides’. it is undoubtedly of eminent importance to determine the goals and interests that excellence serves. Likewise, and in line with Funtowicz and Ravetz’s (1990) focus on fit-for-purpose to describe the quality of a process or product, the quality of an indicator of research excellence crucially depends on its use. One concern here is that research excellence indicators might set the standard of research practices that do not conform to the underlying concept of excellence they seek to achieve ( Hicks 2012 ; Sorensen et al. 2015). For example, in Australia, in seeking to achieve excellence, the explicit focus on publication output indeed increased the number of papers produced but left the issue of the actual worth of those ‘unaddressed’ papers (Butler 2003). Interestingly, in 2009 a new excellence framework came into existence in Australia to replace the former quality framework. While the latter made use of a one-size-fits-all model, the new excellence based one presents a matrix approach in which entire sets of indicators, as well as the experts’ reviews coexist as measures of quality. Again, any definition of research excellence and its implications for quantification need to be positioned against the background of the goals and interests it serves.

2.4 The construction of the Research Excellence Indicator (RES&T) at the Joint Research Centre

The development of the Research Excellence Indicator (RES&T) at the Joint Research Centre of the European Commission (JRC) inspired this research. Its history, developments, and our privileged position of proximity to its developments motivate the basis from which we departed to conduct our inquiries.

In 2011, an expert group on the measurement of innovation set up by the European Commission’s Directorate-General Research and Innovation (DG-RTD) was requested ‘to reflect on the indicators which are the most relevant to describe the progress to excellence of European research’ ( Barré et al. 2011 : 3). At that point the whole notion of excellence was said to be ‘in a rather fuzzy state’ ( Barré et al. 2011 : 3). To overcome the conceptual confusion surrounding research excellence and to come up with a short list of indicators capable of grasping research excellence, the expert group proceeded in four steps. First, they defined and described types of activities eligible for being called excellent. Second, a set of potential indicators were identified. Third, from this set of potential indicators a short list of (actually available) indicators was recommended. And fourth, a process for interpreting research excellence as a whole at the level of countries was proposed.

This was followed by Vertesy and Tarantola (2012) proposing ways to aggregate the set of indicators identified by the expert group into a single composite index measuring research excellence. The index closely resembled the theoretical framework offered by the expert group while aiming for statistical soundness at the same time.

Presented at a workshop organised in Ispra (Italy) during fall 2012 by the European Commission and attended by both policymakers and academic scholars, the newly proposed composite indicator met with fierce criticism. A first critique raised was that the proposed composite indicator mixed up both inputs and outputs while research excellence, according to the critiques, should be about research outputs only. Whereas the outcomes of research and innovation activities are fundamentally uncertain, the nature and magnitude of research and innovation inputs say little to nothing about their outputs. A second critique raised during the workshop was that some of the indicators used, while certainly pertaining to research, need not say much about their excellent content. Focusing on outputs only, would largely exclude other dimensions that could refer to any kind of input (e.g. gross investment in R&D) or any kind of process organizing the translation of inputs into outputs (e.g. university–industry collaborations).

Taking these critiques on board, the research excellence indicator was further refined towards the finalization of the 2013 report ( Hardeman et al. 2013 ). First, the scope of the indicator was made explicit by limiting it to research in science and technology only. Second, following upon the critique strongly to distinguish inputs from outputs, it was put clear which among underlying indicators were primarily focused on outputs. Given that the underlying indicators were not available for all countries, the rankings presented in the 2013 Innovation Union Competitiveness Report was based on a single composite indicator aggregating either three (non-ERA countries) or four (ERA countries) underlying indicators (European Commission 2013).

In a subsequent report aimed at refining the indicator, Hardeman and Vertesy (2015) addressed a number of methodological choices, some of which were also pointed out by Sørensen et al. (2015) . These concerned the scope of coverage in terms of the number and kind of countries and the range of (consecutive) years, the variables included (both numerators and denominators), and the choice of weighting and aggregating components. The sensitivity and uncertainty analyses highlighted that some of the methodological choices were more influential than others. While these findings highlighted the importance of normative choices, such normative debates materialized only within a limited arena.

Based on our research and experience with the RES&T, we will discuss whether careful reconsideration of the processes by which these types of indicators are developed and applied is needed.

A qualitative social research methodology is adopted to gain insights from different actors’ vantage point on concepts, challenges, and practices that sustain the quantification of research excellence.

A series of in-depth interviews was carried out by two of the authors of this paper between March and May 2016. A first set of interviews was conducted with five people directly involved in the construction of the RES&T indicator from the policy and research spheres, or people that were identified through our review of relevant literature. This was followed by a second set of interviews partially suggested by the interviewees in the first set. Hence, eleven telephone semi-structured in-depth interviews were conducted with experts, scholars, and users concerned with research indicators and metrics.

This was followed by a second set of interviews (six participants), partially suggested by the interviewees in the first set. This second set was thus composed of senior managers and scholars of research centres with departments on scientometrics and bibliometrics, as well as policymakers.

Hence, the eleven interviewees included people that were either involved in different phases of the RES&T indicator development or were professionally close (in research or policy areas) to the topic of research indicators and metrics. Awareness of the RES&T indicator constituted a preferable requirement. The eleven telephone semi-structured in-depth interviews conducted with the experts, scholars, and users of indicators may seem numerically little; however, this pool offered relevant insights to shed light on the practices of research evaluation in the EU. So, the interviewees were the relevant actors for our work.

We performed coding as suggested by Clarke (2003) as soon as data were available and such an approach allowed us setting more focus on some aspects of the research that emerged as particularly important. The accuracy of our interpretations was checked through multiple blind comparisons of the coding generated by the authors of this paper. Often our codes have also explicitly been verified with the interviewees to check potential misalignments in the representativeness of our interpretations.

RES&T indicator developers (hereafter referred to as ‘developers’) The three interviewees of this group were all somehow involved in the design and implementation of the RES&T indicator. Among them, two senior and one retired researchers, all of them active in the areas of innovation and statistics. Given that we knew two of the interviewees before the interview, we paid particular attention to the influence of the interviewer–interviewee identities at the moment of data analysis, along the recommendations of Gunasekara (2007) .

Policy analysts (hereafter referred to as ‘users’) This group was composed of four senior experts, who are users of research indicators. They are active as policymakers at the European Commission; they all have been involved in various expert groups and at least two of them have also published own research.

Practitioners and scholars in relevant fields for our endeavour concerned with science and technology indicators and active at different levels in their conceptualization, use and construction (hereafter referred to as ‘practitioners’) This group was composed of four scholars (one senior researcher, one full professor, one department director, and one scientific journal editor in chief) who critically study statistical indicators.

Insights into meanings of excellence.

Critical overview of current metrics in general and processes and backstage dynamics in the development of the RES&T indicator (applicable if the interviewee was personally involved).

Reimagination of ways to assess and assure the quality of processes of indicators development, taking stock of transformations of knowledge production, knowledge governance, and policy needs (new narratives).

All interviews, which were in average one hour long, have been transcribed and data analysis was conducted according the principles of grounded theory, particularly as presented in Charmaz (2006) . The analysis of these interviews consisted of highlighting potential common viewpoints, identifying similar themes and patterns around the topics discussed with the interviewees, which will be discussed in the next sections of this article.

In this section, we attempt to make sense of the issues raised by our interviewees, summarising the main recurrent elements of the three main axes that were at the core of the questionnaire structure: (1) meanings of ‘excellence’, (2) challenges backstage processes of developing and using research excellence indicators, (3) ways to reimagine the process of indicator development to promote better quality frameworks to assess scientific quality.

4.1 On meanings of research excellence

Many of us are persuaded that we know what we mean by excellence and would prefer not to be asked to explain. We act as though we believe that close inspection of the idea of excellence will cause it to dissolve into nothing. ( Merton 1973 : 422)

Our starting question to all interviewees was ‘please, define research excellence’. To this question, our interviewees found themselves rather unprepared, which could suggest that either this expression is taken for granted and not in need of reflection or—as the literature review shows—no shared definition seems to exist, to which, in the end, our interviewees largely agree. Such unpreparedness seems somehow paradoxical as it implies an assumption that the definition of excellence is stable, in no need for reflection, whereas our interviewees’ responses seem to suggest rather the contrary. Excellence is referred to as ‘hard to define’, ‘complex’, ‘ambiguous’, ‘dynamic’, ‘dangerous’, ‘tricky’ as well as, a ‘contextual’ and ‘actor-dependent’ notion. The quotes below reflect different vantage points, indicating some agreement on its multidimensional, contested, distributed, situated, and contextual nature:

[…] this is a dangerous concept, because you have different starting positions . Developer 3 Clearly, excellence is multi-dimensional. Secondly, excellence ought to be considered in dynamic terms , and therefore excellence is also dynamics, movement and progress which can be labelled excellent. Third, excellence is not a natural notion in the absolute, but it is relative to objectives . Therefore, we immediately enter into a more complex notion of excellence I would say, which of course the policy makers do not like because it is more complicated. Developer 2 […] you need to see the concept of research excellence from different perspectives . For universities it might mean one thing, for a private company it might mean something completely different. Developer 1 You could say that excellence is an emergent property of the system and not so much an individual attribute of particular people or groups . Stakeholder 1

The quotes suggest agreement among the interviewees that research excellence is a multidimensional, complex, and value-laden concept which link well with the notion of essentially contested concept introduced earlier. While some experts simply think of highly cited publications as the main ingredient for a quantification of excellence, others tend to problematize the notion of excellence once they are invited to carefully reflect upon it, getting away from the initial official viewpoint. Indeed, the lack of consensus about meanings of excellence is highlighted by different interviewees, and, not surprisingly, seem to be a rather important issue at the level of institutional users and developers, who described it as an unavoidable limitation. For example:

It is extremely difficult to have a consensus and [therefore] it is impossible to have a perfect indicator. User 2 I do see that there was no clear understanding of the concept [of excellence] [since] the Lisbon agenda. This partly explains why [a] high level panel was brought together [by DG RTD], [whose] task was to define it and they gave a very broad definition, but I would not identify it as the Commission’s view . Developer 3

The way users and developers responded to this lack of consensus seems to be different though. Developers, on the one hand, do not seem to take any definition of research excellence for granted. It seems that, as a way out of the idea that research excellence constitutes an essentially contested concept, developers stick to a rather abstract notion of research excellence: specific dimensions, aggregation methods, and weights are not spelled out in detail. For example, when asked to define excellence, one developer responded:

I would say there is a natural or obvious standard language meaning , which is being in the first ranks of competition. Excellence is coming first. Now, we know that such a simple definition is not very relevant [for the task of indicators making]. Developer 2

The more concrete, and perhaps more relevant decisions are therewith avoided, as it is immediately acknowledged that research excellence constitutes an essentially contested concept. Users, on the other hand, seem to take for granted established definitions much easier. Here, one interviewee simply referred to the legal basis of Horizon 2020 in defining excellence:

I think I would stick to the definition of the legal basis : what is important is that the research that is funded is the top research. How is this defined? In general, it is what is talented, looking for future solutions, preparing for the next generation of science and technology, being able to make breakthroughs in society. 1 User 3

What both developers and users share is their insistence on the need for quantification of research excellence, albeit for different reasons. From the user-perspective, the call for a research excellence indicator seems to be grounded in a desire for evidence-based policy (EBP) making.

To our question on whether excellence is the right concept for assessing quality in science, interviewees responded saying that the call for EBP all costs surely plays a fundamental role in the mechanisms of promotion of excellence measures and therefrom indicators development:

There is a huge debate on what the real impact of that is in investment and we need to have a more scientific and evidence-based approach to measure this impact , both to justify the expense and the impact of reform, but also to better allocate spending. User 2

Notwithstanding the difficulty involved in operationalizing a notion of excellence towards indicators, what comes upfront is that no single agreed-upon solution is to be expected from academia when it comes to defining excellence for quantification purposes. This seems to be acknowledged by one of the developers, commenting on the composition of the high-level expert panel that:

You have a bunch of researchers who have a very different understanding of what research excellence would be, and some were selected for this high level panel. I am not aware of any reasoning why specific researchers or professors were selected while others were not. I am sure that if there was a different group, there would have been a different outcome , but this is a tricky thing. Developer 3

As such, similar considerations seem to confirm that the processes behind indicators development, such as the involvement of certain academic communities, potentially influence further conceptualisations of research excellence. These aspects will be discussed in the last section of this article.

4.2 Metrics and processes of research excellence

… the whole indicator activity is a social process, a socio-political process; it is not just a technical process. Developer 2

Indicators and metrics respond and correspond to social and political needs and are not mere technical processes, and this is made visible by different types of tensions identified by our interviewees.

First, the process of quantification of research excellence requires an agreement on its definition. Any definition is neither definitive nor stable, not least because of its contextual dependencies. In the above section, it emerged that what needs to be quantified is substantially contested. However, our interviews show that other at least equally contested dimensions exist, such as methodological (quantification practices), social (involved actors), normative (scientific and policy practices).

In the remainder of this section, we explore through our interviews the production of indicators vis-a-vis their processes and outcomes.

4.2.1 Normativity: who is involved in the design of an indicator?

Indicators clearly require political choices to be made. What needs to be quantified and who decides remains an important question. The normativity aspects remit always to definition issues, social actors of concern and institutional dependencies.

The observation of one of the practitioners resonates with Jack Stillgoe’s provocation that ‘excellence tells us nothing about how important the science is and everything about who decides’. 2

Who decides what excellence is? Is it policy makers, is it the citizen, is it the researchers themselves? Is it absolute or does it depend on the conditions? Does it depend on the discipline? Does it depend on the kind of institution concerned? You see what I mean. Developer 2

A practitioner suggests that the level of satisfaction, and therefore acceptance, of an indicator is primarily defined by its usage:

Who will decide when an indicator is good enough and to what extent? […] The short answer is the users, the people out there who use indicators and also whose careers are affected by indicators, they decide whether it’s good enough. Practitioner 3

These quotes raise different questions related to what we here call ‘normativity’ and ideas of co-production, both in terms of indicators development and usage: first, what are power relations between the actors involved and how can they influence the processes behind indicators? Second, to what extent can these kinds of quantification be deemed unsatisfactory and, ultimately rejected and by whom? Third, in the idiom of co-production, how does research excellence metrics influence research practices in both mainstream knowledge production systems and other emerging systems of knowledge production (namely what is designated as ‘DIY science’, ‘citizen science’, ‘the maker movement’, etc.)?

4.2.2 Inescapable simplifications?

Simplification seems to be an inescapable avenue in any attempt to represent complex concepts with just one number; as it implies inclusion and exclusion of dimensions, it begs the question of responsibility and accountability. In the end of the day, complex systems of knowledge production are evaluated through very limited information. Although we do not want to expand this discussion herein, it is important to point out that when using these scientific tools in realms that will have major implications on systems of knowledge production and governance, the ‘who’ and ‘to whom’ necessarily need careful consideration.

At some point, you need to reduce the complexity of reality . Otherwise you cannot move on. We tend to be in favour of something. The problem is that we have far too many possibilities for indicators […]. In general, we need to take decisions on the basis of a limited amount of information . Practitioner 4 What is the limitation of the index? These were the main issues and dimensions that it was not able to address. I do not know what the most problematic things were. I have seen many questions, which address for instance the choice of a certain indicator, data or denominator and the exclusion or inclusion of an index. I do not know which ones were more important than the others. We ran a number of sensitivity tests, which showed that some of the choices had a more substantial impact on country rankings. You could put the ranks upside down. Developer 3

Different interviewees deem that quantification practices ought to be robust to deviations due to different theoretical assumptions therefrom when specific variables, time periods, weights, and aggregation schemes are varied.

With regards to the RES&T, one user remarked purposeful correction as an issue of major concern for the quantification of research excellence:

Part of [the] alignment [of the RES&T indicator] led to counter-intuitive results, like a very low performance from Denmark, and then we reinstated the previous definition because it led to slightly better results. The definition of results is also important for political acceptance. User 4

As reported in the literature and also emerged throughout our interviews, excellence has been contended as the relevant policy concept to tackle major challenges of measuring impacts of science in society. The two quotes below stress the importance of aiming for indicators that go beyond the mere scientific outputs, suggesting that frameworks of assessment should also encompass other critical issues related to process (e.g. research ethics):

It is OK to measure publications, but not just the number . For instance, also how a certain topic has been developed in a specific domain or has been taken into a wider issue, or any more specific issues, these needs to be tracked as well. User 3 Let’s imagine two research groups: one does not do animal testing, and obtain mediocre results, the other does animal testing and have better results and more publications. How those two very different ethical approaches can be accounted? We should correct excellence by ethics! Developer 1

Our material illustrates several facets of different types of reductionism: first, the loss of multidimensionality as an inevitable consequence of reducing complexity; second, rankings following from indicators sometimes work as drivers and specifications for the production of the indicators themselves; finally, volatility in the results is expected to become an issue of major concern specifically along ever-changing systems of knowledge production (see e.g. Hessels and van Lente 2008 ).

4.2.3 Backstage negotiations

Indicators largely depend on negotiations among actors seeking to implement their own vision and interest. From such a view, research indicators redefine reputation and prioritise funding. This process is depicted as an embedded and inevitable dynamic within indicators production:

[When developing an indicator] you will always have a negotiation process . You will always discuss ‘what you will do in this case’; ‘you would not include that’ or ‘you would do that’; ‘this does not cover it all’. You will always have imperfect and to a certain extent wrong data in whatever indicator you will have . User 1 [Developers] mainly do methodological work. The political decisions on the indicator are taken a bit higher up. Practitioner 3 Many politicians have a very poor view of what actually goes into knowledge production. This is what we have experienced in Europe, The Netherlands and the UK. Give me one number and one A4 with a half page summary and I can take decisions. We need to have some condensation and summarisation, and you cannot expect politicians to deal with all the complexities. At the same time, they must be aware that too poor a view of what knowledge production is, kills the chicken that lays the eggs. Practitioner 1

These quotes seem to suggest that there are ‘clear’ separate roles for those who participate in the production of the indicator and those who are empowered to decide what the final product may look like. In the case of the development of the RES&T indicator, the process of revision and validation of the indicator included a workshop organised by EC policymakers, in which developers and academics were invited to review the indicator’s proposed theoretical framework. The publication of the feasibility study by Barré et al. (2011) was the main input of this workshop; one of the developers that we interviewed remarked the following:

I find it interesting that [at the workshop] also policymakers had their own idea of what it [the indicator] should be. Developer 3

In other words, even if roles seem to be rather defined, in the end of the day indicators respond to predefined political requests. On the other hand, it is interesting to note how this workshop worked as a space for clarifying positions and what the relevant expertise is.

Workshops are interesting in showing the controversies , and even if that is not the case for all indicators, the excellence one has gone through a certain level of peer review, revision and criticism . Even when you want to have an open underpinning, as a commissioning policy body, you’re in a difficult position: how do you select experts? User 2 Although the aim was reviewing and validating, people came up with another set of variables [different from the one proposed by the EG] that should have been taken into consideration. People make a difference and that is clear . Developer 3

Hence, these quotes seem to suggest that indicators are based on selected ‘facts’ of the selected ‘experts’ that are called upon to perform the exercise. The call for evidence-based policy needs to acknowledge this context and carefully examine ‘factual’ promises that cannot be accomplished, which put unnecessary pressures on developers, as well:

You have to understand, we had to consider the request … . They [DG RTD] just wanted a ranking of member states in terms of this kind of excellence concept. This is what they want; this is what we had to deliver within the project. Developer 1

We found two elements intrinsic to negotiation processes behind indicators development: first, different actors (developers vs. policymakers) move in different arenas (academic vs. political) and are moved by different interests; second, power relationships set what needs to be measured which make indicators not much more than mere political devices, coherent with a performative function.

4.3 Reimagining what?

Our interviewees explored practical strategies to deal with the policy need for research quality assessments. As researchers, we had assumed that because of many controversies and expressed discontent, there would be a lot of ideas about novel ways to look into the quality of research. Yet, our empirical material shows that there are no clear alternative proposals to either measuring ‘excellent research’ or to enhance the robustness of indicators, except for small variations. As frequently emerged throughout almost all the interviews, many actors highlighted the necessity of carefully interrogating the very use of excellence as the right proxy to research quality, as in this quote:

The norm of quality in research that you consider valid and others might not consider valid needs to be discussed as well. A debate is possible and is fundamental within indicators. Developer 2

Despite different positions about the controversial underpinnings of research excellence, widely discussed by the majority of interviewees from each of the three categories, none offered slight or indirect suggestions on how to go beyond the issue of quantification of research quality for policy purposes:

When you have evidence based policy, unfortunately, at the moment, almost the only thing that counts is quantitative data. Case studies and evaluation studies are only strong if they have quantitative data. Then you will get into indicators and it is very difficult to get away from it. User 1

This observation summarises an inevitable commitment to quantification: when asked about research excellence, different actors tend to digress around specific implementations and their implications but do not question in a strong manner the overall scope of the indicator as a means to map or ascertain scientific quality. But, quantifications fit the policy (and political) purpose they are meant to support, as suggested in this honest account by one user:

I think the reasoning is very simple. If we want an indicator that fits its purpose, which are political purposes , for policy makers and objective measures, we need to be very clear on what we measure and, as you say, to have the best matching and mismatching between purpose and reality. I think that is the first question. Then we have to deal with the nitty gritty and see how, sorry, nitty gritty is important, whether we can improve statistically what we have. User 2

Hence, in our interviews the narrative of ‘need for quantification’ inevitability persisted despite the recognition of its inherent limitations and misrepresentations. Interviewees focused on the possibility of improving indicators’ resonance with quality research, avoiding oversimplifications and limiting possible unwanted implications. This quote suggests that the limits of known imperfections of indicators can actually help with raising questions, and therefore we suggest that indicators could be viewed as the prompts to enquire further and not answering devices:

The point is that to take into account the fact that an indicator will never satisfy the totality of the issues concerned, my view is that an indicator is fine when it is built carefully, as long as it is then used not only to provide answers but to raise questions . […] for example, the indicator we are talking about is fine only as long as it goes along with the discussion of what it does not cover, of what it may hide or not consider with sufficient attention; or in what sense other types of institution or objectives can be taken into account. Developer 2

Along these lines, the importance of allowing for frequently (re)adjustments of evaluation exercises and practices that sustain research indicators is seen as a major improvement:

I am more interested in making sure that as someone involved in composite indicator development, I get the chance to revisit regularly an index which was developed. I can look around and have some kind of conceptual, methodological or statistical review, or see if it is reflecting the ongoing discussions. I can do this if I have the freedom to choose my research. This is not necessarily the case in settings where research is very politically or policy driven. Developer 1

The issue of data availability is quite relevant, not only because of the quality of the built indicators, but more interestingly because existing datasets determine what can be measured and ultimately give shape to the indicator itself, which is a normative issue tout court :

Many researchers or many users criticize existing indicators and say they are too conservative. [While they are] always criticized, it is difficult to come with new metrics and the traditional ones are very well grounded in statistics. We have a very good database on data metrics and patents, therefore these databases have some gravitational attraction, and people always go back to them. An indicator needs to be based on existing data. These data has to be acknowledged and there needs to be some experience of them and a bit of time lag between the coverage of new developments by data and then the use for developing indicators. User 4

Finally, excellence does not necessarily need to be a comparative concept, and indeed comparisons ultimately rely on a fair amount of de-contextualisation, which imply overlooking scientific (foremost disciplinary) differences of epistemic, ontological, and practical nature. This is recognised by many of our interviewees:

[Excellence] it is not so useful for comparing EU performance to non-European countries, to US and Japan, because they do not have the same components. They do not have ERC grants, for example! User 4 My suspicion is that [excellence] also depends on the discipline! Practitioner 2

Our quest for reimagination stayed mostly confined to discussing the processes of indicators development, with interviewees largely sharing stances on the apparent inevitability of quantification of research excellence for policy purposes. In fact, we are somehow disappointed that the discussion on other ways to describe and map quality in science did not really produce substantial alternatives. However, few points were raised as central to strengthen the robustness of existing indicators: first, evaluation exercises that deploy research indicators should be frequently checked upon and fine-tuned if necessary; second, what is possible to evaluate should not be constrained by existing datasets but other sources of information should be sought, created, and imagined. In any case, available sources of information are not sufficient when one considers the changing nature of current knowledge production and governance modes which today involve a wider range of societal actors and practices (e.g. knowledge production systems appearing outside mainstream institutions).

In this article, we explored the making of a particular ‘research excellence’ indicator, starting from its underlying concept and institutional framing. Below, we summarise our findings through five main points. Together, these may constitute departing points for future debates around alternative evaluation framings, descriptors, and frameworks to describe and map the quality of scientific research in the EU.

5.1 Research excellence: contested concept or misnomer?

Early in this article, we advanced the idea of excellence as an essentially contested concept, as articulated by Gallie (1955) . Our interviews seem to concur with the general idea that the definition of such a concept does not seem to be stable and that there are genuine difficulties (and actual unpreparedness) among interviewees even to come up with a working definition for ‘research excellence’. In most cases, interviewees seem to agree that research excellence is a multidimensional, complex, and value-laden concept whose quantification is likely to end in controversy. ‘Institutionalised’ definitions, which may not necessarily be subject of a thorough reflection, were often given by our interviewees; they repeatedly remarked that each definition depends very much on the actors involved in developing indicators. So, would more extended debate about the meanings and usefulness of the concept to assess and compare scientific research quality be helpful to address some of the current discussions?

5.2 Inescapability of quantification?

The majority of our interviewees had a hard time imagining the assessment of research that is not reliant on quantification . Yet, quantifying or not quantifying research excellence for policy purposes does not seem to be the question, the issue rather revolved around what really needs to be quantified. Is the number of published papers really an indication of excellence? Does paper citation really imply its actual reading? As with classifications ( Bowker and Star 1999 ), indicators of research excellence are both hard to live with and without. The question is how to make life with indicators acceptable while recognising their fallibility. Recognising that quantifying research excellence requires choices to be made, then the values and interests of such choices serve at the neglect of others becomes an important reflection. We would argue that quantifying research excellence is first and foremost a political and normative issue and as such, Merton’s (1973) pertinent question ‘who is to judge on research excellence?’ remains.

The need for quantification is encouraged by and responds to the trend of evidence-based policy. After all, this is a legacy of the ‘modern’ paradigm for policy making which needs to be based on scientific evidence and this, in turn, needs to be delivered in numbers. However, as Boden and Epstein, (2006) remarked, we might be in a situation of ‘policy based evidence’ instead, where scientific research is assessed and governed to meet policy imaginaries of scientific activity (e.g. focus on outcomes such as the number of publications, ‘one size fits all’ approaches to quantification across different scientific fields, etc.). The question then remains, that is, can ideas of qualifying quantifications be developed also in this case?

5.3 The lamppost

In Mulla Nasruddin’s story, the drunken man tries to find under the lamppost the keys he lost. Some of the interviewees suggested that the bottleneck for quantification is existing data. In other words, data availability influences what is possible to quantify: only those parameters for which there are already considerable data, that is those which are easy to count seem to be the ones taken into account. We argue that this type of a priori limitation needs to be reflected upon, not least because knowledge production and the ways in which researchers make visible their work to the public are not confined to academic formats only. Moreover, if one considers the processes by which scientific endeavour actually develops, then we might really need to see outside the lamppost’s light circle. Can we afford to describe and assess ‘excellent research’ exclusively relying on current parameters for which data are already available?

5.4 Drawing hands

In an introductory piece about the co-production idiom, Jasanoff (2004 : 2) says that ‘the ways in which we know and represent the world are inseparable from the ways we choose to live in it’. We concur with the idea that the construction of indicators is a sociopolitical practice. From such a perspective, it becomes clear that knowledge production practices are in turn conditioned by knowledge production assessment practices, exactly as depicted in artist M. C. Escher’s piece Drawing Hands . In other words, whichever ways (research excellence) indicators are constructed, their normative nature contributes to redefining scientific practices. We suggest that the construction of an indicator is a process in which both the concept (research excellence) and its measurements are mutually defined and are co-produced. If an indicator redefines reputation and eligibility for funding, researchers will necessarily adapt their conduct to meet such pre-established standards. However, this understanding is not shared by all interviewees, which suggests that future practice needs to raise awareness of the normativity inherent to the use of indicators.

5.5 One size does not fit all

Indicators necessarily de-contextualize information. Many of our interviewees suggested that other types of information would need to be captured by research indicators; to us this casts doubts about the appropriateness of the use of indicators alone as the relevant devices for assessing research with the purposes of designing policy. What do such indicators tell us about scientific practices across fields and different countries and institutions? The assumption that citation and publication practices are homogenous within different specialties and fields of science has been previously demonstrated as problematic ( Leydesdorff 2008 ), and it is specifically within the policy context that indicators need to be discussed (see e.g. Moed et al. 2004).

The STS literature offers us examples of cultures of scientific practice that warn us that indicators alone cannot be used to sustain policies, but they certainly are very useful to ask questions.

Nowotny (2007) and Paradeise and Thoenig (2013) argued that like many other economic indicators, ‘research excellence’ is promoted at the EU level as a ‘soft’ policy tool (i.e. it responds to benchmarks to compel Member States to meet agreed obligations). But the implied measurements and comparisons ‘at all costs’ cannot be considered ‘soft’ at all: they inevitably trigger unforeseen and indirect incentives in pursuing a specific kind of excellence (see e.g. Martin 2011 ) often based on varied, synthetic and implicit evaluations. In the interviews, stories were told to us, about purposeful retuning of indicators because some countries would not perform as expected when variations to the original indicators were introduced.

If going beyond quantification eventually turns out not being an option at all, at least we should aim for more transparency in the ‘participatory’ processes behind the construction of indicators. To cite Innes’ words ‘the most influential, valid, and reliable social indicators are constructed not just through the efforts of technicians, but also through the vision and understanding of the other participants in the policy process. Influential indicators reflect socially shared meanings and policy purposes, as well as respected technical methodology’ ( Innes 1990 ).

This work departed from the idea that the concept of research excellence is hard to be institutionalised in the form of stable research excellence indicators, because it inevitably involves endless disputes about its usage. Therefore, we expected to find alternative by other imaginaries and transformative ideas that could sustain potential changes. To test these ideas, we examined the RES&T indicator development and its quantification, highlighting that this indicator is developed in a context in which it simultaneously respond to and normalise both scientific practice and policy expectations. We also explored the difficulties of measuring a concept (research excellence) that lacks agreed meanings. The in-depth interviews conducted with relevant actors involved in the development of the RES&T research indicator suggest that, while respondents widely acknowledge intrinsic controversies in the concept and measurement, and are willing to discuss alternatives (what we called ‘re-imagination’), they did not find it easy to imagine alternatives to address research quality for policy purposes. Quantification is hard-wired into practices and tools to assess and assure the quality of scientific research that are further reinforced by the current blind and at-all-costs call for quantified evidence based policy to be applied in twenty-eight different EU Member States. However, suggestions were made to make reimagination a continuous stage of the process of developing excellence assessments, which reminds us of Barré’s agora model ( Barré 2004 ).

To conclude, more than a contested concept, our research lead us to wonder whether ‘research excellence’ could be a misnomer to assess the quality of scientific research in a world where processes, and not only outcomes, are increasingly subject of ethical and societal scrutiny? Or, what is the significance of excellence indicators when scientific research is a distributed endeavour that involves different actors and institutions often even outside mainstream circles?

Conflict of interest statement . The views expressed in the article are purely those of the writers and may not in any circumstances be regarded as stating an official position of the European Commission and Robobank.

The authors would like to thank the contribution of the interviewees, the comments and suggestions of two anonymous reviewers as well as the participants of the workshop on “Excellence Policies in Science” held in Leiden, 2016. The views expressed in the article are purely those of the authors and may not in any circumstances be regarded as stating an official position of the European Commission.

Barré R. ( 2001 ) ‘ Sense and Nonsense of S&T Productivity Indicators ’, Science and Public Policy , 28 / 4 : 259 – 66 .

Google Scholar

Barré R. , ( 2004 ) ‘S&T indicators for policy making in a changing science-society relationship’, in: Moed H. F. , Glänzel W. , Schmoch U. . (eds) Handbook of Quantitative Science and Technology Research: The Use of Publication and Patent Statistics in Studies of S&T Systems , pp. 115–131. Dordrecht : Springer .

Google Preview

Barré R. ( 2010 ) ‘ Towards Socially Robust S&T Indicators: Indicators as Debatable Devices, Enabling Collective Learning ’, Research Evaluation , 19 / 3 : 227 – 31 .

Barré R. , Hollanders H. , Salter A. ( 2011 ). Indicators of Research Excellence . Expert Group on the measurement of innovation.

Benett K. ( 2016 ) ‘Universities are Becoming Like Mechanical Nightingales’. Times Higher education. < https://www.timeshighereducation.com/blog/universities-are-becoming-mechanical-nightingales>

Boden R. , Epstein D. ( 2006 ) ‘ Managing the Research Imagination? Globalisation and Research in Higher Education ’, Globalisation, Societies and Education , 4 / 2 : 223 – 36 .

Boulanger P.-M. ( 2014 ) Elements for a Comprehensive Assessment of Public Indicators . JRC Scientific and Policy Reports, Luxembourg : Publications Office of the European Union .

Bowker G. C. , Star S. L. ( 1999 ) Sorting Things Out: Classification and its Consequences . Cambridge : MIT Press .

Butler L. ( 2003a ) ‘ Explaining Australia’s Increased Share of ISI Publications—the Effects of a Funding Formula Based On Publication Counts ’, Research Policy , 32 : 143 – 55 .

Charmaz K. ( 2006 ) Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis , Vol. 10. http://doi.org/10.1016/j.lisr.2007.11.003

Clarke A. E. ( 2003 ) ‘ Situational Analyses ’, Symbolic Interaction , 26 / 4 : 553 – 76 .

Collier D. , Daniel Hidalgo F. , Olivia Maciuceanu A. ( 2006 ) ‘ Essentially Contested Concepts: Debates and Applications ’, Journal of Political Ideologies , 11 : 211 – 46 .

Desrosieres A. , ( 2015 ) ‘Retroaction: How Indicators Feed Back onto Quantified Actors’. In: Rottenburg . et al.  (eds) The World of Indicators: The Making of Governmental Knowledge through Quantification . Cambridge : Cambridge University Press .

Elzinga A. ( 2012 ) ‘ Features of the Current Science Policy Regime: Viewed in Historical Perspective ’, Science and Public Policy , 39 / 4 : 416 – 28 .

European Commission ( 2014 ) Innovation Union Competitiveness Report 2013—Commission Staff Working Document, Directorate-General for Research and Innovation . Luxembourg : Publications Office of the European Union .

Espeland W. ( 2015 ) ‘Narating Numbers’. In: Rottenburg et al.  (eds) The World of Indicators: The Making of Governmental Knowledge through Quantification . Cambridge, UK : Cambridge University Press .

Funtowicz S. O. , Ravetz J. R. ( 1990 ) Uncertainty and Quality in Science for Policy , 229 – 41 . Dordrecht : Kluwer Academic Publishers .

Gallie W. B. ( 1955 ) ‘ Essentially Contested Concepts ’, Proceedings of the Aristotelian Society , 56 : 167 – 98 .

Gieryn T. F. ( 1983 ) ‘ Boundary-Work and the Demarcation of Science from Non-science: Strains and Interests in Professional Ideologies of Scientists ’, American Sociological Review , 48 / 6 : 781 – 95 .

Grupp H. , Mogee M. ( 2014 ) ‘ Indicators for National Science and Technology Policy. How Robust are Composite Indicators? ’, Research Policy , 33 / 2004 : 1373 – 84 .

Grupp H. , Schubert T. ( 2010 ) ‘ Review and New Evidence on Composite Innovation Indicators for Evaluating National Performance ’, Research Policy , 39 / 1 : 67 – 78 .

Gunasekara C. ( 2007 ) ‘ Pivoting the Centre: Reflections on Undertaking Qualitative Interviewing in Academia ’, Qualitative Research , 7 : 461 – 75 .

Hardeman S. ( 2013 ) ‘ Organization Level Research in Scientometrics: A Plea for an Explicit Pragmatic Approach ’, Scientometrics , 94 / 3 : 1175 – 94 .

Hardeman S. , Van Roy V. , Vertesy D. ( 2013 ) An Analysis of National Research Systems (I): A Composite Indicator for Scientific and Technological Research Excellence . JRC Scientific and Policy Reports, Luxembourg : Publications Office of the European Union .

Hessels L. K. , van Lente H. ( 2008 ) ‘ Re-thinking New Knowledge Production: A Literature Review and a Research Agenda ’, Research Policy , 37 / 4 : 740 – 60 .

Hicks D. ( 2012 ) ‘ Performance-based University Research Funding Systems ’, Research Policy , 41 / 2 : 251 – 61 .

Innes J. E. ( 1990 ) Knowledge and Public Policy. The Search for Meaningful Indicators . New Brunswick (USA) and London (UK ): Transaction Publishers .

Jasanoff S. (ed.) ( 2004 ) States of Knowledge: The Co-Production of Science and the Social Order . London : Routledge .

Lamont M. ( 2009 ) How Professors Think: Inside the Curious World of Academic Judgment . Cambridge/London : Harvard University Press .

Leydesdorff L. ( 2008 ) ‘ Caveats for the Use of Citation Indicators in Research and Journal Evaluations ’, Journal of the American Society for Information Science and Technology , 59 / 2 : 278 – 87 .

Martin B. R. ( 2011 ) ‘ The Research Excellence Framework and the ‘impact agenda’: are we Creating a Frankenstein Monster? ’, Research Evaluation , 20 / 3 : 247 – 54 .

Martin B. R. ( 2013 ) ‘ Whither Research Integrity? Plagiarism, Self-Plagiarism and Coercive Citation in an Age of Research Assessment ’, Research Policy , 42 / 5 : 1005 – 14 .

Merton R. K. ( 1973 ) ‘Recognition and Excellence: Instructive Ambiguities’. In: Merton R. K. (ed.) The Sociology of Science . Chicago : University of Chicago Press .

Mood H.F. , Glanzel W. , Schmoch U. (eds) ( 2004 ) Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems . Dordrecht : Kluwer .

Nowotny H. ( 2007 ) ‘ How Many Policy Rooms Are There? Evidence-Based and Other Kinds of Science Policies ’, Science, Technology & Human Values , 32 / 4 : 479 – 90 .

Paradeise C. , Thoenig J.-C. ( 2013 ) ‘ Organization Studies Orders and Global Standards ’, Organization Studies , 34 / 2 : 189 – 218 .

Polanyi M. ( 2000 ) ‘ The Republic of Science: Its Political and Economic Theory ’, Minerva , 38 / 1 : 1 – 21 .

Porter T. M. , ( 2015 ) The Flight of the Indicator. In: Rottenburg et al.  (eds) The World of Indicators: The Making of Governmental Knowledge through Quantification . Cambridge : Cambridge University Press .

Rafols I. , Leydesdorff L. , O’Hare A. et al.  ( 2012 ) ‘ How Journal Rankings Can Suppress Interdisciplinary Research: A Comparison Between Innovation Studies and Business & management ’, Research Policy , 41 / 7 : 1262 – 82 .

Saltelli A. , D’Hombres B. , Jesinghaus J. et al.  ( 2011 ) ‘ Indicators for European Union Policies. Business as Usual? ’, Social Indicators Research , 102 / 2 : 197 – 207 .

Saltelli A. , Funtowicz S. ( 2017 ). ‘What is Science’s Crisis Really About?’ Futures —in press. < http://www.sciencedirect.com/science/article/pii/S0016328717301969> accessed July 2017.

Sarewitz D. ( 2015 ) ‘ Reproducibility Will Not Cure What Ails Science ’, Nature , 525 : 159.

Smith A. F. ( 1996 ) ‘ MAD Cows and Ecstasy: Chance and Choice in an Evidence-Based Society ’, Journal-Royal Statistical Society Series A , 159 : 367 – 84 .

Sørensen M. P. , Bloch C. , Young M. ( 2015 ) ‘ Excellence in the Knowledge-Based Economy: from Scientific to Research Excellence ’, European Journal of Higher Education , 1 – 21 .

Stilgoe J. ( 2014 ). ‘Against Excellence’. The Guardian , 19 December 2014.

Tijssen R. J. ( 2003 ) ‘ Scoreboards of Research Excellence ’, Research Evaluation , 12 / 2 : 91 – 103 .

Vertesy D. , Tarantola S. ( 2012 ). Composite Indicators of Research Excellence . JRC Scientific and Policy Reports. Luxembourg : Publications Office of the European Union .

Weinberg A. M. ( 2000 ) ‘ Criteria for Scientific Choice ’, Minerva , 38 / 3 : 253 – 66 .

Weingart P. ( 2005 ) ‘ Impact of Bibliometrics upon the Science System: Inadvertent Consequences? ’, Scientometrics , 62 / 1 : 117 – 31 .

Wilsdon J. ( 2015 ) ‘ We Need a Measured Approach to Metrics ’, Nature , 523 / 7559 : 129 .

The official Horizon2020 document defines that research excellence is about to “[] ensure a steady stream of world-class research to secure Europe's long-term competitiveness. It will support the best ideas, develop talent within Europe, provide researchers with access to priority research infrastructure, and make Europe an attractive location for the world's best researchers. ” (European Commission, 2011) (p.4).

From “Against Excellence”, the Guardian, 19 /12/2014 Retrieved at https://www.theguardian.com/science/political-science/2014/dec/19/against-excellence.

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1471-5430
  • Print ISSN 0302-3427
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Phone

  • Mailing List
  • Get Started

Home

  • Services Overview
  • English Language Copy Editing & Proofreading
  • Scientific & Scholarly Editing
  • Journal Recommendation
  • Editorial Bundles
  • Journal Finder
  • Business Enterprise Account
  • Frequently Asked Questions
  • Job Opportunities
  • About Us Overview

What Are Research Indicators and How Do You Use Them?

what are research indicators

In the past few years, the field of science policy has seen an increasing emphasis on the “societal value and value for money, performance-based funding and on globalization of academic research.” There has also been a notably growing need for research assessment internally and broad research information systems. During the shift to more dependence on research indicators, the computerization of the research process and moving to social media for academic communication have led to research assessment relying on data metrics. This involves the usage of citation indexes, electronic databases, the repositories of publications, the usage analytics of the publishers’ sites, and other metrics like Google Analytics ( Moed, 2017 ).

According to ASPIRE, a research performance scheme, here are four indicator categories that go into measuring one’s activity and quality of research. The indicator categories include: research income, which measures the monetary value amount in terms of grants and research income, whether private or public; research training, which measures the activity of the research in terms of supervised doctoral completions and supervised master’s research completions, along with measuring the quality of the research in terms of timely supervised doctoral completions (four years or less) and timely supervised master’s research completions (two years or less); research outputs that measure the research activity based on the number of publications (books, book chapters, journal articles, conference papers) and the number of creative works, such as live performances or exhibitions and events (The quality of the research is measured based on the number of ERA-listed citation databases and high-quality creative works determined by an internal peer review process.); and research engagement, which measures the research activity by the number of accepted invention disclosures, granted patents and commercialization income. This indicator measures the quality of the research based on external co-authored or created outputs and external collaboration grants ( Edith Cowan University, 2016 ).

Journal Finder Banner

Due to the increasing usage of such indicators, there have been many different measures developed to better understand a research’s impact. For example, one way to look into indicators is by measuring outcomes, such as dollars saved, lives saved, and crashes avoided, and combining them with other outputs. This is the method included in the Research Performance Measures System ( National Academies of Sciences, Engineering, and Medicine, 2008 ).

When deciding what measurement system you would want to go with, there are a few things to consider to make sure you are not depending on the wrong indicators. There are more critiques now of the way indicators are used in assessing research, since indicators may be biased and come short of measuring what they are expected to measure. Also, most studies have a limited time horizon, and that could make some indicators unreliable. In addition, there have been discussions about how indicators may be manipulated and their perceived societal impact being flawed. That is why many believe that using indicators by themselves at the level of the individual and making determinations based on them results in faulty measurements. A valid and just assessment of individual research can only be done properly if there is sufficient background knowledge on the particular role the research played in their publications while also taking into consideration other types of information affecting their performance ( Moed, 2017 ).

Although making a contribution to scientific-scholarly progress is a notion that has a place in history, it is argued that this impact can only be measured in the long term. This is why some current indicators measuring scientific-scholarly contributions may be focused less on actual contributions and more as an indication of “attention, visibility, and short-term impact.” In terms of societal value, it is almost impossible assess in a politically neutral manner. This is because societal value is usually measured based on the policy domain, which makes it hard to be neutral ( Moed, 2017 ).

what are research indicators

However, the importance of indicators should not be underestimated because of these factors. Instead, you will need to pay attention to how these influences may change your perception and try to eliminate any biases. For one, an assumption in the usage of indicators to assess academic research has been that it is not the “potential influence or importance of research but the actual influence or impact” that is important for policymakers and researchers. Another bias present in indicators that you can eliminate is the usage of citations as indicators of the importance of the research rather than effective communication strategies. When you shift your perspective to look at citations in this way, it will discourage their usage as a major indicator of importance ( Moed, 2017 ).

what are research indicators

With that being said, do not let the shortcomings of the research indicator system stop you from reaping their best qualities. There are many ways you can use these indicators to influence your decisions and better your outcomes. For example, you can use citations as an indicator of your social media presence and if you are effectively getting your work out into the world. Though the specifics of how accurate indicators are is a hotly contested issue, balancing your trust in indicators with some skepticism will result in better outcomes in the future. The only way for you to fully understand these biases and balance your expectations correctly is by looking at different systems and digging into studies about these specific indicators.

Since indicators depend on a wide variety of outcomes, one way to better these outcomes is by maximizing your work’s impact and visibility. You can find some tips about how to go about this process by looking through eContent Pro’s blogs and adding to your research factors, which will result in higher visibility and better outcomes in terms of impact. Also, since indicators rely on citations and other similarly measured factors, you might want to take a look at eContent Pro’s publishing services to see what works best for you in order to ensure that your work gets to the right audience and becomes an important piece of scholarly work in your academic field. Whether it is libraries and open access organizations, university presses and commercial publishing houses, or academic and research individuals, eContent Pro’s services will ensure that your work gets the visibility and recognition it deserves.

As publishing is an important pillar of building a standout research profile, it is paramount to make use of available author services, many of which are available on the eContent Pro’s website . Services such as English Language Copyediting , Scientific and Scholarly Editing , Journal Selection , and many other related services will greatly improve your chances of getting your manuscript accepted into a journal. You will find eContent Pro’s author services to be top quality, turned around exceptionally quickly, and affordably priced. Here are what our satisfied customers are saying about eContent Pro’s Author Services:

“The feedback on the technical content was beyond my expectations as it included additional reference suggestions and questions provoking further thought. The chapter has improved as a result.”

Dr. Sheron Burns, University of the West Indies, Jamaica

“This is not my first experience with eContent Pro. I was extremely pleased with the communication between me and office staff/professionals.”

Dr. Theresa Canada, Western Connecticut State University, USA

what are research indicators

  • Edith Cowan University. (2016). Research performance analytics. https://intranet.ecu.edu.au/__data/assets/pdf_file/0005/720374/Research-Performance-Analytics.pdf. https://intranet.ecu.edu.au/__data/assets/pdf_file/0005/720374/Research-Performance-Analytics.pdf
  • Moed, H. (2017). How can we use research performance indicators in an informed and responsible manner? The Bibliomagician. https://thebibliomagician.wordpress.com/2017/11/03/how-can-we-use-research-performance-indicators-in-an-informed-and-responsible-manner-guest-post-by-henk-moed/
  • National Academies of Sciences, Engineering, and Medicine. (2008). ). Performance measurement tool box and reporting system for research programs and projects. The National Academies Press. https://nap.nationalacademies.org/read/23093/chapter/5

Facebook

Please complete the following form to proceed to the video:

Frequently asked questions

What’s the difference between concepts, variables, and indicators.

In scientific research, concepts are the abstract ideas or phenomena that are being studied (e.g., educational achievement). Variables are properties or characteristics of the concept (e.g., performance at school), while indicators are ways of measuring or quantifying variables (e.g., yearly grade reports).

The process of turning abstract concepts into measurable variables and indicators is called operationalization .

Frequently asked questions: Methodology

Attrition refers to participants leaving a study. It always happens to some extent—for example, in randomized controlled trials for medical research.

Differential attrition occurs when attrition or dropout rates differ systematically between the intervention and the control group . As a result, the characteristics of the participants who drop out differ from the characteristics of those who stay in the study. Because of this, study results may be biased .

Action research is conducted in order to solve a particular issue immediately, while case studies are often conducted over a longer period of time and focus more on observing and analyzing a particular ongoing phenomenon.

Action research is focused on solving a problem or informing individual and community-based knowledge in a way that impacts teaching, learning, and other related processes. It is less focused on contributing theoretical input, instead producing actionable input.

Action research is particularly popular with educators as a form of systematic inquiry because it prioritizes reflection and bridges the gap between theory and practice. Educators are able to simultaneously investigate an issue as they solve it, and the method is very iterative and flexible.

A cycle of inquiry is another name for action research . It is usually visualized in a spiral shape following a series of steps, such as “planning → acting → observing → reflecting.”

To make quantitative observations , you need to use instruments that are capable of measuring the quantity you want to observe. For example, you might use a ruler to measure the length of an object or a thermometer to measure its temperature.

Criterion validity and construct validity are both types of measurement validity . In other words, they both show you how accurately a method measures something.

While construct validity is the degree to which a test or other measurement method measures what it claims to measure, criterion validity is the degree to which a test can predictively (in the future) or concurrently (in the present) measure something.

Construct validity is often considered the overarching type of measurement validity . You need to have face validity , content validity , and criterion validity in order to achieve construct validity.

Convergent validity and discriminant validity are both subtypes of construct validity . Together, they help you evaluate whether a test measures the concept it was designed to measure.

  • Convergent validity indicates whether a test that is designed to measure a particular construct correlates with other tests that assess the same or similar construct.
  • Discriminant validity indicates whether two tests that should not be highly related to each other are indeed not related. This type of validity is also called divergent validity .

You need to assess both in order to demonstrate construct validity. Neither one alone is sufficient for establishing construct validity.

  • Discriminant validity indicates whether two tests that should not be highly related to each other are indeed not related

Content validity shows you how accurately a test or other measurement method taps  into the various aspects of the specific construct you are researching.

In other words, it helps you answer the question: “does the test measure all aspects of the construct I want to measure?” If it does, then the test has high content validity.

The higher the content validity, the more accurate the measurement of the construct.

If the test fails to include parts of the construct, or irrelevant parts are included, the validity of the instrument is threatened, which brings your results into question.

Face validity and content validity are similar in that they both evaluate how suitable the content of a test is. The difference is that face validity is subjective, and assesses content at surface level.

When a test has strong face validity, anyone would agree that the test’s questions appear to measure what they are intended to measure.

For example, looking at a 4th grade math test consisting of problems in which students have to add and multiply, most people would agree that it has strong face validity (i.e., it looks like a math test).

On the other hand, content validity evaluates how well a test represents all the aspects of a topic. Assessing content validity is more systematic and relies on expert evaluation. of each question, analyzing whether each one covers the aspects that the test was designed to cover.

A 4th grade math test would have high content validity if it covered all the skills taught in that grade. Experts(in this case, math teachers), would have to evaluate the content validity by comparing the test to the learning objectives.

Snowball sampling is a non-probability sampling method . Unlike probability sampling (which involves some form of random selection ), the initial individuals selected to be studied are the ones who recruit new participants.

Because not every member of the target population has an equal chance of being recruited into the sample, selection in snowball sampling is non-random.

Snowball sampling is a non-probability sampling method , where there is not an equal chance for every member of the population to be included in the sample .

This means that you cannot use inferential statistics and make generalizations —often the goal of quantitative research . As such, a snowball sample is not representative of the target population and is usually a better fit for qualitative research .

Snowball sampling relies on the use of referrals. Here, the researcher recruits one or more initial participants, who then recruit the next ones.

Participants share similar characteristics and/or know each other. Because of this, not every member of the population has an equal chance of being included in the sample, giving rise to sampling bias .

Snowball sampling is best used in the following cases:

  • If there is no sampling frame available (e.g., people with a rare disease)
  • If the population of interest is hard to access or locate (e.g., people experiencing homelessness)
  • If the research focuses on a sensitive topic (e.g., extramarital affairs)

The reproducibility and replicability of a study can be ensured by writing a transparent, detailed method section and using clear, unambiguous language.

Reproducibility and replicability are related terms.

  • Reproducing research entails reanalyzing the existing data in the same manner.
  • Replicating (or repeating ) the research entails reconducting the entire analysis, including the collection of new data . 
  • A successful reproduction shows that the data analyses were conducted in a fair and honest manner.
  • A successful replication shows that the reliability of the results is high.

Stratified sampling and quota sampling both involve dividing the population into subgroups and selecting units from each subgroup. The purpose in both cases is to select a representative sample and/or to allow comparisons between subgroups.

The main difference is that in stratified sampling, you draw a random sample from each subgroup ( probability sampling ). In quota sampling you select a predetermined number or proportion of units, in a non-random manner ( non-probability sampling ).

Purposive and convenience sampling are both sampling methods that are typically used in qualitative data collection.

A convenience sample is drawn from a source that is conveniently accessible to the researcher. Convenience sampling does not distinguish characteristics among the participants. On the other hand, purposive sampling focuses on selecting participants possessing characteristics associated with the research study.

The findings of studies based on either convenience or purposive sampling can only be generalized to the (sub)population from which the sample is drawn, and not to the entire population.

Random sampling or probability sampling is based on random selection. This means that each unit has an equal chance (i.e., equal probability) of being included in the sample.

On the other hand, convenience sampling involves stopping people at random, which means that not everyone has an equal chance of being selected depending on the place, time, or day you are collecting your data.

Convenience sampling and quota sampling are both non-probability sampling methods. They both use non-random criteria like availability, geographical proximity, or expert knowledge to recruit study participants.

However, in convenience sampling, you continue to sample units or cases until you reach the required sample size.

In quota sampling, you first need to divide your population of interest into subgroups (strata) and estimate their proportions (quota) in the population. Then you can start your data collection, using convenience sampling to recruit participants, until the proportions in each subgroup coincide with the estimated proportions in the population.

A sampling frame is a list of every member in the entire population . It is important that the sampling frame is as complete as possible, so that your sample accurately reflects your population.

Stratified and cluster sampling may look similar, but bear in mind that groups created in cluster sampling are heterogeneous , so the individual characteristics in the cluster vary. In contrast, groups created in stratified sampling are homogeneous , as units share characteristics.

Relatedly, in cluster sampling you randomly select entire groups and include all units of each group in your sample. However, in stratified sampling, you select some units of all groups and include them in your sample. In this way, both methods can ensure that your sample is representative of the target population .

A systematic review is secondary research because it uses existing research. You don’t collect new data yourself.

The key difference between observational studies and experimental designs is that a well-done observational study does not influence the responses of participants, while experiments do have some sort of treatment condition applied to at least some participants by random assignment .

An observational study is a great choice for you if your research question is based purely on observations. If there are ethical, logistical, or practical concerns that prevent you from conducting a traditional experiment , an observational study may be a good choice. In an observational study, there is no interference or manipulation of the research subjects, as well as no control or treatment groups .

It’s often best to ask a variety of people to review your measurements. You can ask experts, such as other researchers, or laypeople, such as potential participants, to judge the face validity of tests.

While experts have a deep understanding of research methods , the people you’re studying can provide you with valuable insights you may have missed otherwise.

Face validity is important because it’s a simple first step to measuring the overall validity of a test or technique. It’s a relatively intuitive, quick, and easy way to start checking whether a new measure seems useful at first glance.

Good face validity means that anyone who reviews your measure says that it seems to be measuring what it’s supposed to. With poor face validity, someone reviewing your measure may be left confused about what you’re measuring and why you’re using this method.

Face validity is about whether a test appears to measure what it’s supposed to measure. This type of validity is concerned with whether a measure seems relevant and appropriate for what it’s assessing only on the surface.

Statistical analyses are often applied to test validity with data from your measures. You test convergent validity and discriminant validity with correlations to see if results from your test are positively or negatively related to those of other established tests.

You can also use regression analyses to assess whether your measure is actually predictive of outcomes that you expect it to predict theoretically. A regression analysis that supports your expectations strengthens your claim of construct validity .

When designing or evaluating a measure, construct validity helps you ensure you’re actually measuring the construct you’re interested in. If you don’t have construct validity, you may inadvertently measure unrelated or distinct constructs and lose precision in your research.

Construct validity is often considered the overarching type of measurement validity ,  because it covers all of the other types. You need to have face validity , content validity , and criterion validity to achieve construct validity.

Construct validity is about how well a test measures the concept it was designed to evaluate. It’s one of four types of measurement validity , which includes construct validity, face validity , and criterion validity.

There are two subtypes of construct validity.

  • Convergent validity : The extent to which your measure corresponds to measures of related constructs
  • Discriminant validity : The extent to which your measure is unrelated or negatively related to measures of distinct constructs

Naturalistic observation is a valuable tool because of its flexibility, external validity , and suitability for topics that can’t be studied in a lab setting.

The downsides of naturalistic observation include its lack of scientific control , ethical considerations , and potential for bias from observers and subjects.

Naturalistic observation is a qualitative research method where you record the behaviors of your research subjects in real world settings. You avoid interfering or influencing anything in a naturalistic observation.

You can think of naturalistic observation as “people watching” with a purpose.

A dependent variable is what changes as a result of the independent variable manipulation in experiments . It’s what you’re interested in measuring, and it “depends” on your independent variable.

In statistics, dependent variables are also called:

  • Response variables (they respond to a change in another variable)
  • Outcome variables (they represent the outcome you want to measure)
  • Left-hand-side variables (they appear on the left-hand side of a regression equation)

An independent variable is the variable you manipulate, control, or vary in an experimental study to explore its effects. It’s called “independent” because it’s not influenced by any other variables in the study.

Independent variables are also called:

  • Explanatory variables (they explain an event or outcome)
  • Predictor variables (they can be used to predict the value of a dependent variable)
  • Right-hand-side variables (they appear on the right-hand side of a regression equation).

As a rule of thumb, questions related to thoughts, beliefs, and feelings work well in focus groups. Take your time formulating strong questions, paying special attention to phrasing. Be careful to avoid leading questions , which can bias your responses.

Overall, your focus group questions should be:

  • Open-ended and flexible
  • Impossible to answer with “yes” or “no” (questions that start with “why” or “how” are often best)
  • Unambiguous, getting straight to the point while still stimulating discussion
  • Unbiased and neutral

A structured interview is a data collection method that relies on asking questions in a set order to collect data on a topic. They are often quantitative in nature. Structured interviews are best used when: 

  • You already have a very clear understanding of your topic. Perhaps significant research has already been conducted, or you have done some prior research yourself, but you already possess a baseline for designing strong structured questions.
  • You are constrained in terms of time or resources and need to analyze your data quickly and efficiently.
  • Your research question depends on strong parity between participants, with environmental conditions held constant.

More flexible interview options include semi-structured interviews , unstructured interviews , and focus groups .

Social desirability bias is the tendency for interview participants to give responses that will be viewed favorably by the interviewer or other participants. It occurs in all types of interviews and surveys , but is most common in semi-structured interviews , unstructured interviews , and focus groups .

Social desirability bias can be mitigated by ensuring participants feel at ease and comfortable sharing their views. Make sure to pay attention to your own body language and any physical or verbal cues, such as nodding or widening your eyes.

This type of bias can also occur in observations if the participants know they’re being observed. They might alter their behavior accordingly.

The interviewer effect is a type of bias that emerges when a characteristic of an interviewer (race, age, gender identity, etc.) influences the responses given by the interviewee.

There is a risk of an interviewer effect in all types of interviews , but it can be mitigated by writing really high-quality interview questions.

A semi-structured interview is a blend of structured and unstructured types of interviews. Semi-structured interviews are best used when:

  • You have prior interview experience. Spontaneous questions are deceptively challenging, and it’s easy to accidentally ask a leading question or make a participant uncomfortable.
  • Your research question is exploratory in nature. Participant answers can guide future research questions and help you develop a more robust knowledge base for future research.

An unstructured interview is the most flexible type of interview, but it is not always the best fit for your research topic.

Unstructured interviews are best used when:

  • You are an experienced interviewer and have a very strong background in your research topic, since it is challenging to ask spontaneous, colloquial questions.
  • Your research question is exploratory in nature. While you may have developed hypotheses, you are open to discovering new or shifting viewpoints through the interview process.
  • You are seeking descriptive data, and are ready to ask questions that will deepen and contextualize your initial thoughts and hypotheses.
  • Your research depends on forming connections with your participants and making them feel comfortable revealing deeper emotions, lived experiences, or thoughts.

The four most common types of interviews are:

  • Structured interviews : The questions are predetermined in both topic and order. 
  • Semi-structured interviews : A few questions are predetermined, but other questions aren’t planned.
  • Unstructured interviews : None of the questions are predetermined.
  • Focus group interviews : The questions are presented to a group instead of one individual.

Deductive reasoning is commonly used in scientific research, and it’s especially associated with quantitative research .

In research, you might have come across something called the hypothetico-deductive method . It’s the scientific method of testing hypotheses to check whether your predictions are substantiated by real-world data.

Deductive reasoning is a logical approach where you progress from general ideas to specific conclusions. It’s often contrasted with inductive reasoning , where you start with specific observations and form general conclusions.

Deductive reasoning is also called deductive logic.

There are many different types of inductive reasoning that people use formally or informally.

Here are a few common types:

  • Inductive generalization : You use observations about a sample to come to a conclusion about the population it came from.
  • Statistical generalization: You use specific numbers about samples to make statements about populations.
  • Causal reasoning: You make cause-and-effect links between different things.
  • Sign reasoning: You make a conclusion about a correlational relationship between different things.
  • Analogical reasoning: You make a conclusion about something based on its similarities to something else.

Inductive reasoning is a bottom-up approach, while deductive reasoning is top-down.

Inductive reasoning takes you from the specific to the general, while in deductive reasoning, you make inferences by going from general premises to specific conclusions.

In inductive research , you start by making observations or gathering data. Then, you take a broad scan of your data and search for patterns. Finally, you make general conclusions that you might incorporate into theories.

Inductive reasoning is a method of drawing conclusions by going from the specific to the general. It’s usually contrasted with deductive reasoning, where you proceed from general information to specific conclusions.

Inductive reasoning is also called inductive logic or bottom-up reasoning.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Triangulation can help:

  • Reduce research bias that comes from using a single method, theory, or investigator
  • Enhance validity by approaching the same topic with different tools
  • Establish credibility by giving you a complete picture of the research problem

But triangulation can also pose problems:

  • It’s time-consuming and labor-intensive, often involving an interdisciplinary team.
  • Your results may be inconsistent or even contradictory.

There are four main types of triangulation :

  • Data triangulation : Using data from different times, spaces, and people
  • Investigator triangulation : Involving multiple researchers in collecting or analyzing data
  • Theory triangulation : Using varying theoretical perspectives in your research
  • Methodological triangulation : Using different methodologies to approach the same topic

Many academic fields use peer review , largely to determine whether a manuscript is suitable for publication. Peer review enhances the credibility of the published manuscript.

However, peer review is also common in non-academic settings. The United Nations, the European Union, and many individual nations use peer review to evaluate grant applications. It is also widely used in medical and health-related fields as a teaching or quality-of-care measure. 

Peer assessment is often used in the classroom as a pedagogical tool. Both receiving feedback and providing it are thought to enhance the learning process, helping students think critically and collaboratively.

Peer review can stop obviously problematic, falsified, or otherwise untrustworthy research from being published. It also represents an excellent opportunity to get feedback from renowned experts in your field. It acts as a first defense, helping you ensure your argument is clear and that there are no gaps, vague terms, or unanswered questions for readers who weren’t involved in the research process.

Peer-reviewed articles are considered a highly credible source due to this stringent process they go through before publication.

In general, the peer review process follows the following steps: 

  • First, the author submits the manuscript to the editor.
  • Reject the manuscript and send it back to author, or 
  • Send it onward to the selected peer reviewer(s) 
  • Next, the peer review process occurs. The reviewer provides feedback, addressing any major or minor issues with the manuscript, and gives their advice regarding what edits should be made. 
  • Lastly, the edited manuscript is sent back to the author. They input the edits, and resubmit it to the editor for publication.

Exploratory research is often used when the issue you’re studying is new or when the data collection process is challenging for some reason.

You can use exploratory research if you have a general idea or a specific question that you want to study but there is no preexisting knowledge or paradigm with which to study it.

Exploratory research is a methodology approach that explores research questions that have not previously been studied in depth. It is often used when the issue you’re studying is new, or the data collection process is challenging in some way.

Explanatory research is used to investigate how or why a phenomenon occurs. Therefore, this type of research is often one of the first stages in the research process , serving as a jumping-off point for future research.

Exploratory research aims to explore the main aspects of an under-researched problem, while explanatory research aims to explain the causes and consequences of a well-defined problem.

Explanatory research is a research method used to investigate how or why something occurs when only a small amount of information is available pertaining to that topic. It can help you increase your understanding of a given topic.

Clean data are valid, accurate, complete, consistent, unique, and uniform. Dirty data include inconsistencies and errors.

Dirty data can come from any part of the research process, including poor research design , inappropriate measurement materials, or flawed data entry.

Data cleaning takes place between data collection and data analyses. But you can use some methods even before collecting data.

For clean data, you should start by designing measures that collect valid data. Data validation at the time of data entry or collection helps you minimize the amount of data cleaning you’ll need to do.

After data collection, you can use data standardization and data transformation to clean your data. You’ll also deal with any missing values, outliers, and duplicate values.

Every dataset requires different techniques to clean dirty data , but you need to address these issues in a systematic way. You focus on finding and resolving data points that don’t agree or fit with the rest of your dataset.

These data might be missing values, outliers, duplicate values, incorrectly formatted, or irrelevant. You’ll start with screening and diagnosing your data. Then, you’ll often standardize and accept or remove data to make your dataset consistent and valid.

Data cleaning is necessary for valid and appropriate analyses. Dirty data contain inconsistencies or errors , but cleaning your data helps you minimize or resolve these.

Without data cleaning, you could end up with a Type I or II error in your conclusion. These types of erroneous conclusions can be practically significant with important consequences, because they lead to misplaced investments or missed opportunities.

Data cleaning involves spotting and resolving potential data inconsistencies or errors to improve your data quality. An error is any value (e.g., recorded weight) that doesn’t reflect the true value (e.g., actual weight) of something that’s being measured.

In this process, you review, analyze, detect, modify, or remove “dirty” data to make your dataset “clean.” Data cleaning is also called data cleansing or data scrubbing.

Research misconduct means making up or falsifying data, manipulating data analyses, or misrepresenting results in research reports. It’s a form of academic fraud.

These actions are committed intentionally and can have serious consequences; research misconduct is not a simple mistake or a point of disagreement but a serious ethical failure.

Anonymity means you don’t know who the participants are, while confidentiality means you know who they are but remove identifying information from your research report. Both are important ethical considerations .

You can only guarantee anonymity by not collecting any personally identifying information—for example, names, phone numbers, email addresses, IP addresses, physical characteristics, photos, or videos.

You can keep data confidential by using aggregate information in your research report, so that you only refer to groups of participants rather than individuals.

Research ethics matter for scientific integrity, human rights and dignity, and collaboration between science and society. These principles make sure that participation in studies is voluntary, informed, and safe.

Ethical considerations in research are a set of principles that guide your research designs and practices. These principles include voluntary participation, informed consent, anonymity, confidentiality, potential for harm, and results communication.

Scientists and researchers must always adhere to a certain code of conduct when collecting data from others .

These considerations protect the rights of research participants, enhance research validity , and maintain scientific integrity.

In multistage sampling , you can use probability or non-probability sampling methods .

For a probability sample, you have to conduct probability sampling at every stage.

You can mix it up by using simple random sampling , systematic sampling , or stratified sampling to select units at different stages, depending on what is applicable and relevant to your study.

Multistage sampling can simplify data collection when you have large, geographically spread samples, and you can obtain a probability sample without a complete sampling frame.

But multistage sampling may not lead to a representative sample, and larger samples are needed for multistage samples to achieve the statistical properties of simple random samples .

These are four of the most common mixed methods designs :

  • Convergent parallel: Quantitative and qualitative data are collected at the same time and analyzed separately. After both analyses are complete, compare your results to draw overall conclusions. 
  • Embedded: Quantitative and qualitative data are collected at the same time, but within a larger quantitative or qualitative design. One type of data is secondary to the other.
  • Explanatory sequential: Quantitative data is collected and analyzed first, followed by qualitative data. You can use this design if you think your qualitative data will explain and contextualize your quantitative findings.
  • Exploratory sequential: Qualitative data is collected and analyzed first, followed by quantitative data. You can use this design if you think the quantitative data will confirm or validate your qualitative findings.

Triangulation in research means using multiple datasets, methods, theories and/or investigators to address a research question. It’s a research strategy that can help you enhance the validity and credibility of your findings.

Triangulation is mainly used in qualitative research , but it’s also commonly applied in quantitative research . Mixed methods research always uses triangulation.

In multistage sampling , or multistage cluster sampling, you draw a sample from a population using smaller and smaller groups at each stage.

This method is often used to collect data from a large, geographically spread group of people in national surveys, for example. You take advantage of hierarchical groupings (e.g., from state to city to neighborhood) to create a sample that’s less expensive and time-consuming to collect data from.

No, the steepness or slope of the line isn’t related to the correlation coefficient value. The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes.

To find the slope of the line, you’ll need to perform a regression analysis .

Correlation coefficients always range between -1 and 1.

The sign of the coefficient tells you the direction of the relationship: a positive value means the variables change together in the same direction, while a negative value means they change together in opposite directions.

The absolute value of a number is equal to the number without its sign. The absolute value of a correlation coefficient tells you the magnitude of the correlation: the greater the absolute value, the stronger the correlation.

These are the assumptions your data must meet if you want to use Pearson’s r :

  • Both variables are on an interval or ratio level of measurement
  • Data from both variables follow normal distributions
  • Your data have no outliers
  • Your data is from a random or representative sample
  • You expect a linear relationship between the two variables

Quantitative research designs can be divided into two main categories:

  • Correlational and descriptive designs are used to investigate characteristics, averages, trends, and associations between variables.
  • Experimental and quasi-experimental designs are used to test causal relationships .

Qualitative research designs tend to be more flexible. Common types of qualitative design include case study , ethnography , and grounded theory designs.

A well-planned research design helps ensure that your methods match your research aims, that you collect high-quality data, and that you use the right kind of analysis to answer your questions, utilizing credible sources . This allows you to draw valid , trustworthy conclusions.

The priorities of a research design can vary depending on the field, but you usually have to specify:

  • Your research questions and/or hypotheses
  • Your overall approach (e.g., qualitative or quantitative )
  • The type of design you’re using (e.g., a survey , experiment , or case study )
  • Your sampling methods or criteria for selecting subjects
  • Your data collection methods (e.g., questionnaires , observations)
  • Your data collection procedures (e.g., operationalization , timing and data management)
  • Your data analysis methods (e.g., statistical tests  or thematic analysis )

A research design is a strategy for answering your   research question . It defines your overall approach and determines how you will collect and analyze data.

Questionnaires can be self-administered or researcher-administered.

Self-administered questionnaires can be delivered online or in paper-and-pen formats, in person or through mail. All questions are standardized so that all respondents receive the same questions with identical wording.

Researcher-administered questionnaires are interviews that take place by phone, in-person, or online between researchers and respondents. You can gain deeper insights by clarifying questions for respondents or asking follow-up questions.

You can organize the questions logically, with a clear progression from simple to complex, or randomly between respondents. A logical flow helps respondents process the questionnaire easier and quicker, but it may lead to bias. Randomization can minimize the bias from order effects.

Closed-ended, or restricted-choice, questions offer respondents a fixed set of choices to select from. These questions are easier to answer quickly.

Open-ended or long-form questions allow respondents to answer in their own words. Because there are no restrictions on their choices, respondents can answer in ways that researchers may not have otherwise considered.

A questionnaire is a data collection tool or instrument, while a survey is an overarching research method that involves collecting and analyzing data from people using questionnaires.

The third variable and directionality problems are two main reasons why correlation isn’t causation .

The third variable problem means that a confounding variable affects both variables to make them seem causally related when they are not.

The directionality problem is when two variables correlate and might actually have a causal relationship, but it’s impossible to conclude which variable causes changes in the other.

Correlation describes an association between variables : when one variable changes, so does the other. A correlation is a statistical indicator of the relationship between variables.

Causation means that changes in one variable brings about changes in the other (i.e., there is a cause-and-effect relationship between variables). The two variables are correlated with each other, and there’s also a causal link between them.

While causation and correlation can exist simultaneously, correlation does not imply causation. In other words, correlation is simply a relationship where A relates to B—but A doesn’t necessarily cause B to happen (or vice versa). Mistaking correlation for causation is a common error and can lead to false cause fallacy .

Controlled experiments establish causality, whereas correlational studies only show associations between variables.

  • In an experimental design , you manipulate an independent variable and measure its effect on a dependent variable. Other variables are controlled so they can’t impact the results.
  • In a correlational design , you measure variables without manipulating any of them. You can test whether your variables change together, but you can’t be sure that one variable caused a change in another.

In general, correlational research is high in external validity while experimental research is high in internal validity .

A correlation is usually tested for two variables at a time, but you can test correlations between three or more variables.

A correlation coefficient is a single number that describes the strength and direction of the relationship between your variables.

Different types of correlation coefficients might be appropriate for your data based on their levels of measurement and distributions . The Pearson product-moment correlation coefficient (Pearson’s r ) is commonly used to assess a linear relationship between two quantitative variables.

A correlational research design investigates relationships between two variables (or more) without the researcher controlling or manipulating any of them. It’s a non-experimental type of quantitative research .

A correlation reflects the strength and/or direction of the association between two or more variables.

  • A positive correlation means that both variables change in the same direction.
  • A negative correlation means that the variables change in opposite directions.
  • A zero correlation means there’s no relationship between the variables.

Random error  is almost always present in scientific studies, even in highly controlled settings. While you can’t eradicate it completely, you can reduce random error by taking repeated measurements, using a large sample, and controlling extraneous variables .

You can avoid systematic error through careful design of your sampling , data collection , and analysis procedures. For example, use triangulation to measure your variables using multiple methods; regularly calibrate instruments or procedures; use random sampling and random assignment ; and apply masking (blinding) where possible.

Systematic error is generally a bigger problem in research.

With random error, multiple measurements will tend to cluster around the true value. When you’re collecting data from a large sample , the errors in different directions will cancel each other out.

Systematic errors are much more problematic because they can skew your data away from the true value. This can lead you to false conclusions ( Type I and II errors ) about the relationship between the variables you’re studying.

Random and systematic error are two types of measurement error.

Random error is a chance difference between the observed and true values of something (e.g., a researcher misreading a weighing scale records an incorrect measurement).

Systematic error is a consistent or proportional difference between the observed and true values of something (e.g., a miscalibrated scale consistently records weights as higher than they actually are).

On graphs, the explanatory variable is conventionally placed on the x-axis, while the response variable is placed on the y-axis.

  • If you have quantitative variables , use a scatterplot or a line graph.
  • If your response variable is categorical, use a scatterplot or a line graph.
  • If your explanatory variable is categorical, use a bar graph.

The term “ explanatory variable ” is sometimes preferred over “ independent variable ” because, in real world contexts, independent variables are often influenced by other variables. This means they aren’t totally independent.

Multiple independent variables may also be correlated with each other, so “explanatory variables” is a more appropriate term.

The difference between explanatory and response variables is simple:

  • An explanatory variable is the expected cause, and it explains the results.
  • A response variable is the expected effect, and it responds to other variables.

In a controlled experiment , all extraneous variables are held constant so that they can’t influence the results. Controlled experiments require:

  • A control group that receives a standard treatment, a fake treatment, or no treatment.
  • Random assignment of participants to ensure the groups are equivalent.

Depending on your study topic, there are various other methods of controlling variables .

There are 4 main types of extraneous variables :

  • Demand characteristics : environmental cues that encourage participants to conform to researchers’ expectations.
  • Experimenter effects : unintentional actions by researchers that influence study outcomes.
  • Situational variables : environmental variables that alter participants’ behaviors.
  • Participant variables : any characteristic or aspect of a participant’s background that could affect study results.

An extraneous variable is any variable that you’re not investigating that can potentially affect the dependent variable of your research study.

A confounding variable is a type of extraneous variable that not only affects the dependent variable, but is also related to the independent variable.

In a factorial design, multiple independent variables are tested.

If you test two variables, each level of one independent variable is combined with each level of the other independent variable to create different conditions.

Within-subjects designs have many potential threats to internal validity , but they are also very statistically powerful .

Advantages:

  • Only requires small samples
  • Statistically powerful
  • Removes the effects of individual differences on the outcomes

Disadvantages:

  • Internal validity threats reduce the likelihood of establishing a direct relationship between variables
  • Time-related effects, such as growth, can influence the outcomes
  • Carryover effects mean that the specific order of different treatments affect the outcomes

While a between-subjects design has fewer threats to internal validity , it also requires more participants for high statistical power than a within-subjects design .

  • Prevents carryover effects of learning and fatigue.
  • Shorter study duration.
  • Needs larger samples for high power.
  • Uses more resources to recruit participants, administer sessions, cover costs, etc.
  • Individual differences may be an alternative explanation for results.

Yes. Between-subjects and within-subjects designs can be combined in a single study when you have two or more independent variables (a factorial design). In a mixed factorial design, one variable is altered between subjects and another is altered within subjects.

In a between-subjects design , every participant experiences only one condition, and researchers assess group differences between participants in various conditions.

In a within-subjects design , each participant experiences all conditions, and researchers test the same participants repeatedly for differences between conditions.

The word “between” means that you’re comparing different conditions between groups, while the word “within” means you’re comparing different conditions within the same group.

Random assignment is used in experiments with a between-groups or independent measures design. In this research design, there’s usually a control group and one or more experimental groups. Random assignment helps ensure that the groups are comparable.

In general, you should always use random assignment in this type of experimental design when it is ethically possible and makes sense for your study topic.

To implement random assignment , assign a unique number to every member of your study’s sample .

Then, you can use a random number generator or a lottery method to randomly assign each number to a control or experimental group. You can also do so manually, by flipping a coin or rolling a dice to randomly assign participants to groups.

Random selection, or random sampling , is a way of selecting members of a population for your study’s sample.

In contrast, random assignment is a way of sorting the sample into control and experimental groups.

Random sampling enhances the external validity or generalizability of your results, while random assignment improves the internal validity of your study.

In experimental research, random assignment is a way of placing participants from your sample into different groups using randomization. With this method, every member of the sample has a known or equal chance of being placed in a control group or an experimental group.

“Controlling for a variable” means measuring extraneous variables and accounting for them statistically to remove their effects on other variables.

Researchers often model control variable data along with independent and dependent variable data in regression analyses and ANCOVAs . That way, you can isolate the control variable’s effects from the relationship between the variables of interest.

Control variables help you establish a correlational or causal relationship between variables by enhancing internal validity .

If you don’t control relevant extraneous variables , they may influence the outcomes of your study, and you may not be able to demonstrate that your results are really an effect of your independent variable .

A control variable is any variable that’s held constant in a research study. It’s not a variable of interest in the study, but it’s controlled because it could influence the outcomes.

Including mediators and moderators in your research helps you go beyond studying a simple relationship between two variables for a fuller picture of the real world. They are important to consider when studying complex correlational or causal relationships.

Mediators are part of the causal pathway of an effect, and they tell you how or why an effect takes place. Moderators usually help you judge the external validity of your study by identifying the limitations of when the relationship between variables holds.

If something is a mediating variable :

  • It’s caused by the independent variable .
  • It influences the dependent variable
  • When it’s taken into account, the statistical correlation between the independent and dependent variables is higher than when it isn’t considered.

A confounder is a third variable that affects variables of interest and makes them seem related when they are not. In contrast, a mediator is the mechanism of a relationship between two variables: it explains the process by which they are related.

A mediator variable explains the process through which two variables are related, while a moderator variable affects the strength and direction of that relationship.

There are three key steps in systematic sampling :

  • Define and list your population , ensuring that it is not ordered in a cyclical or periodic order.
  • Decide on your sample size and calculate your interval, k , by dividing your population by your target sample size.
  • Choose every k th member of the population as your sample.

Systematic sampling is a probability sampling method where researchers select members of the population at a regular interval – for example, by selecting every 15th person on a list of the population. If the population is in a random order, this can imitate the benefits of simple random sampling .

Yes, you can create a stratified sample using multiple characteristics, but you must ensure that every participant in your study belongs to one and only one subgroup. In this case, you multiply the numbers of subgroups for each characteristic to get the total number of groups.

For example, if you were stratifying by location with three subgroups (urban, rural, or suburban) and marital status with five subgroups (single, divorced, widowed, married, or partnered), you would have 3 x 5 = 15 subgroups.

You should use stratified sampling when your sample can be divided into mutually exclusive and exhaustive subgroups that you believe will take on different mean values for the variable that you’re studying.

Using stratified sampling will allow you to obtain more precise (with lower variance ) statistical estimates of whatever you are trying to measure.

For example, say you want to investigate how income differs based on educational attainment, but you know that this relationship can vary based on race. Using stratified sampling, you can ensure you obtain a large enough sample from each racial group, allowing you to draw more precise conclusions.

In stratified sampling , researchers divide subjects into subgroups called strata based on characteristics that they share (e.g., race, gender, educational attainment).

Once divided, each subgroup is randomly sampled using another probability sampling method.

Cluster sampling is more time- and cost-efficient than other probability sampling methods , particularly when it comes to large samples spread across a wide geographical area.

However, it provides less statistical certainty than other methods, such as simple random sampling , because it is difficult to ensure that your clusters properly represent the population as a whole.

There are three types of cluster sampling : single-stage, double-stage and multi-stage clustering. In all three types, you first divide the population into clusters, then randomly select clusters for use in your sample.

  • In single-stage sampling , you collect data from every unit within the selected clusters.
  • In double-stage sampling , you select a random sample of units from within the clusters.
  • In multi-stage sampling , you repeat the procedure of randomly sampling elements from within the clusters until you have reached a manageable sample.

Cluster sampling is a probability sampling method in which you divide a population into clusters, such as districts or schools, and then randomly select some of these clusters as your sample.

The clusters should ideally each be mini-representations of the population as a whole.

If properly implemented, simple random sampling is usually the best sampling method for ensuring both internal and external validity . However, it can sometimes be impractical and expensive to implement, depending on the size of the population to be studied,

If you have a list of every member of the population and the ability to reach whichever members are selected, you can use simple random sampling.

The American Community Survey  is an example of simple random sampling . In order to collect detailed data on the population of the US, the Census Bureau officials randomly select 3.5 million households per year and use a variety of methods to convince them to fill out the survey.

Simple random sampling is a type of probability sampling in which the researcher randomly selects a subset of participants from a population . Each member of the population has an equal chance of being selected. Data is then collected from as large a percentage as possible of this random subset.

Quasi-experimental design is most useful in situations where it would be unethical or impractical to run a true experiment .

Quasi-experiments have lower internal validity than true experiments, but they often have higher external validity  as they can use real-world interventions instead of artificial laboratory settings.

A quasi-experiment is a type of research design that attempts to establish a cause-and-effect relationship. The main difference with a true experiment is that the groups are not randomly assigned.

Blinding is important to reduce research bias (e.g., observer bias , demand characteristics ) and ensure a study’s internal validity .

If participants know whether they are in a control or treatment group , they may adjust their behavior in ways that affect the outcome that researchers are trying to measure. If the people administering the treatment are aware of group assignment, they may treat participants differently and thus directly or indirectly influence the final results.

  • In a single-blind study , only the participants are blinded.
  • In a double-blind study , both participants and experimenters are blinded.
  • In a triple-blind study , the assignment is hidden not only from participants and experimenters, but also from the researchers analyzing the data.

Blinding means hiding who is assigned to the treatment group and who is assigned to the control group in an experiment .

A true experiment (a.k.a. a controlled experiment) always includes at least one control group that doesn’t receive the experimental treatment.

However, some experiments use a within-subjects design to test treatments without a control group. In these designs, you usually compare one group’s outcomes before and after a treatment (instead of comparing outcomes between different groups).

For strong internal validity , it’s usually best to include a control group if possible. Without a control group, it’s harder to be certain that the outcome was caused by the experimental treatment and not by other variables.

An experimental group, also known as a treatment group, receives the treatment whose effect researchers wish to study, whereas a control group does not. They should be identical in all other ways.

Individual Likert-type questions are generally considered ordinal data , because the items have clear rank order, but don’t have an even distribution.

Overall Likert scale scores are sometimes treated as interval data. These scores are considered to have directionality and even spacing between them.

The type of data determines what statistical tests you should use to analyze your data.

A Likert scale is a rating scale that quantitatively assesses opinions, attitudes, or behaviors. It is made up of 4 or more questions that measure a single attitude or trait when response scores are combined.

To use a Likert scale in a survey , you present participants with Likert-type questions or statements, and a continuum of items, usually with 5 or 7 possible responses, to capture their degree of agreement.

There are various approaches to qualitative data analysis , but they all share five steps in common:

  • Prepare and organize your data.
  • Review and explore your data.
  • Develop a data coding system.
  • Assign codes to the data.
  • Identify recurring themes.

The specifics of each step depend on the focus of the analysis. Some common approaches include textual analysis , thematic analysis , and discourse analysis .

There are five common approaches to qualitative research :

  • Grounded theory involves collecting data in order to develop new theories.
  • Ethnography involves immersing yourself in a group or organization to understand its culture.
  • Narrative research involves interpreting stories to understand how people make sense of their experiences and perceptions.
  • Phenomenological research involves investigating phenomena through people’s lived experiences.
  • Action research links theory and practice in several cycles to drive innovative changes.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

Operationalization means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioral avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalize the variables that you want to measure.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g. understanding the needs of your consumers or user testing your website)
  • You can control and standardize the process for high reliability and validity (e.g. choosing appropriate measurements and sampling methods )

However, there are also some drawbacks: data collection can be time-consuming, labor-intensive and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organizations.

There are several methods you can use to decrease the impact of confounding variables on your research: restriction, matching, statistical control and randomization.

In restriction , you restrict your sample by only including certain subjects that have the same values of potential confounding variables.

In matching , you match each of the subjects in your treatment group with a counterpart in the comparison group. The matched subjects have the same values on any potential confounding variables, and only differ in the independent variable .

In statistical control , you include potential confounders as variables in your regression .

In randomization , you randomly assign the treatment (or independent variable) in your study to a sufficiently large number of subjects, which allows you to control for all potential confounding variables.

A confounding variable is closely related to both the independent and dependent variables in a study. An independent variable represents the supposed cause , while the dependent variable is the supposed effect . A confounding variable is a third variable that influences both the independent and dependent variables.

Failing to account for confounding variables can cause you to wrongly estimate the relationship between your independent and dependent variables.

To ensure the internal validity of your research, you must consider the impact of confounding variables. If you fail to account for them, you might over- or underestimate the causal relationship between your independent and dependent variables , or even find a causal relationship where none exists.

Yes, but including more than one of either type requires multiple research questions .

For example, if you are interested in the effect of a diet on health, you can use multiple measures of health: blood sugar, blood pressure, weight, pulse, and many more. Each of these is its own dependent variable with its own research question.

You could also choose to look at the effect of exercise levels as well as diet, or even the additional effect of the two combined. Each of these is a separate independent variable .

To ensure the internal validity of an experiment , you should only change one independent variable at a time.

No. The value of a dependent variable depends on an independent variable, so a variable cannot be both independent and dependent at the same time. It must be either the cause or the effect, not both!

You want to find out how blood sugar levels are affected by drinking diet soda and regular soda, so you conduct an experiment .

  • The type of soda – diet or regular – is the independent variable .
  • The level of blood sugar that you measure is the dependent variable – it changes depending on the type of soda.

Determining cause and effect is one of the most important parts of scientific research. It’s essential to know which is the cause – the independent variable – and which is the effect – the dependent variable.

In non-probability sampling , the sample is selected based on non-random criteria, and not every member of the population has a chance of being included.

Common non-probability sampling methods include convenience sampling , voluntary response sampling, purposive sampling , snowball sampling, and quota sampling .

Probability sampling means that every member of the target population has a known chance of being included in the sample.

Probability sampling methods include simple random sampling , systematic sampling , stratified sampling , and cluster sampling .

Using careful research design and sampling procedures can help you avoid sampling bias . Oversampling can be used to correct undercoverage bias .

Some common types of sampling bias include self-selection bias , nonresponse bias , undercoverage bias , survivorship bias , pre-screening or advertising bias, and healthy user bias.

Sampling bias is a threat to external validity – it limits the generalizability of your findings to a broader group of people.

A sampling error is the difference between a population parameter and a sample statistic .

A statistic refers to measures about the sample , while a parameter refers to measures about the population .

Populations are used when a research question requires data from every member of the population. This is usually only feasible when the population is small and easily accessible.

Samples are used to make inferences about populations . Samples are easier to collect data from because they are practical, cost-effective, convenient, and manageable.

There are seven threats to external validity : selection bias , history, experimenter effect, Hawthorne effect , testing effect, aptitude-treatment and situation effect.

The two types of external validity are population validity (whether you can generalize to other groups of people) and ecological validity (whether you can generalize to other situations and settings).

The external validity of a study is the extent to which you can generalize your findings to different groups of people, situations, and measures.

Cross-sectional studies cannot establish a cause-and-effect relationship or analyze behavior over a period of time. To investigate cause and effect, you need to do a longitudinal study or an experimental study .

Cross-sectional studies are less expensive and time-consuming than many other types of study. They can provide useful insights into a population’s characteristics and identify correlations for further research.

Sometimes only cross-sectional data is available for analysis; other times your research question may only require a cross-sectional study to answer it.

Longitudinal studies can last anywhere from weeks to decades, although they tend to be at least a year long.

The 1970 British Cohort Study , which has collected data on the lives of 17,000 Brits since their births in 1970, is one well-known example of a longitudinal study .

Longitudinal studies are better to establish the correct sequence of events, identify changes over time, and provide insight into cause-and-effect relationships, but they also tend to be more expensive and time-consuming than other types of studies.

Longitudinal studies and cross-sectional studies are two different types of research design . In a cross-sectional study you collect data from a population at a specific point in time; in a longitudinal study you repeatedly collect data from the same sample over an extended period of time.

There are eight threats to internal validity : history, maturation, instrumentation, testing, selection bias , regression to the mean, social interaction and attrition .

Internal validity is the extent to which you can be confident that a cause-and-effect relationship established in a study cannot be explained by other factors.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts and meanings, use qualitative methods .
  • If you want to analyze a large amount of readily-available data, use secondary data. If you want data specific to your purposes with control over how it is generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

A confounding variable , also called a confounder or confounding factor, is a third variable in a study examining a potential cause-and-effect relationship.

A confounding variable is related to both the supposed cause and the supposed effect of the study. It can be difficult to separate the true effect of the independent variable from the effect of the confounding variable.

In your research design , it’s important to identify potential confounding variables and plan how you will reduce their impact.

Discrete and continuous variables are two types of quantitative variables :

  • Discrete variables represent counts (e.g. the number of objects in a collection).
  • Continuous variables represent measurable amounts (e.g. water volume or weight).

Quantitative variables are any variables where the data represent amounts (e.g. height, weight, or age).

Categorical variables are any variables where the data represent groups. This includes rankings (e.g. finishing places in a race), classifications (e.g. brands of cereal), and binary outcomes (e.g. coin flips).

You need to know what type of variables you are working with to choose the right statistical test for your data and interpret your results .

You can think of independent and dependent variables in terms of cause and effect: an independent variable is the variable you think is the cause , while a dependent variable is the effect .

In an experiment, you manipulate the independent variable and measure the outcome in the dependent variable. For example, in an experiment about the effect of nutrients on crop growth:

  • The  independent variable  is the amount of nutrients added to the crop field.
  • The  dependent variable is the biomass of the crops at harvest time.

Defining your variables, and deciding how you will manipulate and measure them, is an important part of experimental design .

Experimental design means planning a set of procedures to investigate a relationship between variables . To design a controlled experiment, you need:

  • A testable hypothesis
  • At least one independent variable that can be precisely manipulated
  • At least one dependent variable that can be precisely measured

When designing the experiment, you decide:

  • How you will manipulate the variable(s)
  • How you will control for any potential confounding variables
  • How many subjects or samples will be included in the study
  • How subjects will be assigned to treatment levels

Experimental design is essential to the internal and external validity of your experiment.

I nternal validity is the degree of confidence that the causal relationship you are testing is not influenced by other factors or variables .

External validity is the extent to which your results can be generalized to other contexts.

The validity of your experiment depends on your experimental design .

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research, you also have to consider the internal and external validity of your experiment.

A sample is a subset of individuals from a larger population . Sampling means selecting the group that you will actually collect data from in your research. For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

Methodology refers to the overarching strategy and rationale of your research project . It involves studying the methods used in your field and the theories or principles behind them, in order to develop an approach that matches your objectives.

Methods are the specific tools and procedures you use to collect and analyze data (for example, experiments, surveys , and statistical tests ).

In shorter scientific papers, where the aim is to report the findings of a specific study, you might simply describe what you did in a methods section .

In a longer or more complex research project, such as a thesis or dissertation , you will probably include a methodology section , where you explain your approach to answering the research questions and cite relevant sources to support your choice of methods.

Ask our team

Want to contact us directly? No problem.  We  are always here for you.

Support team - Nina

Our team helps students graduate by offering:

  • A world-class citation generator
  • Plagiarism Checker software powered by Turnitin
  • Innovative Citation Checker software
  • Professional proofreading services
  • Over 300 helpful articles about academic writing, citing sources, plagiarism, and more

Scribbr specializes in editing study-related documents . We proofread:

  • PhD dissertations
  • Research proposals
  • Personal statements
  • Admission essays
  • Motivation letters
  • Reflection papers
  • Journal articles
  • Capstone projects

Scribbr’s Plagiarism Checker is powered by elements of Turnitin’s Similarity Checker , namely the plagiarism detection software and the Internet Archive and Premium Scholarly Publications content databases .

The add-on AI detector is powered by Scribbr’s proprietary software.

The Scribbr Citation Generator is developed using the open-source Citation Style Language (CSL) project and Frank Bennett’s citeproc-js . It’s the same technology used by dozens of other popular citation tools, including Mendeley and Zotero.

You can find all the citation styles and locales used in the Scribbr Citation Generator in our publicly accessible repository on Github .

Eric P. Green

  • Using Conceptual Models to Plan Study Measurement
  • Terminology
  • What Makes a Good Indicator?
  • Constructing Indicators
  • The Takeaway

7 Outcomes and Indicators

This chapter describes key measurement concepts, such as how to identify, define, and quantify study constructs. We’ll start by reviewing an example from the global mental health literature and use a conceptual model to think through important targets of measurement. Then we’ll consider what makes a good indicator of study constructs and outcomes and discuss common types of indicators you’ll come across in global health.

7.1 Using Conceptual Models to Plan Study Measurement

A good conceptual model, such as a theory of change or a logic model, can be a bridge to good measurement. I’ll demonstrate this using a study by Patel et al. ( 2017 ) that reports on the results of a randomized controlled trial in India to test the efficacy of a lay counsellor-delivered, brief psychological treatment for severe depression called the Healthy Activity Program , or HAP. Please download the article here and give it a read.

Abstract from @patel:2016 published in *The Lancet*.

Figure 7.1: Abstract from Patel et al. ( 2017 ) published in The Lancet .

7.1.1 PROCESS INDICATORS

Logic model. Process indicators in a logic model capture how well a program is implemented—the “M” (monitoring) in M&E. As researchers, we care about collecting good process and monitoring data to develop a better understanding why programs do or do not work. For example, program costs must be accurately tracked to estimate cost-effectiveness. Or it may be important to determine whether the intervention was delivered according to the plan.

-->Figure 7.2: Logic model. Process indicators in a logic model capture how well a program is implemented—the “M” (monitoring) in M&E. As researchers, we care about collecting good process and monitoring data to develop a better understanding why programs do or do not work. For example, program costs must be accurately tracked to estimate cost-effectiveness. Or it may be important to determine whether the intervention was delivered according to the plan.Figure 7.4: A visual representation of the Dupas (2011) conceptual framework from Chapter 6.Figure 7.7: Sustainable Development Goals. Source: http://bit.ly/2cuDpWN .