Examples

Comparative Research

Ai generator.

examples of comparative research titles quantitative

Although not everyone would agree, comparing is not always bad. Comparing things can also give you a handful of benefits. For instance, there are times in our life where we feel lost. You may not be getting the job that you want or have the sexy body that you have been aiming for a long time now. Then, you happen to cross path with an old friend of yours, who happened to get the job that you always wanted. This scenario may put your self-esteem down, knowing that this friend got what you want, while you didn’t. Or you can choose to look at your friend as an example that your desire is actually attainable. Come up with a plan to achieve your  personal development goal . Perhaps, ask for tips from this person or from the people who inspire you. According to the article posted in  brit.co , licensed master social worker and therapist Kimberly Hershenson said that comparing yourself to someone successful can be an excellent self-motivation to work on your goals.

Aside from self-improvement, as a researcher, you should know that comparison is an essential method in scientific studies, such as experimental research and descriptive research . Through this method, you can uncover the relationship between two or more variables of your project in the form of comparative analysis .

What is Comparative Research?

Aiming to compare two or more variables of an experiment project, experts usually apply comparative research examples in social sciences to compare countries and cultures across a particular area or the entire world. Despite its proven effectiveness, you should keep it in mind that some states have different disciplines in sharing data. Thus, it would help if you consider the affecting factors in gathering specific information.

Quantitative and Qualitative Research Methods in Comparative Studies

In comparing variables, the statistical and mathematical data collection, and analysis that quantitative research methodology naturally uses to uncover the correlational connection of the variables, can be essential. Additionally, since quantitative research requires a specific research question, this method can help you can quickly come up with one particular comparative research question.

The goal of comparative research is drawing a solution out of the similarities and differences between the focused variables. Through non-experimental or qualitative research , you can include this type of research method in your comparative research design.

13+ Comparative Research Examples

Know more about comparative research by going over the following examples. You can download these zipped documents in PDF and MS Word formats.

1. Comparative Research Report Template

Comparative Research Report Template

  • Google Docs

Size: 113 KB

2. Business Comparative Research Template

Business Comparative Research Template

Size: 69 KB

3. Comparative Market Research Template

Comparative Market Research Template

Size: 172 KB

4. Comparative Research Strategies Example

Comparative Research Strategies Example

5. Comparative Research in Anthropology Example

Comparative Research in Anthropology Example

Size: 192 KB

6. Sample Comparative Research Example

Sample Comparative Research Example

Size: 516 KB

7. Comparative Area Research Example

Comparative Area Research Example

8. Comparative Research on Women’s Emplyment Example

Comparative Research on Womens Emplyment

Size: 290 KB

9. Basic Comparative Research Example

Basic Comparative Research Example

Size: 19 KB

10. Comparative Research in Medical Treatments Example

Comparative Research in Medical Treatments

11. Comparative Research in Education Example

Comparative Research in Education

Size: 455 KB

12. Formal Comparative Research Example

Formal Comparative Research Example

Size: 244 KB

13. Comparative Research Designs Example

Comparing Comparative Research Designs

Size: 259 KB

14. Casual Comparative Research in DOC

Caasual Comparative Research in DOC

Best Practices in Writing an Essay for Comparative Research in Visual Arts

If you are going to write an essay for a comparative research examples paper, this section is for you. You must know that there are inevitable mistakes that students do in essay writing . To avoid those mistakes, follow the following pointers.

1. Compare the Artworks Not the Artists

One of the mistakes that students do when writing a comparative essay is comparing the artists instead of artworks. Unless your instructor asked you to write a biographical essay, focus your writing on the works of the artists that you choose.

2. Consult to Your Instructor

There is broad coverage of information that you can find on the internet for your project. Some students, however, prefer choosing the images randomly. In doing so, you may not create a successful comparative study. Therefore, we recommend you to discuss your selections with your teacher.

3. Avoid Redundancy

It is common for the students to repeat the ideas that they have listed in the comparison part. Keep it in mind that the spaces for this activity have limitations. Thus, it is crucial to reserve each space for more thoroughly debated ideas.

4. Be Minimal

Unless instructed, it would be practical if you only include a few items(artworks). In this way, you can focus on developing well-argued information for your study.

5. Master the Assessment Method and the Goals of the Project

We get it. You are doing this project because your instructor told you so. However, you can make your study more valuable by understanding the goals of doing the project. Know how you can apply this new learning. You should also know the criteria that your teachers use to assess your output. It will give you a chance to maximize the grade that you can get from this project.

Comparing things is one way to know what to improve in various aspects. Whether you are aiming to attain a personal goal or attempting to find a solution to a certain task, you can accomplish it by knowing how to conduct a comparative study. Use this content as a tool to expand your knowledge about this research methodology .

Twitter

Text prompt

  • Instructive
  • Professional

10 Examples of Public speaking

20 Examples of Gas lighting

A Short Introduction to Comparative Research

Seyed Mojtaba Miri at Allameh Tabataba'i University

  • Allameh Tabataba'i University

Zohreh Dehdashti Shahrokh at Allameh Tabataba'i University

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Marcus Vinicius Rosário da Silva

  • Radhakrishna Batule

Jonathan Sudhir Joseph

  • Ph.D Mixo Swetness Sithole

Haekal Adha Al Giffari

  • Nik Muhammad Azimuddin bin Aziz
  • Syafitri Syifa
  • Muhammad Syahzan bin Johari
  • Margaret N. Munywoki

Kaveer Singh

  • MULTIMED TOOLS APPL
  • Thomas Bohné

Sławomir Tadeja

  • BUSSINESS ETHICS ANALYSIS
  • ANALISIS PT TIMAH Tbk
  • Catur Widayati
  • Nicole Jeanine Wuisan
  • Debora Chaterin Simanjuntak

Isheanesu Sextus Gusha

  • Claudius Wagemann

Steven Engler

  • Michael Stausberg
  • Guy E. Swanson

Joachim Blatter

  • JONATHAN Z. SMITH
  • Oliver Freiberger

Oliver Freiberger

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up
  • For Individuals
  • For Businesses
  • For Universities
  • For Governments
  • Online Degrees
  • Find your New Career
  • Join for Free

Università di Napoli Federico II

Comparative Research Designs and Methods

Taught in English

Some content may not be translated

Financial aid available

Gain insight into a topic and learn the fundamentals

Dirk Berg Schlosser

Instructor: Dirk Berg Schlosser

Coursera Plus

Included with Coursera Plus

Skills you'll gain

  • Research Methods
  • Qualitative Comparative Analysis (QCA)
  • comparative research
  • Macro-quantitative methods

Details to know

examples of comparative research titles quantitative

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

There are 5 modules in this course

Emile Durkheim, one of the founders of modern empirical social science, once stated that the comparative method is the only one that suits the social sciences. But Descartes already had reminded us that “comparaison n’est pas raison”, which means that comparison is not reason (or theory) by itself.

This course provides an introduction and overview of systematic comparative analyses in the social sciences and shows how to employ this method for constructive explanation and theory building. It begins with comparisons of very few cases and specific “most similar” and “most different” research designs. A major part is then devoted to the often occurring situation of dealing with a small number of highly complex cases, for example when comparing EU member states. Latin American political systems, or particular policy areas. In response to this complexity, new approaches and software have been developed in recent years (“Qualitative Comparative Analysis”, QCA, and related methods). These procedures are able to reduce complexity and to arrive at “configurational” solutions based on set theory and Boolean algebra, which are more meaningful in this context than the usual broad-based statistical methods. In the last section, these methods are contrasted with more common statistical comparative methods at the macro-level of states or societies and the respective strengths and weaknesses are discussed. Some basic quantitative or qualitative methodological training is probably useful to get more out of the course, but participants with little methodological training should find no major obstacles to follow.

An introduction to Comparative Research

This module presents fundamental notions of comparative research designs. To begin with, you will be introduced to multi-dimensional matters. Subsequently, you will delve into John Stuart Mill’s methods and limitations.

What's included

6 videos 10 readings 2 quizzes

6 videos • Total 41 minutes

  • Multi-dimensional substance matter • 7 minutes • Preview module
  • The plastic matter of social sciences • 9 minutes
  • Linking levels of analysis • 6 minutes
  • Mill's canons • 5 minutes
  • Mill‘s methods: Pop's Seafood Platter • 4 minutes
  • Mill's methods: limitations • 7 minutes

10 readings • Total 31 minutes

  • Three Fundamental notions • 3 minutes
  • Multi-dimensional substance matter • 4 minutes
  • The plastic matter of social sciences • 4 minutes
  • Qualitative Comparative Analysis • 3 minutes
  • Linking levels of analysis • 2 minutes
  • Coleman's "bathtub" • 3 minutes
  • Pop's Seafood Platter • 3 minutes
  • Mill's limitations • 3 minutes
  • Mill's methods and recent advances • 1 minute

2 quizzes • Total 30 minutes

  • Challenge yourself: Epistemological foundations of the social sciences • 15 minutes
  • Challenge yourself: Mill's canon • 15 minutes

Comparative Research Designs

This module presents further advances in comparative research designs. To begin with, you will be introduced to case selection and types of research designs. Subsequently, you will delve into most similar and most different designs (MSDO/MDSO) and observe their operationalization.

6 videos 6 readings 2 quizzes

6 videos • Total 48 minutes

  • Further advances • 7 minutes • Preview module
  • Overview • 7 minutes
  • Major steps of research process • 6 minutes
  • MSDO/MDSO application • 8 minutes
  • Operationalizing similarities and dissimilarities • 9 minutes
  • Analysis and interpretation • 8 minutes

6 readings • Total 30 minutes

  • Further advances • 4 minutes
  • Overview of comparative research designs • 4 minutes
  • Selection of variables and cases • 5 minutes
  • Operationalizing similarities and dissimilarities • 5 minutes
  • Analysis and interpretation • 6 minutes
  • Challenge yourself: Further Advances, Comparative Research Designs • 15 minutes
  • Challenge yourself: Most similar and most different designs • 15 minutes

QCA Analysis

This module presents Boolean Algebra and the main steps of QCA. The first lesson will introduce basic features of QCA and provide an example of such analysis. The second lesson will focus on QCA applications, troubleshooting, Multi-Value QCA (mv-QCA), and more specific features of QCA.

6 videos 7 readings 2 quizzes

6 videos • Total 40 minutes

  • QCA Basics • 8 minutes • Preview module
  • QCA Analysis • 9 minutes
  • Simple paper and pencil example • 5 minutes
  • Troubleshooting Contradictions • 6 minutes
  • Threshold setting, necessary and sufficient conditions • 4 minutes
  • Multi-Value QCA (mv-QCA) • 6 minutes

7 readings • Total 41 minutes

  • QCA Basics • 7 minutes
  • QCA Analysis pt.1 • 5 minutes
  • QCA Analysis pt.2 • 7 minutes
  • Simple paper and pencil example • 7 minutes
  • Troubleshooting Contradictions (C) • 6 minutes
  • Threshold setting, necessary and sufficient conditions • 3 minutes
  • Challenge yourself: Introduction to Boolean Algebra, main steps of QCA • 15 minutes
  • Challenge yourself: QCA applications, troubleshooting, Multi-Value QCA (mv-QCA) • 15 minutes

Fuzzy set analyses

This module presents the basic features of the fuzzy set analyses and application, and analyzes in greater depth QCA. The first lesson will introduce basic features of fuzzy set analyses and provide examples of such analysis. The second lesson will focus on fuzzy set applications, its purposes and advantages, and explores more specific features of QCA.

6 videos • Total 37 minutes

  • Fuzzy sets • 6 minutes • Preview module
  • Calculation of necessary and sufficient conditions • 3 minutes
  • Fuzzy sets. Relationship between condition and outcome as in a triangular scatterplot • 3 minutes
  • Principles • 6 minutes
  • Lipset's conditions • 7 minutes
  • Conclusions • 9 minutes

7 readings • Total 75 minutes

  • Fuzzy sets • 10 minutes
  • Calculation of necessary and sufficient conditions • 10 minutes
  • Fuzzy sets. Relationship between condition and outcome as in a triangular scatterplot • 10 minutes
  • Principles • 10 minutes
  • Cases • 5 minutes
  • Examples • 15 minutes
  • Conclusions • 15 minutes
  • Challenge yourself: Fuzzy set analyses, basic features • 15 minutes
  • Challenge yourself: Fuzzy set applications (fs/qca) • 15 minutes

Macro-quantitative (statistical): Methods and perspectives

This module presents the macro-quantitative (statistical) methods by giving examples of recent research employing them. It analyzes the regression analysis and the various ways of analyzing data. Moreover, it concludes the course and opens to further perspectives on comparative research designs and methods.

6 videos • Total 45 minutes

  • Data • 7 minutes • Preview module
  • Examples • 6 minutes
  • Regression analysis • 6 minutes
  • Summary • 8 minutes
  • Contrasting macro qualitative and quantitative methods • 8 minutes
  • Continuing debates: prospects • 6 minutes

6 readings • Total 75 minutes

  • Data • 10 minutes
  • Examples • 10 minutes
  • Regression analysis • 10 minutes
  • Summary • 15 minutes
  • Contrasting macro qualitative and quantitative methods • 15 minutes
  • Continuing debates, prospects • 15 minutes
  • Challenge yourself: Macro-quantitative (statistical) Methods • 15 minutes
  • Challenge yourself: Conclusions and Perspectives • 15 minutes

examples of comparative research titles quantitative

Founded in 1224, Federico II is the oldest lay University in Europe. With its "Federica Web Learning" Center, it is the leader in Europe for open access multimedia education, and in the world's top ten for the production of MOOCs for providing new links between higher education and lifelong learning. Find out more on www.federica.eu.

Recommended if you're interested in Governance and Society

examples of comparative research titles quantitative

University of Michigan

Translating Research to Communities

examples of comparative research titles quantitative

University of Minnesota

Social Determinants of Health: Methodological Opportunities

examples of comparative research titles quantitative

University of Pennsylvania

Introducción al Marketing

examples of comparative research titles quantitative

Translating Research to Healthcare Policy

Why people choose coursera for their career.

examples of comparative research titles quantitative

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

When will i have access to the lectures and assignments.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.

The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

What will I get if I purchase the Certificate?

When you purchase a Certificate you get access to all course materials, including graded assignments. Upon completing the course, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

What is the refund policy?

You will be eligible for a full refund until two weeks after your payment date, or (for courses that have just launched) until two weeks after the first session of the course begins, whichever is later. You cannot receive a refund once you’ve earned a Course Certificate, even if you complete the course within the two-week refund period. See our full refund policy Opens in a new tab .

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

More questions

  • Write my thesis
  • Thesis writers
  • Buy thesis papers
  • Bachelor thesis
  • Master's thesis
  • Thesis editing services
  • Thesis proofreading services
  • Buy a thesis online
  • Write my dissertation
  • Dissertation proposal help
  • Pay for dissertation
  • Custom dissertation
  • Dissertation help online
  • Buy dissertation online
  • Cheap dissertation
  • Dissertation editing services
  • Write my research paper
  • Buy research paper online
  • Pay for research paper
  • Research paper help
  • Order research paper
  • Custom research paper
  • Cheap research paper
  • Research papers for sale
  • Thesis subjects
  • How It Works

100+ Quantitative Research Topics For Students

Quantitative Research Topics

Quantitative research is a research strategy focusing on quantified data collection and analysis processes. This research strategy emphasizes testing theories on various subjects. It also includes collecting and analyzing non-numerical data.

Quantitative research is a common approach in the natural and social sciences , like marketing, business, sociology, chemistry, biology, economics, and psychology. So, if you are fond of statistics and figures, a quantitative research title would be an excellent option for your research proposal or project.

How to Get a Title of Quantitative Research

How to make quantitative research title, what is the best title for quantitative research, amazing quantitative research topics for students, creative quantitative research topics, perfect quantitative research title examples, unique quantitative research titles, outstanding quantitative research title examples for students, creative example title of quantitative research samples, outstanding quantitative research problems examples, fantastic quantitative research topic examples, the best quantitative research topics, grade 12 quantitative research title for students, list of quantitative research titles for high school, easy quantitative research topics for students, trending topics for quantitative research, quantitative research proposal topics, samples of quantitative research titles, research title about business quantitative.

Finding a great title is the key to writing a great quantitative research proposal or paper. A title for quantitative research prepares you for success, failure, or mediocre grades. This post features examples of quantitative research titles for all students.

Putting together a research title and quantitative research design is not as easy as some students assume. So, an example topic of quantitative research can help you craft your own. However, even with the examples, you may need some guidelines for personalizing your research project or proposal topics.

So, here are some tips for getting a title for quantitative research:

  • Consider your area of studies
  • Look out for relevant subjects in the area
  • Expert advice may come in handy
  • Check out some sample quantitative research titles

Making a quantitative research title is easy if you know the qualities of a good title in quantitative research. Reading about how to make a quantitative research title may not help as much as looking at some samples. Looking at a quantitative research example title will give you an idea of where to start.

However, let’s look at some tips for how to make a quantitative research title:

  • The title should seem interesting to readers
  • Ensure that the title represents the content of the research paper
  • Reflect on the tone of the writing in the title
  • The title should contain important keywords in your chosen subject to help readers find your paper
  • The title should not be too lengthy
  • It should be grammatically correct and creative
  • It must generate curiosity

An excellent quantitative title should be clear, which implies that it should effectively explain the paper and what readers can expect. A research title for quantitative research is the gateway to your article or proposal. So, it should be well thought out. Additionally, it should give you room for extensive topic research.

A sample of quantitative research titles will give you an idea of what a good title for quantitative research looks like. Here are some examples:

  • What is the correlation between inflation rates and unemployment rates?
  • Has climate adaptation influenced the mitigation of funds allocation?
  • Job satisfaction and employee turnover: What is the link?
  • A look at the relationship between poor households and the development of entrepreneurship skills
  • Urbanization and economic growth: What is the link between these elements?
  • Does education achievement influence people’s economic status?
  • What is the impact of solar electricity on the wholesale energy market?
  • Debt accumulation and retirement: What is the relationship between these concepts?
  • Can people with psychiatric disorders develop independent living skills?
  • Children’s nutrition and its impact on cognitive development

Quantitative research applies to various subjects in the natural and social sciences. Therefore, depending on your intended subject, you have numerous options. Below are some good quantitative research topics for students:

  • The difference between the colorific intake of men and women in your country
  • Top strategies used to measure customer satisfaction and how they work
  • Black Friday sales: are they profitable?
  • The correlation between estimated target market and practical competitive risk assignment
  • Are smartphones making us brighter or dumber?
  • Nuclear families Vs. Joint families: Is there a difference?
  • What will society look like in the absence of organized religion?
  • A comparison between carbohydrate weight loss benefits and high carbohydrate diets?
  • How does emotional stability influence your overall well-being?
  • The extent of the impact of technology in the communications sector

Creativity is the key to creating a good research topic in quantitative research. Find a good quantitative research topic below:

  • How much exercise is good for lasting physical well-being?
  • A comparison of the nutritional therapy uses and contemporary medical approaches
  • Does sugar intake have a direct impact on diabetes diagnosis?
  • Education attainment: Does it influence crime rates in society?
  • Is there an actual link between obesity and cancer rates?
  • Do kids with siblings have better social skills than those without?
  • Computer games and their impact on the young generation
  • Has social media marketing taken over conventional marketing strategies?
  • The impact of technology development on human relationships and communication
  • What is the link between drug addiction and age?

Need more quantitative research title examples to inspire you? Here are some quantitative research title examples to look at:

  • Habitation fragmentation and biodiversity loss: What is the link?
  • Radiation has affected biodiversity: Assessing its effects
  • An assessment of the impact of the CORONA virus on global population growth
  • Is the pandemic truly over, or have human bodies built resistance against the virus?
  • The ozone hole and its impact on the environment
  • The greenhouse gas effect: What is it and how has it impacted the atmosphere
  • GMO crops: are they good or bad for your health?
  • Is there a direct link between education quality and job attainment?
  • How have education systems changed from traditional to modern times?
  • The good and bad impacts of technology on education qualities

Your examiner will give you excellent grades if you come up with a unique title and outstanding content. Here are some quantitative research examples titles.

  • Online classes: are they helpful or not?
  • What changes has the global CORONA pandemic had on the population growth curve?
  • Daily habits influenced by the global pandemic
  • An analysis of the impact of culture on people’s personalities
  • How has feminism influenced the education system’s approach to the girl child’s education?
  • Academic competition: what are its benefits and downsides for students?
  • Is there a link between education and student integrity?
  • An analysis of how the education sector can influence a country’s economy
  • An overview of the link between crime rates and concern for crime
  • Is there a link between education and obesity?

Research title example quantitative topics when well-thought guarantees a paper that is a good read. Look at the examples below to get started.

  • What are the impacts of online games on students?
  • Sex education in schools: how important is it?
  • Should schools be teaching about safe sex in their sex education classes?
  • The correlation between extreme parent interference on student academic performance
  • Is there a real link between academic marks and intelligence?
  • Teacher feedback: How necessary is it, and how does it help students?
  • An analysis of modern education systems and their impact on student performance
  • An overview of the link between academic performance/marks and intelligence
  • Are grading systems helpful or harmful to students?
  • What was the impact of the pandemic on students?

Irrespective of the course you take, here are some titles that can fit diverse subjects pretty well. Here are some creative quantitative research title ideas:

  • A look at the pre-corona and post-corona economy
  • How are conventional retail businesses fairing against eCommerce sites like Amazon and Shopify?
  • An evaluation of mortality rates of heart attacks
  • Effective treatments for cardiovascular issues and their prevention
  • A comparison of the effectiveness of home care and nursing home care
  • Strategies for managing effective dissemination of information to modern students
  • How does educational discrimination influence students’ futures?
  • The impacts of unfavorable classroom environment and bullying on students and teachers
  • An overview of the implementation of STEM education to K-12 students
  • How effective is digital learning?

If your paper addresses a problem, you must present facts that solve the question or tell more about the question. Here are examples of quantitative research titles that will inspire you.

  • An elaborate study of the influence of telemedicine in healthcare practices
  • How has scientific innovation influenced the defense or military system?
  • The link between technology and people’s mental health
  • Has social media helped create awareness or worsened people’s mental health?
  • How do engineers promote green technology?
  • How can engineers raise sustainability in building and structural infrastructures?
  • An analysis of how decision-making is dependent on someone’s sub-conscious
  • A comprehensive study of ADHD and its impact on students’ capabilities
  • The impact of racism on people’s mental health and overall wellbeing
  • How has the current surge in social activism helped shape people’s relationships?

Are you looking for an example of a quantitative research title? These ten examples below will get you started.

  • The prevalence of nonverbal communication in social control and people’s interactions
  • The impacts of stress on people’s behavior in society
  • A study of the connection between capital structures and corporate strategies
  • How do changes in credit ratings impact equality returns?
  • A quantitative analysis of the effect of bond rating changes on stock prices
  • The impact of semantics on web technology
  • An analysis of persuasion, propaganda, and marketing impact on individuals
  • The dominant-firm model: what is it, and how does it apply to your country’s retail sector?
  • The role of income inequality in economy growth
  • An examination of juvenile delinquents’ treatment in your country

Excellent Topics For Quantitative Research

Here are some titles for quantitative research you should consider:

  • Does studying mathematics help implement data safety for businesses
  • How are art-related subjects interdependent with mathematics?
  • How do eco-friendly practices in the hospitality industry influence tourism rates?
  • A deep insight into how people view eco-tourisms
  • Religion vs. hospitality: Details on their correlation
  • Has your country’s tourist sector revived after the pandemic?
  • How effective is non-verbal communication in conveying emotions?
  • Are there similarities between the English and French vocabulary?
  • How do politicians use persuasive language in political speeches?
  • The correlation between popular culture and translation

Here are some quantitative research titles examples for your consideration:

  • How do world leaders use language to change the emotional climate in their nations?
  • Extensive research on how linguistics cultivate political buzzwords
  • The impact of globalization on the global tourism sector
  • An analysis of the effects of the pandemic on the worldwide hospitality sector
  • The influence of social media platforms on people’s choice of tourism destinations
  • Educational tourism: What is it and what you should know about it
  • Why do college students experience math anxiety?
  • Is math anxiety a phenomenon?
  • A guide on effective ways to fight cultural bias in modern society
  • Creative ways to solve the overpopulation issue

An example of quantitative research topics for 12 th -grade students will come in handy if you want to score a good grade. Here are some of the best ones:

  • The link between global warming and climate change
  • What is the greenhouse gas impact on biodiversity and the atmosphere
  • Has the internet successfully influenced literacy rates in society
  • The value and downsides of competition for students
  • A comparison of the education system in first-world and third-world countries
  • The impact of alcohol addiction on the younger generation
  • How has social media influenced human relationships?
  • Has education helped boost feminism among men and women?
  • Are computers in classrooms beneficial or detrimental to students?
  • How has social media improved bullying rates among teenagers?

High school students can apply research titles on social issues  or other elements, depending on the subject. Let’s look at some quantitative topics for students:

  • What is the right age to introduce sex education for students
  • Can extreme punishment help reduce alcohol consumption among teenagers?
  • Should the government increase the age of sexual consent?
  • The link between globalization and the local economy collapses
  • How are global companies influencing local economies?

There are numerous possible quantitative research topics you can write about. Here are some great quantitative research topics examples:

  • The correlation between video games and crime rates
  • Do college studies impact future job satisfaction?
  • What can the education sector do to encourage more college enrollment?
  • The impact of education on self-esteem
  • The relationship between income and occupation

You can find inspiration for your research topic from trending affairs on social media or in the news. Such topics will make your research enticing. Find a trending topic for quantitative research example from the list below:

  • How the country’s economy is fairing after the pandemic
  • An analysis of the riots by women in Iran and what the women gain to achieve
  • Is the current US government living up to the voter’s expectations?
  • How is the war in Ukraine affecting the global economy?
  • Can social media riots affect political decisions?

A proposal is a paper you write proposing the subject you would like to cover for your research and the research techniques you will apply. If the proposal is approved, it turns to your research topic. Here are some quantitative titles you should consider for your research proposal:

  • Military support and economic development: What is the impact in developing nations?
  • How does gun ownership influence crime rates in developed countries?
  • How can the US government reduce gun violence without influencing people’s rights?
  • What is the link between school prestige and academic standards?
  • Is there a scientific link between abortion and the definition of viability?

You can never have too many sample titles. The samples allow you to find a unique title you’re your research or proposal. Find a sample quantitative research title here:

  • Does weight loss indicate good or poor health?
  • Should schools do away with grading systems?
  • The impact of culture on student interactions and personalities
  • How can parents successfully protect their kids from the dangers of the internet?
  • Is the US education system better or worse than Europe’s?

If you’re a business major, then you must choose a research title quantitative about business. Let’s look at some research title examples quantitative in business:

  • Creating shareholder value in business: How important is it?
  • The changes in credit ratings and their impact on equity returns
  • The importance of data privacy laws in business operations
  • How do businesses benefit from e-waste and carbon footprint reduction?
  • Organizational culture in business: what is its importance?

We Are A Call Away

Interesting, creative, unique, and easy quantitative research topics allow you to explain your paper and make research easy. Therefore, you should not take choosing a research paper or proposal topic lightly. With your topic ready, reach out to us today for excellent research paper writing services .

Leave a Reply Cancel reply

examples of comparative research titles quantitative

Causal Comparative Research: Methods And Examples

Ritu was in charge of marketing a new protein drink about to be launched. The client wanted a causal-comparative study…

Causal Comparative Research

Ritu was in charge of marketing a new protein drink about to be launched. The client wanted a causal-comparative study highlighting the drink’s benefits. They demanded that comparative analysis be made the main campaign design strategy. After carefully analyzing the project requirements, Ritu decided to follow a causal-comparative research design. She realized that causal-comparative research emphasizing physical development in different groups of people would lay a good foundation to establish the product.

What Is Causal Comparative Research?

Examples of causal comparative research variables.

Causal-comparative research is a method used to identify the cause–effect relationship between a dependent and independent variable. This relationship is usually a suggested relationship because we can’t control an independent variable completely. Unlike correlation research, this doesn’t rely on relationships. In a causal-comparative research design, the researcher compares two groups to find out whether the independent variable affected the outcome or the dependent variable.

A causal-comparative method determines whether one variable has a direct influence on the other and why. It identifies the causes of certain occurrences (or non-occurrences). It makes a study descriptive rather than experimental by scrutinizing the relationships among different variables in which the independent variable has already occurred. Variables can’t be manipulated sometimes, but a link between dependent and independent variables is established and the implications of possible causes are used to draw conclusions.

In a causal-comparative design, researchers study cause and effect in retrospect and determine consequences or causes of differences already existing among or between groups of people.

Let’s look at some characteristics of causal-comparative research:

  • This method tries to identify cause and effect relationships.
  • Two or more groups are included as variables.
  • Individuals aren’t selected randomly.
  • Independent variables can’t be manipulated.
  • It helps save time and money.

The main purpose of a causal-comparative study is to explore effects, consequences and causes. There are two types of causal-comparative research design. They are:

Retrospective Causal Comparative Research

For this type of research, a researcher has to investigate a particular question after the effects have occurred. They attempt to determine whether or not a variable influences another variable.

Prospective Causal Comparative Research

The researcher initiates a study, beginning with the causes and determined to analyze the effects of a given condition. This is not as common as retrospective causal-comparative research.

Usually, it’s easier to compare a variable with the known than the unknown.

Researchers use causal-comparative research to achieve research goals by comparing two variables that represent two groups. This data can include differences in opportunities, privileges exclusive to certain groups or developments with respect to gender, race, nationality or ability.

For example, to find out the difference in wages between men and women, researchers have to make a comparative study of wages earned by both genders across various professions, hierarchies and locations. None of the variables can be influenced and cause-effect relationship has to be established with a persuasive logical argument. Some common variables investigated in this type of research are:

  • Achievement and other ability variables
  • Family-related variables
  • Organismic variables such as age, sex and ethnicity
  • Variables related to schools
  • Personality variables

While raw test scores, assessments and other measures (such as grade point averages) are used as data in this research, sources, standardized tests, structured interviews and surveys are popular research tools.

However, there are drawbacks of causal-comparative research too, such as its inability to manipulate or control an independent variable and the lack of randomization. Subject-selection bias always remains a possibility and poses a threat to the internal validity of a study. Researchers can control it with statistical matching or by creating identical subgroups. Executives have to look out for loss of subjects, location influences, poor attitude of subjects and testing threats to produce a valid research study.

Harappa’s Thinking Critically program is for managers who want to learn how to think effectively before making critical decisions. Learn how leaders articulate the reasons behind and implications of their decisions. Become a growth-driven manager looking to select the right strategies to outperform targets. It’s packed with problem-solving and effective-thinking tools that are essential for skill development. What more? It offers live learning support and the opportunity to progress at your own pace. Ask for your free demo today!

Explore Harappa Diaries to learn more about topics such as Objectives Of Research Methodology , Types Of Thinking , What Is Visualisation and Effective Learning Methods to upgrade your knowledge and skills.

Thriversitybannersidenav

National Academies Press: OpenBook

On Evaluating Curricular Effectiveness: Judging the Quality of K-12 Mathematics Evaluations (2004)

Chapter: 5 comparative studies, 5 comparative studies.

It is deceptively simple to imagine that a curriculum’s effectiveness could be easily determined by a single well-designed study. Such a study would randomly assign students to two treatment groups, one using the experimental materials and the other using a widely established comparative program. The students would be taught the entire curriculum, and a test administered at the end of instruction would provide unequivocal results that would permit one to identify the more effective treatment.

The truth is that conducting definitive comparative studies is not simple, and many factors make such an approach difficult. Student placement and curricular choice are decisions that involve multiple groups of decision makers, accrue over time, and are subject to day-to-day conditions of instability, including student mobility, parent preference, teacher assignment, administrator and school board decisions, and the impact of standardized testing. This complex set of institutional policies, school contexts, and individual personalities makes comparative studies, even quasi-experimental approaches, challenging, and thus demands an honest and feasible assessment of what can be expected of evaluation studies (Usiskin, 1997; Kilpatrick, 2002; Schoenfeld, 2002; Shafer, in press).

Comparative evaluation study is an evolving methodology, and our purpose in conducting this review was to evaluate and learn from the efforts undertaken so far and advise on future efforts. We stipulated the use of comparative studies as follows:

A comparative study was defined as a study in which two (or more) curricular treatments were investigated over a substantial period of time (at least one semester, and more typically an entire school year) and a comparison of various curricular outcomes was examined using statistical tests. A statistical test was required to ensure the robustness of the results relative to the study’s design.

We read and reviewed a set of 95 comparative studies. In this report we describe that database, analyze its results, and draw conclusions about the quality of the evaluation database both as a whole and separated into evaluations supported by the National Science Foundation and commercially generated evaluations. In addition to describing and analyzing this database, we also provide advice to those who might wish to fund or conduct future comparative evaluations of mathematics curricular effectiveness. We have concluded that the process of conducting such evaluations is in its adolescence and could benefit from careful synthesis and advice in order to increase its rigor, feasibility, and credibility. In addition, we took an interdisciplinary approach to the task, noting that various committee members brought different expertise and priorities to the consideration of what constitutes the most essential qualities of rigorous and valid experimental or quasi-experimental design in evaluation. This interdisciplinary approach has led to some interesting observations and innovations in our methodology of evaluation study review.

This chapter is organized as follows:

Study counts disaggregated by program and program type.

Seven critical decision points and identification of at least minimally methodologically adequate studies.

Definition and illustration of each decision point.

A summary of results by student achievement in relation to program types (NSF-supported, University of Chicago School Mathematics Project (UCSMP), and commercially generated) in relation to their reported outcome measures.

A list of alternative hypotheses on effectiveness.

Filters based on the critical decision points.

An analysis of results by subpopulations.

An analysis of results by content strand.

An analysis of interactions among content, equity, and grade levels.

Discussion and summary statements.

In this report, we describe our methodology for review and synthesis so that others might scrutinize our approach and offer criticism on the basis of

our methodology and its connection to the results stated and conclusions drawn. In the spirit of scientific, fair, and open investigation, we welcome others to undertake similar or contrasting approaches and compare and discuss the results. Our work was limited by the short timeline set by the funding agencies resulting from the urgency of the task. Although we made multiple efforts to collect comparative studies, we apologize to any curriculum evaluators if comparative studies were unintentionally omitted from our database.

Of these 95 comparative studies, 65 were studies of NSF-supported curricula, 27 were studies of commercially generated materials, and 3 included two curricula each from one of these two categories. To avoid the problem of double coding, two studies, White et al. (1995) and Zahrt (2001), were coded within studies of NSF-supported curricula because more of the classes studied used the NSF-supported curriculum. These studies were not used in later analyses because they did not meet the requirements for the at least minimally methodologically adequate studies, as described below. The other, Peters (1992), compared two commercially generated curricula, and was coded in that category under the primary program of focus. Therefore, of the 95 comparative studies, 67 studies were coded as NSF-supported curricula and 28 were coded as commercially generated materials.

The 11 evaluation studies of the UCSMP secondary program that we reviewed, not including White et al. and Zahrt as previously mentioned, benefit from the maturity of the program, while demonstrating an orientation to both establishing effectiveness and improving a product line. For these reasons, at times we will present the summary of UCSMP’s data separately.

The Saxon materials also present a somewhat different profile from the other commercially generated materials because many of the evaluations of these materials were conducted in the 1980s and the materials were originally developed with a rather atypical program theory. Saxon (1981) designed its algebra materials to combine distributed practice with incremental development. We selected the Saxon materials as a middle grades commercially generated program, and limited its review to middle school studies from 1989 onward when the first National Council of Teachers of Mathematics (NCTM) Standards (NCTM, 1989) were released. This eliminated concerns that the materials or the conditions of educational practice have been altered during the intervening time period. The Saxon materials explicitly do not draw from the NCTM Standards nor did they receive support from the NSF; thus they truly represent a commercial venture. As a result, we categorized the Saxon studies within the group of studies of commercial materials.

At times in this report, we describe characteristics of the database by

examples of comparative research titles quantitative

FIGURE 5-1 The distribution of comparative studies across programs. Programs are coded by grade band: black bars = elementary, white bars = middle grades, and gray bars = secondary. In this figure, there are six studies that involved two programs and one study that involved three programs.

NOTE: Five programs (MathScape, MMAP, MMOW/ARISE, Addison-Wesley, and Harcourt) are not shown above since no comparative studies were reviewed.

particular curricular program evaluations, in which case all 19 programs are listed separately. At other times, when we seek to inform ourselves on policy-related issues of funding and evaluating curricular materials, we use the NSF-supported, commercially generated, and UCSMP distinctions. We remind the reader of the artificial aspects of this distinction because at the present time, 18 of the 19 curricula are published commercially. In order to track the question of historical inception and policy implications, a distinction is drawn between the three categories. Figure 5-1 shows the distribution of comparative studies across the 14 programs.

The first result the committee wishes to report is the uneven distribution of studies across the curricula programs. There were 67 coded studies of the NSF curricula, 11 studies of UCSMP, and 17 studies of the commercial publishers. The 14 evaluation studies conducted on the Saxon materials compose the bulk of these 17-non-UCSMP and non-NSF-supported curricular evaluation studies. As these results suggest, we know more about the

evaluations of the NSF-supported curricula and UCSMP than about the evaluations of the commercial programs. We suggest that three factors account for this uneven distribution of studies. First, evaluations have been funded by the NSF both as a part of the original call, and as follow-up to the work in the case of three supplemental awards to two of the curricula programs. Second, most NSF-supported programs and UCSMP were developed at university sites where there is access to the resources of graduate students and research staff. Finally, there was some reported reluctance on the part of commercial companies to release studies that could affect perceptions of competitive advantage. As Figure 5-1 shows, there were quite a few comparative studies of Everyday Mathematics (EM), Connected Mathematics Project (CMP), Contemporary Mathematics in Context (Core-Plus Mathematics Project [CPMP]), Interactive Mathematics Program (IMP), UCSMP, and Saxon.

In the programs with many studies, we note that a significant number of studies were generated by a core set of authors. In some cases, the evaluation reports follow a relatively uniform structure applied to single schools, generating multiple studies or following cohorts over years. Others use a standardized evaluation approach to evaluate sequential courses. Any reports duplicating exactly the same sample, outcome measures, or forms of analysis were eliminated. For example, one study of Mathematics Trailblazers (Carter et al., 2002) reanalyzed the data from the larger ARC Implementation Center study (Sconiers et al., 2002), so it was not included separately. Synthesis studies referencing a variety of evaluation reports are summarized in Chapter 6 , but relevant individual studies that were referenced in them were sought out and included in this comparative review.

Other less formal comparative studies are conducted regularly at the school or district level, but such studies were not included in this review unless we could obtain formal reports of their results, and the studies met the criteria outlined for inclusion in our database. In our conclusions, we address the issue of how to collect such data more systematically at the district or state level in order to subject the data to the standards of scholarly peer review and make it more systematically and fairly a part of the national database on curricular effectiveness.

A standard for evaluation of any social program requires that an impact assessment is warranted only if two conditions are met: (1) the curricular program is clearly specified, and (2) the intervention is well implemented. Absent this assurance, one must have a means of ensuring or measuring treatment integrity in order to make causal inferences. Rossi et al. (1999, p. 238) warned that:

two prerequisites [must exist] for assessing the impact of an intervention. First, the program’s objectives must be sufficiently well articulated to make

it possible to specify credible measures of the expected outcomes, or the evaluator must be able to establish such a set of measurable outcomes. Second, the intervention should be sufficiently well implemented that there is no question that its critical elements have been delivered to appropriate targets. It would be a waste of time, effort, and resources to attempt to estimate the impact of a program that lacks measurable outcomes or that has not been properly implemented. An important implication of this last consideration is that interventions should be evaluated for impact only when they have been in place long enough to have ironed out implementation problems.

These same conditions apply to evaluation of mathematics curricula. The comparative studies in this report varied in the quality of documentation of these two conditions; however, all addressed them to some degree or another. Initially by reviewing the studies, we were able to identify one general design template, which consisted of seven critical decision points and determined that it could be used to develop a framework for conducting our meta-analysis. The seven critical decision points we identified initially were:

Choice of type of design: experimental or quasi-experimental;

For those studies that do not use random assignment: what methods of establishing comparability of groups were built into the design—this includes student characteristics, teacher characteristics, and the extent to which professional development was involved as part of the definition of a curriculum;

Definition of the appropriate unit of analysis (students, classes, teachers, schools, or districts);

Inclusion of an examination of implementation components;

Definition of the outcome measures and disaggregated results by program;

The choice of statistical tests, including statistical significance levels and effect size; and

Recognition of limitations to generalizability resulting from design choices.

These are critical decisions that affect the quality of an evaluation. We further identified a subset of these evaluation studies that met a set of minimum conditions that we termed at least minimally methodologically adequate studies. Such studies are those with the greatest likelihood of shedding light on the effectiveness of these programs. To be classified as at least minimally methodologically adequate, and therefore to be considered for further analysis, each evaluation study was required to:

Include quantifiably measurable outcomes such as test scores, responses to specified cognitive tasks of mathematical reasoning, performance evaluations, grades, and subsequent course taking; and

Provide adequate information to judge the comparability of samples. In addition, a study must have included at least one of the following additional design elements:

A report of implementation fidelity or professional development activity;

Results disaggregated by content strands or by performance by student subgroups; and/or

Multiple outcome measures or precise theoretical analysis of a measured construct, such as number sense, proof, or proportional reasoning.

Using this rubric, the committee identified a subset of 63 comparative studies to classify as at least minimally methodologically adequate and to analyze in depth to inform the conduct of future evaluations. There are those who would argue that any threat to the validity of a study discredits the findings, thus claiming that until we know everything, we know nothing. Others would claim that from the myriad of studies, examining patterns of effects and patterns of variation, one can learn a great deal, perhaps tentatively, about programs and their possible effects. More importantly, we can learn about methodologies and how to concentrate and focus to increase the likelihood of learning more quickly. As Lipsey (1997, p. 22) wrote:

In the long run, our most useful and informative contribution to program managers and policy makers and even to the evaluation profession itself may be the consolidation of our piecemeal knowledge into broader pictures of the program and policy spaces at issue, rather than individual studies of particular programs.

We do not wish to imply that we devalue studies of student affect or conceptions of mathematics, but decided that unless these indicators were connected to direct indicators of student learning, we would eliminate them from further study. As a result of this sorting, we eliminated 19 studies of NSF-supported curricula and 13 studies of commercially generated curricula. Of these, 4 were eliminated for their sole focus on affect or conceptions, 3 were eliminated for their comparative focus on outcomes other than achievement, such as teacher-related variables, and 19 were eliminated for their failure to meet the minimum additional characteristics specified in the criteria above. In addition, six others were excluded from the studies of commercial materials because they were not conducted within the grade-

level band specified by the committee for the selection of that program. From this point onward, all references can be assumed to refer to at least minimally methodologically adequate unless a study is referenced for illustration, in which case we label it with “EX” to indicate that it is excluded in the summary analyses. Studies labeled “EX” are occasionally referenced because they can provide useful information on certain aspects of curricular evaluation, but not on the overall effectiveness.

The at least minimally methodologically adequate studies reported on a variety of grade levels. Figure 5-2 shows the different grade levels of the studies. At times, the choice of grade levels was dictated by the years in which high-stakes tests were given. Most of the studies reported on multiple grade levels, as shown in Figure 5-2 .

Using the seven critical design elements of at least minimally methodologically adequate studies as a design template, we describe the overall database and discuss the array of choices on critical decision points with examples. Following that, we report on the results on the at least minimally methodologically adequate studies by program type. To do so, the results of each study were coded as either statistically significant or not. Those studies

examples of comparative research titles quantitative

FIGURE 5-2 Single-grade studies by grade and multigrade studies by grade band.

that contained statistically significant results were assigned a percentage of outcomes that are positive (in favor of the treatment curriculum) based on the number of statistically significant comparisons reported relative to the total number of comparisons reported, and a percentage of outcomes that are negative (in favor of the comparative curriculum). The remaining were coded as the percentage of outcomes that are non significant. Then, using seven critical decision points as filters, we identified and examined more closely sets of studies that exhibited the strongest designs, and would therefore be most likely to increase our confidence in the validity of the evaluation. In this last section, we consider alternative hypotheses that could explain the results.

The committee emphasizes that we did not directly evaluate the materials. We present no analysis of results aggregated across studies by naming individual curricular programs because we did not consider the magnitude or rigor of the database for individual programs substantial enough to do so. Nevertheless, there are studies that provide compelling data concerning the effectiveness of the program in a particular context. Furthermore, we do report on individual studies and their results to highlight issues of approach and methodology and to remain within our primary charge, which was to evaluate the evaluations, we do not summarize results of the individual programs.

DESCRIPTION OF COMPARATIVE STUDIES DATABASE ON CRITICAL DECISION POINTS

An experimental or quasi-experimental design.

We separated the studies into experimental and quasiexperimental, and found that 100 percent of the studies were quasiexperimental (Campbell and Stanley, 1966; Cook and Campbell, 1979; and Rossi et al., 1999). 1 Within the quasi-experimental studies, we identified three subcategories of comparative study. In the first case, we identified a study as cross-curricular comparative if it compared the results of curriculum A with curriculum B. A few studies in this category also compared two samples within the curriculum to each other and specified different conditions such as high and low implementation quality.

A second category of a quasi-experimental study involved comparisons that could shed light on effectiveness involving time series studies. These studies compared the performance of a sample of students in a curriculum

  

One study, by Peters (1992), used random assignment to two classrooms, but was classified as quasi-experimental with its sample size and use of qualitative methods.

examples of comparative research titles quantitative

FIGURE 5-3 The number of comparative studies in each category.

under investigation across time, such as in a longitudinal study of the same students over time. A third category of comparative study involved a comparison to some form of externally normed results, such as populations taking state, national, or international tests or prior research assessment from a published study or studies. We categorized these studies and divided them into NSF, UCSMP, and commercial and labeled them by the categories above ( Figure 5-3 ).

In nearly all studies in the comparative group, the titles of experimental curricula were explicitly identified. The only exception to this was the ARC Implementation Center study (Sconiers et al., 2002), where three NSF-supported elementary curricula were examined, but in the results, their effects were pooled. In contrast, in the majority of the cases, the comparison curriculum is referred to simply as “traditional.” In only 22 cases were comparisons made between two identified curricula. Many others surveyed the array of curricula at comparison schools and reported on the most frequently used, but did not identify a single curriculum. This design strategy is used often because other factors were used in selecting comparison groups, and the additional requirement of a single identified curriculum in

these sites would often make it difficult to match. Studies were categorized into specified (including a single or multiple identified curricula) and nonspecified curricula. In the 63 studies, the central group was compared to an NSF-supported curriculum (1), an unnamed traditional curriculum (41), a named traditional curriculum (19), and one of the six commercial curricula (2). To our knowledge, any systematic impact of such a decision on results has not been studied, but we express concern that when a specified curriculum is compared to an unspecified content which is a set of many informal curriculum, the comparison may favor the coherency and consistency of the single curricula, and we consider this possibility subsequently under alternative hypotheses. We believe that a quality study should at least report the array of curricula that comprise the comparative group and include a measure of the frequency of use of each, but a well-defined alternative is more desirable.

If a study was both longitudinal and comparative, then it was coded as comparative. When studies only examined performances of a group over time, such as in some longitudinal studies, it was coded as quasi-experimental normed. In longitudinal studies, the problems created by student mobility were evident. In one study, Carroll (2001), a five-year longitudinal study of Everyday Mathematics, the sample size began with 500 students, 24 classrooms, and 11 schools. By 2nd grade, the longitudinal sample was 343. By 3rd grade, the number of classes increased to 29 while the number of original students decreased to 236 students. At the completion of the study, approximately 170 of the original students were still in the sample. This high rate of attrition from the study suggests that mobility is a major challenge in curricular evaluation, and that the effects of curricular change on mobile students needs to be studied as a potential threat to the validity of the comparison. It is also a challenge in curriculum implementation because students coming into a program do not experience its cumulative, developmental effect.

Longitudinal studies also have unique challenges associated with outcome measures, a study by Romberg et al. (in press) (EX) discussed one approach to this problem. In this study, an external assessment system and a problem-solving assessment system were used. In the External Assessment System, items from the National Assessment of Educational Progress (NAEP) and Third International Mathematics and Science Survey (TIMSS) were balanced across four strands (number, geometry, algebra, probability and statistics), and 20 items of moderate difficulty, called anchor items, were repeated on each grade-specific assessment (p. 8). Because the analyses of the results are currently under way, the evaluators could not provide us with final results of this study, so it is coded as EX.

However, such longitudinal studies can provide substantial evidence of the effects of a curricular program because they may be more sensitive to an

TABLE 5-1 Scores in Percentage Correct by Everyday Mathematics Students and Various Comparison Groups Over a Five-Year Longitudinal Study

 

Sample Size

1st Grade

2nd Grade

3rd Grade

4th Grade

5th Grade

EM

n=170-503

58

62

61

71

75

Traditional U.S.

n=976

43

53.5

 

 

44

Japanese

n=750

64

71

 

 

80

Chinese

n=1,037

52

 

 

 

76

NAEP Sample

n=18,033

 

 

44

44

 

NOTE: 1st grade: 44 items; 2nd grade: 24 items; 3rd grade: 22 items; 4th grade: 29 items; and 5th grade: 33 items.

SOURCE: Adapted from Carroll (2001).

accumulation of modest effects and/or can reveal whether the rates of learning change over time within curricular change.

The longitudinal study by Carroll (2001) showed that the effects of curricula may often accrue over time, but measurements of achievement present challenges to drawing such conclusions as the content and grade level change. A variety of measures were used over time to demonstrate growth in relation to comparison groups. The author chose a set of measures used previously in studies involving two Asian samples and an American sample to provide a contrast to the students in EM over time. For 3rd and 4th grades, where the data from the comparison group were not available, the authors selected items from the NAEP to bridge the gap. Table 5-1 summarizes the scores of the different comparative groups over five years. Scores are reported as the mean percentage correct for a series of tests on number computation, number concepts and applications, geometry, measurement, and data analysis.

It is difficult to compare performances on different tests over different groups over time against a single longitudinal group from EM, and it is not possible to determine whether the students’ performance is increasing or whether the changes in the tests at each grade level are producing the results; thus the results from longitudinal studies lacking a control group or use of sophisticated methodological analysis may be suspect and should be interpreted with caution.

In the Hirsch and Schoen (2002) study, based on a sample of 1,457 students, scores on Ability to Do Quantitative Thinking (ITED-Q) a subset of the Iowa Tests of Education Development, students in Core-Plus showed increasing performance over national norms over the three-year time period. The authors describe the content of the ITED-Q test and point out

that “although very little symbolic algebra is required, the ITED-Q is quite demanding for the full range of high school students” (p. 3). They further point out that “[t]his 3-year pattern is consistent, on average, in rural, urban, and suburban schools, for males and females, for various minority groups, and for students for whom English was not their first language” (p. 4). In this case, one sees that studies over time are important as results over shorter periods may mask cumulative effects of consistent and coherent treatments and such studies could also show increases that do not persist when subject to longer trajectories. One approach to longitudinal studies was used by Webb and Dowling in their studies of the Interactive Mathematics Program (Webb and Dowling, 1995a, 1995b, 1995c). These researchers conducted transcript analyses as a means to examine student persistence and success in subsequent course taking.

The third category of quasi-experimental comparative studies measured student outcomes on a particular curricular program and simply compared them to performance on national tests or international tests. When these tests were of good quality and were representative of a genuine sample of a relevant population, such as NAEP reports or TIMSS results, the reports often provided one a reasonable indicator of the effects of the program if combined with a careful description of the sample. Also, sometimes the national tests or state tests used were norm-referenced tests producing national percentiles or grade-level equivalents. The normed studies were considered of weaker quality in establishing effectiveness, but were still considered valid as examples of comparing samples to populations.

For Studies That Do Not Use Random Assignment: What Methods of Establishing Comparability Across Groups Were Built into the Design

The most fundamental question in an evaluation study is whether the treatment has had an effect on the chosen criterion variable. In our context, the treatment is the curriculum materials, and in some cases, related professional development, and the outcome of interest is academic learning. To establish if there is a treatment effect, one must logically rule out as many other explanations as possible for the differences in the outcome variable. There is a long tradition on how this is best done, and the principle from a design point of view is to assure that there are no differences between the treatment conditions (especially in these evaluations, often there are only the new curriculum materials to be evaluated and a control group) either at the outset of the study or during the conduct of the study.

To ensure the first condition, the ideal procedure is the random assignment of the appropriate units to the treatment conditions. The second condition requires that the treatment is administered reliably during the length of the study, and is assured through the careful observation and

control of the situation. Without randomization, there are a host of possible confounding variables that could differ among the treatment conditions and that are related themselves to the outcome variables. Put another way, the treatment effect is a parameter that the study is set up to estimate. Statistically, an estimate that is unbiased is desired. The goal is that its expected value over repeated samplings is equal to the true value of the parameter. Without randomization at the onset of a study, there is no way to assure this property of unbiasness. The variables that differ across treatment conditions and are related to the outcomes are confounding variables, which bias the estimation process.

Only one study we reviewed, Peters (1992), used randomization in the assignment of students to treatments, but that occurred because the study was limited to one teacher teaching two sections and included substantial qualitative methods, so we coded it as quasi-experimental. Others report partially assigning teachers randomly to treatment conditions (Thompson, et al., 2001; Thompson et al., 2003). Two primary reasons seem to account for a lack of use of pure experimental design. To justify the conduct and expense of a randomized field trial, the program must be described adequately and there must be relative assurance that its implementation has occurred over the duration of the experiment (Peterson et al., 1999). Additionally, one must be sure that the outcome measures are appropriate for the range of performances in the groups and valid relative to the curricula under investigation. Seldom can such conditions be assured for all students and teachers and over the duration of a year or more.

A second reason is that random assignment of classrooms to curricular treatment groups typically is not permitted or encouraged under normal school conditions. As one evaluator wrote, “Building or district administrators typically identified teachers who would be in the study and in only a few cases was random assignment of teachers to UCSMP Algebra or comparison classes possible. School scheduling and teacher preference were more important factors to administrators and at the risk of losing potential sites, we did not insist on randomization” (Mathison et al., 1989, p. 11).

The Joint Committee on Standards for Educational Evaluation (1994, p. 165) committee of evaluations recognized the likelihood of limitations on randomization, writing:

The groups being compared are seldom formed by random assignment. Rather, they tend to be natural groupings that are likely to differ in various ways. Analytical methods may be used to adjust for these initial differences, but these methods are based upon a number of assumptions. As it is often difficult to check such assumptions, it is advisable, when time and resources permit, to use several different methods of analysis to determine whether a replicable pattern of results is obtained.

Does the dearth of pure experimentation render the results of the studies reviewed worthless? Bias is not an “either-or” proposition, but it is a quantity of varying degrees. Through careful measurement of the most salient potential confounding variables, precise theoretical description of constructs, and use of these methods of statistical analysis, it is possible to reduce the amount of bias in the estimated treatment effect. Identification of the most likely confounding variables and their measurement and subsequent adjustments can greatly reduce bias and help estimate an effect that is likely to be more reflective of the true value. The theoretical fully specified model is an alternative to randomization by including relevant variables and thus allowing the unbiased estimation of the parameter. The only problem is realizing when the model is fully specified.

We recognized that we can never have enough knowledge to assure a fully specified model, especially in the complex and unstable conditions of schools. However, a key issue in determining the degree of confidence we have in these evaluations is to examine how they have identified, measured, or controlled for such confounding variables. In the next sections, we report on the methods of the evaluators in identifying and adjusting for such potential confounding variables.

One method to eliminate confounding variables is to examine the extent to which the samples investigated are equated either by sample selection or by methods of statistical adjustments. For individual students, there is a large literature suggesting the importance of social class to achievement. In addition, prior achievement of students must be considered. In the comparative studies, investigators first identified participation of districts, schools, or classes that could provide sufficient duration of use of curricular materials (typically two years or more), availability of target classes, or adequate levels of use of program materials. Establishing comparability was a secondary concern.

These two major factors were generally used in establishing the comparability of the sample:

Student population characteristics, such as demographic characteristics of students in terms of race/ethnicity, economic levels, or location type (urban, suburban, or rural).

Performance-level characteristics such as performance on prior tests, pretest performance, percentage passing standardized tests, or related measures (e.g., problem solving, reading).

In general, four methods of comparing groups were used in the studies we examined, and they permit different degrees of confidence in their results. In the first type, a matching class, school, or district was identified.

Studies were coded as this type if specified characteristics were used to select the schools systematically. In some of these studies, the methodology was relatively complex as correlates of performance on the outcome measures were found empirically and matches were created on that basis (Schneider, 2000; Riordan and Noyce, 2001; and Sconiers et al., 2002). For example, in the Sconiers et al. study, where the total sample of more than 100,000 students was drawn from five states and three elementary curricula are reviewed (Everyday Mathematics, Math Trailblazers [MT], and Investigations [IN], a highly systematic method was developed. After defining eligibility as a “reform school,” evaluators conducted separate regression analyses for the five states at each tested grade level to identify the strongest predictors of average school mathematics score. They reported, “reading score and low-income variables … consistently accounted for the greatest percentage of total variance. These variables were given the greatest weight in the matching process. Other variables—such as percent white, school mobility rate, and percent with limited English proficiency (LEP)—accounted for little of the total variance but were typically significant. These variables were given less weight in the matching process” (Sconiers et al., 2002, p. 10). To further provide a fair and complete comparison, adjustments were made based on regression analysis of the scores to minimize bias prior to calculating the difference in scores and reporting effect sizes. In their results the evaluators report, “The combined state-grade effect sizes for math and total are virtually identical and correspond to a percentile change of about 4 percent favoring the reform students” (p. 12).

A second type of matching procedure was used in the UCSMP evaluations. For example, in an evaluation centered on geometry learning, evaluators advertised in NCTM and UCSMP publications, and set conditions for participation from schools using their program in terms of length of use and grade level. After selecting schools with heterogeneous grouping and no tracking, the researchers used a match-pair design where they selected classes from the same school on the basis of mathematics ability. They used a pretest to determine this, and because the pretest consisted of two parts, they adjusted their significance level using the Bonferroni method. 2 Pairs were discarded if the differences in means and variance were significant for all students or for those students completing all measures, or if class sizes became too variable. In the algebra study, there were 20 pairs as a result of the matching, and because they were comparing three experimental conditions—first edition, second edition, and comparison classes—in the com-

  

The Bonferroni method is a simple method that allows multiple comparison statements to be made (or confidence intervals to be constructed) while still assuring that an overall confidence coefficient is maintained.

parison study relevant to this review, their matching procedure identified 8 pairs. When possible, teachers were assigned randomly to treatment conditions. Most results are presented with the eight identified pairs and an accumulated set of means. The outcomes of this particular study are described below in a discussion of outcome measures (Thompson et al., 2003).

A third method was to measure factors such as prior performance or socio-economic status (SES) based on pretesting, and then to use analysis of covariance or multiple regression in the subsequent analysis to factor in the variance associated with these factors. These studies were coded as “control.” A number of studies of the Saxon curricula used this method. For example, Rentschler (1995) conducted a study of Saxon 76 compared to Silver Burdett with 7th graders in West Virginia. He reported that the groups differed significantly in that the control classes had 65 percent of the students on free and reduced-price lunch programs compared to 55 percent in the experimental conditions. He used scores on California Test of Basic Skills mathematics computation and mathematics concepts and applications as his pretest scores and found significant differences in favor of the experimental group. His posttest scores showed the Saxon experimental group outperformed the control group on both computation and concepts and applications. Using analysis of covariance, the computation difference in favor of the experimental group was statistically significant; however, the difference in concepts and applications was adjusted to show no significant difference at the p < .05 level.

A fourth method was noted in studies that used less rigorous methods of selection of sample and comparison of prior achievement or similar demographics. These studies were coded as “compare.” Typically, there was no explicit procedure to decide if the comparison was good enough. In some of the studies, it appeared that the comparison was not used as a means of selection, but rather as a more informal device to convince the reader of the plausibility of the equivalence of the groups. Clearly, the studies that used a more precise method of selection were more likely to produce results on which one’s confidence in the conclusions is greater.

Definition of Unit of Analysis

A major decision in forming an evaluation design is the unit of analysis. The unit of selection or randomization used to assign elements to treatment and control groups is closely linked to the unit of analysis. As noted in the National Research Council (NRC) report (1992, p. 21):

If one carries out the assignment of treatments at the level of schools, then that is the level that can be justified for causal analysis. To analyze the results at the student level is to introduce a new, nonrandomized level into

the study, and it raises the same issues as does the nonrandomized observational study…. The implications … are twofold. First, it is advisable to use randomization at the level at which units are most naturally manipulated. Second, when the unit of observation is at a “lower” level of aggregation than the unit of randomization, then for many purposes the data need to be aggregated in some appropriate fashion to provide a measure that can be analyzed at the level of assignment. Such aggregation may be as simple as a summary statistic or as complex as a context-specific model for association among lower-level observations.

In many studies, inadequate attention was paid to the fact that the unit of selection would later become the unit of analysis. The unit of analysis, for most curriculum evaluators, needs to be at least the classroom, if not the school or even the district. The units must be independently responding units because instruction is a group process. Students are not independent, the classroom—even if the teachers work together in a school on instruction—is not entirely independent, so the school is the unit. Care needed to be taken to ensure that an adequate numbers of units would be available to have sufficient statistical power to detect important differences.

A curriculum is experienced by students in a group, and this implies that individual student responses and what they learn are correlated. As a result, the appropriate unit of assignment and analysis must at least be defined at the classroom or teacher level. Other researchers (Bryk et al., 1993) suggest that the unit might be better selected at an even higher level of aggregation. The school itself provides a culture in which the curriculum is enacted as it is influenced by the policies and assignments of the principal, by the professional interactions and governance exhibited by the teachers as a group, and by the community in which the school resides. This would imply that the school might be the appropriate unit of analysis. Even further, to the extent that such decisions about curriculum are made at the district level and supported through resources and professional development at that level, the appropriate unit could arguably be the district. On a more practical level, we found that arguments can be made for a variety of decisions on the selection of units, and what is most essential is to make a clear argument for one’s choice, to use the same unit in the analysis as in the sample selection process, and to recognize the potential limits to generalization that result from one’s decisions.

We would argue in all cases that reports of how sites are selected must be explicit in the evaluation report. For example, one set of evaluation studies selected sites by advertisements in a journal distributed by the program and in NCTM journals (UCSMP) (Thompson et al., 2001; Thompson et al., 2003). The samples in their studies tended to be affluent suburban populations and predominantly white populations. Other conditions of inclusion, such as frequency of use also might have influenced this outcome,

but it is important that over a set of studies on effectiveness, all populations of students be adequately sampled. When a study is not randomized, adjustments for these confounding variables should be included. In our analysis of equity, we report on the concerns about representativeness of the overall samples and their impact on the generalizability of the results.

Implementation Components

The complexity of doing research on curricular materials introduces a number of possible confounding variables. Due to the documented complexity of curricular implementation, most comparative study evaluators attempt to monitor implementation in some fashion. A valuable outcome of a well-conducted evaluation is to determine not only if the experimental curriculum could ideally have a positive impact on learning, but whether it can survive or thrive in the conditions of schooling that are so variable across sites. It is essential to know what the treatment was, whether it occurred, and if so, to what degree of intensity, fidelity, duration, and quality. In our model in Chapter 3 , these factors were referred to as “implementation components.” Measuring implementation can be costly for large-scale comparative studies; however, many researchers have shown that variation in implementation is a key factor in determining effectiveness. In coding the comparative studies, we identified three types of components that help to document the character of the treatment: implementation fidelity, professional development treatments, and attention to teacher effects.

Implementation Fidelity

Implementation fidelity is a measure of the basic extent of use of the curricular materials. It does not address issues of instructional quality. In some studies, implementation fidelity is synonymous with “opportunity to learn.” In examining implementation fidelity, a variety of data were reported, including, most frequently, the extent of coverage of the curricular material, the consistency of the instructional approach to content in relation to the program’s theory, reports of pedagogical techniques, and the length of use of the curricula at the sample sites. Other less frequently used approaches documented the calendar of curricular coverage, requested teacher feedback by textbook chapter, conducted student surveys, and gauged homework policies, use of technology, and other particular program elements. Interviews with teachers and students, classroom surveys, and observations were the most frequently used data-gathering techniques. Classroom observations were conducted infrequently in these studies, except in cases when comparative studies were combined with case studies, typically with small numbers of schools and classes where observations

were conducted for long or frequent time periods. In our analysis, we coded only the presence or absence of one or more of these methods.

If the extent of implementation was used in interpreting the results, then we classified the study as having adjusted for implementation differences. Across all 63 at least minimally methodologically adequate studies, 44 percent reported some type of implementation fidelity measure, 3 percent reported and adjusted for it in interpreting their outcome measures, and 53 percent recorded no information on this issue. Differences among studies, by study type (NSF, UCSMP, and commercially generated), showed variation on this issue, with 46 percent of NSF reporting or adjusting for implementation, 75 percent of UCSMP, and only 11 percent of the other studies of commercial materials doing so. Of the commercial, non-UCSMP studies included, only one reported on implementation. Possibly, the evaluators for the NSF and UCSMP Secondary programs recognized more clearly that their programs demanded significant changes in practice that could affect their outcomes and could pose challenges to the teachers assigned to them.

A study by Abrams (1989) (EX) 3 on the use of Saxon algebra by ninth graders showed that concerns for implementation fidelity extend to all curricula, even those like Saxon whose methods may seem more likely to be consistent with common practice. Abrams wrote, “It was not the intent of this study to determine the effectiveness of the Saxon text when used as Saxon suggests, but rather to determine the effect of the text as it is being used in the classroom situations. However, one aspect of the research was to identify how the text is being taught, and how closely teachers adhere to its content and the recommended presentation” (p. 7). Her findings showed that for the 9 teachers and 300 students, treatment effects favoring the traditional group (using Dolciani’s Algebra I textbook, Houghton Mifflin, 1980) were found on the algebra test, the algebra knowledge/skills subtest, and the problem-solving test for this population of teachers (fixed effect). No differences were found between the groups on an algebra understanding/applications subtest, overall attitude toward mathematics, mathematical self-confidence, anxiety about mathematics, or enjoyment of mathematics. She suggests that the lack of differences might be due to the ways in which teachers supplement materials, change test conditions, emphasize

  

Both studies referenced in this section did not meet the criteria for inclusion in the comparative studies, but shed direct light on comparative issues of implementation. The Abrams study was omitted because it examined a program at a grade level outside the specified grade band for that curriculum. Briars and Resnick (2000) did not provide explicit comparison scores to permit one to evaluate the level of student attainment.

and deemphasize topics, use their own tests, vary the proportion of time spent on development and practice, use calculators and group work, and basically adapt the materials to their own interpretation and method. Many of these practices conflict directly with the recommendations of the authors of the materials.

A study by Briars and Resnick (2000) (EX) in Pittsburgh schools directly confronted issues relevant to professional development and implementation. Evaluators contrasted the performance of students of teachers with high and low implementation quality, and showed the results on two contrasting outcome measures, Iowa Test of Basic Skills (ITBS) and Balanced Assessment. Strong implementers were defined as those who used all of the EM components and provided student-centered instruction by giving students opportunities to explore mathematical ideas, solve problems, and explain their reasoning. Weak implementers were either not using EM or using it so little that the overall instruction in the classrooms was “hardly distinguishable from traditional mathematics instruction” (p. 8). Assignment was based on observations of student behavior in classes, the presence or absence of manipulatives, teacher questionnaires about the programs, and students’ knowledge of classroom routines associated with the program.

From the identification of strong- and weak-implementing teachers, strong- and weak-implementation schools were identified as those with strong- or weak-implementing teachers in 3rd and 4th grades over two consecutive years. The performance of students with 2 years of EM experience in these settings composed the comparative samples. Three pairs of strong- and weak-implementation schools with similar demographics in terms of free and reduced-price lunch (range 76 to 93 percent), student living with only one parent (range 57 to 82 percent), mobility (range 8 to 16 percent), and ethnicity (range 43 to 98 percent African American) were identified. These students’ 1st-grade ITBS scores indicated similarity in prior performance levels. Finally, evaluators predicted that if the effects were due to the curricular implementation and accompanying professional development, the effects on scores should be seen in 1998, after full implementation. Figure 5-4 shows that on the 1998 New Standards exams, placement in strong- and weak-implementation schools strongly affected students’ scores. Over three years, performance in the district on skills, concepts, and problem solving rose, confirming the evaluator’s predictions.

An article by McCaffrey et al. (2001) examining the interactions among instructional practices, curriculum, and student achievement illustrates the point that distinctions are often inadequately linked to measurement tools in their treatment of the terms traditional and reform teaching. In this study, researchers conducted an exploratory factor analysis that led them to create two scales for instructional practice: Reform Practices and Tradi-

examples of comparative research titles quantitative

FIGURE 5-4 Percentage of students who met or exceeded the standard. Districtwide grade 4 New Standards Mathematics Reference Examination (NSMRE) performance for 1996, 1997, and 1998 by level of Everyday Mathematics implementation. Percentage of students who achieved the standard. Error bars denote the 99 percent confidence interval for each data point.

SOURCE: Re-created from Briars and Resnick (2000, pp. 19-20).

tional Practices. The reform scale measured the frequency, by means of teacher report, of teacher and student behaviors associated with reform instruction and assessment practices, such as using small-group work, explaining reasoning, representing and using data, writing reflections, or performing tasks in groups. The traditional scale focused on explanations to whole classes, the use of worksheets, practice, and short-answer assessments. There was a –0.32 correlation between scores for integrated curriculum teachers. There was a 0.27 correlation between scores for traditional

curriculum teachers. This shows that it is overly simplistic to think that reform and traditional practices are oppositional. The relationship among a variety of instructional practices is rather more complex as they interact with curriculum and various student populations.

Professional Development

Professional development and teacher effects were separated in our analysis from implementation fidelity. We recognized that professional development could be viewed by the readers of this report in two ways. As indicated in our model, professional development can be considered a program element or component or it can be viewed as part of the implementation process. When viewed as a program element, professional development resources are considered mandatory along with program materials. In relation to evaluation, proponents of considering professional development as a mandatory program element argue that curricular innovations, which involve the introduction of new topics, new types of assessment, or new ways of teaching, must make provision for adequate training, just as with the introduction of any new technology.

For others, the inclusion of professional development in the program elements without a concomitant inclusion of equal amounts of professional development relevant to a comparative treatment interjects a priori disproportionate treatments and biases the results. We hoped for an array of evaluation studies that might shed some empirical light on this dispute, and hence separated professional development from treatment fidelity, coding whether or not studies reported on the amount of professional development provided for the treatment and/or comparison groups. A study was coded as positive if it either reported on the professional development provided on the experimental group or reported the data on both treatments. Across all 63 at least minimally methodologically adequate studies, 27 percent reported some type of professional development measure, 1.5 percent reported and adjusted for it in interpreting their outcome measures, and 71.5 percent recorded no information on the issue.

A study by Collins (2002) (EX) 4 illustrates the critical and controversial role of professional development in evaluation. Collins studied the use of Connected Math over three years, in three middle schools in threat of being classified as low performing in the Massachusetts accountability system. A comparison was made between one school (School A) that engaged

  

The Collins study lacked a comparison group and is coded as EX. However, it is reported as a case study.

substantively in professional development opportunities accompanying the program and two that did not (Schools B and C). In the CMP school reports (School A) totals between 100 and 136 hours of professional development were recorded for all seven teachers in grades 6 through 8. In School B, 66 hours were reported for two teachers and in School C, 150 hours were reported for eight teachers over three years. Results showed significant differences in the subsequent performance by students at the school with higher participation in professional development (School A) and it became a districtwide top performer; the other two schools remained at risk for low performance. No controls for teacher effects were possible, but the results do suggest the centrality of professional development for successful implementation or possibly suggest that the results were due to professional development rather than curriculum materials. The fact that these two interpretations cannot be separated is a problem when professional development is given to one and not the other. The effect could be due to textbook or professional development or an interaction between the two. Research designs should be adjusted to consider these issues when different conditions of professional development are provided.

Teacher Effects

These studies make it obvious that there are potential confounding factors of teacher effects. Many evaluation studies devoted inadequate attention to the variable of teacher quality. A few studies (Goodrow, 1998; Riordan and Noyce, 2001; Thompson et al., 2001; and Thompson et al., 2003) reported on teacher characteristics such as certification, length of service, experience with curricula, or degrees completed. Those studies that matched classrooms and reported by matched results rather than aggregated results sought ways to acknowledge the large variations among teacher performance and its impact on student outcomes. We coded any effort to report on possible teacher effects as one indicator of quality. Across all 63 at least minimally methodologically adequate studies, 16 percent reported some type of teacher effect measure, 3 percent reported and adjusted for it in interpreting their outcome measures, and 81 percent recorded no information on this issue.

One can see that the potential confounding factors of teacher effects, in terms of the provision of professional development or the measure of teacher effects, are not adequately considered in most evaluation designs. Some studies mention and give a subjective judgment as to the nature of the problem, but this is descriptive at the most. Hardly any of the studies actually do anything analytical, and because these are such important potential confounding variables, this presents a serious challenge to the efficacy of these studies. Figure 5-5 shows how attention to these factors varies

examples of comparative research titles quantitative

FIGURE 5-5 Treatment of implementation components by program type.

NOTE: PD = professional development.

across program categories among NSF-supported, UCSMP, and studies of commercial materials. In general, evaluations of NSF-supported studies were the most likely to measure these variables; UCSMP had the most standardized use of methods to do so across studies; and commercial material evaluators seldom reported on issues of implementation fidelity.

Identification of a Set of Outcome Measures and Forms of Disaggregation

Using the selected student outcomes identified in the program theory, one must conduct an impact assessment that refers to the design and measurement of student outcomes. In addition to selecting what outcomes should be measured within one’s program theory, one must determine how these outcomes are measured, when those measures are collected, and what

purpose they serve from the perspective of the participants. In the case of curricular evaluation, there are significant issues involved in how these measures are reported. To provide insight into the level of curricular validity, many evaluators prefer to report results by topic, content strand, or item cluster. These reports often present the level of specificity of outcome needed to inform curriculum designers, especially when efforts are made to document patterns of errors, distribution of results across multiple choices, or analyses of student methods. In these cases, whole test scores may mask essential differences in impact among curricula at the level of content topics, reporting only average performance.

On the other hand, many large-scale assessments depend on methods of test equating that rely on whole test scores and make comparative interpretations of different test administrations by content strands of questionable reliability. Furthermore, there are questions such as whether to present only gain scores effect sizes, how to link pretests and posttests, and how to determine the relative curricular sensitivity of various outcome measures.

The findings of comparative studies are reported in terms of the outcome measure(s) collected. To describe the nature of the database with regard to outcome measures and to facilitate our analyses of the studies, we classified each of the included studies on four outcome measure dimensions:

Total score reported;

Disaggregation of content strands, subtest, performance level, SES, or gender;

Outcome measure that was specific to curriculum; and

Use of multiple outcome measures.

Most studies reported a total score, but we did find studies that reported only subtest scores or only scores on an item-by-item basis. For example, in the Ben-Chaim et al. (1998) evaluation study of Connected Math, the authors were interested in students’ proportional reasoning proficiency as a result of use of this curriculum. They asked students from eight seventh-grade classes of CMP and six seventh-grade classes from the control group to solve a variety of tasks categorized as rate and density problems. The authors provide precise descriptions of the cognitive challenges in the items; however, they do not explain if the problems written up were representative of performance on a larger set of items. A special rating form was developed to code responses in three major categories (correct answer, incorrect answer, and no response), with subcategories indicating the quality of the work that accompanied the response. No reports on reliability of coding were given. Performance on standardized tests indicated that control students’ scores were slightly higher than CMP at the beginning of the

year and lower at the end. Twenty-five percent of the experimental group members were interviewed about their approaches to the problems. The CMP students outperformed the control students (53 percent versus 28 percent) overall in providing the correct answers and support work, and 27 percent of the control group gave an incorrect answer or showed incorrect thinking compared to 13 percent of the CMP group. An item-level analysis permitted the researchers to evaluate the actual strategies used by the students. They reported, for example, that 82 percent of CMP students used a “strategy focused on package price, unit price, or a combination of the two; those effective strategies were used by only 56 of 91 control students (62 percent)” (p. 264).

The use of item or content strand-level comparative reports had the advantage that they permitted the evaluators to assess student learning strategies specific to a curriculum’s program theory. For example, at times, evaluators wanted to gauge the effectiveness of using problems different from those on typical standardized tests. In this case, problems were drawn from familiar circumstances but carefully designed to create significant cognitive challenges, and assess how well the informal strategies approach in CMP works in comparison to traditional instruction. The disadvantages of such an approach include the use of only a small number of items and the concerns for reliability in scoring. These studies seem to represent a method of creating hybrid research models that build on the detailed analyses possible using case studies, but still reporting on samples that provide comparative data. It possibly reflects the concerns of some mathematicians and mathematics educators that the effectiveness of materials needs to be evaluated relative to very specific, research-based issues on learning and that these are often inadequately measured by multiple-choice tests. However, a decision not to report total scores led to a trade-off in the reliability and representativeness of the reported data, which must be addressed to increase the objectivity of the reports.

Second, we coded whether outcome data were disaggregated in some way. Disaggregation involved reporting data on dimensions such as content strand, subtest, test item, ethnic group, performance level, SES, and gender. We found disaggregated results particularly helpful in understanding the findings of studies that found main effects, and also in examining patterns across studies. We report the results of the studies’ disaggregation by content strand in our reports of effects. We report the results of the studies’ disaggregation by subgroup in our discussions of generalizability.

Third, we coded whether a study used an outcome measure that the evaluator reported as being sensitive to a particular treatment—this is a subcategory of what was defined in our framework as “curricular validity of measures.” In such studies, the rationale was that readily available measures such as state-mandated tests, norm-referenced standardized tests, and

college entrance examinations do not measure some of the aims of the program under study. A frequently cited instance of this was that “off the shelf” instruments do not measure well students’ ability to apply their mathematical knowledge to problems embedded in complex settings. Thus, some studies constructed a collection of tasks that assessed this ability and collected data on it (Ben-Chaim et al., 1998; Huntley et al., 2000).

Finally, we recorded whether a study used multiple outcome measures. Some studies used a variety of achievement measures and other studies reported on achievement accompanied by measures such as subsequent course taking or various types of affective measures. For example, Carroll (2001, p. 47) reported results on a norm-referenced standardized achievement test as well as a collection of tasks developed in other studies.

A study by Huntley et al. (2000) illustrates how a variety of these techniques were combined in their outcome measures. They developed three assessments. The first emphasized contextualized problem solving based on items from the American Mathematical Association of Two-Year Colleges and others; the second assessment was on context-free symbolic manipulation and a third part requiring collaborative problem solving. To link these measures to the overall evaluation, they articulated an explicit model of cognition based on how one links an applied situation to mathematical activity through processes of formulation and interpretation. Their assessment strategy permitted them to investigate algebraic reasoning as an ability to use algebraic ideas and techniques to (1) mathematize quantitative problem situations, (2) use algebraic principles and procedures to solve equations, and (3) interpret results of reasoning and calculations.

In presenting their data comparing performance on Core-Plus and traditional curriculum, they presented both main effects and comparisons on subscales. Their design of outcome measures permitted them to examine differences in performance with and without context and to conclude with statements such as “This result illustrates that CPMP students perform better than control students when setting up models and solving algebraic problems presented in meaningful contexts while having access to calculators, but CPMP students do not perform as well on formal symbol-manipulation tasks without access to context cues or calculators” (p. 349). The authors go on to present data on the relationship between knowing how to plan or interpret solutions and knowing how to carry them out. The correlations between these variables were weak but significantly different (0.26 for control groups and 0.35 for Core-Plus). The advantage of using multiple measures carefully tied to program theory is that they can permit one to test fine content distinctions that are likely to be the level of adjustments necessary to fine tune and improve curricular programs.

Another interesting approach to the use of outcome measures is found in the UCSMP studies. In many of these studies, evaluators collected infor-

TABLE 5-2 Mean Percentage Correct on the Subject Tests

Treatment Group

Geometry—Standard

Geometry—UCSMP

Advanced Algebra—UCSMP

UCSMP

43.1, 44.7, 50.5

51.2, 54.5

56.1, 58.8, 56.1

Comparison

42.7, 45.5, 51.5

36.6, 40.8

42.0, 50.1, 50.0

“43.1, 44.7, 50.5” means students were correct on 43.1 percent of the total items, 44.7 percent of the fair items for UCSMP, and 50.5 percent of the items that were taught in both treatments.

Too few items to report data.

SOURCES: Adapted from Thompson et al. (2001); Thompson et al. (2003).

mation from teachers’ reports and chapter reviews as to whether topics for items on the posttests were taught, calling this an “opportunity to learn” measure. The authors reported results from three types of analyses: (1) total test scores, (2) fair test scores (scores reported by program but only on items on topics taught), and (3) conservative test scores (scores on common items taught in both). Table 5-2 reports on the variations across the multiple- choice test scores for the Geometry study (Thompson et al., 2003) on a standardized test, High School Subject Tests-Geometry Form B , and the UCSMP-constructed Geometry test, and for the Advanced Algebra Study on the UCSMP-constructed Advanced Algebra test (Thompson et al., 2001). The table shows the mean scores for UCSMP classes and comparison classes. In each cell, mean percentage correct is reported first by whole test, then by fair test, and then by conservative test.

The authors explicitly compare the items from the standard Geometry test with the items from the UCSMP test and indicate overlap and difference. They constructed their own test because, in their view, the standard test was not adequately balanced among skills, properties, and real-world uses. The UCSMP test included items on transformations, representations, and applications that were lacking in the national test. Only five items were taught by all teachers; hence in the case of the UCSMP geometry test, there is no report on a conservative test. In the Advanced Algebra evaluation, only a UCSMP-constructed test was viewed as appropriate to cover the treatment of the prior material and alignment to the goals of the new course. These data sets demonstrate the challenge of selecting appropriate outcome measures, the sensitivity of the results to those decisions, and the importance of full disclosure of decision-making processes in order to permit readers to assess the implications of the choices. The methodology utilized sought to ensure that the material in the course was covered adequately by treatment teachers while finding ways to make comparisons that reflected content coverage.

Only one study reported on its outcomes using embedded assessment items employed over the course of the year. In a study of Saxon and UCSMP, Peters (1992) (EX) studied the use of these materials with two classrooms taught by the same teacher. In this small study, he randomly assigned students to treatment groups and then measured their performance on four unit tests composed of items common to both curricula and their progress on the Orleans-Hanna Algebraic Prognosis Test.

Peters’ study showed no significant difference in placement scores between Saxon and UCSMP on the posttest, but did show differences on the embedded assessment. Figure 5-6 (Peters, 1992, p. 75) shows an interesting display of the differences on a “continuum” that shows both the direction and magnitude of the differences and provides a level of concept specificity missing in many reports. This figure and a display ( Figure 5-7 ) in a study by Senk (1991, p. 18) of students’ mean scores on Curriculum A versus Curriculum B with a 10 percent range of differences marked represent two excellent means to communicate the kinds of detailed content outcome information that promises to be informative to curriculum writers, publishers, and school decision makers. In Figure 5-7 , 16 items listed by number were taken from the Second International Mathematics Study. The Functions, Statistics, and Trigonometry sample averaged 41 percent correct on these items whereas the U.S. precalculus sample averaged 38 percent. As shown in the figure, differences of 10 percent or less fall inside the banded area and greater than 10 percent fall outside, producing a display that makes it easy for readers and designers to identify the relative curricular strengths and weaknesses of topics.

While we value detailed outcome measure information, we also recognize the importance of examining curricular impact on students’ standardized test performance. Many developers, but not all, are explicit in rejecting standardized tests as adequate measures of the outcomes of their programs, claiming that these tests focus on skills and manipulations, that they are overly reliant on multiple-choice questions, and that they are often poorly aligned to new content emphases such as probability and statistics, transformations, use of contextual problems and functions, and process skills, such as problem solving, representation, or use of calculators. However, national and state tests are being revised to include more content on these topics and to draw on more advanced reasoning. Furthermore, these high-stakes tests are of major importance in school systems, determining graduation, passing standards, school ratings, and so forth. For this reason, if a curricular program demonstrated positive impact on such measures, we referred to that in Chapter 3 as establishing “curricular alignment with systemic factors.” Adequate performance on these measures is of paramount importance to the survival of reform (to large groups of parents and

examples of comparative research titles quantitative

FIGURE 5-6 Continuum of criterion score averages for studied programs.

SOURCE: Peters (1992, p. 75).

school administrators). These examples demonstrate how careful attention to outcomes measures is an essential element of valid evaluation.

In Table 5-3 , we document the number of studies using a variety of types of outcome measures that we used to code the data, and also report on the types of tests used across the studies.

examples of comparative research titles quantitative

FIGURE 5-7 Achievement (percentage correct) on Second International Mathematics Study (SIMS) items by U.S. precalculus students and functions, statistics, and trigonometry (FST) students.

SOURCE: Re-created from Senk (1991, p. 18).

TABLE 5-3 Number of Studies Using a Variety of Outcome Measures by Program Type

 

Total Test

Content Strands

Test Match to Program

Multiple Test

 

Yes

No

Yes

No

Yes

No

Yes

No

NSF

43

3

28

18

26

20

21

25

Commercial

8

1

4

5

2

7

2

7

UCSMP

7

1

7

1

7

1

7

1

A Choice of Statistical Tests, Including Statistical Significance and Effect Size

In our first review of the studies, we coded what methods of statistical evaluation were used by different evaluators. Most common were t-tests; less frequently one found Analysis of Variance (ANOVA), Analysis of Co-

examples of comparative research titles quantitative

FIGURE 5-8 Statistical tests most frequently used.

variance (ANCOVA), and chi-square tests. In a few cases, results were reported using multiple regression or hierarchical linear modeling. Some used multiple tests; hence the total exceeds 63 ( Figure 5-8 ).

One of the difficult aspects of doing curriculum evaluations concerns using the appropriate unit both in terms of the unit to be randomly assigned in an experimental study and the unit to be used in statistical analysis in either an experimental or quasi-experimental study.

For our purposes, we made the decision that unless the study concerned an intact student population such as the freshman at a single university, where a student comparison was the correct unit, we believed that for statistical tests, the unit should be at least at the classroom level. Judgments were made for each study as to whether the appropriate unit was utilized. This question is an important one because statistical significance is related to sample size, and as a result, studies that inappropriately use the student as the unit of analysis could be concluding significant differences where they are not present. For example, if achievement differences between two curricula are tested in 16 classrooms with 400 students, it will always be easier to show significant differences using scores from those 400 students than using 16 classroom means.

Fifty-seven studies used students as the unit of analysis in at least one test of significance. Three of these were coded as correct because they involved whole populations. In all, 10 studies were coded as using the

TABLE 5-4 Performance on Applied Algebra Problems with Use of Calculators, Part 1

Treatment

n

M (0-100)

SD

Control

273

34.1

14.8

CPMP

320

42.6

21.3

NOTE: t = -5.69, p < .001. All sites combined

SOURCE: Huntley et al. (2000). Reprinted with permission.

TABLE 5-5 Reanalysis of Algebra Performance Data

 

Site Mean

Independent Samples Dependent

Difference Sample Difference

Site

Control

CPMP

1

31.7

35.5

 

3.8

2

26.0

49.4

 

23.4

3

36.7

25.2

 

-11.5

4

41.9

47.7

 

5.8

5

29.4

38.3

 

8.9

6

30.5

45.6

 

15.1

Average

32.7

40.3

7.58

7.58

Standard deviation

5.70

9.17

7.64

11.75

Standard error

 

 

4.41

4.80

 

 

t

1.7

1.6

 

 

p

0.116

0.175

 

SOURCE: Huntley et al. (2000).

correct unit of analysis; hence, 7 studies used teachers or classes, or schools. For some studies where multiple tests were conducted, a judgment was made as to whether the primary conclusions drawn treated the unit of analysis adequately. For example, Huntley et al. (2000) compared the performance of CPMP students with students in a traditional course on a measure of ability to formulate and use algebraic models to answer various questions about relationships among variables. The analysis used students as the unit of analysis and showed a significant difference, as shown in Table 5-4 .

To examine the robustness of this result, we reanalyzed the data using an independent sample t-test and a matched pairs t-test with class means as the unit of analysis in both tests ( Table 5-5 ). As can be seen from the analyses, in neither statistical test was the difference between groups found to be significantly different (p < .05), thus emphasizing the importance of using the correct unit in analyzing the data.

Reanalysis of student-level data using class means will not always result

TABLE 5-6 Mean Percentage Correct on Entire Multiple-Choice Posttest: Second Edition and Non-UCSMP

School

Pair

UCSMP Second Edition

Code

ID

n

Mean

SD

OTL

J

18

18

60.8

9.0

100

J

19

11

58.8

13.5

100

K

20

22

63.8

13.0

94

K

21

16

64.8

14.0

94

L

22

19

57.6

16.9

92

L

23

13

44.7

11.2

92

M

24

29

58.4

12.7

92

M

25

22

39.6

13.5

92

Overall

 

150

56.1

15.4

 

NOTE: The mean is the mean percentage correct on a 36-item multiple-choice posttest. The OTL is the percentage of the items for which teachers reported their students had the opportunity to learn the needed content. Underline indicates statistically significant differences between the mean percentage correct for each pair.

in a change in finding. Furthermore, using class means as the unit of analysis does not suggest that significant differences will not be found. For example, a study by Thompson et al. (2001) compared the performance of UCSMP students with the performance of students in a more traditional program across several measures of achievement. They found significant differences between UCSMP students and the non-UCSMP students on several measures. Table 5-6 shows results of an analysis of a multiple-choice algebraic posttest using class means as the unit of analysis. Significant differences were found in five of eight separate classroom comparisons, as shown in the table. They also found a significant difference using a matched-pairs t-test on class means.

The lesson to be learned from these reanalyses is that the choice of unit of analysis and the way the data are aggregated can impact study findings in important ways including the extent to which these findings can be generalized. Thus it is imperative that evaluators pay close attention to such considerations as the unit of analysis and the way data are aggregated in the design, implementation, and analysis of their studies.

Non-UCSMP

n

Mean

SD

OTL

SE

t

df

p

14

55.2

10.2

69

3.40

1.65

30

0.110

15

53.7

11.0

69

4.81

1.06

24

0.299

24

45.9

10.0

72

3.41

5.22

44

23

43.0

11.9

72

4.16

5.23

37

20

38.8

9.1

75

4.32

4.36

37

15

38.3

11.0

75

4.20

1.52

26

0.140

22

37.8

13.8

47

3.72

5.56

49

23

30.8

9.9

47

3.52

2.51

43

156

42.0

13.1

 

 

 

 

 

A matched-pairs t-test indicates that the differences between the two curricula are significant.

SOURCE: Thompson et al. (2001). Reprinted with permission.

Second, effect size has become a relatively common and standard way of gauging the practical significance of the findings. Statistical significance only indicates whether the main-level differences between two curricula are large enough to not be due to chance, assuming they come from the same population. When statistical differences are found, the question remains as to whether such differences are large enough to consider. Because any innovation has its costs, the question becomes one of cost-effectiveness: Are the differences in student achievement large enough to warrant the costs of change? Quantifying the practical effect once statistical significance is established is one way to address this issue. There is a statistical literature for doing this, and for the purposes of this review, the committee simply noted whether these studies have estimated such an effect. However, the committee further noted that in conducting meta-analyses across these studies, effect size was likely to be of little value. These studies used an enormous variety of outcome measures, and even using effect size as a means to standardize units across studies is not sensible when the measures in each

study address such a variety of topics, forms of reasoning, content levels, and assessment strategies.

We note very few studies drew upon the advances in methodologies employed in modeling, which include causal modeling, hierarchical linear modeling (Bryk and Raudenbush, 1992; Bryk et al., 1993), and selection bias modeling (Heckman and Hotz, 1989). Although developing detailed specifications for these approaches is beyond the scope of this review, we wish to emphasize that these methodological advances should be considered within future evaluation designs.

Results and Limitations to Generalizability Resulting from Design Constraints

One also must consider what generalizations can be drawn from the results (Campbell and Stanley, 1966; Caporaso and Roos, 1973; and Boruch, 1997). Generalization is a matter of external validity in that it determines to what populations the study results are likely to apply. In designing an evaluation study, one must carefully consider, in the selection of units of analysis, how various characteristics of those units will affect the generalizability of the study. It is common for evaluators to conflate issues of representativeness for the purpose of generalizability (external validity) and comparativeness (the selection of or adjustment for comparative groups [internal validity]). Not all studies must be representative of the population served by mathematics curricula to be internally valid. But, to be generalizable beyond restricted communities, representativeness must be obtained by the random selection of the basic units. Clearly specifying such limitations to generalizability is critical. Furthermore, on the basis of equity considerations, one must be sure that if overall effectiveness is claimed, that the studies have been conducted and analyzed with reference of all relevant subgroups.

Thus, depending on the design of a study, its results may be limited in generalizability to other populations and circumstances. We identified four typical kinds of limitations on the generalizability of studies and coded them to determine, on the whole, how generalizable the results across studies might be.

First, there were studies whose designs were limited by the ability or performance level of the students in the samples. It was not unusual to find that when new curricula were implemented at the secondary level, schools kept in place systems of tracking that assigned the top students to traditional college-bound curriculum sequences. As a result, studies either used comparative groups who were matched demographically but less skilled than the population as a whole, in relation to prior learning, or their results compared samples of less well-prepared students to samples of students

with stronger preparations. Alternatively, some studies reported on the effects of curricula reform on gifted and talented students or on college-attending students. In these cases, the study results would also limit the generalizability of the results to similar populations. Reports using limited samples of students’ ability and prior performance levels were coded as a limitation to the generalizability of the study.

For example, Wasman (2000) conducted a study of one school (six teachers) and examined the students’ development of algebraic reasoning after one (n=100) and two years (n=73) in CMP. In this school, the top 25 percent of the students are counseled to take a more traditional algebra course, so her experimental sample, which was 61 percent white, 35 percent African American, 3 percent Asian, and 1 percent Hispanic, consisted of the lower 75 percent of the students. She reported on the student performance on the Iowa Algebraic Aptitude Test (IAAT) (1992), in the subcategories of interpreting information, translating symbols, finding relationships, and using symbols. Results for Forms 1 and 2 of the test, for the experimental and norm group, are shown in Table 5-7 for 8th graders.

In our coding of outcomes, this study was coded as showing no significant differences, although arguably its results demonstrate a positive set of

TABLE 5-7 Comparing Iowa Algebraic Aptitude Test (IAAT) Mean Scores of the Connected Mathematics Project Forms 1 and 2 to the Normative Group (8th Graders)

 

Interpreting Information

Translating Symbols

Finding Relationships

Using Symbols

Total

CMP: Form 1

9.35

8.22

9.90

8.65

36.12

7th (n=51)

(3.36)

(3.44)

(3.26)

(3.12)

(11.28)

CMP: Form 1

9.76

8.56

9.41

8.27

36.00

8th (n=41)

(3.89)

(3.64)

(4.13)

(3.74)

(13.65)

Norm: Form 1

10.03

9.55

9.14

8.87

37.59

(n=2,467)

(3.35)

(2.89)

(3.59)

(3.19)

(10.57)

CMP: Form 2

9.41

7.82

9.29

7.65

34.16

7th (n=49)

(4.05)

(3.03)

(3.57)

(3.35)

(11.47)

CMP: Form 2

11.28

8.66

10.94

9.81

40.69

8th (n=32)

(3.74)

(3.81)

(3.79)

(3.64)

(12.94)

Norm: Form 2

10.63

8.58

8.67

9.19

37.07

(n=2,467)

(3.78)

(2.91)

(3.84)

(3.17)

(11.05)

NOTE: Parentheses indicate standard deviation.

SOURCE: Adapted from Wasman (2000).

outcomes as the treatment group was weaker than the control group. Had the researcher used a prior achievement measure and a different statistical technique, significance might have been demonstrated, although potential teacher effects confound interpretations of results.

A second limitation to generalizability was when comparative studies resided entirely at curriculum pilot site locations, where such sites were developed as a means to conduct formative evaluations of the materials with close contact and advice from teachers. Typically, pilot sites have unusual levels of teacher support, whether it is in the form of daily technical support in the use of materials or technology or increased quantities of professional development. These sites are often selected for study because they have established cooperative agreements with the program developers and other sources of data, such as classroom observations, are already available. We coded whether the study was conducted at a pilot site to signal potential limitations in generalizability of the findings.

Third, studies were also coded as being of limited generalizability if they failed to disaggregate their data by socioeconomic class, race, gender, or some other potentially significant sources of restriction on the claims. We recorded the categories in which disaggregation occurred and compiled their frequency across the studies. Because of the need to open the pipeline to advanced study in mathematics by members of underrepresented groups, we were particularly concerned about gauging the extent to which evaluators factored such variables into their analysis of results and not just in terms of the selection of the sample.

Of the 46 included studies of NSF-supported curricula, 19 disaggregated their data by student subgroup. Nine of 17 studies of commercial materials disaggregated their data. Figure 5-9 shows the number of studies that disaggregated outcomes by race or ethnicity, SES, gender, LEP, special education status, or prior achievement. Studies using multiple categories of disaggregation were counted multiple times by program category.

The last category of restricted generalization occurred in studies of limited sample size. Although such studies may have provided more indepth observations of implementation and reports on professional development factors, the smaller numbers of classrooms and students in the study would limit the extent of generalization that could be drawn from it. Figure 5-10 shows the distribution of sizes of the samples in terms of numbers of students by study type.

Summary of Results by Student Achievement Among Program Types

We present the results of the studies as a means to further investigate their methodological implications. To this end, for each study, we counted across outcome measures the number of findings that were positive, nega-

examples of comparative research titles quantitative

FIGURE 5-9 Disaggregation of subpopulations.

examples of comparative research titles quantitative

FIGURE 5-10 Proportion of studies by sample size and program.

tive, or indeterminate (no significant difference) and then calculated the proportion of each. We represented the calculation of each study as a triplet (a, b, c) where a indicates the proportion of the results that were positive and statistically significantly stronger than the comparison program, b indicates the proportion that were negative and statistically significantly weaker than the comparison program, and c indicates the proportion that showed no significant difference between the treatment and the comparative group. For studies with a single outcome measure, without disaggregation by content strand, the triplet is always composed of two zeros and a single one. For studies with multiple measures or disaggregation by content strand, the triplet is typically a set of three decimal values that sum to one. For example, a study with one outcome measure in favor of the experimental treatment would be coded (1, 0, 0), while one with multiple measures and mixed results more strongly in favor of the comparative curriculum might be listed as (.20, .50, .30). This triplet would mean that for 20 percent of the comparisons examined, the evaluators reported statistically significant positive results, for 50 percent of the comparisons the results were statistically significant in favor of the comparison group, and for 30 percent of the comparisons no significant difference were found. Overall, the mean score on these distributions was (.54, .07, .40), indicating that across all the studies, 54 percent of the comparisons favored the treatment, 7 percent favored the comparison group, and 40 percent showed no significant difference. Table 5-8 shows the comparison by curricular program types. We present the results by individual program types, because each program type relies on a similar program theory and hence could lead to patterns of results that would be lost in combining the data. If the studies of commercial materials are all grouped together to include UCSMP, their pattern of results is (.38, .11, .51). Again we emphasize that due to our call for increased methodological rigor and the use of multiple methods, this result is not sufficient to establish the curricular effectiveness of these programs as a whole with adequate certainty.

We caution readers that these results are summaries of the results presented across a set of evaluations that meet only the standard of at least

TABLE 5-8 Comparison by Curricular Program Types

Proportion of Results That Are:

NSF-Supported n=46

UCSMP n=8

Commercially Generated n=9

In favor of treatment

.591

.491

.285

In favor of comparison

.055

.087

.130

Show no significant difference

.354

.422

.585

minimally methodologically adequate . Calculations of statistical significance of each program’s results were reported by the evaluators; we have made no adjustments for weaknesses in the evaluations such as inappropriate use of units of analysis in calculating statistical significance. Evaluations that consistently used the correct unit of analysis, such as UCSMP, could have fewer reports of significant results as a consequence. Furthermore, these results are not weighted by study size. Within any study, the results pay no attention to comparative effect size or to the established credibility of an outcome measure. Similarly, these results do not take into account differences in the populations sampled, an important consideration in generalizing the results. For example, using the same set of studies as an example, UCSMP studies used volunteer samples who responded to advertisements in their newsletters, resulting in samples with disproportionately Caucasian subjects from wealthier schools compared to national samples. As a result, we would suggest that these results are useful only as baseline data for future evaluation efforts. Our purpose in calculating these results is to permit us to create filters from the critical decision points and test how the results change as one applies more rigorous standards.

Given that none of the studies adequately addressed all of the critical criteria, we do not offer these results as definitive, only suggestive—a hypothesis for further study. In effect, given the limitations of time and support, and the urgency of providing advice related to policy, we offer this filtering approach as an informal meta-analytic technique sufficient to permit us to address our primary task, namely, evaluating the quality of the evaluation studies.

This approach reflects the committee’s view that to deeply understand and improve methodology, it is necessary to scrutinize the results and to determine what inferences they provide about the conduct of future evaluations. Analogous to debates on consequential validity in testing, we argue that to strengthen methodology, one must consider what current methodologies are able (or not able) to produce across an entire series of studies. The remainder of the chapter is focused on considering in detail what claims are made by these studies, and how robust those claims are when subjected to challenge by alternative hypothesis, filtering by tests of increasing rigor, and examining results and patterns across the studies.

Alternative Hypotheses on Effectiveness

In the spirit of scientific rigor, the committee sought to consider rival hypotheses that could explain the data. Given the weaknesses in the designs generally, often these alternative hypotheses cannot be dismissed. However, we believed that only after examining the configuration of results and

alternative hypotheses can the next generation of evaluations be better informed and better designed. We began by generating alternative hypotheses to explain the positive directionality of the results in favor of experimental groups. Alternative hypotheses included the following:

The teachers in the experimental groups tended to be self-selecting early adopters, and thus able to achieve effects not likely in regular populations.

Changes in student outcomes reflect the effects of professional development instruction, or level of classroom support (in pilot sites), and thus inflate the predictions of effectiveness of curricular programs.

Hawthorne effect (Franke and Kaul, 1978) occurs when treatments are compared to everyday practices, due to motivational factors that influence experimental participants.

The consistent difference is due to the coherence and consistency of a single curricular program when compared to multiple programs.

The significance level is only achieved by the use of the wrong unit of analysis to test for significance.

Supplemental materials or new teaching techniques produce the results and not the experimental curricula.

Significant results reflect inadequate outcome measures that focus on a restricted set of activities.

The results are due to evaluator bias because too few evaluators are independent of the program developers.

At the same time, one could argue that the results actually underestimate the performance of these materials and are conservative measures, and their alternative hypotheses also deserve consideration:

Many standardized tests are not sensitive to these curricular approaches, and by eliminating studies focusing on affect, we eliminated a key indicator of the appeal of these curricula to students.

Poor implementation or increased demands on teachers’ knowledge dampens the effects.

Often in the experimental treatment, top-performing students are missing as they are advised to take traditional sequences, rendering the samples unequal.

Materials are not well aligned with universities and colleges because tests for placement and success in early courses focus extensively on algebraic manipulation.

Program implementation has been undercut by negative publicity and the fears of parents concerning change.

There are also a number of possible hypotheses that may be affecting the results in either direction, and we list a few of these:

Examining the role of the teacher in curricular decision making is an important element in effective implementation, and design mandates of evaluation design make this impossible (and the positives and negatives or single- versus dual-track curriculum as in Lundin, 2001).

Local tests that are sensitive to the curricular effects typically are not mandatory and hence may lead to unpredictable performance by students.

Different types and extent of professional development may affect outcomes differentially.

Persistence or attrition may affect the mean scores and are often not considered in the comparative analyses.

One could also generate reasons why the curricular programs produced results showing no significance when one program or the other is actually more effective. This could include high degrees of variability in the results, samples that used the correct unit of analysis but did not obtain consistent participation across enough cases, implementation that did not show enough fidelity to the measures, or outcome measures insensitive to the results. Again, subsequent designs should be better informed by these findings to improve the likelihood that they will produce less ambiguous results and replication of studies could also give more confidence in the findings.

It is beyond the scope of this report to consider each of these alternative hypotheses separately and to seek confirmation or refutation of them. However, in the next section, we describe a set of analyses carried out by the committee that permits us to examine and consider the impact of various critical evaluation design decisions on the patterns of outcomes across sets of studies. A number of analyses shed some light on various alternative hypotheses and may inform the conduct of future evaluations.

Filtering Studies by Critical Decision Points to Increase Rigor

In examining the comparative studies, we identified seven critical decision points that we believed would directly affect the rigor and efficacy of the study design. These decision points were used to create a set of 16 filters. These are listed as the following questions:

Was there a report on comparability relative to SES?

Was there a report on comparability of samples relative to prior knowledge?

Was there a report on treatment fidelity?

Was professional development reported on?

Was the comparative curriculum specified?

Was there any attempt to report on teacher effects?

Was a total test score reported?

Was total test score(s) disaggregated by content strand?

Did the outcome measures match the curriculum?

Were multiple tests used?

Was the appropriate unit of analysis used in their statistical tests?

Did they estimate effect size for the study?

Was the generalizability of their findings limited by use of a restricted range of ability levels?

Was the generalizability of their findings limited by use of pilot sites for their study?

Was the generalizability of their findings limited by not disaggregating their results by subgroup?

Was the generalizability of their findings limited by use of small sample size?

The studies were coded to indicate if they reported having addressed these considerations. In some cases, the decision points were coded dichotomously as present or absent in the studies, and in other cases, the decision points were coded trichotomously, as description presented, absent, or statistically adjusted for in the results. For example, a study may or may not report on the comparability of the samples in terms of race, ethnicity, or socioeconomic status. If a report on SES was given, the study was coded as “present” on this decision; if a report was missing, it was coded as “absent”; and if SES status or ethnicity was used in the analysis to actually adjust outcomes, it was coded as “adjusted for.” For each coding, the table that follows reports the number of studies that met that condition, and then reports on the mean percentage of statistically significant results, and results showing no significant difference for that set of studies. A significance test is run to see if the application of the filter produces changes in the probability that are significantly different. 5

In the cases in which studies are coded into three distinct categories—present, absent, and adjusted for—a second set of filters is applied. First, the studies coded as present or adjusted for are combined and compared to those coded as absent; this is what we refer to as a weak test of the rigor of the study. Second, the studies coded as present or absent are combined and compared to those coded as adjusted for. This is what we refer to as a strong test. For dichotomous codings, there can be as few as three compari-

  

The significance test used was a chi-square not corrected for discontinuity.

sons, and for trichotomous codings, there can be nine comparisons with accompanying tests of significance. Trichotomous codes were used for adjustments for SES and prior knowledge, examining treatment fidelity, professional development, teacher effects, and reports on effect sizes. All others were dichotomous.

NSF Studies and the Filters

For example, there were 11 studies of NSF-supported curricula that simply reported on the issues of SES in creating equivalent samples for comparison, and for this subset the mean probabilities of getting positive, negative, or results showing no significant difference were (.47, .10, .43). If no report of SES was supplied (n= 21), those probabilities become (.57, .07, .37), indicating an increase in positive results and a decrease in results showing no significant difference. When an adjustment is made in outcomes based on differences in SES (n=14), the probabilities change to (.72, .00, .28), showing a higher likelihood of positive outcomes. The probabilities that result from filtering should always be compared back to the overall results of (.59, .06, .35) (see Table 5-8 ) so as to permit one to judge the effects of more rigorous methodological constraints. This suggests that a simple report on SES without adjustment is least likely to produce positive outcomes; that is, no report produces the outcomes next most likely to be positive and studies that adjusted for SES tend to have a higher proportion of their comparisons producing positive results.

The second method of applying the filter (the weak test for rigor) for the treatment of the adjustment of SES groups compares the probabilities when a report is either given or adjusted for compared to when no report is offered. The combined percentage of a positive outcome of a study in which SES is reported or adjusted for is (.61, .05, .34), while the percentage for no report remains as reported previously at (.57, .07, .37). A final filter compares the probabilities of the studies in which SES is adjusted for with those that either report it only or do not report it at all. Here we compare the percentage of (.72, .00, .28) to (.53, .08, .37) in what we call a strong test. In each case we compared the probability produced by the whole group to those of the filtered studies and conducted a test of the differences to determine if they were significant. These differences were not significant. These findings indicate that to date, with this set of studies, there is no statistically significant difference in results when one reports or adjusts for changes in SES. It appears that by adjusting for SES, one sees increases in the positive results, and this result deserves a closer examination for its implications should it prove to hold up over larger sets of studies.

We ran tests that report the impact of the filters on the number of studies, the percentage of studies, and the effects described as probabilities

for each of the three study categories, NSF-supported and commercially generated with UCSMP included. We claim that when a pattern of probabilities of results does not change after filtering, one can have more confidence in that pattern. When the pattern of results changes, there is a need for an explanatory hypothesis, and that hypothesis can shed light on experimental design. We propose that this “filtering process” constitutes a test of the robustness of the outcome measures subjected to increasing degrees of rigor by using filtering.

Results of Filtering on Evaluations of NSF-Supported Curricula

For the NSF-supported curricular programs, out of 15 filters, 5 produced a probability that differed significantly at the p<.1 level. The five filters were for treatment fidelity, specification of control group, choosing the appropriate statistical unit, generalizability for ability, and generalizability based on disaggregation by subgroup. For each filter, there were from three to nine comparisons, as we examined how the probabilities of outcomes change as tests were more stringent and across the categories of positive results, negative results, and results with no significant differences. Out of a total of 72 possible tests, only 11 produced a probability that differed significantly at the p < .1 level. With 85 percent of the comparisons showing no significant difference after filtering, we suggest the results of the studies were relatively robust in relation to these tests. At the same time, when rigor is increased for the five filters just listed, the results become generally more ambiguous and signal the need for further research with more careful designs.

Studies of Commercial Materials and the Filters

To ensure enough studies to conduct the analysis (n=17), our filtering analysis of the commercially generated studies included UCSMP (n=8). In this case, there were six filters that produced a probability that differed significantly at the p < .1 level. These were treatment fidelity, disaggregation by content, use of multiple tests, use of effect size, generalizability by ability, and generalizability by sample size. In this case, because there were no studies in some possible categories, there were a total of 57 comparisons, and 9 displayed significant differences in the probabilities after filtering at the p < .1 level. With 84 percent of the comparisons showing no significant difference after filtering, we suggest the results of the studies were relatively robust in relation to these tests. Table 5-9 shows the cases in which significant differences were recorded.

Impact of Treatment Fidelity on Probabilities

A few of these differences are worthy of comment. In the cases of both the NSF-supported and commercially generated curricula evaluation studies, studies that reported treatment fidelity differed significantly from those that did not. In the case of the studies of NSF-supported curricula, it appeared that a report or adjustment on treatment fidelity led to proportions with less positive effects and more results showing no significant differences. We hypothesize that this is partly because larger studies often do not examine actual classroom practices, but can obtain significance more easily due to large sample sizes.

In the studies of commercial materials, the presence or absence of measures of treatment fidelity worked differently. Studies reporting on or adjusting for treatment fidelity tended to have significantly higher probabilities in favor of experimental treatment, less positive effects in fewer of the comparative treatments, and more likelihood of results with no significant differences. We hypothesize, and confirm with a separate analysis, that this is because UCSMP frequently reported on treatment fidelity in their designs while study of Saxon typically did not, and the change represents the preponderance of these different curricular treatments in the studies of commercially generated materials.

Impact of Identification of Curricular Program on Probabilities

The significant differences reported under specificity of curricular comparison also merit discussion for studies of NSF-supported curricula. When the comparison group is not specified, a higher percentage of mean scores in favor of the experimental curricula is reported. In the studies of commercial materials, a failure to name specific curricular comparisons also produced a higher percentage of positive outcomes for the treatment, but the difference was not statistically significant. This suggests the possibility that when a specified curriculum is compared to an unspecified curriculum, reports of impact may be inflated. This finding may suggest that in studies of effectiveness, specifying comparative treatments would provide more rigorous tests of experimental approaches.

When studies of commercial materials disaggregate their results of content strands or use multiple measures, their reports of positive outcomes increase, the negative outcomes decrease, and in one case, the results show no significant differences. Percentage of significant difference was only recorded in one comparison within each one of these filters.

TABLE 5-9 Cases of Significant Differences

Test

Type of Comparison

Category Code

N=

Probabilities Before Filter

p=

Treatment fidelity

Simple compare

Specified

21

.51, .02, .47*

*p =.049

 

Not specified

 

24

.68, .09, .23*

 

 

Adjusted for

 

1

.25, .00, .75

 

Treatment fidelity

Strong test

Adjusted for

22

.49*, .02, .49**

*p=.098

 

Reported or not specified

 

24

.68*, .09, .23**

**p=.019

Control group specified

Simple compare

Specified

8

.33*, .00, .66**

*p=.033

 

 

Not specified

38

.65*, .07, .29**

**p=.008

Appropriate unit of analysis

Simple compare

Correct

5

.30*, .40**, .30

*p=.069

 

 

Incorrect

41

.63*, .01**, .36

**p=.000

Generalizability by ability

Simple compare

Limited

5

.22*, .41**, .37

*p=.019

 

 

Not limited

41

.64*, .01**, .35

**p=.000

Generalizability by disaggregated subgroup

Simple compare

Limited

28

.48*, .09, .43**

*p=.013

 

Not limited

18

.76*, .00, .24**

**p=.085

Treatment fidelity

Simple compare

Reported

7

.53, .37*, .20

*p=.032

 

 

Not specified

9

.26, .67*, .11

 

 

 

Adjusted for

1

.45, .00*, .55

 

Treatment fidelity

Weak test

Adjusted for or

8

.52, .33, .25*

*p=.087

 

 

Reported versus

9

.26, .67, .11*

 

 

 

Not specified

 

 

 

Outcomes disaggregated by content strand

Simple compare

Reported

11

.50, .37, .22*

*p=.052

 

Not reported

6

.17, .77, .10*

 

Outcomes using multiple tests

Simple compare

Yes

9

.55*, .35, .19

*p=.076

 

 

No

8

.20*, .68, .20

 

Effect size reported

Simple compare

Yes

3

.72, .05, .29*

*p=.029

 

 

No

14

.31, .61, .16*

 

Generalization by ability

Simple compare

Limited

4

.23, .41*, .32

*p=.004

 

 

Not limited

14

.42, .53, .09

 

Generalization by sample size

Simple compare

Limited

6

.57, .23, .27*

*p=.036

 

 

Not limited

11

.28, .66, .10*

 

NOTE: In the comparisons shown, only the comparisons marked by an asterisk showed significant differences at p<.1. Probabilitie s are estimated for each significant difference.

Impact of Units of Analysis on Probabilities 6

For the evaluations of the NSF-supported materials, a significant difference was reported on the outcomes for the studies that used the correct unit of analysis compared to those that did not. The percentage for those with the correct unit were (.30, .40, .30) compared to (.63, .01, .36) for those that used the incorrect result. These results suggest that our prediction that using the correct unit of analysis would decrease the percentage of positive outcomes is likely to be correct. It also suggests that the most serious threat to the apparent conclusions of these studies comes from selecting an incorrect unit of analysis. It causes a decrease in favorable results, making the results more ambiguous, but never reverses the direction of the effect. This is a concern that merits major attention in the conduct of further studies.

For the commercially generated studies, most of the ones coded with the correct unit of analysis were UCSMP studies. Because of the small number of studies involved, we could not break out from the overall filtering of studies of commercial materials, but report this issue to assist readers in interpreting the relative patterns of results.

Impact of Generalizability on Probabilities

Both types of studies yielded significant differences for some of the comparisons coded as restrictions to generalizability. Investigating these is important in order to understand the effects of these curricular programs on different subpopulations of students. In the case of the studies of commercially generated materials, significantly different results occurred in the categories of ability and sample size. In the studies of NSF-supported materials, the significant differences occurred in ability and disaggregation by subgroups.

In relation to generalizability, the studies of NSF-supported curricula reported significantly more positive results in favor of the treatment when they included all students. Because studies coded as “limited by ability” were restricted either by focusing only on higher achieving students or on lower achieving students, we sorted these two groups. For higher performing students (n=3), the probabilities of effects were (.11, .67, .22). For lower

  

It should be noted that of the five studies in which the correct unit of analysis was used, two of these were population studies of freshmen entering college, and these reported few results in favor of the experimental treatments. However, the high proportion of these studies involving college students may skew this particular result relative to the preponderance of other studies involving K-12 students.

performing students (n=2), the probabilities were (.39, .025, .59). The first two comparisons are significantly different at p < .05. These findings are based on only a total of five studies, but they suggest that these programs may be serving the weaker ability students more effectively than the stronger ability students, serving both less well than they serve whole heterogeneous groups. For the studies of commercial materials, there were only three studies that were restricted to limited populations. The results for those three studies were (.23, .41, .32) and for all students (n=14) were (.42, .53, .09). These studies were significantly different at p = .004. All three studies included UCSMP and one also included Saxon and was limited by serving primarily high-performing students. This means both categories of programs are showing weaker results when used with high-ability students.

Finally, the studies on NSF-supported materials were disaggregated by subgroups for 28 studies. A complete analysis of this set follows, but the studies that did not report results disaggregated by subgroup generated probabilities of results of (.48, .09, .43) whereas those that did disaggregate their results reported (.76, 0, .24). These gains in positive effects came from significant losses in reporting no significant differences. Studies of commercial materials also reported a small decrease in likelihood of negative effects for the comparison program when disaggregation by subgroup is reported offset by increases in positive results and results with no significant differences, although these comparisons were not significantly different. A further analysis of this topic follows.

Overall, these results suggest that increased rigor seems to lead in general to less strong outcomes, but never reports of completely contrary results. These results also suggest that in recommending design considerations to evaluators, there should be careful attention to having evaluators include measures of treatment fidelity, considering the impact on all students as well as one particular subgroup; using the correct unit of analysis; and using multiple tests that are also disaggregated by content strand.

Further Analyses

We conducted four further analyses: (1) an analysis of the outcome probabilities by test type; (2) content strands analysis; (3) equity analysis; and (4) an analysis of the interactions of content and equity by grade band. Careful attention to the issues of content strand, equity, and interaction is essential for the advancement of curricular evaluation. Content strand analysis provides the detail that is often lost by reporting overall scores; equity analysis can provide essential information on what subgroups are adequately served by the innovations, and analysis by content and grade level can shed light on the controversies that evolve over time.

Analysis by Test Type

Different studies used varied combinations of outcome measures. Because of the importance of outcome measures on test results, we chose to examine whether the probabilities for the studies changed significantly across different types of outcome measures (national test, local test). The most frequent use of tests across all studies was a combination of national and local tests (n=18 studies), a local test (n=16), and national tests (n=17). Other uses of test combinations were used by three studies or less. The percentages of various outcomes by test type in comparison to all studies are described in Table 5-10 .

These data ( Table 5-11 ) suggest that national tests tend to produce less positive results, and with the resulting gains falling into results showing no significant differences, suggesting that national tests demonstrate less curricular sensitivity and specificity.

TABLE 5-10 Percentage of Outcomes by Test Type

Test Type

National/Local

Local Only

National Only

All Studies

All studies

(.48, .18, .34) n=18

(.63, .03, .34) n=16

(.31, .05, .64) n= 3

(.54, .07, .40) n=63

NOTE: The first set of numbers in the parenthesis represent the percentage of outcomes that are positive, the second set of numbers in the parenthesis represent the percentage of outcomes that are negative, and the third set of numbers represent the percentage of outcomes that are nonsignificant.

TABLE 5-11 Percentage of Outcomes by Test Type and Program Type

Test Type

National/Local

Local Only

National Only

All Studies

NSF effects

(.52, .15, .34) n=14

(.57, .03, .39) n=14

(.44, .00, .56) n=4

(.59, .06, .35) n=46

UCSMP effects

(.41, .18, .41) n=3

***

***

(.49, .09, .42) n=8

Commercial effects

**

**

(.29, .08, .63) n=8

(.29, .13, .59) n=9

NOTE: The first set of numbers in the parenthesis represent the percentage of outcomes that are positive, the second set of numbers represent the percentage of outcomes that are negative, and the third set of numbers represent the percentage of outcomes that are nonsignificant.

TABLE 5-12 Number of Studies That Disaggregated by Content Strand

Program Type

Elementary

Middle

High School

Total

NSF-supported

14

6

9

29

Commercially generated

0

4

5

9

Content Strand

Curricular effectiveness is not an all-or-nothing proposition. A curriculum may be effective in some topics and less effective in others. For this reason, it is useful for evaluators to include an analysis of curricular strands and to report on the performance of students on those strands. To examine this issue, we conducted an analysis of the studies that reported their results by content strand. Thirty-eight studies did this; the breakdown is shown in Table 5-12 by type of curricular program and grade band.

To examine the evaluations of these content strands, we began by listing all of the content strands reported across studies as well as the frequency of report by the number of studies at each grade band. These results are shown in Figure 5-11 , which is broken down by content strand, grade level, and program type.

Although there are numerous content strands, some of them were reported on infrequently. To allow the analysis to focus on the key results from these studies, we separated out the most frequently reported on strands, which we call the “major content strands.” We defined these as strands that were examined in at least 10 percent of the studies. The major content strands are marked with an asterisk in the Figure 5-11 . When we conduct analyses across curricular program types or grade levels, we use these to facilitate comparisons.

A second phase of our analysis was to examine the performance of students by content strand in the treatment group in comparison to the control groups. Our analysis was conducted across the major content strands at the level of NSF-supported versus commercially generated, initially by all studies and then by grade band. It appeared that such analysis permitted some patterns to emerge that might prove helpful to future evaluators in considering the overall effectiveness of each approach. To do this, we then coded the number of times any particular strand was measured across all studies that disaggregated by content strand. Then, we coded the proportion of times that this strand was reported as favoring the experimental treatment, favoring the comparative curricula, or showing no significant difference. These data are presented across the major content strands for the NSF-supported curricula ( Figure 5-12 ) and the commercially generated curricula, ( Figure 5-13 ) (except in the case of the elemen-

examples of comparative research titles quantitative

FIGURE 5-11 Study counts for all content strands.

tary curricula where no data were available) in the forms of percentages, with the frequencies listed in the bars.

The presentation of results by strands must be accompanied by the same restrictions as stated previously. These results are based on studies identified as at least minimally methodologically adequate. The quality of the outcome measures in measuring the content strands has not been examined. Their results are coded in relation to the comparison group in the study and are indicated as statistically in favor of the program, as in favor of the comparative program, or as showing no significant differences. The results are combined across studies with no weighting by study size. Their results should be viewed as a means for the identification of topics for potential future study. It is completely possible that a refinement of methodologies may affect the future patterns of results, so the results are to be viewed as tentative and suggestive.

examples of comparative research titles quantitative

FIGURE 5-12 Major content strand result: All NSF (n=27).

According to these tentative results, future evaluations should examine whether the NSF-supported programs produce sufficient competency among students in the areas of algebraic manipulation and computation. In computation, approximately 40 percent of the results were in favor of the treatment group, no significant differences were reported in approximately 50 percent of the results, and results in favor of the comparison were revealed 10 percent of the time. Interpreting that final proportion of no significant difference is essential. Some would argue that because computation has not been emphasized, findings of no significant differences are acceptable. Others would suggest that such findings indicate weakness, because the development of the materials and accompanying professional development yielded no significant difference in key areas.

examples of comparative research titles quantitative

FIGURE 5-13 Major content strand result: All commercial (n=8).

From Figure 5-13 of findings from studies of commercially generated curricula, it appears that mixed results are commonly reported. Thus, in evaluations of commercial materials, lack of significant differences in computations/operations, word problems, and probability and statistics suggest that careful attention should be given to measuring these outcomes in future evaluations.

Overall, the grade band results for the NSF-supported programs—while consistent with the aggregated results—provide more detail. At the elementary level, evaluations of NSF-supported curricula (n=12) report better performance in mathematics concepts, geometry, and reasoning and problem solving, and some weaknesses in computation. No content strand analysis for commercially generated materials was possible. Evaluations

(n=6) at middle grades of NSF-supported curricula showed strength in measurement, geometry, and probability and statistics and some weaknesses in computation. In the studies of commercial materials, evaluations (n=4) reported favorable results in reasoning and problem solving and some unfavorable results in algebraic procedures, contextual problems, and mathematics concepts. Finally, at the high school level, the evaluations (n=9) by content strand for the NSF-supported curricula showed strong favorable results in algebra concepts, reasoning/problem solving, word problems, probability and statistics, and measurement. Results in favor of the control were reported in 25 percent of the algebra procedures and 33 percent of computation measures.

For the studies of commercial materials (n=4), only the geometry results favor the control group 25 percent of the time, with 50 percent having favorable results. Algebra concepts, reasoning, and probability and statistics also produced favorable results.

Equity Analysis of Comparative Studies

When the goal of providing a standards-based curriculum to all students was proposed, most people could recognize its merits: the replacement of dull, repetitive, largely dead-end courses with courses that would lead all students to be able, if desired and earned, to pursue careers in mathematics-reliant fields. It was clear that the NSF-supported projects, a stated goal of which was to provide standards-based courses to all students, called for curricula that would address the problem of too few students persisting in the study of mathematics. For example, as stated in the NSF Request for Proposals (RFP):

Rather than prematurely tracking students by curricular objectives, secondary school mathematics should provide for all students a common core of mainstream mathematics differentiated instructionally by level of abstraction and formalism, depth of treatment and pace (National Science Foundation, 1991, p. 1). In the elementary level solicitation, a similar statement on causes for all students was made (National Science Foundation, 1988, pp. 4-5).

Some, but not enough attention has been paid to the education of students who fall below the average of the class. On the other hand, because the above average students sometimes do not receive a demanding education, it may be incorrectly assumed they are easy to teach (National Science Foundation, 1989, p. 2).

Likewise, with increasing numbers of students in urban schools, and increased demographic diversity, the challenges of equity are equally significant for commercial publishers, who feel increasing pressures to demonstrate the effectiveness of their products in various contexts.

The problem was clearly identified: poorer performance by certain subgroups of students (minorities—non-Asian, LEP students, sometimes females) and a resulting lack of representation of such groups in mathematics-reliant fields. In addition, a secondary problem was acknowledged: Highly talented American students were not being provided adequate challenge and stimulation in comparison with their international counterparts. We relied on the concept of equity in examining the evaluation. Equity was contrasted to equality, where one assumed all students should be treated exactly the same (Secada et al., 1995). Equity was defined as providing opportunities and eliminating barriers so that the membership in a subgroup does not subject one to undue and systematically diminished possibility of success in pursuing mathematical study. Appropriate treatment therefore varies according to the needs of and obstacles facing any subgroup.

Applying the principles of equity to evaluate the progress of curricular programs is a conceptually thorny challenge. What is challenging is how to evaluate curricular programs on their progress toward equity in meeting the needs of a diverse student body. Consider how the following questions provide one with a variety of perspectives on the effectiveness of curricular reform regarding equity:

Does one expect all students to improve performance, thus raising the bar, but possibly not to decrease the gap between traditionally well-served and under-served students?

Does one focus on reducing the gap and devote less attention to overall gains, thus closing the gap but possibly not raising the bar?

Or, does one seek evidence that progress is made on both challenges—seeking progress for all students and arguably faster progress for those most at risk?

Evaluating each of the first two questions independently seems relatively straightforward. When one opts for a combination of these two, the potential for tensions between the two becomes more evident. For example, how can one differentiate between the case in which the gap is closed because talented students are being underchallenged from the case in which the gap is closed because the low-performing students improved their progress at an increased rate? Many believe that nearly all mathematics curricula in this country are insufficiently challenging and rigorous. Therefore achieving modest gains across all ability levels with evidence of accelerated progress by at-risk students may still be criticized for failure to stimulate the top performing student group adequately. Evaluating curricula with regard to this aspect therefore requires judgment and careful methodological attention.

Depending on one’s view of equity, different implications for the collection of data follow. These considerations made examination of the quality of the evaluations as they treated questions of equity challenging for the committee members. Hence we spell out our assumptions as precisely as possible:

Evaluation studies should include representative samples of student demographics, which may require particular attention to the inclusion of underrepresented minority students from lower socioeconomic groups, females, and special needs populations (LEP, learning disabled, gifted and talented students) in the samples. This may require one to solicit participation by particular schools or districts, rather than to follow the patterns of commercial implementation, which may lead to an unrepresentative sample in aggregate.

Analysis of results should always consider the impact of the program on the entire spectrum of the sample to determine whether the overall gains are distributed fairly among differing student groups, and not achieved as improvements in the mean(s) of an identifiable subpopulation(s) alone.

Analysis should examine whether any group of students is systematically less well served by curricular implementation, causing losses or weakening the rate of gains. For example, this could occur if one neglected the continued development of programs for gifted and talented students in mathematics in order to implement programs focused on improving access for underserved youth, or if one improved programs solely for one group of language learners, ignoring the needs of others, or if one’s study systematically failed to report high attrition affecting rates of participation of success or failure.

Analyses should examine whether gaps in scores between significantly disadvantaged or underperforming subgroups and advantaged subgroups are decreasing both in relation to eliminating the development of gaps in the first place and in relation to accelerating improvement for underserved youth relative to their advantaged peers at the upper grades.

In reviewing the outcomes of the studies, the committee reports first on what kinds of attention to these issues were apparent in the database, and second on what kinds of results were produced. Some of the studies used multiple methods to provide readers with information on these issues. In our report on the evaluations, we both provide descriptive information on the approaches used and summarize the results of those studies. Developing more effective methods to monitor the achievement of these objectives may need to go beyond what is reported in this study.

Among the 63 at least minimally methodologically adequate studies, 26 reported on the effects of their programs on subgroups of students. The

TABLE 5-13 Most Common Subgroups Used in the Analyses and the Number of Studies That Reported on That Variable

Identified Subgroup

Number of Studies of NSF-Supported

Number of Studies of Commercially Generated

Total

Gender

14

5

19

Race and ethnicity

14

2

16

Socioeconomic status

8

2

10

Achievement levels

5

3

8

English as a second language (ESL)

2

1

3

Total

43

13

56

Achievement levels: Outcome data are reported in relation to categorizations by quartiles or by achievement level based on independent test.

other 37 reported on the effects of the curricular intervention on means of whole groups and their standard deviations, but did not report on their data in terms of the impact on subpopulations. Of those 26 evaluations, 19 studies were on NSF-supported programs and 7 were on commercially generated materials. Table 5-13 reports the most common subgroups used in the analyses and the number of studies that reported on that variable. Because many studies used multiple categories for disaggregation (ethnicity, SES, and gender), the number of reports is more than double the number of studies. For this reason, we report the study results in terms of the “frequency of reports on a particular subgroup” and distinguish this from what we refer to as “study counts.” The advantage of this approach is that it permits reporting on studies that investigated multiple ways to disaggregate their data. The disadvantage is that in a sense, studies undertaking multiple disaggregations become overrepresented in the data set as a result. A similar distinction and approach were used in our treatment of disaggregation by content strands.

It is apparent from these data that the evaluators of NSF-supported curricula documented more equity-based outcomes, as they reported 43 of the 56 comparisons. However, the same percentage of the NSF-supported evaluations disaggregated their results by subgroup, as did commercially generated evaluations (41 percent in both cases). This is an area where evaluations of curricula could benefit greatly from standardization of ex-

pectation and methodology. Given the importance of the topic of equity, it should be standard practice to include such analyses in evaluation studies.

In summarizing these 26 studies, the first consideration was whether representative samples of students were evaluated. As we have learned from medical studies, if conclusions on effectiveness are drawn without careful attention to representativeness of the sample relative to the whole population, then the generalizations drawn from the results can be seriously flawed. In Chapter 2 we reported that across the studies, approximately 81 percent of the comparative studies and 73 percent of the case studies reported data on school location (urban, suburban, rural, or state/region), with suburban students being the largest percentage in both study types. The proportions of students studied indicated a tendency to undersample urban and rural populations and oversample suburban schools. With a high concentration of minorities and lower SES students in these areas, there are some concerns about the representativeness of the work.

A second consideration was to see whether the achievement effects of curricular interventions were achieved evenly among the various subgroups. Studies answered this question in different ways. Most commonly, evaluators reported on the performance of various subgroups in the treatment conditions as compared to those same subgroups in the comparative condition. They reported outcome scores or gains from pretest to posttest. We refer to these as “between” comparisons.

Other studies reported on the differences among subgroups within an experimental treatment, describing how well one group does in comparison with another group. Again, these reports were done in relation either to outcome measures or to gains from pretest to posttest. Often these reports contained a time element, reporting on how the internal achievement patterns changed over time as a curricular program was used. We refer to these as “within” comparisons.

Some studies reported both between and within comparisons. Others did not report findings by comparing mean scores or gains, but rather created regression equations that predicted the outcomes and examined whether demographic characteristics are related to performance. Six studies (all on NSF-supported curricula) used this approach with variables related to subpopulations. Twelve studies used ANCOVA or Multiple Analysis of Variance (MANOVA) to study disaggregation by subgroup, and two reported on comparative effect sizes. In the studies using statistical tests other than t-tests or Chi-squares, two were evaluations of commercially generated materials and the rest were of NSF-supported materials.

Of the studies that reported on gender (n=19), the NSF-supported ones (n=13) reported five cases in which the females outperformed their counterparts in the controls and one case in which the female-male gap decreased within the experimental treatments across grades. In most cases, the studies

present a mixed picture with some bright spots, with the majority showing no significant difference. One study reported significant improvements for African-American females.

In relation to race, 15 of 16 reports on African Americans showed positive effects in favor of the treatment group for NSF-supported curricula. Two studies reported decreases in the gaps between African Americans and whites or Asians. One of the two evaluations of African Americans, performance reported for the commercially generated materials, showed significant positive results, as mentioned previously.

For Hispanic students, 12 of 15 reports of the NSF-supported materials were significantly positive, with the other 3 showing no significant difference. One study reported a decrease in the gaps in favor of the experimental group. No evaluations of commercially generated materials were reported on Hispanic populations. Other reports on ethnic groups occurred too seldom to generalize.

Students from lower socioeconomic groups fared well, according to reported evaluations of NSF-supported materials (n=8), in that experimental groups outperformed control groups in all but one case. The one study of commercially generated materials that included SES as a variable reported no significant difference. For students with limited English proficiency, of the two evaluations of NSF-supported materials, one reported significantly more positive results for the experimental treatment. Likewise, one study of commercially generated materials yielded a positive result at the elementary level.

We also examined the data for ability differences and found reports by quartiles for a few evaluation studies. In these cases, the evaluations showed results across quartiles in favor of the NSF-supported materials. In one case using the same program, the lower quartiles showed the most improvement, and in the other, the gains were in the middle and upper groups for the Iowa Test of Basic Skills and evenly distributed for the informal assessment.

Summary Statements

After reviewing these studies, the committee observed that examining differences by gender, race, SES, and performance levels should be examined as a regular part of any review of effectiveness. We would recommend that all comparative studies report on both “between” and “within” comparisons so that the audience of an evaluation can simply and easily consider the level of improvement, its distribution across subgroups, and the impact of curricular implementation on any gaps in performance. Each of the major categories—gender, race/ethnicity, SES, and achievement level—contributes a significant and contrasting view of curricular impact. Further-

more, more sophisticated accounts would begin to permit, across studies, finer distinctions to emerge, such as the effect of a program on young African-American women or on first generation Asian students.

In addition, the committee encourages further study and deliberation on the use of more complex approaches to the examination of equity issues. This is particularly important due to the overlaps among these categories, where poverty can show itself as its own variable but also may be highly correlated to prior performance. Hence, the use of one variable can mask differences that should be more directly attributable to another. The committee recommends that a group of measurement and equity specialists confer on the most effective design to advance on these questions.

Finally, it is imperative that evaluation studies systematically include demographically representative student populations and distinguish evaluations that follow the commercial patterns of use from those that seek to establish effectiveness with a diverse student population. Along these lines, it is also important that studies report on the impact data on all substantial ethnic groups, including whites. Many studies, perhaps because whites were the majority population, failed to report on this ethnic group in their analyses. As we saw in one study, where Asian students were from poor homes and first generation, any subgroup can be an at-risk population in some setting, and because gains in means may not necessarily be assumed to translate to gains for all subgroups or necessarily for the majority subgroup. More complete and thorough descriptions and configurations of characteristics of the subgroups being served at any location—with careful attention to interactions—is needed in evaluations.

Interactions Among Content and Equity, by Grade Band

By examining disaggregation by content strand by grade levels, along with disaggregation by diverse subpopulations, the committee began to discover grade band patterns of performance that should be useful in the conduct of future evaluations. Examining each of these issues in isolation can mask some of the overall effects of curricular use. Two examples of such analysis are provided. The first example examines all the evaluations of NSF-supported curricula from the elementary level. The second examines the set of evaluations of NSF-supported curricula at the high school level, and cannot be carried out on evaluations of commercially generated programs because they lack disaggregation by student subgroup.

Example One

At the elementary level, the findings of the review of evaluations of data on effectiveness of NSF-supported curricula report consistent patterns of

benefits to students. Across the studies, it appears that positive results are enhanced when accompanied by adequate professional development and the use of pedagogical methods consistent with those indicated by the curricula. The benefits are most consistently evidenced in the broadening topics of geometry, measurement, probability, and statistics, and in applied problem solving and reasoning. It is important to consider whether the outcome measures in these areas demonstrate a depth of understanding. In early understanding of fractions and algebra, there is some evidence of improvement. Weaknesses are sometimes reported in the areas of computational skills, especially in the routinization of multiplication and division. These assertions are tentative due to the possible flaws in designs but quite consistent across studies, and future evaluations should seek to replicate, modify, or discredit these results.

The way to most efficiently and effectively link informal reasoning and formal algorithms and procedures is an open question. Further research is needed to determine how to most effectively link the gains and flexibility associated with student-generated reasoning to the automaticity and generalizability often associated with mastery of standard algorithms.

The data from these evaluations at the elementary level generally present credible evidence of increased success in engaging minority students and students in poverty based on reported gains that are modestly higher for these students than for the comparative groups. What is less well documented in the studies is the extent to which the curricula counteract the tendencies to see gaps emerge and result in long-term persistence in performance by gender and minority group membership as they move up the grades. However, the evaluations do indicate that these curricula can help, and almost never do harm. Finally, on the question of adequate challenge for advanced and talented students, the data are equivocal. More attention to this issue is needed.

Example Two

The data at the high school level produced the most conflicting results, and in conducting future evaluations, evaluators will need to examine this level more closely. We identify the high school as the crucible for curricular change for three reasons: (1) the transition to postsecondary education puts considerable pressure on these curricula; (2) the criteria outlined in the NSF RFP specify significant changes from traditional practice; and (3) high school freshmen arrive from a myriad of middle school curricular experiences. For the NSF-supported curricula, the RFP required that the programs provide a core curriculum “drawn from statistics/probability, algebra/functions, geometry/trigonometry, and discrete mathematics” (NSF, 1991, p. 2) and use “a full range of tools, including graphing calculators

and computers” (NSF, 1991, p. 2). The NSF RFP also specified the inclusion of “situations from the natural and social sciences and from other parts of the school curriculum as contexts for developing and using mathematics” (NSF, 1991, p. 1). It was during the fourth year that “course options should focus on special mathematical needs of individual students, accommodating not only the curricular demands of the college-bound but also specialized applications supportive of the workplace aspirations of employment-bound students” (NSF, 1991, p. 2). Because this set of requirements comprises a significant departure from conventional practice, the implementation of the high school curricula should be studied in particular detail.

We report on a Systemic Initiative for Montana Mathematics and Science (SIMMS) study by Souhrada (2001) and Brown et al. (1990), in which students were permitted to select traditional, reform, and mixed tracks. It became apparent that the students were quite aware of the choices they faced, as illustrated in the following quote:

The advantage of the traditional courses is that you learn—just math. It’s not applied. You get a lot of math. You may not know where to use it, but you learn a lot…. An advantage in SIMMS is that the kids in SIMMS tell me that they really understand the math. They understand where it comes from and where it is used.

This quote succinctly captures the tensions reported as experienced by students. It suggests that student perceptions are an important source of evidence in conducting evaluations. As we examined these curricular evaluations across the grades, we paid particular attention to the specificity of the outcome measures in relation to curricular objectives. Overall, a review of these studies would lead one to draw the following tentative summary conclusions:

There is some evidence of discontinuity in the articulation between high school and college, resulting from the organization and emphasis of the new curricula. This discontinuity can emerge in scores on college admission tests, placement tests, and first semester grades where nonreform students have shown some advantage on typical college achievement measures.

The most significant areas of disadvantage seem to be in students’ facility with algebraic manipulation, and with formalization, mathematical structure, and proof when isolated from context and denied technological supports. There is some evidence of weakness in computation and numeration, perhaps due to reliance on calculators and varied policies regarding their use at colleges (Kahan, 1999; Huntley et al., 2000).

There is also consistent evidence that the new curricula present

strengths in areas of solving applied problems, the use of technology, new areas of content development such as probability and statistics and functions-based reasoning in the use of graphs, using data in tables, and producing equations to describe situations (Huntley et al., 2000; Hirsch and Schoen, 2002).

Despite early performance on standard outcome measures at the high school level showing equivalent or better performance by reform students (Austin et al., 1997; Merlino and Wolff, 2001), the common standardized outcome measures (Preliminary Scholastic Assessment Test [PSAT] scores or national tests) are too imprecise to determine with more specificity the comparisons between the NSF-supported and comparison approaches, while program-generated measures lack evidence of external validity and objectivity. There is an urgent need for a set of measures that would provide detailed information on specific concepts and conceptual development over time and may require use as embedded as well as summative assessment tools to provide precise enough data on curricular effectiveness.

The data also report some progress in strengthening the performance of underrepresented groups in mathematics relative to their counterparts in the comparative programs (Schoen et al., 1998; Hirsch and Schoen, 2002).

This reported pattern of results should be viewed as very tentative, as there are only a few studies in each of these areas, and most do not adequately control for competing factors, such as the nature of the course received in college. Difficulties in the transition may also be the result of a lack of alignment of measures, especially as placement exams often emphasize algebraic proficiencies. These results are presented only for the purpose of stimulating further evaluation efforts. They further emphasize the need to be certain that such designs examine the level of mathematical reasoning of students, particularly in relation to their knowledge of understanding of the role of proofs and definitions and their facility with algebraic manipulation as we as carefully document the competencies taught in the curricular materials. In our framework, gauging the ease of transition to college study is an issue of examining curricular alignment with systemic factors, and needs to be considered along with those tests that demonstrate a curricular validity of measures. Furthermore, the results raising concerns about college success need replication before secure conclusions are drawn.

Also, it is important that subsequent evaluations also examine curricular effects on students’ interest in mathematics and willingness to persist in its study. Walker (1999) reported that there may be some systematic differences in these behaviors among different curricula and that interest and persistence may help students across a variety of subgroups to survive entry-level hurdles, especially if technical facility with symbol manipulation

can be improved. In the context of declines in advanced study in mathematics by American students (Hawkins, 2003), evaluation of curricular impact on students’ interest, beliefs, persistence, and success are needed.

The committee takes the position that ultimately the question of the impact of different curricula on performance at the collegiate level should be resolved by whether students are adequately prepared to pursue careers in mathematical sciences, broadly defined, and to reason quantitatively about societal and technological issues. It would be a mistake to focus evaluation efforts solely or primarily on performance on entry-level courses, which can clearly function as filters and may overly emphasize procedural competence, but do not necessarily represent what concepts and skills lead to excellence and success in the field.

These tentative patterns of findings indicate that at the high school level, it is necessary to conduct individual evaluations that examine the transition to college carefully in order to gauge the level of success in preparing students for college entry and the successful negotiation of majors. Equally, it is imperative to examine the impact of high school curricula on other possible student trajectories, such as obtaining high school diplomas, moving into worlds of work or through transitional programs leading to technical training, two-year colleges, and so on.

These two analyses of programs by grade-level band, content strand, and equity represent a methodological innovation that could strengthen the empirical database on curricula significantly and provide the level of detail really needed by curriculum designers to improve their programs. In addition, it appears that one could characterize the NSF programs (and not the commercial programs as a group) as representing a particular approach to curriculum, as discussed in Chapter 3 . It is an approach that integrates content strands; relies heavily on the use of situations, applications, and modeling; encourages the use of technology; and has a significant dose of mathematical inquiry. One could ask the question of whether this approach as a whole is “effective.” It is beyond the charge and scope of this report, but is a worthy target of investigation if one uses proper care in design, execution, and analysis. Likewise other approaches to curricular change should be investigated at the aggregate level, using careful and rigorous design.

The committee believes that a diversity of curricular approaches is a strength in an educational system that maintains local and state control of curricular decision making. While “scientifically established as effective” should be an increasingly important consideration in curricular choice, local cultural differences, needs, values, and goals will also properly influence curricular choice. A diverse set of effective curricula would be ideal. Finally, the committee emphasizes once again the importance of basing the studies on measures with established curricular validity and avoiding cor-

ruption of indicators as a result of inappropriate amounts of teaching to the test, so as to be certain that the outcomes are the product of genuine student learning.

CONCLUSIONS FROM THE COMPARATIVE STUDIES

In summary, the committee reviewed a total of 95 comparative studies. There were more NSF-supported program evaluations than commercial ones, and the commercial ones were primarily on Saxon or UCSMP materials. Of the 19 curricular programs reviewed, 23 percent of the NSF-supported and 33 percent of the commercially generated materials selected had programs with no comparative reviews. This finding is particularly disturbing in light of the legislative mandate in No Child Left Behind (U.S. Department of Education, 2001) for scientifically based curricular programs and materials to be used in the schools. It suggests that more explicit protocols for the conduct of evaluation of programs that include comparative studies need to be required and utilized.

Sixty-nine percent of NSF-supported and 61 percent of commercially generated program evaluations met basic conditions to be classified as at least minimally methodologically adequate studies for the evaluation of effectiveness. These studies were ones that met the criteria of including measures of student outcomes on mathematical achievement, reporting a method of establishing comparability among samples and reporting on implementation elements, disaggregating by content strand, or using precise, theoretical analyses of the construct or multiple measures.

Most of these studies had both strengths and weaknesses in their quasi-experimental designs. The committee reviewed the studies and found that evaluators had developed a number of features that merit inclusions in future work. At the same time, many had internal threats to validity that suggest a need for clearer guidelines for the conduct of comparative evaluations.

Many of the strengths and innovations came from the evaluators’ understanding of the program theories behind the curricula, their knowledge of the complexity of practice, and their commitment to measuring valid and significant mathematical ideas. Many of the weaknesses came from inadequate attention to experimental design, insufficient evidence of the independence of evaluators in some studies, and instability and lack of cooperation in interfacing with the conditions of everyday practice.

The committee identified 10 elements of comparative studies needed to establish a basis for determining the effectiveness of a curriculum. We recognize that not all studies will be able to implement successfully all elements, and those experimental design variations will be based largely on study size and location. The list of elements begins with the seven elements

corresponding to the seven critical decisions and adds three additional elements that emerged as a result of our review:

A better balance needs to be achieved between experimental and quasi-experimental studies. The virtual absence of large-scale experimental studies does not provide a way to determine whether the use of quasi-experimental approaches is being systematically biased in unseen ways.

If a quasi-experimental design is selected, it is necessary to establish comparability. When quasi-experimentation is used, it “pertains to studies in which the model to describe effects of secondary variables is not known but assumed” (NRC, 1992, p. 18). This will lead to weaker and potentially suspect causal claims, which should be acknowledged in the evaluation report, but may be necessary in relation to feasibility (Joint Committee on Standards for Educational Evaluation, 1994). In general, to date, studies have assumed prior achievement measures, ethnicity, gender, and SES, are acceptable variables on which to match samples or on which to make statistical adjustments. But there are often other variables in need of such control in such evaluations including opportunity to learn, teacher effectiveness, and implementation (see #4 below).

The selection of a unit of analysis is of critical importance to the design. To the extent possible, it is useful to randomly assign the unit for the different curricula. The number of units of analysis necessary for the study to establish statistical significance depends not on the number of students, but on this unit of analysis. It appears that classrooms and schools are the most likely units of analysis. In addition, the development of increasingly sophisticated means of conducting studies that recognize that the level of the educational system in which experimentation occurs affects research designs.

It is essential to examine the implementation components through a set of variables that include the extent to which the materials are implemented, teaching methods, the use of supplemental materials, professional development resources, teacher background variables, and teacher effects. Gathering these data to gauge the level of implementation fidelity is essential for evaluators to ensure adequate implementation. Studies could also include nested designs to support analysis of variation by implementation components.

Outcome data should include a variety of measures of the highest quality. These measures should vary by question type (open ended, multiple choice), by type of test (international, national, local) and by relation of testing to everyday practice (formative, summative, high stakes), and ensure curricular validity of measures and assess curricular alignment with systemic factors. The use of comparisons among total tests, fair tests, and

conservative tests, as done in the evaluations of UCSMP, permits one to gain insight into teacher effects and to contrast test results by items included. Tests should also include content strands to aid disaggregation, at a level of major content strands (see Figure 5-11 ) and content-specific items relevant to the experimental curricula.

Statistical analysis should be conducted on the appropriate unit of analysis and should include more sophisticated methods of analysis such as ANOVA, ANCOVA, MACOVA, linear regression, and multiple regression analysis as appropriate.

Reports should include clear statements of the limitations to generalization of the study. These should include indications of limitations in populations sampled, sample size, unique population inclusions or exclusions, and levels of use or attrition. Data should also be disaggregated by gender, race/ethnicity, SES, and performance levels to permit readers to see comparative gains across subgroups both between and within studies.

It is useful to report effect sizes. It is also useful to present item-level data across treatment program and show when performances between the two groups are within the 10 percent confidence interval of each other. These two extremes document how crucial it is for curricula developers to garner both precise and generalizable information to inform their revisions.

Careful attention should also be given to the selection of samples of populations for participation. These samples should be representative of the populations to whom one wants to generalize the results. Studies should be clear if they are generalizing to groups who have already selected the materials (prior users) or to populations who might be interested in using the materials (demographically representative).

The control group should use an identified comparative curriculum or curricula to avoid comparisons to unstructured instruction.

In addition to these prototypical decisions to be made in the conduct of comparative studies, the committee suggests that it would be ideal for future studies to consider some of the overall effects of these curricula and to test more directly and rigorously some of the findings and alternative hypotheses. Toward this end, the committee reported the tentative findings of these studies by program type. Although these results are subject to revision, based on the potential weaknesses in design of many of the studies summarized, the form of analysis demonstrated in this chapter provides clear guidance about the kinds of knowledge claims and the level of detail that we need to be able to judge effectiveness. Until we are able to achieve an array of comparative studies that provide valid and reliable information on these issues, we will be vulnerable to decision making based excessively on opinion, limited experience, and preconceptions.

This book reviews the evaluation research literature that has accumulated around 19 K-12 mathematics curricula and breaks new ground in framing an ambitious and rigorous approach to curriculum evaluation that has relevance beyond mathematics. The committee that produced this book consisted of mathematicians, mathematics educators, and methodologists who began with the following charge:

  • Evaluate the quality of the evaluations of the thirteen National Science Foundation (NSF)-supported and six commercially generated mathematics curriculum materials;
  • Determine whether the available data are sufficient for evaluating the efficacy of these materials, and if not;
  • Develop recommendations about the design of a project that could result in the generation of more reliable and valid data for evaluating such materials.

The committee collected, reviewed, and classified almost 700 studies, solicited expert testimony during two workshops, developed an evaluation framework, established dimensions/criteria for three methodologies (content analyses, comparative studies, and case studies), drew conclusions on the corpus of studies, and made recommendations for future research.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

  • Cookies & Privacy
  • GETTING STARTED
  • Introduction
  • FUNDAMENTALS
  • Acknowledgements
  • Research questions & hypotheses
  • Concepts, constructs & variables
  • Research limitations
  • Getting started
  • Sampling Strategy
  • Research Quality
  • Research Ethics
  • Data Analysis

Structure of comparative research questions

There are five steps required to construct a comparative research question: (1) choose your starting phrase; (2) identify and name the dependent variable; (3) identify the groups you are interested in; (4) identify the appropriate adjoining text; and (5) write out the comparative research question. Each of these steps is discussed in turn:

Choose your starting phrase

Identify and name the dependent variable

Identify the groups you are interested in

Identify the appropriate adjoining text

Write out the comparative research question

FIRST Choose your starting phrase

Comparative research questions typically start with one of two phrases:

Number of dependent variables Starting phrase
Two What is the difference in?
Three or more What are the differences in?

Some of these starting phrases are highlighted in blue text in the examples below:

What is the difference in the daily calorific intake of American men and women?

What is the difference in the weekly photo uploads on Facebook between British male and female university students?

What are the differences in perceptions towards Internet banking security between adolescents and pensioners?

What are the differences in attitudes towards music piracy when pirated music is freely distributed or purchased?

SECOND Identify and name the dependent variable

All comparative research questions have a dependent variable . You need to identify what this is. However, how the dependent variable is written out in a research question and what you call it are often two different things. In the examples below, we have illustrated the name of the dependent variable and highlighted how it would be written out in the blue text .

Name of the dependent variable How the dependent variable is written out
Daily calorific intake What is the difference in the daily calorific intake of American men and women?
Perceptions towards Internet
banking security
What are the differences in perceptions towards Internet banking security between
adolescents and pensioners?
Attitudes towards music piracy What are the differences in attitudes towards music piracy when pirated music is
freely distributed or purchased?
Weekly Facebook photo uploads What is the difference in the weekly photo uploads on Facebook between British male
and female university students?

The first three examples highlight that while the name of the dependent variable is the same, namely daily calorific intake, the way that this dependent variable is written out differs in each case.

THIRD Identify the groups you are interested in

All comparative research questions have at least two groups . You need to identify these groups. In the examples below, we have identified the groups in the green text .

What is the difference in the daily calorific intake of American men and women ?

What is the difference in the weekly photo uploads on Facebook between British male and female university students ?

What are the differences in perceptions towards Internet banking security between adolescents and pensioners ?

What are the differences in attitudes towards music piracy when pirated music is freely distributed or purchased ?

It is often easy to identify groups because they reflect different types of people (e.g., men and women, adolescents and pensioners), as highlighted by the first three examples. However, sometimes the two groups you are interested in reflect two different conditions, as highlighted by the final example. In this final example, the two conditions (i.e., groups) are pirated music that is freely distributed and pirated music that is purchased. So we are interested in how the attitudes towards music piracy differ when pirated music is freely distributed as opposed to when pirated music in purchased.

FOURTH Identify the appropriate adjoining text

Before you write out the groups you are interested in comparing, you typically need to include some adjoining text. Typically, this adjoining text includes the words between or amongst , but other words may be more appropriate, as highlighted by the examples in red text below:

FIFTH Write out the comparative research question

Once you have these details - (1) the starting phrase, (2) the name of the dependent variable, (3) the name of the groups you are interested in comparing, and (4) any potential adjoining words - you can write out the comparative research question in full. The example comparative research questions discussed above are written out in full below:

In the section that follows, the structure of relationship-based research questions is discussed.

Structure of relationship-based research questions

There are six steps required to construct a relationship-based research question: (1) choose your starting phrase; (2) identify the independent variable(s); (3) identify the dependent variable(s); (4) identify the group(s); (5) identify the appropriate adjoining text; and (6) write out the relationship-based research question. Each of these steps is discussed in turn.

Identify the independent variable(s)

Identify the dependent variable(s)

Identify the group(s)

Write out the relationship-based research question

Relationship-based research questions typically start with one or two phrases:

Name of the independent variable Starting phrase
Two What is the relationship between?
Three or more What are the relationships of?

What is the relationship between gender and attitudes towards music piracy amongst adolescents?

What is the relationship between study time and exam scores amongst university students?

What is the relationship of career prospects, salary and benefits, and physical working conditions on job satisfaction between managers and non-managers?

SECOND Name the independent variable(s)

All relationship-based research questions have at least one independent variable . You need to identify what this is. In the examples that follow, the independent variable(s) is highlighted in the purple text .

What is the relationship of career prospects , salary and benefits , and physical working conditions on job satisfaction between managers and non-managers?

When doing a dissertation at the undergraduate and master's level, it is likely that your research question will only have one or two independent variables, but this is not always the case.

THIRD Name the dependent variable(s)

All relationship-based research questions also have at least one dependent variable . You also need to identify what this is. At the undergraduate and master's level, it is likely that your research question will only have one dependent variable. In the examples that follow, the dependent variable is highlighted in the blue text .

FOURTH Name of the group(s)

All relationship-based research questions have at least one group , but can have multiple groups . You need to identify this group(s). In the examples below, we have identified the group(s) in the green text .

What is the relationship between gender and attitudes towards music piracy amongst adolescents ?

What is the relationship between study time and exam scores amongst university students ?

What is the relationship of career prospects, salary and benefits, and physical working conditions on job satisfaction between managers and non-managers ?

FIFTH Identify the appropriate adjoining text

Before you write out the groups you are interested in comparing, you typically need to include some adjoining text (i.e., usually the words between or amongst):

Number of groups Adjoining text
One amongst?
[e.g., group 1]
Two or more between?
of?
[e.g., group 1 and group 2]

Some examples are highlighted in red text below:

SIXTH Write out the relationship-based research question

Once you have these details ? (1) the starting phrase, (2) the name of the dependent variable, (3) the name of the independent variable, (4) the name of the group(s) you are interested in, and (5) any potential adjoining words ? you can write out the relationship-based research question in full. The example relationship-based research questions discussed above are written out in full below:

STEP FOUR Write out the problem or issues you are trying to address in the form of a complete research question

In the previous section, we illustrated how to write out the three types of research question (i.e., descriptive, comparative and relationship-based research questions). Whilst these rules should help you when writing out your research question(s), the main thing you should keep in mind is whether your research question(s) flow and are easy to read .

What is comparative analysis? A complete guide

Last updated

18 April 2023

Reviewed by

Jean Kaluza

Short on time? Get an AI generated summary of this article instead

Comparative analysis is a valuable tool for acquiring deep insights into your organization’s processes, products, and services so you can continuously improve them. 

Similarly, if you want to streamline, price appropriately, and ultimately be a market leader, you’ll likely need to draw on comparative analyses quite often.

When faced with multiple options or solutions to a given problem, a thorough comparative analysis can help you compare and contrast your options and make a clear, informed decision.

If you want to get up to speed on conducting a comparative analysis or need a refresher, here’s your guide.

Make comparative analysis less tedious

Dovetail streamlines comparative analysis to help you uncover and share actionable insights

  • What exactly is comparative analysis?

A comparative analysis is a side-by-side comparison that systematically compares two or more things to pinpoint their similarities and differences. The focus of the investigation might be conceptual—a particular problem, idea, or theory—or perhaps something more tangible, like two different data sets.

For instance, you could use comparative analysis to investigate how your product features measure up to the competition.

After a successful comparative analysis, you should be able to identify strengths and weaknesses and clearly understand which product is more effective.

You could also use comparative analysis to examine different methods of producing that product and determine which way is most efficient and profitable.

The potential applications for using comparative analysis in everyday business are almost unlimited. That said, a comparative analysis is most commonly used to examine

Emerging trends and opportunities (new technologies, marketing)

Competitor strategies

Financial health

Effects of trends on a target audience

Free AI content analysis generator

Make sense of your research by automatically summarizing key takeaways through our free content analysis tool.

examples of comparative research titles quantitative

  • Why is comparative analysis so important? 

Comparative analysis can help narrow your focus so your business pursues the most meaningful opportunities rather than attempting dozens of improvements simultaneously.

A comparative approach also helps frame up data to illuminate interrelationships. For example, comparative research might reveal nuanced relationships or critical contexts behind specific processes or dependencies that wouldn’t be well-understood without the research.

For instance, if your business compares the cost of producing several existing products relative to which ones have historically sold well, that should provide helpful information once you’re ready to look at developing new products or features.

  • Comparative vs. competitive analysis—what’s the difference?

Comparative analysis is generally divided into three subtypes, using quantitative or qualitative data and then extending the findings to a larger group. These include

Pattern analysis —identifying patterns or recurrences of trends and behavior across large data sets.

Data filtering —analyzing large data sets to extract an underlying subset of information. It may involve rearranging, excluding, and apportioning comparative data to fit different criteria. 

Decision tree —flowcharting to visually map and assess potential outcomes, costs, and consequences.

In contrast, competitive analysis is a type of comparative analysis in which you deeply research one or more of your industry competitors. In this case, you’re using qualitative research to explore what the competition is up to across one or more dimensions.

For example

Service delivery —metrics like the Net Promoter Scores indicate customer satisfaction levels.

Market position — the share of the market that the competition has captured.

Brand reputation —how well-known or recognized your competitors are within their target market.

  • Tips for optimizing your comparative analysis

Conduct original research

Thorough, independent research is a significant asset when doing comparative analysis. It provides evidence to support your findings and may present a perspective or angle not considered previously. 

Make analysis routine

To get the maximum benefit from comparative research, make it a regular practice, and establish a cadence you can realistically stick to. Some business areas you could plan to analyze regularly include:

Profitability

Competition

Experiment with controlled and uncontrolled variables

In addition to simply comparing and contrasting, explore how different variables might affect your outcomes.

For example, a controllable variable would be offering a seasonal feature like a shopping bot to assist in holiday shopping or raising or lowering the selling price of a product.

Uncontrollable variables include weather, changing regulations, the current political climate, or global pandemics.

Put equal effort into each point of comparison

Most people enter into comparative research with a particular idea or hypothesis already in mind to validate. For instance, you might try to prove the worthwhileness of launching a new service. So, you may be disappointed if your analysis results don’t support your plan.

However, in any comparative analysis, try to maintain an unbiased approach by spending equal time debating the merits and drawbacks of any decision. Ultimately, this will be a practical, more long-term sustainable approach for your business than focusing only on the evidence that favors pursuing your argument or strategy.

Writing a comparative analysis in five steps

To put together a coherent, insightful analysis that goes beyond a list of pros and cons or similarities and differences, try organizing the information into these five components:

1. Frame of reference

Here is where you provide context. First, what driving idea or problem is your research anchored in? Then, for added substance, cite existing research or insights from a subject matter expert, such as a thought leader in marketing, startup growth, or investment

2. Grounds for comparison Why have you chosen to examine the two things you’re analyzing instead of focusing on two entirely different things? What are you hoping to accomplish?

3. Thesis What argument or choice are you advocating for? What will be the before and after effects of going with either decision? What do you anticipate happening with and without this approach?

For example, “If we release an AI feature for our shopping cart, we will have an edge over the rest of the market before the holiday season.” The finished comparative analysis will weigh all the pros and cons of choosing to build the new expensive AI feature including variables like how “intelligent” it will be, what it “pushes” customers to use, how much it takes off the plates of customer service etc.

Ultimately, you will gauge whether building an AI feature is the right plan for your e-commerce shop.

4. Organize the scheme Typically, there are two ways to organize a comparative analysis report. First, you can discuss everything about comparison point “A” and then go into everything about aspect “B.” Or, you alternate back and forth between points “A” and “B,” sometimes referred to as point-by-point analysis.

Using the AI feature as an example again, you could cover all the pros and cons of building the AI feature, then discuss the benefits and drawbacks of building and maintaining the feature. Or you could compare and contrast each aspect of the AI feature, one at a time. For example, a side-by-side comparison of the AI feature to shopping without it, then proceeding to another point of differentiation.

5. Connect the dots Tie it all together in a way that either confirms or disproves your hypothesis.

For instance, “Building the AI bot would allow our customer service team to save 12% on returns in Q3 while offering optimizations and savings in future strategies. However, it would also increase the product development budget by 43% in both Q1 and Q2. Our budget for product development won’t increase again until series 3 of funding is reached, so despite its potential, we will hold off building the bot until funding is secured and more opportunities and benefits can be proved effective.”

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 22 August 2024

Last updated: 5 February 2023

Last updated: 16 August 2024

Last updated: 9 March 2023

Last updated: 30 April 2024

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next, log in or sign up.

Get started for free

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Korean Med Sci
  • v.37(16); 2022 Apr 25

Logo of jkms

A Practical Guide to Writing Quantitative and Qualitative Research Questions and Hypotheses in Scholarly Articles

Edward barroga.

1 Department of General Education, Graduate School of Nursing Science, St. Luke’s International University, Tokyo, Japan.

Glafera Janet Matanguihan

2 Department of Biological Sciences, Messiah University, Mechanicsburg, PA, USA.

The development of research questions and the subsequent hypotheses are prerequisites to defining the main research purpose and specific objectives of a study. Consequently, these objectives determine the study design and research outcome. The development of research questions is a process based on knowledge of current trends, cutting-edge studies, and technological advances in the research field. Excellent research questions are focused and require a comprehensive literature search and in-depth understanding of the problem being investigated. Initially, research questions may be written as descriptive questions which could be developed into inferential questions. These questions must be specific and concise to provide a clear foundation for developing hypotheses. Hypotheses are more formal predictions about the research outcomes. These specify the possible results that may or may not be expected regarding the relationship between groups. Thus, research questions and hypotheses clarify the main purpose and specific objectives of the study, which in turn dictate the design of the study, its direction, and outcome. Studies developed from good research questions and hypotheses will have trustworthy outcomes with wide-ranging social and health implications.

INTRODUCTION

Scientific research is usually initiated by posing evidenced-based research questions which are then explicitly restated as hypotheses. 1 , 2 The hypotheses provide directions to guide the study, solutions, explanations, and expected results. 3 , 4 Both research questions and hypotheses are essentially formulated based on conventional theories and real-world processes, which allow the inception of novel studies and the ethical testing of ideas. 5 , 6

It is crucial to have knowledge of both quantitative and qualitative research 2 as both types of research involve writing research questions and hypotheses. 7 However, these crucial elements of research are sometimes overlooked; if not overlooked, then framed without the forethought and meticulous attention it needs. Planning and careful consideration are needed when developing quantitative or qualitative research, particularly when conceptualizing research questions and hypotheses. 4

There is a continuing need to support researchers in the creation of innovative research questions and hypotheses, as well as for journal articles that carefully review these elements. 1 When research questions and hypotheses are not carefully thought of, unethical studies and poor outcomes usually ensue. Carefully formulated research questions and hypotheses define well-founded objectives, which in turn determine the appropriate design, course, and outcome of the study. This article then aims to discuss in detail the various aspects of crafting research questions and hypotheses, with the goal of guiding researchers as they develop their own. Examples from the authors and peer-reviewed scientific articles in the healthcare field are provided to illustrate key points.

DEFINITIONS AND RELATIONSHIP OF RESEARCH QUESTIONS AND HYPOTHESES

A research question is what a study aims to answer after data analysis and interpretation. The answer is written in length in the discussion section of the paper. Thus, the research question gives a preview of the different parts and variables of the study meant to address the problem posed in the research question. 1 An excellent research question clarifies the research writing while facilitating understanding of the research topic, objective, scope, and limitations of the study. 5

On the other hand, a research hypothesis is an educated statement of an expected outcome. This statement is based on background research and current knowledge. 8 , 9 The research hypothesis makes a specific prediction about a new phenomenon 10 or a formal statement on the expected relationship between an independent variable and a dependent variable. 3 , 11 It provides a tentative answer to the research question to be tested or explored. 4

Hypotheses employ reasoning to predict a theory-based outcome. 10 These can also be developed from theories by focusing on components of theories that have not yet been observed. 10 The validity of hypotheses is often based on the testability of the prediction made in a reproducible experiment. 8

Conversely, hypotheses can also be rephrased as research questions. Several hypotheses based on existing theories and knowledge may be needed to answer a research question. Developing ethical research questions and hypotheses creates a research design that has logical relationships among variables. These relationships serve as a solid foundation for the conduct of the study. 4 , 11 Haphazardly constructed research questions can result in poorly formulated hypotheses and improper study designs, leading to unreliable results. Thus, the formulations of relevant research questions and verifiable hypotheses are crucial when beginning research. 12

CHARACTERISTICS OF GOOD RESEARCH QUESTIONS AND HYPOTHESES

Excellent research questions are specific and focused. These integrate collective data and observations to confirm or refute the subsequent hypotheses. Well-constructed hypotheses are based on previous reports and verify the research context. These are realistic, in-depth, sufficiently complex, and reproducible. More importantly, these hypotheses can be addressed and tested. 13

There are several characteristics of well-developed hypotheses. Good hypotheses are 1) empirically testable 7 , 10 , 11 , 13 ; 2) backed by preliminary evidence 9 ; 3) testable by ethical research 7 , 9 ; 4) based on original ideas 9 ; 5) have evidenced-based logical reasoning 10 ; and 6) can be predicted. 11 Good hypotheses can infer ethical and positive implications, indicating the presence of a relationship or effect relevant to the research theme. 7 , 11 These are initially developed from a general theory and branch into specific hypotheses by deductive reasoning. In the absence of a theory to base the hypotheses, inductive reasoning based on specific observations or findings form more general hypotheses. 10

TYPES OF RESEARCH QUESTIONS AND HYPOTHESES

Research questions and hypotheses are developed according to the type of research, which can be broadly classified into quantitative and qualitative research. We provide a summary of the types of research questions and hypotheses under quantitative and qualitative research categories in Table 1 .

Quantitative research questionsQuantitative research hypotheses
Descriptive research questionsSimple hypothesis
Comparative research questionsComplex hypothesis
Relationship research questionsDirectional hypothesis
Non-directional hypothesis
Associative hypothesis
Causal hypothesis
Null hypothesis
Alternative hypothesis
Working hypothesis
Statistical hypothesis
Logical hypothesis
Hypothesis-testing
Qualitative research questionsQualitative research hypotheses
Contextual research questionsHypothesis-generating
Descriptive research questions
Evaluation research questions
Explanatory research questions
Exploratory research questions
Generative research questions
Ideological research questions
Ethnographic research questions
Phenomenological research questions
Grounded theory questions
Qualitative case study questions

Research questions in quantitative research

In quantitative research, research questions inquire about the relationships among variables being investigated and are usually framed at the start of the study. These are precise and typically linked to the subject population, dependent and independent variables, and research design. 1 Research questions may also attempt to describe the behavior of a population in relation to one or more variables, or describe the characteristics of variables to be measured ( descriptive research questions ). 1 , 5 , 14 These questions may also aim to discover differences between groups within the context of an outcome variable ( comparative research questions ), 1 , 5 , 14 or elucidate trends and interactions among variables ( relationship research questions ). 1 , 5 We provide examples of descriptive, comparative, and relationship research questions in quantitative research in Table 2 .

Quantitative research questions
Descriptive research question
- Measures responses of subjects to variables
- Presents variables to measure, analyze, or assess
What is the proportion of resident doctors in the hospital who have mastered ultrasonography (response of subjects to a variable) as a diagnostic technique in their clinical training?
Comparative research question
- Clarifies difference between one group with outcome variable and another group without outcome variable
Is there a difference in the reduction of lung metastasis in osteosarcoma patients who received the vitamin D adjunctive therapy (group with outcome variable) compared with osteosarcoma patients who did not receive the vitamin D adjunctive therapy (group without outcome variable)?
- Compares the effects of variables
How does the vitamin D analogue 22-Oxacalcitriol (variable 1) mimic the antiproliferative activity of 1,25-Dihydroxyvitamin D (variable 2) in osteosarcoma cells?
Relationship research question
- Defines trends, association, relationships, or interactions between dependent variable and independent variable
Is there a relationship between the number of medical student suicide (dependent variable) and the level of medical student stress (independent variable) in Japan during the first wave of the COVID-19 pandemic?

Hypotheses in quantitative research

In quantitative research, hypotheses predict the expected relationships among variables. 15 Relationships among variables that can be predicted include 1) between a single dependent variable and a single independent variable ( simple hypothesis ) or 2) between two or more independent and dependent variables ( complex hypothesis ). 4 , 11 Hypotheses may also specify the expected direction to be followed and imply an intellectual commitment to a particular outcome ( directional hypothesis ) 4 . On the other hand, hypotheses may not predict the exact direction and are used in the absence of a theory, or when findings contradict previous studies ( non-directional hypothesis ). 4 In addition, hypotheses can 1) define interdependency between variables ( associative hypothesis ), 4 2) propose an effect on the dependent variable from manipulation of the independent variable ( causal hypothesis ), 4 3) state a negative relationship between two variables ( null hypothesis ), 4 , 11 , 15 4) replace the working hypothesis if rejected ( alternative hypothesis ), 15 explain the relationship of phenomena to possibly generate a theory ( working hypothesis ), 11 5) involve quantifiable variables that can be tested statistically ( statistical hypothesis ), 11 6) or express a relationship whose interlinks can be verified logically ( logical hypothesis ). 11 We provide examples of simple, complex, directional, non-directional, associative, causal, null, alternative, working, statistical, and logical hypotheses in quantitative research, as well as the definition of quantitative hypothesis-testing research in Table 3 .

Quantitative research hypotheses
Simple hypothesis
- Predicts relationship between single dependent variable and single independent variable
If the dose of the new medication (single independent variable) is high, blood pressure (single dependent variable) is lowered.
Complex hypothesis
- Foretells relationship between two or more independent and dependent variables
The higher the use of anticancer drugs, radiation therapy, and adjunctive agents (3 independent variables), the higher would be the survival rate (1 dependent variable).
Directional hypothesis
- Identifies study direction based on theory towards particular outcome to clarify relationship between variables
Privately funded research projects will have a larger international scope (study direction) than publicly funded research projects.
Non-directional hypothesis
- Nature of relationship between two variables or exact study direction is not identified
- Does not involve a theory
Women and men are different in terms of helpfulness. (Exact study direction is not identified)
Associative hypothesis
- Describes variable interdependency
- Change in one variable causes change in another variable
A larger number of people vaccinated against COVID-19 in the region (change in independent variable) will reduce the region’s incidence of COVID-19 infection (change in dependent variable).
Causal hypothesis
- An effect on dependent variable is predicted from manipulation of independent variable
A change into a high-fiber diet (independent variable) will reduce the blood sugar level (dependent variable) of the patient.
Null hypothesis
- A negative statement indicating no relationship or difference between 2 variables
There is no significant difference in the severity of pulmonary metastases between the new drug (variable 1) and the current drug (variable 2).
Alternative hypothesis
- Following a null hypothesis, an alternative hypothesis predicts a relationship between 2 study variables
The new drug (variable 1) is better on average in reducing the level of pain from pulmonary metastasis than the current drug (variable 2).
Working hypothesis
- A hypothesis that is initially accepted for further research to produce a feasible theory
Dairy cows fed with concentrates of different formulations will produce different amounts of milk.
Statistical hypothesis
- Assumption about the value of population parameter or relationship among several population characteristics
- Validity tested by a statistical experiment or analysis
The mean recovery rate from COVID-19 infection (value of population parameter) is not significantly different between population 1 and population 2.
There is a positive correlation between the level of stress at the workplace and the number of suicides (population characteristics) among working people in Japan.
Logical hypothesis
- Offers or proposes an explanation with limited or no extensive evidence
If healthcare workers provide more educational programs about contraception methods, the number of adolescent pregnancies will be less.
Hypothesis-testing (Quantitative hypothesis-testing research)
- Quantitative research uses deductive reasoning.
- This involves the formation of a hypothesis, collection of data in the investigation of the problem, analysis and use of the data from the investigation, and drawing of conclusions to validate or nullify the hypotheses.

Research questions in qualitative research

Unlike research questions in quantitative research, research questions in qualitative research are usually continuously reviewed and reformulated. The central question and associated subquestions are stated more than the hypotheses. 15 The central question broadly explores a complex set of factors surrounding the central phenomenon, aiming to present the varied perspectives of participants. 15

There are varied goals for which qualitative research questions are developed. These questions can function in several ways, such as to 1) identify and describe existing conditions ( contextual research question s); 2) describe a phenomenon ( descriptive research questions ); 3) assess the effectiveness of existing methods, protocols, theories, or procedures ( evaluation research questions ); 4) examine a phenomenon or analyze the reasons or relationships between subjects or phenomena ( explanatory research questions ); or 5) focus on unknown aspects of a particular topic ( exploratory research questions ). 5 In addition, some qualitative research questions provide new ideas for the development of theories and actions ( generative research questions ) or advance specific ideologies of a position ( ideological research questions ). 1 Other qualitative research questions may build on a body of existing literature and become working guidelines ( ethnographic research questions ). Research questions may also be broadly stated without specific reference to the existing literature or a typology of questions ( phenomenological research questions ), may be directed towards generating a theory of some process ( grounded theory questions ), or may address a description of the case and the emerging themes ( qualitative case study questions ). 15 We provide examples of contextual, descriptive, evaluation, explanatory, exploratory, generative, ideological, ethnographic, phenomenological, grounded theory, and qualitative case study research questions in qualitative research in Table 4 , and the definition of qualitative hypothesis-generating research in Table 5 .

Qualitative research questions
Contextual research question
- Ask the nature of what already exists
- Individuals or groups function to further clarify and understand the natural context of real-world problems
What are the experiences of nurses working night shifts in healthcare during the COVID-19 pandemic? (natural context of real-world problems)
Descriptive research question
- Aims to describe a phenomenon
What are the different forms of disrespect and abuse (phenomenon) experienced by Tanzanian women when giving birth in healthcare facilities?
Evaluation research question
- Examines the effectiveness of existing practice or accepted frameworks
How effective are decision aids (effectiveness of existing practice) in helping decide whether to give birth at home or in a healthcare facility?
Explanatory research question
- Clarifies a previously studied phenomenon and explains why it occurs
Why is there an increase in teenage pregnancy (phenomenon) in Tanzania?
Exploratory research question
- Explores areas that have not been fully investigated to have a deeper understanding of the research problem
What factors affect the mental health of medical students (areas that have not yet been fully investigated) during the COVID-19 pandemic?
Generative research question
- Develops an in-depth understanding of people’s behavior by asking ‘how would’ or ‘what if’ to identify problems and find solutions
How would the extensive research experience of the behavior of new staff impact the success of the novel drug initiative?
Ideological research question
- Aims to advance specific ideas or ideologies of a position
Are Japanese nurses who volunteer in remote African hospitals able to promote humanized care of patients (specific ideas or ideologies) in the areas of safe patient environment, respect of patient privacy, and provision of accurate information related to health and care?
Ethnographic research question
- Clarifies peoples’ nature, activities, their interactions, and the outcomes of their actions in specific settings
What are the demographic characteristics, rehabilitative treatments, community interactions, and disease outcomes (nature, activities, their interactions, and the outcomes) of people in China who are suffering from pneumoconiosis?
Phenomenological research question
- Knows more about the phenomena that have impacted an individual
What are the lived experiences of parents who have been living with and caring for children with a diagnosis of autism? (phenomena that have impacted an individual)
Grounded theory question
- Focuses on social processes asking about what happens and how people interact, or uncovering social relationships and behaviors of groups
What are the problems that pregnant adolescents face in terms of social and cultural norms (social processes), and how can these be addressed?
Qualitative case study question
- Assesses a phenomenon using different sources of data to answer “why” and “how” questions
- Considers how the phenomenon is influenced by its contextual situation.
How does quitting work and assuming the role of a full-time mother (phenomenon assessed) change the lives of women in Japan?
Qualitative research hypotheses
Hypothesis-generating (Qualitative hypothesis-generating research)
- Qualitative research uses inductive reasoning.
- This involves data collection from study participants or the literature regarding a phenomenon of interest, using the collected data to develop a formal hypothesis, and using the formal hypothesis as a framework for testing the hypothesis.
- Qualitative exploratory studies explore areas deeper, clarifying subjective experience and allowing formulation of a formal hypothesis potentially testable in a future quantitative approach.

Qualitative studies usually pose at least one central research question and several subquestions starting with How or What . These research questions use exploratory verbs such as explore or describe . These also focus on one central phenomenon of interest, and may mention the participants and research site. 15

Hypotheses in qualitative research

Hypotheses in qualitative research are stated in the form of a clear statement concerning the problem to be investigated. Unlike in quantitative research where hypotheses are usually developed to be tested, qualitative research can lead to both hypothesis-testing and hypothesis-generating outcomes. 2 When studies require both quantitative and qualitative research questions, this suggests an integrative process between both research methods wherein a single mixed-methods research question can be developed. 1

FRAMEWORKS FOR DEVELOPING RESEARCH QUESTIONS AND HYPOTHESES

Research questions followed by hypotheses should be developed before the start of the study. 1 , 12 , 14 It is crucial to develop feasible research questions on a topic that is interesting to both the researcher and the scientific community. This can be achieved by a meticulous review of previous and current studies to establish a novel topic. Specific areas are subsequently focused on to generate ethical research questions. The relevance of the research questions is evaluated in terms of clarity of the resulting data, specificity of the methodology, objectivity of the outcome, depth of the research, and impact of the study. 1 , 5 These aspects constitute the FINER criteria (i.e., Feasible, Interesting, Novel, Ethical, and Relevant). 1 Clarity and effectiveness are achieved if research questions meet the FINER criteria. In addition to the FINER criteria, Ratan et al. described focus, complexity, novelty, feasibility, and measurability for evaluating the effectiveness of research questions. 14

The PICOT and PEO frameworks are also used when developing research questions. 1 The following elements are addressed in these frameworks, PICOT: P-population/patients/problem, I-intervention or indicator being studied, C-comparison group, O-outcome of interest, and T-timeframe of the study; PEO: P-population being studied, E-exposure to preexisting conditions, and O-outcome of interest. 1 Research questions are also considered good if these meet the “FINERMAPS” framework: Feasible, Interesting, Novel, Ethical, Relevant, Manageable, Appropriate, Potential value/publishable, and Systematic. 14

As we indicated earlier, research questions and hypotheses that are not carefully formulated result in unethical studies or poor outcomes. To illustrate this, we provide some examples of ambiguous research question and hypotheses that result in unclear and weak research objectives in quantitative research ( Table 6 ) 16 and qualitative research ( Table 7 ) 17 , and how to transform these ambiguous research question(s) and hypothesis(es) into clear and good statements.

VariablesUnclear and weak statement (Statement 1) Clear and good statement (Statement 2) Points to avoid
Research questionWhich is more effective between smoke moxibustion and smokeless moxibustion?“Moreover, regarding smoke moxibustion versus smokeless moxibustion, it remains unclear which is more effective, safe, and acceptable to pregnant women, and whether there is any difference in the amount of heat generated.” 1) Vague and unfocused questions
2) Closed questions simply answerable by yes or no
3) Questions requiring a simple choice
HypothesisThe smoke moxibustion group will have higher cephalic presentation.“Hypothesis 1. The smoke moxibustion stick group (SM group) and smokeless moxibustion stick group (-SLM group) will have higher rates of cephalic presentation after treatment than the control group.1) Unverifiable hypotheses
Hypothesis 2. The SM group and SLM group will have higher rates of cephalic presentation at birth than the control group.2) Incompletely stated groups of comparison
Hypothesis 3. There will be no significant differences in the well-being of the mother and child among the three groups in terms of the following outcomes: premature birth, premature rupture of membranes (PROM) at < 37 weeks, Apgar score < 7 at 5 min, umbilical cord blood pH < 7.1, admission to neonatal intensive care unit (NICU), and intrauterine fetal death.” 3) Insufficiently described variables or outcomes
Research objectiveTo determine which is more effective between smoke moxibustion and smokeless moxibustion.“The specific aims of this pilot study were (a) to compare the effects of smoke moxibustion and smokeless moxibustion treatments with the control group as a possible supplement to ECV for converting breech presentation to cephalic presentation and increasing adherence to the newly obtained cephalic position, and (b) to assess the effects of these treatments on the well-being of the mother and child.” 1) Poor understanding of the research question and hypotheses
2) Insufficient description of population, variables, or study outcomes

a These statements were composed for comparison and illustrative purposes only.

b These statements are direct quotes from Higashihara and Horiuchi. 16

VariablesUnclear and weak statement (Statement 1)Clear and good statement (Statement 2)Points to avoid
Research questionDoes disrespect and abuse (D&A) occur in childbirth in Tanzania?How does disrespect and abuse (D&A) occur and what are the types of physical and psychological abuses observed in midwives’ actual care during facility-based childbirth in urban Tanzania?1) Ambiguous or oversimplistic questions
2) Questions unverifiable by data collection and analysis
HypothesisDisrespect and abuse (D&A) occur in childbirth in Tanzania.Hypothesis 1: Several types of physical and psychological abuse by midwives in actual care occur during facility-based childbirth in urban Tanzania.1) Statements simply expressing facts
Hypothesis 2: Weak nursing and midwifery management contribute to the D&A of women during facility-based childbirth in urban Tanzania.2) Insufficiently described concepts or variables
Research objectiveTo describe disrespect and abuse (D&A) in childbirth in Tanzania.“This study aimed to describe from actual observations the respectful and disrespectful care received by women from midwives during their labor period in two hospitals in urban Tanzania.” 1) Statements unrelated to the research question and hypotheses
2) Unattainable or unexplorable objectives

a This statement is a direct quote from Shimoda et al. 17

The other statements were composed for comparison and illustrative purposes only.

CONSTRUCTING RESEARCH QUESTIONS AND HYPOTHESES

To construct effective research questions and hypotheses, it is very important to 1) clarify the background and 2) identify the research problem at the outset of the research, within a specific timeframe. 9 Then, 3) review or conduct preliminary research to collect all available knowledge about the possible research questions by studying theories and previous studies. 18 Afterwards, 4) construct research questions to investigate the research problem. Identify variables to be accessed from the research questions 4 and make operational definitions of constructs from the research problem and questions. Thereafter, 5) construct specific deductive or inductive predictions in the form of hypotheses. 4 Finally, 6) state the study aims . This general flow for constructing effective research questions and hypotheses prior to conducting research is shown in Fig. 1 .

An external file that holds a picture, illustration, etc.
Object name is jkms-37-e121-g001.jpg

Research questions are used more frequently in qualitative research than objectives or hypotheses. 3 These questions seek to discover, understand, explore or describe experiences by asking “What” or “How.” The questions are open-ended to elicit a description rather than to relate variables or compare groups. The questions are continually reviewed, reformulated, and changed during the qualitative study. 3 Research questions are also used more frequently in survey projects than hypotheses in experiments in quantitative research to compare variables and their relationships.

Hypotheses are constructed based on the variables identified and as an if-then statement, following the template, ‘If a specific action is taken, then a certain outcome is expected.’ At this stage, some ideas regarding expectations from the research to be conducted must be drawn. 18 Then, the variables to be manipulated (independent) and influenced (dependent) are defined. 4 Thereafter, the hypothesis is stated and refined, and reproducible data tailored to the hypothesis are identified, collected, and analyzed. 4 The hypotheses must be testable and specific, 18 and should describe the variables and their relationships, the specific group being studied, and the predicted research outcome. 18 Hypotheses construction involves a testable proposition to be deduced from theory, and independent and dependent variables to be separated and measured separately. 3 Therefore, good hypotheses must be based on good research questions constructed at the start of a study or trial. 12

In summary, research questions are constructed after establishing the background of the study. Hypotheses are then developed based on the research questions. Thus, it is crucial to have excellent research questions to generate superior hypotheses. In turn, these would determine the research objectives and the design of the study, and ultimately, the outcome of the research. 12 Algorithms for building research questions and hypotheses are shown in Fig. 2 for quantitative research and in Fig. 3 for qualitative research.

An external file that holds a picture, illustration, etc.
Object name is jkms-37-e121-g002.jpg

EXAMPLES OF RESEARCH QUESTIONS FROM PUBLISHED ARTICLES

  • EXAMPLE 1. Descriptive research question (quantitative research)
  • - Presents research variables to be assessed (distinct phenotypes and subphenotypes)
  • “BACKGROUND: Since COVID-19 was identified, its clinical and biological heterogeneity has been recognized. Identifying COVID-19 phenotypes might help guide basic, clinical, and translational research efforts.
  • RESEARCH QUESTION: Does the clinical spectrum of patients with COVID-19 contain distinct phenotypes and subphenotypes? ” 19
  • EXAMPLE 2. Relationship research question (quantitative research)
  • - Shows interactions between dependent variable (static postural control) and independent variable (peripheral visual field loss)
  • “Background: Integration of visual, vestibular, and proprioceptive sensations contributes to postural control. People with peripheral visual field loss have serious postural instability. However, the directional specificity of postural stability and sensory reweighting caused by gradual peripheral visual field loss remain unclear.
  • Research question: What are the effects of peripheral visual field loss on static postural control ?” 20
  • EXAMPLE 3. Comparative research question (quantitative research)
  • - Clarifies the difference among groups with an outcome variable (patients enrolled in COMPERA with moderate PH or severe PH in COPD) and another group without the outcome variable (patients with idiopathic pulmonary arterial hypertension (IPAH))
  • “BACKGROUND: Pulmonary hypertension (PH) in COPD is a poorly investigated clinical condition.
  • RESEARCH QUESTION: Which factors determine the outcome of PH in COPD?
  • STUDY DESIGN AND METHODS: We analyzed the characteristics and outcome of patients enrolled in the Comparative, Prospective Registry of Newly Initiated Therapies for Pulmonary Hypertension (COMPERA) with moderate or severe PH in COPD as defined during the 6th PH World Symposium who received medical therapy for PH and compared them with patients with idiopathic pulmonary arterial hypertension (IPAH) .” 21
  • EXAMPLE 4. Exploratory research question (qualitative research)
  • - Explores areas that have not been fully investigated (perspectives of families and children who receive care in clinic-based child obesity treatment) to have a deeper understanding of the research problem
  • “Problem: Interventions for children with obesity lead to only modest improvements in BMI and long-term outcomes, and data are limited on the perspectives of families of children with obesity in clinic-based treatment. This scoping review seeks to answer the question: What is known about the perspectives of families and children who receive care in clinic-based child obesity treatment? This review aims to explore the scope of perspectives reported by families of children with obesity who have received individualized outpatient clinic-based obesity treatment.” 22
  • EXAMPLE 5. Relationship research question (quantitative research)
  • - Defines interactions between dependent variable (use of ankle strategies) and independent variable (changes in muscle tone)
  • “Background: To maintain an upright standing posture against external disturbances, the human body mainly employs two types of postural control strategies: “ankle strategy” and “hip strategy.” While it has been reported that the magnitude of the disturbance alters the use of postural control strategies, it has not been elucidated how the level of muscle tone, one of the crucial parameters of bodily function, determines the use of each strategy. We have previously confirmed using forward dynamics simulations of human musculoskeletal models that an increased muscle tone promotes the use of ankle strategies. The objective of the present study was to experimentally evaluate a hypothesis: an increased muscle tone promotes the use of ankle strategies. Research question: Do changes in the muscle tone affect the use of ankle strategies ?” 23

EXAMPLES OF HYPOTHESES IN PUBLISHED ARTICLES

  • EXAMPLE 1. Working hypothesis (quantitative research)
  • - A hypothesis that is initially accepted for further research to produce a feasible theory
  • “As fever may have benefit in shortening the duration of viral illness, it is plausible to hypothesize that the antipyretic efficacy of ibuprofen may be hindering the benefits of a fever response when taken during the early stages of COVID-19 illness .” 24
  • “In conclusion, it is plausible to hypothesize that the antipyretic efficacy of ibuprofen may be hindering the benefits of a fever response . The difference in perceived safety of these agents in COVID-19 illness could be related to the more potent efficacy to reduce fever with ibuprofen compared to acetaminophen. Compelling data on the benefit of fever warrant further research and review to determine when to treat or withhold ibuprofen for early stage fever for COVID-19 and other related viral illnesses .” 24
  • EXAMPLE 2. Exploratory hypothesis (qualitative research)
  • - Explores particular areas deeper to clarify subjective experience and develop a formal hypothesis potentially testable in a future quantitative approach
  • “We hypothesized that when thinking about a past experience of help-seeking, a self distancing prompt would cause increased help-seeking intentions and more favorable help-seeking outcome expectations .” 25
  • “Conclusion
  • Although a priori hypotheses were not supported, further research is warranted as results indicate the potential for using self-distancing approaches to increasing help-seeking among some people with depressive symptomatology.” 25
  • EXAMPLE 3. Hypothesis-generating research to establish a framework for hypothesis testing (qualitative research)
  • “We hypothesize that compassionate care is beneficial for patients (better outcomes), healthcare systems and payers (lower costs), and healthcare providers (lower burnout). ” 26
  • Compassionomics is the branch of knowledge and scientific study of the effects of compassionate healthcare. Our main hypotheses are that compassionate healthcare is beneficial for (1) patients, by improving clinical outcomes, (2) healthcare systems and payers, by supporting financial sustainability, and (3) HCPs, by lowering burnout and promoting resilience and well-being. The purpose of this paper is to establish a scientific framework for testing the hypotheses above . If these hypotheses are confirmed through rigorous research, compassionomics will belong in the science of evidence-based medicine, with major implications for all healthcare domains.” 26
  • EXAMPLE 4. Statistical hypothesis (quantitative research)
  • - An assumption is made about the relationship among several population characteristics ( gender differences in sociodemographic and clinical characteristics of adults with ADHD ). Validity is tested by statistical experiment or analysis ( chi-square test, Students t-test, and logistic regression analysis)
  • “Our research investigated gender differences in sociodemographic and clinical characteristics of adults with ADHD in a Japanese clinical sample. Due to unique Japanese cultural ideals and expectations of women's behavior that are in opposition to ADHD symptoms, we hypothesized that women with ADHD experience more difficulties and present more dysfunctions than men . We tested the following hypotheses: first, women with ADHD have more comorbidities than men with ADHD; second, women with ADHD experience more social hardships than men, such as having less full-time employment and being more likely to be divorced.” 27
  • “Statistical Analysis
  • ( text omitted ) Between-gender comparisons were made using the chi-squared test for categorical variables and Students t-test for continuous variables…( text omitted ). A logistic regression analysis was performed for employment status, marital status, and comorbidity to evaluate the independent effects of gender on these dependent variables.” 27

EXAMPLES OF HYPOTHESIS AS WRITTEN IN PUBLISHED ARTICLES IN RELATION TO OTHER PARTS

  • EXAMPLE 1. Background, hypotheses, and aims are provided
  • “Pregnant women need skilled care during pregnancy and childbirth, but that skilled care is often delayed in some countries …( text omitted ). The focused antenatal care (FANC) model of WHO recommends that nurses provide information or counseling to all pregnant women …( text omitted ). Job aids are visual support materials that provide the right kind of information using graphics and words in a simple and yet effective manner. When nurses are not highly trained or have many work details to attend to, these job aids can serve as a content reminder for the nurses and can be used for educating their patients (Jennings, Yebadokpo, Affo, & Agbogbe, 2010) ( text omitted ). Importantly, additional evidence is needed to confirm how job aids can further improve the quality of ANC counseling by health workers in maternal care …( text omitted )” 28
  • “ This has led us to hypothesize that the quality of ANC counseling would be better if supported by job aids. Consequently, a better quality of ANC counseling is expected to produce higher levels of awareness concerning the danger signs of pregnancy and a more favorable impression of the caring behavior of nurses .” 28
  • “This study aimed to examine the differences in the responses of pregnant women to a job aid-supported intervention during ANC visit in terms of 1) their understanding of the danger signs of pregnancy and 2) their impression of the caring behaviors of nurses to pregnant women in rural Tanzania.” 28
  • EXAMPLE 2. Background, hypotheses, and aims are provided
  • “We conducted a two-arm randomized controlled trial (RCT) to evaluate and compare changes in salivary cortisol and oxytocin levels of first-time pregnant women between experimental and control groups. The women in the experimental group touched and held an infant for 30 min (experimental intervention protocol), whereas those in the control group watched a DVD movie of an infant (control intervention protocol). The primary outcome was salivary cortisol level and the secondary outcome was salivary oxytocin level.” 29
  • “ We hypothesize that at 30 min after touching and holding an infant, the salivary cortisol level will significantly decrease and the salivary oxytocin level will increase in the experimental group compared with the control group .” 29
  • EXAMPLE 3. Background, aim, and hypothesis are provided
  • “In countries where the maternal mortality ratio remains high, antenatal education to increase Birth Preparedness and Complication Readiness (BPCR) is considered one of the top priorities [1]. BPCR includes birth plans during the antenatal period, such as the birthplace, birth attendant, transportation, health facility for complications, expenses, and birth materials, as well as family coordination to achieve such birth plans. In Tanzania, although increasing, only about half of all pregnant women attend an antenatal clinic more than four times [4]. Moreover, the information provided during antenatal care (ANC) is insufficient. In the resource-poor settings, antenatal group education is a potential approach because of the limited time for individual counseling at antenatal clinics.” 30
  • “This study aimed to evaluate an antenatal group education program among pregnant women and their families with respect to birth-preparedness and maternal and infant outcomes in rural villages of Tanzania.” 30
  • “ The study hypothesis was if Tanzanian pregnant women and their families received a family-oriented antenatal group education, they would (1) have a higher level of BPCR, (2) attend antenatal clinic four or more times, (3) give birth in a health facility, (4) have less complications of women at birth, and (5) have less complications and deaths of infants than those who did not receive the education .” 30

Research questions and hypotheses are crucial components to any type of research, whether quantitative or qualitative. These questions should be developed at the very beginning of the study. Excellent research questions lead to superior hypotheses, which, like a compass, set the direction of research, and can often determine the successful conduct of the study. Many research studies have floundered because the development of research questions and subsequent hypotheses was not given the thought and meticulous attention needed. The development of research questions and hypotheses is an iterative process based on extensive knowledge of the literature and insightful grasp of the knowledge gap. Focused, concise, and specific research questions provide a strong foundation for constructing hypotheses which serve as formal predictions about the research outcomes. Research questions and hypotheses are crucial elements of research that should not be overlooked. They should be carefully thought of and constructed when planning research. This avoids unethical studies and poor outcomes by defining well-founded objectives that determine the design, course, and outcome of the study.

Disclosure: The authors have no potential conflicts of interest to disclose.

Author Contributions:

  • Conceptualization: Barroga E, Matanguihan GJ.
  • Methodology: Barroga E, Matanguihan GJ.
  • Writing - original draft: Barroga E, Matanguihan GJ.
  • Writing - review & editing: Barroga E, Matanguihan GJ.

quantitative approaches to comparative analyses: data properties and their implications for theory, measurement and modelling

  • Introduction
  • Open access
  • Published: 06 November 2015
  • Volume 14 , pages 385–393, ( 2015 )

Cite this article

You have full access to this open access article

examples of comparative research titles quantitative

  • robert neumann 1 &
  • peter graeff 2  

6322 Accesses

2 Citations

1 Altmetric

Explore all metrics

While there is an abundant use of macro data in the social sciences, little attention is given to the sources or the construction of these data. Owing to the restricted amount of indices or items, researchers most often apply the ‘available data at hand’. Since the opportunities to analyse data are constantly increasing and the availability of macro indicators is improving as well, one may be enticed to incorporate even qualitatively inferior indicators for the sake of statistically significant results. The pitfalls of applying biased indicators or using instruments with unknown methodological characteristics are biased estimates, false statistical inferences and, as one potential consequence, the derivation of misleading policy recommendations. This Special Issue assembles contributions that attempt to stimulate the missing debate about the criteria of assessing aggregate data and their measurement properties for comparative analyses.

Explore related subjects

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

INTRODUCTION

The social sciences are witnessing an ever increasing supply of data at the aggregate levels on several key dimensions of societal progress or politico-institutional conditions. Next to standardised sources for comparing countries worldwide ( Solt, 2014 ), a bulge of indicators have been introduced over the past three decades to allow for comparative analyses regarding such issues as levels of perceived corruption, quality of governance, environmental sustainability, political rights and democratic freedom. And while there is an abundant use of these macro data, less attention has been given to the sources or to the construction of these data. Despite the spike in data availability, information on countries or regions often remains restricted to only a handful of indicators compiled by organisations that have the resources and know-how to offer worldwide a coverage of countries. Due to this restricted amount of indices or items, researchers for the most part apply the ‘available data at hand’ with only little consideration of their measurement properties.

There already have been attempts to address questions of data quality within the community of comparative political science. Herrera and Kapur (2007) try to foster the debate about the quality of comparative data sets by highlighting the three components of validity, coverage and accuracy. Mudde and Schedler (2010) discuss the challenges of data choice, distinguishing between procedural and outcome-oriented criteria when data quality is to be assessed. They relate the procedural criterion to aspects of transparency, reliability and replicability of data. The latter criteria is connected to validity, accuracy and precision ( Mudde and Schedler, 2010 : 411). Both groups of authors agree that research on data properties usually offers little scientific rewards, but that the debate about the measures is crucial and requires constant stimulation.

A few landmark books and articles have laid out some fundamental guidelines and approaches concerning case selection, operationalisation and implications for comparative model testing at the macro level (see for instance King et al, 1994 ; Adcock and Collier, 2001 ; Gerring, 2001 ). Yet it appears that the discussion within comparative research about measurement properties of different indicators lags the ongoing application of numerous indices in all sorts of comparative empirical research. That is, theoretical and empirical work with new and improved measurements has so far refrained from the opportunity to enhance an exchange about the conceptual framework for comparative multivariate modelling. Furthermore, it often remains problematic to grasp the core intentions of different streams of knowledge production especially when the computation of new cross-country indices was performed in response to prior criticism of existing measures.

DATA PROPERTIES AND THEIR TRADE-OFF

Judging data properties from a qualitative and quantitative perspective, King et al (1994 : 63, 97) propose the criteria of unbiasedness, efficiency and consistency. In particular they concentrate on the inferential performance of measures. Here, bias relates to the property to introduce specific variance into the measurement, which in turn leads to non-random variation between different or repeated applications of the measure in inferential tasks. For example, Hawken and Munck (2011 : 4) report that ratings on perceived corruption made by commercial risk assessment agencies systematically rate economies as more corrupt than surveys of business executives, representing a bias ‘which does not seem consistent with random measurement error’. Efficiency relates to the variance of a measure when taken as an estimator. The simple idea is that an increase in sample size will likely reduce the variance of a measure and will measure a phenomenon more efficiently. But, even King et al (1994 : 66) emphasise that these two properties come with a trade-off that is not always easily reconcilable to achieve consistency, most likely in the form that researchers should allow for more bias in their measure if they achieve larger improvements in efficiency. They do not elaborate on consistency further, although they obviously relate it to reliability, which points towards traditional criteria or properties of measurement theory.

‘… the criteria of validity and reliability remain the cornerstones of any discussions about measurement properties’.

This traditional approach of (psychometric) test or measurement theory usually provides social scientists with a framework to think about properties of measures or data. That is, the criteria of validity and reliability remain the cornerstones of any discussions about measurement properties. Footnote 1 One can define reliability as an ‘agreement between two efforts to measure same trait through maximally similar methods’ ( Campbell and Fiske, 1959 : 83). Usually, this translates to a test of internal consistency of an indicator or test-retest approaches to check whether the systematic variation of an observed phenomenon can be captured by an empirical measure, at several points in time or across different (sub-)samples ( Nunnally and Bernstein, 1978 : 191). Validity represents a more demanding measurement criterion. A few authors have put forward conceptual approaches to address the problems of constructing indices under the perspective of measurement validity (e.g., Bollen, 1989 ; Adcock and Collier, 2001 ). While measurement validity may be broadly defined as the achievement that ‘… scores (including the results of qualitative classification) meaningfully capture the ideas contained in the corresponding concept’ ( Adcock and Collier, 2001 : 530), it consists of various subcategories such as content, construct, internal/external validity, convergent/discriminant validity and even touches upon more ambitious concepts such as ecological validity as well. These various dimensions also reflect a variety of sources for measurement errors, whether stemming from the process data collection (randomisation versus case selection), survey mode and origin of data, data operationalisation or aggregation of different data sources.

Three aspects require us to think harder about the feasibility of these classical concepts of measurement theory. First, the increasing availability of data for the computation or aggregation of macro indicators should improve the reliability of measurements. In fact, it seems that econometricians have completely abandoned the idea of measurement validity and instead focus on statistical techniques for aggregating data. For instance, a recent debate has yielded the impression that reliability remains the main goal to be established, while the concept of validity are not treated as equally important (see the discussion between Kaufmann et al (2010) and Thomas (2010) ). The problem with the idea to increase the reliability of measures arises at the point when validity is sacrificed due to ‘methodological contamination’ ( Sullivan and Feldman, 1979 : 19), especially with regards to the notion that reliability ‘represents a necessary but not sufficient condition for validity’ ( Nunally and Bernstein, 1978 : 192, italics in the original). Hence, aggregated or broadly defined measures that are unable to discriminate concepts and which are theoretically distinct – and hence are not supposed to be measured by the initial approaches – do not necessarily represent threats to the reliability, but rather to the validity. This is especially the case in empirical tests of theoretical predictions regarding the determinants or consequences of certain politico-institutional conditions, where invalid measures are likely to generate biased coefficients due to measurement error among independent or even dependent variables ( Herrera and Kapur, 2007 ). To this end, results will subsequently lack generalisability. For example, combining several reliable measures of the same phenomena to increase the reliability of the aggregate measure can only claim to be unbiased if all underlying measures capture the same portion of systematic variation in a phenomenon and are able to exclude random measurement error equally well. Testing theories with aggregate measures always comes with the caveat of introducing random measurement error into a measure that is supposed to only represent systematic variation in a phenomenon (see for instance Bollen, 2009 for a discussion), despite being highly reliable.

The potential for a trade-off between reliability and components of validity leads to the second aspect to keep in mind when thinking about measurement properties: Lack of validity may only bother researchers who refer to a theory-driven approach of quantitative analyses. The shift towards a data-driven approach puts less emphasis on the underlying theory from which one derives hypotheses to be tested. Hypothesis testing may even be the least important aspect of statistical modelling ( Varian, 2014 : 5). Instead, the goals of data analyses are prediction, forecasting specific behaviours, events or outcomes based on large sets of data, prior knowledge or prior evidence. Due to large amounts of data available and the increasing computer capacities that have enabled the widespread use of Bayesian approaches or machine learning techniques in the social sciences (see Gelman et al, 2014 ; Jackman, 2009 ), claims can be made that measurement properties that derive their ideas from a theory-driven perspective may lose its relevance. Given this shift, it implies an increasing importance for concepts such as reliability or predictive validity that appear closer to the data-driven approach. Footnote 2

The third challenge confronts comparative scholars working with individual-level data. Here, the extension and longevity of survey programmes such as the World Values Surveys or the International Social Science Project (ISSP) have made the application of multilevel models for comparative cross-sectional longitudinal analyses feasible ( Beck, 2007 ; Fairbrother, 2014 ). Given these opportunities, one core assumption is that measurement invariance holds across countries. That is, questionnaire items capture the same underlying concept across different contexts of data collection in a similar way. On the other hand, the theoretical emphasis on the contextuality of social phenomena creates a desire to reflect such idiosyncratic characteristics of a society within the subsequent measurements approaches.

This creates another trade-off for scholars within the respective research communities. As in the case of reliability and validity, contextually reliable measures can come with a lack of measurement invariance. Given that measurement invariance is tested via its discrepancy to some theoretical model, the shift to data-driven approaches may affect the importance of this particular measurement property in a similar fashion as illustrated for the relationship between reliability and validity.

We perceive this development as neither definitive nor one-dimensional. Measurement theory and the concepts like validity remain crucial to evaluate and apply the right instruments and to know where to look when research questions are to be answered. That is, how to think or assess the properties of data becomes one crucial aspect of any empirical endeavour. But they seldom represent the only criteria for assessing the characteristics of data. Our own work was concentrated on the aspect of comparing different indices by their measurement properties ( Neumann and Graeff, 2010 , 2013 ). One conclusion from this work is that researchers face certain incentives that require decisions on how to cope with the aforementioned trade-offs when measures from comparative data are applied.

THE EDITED SPECIAL ISSUE

Despite the known problems with comparative data, only a few questions remain answered and the stream of new indicators constantly enhances new challenges facing current comparative research. Some key problems can be summarised as follows: How to account for the contextuality of measuring country characteristics while maintaining comparability? What are the consequences when prior knowledge and existing empirical findings are to be included into the derivation of existing and new indicators? How to assess the accuracy of an index and how to even define or measure accuracy in a measurement sense?

This edited issue comprises papers in which the properties of applied aggregate data and the underlying sources for the analysis are explicitly reflected. As the authors bring in different methodological backgrounds, the papers apply the variety of contemporary approaches dealing with reliability and validity. This does not always coincide with a psychometric notion of constructs or measurement criteria. The authors do not, however, fall prey to typical publication strategies such as reporting only significant and/or theoretical congruent results instead of null-results ( Gelman and Loken, 2014 ). All papers share the ambition to accurately reflect the underlying theoretical meaning of the constructs of interest. By this, they refer to the above mentioned key questions in their own way.

Susanne Pickel et al (2015 ) present a new framework for comparative social scientists that tackles one of the most prominent topics in political research: the quality of democracy. In particular, the authors propose a framework to assess the measurement properties of three prominent indices of the quality of democracy. This evaluative process requires both the integration of theoretical considerations about the definitional clarity and validity of the underlying concepts as well as empirical concerns about choice of data sources or procedures of operationalisation and aggregation. Their contribution picks up several important points when one deals with the measurement of macro phenomena. First, although the definition of a concept that encompasses concept validity may vary between researchers or research schools, an assessment of the measurement properties remains tied to rather objective criteria like reliability, transparency, parsimony or replicability. Second, the assessment of a concept and its measurement characteristic ultimately face the challenge of measuring contextual characteristics of a political system as close as possible while adhering to more general measurement principles. The latter represents a task for researchers who want to investigate the comparability of indices. Pickel et al apply a framework that includes twenty criteria, focusing on three indices of quality of democracy. The authors state that a theory-based conceptualisation represents the necessary condition for an attempt to face the (potential) trade-off between the adequacy of a measure and its property to compare it with other measures in a meaningful way.

Mark David Nieman and Jonathan Ring (2015 ) pick up one of the other big topics of political research: human rights. Their starting point is that all researchers dealing with country data on human rights have to rely on a restricted number of data sources. Namely, the Cingranelli-Richards (CIRI) or the Political Terror Scale (PTS) represents two widely used indices that are both constructed by using the same country reports on human rights violations from the United States State Department and Amnesty International. Their main concern is that if data resources share systematic measurement error, for instance due to politico-ideological or geopolitical bias in the country reports, these properties will likely be reflected in the indices constructed from these data sources. After clarifying why the reports of the US State Department possess such undesirable measurement properties, they propose specific remedies for the problem. Nieman and Ring discuss possible solutions such as data truncation as well as strategies of correcting for systematic bias using an instrumental variable approach. Their replication analysis reveals that the application of the corrected version indeed changes results from prior analyses. Their work highlights the importance of the decisions during the process of indicator choice and subsequent analysis, whereas some choice sets and their consequences regarding inferential reasoning pose conflicting incentives for researchers given the publication bias favouring statistical significant findings ( Brodeur et al, 2012 ).

Joakim Kreutz (2015) also scrutinises the methodological foundations of the PTS and CIRI. By referring to both indices, he tries to clarify the connection between human rights and the level of state repression in eighteen West African countries. But instead of focusing on repression levels, Kreutz focuses on changes in repression. By highlighting the importance of repression dynamics, he extends prior evidence on the connection of state repression and politico-institutional factors. From a measurement perspective, disaggregating levels of repression by the direction of change (increase/decrease) and by the nature of repressive actions (indiscriminate, selective targeting) may improve our understanding of the contextual features of repression dynamics. His study provides several implications for current research efforts that try to disentangle the relationship between levels of democracy and state repression.

Alexander Schmotz identifies a gap in the political science literature about the measurement of cooptation, which is the way by which non-members are absorbed by a ruling elite. Concepts of co-optation become particularly important for explaining the upholding of autocratic regimes. As such, issues of co-optation are at the heart of political science research but are only seldom operationalised, especially across time. Schmotz develops an index that is capable to measure several threats to autocratic regimes by social pressure groups. Co-optation is a way to deal with these threats. This topic illustrates some general problems in social science research, namely that theoretical ideas, their predictions about causes and effects, and their testing in empirical research are often intertwined. In such a situation, measurement quality (e.g., content validity) is also related to the performance of the index, in particular if the concept of co-optation refers to a ‘seemingly unrelated set of indicators’ ( Schmotz, 2015 ). Counterintuitive findings are then of particular importance as in study by Schmotz. He comes up with the conclusion that the concept of co-optation might not be as important as the relevant literature suggests. Such a finding – based on a new index with the potential for testing and improving its measurement features – will incite the discussion in this field and will most likely lead to refinements of theoretical ideas and their operationalisations.

Barbara Bechter and Bernd Brandl (2015 ) start with the observation that comparative research is mainly based on aggregates on the national level. This ‘methodological nationalism’ comes to a dead end if the variance between countries for the variable of interest vanishes (which typically occurs for political regime indicators for western countries, such as the Polity index). They provide an excellent example for an answer to the question about what accounts for the contextuality of comparative research measures as they find that for the field of industrial relations relevant variables reveal more variability across industrial sectors than across countries. This does not imply the meaninglessness of cross-country comparisons. Rather, it opens the perspective to alternative levels of analysis, not only in the field of industrial relations.

William Pollock, Jason Barabas, Jennifer Jerit, Martijn Schoonvelde, Susan Banducci and Daniel Stevens ( 2015 ) introduce their study of media effects with the statement that results from analyses of the degree of media exposure on certain attitudes or public opinion are affected by ‘data issues related to the number of observations, the timing of the inquiry, and (most importantly) the design choices that lead to alternative counterfactuals’ ( Pollock et al, 2015 ). In an attempt to provide a comprehensive overview, two identification strategies (difference-in-difference estimator versus within-survey/within-subject) for causal claims from cross- or single country survey data are compared to a traditional approach of statistical inference from regression analyses. Using the European Social Survey and information about media-related events during the data collection process allows them to investigate media effects of political or economic events across countries, across types and number of events as well as across time. With a focus on the external validity of such (quasi-)experimental use of survey data, they are able to generate in parts counterintuitive results regarding the impact of sample size and design effects. Their study emphasises that the process of data collection and design choices have an important impact on subsequent data analyses.

By referring to psychometric techniques, Jan Cieciuch et al (2015 ) raise the question about reliable ways of testing measurement invariance. As a precondition for comparing data, measurement invariance can be determined at the level of theoretical constructs (or latent variables), at the level of relations between the theoretical constructs and their indicators or at the level of indicators themselves. Standard methods to pinpoint measurement invariance based on factor analytical techniques are prone to produce false inferences due to model misspecifications. Cieciuch and his colleagues pick up the discussion in literature about model misspecification and show how one can assess whether a certain level of measurement invariance is obtained. As misspecification must be considered as a matter of degree, their study stimulates the discussion about the question, how much misspecification is acceptable.

King et al (1994: 25) clarify earlier that the achievement of reliability and validity represent key goals in any social inquiry, whether qualitative or quantitative in nature.

This change does not imply a shift from deductive to inductive reasoning from data to theories, because researchers remain bound to deriving their results from a theoretical framework. The nomological core of the data-driven approach stems from the distributive characteristics of different probability distributions. See Gelman and Shalizi (2014) for more details on this line of reasoning.

Adcock, R. and Collier, D. (2001) ‘Measurement validity: A shared standard for qualitative and quantitative research’, American Political Science Review 95 (3): 529–546.

Article   Google Scholar  

Beck, N. (2007) ‘From statistical nuisances to serious modeling: Changing how we think about the analysis of time-series–cross-section data’, Political Analysis 15 (2): 97–100. doi:10.1093/pan/mpm001.

Bechter, B. and Brandl, B. (2015) ‘Measurement and analysis of industrial relations aggregates: What is the relevant unit of analysis in comparative research?’ European Political Science 14(4): 422–438.

Bollen, K.A. (1989) Structural Equations with Latent Variables, New York, NY: Wiley.

Book   Google Scholar  

Bollen, K.A. (2009) ‘Liberal democracy series I, 1972–1988: Definition, measurement, and trajectories’, Electoral Studies 28 (3): 368–374.

Brodeur, A., Lé, M., Sangnier, M. and Zylberberg, Y. (2012) Star wars: The empirics strike back’, Paris School of Economics Working Paper 2012–29, pp. 1-.

Campbell, D.T. and Fiske, D.W. (1959) ‘Convergent and discriminant validity by the mutitrait-multimethod matrix’, Psychological Bulletin 56 (2): 81–105.

Cieciuch, J., Davidov, E., Oberski, D.L. and Algersheimer, R. (2015) ‘Testing for measurement invariance by detecting local misspecification and an illustration across online and paper-and-pencil samples’, European Political Science 14(4): 521–538.

Fairbrother, M. (2014) ‘Two multilevel modeling techniques for analyzing comparative longitudinal survey datasets’, Political Science Research and Methods 2 (1): 119–140.

Gelman, A., Carlin, J., Stern, H., Dunson, D.B., Vehtari, A. and Rubin, D. (2014) Bayesian Data Analysis, 3rd edn. London: CRC Press.

Google Scholar  

Gelman, A. and Shalizi, C. (2014) ‘Philosophy and the practice of Bayesian statistics’, British Journal of Mathematical and Statistical Psychology 66 (1): 8–38.

Gelman, A. and Loken, E. (2014) ‘The statistical crisis in science data-dependent analysis – a ‘garden of forking paths’ – explains why many statistically significant comparisons don't hold up’, American Scientist 102 (6): 460. doi:10.1511/2014.111.460.

Gerring, J. (2001) Social Science Methodology: A Criterial Framework, Cambridge: Cambridge University Press.

Hawken, A. and Munck, G.L. (2011) ‘Does the evaluator make a difference? Measurement validity in corruption research’, Measurement Validity in Corruption Research.

Herrera, Y.M. and Kapur, D. (2007) ‘Improving data quality: Actors, incentives, and capabilities’, Political Analysis 15 (4): 365–386.

Jackman, S. (2009) Bayesian Analysis for the Social Sciences, New York: John Wiley & Sons.

Kaufmann, D., Kraay, A. and Mastruzzi, M. (2010) ‘Response to ‘what do the worldwide governance indicators measure?’’, European Journal of Development Research 22 (1): 55–58.

King, G., Keohane, R.O. and Verba, S. (1994) Designing Social Inquiry: Scientific Inference in Qualitative Research, Princeton, NJ: Princeton University Press.

Kreutz, J. (2015) ‘Separating dirty war from dirty peace: Revisiting the conceptualization of state repression in quantitative data’, European Political Science 14(4): 458–472.

Mudde, C. and Schedler, A. (2010) ‘Introduction: Rational data choice’, Political Research Quarterly 63 (2): 410–416.

Neumann, R. and Graeff, P. (2010) ‘A multitrait-multimethod approach to pinpoint the validity of aggregated governance indicators’, Quality & Quantity 44 (5): 849–864.

Neumann, R. and Graeff, P. (2013) ‘Method bias in comparative research: Problems of construct validity as exemplified by the measurement of ethnic diversity’, Journal of Mathematical Sociology 37 (2): 85–112.

Nieman, M.D. and Ring, J.J. (2015) ‘The construction of human rights: Accounting for systematic bias in common human rights measures’, European Political Science 14(4): 473–495.

Nunally, J.C. and Bernstein, I.H. (1978) Psychometric Theory, New York: McGraw-Hill.

Pickel, S., Stark, T. and Breustedt, W. (2015) ‘Assessing the quality of quality measures of democracy: a theoretical framework and its empirical application’, European Political Science 14(4): 496–520.

Pollock, W., Barabas, J., Jerit, J., Schoonvelde, M., Banducci, S. and Stevens, D. (2015) ‘Studying media events in the European social surveys across research designs, countries, time, issues, and outcomes’, European Political Science 14(4): 394–421.

Schmotz, A. (2015) ‘Vulnerability and compensation – Constructing an index of co-optation in autocratic regimes’, European Political Science 14(4): 439–457.

Solt, F. (2014) ‘The Standardized World Income Inequality Database‘, Working paper. SWIID Version 5.0, October 2014. http://myweb.uiowa.edu/fsolt/index.html .

Sullivan, J.L. and Feldman, S. (1979) ‘Multiple indicators – An introduction‘ Sage University Paper series in Quantitative Applications in the Social Sciences No. 07–15, Beverly Hills and London: Sage.

Thomas, M. (2010) ‘What do the worldwide governance indicators measure?’ European Journal of Development Research 22 (1): 31–54.

Varian, H.R. (2014) ‘Big data: New tricks for econometrics’, The Journal of Economic Perspectives 28 (2): 3–28.

Download references

Acknowledgements

Parts of this Special Issue follow upon the symposium ‘The Quality of Measurement – Validity, Reliability and its Ramifications for Multivariate Modelling in Social Sciences’ held at Technische Universität Dresden from 21 to 22 September 2012. Videos of the presentations from the Symposium can be accessed through the website of the symposium at http://tinyurl.com/vwmeasurement . This symposium was financed by the Volkswagen Foundation, which supported the publication of this special issue as well. We thank all participants of the symposium for their remarks and contributions. Foremost, we thank the Volkswagen Foundation for their financial support.

Author information

Authors and affiliations.

Technische Universität Dresden, Dresden, 01069, Germany

robert neumann

University of Kiel, Christian-Albrechts-Platz 4, Kiel, 24118, Germany

peter graeff

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to robert neumann .

Additional information

The online version of this article is available Open Access

Rights and permissions

This work is licensed under a Creative Commons Attribution 3.0 Unported License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/

Reprints and permissions

About this article

neumann, r., graeff, p. quantitative approaches to comparative analyses: data properties and their implications for theory, measurement and modelling. Eur Polit Sci 14 , 385–393 (2015). https://doi.org/10.1057/eps.2015.59

Download citation

Published : 06 November 2015

Issue Date : 01 December 2015

DOI : https://doi.org/10.1057/eps.2015.59

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • reliability
  • measurement
  • quantitative analysis
  • comparative politics
  • comparative sociology
  • Find a journal
  • Publish with us
  • Track your research

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Qualitative vs. Quantitative Research | Differences, Examples & Methods

Qualitative vs. Quantitative Research | Differences, Examples & Methods

Published on April 12, 2019 by Raimo Streefkerk . Revised on June 22, 2023.

When collecting and analyzing data, quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings. Both are important for gaining different kinds of knowledge.

Common quantitative methods include experiments, observations recorded as numbers, and surveys with closed-ended questions.

Quantitative research is at risk for research biases including information bias , omitted variable bias , sampling bias , or selection bias . Qualitative research Qualitative research is expressed in words . It is used to understand concepts, thoughts or experiences. This type of research enables you to gather in-depth insights on topics that are not well understood.

Common qualitative methods include interviews with open-ended questions, observations described in words, and literature reviews that explore concepts and theories.

Table of contents

The differences between quantitative and qualitative research, data collection methods, when to use qualitative vs. quantitative research, how to analyze qualitative and quantitative data, other interesting articles, frequently asked questions about qualitative and quantitative research.

Quantitative and qualitative research use different research methods to collect and analyze data, and they allow you to answer different kinds of research questions.

Qualitative vs. quantitative research

Quantitative and qualitative data can be collected using various methods. It is important to use a data collection method that will help answer your research question(s).

Many data collection methods can be either qualitative or quantitative. For example, in surveys, observational studies or case studies , your data can be represented as numbers (e.g., using rating scales or counting frequencies) or as words (e.g., with open-ended questions or descriptions of what you observe).

However, some methods are more commonly used in one type or the other.

Quantitative data collection methods

  • Surveys :  List of closed or multiple choice questions that is distributed to a sample (online, in person, or over the phone).
  • Experiments : Situation in which different types of variables are controlled and manipulated to establish cause-and-effect relationships.
  • Observations : Observing subjects in a natural environment where variables can’t be controlled.

Qualitative data collection methods

  • Interviews : Asking open-ended questions verbally to respondents.
  • Focus groups : Discussion among a group of people about a topic to gather opinions that can be used for further research.
  • Ethnography : Participating in a community or organization for an extended period of time to closely observe culture and behavior.
  • Literature review : Survey of published works by other authors.

A rule of thumb for deciding whether to use qualitative or quantitative data is:

  • Use quantitative research if you want to confirm or test something (a theory or hypothesis )
  • Use qualitative research if you want to understand something (concepts, thoughts, experiences)

For most research topics you can choose a qualitative, quantitative or mixed methods approach . Which type you choose depends on, among other things, whether you’re taking an inductive vs. deductive research approach ; your research question(s) ; whether you’re doing experimental , correlational , or descriptive research ; and practical considerations such as time, money, availability of data, and access to respondents.

Quantitative research approach

You survey 300 students at your university and ask them questions such as: “on a scale from 1-5, how satisfied are your with your professors?”

You can perform statistical analysis on the data and draw conclusions such as: “on average students rated their professors 4.4”.

Qualitative research approach

You conduct in-depth interviews with 15 students and ask them open-ended questions such as: “How satisfied are you with your studies?”, “What is the most positive aspect of your study program?” and “What can be done to improve the study program?”

Based on the answers you get you can ask follow-up questions to clarify things. You transcribe all interviews using transcription software and try to find commonalities and patterns.

Mixed methods approach

You conduct interviews to find out how satisfied students are with their studies. Through open-ended questions you learn things you never thought about before and gain new insights. Later, you use a survey to test these insights on a larger scale.

It’s also possible to start with a survey to find out the overall trends, followed by interviews to better understand the reasons behind the trends.

Qualitative or quantitative data by itself can’t prove or demonstrate anything, but has to be analyzed to show its meaning in relation to the research questions. The method of analysis differs for each type of data.

Analyzing quantitative data

Quantitative data is based on numbers. Simple math or more advanced statistical analysis is used to discover commonalities or patterns in the data. The results are often reported in graphs and tables.

Applications such as Excel, SPSS, or R can be used to calculate things like:

  • Average scores ( means )
  • The number of times a particular answer was given
  • The correlation or causation between two or more variables
  • The reliability and validity of the results

Analyzing qualitative data

Qualitative data is more difficult to analyze than quantitative data. It consists of text, images or videos instead of numbers.

Some common approaches to analyzing qualitative data include:

  • Qualitative content analysis : Tracking the occurrence, position and meaning of words or phrases
  • Thematic analysis : Closely examining the data to identify the main themes and patterns
  • Discourse analysis : Studying how communication works in social contexts

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square goodness of fit test
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Inclusion and exclusion criteria

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts and meanings, use qualitative methods .
  • If you want to analyze a large amount of readily-available data, use secondary data. If you want data specific to your purposes with control over how it is generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organizations.

There are various approaches to qualitative data analysis , but they all share five steps in common:

  • Prepare and organize your data.
  • Review and explore your data.
  • Develop a data coding system.
  • Assign codes to the data.
  • Identify recurring themes.

The specifics of each step depend on the focus of the analysis. Some common approaches include textual analysis , thematic analysis , and discourse analysis .

A research project is an academic, scientific, or professional undertaking to answer a research question . Research projects can take many forms, such as qualitative or quantitative , descriptive , longitudinal , experimental , or correlational . What kind of research approach you choose will depend on your topic.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Streefkerk, R. (2023, June 22). Qualitative vs. Quantitative Research | Differences, Examples & Methods. Scribbr. Retrieved September 2, 2024, from https://www.scribbr.com/methodology/qualitative-quantitative-research/

Is this article helpful?

Raimo Streefkerk

Raimo Streefkerk

Other students also liked, what is quantitative research | definition, uses & methods, what is qualitative research | methods & examples, mixed methods research | definition, guide & examples, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Open access
  • Published: 31 August 2024

Self-regulated learning in ESL/EFL contexts: a methodological exploration

  • Omid Mazandarani 1  

Humanities and Social Sciences Communications volume  11 , Article number:  1118 ( 2024 ) Cite this article

Metrics details

  • Language and linguistics

The present systematic review provides an overview and analysis of methodological underpinnings of self-regulated learning (SRL) research in ESL/EFL contexts. A search of five academic databases was conducted for studies published from 2017 to 2022. Adopting Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, the search yielded 31 studies conducted in various countries and educational settings. Informed by a 16-item coding scheme, the analysis found that SRL research is more nested within higher education. The results provided evidence to substantiate the idea that quantitative approaches towards SRL research is in the ascendency. Experimental and survey designs were identified as the most preferred research designs. The results revealed an absolute dominance of questionnaire/scale as the most frequently utilised data collection instrument. As for data analysis software, SPSS and Mplus were applied in the majority of studies. The results demonstrated that correlation, confirmatory factor analysis (CFA), and structural equation modelling (SEM) were among the most widely applied statistical tests. Finally, writing, compared to other language skills/subskills, was found to receive a surge of interest in SRL research. The study concludes with some suggestions for further future research.

Similar content being viewed by others

examples of comparative research titles quantitative

Subject integration and theme evolution of STEM education in K-12 and higher education research

examples of comparative research titles quantitative

An integrated model exploring the relationship between self-efficacy, technology integration via Blackboard, English proficiency, and Saudi EFL students’ academic achievement

examples of comparative research titles quantitative

Understanding academic transition and self-regulation: a case study of English majors in China

Introduction.

With the global popularisation of English medium instruction in higher education institutions in non-English-speaking countries around the world, researchers have inevitably been drawn to the significance of maximising students’ learning opportunities in English as a second or foreign language (ESL/EFL) contexts. To this end, identifying variables which may help promote learning outcomes has turned out to become a sine qua non for theory, policy, and practice, exerting influence on students’ learning experience (Ardasheva et al., 2017 , p. 544). Indeed, a wide array of factors and variables conducive to learners’ learning experiences has been reported and documented in the literature, namely mode of instructional delivery such as blended learning (e.g., Bouilheres et al., 2020 ), gamification (e.g., Kian Tan et al., 2023 ), virtual and augmented reality (e.g., Jiawei et al., 2024 ; Videnovik et al., 2020 ), flipped classroom (e.g., Sointu et al., 2023 ), classroom climate (e.g., Li et al., 2023 ), artificial intelligence (AI) (e.g., Díaz and Nussbaum, 2024 ), instructional quality and student satisfaction (e.g., Yang et al., 2023 ) to name but a few.

In a similar vein, one variable, amongst others, which has long been considered to serve as an element in students’ success tends to be self-regulated learning (e.g., Zimmerman, 1990 , p. 4). As a “desirable educational outcome” (Paris and Newman, 1990 , p. 87), encompassing a good few number of variables influential in learning (Panadero, 2017 , p. 1), self-regulated learning has been evidenced as a precious asset to students. The surge of interest in self-regulated learning research over the past decades has culminated in the emergence of several models (e.g., Boekaerts, 2017 ; Winne and Hadwin, 1998 ; Zimmerman, 1989 ), each of which has been the focus of several review studies (e.g., Panadero, 2017 ; Puustinen and Pulkkinen, 2001 ).

Nevertheless, since coming to its own in the 1980s, self-regulated learning research has mostly been around in the fields of mainstream general education and educational psychology. Only later, did the notion gain momentum in ESL/EFL contexts especially over the last decade (e.g., Bai and Wang, 2023 ; Kondo et al., 2012 ). Given that contextual constraints tend to thwart and exert impact on efforts at regulation (Pintrich, 2004 , p. 387), then, examining how self-regulated learning strategies and models transpire and manifest themselves in ESL/EFL contexts, where the medium of instruction is different from students’ mother tongue, imbued with idiosyncratic subtleties (Mazandarani and Troudi, 2022 ), is seemingly of great importance. Despite all the endeavours made so far, drawing solid conclusions as to how self-regulated learning tends to interact with diverse educational variables and covariates (e.g., different types of language skills, subskills, and components, level of education, gender, age, level of proficiency) remains rather enigmatic in language teaching contexts. As such, several researchers have referred to instances of inconsistent findings, leaving lacunae in the SRL literature (see e.g., Chen, 2022 ; Guo et al., 2023 ; Shen and Bai, 2022 ). For instance, in their study on self-regulated learning strategies in a flipped course, Öztürk and Çakıroğlu ( 2021 , p. 1) found that whereas students’ speaking, reading, writing, and grammar performances benefited significantly from SRL, their listening performance was not of any significant difference.

This predicament could be partly due to the fact that research into the dynamics and mechanism of self-regulated learning in ESL/EFL contexts, compared to mainstream general education, appears to be in its infancy, especially when it comes to understanding the paradigmatic underpinnings of SRL research. In the current educational research milieu in which, as Pring ( 2000a ) eloquently contends, there is a bulk of “bad research” (p. 5), gaining deep insights into how to design, collect, analyse, and interpret accurate data, and draw robust conclusions is of high significance. As a prized asset to researchers (Mazandarani, 2022a , p. 217), awareness of paradigmatic nature of what is to be researched is quite seminal on the very grounds that rigourous findings in a research project tend to be contingent upon solid ontological, epistemological, and methodological assumptions, which per se lay the groundwork for selection of appropriate methods and instruments. Yet, research has shown that rarely do researchers make the underlying philosophical assumptions explicit in their works (Mazandarani, 2022a ). On such grounds, therefore, one recommended course of action for enabling researchers to make sense of the past, present, and future directions of what they research into is to appreciate the importance of methodological approaches of their research topics. One way for doing so lies with conducting meta-analyses and systematic review studies. Whilst the literature on different dimensions of SRL in mainstream education hosts various meta-analysis and systematic review studies, offering rich perceptive (e.g., Broadbent and Poon, 2015 ; Dignath et al., 2008 ; Jansen et al., 2019 ; Panadero, 2017 ; Sitzmann and Ely, 2011 ; Theobald, 2021 ), it is somewhat young and in a state of flux in ESL/EFL contexts with some recent meta-analytic works (e.g., Ardasheva et al., 2017 ; Chen, 2022 ; Yang et al., 2023 ) with limited methodologically analytical framework. This study is, therefore, one of the first of its kind which delves into different philosophical and methodological dimensions of SRL research.

Methodological underpinnings in educational research

Educational research is a messy and convoluted enterprise with trade-offs (Cohen et al., 2018 , p. 3). Despite several decades of philosophical discussion, methodological concepts are yet ambiguous and opaque (Hammersley, 2023 , p. 12). Such a vagueness in terminologies has led to the interchangeable use of methodology and method not only by some researchers, but more surprisingly, by some journals (Mazandarani, 2022a , p. 218). Of note is that researchers’ methodological choices cannot be exercised in a vacuum, devoid of philosophical positions. Inasmuch as philosophical assumptions exert deep influence on the research conduct (Pring, 2000b , p. 88), having implications for researchers’ methodological concerns (Cohen et al., 2018 , p. 6), then, probing into them is of high priority for researchers who inevitably bring to their adopted methodologies a number of assumptions (Crotty, 1998 , p. 7). Despite serving as a desideratum in educational research, however, philosophy along with its underlying assumptions tend to escape researchers’ attention, which is presumably due to the intricacy and abstractness of philosophical assumptions (Mazandarani, 2022a , p. 218). Understanding the methodology of research projects is vital, in that not only does it tell us about researchers’ philosophical stances, but it also provides the rationale that lies behind the chosen methods (Crotty, 1998 , p. 7) and instrumentations and data collection (Cohen et al., 2018 , p. 3). Good researchers are those who are responsible and disciplined (Dörnyei, 2007 , p. 17), and accountable to what they add to the literature. To achieve this, research needs to be philosophically and methodologically well-informed. As such, the methodological analysis of the state-of-the-art research on a given topic can provide invaluable information as to what research philosophy and worldviews dominate research on that topic. This is quite important as how researchers view ‘truth’ and what they consider as ‘knowledge’ tend to have direct and indirect implications for theory, policy, and more importantly practice. However, tracing back to the philosophical underpinnings of a research study may not be straightforward. In so doing, one is required to understand beforehand the competing research paradigms, which as Lincoln, Lynham and Guba ( 2018 , p. 214) argue, have begun to “interbreed”. Irrespective of types of research paradigm (e.g., positivist, interpretivist, pragmatist) undergirding a given research project, researchers need to be cognisant of the impact of philosophical stances on their methodological decisions. For instance, those who espouse positivist position will favour experiment and survey, whereas those who give countenance to anti-positivist standpoints will opt for interpretive approaches such as observation (Cohen et al., 2018 , p. 6). A review of the literature on SRL, simply substantiates the paucity of research on philosophical and methodological issues in SRL studies in the field of ESL/EFL research. It is, therefore, the aim of the present study to address this gap through a systematic review of the articles published on SRL in the past few years. Systematic reviews have come to prominence in recent years (Bryman, 2012 , p. 103). As Andrews ( 2005 , p. 404) posits, the existing body of knowledge deserves reviewing, and in so doing, systematic reviews provide an opportunity as to synthesise research findings in existing literature. Among several functions of systematic reviews, as Andrews ( 2005 , p. 409) continues, is the adopted methodological approaches to a research topic, exploring where the methodological flaws lie.

Previous systematic reviews and meta-analyses on SRL in ESL/EFL contexts

As mentioned, silence prevails upon literature on systematic review and meta-analysis studies pertinent to SRL in ESL/EFL contexts, especially when it comes to methodological reflexivity. This section, therefore, provides an overview of the most salient attempts mentioned in the literature. In their meta-analysis of 37 articles, Ardasheva et al. ( 2017 ) explored how language learning strategy instruction is associated with self-regulated learning. Supporting the link between the two variables, the results called for further attention to self-regulated learning in strategy instruction research (Ardasheva et al., 2017 , p. 544). In her meta-analytic study on 16 articles, Chen ( 2022 ) investigated to what extent SRL interventions are effective in students’ achievement, strategy employment, and self-efficacy, the results of which gave support for the effectiveness of SRL interventions (Chen, 2022 , p. 14). In a similar study, Xu et al. ( 2023a ) addressed the effect of SRL interventions on students’ academic achievement in both online and blended learning environments across different levels of education. Review of 50 articles showed moderate and positive effect exerted by SRL intervention on students’ performance in elementary, secondary, and higher education contexts as well informal settings (Xu et al., 2023a , p. 2911). Perhaps, the most relevant systematic review which partly addressed the methodological issues surrounding SRL is that of Yang, Wen and Song ( 2023 ). Focusing on technology-enhanced SRL strategies, their systematic review of 34 studies conducted, from 2011 to 2020, substantiated the preponderance of quantitative methods, placing emphasis on outcome rather than process in SRL learning (Yang et al., 2023 , p. 31).

As can be seen in the above-reviewed literature, various methodological aspects of SRL research in ESL/EFL contexts have not yet been addressed. Pursuant to such a lacuna, in this systematic review study, I provide an overall picture of the status quo of the epistemological and methodological approaches of state-of-the-art ESL/EFL-specific SRL research. In particular, this paper delves into methodological issues surrounding SRL research such as what research paradigm, design, data collection instruments, data analysis software, and statistical techniques, amongst others, tend to be adopted by researchers. To partly bridge the gap, the following research questions were posed:

What are the paradigmatic and methodological features of SRL research in ESL/EFL context?

What is the geographical distribution of SRL research?

To what extent different levels of education are addressed in SRL research?

What aspects and variables of language teaching germane to SRL are investigated?

In order to ensure the rigour and robustness of the review process, this systematic review was informed by the PRISMA (Page et al., 2021 ), as the guiding framework undergirding the review process.

Search strategy

The adopted multi-phase search strategy encompassed searching the most relevant terms and queries in five major electronic academic databases, including ScienceDirect, SpringerLink, Taylor & Francis Online, Wiley Online Library, and SageJournals. The rationale behind this is that they publish journals which are mostly indexed by the two most well-known academic indexing databases, i.e., Elsevier’s Scopus and Clarivate Analytics’ Web of Science. As the most frequently used databases for bibliometric analysis (Singh et al., 2021 ), the covered journals in Scopus and Web of Science secured extracting rigorous and quality articles for this study. In order to minimise the search bias, a multi-phase searching strategy was applied. Given that different abbreviations (“EFL”, “ESL”, and “L2”) are inconsistently and interchangeably used in the literature to refer to English language education contexts, I used the three abbreviations separately, together with “self-regulated learning” query. This means that the search of queries was repeated for 15 times to maximise the search hits. However, given that the initial search yielded abundant results embracing irrelevant studies, and in order to make the search more precise, the search was instructed with the parameter of including “self-regulated learning” in “title” AND “EFL” OR “ESL” OR “L2” “anywhere” (see Table 1 ). This modification allowed for finding the most relevant studies on self-regulated learning in English language education contexts. Finally, in order to understand the state-of-the-art trends of research on SRL, the search, conducted in July 2023, was set to cover studies published from 2017 to 2022. The rationale behind the selection of a 6-year period for SRL research was twofold. First, the present study is an attempt to present an updated and state-of-the-art understanding of SRL research. Second, the literature simply shows that SRL research in L2 context has been booming in the past few years, as a consequence of which the selected period is deemed to be saturated enough with quality and relevant studies. As Gan, Liu and Yang ( 2020 ) contend, SRL in recent years has come to its own as educational innovation.

Eligibility criteria

In order to exclude the gray literature and eliminate irrelevant articles, a set of inclusion/exclusion criteria was applied (see Table 2 ). For a study to be considered eligible in the article pool, it must investigate the variable or case of SRL in English language education contexts. Being peer-reviewed, the paper must report an original study. Therefore, all other types of academic publications including reviews, conceptual/theoretical papers, short communications, book chapters, conference proceedings, editorials, etc. were excluded. In line with PRISMA 2020 flowchart (Page et al., 2021 , p. 6), the search was performed in four stages, as illustrated in Fig. 1 .

figure 1

PRISMA 2020 flow diagram adopted for systematic review (Page et al., 2021 , p. 6).

Data coding and analysis

After proposing a set of parameters and running a rigourous search of electronic databases, an initial pool of articles relevant to search strings was aggregated, yielding 69 articles. In the next phase, all titles and abstracts of the extracted articles were screened based on a screening guide, in which all eligibility criteria were identified. In the event it was not possible to judge upon the relevance of the articles based on the reading of title and abstract, full texts of articles were examined for checking the eligibility criteria and making the final decision. Subsequent to several stages of screening and pruning, informed by PRISMA framework, 38 articles were removed, leaving 31 eligible articles for final data analysis. In the next phase, the extracted full texts of articles were subject to content analysis using a pre-determined coding schedule, in consonance with the proposed research questions. To this end, labelled with a unique ID, the full text of each paper was screened for publisher, journal title, year of publication, geographical context, level of education, methodological approaches, data collection instruments, variables, data analysis software, design of the study, philosophical assumptions, type and number of participants, number of authors, and main data analysis tests and techniques. Apart from screening the relevant sections in each article, Adobe Acrobat’s FIND function was used to locate the information needed for content analysis as indicated in the coding scheme.

The statistical analysis and thematic content analysis of the final 31 article were conducted using SPSS version 27 and NVivo version 12 software, and Microsoft Excel. SPSS was used in order to produce a descriptive profile for articles, codifying the thematic content analysis carried out for each article. Adopting an inductive thematic analysis approach, NVivo, in addition, was used to complement the data exploration. In order to identity the most frequent words and concepts mentioned in the relevant sections of the selected articles, the summary function of NVivo was used. On this ground, titles, abstracts, and keywords sections of all articles were extracted from the full texts, ensuring the elimination of redundant information.

Preliminary analysis

Distribution of articles by publishers.

The number of articles extracted from each publishing databases is shown in Table 3 . The majority of articles were obtained from Taylor & Francis Online with 12 articles (38.7%).

Distribution of articles by journals

As for the title of journals included in the analysis, Fig. 2 shows that the final 31 articles were published in 24 journals, with System, Cogent Education, International Journal of Educational Research, Computer assisted language learning, and International Journal of Bilingual Education and Bilingualism leading the distribution.

figure 2

Journal-wise distribution of articles.

Distribution of articles by year

Figure 3 illustrates the dispersion of articles over a six-year span from 2017 to 2022, as follows:

figure 3

Year-wise distribution of articles.

Country-wise distribution of articles

The analysis of the countries where the selected studies were conducted is provided in Table 4 , representing all continents except for Africa and Antarctica. Of great note is the number of studies conducted in Hong Kong and China which almost accounts for near half the studies (48.4%). Notably, the results showed that an absolute majority of studies (71%) were conducted in Asian countries. It is of note that the analysis showed that 11160 participants took part in the selected studies, conducted by 76 researchers.

Levels of education-wise distribution of articles

The analysis of context and participants of the selected studies demonstrated that SRL research targeted both K-12 and higher education contexts, with 13 studies (41.9%) and 17 studies (54.8%), respectively (see Table 5 ).

Main analysis

Methodological approaches of srl research.

As shown in Fig. 4 , a huge majority of articles (80.6%) were conducted quantitatively (e.g., Cho et al., 2020 ; Lin and Dai, 2022 ), showing the researchers’ inclination to adopt positivist-quantitative paradigm. In contrast, qualitative studies (e.g., Hu and Gao, 2020 ; Nakata, 2019 ) and mixed methods studies (e.g., Onah et al., 2020 ; Xu, 2021 ) accounted for a very small proportion of articles, just less than 10% each.

figure 4

Methodological approaches of studies.

Research design

The analysis of the research designs adopted in the selected studies revealed that ‘design’ appears to go unnoticed by researchers, inasmuch a good few of articles (41.9%) had no clear reference to the type of design used for the study. Of the remaining articles, as seen in Table 6 , (quasi)experimental designs (e.g., Ferreira et al., 2017 ; Öztürk and Çakıroğlu, 2021 ; Teng and Zhang, 2020 ) and survey designs (e.g., Yi, 2021 ) were the most adopted research designs, 22.6% and 16.1%, respectively.

Data collection instruments

As for the instruments and materials used for collecting data, the analysis revealed that researchers deployed a variety of different data collection instruments and tools, many of which were used in one single study. However, as shown in Fig. 5 , ‘questionnaire/scale’ was used in 28 out of 31 articles (90.3%) (e.g., Bai and Guo, 2018 ; Guo et al., 2021 ; Teng, 2021 ), making it the most widely used instrument for collecting data. Researchers also utilised ‘test’ in 15 studies (48.4%) (e.g., Öztürk and Çakıroğlu, 2021 ), introducing it as the second most used data collection instrument, followed by ‘interview’ with 8 studies (25.7%).

figure 5

Data collection instrument(s).

Data analysis software

The obtained results indicated that researchers incorporated various software tools for conducting data analysis. As can be seen in Table 7 , however, more than one-thirds of the studies (35.5%) had no mention of any statistical analysis software. From among the remaining articles, it was found that ‘SPSS’ and ‘MPlus’ were the most utilised data analysis software, with 25.8% (e.g., Ferreiraet al., 2017 ; Lin and Dai, 2022 ) and 19.4% (e.g., Bai and Wang, 2021 ; Bai et al., 2021 ; Yi, 2021 ) of the selected studies, respectively. As for qualitative data analysis, NVivo was the only analysis software reported in the selected studies (e.g., Alvi and Gillies, 2023 ; Zhang, 2017 ).

Main statistical procedure

The analysis of methods and results sections of the selected articles offered a wide range of statistical procedures, techniques, and tests used by researchers to answer the proposed research questions. As highlighted in Table 8 , correlation and regression were used in 12 studies (38.4%) (e.g., Lin and Dai, 2022 ), followed by Confirmatory Factor Analysis (CFA) in 10 articles (32%) (e.g., Şahin Kızıl and Savran, 2018 ), and Structural Equation Modelling (SEM) in 10 articles (32%) (e.g., Tse et al., 2022 ). The use of ANOVA, MANOVA, ANCOVA, and MANCOVA were also reported in the studies, 12.8%, 12.8%, 9.6%, and 3.2% of articles, respectively.

Variables related to SRL research

An important yet little-researched dimension of the data analysis revolved around the variables (dependent, independent, moderator, etc.) and language skills and components which have been addressed along with SRL in the selected studies. As presented in Table 9 , in terms of language skills, ‘writing’ skill was found to be of the highest priority in SRL research, inasmuch as 12 articles (38.6%) addressed ‘writing’ in one way or another (e.g., Guo et al., 2021 ; Teng and Zhang, 2020 ). Three articles (9.7%) targeted ‘reading’ skill in relation to SRL (e.g., Tse et al., 2022 ). Online and blended learning were also the focus of three studies (9.6%) (e.g., Lin and Dai, 2022 ; Zhu et al., 2020 ).

As for the thematic analysis of the selected articles, the summary function of NVivo was run. Table 10 presents the 20 most frequently used words in the titles, abstracts, and keywords sections of the selected papers. Quite expectedly, words such as ‘self’, ‘learning’, ‘regulated’, ‘strategies’, ‘students’, ‘writing’, ‘motivation’, ‘efficacy’, ‘assessment’, ‘instruction’, ‘online’, and ‘reading’ were among the highly mentioned concepts.

This systematic review was conducted to cast new light onto the methodological underpinnings undergirding research in SRL in ESL/EFL contexts. Having followed PRISMA guidelines, the search yielded 31 articles published from 2017 to 2022, the data of which were subject to content analysis using SPSS 27, NVivo 12, and Microsoft Excel. The obtained results revealed some facts and gaps, and issues underlying SRL research in ESL/EFL contexts which will be discussed in response to the research questions posed for this study.

The results (see Table 4 ) showed that Asian countries led the state-of-the-art research on SRL, with China and Hong Kong accounting for approximately half the selected studies. This result corroborates the literature on other dimensions of ESL/EFL education. For instance, in his study on L2 teacher education, Mazandarani ( 2022b , p. 1) found out research on L2 teacher education, compared to mainstream general teacher education, is more nested within Asian countries. Similarly, reviewing technological, pedagogical, and content knowledge (TPACK) research, Tseng et al. ( 2022 , p. 948) identified Asia as the context where most of their selected studies were conducted. A note of emphasis herein is that, with the advent of internationalisation, English as medium of instruction (EMI) has been in the ascendency in many non-English-dominant Asian countries. As such, EMI policies in countries such as China, is considered as a vital ingredient for internationalisation of higher education (Zhang, 2018 , p. 542). To this end, it is not surprising that, therefore, various aspects of ESL/EFL education become the focus of academic research in non-English-speaking countries, and SRL research is no exception.

The extent to which different educational levels have been targeted by SRL research was addressed in this study, the results of which gave precedence to higher education, compared to K-12 education (see Table 5 ). This is consistent with the results of scoping review conducted by Xu et al. ( 2023b , p. 8), in which they identified higher education as the most widely investigated level of education (71.17%) in SRL research in blended or online educational contexts. This result also echoes that of Yang, Wen and Song ( 2023 , p. 35) systematic review of technology-enhanced SRL, by which higher education was the context of 73.5% of studies. The surge of interest in self-regulated learning in higher education could be partly due to students’ age-specific learning needs which are different in higher education vis-à-vis K-12 education. Higher education students, for instance, may need more support to acquire self-regulated learning strategies and skills to avail themselves of artificial intelligence technology (Koć-Januchta et al., 2022 , p. 18), which is currently a feature of higher education. Another plausible scenario could be the convenience of conducting intervention research with adults (Xu et al., 2023b , p. 8).

The preponderance of quantitative research methodology was evidenced with 25 studies (80.6%), followed by qualitative and mixed methods methodologies (9.7% each) (see Fig. 4 ). This was expected, in that hypothesis-testing, treatments, interventions, causal relationships, correlations, and predictions, informed by explanatory approach, which tend to be used frequently in SRL research, are the epitome of quantitative research. This finding is consistent with other SRL studies in which quantitative studies were the most prevalent type of research (e.g., Junaštíková, 2023 ; Xu et al., 2023b ; Yang et al., 2023 ).

As for design of study, it was quite surprising that 13 articles (41.9%) had no specific section on or even just a clear reference to type of design used in studies. Lack of clear and correct reference to study design is a common error in many manuscripts (Praharaj, 2023 ). One possible explanation for this issue could be due to some journals’ author guidelines which are rather silent on ‘design’ of the study, or making it optional for authors to refer to design of study in ‘methods’ section, if needed. From the remaining articles which highlighted the design of study, experimental designs were used in seven (22.6%), followed by survey and mixed methods designs, 16.1% and 9.7%, respectively (see Table 6 ).

A variety of instruments utilised for data collection purposes were introduced in this study (see Fig. 5 ). Questionnaires and scales were found to be the dominant data collection instruments in SRL research with 28 studies (90.3%). This finding is consistent with Junaštíková ( 2023 ) review of empirical studies on self-regulation of learning. One explanation for such a ubiquitous use of questionnaires/scales is that, closely aligned with dominance of quantitative research approach, the existence of several well-established, valid, and reliable questionnaires and tests in the literature is of convenience to researchers, in that they can be easily and quickly used in pre and post-intervention phases of research, severing as reliable tools for obtaining numerical data for hypothesis testing. As one of the main data collection tools in a survey design, self-completion questionnaires are of several advantages such as cheap and quick administration, and convenience (Bryman, 2012 , p. 233), making them a suitable instrument for SRL research.

When it comes to software used for data analysis, this review revealed that researchers in more than one-thirds of the selected studies (35.5%) had a lackadaisical approach towards highlighting their employed quantitative and/or qualitative data analysis software. One underlying reason for such a heedlessness, in a similar vein, might emanate from some journals’ policies and author guidelines. Another plausible scenario is that it is customary, in academia, for some researchers to turn to statisticians for assistance with statistical and/or thematic analyses, inasmuch as they see it as a technical domain of enquiry which requires statistical expert knowledge (Mazandarani, 2024 , p. 408). As a consequence, data analysis results and reports tend to be a prime concern for researchers rather than the type or name of data analysis software per se. Of the remaining articles, researchers used a variety of data analysis software, with SPSS leading in frequency (25.8%) (see Table 7 ). Such a results was not uncommon, in that, SPSS is the most frequent statistical analysis software applied in the field of social sciences (Cohen et al., 2018 , p. 725; Dörnyei and Csizér, 2012 , p. 83). The literature on SRL, however, is rather silent about the usefulness of various statistical software packages used for data analysis in the SRL research. Further research is, therefore, needed to investigate which data analysis software can best accommodate SRL researchers’ needs in ESL/EFL contexts.

As for statistical tests and procedures, correlation and regression analyses, CFA, and SEM were found to be highly applied by researchers (see Table 8 ). This finding was expected, on the grounds that one of the main data collection instruments for investigating SRL and its pertinent variables in survey and correlational (associational) designs is questionnaire/scale. As such, correlation analysis and CFA are typical statistical techniques for questionnaire/scale development and validation. Unfortunately, there is a dearth of literature on this aspect of SRL research against which this finding could be compared and contrasted.

The analysis of data offered some new insights into different variables and domains involved in SRL research. As shown in Table 9 , ‘writing’ was the main language skill with respect to which SRL-related investigations were conducted. There are some possible reasons for such a surge of interest in writing, compared to other language skills and components. First, writing is usually a compulsory course across a wide range of academic programmes in higher education in ESL/EFL contexts. Second, compared to other academic tasks, writing assignments is reported to be more connected with students’ procrastination (Fritzsche et al., 2003 , p. 1550). Third, as a multidimensional phenomenon (Bai and Wang, 2021 ), being of an utmost significance for academic success and future occupation (Bai et al., 2021 , p. 65), writing demands high self-regulation abilities embracing an intricate framework of interdependent processes (Zimmerman and Risemberg, 1997 , p. 97). Fourth, research has demonstrated that metacognition is a key ingredient of SRL (e.g., Meyer et al., 2010 ; Senko, Perry and Greiser, 2022 ). On the other hand, cognitive and metacognitive strategies rest at the heart of writing quality (Wischgoll, 2016 , p. 1). It is, therefore, explicable why the literature on SRL has witnessed an upswing in studies, targeting the interconnection between SRL strategies and writing , the two concepts which were among the most frequently mentioned words in the selected articles.

Limitations

This review study has several limitations. First, this review was limited to papers which were published in English, leaving out contexts where publications are in languages other than English. Second, in order to avoid the gray and low-quality literature, the search was limited to five well-known academic databases. Although this strategy has led to a collection of quality research papers, it might have possibly resulted in underrepresentation of many others. Third, notwithstanding wisely chosen, the combination of search keywords might have excluded some relevant quality articles. Fourth, the adopted coding protocol could have more or even different items, generating a richer analysis. Finally, there was a likeliness to miss some important information whilst screening for keywords to locate the relevant information in the manuscripts through scanning and searching the texts using Adobe Acrobat’s FIND function.

In view of surge of attention in SRL research in mainstream general education and, in particular, ESL/EFL education in recent years, there is a need to understand the status quo of SRL research, identifying the associated strengths, weaknesses, opportunities, and threats. As such, methodological reflexivity provides an opportunity for researchers to reflect on the consequences of their adopted methods, values, biases, and decisions throughout their knowledge production mission (Bryman, 2012 , p. 393). This systematic review brings to the fore several facts and gaps underlying SRL research in ESL/EFL contexts, informed by a 16-item coding scheme. One main issue, amongst others, highlighted in this study was the hegemony of etic philosophical positions towards SRL research. Future research, therefore, should bring into play more emic approaches towards SRL research. In the same fashion, the existing understanding of SRL research and know how is heavily relied on data obtained via questionnaire/scale/test, which can potentially limit researchers’ insights into the underlying issues of SRL. Further utilisation of various types of instruments for data collection can deepen researchers’ views. Finally, from among English language skills, subskills, and components, writing has received much momentum. Further research is expected to address SRL in relation to other language skills and components evenly.

Alvi E, Gillies RM (2023) Self-regulated learning (SRL) perspectives and strategies of Australian primary school students: a qualitative exploration at different year levels. Educ Rev 75(4):680–702. https://doi.org/10.1080/00131911.2021.1948390

Article   Google Scholar  

Andrews R (2005) The place of systematic reviews in education research. Br J Educ Stud 53(4):399–416. https://doi.org/10.1111/j.1467-8527.2005.00303.x

Ardasheva Y, Wang Z, Adesope OO, Valentine JC (2017) Exploring effectiveness and moderators of language learning strategy instruction on second language and self-regulated learning outcomes. Rev Educ Res 87(3):544–582. https://doi.org/10.3102/0034654316689135

Bai B, Guo W (2018) Influences of self-regulated learning strategy use on self-efficacy in primary school students’ English writing in Hong Kong. Read Writ Q 34(6):523–536. https://doi.org/10.1080/10573569.2018.1499058

Bai B, Wang J (2021) Hong Kong secondary students’ self-regulated learning strategy use and English writing: influences of motivational beliefs. System 96:102404. https://doi.org/10.1016/j.system.2020.102404

Bai B, Wang J (2023) The role of growth mindset, self-efficacy and intrinsic value in self-regulated learning and English language learning achievements. Lang Teach Res 27(1):207–228. https://doi.org/10.1177/1362168820933190

Bai B, Wang J, Nie Y (2021) Self-efficacy, task values and growth mindset: what has the most predictive power for primary school students’ self-regulated learning in English writing and writing competence in an Asian Confucian cultural context? Camb J Educ 51(1):65–84. https://doi.org/10.1080/0305764X.2020.1778639

Boekaerts M (2017) Cognitive load and self-regulation: attempts to build a bridge. Learn Instr 51:90–97. https://doi.org/10.1016/j.learninstruc.2017.07.001

Bouilheres F, Le LTVH, McDonald S, Nkhoma C, Jandug-Montera L (2020) Defining student learning experience through blended learning. Educ Inf Technol 25(4):3049–3069. https://doi.org/10.1007/s10639-020-10100-y

Broadbent J, Poon WL (2015) Self-regulated learning strategies & academic achievement in online higher education learning environments: a systematic review. Internet High Educ 27:1–13. https://doi.org/10.1016/j.iheduc.2015.04.007

Bryman A (2012) Social research methods, 4th edn. Oxford University Press

Chen J (2022) The effectiveness of self-regulated learning (SRL) interventions on L2 learning achievement, strategy employment and self-efficacy: a meta-analytic study [Systematic Review]. Front Psychol https://doi.org/10.3389/fpsyg.2022.1021101

Cho HJ, Yough M, Levesque-Bristol C (2020) Relationships between beliefs about assessment and self-regulated learning in second language learning. Int J Educ Res 99:101505. https://doi.org/10.1016/j.ijer.2019.101505

Cohen L, Manion L, Morrison K (2018) Research methods in education, 8th edn. Routledge

Crotty M (1998) The foundations of social research: Meaning and perspective in the research process. Sage Publications

Díaz B, Nussbaum M (2024) Artificial intelligence for teaching and learning in schools: the need for pedagogical intelligence. Computers Educ 217:105071. https://doi.org/10.1016/j.compedu.2024.105071

Dignath C, Buettner G, Langfeldt H-P (2008) How can primary school students learn self-regulated learning strategies most effectively?: A meta-analysis on self-regulation training programmes. Educ Res Rev 3(2):101–129. https://doi.org/10.1016/j.edurev.2008.02.003

Dörnyei Z (2007) Research methods in applied linguistics: Quantitative, qualitative, and mixed methodologies. Oxford University Press

Dörnyei Z, Csizér K (2012) How to design and analyze surveys in second language acquisition research. In: A Mackey, SM Gass (eds) Research methods in second language acquisition: a practical guide. Wiley-Blackwell, pp 74–94

Ferreira PC, Simão AMV, da Silva AL (2017) How and with what accuracy do children report self-regulated learning in contemporary EFL instructional settings? Eur J Psychol Educ 32(4):589–615. https://doi.org/10.1007/s10212-016-0313-x

Fritzsche BA, Rapp Young B, Hickson KC (2003) Individual differences in academic procrastination tendency and writing success. Personal Individ Differ 35(7):1549–1557. https://doi.org/10.1016/S0191-8869(02)00369-0

Gan Z, Liu F, Yang CCR (2020) Student-teachers’ self-efficacy for instructing self-regulated learning in the classroom. J Educ Teach 46(1):120–123. https://doi.org/10.1080/02607476.2019.1708632

Guo W, Bai B, Song H (2021) Influences of process-based instruction on students’ use of self-regulated learning strategies in EFL writing. System 101:102578. https://doi.org/10.1016/j.system.2021.102578

Guo W, Lau KL, Wei J, Bai B (2023) Academic subject and gender differences in high school students’ self-regulated learning of language and mathematics. Curr Psychol 42(10):7965–7980. https://doi.org/10.1007/s12144-021-02120-9

Hammersley M (2023) Methodological concepts: a critical guide. Routledge

Hu J, Gao X (2020) Appropriation of resources by bilingual students for self-regulated learning of science. Int J Bilingual Educ Bilingualism 23(5):567–583. https://doi.org/10.1080/13670050.2017.1386615

Jansen RS, van Leeuwen A, Janssen J, Jak S, Kester L (2019) Self-regulated learning partially mediates the effect of self-regulated learning interventions on achievement in higher education: a meta-analysis. Educ Res Rev 28:100292. https://doi.org/10.1016/j.edurev.2019.100292

Jiawei W, Mokmin NAM, Shaorong J (2024) Enhancing higher education art students’ learning experience through virtual reality: a comprehensive literature review of product design courses. Interactive Learn Environ 1–17. https://doi.org/10.1080/10494820.2024.2315125

Junaštíková J (2023) Self-regulation of learning in the context of modern technology: a review of empirical studies. Interactive Technol Smart Educ. https://doi.org/10.1108/ITSE-02-2023-0030

Kian Tan W, Shahrizal Sunar M, Su Goh E (2023) Analysis of the college underachievers’ transformation via gamified learning experience. Entertain Comput 44:100524. https://doi.org/10.1016/j.entcom.2022.100524

Koć-Januchta MM, Schönborn KJ, Roehrig C, Chaudhri VK, Tibell LAE, Heller HC (2022) Connecting concepts helps put main ideas together”: cognitive load and usability in learning biology with an AI-enriched textbook. Int J Educ Technol High Educ 19(1):11. https://doi.org/10.1186/s41239-021-00317-3

Kondo M, Ishikawa Y, Smith C, Sakamoto K, Shimomura H, Wada N (2012) Mobile assisted language learning in university EFL courses in Japan: developing attitudes and skills for self-regulated learning. ReCALL 24(2):169–187. https://doi.org/10.1017/S0958344012000055

Li W, Ren X, Qian L, Luo H, Liu B (2023) Uncovering the effect of classroom climates on learning experience and performance in a virtual environment. Interactive Learn Environ https://doi.org/10.1080/10494820.2023.2195450

Lin X, Dai Y (2022) An exploratory study of the effect of online learning readiness on self-regulated learning. Int J Chin Educ 11(2):2212585X221111938. https://doi.org/10.1177/2212585x221111938

Lincoln YS, Lynham SA, Guba EG (2018) Paradigmatic controversies, contradictions, and emerging confluences, revisited. In: NK Denzin, YS Lincoln (eds) The SAGE handbook of qualitative research. SAGE Publications, Inc, pp 213–263

Mazandarani O (2022a) Philosophical assumptions in ELT research: a systematic review. Asia-Pac Educ Res 31(3):217–226. https://doi.org/10.1007/s40299-021-00554-0

Mazandarani O (2022b) The status quo of L2 vis-à-vis general teacher education. Educ Stud 48(1):1–19. https://doi.org/10.1080/03055698.2020.1729101

Mazandarani O (2024) Statistical literacy: a point of contention in L2 teacher education. Lang Relat Res 14(6):405–421. https://doi.org/10.29252/lrr.14.6.13

Mazandarani O, Troudi S (2022) Measures and features of teacher effectiveness evaluation: perspectives from Iranian EFL lecturers. Educ Res Policy Pract 21(1):19–42. https://doi.org/10.1007/s10671-021-09290-0

Meyer E, Abrami PC, Wade CA, Aslan O, Deault L (2010) Improving literacy and metacognition with electronic portfolios: teaching and learning with ePEARL. Comput Educ 55(1):84–91. https://doi.org/10.1016/j.compedu.2009.12.005

Nakata Y (2019) Encouraging student teachers to support self-regulated learning: a multiple case study on prospective language teachers. Int J Educ Res 95:200–211. https://doi.org/10.1016/j.ijer.2019.01.007

Onah DFO, Pang ELL, Sinclair JE (2020) Cognitive optimism of distinctive initiatives to foster self-directed and self-regulated learning skills: a comparative analysis of conventional and blended-learning in undergraduate studies. Educ Inf Technol 25(5):4365–4380. https://doi.org/10.1007/s10639-020-10172-w

Öztürk M, Çakıroğlu Ü (2021) Flipped learning design in EFL classrooms: implementing self-regulated learning strategies to develop language skills. Smart Learn Environ https://doi.org/10.1186/s40561-021-00146-x

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, Moher D (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int J Surg 88:105906. https://doi.org/10.1016/j.ijsu.2021.105906

Article   PubMed   Google Scholar  

Panadero E (2017) A review of self-regulated learning: six models and four directions for research [review]. Front Psychol. https://doi.org/10.3389/fpsyg.2017.00422

Paris SG, Newman RS (1990) Development aspects of self-regulated learning. Educ Psychol 25(1):87–102. https://doi.org/10.1207/s15326985ep2501_7

Pintrich PR (2004) A conceptual framework for assessing motivation and self-regulated learning in college students. Educ Psychol Rev 16(4):385–407. https://doi.org/10.1007/s10648-004-0006-x

Praharaj SK (2023) Boil the ocean: no apologies for setting the standard high! Indian J Psychol Med. https://doi.org/10.1177/02537176231188588

Pring R (2000a) Editorial conclusion: a philosophical perspective. Oxf Rev Educ 26(3-4):495–501. https://doi.org/10.1080/713688536

Pring R (2000b) Philosophy of educational research, 2nd edn. Continuum

Puustinen M, Pulkkinen L (2001) Models of self-regulated learning: a review. Scand J Educ Res 45(3):269–286. https://doi.org/10.1080/00313830120074206

Şahin Kızıl A, Savran Z (2018) Assessing self-regulated learning: the case of vocabulary learning through information and communication technologies. Computer Assist Lang Learn 31(5-6):599–616. https://doi.org/10.1080/09588221.2018.1428201

Senko C, Perry AH, Greiser M (2022) Does triggering learners’ interest make them overconfident? J Educ Psychol 114(3):482–497. https://doi.org/10.1037/edu0000649

Shen B, Bai B (2022) Chinese university students’ self-regulated writing strategy use and EFL writing performance: influences of self-efficacy, gender, and major. Appl Linguistics Rev. https://doi.org/10.1515/applirev-2020-0103

Singh VK, Singh P, Karmakar M, Leta J, Mayr P (2021) The journal coverage of web of science, Scopus and dimensions: a comparative analysis. Scientometrics 126(6):5113–5142. https://doi.org/10.1007/s11192-021-03948-5

Sitzmann T, Ely K (2011) A meta-analysis of self-regulated learning in work-related training and educational attainment: what we know and where we need to go. Psychol Bull 137(3):421–442. https://doi.org/10.1037/a0022777

Sointu E, Hyypiä M, Lambert MC, Hirsto L, Saarelainen M, Valtonen T (2023) Preliminary evidence of key factors in successful flipping: predicting positive student experiences in flipped classrooms. High Educ 85(3):503–520. https://doi.org/10.1007/s10734-022-00848-2

Teng LS (2021) Individual differences in self-regulated learning: exploring the nexus of motivational beliefs, self-efficacy, and SRL strategies in EFL writing. Language Teach Res. https://doi.org/10.1177/13621688211006881

Teng LS, Zhang LJ (2020) Empowering learners in the second/foreign language classroom: can self-regulated learning strategies-based writing instruction make a difference? J Second Lang Writ 48:100701. https://doi.org/10.1016/j.jslw.2019.100701

Theobald M (2021) Self-regulated learning training programs enhance university students’ academic performance, self-regulated learning strategies, and motivation: a meta-analysis. Contemp Educ Psychol 66:101976. https://doi.org/10.1016/j.cedpsych.2021.101976

Tse SK, Lin L, Ng RHW (2022) Self-regulated learning strategies and reading comprehension among bilingual primary school students in Hong Kong. Int J Bilingual Educ Bilingualism 25(9):3258–3273. https://doi.org/10.1080/13670050.2022.2049686

Tseng J-J, Chai CS, Tan L, Park M (2022) A critical review of research on technological pedagogical and content knowledge (TPACK) in language teaching. Comput Assist Lang Learn 35(4):948–971. https://doi.org/10.1080/09588221.2020.1868531

Videnovik M, Trajkovik V, Kiønig LV, Vold T (2020) Increasing quality of learning experience using augmented reality educational games. Multimed Tools Appl 79(33):23861–23885. https://doi.org/10.1007/s11042-020-09046-7

Winne PH, Hadwin AF (1998) Studying as self-regulated learning. In: DJ Hacker, J Dunlosky, & AC Graesser (eds) Metacognition in educational theory and practice, 1st edn. Lawrence Erlbaum Associates Publishers, pp 277–304

Wischgoll A (2016) Combined training of one cognitive and one metacognitive strategy improves academic writing skills. Front Psychol 7(187):1–13

Xu J (2021) Chinese university students’ L2 writing feedback orientation and self-regulated learning writing strategies in online teaching during COVID-19. Asia-Pac Educ Res 30(6):563–574. https://doi.org/10.1007/s40299-021-00586-6

Xu Z, Zhao Y, Zhang B, Liew J, Kogut A (2023a) A meta-analysis of the efficacy of self-regulated learning interventions on academic achievement in online and blended environments in K-12 and higher education Behav Inf Technol 42(16):2911–2931. https://doi.org/10.1080/0144929X.2022.2151935

Xu Z, Zhao Y, Liew J, Zhou X, Kogut A (2023b) Synthesizing research evidence on self-regulated learning and academic achievement in online and blended learning environments: a scoping review. Educ Res Rev 39:100510. https://doi.org/10.1016/j.edurev.2023.100510

Yang G, Shen Q, Jiang R (2023) Exploring the relationship between university students’ perceived English instructional quality and learner satisfaction in the online environment. System 119:103178. https://doi.org/10.1016/j.system.2023.103178

Yang Y, Wen Y, Song Y (2023) A systematic review of technology-enhanced self-regulated language learning. Educ Technol Soc 26(1):31–44. https://www.jstor.org/stable/48707965

Google Scholar  

Yi Y-S (2021) On the usefulness of CDA-based score reporting: implications for self-regulated learning. Lang Test Asia 11(1):13. https://doi.org/10.1186/s40468-021-00127-4

Zhang W (2017) Using classroom assessment to promote self-regulated learning and the factors influencing its (in)effectiveness. Front Educ China 12(2):261–295. https://doi.org/10.1007/s11516-017-0019-0

Zhang Z (2018) English-medium instruction policies in China: internationalisation of higher education. J Multiling Multicult Dev 39(6):542–555. https://doi.org/10.1080/01434632.2017.1404070

Zhu Y, Zhang JH, Au W, Yates G (2020) University students’ online learning attitudes and continuous intention to undertake online courses: a self-regulated learning perspective. Educ Technol Res Dev 68(3):1485–1519. https://doi.org/10.1007/s11423-020-09753-w

Zimmerman BJ (1989) A social cognitive view of self-regulated academic learning. J Educ Psychol 81(3):329–339. https://doi.org/10.1037/0022-0663.81.3.329

Zimmerman BJ (1990) Self-regulated learning and academic achievement: an overview. Educ Psychol 25(1):3–17. https://doi.org/10.1207/s15326985ep2501_2

Zimmerman BJ, Risemberg R (1997) Becoming a self-regulated writer: a social cognitive perspective. Contemp Educ Psychol 22(1):73–101. https://doi.org/10.1006/ceps.1997.0919

Download references

Author information

Authors and affiliations.

Department of English Language Teaching, Aliabad Katoul Branch, Islamic Azad University, Aliabad Katoul, Iran

Omid Mazandarani

You can also search for this author in PubMed   Google Scholar

Contributions

The author contributed to and supervised this work.

Corresponding author

Correspondence to Omid Mazandarani .

Ethics declarations

Competing interests.

The author declares no competing interests.

Ethics approval

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

Additional information.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Mazandarani, O. Self-regulated learning in ESL/EFL contexts: a methodological exploration. Humanit Soc Sci Commun 11 , 1118 (2024). https://doi.org/10.1057/s41599-024-03617-x

Download citation

Received : 11 January 2024

Accepted : 19 August 2024

Published : 31 August 2024

DOI : https://doi.org/10.1057/s41599-024-03617-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

examples of comparative research titles quantitative

  • Open access
  • Published: 02 September 2024

Encompassing trust in medical AI from the perspective of medical students: a quantitative comparative study

  • Anamaria Malešević 1 ,
  • Mária Kolesárová 2 &
  • Anto Čartolovni 1 , 3  

BMC Medical Ethics volume  25 , Article number:  94 ( 2024 ) Cite this article

Metrics details

In the years to come, artificial intelligence will become an indispensable tool in medical practice. The digital transformation will undoubtedly affect today’s medical students. This study focuses on trust from the perspective of three groups of medical students - students from Croatia, students from Slovakia, and international students studying in Slovakia.

A paper-pen survey was conducted using a non-probabilistic convenience sample. In the second half of 2022, 1715 students were surveyed at five faculties in Croatia and three in Slovakia.

Specifically, 38.2% of students indicated familiarity with the concept of AI, while 44.8% believed they would use AI in the future. Patient readiness for the implementation of technologies was mostly assessed as being low. More than half of the students, 59.1%, believe that the implementation of digital technology (AI) will negatively impact the patient-physician relationship and 51,3% of students believe that patients will trust physicians less. The least agreement with the statement was observed among international students, while a higher agreement was expressed by Slovak and Croatian students 40.9% of Croatian students believe that users do not trust the healthcare system, 56.9% of Slovak students agree with this view, while only 17.3% of international students share this opinion. The ability to explain to patients how AI works if they were asked was statistically significantly different for the different student groups, international students expressed the lowest agreement, while the Slovak and Croatian students showed a higher agreement.

This study provides insight into medical students’ attitudes from Croatia, Slovakia, and international students regarding the role of artificial intelligence (AI) in the future healthcare system, with a particular emphasis on the concept of trust. A notable difference was observed between the three groups of students, with international students differing from their Croatian and Slovak colleagues. This study also highlights the importance of integrating AI topics into the medical curriculum, taking into account national social & cultural specificities that could negatively impact AI implementation if not carefully addressed.

Peer Review reports

Introduction

Technological advancements and artificial intelligence (AI) have transformed healthcare over the past few years. There has been a broad range of applications for AI in medicine, ranging from appointment scheduling and digitising health records to using algorithms to determine drug dosage [ 1 ]. The enthusiasm for the application of AI has extended to various medical specialties, such as radiology [ 2 , 3 ], oncology [ 4 ], neurology [ 5 ], nephrology [ 6 ]. Changes in the field have also prompted many studies to focus on the attitudes of students and their choice of specialisation. Some interesting results that have emerged from the research include a shift in interest toward this specialisation, anticipated changes in daily work, the consideration of fears, and expectations [ 7 , 8 , 9 ]. Students represent an interesting group when researching the future of healthcare and their perceptions regarding the use of AI. Research has shown that in most cases, medical students agree with statements indicating that they understand what AI is [ 10 , 11 ]. However, when asked to define it themselves, the majority are unable to do so [ 12 ]. The existing literature recognises the necessity of incorporating education on the use of AI into the medical curricula, highlighting that the current education in this area is neither sufficient nor satisfactory [ 11 , 12 , 13 , 14 ]. Although medical students expect AI to transform and revolutionise healthcare, they note that the current education on this topic is inadequate [ 15 ]. In Croatia, most medical faculties include medical informatics as a mandatory course in their curriculum (in the 2nd or 5th year of study), while no course directly focused on AI has been found. However, several elective courses, such as “Robotics in Medicine” and “Digital Technologies in the Healthcare System and E-Health,” can be found, which introduce students to AI through practical applications. Although there are no specific subjects on AI in the medical curricula in Slovakia, medical faculties organize lectures and workshops on AI for medical students. At the largest Slovak medical faculty in Bratislava, the topic of AI has been addressed for the last four years in the first-year medical ethics course. The medical students’ readiness for AI, which they should develop during their studies, has received more attention in the form of the Medical Artificial Intelligence Readiness Scale for Medical Students (MAIRS-MS) [ 16 ]. While some studies suggest what medical students should know about artificial intelligence in medicine [ 17 ], others highlight the need for health AI ethics in medical school education [ 18 ]. Students believe that AI will make medicine more exciting in the future and that AI should be a partner rather than a competitor [ 19 ]. They also think that receiving education in AI will greatly benefit their careers [ 20 ]. While significant progress has been observed in implementing AI across various applications, these are still early stages that require validation and identifying solutions for emerging ethical and social challenges [ 21 ]. Students have expressed fear about the reduced interaction with patients due to the integration of AI [ 14 ], decreased job opportunities, and the emergence of new ethical and social challenges [ 10 ]. They are also concerned that AI will increase patient risks, reduce physicians’ skills, and harm patients [ 22 ].

Implementing AI brings about changes that will impact the patient and physician relationship [ 23 ]. Adopting AI involves a patient-centred approach that promotes informed choices [ 24 ]. The relationship between physicians and patients has been evolving under the influence of social circumstances and technological progress. The information and digital age have provided patients with tools empowering them to take on an active role as co-decision-makers, unlike when a paternalistic model prevailed and only physicians had exclusive access to medical information [ 25 , 26 ].

Trust is a crucial factor in the current model of the patient-physician relationship. As a complex concept from the perspective of both physicians and patients, trust is the foundation for successful health outcomes and a quality relationship between them [ 27 ]. Trust is deeply embedded in the physician-patient relationship, making it a fiduciary relationship. Inserting a new actor will bring disruption and potentially even the creation of new dyadic or triadic trusting relationships between physicians and AI, patients and AI, or even between patients, the physician and AI [ 35 ]. Due to technological advancements, trust relationships in healthcare will become even more of an issue, necessitating active reflection and action [ 28 ].

One of the most critical ethical values in the design, development, and deployment of medical AI is transparency. It is not merely a recommendation but a necessity, tied to the informed consent of the user (physician) who may or may not be fully aware of the underlying processes in the algorithmic decision-making. Thus, one of the most pressing issues, alongside transparency, is explainability [ 29 ]. Explainability and transparency are closely linked with the level of trust and trustworthiness; trust mainly refers to the belief that we can depend on someone or something, hence a gradual increase in reliability may lead to trust [ 30 ]. From a phenomenological perspective, trust in medical AI is an affective-cognitive state of the entities involved in these relationships, namely the trustor (the person who trusts) and the trustee (the entity to be trusted) [ 31 ]. In this instance, the trustor is a physician, and the trustee would be the medical AI system. As for the current ongoing discussion on whether medical AI can be trusted or only relied on [ 32 , 33 , 34 ], an interesting research question has emerged, specifically the need to examine whether future physicians perceive that this trust is possible or will be disruptive.

Research aims

In our study, we aimed to focus on the medical students’ attitudes towards the role of AI in the future of healthcare, particularly focusing on the concept of trust.

This study aims to explore:

How students perceive the phenomenon of trust in physician-patient relationship.

The perception of their own medical expertise in the context of AI use.

Students’ estimation of patient preparedness to embrace AI as part of everyday healthcare provision.

Additionally, the study investigated whether trust is a prerequisite for the physician-patient relationship in the context of AI implementation.

Participants and data collection

This study involved medical students from Croatia and Slovakia, two Eastern European countries with many similarities, such as in their history and states’ development, social circumstances, and healthcare challenges. International students from different societal backgrounds have also been included in the study and were observed in the analysis as a third group. This study was conducted between May 2022 and November 2022 at five medical schools in Croatia and three in Slovakia (Table  1 ). This study was conducted using a non-probabilistic convenience sample. The inclusion criteria were being a medical student in one of the medical schools in Croatia or Slovakia and being physically present at lectures where the researchers conducted the research. The study included students from all years of study, as was the practice in some other studies conducted on this topic [ 15 , 20 , 33 , 36 , 37 , 39 ]The survey was conducted using the paper-pen method, except at one university in Slovakia where the students, after signing an informed consent form, received a URL link to the survey on the LimeSurvey platform. In agreement with the lecturers, the researchers arrived at the beginning of lectures, introduced the research, and asked for the students’ voluntary participation. Students who were interested in the study were asked to sign the informed consent form. In total, 1715 medical students participated. In the statistical analysis, 14 were excluded due to insufficient survey completion. The final sample consisted of 1701 medical students.

Design of the questionnaire

The research team developed a questionnaire, and the English version is available in supplementary files (Additional file 1). The survey and the questions were based on a prior qualitative study conducted in 2021 in Croatia [ 35 ], as well as the literature review of previous surveys conducted involving medical students, patients, and physicians [ 23 , 36 , 37 , 38 , 39 , 40 , 41 ] As used in our qualitative study [ 35 ], the anticipatory ethics approach [ 42 ] was followed with the same scenario. To preserve the continuity between the qualitative and quantitative studies, we deliberatively decided to focus primarily on the ethical, legal and social issues by not using the existing MAIRS-MS [ 16 ]. The survey focused on six broad topics and explored the following regarding the participants: (1) their motivation for enrolling in medical studies and the self-reported knowledge of medical ethics and/or bioethics; (2) the attitudes related to the impact of AI on the patient-physician relationship; (3) their self-reported perception of understanding of artificial intelligence; (4) their propensity to use AI and digital technologies in future medical practice; (5) the perceived utility of AI in the future, and societal readiness and preparedness for implementation; and (6) their demographic characteristics. The questions included multiple-choice answers on a 5-point Likert scale (the participants were instructed to read the statements and express their agreement or disagreement). At the beginning of the survey, a short scenario (Additional file 2) was presented to the medical students based on the anticipatory ethics approach [ 42 ], followed by the survey questions. This short scenario focused on an AI-based virtual assistant used in a hospital context in 2030. The survey was pilot-tested with a small sample of first-year students from the researcher’s university to ensure questionnaire comprehension, clarity, and the time taken to answer the questionnaire. The survey was available in Croatian, Slovak, and English, the latter particularly for the international students studying Medicine in the English program. The part of the questionnaire related to the perception of patient readiness, which was taken for further analysis, consisted of four questions with a high level of internal consistency, as determined by the Cronbach’s alpha score of 0.810.

Data analysis

All statistical analyses were conducted using SPSS version 25 (IBM Corp. Armonk, NY, USA). The simple descriptive statistics have been presented in percentages. An independent t-test and one-way ANOVA were conducted to examine the group differences based on demographic determinants. Principal axis factoring was run on the questions about attitudes towards using AI technology in their future work.

Demographics

A total of 1701 responses were collected from eight Schools of Medicine (Table  1 ). Among these, 771 students (45.3%) were from Croatia, and 930 (54.7%) were from Slovakia, comprising 587 (34.5%) Slovak students and 343 (20.2%) international students mainly arriving from Western European and Scandinavian countries. Overall, 63.7% (1084) were female, 34.5% (587) were male, while 30 (1.8%) participants’ answers for gender were missing. In this study, female students were more represented than male students, which is in line with gender structure trends in medical studies. The Eurostudent VI survey for Croatia (2019) shows that 77.6% of students in medicine and social care are female compared to 22.1% of male students [ 43 ]. In some other studies on medical students in Croatia, similar ratios as in this research have been observed between male and female students [ 44 , 45 ]. Recent studies in Slovakia on the population of medical students also have a higher proportion of women than men in their samples [ 46 , 47 ]. The most represented group consisted of first-year students, followed by fourth-year and fifth-year students. The lowest representation was among sixth-year students which is attributed to the sampling approach that included students only attending lectures at the Faculty of Medicine. Given the specificities of medical education, this group was often located in hospital centres and clinics, making them less accessible to researchers.

General attitudes on AI and trust within the patient-physician relationship

Regarding their acquaintance with the concept of artificial intelligence, a significant portion of students (38.6%) remained neutral, indicating neither agreement nor disagreement with the statement (Fig.  1 ). Additionally, 38.2% of students agreed with the assertion, while 23.2% negatively assessed their familiarity with the concept of AI. There was a statistically significant difference in the mean acquainted score between males and females, t (1162,09) = 7,928, P  < .001, with males scoring higher (M = 3.45, SD = 1.014) than females (M = 3.05, SD = 0,977). Similar results were also seen when it came to the statement, “I expect to actively use artificial intelligence in my medical practice.” In this context, 39% of students remained neutral, 44.8% expressed an expectation to actively utilise artificial intelligence in their future medical practice, while 16.2% disagreed.

figure 1

Student’s attitudes toward AI

Regarding trust within the patient-physician relationship, the medical students exhibit pronounced affirmative attitudes (Fig.  2 ). In response to the statement, “The patient and the physician should trust each other,” 80% of students strongly agreed, 16.8% agreed, 2.1% were neutral, and only 1.1% disagreed. For the statement, “The patient should trust the physician upon consulting him/her,” only 0.8% of students disagreed, 3% were neutral, while 96.2% of students agreed. Among the medical students who participated in this study, 2.9% disagreed with the assertion that “The physician is required to clarify to the patient how he or she came to a certain conclusion.” Here, 8.9% were neutral, and 89.2% agreed.

figure 2

Student’s attitudes toward different aspects of patient-physician relationship

Based on the provided statements, a statistically significant difference was found among the Croatian, Slovak, and international students, as illustrated in Table  2 . The international students were less likely to agree with the statements asserting that patients should trust the physician during consultations and must rely entirely on the physician’s opinion compared to Croatian and Slovak students. Conversely, they are more inclined to agree that patients respect the physicians’ time, unlike their Croatian and Slovak counterparts, who agreed with this to a lesser extent.

Trust in the healthcare system

Table  3 presents the percentage of agreement with the statement, “To what extent do you think users trust the healthcare system in the country you study in?” 40.9% of Croatian students believe that users do not trust the healthcare system, 56.9% of Slovak students agree with this view, while only 17.3% of international students share this opinion. A one-way ANOVA was conducted to determine whether the student groups’ perceptions of patient trust differed. The perception of patient trust in the healthcare system was statistically significantly different for the different student groups, Welch’s F (2, 106,211) = 901,153, P  < .001. There was a difference in the mean between the Slovak students ( M  = 2.51, SD  = 0.737), Croatian students ( M  = 2,75, SD  = 0.847), and international students ( M  = 3.28, SD  = 0.798), which was statistically significant ( P <  .001). Interestingly, the international students believe that users trust the Slovak healthcare system more than Slovak students, with a mean increase of 0.77, 95% CI [0.64, 0.9].

Patient readiness to use AI

The construct of patient readiness consisted of the student’s perception of patient trust in technology, adaptability, digital literacy, and medical literacy. These aspects have been recognised as necessary for patients to be ready for use of technology. The range was from a minimum of 4 to a maximum of 20. A score of 4 was obtained if the student responded to all statements with “strongly disagree,” up to 20 if the student responded to all statements with “strongly agree”. A statistically significant difference ( P  < .001) in the perception of patient readiness was observed among Croatian, Slovak, and international students. The Croatian students gave, on average, the lowest scores for patient readiness ( M  = 8,40, SD  = 2,814), followed by the Slovak students ( M  = 8,79, SD  = 2,689), while the international students expressed the highest confidence in patient readiness to use AI technology in the future ( M  = 9,62, SD  = 2,829).

Here, 59.1% of students agreed that implementing digital technologies will have a negative impact on the patient-physician relationship, at M = 3.62, SD = 1.009. No statistically significant difference was found based on student country of origin. On the other hand, there was a statistically significant difference of P  < .001 among the students regarding the belief that patients will trust physicians less as more digital technologies are implemented. Here, 51,3% of students believe that patients will trust physicians less. The least agreement with the statement was observed among international students (M = 3.09, SD = 1.006), while a higher agreement was expressed by Slovak (M = 3.50, SD = 1.030) and Croatian students (M = 3.51, SD = 1.006).

The third aspect of trust focused on confidence in use. Here, 53.6% of students believe that if asked by a patient, they would be able to explain how the technology works. The ability to explain to patients how AI works if they were asked was statistically significantly different for the different student groups, Welch’s F (2, 856,821) = 12.294 P  < .001. International students expressed the lowest agreement with the statement (M = 3.09, SD = 1.215), while the Slovak (M = 3.41, SD = 1.048) and Croatian (M = 3.47, SD = 1.096) students showed a higher agreement.

In the scenario (Annex I), AI was presented through the virtual assistant Cronko. The students were asked to assess how likely it was that they would react in a specific way if the diagnosis they provided significantly differed from that of the virtual assistant (AI) (Table  4 ). A statistically significant difference was found among the Slovak, Croatian, and international students. In this case, the international students expressed a lower likelihood of standing by their diagnostic conclusion and a higher mean score for rejecting their conclusion, favouring the AI’s opinion.

The students were also required to decide how patients should react if the diagnosis of the physician and AI significantly differed (Table  5 ). Here, 49.4% of students believe that patients should seek a third (expert) opinion, 42.1% thought that they should trust the physician, and 7.4% believe that they should consider both diagnoses and decide for themselves. Only a small number thought that they should trust the AI (0.7%) or seek a third opinion from another artificial intelligence system (0.4%).

The crosstabulation analysis revealed that international students, at a lower percentage, believe that patients should trust the physician compared to Croatian and Slovak students. Based on Pearson’s Chi-square test (χ2 = 43,731, df = 8, P < . 001 ) , it was concluded that there is a dependence between the student’s country of origin and the opinion that the patient should have trust. The measure of association (Cramer’s V) indicates that there is a statistically significant weak association between the variables (φ = 0.114, P  < .001).

As far as the authors are aware, this is the first study providing the perspective of Eastern European countries regarding the attitude of medical students on the use of AI in medical practice. Previous studies have focused on Western countries such as Germany [ 48 , 49 , 50 ], Switzerland [ 37 ], the United Kingdom [ 39 , 40 ], Canada [ 7 , 10 , 12 ], and Asian countries [ 11 , 13 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 ]. Although many expect that AI’s implementation in healthcare will occur in the coming years, only 44.8% of students believe they will use AI in the future. Here, 53.6% of students believed they would be able to explain to patients how AI technology works. Only 38.2% emphasised that they were (currently) familiar with the concept of AI. These results align with a study in Germany, where 64.3% of students expressed that they did not feel well-informed about AI in medicine [ 48 ]. It is important to note that previous research has observed a discrepancy between the perceived understanding of AI and the actual knowledge among medical students [ 9 ]. In the current era, medical education should set a goal to develop the skills that enable students to acquire knowledge about AI and successfully apply it in patient interactions, allowing them to convey information to patients in an understandable manner [ 59 ].

The prevailing view among Croatian and Slovak students was that users do not trust the healthcare system. This perception of a lack of trust aligns with research conducted on the general population. The EVS survey indicated that only 43% of Croatian citizens trust the healthcare system [ 60 ]. Studies have shown that a quarter of the population considers the healthcare system to be completely ineffective, and the majority believes that fundamental changes are needed, with the lowest levels of trust being expressed by social groups with the lowest levels of education [ 61 ]. The general level of satisfaction with the health care system in Slovakia recently reached 44%. When asked “To what extent do you trust conventional medicine in doctors and hospitals?” Slovakia fell to the bottom of the ranking with 55% of the population trusting conventional medicine compared to the European average. Looking at the reasons for Slovak dissatisfaction, the main reasons cited by Slovaks are the inability to get an appointment with a doctor (57%) and a bad personal or mediated negative experience with the care provided (51%) [ 62 ]. As previously highlighted, most international students come from Norway and other Scandinavian countries. Many studies show that trust in healthcare is exceptionally high in these countries [ 63 , 64 , 65 ]. Therefore, international students are expected to project the same perception of trust in the healthcare system onto the healthcare system of a different country outside their home country.

In Croatia and Slovakia, where trust in the healthcare system is relatively low and students perceive that patients do not have much trust in the system, it has been observed that students are more likely to believe that patients must fully trust their physicians during consultations and that patients are not respectful of the physician’s time. The implementation of AI requires collaborative cooperation between the patient and the physician, which necessitates mutual trust and understanding between them [ 66 ].Trust has been defined as “individuals’ calculated exposure to the risk of harm from the actions of an influential other” [ 31 , 67 ] where harm signifies the extent of physical and/or psychological damage that can result from incorrectly calibrated trust decisions [ 31 ]. However, in the physician’s use of medical AI, the damage primarily manifests as harm to the patient and directly affects the physician-patient relationship [ 35 , 68 ]. This also affects the reliability aspect and the physician’s trust in medical AI, as well as its acceptability and future use, which are directly related to trustworthiness.

Also, the different views of international students on issues of AI and medical trust may differ because these individuals mostly come from Western and Northern European countries where the shared decision-making model of the patient-physician relationship is strongly used in medical practice. The shared decision-making model avoids the trap of the two extremes where, on the one hand, the physician has a dominant role as the decision-maker and, on the other, the patient has an absolute position and makes the decision on his or her own. Modern medicine has moved from a paternalistic approach to a physician-patient partnership based on mutual discussion. It is very likely that international students from Western Europe are more accustomed to a system in which the emphasis on patient autonomy and ethical communication is important. The persistence of a paternalistic mentality in the healthcare system is noticeable in some post-communist or transitional countries [ 69 , 70 ]. Although these countries are transforming and increasingly involving patients in decision-making, remnants of the old mentality still exist. The Slovak and Croatian students expressed more negative attitudes regarding patients respecting the time of physicians compared to international students. Similarly, they are more inclined to believe that patients should fully trust the physicians’ opinions. The attitudes of both Croatian and Slovak students towards trust between the patient and physician in the context of AI can be partly explained by the paternalistic model of the patient-physician relationship which is still to some extent present in these countries. Transitional countries, including Croatia and Slovakia, have specific cultural patterns in patient-physician communication, such as a lack of information sharing and a paternalistic approach to the patient [ 71 ]. In the region of Central and South-Eastern Europe, these issues have not been studied systematically [ 71 ]. However, Croatian researchers, following the Slovakian research team [ 72 ], have carried out a study of patient rights, focusing on patient-physician communication and the informed consent process [ 71 ]. The results of this study showed that communication during the process of obtaining informed consent in selected Croatian hospitals was based on the model of shared decision-making, but the paternalistic relationship was still present. We assume that due to the similar cultural and political background, this will probably be analogous in Slovakia, although to the best of our knowledge, such research has not been conducted recently. The case of the still existing medical paternalism in Slovakia, that has started a public debate, was the involuntary sterilisation of Roma women, which began in communist Czechoslovakia and continued into the 2000s. This case has contributed to ongoing mistrust of the national health system among Roma, impacting vaccine uptake and highlighting the need for improved communication and informed consent practices [ 73 , 74 ].

In cases of conflict between the judgements of the physician and AI, our results demonstrate that more than half of the medical students consider that patients should look for a third (expert) opinion (49.4%) or trust the physician (42.1%). These results are similar to a German study [ 48 ] in which the majority (82.5%) stated that the physician’s decision should be followed. In such a disagreement, the international students were keener to reject their own decisions and favoured the AI than the Croatian and Slovak students despite frequenting and attending the same program as their Slovak colleagues. The new insights from our study represent a valuable contribution to the ongoing discussion [ 32 , 33 , 34 ] on the possibility of trusting medical AI from the perspectives of future physicians who will probably use AI in their everyday work.

In cases of different diagnoses, Croatian and Slovak students were more likely to believe that patients should rely on the physician’s opinion. Almost 90% of students think the physician must explain to the patient how they reached a conclusion. However, only 53.6% of students believe they could explain how AI technology works to a patient. This gap may pose a problem in healthcare due to inadequate explanations to patients’ and future physicians’ understanding and acceptance of AI diagnostic conclusions, especially when they differ. Future physicians must know how to use AI, understand and interpret the results, be aware of all risks, and explain it to patients in an understandable way [ 75 ].

Strengths and limitations

Based on our knowledge, no similar research has been conducted focusing on Eastern Europe, specifically Croatia and Slovakia, and emphasizing various aspects of trust that are crucial to consider in the context of medical AI. This study highlights the differences between medical students’ perceptions of trust and patient-physician relationships. The main limitation of this research was the sample selection which cannot be generalised due to its non-probabilistic nature. Due to technical and organisational difficulties, a convenience sample was the only available option. It is essential to consider that the research was conducted at the end of 2022 during the ongoing COVID-19 pandemic, which could have influenced the students’ attitudes within the healthcare system. International students filled out the questionnaire in English (not their first language) which could lead to misinterpretation or misunderstanding of specific questions.

Conclusions

This study provides insight into medical students’ attitudes from Croatia, Slovakia, and international students regarding the role of artificial intelligence (AI) in the future healthcare system, with a particular emphasis on the concept of trust. The insights from our study represent a valuable contribution to the ongoing debate on the possibility of trust in medical AI from the perspective of future physicians. Students agree that physicians and patients must trust each other; however, they also believe that implementing digital technologies will negatively impact the patient-physician relationship. A notable difference was observed between the three groups of students, with international students differing from their Croatian and Slovak colleagues. Croatian and Slovak students are more inclined to believe that patients will have less trust in them with the implementation of AI. Also, they are presenting certain paternalistic views. Additionally, Croatian and Slovak students exhibit higher confidence in their abilities (accuracy of diagnosis, ability to explain how AI functions) than international students. This study also highlights the importance of integrating AI topics into the medical curriculum, taking into account national specificities that could negatively impact AI implementation if not carefully addressed. Increasing explainability and trust through education about AI will contribute to better acceptance in the future, as well as to a stronger relationship between patients and physicians.

Data availability

The dataset generated by the survey research is available at the link: https://osf.io/2pyv9/files/osfstorage/6606a02b58fa490843e4f06b.

Amisha F, Malik P, Pathania M, Rathaur VK. Overview of artificial intelligence in medicine. J Family Med Prim Care. 2019;8(7):2328.

Article   Google Scholar  

Reyes M, Meier R, Pereira S, et al. On the Interpretability of Artificial Intelligence in Radiology: challenges and opportunities. Radiol Artif Intell. 2020;2(3):e190043.

Mehrizi MHR, Van Ooijen PMA, Homan M. Applications of artificial intelligence (AI) in diagnostic radiology: a technography study. Eur Radiol. 2020;31(4):1805–11.

Dlamini Z, Francies FZ, Hull R, Marima R. Artificial intelligence (AI) and big data in cancer and precision oncology. Comput Struct Biotechnol J. 2020;18:2300–11.

Kalani M, Anjankar A, Revolutionizing, Neurology. The role of Artificial intelligence in advancing diagnosis and treatment. Curēus. 2024.

Bajaj T, Koyner JL. Artificial intelligence in acute kidney injury prediction. Adv Chronic Kidney Dis. 2022;29(5):450–60.

Gong B, Nugent J, Guest W, et al. Influence of artificial intelligence on Canadian medical students’ preference for radiology specialty: a National Survey study. Acad Radiol. 2019;26(4):566–77.

Capparos Galán G, Portero FS. Medical students’ perceptions of the impact of artificial intelligence in radiology. Radiología. 2022;64(6):516–24.

Bin Dahmash A, Alabdulkareem M, Alfutais A, Kamel AM, Alkholaiwi F, Alshehri S et al. Artificial intelligence in radiology: does it impact medical students preference for radiology as their future career? BJR|Open. 2020;2(1):20200037.

Mehta N, Harish V, Bilimoria K et al. Knowledge and attitudes on Artificial intelligence in Healthcare: a provincial survey study of medical students. MedEdPublish. 2021;10(1).

Al Hadithy ZA, Al Lawati A, Al-Zadjali R et al. Knowledge, attitudes, and perceptions of Artificial Intelligence in Healthcare among Medical students at Sultan Qaboos University. Cureus. 2023;15(9).

Teng M, Singla R, Yau O, Lamoureux D, Gupta A, Hu Z, et al. Health Care Students’ perspectives on Artificial Intelligence: Countrywide Survey in Canada. JMIR Med Educ. 2022;8(1):e33390.

Abid S, Awan B, Ismail T, Sarwar N, Sarwar G, Tariq M. Artificial Intelligence: medical students attitude in District Peshawar Pakistan. Pakistan J Public Health. 2019;9(1):19–21.

Bisdas S, Topriceanu C, Zakrzewska Z et al. Artificial Intelligence in Medicine: a multinational Multi-center survey on the medical and dental students’ perception. Front Public Health. 2021;9.

Jebreen K, Radwan E, Kammoun-Rebai W, Alattar E, Radwan A, Safi W et al. Perceptions of undergraduate medical students on artificial intelligence in medicine: mixed-methods survey study from Palestine. BMC Med Educ. 2024;24(1).

Karaca O, Çalişkan S, Demir K. Medical artificial intelligence readiness scale for medical students (MAIRS-MS) – development, validity and reliability study. BMC Med Educ. 2021;21(1).

Park SH, Hyun K, Kim S, Park JH, Lim YS. What should medical students know about artificial intelligence in medicine? J Educational Evaluation Health Professions. 2019;16:18.

Katznelson G, Gerke S. The need for health AI ethics in medical school education. Adv Health Sci Educ. 2021;26(4):1447–58.

Bisdas S, Topriceanu CC, Zakrzewska Z, Irimia AV, Shakallis L, Subhash J et al. Artificial Intelligence in Medicine: a multinational Multi-center survey on the medical and dental students’ perception. Front Public Health. 2021;9.

Tung AYZ, Dong LW. Malaysian medical students’ attitudes and readiness toward AI (Artificial Intelligence): a cross-sectional study. J Med Educ Curric Dev. 2023;10.

Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28(1):31–8.

Boillat T, Nawaz FA, Rivas H. Readiness to Embrace Artificial intelligence among medical doctors and students: questionnaire-based study. JMIR Med Educ. 2022;8(2):e34973.

Ongena Y, Haan M, Yakar D, Kwee TC. Patients’ views on the implementation of artificial intelligence in radiology: development and validation of a standardized questionnaire. Eur Radiol. 2019;30(2):1033–40.

Quinn TP, Senadeera M, Jacobs S, Coghlan S, Le V. Trust and medical AI: the challenges we face and the expertise needed to overcome them. J Am Med Inform Assoc. 2020;28(4):890–4.

Gerber BS, Eiser AR. The patient-physician relationship in the internet age: future prospects and the research agenda. JMIR J Med Internet Research/Journal Med Internet Res. 2001;3(2):e15.

Google Scholar  

Agarwal AK, Murinson BB. New dimensions in patient–physician interaction: values, autonomy, and medical information in the patient-centered clinical encounter. Rambam Maimonides Med J. 2012;3(3):e0017.

Chandra S, Mohammadnezhad M, Ward P. Trust and Communication in a doctor- patient relationship: a literature review. J Healthc Commun. 2018;03(03).

Cado V. Trust as a factor for higher performance in healthcare: COVID 19, digitalization, and positive patient experiences. IJQHC Commun. 2022;2(2).

Gerdes A. The role of explainability in AI-supported medical decision-making. Discover Artif Intell. 2024;4(1).

De Fine Licht K, Brülde B. On defining Reliance and Trust: purposes, conditions of adequacy, and new definitions. Philosophia. 2021;49(5):1981–2001.

Hancock PA, Kessler TT, Kaplan AD, Stowers K, Brill JC, Billings DR et al. How and why humans trust: a meta-analysis and elaborated model. Front Psychol. 2023;14.

Hatherley J. Limits of trust in medical AI. J Med Ethics. 2020;46(7):478–81.

Kerasidou C, Kerasidou A, Büscher M, Wilkinson S. Before and beyond trust: reliance in medical AI. J Med Ethics. 2021;48(11):852–6.

Ferrario A, Loi M, Viganò E. Trust does not need to be human: it is possible to trust medical AI. J Med Ethics. 2020;47(6):437–8.

Čartolovni A, Malešević A, Poslon L. Critical analysis of the AI impact on the patient–physician relationship: a multi-stakeholder qualitative study. Digit Health. 2023;9.

Coppola F, Faggioni L, Regge D, et al. Artificial intelligence: radiologists’ expectations and opinions gleaned from a nationwide online survey. Radiol Med. 2020;126(1):63–71.

Van Der Hoek J, Huber AT, Leichtle AB, et al. A survey on the future of radiology among radiologists, medical students and surgeons: students and surgeons tend to be more skeptical about artificial intelligence and radiologists may fear that other disciplines take over. Eur J Radiol. 2019;121:108742.

Abdullah R, Fakieh B. Health care employees’ perceptions of the use of artificial intelligence applications: survey study. JMIR J Med Internet Res. 2020;22(5):e17620.

Blease C, Bernstein MH, Gaab J, et al. Computerization and the future of primary care: a survey of general practitioners in the UK. PLoS ONE. 2018;13(12):e0207418.

Sit C, Srinivasan R, Amlani A et al. Attitudes and perceptions of UK medical students towards artificial intelligence and radiology: a multicentre survey. Insights into Imaging. 2020;11(1).

Oh S, Kim JH, Choi SK, Lee HJ, Hong J, Kwon SH. Physician confidence in Artificial Intelligence: an online mobile survey. JMIR J Med Internet Res. 2019;21(3):e12422.

York E, Conley SN. Creative anticipatory ethical reasoning with scenario analysis and design fiction. Sci Eng Ethics. 2020;26(6):2985–3016.

Rimac I, Bovan K, Ogresta J. Nacionalo izvješće istraživanja EUROSTUDENT VI Za Hrvatsku. Ministarstvo znanosti i obrazovanja; 2019.

Dragun R, Veček NN, Marendić M, Pribisalić A, Đivić G, Cena H, et al. Have Lifestyle habits and Psychological Well-being changed among adolescents and medical students due to COVID-19 Lockdown in Croatia? Nutrients. 2020;13(1):97.

Đogaš V, Jerončić A, Marušić M, Marušić A. Who would students ask for help in academic cheating? Cross-sectional study of medical students in Croatia. BMC Med Educ. 2014;14(1).

Sovicova M, Zibolenova J, Svihrova V, Hudeckova H. Odds ratio estimation of Medical Students’ attitudes towards COVID-19 vaccination. Int J Environ Res Public Health/International J Environ Res Public Health. 2021;18(13):6815.

Faixová D, Jurinová Z, Faixová Z, Kyselovič J, Gažová A. Dietary changes during the examination period in medical students. EAS J Pharm Pharmacol. 2023;5(03):78–86.

McLennan S, Meyer A, Schreyer K, Buyx A. German medical students´ views regarding artificial intelligence in medicine: a cross-sectional survey. PLOS Digit Health. 2022;1(10):e0000114.

Gillissen A, Kochanek T, Zupanic M, Ehlers JP. Medical students’ perceptions towards digitization and Artificial Intelligence: a mixed-methods study. Healthcare. 2022;10(4):723.

Moldt JA, Loda T, Mamlouk AM, Nieselt K, Fuhl W, Herrmann-Werner A. Chatbots for future docs: exploring medical students’ attitudes and knowledge towards artificial intelligence and medical chatbots. Med Educ Online. 2023;28(1).

Syed W, Basil A, Al-Rawi M. Assessment of awareness, perceptions, and opinions towards Artificial Intelligence among Healthcare students in Riyadh, Saudi Arabia. Medicina. 2023;59(5):828.

‌Komasawa N, Nakano T, Terasaki F, Kawata R. Attitude survey toward artificial intelligence in medicine among Japanese medical students. Bull Osaka Med Pharm Univ. 2021;67(1–2):9–16.

Jha N, Shankar PR, Al-Betar MA, Mukhia R, Hada K, Palaian S. Undergraduate medical students’ and interns’ knowledge and perception of artificial intelligence in medicine. Adv Med Educ Pract. 2022;13:927–37.

Swed S, Alibrahim H, Elkalagi NKH et al. Knowledge, attitude, and practice of artificial intelligence among doctors and medical students in Syria: a cross-sectional online survey. Front Artif Intell. 2022;5.

Doumat G, Daher D, Ghanem NN, Khater B. Knowledge and attitudes of medical students in Lebanon toward artificial intelligence: a national survey study. Front Artif Intell. 2022;5.

Buabbas AJ, Miskin B, Alnaqi A, et al. Investigating students’ perceptions towards Artificial Intelligence in Medical Education. Healthcare. 2023;11(9):1298.

Kansal R, Bawa A, Bansal A et al. Differences in knowledge and perspectives on the usage of artificial intelligence among doctors and medical students of a developing country: a cross-sectional study. Curēus Published Online January 19, 2022.

AlZaabi A, AlMaskari S, AalAbdulsalam A. Are physicians and medical students ready for artificial intelligence applications in healthcare? Digit Health. 2023;9:205520762311521.

Pupic N, Ghaffari-Zadeh A, Hu R, et al. An evidence-based approach to artificial intelligence education for medical students: a systematic review. PLOS Digit Health. 2023;2(11):e0000255.

Baloban J, Črpić G, Ježovita J. Vrednote u Hrvatskoj Od 1999. Do 2018. Prema European values study. Kršćanska sadašnjost; 2019.

Popović S. Determinants of citizen’s attitudes and satisfaction with the Croatian health care system. Medicina. 2017;53(1):85–100.

STADA Health Report. 2024. Satisfaction with Healthcare System continues to decline. 2024.

Price D, Bonsaksen T, Leung J, McClure-Thomas C, Ruffolo M, Lamph G et al. Factors Associated with Trust in Public Authorities among adults in Norway, United Kingdom, United States, and Australia two years after the COVID-19 outbreak. Int J Public Health. 2023;68.

Skirbekk H, Magelssen M, Conradsen S. Trust in healthcare before and during the COVID-19 pandemic. BMC Public Health. 2023;23(1).

Baroudi M, Goicolea I, Hurtig AK, San-Sebastian M. Social factors associated with trust in the health system in northern Sweden: a cross-sectional study. BMC Public Health. 2022;22(1).

Jj C. Doctor-patient relationship: from medical paternalism to enhanced autonomy. Singapore Med J. 2002;43(3):152–5.

Hancock PA, Billings DR, Schaefer KE, Chen JYC, De Visser EJ, Parasuraman R. A Meta-analysis of factors affecting Trust in Human-Robot Interaction. Hum Factors. 2011;53(5):517–27.

Čartolovni A, Tomičić A, Mosler EL. Ethical, legal, and social considerations of AI-based medical decision-support tools: a scoping review. Int J Med Informatics. 2022;161:104738.

Vyshka G, Kruja J. Inapplicability of advance directives in a paternalistic setting: the case of a post-communist health system. BMC Med Ethics. 2011;12(1).

Murgic L, Hébert PC, Sovic S, Pavlekovic G. Paternalism and autonomy: views of patients and providers in a transitional (post-communist) country. BMC Med Ethics. 2015;16(1).

Vučemilo L, Ćurković M, Milošević M, Mustajbegović J, Borovečki A. Are physician-patient communication practices slowly changing in Croatia? – a cross-sectional questionnaire study. Croatian Med J. 2013;54(2):185–91.

Nemcekova M, Ziakova K, Mistuna D, Kudlicka J. Respecting patients’ rights. Bull Med Ethics. 1998;140:13–8.

REPORT by Thomas Hammarberg, Commissioner for Human Rights of the Council of Europe. 2011. Online: https://rm.coe.int/16806db7c5

The Advisory Committee on the. Framework Convention for the Protection of National Minorities. Fifth Opinion on Slovak Republic. 2022.

McCoy LG, Nagaraj S, Morgado F, Harish V, Das S, Celi LA. What do medical students actually need to know about artificial intelligence? Npj Digit Med. 2020;3(1).

Download references

This work was supported by the Hrvatska zaklada za znanost (Croatian Science Foundation (CSF)) [grant number UIP-2019-04-3212] “(New) Ethical and Social Challenges of Digital Technologies in the Healthcare Domain”. The funder had no role in the design of this study and its execution, analyses, interpretation of the data, or decision to submit results.

Author information

Authors and affiliations.

Digital Healthcare Ethics Laboratory (Digit-HeaL), Catholic University of Croatia, Zagreb, Croatia

Anamaria Malešević & Anto Čartolovni

Institute of Social Medicine and Medical Ethics, School of Medicine, Comenius University in Bratislava, Bratislava, Slovakia

Mária Kolesárová

School of Medicine, Catholic University of Croatia, Zagreb, Croatia

Anto Čartolovni

You can also search for this author in PubMed   Google Scholar

Contributions

AČ and AM planned the study. MK assisted in the research implementation process. AM analysed the data, with contributions from MK and AČ. All authors contributed to the data interpretation and writing of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Anamaria Malešević .

Ethics declarations

Ethics approval and consent to participate.

This study was approved by the Catholic University of Croatia’s Ethics Committee on 21 January 2022 (Classification number: 641-03/21 − 03/03; registration number: 498 − 16/2-22-06). Participation in the research was anonymous and voluntary. Before completing the survey, participants were informed about the research objectives, data processing, and storage procedures and signed an informed consent form.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Malešević, A., Kolesárová, M. & Čartolovni, A. Encompassing trust in medical AI from the perspective of medical students: a quantitative comparative study. BMC Med Ethics 25 , 94 (2024). https://doi.org/10.1186/s12910-024-01092-2

Download citation

Received : 03 May 2024

Accepted : 23 August 2024

Published : 02 September 2024

DOI : https://doi.org/10.1186/s12910-024-01092-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial intelligence
  • Medical students
  • Quantitative study
  • Medical ethics
  • Patient-physician relationship

BMC Medical Ethics

ISSN: 1472-6939

examples of comparative research titles quantitative

COMMENTS

  1. 500+ Quantitative Research Titles and Topics

    Quantitative Research Topics. Quantitative Research Topics are as follows: The effects of social media on self-esteem among teenagers. A comparative study of academic achievement among students of single-sex and co-educational schools. The impact of gender on leadership styles in the workplace.

  2. Comparative Research

    Quantitative and Qualitative Research Methods in Comparative Studies. In comparing variables, the statistical and mathematical data collection, and analysis that quantitative research methodology naturally uses to uncover the correlational connection of the variables, can be essential. Additionally, since quantitative research requires a ...

  3. PDF A Causal Comparative Study on The Effect of Proficiency-based ...

    This quantitative, causal comparative study sought to determine if proficiency-based education has an effect on school climate. With sweeping school reform across the United States, educators are seeking ways to improve student achievement and maintain a positive school climate.

  4. Comparative Research Methods

    Comparative research in communication and media studies is conventionally understood as the contrast among different macro-level units, such as world regions, countries, sub-national regions, social milieus, language areas and cultural thickenings, at one point or more points in time.

  5. Chapter 10 Methods for Comparative Studies

    In eHealth evaluation, comparative studies aim to find out whether group differences in eHealth system adoption make a difference in important outcomes. These groups may differ in their composition, the type of system in use, and the setting where they work over a given time duration. The comparisons are to determine whether significant differences exist for some predefined measures between ...

  6. Quantitative Research with Nonexperimental Designs

    There are two main types of nonexperimental research designs: comparative design and correlational design. In comparative research, the researcher examines the differences between two or more groups on the phenomenon that is being studied. For example, studying gender difference in learning mathematics is a comparative research.

  7. (PDF) A Short Introduction to Comparative Research

    Comparative research or analysis is a broad term that includes both quantitative and qualitative comparison. Social entities may be based on many lines, such as geographical or

  8. Comparative Studies

    Comparative is a concept that derives from the verb "to compare" (the etymology is Latin comparare, derivation of par = equal, with prefix com-, it is a systematic comparison).Comparative studies are investigations to analyze and evaluate, with quantitative and qualitative methods, a phenomenon and/or facts among different areas, subjects, and/or objects to detect similarities and/or ...

  9. Comparative Research Designs and Methods

    This module presents the macro-quantitative (statistical) methods by giving examples of recent research employing them. It analyzes the regression analysis and the various ways of analyzing data. Moreover, it concludes the course and opens to further perspectives on comparative research designs and methods.

  10. 100+ Best Quantitative Research Topics For Students In 2023

    An example of quantitative research topics for 12 th -grade students will come in handy if you want to score a good grade. Here are some of the best ones: The link between global warming and climate change. What is the greenhouse gas impact on biodiversity and the atmosphere.

  11. Types of Research Designs Compared

    Other interesting articles. If you want to know more about statistics, methodology, or research bias, make sure to check out some of our other articles with explanations and examples. Statistics. Normal distribution. Skewness. Kurtosis. Degrees of freedom. Variance. Null hypothesis.

  12. Causal Comparative Research: Methods And Examples

    In a causal-comparative research design, the researcher compares two groups to find out whether the independent variable affected the outcome or the dependent variable. A causal-comparative method determines whether one variable has a direct influence on the other and why. It identifies the causes of certain occurrences (or non-occurrences).

  13. 5 Comparative Studies

    In nearly all studies in the comparative group, the titles of experimental curricula were explicitly identified. The only exception to this was the ARC Implementation Center study (Sconiers et al., 2002), where three NSF-supported elementary curricula were examined, but in the results, their effects were pooled.

  14. How to structure quantitative research questions

    Write out the comparative research question. Once you have these details - (1) the starting phrase, (2) the name of the dependent variable, (3) the name of the groups you are interested in comparing, and (4) any potential adjoining words - you can write out the comparative research question in full. The example comparative research questions ...

  15. Comparative research

    Comparative research is a research methodology in the social sciences exemplified in cross-cultural or comparative studies that aims to make comparisons across different countries or cultures.A major problem in comparative research is that the data sets in different countries may define categories differently (for example by using different definitions of poverty) or may not use the same ...

  16. What is Comparative Analysis? Guide with Examples

    A comparative analysis is a side-by-side comparison that systematically compares two or more things to pinpoint their similarities and differences. The focus of the investigation might be conceptual—a particular problem, idea, or theory—or perhaps something more tangible, like two different data sets. For instance, you could use comparative ...

  17. causal comparative study: Topics by Science.gov

    2013-01-01. The purpose of this quantitative causal-comparative study was to investigate the relationship between the instructional effects of the interactive whiteboard and students' proficiency levels in eighth-grade science as evidenced by the state FCAT scores. A total of 46 eighth-grade science teachers in a South Florida public school ...

  18. A Practical Guide to Writing Quantitative and Qualitative Research

    INTRODUCTION. Scientific research is usually initiated by posing evidenced-based research questions which are then explicitly restated as hypotheses.1,2 The hypotheses provide directions to guide the study, solutions, explanations, and expected results.3,4 Both research questions and hypotheses are essentially formulated based on conventional theories and real-world processes, which allow the ...

  19. quantitative approaches to comparative analyses: data properties and

    Susanne Pickel et al (2015) present a new framework for comparative social scientists that tackles one of the most prominent topics in political research: the quality of democracy. In particular, the authors propose a framework to assess the measurement properties of three prominent indices of the quality of democracy.

  20. What Is Quantitative Research? An Overview and Guidelines

    Abstract. In an era of data-driven decision-making, a comprehensive understanding of quantitative research is indispensable. Current guides often provide fragmented insights, failing to offer a holistic view, while more comprehensive sources remain lengthy and less accessible, hindered by physical and proprietary barriers.

  21. 10 Research Question Examples to Guide your Research Project

    The first question asks for a ready-made solution, and is not focused or researchable. The second question is a clearer comparative question, but note that it may not be practically feasible. For a smaller research project or thesis, it could be narrowed down further to focus on the effectiveness of drunk driving laws in just one or two countries.

  22. Qualitative vs. Quantitative Research

    When collecting and analyzing data, quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings. Both are important for gaining different kinds of knowledge. Quantitative research. Quantitative research is expressed in numbers and graphs. It is used to test or confirm theories and assumptions.

  23. Self-regulated learning in ESL/EFL contexts: a methodological ...

    The results provided evidence to substantiate the idea that quantitative approaches towards SRL research is in the ascendency. Experimental and survey designs were identified as the most preferred ...

  24. Encompassing trust in medical AI from the perspective of medical

    Background In the years to come, artificial intelligence will become an indispensable tool in medical practice. The digital transformation will undoubtedly affect today's medical students. This study focuses on trust from the perspective of three groups of medical students - students from Croatia, students from Slovakia, and international students studying in Slovakia. Methods A paper-pen ...