What Makes Reading Comprehension Questions Difficult?

Saku Sugawara , Nikita Nangia , Alex Warstadt , Samuel Bowman

Export citation

  • Preformatted

Markdown (Informal)

[What Makes Reading Comprehension Questions Difficult?](https://aclanthology.org/2022.acl-long.479) (Sugawara et al., ACL 2022)

  • What Makes Reading Comprehension Questions Difficult? (Sugawara et al., ACL 2022)
  • Saku Sugawara, Nikita Nangia, Alex Warstadt, and Samuel Bowman. 2022. What Makes Reading Comprehension Questions Difficult? . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6951–6971, Dublin, Ireland. Association for Computational Linguistics.

Iowa Reading Research Center

A teacher smiling at a student

Research Article of the Month: April 2024

This blog post is part of our  Research Article of the Month series. For this month, we highlight “ Designing an Intervention in Reading and Self-Regulation for Students With Significant Reading Difficulties Including Dyslexia ,” an article published in the journal Learning Disability Quarterly in 2021. Important words related to research are bolded, and definitions of these terms are included at the end of the article in the “Terms to Know” section.

Why Did We Pick This Paper?

Self-regulation is the ability to modify one’s thinking, emotions, and behavior to achieve a goal. Some self-regulation strategies include setting goals, becoming aware of emotions, practicing positive self-statements (“I am doing my best” or “I will not give up”), and believing in the ability to grow and learn. 

Self-regulation contributes to reading proficiency (Berkeley & Larson, 2018), and students with reading difficulties tend to have impaired self-regulation (Cutting, et al., 2009). Fortunately, training in self-regulation has been shown to improve the use of self-regulation strategies and reading comprehension outcomes (Spörer & Schünemann, 2014). This study examined the feasibility and effects of a reading intervention that explicitly teaches self-regulation strategies. Reading interventions that target self-regulation may support the reading outcomes of students with reading disabilities (RDs). 

What Are the Research Questions or Purpose?

This study examined the feasibility of implementing a specific reading intervention with self-regulation instruction by addressing the following questions:

  • Is the intervention associated with stronger effects on reading outcomes than the interventions currently provided to students with RDs in the participating schools?
  • Can teachers implement the intervention as designed?
  • What are the barriers to consistent implementation and to student progress in the intervention?
  • What are teachers’ perceptions of the self-regulation component of the intervention?
  • What parts of the intervention should be maintained as they are and how should the intervention be revised?

What Methodology Do the Authors Employ?

To assess the feasibility of the intervention and explore its potential effects on reading outcomes, the study employed a quasi-experimental design. 

A group of special education teachers, dyslexia specialists,  and reading interventionists were randomly assigned to teach the intervention (the experimental condition) or continue delivering their typical instruction (the business-as-usual , or BAU, condition). Instruction was delivered in small groups of 2-4 students, 4 days a week for 26 weeks.

A total of 21 instructors participated in the study (10 in the intervention and 11 in the BAU condition), as well as 43 students in Grades 2-4 (23 in the intervention and 20 in the BAU condition).

The students were assessed on a number of reading skills, including word recognition, decoding, reading comprehension, and oral reading fluency, at the beginning and end of the study. Pre- and post-test scores were compared in order to assess students’ growth in the measured skills over the course of the study.

The intervention consisted of word study, text reading, reading comprehension, and self-regulation, as described below:

  • Word study: The word study component included instruction in phonemic awareness, decoding, word recognition, and spelling. 
  • Text reading: For the text-reading component, students read high-interest, motivating decodable texts that included phonics and spelling patterns the students had been explicitly taught. They also applied these skills on authentic texts to practice extending these skills to new contexts. 
  • Comprehension: The comprehension component included guiding questions to focus students’ attention and activate prior knowledge. Teachers asked questions that would stimulate students to recall events, generate inferences, make connections across texts, paraphrase, identify main ideas, monitor their understanding, generate questions, and visualize. Teachers modeled comprehension skills and provided students with multiple practice opportunities. 
  • Self-regulation: The self-regulation component included activities designed to support a growth mindset (the belief that one can grow and achieve success in the future despite present challenges), emotional self-regulation (the ability to identify emotions), reflection on comprehension strategy use, positive self-statements, and goal setting. 

These components were delivered in two phases: the first phase focused on foundational reading skills, and the second phase addressed more advanced skills. For all components, students received direct instruction and modeling from teachers, and they practiced skills using multiple modalities (e.g., reading, writing, and manipulation of letter tiles). 

Students in the BAU condition received instruction using other evidence-based programs.

The researchers monitored the fidelity and quality of implementation for the intervention by recording videos of classroom instruction. The researchers conducted an analysis of covariance with pre- and post-test scores to determine whether the intervention was associated with greater effects than traditional instruction. The intervention teachers also participated in two focus groups to provide feedback on the feasibility of the intervention. 

What Are the Key Findings?

Research question 1: is the intervention associated with stronger effects on reading outcomes than the interventions currently provided to students with rds in the participating schools.

Students’ pre-test scores on all reading skill variables were higher in the BAU condition compared to the experimental condition, but there were no significant differences between groups for any measures on the post-test. Thus, the intervention was not associated with stronger effects on reading outcomes than other interventions used in the participating schools.

Research Question 2: Can teachers implement the intervention as designed?

The fidelity and quality of implementation were reported as a percentage to measure if the intervention was implemented as designed. The mean word study and text reading fidelity rating was 88%, and the quality rating was 92%. For comprehension and self-regulation, the mean fidelity rating was 81%, and the quality rating was 94%. The lower fidelity rating for the comprehension and self-regulation components indicates that these components were more difficult for teachers to implement as intended. 

Research Question 3: What are the barriers to consistent implementation and to student progress in the intervention?

Teachers identified context barriers, including scheduling, limited school resources, limited instructional and planning time, and logistics related to providing the intervention at two different schools on the same day. They also identified student-related barriers, including student frustration with literacy tasks, lack of confidence and inconsistent focus, and behavior management. 

Research Question 4: What are teachers’ perceptions of the self-regulation component of the intervention?

In focus groups, teachers voiced their support for the self-regulation component of the intervention, citing the positive effects of growth mindset instruction on students’ confidence and self-esteem. Teachers also noted the benefits of recognizing negative self-statements and substituting them with positive ones. 

Research Question 5: What parts of the intervention should be maintained as they are and how should the intervention be revised?

The teachers requested a better approach for organizing and managing materials (e.g., letter tiles, books, visual aids). They suggested that future versions of the intervention should focus more on active student participation rather than teacher talk. They wanted a stronger fluency component of the intervention and guidance on incorporating technology into instruction. Overall, the teachers highlighted that the strengths of the intervention include its well-designed curriculum and content, the material resources provided, and the variety of activities that support student interest and participation.

What Are the Limitations of This Paper?

The study examined the implementation of a multi-component intervention for reading and self-regulation for students with RDs. Teachers highlighted several challenges they faced in implementing this intervention, including managing materials and coordinating the pace of the different lesson components. This complexity could potentially limit the intervention’s ease of implementation, as educators may struggle to implement it effectively without substantial preparatory training or additional support and technology. Enhancing teacher readiness through targeted professional development sessions, along with providing ongoing support and technology, could improve the fidelity and quality of implementing of this complex intervention. 

Additionally, the study was constrained by its small sample size, which limits the statistical power in detecting the possible effectiveness of the intervention. A small number of participants can make it challenging to detect smaller but statistically significant effect sizes or subtle differences between treatment and BAU groups. For further research on the effectiveness of the intervention, a greater number of participants is needed to validate the results and conclusions drawn from this study. 

Terms to Know

  • Feasibility:  A feasibility study follows the implementation of a project or process (such as a new reading instructional program) in order to assess its potential for success. Researchers may gather data and feedback to inform future revisions.
  • Quasi-experimental: Experimental research aims to determine whether a certain treatment influences a measurable outcome—for example, whether a certain instructional method influences students’ reading comprehension scores. To do this, participants are assigned to one of two groups: the experimental group, which receives the treatment, and the control group, which does not receive the treatment. In an experimental study, these groups are randomly assigned, meaning each participant has equal probability of being in either the treatment or the control group. A quasi-experimental study is similar to an experimental study except that participants are not randomly assigned to groups. In educational research, groups often are assigned by classroom rather than through random assignment, making this kind of research quasi-experimental. In either case, participants in both groups are tested before and after the treatment, and their results are compared.
  • Business-as-usual (BAU) condition:  The business-as-usual condition is another name for the control group in an experimental or quasi-experimental study.   This group does not receive the experimental treatment and therefore serves as a point of comparison for the experimental group. 
  • Fidelity:  Fidelity is a measure of the extent to which a process, such as an instructional approach, is implemented as intended.
  • Covariance:  Covariance , in statistics, is a measure of the relationship between two variables and the extent to which they change together.
  • Effects:  In statistics, effect size is a measure of the strength of the relationship between two variables in statistical analyses. A commonly used interpretation is to refer to effect size as small (g = 0.2), medium (g = 0.5), and large (g = 0.8) based on the benchmarks suggested by Cohen (1988), where “g” refers to Hedge’s g, a statistical measure of effect size.
  • Focus groups:  A focus group gathers participants for a guided discussion or interview in order to elicit feedback about a product or process.

Berkeley, S., & Larsen, A. (2018). Fostering self–regulation of students with learning disabilities: Insights from 30 years of reading comprehension intervention research. Learning Disabilities Research & Practice , 33 (2), 75-86.  https://doi.org/10.1111/ldrp.12165  

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge Academic.

Cutting, L. E., Materek, A., Cole, C. A. S., Levine, T. M., & Mahone, E. M. (2009). Effects of fluency, oral language, and executive function on reading comprehension performance. Annals of Dyslexia , 59, 34–54.  https://doi.org/10.1007/s11881-009-0022-0  

Denton, C. A., Montroy, J. J., Zucker, T. A., Cannon, G. (2021). Designing an intervention in reading and self-regulation for students with significant reading difficulties, including dyslexia. Learning Disability Quarterly , 44 (3), 170-182.  https://doi.org/10.1177/0731948719899479  

Spörer, N., & Schünemann, N. (2014). Improvements of self-regulation procedures for fifth graders' reading competence: Analyzing effects on reading comprehension, reading strategy performance, and motivation for reading. Learning and Instruction , 33 , 147-157.  https://doi.org/10.1016/j.learninstruc.2014.05.002  

  • emotional control
  • goal setting
  • growth mindset
  • interventions
  • learning disabilities
  • Research Article of the Month
  • self regulation
  • self-esteem
  • self-monitoring

Student taking notes in a notebook while reading a book

Research Article of the Month: March 2024

A child looking at a teacher with a flashcard

Research Article of the Month: February 2024

Kids on computers

Research Article of the Month: January 2024

  • Tools and Resources
  • Customer Services
  • Original Language Spotlight
  • Alternative and Non-formal Education 
  • Cognition, Emotion, and Learning
  • Curriculum and Pedagogy
  • Education and Society
  • Education, Change, and Development
  • Education, Cultures, and Ethnicities
  • Education, Gender, and Sexualities
  • Education, Health, and Social Services
  • Educational Administration and Leadership
  • Educational History
  • Educational Politics and Policy
  • Educational Purposes and Ideals
  • Educational Systems
  • Educational Theories and Philosophies
  • Globalization, Economics, and Education
  • Languages and Literacies
  • Professional Learning and Development
  • Research and Assessment Methods
  • Technology and Education
  • Share This Facebook LinkedIn Twitter

Article contents

Effective strategies for improving reading comprehension.

  • Meenakshi Gajria Meenakshi Gajria St. Thomas Aquinas College
  •  and  Athena Lentini McAlenney Athena Lentini McAlenney St. Thomas Aquinas College
  • https://doi.org/10.1093/acrefore/9780190264093.013.1225
  • Published online: 27 October 2020

Reading comprehension, or the ability to extract information accurately from reading narrative or content area textbooks, is critical for school success. Many students identified with learning disabilities struggle with comprehending or acquiring knowledge from text despite adequate word-recognition skills. These students experience greater difficulty as they move from elementary to middle school where the focus shifts from “learning to read” to “reading to learn.” Although the group of students with learning disabilities vary with respect to their challenges in reading, some general characteristics of this group include problems identifying central ideas of a text, including its relationship to supporting ideas, differentiating between important and unimportant details, asking questions, drawing inferences, creating a summary, and recalling textual ideas. Typically, these students are passive readers that do not spontaneously employ task appropriate cognitive strategies nor monitor their ongoing understanding of the text, resulting in limited understanding of both narrative and expository texts. An evidence-based approach to comprehension instruction is centered on teaching students the cognitive strategies used by proficient readers. Within the framework of reading comprehension, the goal of cognitive strategies is to teach students to actively engage with the text, to make connections with it and their prior knowledge, so that learning becomes more purposeful, deliberate, and self-regulated.

Texts differ in the level of challenge that they present to students. Narrative texts are generally simpler to read as these are based on a temporal sequence of events and have a predictable story structure. In contrast, expository texts, such as social studies and science, can be particularly demanding as there are multiple and complex text structures based on the relationship of ideas about a particular concept or topic. Using principles of explicit instruction, all learners, including students with learning disabilities and English language learners, can be taught cognitive strategies that have been proven effective for increasing reading comprehension. Early research focused on the instruction in a single cognitive strategy to promote reading comprehension such as identifying story grammar elements and story mapping for narrative texts and identifying the main idea, summarizing, and text structure for expository texts. Later researchers embedded a metacognitive component, such as self-monitoring with a specific cognitive strategy, and also developed multicomponent reading packages, such as reciprocal teaching, that integrated the use of several cognitive strategies. Instruction in cognitive and metacognitive strategies is a promising approach for students with learning disabilities to support their independent use of reading comprehension strategies and for promoting academic achievement across content areas and grade levels.

  • reading comprehension
  • story grammar
  • story mapping
  • summarization
  • text structure
  • multicomponent
  • cognitive strategies

You do not currently have access to this article

Please login to access the full content.

Access to the full content requires a subscription

Printed from Oxford Research Encyclopedias, Education. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 30 April 2024

  • Cookie Policy
  • Privacy Policy
  • Legal Notice
  • Accessibility
  • [66.249.64.20|81.177.180.204]
  • 81.177.180.204

Character limit 500 /500

You are using an outdated browser. Upgrade your browser today or install Google Chrome Frame to better experience this site.

  • Professional learning

Teach. Learn. Grow.

Teach. learn. grow. the education blog.

Avatar photo

The science of teaching reading comprehension

reading comprehension questions research

We think about reading comprehension as the product of word recognition and language comprehension. Nationally, we’ve done a great job getting the word out on the importance of phonics. This is, arguably, the easiest part of the equation to get right. However, that’s not all that needs to happen in the early years so students are successful readers later on.

Two pathways to teaching reading comprehension

We at NWEA recently spoke with Natalie Wexler, an education writer and author of The Knowledge Gap: The Hidden Cause of America’s Broken Education System—And How to Fix It . Natalie reminds us that “We really have to see literacy developing along two pathways that are going to be, to some extent, pretty separate in the early years.”

These pathways are word recognition and language comprehension. While phonics has a mound of intervention research on how to effectively get students to reading fluency, cognitive science tells us that students need to acquire plenty of knowledge to be able to understand the texts they encounter, and that this must start early on. Otherwise, the opportunity gaps between kids with experiences to gain background knowledge and kids without will only grow wider.

reading comprehension questions research

In the early years, these pathways to becoming a reader are largely separate. Younger students or older readers with decoding difficulties won’t yet be able to read texts that are building their vocabulary and knowledge. They need to have these rich and complex texts read aloud to them. What does this mean for educators? Both paths need to be effectively taught for the best chance of literacy success in the upper elementary grades and later in life. One doesn’t come before the other—decoding and comprehension both must be valued in the early grades—and both must have adequate instructional time devoted.

What is reading comprehension?

Natalie says, “we have to think of reading comprehension as a process.” Sometimes you may hear teachers asking comprehension questions about a text to students. This is thinking of comprehension as a product, not a process. Assessing students’ comprehension of a text by asking them questions is not the same as teaching students to comprehend.

Comprehension is a metacognitive skill, one that is developed through purposely choosing text sets to build knowledge and leveraging specific reading comprehension strategies to help students acquire this knowledge and apply these metacognitive skills on their own.

So how do we go about building knowledge?

Reading strategies should not be the focus of teaching reading comprehension. Instead, they should be used in service of teaching students new content. The most recent research suggests we use three strategies to help students learn the content of the texts they are reading. Specifically, when combined with instruction in vocabulary and background knowledge, these strategies are most helpful in building student knowledge and understanding. We can teach students to:

  • Identify the text structure
  • Using the text structure, identify the main idea
  • Summarize a text by expanding on the main idea

If students can summarize a text, they now have a situation model to work from. Think of it like helping them build a web of Velcro that all the details in the text can stick to. Teaching students to use these steps will help them build the metacognitive muscles they’ll need to do this type of understanding on their own. By helping students arrive at a coherent understanding, teachers position readers to do the deep work of making inferences, generating questions, and making connections.

Imagine, for example, a class of first-grade students learning about animals and their habitats in science. They read an informational text about owls. Their teacher may then plan to use the book Owl Moon by Jane Yolen to help students step into the role of the child protagonist who is going owling for the first time. Their teacher may refer to what the students learned about owls’ eyesight and sleeping patterns from the informational text. With these goals in mind, the teacher may use various reading strategies and activities to help students understand what they are reading and gain knowledge about animals and their habitats.

Before reading , the teacher may activate students’ background knowledge from the earlier lesson by asking questions like, “What are the ‘special powers’ we learned about owls yesterday?” and “What are owls’ sleeping patterns like?” Activating these concepts will help students make connections during the narrative story. The teacher may also focus students on a problem–solution sentence stem or a narrative story map to help them better understand the plot. The work could be displayed on an anchor chart in a student-friendly format so the class can take notes together. This could transition to students taking brief notes on a graphic organizer or dry-erase board once they are more independent spellers, typically toward the middle of the year.

During reading , the teacher may ask connecting questions to help solidify knowledge, such as, “When did this happen?” and “Why do you think Pa chose to take them owling so late?” The teacher may also highlight the meaning of unfamiliar vocabulary that is related to understanding the content, such as “pine trees,” “meadow,” or “clearing.” The teacher can list these words on index cards so students can refer to them and use them in their writing throughout the unit. As they encounter a plot element, they can record it together on their graphic organizer.

After reading , the class could talk about the plot structure and use the completed graphic organizer or sentence stems to summarize the story. The teacher could also have students add descriptive words about the owl’s habitat to their science journal. This could be extended to a few sentences to explain why it was so difficult to find an owl. Students may also be guided to use a graphic organizer to compare their learning about the owl habitat to the habitat of a field mouse they explored while reading Frederick by Leo Lionni.

Notice that each of the strategies and activities—from recognizing a story’s structure, to summarizing, to eliciting details and answering questions, to comparing and contrasting—are all in service of learning content related to the science unit on animals and their habitats. The focus of reading a new text is not on learning a certain strategy but using the strategies to learn the content.

Natalie notes, “There is evidence that teaching kids comprehension strategies, or at least certain kinds of comprehension strategies, does boost their comprehension. But we’ve been trying to do this in the abstract… What really will work better is teaching a topic and bringing in whatever strategy or skill is appropriate to help kids think deeply about that topic and understand that text for that topic.”

Recommendations for teachers

When teaching reading comprehension, I encourage teachers to avoid choosing texts to focus on a particular comprehension skill or strategy. Choose texts instead based on the content focus. Here are some suggestions for how to align your instructional focus with best practices in reading science:

  • Plan to use texts that revolve around a specific science or social studies topic. These can be both narrative and informational texts, as in the narrative example I shared earlier. Using texts around a common topic enables students to build a rich and enduring web of knowledge.
  • Teach students to identify the text structure and generate a main idea statement. This enables students to understand and summarize what they are reading more easily. When students understand the main idea of a text, it empowers them to move into higher levels of understanding.
  • Explicitly teach and review new vocabulary that relates back to the science or social studies topic. Help students understand how these words relate to one another and the topic at hand. Research in cognitive science suggests using distributed practice enables students to learn more words and, therefore, understand more concepts.

Recommendations for school administrators

If you’re a school administrator, here are some ways to support your teachers in this work of shifting from a strategy focus to a content focus when teaching reading comprehension:

  • Provide teachers with high-quality text sets for read-alouds related to your grade-level science and social studies standards. In second grade and up, also provide multiple copies of chapter books around these topics for students to discuss in small groups or as a whole-class book study.
  • Provide teachers high-quality professional learning and time to plan. Teachers need to be able to think deeply with one another about the vocabulary to highlight and strategies to use to help students acquire information and learn new concepts. Use practitioner articles to guide PLCs in integrating new practices into your existing curricula.
  • Create a culture of collaboration. Give time for art, music, PE, and other shared-subjects teachers to plan lessons around the topic of study. Students are more likely to learn deeply when they are building common knowledge across class periods.

To hear more from Natalie on the importance effectively teaching reading comprehension, watch our interview with her.

For additional ideas and tips on literacy instruction from Teach. Learn. Grow. authors, browse our archive of ELA posts .

Recommended for you

reading comprehension questions research

Anchor your writing instruction in big ideas students can remember

reading comprehension questions research

6 strategies for teaching multisyllabic word reading

reading comprehension questions research

The science of reading explained

reading comprehension questions research

Helping students grow

Students continue to rebound from pandemic school closures. NWEA® and Learning Heroes experts talk about how best to support them here on our blog, Teach. Learn. Grow.

See the post

reading comprehension questions research

Put the science of reading into action

The science of reading is not a buzzword. It’s the converging evidence of what matters and what works in literacy instruction. We can help you make it part of your practice.

Get the guide

reading comprehension questions research

Support teachers with PL

High-quality professional learning can help teachers feel invested—and supported—in their work.

Read the article

STAY CURRENT by subscribing to our newsletter

You are now signed up to receive our newsletter containing the latest news, blogs, and resources from nwea..

Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension

Machine Reading Comprehension (MRC) poses a significant challenge in the field of Natural Language Processing (NLP). While mainstream MRC methods predominantly leverage extractive strategies using encoder-only models such as BERT, generative approaches face the issue of out-of-control generation – a critical problem where answers generated are often incorrect, irrelevant, or unfaithful to the source text. To address these limitations in generative models for MRC, we introduce the Q uestion- A ttended S pan E xtraction ( QASE ) module. Integrated during the fine-tuning phase of pre-trained generative language models (PLMs), QASE significantly enhances their performance, allowing them to surpass the extractive capabilities of advanced Large Language Models (LLMs) such as GPT-4. Notably, these gains in performance do not come with an increase in computational demands. The efficacy of the QASE module has been rigorously tested across various datasets, consistently achieving or even surpassing state-of-the-art (SOTA) results. Our code is available at this anonymous repo link.

Lin Ai Zheng Hui Zizhou Liu Julia Hirschberg Columbia University, New York, NY {lin.ai, julia}@cs.columbia.edu {zh2483, [email protected]}@columbia.edu

1 Introduction

Machine Reading Comprehension (MRC), also referred to as text-grounded question answering (QA) Wang et al. ( 2022 ) , involves presenting a model with a text passage and a question, requiring it to formulate an answer based solely on the given text. This can be achieved either by identifying a specific span within the text or by generating a concise answer. MRC poses a significant challenge within the domain of Natural Language Processing (NLP). Predominant strategies for addressing MRC employ extractive methods, which typically extract pertinent text snippets from a broader context in response to a query Wang et al. ( 2018 ); Yan et al. ( 2019 ); Chen et al. ( 2020 ) . However, the most precise answers in practical settings often span multiple text passages or necessitate inferential reasoning that extends beyond the surface-level content Li et al. ( 2021 ) . Therefore, there is a compelling necessity to integrate generative models alongside extractive approaches to enhance the robustness and comprehensiveness of solutions in this field.

Refer to caption

Yet, generative models often fall short in MRC tasks due to a phenomenon known as out-of-control generation Li et al. ( 2021 ) , which encompasses two primary issues, as illustrated in Figure 1 : (a) ill-formed generations that include incomplete or redundant phrases, and (b) factual inconsistencies that diverge from the intended information. To tackle these issues, this paper introduces the lightweight Q uestion- A ttended S pan E xtraction ( QASE ) module. This module is integrated during the fine-tuning of various open-source generative pre-trained language models (PLMs) across multiple MRC datasets to enhance the reliability and accuracy of the generated answers.

Our key contributions are outlined as follows:

We develop the QASE module to enhance the quality and factual accuracy of answers generated by fine-tuned generative PLMs, achieving performance on par with state-of-the-art (SOTA) extractive methods and surpassing that of advanced Large Language Models (LLMs) such as GPT-4.

QASE enhances model performance without imposing significant additional computational demands, offering a cost-effective solution for researchers operating under resource constraints.

2 Related Work

Research in mrc.

In recent research on MRC, there is a predominant focus on extractive question answering using encoder-only PLMs such as BERT and XLM-Roberta. These studies typically involve predicting the start and end positions of answers directly from the context provided Ohsugi et al. ( 2019 ); Lan et al. ( 2019 ); Bachina et al. ( 2021 ); Chen et al. ( 2022 ) . Additionally, to accommodate scenarios where answers comprise multiple spans from the text – termed as the multi-span setting – researchers have proposed different strategies. Segal et al. ( 2020 ) treated this as a sequence tagging task where each token in the sequence is tagged as being part of an answer span or not. Others Hu et al. ( 2019 ); Lee et al. ( 2023 ); Zhang et al. ( 2023 ) have experimented with hybrid approaches that combine different modeling techniques or tasks to enhance performance on complex MRC problems. Beyond these extractive methods, there is an emerging interest in applying generative language models(GLMs) for MRC Yang et al. ( 2020 ); Li et al. ( 2021 ); Jiang et al. ( 2022 ); Su et al. ( 2022 ) . These models do not just predict the location of an answer within a given text but generate an answer reformulating information found across the context.

Retrieval-augmented text generation (RAG)

RAG augments the input of PLMs with in-domain Gu et al. ( 2018 ); Weston et al. ( 2018 ); Saha and Srihari ( 2023 ) or external knowledge Su et al. ( 2021 ); Xiao et al. ( 2021 ) to control the quality and factual consistency of generated content. It has become a new text generation paradigm in many NLP tasks Li et al. ( 2022b ) , such as dialogue response generation Wu et al. ( 2021 ); Liu et al. ( 2023b ) and machine translation He et al. ( 2021 ); Zhu et al. ( 2023 ) . However, RAG is typically utilized in scenarios where document retrieval is necessary to reduce input context window Chen et al. ( 2024 ); Ram et al. ( 2023 ) , whereas selective MRC often requires accessing information beyond the immediate context. Our approach diverges from RAG as it directly fine-tunes the weights of the PLMs rather than altering the input to the PLMs with additional information.

Controllable Text Generation

In the domain of controllable text generation, significant progress has been achieved. Gururangan et al. ( 2020 ) employ fine-tuning of language models on domain-adaptive text to customize the attributes of generated content. Other promising methodologies include reinforcement learning Li et al. ( 2024 ) , contrastive learning Zheng et al. ( 2023 ) , and the use of control codes for fine-tuning PLMs Keskar et al. ( 2019 ) . Some approaches involve modifying the probability distribution of PLMs. For instance, Liu et al. ( 2021 ) propose a technique using two smaller "expert" models to refine the PLM’s output, while Yang and Klein ( 2021 ) introduce a method that conditions the generation process using a "future discriminator" based on predicted outcomes. Moreover, Huang et al. ( 2023 ) explore efficient multi-aspect text generation with trainable gates for enhanced control. QASE represents a novel adaptation of controlled text generation tailored to the specific challenges of MRC, with a focus on the precision and relevance of generated answers. Unlike methods that modify the overall generative process through complex architectural alterations or additional learning mechanisms, QASE directly utilizes the question context to guide the extraction and generation phases.

This section presents our proposed QASE module and the multi-task fine-tuning strategy we employ.

Refer to caption

3.1 Question-Attended Span Extraction

To guide text generation, we employ the QASE module, a question-attended span extraction tool, during the fine-tuning of generative PLMs. QASE directs model focus to potential answer spans within the original text. We frame span extraction as a sequence tagging task using the Inside-Outside (IO) tagging schema. In this schema, each token is labeled as ‘inside’ ( I ) if it falls within a relevant span, or ‘outside’ ( O ) otherwise. This approach effectively handles both single- and multi-span extractions and has shown to perform on par with or better than the well-known BIO format Huang et al. ( 2015 ) , as demonstrated by Segal et al. ( 2020 ) .

subscript 𝑊 𝑝 𝑟 𝑜 𝑗 subscript 𝑣 𝑖 subscript 𝑏 𝑝 𝑟 𝑜 𝑗 z_{i}=ReLU(W_{proj}v_{i}+b_{proj}) italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R italic_e italic_L italic_U ( italic_W start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT ) , where v i ∈ R d subscript 𝑣 𝑖 superscript 𝑅 𝑑 v_{i}\in R^{d} italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represents the hidden state of the i t ⁢ h superscript 𝑖 𝑡 ℎ i^{th} italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT token from the PLM output.

To capture the relationship of context tokens to specific questions, we utilize a multi-head attention mechanism ( MHA ). Each attention head targets different aspects of the context in relation to the question, treating question embeddings as queries and context embeddings as keys and values. Specifically, for each question-context pair, we compute a mean question embedding by averaging the embeddings of question tokens, which is then expanded to align with the length of the context sequence. This expanded question embedding, z Q ∗ subscript superscript 𝑧 𝑄 z^{*}_{Q} italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , serves as the query in the MHA , with the context embedding, z C subscript 𝑧 𝐶 z_{C} italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , acting as both key and value. This mechanism allows the derived representation of each token in the context to encapsulate its relevance in relation to the posed question.

In conclusion, the QASE module processes the projected embeddings z C subscript 𝑧 𝐶 z_{C} italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and z Q ∗ subscript superscript 𝑧 𝑄 z^{*}_{Q} italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT through the MHA mechanism, followed by a linear and a softmax layer to calculate the probability that each context token belongs to an answer span:

This probability is represented by p C i subscript 𝑝 subscript 𝐶 𝑖 p_{C_{i}} italic_p start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for the i t ⁢ h superscript 𝑖 𝑡 ℎ i^{th} italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT context token. To measure the accuracy of span prediction, we compute sequence tagging loss employing cross-entropy loss:

3.2 Fine-Tuning and Inference

We fine-tune the PLMs employing a multi-task learning strategy that concurrently optimizes both the language modeling loss and the sequence tagging loss:

where β 𝛽 \beta italic_β is a hyper-parameter that determines the weight assigned to the span extraction task. This dual-objective approach substantially improves the PLMs’ capability to generate contextually grounded and relevant answers. During the inference phase, only the generation component of the finely-tuned model is utilized.

4 Experiments

This section presents the experimental framework, detailing the datasets used, experimental setup, comprehensive quantitative results of model performance, ablation studies, analysis of model factual consistency, and qualitative case studies.

4.1 Datasets and Metrics

We utilize three MRC benchmark datasets:

SQuAD Rajpurkar et al. ( 2016 ) : A benchmark dataset consisting of 100K+ questions with single-span answers. We use SQuAD v1.1. Since the official evaluation on v1.1 has long been ended, we report our results on the official v1.1 development set.

MultiSpanQA Li et al. ( 2022a ) : This dataset consists of over 6.5k question-answer pairs. Unlike most existing single-span answer MRC datasets, MultiSpanQA focuses on multi-span answers.

Quoref Dasigi et al. ( 2019 ) : A benchmark dataset containing more than 24K questions, with most answers being single-span and ∼ similar-to \sim ∼ 10% being multi-span.

Following the conventions of the datasets’ official leaderboards (listed in A.1 ), we employ exact match (EM) and partial match (Overlap) F1 scores as metrics on MultiSpanQA, and exact match percentage and macro-averaged F1 score on SQuAD and Quoref.

4.2 Experimental Setup

To assess the efficacy of the QASE module independent of any specific language models, we conduct experiments with multiple open-source LLMs. Our tests include both decoder-only LLMs, such as Llama 2 Touvron et al. ( 2023 ) and Alpaca Taori et al. ( 2023 ) , and an encoder-decoder model family, Flan-T5 Chung et al. ( 2022 ) . For Llama 2 and Alpaca, we employ the pre-trained 7B version and fine-tune it using LoRA Hu et al. ( 2021 ) combined with instruction-tuning (instruction templates are detailed in A.4 ). For the Flan-T5 family, we fine-tune the small, base, and large versions. Detailed information about the trainable parameters for each model is provided in Table 1 .

We determine the hyper-parameter β = 1 𝛽 1 \beta=1 italic_β = 1 and the learning rate l ⁢ r = 1 ⁢ e − 4 𝑙 𝑟 1 𝑒 4 lr=1e-4 italic_l italic_r = 1 italic_e - 4 using results from a grid search. For the LoRA fine-tuning of the Llama 2 and Alpaca models, we set a rank r = 8 𝑟 8 r=8 italic_r = 8 , α = 32 𝛼 32 \alpha=32 italic_α = 32 , and a dropout rate of 0.05 0.05 0.05 0.05 . The methodology for selecting these hyper-parameters is detailed in A.2 . All models are trained on individual GPUs with batch sizes ranging from 2 to 4, adjusted according to each GPU’s VRAM capabilities. We employ four types of GPUs: A40, A10, A5500, and A100. Training continues for three epochs or until the models converge. Consistency is maintained across all variants of each base PLM in terms of GPU type, batch size, and training epochs.

4.3 Experiment Results

To assess the QASE module, we compare the performance of various PLMs fine-tuned with and without QASE , as detailed in Table 2 . Overall, models fine-tuned with QASE consistently outperform those without it. Specifically, for the SQuAD dataset, models with QASE show an exact match (EM) percentage increase of up to 33.8% and an F1 score improvement of up to 8.4% compared to vanilla fine-tuned models. For MultiSpanQA, improvements include up to 1.6% in EM F1 and up to 3.3% in overlap F1. Likewise, on the Quoref dataset, enhancements of up to 19.2% in EM percentage and up to 16.0% in F1 score are observed. These results confirm that QASE enables generative-based PLMs to produce more accurate, contextually coherent, and higher-quality answers in MRC tasks compared to vanilla fine-tuning approaches.

For additional comparisons, we also evaluate the fine-tuned PLMs against their zero-shot performance, as outlined in Appendix A.3 . Specifically, on the SQuAD dataset, models using QASE perform up to 5.6 times better in EM and 3.0 times better in F1 score compared to the zero-shot models. On the MultiSpanQA dataset, the EM improves by up to 124.4 times, and F1 score by up to 3.4 times. Similarly, on the Quoref dataset, the EM improves by up to 38.4 times, and F1 score by up to 11.2 times with QASE . It is important to note that these substantial improvements stem from comparing zero-shot models to those fine-tuned with QASE . Nonetheless, the previously discussed results comparing fine-tuned models with and without QASE have clearly illustrated its effectiveness.

4.4 Model Comparisons

Our top model, Flan-T5-Large QASE , is further benchmarked against leading models on each dataset’s official leaderboard, alongside zero-shot GPT-3.5-Turbo and GPT-4. GPT-3.5-Turbo stands as one of OpenAI’s most efficient models in terms of capability and cost, while GPT-4 shows superior reasoning abilities Liu et al. ( 2023c ) . Studies indicate their superiority over traditional fine-tuning methods in most logical reasoning benchmarks Liu et al. ( 2023a ) . The prompts used to query the GPT variants are detailed in Appendix A.4 .

On SQuAD , as showed in Table 3 , Flan-T5-Large QASE surpasses human performance, equaling the NLNet model from Microsoft Research Asia and the original pre-trained BERT-Large from Google Devlin et al. ( 2019 ) . Additionally, it surpasses GPT-4 by 113.8% on the exact match score and 32.6% on F1.

On MultiSpanQA , Table 4 shows that Flan-T5-Large QASE outperforms LIQUID Lee et al. ( 2023 ) , which currently ranks #1 on the leaderboard, with respect to the overlap F1 score. Moreover, it surpasses GPT-4 by 4.5% on the exact match F1 and 1.5% on the overlap F1.

On Quoref , Table 5 shows that Flan-T5-Large QASE is comparable to CorefRoberta-Large Ye et al. ( 2020 ) , which ranks #9 on the leaderboard, with a 0.5% higher exact match. Furthermore, it outperforms GPT-4 by 11.9% on the exact match and 4.8% on F1.

All top-performing models on these datasets’ leaderboards, equaling or exceeding Flan-T5-Large QASE , are encoder-only extractive models. Therefore, these results demonstrate that QASE -enhanced generative PLMs can be fine-tuned to match or exceed the capabilities of SOTA extractive models and outperform leading LLMs in MRC.

4.5 Ablation Studies

We conduct ablation studies to assess the effectiveness of the QASE architecture and to determine the optimal prompting strategy. Specifically, we compare Flan-T5-Large QASE with both the vanilla fine-tuned Flan-T5-Large FT and the baseline Flan-T5-Large baseline . As shown in Figure 3 in Appendix A.5 , the baseline span extraction module does not include the MHA component, rendering it a conventional architecture for fine-tuning pre-trained encoders on downstream sequence tagging tasks. For each configuration – Flan-T5-Large FT , Flan-T5-Large QASE , and Flan-T5-Large baseline – we explored both a question-first ( qf ) and a context-first prompting strategy, with a detailed description of these strategies provided in Appendix A.5 .

Table 6 shows that the baseline-embedded model performs better with a question-first prompting strategy, as Flan-T5-Large b ⁢ a ⁢ s ⁢ e ⁢ l ⁢ i ⁢ n ⁢ e q ⁢ f 𝑏 𝑎 𝑠 𝑒 𝑙 𝑖 𝑛 subscript 𝑒 𝑞 𝑓 {}_{baseline_{qf}} start_FLOATSUBSCRIPT italic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e start_POSTSUBSCRIPT italic_q italic_f end_POSTSUBSCRIPT end_FLOATSUBSCRIPT surpasses Flan-T5-Large baseline and Flan-T5-Large F ⁢ T q ⁢ f 𝐹 subscript 𝑇 𝑞 𝑓 {}_{FT_{qf}} start_FLOATSUBSCRIPT italic_F italic_T start_POSTSUBSCRIPT italic_q italic_f end_POSTSUBSCRIPT end_FLOATSUBSCRIPT . Conversely, the baseline span extraction module decreases performance in context-first prompting, where Flan-T5-Large baseline underperforms compared to Flan-T5-Large FT . This suggests that adding an auxiliary span extraction module without careful design can negatively affect instruction fine-tuning. Meanwhile, the QASE -enhanced model excels over both vanilla fine-tuned and baseline-embedded models in both prompting scenarios, demonstrating its architectural superiority. Specifically, in context-first setting, Flan-T5-Large QASE significantly outperforms Flan-T5-Large baseline with a 4.3% higher F1.

4.6 Computational Cost

To assess the computational cost associated with QASE , Table 1 reveals that incorporating the QASE module incurs only a slight increase in the number of trainable parameters in PLMs. The degree of this increase varies based on the hidden sizes of the models. Remarkably, for the largest model, Flan-T5-Large, the addition of QASE accounts for merely an extra 0.2% in parameters. This underscores that QASE can substantially boost the performance of fine-tuned PLMs in MRC tasks without requiring significant additional computational resources.

4.7 Factual Consistency

While token-based EM and F1 scores measure the structural quality of generated text, they do not reflect factual accuracy relative to the context. For this we used Q 2 superscript 𝑄 2 Q^{2} italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Honovich et al. ( 2021 ) , an automatic metric for assessing factual consistency in generated text, which uses question generation and answering methods over token-based matching. We compared fine-tuned Flan-T5-Large with and without QASE in both single-span (SQuAD) and multi-span (MultiSpanQA) answer settings. Table 7 shows that QASE -enhanced models consistently outperform the vanilla fine-tuned model. On SQuAD, Q 2 superscript 𝑄 2 Q^{2} italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT NLI score is improved by 1.0%, and on MultiSpanQA, it is improved by 16.0%.

4.7.1 Qualitative Case Studies

In addition to the Q 2 superscript 𝑄 2 Q^{2} italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT statistical analysis, we also perform qualitative case studies to further demonstrate the effectiveness of QASE in generating factual consistent answers.

Question Attended Alignment

Table 8 showcases that Flan-T5-Large QASE more accurately identifies the key focus of the question and locates the pertinent factual information within the context, with the aid of the QASE module. For instance, in Sample 1 , Flan-T5-Large QASE correctly interprets the question as seeking the age difference between Newton and Manning, rather than the age of either individual, and accordingly provides the accurate answer. In contrast, Flan-T5-Large FT mistakenly provides Newton’s age as the answer. Similarly, in Sample 2 , Flan-T5-Large QASE accurately discerns that the question pertains to Thoreau’s claim regarding the majority, generating in the correct answer, whereas Flan-T5-Large FT misguidedly responds with Thoreau’s political philosophy.

Multi-Span Answers

Flan-T5-Large QASE also shows a notable improvement in comprehending complex, lengthy sentences and synthesizing answers from information that is sparsely distributed across multiple spans requiring logical processing. This capability is particularly valuable when the answer to a question does not directly stem from a single phrase. Table 9 provides examples of such instances. In Sample 3 , the model needs to recognize that ESPN Deportes is the exclusive broadcaster in Spanish and that CBS, although mentioned, does not offer Spanish-language broadcasting. Combining these facts leads to the correct answer, that ESPN Deportes is the network that broadcast the game in Spanish. Flan-T5-Large QASE accurately generates this answer, whereas Flan-T5-Large FT incorrectly answers with "CBS", likely due to confusion caused by the complex sentence structures and dispersed information. Similarly, in Sample 4 , Flan-T5-Large QASE correctly identifies the question as seeking the name of the force related to a potential field between two locations. It successfully locates the relevant long sentence, deconstructs, and comprehends it to produce the correct answer, in contrast to Flan-T5-Large FT , which incorrectly selects the first phrase mentioning "force". In Sample 5 , the question asks for the class most commonly not ascribed to the graph isomorphism problem. The model needs to deduce from the context that "it is widely believed that the polynomial hierarchy does not collapse to any finite level", implying "graph isomorphism is not NP-complete". Once again, Flan-T5-Large QASE arrives at the correct conclusion, while Flan-T5-Large FT does not.

Real-World Knowledge

While our primary evaluation focuses on the model’s proficiency in deriving answers from provided contexts, we also note that QASE enhances the model’s capacity to leverage real-world knowledge acquired during its pre-training phase. This improvement is attributed to QASE ’s ability to better align the model’s focus on parts of the context that are relevant to the questions asked. Table 10 presents an example of this phenomenon. In Sample 6 , when asked about the California venue considered for the Super Bowl, Flan-T5-Large QASE correctly associates the San Francisco Bay Area with California, thus producing the accurate answer. On the other hand, Flan-T5-Large FT erroneously identifies a stadium in Miami as the answer. This example illustrates how QASE not only improves context-based answer generation but also the model’s application of pre-existing real-world knowledge to the questions posed.

5 Discussions

In this section, we briefly address the weak performance of Flan-T5 zero-shot and Llama 2 on MRC tasks, despite their strong language understanding abilities. We note that a comprehensive analysis is beyond our study’s scope. Our goal is to gain insights into further improving these PLMs’ effectiveness in MRC.

5.1 Flan-T5 Zero-Shot Performance

Despite being trained on SQuAD during pre-training, Flan-T5 models demonstrate poor performance across datasets, including SQuAD. While a comprehensive analysis of Flan-T5’s performance is beyond the focus of our study, we briefly explore potential reasons for this underperformance to gain better insights. This underperformance may stem from their training on a wide range of tasks (1,836 tasks), focusing on free-form generation, QA, and reasoning tasks, rather than being finely optimized for extractive QA tasks like MRC. Additionally, generative models like Flan-T5 and Llama 2 generally struggle in MRC tasks, as discussed earlier. For extended discussions, refer to Appendix B.1 .

For fairness in our zero-shot experiments, we compare our prompt template with Google’s instruct-tuning prompts for Flan-T5 on the SQuAD v1 dataset. Our results, as illustrated in Table 14 , reveal that our prompt template achieves the highest F1 score. This implies that Flan-T5’s lower zero-shot performance on MRC is expected.

5.2 Llama 2 Performance

We also observe that models based on Llama 2 and Alpaca consistently underperform compared to those based on Flan-T5, across zero-shot and fine-tuned scenarios, with or without QASE . This discrepancy may arise from the significant difference in the number of trainable parameters, as illustrated in Table 1 , during fine-tuning. Additionally, factors such as differences in pre-training datasets and varied adaptation to tasks due to structural disparities can also contribute to this performance gap. While acknowledging these factors, conducting a comprehensive comparison of different generative model architectures in MRC tasks exceeds the scope of our study. For further discussion, please refer to Appendix B.2 .

5.3 Prompting Strategies

While we acknowledge the prevalent use of prompting strategies like in-context learning, our main focus is not on evaluating these strategies but rather on improving the performance of generative PLMs on MRC tasks through fine-tuning, with and without the integration of QASE . However, we do experiment with various prompt templates, as discussed in Section 4.5 and Appendix B.1 .

6 Conclusion and Future Work

In this study, we address out-of-control generation issue of generative PLMs in MRC using QASE , a lightweight question-attended span extraction module, during the fine-tuning of PLMs. Our experiments show that QASE -enhanced PLMs generate better-quality responses with improved formality and factual consistency, matching SOTA extractive models and outperforming GPT-4 by a significant margin on all three MRC datasets. Importantly, QASE improves performance without a significant increase in computational costs, benefiting researchers with limited resources.

In the future, we aim to evaluate our model on generative MRC datasets Nguyen et al. ( 2016 ) to gauge its effectiveness in handling more intricate scenarios. Additionally, a significant emphasis will be placed on assessing the model’s overall capability in answer generation, with a specific focus on human perception. This involves incorporating human annotators alongside automatic metrics. Looking further ahead, we aspire to extend our research to explore strategies for mitigating input- and context-conflicting hallucinations in LLMs.

Limitations

Due to our limited computational resources, we have been able to perform our experiments on models no larger than Flan-T5-Large. This same constraint led us to only fine-tuning of Llama 2 and Alpaca with LoRA. We note that models based on Llama 2 and Alpaca generally underperform those based on Flan-T5. Apart from the inherent distinctions between decoder-only and encoder-decoder models, and their suitability for different tasks (as seen from the models’ zero-shot performance), a possible factor could be the number of trainable parameters during fine-tuning. Specifically, fine-tuning Llama 2 and Alpaca with LoRA results in only 4.2M trainable parameters, while even the smallest Flan-T5 model provides 77.0M trainable parameters, as shown in Table 1 . We acknowledge that many researchers face similar computational resource limitations. Therefore, our research should be very useful, proposing this lightweight module capable of enhancing smaller PLMs to outperform leading LLMs on MRC tasks like these, achieving a balance of effectiveness and affordability.

One foreseeable limitation of our work is the dependency of the fine-tuning process on answer span annotations, since QASE works as an auxiliary supervised span extraction module. This reliance on annotated data could potentially limit the model’s broader applicability. A prospective exciting future direction to address this limitation is to develop a semi- or unsupervised module that focuses on selecting relevant spans or rationales within a given context. By integrating this module with our current model, we could significantly improve its generalization capabilities, thereby making it more adaptable and effective across a wider range of scenarios.

One popular method to enhance the formality of answers generated by LLMs is through prompt engineering, paired with few-shot or in-context learning techniques. While these strategies offer great advantages, our ultimate goal is to create a system with broad domain generalization, one that minimizes the need for extensive, calibrated prompt engineering and sample selections for task adaptation. Although developing a robust prompt engineering framework or paradigm is an appealing direction, our current focus diverges from this path. As a long-term goal, we aim for a solution that handles diverse tasks with minimal task-specific tuning.

  • Bachina et al. (2021) Sony Bachina, Spandana Balumuri, and Sowmya Kamath S. 2021. Ensemble ALBERT and RoBERTa for span prediction in question answering . In Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021) , pages 63–68, Online. Association for Computational Linguistics.
  • Chen et al. (2024) Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation . Proceedings of the AAAI Conference on Artificial Intelligence , 38(16):17754–17762.
  • Chen et al. (2020) Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xiaochuan, Yuyu Zhang, Le Song, Taifeng Wang, Yuan Qi, and Wei Chu. 2020. Question directed graph attention network for numerical reasoning over text. arXiv preprint arXiv:2009.07448 .
  • Chen et al. (2022) Nuo Chen, Linjun Shou, Ming Gong, and Jian Pei. 2022. From good to best: Two-stage training for cross-lingual machine reading comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 36, pages 10501–10508.
  • Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 .
  • Dasigi et al. (2019) Pradeep Dasigi, Nelson F. Liu, Ana Marasović, Noah A. Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 5925–5932, Hong Kong, China. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Gu et al. (2018) Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor OK Li. 2018. Search engine guided neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 32.
  • Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 8342–8360, Online. Association for Computational Linguistics.
  • He et al. (2021) Qiuxiang He, Guoping Huang, Qu Cui, Li Li, and Lemao Liu. 2021. Fast and accurate neural machine translation with translation memory . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 3170–3180, Online. Association for Computational Linguistics.
  • Honovich et al. (2021) Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. 2021. Q 2 superscript 𝑄 2 {Q^{2}} italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. arXiv preprint arXiv:2104.08202 .
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 .
  • Hu et al. (2019) Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 2019. A multi-type multi-span network for reading comprehension that requires discrete reasoning . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 1596–1606, Hong Kong, China. Association for Computational Linguistics.
  • Huang et al. (2023) Xuancheng Huang, Zijun Liu, Peng Li, Tao Li, Maosong Sun, and Yang Liu. 2023. An extensible plug-and-play method for multi-aspect controllable text generation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 15233–15256, Toronto, Canada. Association for Computational Linguistics.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arxiv 2015. arXiv preprint arXiv:1508.01991 .
  • Jiang et al. (2022) Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2022. Understanding and improving zero-shot multi-hop reasoning in generative question answering . Preprint , arXiv:2210.04234.
  • Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation . Preprint , arXiv:1909.05858.
  • Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 .
  • Lee et al. (2023) Seongyun Lee, Hyunjae Kim, and Jaewoo Kang. 2023. Liquid: A framework for list question answering dataset generation. arXiv preprint arXiv:2302.01691 .
  • Li et al. (2021) Chenliang Li, Bin Bi, Ming Yan, Wei Wang, and Songfang Huang. 2021. Addressing semantic drift in generative question answering with auxiliary extraction . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages 942–947, Online. Association for Computational Linguistics.
  • Li et al. (2022a) Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. 2022a. MultiSpanQA: A dataset for multi-span question answering . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1250–1260, Seattle, United States. Association for Computational Linguistics.
  • Li et al. (2022b) Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. 2022b. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110 .
  • Li et al. (2024) Wendi Li, Wei Wei, Kaihe Xu, Wenfeng Xie, Dangyang Chen, and Yu Cheng. 2024. Reinforcement learning with token-level feedback for controllable text generation . Preprint , arXiv:2403.11558.
  • Liu et al. (2021) Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. DExperts: Decoding-time controlled text generation with experts and anti-experts . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 6691–6706, Online. Association for Computational Linguistics.
  • Liu et al. (2023a) Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023a. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439 .
  • Liu et al. (2023b) Shuai Liu, Hyundong Cho, Marjorie Freedman, Xuezhe Ma, and Jonathan May. 2023b. RECAP: Retrieval-enhanced context-aware prefix encoder for personalized dialogue response generation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8404–8419, Toronto, Canada. Association for Computational Linguistics.
  • Liu et al. (2023c) Xiao Liu, Junfeng Yu, Yibo He, Lujun Zhang, Kaiyichen Wei, Hongbo Sun, and Gang Tu. 2023c. System report for CCL23-eval task 9: HUST1037 explore proper prompt strategy for LLM in MRC task . In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations) , pages 310–319, Harbin, China. Chinese Information Processing Society of China.
  • Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. choice , 2640:660.
  • Ohsugi et al. (2019) Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Hisako Asano, and Junji Tomita. 2019. A simple but effective method to incorporate multi-turn context with BERT for conversational machine comprehension . In Proceedings of the First Workshop on NLP for Conversational AI , pages 11–17, Florence, Italy. Association for Computational Linguistics.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  • Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-Context Retrieval-Augmented Language Models . Transactions of the Association for Computational Linguistics , 11:1316–1331.
  • Saha and Srihari (2023) Sougata Saha and Rohini Srihari. 2023. ArgU: A controllable factual argument generator . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8373–8388, Toronto, Canada. Association for Computational Linguistics.
  • Segal et al. (2020) Elad Segal, Avia Efrat, Mor Shoham, Amir Globerson, and Jonathan Berant. 2020. A simple and effective model for answering multi-span questions . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 3074–3080, Online. Association for Computational Linguistics.
  • Su et al. (2022) Dan Su, Xiaoguang Li, Jindi Zhang, Lifeng Shang, Xin Jiang, Qun Liu, and Pascale Fung. 2022. Read before generate! faithful long form question answering with machine reading . In Findings of the Association for Computational Linguistics: ACL 2022 , pages 744–756, Dublin, Ireland. Association for Computational Linguistics.
  • Su et al. (2021) Yixuan Su, Yan Wang, Deng Cai, Simon Baker, Anna Korhonen, and Nigel Collier. 2021. Prototype-to-style: Dialogue generation with style-aware editing on retrieval memory. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 29:2152–2161.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca .
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
  • Wang et al. (2022) Luqi Wang, Kaiwen Zheng, Liyin Qian, and Sheng Li. 2022. A survey of extractive question answering. In 2022 International Conference on High Performance Big Data and Intelligent Systems (HDIS) , pages 147–153. IEEE.
  • Wang et al. (2018) Wei Wang, Ming Yan, and Chen Wu. 2018. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1705–1714, Melbourne, Australia. Association for Computational Linguistics.
  • Weston et al. (2018) Jason Weston, Emily Dinan, and Alexander Miller. 2018. Retrieve and refine: Improved sequence generation models for dialogue . In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI , pages 87–92, Brussels, Belgium. Association for Computational Linguistics.
  • Wu et al. (2021) Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe Zhang, Xiang Gao, Chris Quirk, Rik Koncel-Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi, Mari Ostendorf, et al. 2021. A controllable model of grounded response generation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 14085–14093.
  • Xiao et al. (2021) Fei Xiao, Liang Pang, Yanyan Lan, Yan Wang, Huawei Shen, and Xueqi Cheng. 2021. Transductive learning for unsupervised text style transfer . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 2510–2521, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Yan et al. (2019) Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, Zhongzhou Zhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang, and Haiqing Chen. 2019. A deep cascade model for multi-document reading comprehension. In Proceedings of the AAAI conference on artificial intelligence , volume 33, pages 7354–7361.
  • Yang et al. (2020) Junjie Yang, Zhuosheng Zhang, and Hai Zhao. 2020. Multi-span style extraction for generative reading comprehension. arXiv preprint arXiv:2009.07382 .
  • Yang and Klein (2021) Kevin Yang and Dan Klein. 2021. FUDGE: Controlled text generation with future discriminators . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 3511–3535, Online. Association for Computational Linguistics.
  • Ye et al. (2020) Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Peng Li, Maosong Sun, and Zhiyuan Liu. 2020. Coreferential Reasoning Learning for Language Representation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7170–7186, Online. Association for Computational Linguistics.
  • Zhang et al. (2023) Chen Zhang, Jiuheng Lin, Xiao Liu, Yuxuan Lai, Yansong Feng, and Dongyan Zhao. 2023. How many answers should i give? an empirical study of multi-answer reading comprehension. arXiv preprint arXiv:2306.00435 .
  • Zheng et al. (2023) Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. 2023. Click: Controllable text generation with sequence likelihood contrastive learning . Preprint , arXiv:2306.03350.
  • Zhu et al. (2023) Wenhao Zhu, Jingjing Xu, Shujian Huang, Lingpeng Kong, and Jiajun Chen. 2023. INK: Injecting kNN knowledge in nearest neighbor machine translation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 15948–15959, Toronto, Canada. Association for Computational Linguistics.

Appendix A Detailed Experiment Setup and Results

A.1 dataset leaderboard.

Below are the official leaderboards all the datasets we refer to:

A.2 Hyper-Parameter Selection

In this section, we outline the process for selecting the hyper-parameter β 𝛽 \beta italic_β and detail our approach to LoRA fine-tuning.

For selecting β 𝛽 \beta italic_β , we use a grid search method, exploring values from 0.5 to 2 in increments of 0.1, on 30% of the MultiSpanQA training dataset. This process leads to the determination that β = 1 𝛽 1 \beta=1 italic_β = 1 empirically yield the best performance, hence it is selected for use in our experiments.

To select the learning rate l ⁢ r 𝑙 𝑟 lr italic_l italic_r , we conduct a grid search, testing values from { 1 ⁢ e − 5 , 5 ⁢ e − 5 , 1 ⁢ e − 4 , 5 ⁢ e − 4 , 1 ⁢ e − 3 } 1 𝑒 5 5 𝑒 5 1 𝑒 4 5 𝑒 4 1 𝑒 3 \{1e-5,5e-5,1e-4,5e-4,1e-3\} { 1 italic_e - 5 , 5 italic_e - 5 , 1 italic_e - 4 , 5 italic_e - 4 , 1 italic_e - 3 } on 30% of the MultiSpanQA training dataset. Empirically, the value 1 ⁢ e − 4 1 𝑒 4 1e-4 1 italic_e - 4 demonstrates the best performance and is therefore chosen for our experiments. This selection is in agreement with the default l ⁢ r 𝑙 𝑟 lr italic_l italic_r value used in Meta’s official Llama 2 fine-tuning recipe 1 1 1 Link to the fine-tuning configuration of Meta’s official Llama 2 recipe. .

In the case of LoRA fine-tuning, we follow the established methodology as outlined by Hu et al. ( 2021 ) . This involves applying LoRA to Llama 2 and the pre-trained Alpaca models by freezing their pre-trained weights and integrating trainable rank decomposition matrices at every layer of their Transformer structures, aimed at reducing the number of trainable parameters to enhance computational efficiency. We implement this using the PEFT package 2 2 2 Link to the Hugging Face PEFT implementation. . The fine-tuning hyper-parameters for LoRA are set according to the default settings specified in Meta’s official Llama 2 fine-tuning recipe 3 3 3 Link to the LoRA hyper-parameter configuration of Meta’s official Llama 2 recipe. , which include a rank r = 8 𝑟 8 r=8 italic_r = 8 , α = 32 𝛼 32 \alpha=32 italic_α = 32 , and a dropout rate of 0.05 0.05 0.05 0.05 .

A.3 Full Experiment Results

In addition to the highlighted results presented in Section 4 , we also compare the fine-tuned PLMs to their corresponding base PLMs in zero-shot settings. The results, presented in Table 12 , show that fine-tuning with QASE improves performance across all datasets. Specifically, on the SQuAD dataset, models using QASE perform up to 5.6 times better in exact match and 3.0 times better in F1 score compared to the original models. On the MultiSpanQA dataset, the exact match improves by up to 124.4 times, and F1 score by up to 3.4 times. Similarly, on the Quoref dataset, the exact match improves by up to 38.4 times, and F1 score by up to 11.2 times with QASE .

A.4 Instruction Templates and Model Prompts

Table 13 provides the instruction and prompt templates used for fine-tuning the PLMs and for zero-shot querying of PLMs and GPT variants across both single- and multi-span answer datasets.

A.5 Ablation Studies Details

Figure 3 depicts the architecture of the model we use for the ablation studies, with a baseline span extraction module. The baseline span extraction module omits the MHA component, typifying a standard architecture for fine-tuning pre-trained encoders for downstream sequence tagging tasks. The baseline-embedded Flan-T5-Large models are fine-tuned with the same configurations as Flan-T5-Large QASE including learning rate, weight decay, batch size, epoch number, and GPU type.

Refer to caption

We experiment with 2 prompting strategies for ablation studies:

Context-first prompting: The default prompting strategy we utilize for fine-tuning PLMs, both with and without QASE . In this setting, the prompt is ordered as "<instruction tokens> <context tokens> <question tokens>".

Question-first prompting ( qf ): Following BERT’s standard fine-tuning procedures. In this setting, the prompt is ordered as "<instruction tokens> <question tokens> <SEP> <context tokens>". <SEP> is a special separator token.

Appendix B Extended Discussion on Model Performance

In this section, we engage in a detailed discussion on the performance of the Flan-T5 family of models and Llama 2 in MRC tasks. Our aim is to gain insights into the reasons behind the modest zero-shot performance of these large PLMs on MRC tasks, despite their adeptness at handling other complex NLP tasks such as dialogue generation and summarization. Although a comprehensive analysis falls outside the scope of our current study, exploring these performance nuances can provide valuable perspectives on how to potentially enhance the effectiveness of these PLMs on similar tasks.

B.1 Discussion on Flan-T5 Zero-Shot Performance

We observe that the zero-shot performance of Flan-T5 models across all datasets, including SQuAD, remains low as shown in Table 12 , despite being instruct-tuned on the SQuAD dataset during the pre-training phase. This underperformance might stem from the fact that Flan-T5 models, although trained on the <SQuAD, Extractive QA> task, are also trained on a broad spectrum of 1,836 tasks, predominantly focusing on free-form generation, QA, and reasoning tasks Chung et al. ( 2022 ) . Consequently, these models are not finely optimized for extractive QA tasks like MRC, especially under metrics like exact match and F1, particularly for the smaller to larger variants under study. The larger XL and XXL variants may exhibit better performance in these tasks. Furthermore, as discussed in the previous sections, generative models, including Llama 2, Alpaca, and GPT variants, generally show limited effectiveness in MRC tasks in zero-shot settings, underscored by their poorer performance despite having significantly larger model parameters compared to the Flan-T5 variants we experiment with.

To ensure that our zero-shot experiment’s prompts do not adversely affect Flan-T5’s performance, we compare our prompt template, detailed in Table 13 , with those Google released for Flan-T5’s instruct-tuning on the SQuAD v1 dataset 4 4 4 Link to Flan-T5 instruct-tuning prompt templates. . Our template, similar to Google’s, differs mainly by including "with exact phrases and avoid explanations." This difference could potentially affect performance, yet our subsequent experiments demonstrate otherwise.

We conduct a series of experiments to assess the zero-shot performance of Flan-T5-Large on SQuAD, using Google released templates for Flan-T5 instruct-tuning. We select three templates of varying complexities, as listed in Table 14 . Our results, detailed in Table 14 , reveal that our template achieves the highest F1 score. This indicates the lower performance of zero-shot Flan-T5 on SQuAD and similar MRC datasets is expected, even with the original instruct-tuning templates. It supports our hypothesis that, although Flan-T5 is instruct-tuned on SQuAD, its primary strengths are in broader generative question answering and reasoning, rather than specific extractive QA tasks such as MRC, particularly when evaluated by exact match and F1 metrics.

B.2 Discussion on Llama 2 Performance

We observe that models based on Llama 2 and Alpaca generally underperform compared to those based on Flan-T5, in both zero-shot and fine-tuned scenarios, with or without QASE . This section delves into a detailed discussion of the potential reasons behind this trend.

Firstly, the discrepancy in performance may stem from the inherent structural differences between decoder-only models (Llama 2 and Alpaca) and encoder-decoder models (Flan-T5). Encoder-decoder models are better equipped for tasks that require extensive input processing, such as MRC, making them more apt for these tasks than decoder-only models, which are typically more suited to open-ended QA scenarios. This fundamental distinction partially accounts for Flan-T5’s superior performance in context-based question answering across both zero-shot and fine-tuned settings.

Additionally, the difference in the number of trainable parameters during fine-tuning might contribute to the observed performance gap. Table 1 indicates that fine-tuning Llama 2 and Alpaca with LoRA leads to a significantly lower count of trainable parameters (4.2M) compared to even the smallest Flan-T5 model (77.0M). This disparity in trainable parameters is a crucial factor in explaining why fine-tuned Flan-T5 models, irrespective of the use of QASE, outperform Llama 2 and Alpaca models.

While we address these factors, conducting a comprehensive comparison and analysis of different generative model architectures in MRC tasks exceeds the scope of our current study. Nonetheless, we acknowledge that additional factors, such as the specific instruct-fine-tuning of Flan-T5 models on MRC datasets like SQuAD, might also play a role in their enhanced performance over Llama 2 and Alpaca.

ScienceDaily

Intervention based on science of reading, math boosts comprehension, word problem-solving skills

English learners with math difficulty showed improvement following culturally-responsive training.

New research from the University of Kansas has found an intervention based on the science of reading and math effectively helped English learners boost their comprehension, visualize and synthesize information, and make connections that significantly improved their math performance.

The intervention, performed for 30 minutes twice a week for 10 weeks with 66 third-grade English language learners who displayed math learning difficulties, improved students' performance when compared to students who received general instruction. That indicates emphasizing cognitive concepts involved in the science of reading and math are key to helping students improve, according to researchers.

"Word problem-solving is influenced by both the science of reading and the science of math. Key components include number sense, decoding, language comprehension and working memory. Utilizing direct and explicit teaching methods enhances understanding and enables students to effectively connect these skills to solve math problems. This integrated approach ensures that students are equipped with necessary tools to navigate both the linguistic and numerical demands of word problems," said Michael Orosco, professor of educational psychology at KU and lead author of the study.

The intervention incorporates comprehension strategy instruction in both reading and math, focusing and decoding, phonological awareness, vocabulary development, inferential thinking, contextualized learning and numeracy.

"It is proving to be one of the most effective evidence-based practices available for this growing population," Orosco said.

The study, co-written with Deborah Reed of the University of Tennessee, was published in the journal Learning Disabilities Research and Practice .

For the research, trained tutors developed the intervention, developed by Orosco and colleagues based on cognitive and culturally responsive research conducted over a span of 20 years. One example of an intervention session tested in the study included a script in which a tutor examined a word problem that explained a person made a quesadilla for his friend Mario, giving him one-fourth of it, then needed to students to determine how much remained.

The tutor first asked students if they remembered a class session in which they made quesadillas, what shape they were and demonstrated concepts by drawing a circle on the board, dividing it into four equal pieces, having students repeat terms like numerator and denominator, and explaining that when a question asks how much is left, subtraction is required. The students also collaborated with peers to practice using important vocabulary in sentences. The approach both helps students learn and understand mathematical concepts while being culturally responsive.

"Word problems are complex because they require translating words into mathematical equations, and this involves integrating the science of reading and math through language concepts and differentiated instruction," Orosco said. "We have not extensively tested these approaches with this group of children. However, we are establishing an evidence-based framework that aids them in developing background knowledge and connecting it to their cultural contexts."

Orosco, director of KU's Center for Culturally Responsive Educational Neuroscience, emphasized the critical role of language in word problems, highlighting the importance of using culturally familiar terms. For instance, substituting "pastry" for "quesadilla" could significantly affect comprehension for students from diverse backgrounds. Failure to grasp the initial scenario can impede subsequent problem-solving efforts.

The study proved effective in improving students' problem-solving abilities, despite covariates including an individual's basic calculation skills, fluid intelligence and reading comprehension scores. That finding is key as, while ideally all students would begin on equal footing and there were little variations in a classroom, in reality, covariates exist and are commonplace.

The study had trained tutors deliver the intervention, and its effectiveness should be further tested with working teachers, the authors wrote. Orosco said professional development to help teachers gain the skills is necessary, and it is vital for teacher preparation programs to train future teachers with such skills as well. And helping students at the elementary level is necessary to help ensure success in future higher-level math classes such as algebra.

The research builds on Orosco and colleagues' work in understanding and improving math instruction for English learners. Future work will continue to examine the role of cognitive functions such as working memory and brain science, as well as potential integration of artificial intelligence in teaching math.

"Comprehension strategy instruction helps students make connections, ask questions, visualize, synthesize and monitor their thinking about word problems," Orosco and Reed wrote. "Finally, applying comprehension strategy instruction supports ELs in integrating their reading, language and math cognition… Focusing on relevant language in word problems and providing collaborative support significantly improved students' solution accuracy."

  • Learning Disorders
  • K-12 Education
  • Educational Psychology
  • Intelligence
  • Special education
  • Problem solving
  • Developmental psychology
  • Child prodigy
  • Intellectual giftedness
  • Lateral thinking

Story Source:

Materials provided by University of Kansas . Original written by Mike Krings. Note: Content may be edited for style and length.

Journal Reference :

  • Michael J. Orosco, Deborah K. Reed. Supplemental intervention for third-grade English learners with significant problem-solving challenges . Learning Disabilities Research & Practice , 2024; 39 (2): 60 DOI: 10.1177/09388982241229407

Cite This Page :

Explore More

  • Virus to Save Billions of Gallons of Wastewater
  • Weather Report On Planet 280 Light-Years Away
  • Trotting Robots and Animal Gait Transitions
  • Where Have All the Fireflies Gone?
  • Cardio-Fitness Cuts Death and Disease by 20%
  • Reusable Super-Adhesive from Smart Materials
  • Long Snouts Protect Foxes Diving Into Snow
  • Promising Experimental Type 1 Diabetes Drug
  • Giant, Prehistoric Salmon Had Tusk-Like Teeth
  • Plants On the Menu of Ancient Hunter-Gatherers

Trending Topics

Strange & offbeat.

  • International
  • Schools directory
  • Resources Jobs Schools directory News Search

The Golden Goblet. Reading Comprehension Questions, Multiple-choice questions

The Golden Goblet. Reading Comprehension Questions, Multiple-choice questions

YourFellowTeacher's Shop

Last updated

27 April 2024

  • Share through email
  • Share through twitter
  • Share through linkedin
  • Share through facebook
  • Share through pinterest

Resources included (2)

The Golden Goblet. 30 multiple-choice questions (Editable)

The Golden Goblet. 30 multiple-choice questions (Editable)

The Golden Goblet. 40 Reading Comprehension Questions (Editable)

The Golden Goblet. 40 Reading Comprehension Questions (Editable)

The Golden Goblet, by Eloise Jarvis McGraw, with our comprehensive bundle, combining 40 thought-provoking reading comprehension questions with 30 meticulously crafted multiple-choice questions. Explore the depths of the narrative as you unravel its themes, analyze character motivations, and dissect plot developments with precision. Perfect for educators seeking to enrich their curriculum with rigorous yet accessible assessments, this bundle promises to empower students to critically engage with the text while honing their reading comprehension skills. Whether used for individual assessment, group discussion, or classroom activities, these questions are designed to foster critical thinking and literary exploration among students of all levels. Perfect for literature studies, literature, comprehension, critical thinking, discussion and independent learning and can be used as a test quiz.

Note to Buyers This resource does not contain answer keys. We intentionally designed it this way to encourage students to actively engage with the text and collaborate in finding their own answers. Embrace the opportunity for students to develop critical thinking skills and explore diverse interpretations while working through the comprehension questions

Tes paid licence How can I reuse this?

Your rating is required to reflect your happiness.

It's good to leave some feedback.

Something went wrong, please try again later.

This resource hasn't been reviewed yet

To ensure quality for our reviews, only customers who have purchased this resource can review it

Report this resource to let us know if it violates our terms and conditions. Our customer service team will review your report and will be in touch.

Not quite what you were looking for? Search by keyword to find the right resource:

COMMENTS

  1. Reading Comprehension Research: Implications for Practice and Policy

    Similarly, the RAND reading model, another influential reading framework for research and practice, defined reading comprehension as the process of "extracting and constructing meaning through interaction and involvement with written language" (RAND Reading Study Group, 2002, p. 11). Specifically, reading comprehension is the interaction ...

  2. Question Asking During Reading Comprehension Instruction: A Corpus

    Reading comprehension is a critical skill essential for access to the broad curriculum and long-term academic success (McNamara & Magliano, 2009).Classroom instruction in reading comprehension often takes the form of teacher-led, small-group reading and discussion of a text, a format commonly referred to as guided reading (Ford, 2015; Fountas & Pinnell, 2017).

  3. The Science of Reading Comprehension Instruction

    Decades of research offer important understandings about the nature of comprehension and its development. Drawing on both classic and contemporary research, in this article, we identify some key understandings about reading comprehension processes and instruction, including these: Comprehension instruction should begin early, teaching word-reading and bridging skills (including ...

  4. PDF Reading Comprehension, What We Know: A Review of Research ...

    This review of research concerning reading comprehension provides incites into what has been learned from 1995 to the present. Reading comprehension is defined as a complex activity that involves several variables. Reading strategies are discussed and how they relate to reading comprehension. Testing is another concern regarding how

  5. Handbook of Research on Reading Comprehension

    The Handbook of Research on Reading Comprehension assembles researchers of reading comprehension, literacy, educational psychology, psychology, and neuroscience to document the most recent research on the topic. It summarizes the current body of research on theory, methods, instruction, and assessment, including coverage of landmark studies. Designed to deepen understanding of how past ...

  6. What Makes Reading Comprehension Questions Difficult?

    %0 Conference Proceedings %T What Makes Reading Comprehension Questions Difficult? %A Sugawara, Saku %A Nangia, Nikita %A Warstadt, Alex %A Bowman, Samuel %Y Muresan, Smaranda %Y Nakov, Preslav %Y Villavicencio, Aline %S Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) %D 2022 %8 May %I Association for Computational Linguistics ...

  7. Measuring Student Reading Comprehension Performance: Considerations of

    Measuring Student Reading Comprehension Performance: Considerations of Accuracy, Equity, and Engagement by ... individualization of reading to support research suggesting that students interact with text in different ways (Stanovich, Cunningham, and Feeman, 1984). ... comprehension questions. Embedding items throughout the passage utilizes ...

  8. PDF Improving Reading Comprehension

    The teacher researchers intended to improve reading comprehension by using higher-order thinking skills such as predicting, making connections, visualizing, inferring, questioning, and summarizing. In their classrooms the teacher researchers modeled these strategies through the think-aloud process and graphic organizers.

  9. Reading comprehension and metacognition: The importance of inferential

    With respect to our final research question, correlations between reading comprehension performance and metacomprehension accuracy revealed that correlation coefficients were higher in absolute value for inferential-level questions (Pearson's correlations ranged from r = .32 to r = .92 in absolute value) than for text-based-level questions ...

  10. Research Article of the Month: April 2024

    This blog post is part of our Research Article of the Month series. For this month, we highlight " Designing an Intervention in Reading and Self-Regulation for Students With Significant Reading Difficulties Including Dyslexia," an article published in the journal Learning Disability Quarterly in 2021. Important words related to research are bolded, and definitions of these terms are ...

  11. Levels of Reading Comprehension in Higher Education: Systematic Review

    The current research is linked to the research project "Study of reading comprehension in higher education" of Asociación Educar para el Desarrollo Humano from Argentina. ... A descriptive content analysis of the extent of Bloom's taxonomy in the reading comprehension questions of the course book Q: skills for success 4 reading and ...

  12. The Effectiveness of Reading Strategies on Reading Comprehension

    Abstract —This research aimed to investigate the effectiveness. of reading strategies on reading comprehension of the second. year English major students who enrolled to study English. Reading ...

  13. Reading Comprehension and Academic Vocabulary: Exploring Relations of

    Research Questions. General academic word knowledge is strongly related to reading comprehension (Townsend et al., 2012; Lawrence, Hagen, Hwang, Lin, ... If the relation between vocabulary and reading comprehension is driven in part by the fact that knowledge of words is also knowledge of the world and conceptual relations, this approach may ...

  14. What Research Tells Us About Reading, Comprehension, and Comprehension

    For many years, reading instruction was based on a concept of reading as the application of a set of isolated skills such as identifying words, finding main ideas, identifying cause and effect relationships, comparing and contrasting and sequencing. Comprehension was viewed as the mastery of these skills. One important classroom study conducted ...

  15. Effective Strategies for Improving Reading Comprehension

    Within the framework of reading comprehension, the goal of cognitive strategies is to teach students to actively engage with the text, to make connections with it and their prior knowledge, so that learning becomes more purposeful, deliberate, and self-regulated. Texts differ in the level of challenge that they present to students.

  16. Education Sciences

    This study aims to illustrate the complex relationships between reading motivation and reading comprehension for Black girl readers. There is an urgent need for research that explicitly centers on the reading motivations of Black girls through a humanizing, asset-oriented lens. Through a Situative Black Girlhood Reading Motivations lens, which integrates a situative perspective on motivation ...

  17. (PDF) Assessment of Reading Comprehension

    Assessment of Reading Comprehension. Madani HABIB 1. Abstract. This study attemp ts to shed light on the concept of assessment as an essential. pedagogical practice for the improvement of the ...

  18. Development of the Reading Comprehension Strategies Questionnaire (RCSQ

    This task was representative for a reading comprehension lesson in Flemish education (i.e., reading three short expository texts, followed by 8-12 content-related questions). The items of the RCSQ are expressed in the past tense and explicitly make a link between the questioned strategies and the reading comprehension task (e.g., "Before I ...

  19. PDF Reading Comprehension

    of multiple-text comprehension and sourcing during reading comprehension. Second, the article highlights important sources of individual differences in reading comprehension with a particular focus on identifying sources of comprehension difficulty. The article con­ cludes with open questions and potential future directions.

  20. Question Asking During Reading Comprehension Instruction: A Corpus

    reading, and other teacher-led reading comprehension instruction, is the use of questions to check comprehension and encourage deeper texts and discussion within the group (Degener & Berne, 2017; Ford & Opitz, 2008; Fountas & Pinnell, 1996; McKeown, Beck, & Blake, 2009). To understand better the variability of questioning and its effectiveness

  21. The science of teaching reading comprehension

    Natalie says, "we have to think of reading comprehension as a process." Sometimes you may hear teachers asking comprehension questions about a text to students. This is thinking of comprehension as a product, not a process. Assessing students' comprehension of a text by asking them questions is not the same as teaching students to comprehend.

  22. 110 questions with answers in READING COMPREHENSION

    2 answers. Apr 28, 2016. Question and answer relationship (QAR) is a comprehension strategy which was created by Raphael and colleagues to help students understand and realize that the answers ...

  23. Enhancing Pre-Trained Generative Language Models with Question Attended

    Machine Reading Comprehension (MRC), also referred to as text-grounded question answering (QA) Wang et al. (), involves presenting a model with a text passage and a question, requiring it to formulate an answer based solely on the given text.This can be achieved either by identifying a specific span within the text or by generating a concise answer.

  24. Intervention based on science of reading, math boosts comprehension

    Researchers tested a research-based intervention with English learners with math difficulty. The intervention proved to boost comprehension and help students synthesize and visualize information ...

  25. The Golden Goblet. Reading Comprehension Questions, Multiple-choice

    The Golden Goblet, by Eloise Jarvis McGraw, with our comprehensive bundle, combining 40 thought-provoking reading comprehension questions with 30 meticulously crafted multiple-choice questions. Explore the depths of the narrative as you unravel its themes, analyze character motivations, and dissect plot developments with precision.

  26. Reading Comprehension Research: Implications for Practice and Policy

    Reading comprehension is one of the most complex cognitive activities in which humans engage, making it difficult to teach, measure, and research. Despite decades of research in reading comprehension, international and national reading scores indicate stagnant growth for U.S. adolescents.

  27. Reading aloud boosts memory, but not understanding

    Memory-focused questions tested the participants' ability to remember specific details mentioned in the text. Comprehension-focused questions, on the other hand, required participants to engage ...