Writing Beginner

Writing Rubrics [Examples, Best Practices, & Free Templates]

Writing rubrics are essential tools for teachers.

Rubrics can improve both teaching and learning. This guide will explain writing rubrics, their benefits, and how to create and use them effectively.

What Is a Writing Rubric?

Writer typing at a vintage desk, with a stormy night outside -- Writing Rubrics

Table of Contents

A writing rubric is a scoring guide used to evaluate written work.

It lists criteria and describes levels of quality from excellent to poor. Rubrics provide a standardized way to assess writing.

They make expectations clear and grading consistent.

Key Components of a Writing Rubric

  • Criteria : Specific aspects of writing being evaluated (e.g., grammar, organization).
  • Descriptors : Detailed descriptions of what each level of performance looks like.
  • Scoring Levels : Typically, a range (e.g., 1-4 or 1-6) showing levels of mastery.

Example Breakdown

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
GrammarNo errorsFew minor errorsSeveral errorsMany errors
OrganizationClear and logicalMostly clearSomewhat clearNot clear
ContentThorough and insightfulGood, but not thoroughBasic, lacks insightIncomplete or off-topic

Benefits of Using Writing Rubrics

Writing rubrics offer many advantages:

  • Clarity : Rubrics clarify expectations for students. They know what is required for each level of performance.
  • Consistency : Rubrics standardize grading. This ensures fairness and consistency across different students and assignments.
  • Feedback : Rubrics provide detailed feedback. Students understand their strengths and areas for improvement.
  • Efficiency : Rubrics streamline the grading process. Teachers can evaluate work more quickly and systematically.
  • Self-Assessment : Students can use rubrics to self-assess. This promotes reflection and responsibility for their learning.

Examples of Writing Rubrics

Here are some examples of writing rubrics.

Narrative Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Story ElementsWell-developedDeveloped, some detailsBasic, missing detailsUnderdeveloped
CreativityHighly creativeCreativeSome creativityLacks creativity
GrammarNo errorsFew minor errorsSeveral errorsMany errors
OrganizationClear and logicalMostly clearSomewhat clearNot clear
Language UseRich and variedVariedLimitedBasic or inappropriate

Persuasive Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
ArgumentStrong and convincingConvincing, some gapsBasic, lacks supportWeak or unsupported
EvidenceStrong and relevantRelevant, but not strongSome relevant, weakIrrelevant or missing
GrammarNo errorsFew minor errorsSeveral errorsMany errors
OrganizationClear and logicalMostly clearSomewhat clearNot clear
Language UsePersuasive and engagingEngagingSomewhat engagingNot engaging

Best Practices for Creating Writing Rubrics

Let’s look at some best practices for creating useful writing rubrics.

1. Define Clear Criteria

Identify specific aspects of writing to evaluate. Be clear and precise.

The criteria should reflect the key components of the writing task. For example, for a narrative essay, criteria might include plot development, character depth, and use of descriptive language.

Clear criteria help students understand what is expected and allow teachers to provide targeted feedback.

Insider Tip : Collaborate with colleagues to establish consistent criteria across grade levels. This ensures uniformity in expectations and assessments.

2. Use Detailed Descriptors

Describe what each level of performance looks like.

This ensures transparency and clarity. Avoid vague language. Instead of saying “good,” describe what “good” entails. For example, “Few minor grammatical errors that do not impede readability.”

Detailed descriptors help students gauge their performance accurately.

Insider Tip : Use student work samples to illustrate each performance level. This provides concrete examples and helps students visualize expectations.

3. Involve Students

Involve students in the rubric creation process. This increases their understanding and buy-in.

Ask for their input on what they think is important in their writing.

This collaborative approach not only demystifies the grading process but also fosters a sense of ownership and responsibility in students.

Insider Tip : Conduct a workshop where students help create a rubric for an upcoming assignment. This interactive session can clarify doubts and make students more invested in their work.

4. Align with Objectives

Ensure the rubric aligns with learning objectives. This ensures relevance and focus.

If the objective is to enhance persuasive writing skills, the rubric should emphasize argument strength, evidence quality, and persuasive techniques.

Alignment ensures that the assessment directly supports instructional goals.

Insider Tip : Regularly revisit and update rubrics to reflect changes in curriculum and instructional priorities. This keeps the rubrics relevant and effective.

5. Review and Revise

Regularly review and revise rubrics. Ensure they remain accurate and effective.

Solicit feedback from students and colleagues. Continuous improvement of rubrics ensures they remain a valuable tool for both assessment and instruction.

Insider Tip : After using a rubric, take notes on its effectiveness. Were students confused by any criteria? Did the rubric cover all necessary aspects of the assignment? Use these observations to make adjustments.

6. Be Consistent

Use the rubric consistently across all assignments.

This ensures fairness and reliability. Consistency in applying the rubric helps build trust with students and maintains the integrity of the assessment process.

Insider Tip : Develop a grading checklist to accompany the rubric. This can help ensure that all criteria are consistently applied and none are overlooked during the grading process.

7. Provide Examples

Provide examples of each performance level.

This helps students understand expectations. Use annotated examples to show why a particular piece of writing meets a specific level.

This visual and practical demonstration can be more effective than descriptions alone.

Insider Tip : Create a portfolio of exemplar works for different assignments. This can be a valuable resource for both new and experienced teachers to standardize grading.

How to Use Writing Rubrics Effectively

Here is how to use writing rubrics like the pros.

1. Introduce Rubrics Early

Introduce rubrics at the beginning of the assignment.

Explain each criterion and performance level. This upfront clarity helps students understand what is expected and guides their work from the start.

Insider Tip : Conduct a rubric walkthrough session where you discuss each part of the rubric in detail. Allow students to ask questions and provide examples to illustrate each criterion.

2. Use Rubrics as a Teaching Tool

Use rubrics to teach writing skills. Discuss what constitutes good writing and why.

This can be an opportunity to reinforce lessons on grammar, organization, and other writing components.

Insider Tip : Pair the rubric with writing workshops. Use the rubric to critique sample essays and show students how to apply the rubric to improve their own writing.

3. Provide Feedback

Use the rubric to give detailed feedback. Highlight strengths and areas for improvement.

This targeted feedback helps students understand their performance and learn how to improve.

Insider Tip : Instead of just marking scores, add comments next to each criterion on the rubric. This personalized feedback can be more impactful and instructive for students.

4. Encourage Self-Assessment

Encourage students to use rubrics to self-assess.

This promotes reflection and growth. Before submitting their work, ask students to evaluate their own writing against the rubric.

This practice fosters self-awareness and critical thinking.

Insider Tip : Incorporate self-assessment as a mandatory step in the assignment process. Provide a simplified version of the rubric for students to use during self-assessment.

5. Use Rubrics for Peer Assessment

Use rubrics for peer assessment. This allows students to learn from each other.

Peer assessments can provide new perspectives and reinforce learning.

Insider Tip : Conduct a peer assessment workshop. Train students on how to use the rubric to evaluate each other’s work constructively. This can improve the quality of peer feedback.

6. Reflect and Improve

Reflect on the effectiveness of the rubric. Make adjustments as needed for future assignments.

Continuous reflection ensures that rubrics remain relevant and effective tools for assessment and learning.

Insider Tip : After an assignment, hold a debrief session with students to gather their feedback on the rubric. Use their insights to make improvements.

Check out this video about using writing rubrics:

Common Mistakes with Writing Rubrics

Creating and using writing rubrics can be incredibly effective, but there are common mistakes that can undermine their effectiveness.

Here are some pitfalls to avoid:

1. Vague Criteria

Vague criteria can confuse students and lead to inconsistent grading.

Ensure that each criterion is specific and clearly defined. Ambiguous terms like “good” or “satisfactory” should be replaced with concrete descriptions of what those levels of performance look like.

2. Overly Complex Rubrics

While detail is important, overly complex rubrics can be overwhelming for both students and teachers.

Too many criteria and performance levels can complicate the grading process and make it difficult for students to understand what is expected.

Keep rubrics concise and focused on the most important aspects of the assignment.

3. Inconsistent Application

Applying the rubric inconsistently can lead to unfair grading.

Ensure that you apply the rubric in the same way for all students and all assignments. Consistency builds trust and ensures that grades accurately reflect student performance.

4. Ignoring Student Input

Ignoring student input when creating rubrics can result in criteria that do not align with student understanding or priorities.

Involving students in the creation process can enhance their understanding and engagement with the rubric.

5. Failing to Update Rubrics

Rubrics should evolve to reflect changes in instructional goals and student needs.

Failing to update rubrics can result in outdated criteria that no longer align with current teaching objectives.

Regularly review and revise rubrics to keep them relevant and effective.

6. Lack of Examples

Without examples, students may struggle to understand the expectations for each performance level.

Providing annotated examples of work that meets each criterion can help students visualize what is required and guide their efforts more effectively.

7. Not Providing Feedback

Rubrics should be used as a tool for feedback, not just scoring.

Simply assigning a score without providing detailed feedback can leave students unclear about their strengths and areas for improvement.

Use the rubric to give comprehensive feedback that guides students’ growth.

8. Overlooking Self-Assessment and Peer Assessment

Self-assessment and peer assessment are valuable components of the learning process.

Overlooking these opportunities can limit students’ ability to reflect on their own work and learn from their peers.

Encourage students to use the rubric for self and peer assessment to deepen their understanding and enhance their skills.

What Is a Holistic Scoring Rubric for Writing?

A holistic scoring rubric for writing is a type of rubric that evaluates a piece of writing as a whole rather than breaking it down into separate criteria

This approach provides a single overall score based on the general impression of the writing’s quality and effectiveness.

Here’s a closer look at holistic scoring rubrics.

Key Features of Holistic Scoring Rubrics

  • Single Overall Score : Assigns one score based on the overall quality of the writing.
  • General Criteria : Focuses on the overall effectiveness, coherence, and impact of the writing.
  • Descriptors : Uses broad descriptors for each score level to capture the general characteristics of the writing.

Example Holistic Scoring Rubric

ScoreDescription
5 : Exceptionally clear, engaging, and well-organized writing. Demonstrates excellent control of language, grammar, and style.
4 : Clear and well-organized writing. Minor errors do not detract from the overall quality. Demonstrates good control of language and style.
3 : Satisfactory writing with some organizational issues. Contains a few errors that may distract but do not impede understanding.
2 : Basic writing that lacks organization and contains several errors. Demonstrates limited control of language and style.
1 : Unclear and poorly organized writing. Contains numerous errors that impede understanding. Demonstrates poor control of language and style.

Advantages of Holistic Scoring Rubrics

  • Efficiency : Faster to use because it involves a single overall judgment rather than multiple criteria.
  • Flexibility : Allows for a more intuitive assessment of the writing’s overall impact and effectiveness.
  • Comprehensiveness : Captures the overall quality of writing, considering all elements together.

Disadvantages of Holistic Scoring Rubrics

  • Less Detailed Feedback : Provides a general score without specific feedback on individual aspects of writing.
  • Subjectivity : Can be more subjective, as it relies on the assessor’s overall impression rather than specific criteria.
  • Limited Diagnostic Use : Less useful for identifying specific areas of strength and weakness for instructional purposes.

When to Use Holistic Scoring Rubrics

  • Quick Assessments : When a quick, overall evaluation is needed.
  • Standardized Testing : Often used in standardized testing scenarios where consistency and efficiency are priorities.
  • Initial Impressions : Useful for providing an initial overall impression before more detailed analysis.

Free Writing Rubric Templates

Feel free to use the following writing rubric templates.

You can easily copy and paste them into a Word Document. Please do credit this website on any written, printed, or published use.

Otherwise, go wild.

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Well-developed, engaging, and clear plot, characters, and setting.Developed plot, characters, and setting with some details missing.Basic plot, characters, and setting; lacks details.Underdeveloped plot, characters, and setting.
Highly creative and original.Creative with some originality.Some creativity but lacks originality.Lacks creativity and originality.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Rich, varied, and appropriate language.Varied and appropriate language.Limited language variety.Basic or inappropriate language.
Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Strong, clear, and convincing argument.Convincing argument with minor gaps.Basic argument; lacks strong support.Weak or unsupported argument.
Strong, relevant, and well-integrated evidence.Relevant evidence but not strong.Some relevant evidence, but weak.Irrelevant or missing evidence.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Persuasive and engaging language.Engaging language.Somewhat engaging language.Not engaging language.

Expository Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Thorough, accurate, and insightful content.Accurate content with some details missing.Basic content; lacks depth.Incomplete or inaccurate content.
Clear and concise explanations.Mostly clear explanations.Somewhat clear explanations.Unclear explanations.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Precise and appropriate language.Appropriate language.Limited language variety.Basic or inappropriate language.

Descriptive Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Vivid and detailed imagery that engages the senses.Detailed imagery with minor gaps.Basic imagery; lacks vivid details.Little to no imagery.
Highly creative and original descriptions.Creative with some originality.Some creativity but lacks originality.Lacks creativity and originality.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Rich, varied, and appropriate language.Varied and appropriate language.Limited language variety.Basic or inappropriate language.

Analytical Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Insightful, thorough, and well-supported analysis.Good analysis with some depth.Basic analysis; lacks depth.Weak or unsupported analysis.
Strong, relevant, and well-integrated evidence.Relevant evidence but not strong.Some relevant evidence, but weak.Irrelevant or missing evidence.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Precise and appropriate language.Appropriate language.Limited language variety.Basic or inappropriate language.

Final Thoughts: Writing Rubrics

I have a lot more resources for teaching on this site.

Check out some of the blog posts I’ve listed below. I think you might enjoy them.

Read This Next:

  • Narrative Writing Graphic Organizer [Guide + Free Templates]
  • 100 Best A Words for Kids (+ How to Use Them)
  • 100 Best B Words For Kids (+How to Teach Them)
  • 100 Dictation Word Ideas for Students and Kids
  • 50 Tricky Words to Pronounce and Spell (How to Teach Them)

Rubric Best Practices, Examples, and Templates

A rubric is a scoring tool that identifies the different criteria relevant to an assignment, assessment, or learning outcome and states the possible levels of achievement in a specific, clear, and objective way. Use rubrics to assess project-based student work including essays, group projects, creative endeavors, and oral presentations.

Rubrics can help instructors communicate expectations to students and assess student work fairly, consistently and efficiently. Rubrics can provide students with informative feedback on their strengths and weaknesses so that they can reflect on their performance and work on areas that need improvement.

How to Get Started

Best practices, moodle how-to guides.

  • Workshop Recording (Spring 2024)
  • Workshop Registration

Step 1: Analyze the assignment

The first step in the rubric creation process is to analyze the assignment or assessment for which you are creating a rubric. To do this, consider the following questions:

  • What is the purpose of the assignment and your feedback? What do you want students to demonstrate through the completion of this assignment (i.e. what are the learning objectives measured by it)? Is it a summative assessment, or will students use the feedback to create an improved product?
  • Does the assignment break down into different or smaller tasks? Are these tasks equally important as the main assignment?
  • What would an “excellent” assignment look like? An “acceptable” assignment? One that still needs major work?
  • How detailed do you want the feedback you give students to be? Do you want/need to give them a grade?

Step 2: Decide what kind of rubric you will use

Types of rubrics: holistic, analytic/descriptive, single-point

Holistic Rubric. A holistic rubric includes all the criteria (such as clarity, organization, mechanics, etc.) to be considered together and included in a single evaluation. With a holistic rubric, the rater or grader assigns a single score based on an overall judgment of the student’s work, using descriptions of each performance level to assign the score.

Advantages of holistic rubrics:

  • Can p lace an emphasis on what learners can demonstrate rather than what they cannot
  • Save grader time by minimizing the number of evaluations to be made for each student
  • Can be used consistently across raters, provided they have all been trained

Disadvantages of holistic rubrics:

  • Provide less specific feedback than analytic/descriptive rubrics
  • Can be difficult to choose a score when a student’s work is at varying levels across the criteria
  • Any weighting of c riteria cannot be indicated in the rubric

Analytic/Descriptive Rubric . An analytic or descriptive rubric often takes the form of a table with the criteria listed in the left column and with levels of performance listed across the top row. Each cell contains a description of what the specified criterion looks like at a given level of performance. Each of the criteria is scored individually.

Advantages of analytic rubrics:

  • Provide detailed feedback on areas of strength or weakness
  • Each criterion can be weighted to reflect its relative importance

Disadvantages of analytic rubrics:

  • More time-consuming to create and use than a holistic rubric
  • May not be used consistently across raters unless the cells are well defined
  • May result in giving less personalized feedback

Single-Point Rubric . A single-point rubric is breaks down the components of an assignment into different criteria, but instead of describing different levels of performance, only the “proficient” level is described. Feedback space is provided for instructors to give individualized comments to help students improve and/or show where they excelled beyond the proficiency descriptors.

Advantages of single-point rubrics:

  • Easier to create than an analytic/descriptive rubric
  • Perhaps more likely that students will read the descriptors
  • Areas of concern and excellence are open-ended
  • May removes a focus on the grade/points
  • May increase student creativity in project-based assignments

Disadvantage of analytic rubrics: Requires more work for instructors writing feedback

Step 3 (Optional): Look for templates and examples.

You might Google, “Rubric for persuasive essay at the college level” and see if there are any publicly available examples to start from. Ask your colleagues if they have used a rubric for a similar assignment. Some examples are also available at the end of this article. These rubrics can be a great starting point for you, but consider steps 3, 4, and 5 below to ensure that the rubric matches your assignment description, learning objectives and expectations.

Step 4: Define the assignment criteria

Make a list of the knowledge and skills are you measuring with the assignment/assessment Refer to your stated learning objectives, the assignment instructions, past examples of student work, etc. for help.

  Helpful strategies for defining grading criteria:

  • Collaborate with co-instructors, teaching assistants, and other colleagues
  • Brainstorm and discuss with students
  • Can they be observed and measured?
  • Are they important and essential?
  • Are they distinct from other criteria?
  • Are they phrased in precise, unambiguous language?
  • Revise the criteria as needed
  • Consider whether some are more important than others, and how you will weight them.

Step 5: Design the rating scale

Most ratings scales include between 3 and 5 levels. Consider the following questions when designing your rating scale:

  • Given what students are able to demonstrate in this assignment/assessment, what are the possible levels of achievement?
  • How many levels would you like to include (more levels means more detailed descriptions)
  • Will you use numbers and/or descriptive labels for each level of performance? (for example 5, 4, 3, 2, 1 and/or Exceeds expectations, Accomplished, Proficient, Developing, Beginning, etc.)
  • Don’t use too many columns, and recognize that some criteria can have more columns that others . The rubric needs to be comprehensible and organized. Pick the right amount of columns so that the criteria flow logically and naturally across levels.

Step 6: Write descriptions for each level of the rating scale

Artificial Intelligence tools like Chat GPT have proven to be useful tools for creating a rubric. You will want to engineer your prompt that you provide the AI assistant to ensure you get what you want. For example, you might provide the assignment description, the criteria you feel are important, and the number of levels of performance you want in your prompt. Use the results as a starting point, and adjust the descriptions as needed.

Building a rubric from scratch

For a single-point rubric , describe what would be considered “proficient,” i.e. B-level work, and provide that description. You might also include suggestions for students outside of the actual rubric about how they might surpass proficient-level work.

For analytic and holistic rubrics , c reate statements of expected performance at each level of the rubric.

  • Consider what descriptor is appropriate for each criteria, e.g., presence vs absence, complete vs incomplete, many vs none, major vs minor, consistent vs inconsistent, always vs never. If you have an indicator described in one level, it will need to be described in each level.
  • You might start with the top/exemplary level. What does it look like when a student has achieved excellence for each/every criterion? Then, look at the “bottom” level. What does it look like when a student has not achieved the learning goals in any way? Then, complete the in-between levels.
  • For an analytic rubric , do this for each particular criterion of the rubric so that every cell in the table is filled. These descriptions help students understand your expectations and their performance in regard to those expectations.

Well-written descriptions:

  • Describe observable and measurable behavior
  • Use parallel language across the scale
  • Indicate the degree to which the standards are met

Step 7: Create your rubric

Create your rubric in a table or spreadsheet in Word, Google Docs, Sheets, etc., and then transfer it by typing it into Moodle. You can also use online tools to create the rubric, but you will still have to type the criteria, indicators, levels, etc., into Moodle. Rubric creators: Rubistar , iRubric

Step 8: Pilot-test your rubric

Prior to implementing your rubric on a live course, obtain feedback from:

  • Teacher assistants

Try out your new rubric on a sample of student work. After you pilot-test your rubric, analyze the results to consider its effectiveness and revise accordingly.

  • Limit the rubric to a single page for reading and grading ease
  • Use parallel language . Use similar language and syntax/wording from column to column. Make sure that the rubric can be easily read from left to right or vice versa.
  • Use student-friendly language . Make sure the language is learning-level appropriate. If you use academic language or concepts, you will need to teach those concepts.
  • Share and discuss the rubric with your students . Students should understand that the rubric is there to help them learn, reflect, and self-assess. If students use a rubric, they will understand the expectations and their relevance to learning.
  • Consider scalability and reusability of rubrics. Create rubric templates that you can alter as needed for multiple assignments.
  • Maximize the descriptiveness of your language. Avoid words like “good” and “excellent.” For example, instead of saying, “uses excellent sources,” you might describe what makes a resource excellent so that students will know. You might also consider reducing the reliance on quantity, such as a number of allowable misspelled words. Focus instead, for example, on how distracting any spelling errors are.

Example of an analytic rubric for a final paper

Above Average (4)Sufficient (3)Developing (2)Needs improvement (1)
(Thesis supported by relevant information and ideas The central purpose of the student work is clear and supporting ideas always are always well-focused. Details are relevant, enrich the work.The central purpose of the student work is clear and ideas are almost always focused in a way that supports the thesis. Relevant details illustrate the author’s ideas.The central purpose of the student work is identified. Ideas are mostly focused in a way that supports the thesis.The purpose of the student work is not well-defined. A number of central ideas do not support the thesis. Thoughts appear disconnected.
(Sequencing of elements/ ideas)Information and ideas are presented in a logical sequence which flows naturally and is engaging to the audience.Information and ideas are presented in a logical sequence which is followed by the reader with little or no difficulty.Information and ideas are presented in an order that the audience can mostly follow.Information and ideas are poorly sequenced. The audience has difficulty following the thread of thought.
(Correctness of grammar and spelling)Minimal to no distracting errors in grammar and spelling.The readability of the work is only slightly interrupted by spelling and/or grammatical errors.Grammatical and/or spelling errors distract from the work.The readability of the work is seriously hampered by spelling and/or grammatical errors.

Example of a holistic rubric for a final paper

The audience is able to easily identify the central message of the work and is engaged by the paper’s clear focus and relevant details. Information is presented logically and naturally. There are minimal to no distracting errors in grammar and spelling. : The audience is easily able to identify the focus of the student work which is supported by relevant ideas and supporting details. Information is presented in a logical manner that is easily followed. The readability of the work is only slightly interrupted by errors. : The audience can identify the central purpose of the student work without little difficulty and supporting ideas are present and clear. The information is presented in an orderly fashion that can be followed with little difficulty. Grammatical and spelling errors distract from the work. : The audience cannot clearly or easily identify the central ideas or purpose of the student work. Information is presented in a disorganized fashion causing the audience to have difficulty following the author’s ideas. The readability of the work is seriously hampered by errors.

Single-Point Rubric

Advanced (evidence of exceeding standards)Criteria described a proficient levelConcerns (things that need work)
Criteria #1: Description reflecting achievement of proficient level of performance
Criteria #2: Description reflecting achievement of proficient level of performance
Criteria #3: Description reflecting achievement of proficient level of performance
Criteria #4: Description reflecting achievement of proficient level of performance
90-100 points80-90 points<80 points

More examples:

  • Single Point Rubric Template ( variation )
  • Analytic Rubric Template make a copy to edit
  • A Rubric for Rubrics
  • Bank of Online Discussion Rubrics in different formats
  • Mathematical Presentations Descriptive Rubric
  • Math Proof Assessment Rubric
  • Kansas State Sample Rubrics
  • Design Single Point Rubric

Technology Tools: Rubrics in Moodle

  • Moodle Docs: Rubrics
  • Moodle Docs: Grading Guide (use for single-point rubrics)

Tools with rubrics (other than Moodle)

  • Google Assignments
  • Turnitin Assignments: Rubric or Grading Form

Other resources

  • DePaul University (n.d.). Rubrics .
  • Gonzalez, J. (2014). Know your terms: Holistic, Analytic, and Single-Point Rubrics . Cult of Pedagogy.
  • Goodrich, H. (1996). Understanding rubrics . Teaching for Authentic Student Performance, 54 (4), 14-17. Retrieved from   
  • Miller, A. (2012). Tame the beast: tips for designing and using rubrics.
  • Ragupathi, K., Lee, A. (2020). Beyond Fairness and Consistency in Grading: The Role of Rubrics in Higher Education. In: Sanger, C., Gleason, N. (eds) Diversity and Inclusion in Global Higher Education. Palgrave Macmillan, Singapore.
  • Majors & Minors
  • About Southwestern
  • Library & IT
  • Develop Your Career
  • Life at Southwestern
  • Scholarships/Financial Aid
  • Student Organizations
  • Study Abroad
  • Academic Advising
  • Billing & Payments
  • mySouthwestern
  • Pirate Card
  • Registrar & Records
  • Resources & Tools
  • Safety & Security
  • Student Life
  • Parents Homepage
  • Parent Council
  • Rankings & Recognition
  • Tactical Plan
  • Academic Affairs
  • Business Office
  • Facilities Management
  • Human Resources
  • Notable Achievements
  • Alumni Home
  • Alumni Achievement
  • Alumni Calendar
  • Alumni Directory
  • Class Years
  • Local Chapters
  • Make a Gift
  • SU Ambassadors

Southwestern University

Southwestern University announces its 2021–2026 Tactical Plan.

Southwestern University

Leading Colleges recognizes Southwestern as one of the top universities in the nation for ethical, fair, and transparent recruiting practices.

Professor John Score II Learning Commons

New Learning Commons houses existing Debby Ellis Writing Center and newly-established Tutoring Center inside A. Frank Smith, Jr. Library Center.

Shay Bangert ’27

Sophomore Shay Bangert ’27 used skills learned at Southwestern during his 10-week paid internship at the prestigious Houston Methodist Academic Institute.

Southwestern Arch

Southwestern ranks 13th on   College Raptor’s   annual list of “Top 25 Best Colleges in the Southwest U.S.”

Adrian Gonzalez ’25

Political science and English double major was one of 14 students in the country to be selected to participate in the program at Duke University.

Brianna Gonzales ’24

As part of Southwestern University’s Hispanic Serving Institution designation, first-generation student Brianna Gonzales ’24 has traveled the country to participate in a variety of prestigious programs.

Camille Krumwiede

Theatre and psychology double major Camille Krumwiede ’22 is showcasing skills learned at Southwestern through internships at   And Just Like That…   and Atlantic Pictures.

Pirate Dining

Through a seasoned blend of award-winning meal options, professional staff, and state-of-the-art facilities, Pirate Dining is enhancing the Southwestern Experience one meal at a time.

Southwestern University

The bestselling college guide ranked Southwestern as one of the top 300 “best and most interesting” four-year universities in its annual list.

Southwestern University BEE-Co

With the support of an SU alumnus and local honey producer, Layla Hoffen ’26 created BEE-Co, one of the most unique student organizations at Southwestern.

Gabriella Guinn ’25

Spurred by her affection for horses, Gabby Guinn ’25 gives back to the community as an intern at the Ride On Center for Kids (ROCK).

Southwestern Pirates Football

Generous gift kicks off fundraising efforts for new athletic complex that will help bring football back to campus for the first time since 1950.

Pirate Athletic Association

Pirate Athletics launches a new way to elevate the student-athlete experience at Southwestern.

Emma McCandless, Michael Gebhardt, Alyssa Gilbert

Southwestern’s liberal arts education, wide array of majors and minors, and prime geographic location set students up for future success in the tech industry.

Natalie Davis

Natalie Davis ’26 awarded with runner-up honors in ASIANetwork’s nationwide essay contest.

Southwestern University

Expansive transformation of Mabee Commons honored for outstanding renovation project in national competition.

Photo courtesy Ethan Sleeper ’22

Alumnus debuts performance to complete masters of music composition program at Texas State University.

Job Search Academy

The Southwestern community will have exclusive access to expanded job resources through Indeed, the world’s #1 job site.

Designing Rubrics

Deciding which type of rubric to use.

Rubrics are generally broken down into two types:   holistic  and  analytic .

Holistic Rubrics

A holistic rubric provides students with a general overview of what is expected by describing the characteristics of a paper that would earn an “A,” (or be marked “excellent”), a B (or “proficient”) a C (or “average”) and so on.

Here is an example of a holistic rubric for weekly reading responses in a religion course: 

holistic creative writing rubric

As you can see, a holistic rubric gives students a sense of the criteria for evaluation (in this case: understanding of the text, engagement with the text, ability to explain significance of argument, organization & ability to answer the prompt, and grammar, mechanics & formatting).  However, it does not assign any particular value to these criteria and therefore allows more room for variation between papers of one grade.

Benefits of Holistic Rubrics:

Holistic rubrics tend to work best for low-stakes writing assignments, and there are several benefits to using a holistic rubric for evaluation:

  • They allow for slightly more impressionistic grading, which is useful when papers may vary dramatically from one another.  (This particular rubric would be used to respond to one of several different prompts that students could choose from each week).
  • They encourage students to think of all the parts of their writing as interconnected, so (for example) students see organization as connected to clarity of ideas.
  • When used for recurring assignments, they allow students to see a trend in the feedback for their writing.
  • They allow for quicker grading, since you can highlight or circle specific words or phrases to draw students’ attention to areas of possible improvement.

Drawbacks of Holistic Rubrics:

One potential drawback to holistic rubrics, however, is that it can be difficult for students to identify discrete areas for improvement or get specific examples of common missteps.

Analytic Rubrics

An analytic rubric is one that explicitly breaks down an assignment into its constitutive skills and provides students with guidelines for what each performance level looks like for each skill.

Here is an example of an analytic rubric for the same assignment:

holistic creative writing rubric

As you can see, an analytic rubric provides students with much clearer definition of the evaluation criteria.  It may or may not assign points to each criteria.

Benefits of Analytic Rubrics: 

Analytic rubrics tend to work well for complex assignments.  There are several benefits to choosing an analytic rubric:

  • They allow more specific feedback for students, which can be particularly useful in guiding revision.
  • They provide students with more specific guidelines that they can follow when writing their papers.
  • They provide students with a sense of your priorities for the assignment.
  • They allow for more regular grading.

Drawbacks of Analytic Rubrics:

One drawback to analytic rubrics, however, is that they can be difficult to develop for assignments you’re asking students to complete for the first time; if you haven’t yet seen what can go wrong, it can be difficult to identify what poor performance might look like.

Bean, John C.  Engaging Ideas: The Professor’s Guide to Integrating Writing, Critical Thinking,  and Active Learning in the Classroom .  San Francisco: Jossy-Bass, 2001.

“Creating and Using Rubrics.”   The Assessment Office.  The University of Hawaii at Mānoa .  18 December 2013.  Web. 1 June 2014.

“How to Develop a Rubric.”  Ohio State Writing Across the Curriculum Resources .  Ohio State University. Web. 1 June 2014.

“Rubric Development.”  Center for University Teaching, Learning, and Assessment .  University of West Florida.  24 April 2014.  Web. 1 June 2014.

  • MyU : For Students, Faculty, and Staff

Writing Across the Curriculum

Ask a question

  • Research & Assessment
  • Writing Plans
  • WEC Liaisons
  • Academic Units
  • Engage with WEC
  • Teaching Resources
  • Teaching Consultations
  • Faculty Writing Resources

Tww hero 3

  • Designing and Using Rubrics

Grading rubrics (structured scoring guides) can make writing criteria more explicit, improving student performance and making valid and consistent grading easier for course instructors. This page provides an overview of rubric types and offers guidelines for their development and use.

Why use a rubric?
  • Types of Rubrics
Guidelines for Creating a Writing Rubric
Additional Ways to Use Rubrics
  • Downsides to Rubrics?
  • Further Resources

While grading criteria can come in many forms—a checklist of requirements, a description of grade-level expectations, articulated standards, or a contract between instructor and students, to name but a few options—they often take the form of a rubric, a structured scoring guide. 

Because of their flexibility, rubrics can provide several benefits for students and instructors:

  • They make the grading criteria explicit to students by providing specific dimensions (e.g. thesis, organization, use of evidence. etc.), the performance-level descriptions for those dimensions, and the relative weight of those dimensions within the overall assignment.
  • They can serve as guidelines and targets for students as they develop their writing, especially when the rubrics are distributed with the assignment.
  • They can be used by faculty to coach and reinforce writing criteria in the class.
  • They are useful for norming assessment and ensuring reliability and consistency among multiple graders, such as teaching assistants . 
  • They can help instructors to isolate specific features of student writing for praise or for instruction.
  • They are very adaptable in form–from basic to complex—and can be used to assess minor and major assignments.
  • They can be a data source for instructors to improve future teaching and learning.
What types of rubrics are there?

Rubrics come in many forms. Here are some of the key types, using terms introduced by John Bean (2011) , along with the advantages and disadvantages of rubric types, as detailed by the Center for Advanced Research on Language Acquisition (CARLA ).

Holistic Rubrics stress an overall evaluation of the work by creating single-score categories (letter or numeric). Holistic rubrics are often used in standardized assessments, such as Advanced Placement exams. Here is a sample of a holistic rubric .

Some potential benefits of holistic rubrics:

  • They often save time by minimizing the number of decisions graders must make.
  • Multiple graders (such as teaching assistants) who norm with holistic rubrics tend to apply them consistently, resulting in more reliable measurement.
  • They are good for summative assessments that do not require additional feedback.

Some potential challenges of holistic rubrics:

  • Unless space is provided for specific comments, they are less useful for offering specific feedback to learners about how to improve performance.
  • They are not very useful for formative assessments , where the goal is to provide actionable feedback for the student.

Analytic Rubrics stress the weight of different criteria or traits, such as content, organization, use of conventions, etc. Most analytic rubrics are formatted as grids. Here is a sample of an analytic rubric .

Some potential benefits of analytic rubrics:

  • They provide useful feedback to learners on specific areas of strength and weakness.
  • Their dimensions can be weighted to reflect the relative importance of individual criteria on the assignment.
  • They can show learners that they have made progress over time in some or all dimensions when the same rubric categories are used repeatedly ( Moskal, 2000 ).

Some potential challenges of analytic rubrics:

  • As Tedick (2002) notes, "Separate scores for different aspects of a student’s writing or speaking performance may be considered artificial in that it does not give the teacher (or student) a good assessment of the ‘whole’ of a performance."
  • They often take more time to create and use, and it can be challenging to name all the possible attributes that will signal success or failure on the assignment.
  • Because there are more dimensions to score, it can take more time to norm and achieve reliability. 
  • Given evidence that graders tend to evaluate grammar-related categories more harshly than they do other categories ( McNamara, 1996 ), analytic rubrics containing a category for “grammar” may provide a negatively skewed picture of a learners' proficiency.

Generic Rubrics can take holistic or analytic forms. In generic rubrics, the grading criteria are generalized in such a way that the rubric can be used for multiple assignments and/or across multiple sections of courses. Here is a sample of a generic rubric .

Some potential benefits of generic rubrics:

  • They can be applied to a number of different tasks across a single mode of communication (such as persuasion, analysis, oral presentation, etc.).
  • They can be used repeatedly for assignments with fixed formats and genres (lab reports, technical memos, etc.).
  • They may be useful in departments for collecting data about student performance across courses.

Some potential challenges of generic rubrics:

  • They are not directly aligned with the language in the assignment prompt.
  • They may reinforce a singular and reductive view of effective writing.

Task-Specific Rubrics closely align the grading criteria with the language and specifications in the assignment prompt. Here is a sample of a task-specific rubric .

Some potential benefits of task-specific rubrics:

  • According to Walvoord (2014) , task-specific rubrics can be “credible and actionable for students because they involve faculty in their own disciplinary language, their own assignments, and their own criteria.”
  • They emphasize the specificity of discipline and genre-based writing.
  • They can be useful for both formative and summative feedback.

Some potential challenges of task-specific rubrics:

  • They take some time to develop.
  • They are not easily transferable to other assignments. 

Step 1: Identify your grading criteria.

steel fram structure

What are the intended outcomes for the assignment? What do you want students to do or demonstrate? What are the primary dimensions (note: these are often referred to as “traits” or as “criteria”) that count in the evaluation? Try writing each one as a noun or noun phrase—for example, “Insights and ideas that are central to the assignment”; “Address of audience”; “Logic of organization”; “Integration of source materials.”

Suggestion: Try not to exceed more than ten total criteria. If you have too many criteria, you can make it challenging to distinguish among them, and you may be required to clarify, repeatedly, the distinctions for students (or for yourself!).

Step 2: Describe the levels of success for each criterion.

For each trait or criterion, consider a 2–4-point scale (e.g. strong, satisfactory, weak). For each point on the scale, describe the performance.

Suggestions : Either begin with optimum performances and then describe lower levels as less than (adequately, insufficiently, etc.) OR fully describe a baseline performance and then add values. To write an effective performance level for a criterion, describe in precise language what the text is doing successfully.

Effective grading criteria are…

  • Explicit and well detailed, and leave little room for unstated assumptions.

Ineffective: Includes figures and graphs.

Effective: Includes figures that are legible and labeled accurately, and that illustrate data in a manner free from distortion. 

  • Focused on qualities, not components, segments, or sections.

Ineffective: Use the IMRAD structure.

Effective: Includes a materials and methods section that identify all components, technical standards, equipment, and methodological description such that a professional might reproduce the research. 

  • Address discrete features and try not to do too much.

Ineffective: Contains at least five sources.

Effective: Uses research from carefully vetted sources, presented with an in-text and terminal citation, to support assertions.

  • Address observable characteristics of writing, not impressions of writer’s intent.

Ineffective: Does not use slang or jargon.

Effective: Uses language appropriate to fellow professionals and patient communication in context.

Step 3: Weight the criteria.

When criteria have been identified and performance-levels described, decisions should be made about their varying importance in relation to each other.

Suggestion: If you use a point-based grading system, consider using a range of points within  performance levels, and make sure the points for each criterion reflect their relative value to one another. Rubrics without carefully determined and relative grade weights can often produce a final score that does not align with the instructor’s expectations for the score. Here is a sample of a rubric with a range of points within each performance level .

Step 4: Create a format for the rubric.

When the specific criteria and levels of success have been named and ranked, they can be sorted into a variety of formats and distributed with the assignment. The right format will depend on how and when you are using the rubric. Consider these three examples of an Anthropology rubric and how each format might be useful (or not), depending on the course context. [ Rubric 1 , Rubric 2 , Rubric 3 ]

Suggestion: Consider allowing space on the rubric to insert comments on each item and again at the end. Regardless of how well your rubric identifies, describes, and weighs the grading criteria, students will still appreciate and benefit from brief comments that personalize your assessment.

Step 5: Test (and refine) the rubric.

Assortment of random pile of wood letter steps.

Ideally, a rubric will be tested in advance of full implementation. A practical way to test the rubric is to apply it to a subset of student assignments. Even after you have tested and used the rubric, you will likely discover, as with the assignment prompt itself, that there are parts that need tweaking and refinement.

Suggestion: A peer review of the rubric before it gets used on an assignment will allow you to take stock of the questions, confusions, or issues students have about your rubric, so you can make timely and effective adjustments.

Beyond their value as formative and summative assessment tools, rubrics can be used to support teaching and learning in the classroom.

Here are three suggestions for additional uses:

  • For in-class norming sessions with students—effective for discussing, clarifying, and reinforcing writing criteria;
  • For constructing rubric criteria and values with students—most effective when students are quite familiar with the specific writing genre (e.g. capstone-level writing);
  • For guiding a peer-review session
Any Downsides to Rubrics?

While many faculty members use rubrics, some resist them because they worry that rubrics are unable to accurately convey authentic and nuanced assessment. As Bob Broad (2003) argues, rubrics can leave out many of the rhetorical qualities and contexts that influence how well a work is received or not. Rubrics, Broad maintains, convey a temporary sense of standardization that does not capture the real ways that real readers respond in different ways to a given work. John Bean (2011) has also described this as the “myth of the universal reader” and the “problem of implied precision” (279). Of course, the alternative to using a rubric, such as providing a holistic grade with comments that justify the grade—still a common practice among instructors—is often labor-intensive and poses its own set of challenges when it comes to consistency with assessment across all students enrolled in a course. Ultimately, a rubric’s impact depends on the criteria on which it is built and the ways it is used.

  • African American & African Studies
  • Agronomy and Plant Genetics
  • Animal Science
  • Anthropology
  • Applied Economics
  • Art History
  • Carlson School of Management
  • Chemical Engineering and Materials Science
  • Civil, Environmental, and Geo- Engineering
  • College of Biological Sciences
  • Communication Studies
  • Computer Science & Engineering
  • Construction Management
  • Curriculum and Instruction
  • Dental Hygiene
  • Apparel Design
  • Graphic Design
  • Product Design
  • Retail Merchandising
  • Earth Sciences
  • Electrical and Computer Engineering
  • Environmental Sciences, Policy and Management
  • Family Social Science
  • Fisheries, Wildlife, and Conservation Biology
  • Food Science and Nutrition
  • Geography, Environment and Society
  • German, Nordic, Slavic & Dutch
  • Health Services Management
  • Horticultural Science
  • Hubbard School of Journalism and Mass Communication
  • Industrial and Systems Engineering
  • Information Technology Infrastructure
  • Mathematics
  • Mechanical Engineering
  • Medical Laboratory Sciences
  • Mortuary Science
  • Organizational Leadership, Policy, and Development
  • Political Science
  • School of Architecture
  • School of Kinesiology
  • School of Public Health
  • Spanish and Portuguese Studies
  • Speech-Language-Hearing Sciences
  • Theatre Arts & Dance
  • Youth Studies
  • New Enrollments for Departments and Programs
  • Legacy Program for Continuing Units
  • Writing in Your Course Context
  • Syllabus Matters
  • Mid-Semester Feedback Strategies
  • Designing Effective Writing Assignments
  • Writing Assignment Checklist
  • Scaffolding and Sequencing Writing Assignments
  • Informal, Exploratory Writing Activities
  • 5-Minute Revision Workshops
  • Reflective Memos
  • Conducting In-Class Writing Activities: Notes on Procedures
  • Now what? Responding to Informal Writing
  • Teaching Writing with Quantitative Data
  • Commenting on Student Writing
  • Supporting Multilingual Learners
  • Teaching with Effective Models of Writing
  • Peer Response Protocols and Procedures
  • Using Reflective Writing to Deepen Student Learning
  • Conferencing with Student Writers
  • Designing Inclusive Writing Assigments
  • Addressing a Range of Writing Abilities in Your Courses
  • Effective Grading Strategies
  • Running a Grade-Norming Session
  • Working with Teaching Assistants
  • Managing the Paper Load
  • Teaching Writing with Sources
  • Preventing Plagiarism
  • Grammar Matters
  • What is ChatGPT and how does it work?
  • Incorporating ChatGPT into Classes with Writing Assignments: Policies, Syllabus Statements, and Recommendations
  • Restricting ChatGPT Use in Classes with Writing Assignments: Policies, Syllabus Statements, and Recommendations
  • What do we mean by "writing"?
  • How can I teach writing effectively in an online course?
  • What are the attributes of a "writing-intensive" course at the University of Minnesota?
  • How can I talk with students about the use of artificial intelligence tools in their writing?
  • How can I support inclusive participation on team-based writing projects?
  • How can I design and assess reflective writing assignments?
  • How can I use prewritten comments to give timely and thorough feedback on student writing?
  • How can I use online discussion forums to support and engage students?
  • How can I use and integrate the university libraries and academic librarians to support writing in my courses?
  • How can I support students during the writing process?
  • How can I use writing to help students develop self-regulated learning habits?
  • Submit your own question
  • Short Course: Teaching with Writing Online
  • Five-Day Faculty Seminar
  • Past Summer Hunker Participants
  • Resources for Scholarly Writers
  • Consultation Request
  • Faculty Writing Groups
  • Further Writing Resources
  • Generating Ideas
  • Drafting and Revision
  • Sources and Evidence
  • Style and Grammar
  • Specific to Creative Arts
  • Specific to Humanities
  • Specific to Sciences
  • Specific to Social Sciences
  • CVs, Résumés and Cover Letters
  • Graduate School Applications
  • Other Resources
  • Hiatt Career Center
  • University Writing Center
  • Classroom Materials
  • Course and Assignment Design
  • UWP Instructor Resources
  • Writing Intensive Requirement
  • Criteria and Learning Goals
  • Course Application for Instructors
  • FAQ for Instructors
  • FAQ for Students
  • Journals on Writing Research and Pedagogy
  • University Writing Program
  • Degree Programs
  • Graduate Programs
  • Brandeis Online
  • Summer Programs
  • Undergraduate Admissions
  • Graduate Admissions
  • Financial Aid
  • Summer School
  • Centers and Institutes
  • Funding Resources
  • Housing/Community Living
  • Clubs and Organizations
  • Community Service
  • Brandeis Arts Engagement
  • Rose Art Museum
  • Our Jewish Roots
  • Mission and Diversity Statements
  • Administration
  • Faculty & Staff
  • Alumni & Friends
  • Parents & Families
  • Campus Calendar
  • Directories
  • New Students
  • Shuttle Schedules
  • Support at Brandeis

Writing Resources

Using rubrics: tips and examples.

Rubrics are a tool for effective assessment of student work. A rubric identifies specific expectations from a given assignment, as well as how the successful completion of these elements contributes to a grade.

For instructors, rubrics :

  • Help the grading / feedback reflect the assignment / class goals
  • Remove bias from grading (including across graders / TAs)

For students, rubrics :

  • Ensure students know the expectations of an assignment(s) rubrics should be shared with students in advance
  • Clearly justify grades for students

Rubrics can be used to evaluate progress, as well as to assess final products and assign grades. There are different types of rubrics, depending on the needs of the assignment:

Checklist Rubrics

Checklist rubrics assess completion of the parts of an assignment. The student is not assessed on how well each element is executed, but just on completion. Checklists can be done by the instructor, but can also be done by students themselves to self-assess their progress/product.An instructor can choose to give partial credit if an element of the assignment is partially completed. For example, for the UWS proposal assignment, a checklist rubric may look like this:

Proposal Element (completed/not completed):

  • Introduction (1 paragraph)
  • Literature Review (~2 pages)
  • Library Research Plan (~1 page)
  • Motive (1 paragraph)
  • Weekly Timeline
  • Annotated Bibliography (minimum of 3 sources)

An instructor can choose to give partial credit if an element of the assignment is partially completed.

Narrative / Holistic Rubrics

Narrative/holistic rubrics provide overall descriptions of [ insert text here ]

For example, this rubric refers to an assignment where students contributed to an online discussion board. As you can see, the assessment can be indicated in various ways—as a letter grade, as a descriptive word or phrase, or as a numerical rating. The rubric then shows a description of the expectations for that grade. To create this for a specific assignment, the instructor would consider what a submission that earned an A should look like, versus one that earned a B, C, etc. The instructor would then write detailed descriptions of what qualities a student would have to demonstrate in order to earn an A, B, C, etc.

A — Outstanding — (90-100) . Student created an original post that was highly insightful and which responded thoroughly to all parts of the prompt. The response effectively utilized a variety of evidence from the literary readings for the week, directly referencing the texts at least three times. Additionally, the student responded to peers' posts with a high degree of professionalism in interaction, grammar/mechanics, and spelling. 
B — Very Good — (81-99) . Student created an original post that was insightful and responded to all or most parts of the prompt. The response utilized a variety of evidence from the text as well, though there may have been room for more and/or further explanation. Additionally, the student responded to peers' posts with a good degree of professionalism in interaction, grammar/mechanics, and spelling. 
C — Average — (71-79) . Student created an original post that was at times insightful, though lacking in substantial well-explained evidence and/or which was at times off-topic from the prompt. Responses to peers were attempted, though these responses were lacking in insightful observation and/or had errors/lapses in professional interaction, grammar/mechanics, and spelling. 
  D — Needs Improvement — (60-69) . Student created an original post which did not effectively address the prompt and/or which was lacking in substantial evidence from the text. Responses to peers were incomplete or insubstantial with numerous errors/lapses in professional interaction, grammar/mechanics, and spelling. 
F — Does not Meet Expectations — (59 and below) . Student created an original post which was too brief, did not respond to the prompt, and was lacking in substantial evidence from the text. Missing or overly brief responses to peers with errors/lapses in professional interaction, grammar/mechanics, and spelling.  
Source: Virginia Commonwealth University

Narrative/Holistic Rubrics can be easier for instructors to create and use. However, for students, these sorts of rubrics often provide less specific feedback. A student may not know where exactly their writing fails to meet expectations. Narrative/holistic rubrics thus should be used in tandem with specific comments that articulate where the student needs improvement. Often, when an instructor is using a narrative/holistic rubric, there is a sort of analytical rubric (see below) going on behind the scenes, which helps inform the final grade. This behind-the-scenes thinking should be communicated to students.  

Analytical / Developmental Rubrics

Similar to holistic rubrics but they break down the elements of the assignment into pieces. The benefit of this approach is that students see exactly where they are and are not succeeding with their writing. These rubrics can be time consuming to produce, but are effective in both communicating expectations and justifying grades. Some instructors opt to return each writing assignment with the rubric attached, highlighted or otherwise marked to show progress. Other instructors may opt to use detailed marginal and block comments to refer to the elements of the rubric.

Assignment Elements
THESIS A (Exceeding Standard) :  The major claim of the essay is complex, insightful, and unexpected. B (Proficient) : The major claim is clear and arguable but lacks complexity or is too narrow in scope. C (Progressing) : The major claim of the essay is weak, i.e., vague, simple, or obvious. D (Not meeting standard) : The major claim is missing or unclear.
EVIDENCE A : Strong evidence is used in supportive and creative ways. B : Most ideas are supported by evidence, but not the best evidence. C : Evidence may be lacking or irrelevant. D : There is little to no appropriate evidence.
STRUCTURE A : Ideas develop over the course of the essay. B : The argument is mostly logical and structured. C : The argument does not develop over the course of the essay. D : Argument shows no clear structure.
REVISION A : Extensive & effective revision beyond instructor’s comments. B : Extensive revision. C : Some evidence of revision. D : Little evidence of revision.

 Some instructors opt to return each writing assignment with the rubric attached, highlighted or otherwise marked to show progress. Other instructors may opt to use detailed marginal and block comments to refer to the elements of the rubric.

Elissa Jacobs and Paige Eggebrecht

  • Resources for Students
  • Research and Pedagogy

Rubric Design

Main navigation, articulating your assessment values.

Reading, commenting on, and then assigning a grade to a piece of student writing requires intense attention and difficult judgment calls. Some faculty dread “the stack.” Students may share the faculty’s dim view of writing assessment, perceiving it as highly subjective. They wonder why one faculty member values evidence and correctness before all else, while another seeks a vaguely defined originality.

Writing rubrics can help address the concerns of both faculty and students by making writing assessment more efficient, consistent, and public. Whether it is called a grading rubric, a grading sheet, or a scoring guide, a writing assignment rubric lists criteria by which the writing is graded.

Why create a writing rubric?

  • It makes your tacit rhetorical knowledge explicit
  • It articulates community- and discipline-specific standards of excellence
  • It links the grade you give the assignment to the criteria
  • It can make your grading more efficient, consistent, and fair as you can read and comment with your criteria in mind
  • It can help you reverse engineer your course: once you have the rubrics created, you can align your readings, activities, and lectures with the rubrics to set your students up for success
  • It can help your students produce writing that you look forward to reading

How to create a writing rubric

Create a rubric at the same time you create the assignment. It will help you explain to the students what your goals are for the assignment.

  • Consider your purpose: do you need a rubric that addresses the standards for all the writing in the course? Or do you need to address the writing requirements and standards for just one assignment?  Task-specific rubrics are written to help teachers assess individual assignments or genres, whereas generic rubrics are written to help teachers assess multiple assignments.
  • Begin by listing the important qualities of the writing that will be produced in response to a particular assignment. It may be helpful to have several examples of excellent versions of the assignment in front of you: what writing elements do they all have in common? Among other things, these may include features of the argument, such as a main claim or thesis; use and presentation of sources, including visuals; and formatting guidelines such as the requirement of a works cited.
  • Then consider how the criteria will be weighted in grading. Perhaps all criteria are equally important, or perhaps there are two or three that all students must achieve to earn a passing grade. Decide what best fits the class and requirements of the assignment.

Consider involving students in Steps 2 and 3. A class session devoted to developing a rubric can provoke many important discussions about the ways the features of the language serve the purpose of the writing. And when students themselves work to describe the writing they are expected to produce, they are more likely to achieve it.

At this point, you will need to decide if you want to create a holistic or an analytic rubric. There is much debate about these two approaches to assessment.

Comparing Holistic and Analytic Rubrics

Holistic scoring .

Holistic scoring aims to rate overall proficiency in a given student writing sample. It is often used in large-scale writing program assessment and impromptu classroom writing for diagnostic purposes.

General tenets to holistic scoring:

  • Responding to drafts is part of evaluation
  • Responses do not focus on grammar and mechanics during drafting and there is little correction
  • Marginal comments are kept to 2-3 per page with summative comments at end
  • End commentary attends to students’ overall performance across learning objectives as articulated in the assignment
  • Response language aims to foster students’ self-assessment

Holistic rubrics emphasize what students do well and generally increase efficiency; they may also be more valid because scoring includes authentic, personal reaction of the reader. But holistic sores won’t tell a student how they’ve progressed relative to previous assignments and may be rater-dependent, reducing reliability. (For a summary of advantages and disadvantages of holistic scoring, see Becker, 2011, p. 116.)

Here is an example of a partial holistic rubric:

Summary meets all the criteria. The writer understands the article thoroughly. The main points in the article appear in the summary with all main points proportionately developed. The summary should be as comprehensive as possible and should be as comprehensive as possible and should read smoothly, with appropriate transitions between ideas. Sentences should be clear, without vagueness or ambiguity and without grammatical or mechanical errors.

A complete holistic rubric for a research paper (authored by Jonah Willihnganz) can be  downloaded here.

Analytic Scoring

Analytic scoring makes explicit the contribution to the final grade of each element of writing. For example, an instructor may choose to give 30 points for an essay whose ideas are sufficiently complex, that marshals good reasons in support of a thesis, and whose argument is logical; and 20 points for well-constructed sentences and careful copy editing.

General tenets to analytic scoring:

  • Reflect emphases in your teaching and communicate the learning goals for the course
  • Emphasize student performance across criterion, which are established as central to the assignment in advance, usually on an assignment sheet
  • Typically take a quantitative approach, providing a scaled set of points for each criterion
  • Make the analytic framework available to students before they write  

Advantages of an analytic rubric include ease of training raters and improved reliability. Meanwhile, writers often can more easily diagnose the strengths and weaknesses of their work. But analytic rubrics can be time-consuming to produce, and raters may judge the writing holistically anyway. Moreover, many readers believe that writing traits cannot be separated. (For a summary of the advantages and disadvantages of analytic scoring, see Becker, 2011, p. 115.)

For example, a partial analytic rubric for a single trait, “addresses a significant issue”:

  • Excellent: Elegantly establishes the current problem, why it matters, to whom
  • Above Average: Identifies the problem; explains why it matters and to whom
  • Competent: Describes topic but relevance unclear or cursory
  • Developing: Unclear issue and relevance

A  complete analytic rubric for a research paper can be downloaded here.  In WIM courses, this language should be revised to name specific disciplinary conventions.

Whichever type of rubric you write, your goal is to avoid pushing students into prescriptive formulas and limiting thinking (e.g., “each paragraph has five sentences”). By carefully describing the writing you want to read, you give students a clear target, and, as Ed White puts it, “describe the ongoing work of the class” (75).

Writing rubrics contribute meaningfully to the teaching of writing. Think of them as a coaching aide. In class and in conferences, you can use the language of the rubric to help you move past generic statements about what makes good writing good to statements about what constitutes success on the assignment and in the genre or discourse community. The rubric articulates what you are asking students to produce on the page; once that work is accomplished, you can turn your attention to explaining how students can achieve it.

Works Cited

Becker, Anthony.  “Examining Rubrics Used to Measure Writing Performance in U.S. Intensive English Programs.”   The CATESOL Journal  22.1 (2010/2011):113-30. Web.

White, Edward M.  Teaching and Assessing Writing . Proquest Info and Learning, 1985. Print.

Further Resources

CCCC Committee on Assessment. “Writing Assessment: A Position Statement.” November 2006 (Revised March 2009). Conference on College Composition and Communication. Web.

Gallagher, Chris W. “Assess Locally, Validate Globally: Heuristics for Validating Local Writing Assessments.” Writing Program Administration 34.1 (2010): 10-32. Web.

Huot, Brian.  (Re)Articulating Writing Assessment for Teaching and Learning.  Logan: Utah State UP, 2002. Print.

Kelly-Reilly, Diane, and Peggy O’Neil, eds. Journal of Writing Assessment. Web.

McKee, Heidi A., and Dànielle Nicole DeVoss DeVoss, Eds. Digital Writing Assessment & Evaluation. Logan, UT: Computers and Composition Digital Press/Utah State University Press, 2013. Web.

O’Neill, Peggy, Cindy Moore, and Brian Huot.  A Guide to College Writing Assessment . Logan: Utah State UP, 2009. Print.

Sommers, Nancy.  Responding to Student Writers . Macmillan Higher Education, 2013.

Straub, Richard. “Responding, Really Responding to Other Students’ Writing.” The Subject is Writing: Essays by Teachers and Students. Ed. Wendy Bishop. Boynton/Cook, 1999. Web.

White, Edward M., and Cassie A. Wright.  Assigning, Responding, Evaluating: A Writing Teacher’s Guide . 5th ed. Bedford/St. Martin’s, 2015. Print.

Of Rubrics and Writing

Just another academic classroom sites | lake superior college sites site, another rubric for creative assignments: short stories.

I have used a holistic, comment-based rubric for my short story assignment in Creative Writing for several years. After reading all this information about rubric, I decided to revise it into a point-based, more analytic rubric. I also changed the point values because the short story ends up being one of the longest assignments in the class, so I changed it from 100 to 150 points (I plan to decrease the points in their literary critique since that is a shorter overall assignment). I hope this new rubric makes the expectations of the assignment clearer to students and make grading more objective and clear.

Here is my original rubric (with examples of comments and a grade):

Short story rubric  

Character development Are the characters well developed through a variety of character techniques (such as dialogue, using gestures, observations, etc.)?

 

ok Good character but I wanted to know more about her—and see her more in action. So much of the story is summary that we only get general info on her.
Plot Is the plot interesting and original? Is the plot condensed enough to develop in the length of the story?

 

ok Good idea for plot—just need more scenes and less summary to make the story more effective.
Story beginning Does the story start with action or dialogue instead of summary?

 

Needs work It’s most effective to start with dialogue and/or action. You begin more with an introduction or summary.  I would suggest just starting with the first scene—let the background  of the characters come out through the plot.
Scenes Does the story contain scenes that let the characters act and move and not just a summary of events or time periods?

 

Needs work Good at the start but try to let the action and dialogue show things—try not to explain everything. Also you need more scenes—to really move the action along and help the readers get into the story and characters.
Grammar and style Does the story contain college-level writing and an interesting writing style? Are there too many grammar, spelling, and punctuation errors?

 

Needs work Avoid using second person (you) in fiction.  Also some comma splices, apostrophe errors, run-ons, and other errors are getting in the way of your ideas.
Dialogue Is the dialogue in the story natural and realistic? Does it help develop characters, action, and scenes?

 

ok Use a comma between speaker and dialogue. Just need more dialogue in scenes.
Setting and detail Are the setting and details in the story well developed and unique?

 

Good Great detail about the city but need more details in some places–scenes would help with that.
Overall comments Great start here–see comments above for ways to improve the story.    
Grade 84/100 B  

Here is the first draft of my new rubric:

Short Story Assignment  

             Write a short story (possibly using a character/characters you have developed in class assignments (week three discussion assignment). Think about all the elements of fiction which the fiction lessons and your textbook discuss.  Try to write a unique story in your own writing style.  Try not to fall back on common plots, stereotypical characters, etc.

Length:  6-25 pages (1200-6000 words)

Format: Double-spaced, in RTF format.

                Name the file as:  yourlastname_story  (for example: swing_story)

                Make sure to have title page with name, name of story, date, etc.

                Make sure to start a new paragraph when a new character speaks.

                Make sure to use correct capitalization, spelling, and grammar. See this website for grammar review if needed: http://grammar.ccc.commnet.edu/grammar/

Plot

 

50 points

□   Plot is original and surprising (had tension), but not shocking. It engages audience throughout story.

□   The plot is condensed enough to develop in a short story (time is condensed)

□   Beginning of the story engages audience and begins with action or a scene and not summary or background.

□   Ending is satisfying even if it’s abrupt or doesn’t wrap up all ideas.

□   Story meets word requirements.

□   Plot is interesting but may contain some confusion, clichéd ideas, or vagueness.

□   The plot is fairy condensed but may span too much time or have too much history or summary.

□   Beginning of the story is interesting but may have too much summary and not enough action.

□   Ending is ok but could be more satisfying or original.

□   Story meets word requirements, but needs to be longer, ideas need to developed further.

□    Plot is not engaging, doesn’t contain tension, or is clichéd.

□   The plot tries to cover too much time or is confusing to follow.

□   Beginning of the story has too much summary and background—needs a scene and action.

□   Ending is clichéd, shocking, or unbelievable.

□   Story does not meet minimum word requirements.

 
Character development

 

20 points

□   Characters, especially main character, is developed well through multiple techniques (dialogue gestures, description, action, etc).

□   Characters are unique and not stereotypes or one dimensional

□   Character relationships are well developed and interesting.

□   Character makes some significant change in the story.

□   Characters, especially main character, is developed well but needs more showing and less telling. Need to have the character in action more.

□   Characters are interesting but may be a bit stereotypical or one dimensional at times.

□   Character relationships are interesting but may need more development.

□   Character makes some changes but they might not be enough or realistic based on the plot of the story.

□    Characters, especially main character, are not developed enough. Need action, dialogue, background, etc.  

□   Characters are stereotypical or one dimensional.

□   Character relationships are not developed or unrealistic.

□   Character does not make any significant or realistic changes throughout the story.

 
Scenes

 

20 points

□   Multiples scenes are used in the story to show and not tell the story

□   Scenes are in a clear and logical sequence even if flashbacks are used

□   Scenes are interesting and effective

□   Story has some scenes that develop ideas, but may need more scenes and less summary.

□   Scenes are in a clear order but may need some reorganization.

□   Scenes are good but may need more action or tension

□    Story is mostly summary and needs scene to develop characters, tension, and ideas.

□   Scenes are not in a clear order and are confusing.

□   Scenes are unrealistic or uninteresting or unoriginal.

  
Dialogue

 

15 points

□   Dialogue is natural and not stilted or awkward

□   Dialogue is effectively used to develop characters, give character background, and develop tension.

□   Dialogue uses correct quotation mark placement and  is indented with each new speaker

 

□   Dialogue original but may be stilted or inconsistent at time (need to use contractions, for example)

□   Dialogue gives some character and plot details but could be used more to develop those traits.

□   Dialogue uses mostly correct format, but may need some corrections like a comma between speaker and quotation or correct indentation.

 

□   Dialogue is not used enough or is stilted and/or inconsistent (need to use contractions, for example or character’s voice changes)

□   Dialogue needs to be used to develop characters and details more effectively.

□   Dialogue does not follow correct format (indent with each speaker, comma between speaker and quote, correct quotation marks, etc.)

 

  
Grammar and style

 

20 points

□   The story is written using college-level writing skills in a professional manner.

□   The story does not contain many errors in spelling, sentences errors, pronoun use, apostrophes, or other errors.

□   Style of the story is consistent and engaging and not wordy or overly passive.

□    Story uses appropriate and consistent point of view.

□   The story is written at college level but may have some inconsistencies.

□   The story contains some errors in spelling, sentences errors, pronoun use, apostrophes, or other errors.

□   Style of the story is mostly consistent and engaging but may have some wordiness, vagueness, etc.

□   Story uses appropriate point of view but may shift once or twice.

□   The story is not written at college level.

□   The story contains many errors in spelling, sentences errors, pronoun use, apostrophes, or other errors.

□   Style of the story is inconsistent and engaging contains too much wordiness, vagueness, etc.

□   Story shifts point of view multiple times and for no logical reason.

 
Setting and detail

 

15 points

□    Setting in the story is clear, unique, and well developed.

□   Setting is an important part of the plot or tension in the story.

□    Details in the story such as colors, clothes, music, objects, are unique and used to develop characters and plot.

 

□   Setting in the story is clear but could developed further.

□   Setting could be used more as part of the plot or tension.

□   Some of the details in the story such as colors, clothes, music, objects, are unique but could be used more to develop characters and plot.

□   Setting is vague or unclear.

□    Setting has no relationship to the plot or characters.

□   Story needs more details like colors, clothes, music, cars, landscape, etc. to develop characters and plot.

  
Paper format

 

10 points

 

 

□   Story was submitted on time in the dropbox with correct file name.

□   Story follows paper format (double-spaced, one in margins).

□   Story has unique title and correct heading.

□   Story was submitted on time in the dropbox with correct file name.

□   Story follows paper format (double-spaced, one in margins) with one or two minor errors.

Story has a title and heading but may have some errors.

□   Story was not submitted on time in the dropbox and or has an incorrect file name.

□   Story does not follows paper format (double-spaced, one in margins).

□   Story has a not title and/or no heading.

 
Overall comments        
Points / Out of 150 points Grade:    

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Grades 6-12
  • School Leaders

Get Your Free 21st Century Timeline Poster ✨

15 Helpful Scoring Rubric Examples for All Grades and Subjects

In the end, they actually make grading easier.

Collage of scoring rubric examples including written response rubric and interactive notebook rubric

When it comes to student assessment and evaluation, there are a lot of methods to consider. In some cases, testing is the best way to assess a student’s knowledge, and the answers are either right or wrong. But often, assessing a student’s performance is much less clear-cut. In these situations, a scoring rubric is often the way to go, especially if you’re using standards-based grading . Here’s what you need to know about this useful tool, along with lots of rubric examples to get you started.

What is a scoring rubric?

In the United States, a rubric is a guide that lays out the performance expectations for an assignment. It helps students understand what’s required of them, and guides teachers through the evaluation process. (Note that in other countries, the term “rubric” may instead refer to the set of instructions at the beginning of an exam. To avoid confusion, some people use the term “scoring rubric” instead.)

A rubric generally has three parts:

  • Performance criteria: These are the various aspects on which the assignment will be evaluated. They should align with the desired learning outcomes for the assignment.
  • Rating scale: This could be a number system (often 1 to 4) or words like “exceeds expectations, meets expectations, below expectations,” etc.
  • Indicators: These describe the qualities needed to earn a specific rating for each of the performance criteria. The level of detail may vary depending on the assignment and the purpose of the rubric itself.

Rubrics take more time to develop up front, but they help ensure more consistent assessment, especially when the skills being assessed are more subjective. A well-developed rubric can actually save teachers a lot of time when it comes to grading. What’s more, sharing your scoring rubric with students in advance often helps improve performance . This way, students have a clear picture of what’s expected of them and what they need to do to achieve a specific grade or performance rating.

Learn more about why and how to use a rubric here.

Types of Rubric

There are three basic rubric categories, each with its own purpose.

Holistic Rubric

A holistic scoring rubric laying out the criteria for a rating of 1 to 4 when creating an infographic

Source: Cambrian College

This type of rubric combines all the scoring criteria in a single scale. They’re quick to create and use, but they have drawbacks. If a student’s work spans different levels, it can be difficult to decide which score to assign. They also make it harder to provide feedback on specific aspects.

Traditional letter grades are a type of holistic rubric. So are the popular “hamburger rubric” and “ cupcake rubric ” examples. Learn more about holistic rubrics here.

Analytic Rubric

Layout of an analytic scoring rubric, describing the different sections like criteria, rating, and indicators

Source: University of Nebraska

Analytic rubrics are much more complex and generally take a great deal more time up front to design. They include specific details of the expected learning outcomes, and descriptions of what criteria are required to meet various performance ratings in each. Each rating is assigned a point value, and the total number of points earned determines the overall grade for the assignment.

Though they’re more time-intensive to create, analytic rubrics actually save time while grading. Teachers can simply circle or highlight any relevant phrases in each rating, and add a comment or two if needed. They also help ensure consistency in grading, and make it much easier for students to understand what’s expected of them.

Learn more about analytic rubrics here.

Developmental Rubric

A developmental rubric for kindergarten skills, with illustrations to describe the indicators of criteria

Source: Deb’s Data Digest

A developmental rubric is a type of analytic rubric, but it’s used to assess progress along the way rather than determining a final score on an assignment. The details in these rubrics help students understand their achievements, as well as highlight the specific skills they still need to improve.

Developmental rubrics are essentially a subset of analytic rubrics. They leave off the point values, though, and focus instead on giving feedback using the criteria and indicators of performance.

Learn how to use developmental rubrics here.

Ready to create your own rubrics? Find general tips on designing rubrics here. Then, check out these examples across all grades and subjects to inspire you.

Elementary School Rubric Examples

These elementary school rubric examples come from real teachers who use them with their students. Adapt them to fit your needs and grade level.

Reading Fluency Rubric

A developmental rubric example for reading fluency

You can use this one as an analytic rubric by counting up points to earn a final score, or just to provide developmental feedback. There’s a second rubric page available specifically to assess prosody (reading with expression).

Learn more: Teacher Thrive

Reading Comprehension Rubric

Reading comprehension rubric, with criteria and indicators for different comprehension skills

The nice thing about this rubric is that you can use it at any grade level, for any text. If you like this style, you can get a reading fluency rubric here too.

Learn more: Pawprints Resource Center

Written Response Rubric

Two anchor charts, one showing

Rubrics aren’t just for huge projects. They can also help kids work on very specific skills, like this one for improving written responses on assessments.

Learn more: Dianna Radcliffe: Teaching Upper Elementary and More

Interactive Notebook Rubric

Interactive Notebook rubric example, with criteria and indicators for assessment

If you use interactive notebooks as a learning tool , this rubric can help kids stay on track and meet your expectations.

Learn more: Classroom Nook

Project Rubric

Rubric that can be used for assessing any elementary school project

Use this simple rubric as it is, or tweak it to include more specific indicators for the project you have in mind.

Learn more: Tales of a Title One Teacher

Behavior Rubric

Rubric for assessing student behavior in school and classroom

Developmental rubrics are perfect for assessing behavior and helping students identify opportunities for improvement. Send these home regularly to keep parents in the loop.

Learn more: Teachers.net Gazette

Middle School Rubric Examples

In middle school, use rubrics to offer detailed feedback on projects, presentations, and more. Be sure to share them with students in advance, and encourage them to use them as they work so they’ll know if they’re meeting expectations.

Argumentative Writing Rubric

An argumentative rubric example to use with middle school students

Argumentative writing is a part of language arts, social studies, science, and more. That makes this rubric especially useful.

Learn more: Dr. Caitlyn Tucker

Role-Play Rubric

A rubric example for assessing student role play in the classroom

Role-plays can be really useful when teaching social and critical thinking skills, but it’s hard to assess them. Try a rubric like this one to evaluate and provide useful feedback.

Learn more: A Question of Influence

Art Project Rubric

A rubric used to grade middle school art projects

Art is one of those subjects where grading can feel very subjective. Bring some objectivity to the process with a rubric like this.

Source: Art Ed Guru

Diorama Project Rubric

A rubric for grading middle school diorama projects

You can use diorama projects in almost any subject, and they’re a great chance to encourage creativity. Simplify the grading process and help kids know how to make their projects shine with this scoring rubric.

Learn more: Historyourstory.com

Oral Presentation Rubric

Rubric example for grading oral presentations given by middle school students

Rubrics are terrific for grading presentations, since you can include a variety of skills and other criteria. Consider letting students use a rubric like this to offer peer feedback too.

Learn more: Bright Hub Education

High School Rubric Examples

In high school, it’s important to include your grading rubrics when you give assignments like presentations, research projects, or essays. Kids who go on to college will definitely encounter rubrics, so helping them become familiar with them now will help in the future.

Presentation Rubric

Example of a rubric used to grade a high school project presentation

Analyze a student’s presentation both for content and communication skills with a rubric like this one. If needed, create a separate one for content knowledge with even more criteria and indicators.

Learn more: Michael A. Pena Jr.

Debate Rubric

A rubric for assessing a student's performance in a high school debate

Debate is a valuable learning tool that encourages critical thinking and oral communication skills. This rubric can help you assess those skills objectively.

Learn more: Education World

Project-Based Learning Rubric

A rubric for assessing high school project based learning assignments

Implementing project-based learning can be time-intensive, but the payoffs are worth it. Try this rubric to make student expectations clear and end-of-project assessment easier.

Learn more: Free Technology for Teachers

100-Point Essay Rubric

Rubric for scoring an essay with a final score out of 100 points

Need an easy way to convert a scoring rubric to a letter grade? This example for essay writing earns students a final score out of 100 points.

Learn more: Learn for Your Life

Drama Performance Rubric

A rubric teachers can use to evaluate a student's participation and performance in a theater production

If you’re unsure how to grade a student’s participation and performance in drama class, consider this example. It offers lots of objective criteria and indicators to evaluate.

Learn more: Chase March

How do you use rubrics in your classroom? Come share your thoughts and exchange ideas in the WeAreTeachers HELPLINE group on Facebook .

Plus, 25 of the best alternative assessment ideas ..

Scoring rubrics help establish expectations and ensure assessment consistency. Use these rubric examples to help you design your own.

You Might Also Like

video project ideas for kids

25 Creative Video Project Ideas Your Students Will Love

Tell a story, make a newscast, create a vlog, and more! Continue Reading

Copyright © 2024. All rights reserved. 5335 Gate Parkway, Jacksonville, FL 32256

  • Novgorod Oblast Tourism
  • Novgorod Oblast Hotels
  • Novgorod Oblast Bed and Breakfast
  • Flights to Novgorod Oblast
  • Novgorod Oblast Restaurants
  • Things to Do in Novgorod Oblast
  • Novgorod Oblast Travel Forum
  • Novgorod Oblast Photos
  • Novgorod Oblast Map
  • All Novgorod Oblast Hotels
  • Novgorod Oblast Hotel Deals

novgorod - Novgorod Oblast Forum

  • Europe    
  • Russia    
  • Northwestern District    
  • Novgorod Oblast    
  • United States Forums
  • Europe Forums
  • Canada Forums
  • Asia Forums
  • Central America Forums
  • Africa Forums
  • Caribbean Forums
  • Mexico Forums
  • South Pacific Forums
  • South America Forums
  • Middle East Forums
  • Honeymoons and Romance
  • Business Travel
  • Train Travel
  • Traveling With Disabilities
  • Tripadvisor Support
  • Solo Travel
  • Bargain Travel
  • Timeshares / Vacation Rentals
  • Northwestern District forums
  • Novgorod Oblast forum

holistic creative writing rubric

is it en route from moscow to st petersberg?

2 replies to this topic

holistic creative writing rubric

Are you familiar with the concept of geographic maps?

Veliky or Nizhny?

  • Arriving early V Novgorod, where to get breakfast? Dec 28, 2019
  • tranportation Feb 15, 2019
  • Transportation Veliky Novgorod Aug 16, 2018
  • Getting to Rurikov Ancient Town from Veliky Novogorod centre Aug 16, 2018
  • Veliky Novgorod from St.Petersburg by bus/train Apr 08, 2018
  • St Petersburg to Veliky Novgorod - options Apr 05, 2018
  • St Pete - Veliky Novogorod - Moscow - Day Trip Mar 06, 2018
  • Moscow overnight train...sleeper? Kids? Jan 03, 2018
  • Liturgy and Vespers at St. Sophia Sep 05, 2017
  • Getting to Veliky Novgorod Sep 01, 2017
  • Travel from Chudovo to Veliky Novgorod Aug 29, 2017
  • Transportation between Veliky Novgorod and Moscow Jun 12, 2017
  • Train form Moscow to Veliky Novogorod Apr 07, 2017
  • Direct train from Veliky Novgorod to Vologda Jan 07, 2016
  • GreenLeaders
  • Tourist Attractions
  • Tourist Attractions in Russia
  • Novgorod Oblast Tourist Attractions

Veliky Novgorod

The whole city of Veliky Novgorod is a big museum; there are many well-preserved monuments dating back to the 11th century and later centuries.

Bell ringing in Veliky Novgorod (credit to Lucia McCreery from Brooklyn)

Take a walk through the most ancient Kremlin in Russia

The Novgorod Kremlin, which is also called ‘Detinets’, is located on the left bank of the Volkhov River. The first fortified settlement was set here during the reign of prince Vladimir Yaroslavich, the son of Yaroslav the Wise. During these times, all the state, public and religious life of Novgorod was concentrated here. It was the place where people kept chronicles and copied the texts of books. The Novgorod Kremlin, the most ancient one in Russia, was founded here in the 15th century.

St. Sophia Cathedral (11th century), The Millennium Of Russia Monument, Episcopal Chamber (15th century) and the main exhibition of The State Novgorod Museum-reservation located in a public office building of the 18th century are all situated in the Novgorod Kremlin. The exhibition will tell you about the whole Novgorod history from ancient times to the present day. There are also restoration workshops, a children’s center, a library and a philharmonic inside the Kremlin walls.

holistic creative writing rubric

Send a letter with the State Novgorod Museum-reservation stamp

While visiting the main building of the Novgorod Kremlin museum, you’ll see a small bureau near the souvenir area. Two more bureaus like that can be found in the Fine Arts Museum and the Museum information centre. This is the Museum Post, the joint project of the State Novgorod Museum-reservation and Russian Post.

holistic creative writing rubric

The tradition to exchange letters (at that time written on birch bark sheets) dates back to the 11th century so it’s hardly surprising that such a project appeared here. The bureaus are desks and mailboxes at the same time, so you can send your friends a postcard with a view of Novgorod right from the museum.

holistic creative writing rubric

Find the famous Russian poet Alexander Pushkin among the figures of The Millennium Of Russia Monument

In 1862, 1000 years after the Varangians were called to Russia, a monument dedicated to this event was launched in Novgorod. To tell the story of Russia’s one thousand years, the sculptor used 129 bronze figures: from state and military leaders to artists and poets.

One of figures portrays Afanasy Ordin-Nashchokin, a politician and reformer who was responsible for Russia’s diplomatic relations in the middle of the 17th century. He is believed to be the father of international and regular mail in Russia. He was also the person who came up with the idea of the first Russian Post official emblem — a post horn and a double-headed eagle.

holistic creative writing rubric

Cross the Msta River over the first arch bridge in Russia

The steel bridge in Borovichi town that connects two banks of the Msta river was built at the beginning of the 20th century. The project of the bridge was created by Nikolay Belelyubsky, engineer and professor of St. Petersburg State Transport University. This is the first arch bridge in Russia.

In 1995, it was included in the national cultural heritage register. More than 100 bridges across Russia were developed by Belelyubsky, but only this one is named after him.

holistic creative writing rubric

Cast a virtual bell

When in the Novgorod region, you’ll definitely hear bells ring and learn about the Novgorod Veche Bell. During the siege of the city, tsar Ivan III ordered to remove this bell from the bell tower and send it to Moscow. Legend says that the bell didn’t accept his fate, fell to the ground near the border of the Novgorod region and broke to pieces against the stones.

In the biggest Museum Bell Centre in Russia located in the Valday town, you can see bells from across the world and learn why Novgorod bells are unique. The museum’s collection represents bells from different countries and ages, some of them dating back to the 3rd century BC. You’ll learn about the history of casting and modern bell-making technologies and also play games on a touch table. For example, harness virtual ‘troika’ (three) horses with bells or cast a virtual bell.

holistic creative writing rubric

Spot the pigeon on the cross of St. Sophia Cathedral

St. Sophia Cathedral was built in Novrogod between 1045 and 1050 by Kievan and Byzantine masters. It was conceived as the main cathedral of the city, and during its first years it was the only stone building in Novgorod. So where does the pigeon on the cross of the cathedral’s biggest dome come from?

Legend says that while tsar Ivan the Terrible and his Oprichniki were cruelly killing peaceful city folk in 1570, a pigeon suddenly sat down to the cross of the city’s main cathedral. It looked down, saw the massacre, and was literally petrified with horror. Since then the pigeon has been considered the defender of the city. People believe that as soon as the pigeon flies away from the cross, Novgorod will come to an end.

holistic creative writing rubric

Visit a monastery, that was founded by Patriarch Nikon

The Valday Iver Monastery is situated on the island in the middle of the Valday lake. It is considered to be one of the most important and picturesque orthodox shrines.

The monastery was founded in 1653 by the initiative of Nikon who had just been elected Patriarch. Nikon wanted the monastery to look like the Iviron Monastery on Mount Athos, including the architectural style and monk’s clothes. Legend says that Nikon saw the spot for the monastery in a dream.

holistic creative writing rubric

Check out Fyodor Dostoevsky’s country house

Fyodor Dostoevsky, a famous Russian writer, first visited Staraya Russa town in 1872 during a summer trip with his family. They liked it so much that the next year they rented a house near the Pererytitsa River’s embankment and spent every summer here ever since.

Dostoevsky loved this house, called it ‘his nest’ and considered it the perfect place to work and to be alone. In Staraya Russa he wrote his novels ‘The Adolescent’, ‘The Brothers Karamazov’ and ‘Demons’. Today, this place is a museum where you can explore what Dostoevsky’s house looked like and see his family’s personal belongings, photos and letters.

holistic creative writing rubric

Visit an authentic Russian ‘izba’ (wooden house)

If you want to really enjoy the atmosphere of the old Novgorod, you should come to the Vitoslavlitsy Museum of folk wooden architecture that is located on the Myachino lake not far away from Veliky Novgorod. In this open-air museum you’ll see the best examples of Russian wooden architecture, including authentic old ‘izbas’ (wooden houses), rural chapels and churches.

During the year, the museum hosts fairs of crafts and folklore, christmastides, and even an international bell ringing festival.

holistic creative writing rubric

Learn what Brick Gothic looks like

The Episcopal Chamber of the Novgorod Kremlin is the only non-religious German Gothic building of the 15th century preserved in Russia. You can have a good look at the facets of the gothic cross-domed vaults inside the chamber. This is why this building is also called ‘Faceted Chamber’ or ‘Chamber of Facets’.

The chamber was part of Vladychny Dvor, the place where all important city events took place: court hearings, gatherings of the Council of Lords of the Novgorod Republic, ambassador’s receptions and feasts. The seals of the city’s lords were kept here. The decree of tsar Ivan III on merging the Novgorod Republic with the Moscow State was first announced in 1478 in Episcopal Chamber. This is when the name of the new state, Russia, was first pronounced.

holistic creative writing rubric

See the murals by Theophanes the Greek

The Byzantine Empire had a huge impact on the development of the Russian culture. Many works of art and architecture in ancient Russia were created by Byzantine artists and masters. Theophanes the Greek was one of them. He was born in Byzantine and created icons and murals in Constantinople and Caffa (modern Feodosia). After that he moved to Novgorod where he was commissioned to paint the walls of the Church of the Transfiguration of the Savior on Ilyina Street. You can enjoy his unique and expressive style if you look at the murals inside the dome of the church and the Trinity side chapel.

The most recognizable and the only monumental work of Theophanes the Greek that is preserved today is the chest-high portrait of the Savior the Almighty in the dome of the Church of the Transfiguration of the Savior.

holistic creative writing rubric

Take a photo with an ancient Novgorod citizen who is learning how to read and write

In 1951, a letter written on birch bark dating back to the 14–15th centuries was found in Veliky Novgorod. Many decades later, in 2019, a sculpture designed by Novgorod artist and sculptor Sergey Gaev appeared on this exact site.

The sculpture portrays an 8–year old boy sitting on a stool and holding a piece of birch bark. At this age children in Novgorod started to learn how to read and write. During archaeological excavations in Novgorod, scientists often found ancient handwriting practice books and children’s drawings on birch bark sheets.

holistic creative writing rubric

Feel like an ancient viking or prince Rurik’s guest

Novgorod is one of the waypoints of the famous trade route from the Varangians to the Greeks. The route passed through the Volkhov river. In the 9–10th centuries there was a fortified settlement of the Viking Age here.

Some scientists believe that Novgorod is named after this area which was called ‘Stary Gorod’ (‘Old City’) at that time. Some historians and archeologists consider this place to be the residence of Prince Rurik who was asked to rule the city in 862. That’s why this ancient settlement is called ‘Rurikovo Gorodische’ (‘Ruruk’s Old City’).

holistic creative writing rubric

Learn more about the Soviet modernist architecture

On the bank of the Volkhov river near the Novgorod Kremlin, there is an incredible building that looks like a spaceship and contrasts strongly with the ancient buildings of the city.

This is the Fyodor Dostoevsky Theater of Dramatic Art that was built in 1987. It is one of the most striking examples of the Soviet modernist architecture. The theater was built for 10 years according to the project of architect Vladimir Somov.

holistic creative writing rubric

See what an everyday life of Old Believers looks like

The Krestsy town in the Novgorod region has always been considered to be the center of the Novgorod Old Belief community, and it still is. Before the Soviet revolution there were three Old Believers churches here.

The Lyakova village, which is located not far from the town, used to be inhabited completely by Old Believers. You can learn more about their lifestyle in the local interactive museum. You’ll be introduced to Old Believers’ traditional crafts and ceremonies, drink tea with healing herbs and learn how to chop wood and use an old spinning wheel.

holistic creative writing rubric

Buy a traditional embroidered tablecloth

A unique embroidery style that is now famous all over the world was born in the Staroye Rakhino village in the Novgorod province. By the middle of the 19th century, it had become a folk craft. Since then, linen tablecloths, towels and clothing items decorated with unusual ornaments have been popular not only among the locals, but also travellers.

In 1929, the first cooperative partnership of embroidery masters was created in Kresttsy. Later it turned into a factory that still operates today. The factory has a museum where embroidery traditions are preserved and new ornaments and technologies are created.

holistic creative writing rubric

Find yourself in the Middle Ages

In the Middle Ages, Staraya Russa town could be called ‘the salt cellar of Russia’. That’s because salt making was the main trade here up to the 19th century. A few years ago, the old craft was brought back to life, and construction of salt works began. Later, an interactive museum was launched based on the results of archaeological findings.

This museum recreates a typical medieval manor of Staraya Russa of the 12th century with living rooms, a bathhouse, workshops, a livestock pen and traditional peasant household items. In this museum, you can also buy salt which is made in the same way as 1000 years ago.

holistic creative writing rubric

See the place where Suvorov started his Italian campaign

Alexander Suvorov’s manor in the Konchanskoe village, which has now become the museum of the great commander, was originally the place of his exile. Suvovor openly disagreed with the reform of Russian’s army based on the Prussian model, and Emperor Paul the First didn’t appreciate such behaviour. He first fired Suvorov and then sent him away to his family estate.

However, the exile lasted for only two years. The great commander started the military campaign straight from his house in the Konchanskoe. During this legendary expedition, he crossed the Alps and defeated the French army.

holistic creative writing rubric

Become a real hiker

If you dream of having a hike in the Novgorod region, but at the same time you are afraid that a tourist’s life may be too hard, you should try the Big Valday trail. This is a five-day 59-kilometer walking route. Its central part goes right through the Valday National Park’s territory.

You won’t have to cope with difficulties and inconveniences of camping life here. The route is marked with signs, and there are camping sites where you can find everything you need for an overnight stay from shelters and places for a fire to toilets. The trail finishes at the Dunayevshchina village where you can take a bus back to Valday. To take the trail, you have to fill out a special form and register on the Ministry of Emergency Situations of the Russian Federation website.

holistic creative writing rubric

Russian Post has launched a limited series of products dedicated to the cultural heritage of the Novgorod region.

In autumn 2020, Russian Post announced an open contest to create the design for its limited series dedicated to Novgorod region. The project was supported by the Government of the Novgorod region, ‘Russ Novgorodskaya’ (Novgorod Russia) project, the State Novgorod Museum-reservation and Yandex.

Stamps and envelopes are traditionally used to spread information about historic dates and figures and famous landmarks. Now we can also use parcel boxes, packaging tape and postcards. The limited series products will travel around the world, introducing the most popular Russian attractions to six million Russian Post clients daily.

The participants were to create the design for the limited series featuring three iconic attractions of the Novgorod region, the Novgorod Kremlin, the Millennium Of Russia Monument and the Belelyubsky Bridge in Borovichi. Moscow designers and graduates of the Higher School of Economics’ Art and Design School Alena Akmatova and Svetlana Ilyushina won the contest. Their project was chosen via an open vote and by the expert jury.

holistic creative writing rubric

Linking essay-writing tests using many-facet models and neural automated essay scoring

  • Original Manuscript
  • Open access
  • Published: 20 August 2024

Cite this article

You have full access to this open access article

holistic creative writing rubric

  • Masaki Uto   ORCID: orcid.org/0000-0002-9330-5158 1 &
  • Kota Aramaki 1  

Explore all metrics

For essay-writing tests, challenges arise when scores assigned to essays are influenced by the characteristics of raters, such as rater severity and consistency. Item response theory (IRT) models incorporating rater parameters have been developed to tackle this issue, exemplified by the many-facet Rasch models. These IRT models enable the estimation of examinees’ abilities while accounting for the impact of rater characteristics, thereby enhancing the accuracy of ability measurement. However, difficulties can arise when different groups of examinees are evaluated by different sets of raters. In such cases, test linking is essential for unifying the scale of model parameters estimated for individual examinee–rater groups. Traditional test-linking methods typically require administrators to design groups in which either examinees or raters are partially shared. However, this is often impractical in real-world testing scenarios. To address this, we introduce a novel method for linking the parameters of IRT models with rater parameters that uses neural automated essay scoring technology. Our experimental results indicate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

Similar content being viewed by others

holistic creative writing rubric

Rater-Effect IRT Model Integrating Supervised LDA for Accurate Measurement of Essay Writing Ability

holistic creative writing rubric

Integration of Automated Essay Scoring Models Using Item Response Theory

holistic creative writing rubric

Robust Neural Automated Essay Scoring Using Item Response Theory

Avoid common mistakes on your manuscript.

Introduction

The growing demand for assessing higher-order skills, such as logical reasoning and expressive capabilities, has led to increased interest in essay-writing assessments (Abosalem, 2016 ; Bernardin et al., 2016 ; Liu et al., 2014 ; Rosen & Tager, 2014 ; Schendel & Tolmie, 2017 ). In these assessments, human raters assess the written responses of examinees to specific writing tasks. However, a major limitation of these assessments is the strong influence that rater characteristics, including severity and consistency, have on the accuracy of ability measurement (Bernardin et al., 2016 ; Eckes, 2005 , 2023 ; Kassim, 2011 ; Myford & Wolfe, 2003 ). Several item response theory (IRT) models that incorporate parameters representing rater characteristics have been proposed to mitigate this issue (Eckes, 2023 ; Myford & Wolfe, 2003 ; Uto & Ueno, 2018 ).

The most prominent among them are many-facet Rasch models (MFRMs) (Linacre, 1989 ), and various extensions of MFRMs have been proposed to date (Patz & Junker, 1999 ; Patz et al., 2002 ; Uto & Ueno, 2018 , 2020 ). These IRT models have the advantage of being able to estimate examinee ability while accounting for rater effects, making them more accurate than simple scoring methods based on point totals or averages.

However, difficulties can arise when essays from different groups of examinees are evaluated by different sets of raters, a scenario often encountered in real-world testing. For instance, in academic settings such as university admissions, individual departments may use different pools of raters to assess essays from specific applicant pools. Similarly, in the context of large-scale standardized tests, different sets of raters may be allocated to various test dates or locations. Thus, when applying IRT models with rater parameters to account for such real-world testing cases while also ensuring that ability estimates are comparable across groups of examinees and raters, test linking becomes essential for unifying the scale of model parameters estimated for each group.

Conventional test-linking methods generally require some overlap of examinees or raters across the groups being linked (Eckes, 2023 ; Engelhard, 1997 ; Ilhan, 2016 ; Linacre, 2014 ; Uto, 2021a ). For example, linear linking based on common examinees, a popular linking method, estimates the IRT parameters for shared examinees using data from each group. These estimates are then used to build a linear regression model, which adjusts the parameter scales across groups. However, the design of such overlapping groups can often be impractical in real-world testing environments.

To facilitate test linking in these challenging environments, we introduce a novel method that leverages neural automated essay scoring (AES) technology. Specifically, we employ a cutting-edge deep neural AES method (Uto & Okano, 2021 ) that can predict IRT-based abilities from examinees’ essays. The central concept of our linking method is to construct an AES model using the ability estimates of examinees in a reference group, along with their essays, and then to apply this model to predict the abilities of examinees in other groups. An important point is that the AES model is trained to predict examinee abilities on the scale established by the reference group. This implies that the trained AES model can predict the abilities of examinees in other groups on the ability scale established by the reference group. Therefore, we use the predicted abilities to calculate the linking coefficients required for linear linking and to perform a test linking. In this study, we conducted experiments based on real-world data to demonstrate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

It should be noted that previous studies have attempted to employ AES technologies for test linking (Almond, 2014 ; Olgar, 2015 ), but their focus has primarily been on linking tests with varied writing tasks or a mixture of essay tasks and objective items, while overlooking the influence of rater characteristics. This differs from the specific scenarios and goals that our study aims to address. To the best of our knowledge, this is the first study that employs AES technologies to link IRT models incorporating rater parameters for writing assessments without the need for common examinees and raters.

Setting and data

In this study, we assume scenarios in which two groups of examinees respond to the same writing task and their written essays are assessed by two distinct sets of raters following the same scoring rubric. We refer to one group as the reference group , which serves as the basis for the scale, and the other as the focal group , whose scale we aim to align with that of the reference group.

Let \(u^{\text {ref}}_{jr}\) be the score assigned by rater \(r \in \mathcal {R}^{\text {ref}}\) to the essay of examinee \(j \in \mathcal {J}^{\text {ref}}\) , where \(\mathcal {R}^{\text {ref}}\) and \(\mathcal {J}^{\text {ref}}\) denote the sets of raters and examinees in the reference group, respectively. Then, a collection of scores for the reference group can be defined as

where \(\mathcal{K} = \{1,\ldots ,K\}\) represents the rating categories, and \(-1\) indicates missing data.

Similarly, a collection of scores for the focal group can be defined as

where \(u^{\text {foc}}_{jr}\) indicates the score assigned by rater \(r \in \mathcal {R}^{\text {foc}}\) to the essay of examinee \(j \in \mathcal {J}^{\text {foc}}\) , and \(\mathcal {R}^{\text {foc}}\) and \(\mathcal {J}^{\text {foc}}\) represent the sets of raters and examinees in the focal group, respectively.

The primary objective of this study is to apply IRT models with rater parameters to the two sets of data, \(\textbf{U}^{\text {ref}}\) and \(\textbf{U}^{\text {foc}}\) , and to establish IRT parameter linking without shared examinees and raters: \(\mathcal {J}^{\text {ref}} \cap \mathcal {J}^{\text {foc}} = \emptyset \) and \(\mathcal {R}^{\text {ref}} \cap \mathcal {R}^{\text {foc}} = \emptyset \) . More specifically, we seek to align the scale derived from \(\textbf{U}^{\text {foc}}\) with that of \(\textbf{U}^{\text {ref}}\) .

  • Item response theory

IRT (Lord, 1980 ), a test theory grounded in mathematical models, has recently gained widespread use in various testing situations due to the growing prevalence of computer-based testing. In objective testing contexts, IRT makes use of latent variable models, commonly referred to as IRT models. Traditional IRT models, such as the Rasch model and the two-parameter logistic model, give the probability of an examinee’s response to a test item as a probabilistic function influenced by both the examinee’s latent ability and the item’s characteristic parameters, such as difficulty and discrimination. These IRT parameters can be estimated from a dataset consisting of examinees’ responses to test items.

However, traditional IRT models are not directly applicable to essay-writing test data, where the examinees’ responses to test items are assessed by multiple human raters. Extended IRT models with rater parameters have been proposed to address this issue (Eckes, 2023 ; Jin and Wang, 2018 ; Linacre, 1989 ; Shin et al., 2019 ; Uto, 2023 ; Wilson & Hoskens, 2001 ).

Many-facet Rasch models and their extensions

The MFRM (Linacre, 1989 ) is the most commonly used IRT model that incorporates rater parameters. Although several variants of the MFRM exist (Eckes, 2023 ; Myford & Wolfe, 2004 ), the most representative model defines the probability that the essay of examinee j for a given test item (either a writing task or prompt) i receives a score of k from rater r as

where \(\theta _j\) is the latent ability of examinee j , \(\beta _{i}\) represents the difficulty of item i , \(\beta _{r}\) represents the severity of rater  r , and \(d_{m}\) is a step parameter denoting the difficulty of transitioning between scores \(m-1\) and m . \(D = 1.7\) is a scaling constant used to minimize the difference between the normal and logistic distribution functions. For model identification, \(\sum _{i} \beta _{i} = 0\) , \(d_1 = 0\) , \(\sum _{m = 2}^{K} d_{m} = 0\) , and a normal distribution for the ability \(\theta _j\) are assumed.

Another popular MFRM is one in which \(d_{m}\) is replaced with \(d_{rm}\) , a rater-specific step parameter denoting the severity of rater r when transitioning from score  \(m-1\) to m . This model is often used to investigate variations in rating scale criteria among raters caused by differences in the central tendency, extreme response tendency, and range restriction among raters (Eckes, 2023 ; Myford & Wolfe, 2004 ; Qiu et al., 2022 ; Uto, 2021a ).

A recent extension of the MFRM is the generalized many-facet model (GMFM) (Uto & Ueno, 2020 ) Footnote 1 , which incorporates parameters denoting rater consistency and item discrimination. GMFM defines the probability \(P_{ijrk}\) as

where \(\alpha _i\) indicates the discrimination power of item i , and \(\alpha _r\) indicates the consistency of rater r . For model identification, \(\prod _{r} \alpha _i = 1\) , \(\sum _{i} \beta _{i} = 0\) , \(d_{r1} = 0\) , \(\sum _{m = 2}^{K} d_{rm} = 0\) , and a normal distribution for the ability \(\theta _j\) are assumed.

In this study, we seek to apply the aforementioned IRT models to data involving a single test item, as detailed in the Setting and data section. When there is only one test item, the item parameters in the above equations become superfluous and can be omitted. Consequently, the equations for these models can be simplified as follows.

MFRM with rater-specific step parameters (referred to as MFRM with RSS in the subsequent sections):

Note that the GMFM can simultaneously capture the following typical characteristics of raters, whereas the MFRM and MFRM with RSS can only consider a subset of these characteristics.

Severity : This refers to the tendency of some raters to systematically assign higher or lower scores compared with other raters regardless of the actual performance of the examinee. This tendency is quantified by the parameter \(\beta _r\) .

Consistency : This is the extent to which raters maintain their scoring criteria consistently over time and across different examinees. Consistent raters exhibit stable scoring patterns, which make their evaluations more reliable and predictable. In contrast, inconsistent raters show varying scoring tendencies. This characteristic is represented by the parameter \(\alpha _r\) .

Range Restriction : This describes the limited variability in scores assigned by a rater. Central tendency and extreme response tendency are special cases of range restriction. This characteristic is represented by the parameter \(d_{rm}\) .

For details on how these characteristics are represented in the GMFM, see the article (Uto & Ueno, 2020 ).

Based on the above, it is evident that both the MFRM and MFRM with RSS are special cases of the GMFM. Specifically, the GMFM with constant rater consistency corresponds to the MFRM with RSS. Moreover, the MFRM with RSS that assumes no differences in the range restriction characteristic among raters aligns with the MFRM.

When the aforementioned IRT models are applied to datasets from multiple groups composed of different examinees and raters, such as \(\textbf{U}^{\text {red}}\) and \(\textbf{U}^{\text {foc}}\) , the scales of the estimated parameters generally differ among them. This discrepancy arises because IRT permits arbitrary scaling of parameters for each independent dataset. An exception occurs when it is feasible to assume equality in between-test distributions of examinee abilities and rater parameters (Linacre, 2014 ). However, real-world testing conditions may not always satisfy this assumption. Therefore, if the aim is to compare parameter estimates between different groups, test linking is generally required to unify the scale of model parameters estimated from each individual group’s dataset.

One widely used approach for test linking is linear linking . In the context of the essay-writing test considered in this study, implementing linear linking necessitates designing two groups so that there is some overlap in examinees between them. With this design, IRT parameters for the shared examinees are estimated individually for each group. These estimates are then used to construct a linear regression model for aligning the parameter scales across groups, thereby rendering them comparable. We now introduce the mean and sigma method  (Kolen & Brennan, 2014 ; Marco, 1977 ), a popular method for linear linking, and illustrate the procedures for parameter linking specifically for the GMFM, as defined in Eq.  7 , because both the MFRM and the MFRM with RSS can be regarded as special cases of the GMFM, as explained earlier.

To elucidate this, let us assume that the datasets corresponding to the reference and focal groups, denoted as \(\textbf{U}^{\text {ref}}\) and \(\textbf{U}^{\text {foc}}\) , contain overlapping sets of examinees. Furthermore, let us assume that \(\hat{\varvec{\theta }}^{\text {foc}}\) , \(\hat{\varvec{\alpha }}^{\text {foc}}\) , \(\hat{\varvec{\beta }}^{\text {foc}}\) , and \(\hat{\varvec{d}}^{\text {foc}}\) are the GMFM parameters estimated from \(\textbf{U}^{\text {foc}}\) . The mean and sigma method aims to transform these parameters linearly so that their scale aligns with those estimated from \(\textbf{U}^{\text {ref}}\) . This transformation is guided by the equations

where \(\tilde{\varvec{\theta }}^{\text {foc}}\) , \(\tilde{\varvec{\alpha }}^{\text {foc}}\) , \(\tilde{\varvec{\beta }}^{\text {foc}}\) , and \(\tilde{\varvec{d}}^{\text {foc}}\) represent the scale-transformed parameters for the focal group. The linking coefficients are defined as

where \({\mu }^{\text {ref}}\) and \({\sigma }^{\text {ref}}\) represent the mean and standard deviation (SD) of the common examinees’ ability values estimated from \(\textbf{U}^{\text {ref}}\) , and \({\mu }^{\text {foc}}\) and \({\sigma }^{\text {foc}}\) represent those values obtained from \(\textbf{U}^{\text {foc}}\) .

This linear linking method is applicable when there are common examinees across different groups. However, as discussed in the introduction, arranging for multiple groups with partially overlapping examinees (and/or raters) can often be impractical in real-world testing environments. To address this limitation, we aim to facilitate test linking without the need for common examinees and raters by leveraging AES technology.

Automated essay scoring models

Many AES methods have been developed over recent decades and can be broadly categorized into either feature-engineering or automatic feature extraction approaches (Hussein et al., 2019 ; Ke & Ng, 2019 ). The feature-engineering approach predicts essay scores using either a regression or classification model that employs manually designed features, such as essay length and the number of spelling errors (Amorim et al., 2018 ; Dascalu et al., 2017 ; Nguyen & Litman, 2018 ; Shermis & Burstein, 2002 ). The advantages of this approach include greater interpretability and explainability. However, it generally requires considerable effort in developing effective features to achieve high scoring accuracy for various datasets. Automatic feature extraction approaches based on deep neural networks (DNNs) have recently attracted attention as a means of eliminating the need for feature engineering. Many DNN-based AES models have been proposed in the last decade and have achieved state-of-the-art accuracy (Alikaniotis et al., 2016 ; Dasgupta et al., 2018 ; Farag et al., 2018 ; Jin et al., 2018 ; Mesgar & Strube, 2018 ; Mim et al., 2019 ; Nadeem et al., 2019 ; Ridley et al., 2021 ; Taghipour & Ng, 2016 ; Uto, 2021b ; Wang et al., 2018 ). In the next section, we introduce the most widely used DNN-based AES model, which utilizes Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019 ).

BERT-based AES model

BERT, a pre-trained language model developed by Google’s AI language team, achieved state-of-the-art performance in various natural language processing (NLP) tasks in 2019 (Devlin et al., 2019 ). Since then, it has frequently been applied to AES (Rodriguez et al., 2019 ) and automated short-answer grading (Liu et al., 2019 ; Lun et al., 2020 ; Sung et al., 2019 ) and has demonstrated high accuracy.

BERT is structured as a multilayer bidirectional transformer network, where the transformer is a neural network architecture designed to handle ordered sequences of data using an attention mechanism. See Ref. (Vaswani et al., 2017 ) for details of transformers.

BERT undergoes training in two distinct phases, pretraining and fine-tuning . The pretraining phase utilizes massive volumes of unlabeled text data and is conducted through two unsupervised learning tasks, specifically, masked language modeling and next-sentence prediction . Masked language modeling predicts the identities of words that have been masked out of the input text, while next-sequence prediction predicts whether two given sentences are adjacent.

Fine-tuning is required to adapt a pre-trained BERT model for a specific NLP task, including AES. This entails retraining the BERT model using a task-specific supervised dataset after initializing the model parameters with pre-trained values and augmenting with task-specific output layers. For AES applications, the addition of a special token, [CLS] , at the beginning of each input is required. Then, BERT condenses the entire input text into a fixed-length real-valued hidden vector referred to as the distributed text representation , which corresponds to the output of the special token [CLS]  (Devlin et al., 2019 ). AES scores can thus be derived by feeding the distributed text representation into a linear layer with sigmoid activation , as depicted in Fig.  1 . More formally, let \( \varvec{h} \) be the distributed text representation. The linear layer with sigmoid activation is defined as \(\sigma (\varvec{W}\varvec{h}+\text{ b})\) , where \(\varvec{W}\) is a weight matrix and \(\text{ b }\) is a bias, both learned during the fine-tuning process. The sigmoid function \(\sigma ()\) maps its input to a value between 0 and 1. Therefore, the model is trained to minimize an error loss function between the predicted scores and the gold-standard scores, which are normalized to the [0, 1] range. Moreover, score prediction using the trained model is performed by linearly rescaling the predicted scores back to the original score range.

figure 1

BERT-based AES model architecture. \(w_{jt}\) is the t -th word in the essay of examinee j , \(n_j\) is the number of words in the essay, and \(\hat{y}_{j}\) represents the predicted score from the model

Problems with AES model training

As mentioned above, to employ BERT-based and other DNN-based AES models, they must be trained or fine-tuned using a large dataset of essays that have been graded by human raters. Typically, the mean-squared error (MSE) between the predicted and the gold-standard scores serves as the loss function for model training. Specifically, let \(y_{j}\) be the normalized gold-standard score for the j -th examinee’s essay, and let \(\hat{y}_{j}\) be the predicted score from the model. The MSE loss function is then defined as

where J denotes the number of examinees, which is equivalent to the number of essays, in the training dataset.

Here, note that a large-scale training dataset is often created by assigning a few raters from a pool of potential raters to each essay to reduce the scoring burden and to increase scoring reliability. In such cases, the gold-standard score for each essay is commonly determined by averaging the scores given by multiple raters assigned to that essay. However, as discussed in earlier sections, these straightforward average scores are highly sensitive to rater characteristics. When training data includes rater bias effects, an AES model trained on that data can show decreased performance as a result of inheriting these biases (Amorim et al., 2018 ; Huang et al., 2019 ; Li et al., 2020 ; Wind et al., 2018 ). An AES method that uses IRT has been proposed to address this issue (Uto & Okano, 2021 ).

AES method using IRT

The main idea behind the AES method using IRT (Uto & Okano, 2021 ) is to train an AES model using the ability value \(\theta _j\) estimated by IRT models with rater parameters, such as MFRM and its extensions, from the data given by multiple raters for each essay, instead of a simple average score. Specifically, AES model training in this method occurs in two steps, as outlined in Fig.  2 .

Estimate the IRT-based abilities \(\varvec{\theta }\) from a score dataset, which includes scores given to essays by multiple raters.

Train an AES model given the ability estimates as the gold-standard scores. Specifically, the MSE loss function for training is defined as

where \(\hat{\theta }_j\) represents the AES’s predicted ability of the j -th examinee, and \(\theta _{j}\) is the gold-standard ability for the examinee obtained from Step 1. Note that the gold-standard scores are rescaled into the range [0, 1] by applying a linear transformation from the logit range \([-3, 3]\) to [0, 1]. See the original paper (Uto & Okano, 2021 ) for details.

figure 2

Architecture of a BERT-based AES model that uses IRT

A trained AES model based on this method will not reflect bias effects because IRT-based abilities \(\varvec{\theta }\) are estimated while removing rater bias effects.

In the prediction phase, the score for an essay from examinee \(j^{\prime }\) is calculated in two steps.

Predict the IRT-based ability \(\theta _{j^{\prime }}\) for the examinee using the trained AES model, and then linearly rescale it to the logit range \([-3, 3]\) .

Calculate the expected score \(\mathbb {E}_{r,k}\left[ P_{j^{\prime }rk}\right] \) , which corresponds to an unbiased original-scaled score, given \(\theta _{j'}\) and the rater parameters. This is used as a predicted essay score in this method.

This method originally aimed to train an AES model while mitigating the impact of varying rater characteristics present in the training data. A key feature, however, is its ability to predict an examinee’s IRT-based ability from their essay texts. Our linking approach leverages this feature to enable test linking without requiring common examinees and raters.

figure 3

Outline of our proposed method, steps 1 and 2

figure 4

Outline of our proposed method, steps 3–6

Proposed method

The core idea behind our method is to develop an AES model that predicts examinee ability using score and essay data from the reference group, and then to use this model to predict the abilities of examinees in the focal group. These predictions are then used to estimate the linking coefficients for a linear linking. An outline of our method is illustrated in Figs.  3 and 4 . The detailed steps involved in the procedure are as follows.

Estimate the IRT model parameters from the reference group’s data \(\textbf{U}^{\text {ref}}\) to obtain \(\hat{\varvec{\theta }}^{\text {ref}}\) indicating the ability estimates of the examinees in the reference group.

Use the ability estimates \(\hat{\varvec{\theta }}^{\text {ref}}\) and the essays written by the examinees in the reference group to train the AES model that predicts examinee ability.

Use the trained AES model to predict the abilities of examinees in the focal group by inputting their essays. We designate these AES-predicted abilities as \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) from here on. An important point to note is that the AES model is trained to predict ability values on the parameter scale aligned with the reference group’s data, meaning that the predicted abilities for examinees in the focal group follow the same scale.

Estimate the IRT model parameters from the focal group’s data \(\textbf{U}^{\text {foc}}\) .

Calculate the linking coefficients A and K using the AES-predicted abilities \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) and the IRT-based ability estimates \(\hat{\varvec{\theta }}^{\text {foc}}\) for examinees in the focal group as follows.

where \({\mu }^{\text {foc}}_{\text {pred}}\) and \({\sigma }^{\text {foc}}_{\text {pred}}\) represent the mean and the SD of the AES-predicted abilities \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) , respectively. Furthermore, \({\mu }^{\text {foc}}\) and \({\sigma }^{\text {foc}}\) represent the corresponding values for the IRT-based ability estimates \(\hat{\varvec{\theta }}^{\text {foc}}\) .

Apply linear linking based on the mean and sigma method given in Eq.  8 using the above linking coefficients and the parameter estimates for the focal group obtained in Step 4. This procedure yields parameter estimates for the focal group that are aligned with the scale of the parameters of the reference group.

As described in Step 3, the AES model used in our method is trained to predict examinee abilities on the scale derived from the reference data \(\textbf{U}^{\text {ref}}\) . Therefore, the abilities predicted by the trained AES model for the examinees in the focal group, denoted as \(\hat{\varvec{\theta }}^{\text {foc}}_{\text {pred}}\) , also follow the ability scale derived from the reference data. Consequently, by using the AES-predicted abilities, we can infer the differences in the ability distribution between the reference and focal groups. This enables us to estimate the linking coefficients, which then allows us to perform linear linking based on the mean and sigma method. Thus, our method allows for test linking without the need for common examinees and raters.

It is important to note that the current AES model for predicting examinees’ abilities does not necessarily offer sufficient prediction accuracy for individual ability estimates. This implies that their direct use in mid- to high-stakes assessments could be problematic. Therefore, we focus solely on the mean and SD values of the ability distribution based on predicted abilities, rather than using individual predicted ability values. Our underlying assumption is that these AES models can provide valuable insights into differences in the ability distribution across various groups, even though the individual predictions might be somewhat inaccurate, thereby substantiating their utility for test linking.

Experiments

In this section, we provide an overview of the experiments we conducted using actual data to evaluate the effectiveness of our method.

Actual data

We used the dataset previously collected in Uto and Okano ( 2021 ). It consists of essays written in English by 1805 students from grades 7 to 10 along with scores from 38 raters for these essays. The essays originally came from the ASAP (Automated Student Assessment Prize) dataset, which is a well-known benchmark dataset for AES studies. The raters were native English speakers recruited from Amazon Mechanical Turk (AMT), a popular crowdsourcing platform. To alleviate the scoring burden, only a few raters were assigned to each essay, rather than having all raters evaluate every essay. Rater assignment was conducted based on a systematic links design  (Shin et al., 2019 ; Uto, 2021a ; Wind & Jones, 2019 ) to achieve IRT-scale linking. Consequently, each rater evaluated approximately 195 essays, and each essay was graded by four raters on average. The raters were asked to grade the essays using a holistic rubric with five rating categories, which is identical to the one used in the original ASAP dataset. The raters were provided no training before the scoring process began. The average Pearson correlation between the scores from AMT raters and the ground-truth scores included in the original ASAP dataset was 0.70 with an SD of 0.09. The minimum and maximum correlations were 0.37 and 0.81, respectively. Furthermore, we also calculated the intraclass correlation coefficient (ICC) between the scores from each AMT rater and the ground-truth scores. The average ICC was 0.60 with an SD of 0.15, and the minimum and maximum ICCs were 0.29 and 0.79, respectively. The calculation of the correlation coefficients and ICC for each AMT rater excluded essays that the AMT rater did not assess. Furthermore, because the ground-truth scores were given as the total scores from two raters, we divided them by two in order to align the score scale with the AMT raters’ scores.

For further analysis, we also evaluated the ICC among the AMT raters as their interrater reliability. In this analysis, missing value imputation was required because all essays were evaluated by a subset of AMT raters. Thus, we first applied multiple imputation with predictive mean matching to the AMT raters’ score dataset. In this process, we generated five imputed datasets. For each imputed dataset, we calculated the ICC among all AMT raters. Finally, we aggregated the ICC values from each imputed dataset to calculate the mean ICC and its SD. The results revealed a mean ICC of 0.43 with an SD of 0.01.

These results suggest that the reliability of raters is not necessarily high. This variability in scoring behavior among raters underscores the importance of applying IRT models with rater parameters. For further details of the dataset see Uto and Okano ( 2021 ).

Experimental procedures

Using this dataset, we conducted the following experiment for three IRT models with rater parameters, MFRM, MFRM with RSS, and GMFM, defined by Eqs.  5 , 6 , and 7 , respectively.

We estimated the IRT parameters from the dataset using the No-U-Turn sampler-based Markov chain Monte Carlo (MCMC) algorithm, given the prior distributions \(\theta _j, \beta _r, d_m, d_{rm} \sim N(0, 1)\) , and \(\alpha _r \sim LN(0, 0.5)\) following the previous work (Uto & Ueno, 2020 ). Here, \( N(\cdot , \cdot )\) and \(LN(\cdot , \cdot )\) indicate normal and log-normal distributions with mean and SD values, respectively. The expected a posteriori (EAP) estimator was used as the point estimates.

We then separated the dataset randomly into two groups, the reference group and the focal group, ensuring no overlap of examinees and raters between them. In this separation, we selected examinees and raters in each group to ensure distinct distributions of examinee abilities and rater severities. Various separation patterns were tested and are listed in Table  1 . For example, condition 1 in Table  1 means that the reference group comprised randomly selected high-ability examinees and low-severity raters, while the focal group comprised low-ability examinees and high-severity raters. Condition 2 provided a similar separation but controlled for narrower variance in rater severity in the focal group. Details of the group creation procedures can be found in Appendix  A .

Using the obtained data for the reference and focal groups, we conducted test linking using our method, the details of which are given in the Proposed method section. In it, the IRT parameter estimations were carried out using the same MCMC algorithm as in Step 1.

We calculated the Root Mean Squared Error (RMSE) between the IRT parameters for the focal group, which were linked using our proposed method, and their gold-standard parameters. In this context, the gold-standard parameters were obtained by transforming the scale of the parameters estimated from the entire dataset in Step 1 so that it aligned with that of the reference group. Specifically, we estimated the IRT parameters using data from the reference group and collected those estimated from the entire dataset in Step 1. Then, using the examinees in the reference group as common examinees, we applied linear linking based on the mean and sigma method to adjust the scale of the parameters estimated from the entire dataset to match that of the reference group.

For comparison, we also calculated the RMSE between the focal group’s IRT parameters, obtained without applying the proposed linking, and their gold-standard parameters. This functions as the worst baseline against which the results of the proposed method are compared. Additionally, we examined other baselines that use linear linking based on common examinees. For these baselines, we randomly selected five or ten examinees from the reference group, who were assigned scores by at least two focal group’s raters in the entire dataset. The scores given to these selected examinees by the focal group’s raters were then merged with the focal group’s data, where the added examinees worked as common examinees between the reference and focal groups. Using this data, we examined linear linking using common examinees. Specifically, we estimated the IRT parameters from the data of the focal group with common examinees and applied linear linking based on the mean and sigma method using the ability estimates of the common examinees to align its scale with that of the reference group. Finally, we calculated the RMSE between the linked parameter estimates for the examinees and raters belonging only to the original focal group and their gold-standard parameters. Note that this common examinee approach operates under more advantageous conditions compared with the proposed linking method because it can utilize larger samples for estimating the parameters of raters in the focal group.

We repeated Steps 2–5 ten times for each data separation condition and calculated the average RMSE for four cases: one in which our proposed linking method was applied, one without linking, and two others where linear linkings using five and ten common examinees were applied.

The parameter estimation program utilized in Steps 1, 4, and 5 was implemented using RStan (Stan Development Team, 2018 ). The EAP estimates were calculated as the mean of the parameter samples obtained from 2,000 to 5,000 periods using three independent chains. The AES model was developed in Python, leveraging the PyTorch library Footnote 2 . For the AES model training in Step 3, we randomly selected \(90\%\) of the data from the reference group to serve as the training set, with the remaining \(10\%\) designated as the development set. We limited the maximum number of steps for training the AES model to 800 and set the maximum number of epochs to 800 divided by the number of mini-batches. Additionally, we employed early stopping based on the performance on the development set. The AdamW optimization algorithm was used, and the mini-batch size was set to 8.

MCMC statistics and model fitting

Before delving into the results of the aforementioned experiments, we provide some statistics related to the MCMC-based parameter estimation. Specifically, we computed the Gelman–Rubin statistic \(\hat{R}\)  (Gelman et al., 2013 ; Gelman & Rubin, 1992 ), a well-established diagnostic index for convergence, as well as the effective sample size (ESS) and the number of divergent transitions for each IRT model during the parameter estimation phase in Step 1. Across all models, the \(\hat{R}\) statistics were below 1.1 for all parameters, indicating convergence of the MCMC runs. Furthermore, as shown in the first row of Table  2 , our ESS values for all parameters in all models exceeded the criterion of 400, which is considered sufficiently large according to Zitzmann and Hecht ( 2019 ). We also observed no divergent transitions in any of the cases. These results support the validity of the MCMC-based parameter estimation.

Furthermore, we evaluated the model – data fit for each IRT model during the parameter estimation step in Step 1. To assess this fit, we employed the posterior predictive p  value ( PPP -value) (Gelman et al., 2013 ), a commonly used metric for evaluating the model–data fit in Bayesian frameworks (Nering & Ostini, 2010 ; van der Linden, 2016 ). Specifically, we calculated the PPP -value using an averaged standardized residual, a conventional metric for IRT model fit in non-Bayesian settings, as a discrepancy function, similar to the approach in Nering and Ostini ( 2010 ); Tran ( 2020 ); Uto and Okano ( 2021 ). A well-fitted model yields a PPP -value close to 0.5, while poorly fitted models exhibit extreme low or high values, such as those below 0.05 or above 0.95. Additionally, we calculated two information criteria, the widely applicable information criterion (WAIC) (Watanabe, 2010 ) and the widely applicable Bayesian information criterion (WBIC) (Watanabe, 2013 ). The model that minimizes these criteria is considered optimal.

The last three rows in Table  2 shows the results. We can see that the PPP -value for GMFM is close to 0.5, indicating a good fit to the data. In contrast, the other models exhibit high values, suggesting a poor fit to the data. Furthermore, among the three IRT models evaluated, GMFM exhibits the lowest WAIC and WBIC values. These findings suggest that GMFM offers the best fit to the data, corroborating previous work that investigated the same dataset using IRT models (Uto & Okano, 2021 ). We provide further discussion about the model fit in the Analysis of rater characteristics section given later.

According to these results, the following section focuses on the results for GMFM. Note that we also include the results for MFRM and MFRM with RSS in Appendix  B , along with the open practices statement.

Effectiveness of our proposed linking method

The results of the aforementioned experiments for GMFM are shown in Table  3 . In the table, the Unlinked row represents the average RMSE between the focal group’s IRT parameters without applying our linking method and their gold-standard parameters. Similarly, the Linked by proposed method row represents the average RMSE between the focal group’s IRT parameters after applying our linking method and their gold-standard parameters. The rows labeled Linked by five/ten common examinees represent the results for linear linking using common examinees.

A comparison of the results from the unlinked condition and the proposed method reveals that the proposed method improved the RMSEs for the ability and rater severity parameters, namely, \(\theta _j\) and \(\beta _r\) , which we intentionally varied between the reference and focal groups. The degree of improvement is notably substantial when the distributional differences between the reference and focal groups are large, as is the case in Conditions 1–5. On the other hand, for Conditions 6–8, where the distributional differences are relatively minor, the improvements are also smaller in comparison. This is because the RMSEs for the unlinked parameters are already lower in these conditions than in Conditions 1–5. Nonetheless, it is worth emphasizing that the RMSEs after employing our linking method are exceptionally low in Conditions 6–8.

Furthermore, the table indicates that the RMSEs for the step parameters and rater consistency parameters, namely, \(d_{rm}\) and \(\alpha _r\) , also improved in many cases, while the impact of applying our linking method is relatively small for these parameters compared with the ability and rater severity parameters. This is because we did not intentionally vary their distribution between the reference and focal groups, and thus their distribution differences were smaller than those for the ability and rater severity parameters, as shown in the next section.

Comparing the results from the proposed method and linear linking using five common examinees, we observe that the proposed method generally exhibits lower RMSE values for the ability \(\theta _j\) and the rater severity parameters \(\beta _r\) , except for conditions 2–3. Furthermore, when comparing the proposed method with linear linking using ten common examinees, it achieves superior performance in conditions 4–8 and slightly lower performance in conditions 1–3 for \(\theta _j\) and \(\beta _r\) , while the differences are more minor overall than those observed when comparing the proposed method with the condition of five common examinees. Note that the reasons why the proposed method tends to show lower performance for conditions 1–3 are as follows.

The proposed method utilizes fewer samples to estimate the rater parameters compared with the linear linking method using common examinees.

In situations where distributional differences between the reference and focal groups are relatively large, as in conditions 1–3, constructing an accurate AES model for the focal group becomes challenging due to the limited overlap in the ability value range. We elaborate on this point in the next section.

Furthermore, in terms of the rater consistency parameter \(\alpha _r\) and the step parameter \(d_{rm}\) , the proposed method typically shows lower RMSE values compared with linear linking using common examinees. We attribute this to the fact that the performance of the linking method using common examinees is highly dependent on the choice of common examinees, which can sometimes result in significant errors in these parameters. This issue is also further discussed in the next section.

These results suggest that our method can perform linking with comparable accuracy to linear linking using few common examinees, even in the absence of common examinees and raters. Additionally, as reported in Tables  15 and 16 in Appendix  B , both MFRM and MFRM with RSS also exhibit a similar tendency, further validating the effectiveness of our approach regardless of the IRT models employed.

Detailed analysis

Analysis of parameter scale transformation using the proposed method.

In this section, we detail how our method transforms the parameter scale. To demonstrate this, we first summarize the mean and SD values of the gold-standard parameters for both the reference and focal groups in Table  4 . The values in the table are averages calculated from ten repetitions of the experimental procedures. The table shows that the mean and SD values of both examinee ability and rater severity vary significantly between the reference and focal groups following our intended settings, as outlined in Table  1 . Additionally, the mean and SD values for the rater consistency parameter \(\alpha _r\) and the rater-specific step parameters \(d_{rm}\) also differ slightly between the groups, although we did not intentionally alter them.

Second, the averaged values of the means and SDs of the parameters, estimated solely from either the reference or the focal group’s data over ten repetitions, are presented in Table  5 . The table reveals that the estimated parameters for both groups align with a normal distribution centered at nearly zero, despite the actual ability distributions differing between the groups. This phenomenon arises because IRT permits arbitrary scaling of parameters for each independent dataset, as mentioned in the Linking section. This leads to differences in the parameter scale for the focal group compared with their gold-standard values, thereby highlighting the need for parameter linking.

Next, the first two rows of Table  6 display the mean and SD values of the ability estimates for the focal group’s examinees, as predicted by the BERT-based AES model. In the table, the RMSE row indicates the RMSE between the AES-predicted ability values and the gold-standard ability values for the focal groups. The Linking Coefficients row presents the linking coefficients calculated based on the AES-predicted abilities. As with the abovementioned tables, these values are also averages over ten experimental repetitions. According to the table, for Conditions 6–8, where the distributional differences between the groups are relatively minor, both the mean and SD estimates align closely with those of the gold-standard parameters. In contrast, for Conditions 1–5, where the distributional differences are more pronounced, the mean and SD estimates tend to deviate from the gold-standard values, highlighting the challenges of parameter linking under such conditions.

In addition, as indicated in the RMSE row, the AES-predicted abilities may lack accuracy under specific conditions, such as Conditions 1, 2, and 3. This inaccuracy could arise because the AES model, trained on the reference group’s data, could not cover the ability range of the focal group due to significant differences in the ability distribution between the groups. Note that even in cases where the mean and SD estimates are relatively inaccurate, these values are closer to the gold-standard ones than those estimated solely from the focal group’s data. This leads to meaningful linking coefficients, which transform the focal group’s parameters toward the scale of their gold-standard values.

Finally, Table  7 displays the averaged values of the means and SDs of the focal group’s parameters obtained through our linking method over ten repetitions. Note that the mean and SD values of the ability estimates are the same as those reported in Table  6 because the proposed method is designed to align them. The table indicates that the differences in the mean and SD values between the proposed method and the gold-standard condition, shown in Table  4 , tend to be smaller compared with those between the unlinked condition, shown in Table  5 , and the gold-standard. To verify this point more precisely, Table  8 shows the average absolute differences in the mean and SD values of the parameters for the focal groups between the proposed method and the gold-standard condition, as well as those between the unlinked condition and the gold-standard. These values were calculated by averaging the absolute differences in the mean and SD values obtained from each of the ten repetitions, unlike the simple absolute differences in the values reported in Tables  4 and 7 . The table shows that the proposed linking method tends to derive lower values, especially for \(\theta _j\) and \(\beta _r\) , than the unlinked condition. Furthermore, this tendency is prominent for conditions 6–8 in which the distributional differences between the focal and reference groups are relatively small. These trends are consistent with the cases for which our method revealed high linking performance, detailed in the previous section.

In summary, the above analyses suggest that although the AES model’s predictions may not always be perfectly accurate, they can offer valuable insights into scale differences between the reference and focal groups, thereby facilitating successful IRT parameter linking without common examinees and raters.

We now present the distributions of examinee ability and rater severity for the focal group, comparing their gold-standard values with those before and after the application of the linking method. Figures  5 , 6 , 7 , 8 , 9 , 10 , 11 , and 12 are illustrative examples for the eight data-splitting conditions. The gray bars depict the distributions of the gold-standard parameters, the blue bars represent those of the parameters estimated from the focal group’s data, the red bars signify those of the parameters obtained using our linking method, and the green bars indicate the ability distribution as predicted by the BERT-based AES. The upper part of the figure presents results for examinee ability \(\theta _j\) and the lower part presents those for rater severity \(\beta _r\) .

The blue bars in these figures reveal that the parameters estimated from the focal group’s data exhibit distributions with different locations and/or scales compared with their gold-standard values. Meanwhile, the red bars reveal that the distributions of the parameters obtained through our linking method tend to align closely with those of the gold-standard parameters. This is attributed to the fact that the ability distributions for the focal group given by the BERT-based AES model, as depicted by the green bars, were informative for performing linear linking.

Analysis of the linking method based on common examinees

For a detailed analysis of the linking method based on common examinees, Table  9 reports the averaged values of means and SDs of the focal groups’ parameter estimates obtained by the linking method based on five and ten common examinees for each condition. Furthermore, Table  10 shows the average absolute differences between these values and those from the gold standard condition. Table  10 shows that an increase in the number of common examinees tends to lower the average absolute differences, which is a reasonable trend. Furthermore, comparing the results with those of the proposed method reported in Table  8 , the proposed method tends to achieve smaller absolute differences in conditions 4–8 for \(\theta _j\) and \(\beta _r\) , which is consistent with the tendency of the linking performance discussed in the “Effectiveness of our proposed linking method” section.

Note that although the mean and SD values in Table  9 are close to those of the gold-standard parameters shown in Table  4 , this does not imply that linear linking based on five or ten common examinees achieves high linking accuracy for each repetition. To explain this, Table  11 shows the means of the gold-standard ability values for the focal group and their estimates obtained from the proposed method and the linking method based on ten common examinees, for each of ten repetitions under condition 8. This table also shows the absolute differences between the estimated ability means and the corresponding gold-standard means.

figure 5

Example of ability and rater severity distributions for the focal group under data-splitting condition 1

figure 6

Example of ability and rater severity distributions for the focal group under data-splitting condition 2

figure 7

Example of ability and rater severity distributions for the focal group under data-splitting condition 3

figure 8

Example of ability and rater severity distributions for the focal group under data-splitting condition 4

figure 9

Example of ability and rater severity distributions for the focal group under data-splitting condition 5

figure 10

Example of ability and rater severity distributions for the focal group under data-splitting condition 6

figure 11

Example of ability and rater severity distributions for the focal group under data-splitting condition 7

figure 12

Example of ability and rater severity distributions for the focal group under data-splitting condition 8

The table shows that the results of the proposed method are relatively stable, consistently revealing low absolute differences for every repetition. In contrast, the results of linear linking based on ten common examinees vary significantly across repetitions, resulting in large absolute differences for some repetitions. These results yield a smaller average absolute difference for the proposed method compared with linear linking based on ten common examinees. However, in terms of the absolute difference in the averaged ability means, linear linking based on ten common examinees shows a smaller difference ( \(|0.38-0.33| = 0.05\) ) compared with the proposed method ( \(|0.38-0.46| = 0.08\) ). This occurs because the results of linear linking based on ten common examinees for ten repetitions fluctuate around the ten-repetition average of the gold standard, thereby canceling out the positive and negative differences. However, this does not imply that linear linking based on ten common examinees achieves high linking accuracy for each repetition. Thus, it is reasonable to interpret the average of the absolute differences calculated for each of the ten repetitions, as reported in Tables  8 and  10 .

This greater variability in performance of the linking method based on common examinees also relates to the tendency of the proposed method to show lower RMSE values for the rater consistency parameter \(\alpha _r\) and the step parameters \(d_{rm}\) compared with linking based on common examinees, as mentioned in the Effectiveness of our proposed linking method section. In that section, we mentioned that this is due to the fact that linear linking based on common examinees is highly dependent on the selection of common examinees, which can sometimes lead to significant errors in these parameters.

To confirm this point, Table  12 displays the SD of RMSEs calculated from ten repetitions of the experimental procedures for both the proposed method and linear linking using ten common examinees. The table indicates that the linking method using common examinees tends to exhibit larger SD values overall, suggesting that this linking method sometimes becomes inaccurate, as we also exemplified in Table  11 . This variability also implies that the estimation of the linking coefficient can be unstable.

Furthermore, the tendency of having larger SD values in the common examinee approach is particularly pronounced for the step parameters at the extreme categories, namely, \(d_{r2}\) and \(d_{r5}\) . We consider this comes from the instability of linking coefficients and the fact that the step parameters for the extreme categories tend to have large absolute values (see Table  13 for detailed estimates). Linear linking multiplies the step parameters by a linking coefficient A , although applying an inappropriate linking coefficient to larger absolute values can have a more substantial impact than when applied to smaller values. We concluded that this is why the RMSEs of the step difficulty parameters in the common examinee approach were deteriorated compared with those in the proposed method. The same reasoning would be applicable to the rater consistency parameter, given that it is distributed among positive values with a mean over one. See Table  13 for details.

Prerequisites of the proposed method

As demonstrated thus far, the proposed method can perform IRT parameter linking without the need for common examinees and raters. As outlined in the Introduction section, certain testing scenarios may encounter challenges or incur significant costs in assembling common examinees or raters. Our method provides a viable solution in these situations. However, it does come with specific prerequisites and inherent costs.

The prerequisites of our proposed method are as follows.

The same essay writing task is offered to both the reference and focal groups, and the written essays for it are scored by different groups of raters using the same rubric.

Raters will function identically across both the reference and focal groups, and the established scales can be adjusted through linear transformations. This implies that there are no systematic differences in scoring that are correlated with the groups but are unrelated to the measured construct, such as differential rater functioning (Leckie & Baird, 2011 ; Myford & Wolfe, 2009 ; Uto, 2023 ; Wind & Guo, 2019 ).

The ability ranges of the reference and focal groups require some overlap because the ability prediction accuracy of the AES decreases as the differences in the ability distributions between the groups increases, as discussed in the Detailed analysis section. This is a limitation of this approach, which requires future studies to overcome.

The reference group consists of a sufficient number of examinees for training AES models using their essays as training data.

Related to the fourth point, we conducted an additional experiment to investigate the number of samples required to train AES models. In this experiment, we assessed the ability prediction accuracy of the BERT-based AES model used in this study by varying the number of training samples. The detailed experimental procedures are outlined below.

Estimate the ability of all 1805 examinees from the entire dataset based on the GMFM.

Randomly split the examinees into 80% (1444) and 20% (361) groups. The 20% subset, consisting of examinees’ essays and their ability estimates, was used as test data to evaluate the ability prediction accuracy of the AES model trained through the following steps.

The 80% subset was further divided into 80% (1155) and 20% (289) groups. Here, the essays and ability estimates of the 80% subset were used as the training data, while those of the 20% served as development data for selecting the optimal epoch.

Train the BERT-based AES model using the training data and select the optimal epoch that minimizes the RMSE between the predicted and gold-standard ability values for the development set.

Use the trained AES model at the optimal epoch to evaluate the RMSE between the predicted and gold-standard ability values for the test data.

Randomly sample 50, 100, 200, 300, 500, 750, and 1000 examinees from the training data created in Step 3.

Train the AES model using each sampled set as training data, and select the optimal epoch using the same development data as before.

Use the trained AES model to evaluate the RMSE for the same test data as before.

Repeat Steps 2–8 five times and calculate the average RMSE for the test data.

figure 13

Relationship between the number of training samples and the ability prediction accuracy of AES

figure 14

Item response curves of four representative raters found in experiments using actual data

Figure  13 displays the results. The horizontal axis represents the number of training samples, and the vertical axis shows the RMSE values. Each plot illustrates the average RMSE, with error bars indicating the SD ranges. The results demonstrate that larger sample sizes enhance the accuracy of the AES model. Furthermore, while the RMSE decreases significantly when the sample size is small, the improvements tend to plateau beyond 500 samples. This suggests that, for this dataset, approximately 500 samples would be sufficient to train the AES model with reasonable accuracy. However, note that the required number of samples may vary depending on the essay tasks. A detailed analysis of the relationship between the required number of samples and the characteristics of essay writing tasks is planned for future work.

An inherent cost associated with the proposed method is the computational expense required to construct the BERT-based AES model. Specifically, a computer with a reasonably powerful GPU is necessary to efficiently train the AES model. In this study, for example, we utilized an NVIDIA Tesla T4 GPU on Google Colaboratory. To elaborate on the computational expense, we calculated the computation times and costs for the above experiment under a condition where 1155 training samples were used. Consequently, training the AES model with 1155 samples, including evaluating the RMSE for the development set of 289 essays in each epoch, took approximately 10 min in total. Moreover, it required about 10 s to predict the abilities of 361 examinees from their essays using the trained model. The computational units consumed on Google Colaboratory for both training and inference amounted to 0.44, which corresponds to approximately $0.044. These costs and the time required are significantly smaller than what is required for human scoring.

Analysis of rater characteristics

The MCMC statistics and model fitting section demonstrated that the GMFM provides a better fit to the actual data compared with the MFRM and MFRM with RSS. To explain this, Table  13 shows the rater parameters estimated by the GMFM using the entire dataset. Additionally, Fig.  14 illustrates the item response curves (IRCs) for raters 3, 16, 31, and 34, where the horizontal axis represents the ability \(\theta _j\) , and the vertical axis depicts the response probability for each category.

The table and figure reveal that the raters exhibit diverse and unique characteristics in terms of severity, consistency, and range restriction. For instance, Rater 3 demonstrates nearly average values for all parameters, indicating standard rating characteristics. In contrast, Rater 16 exhibits a pronounced extreme response tendency, as evidenced by higher \(d_{r2}\) and lower \(d_{r5}\) values. Additionally, Rater 31 is characterized by a low severity score, generally preferring higher scores (four and five). Rater 34 exhibits a low consistency value \(\alpha _r\) , which results in minimal variation in response probabilities among categories. This indicates that the rater is likely to assign different ratings to essays of similar quality.

As detailed in the Item Response Theory section, the GMFM can capture these variations in rater severity, consistency, and range restriction simultaneously, while the MFRM and MFRM with RSS can consider only its subsets. We can infer that this capability, along with the large variety of rater characteristics, contributed to the superior model fit of the GMFM compared with the other models.

It is important to note that, the proposed method is also useful for facilitating linking for MFRM and MFRM with RSS, even though the model fits for them were relatively worse, as well as for the GMFM, which we mentioned earlier and is shown in Appendix B .

Effect of using cloud workers as raters

As we detailed in the Actual data section, we used scores given by untrained non-expert cloud workers instead of expert raters. A concern with using raters from cloud workers without adequate training is the potential for greater variability in rating characteristics compared with expert raters. This variability is evidenced by the diverse correlations between the raters’ scores and their ground truth, reported in the Actual data section, and the large variety of rater parameters discussed above. These observations suggest the importance of the following two strategies for ensuring reliable essay scoring when employing crowd workers as raters.

Assigning a larger number of raters to each essay than would typically be used with expert raters.

Estimating the standardized essay scores while accounting for differences in rater characteristics, potentially through the use of IRT models that incorporate rater parameters, which we used in this study.

In this study, we propose a novel IRT-based linking method for essay-writing tests that uses AES technology to enable parameter linking based on IRT models with rater parameters across multiple groups in which neither examinees nor raters are shared. Specifically, we use a deep neural AES method capable of predicting IRT-based examinee abilities based on their essays. The core concept of our approach involves developing an AES model to predict examinee abilities using data from a reference group. This AES model is then applied to predict the abilities of examinees in the focal group. These predictions are used to estimate the linking coefficients required for linear linking. Experimental results with real data demonstrate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

In our experiments, we compared the linking performance of the proposed method with linear linking based on the mean and sigma method using only five or ten common examinees. However, such a small number of common examinees is generally insufficient for accurate linear linking and thus leads to unstable estimation of linking coefficients, as discussed in the “Analysis of the linking method based on common examinees” section. Although this study concluded that our method could perform linking with accuracy comparable to that of linear linking using few common examinees, further detailed evaluations of our method involving comparisons with various conventional linking methods using different numbers of common examinees and raters will be the target of future work.

Additionally, our experimental results suggest that although the AES model may not provide sufficient predictive accuracy for individual examinee abilities, it does tend to yield reasonable mean and SD values for the ability distribution of focal groups. This lends credence to our assumption stated in the Proposed method section that AES models incorporating IRT can offer valuable insights into differences in ability distribution across various groups, thereby validating their utility for test linking. This result also supports the use of the mean and sigma method for linking. While concurrent calibration, another common linking method, requires highly accurate individual AES-predicted abilities to serve as anchor values, linear linking through the mean and sigma method necessitates only the mean and SD of the ability distribution. Given that the AES model can provide accurate estimates for these statistics, successful linking can be achieved, as shown in our experiments.

A limitation of this study is that our method is designed for test situations where a single essay writing item is administered to multiple groups, each comprising different examinees and raters. Consequently, the method is not directly applicable for linking multiple tests that offer different items. Developing an extension of our approach to accommodate such test situations is one direction for future research. Another involves evaluating the effectiveness of our method using other datasets. To the best of our knowledge, there are no open datasets that include examinee essays along with scores from multiple assigned raters. Therefore, we plan to develop additional datasets and to conduct further evaluations. Further investigation of the impact of the AES model’s accuracy on linking performance is also warranted.

Availability of data and materials

The data and materials from our experiments are available at https://github.com/AI-Behaviormetrics/LinkingIRTbyAES.git . This includes all experimental results and a sample dataset.

Code availability

The source code for our linking method, developed in R and Python, is available in the same GitHub repository.

The original paper referred to this model as the generalized MFRM. However, in this paper, we refer to it as GMFM because it does not strictly belong to the family of Rasch models.

https://pytorch.org/

Abosalem, Y. (2016). Assessment techniques and students’ higher-order thinking skills. International Journal of Secondary Education, 4 (1), 1–11. https://doi.org/10.11648/j.ijsedu.20160401.11

Article   Google Scholar  

Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. Proceedings of the annual meeting of the association for computational linguistics (pp. 715–725).

Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed response writing tests. International Journal of Testing, 14 (1), 73–91. https://doi.org/10.1080/15305058.2013.816309

Amorim, E., Cançado, M., & Veloso, A. (2018). Automated essay scoring in the presence of biased ratings. Proceedings of the annual conference of the north american chapter of the association for computational linguistics (pp. 229–237).

Bernardin, H. J., Thomason, S., Buckley, M. R., & Kane, J. S. (2016). Rater rating-level bias and accuracy in performance appraisals: The impact of rater personality, performance management competence, and rater accountability. Human Resource Management, 55 (2), 321–340. https://doi.org/10.1002/hrm.21678

Dascalu, M., Westera, W., Ruseti, S., Trausan-Matu, S., & Kurvers, H. (2017). ReaderBench learns Dutch: Building a comprehensive automated essay scoring system for Dutch language. Proceedings of the international conference on artificial intelligence in education (pp. 52–63).

Dasgupta, T., Naskar, A., Dey, L., & Saha, R. (2018). Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. Proceedings of the workshop on natural language processing techniques for educational applications (pp. 93–102).

Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the annual conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186).

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2 (3), 197–221. https://doi.org/10.1207/s15434311laq0203_2

Eckes, T. (2023). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments . Peter Lang Pub. Inc.

Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1 (1), 19–33.

PubMed   Google Scholar  

Farag, Y., Yannakoudakis, H., & Briscoe, T. (2018). Neural automated essay scoring and coherence modeling for adversarially crafted input. Proceedings of the annual conference of the north American chapter of the association for computational linguistics (pp. 263–271).

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., & Rubin, D. (2013). Bayesian data analysis (3rd ed.). Taylor & Francis.

Book   Google Scholar  

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7 (4), 457–472. https://doi.org/10.1214/ss/1177011136

Huang, J., Qu, L., Jia, R., & Zhao, B. (2019). O2U-Net: A simple noisy label detection approach for deep neural networks. Proceedings of the IEEE international conference on computer vision .

Hussein, M. A., Hassan, H. A., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5 , e208. https://doi.org/10.7717/peerj-cs.208

Article   PubMed   PubMed Central   Google Scholar  

Ilhan, M. (2016). A comparison of the results of many-facet Rasch analyses based on crossed and judge pair designs. Educational Sciences: Theory and Practice, 16 (2), 579–601. https://doi.org/10.12738/estp.2016.2.0390

Jin, C., He, B., Hui, K., & Sun, L. (2018). TDNN: A two-stage deep neural network for prompt-independent automated essay scoring. Proceedings of the annual meeting of the association for computational linguistics (pp. 1088–1097).

Jin, K. Y., & Wang, W. C. (2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55 (4), 543–563. https://doi.org/10.1111/jedm.12191

Kassim, N. L. A. (2011). Judging behaviour and rater errors: An application of the many-facet Rasch model. GEMA Online Journal of Language Studies, 11 (3), 179–197.

Google Scholar  

Ke, Z., & Ng, V. (2019). Automated essay scoring: A survey of the state of the art. Proceedings of the international joint conference on artificial intelligence (pp. 6300–6308).

Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking . New York: Springer.

Leckie, G., & Baird, J. A. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48 (4), 399–418. https://doi.org/10.1111/j.1745-3984.2011.00152.x

Li, S., Ge, S., Hua, Y., Zhang, C., Wen, H., Liu, T., & Wang, W. (2020). Coupled-view deep classifier learning from multiple noisy annotators. Proceedings of the association for the advancement of artificial intelligence (vol. 34, pp. 4667–4674).

Linacre, J. M. (1989). Many-faceted Rasch measurement . MESA Press.

Linacre, J. M. (2014). A user’s guide to FACETS Rasch-model computer programs .

Liu, O. L., Frankel, L., & Roohr, K. C. (2014). Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series, 2014 (1), 1–23. https://doi.org/10.1002/ets2.12009

Liu, T., Ding, W., Wang, Z., Tang, J., Huang, G. Y., & Liu, Z. (2019). Automatic short answer grading via multiway attention networks. Proceedings of the international conference on artificial intelligence in education (pp. 169–173).

Lord, F. (1980). Applications of item response theory to practical testing problems . Routledge.

Lun, J., Zhu, J., Tang, Y., & Yang, M. (2020). Multiple data augmentation strategies for improving performance on automatic short answer scoring. Proceedings of the association for the advancement of artificial intelligence (vol. 34, pp. 13389–13396).

Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14 (2), 139–160.

Mesgar, M., & Strube, M. (2018). A neural local coherence model for text quality assessment. Proceedings of the conference on empirical methods in natural language processing (pp. 4328–4339).

Mim, F. S., Inoue, N., Reisert, P., Ouchi, H., & Inui, K. (2019). Unsupervised learning of discourse-aware text representation for essay scoring. Proceedings of the annual meeting of the association for computational linguistics: Student research workshop (pp. 378–385).

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4 (4), 386–422.

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5 (2), 189–227.

Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46 (4), 371–389. https://doi.org/10.1111/j.1745-3984.2009.00088.x

Nadeem, F., Nguyen, H., Liu, Y., & Ostendorf, M. (2019). Automated essay scoring with discourse-aware neural models. Proceedings of the workshop on innovative use of NLP for building educational applications (pp. 484–493).

Nering, M. L., & Ostini, R. (2010). Handbook of polytomous item response theory models . Evanston, IL, USA: Routledge.

Nguyen, H. V., & Litman, D. J. (2018). Argument mining for improving the automated scoring of persuasive essays. Proceedings of the association for the advancement of artificial intelligence (Vol. 32).

Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed-format tests [Doctoral dissertation, The Florida State University].

Patz, R. J., & Junker, B. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24 (4), 342–366. https://doi.org/10.3102/10769986024004342

Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27 (4), 341–384. https://doi.org/10.3102/10769986027004341

Qiu, X. L., Chiu, M. M., Wang, W. C., & Chen, P. H. (2022). A new item response theory model for rater centrality using a hierarchical rater model approach. Behavior Research Methods, 54 , 1854–1868. https://doi.org/10.3758/s13428-021-01699-y

Article   PubMed   Google Scholar  

Ridley, R., He, L., Dai, X. Y., Huang, S., & Chen, J. (2021). Automated cross-prompt scoring of essay traits. Proceedings of the association for the advancement of artificial intelligence (vol. 35, pp. 13745–13753).

Rodriguez, P. U., Jafari, A., & Ormerod, C. M. (2019). Language models and automated essay scoring. https://doi.org/10.48550/arXiv.1909.09482 . arXiv:1909.09482

Rosen, Y., & Tager, M. (2014). Making student thinking visible through a concept map in computer-based assessment of critical thinking. Journal of Educational Computing Research, 50 (2), 249–270. https://doi.org/10.2190/EC.50.2.f

Schendel, R., & Tolmie, A. (2017). Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda. Assessment & Evaluation in Higher Education, 42 (5), 673–689. https://doi.org/10.1080/02602938.2016.1177484

Shermis, M. D., & Burstein, J. C. (2002). Automated essay scoring: A cross-disciplinary perspective . Routledge.

Shin, H. J., Rabe-Hesketh, S., & Wilson, M. (2019). Trifactor models for Multiple-Ratings data. Multivariate Behavioral Research, 54 (3), 360–381. https://doi.org/10.1080/00273171.2018.1530091

Stan Development Team. (2018). RStan: the R interface to stan . R package version 2.17.3.

Sung, C., Dhamecha, T. I., & Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. Proceedings of the international conference on artificial intelligence in education (pp. 469–481).

Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. Proceedings of the conference on empirical methods in natural language processing (pp. 1882–1891).

Tran, T. D. (2020). Bayesian analysis of multivariate longitudinal data using latent structures with applications to medical data. (Doctoral dissertation, KU Leuven).

Uto, M. (2021a). Accuracy of performance-test linking based on a many-facet Rasch model. Behavior Research Methods, 53 , 1440–1454. https://doi.org/10.3758/s13428-020-01498-x

Uto, M. (2021b). A review of deep-neural automated essay scoring models. Behaviormetrika, 48 , 459–484. https://doi.org/10.1007/s41237-021-00142-y

Uto, M. (2023). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Behavior Research Methods, 55 , 3910–3928. https://doi.org/10.3758/s13428-022-01997-z

Uto, M., & Okano, M. (2021). Learning automated essay scoring models using item-response-theory-based scores to decrease effects of rater biases. IEEE Transactions on Learning Technologies, 14 (6), 763–776. https://doi.org/10.1109/TLT.2022.3145352

Uto, M., & Ueno, M. (2018). Empirical comparison of item response theory models with rater’s parameters. Heliyon, Elsevier , 4 (5), , https://doi.org/10.1016/j.heliyon.2018.e00622

Uto, M., & Ueno, M. (2020). A generalized many-facet Rasch model and its Bayesian estimation using Hamiltonian Monte Carlo. Behaviormetrika, Springer, 47 , 469–496. https://doi.org/10.1007/s41237-020-00115-7

van der Linden, W. J. (2016). Handbook of item response theory, volume two: Statistical tools . Boca Raton, FL, USA: CRC Press.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems (pp. 5998–6008).

Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018). Automatic essay scoring incorporating rating schema via reinforcement learning. Proceedings of the conference on empirical methods in natural language processing (pp. 791–797).

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11 , 3571–3594. https://doi.org/10.48550/arXiv.1004.2316

Watanabe, S. (2013). A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14 (1), 867–897. https://doi.org/10.48550/arXiv.1208.6338

Wilson, M., & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26 (3), 283–306. https://doi.org/10.3102/10769986026003283

Wind, S. A., & Guo, W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79 (5), 962–987. https://doi.org/10.1177/0013164419834613

Wind, S. A., & Jones, E. (2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56 (1), 76–100. https://doi.org/10.1111/jedm.12201

Wind, S. A., Wolfe, E. W., Jr., G.E., Foltz, P., & Rosenstein, M. (2018). The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing, 18 (1), 27–49. https://doi.org/10.1080/15305058.2017.1361426

Zitzmann, S., & Hecht, M. (2019). Going beyond convergence in Bayesian estimation: Why precision matters too and how to assess it. Structural Equation Modeling: A Multidisciplinary Journal, 26 (4), 646–661. https://doi.org/10.1080/10705511.2018.1545232

Download references

This work was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers 19H05663, 21H00898, and 23K17585.

Author information

Authors and affiliations.

The University of Electro-Communications, Tokyo, Japan

Masaki Uto & Kota Aramaki

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Masaki Uto .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflicts of interest.

Ethics approval

Not applicable

Consent to participate

Consent for publication.

All authors agreed to publish the article.

Open Practices Statement

All results presented from our experiments for all models, including MFRM, MFRM with RSS, and GMFM, as well as the results for each repetition, are available for download at https://github.com/AI-Behaviormetrics/LinkingIRTbyAES.git . This repository also includes programs for performing our linking method, along with a sample dataset. These programs were developed using R and Python, along with RStan and PyTorch. Please refer to the README file for information on program usage and data format details.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Data splitting procedures

In this appendix, we explain the detailed procedures used to construct the reference group and the focal group while aiming to ensure distinct distributions of examinee abilities and rater severities, as outlined in experimental Procedure 2 in the Experimental procedures section.

Let \(\mu ^{\text {all}}_\theta \) and \(\sigma ^{\text {all}}_\theta \) be the mean and SD of the examinees’ abilities estimated from the entire dataset in Procedure 1 of the Experimental procedures section. Similarly, let \(\mu ^{\text {all}}_\beta \) and \(\sigma ^{\text {all}}_\beta \) be the mean and SD of the rater severity parameter estimated from the entire dataset. Using these values, we set target mean and SD values of abilities and severities for both the reference and focal groups. Specifically, let \(\acute{\mu }^{\text {ref}}_{\theta }\) and \(\acute{\sigma }^{\text {ref}}_{\theta }\) denote the target mean and SD for the abilities of examinees in the reference group, and \(\acute{\mu }^{\text {ref}}_{\beta }\) and \(\acute{\sigma }^{\text {ref}}_{\beta }\) be those for the rater severities in the reference group. Similarly, let \(\acute{\mu }^{\text {foc}}_{\theta }\) , \(\acute{\sigma }^{\text {foc}}_{\theta }\) , \(\acute{\mu }^{\text {foc}}_{\beta }\) , and \(\acute{\sigma }^{\text {foc}}_{\beta }\) represent the target mean and SD for the examinee abilities and rater severities in the focal group. Each of the eight conditions in Table 1 uses these target values, as summarized in Table  14 .

Given these target means and SDs, we constructed the reference and focal groups for each condition through the following procedure.

Prepare the entire set of examinees and raters along with their ability and severity estimates. Specifically, let \(\hat{\varvec{\theta }}\) and \(\hat{\varvec{\beta }}\) be the collections of ability and severity estimates, respectively.

Randomly sample a value from the normal distribution \(N(\acute{\mu }^{\text {ref}}_\theta , \acute{\sigma }^{\text {ref}}_\theta )\) , and choose an examinee with \(\hat{\theta }_j \in \hat{\varvec{\theta }}\) nearest to the sampled value. Add the examinee to the reference group, and remove it from the remaining pool of examinee candidates \(\hat{\varvec{\theta }}\) .

Similarly, randomly sample a value from \(N(\acute{\mu }^{\text {ref}}_\beta ,\acute{\sigma }^{\text {ref}}_\beta )\) , and choose a rater with \(\hat{\beta }_j \in \hat{\varvec{\beta }}\) nearest to the sampled value. Then, add the rater to the reference group, and remove it from the remaining pool of rater candidates \(\hat{\varvec{\beta }}\) .

Repeat Steps 2 and 3 for the focal group, using \(N(\acute{\mu }^{\text {foc}}_\theta , \) \(\acute{\sigma }^{\text {foc}}_\theta )\) and \(N(\acute{\mu }^{\text {foc}}_\beta ,\acute{\sigma }^{\text {foc}}_\beta )\) as the sampling distributions.

Continue to repeat Steps 2, 3, and 4 until the pools \(\hat{\varvec{\theta }}\) and \(\hat{\varvec{\beta }}\) are empty.

Given the examinees and raters in each group, create the data for the reference group \(\textbf{U}^{\text {ref}}\) and the focal group \(\textbf{U}^{\text {foc}}\) .

Remove examinees from each group, as well as their data, if they have received scores from only one rater, thereby ensuring that each examinee is graded by at least two raters.

Appendix B: Experimental results for MFRM and MFRM with RSS

The experiments discussed in the main text focus on the results obtained from GMFM, as this model demonstrated the best fit to the dataset. However, it is important to note that our linking method is not restricted to GMFM and can also be applied to other models, including MFRM and MFRM with RSS. Experiments involving these models were carried out in the manner described in the Experimental procedures section, and the results are shown in Tables  15 and 16 . These tables reveal trends similar to those observed for GMFM, validating the effectiveness of our linking method under the MFRM and MFRM with RSS as well.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Uto, M., Aramaki, K. Linking essay-writing tests using many-facet models and neural automated essay scoring. Behav Res (2024). https://doi.org/10.3758/s13428-024-02485-2

Download citation

Accepted : 26 July 2024

Published : 20 August 2024

DOI : https://doi.org/10.3758/s13428-024-02485-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Writing assessment
  • Many-facet Rasch models
  • IRT linking
  • Automated essay scoring
  • Educational measurement
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. Writing Rubrics [Examples, Best Practices, & Free Templates]

    1. Define Clear Criteria. Identify specific aspects of writing to evaluate. Be clear and precise. The criteria should reflect the key components of the writing task. For example, for a narrative essay, criteria might include plot development, character depth, and use of descriptive language.

  2. Rubric Best Practices, Examples, and Templates

    Types of rubrics: holistic, analytic/descriptive, single-point. Holistic Rubric. A holistic rubric includes all the criteria (such as clarity, organization, mechanics, etc.) to be considered together and included in a single evaluation. With a holistic rubric, the rater or grader assigns a single score based on an overall judgment of the ...

  3. Deciding Which Type of Rubric to Use • Southwestern University

    A holistic rubric provides students with a general overview of what is expected by describing the characteristics of a paper that would earn an "A," (or be marked "excellent"), a B (or "proficient") a C (or "average") and so on. Here is an example of a holistic rubric for weekly reading responses in a religion course: As you can ...

  4. Designing and Using Rubrics

    Here is a sample of a rubric with a range of points within each performance level. Step 4: Create a format for the rubric. When the specific criteria and levels of success have been named and ranked, they can be sorted into a variety of formats and distributed with the assignment.

  5. PDF Holistic Rubric Samples

    Holistic Rubric Samples . Writing - Grade 2 Example. This rubric could be used to assess an expository piece of writing. 1 . ... - This rubric could be used to assess individual science fair projects, in which a student has to test a hypothesis using the scientific method. 1 :

  6. PDF Tip sheet

    Accordingly, it is important to design rubrics to be clearly understood by both students and markers. Although rubrics may exist in multiple forms, this tip sheet provides some guidance on designing holistic rubrics (which assesses the whole of a students' assessment item) rather than analytical rubrics (which assess components of each ...

  7. PDF Holistic Rubric Template

    Holistic Rubric Template You can provide your own descriptions for each level of the rubric below. You can also change the labels for each "level" and/or the corresponding value associated with each (e.g., could have letter grades if that is your preference, incompetent response could be 0). You can also choose to have more or fewer levels. 6.

  8. Using Rubrics: Tips and Examples

    Narrative/holistic rubrics provide overall descriptions of [insert text here] For example, this rubric refers to an assignment where students contributed to an online discussion board. As you can see, the assessment can be indicated in various ways—as a letter grade, as a descriptive word or phrase, or as a numerical rating.

  9. PDF English Language Arts Writing Rubrics Grades 1 12 Background

    Grades 1-12, with levels 3 and 4 being at grade level. The result of that work is the attached rubrics. Content: The holistic writing rubrics are available for all grades, and teachers are encouraged to use the rubrics at all grade levels. However, provincial reporting of writing at grade level will be done for Grades 4, 7, and 9 only.

  10. Rubric Design

    Writing rubrics can help address the concerns of both faculty and students by making writing assessment more efficient, consistent, and public. Whether it is called a grading rubric, a grading sheet, or a scoring guide, a writing assignment rubric lists criteria by which the writing is graded. ... Holistic rubrics emphasize what students do ...

  11. PDF Grade 5 Holistic Writing Rubric

    OSTP Grade 5 Holistic Writing Rubric. Score. Description. 4. Content is wellͲsuited for the audience and task/purpose and the writing maintains a clear focus; ideas are fully developed. Organization is strong, creating unity and coherence; contains an engaging introduction, effective conclusion and logical sequencing with smooth, effective ...

  12. Another rubric for creative assignments: short stories

    I have used a holistic, comment-based rubric for my short story assignment in Creative Writing for several years. After reading all this information about rubric, I decided to revise it into a point-based, more analytic rubric. I also changed the point values because the short story ends up being one of the longest assignments in the class, so ...

  13. PDF Holistic Rubric for Grades 4-5: Narrative Writing

    Holistic Rubric for Grades 4-5: Narrative Writing. This holistic rubric guides the evaluation of a student response by providing descriptions of sample characteristics for each score point. ELA/L responses are scored for both written expression and written conventions. A score is based on an overall analysis of what is included in a student's ...

  14. PDF Holistic Writing Rubric for SCRs

    The writing connects ideas to the specified purpose. The writer selects words that are accurate, specific, and appropriate for the specified purpose. The writer may experiment with words and/or use figurative language and/or imagery. The writer uses a variety of sentence structures. The writing is readable, neat, and nearly error-free.

  15. PDF Creative Writing

    Creative Writing - RUBRIC Please note: Judging creative work is based largely on the reader's emotional response and on the sense that there is a driving concern (or concerns) or question (or questions) that the author pursues with clarity, thoughtfulness and compassion. Form is also

  16. PDF 291 Using Rubrics and Holistic Scoring of Writing

    Using Rubrics and Holistic Scoring of Writing295• Provide students with anchor papers (student papers from past years will work) and have them. ractice scoring using the class-devised criteria. Then have them meet in groups of three to confer on their score. and clarify their understanding of the criteri.

  17. 15 Helpful Scoring Rubric Examples for All Grades and Subjects

    Traditional letter grades are a type of holistic rubric. So are the popular "hamburger rubric" and "cupcake rubric" examples. Learn more about holistic rubrics here. Analytic Rubric. Source: University of Nebraska. Analytic rubrics are much more complex and generally take a great deal more time up front to design.

  18. PDF Sample Holistic Rubric

    Sample Holistic Scoring Guide. 6. AA superior response addresses the question fully and explores the issues thoughtfully. It shows substantial depth, fullness, and complexity of thought. The response demonstrates clear, focused, unified, and coherent organization and is fully developed and detailed. The essay demonstrates superior control of ...

  19. PDF Holistic Rubrics: 4 Best Practices

    Holistic Rubrics: 4 Best Practices The literary analysis holistic rubric shows a summary of scoring details. Full training on using this and other rubrics are included within the five-part ... The writing skill in a 5 response is excellent and shows skill above acceptable. This writer presents clear analysis and a fullydeveloped thesis with

  20. novgorod

    See all. Sign in to get trip updates and message other travelers. to get trip updates and message other travelers.

  21. Veliky Novgorod

    The main exhibitions of Novgorod Museum are located in a two-storied building of Public Offices Chambers on the territory of the Kremlin. The most interesting parts of the exhibition are the collection of Russian icons of the 11th - 19th centuries, birch bark manuscripts, handicrafts, military equipment and other artefacts from ancient times till the end of the 17th century.

  22. Veliky Novgorod

    Official documents were not the only form of writing in Novgorodchina (the lands of Novgorod). Even peasants kept active correspondence on daily life, as well as love affairs. Today you can also have an opportunity to read the messages that ancient Novgorodians sent to each other. Archaeologists have found over 1000 birch-bark scrolls, many of ...

  23. 20 reasons to visit Veliky Novgorod and the Novgorod region

    The Episcopal Chamber of the Novgorod Kremlin is the only non-religious German Gothic building of the 15th century preserved in Russia. You can have a good look at the facets of the gothic cross-domed vaults inside the chamber. This is why this building is also called 'Faceted Chamber' or 'Chamber of Facets'.

  24. Linking essay-writing tests using many-facet models and neural

    For essay-writing tests, challenges arise when scores assigned to essays are influenced by the characteristics of raters, such as rater severity and consistency. Item response theory (IRT) models incorporating rater parameters have been developed to tackle this issue, exemplified by the many-facet Rasch models. These IRT models enable the estimation of examinees' abilities while accounting ...