Help | Advanced Search

Computer Science > Computation and Language

Title: on the use of bert for automated essay scoring: joint learning of multi-scale essay representation.

Abstract: In recent years, pre-trained models have become dominant in most natural language processing (NLP) tasks. However, in the area of Automated Essay Scoring (AES), pre-trained models such as BERT have not been properly used to outperform other deep learning models such as LSTM. In this paper, we introduce a novel multi-scale essay representation for BERT that can be jointly learned. We also employ multiple losses and transfer learning from out-of-domain essays to further improve the performance. Experiment results show that our approach derives much benefit from joint learning of multi-scale essay representation and obtains almost the state-of-the-art result among all deep learning models in the ASAP task. Our multi-scale essay representation also generalizes well to CommonLit Readability Prize data set, which suggests that the novel text representation proposed in this paper may be a new and effective choice for long-text tasks.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

A review of deep-neural automated essay scoring models

  • Review Paper
  • Open access
  • Published: 20 July 2021
  • Volume 48 , pages 459–484, ( 2021 )

Cite this article

You have full access to this open access article

essay representation model

  • Masaki Uto   ORCID: orcid.org/0000-0002-9330-5158 1  

10k Accesses

40 Citations

4 Altmetric

Explore all metrics

Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by humans. Although traditional AES models typically rely on manually designed features, deep neural network (DNN)-based AES models that obviate the need for feature engineering have recently attracted increased attention. Various DNN-AES models with different characteristics have been proposed over the past few years. To our knowledge, however, no study has provided a comprehensive review of DNN-AES models while introducing each model in detail. Therefore, this review presents a comprehensive survey of DNN-AES models, describing the main idea and detailed architecture of each model. We classify the AES task into four types and introduce existing DNN-AES models according to this classification.

Similar content being viewed by others

essay representation model

Deep Learning in Automated Essay Scoring

essay representation model

Robust Neural Automated Essay Scoring Using Item Response Theory

essay representation model

A Study on Performance Sensitivity to Data Sparsity for Automated Essay Scoring

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Essay-writing tests have attracted much attention as a means to measuring practical and higher-order abilities such as logical thinking, critical reasoning, and creative thinking in various assessment fields (Abosalem 2016 ; Bernardin et al. 2016 ; Liu et al. 2014 ; Rosen and Tager 2014 ; Schendel and Tolmie 2017 ). In essay-writing tests, examinees write an essay about a given topic, and human raters grade those essays. However, essay grading is an expensive and time-consuming process, especially when there are many examinees (Hussein et al. 2019 ; Ke and Ng 2019 ). In addition, grading by human raters is not always consistent among and within raters (Eckes 2015 ; Hua and Wind 2019 ; Kassim 2011 ; Myford and Wolfe 2003 ; Rahman et al. 2017 ; Uto and Ueno 2018a ). One approach to resolving this problem is automated essay scoring (AES), which utilizes natural language processing (NLP) and machine learning techniques to automatically grade essays.

Many AES models have been developed over recent decades, and these can generally be classified as feature-engineering or automatic feature extraction approaches (Hussein et al. 2019 ; Ke and Ng 2019 ).

AES models based on the feature-engineering approach predict scores using textual features that are manually designed by human experts (e.g., Dascalu et al. 2017 ; Mark and Shermis 2016 ; Nguyen and Litman 2018 ). Typical features include essay length and the number of grammatical and spelling errors. The AES model first calculates these types of textual features from a target essay, then inputs the feature vector into a regression or classification model and outputs a score. Various models based on this approach have long been proposed (e.g., Nguyen and Litman 2018 ; Attali and Burstein 2006 ; Phandi et al. 2015 ; Beigman Klebanov et al. 2016 ; Cozma et al. 2018 ). For example, e-rater (Attali and Burstein 2006 ) is a representative model that was developed and has been used by the Educational Testing Service. Another recent popular model is the Enhanced AI Scoring Engine (Phandi et al. 2015 ), which achieved high performance in the Automated Student Assessment Prize (ASAP) competition run by Kaggle.

The advantages of feature-engineering approach models include interpretability and explainability. However, this approach generally requires extensive effort in engineering and tuning features to achieve high scoring accuracy for a target collection of essays. To obviate the need for feature engineering, automatic feature extraction approach models based on deep neural networks (DNNs) have recently attracted attention. Many DNN-AES models have been proposed over the last five years and have achieved state-of-the-art accuracy (e.g., Alikaniotis et al. 2016 ; Taghipour and Ng 2016 ; Dasgupta et al. 2018 ; Farag et al. 2018 ; Jin et al. 2018 ; Mesgar and Strube 2018 ; Wang et al. 2018 ; Mim et al. 2019 ; Nadeem et al. 2019 ; Uto et al. 2020 ; Ridley et al. 2021 ). The purpose of this paper is to review these DNN-AES models.

Several recent studies have reviewed AES models (Ke and Ng 2019 ; Hussein et al. 2019 ; Borade and Netak 2021 ). For example, Ke and Ng ( 2019 ) reviewed various AES models, including both feature-engineering approach models and DNN-AES models. However, because the purpose of their study was to present an overview of major milestones reached in AES research since its inception, they provided only a short summary of each DNN-AES model. Another review (Hussein et al. 2019 ) explained some DNN-AES models in detail, but only a few models were introduced. Borade and Netak ( 2021 ) also reviewed AES models, but they focused on feature-engineering approach models.

To our knowledge, no study has provided a comprehensive review of DNN-AES models while introducing each model in detail. Therefore, this review presents a comprehensive survey of DNN-AES models, describing the main idea and detailed architecture of each model. We classify AES tasks into four types according to recent findings (Li et al. 2020 ; Ridley et al. 2021 ), and introduce existing DNN-AES models according to this classification.

2 Automated essay scoring tasks

AES tasks are generally classified into the following four types (Li et al. 2020 ; Ridley et al. 2021 ).

Prompt-specific holistic scoring This is the most common AES task type, whereby an AES model is trained using rated essays that have holistic scores and have been written for a prompt. This trained model is used to predict the scores of essays written for the same prompt. Note that a prompt refers to an essay topic or a writing task that generally consists of reading materials and a task instruction.

Prompt-specific trait scoring This task involves predicting multiple trait-specific scores for each essay in a prompt-specific setting in which essays used for model training and unrated target essays are written for the same prompt. Such scoring is often required when an analytic rubric is used to provide more detailed feedback for educational purposes.

Cross-prompt holistic scoring In this task, an AES model is trained using rated essays with holistic scores written for non-target prompts and the trained model is transferred to a target prompt. This task has recently attracted attention because it is difficult to obtain a sufficient number of rated essays written for a target prompt in practice. This task includes a zero-shot setting in which rated essays written for a target prompt do not exist, and another setting in which a relatively small number of rated essays written for a target prompt can be used. The cross-prompt AES task relates to domain adaptation and transfer learning tasks, which are widely studied in machine learning fields.

Cross-prompt trait scoring This task involves predicting multiple trait-specific scores for each essay in a cross-prompt setting in which essays written for non-target prompts are used to train an AES model.

In the following section, we review representative DNN-AES models for each task type. Table  1 summarizes the models introduced in this paper.

3 Prompt-specific holistic scoring

This section introduces DNN-AES models for prompt-specific holistic scoring.

3.1 RNN-based model

One of the first DNN-AES models was a recurrent neural network (RNN)-based model proposed by Taghipour and Ng ( 2016 ). This model predicts a score for a given essay, defined as a sequence of words, by following multi-layered neural networks whose architecture is shown in Fig.  1 .

figure 1

Architecture of RNN-based model

Lookup table layer This layer transforms each word in a given essay into a G -dimensional word-embedding representation. Word-embedding representation is a real-valued fixed-length vector of a word, in which words with similar meaning have similar vectors. Suppose \({{\mathcal {V}}}\) is a vocabulary list for essay collection, \(\varvec{w}_{t}\) represents a \(|{{\mathcal {V}}}|\) -dimensional one-hot representation of t -th word \(w_{t}\) in a given essay, and \(\varvec{A}\) represents a \(G \times |{{\mathcal {V}}}|\) -dimensional trainable embeddings matrix. Then, the embedding representation \(\tilde{\varvec{w}}_t\) corresponding to \(w_{t}\) is calculable as a dot product \(\tilde{\varvec{w}}_t = \varvec{A}\cdot \varvec{w}_{t}\) .

Convolution layer This layer captures local textual dependencies using convolution neural networks (CNNs) from a sequence of word-embedding vectors. Given an input sequence \(\{\tilde{\varvec{w}}_1,\tilde{\varvec{w}}_2, \ldots , \tilde{\varvec{w}}_L\}\) (where L is the number of words in a given essay), this layer is applied to a window of c words to capture local textual dependencies among c -gram words. Concretely, the t -th output of this layer is calculable as follows.

where \(\mathbf {W_c}\) and \(b_c\) are trainable weight and bias parameters, and \([\cdot , \cdot ]\) means the concatenation of the given elements. Zero padding is applied to outputs from this layer to preserve the input and output sequence lengths. This is an optional layer that has often been omitted in recent studies.

Recurrent layer This layer generally uses a long short-term memory (LSTM) network, a representative RNN, that outputs a vector at each timestep while capturing time series dependencies in an input sequence. A single-layer unidirectional LSTM is generally used, but bidirectional or multilayered LSTMs are also often used.

Pooling layer This layer transforms the output hidden vector sequence of the recurrent layer \(\{ \varvec{h}_{1}, \varvec{h}_{2}, \ldots ,\varvec{h}_{L}\}\) (where \(\varvec{h}_{t}\) represents the hidden vector of the t -th output of the recurrent layer) into an aggregated fixed-length hidden vector. Mean-over-time pooling, which calculates an average vector

is generally used because it tends to provide stable accuracy. Other frequently used pooling methods include the last pool (Alikaniotis et al. 2016 ), which uses the last output of the recurrent layer \(\varvec{h}_{L}\) , and an attention pooling layer (Dong et al. 2017 ), which we explain later in the present study.

Linear layer with sigmoid activation This layer projects a pooling layer output onto a scalar value in the range [0, 1] by utilizing the sigmoid function as

where \(\mathbf{W}_o\) is a weight matrix and \(b_o\) represents bias parameters. \(\sigma ()\) represents the sigmoid function.

For model training, the mean-squared error (MSE) between predicted and gold-standard scores is generally used as the loss function. Specifically, letting \(y_{n}\) be the gold-standard score for n -th essay and letting \({\hat{y}}_{n}\) be the predicted score, the MSE loss function is defined as

where N is the number of essays. Note that the model training is conducted after normalizing gold standard scores to [0, 1], but the predicted scores are linearly rescaled to the original score range in the prediction phase.

3.2 RNN-based model with score-specific word embedding

Alikaniotis et al. ( 2016 ) also proposed a similar RNN-based model consisting of three layers, namely, a lookup table layer, a recurrent layer, and a pooling layer. The model uses a bidirectional LSTM for the recurrent layer and the last pooling for the pooling layer. The unique feature of this model is the use of score-specific word embedding (SSWE), which is an extension of Collobert & Weston (C&W) word-embedding (Collobert and Weston 2008 ), in the lookup table layer.

Suppose we train a representation for a target word \(w_t\) within a sequence of one-hot encoded words \(\varvec{S}=\{ \varvec{w}_1, \ldots , \varvec{w}_t, \ldots \varvec{w}_L\}\) . To derive this representation, the C&W word-embedding model learns to distinguish between the original sequence \(\varvec{S}\) and an artificially created noisy sequence \(\varvec{S}'\) in which the target word is substituted for a randomly selected word. Given a trainable embedding matrix \(\varvec{A}\) , the model concatenates the embedding representation vectors of the words in the sequence, that is, \(\tilde{\varvec{S}} = [\varvec{A} \cdot \varvec{w}_1, \varvec{A} \cdot \varvec{w}_2, \ldots , \varvec{A} \cdot \varvec{w}_L]\) . Using the vector, the C&W word-embedding model predicts whether the given word sequence \(\varvec{S}\) is the original sequence or a noisy one based on the following function.

where \(\mathbf{W}_1\) , \(\mathbf{W}_2\) , \(b_1\) , and \(b_2\) are the trainable parameters, and \(\mathrm {htanh}()\) is the hard hyperbolic tangent function.

The SSWE model extends the C&W word-embedding model by adding another output layer that predicts essay scores as follows.

where \(\mathbf{W}'_1\) , \(\mathbf{W}'_2\) , \(b'_1\) , and \(b'_2\) are the trainable parameters. The SSWE model is trained while minimizing a weighted linear combination of two error loss functions, namely, a classification loss function based on Eq. ( 5 ) and a scoring error loss function based on Eq. ( 6 ).

The SSWE model provides a more effective word-embedding representation to distinguish essay qualities than does the C&W word-embedding model. Thus, Alikaniotis et al. ( 2016 ) proposed using the embedding matrix \(\varvec{A}\) trained by the SSWE model in the lookup table layer.

3.3 Hierarchical representation models

figure 2

Architecture of hierarchical representation model

The models introduced above handle an essay as a linear sequence of words. Dong and Zhang ( 2016 ), however, proposed modeling the hierarchical structure of a text. Concretely, they assumed that an essay is constructed as a sequence of sentences defined as word sequences. Accordingly, they introduced a two-level hierarchical representation model consisting of a word-level CNN and a sentence-level CNN, as shown in Fig.  2 . Each CNN works as explained below.

Word-level CNN The sequence of words in each sentence is processed and an aggregated vector is output, which can be taken as an embedding representation of a sentence. Suppose an essay consists of I sentences \(\{\varvec{s}_1, \ldots , \varvec{s}_I\}\) , and each sentence is defined as a sequence of words as \(\varvec{s}_i = \{w_{i1}, \ldots , w_{iL_i}\}\) (where \(w_{it}\) is the t -th word in i -th sentence, and \(L_i\) is the number of words in i -th sentence). For each sentence \(\varvec{s}_i\) , the lookup table layer transforms each word into an embedding representation, and then the word-level CNN processes the sequence of word-embedding vectors. The operation of the word-level CNN is the same as that of the convolution layer explained in Subsection  3.1 . The output sequence of the word-level CNN is transformed into an aggregated fixed-length hidden vector \(\tilde{\varvec{h}}_{s_i}\) through a pooling layer.

Sentence-level CNN This CNN takes the sequence of sentence vectors \(\{\tilde{\varvec{h}}_{s_1}, \ldots , \tilde{\varvec{h}}_{s_I}\}\) as input and extracts n-gram level features over the sentence sequence. Then, a pooling layer transforms the CNN output sequence into an aggregated fixed-length hidden vector \(\tilde{\varvec{h}}\) . Finally, the linear layer with sigmoid activation maps vector \(\tilde{\varvec{h}}\) to a score.

Dong et al. ( 2017 ) proposed another hierarchical representation model that extends the above model by using an attention mechanism (Bahdanau et al. 2014 ) to automatically identify important words and sentences. The attention mechanism is a neural architecture that enables to dynamically focus on relevant regions of input data to make predictions. The main idea of the attention mechanism is to compute a weight distribution on the input data, assigning higher values to more relevant regions. (Dong et al. 2017 ) uses attention-based pooling in the pooling layers. Letting the input sequence for the pooling layer be \(\{\varvec{x}_1, \ldots , \varvec{x}_J\}\) , where J indicates the sequence length, the attention mechanism aggregates the input sequence into a fixed-length vector \(\tilde{\varvec{x}}\) by performing the following operations.

In these equations, \(\mathbf{W}_{a_1}\) , \(\mathbf{W}_{a_2}\) , and b are trainable parameters. \(\tilde{\varvec{x}}_j\) and \(a_j\) are called an attention vector and an attention weight for j -th input, respectively.

In addition to the incorporation of the attention mechanism, Dong et al. ( 2017 ) proposed adding a character-level CNN before the word-level CNN and using LSTM as an alternative to the sentence-level CNN.

3.4 Coherence modeling

Coherence is an important criterion for evaluating the quality of essays. However, the RNN-based models introduced above are known to have difficulty capturing the relationships between multiple regions in an essay because they compress a word sequence within a fixed-length hidden vector in the order they are inputted. To resolve this difficulty, several DNN-AES models that consider coherence features have been proposed (Tay et al. 2018 ; Li et al. 2018 ; Farag et al. 2018 ; Mesgar and Strube 2018 ; Yang and Zhong 2021 ). This subsection introduces two representative models.

3.4.1 SKIPFLOW model

figure 3

Architecture of SKIPFLOW model

Tay et al. ( 2018 ) proposed SKIPFLOW, which learns coherence features explicitly using a neural network architecture. The model is the RNN-based model with a neural tensor layer as shown in Fig.  3 . The neural tensor layer takes two positional outputs of the recurrent layer that are collected from different time steps as input and computes the similarity between each of these pairs of positional outputs. Concretely, for a recurrent layer output sequence \(\{ \varvec{h}_{1}, \varvec{h}_{2}, \ldots ,\varvec{h}_{L}\}\) , the model first selects a pair of sequential outputs of width \(\delta\) , that is, \(\{(\varvec{h}_1,\varvec{h}_\delta ), (\varvec{h}_{\delta +1},\varvec{h}_{2\delta }), \ldots , (\varvec{h}_{t\delta +1}, \varvec{h}_{(t+1)\delta }), \ldots \}\) . Then, each pair of hidden vectors \((\varvec{h}_{t\delta +1}, \varvec{h}_{(t+1)\delta })\) is input into the following neural tensor layer to return a similarity score as

where \(\mathbf{W}_u\) , \(\mathbf{V}\) , and \(\mathbf{b}_u\) are the weight and bias vectors and \(\mathbf{M}\) is a three-dimensional tensor. These are trainable parameters.

The similarity scores for all the pairs are concatenated with the polling layer output vector \(\tilde{\varvec{h}}\) as

and the resulting vector is mapped to a score through a fully connected neural network layer and a linear layer with sigmoid activation.

3.4.2 Self-attention-based model

Li et al. ( 2018 ) proposed another model using a self-attention mechanism to capture relationships between multiple points in an essay. Self-attention mechanisms have been shown to be able to capture long-distance relationships between words in a sequence and have recently been used in various NLP tasks.

figure 4

Architecture of self-attention-based model

Figure  4 shows the model architecture. This model first transforms each word into an embedding representation through a lookup table layer with a position encoding, and then inputs the sequence into a multi-head self-attention model that combines multiple self-attention models in parallel. See Vaswani et al. ( 2017 ) for details of the lookup table layer with the position encoding and the multi-head self-attention architecture. The self-attention output sequence is input into a recurrent layer, a pooling layer, and a linear layer with sigmoid activation to produce an essay score.

3.5 BERT-based models

Bidirectional encoder representations from transformers (BERT), a pre-trained language model released by the Google AI Language team in 2018, has achieved state-of-the-art results in various NLP tasks (Devlin et al. 2019 ). Since then, BERT has also been applied to automated text scoring tasks, including AES (Nadeem et al. 2019 ; Uto et al. 2020 ; Rodriguez et al. 2019 ; Yang et al. 2020 ; Mayfield and Black 2020 ) and automated short-answer grading (Liu et al. 2019 ; Lun et al. 2020 ; Sung et al. 2019 ), and has shown good performance.

BERT is defined as a multilayer bidirectional transformer network (Vaswani et al. 2017 ). Transformers are a neural network architecture designed to handle ordered sequences of data using an attention mechanism. Specifically, transformers consist of multiple layers (called transformer blocks), each containing a multi-head self-attention network and a position-wise fully connected feed-forward network. See (Vaswani et al. 2017 ) for details of this architecture.

figure 5

Architecture of BERT-based model

figure 6

Architecture of BERT-based model with ranking task

BERT is trained in pre-training and fine-tuning steps. Pre-training is conducted on huge amounts of unlabeled text data over two tasks, namely, masked language modeling and next-sentence prediction. Masked language modeling is the task that predicts the identities of words that have been masked out of the input text. Next-sequence prediction is the task that predicts whether two given sentences are adjacent.

Using pre-trained BERT for a target NLP task, such as AES, requires fine-tuning (retraining), which is conducted from a task-specific supervised dataset after initializing model parameters to pre-trained values. When using BERT for AES, input essays require preprocessing, namely, adding a special token (“CLS”) to the beginning of each input. BERT output corresponding to this token is used as the aggregate hidden representation for a given essay (Devlin et al. 2019 ). We can thus score an essay by inputting its representation into a linear layer with sigmoid activation, as illustrated in Fig.  5 .

Furthermore, Yang et al. ( 2020 ) proposed fine-tuning the BERT model so that the essay scoring task and an essay ranking task are jointly resolved. As shown in Fig.  6 , the proposed model is formulated as a BERT-based AES model with an additional output layer that predicts essay ranks. The model uses ListNet (Cao et al. 2007 ) for predicting the ranking list. This model is fine-tuned by minimizing a combination of the scoring MSE loss function and a ranking error loss function based on ListNet.

3.6 Hybrid models

figure 7

Architecture of hybrid model with additional RNN for sentence-level features

figure 8

Architecture of DNN-AES with handcrafted essay-level features

The feature-engineering approach and the DNN-AES approach can be viewed as complementary rather than competing approaches (Ke and Ng 2019 ; Uto et al. 2020 ) because they provide different advantages. To receive both benefits, some hybrid models that integrate the two approaches have been proposed (Dasgupta et al. 2018 ; Uto et al. 2020 ).

One of the hybrid models is proposed by Dasgupta et al. ( 2018 ). Figure  7 shows the model architecture. As shown in the figure, it mainly consists of two DNNs. One processes word sequences in a given essay in the same way as the conventional RNN-based model (Taghipour and Ng 2016 ). Specifically, a word sequence is transformed into a fixed-length hidden vector \(\tilde{\varvec{h}}\) through a lookup table layer, a convolution layer, a recurrent layer, and a pooling layer. The other DNN processes a sequence of manually designed sentence-level features. Letting a given essay have I sentences, and letting \(\varvec{f}_{i}\) be a manually designed sentence-level feature vector for i -th sentence, the feature sequence \(\{\varvec{f}_{1},\varvec{f}_{2},\ldots ,\varvec{f}_{I}\}\) is transformed into a fixed-length hidden vector \(\tilde{\varvec{h}}_f\) through a convolution layer, a recurrent layer, and a pooling layer. The model uses LSTM for the recurrent layer and attention pooling for the pooling layer. Finally, after concatenating the hidden vectors \([\tilde{\varvec{h}}, \tilde{\varvec{h}}_f]\) , a linear layer with sigmoid activation maps it to a score.

Another hybrid model is formulated as a DNN-AES model incorporating manually designed essay-level features (Uto et al. 2020 ). Concretely, letting \(\varvec{F}\) be a manually designed essay-level feature vector, the model concatenates the feature vector with the hidden vector \(\tilde{\varvec{h}}\) , which is obtained from a DNN-AES model. Then, a linear layer with sigmoid activation maps the concatenated vector \([\tilde{\varvec{h}}, \varvec{F}]\) to a score value. Figure  8 shows the architecture of this model. This hybrid model is easy to construct using various DNN-AES models.

3.7 Improving robustness for biased training data

DNN-AES models generally require a large dataset of essays graded by human raters as training data. When creating a training dataset, essay grading tasks are generally shared among many raters by assigning a few raters to each essay to lower the burden of assessment. However, in such cases, assigned scores are known to be biased owing to the effects of rater characteristics (Rahman et al. 2017 ; Amidei et al. 2020 ). The performance of AES models drops when biased data are used for model training because the resulting model reflects the bias effects (Amorim et al. 2018 ; Huang et al. 2019 ; Li et al. 2020 ).

To resolve this problem, Uto and Okano ( 2020 ) proposed an AES framework that integrates item response theory (IRT), a test theory based on mathematical models. Specifically, they used an IRT model incorporating parameters representing rater characteristics (e.g., Eckes 2015 ; Uto and Ueno 2016 , 2018a ) that can estimate essay scores while mitigating rater bias effects. The applied IRT model is the generalized many-facet Rasch model (Uto and Ueno 2018b , 2020 ) that defines the probability that rater r assigns score k to n -th essay for a prompt as

where \(\alpha _r\) is the consistency of rater r , \(\beta _{r}\) is the severity of rater r , \(\beta _{rm}\) represents the strictness of rater r for category m , and K indicates the number of score categories. Furthermore, \(\theta _n\) represents the latent scores for n -th essay, which removes the effects of the rater characteristics.

Using this IRT model, Uto and Okano ( 2020 ) proposed training an AES model through the following two steps. 1) Apply the IRT model to observed rating data to estimate the IRT-based score \(\theta _n\) , which removes the effects of rater bias. 2) Train an AES model using the unbiased scores \(\varvec{\theta } =\{\theta _1, \ldots , \theta _N\}\) as the gold-standard scores based on the following loss function.

where \({\hat{\theta }}_n\) represents the AES’s predicted score for n -th essay. Because the IRT-based scores are theoretically free from rater bias effects, the AES model will not reflect the bias effects.

In the prediction phase, the score for a new essay is calculated in two steps: (1) Predict the IRT score \(\theta\) for the essay using a trained AES model. (2) Given \(\theta\) and rater parameters, calculate the expected score, which corresponds to an unbiased original-scaled score (Uto 2019 ), as

where R indicates the number of raters who graded essays in the training data. The expected score is used as a predicted essay score, which is robust against rater biases.

3.8 Integration of AES models

Conventional AES models including those introduced above have different scoring characteristics. Therefore, integrating multiple AES models is expected to improve scoring accuracy. For these reasons, Aomi et al. ( 2021 ) proposed a framework that integrates multiple AES models while considering the characteristics of each model using IRT. In the framework, multiple AES models are first trained independently, and the trained models are used to produce prediction scores for target essays. Then, the generalized many-facet Rasch model introduced above is applied to the obtained prediction scores by regarding rater characteristic parameters, \(\alpha _r\) , \(\beta _r\) , and \(\beta _{rm}\) as characteristic parameters of AES models. Given the estimated IRT score \(\theta\) for the target essays, a predicted essay score is calculated as the expected score based on Eq. ( 12 ).

This framework can integrate prediction scores from various AES models while considering the characteristics of each model. Subsequently, it provides scores that are more accurate than those obtained by simple averaging or a single AES model.

4 Prompt-specific trait scoring

This section introduces DNN-AES models for the prompt-specific trait scoring task. Although this task is important especially for educational purposes, only a limited number of models have been proposed for the task.

4.1 Use of multiple trait-specific models

Mathias et al. ( 2020 ) presents one of the first attempts to perform prompt-specific trait scoring based on a DNN-AES model. Their study used the hierarchical representation model with an attention mechanism (Dong et al. 2017 ), introduced in Sect.  3.3 , to predict trait-specific scores for each essay. Concretely, in their study, the AES model was trained for each trait independently, and predicted scores using trait-specific models.

figure 9

Architecture of RNN-based model with multiple output layers for prompt-specific trait scoring

4.2 Model with multiple output modules

Hussein et al. ( 2020 ) proposed a model specialized in a prompt-specific trait scoring task that can predict multiple trait scores jointly. The model is formulated as a multi-output model based on the RNN-based model (Taghipour and Ng 2016 ), introduced in Sect.  3.1 . Concretely, as shown in Fig.  9 , they extended the RNN-based model by adding as many multiple output linear layers as the number of traits. Additionally, an optional fully connected neural network layer was added after the pooling layer. The loss function is defined as a linear combination of multiple MSE loss functions as follows.

where D is the number of traits, \(y_{nd}\) and \({\hat{y}}_{nd}\) are the gold-standard score and predicted d -th trait score for n -th essay, respectively.

5 Cross-prompt holistic scoring

The prompt-specific scoring models introduced above assume situations in which rated training essays and unrated target essays are written for the same prompt. However, we often face situations in which we cannot use any rated essays or only a relatively small number of rated essays written for the target prompt in model training, even though we have many rated essays written for other non-target prompts. AES for such settings is generally called a cross-prompt scoring task. This section introduces cross-prompt holistic scoring models.

5.1 Two-stage learning models

One of the first cross-prompt holistic scoring models using DNN was proposed by Jin et al. ( 2018 ). The method is constructed as a two-stage DNN (TDNN) approach in which a prompt-independent scoring model is trained using rated essays for non-target prompts in the first stage, and is used to generate pseudo rating data for unrated essays in a target prompt. Then, using the pseudo rating data, a prompt-specific scoring model for the target prompt is trained in the second stage. The TDNN is detailed below.

First stage (Training a prompt-independent AES model) In this stage, rated essays written for non-target prompts are used to train a prompt-independent AES model that uses manually designed prompt-independent shallow features, such as the number of typos, grammatical errors, and spelling errors. Here, a ranking support vector machine (Joachims 2002 ) is used as the prompt-independent model.

Second stage (Training a prompt-specific AES model) The trained prompt-independent AES model is used to produce the scores of unrated essays written for a target prompt, and the pseudo scores are used to train a prompt-specific scoring model. To train a prompt-specific scoring model, only confident essays with the highest and lowest pseudo scores are used, instead of using all the produced scores. The prompt-specific AES model in the study by Jin et al. ( 2018 ) used an extended model of the RNN-based model (Taghipour and Ng 2016 ) that can process three types of sequential inputs, namely, a sequence of words, part-of-speech (POS) tags, and syntactic tags.

Li et al. ( 2020 ) pointed out that the TDNN model uses a limited number of general linguistic features in the prompt-independent AES model, which may seriously affect the accuracy of the generated pseudo scores for essays in a target prompt. To extract more efficient features, they proposed another two-stage framework called a shared and enhanced deep neural network (SEDNN) model. The SEDNN model consists of two stages, described as follows.

First stage As an alternative to a prompt-independent model with manually designed shallow linguistic features, the SEDNN uses a DNN-AES model that extends the hierarchical representation model with an attention mechanism (Dong et al. 2017 ), introduced in Sect.  3.3 . Concretely, in the model, a new output layer is added to jointly solve the AES task and a binary classification task that distinguishes whether a given essay was written for the target prompt. The model is trained based on a combination of the loss functions for the essay scoring task and the prompt discrimination task using a dataset consisting of rated essays written for non-target prompts and the unrated essays written for the target prompt.

Second stage As in the second stage of the TDNN model, scores of unrated essays written for a target prompt are generated by the prompt-independent AES model, and the pseudo scores are used to train a prompt-specific scoring model. The prompt-specific scoring model in the study by Li et al. ( 2020 ) is a Siamese network model that jointly uses the essay text and the text of the target prompt itself to learn prompt-dependent features more efficiently. In the model, an essay text is processed by a similar model to the SKIPFLOW (Tay et al. 2018 ) and is transformed into vector representations. The word sequence in the prompt text is also transformed into a fixed-length hidden vector representation by another neural architecture consisting of a lookup table layer, a convolution layer, a recurrent layer, and a mechanism that measures the relevance relation between the given essay and the target prompt text. After concatenating the two vector representations corresponding to an essay text and a prompt text, a linear layer with sigmoid activation maps it to a prediction score.

5.2 Multi-stage pre-training approach model

Another cross-prompt holistic scoring approach incorporates pre-training processes. In the approach, an AES model is developed by performing pre-training on a vast number of essays with or without scores written for non-target prompts, and then the model is fine-tuned using a limited number of rated essays written for a target prompt. The pre-training process enables a DNN model to capture a general language model for predicting essay quality. Thus, the use of a pre-trained model as an initial model helps in obtaining a model for a target scoring task. The BERT-based AES models explained in Sect.  3.5 are examples of the pre-training and fine-tuning approach models. In various NLP tasks, the use of pre-training has been popular and has achieved great success.

For cross-prompt holistic scoring, (Song et al. 2020 ) proposed training the hierarchical representation model with the attention mechanism (Dong et al. 2017 ), as explained in Sect.  3.3 , through the following three pre-training and fine-tuning steps.

Weakly supervised pre-training The AES model is trained based on a vast number of roughly scored essays written for diverse prompts collected from the Web. The study by Song et al. ( 2020 ) assumed that binary scores are given to the essays; thus, this step is called weakly supervised. The objective of this pre-training step was to have the AES model learn a general language representation that can roughly distinguish essay quality.

Cross-prompt supervised fine-tuning If we have rated essays written for non-target prompts, the pre-trained model is fine-tuned using the data.

Target-prompt supervised fine-tuning The model obtained from the above steps is fine-tuned using rated essays written for the target prompt. The study by Song et al. ( 2020 ) reported that incorporating the above two-stage pre-training and fine-tuning improves the performance of the target-prompt scoring.

5.3 Model with self-supervised learning

Cao et al. ( 2020 ) proposed another cross-prompt holistic scoring model that was designed to solve the AES task with two prompt-independent self-supervised learning tasks jointly. The two self-supervised learning tasks, which are appended to efficiently extract prompt-independent common knowledge, are a sentence reordering task and a noise identification task , as explained bellow.

Sentence reordering In this task, each essay is divided into four parts and then shuffled according to a certain permutation order. The sentence reordering task predicts an appropriate permutation for each given essay.

Noise identification In this task, each essay is transformed into noisy data by performing random insertion, random swap, and random deletion operations on 10% of the words in the essay. The noise identification task predicts whether a given essay is noisy or not.

The above two self-supervised learning tasks are simultaneously trained with the AES task in a model. Figure  10 shows the model architecture. This model has a shared encoder that transforms an input word sequence into a fixed-length essay representation vector, and three task-specific output layers.

figure 10

Architecture of cross-prompt holistic scoring model with self-supervised learning

The shared encoder is formulated as a hierarchical representation DNN model such as that introduced in Sect.  3.3 . In this model, a sequence of words corresponding to each sentence is transformed into a fixed-length sentence representation vector through a lookup table layer, a recurrent layer, a self-attention layer, a fusion gate, and a mean-over-time pooling layer. Here, the fusion gate is an operation that combines the input and output of the self-attention layer as follows.

where \(\varvec{H}_{s_i}\) and \(\tilde{\varvec{H}}_{s_i}\) are the input and output vector sequences of the self-attention layer for i -th sentence, and \(\hat{\varvec{H}}_{s_i}\) is the fusion gate output. \(\mathbf{W}_{g1}\) and \(\mathbf{W}_{g2}\) are trainable parameters. The essay representation vector is calculated by averaging the obtained sentence vectors, and the vector is used for the AES and the two self-supervised learning tasks. This model is trained based on a weighted sum of the MSE loss function for the AES and error loss functions for the two self-supervised learning tasks.

Furthermore, Cao et al. ( 2020 ) proposed a technique to improve the adaptability of the model to a target prompt. Concretely, during the model training processes, this technique calculates the averaged essay representation vector for each prompt and shifts the representation of each essay into the target prompt’s averaged vector.

6 Cross-prompt trait scoring

This section introduces cross-prompt trait scoring models that predict multiple trait-specific scores for each essay in a cross-prompt setting.

6.1 Use of multiple trait-specific models with self-supervised learning

Mim et al. ( 2019 ) proposed a method to predict two trait scores, namely, coherence and argument strength , for each essay. They used a vast number of unrated essays for non-target prompts to pre-train a DNN model, and then the model was transferred to a target AES task. They used the RNN-based model (Taghipour and Ng 2016 ) introduced in Sect.  3.1 as the base model. The detailed processes are as follows.

Pre-training based on self-supervised learning with non-target essays In this step, the base model is trained using unrated essays written for non-target prompts based on a self-supervised learning task, which is a binary classification task that distinguishes artificially created incoherent essays. For the self-supervised learning task, incoherent essays are created by randomly shuffling sentences, discourse indicators, and paragraphs in the original essays. This pre-training is introduced to enable the base model to learn features for distinguishing logical text from illogical text.

Pre-training based on self-supervised learning with target essays The pre-trained model is retrained using essays written for the target prompt based on the same self-supervised task described above. This step is introduced to alleviate mismatch between essays written for non-target prompts and those written for the target prompt.

Fine-tuning for AES The pre-trained model is fine-tuned for the AES task using rated essays for the target prompt. Note that, for the AES task, the base model is extended by adding two RNN-based architectures that process a prompt text and a sequence of paragraph function labels (i.e., Introduction, Body, Rebuttal and Conclusion). The fine-tuning is conducted independently for two traits, namely, coherence and argument strength .

6.2 Model with multiple output modules

Ridley et al. ( 2021 ) proposed a model specialized in trait scoring that can predict multiple trait scores jointly. As shown in Fig.  11 , the model is formulated as the following multi-output DNN model.

figure 11

Architecture of cross-prompt trait scoring model with multiple output layers

Shared layers The model first processes an input through shared layers that is commonly used for predicting all trait scores. The shared layers consist of a POS embedding layer, a convolutional layer, and an attention pooling layer, as explained below.

A POS embedding layer takes a sequence of POS tags for words in a given essay and transforms it into embedding representations, using the same operations as in the lookup table layer. Note that this model uses a POS tag sequence as the input instead of a word sequence because word information depends strongly on a prompt, but POS information that represents syntactic information is more adaptable to different prompts.

A convolutional layer extracts n-gram level features from a sequence of POS embeddings for each sentence in the same way as described in Sect.  3.1 .

An attention pooling layer applies an attention mechanism to produce a fixed-length vector representation for each sentence from the convolutional layer outputs.

Trait-specific layers The sequence of sentence representations produced by the shared convolutional layer is input into trait-specific layers that are used for predicting each trait score through the following procedures.

The sentence representation sequence is transformed into a fixed-length vector corresponding to essay representation through a recurrent layer, and an attention pooling layer.

The essay representation vector is concatenated with prompt-independent manually designed features, similar to those used in the first stage of TDNN.

To obtain a final representation for each trait score, the model applies an attention mechanism so that each trait-specific layer can utilize the relevant information from the other trait-specific layers.

A linear layer with sigmoid activation maps the aggregated vector to a corresponding trait score.

The loss function for training this model is similar to Eq. ( 13 ). Note that, because different prompts are often designed to evaluate different trait scores, the model introduces a masking function. Concretely, letting \(mask_{nd}\) be a variable that takes 1 if the prompt corresponding to n -th essay that has d -th trait score, and 0 otherwise, the loss function with the mask function is defined as follows.

The mask function enables the loss values for the traits without the gold scores to be 0.

A special case of this model that has a single output layer for the holistic score has also been proposed as a cross-prompt holistic scoring model (Ridley et al. 2020 ).

7 Conclusions and remarks

This review has presented a comprehensive survey of DNN-AES models. Concretely, we classified the AES task into four types, namely, (1) prompt-specific holistic scoring, (2) prompt-specific trait scoring, (3) cross-prompt holistic scoring, and (4) cross-prompt trait scoring, and introduced the main ideas and the architectures of representative DNN-AES models for each task type.

As shown in our study, earlier DNN-AES models focus mainly on the prompt-specific holistic scoring task. The commonly used baseline model is the RNN-based model (Taghipour and Ng 2016 ), which has been extended by incorporating an efficient word embedding representation, a hierarchical structure of a text, a coherence model, and manually designed features. We also described transformer-based models such as BERT that have recently been applied to AES with their widespread use in various machine learning research studies.

These prompt-specific holistic scoring models have been extended for prompt-specific trait scoring, which predicts multiple trait scores for each essay. Trait scoring is practically important, especially when we need to provide detailed feedback to examinees for educational purposes, although the number of papers for this task is still limited.

Although prompt-specific scoring tasks assume that we can use a sufficient number of rated essays for a target prompt, this assumption is not often satisfied in practice because collecting rated essays is an expensive and time-consuming task. To overcome this limitation, cross-prompt scoring models have provided frameworks that use a large number of essays for non-target prompts. Although the number of cross-prompt scoring models is still limited, this task is important for increasing the feasibility of applying DNN-AES models to practical situations.

We can use several corpora to develop and evaluate AES models. ASAP corpus, which was released as part of a Kaggle competition, has been commonly used in holistic scoring models. For the trait scoring models, the International Corpus of Learner English (Ke 2019 ) and ASAP++ corpus (Mathias et al. 2018 ) are available. See (Ke and Ng 2019 ) for a more detailed summary of these corpora.

A future direction of AES studies is developing efficient and accurate trait scoring models and cross-prompt models. As described above, although the number of studies for those DNN-AES models is limited, such studies are essential to the use of AES technologies in various situations. It is also important to develop methodologies that reduce costs and noise when training data are being created. Approaches to reducing rating costs include recently examined active learning approaches (e.g., Hellman et al. 2019 ). To reduce scoring noise or biases, the integration of statistical models such as the IRT models described in Sect.  3.7 would be a possible approach.

Another future direction is to analyze the quality of each essay test and the characteristics of an applied AES model based on test theory. From the perspective of test theory, evaluating the reliability and validity of a test and its scoring processes is important for discussing the appropriateness of the test as a measurement tool. Although AES studies tend to ignore these points, several works have considered the relationship between DNN-based AES tasks and test theory (e.g., Uysal and Doğan 2021 ; Uto and Uchida 2020 ; Ha et al. 2020 ).

The application of AES methods to various related domains is also desired. For example, AES methods would be applicable to various operations such as writing support systems (e.g., Ito et al. 2020 ; Tsai et al. 2020 ) and peer grading processes (Han et al. 2020 ).

Abosalem Y (2016) Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda. Int J Secondary Educ 4(1):1–11

Article   Google Scholar  

Alikaniotis D, Yannakoudakis H, Rei M (2016) Automatic text scoring using neural networks. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 715–725)

Amidei J, Piwek P, Willis A (2020) Identifying annotator bias: a new irt-based method for bias identification. In: Proceedings of the international conference on computational linguistics (pp. 4787–4797)

Amorim E, Cançado M, Veloso A (2018) Automated essay scoring in the presence of biased ratings. In: Proceedings of the annual conference of the north American chapter of the association for computational linguistics (pp. 229–237)

Aomi I, Tsutsumi E, Uto M, Ueno M (2021) Integration of automated essay scoring models using item response theory. In: Proceedings of the international conference on artificial intelligence in education (pp. 54–59)

Attali Y, Burstein J (2006) Automated essay scoring with e-rater v.2. J Technol, Learn Assessment 4(3):1–31

Google Scholar  

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv

Beigman Klebanov B, Flor M, Gyawali B (2016) Topicality-based indices for essay scoring. In: Proceedings of the workshop on innovative use of NLP for building educational applications (pp. 63–72)

Bernardin HJ, Thomason S, Buckley MR, Kane JS (2016) Rater rating-level bias and accuracy in performance appraisals: the impact of rater personality, performance management competence, and rater accountability. Hum Resour Manage 55(2):321–340

Borade JG, Netak LD (2021) Automated grading of essays: a review. In: Intelligent human computer interaction (vol. 12615, pp. 238–249), Springer International Publishing

Cao Y, Jin H, Wan X, Yu Z (2020) Domain-adaptive neural automated essay scoring. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (pp. 1011–1020), Association for Computing Machinery

Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: From pairwise approach to listwise approach. In: Proceedings of the international conference on machine learning (pp. 129–136), Association for Computing Machinery

Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the international conference on machine learning (pp. 160–167), Association for Computing Machinery

Cozma M, Butnaru A, Ionescu RT (2018) Automated essay scoring with string kernels and word embeddings. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 503–509)

Dascalu M, Westera W, Ruseti S, Trausan-Matu S, Kurvers H (2017) Readerbench learns Dutch: building a comprehensive automated essay scoring system for Dutch language. In: Proceedings of the international conference on artificial intelligence in education (pp. 52–63)

Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the workshop on natural language processing techniques for educational applications, association for computational linguistics (pp. 93–102)

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the annual conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186)

Dong F, Zhang Y (2016) Automatic features for essay scoring—an empirical study. In: Proceedings of the conference on empirical methods in natural language processing (pp. 1072–1077), Association for Computational Linguistics

Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the conference on computational natural language learning (pp. 153–162), Association for Computational Linguistics

Eckes T (2015) Introduction to many-facet Rasch measurement: analyzing and evaluating rater-mediated assessments, Peter Lang Pub. Inc

Farag Y, Yannakoudakis H, Briscoe T (2018) Neural automated essay scoring and coherence modeling for adversarially crafted input. In: Proceedings of the annual conference of the north American chapter of the association for computational linguistics (pp. 263–271)

Ha LA, Yaneva V, Harik P, Pandian R, Morales A, Clauser B (2020) Automated prediction of examinee proficiency from short-answer questions. In: Proceedings of the international conference on computational linguistics (pp. 893–903)

Han Y, Wu W, Yan Y, Zhang L (2020) Human-machine hybrid peer grading in SPOCs. IEEE Access 8:220922–220934

Hellman S, Rosenstein M, Gorman A, Murray W, Becker L, Baikadi A, Foltz PW (2019) Scaling up writing in the curriculum: Batch mode active learning for automated essay scoring. In: Proceedings of the ACM conference on learning (pp. 1—10), Association for Computing Machinery

Hua C, Wind SA (2019) Exploring the psychometric properties of the mind-map scoring rubric. Behaviormetrika 46(1):73–99

Huang J, Qu L, Jia R, Zhao B (2019) O2U-Net: a simple noisy label detection approach for deep neural networks. In: Proceedings of the IEEE international conference on computer vision (pp. 3326–3334)

Hussein MA, Hassan HA, Nassef M (2019) Automated language essay scoring systems: a literature review. Peer J Comput Sci 5:e208

Hussein MA, Hassan HA, Nassef M (2020) A trait-based deep learning automated essay scoring system with adaptive feedback. Int J Adv Comput Sci Appl 11(5):287–293

Ito T, Kuribayashi T, Hidaka M, Suzuki J, Inui K (2020) Langsmith: n interactive academic text revision system. In: Proceedings of conference on empirical methods in natural language processing (pp. 216–226), Association for Computational Linguistics

Jin C, He B, Hui K, Sun L (2018) TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 1088–1097)

Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 133–142), Association for Computing Machinery

Kassim NLA (2011) Judging behaviour and rater errors: an application of the many-facet Rasch model. GEMA Online J Lang Stud 11(3):179–197

Ke Z, Inamdar H, Lin H, Ng V (2019) Give me more feedback II: Annotating thesis strength and related attributes in student essays. In: Proceedings of the annual meeting of the association for computational linguistics (pp. 3994–4004)

Ke Z, Ng V (2019) Automated essay scoring: a survey of the state of the art. In: Proceedings of the international joint conference on artificial intelligence (pp. 6300–6308)

Li S, Ge S, Hua Y, Zhang C, Wen H, Liu T, Wang W (2020) Coupled-view deep classifier learning from multiple noisy annotators. In: Proceedings of the association for the advancement of artificial intelligence (pp. 4667–4674)

Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-based automated essay scoring using self-attention. In: Chinese computational linguistics and natural language processing based on naturally annotated big data (pp. 386–397), Springer International Publishing

Li X, Chen M, Nie JY (2020) SEDNN: shared and enhanced deep neural network model for cross-prompt automated essay scoring. Knowl-Based Syst 210:106491

Liu OL, Frankel L, Roohr KC (2014) Assessing critical thinking in higher education: current state and directions for next-generation assessment. ETS Res Rep Series 1:1–23

Liu T, Ding W, Wang Z, Tang J, Huang GY, Liu Z (2019) Automatic short answer grading via multiway attention networks. In: Proceedings of the international conference on artificial intelligence in education (pp. 169–173)

Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the association for the advancement of artificial intelligence (pp. 13389–13396)

Mark D, Shermis JCB (2016) Automated essay scoring: a cross-disciplinary perspective. Taylor & Francis

Mathias S, Bhattacharyya P (2018) ASAP++: enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the eleventh international conference on language resources and evaluation (pp. 1169–1173)

Mathias S, Bhattacharyya P (2020) Can neural networks automatically score essay traits? In: Proceedings of the workshop on innovative use of nlp for building educational applications (pp. 85–91), Association for Computational Linguistics

Mayfield E, Black AW (2020) Should you fine-tune BERT for automated essay scoring? In: Proceedings of the workshop on innovative use of nlp for building educational applications (pp. 151–162), Association for Computational Linguistics

Mesgar M, Strube M (2018) A neural local coherence model for text quality assessment. In: Proceedings of the conference on empirical methods in natural language processing (pp. 4328–4339)

Mim FS, Inoue N, Reisert P, Ouchi H, Inui K (2019) Unsupervised learning of discourse-aware text representation for essay scoring. In: Proceedings of the annual meeting of the association for computational linguistics: student research workshop (pp. 378–385)

Myford CM, Wolfe EW (2003) Detecting and measuring rater effects using many-facet Rasch measurement: part I. J Appl Meas 4:386–422

Nadeem F, Nguyen H, Liu Y, Ostendorf M (2019) Automated essay scoring with discourse-aware neural models. In: Proceedings of the workshop on innovative use of NLP for building educational applications, association for computational linguistics (pp. 484–493)

Nguyen HV, Litman DJ (2018) Argument mining for improving the automated scoring of persuasive essays. In: Proceedings of the association for the advancement of artificial intelligence (pp. 5892–5899)

Phandi P, Chai KMA, Ng HT (2015) Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the conference on empirical methods in natural language processing (pp. 431–439)

Rahman AA, Ahmad J, Yasin RM, Hanafi NM (2017) Investigating central tendency in competency assessment of design electronic circuit: analysis using many facet Rasch measurement (MFRM). Int J Inf Educ Technol 7(7):525–528

Ridley R, He L, Dai X, Huang S, Chen J (2020) Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv

Ridley R, He L, yu Dai X, Huang S, Chen J (2021) Automated cross-prompt scoring of essay traits. In: Proceedings of the AAAI conference on artificial intelligence (vol 35, pp. 13745-13753)

Rodriguez PU, Jafari A, Ormerod CM (2019) Language models and automated essay scoring. arXiv

Rosen Y, Tager M (2014) Making student thinking visible through a concept map in computer-based assessment of critical thinking. J Educ Comput Res 50(2):249–270

Schendel R, Tolmie A (2017) Assessment techniques and students’ higher-order thinking skills. Assess & Eval Higher Educ 42(5):673–689

Song W, Zhang K, Fu R, Liu L, Liu T, Cheng M (2020) Multi-stage pre-training for automated Chinese essay scoring. In: Proceedings of the conference on empirical methods in natural language processing (pp. 6723–6733), Association for Computational Linguistics

Sung C, Dhamecha TI, Mukhi N (2019) Improving short answer grading using transformer-based pre-training. In: Proceedings of the international conference on artificial intelligence in education (pp. 469–481)

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the conference on empirical methods in natural language processing (pp. 1882–1891)

Tay Y, Phan MC, Tuan LA, Hui SC (2018) SKIPFLOW: Incorporating neural coherence features for end-to-end automatic text scoring. In: Proceedings of the AAAI conference on artificial intelligence (pp. 5948–5955)

Tsai CT, Chen JJ, Yang CY, Chang JS (2020) LinggleWrite: a coaching system for essay writing. In: Proceedings of annual meeting of the association for computational linguistics (pp. 127–133), Association for Computational Linguistics

Uto M (2019) Rater-effect IRT model integrating supervised LDA for accurate measurement of essay writing ability. In: Proceedings of the international conference on artificial intelligence in education (pp. 494–506)

Uto M, Okano M (2020) Robust neural automated essay scoring using item response theory. In: Proceedings of the international conference on artificial intelligence in education (pp. 549–561)

Uto M, Uchida Y (2020) Automated short-answer grading using deep neural networks and item response theory. In: Proceedings of the artificial intelligence in education (pp. 334–339)

Uto M, Ueno M (2016) Item response theory for peer assessment. IEEE Trans Learn Technol 9(2):157–170

Uto M, Ueno M (2018a) Empirical comparison of item response theory models with rater’s parameters. Heliyon, Elsevier 4(5):1–32

Uto M, Ueno M (2018b) Item response theory without restriction of equal interval scale for rater’s score. In: Proceedings of the international conference on artificial intelligence in education (pp. 363–368)

Uto M, Ueno M (2020) A generalized many-facet Rasch model and its Bayesian estimation using Hamiltonian Monte Carlo. Behaviormetrika, Springer 47(2):469–496

Uto M, Xie Y, Ueno M (2020) Neural automated essay scoring incorporating handcrafted features. In: Proceedings of the international conference on computational linguistics (pp. 6077–6088), International Committee on Computational Linguistics

Uysal İ, Doğan N (2021) Automated essay scoring effect on test equating errors in mixed-format test. Int J Assess Tools Educ 8:222–238

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Proceedings of the international conference on advances in neural information processing systems (pp. 5998–6008)

Wang Y, Wei Z, Zhou Y, Huang X (2018) Automatic essay scoring incorporating rating schema via reinforcement learning. In: Proceedings of the conference on empirical methods in natural language processing (pp. 791–797)

Yang R, Cao J, Wen Z, Wu Y, He X (2020) Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In: Findings of the association for computational linguistics: EMNLP 2020 (pp. 1560–1569), Association for Computational Linguistics

Yang Y, Zhong J (2021) Automated essay scoring via example-based learning. In: Brambilla M, Chbeir R, Frasincar F, Manolescu I (eds) Web engineering. Springer International Publishing, pp 201–208

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers 19H05663 and 21H00898.

Author information

Authors and affiliations.

The University of Electro-Communications, Tokyo, Japan

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Masaki Uto .

Ethics declarations

Conflict of interest.

The authors have no conflicts of interest directly relevant to the content of this article.

Additional information

Communicated by Kazuo Shigemasu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team .

About this article

Uto, M. A review of deep-neural automated essay scoring models. Behaviormetrika 48 , 459–484 (2021). https://doi.org/10.1007/s41237-021-00142-y

Download citation

Received : 18 June 2021

Accepted : 08 July 2021

Published : 20 July 2021

Issue Date : July 2021

DOI : https://doi.org/10.1007/s41237-021-00142-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Automated essay scoring
  • Deep neural networks
  • Natural language processing
  • Educational/psychological measurement
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 03 June 2024

Applying large language models for automated essay scoring for non-native Japanese

  • Wenchao Li 1 &
  • Haitao Liu 2  

Humanities and Social Sciences Communications volume  11 , Article number:  723 ( 2024 ) Cite this article

2548 Accesses

14 Altmetric

Metrics details

  • Language and linguistics

Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated listening tests, and automated oral proficiency assessments. The application of LLMs for AES in the context of non-native Japanese, however, remains limited. This study explores the potential of LLM-based AES by comparing the efficiency of different models, i.e. two conventional machine training technology-based methods (Jess and JWriter), two LLMs (GPT and BERT), and one Japanese local LLM (Open-Calm large model). To conduct the evaluation, a dataset consisting of 1400 story-writing scripts authored by learners with 12 different first languages was used. Statistical analysis revealed that GPT-4 outperforms Jess and JWriter, BERT, and the Japanese language-specific trained Open-Calm large model in terms of annotation accuracy and predicting learning levels. Furthermore, by comparing 18 different models that utilize various prompts, the study emphasized the significance of prompts in achieving accurate and reliable evaluations using LLMs.

Similar content being viewed by others

essay representation model

Scoring method of English composition integrating deep learning in higher vocational colleges

essay representation model

ChatGPT-3.5 as writing assistance in students’ essays

essay representation model

Detecting contract cheating through linguistic fingerprint

Conventional machine learning technology in aes.

AES has experienced significant growth with the advancement of machine learning technologies in recent decades. In the earlier stages of AES development, conventional machine learning-based approaches were commonly used. These approaches involved the following procedures: a) feeding the machine with a dataset. In this step, a dataset of essays is provided to the machine learning system. The dataset serves as the basis for training the model and establishing patterns and correlations between linguistic features and human ratings. b) the machine learning model is trained using linguistic features that best represent human ratings and can effectively discriminate learners’ writing proficiency. These features include lexical richness (Lu, 2012 ; Kyle and Crossley, 2015 ; Kyle et al. 2021 ), syntactic complexity (Lu, 2010 ; Liu, 2008 ), text cohesion (Crossley and McNamara, 2016 ), and among others. Conventional machine learning approaches in AES require human intervention, such as manual correction and annotation of essays. This human involvement was necessary to create a labeled dataset for training the model. Several AES systems have been developed using conventional machine learning technologies. These include the Intelligent Essay Assessor (Landauer et al. 2003 ), the e-rater engine by Educational Testing Service (Attali and Burstein, 2006 ; Burstein, 2003 ), MyAccess with the InterlliMetric scoring engine by Vantage Learning (Elliot, 2003 ), and the Bayesian Essay Test Scoring system (Rudner and Liang, 2002 ). These systems have played a significant role in automating the essay scoring process and providing quick and consistent feedback to learners. However, as touched upon earlier, conventional machine learning approaches rely on predetermined linguistic features and often require manual intervention, making them less flexible and potentially limiting their generalizability to different contexts.

In the context of the Japanese language, conventional machine learning-incorporated AES tools include Jess (Ishioka and Kameda, 2006 ) and JWriter (Lee and Hasebe, 2017 ). Jess assesses essays by deducting points from the perfect score, utilizing the Mainichi Daily News newspaper as a database. The evaluation criteria employed by Jess encompass various aspects, such as rhetorical elements (e.g., reading comprehension, vocabulary diversity, percentage of complex words, and percentage of passive sentences), organizational structures (e.g., forward and reverse connection structures), and content analysis (e.g., latent semantic indexing). JWriter employs linear regression analysis to assign weights to various measurement indices, such as average sentence length and total number of characters. These weights are then combined to derive the overall score. A pilot study involving the Jess model was conducted on 1320 essays at different proficiency levels, including primary, intermediate, and advanced. However, the results indicated that the Jess model failed to significantly distinguish between these essay levels. Out of the 16 measures used, four measures, namely median sentence length, median clause length, median number of phrases, and maximum number of phrases, did not show statistically significant differences between the levels. Additionally, two measures exhibited between-level differences but lacked linear progression: the number of attributives declined words and the Kanji/kana ratio. On the other hand, the remaining measures, including maximum sentence length, maximum clause length, number of attributive conjugated words, maximum number of consecutive infinitive forms, maximum number of conjunctive-particle clauses, k characteristic value, percentage of big words, and percentage of passive sentences, demonstrated statistically significant between-level differences and displayed linear progression.

Both Jess and JWriter exhibit notable limitations, including the manual selection of feature parameters and weights, which can introduce biases into the scoring process. The reliance on human annotators to label non-native language essays also introduces potential noise and variability in the scoring. Furthermore, an important concern is the possibility of system manipulation and cheating by learners who are aware of the regression equation utilized by the models (Hirao et al. 2020 ). These limitations emphasize the need for further advancements in AES systems to address these challenges.

Deep learning technology in AES

Deep learning has emerged as one of the approaches for improving the accuracy and effectiveness of AES. Deep learning-based AES methods utilize artificial neural networks that mimic the human brain’s functioning through layered algorithms and computational units. Unlike conventional machine learning, deep learning autonomously learns from the environment and past errors without human intervention. This enables deep learning models to establish nonlinear correlations, resulting in higher accuracy. Recent advancements in deep learning have led to the development of transformers, which are particularly effective in learning text representations. Noteworthy examples include bidirectional encoder representations from transformers (BERT) (Devlin et al. 2019 ) and the generative pretrained transformer (GPT) (OpenAI).

BERT is a linguistic representation model that utilizes a transformer architecture and is trained on two tasks: masked linguistic modeling and next-sentence prediction (Hirao et al. 2020 ; Vaswani et al. 2017 ). In the context of AES, BERT follows specific procedures, as illustrated in Fig. 1 : (a) the tokenized prompts and essays are taken as input; (b) special tokens, such as [CLS] and [SEP], are added to mark the beginning and separation of prompts and essays; (c) the transformer encoder processes the prompt and essay sequences, resulting in hidden layer sequences; (d) the hidden layers corresponding to the [CLS] tokens (T[CLS]) represent distributed representations of the prompts and essays; and (e) a multilayer perceptron uses these distributed representations as input to obtain the final score (Hirao et al. 2020 ).

figure 1

AES system with BERT (Hirao et al. 2020 ).

The training of BERT using a substantial amount of sentence data through the Masked Language Model (MLM) allows it to capture contextual information within the hidden layers. Consequently, BERT is expected to be capable of identifying artificial essays as invalid and assigning them lower scores (Mizumoto and Eguchi, 2023 ). In the context of AES for nonnative Japanese learners, Hirao et al. ( 2020 ) combined the long short-term memory (LSTM) model proposed by Hochreiter and Schmidhuber ( 1997 ) with BERT to develop a tailored automated Essay Scoring System. The findings of their study revealed that the BERT model outperformed both the conventional machine learning approach utilizing character-type features such as “kanji” and “hiragana”, as well as the standalone LSTM model. Takeuchi et al. ( 2021 ) presented an approach to Japanese AES that eliminates the requirement for pre-scored essays by relying solely on reference texts or a model answer for the essay task. They investigated multiple similarity evaluation methods, including frequency of morphemes, idf values calculated on Wikipedia, LSI, LDA, word-embedding vectors, and document vectors produced by BERT. The experimental findings revealed that the method utilizing the frequency of morphemes with idf values exhibited the strongest correlation with human-annotated scores across different essay tasks. The utilization of BERT in AES encounters several limitations. Firstly, essays often exceed the model’s maximum length limit. Second, only score labels are available for training, which restricts access to additional information.

Mizumoto and Eguchi ( 2023 ) were pioneers in employing the GPT model for AES in non-native English writing. Their study focused on evaluating the accuracy and reliability of AES using the GPT-3 text-davinci-003 model, analyzing a dataset of 12,100 essays from the corpus of nonnative written English (TOEFL11). The findings indicated that AES utilizing the GPT-3 model exhibited a certain degree of accuracy and reliability. They suggest that GPT-3-based AES systems hold the potential to provide support for human ratings. However, applying GPT model to AES presents a unique natural language processing (NLP) task that involves considerations such as nonnative language proficiency, the influence of the learner’s first language on the output in the target language, and identifying linguistic features that best indicate writing quality in a specific language. These linguistic features may differ morphologically or syntactically from those present in the learners’ first language, as observed in (1)–(3).

我-送了-他-一本-书

Wǒ-sòngle-tā-yī běn-shū

1 sg .-give. past- him-one .cl- book

“I gave him a book.”

Agglutinative

彼-に-本-を-あげ-まし-た

Kare-ni-hon-o-age-mashi-ta

3 sg .- dat -hon- acc- give.honorification. past

Inflectional

give, give-s, gave, given, giving

Additionally, the morphological agglutination and subject-object-verb (SOV) order in Japanese, along with its idiomatic expressions, pose additional challenges for applying language models in AES tasks (4).

足-が 棒-に なり-ました

Ashi-ga bo-ni nar-mashita

leg- nom stick- dat become- past

“My leg became like a stick (I am extremely tired).”

The example sentence provided demonstrates the morpho-syntactic structure of Japanese and the presence of an idiomatic expression. In this sentence, the verb “なる” (naru), meaning “to become”, appears at the end of the sentence. The verb stem “なり” (nari) is attached with morphemes indicating honorification (“ます” - mashu) and tense (“た” - ta), showcasing agglutination. While the sentence can be literally translated as “my leg became like a stick”, it carries an idiomatic interpretation that implies “I am extremely tired”.

To overcome this issue, CyberAgent Inc. ( 2023 ) has developed the Open-Calm series of language models specifically designed for Japanese. Open-Calm consists of pre-trained models available in various sizes, such as Small, Medium, Large, and 7b. Figure 2 depicts the fundamental structure of the Open-Calm model. A key feature of this architecture is the incorporation of the Lora Adapter and GPT-NeoX frameworks, which can enhance its language processing capabilities.

figure 2

GPT-NeoX Model Architecture (Okgetheng and Takeuchi 2024 ).

In a recent study conducted by Okgetheng and Takeuchi ( 2024 ), they assessed the efficacy of Open-Calm language models in grading Japanese essays. The research utilized a dataset of approximately 300 essays, which were annotated by native Japanese educators. The findings of the study demonstrate the considerable potential of Open-Calm language models in automated Japanese essay scoring. Specifically, among the Open-Calm family, the Open-Calm Large model (referred to as OCLL) exhibited the highest performance. However, it is important to note that, as of the current date, the Open-Calm Large model does not offer public access to its server. Consequently, users are required to independently deploy and operate the environment for OCLL. In order to utilize OCLL, users must have a PC equipped with an NVIDIA GeForce RTX 3060 (8 or 12 GB VRAM).

In summary, while the potential of LLMs in automated scoring of nonnative Japanese essays has been demonstrated in two studies—BERT-driven AES (Hirao et al. 2020 ) and OCLL-based AES (Okgetheng and Takeuchi, 2024 )—the number of research efforts in this area remains limited.

Another significant challenge in applying LLMs to AES lies in prompt engineering and ensuring its reliability and effectiveness (Brown et al. 2020 ; Rae et al. 2021 ; Zhang et al. 2021 ). Various prompting strategies have been proposed, such as the zero-shot chain of thought (CoT) approach (Kojima et al. 2022 ), which involves manually crafting diverse and effective examples. However, manual efforts can lead to mistakes. To address this, Zhang et al. ( 2021 ) introduced an automatic CoT prompting method called Auto-CoT, which demonstrates matching or superior performance compared to the CoT paradigm. Another prompt framework is trees of thoughts, enabling a model to self-evaluate its progress at intermediate stages of problem-solving through deliberate reasoning (Yao et al. 2023 ).

Beyond linguistic studies, there has been a noticeable increase in the number of foreign workers in Japan and Japanese learners worldwide (Ministry of Health, Labor, and Welfare of Japan, 2022 ; Japan Foundation, 2021 ). However, existing assessment methods, such as the Japanese Language Proficiency Test (JLPT), J-CAT, and TTBJ Footnote 1 , primarily focus on reading, listening, vocabulary, and grammar skills, neglecting the evaluation of writing proficiency. As the number of workers and language learners continues to grow, there is a rising demand for an efficient AES system that can reduce costs and time for raters and be utilized for employment, examinations, and self-study purposes.

This study aims to explore the potential of LLM-based AES by comparing the effectiveness of five models: two LLMs (GPT Footnote 2 and BERT), one Japanese local LLM (OCLL), and two conventional machine learning-based methods (linguistic feature-based scoring tools - Jess and JWriter).

The research questions addressed in this study are as follows:

To what extent do the LLM-driven AES and linguistic feature-based AES, when used as automated tools to support human rating, accurately reflect test takers’ actual performance?

What influence does the prompt have on the accuracy and performance of LLM-based AES methods?

The subsequent sections of the manuscript cover the methodology, including the assessment measures for nonnative Japanese writing proficiency, criteria for prompts, and the dataset. The evaluation section focuses on the analysis of annotations and rating scores generated by LLM-driven and linguistic feature-based AES methods.

Methodology

The dataset utilized in this study was obtained from the International Corpus of Japanese as a Second Language (I-JAS) Footnote 3 . This corpus consisted of 1000 participants who represented 12 different first languages. For the study, the participants were given a story-writing task on a personal computer. They were required to write two stories based on the 4-panel illustrations titled “Picnic” and “The key” (see Appendix A). Background information for the participants was provided by the corpus, including their Japanese language proficiency levels assessed through two online tests: J-CAT and SPOT. These tests evaluated their reading, listening, vocabulary, and grammar abilities. The learners’ proficiency levels were categorized into six levels aligned with the Common European Framework of Reference for Languages (CEFR) and the Reference Framework for Japanese Language Education (RFJLE): A1, A2, B1, B2, C1, and C2. According to Lee et al. ( 2015 ), there is a high level of agreement (r = 0.86) between the J-CAT and SPOT assessments, indicating that the proficiency certifications provided by J-CAT are consistent with those of SPOT. However, it is important to note that the scores of J-CAT and SPOT do not have a one-to-one correspondence. In this study, the J-CAT scores were used as a benchmark to differentiate learners of different proficiency levels. A total of 1400 essays were utilized, representing the beginner (aligned with A1), A2, B1, B2, C1, and C2 levels based on the J-CAT scores. Table 1 provides information about the learners’ proficiency levels and their corresponding J-CAT and SPOT scores.

A dataset comprising a total of 1400 essays from the story writing tasks was collected. Among these, 714 essays were utilized to evaluate the reliability of the LLM-based AES method, while the remaining 686 essays were designated as development data to assess the LLM-based AES’s capability to distinguish participants with varying proficiency levels. The GPT 4 API was used in this study. A detailed explanation of the prompt-assessment criteria is provided in Section Prompt . All essays were sent to the model for measurement and scoring.

Measures of writing proficiency for nonnative Japanese

Japanese exhibits a morphologically agglutinative structure where morphemes are attached to the word stem to convey grammatical functions such as tense, aspect, voice, and honorifics, e.g. (5).

食べ-させ-られ-まし-た-か

tabe-sase-rare-mashi-ta-ka

[eat (stem)-causative-passive voice-honorification-tense. past-question marker]

Japanese employs nine case particles to indicate grammatical functions: the nominative case particle が (ga), the accusative case particle を (o), the genitive case particle の (no), the dative case particle に (ni), the locative/instrumental case particle で (de), the ablative case particle から (kara), the directional case particle へ (e), and the comitative case particle と (to). The agglutinative nature of the language, combined with the case particle system, provides an efficient means of distinguishing between active and passive voice, either through morphemes or case particles, e.g. 食べる taberu “eat concusive . ” (active voice); 食べられる taberareru “eat concusive . ” (passive voice). In the active voice, “パン を 食べる” (pan o taberu) translates to “to eat bread”. On the other hand, in the passive voice, it becomes “パン が 食べられた” (pan ga taberareta), which means “(the) bread was eaten”. Additionally, it is important to note that different conjugations of the same lemma are considered as one type in order to ensure a comprehensive assessment of the language features. For example, e.g., 食べる taberu “eat concusive . ”; 食べている tabeteiru “eat progress .”; 食べた tabeta “eat past . ” as one type.

To incorporate these features, previous research (Suzuki, 1999 ; Watanabe et al. 1988 ; Ishioka, 2001 ; Ishioka and Kameda, 2006 ; Hirao et al. 2020 ) has identified complexity, fluency, and accuracy as crucial factors for evaluating writing quality. These criteria are assessed through various aspects, including lexical richness (lexical density, diversity, and sophistication), syntactic complexity, and cohesion (Kyle et al. 2021 ; Mizumoto and Eguchi, 2023 ; Ure, 1971 ; Halliday, 1985 ; Barkaoui and Hadidi, 2020 ; Zenker and Kyle, 2021 ; Kim et al. 2018 ; Lu, 2017 ; Ortega, 2015 ). Therefore, this study proposes five scoring categories: lexical richness, syntactic complexity, cohesion, content elaboration, and grammatical accuracy. A total of 16 measures were employed to capture these categories. The calculation process and specific details of these measures can be found in Table 2 .

T-unit, first introduced by Hunt ( 1966 ), is a measure used for evaluating speech and composition. It serves as an indicator of syntactic development and represents the shortest units into which a piece of discourse can be divided without leaving any sentence fragments. In the context of Japanese language assessment, Sakoda and Hosoi ( 2020 ) utilized T-unit as the basic unit to assess the accuracy and complexity of Japanese learners’ speaking and storytelling. The calculation of T-units in Japanese follows the following principles:

A single main clause constitutes 1 T-unit, regardless of the presence or absence of dependent clauses, e.g. (6).

ケンとマリはピクニックに行きました (main clause): 1 T-unit.

If a sentence contains a main clause along with subclauses, each subclause is considered part of the same T-unit, e.g. (7).

天気が良かった の で (subclause)、ケンとマリはピクニックに行きました (main clause): 1 T-unit.

In the case of coordinate clauses, where multiple clauses are connected, each coordinated clause is counted separately. Thus, a sentence with coordinate clauses may have 2 T-units or more, e.g. (8).

ケンは地図で場所を探して (coordinate clause)、マリはサンドイッチを作りました (coordinate clause): 2 T-units.

Lexical diversity refers to the range of words used within a text (Engber, 1995 ; Kyle et al. 2021 ) and is considered a useful measure of the breadth of vocabulary in L n production (Jarvis, 2013a , 2013b ).

The type/token ratio (TTR) is widely recognized as a straightforward measure for calculating lexical diversity and has been employed in numerous studies. These studies have demonstrated a strong correlation between TTR and other methods of measuring lexical diversity (e.g., Bentz et al. 2016 ; Čech and Miroslav, 2018 ; Çöltekin and Taraka, 2018 ). TTR is computed by considering both the number of unique words (types) and the total number of words (tokens) in a given text. Given that the length of learners’ writing texts can vary, this study employs the moving average type-token ratio (MATTR) to mitigate the influence of text length. MATTR is calculated using a 50-word moving window. Initially, a TTR is determined for words 1–50 in an essay, followed by words 2–51, 3–52, and so on until the end of the essay is reached (Díez-Ortega and Kyle, 2023 ). The final MATTR scores were obtained by averaging the TTR scores for all 50-word windows. The following formula was employed to derive MATTR:

\({\rm{MATTR}}({\rm{W}})=\frac{{\sum }_{{\rm{i}}=1}^{{\rm{N}}-{\rm{W}}+1}{{\rm{F}}}_{{\rm{i}}}}{{\rm{W}}({\rm{N}}-{\rm{W}}+1)}\)

Here, N refers to the number of tokens in the corpus. W is the randomly selected token size (W < N). \({F}_{i}\) is the number of types in each window. The \({\rm{MATTR}}({\rm{W}})\) is the mean of a series of type-token ratios (TTRs) based on the word form for all windows. It is expected that individuals with higher language proficiency will produce texts with greater lexical diversity, as indicated by higher MATTR scores.

Lexical density was captured by the ratio of the number of lexical words to the total number of words (Lu, 2012 ). Lexical sophistication refers to the utilization of advanced vocabulary, often evaluated through word frequency indices (Crossley et al. 2013 ; Haberman, 2008 ; Kyle and Crossley, 2015 ; Laufer and Nation, 1995 ; Lu, 2012 ; Read, 2000 ). In line of writing, lexical sophistication can be interpreted as vocabulary breadth, which entails the appropriate usage of vocabulary items across various lexicon-grammatical contexts and registers (Garner et al. 2019 ; Kim et al. 2018 ; Kyle et al. 2018 ). In Japanese specifically, words are considered lexically sophisticated if they are not included in the “Japanese Education Vocabulary List Ver 1.0”. Footnote 4 Consequently, lexical sophistication was calculated by determining the number of sophisticated word types relative to the total number of words per essay. Furthermore, it has been suggested that, in Japanese writing, sentences should ideally have a length of no more than 40 to 50 characters, as this promotes readability. Therefore, the median and maximum sentence length can be considered as useful indices for assessment (Ishioka and Kameda, 2006 ).

Syntactic complexity was assessed based on several measures, including the mean length of clauses, verb phrases per T-unit, clauses per T-unit, dependent clauses per T-unit, complex nominals per clause, adverbial clauses per clause, coordinate phrases per clause, and mean dependency distance (MDD). The MDD reflects the distance between the governor and dependent positions in a sentence. A larger dependency distance indicates a higher cognitive load and greater complexity in syntactic processing (Liu, 2008 ; Liu et al. 2017 ). The MDD has been established as an efficient metric for measuring syntactic complexity (Jiang, Quyang, and Liu, 2019 ; Li and Yan, 2021 ). To calculate the MDD, the position numbers of the governor and dependent are subtracted, assuming that words in a sentence are assigned in a linear order, such as W1 … Wi … Wn. In any dependency relationship between words Wa and Wb, Wa is the governor and Wb is the dependent. The MDD of the entire sentence was obtained by taking the absolute value of governor – dependent:

MDD = \(\frac{1}{n}{\sum }_{i=1}^{n}|{\rm{D}}{{\rm{D}}}_{i}|\)

In this formula, \(n\) represents the number of words in the sentence, and \({DD}i\) is the dependency distance of the \({i}^{{th}}\) dependency relationship of a sentence. Building on this, the annotation of sentence ‘Mary-ga-John-ni-keshigomu-o-watashita was [Mary- top -John- dat -eraser- acc -give- past] ’. The sentence’s MDD would be 2. Table 3 provides the CSV file as a prompt for GPT 4.

Cohesion (semantic similarity) and content elaboration aim to capture the ideas presented in test taker’s essays. Cohesion was assessed using three measures: Synonym overlap/paragraph (topic), Synonym overlap/paragraph (keywords), and word2vec cosine similarity. Content elaboration and development were measured as the number of metadiscourse markers (type)/number of words. To capture content closely, this study proposed a novel-distance based representation, by encoding the cosine distance between the essay (by learner) and essay task’s (topic and keyword) i -vectors. The learner’s essay is decoded into a word sequence, and aligned to the essay task’ topic and keyword for log-likelihood measurement. The cosine distance reveals the content elaboration score in the leaners’ essay. The mathematical equation of cosine similarity between target-reference vectors is shown in (11), assuming there are i essays and ( L i , …. L n ) and ( N i , …. N n ) are the vectors representing the learner and task’s topic and keyword respectively. The content elaboration distance between L i and N i was calculated as follows:

\(\cos \left(\theta \right)=\frac{{\rm{L}}\,\cdot\, {\rm{N}}}{\left|{\rm{L}}\right|{\rm{|N|}}}=\frac{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}{N}_{i}}{\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}^{2}}\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{N}_{i}^{2}}}\)

A high similarity value indicates a low difference between the two recognition outcomes, which in turn suggests a high level of proficiency in content elaboration.

To evaluate the effectiveness of the proposed measures in distinguishing different proficiency levels among nonnative Japanese speakers’ writing, we conducted a multi-faceted Rasch measurement analysis (Linacre, 1994 ). This approach applies measurement models to thoroughly analyze various factors that can influence test outcomes, including test takers’ proficiency, item difficulty, and rater severity, among others. The underlying principles and functionality of multi-faceted Rasch measurement are illustrated in (12).

\(\log \left(\frac{{P}_{{nijk}}}{{P}_{{nij}(k-1)}}\right)={B}_{n}-{D}_{i}-{C}_{j}-{F}_{k}\)

(12) defines the logarithmic transformation of the probability ratio ( P nijk /P nij(k-1) )) as a function of multiple parameters. Here, n represents the test taker, i denotes a writing proficiency measure, j corresponds to the human rater, and k represents the proficiency score. The parameter B n signifies the proficiency level of test taker n (where n ranges from 1 to N). D j represents the difficulty parameter of test item i (where i ranges from 1 to L), while C j represents the severity of rater j (where j ranges from 1 to J). Additionally, F k represents the step difficulty for a test taker to move from score ‘k-1’ to k . P nijk refers to the probability of rater j assigning score k to test taker n for test item i . P nij(k-1) represents the likelihood of test taker n being assigned score ‘k-1’ by rater j for test item i . Each facet within the test is treated as an independent parameter and estimated within the same reference framework. To evaluate the consistency of scores obtained through both human and computer analysis, we utilized the Infit mean-square statistic. This statistic is a chi-square measure divided by the degrees of freedom and is weighted with information. It demonstrates higher sensitivity to unexpected patterns in responses to items near a person’s proficiency level (Linacre, 2002 ). Fit statistics are assessed based on predefined thresholds for acceptable fit. For the Infit MNSQ, which has a mean of 1.00, different thresholds have been suggested. Some propose stricter thresholds ranging from 0.7 to 1.3 (Bond et al. 2021 ), while others suggest more lenient thresholds ranging from 0.5 to 1.5 (Eckes, 2009 ). In this study, we adopted the criterion of 0.70–1.30 for the Infit MNSQ.

Moving forward, we can now proceed to assess the effectiveness of the 16 proposed measures based on five criteria for accurately distinguishing various levels of writing proficiency among non-native Japanese speakers. To conduct this evaluation, we utilized the development dataset from the I-JAS corpus, as described in Section Dataset . Table 4 provides a measurement report that presents the performance details of the 14 metrics under consideration. The measure separation was found to be 4.02, indicating a clear differentiation among the measures. The reliability index for the measure separation was 0.891, suggesting consistency in the measurement. Similarly, the person separation reliability index was 0.802, indicating the accuracy of the assessment in distinguishing between individuals. All 16 measures demonstrated Infit mean squares within a reasonable range, ranging from 0.76 to 1.28. The Synonym overlap/paragraph (topic) measure exhibited a relatively high outfit mean square of 1.46, although the Infit mean square falls within an acceptable range. The standard error for the measures ranged from 0.13 to 0.28, indicating the precision of the estimates.

Table 5 further illustrated the weights assigned to different linguistic measures for score prediction, with higher weights indicating stronger correlations between those measures and higher scores. Specifically, the following measures exhibited higher weights compared to others: moving average type token ratio per essay has a weight of 0.0391. Mean dependency distance had a weight of 0.0388. Mean length of clause, calculated by dividing the number of words by the number of clauses, had a weight of 0.0374. Complex nominals per T-unit, calculated by dividing the number of complex nominals by the number of T-units, had a weight of 0.0379. Coordinate phrases rate, calculated by dividing the number of coordinate phrases by the number of clauses, had a weight of 0.0325. Grammatical error rate, representing the number of errors per essay, had a weight of 0.0322.

Criteria (output indicator)

The criteria used to evaluate the writing ability in this study were based on CEFR, which follows a six-point scale ranging from A1 to C2. To assess the quality of Japanese writing, the scoring criteria from Table 6 were utilized. These criteria were derived from the IELTS writing standards and served as assessment guidelines and prompts for the written output.

A prompt is a question or detailed instruction that is provided to the model to obtain a proper response. After several pilot experiments, we decided to provide the measures (Section Measures of writing proficiency for nonnative Japanese ) as the input prompt and use the criteria (Section Criteria (output indicator) ) as the output indicator. Regarding the prompt language, considering that the LLM was tasked with rating Japanese essays, would prompt in Japanese works better Footnote 5 ? We conducted experiments comparing the performance of GPT-4 using both English and Japanese prompts. Additionally, we utilized the Japanese local model OCLL with Japanese prompts. Multiple trials were conducted using the same sample. Regardless of the prompt language used, we consistently obtained the same grading results with GPT-4, which assigned a grade of B1 to the writing sample. This suggested that GPT-4 is reliable and capable of producing consistent ratings regardless of the prompt language. On the other hand, when we used Japanese prompts with the Japanese local model “OCLL”, we encountered inconsistent grading results. Out of 10 attempts with OCLL, only 6 yielded consistent grading results (B1), while the remaining 4 showed different outcomes, including A1 and B2 grades. These findings indicated that the language of the prompt was not the determining factor for reliable AES. Instead, the size of the training data and the model parameters played crucial roles in achieving consistent and reliable AES results for the language model.

The following is the utilized prompt, which details all measures and requires the LLM to score the essays using holistic and trait scores.

Please evaluate Japanese essays written by Japanese learners and assign a score to each essay on a six-point scale, ranging from A1, A2, B1, B2, C1 to C2. Additionally, please provide trait scores and display the calculation process for each trait score. The scoring should be based on the following criteria:

Moving average type-token ratio.

Number of lexical words (token) divided by the total number of words per essay.

Number of sophisticated word types divided by the total number of words per essay.

Mean length of clause.

Verb phrases per T-unit.

Clauses per T-unit.

Dependent clauses per T-unit.

Complex nominals per clause.

Adverbial clauses per clause.

Coordinate phrases per clause.

Mean dependency distance.

Synonym overlap paragraph (topic and keywords).

Word2vec cosine similarity.

Connectives per essay.

Conjunctions per essay.

Number of metadiscourse markers (types) divided by the total number of words.

Number of errors per essay.

Japanese essay text

出かける前に二人が地図を見ている間に、サンドイッチを入れたバスケットに犬が入ってしまいました。それに気づかずに二人は楽しそうに出かけて行きました。やがて突然犬がバスケットから飛び出し、二人は驚きました。バスケット の 中を見ると、食べ物はすべて犬に食べられていて、二人は困ってしまいました。(ID_JJJ01_SW1)

The score of the example above was B1. Figure 3 provides an example of holistic and trait scores provided by GPT-4 (with a prompt indicating all measures) via Bing Footnote 6 .

figure 3

Example of GPT-4 AES and feedback (with a prompt indicating all measures).

Statistical analysis

The aim of this study is to investigate the potential use of LLM for nonnative Japanese AES. It seeks to compare the scoring outcomes obtained from feature-based AES tools, which rely on conventional machine learning technology (i.e. Jess, JWriter), with those generated by AI-driven AES tools utilizing deep learning technology (BERT, GPT, OCLL). To assess the reliability of a computer-assisted annotation tool, the study initially established human-human agreement as the benchmark measure. Subsequently, the performance of the LLM-based method was evaluated by comparing it to human-human agreement.

To assess annotation agreement, the study employed standard measures such as precision, recall, and F-score (Brants 2000 ; Lu 2010 ), along with the quadratically weighted kappa (QWK) to evaluate the consistency and agreement in the annotation process. Assume A and B represent human annotators. When comparing the annotations of the two annotators, the following results are obtained. The evaluation of precision, recall, and F-score metrics was illustrated in equations (13) to (15).

\({\rm{Recall}}(A,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,A}\)

\({\rm{Precision}}(A,\,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,B}\)

The F-score is the harmonic mean of recall and precision:

\({\rm{F}}-{\rm{score}}=\frac{2* ({\rm{Precision}}* {\rm{Recall}})}{{\rm{Precision}}+{\rm{Recall}}}\)

The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either precision or recall are zero.

In accordance with Taghipour and Ng ( 2016 ), the calculation of QWK involves two steps:

Step 1: Construct a weight matrix W as follows:

\({W}_{{ij}}=\frac{{(i-j)}^{2}}{{(N-1)}^{2}}\)

i represents the annotation made by the tool, while j represents the annotation made by a human rater. N denotes the total number of possible annotations. Matrix O is subsequently computed, where O_( i, j ) represents the count of data annotated by the tool ( i ) and the human annotator ( j ). On the other hand, E refers to the expected count matrix, which undergoes normalization to ensure that the sum of elements in E matches the sum of elements in O.

Step 2: With matrices O and E, the QWK is obtained as follows:

K = 1- \(\frac{\sum i,j{W}_{i,j}\,{O}_{i,j}}{\sum i,j{W}_{i,j}\,{E}_{i,j}}\)

The value of the quadratic weighted kappa increases as the level of agreement improves. Further, to assess the accuracy of LLM scoring, the proportional reductive mean square error (PRMSE) was employed. The PRMSE approach takes into account the variability observed in human ratings to estimate the rater error, which is then subtracted from the variance of the human labels. This calculation provides an overall measure of agreement between the automated scores and true scores (Haberman et al. 2015 ; Loukina et al. 2020 ; Taghipour and Ng, 2016 ). The computation of PRMSE involves the following steps:

Step 1: Calculate the mean squared errors (MSEs) for the scoring outcomes of the computer-assisted tool (MSE tool) and the human scoring outcomes (MSE human).

Step 2: Determine the PRMSE by comparing the MSE of the computer-assisted tool (MSE tool) with the MSE from human raters (MSE human), using the following formula:

\({\rm{PRMSE}}=1-\frac{({\rm{MSE}}\,{\rm{tool}})\,}{({\rm{MSE}}\,{\rm{human}})\,}=1-\,\frac{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-{\hat{{\rm{y}}}}_{{\rm{i}}})}^{2}}{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-\hat{{\rm{y}}})}^{2}}\)

In the numerator, ŷi represents the scoring outcome predicted by a specific LLM-driven AES system for a given sample. The term y i − ŷ i represents the difference between this predicted outcome and the mean value of all LLM-driven AES systems’ scoring outcomes. It quantifies the deviation of the specific LLM-driven AES system’s prediction from the average prediction of all LLM-driven AES systems. In the denominator, y i − ŷ represents the difference between the scoring outcome provided by a specific human rater for a given sample and the mean value of all human raters’ scoring outcomes. It measures the discrepancy between the specific human rater’s score and the average score given by all human raters. The PRMSE is then calculated by subtracting the ratio of the MSE tool to the MSE human from 1. PRMSE falls within the range of 0 to 1, with larger values indicating reduced errors in LLM’s scoring compared to those of human raters. In other words, a higher PRMSE implies that LLM’s scoring demonstrates greater accuracy in predicting the true scores (Loukina et al. 2020 ). The interpretation of kappa values, ranging from 0 to 1, is based on the work of Landis and Koch ( 1977 ). Specifically, the following categories are assigned to different ranges of kappa values: −1 indicates complete inconsistency, 0 indicates random agreement, 0.0 ~ 0.20 indicates extremely low level of agreement (slight), 0.21 ~ 0.40 indicates moderate level of agreement (fair), 0.41 ~ 0.60 indicates medium level of agreement (moderate), 0.61 ~ 0.80 indicates high level of agreement (substantial), 0.81 ~ 1 indicates almost perfect level of agreement. All statistical analyses were executed using Python script.

Results and discussion

Annotation reliability of the llm.

This section focuses on assessing the reliability of the LLM’s annotation and scoring capabilities. To evaluate the reliability, several tests were conducted simultaneously, aiming to achieve the following objectives:

Assess the LLM’s ability to differentiate between test takers with varying levels of oral proficiency.

Determine the level of agreement between the annotations and scoring performed by the LLM and those done by human raters.

The evaluation of the results encompassed several metrics, including: precision, recall, F-Score, quadratically-weighted kappa, proportional reduction of mean squared error, Pearson correlation, and multi-faceted Rasch measurement.

Inter-annotator agreement (human–human annotator agreement)

We started with an agreement test of the two human annotators. Two trained annotators were recruited to determine the writing task data measures. A total of 714 scripts, as the test data, was utilized. Each analysis lasted 300–360 min. Inter-annotator agreement was evaluated using the standard measures of precision, recall, and F-score and QWK. Table 7 presents the inter-annotator agreement for the various indicators. As shown, the inter-annotator agreement was fairly high, with F-scores ranging from 1.0 for sentence and word number to 0.666 for grammatical errors.

The findings from the QWK analysis provided further confirmation of the inter-annotator agreement. The QWK values covered a range from 0.950 ( p  = 0.000) for sentence and word number to 0.695 for synonym overlap number (keyword) and grammatical errors ( p  = 0.001).

Agreement of annotation outcomes between human and LLM

To evaluate the consistency between human annotators and LLM annotators (BERT, GPT, OCLL) across the indices, the same test was conducted. The results of the inter-annotator agreement (F-score) between LLM and human annotation are provided in Appendix B-D. The F-scores ranged from 0.706 for Grammatical error # for OCLL-human to a perfect 1.000 for GPT-human, for sentences, clauses, T-units, and words. These findings were further supported by the QWK analysis, which showed agreement levels ranging from 0.807 ( p  = 0.001) for metadiscourse markers for OCLL-human to 0.962 for words ( p  = 0.000) for GPT-human. The findings demonstrated that the LLM annotation achieved a significant level of accuracy in identifying measurement units and counts.

Reliability of LLM-driven AES’s scoring and discriminating proficiency levels

This section examines the reliability of the LLM-driven AES scoring through a comparison of the scoring outcomes produced by human raters and the LLM ( Reliability of LLM-driven AES scoring ). It also assesses the effectiveness of the LLM-based AES system in differentiating participants with varying proficiency levels ( Reliability of LLM-driven AES discriminating proficiency levels ).

Reliability of LLM-driven AES scoring

Table 8 summarizes the QWK coefficient analysis between the scores computed by the human raters and the GPT-4 for the individual essays from I-JAS Footnote 7 . As shown, the QWK of all measures ranged from k  = 0.819 for lexical density (number of lexical words (tokens)/number of words per essay) to k  = 0.644 for word2vec cosine similarity. Table 9 further presents the Pearson correlations between the 16 writing proficiency measures scored by human raters and GPT 4 for the individual essays. The correlations ranged from 0.672 for syntactic complexity to 0.734 for grammatical accuracy. The correlations between the writing proficiency scores assigned by human raters and the BERT-based AES system were found to range from 0.661 for syntactic complexity to 0.713 for grammatical accuracy. The correlations between the writing proficiency scores given by human raters and the OCLL-based AES system ranged from 0.654 for cohesion to 0.721 for grammatical accuracy. These findings indicated an alignment between the assessments made by human raters and both the BERT-based and OCLL-based AES systems in terms of various aspects of writing proficiency.

Reliability of LLM-driven AES discriminating proficiency levels

After validating the reliability of the LLM’s annotation and scoring, the subsequent objective was to evaluate its ability to distinguish between various proficiency levels. For this analysis, a dataset of 686 individual essays was utilized. Table 10 presents a sample of the results, summarizing the means, standard deviations, and the outcomes of the one-way ANOVAs based on the measures assessed by the GPT-4 model. A post hoc multiple comparison test, specifically the Bonferroni test, was conducted to identify any potential differences between pairs of levels.

As the results reveal, seven measures presented linear upward or downward progress across the three proficiency levels. These were marked in bold in Table 10 and comprise one measure of lexical richness, i.e. MATTR (lexical diversity); four measures of syntactic complexity, i.e. MDD (mean dependency distance), MLC (mean length of clause), CNT (complex nominals per T-unit), CPC (coordinate phrases rate); one cohesion measure, i.e. word2vec cosine similarity and GER (grammatical error rate). Regarding the ability of the sixteen measures to distinguish adjacent proficiency levels, the Bonferroni tests indicated that statistically significant differences exist between the primary level and the intermediate level for MLC and GER. One measure of lexical richness, namely LD, along with three measures of syntactic complexity (VPT, CT, DCT, ACC), two measures of cohesion (SOPT, SOPK), and one measure of content elaboration (IMM), exhibited statistically significant differences between proficiency levels. However, these differences did not demonstrate a linear progression between adjacent proficiency levels. No significant difference was observed in lexical sophistication between proficiency levels.

To summarize, our study aimed to evaluate the reliability and differentiation capabilities of the LLM-driven AES method. For the first objective, we assessed the LLM’s ability to differentiate between test takers with varying levels of oral proficiency using precision, recall, F-Score, and quadratically-weighted kappa. Regarding the second objective, we compared the scoring outcomes generated by human raters and the LLM to determine the level of agreement. We employed quadratically-weighted kappa and Pearson correlations to compare the 16 writing proficiency measures for the individual essays. The results confirmed the feasibility of using the LLM for annotation and scoring in AES for nonnative Japanese. As a result, Research Question 1 has been addressed.

Comparison of BERT-, GPT-, OCLL-based AES, and linguistic-feature-based computation methods

This section aims to compare the effectiveness of five AES methods for nonnative Japanese writing, i.e. LLM-driven approaches utilizing BERT, GPT, and OCLL, linguistic feature-based approaches using Jess and JWriter. The comparison was conducted by comparing the ratings obtained from each approach with human ratings. All ratings were derived from the dataset introduced in Dataset . To facilitate the comparison, the agreement between the automated methods and human ratings was assessed using QWK and PRMSE. The performance of each approach was summarized in Table 11 .

The QWK coefficient values indicate that LLMs (GPT, BERT, OCLL) and human rating outcomes demonstrated higher agreement compared to feature-based AES methods (Jess and JWriter) in assessing writing proficiency criteria, including lexical richness, syntactic complexity, content, and grammatical accuracy. Among the LLMs, the GPT-4 driven AES and human rating outcomes showed the highest agreement in all criteria, except for syntactic complexity. The PRMSE values suggest that the GPT-based method outperformed linguistic feature-based methods and other LLM-based approaches. Moreover, an interesting finding emerged during the study: the agreement coefficient between GPT-4 and human scoring was even higher than the agreement between different human raters themselves. This discovery highlights the advantage of GPT-based AES over human rating. Ratings involve a series of processes, including reading the learners’ writing, evaluating the content and language, and assigning scores. Within this chain of processes, various biases can be introduced, stemming from factors such as rater biases, test design, and rating scales. These biases can impact the consistency and objectivity of human ratings. GPT-based AES may benefit from its ability to apply consistent and objective evaluation criteria. By prompting the GPT model with detailed writing scoring rubrics and linguistic features, potential biases in human ratings can be mitigated. The model follows a predefined set of guidelines and does not possess the same subjective biases that human raters may exhibit. This standardization in the evaluation process contributes to the higher agreement observed between GPT-4 and human scoring. Section Prompt strategy of the study delves further into the role of prompts in the application of LLMs to AES. It explores how the choice and implementation of prompts can impact the performance and reliability of LLM-based AES methods. Furthermore, it is important to acknowledge the strengths of the local model, i.e. the Japanese local model OCLL, which excels in processing certain idiomatic expressions. Nevertheless, our analysis indicated that GPT-4 surpasses local models in AES. This superior performance can be attributed to the larger parameter size of GPT-4, estimated to be between 500 billion and 1 trillion, which exceeds the sizes of both BERT and the local model OCLL.

Prompt strategy

In the context of prompt strategy, Mizumoto and Eguchi ( 2023 ) conducted a study where they applied the GPT-3 model to automatically score English essays in the TOEFL test. They found that the accuracy of the GPT model alone was moderate to fair. However, when they incorporated linguistic measures such as cohesion, syntactic complexity, and lexical features alongside the GPT model, the accuracy significantly improved. This highlights the importance of prompt engineering and providing the model with specific instructions to enhance its performance. In this study, a similar approach was taken to optimize the performance of LLMs. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. Model 1 was used as the baseline, representing GPT-4 without any additional prompting. Model 2, on the other hand, involved GPT-4 prompted with 16 measures that included scoring criteria, efficient linguistic features for writing assessment, and detailed measurement units and calculation formulas. The remaining models (Models 3 to 18) utilized GPT-4 prompted with individual measures. The performance of these 18 different models was assessed using the output indicators described in Section Criteria (output indicator) . By comparing the performances of these models, the study aimed to understand the impact of prompt engineering on the accuracy and effectiveness of GPT-4 in AES tasks.

Based on the PRMSE scores presented in Fig. 4 , it was observed that Model 1, representing GPT-4 without any additional prompting, achieved a fair level of performance. However, Model 2, which utilized GPT-4 prompted with all measures, outperformed all other models in terms of PRMSE score, achieving a score of 0.681. These results indicate that the inclusion of specific measures and prompts significantly enhanced the performance of GPT-4 in AES. Among the measures, syntactic complexity was found to play a particularly significant role in improving the accuracy of GPT-4 in assessing writing quality. Following that, lexical diversity emerged as another important factor contributing to the model’s effectiveness. The study suggests that a well-prompted GPT-4 can serve as a valuable tool to support human assessors in evaluating writing quality. By utilizing GPT-4 as an automated scoring tool, the evaluation biases associated with human raters can be minimized. This has the potential to empower teachers by allowing them to focus on designing writing tasks and guiding writing strategies, while leveraging the capabilities of GPT-4 for efficient and reliable scoring.

figure 4

PRMSE scores of the 18 AES models.

This study aimed to investigate two main research questions: the feasibility of utilizing LLMs for AES and the impact of prompt engineering on the application of LLMs in AES.

To address the first objective, the study compared the effectiveness of five different models: GPT, BERT, the Japanese local LLM (OCLL), and two conventional machine learning-based AES tools (Jess and JWriter). The PRMSE values indicated that the GPT-4-based method outperformed other LLMs (BERT, OCLL) and linguistic feature-based computational methods (Jess and JWriter) across various writing proficiency criteria. Furthermore, the agreement coefficient between GPT-4 and human scoring surpassed the agreement among human raters themselves, highlighting the potential of using the GPT-4 tool to enhance AES by reducing biases and subjectivity, saving time, labor, and cost, and providing valuable feedback for self-study. Regarding the second goal, the role of prompt design was investigated by comparing 18 models, including a baseline model, a model prompted with all measures, and 16 models prompted with one measure at a time. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. The PRMSE scores of the models showed that GPT-4 prompted with all measures achieved the best performance, surpassing the baseline and other models.

In conclusion, this study has demonstrated the potential of LLMs in supporting human rating in assessments. By incorporating automation, we can save time and resources while reducing biases and subjectivity inherent in human rating processes. Automated language assessments offer the advantage of accessibility, providing equal opportunities and economic feasibility for individuals who lack access to traditional assessment centers or necessary resources. LLM-based language assessments provide valuable feedback and support to learners, aiding in the enhancement of their language proficiency and the achievement of their goals. This personalized feedback can cater to individual learner needs, facilitating a more tailored and effective language-learning experience.

There are three important areas that merit further exploration. First, prompt engineering requires attention to ensure optimal performance of LLM-based AES across different language types. This study revealed that GPT-4, when prompted with all measures, outperformed models prompted with fewer measures. Therefore, investigating and refining prompt strategies can enhance the effectiveness of LLMs in automated language assessments. Second, it is crucial to explore the application of LLMs in second-language assessment and learning for oral proficiency, as well as their potential in under-resourced languages. Recent advancements in self-supervised machine learning techniques have significantly improved automatic speech recognition (ASR) systems, opening up new possibilities for creating reliable ASR systems, particularly for under-resourced languages with limited data. However, challenges persist in the field of ASR. First, ASR assumes correct word pronunciation for automatic pronunciation evaluation, which proves challenging for learners in the early stages of language acquisition due to diverse accents influenced by their native languages. Accurately segmenting short words becomes problematic in such cases. Second, developing precise audio-text transcriptions for languages with non-native accented speech poses a formidable task. Last, assessing oral proficiency levels involves capturing various linguistic features, including fluency, pronunciation, accuracy, and complexity, which are not easily captured by current NLP technology.

Data availability

The dataset utilized was obtained from the International Corpus of Japanese as a Second Language (I-JAS). The data URLs: [ https://www2.ninjal.ac.jp/jll/lsaj/ihome2.html ].

J-CAT and TTBJ are two computerized adaptive tests used to assess Japanese language proficiency.

SPOT is a specific component of the TTBJ test.

J-CAT: https://www.j-cat2.org/html/ja/pages/interpret.html

SPOT: https://ttbj.cegloc.tsukuba.ac.jp/p1.html#SPOT .

The study utilized a prompt-based GPT-4 model, developed by OpenAI, which has an impressive architecture with 1.8 trillion parameters across 120 layers. GPT-4 was trained on a vast dataset of 13 trillion tokens, using two stages: initial training on internet text datasets to predict the next token, and subsequent fine-tuning through reinforcement learning from human feedback.

https://www2.ninjal.ac.jp/jll/lsaj/ihome2-en.html .

http://jhlee.sakura.ne.jp/JEV/ by Japanese Learning Dictionary Support Group 2015.

We express our sincere gratitude to the reviewer for bringing this matter to our attention.

On February 7, 2023, Microsoft began rolling out a major overhaul to Bing that included a new chatbot feature based on OpenAI’s GPT-4 (Bing.com).

Appendix E-F present the analysis results of the QWK coefficient between the scores computed by the human raters and the BERT, OCLL models.

Attali Y, Burstein J (2006) Automated essay scoring with e-rater® V.2. J. Technol., Learn. Assess., 4

Barkaoui K, Hadidi A (2020) Assessing Change in English Second Language Writing Performance (1st ed.). Routledge, New York. https://doi.org/10.4324/9781003092346

Bentz C, Tatyana R, Koplenig A, Tanja S (2016) A comparison between morphological complexity. measures: Typological data vs. language corpora. In Proceedings of the workshop on computational linguistics for linguistic complexity (CL4LC), 142–153. Osaka, Japan: The COLING 2016 Organizing Committee

Bond TG, Yan Z, Heene M (2021) Applying the Rasch model: Fundamental measurement in the human sciences (4th ed). Routledge

Brants T (2000) Inter-annotator agreement for a German newspaper corpus. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece, 31 May-2 June, European Language Resources Association

Brown TB, Mann B, Ryder N, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, Online, 6–12 December, Curran Associates, Inc., Red Hook, NY

Burstein J (2003) The E-rater scoring engine: Automated essay scoring with natural language processing. In Shermis MD and Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Čech R, Miroslav K (2018) Morphological richness of text. In Masako F, Václav C (ed) Taming the corpus: From inflection and lexis to interpretation, 63–77. Cham, Switzerland: Springer Nature

Çöltekin Ç, Taraka, R (2018) Exploiting Universal Dependencies treebanks for measuring morphosyntactic complexity. In Aleksandrs B, Christian B (ed), Proceedings of first workshop on measuring language complexity, 1–7. Torun, Poland

Crossley SA, Cobb T, McNamara DS (2013) Comparing count-based and band-based indices of word frequency: Implications for active vocabulary research and pedagogical applications. System 41:965–981. https://doi.org/10.1016/j.system.2013.08.002

Article   Google Scholar  

Crossley SA, McNamara DS (2016) Say more and be more coherent: How text elaboration and cohesion can increase writing quality. J. Writ. Res. 7:351–370

CyberAgent Inc (2023) Open-Calm series of Japanese language models. Retrieved from: https://www.cyberagent.co.jp/news/detail/id=28817

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, 2–7 June, pp. 4171–4186. Association for Computational Linguistics

Diez-Ortega M, Kyle K (2023) Measuring the development of lexical richness of L2 Spanish: a longitudinal learner corpus study. Studies in Second Language Acquisition 1-31

Eckes T (2009) On common ground? How raters perceive scoring criteria in oral proficiency testing. In Brown A, Hill K (ed) Language testing and evaluation 13: Tasks and criteria in performance assessment (pp. 43–73). Peter Lang Publishing

Elliot S (2003) IntelliMetric: from here to validity. In: Shermis MD, Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Google Scholar  

Engber CA (1995) The relationship of lexical proficiency to the quality of ESL compositions. J. Second Lang. Writ. 4:139–155

Garner J, Crossley SA, Kyle K (2019) N-gram measures and L2 writing proficiency. System 80:176–187. https://doi.org/10.1016/j.system.2018.12.001

Haberman SJ (2008) When can subscores have value? J. Educat. Behav. Stat., 33:204–229

Haberman SJ, Yao L, Sinharay S (2015) Prediction of true test scores from observed item scores and ancillary data. Brit. J. Math. Stat. Psychol. 68:363–385

Halliday MAK (1985) Spoken and Written Language. Deakin University Press, Melbourne, Australia

Hirao R, Arai M, Shimanaka H et al. (2020) Automated essay scoring system for nonnative Japanese learners. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 1250–1257. European Language Resources Association

Hunt KW (1966) Recent Measures in Syntactic Development. Elementary English, 43(7), 732–739. http://www.jstor.org/stable/41386067

Ishioka T (2001) About e-rater, a computer-based automatic scoring system for essays [Konpyūta ni yoru essei no jidō saiten shisutemu e − rater ni tsuite]. University Entrance Examination. Forum [Daigaku nyūshi fōramu] 24:71–76

Hochreiter S, Schmidhuber J (1997) Long short- term memory. Neural Comput. 9(8):1735–1780

Article   CAS   PubMed   Google Scholar  

Ishioka T, Kameda M (2006) Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–18 July 2006, pp. 233-240. Association for Computational Linguistics, USA

Japan Foundation (2021) Retrieved from: https://www.jpf.gp.jp/j/project/japanese/survey/result/dl/survey2021/all.pdf

Jarvis S (2013a) Defining and measuring lexical diversity. In Jarvis S, Daller M (ed) Vocabulary knowledge: Human ratings and automated measures (Vol. 47, pp. 13–44). John Benjamins. https://doi.org/10.1075/sibil.47.03ch1

Jarvis S (2013b) Capturing the diversity in lexical diversity. Lang. Learn. 63:87–106. https://doi.org/10.1111/j.1467-9922.2012.00739.x

Jiang J, Quyang J, Liu H (2019) Interlanguage: A perspective of quantitative linguistic typology. Lang. Sci. 74:85–97

Kim M, Crossley SA, Kyle K (2018) Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. Mod. Lang. J. 102(1):120–141. https://doi.org/10.1111/modl.12447

Kojima T, Gu S, Reid M et al. (2022) Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, New Orleans, LA, 29 November-1 December, Curran Associates, Inc., Red Hook, NY

Kyle K, Crossley SA (2015) Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Q 49:757–786

Kyle K, Crossley SA, Berger CM (2018) The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behav. Res. Methods 50:1030–1046. https://doi.org/10.3758/s13428-017-0924-4

Article   PubMed   Google Scholar  

Kyle K, Crossley SA, Jarvis S (2021) Assessing the validity of lexical diversity using direct judgements. Lang. Assess. Q. 18:154–170. https://doi.org/10.1080/15434303.2020.1844205

Landauer TK, Laham D, Foltz PW (2003) Automated essay scoring and annotation of essays with the Intelligent Essay Assessor. In Shermis MD, Burstein JC (ed), Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174

Laufer B, Nation P (1995) Vocabulary size and use: Lexical richness in L2 written production. Appl. Linguist. 16:307–322. https://doi.org/10.1093/applin/16.3.307

Lee J, Hasebe Y (2017) jWriter Learner Text Evaluator, URL: https://jreadability.net/jwriter/

Lee J, Kobayashi N, Sakai T, Sakota K (2015) A Comparison of SPOT and J-CAT Based on Test Analysis [Tesuto bunseki ni motozuku ‘SPOT’ to ‘J-CAT’ no hikaku]. Research on the Acquisition of Second Language Japanese [Dainigengo to shite no nihongo no shūtoku kenkyū] (18) 53–69

Li W, Yan J (2021) Probability distribution of dependency distance based on a Treebank of. Japanese EFL Learners’ Interlanguage. J. Quant. Linguist. 28(2):172–186. https://doi.org/10.1080/09296174.2020.1754611

Article   MathSciNet   Google Scholar  

Linacre JM (2002) Optimizing rating scale category effectiveness. J. Appl. Meas. 3(1):85–106

PubMed   Google Scholar  

Linacre JM (1994) Constructing measurement with a Many-Facet Rasch Model. In Wilson M (ed) Objective measurement: Theory into practice, Volume 2 (pp. 129–144). Norwood, NJ: Ablex

Liu H (2008) Dependency distance as a metric of language comprehension difficulty. J. Cognitive Sci. 9:159–191

Liu H, Xu C, Liang J (2017) Dependency distance: A new perspective on syntactic patterns in natural languages. Phys. Life Rev. 21. https://doi.org/10.1016/j.plrev.2017.03.002

Loukina A, Madnani N, Cahill A, et al. (2020) Using PRMSE to evaluate automated scoring systems in the presence of label noise. Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA → Online, 10 July, pp. 18–29. Association for Computational Linguistics

Lu X (2010) Automatic analysis of syntactic complexity in second language writing. Int. J. Corpus Linguist. 15:474–496

Lu X (2012) The relationship of lexical richness to the quality of ESL learners’ oral narratives. Mod. Lang. J. 96:190–208

Lu X (2017) Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing assessment. Lang. Test. 34:493–511

Lu X, Hu R (2022) Sense-aware lexical sophistication indices and their relationship to second language writing quality. Behav. Res. Method. 54:1444–1460. https://doi.org/10.3758/s13428-021-01675-6

Ministry of Health, Labor, and Welfare of Japan (2022) Retrieved from: https://www.mhlw.go.jp/stf/newpage_30367.html

Mizumoto A, Eguchi M (2023) Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 3:100050

Okgetheng B, Takeuchi K (2024) Estimating Japanese Essay Grading Scores with Large Language Models. Proceedings of 30th Annual Conference of the Language Processing Society in Japan, March 2024

Ortega L (2015) Second language learning explained? SLA across 10 contemporary theories. In VanPatten B, Williams J (ed) Theories in Second Language Acquisition: An Introduction

Rae JW, Borgeaud S, Cai T, et al. (2021) Scaling Language Models: Methods, Analysis & Insights from Training Gopher. ArXiv, abs/2112.11446

Read J (2000) Assessing vocabulary. Cambridge University Press. https://doi.org/10.1017/CBO9780511732942

Rudner LM, Liang T (2002) Automated Essay Scoring Using Bayes’ Theorem. J. Technol., Learning and Assessment, 1 (2)

Sakoda K, Hosoi Y (2020) Accuracy and complexity of Japanese Language usage by SLA learners in different learning environments based on the analysis of I-JAS, a learners’ corpus of Japanese as L2. Math. Linguist. 32(7):403–418. https://doi.org/10.24701/mathling.32.7_403

Suzuki N (1999) Summary of survey results regarding comprehensive essay questions. Final report of “Joint Research on Comprehensive Examinations for the Aim of Evaluating Applicability to Each Specialized Field of Universities” for 1996-2000 [shōronbun sōgō mondai ni kansuru chōsa kekka no gaiyō. Heisei 8 - Heisei 12-nendo daigaku no kaku senmon bun’ya e no tekisei no hyōka o mokuteki to suru sōgō shiken no arikata ni kansuru kyōdō kenkyū’ saishū hōkoku-sho]. University Entrance Examination Section Center Research and Development Department [Daigaku nyūshi sentā kenkyū kaihatsubu], 21–32

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1–5 November, pp. 1882–1891. Association for Computational Linguistics

Takeuchi K, Ohno M, Motojin K, Taguchi M, Inada Y, Iizuka M, Abo T, Ueda H (2021) Development of essay scoring methods based on reference texts with construction of research-available Japanese essay data. In IPSJ J 62(9):1586–1604

Ure J (1971) Lexical density: A computational technique and some findings. In Coultard M (ed) Talking about Text. English Language Research, University of Birmingham, Birmingham, England

Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Long Beach, CA, 4–7 December, pp. 5998–6008, Curran Associates, Inc., Red Hook, NY

Watanabe H, Taira Y, Inoue Y (1988) Analysis of essay evaluation data [Shōronbun hyōka dēta no kaiseki]. Bulletin of the Faculty of Education, University of Tokyo [Tōkyōdaigaku kyōiku gakubu kiyō], Vol. 28, 143–164

Yao S, Yu D, Zhao J, et al. (2023) Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36

Zenker F, Kyle K (2021) Investigating minimum text lengths for lexical diversity indices. Assess. Writ. 47:100505. https://doi.org/10.1016/j.asw.2020.100505

Zhang Y, Warstadt A, Li X, et al. (2021) When do you need billions of words of pretraining data? Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, pp. 1112-1125. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.90

Download references

This research was funded by National Foundation of Social Sciences (22BYY186) to Wenchao Li.

Author information

Authors and affiliations.

Department of Japanese Studies, Zhejiang University, Hangzhou, China

Department of Linguistics and Applied Linguistics, Zhejiang University, Hangzhou, China

You can also search for this author in PubMed   Google Scholar

Contributions

Wenchao Li is in charge of conceptualization, validation, formal analysis, investigation, data curation, visualization and writing the draft. Haitao Liu is in charge of supervision.

Corresponding author

Correspondence to Wenchao Li .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

Ethical approval was not required as the study did not involve human participants.

Informed consent

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material file #1, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, W., Liu, H. Applying large language models for automated essay scoring for non-native Japanese. Humanit Soc Sci Commun 11 , 723 (2024). https://doi.org/10.1057/s41599-024-03209-9

Download citation

Received : 02 February 2024

Accepted : 16 May 2024

Published : 03 June 2024

DOI : https://doi.org/10.1057/s41599-024-03209-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

essay representation model

IMAGES

  1. Argumentative Essay Model :: Valencia Campus

    essay representation model

  2. 5 Paragraph Essay Model

    essay representation model

  3. example model essay

    essay representation model

  4. Eng210 v2 wk2 essay structure

    essay representation model

  5. Representation Essay:

    essay representation model

  6. Use this Sample Basic Essay as a Model

    essay representation model

VIDEO

  1. Essay on my Role model in English # writing

  2. Model Paper Essay

  3. MIXED PAPER 13 DISCUSSION ADVANCED LEVEL ICT (2021 A/L ICT PART 2 PAPER)

  4. Representation Manager for Tekla Structure || 表示設定管理 テクラ ストラクチャー

  5. RINA X DIVA

  6. Representation of Social Groups in Casablanca and Bonnie & Clyde