Generative AI in Writing Evaluation: Where We Stand and What Lies Ahead

Title

Abstract

Generative artificial intelligence is reshaping educational assessment; however, high-stakes evaluations of student writing remain contentious. This study proposes an LLM-derived similarity metric—cosine similarity between essay-level embedding vectors of student essays and expert model texts (e.g., instructor-written benchmark essays)—as an automated indicator of L2 English writing proficiency. Using a longitudinal design, about 35 Japanese university students will produce argumentative essays at three time points over a 15-week semester. Essays will be scored by trained human raters and analyzed for linguistic features, including lexical diversity, syntactic complexity, and cohesion. The author will examine (a) convergent validity via correlations between the similarity metric and human scores, (b) sensitivity to developmental change using repeated-measures models, and (c) incremental predictive validity through hierarchical regression by adding the similarity metric to models based on surface linguistic features. It is hypothesized that the similarity metric will show strong positive associations with human ratings, detect significant longitudinal gains, and explain unique variance beyond traditional feature-based predictors. If validated, this approach could support scalable diagnostics that complement human judgment and improve the reliability and pedagogical utility of L2 writing assessment.

Creators IWANAKA Takahiro

Source Identifiers [EISSN] 2189-4825

Resource Type departmental bulletin paper

Date Issued 2026-03-31

File Version Version of Record

Access Rights open access

Relations

[EISSN]2189-4825