Annotated bibliography on machine grading of essays, Part 1 - World leading higher education information and services

$\"\"$

The following annotated bibliography on machine scoring and evaluation of essay-length writing is based on the 2012 published bibliography in the Journal of Writing Assessment 5 (compiled by Richard Haswell, Whitney Donnelly, Vicki Hester, Peggy O’Neill, and Ellen Schendel).

The bibliography was compiled by reviewing recent scholarship on machine scoring of essays, also referred to as automated essay scoring (AES), using databases such as ERIC and CompPile. Entries were selected for their attention to machine scoring of essays and publication in peer-reviewed venues (with exceptions noted). We also endeavored to cover the breadth of the issues addressed in the research without being overly redundant. We avoided publications that were very narrowly focused on highly technical aspects of assessment. The earliest research — such as Ellis Page’s 1966 piece in Phi Delta Kappan, “The Imminence of Essay Grading by Computer” — is not included because many more recent entries provide a review of the early development of machine scoring.

The bibliography is organized by publication date, with the most recent entries appearing first. Entries that have been excerpted from the published JWA bibliography are indicated by an asterisk.

Klobucar, Andrew, Deane, Paul, Elliot, Norbert, Raminie, Chaitanya, Deess, Perry & Rudniy, Alex. (2012). Automated essay scoring and the search for valid writing assessment. In Charles Bazerman et al. (Eds.) International Advances in Writing Research: Cultures, Places, Measures(pp. 103-119). Fort Collins, CO: WAC Clearinghouse & Parlor Press.

This chapter reports on an ETS and New Jersey Institute of Technology research collaboration that used Criterion, an integrated instruction and assessment system that includes automated essay scoring. The purpose of the research was “to explore ways in which automated essay scoring might fit within a larger ecology as one among a family of assessment techniques supporting the development of digitally enhanced literacy” (105). The study used scores from multiple writing measures including the SAT-W, beginning of the semester impromptu essays scored by Criterion, an essay written over an extended time line scored by faculty, end of semester portfolios, and course grades. The researchers compare the scores and conclude that when embedded in a course, AES can be used as “an early warning system for instructors and their students.” Authors also noted concerns that over-reliance on AES could result in a fixation on error and surface features such as length.

Perelman, Les. (2012). Construct validity, length, score, and time in holistically graded writing assessments: The case against automated essay scoring (AES). In Charles Bazerman et al. (Eds.) International Advances in Writing Research: Cultures, Places, Measures (pp. 121-150). Fort Collins, CO: WAC Clearinghouse & Parlor Press.

An accessible critique of the writing tasks (the timed impromptu) and the automated essay scoring process. The author argues that while “the whole enterprise of automated essay scoring claims various kinds of construct validity, the measures it employs substantially fail to represent any reasonable real-world construct of writing ability” (p. 121). He explains how length affects scoring: for short impromptus, length correlates to scores, but once more time is given to write and subjects are known in advance, the influence of length on scores diminishes. He also explains how AES is different from holistic scoring in spite of a single number being generated because that number is generated by a set of analytical measures. These individual measures (e.g., word length, sentence length, grammar, and mechanics) are not the same construct it purports to measure (writing ability). The AES program discussed is primarily the ETS e-rater 2.0 system because ETS has been more transparent about it than other AES developers. Perelman draws on his own research into AES, many ETS technical reports and peer-reviewed research in making his argument.

Bridgeman, Brent, Trapani, Catherine & Yigal, Attali. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education 25(1): 27-40.*

Reports on two studies comparing human and machine scoring in terms of certain sub-populations. The data examines the scoring of writing samples for high-stakes exams — the Graduate Record Exam (GRE) and the Test of English as a Foreign Language Internet-based form (TOEFL IBT) that are scored with e-rater, an automated essay scoring program. The study uses a large pool of US and international test takers. The authors, all affiliated with the Educational Testing Service, contend that the studies reported here build on the earlier work of Chodrow and Burstein (2004) in three ways: (1) it uses a more recent version of e-rater that considers micro-features; (2) the samples include both domestic US subgroups and more comprehensive international test takers; and (3) it identifies “some of the features that appear to contribute to the discrepancy between human and machine scores” (29). The authors conclude that while “differences between human and e-rater scores for various ethnic and language or country subgroups are generally not large, they are substantial enough that they should not be ignored” (38). Essays that are slightly off topic tend to get higher scores by the e-rater. They also explain that “it appears that, for certain groups, essays that are well organized and developed,” but are flawed in terms of “grammar, usage, and mechanics, tend to get higher scores from e-rater than human scorers” (39).

Cope, Bill, Kalantzis, Mary, McCarthey, Sarah, Vojak, Colleen & Kline, Sonia. (2011). Technology-mediated writing assessments: Principles and processes. Computers and Composition 28, 79–96.

This article is not specifically focused on machine scoring but argues for a more comprehensive approach to the assessment of writing with technology. Meaningful assessment, the authors argue, should “be situated in a knowledge-making practice, draw explicitly on social cognition, measure metacognition, address multimodal texts, be ‘for learning,’ not just ‘of learning,’ be ubiquitous” (p. 81-82). Technology is defined more broadly than merely programs for machine scoring although it does include ways that those types of programs may be incorporated in a more comprehensive approach. It is a companion piece to the authors’ other article in the same issue of Computers and Composition (see below).

Vojak, Colleen, Kline, Sonia, Cope, Bill, McCarthey, Sarah & Kalantzis, Mary (2011). New spaces and old places: An analysis of writing assessment software. Computers and Composition 28, 97-111.

A systematic review of seventeen computer-based writing assessment programs, both those that score or rate essays as well as those that provide technology–mediated assessment. The programs included, among others, Criterion, MY Access! Essayrater, MyCompLab, Project Essay Grader, and Calibrated Peer Review. The analysis framed writing as “a socially situated activity” that is “functionally and formally diverse” and considers it to be “a meaning-making activity that can be conveyed in multiple modalities” (p. 98). Authors reviewed various components of each program, considering its components such as its primary purpose, the underlying primary algorithm, feedback mechanisms, genres/forms of writing promoted, and opportunities for engaging in writing process. They also identified strengths and weaknesses for each program. Although this review considers more than AES, it includes it as part of a larger movement to incorporate technology in various forms of writing assessment, whether formative or summative. The authors conclude that the programs do help raise test scores but that they “largely neglect the potential technology has” in terms of the three fundamental understandings of writing that they identified promoting instead a “narrow view that conforms to systems requirements in an era of testing and accountability.” They “found evidence of formulaic approaches, non-specific feedback, incorrect identification of errors, a strong emphasis on writing mechanics such as grammar and punctuation, and a tendency to value length over content” and that the programs assumed “that successful student writers would reproduce conventional, purely written linguistic generic structures” (108).

Neal, Michael R. (2011). Writing Assessment and the Revolution in Digital Texts and Technologies. New York: Teachers College Press.*

After a thorough review of the response to machine scoring from composition scholars (pp. 65-74), this book argues that the mechanization represented by machine scoring is a “misdirection” in which we, the teaching community, are partly complicit: “somewhere along the way we have lost the idea of how and why people read and write within meaningful rhetorical situations,” noting, however, that machine scoring is “a cheap, mechanized solution to a problem that we have not had opportunity to help define” (p. 74).

Elliott, Scott. (2011). Computer-graded essays full of flaws. Dayton Daily News (May 24). *

Describes how the reporter tested Educational Testing Service’s e-rater by submitting two essays, one his best effort and one designed to meet the computer program’s preference for “long paragraphs, transitional words, and a vocabulary a bureaucrat would love” but also filled with such nonsense as “king Richard Simmons, a shoe-eating television interloper, alien beings and green swamp toads.” E-rater gave the first essay a score of 5 (on a scale of 1 up to 6) and the nonsense essay a score of 6. An English teacher gave the first essay 6+ and the second 1 on the same scale. Rich Swartz of Educational Testing Services explained that “we’re a long way from computers actually reading stuff.” This isn’t a peer reviewed publication, but it provides a useful perspective on the limitations of AES.

Dikli, Semire. (2010). The nature of automated essay scoring feedback. CALICO Journal 28(1), 99-134.

A study of the feedback on their writing received by twelve adult English language learners from MY Access!, an Automated Essay Scoring (AES) program that uses the Intellimetric system and teacher’s feedback. The program was not scoring essays but providing students with feedback. The study used case study methodology including observation, interviews with the students, and examination of the texts. Students were divided into two groups: one group of six received feedback from the computer system and one from the teacher. The feedback from AES and the teacher differed extensively in terms of length, usability, redundancy, and consistency. The researcher reported that MY Access! provided substantially more feedback than the teacher but that it wasn’t as useable, it was highly redundant, generic, but consistent. The AES system did not use positive reinforcement and didn’t connect on a personal level to the student. The researcher concluded that the AES program did not meet the needs of nonnative speakers.

Byrne, Roxanne, Tang, Michael, Truduc, John & Tang, Matthew. (2010). eGrader, a software application that automatically scores student essays: with a postscript on the ethical complexities. Journal of Systemics, Cybernetics & Informatics 8 (6), 30-35.*

Provides a very brief overview of three commercially available automatic essay scoring services (Project Essay Grade, Intellimetric, and e-rater) as well as eGrader. eGrader differs from others because it operates on a client PC; requires little human training; is cost effective; and does not require a huge database. While it shares some processes as these other AES applications, differences include key word searching of webpages for benchmark data. Authors used 33 essays to compare the eGrader results with human judges. Correlations between the scores were comparable with other AES applications. In classroom use, however, the instructor “found a disturbing pattern”: “The machine algorithm could not detect ideas that were not contained in the readings or Web benchmark documents although the ideas expressed were germane to the essay question.” Ultimately, the authors decided not to use machine readers because they “could not detect other subtleties of writing such as irony, metaphor, puns, connotation and other rhetorical devices” and “appears to penalize those students we want to nurture, those who think and write in original or different ways.”

Crusan, Deborah. (2010). Assessment in the Second Language Classroom. Ann Arbor, MI: University of Michigan Press.*

With an interest in second-language instruction, the author tested out Pearson Educational’sIntelligent Essay Assessor and found the diagnosis “vague and unhelpful” (p. 165). For instance,IEA said that the introduction was “missing, undeveloped, or predictable. Which was it?” (p. 166). Her chapter on machine scoring (pp. 156-179) compares all the major writing-analysis software, with an especially intense look at Vantage Learning’s MY Access! (based on IntelliMetric), and finds the feedback problematical, in part because it can be wrongly used by administrators and it can lead to “de-skilling” of teachers (p. 170). Cautions that the programs, “if used at all, ought to be used with care and constant teacher supervision and intervention” (p. 178).

McCurry, Doug. (2010). Can machine scoring deal with broad and open writing tests as well as human readers? Assessing Writing 15(2), 118-129.*

Investigates the claim that machine scoring of essays agrees with human scorers. Argues that the research supporting this claim is based on limited, constrained writing tasks such as those used for the GMAT, but a 2005 study reported by NAEP shows automated essay scoring (AES) is not reliable for more open tasks. McCurry reports on a study that compares the results of two machine-scoring applications to the results of human readers for the writing portion of the Australian Scaling Test (AST), which has been designed specifically to encourage test takers to identify an issue and use drafting and revising to present a point of view. It does not prescribe a form or genre, or even the issue. It has been designed to reflect classroom practice, not to facilitate grading and inter-rater agreement, according to McCurry. Scoring procedures, which are also different than those typically used in large-scale testing in the USA, involve four readers scoring essays on a 10-point scale. After comparing and analyzing the results between the human scores and the scores given by the AES applications, McCurry concludes that machine scoring cannot score open, broad writing tasks more reliably than human readers.

Herrington, Anne & Moran, Charles. (2009). Writing, assessment, and new technologies. In Marie C. Paretti & Katrina Powell (Eds.), Assessment in Writing (Assessment in the Disciplines, Vol. 4) (pp. 159-177). Tallahassee, TN: Association of Institutional Researchers.*

Herrington and Moran argue against educators and assessors “relying principally or exclusively on standardized assessment programs or using automated, externally developed writing assessment programs” (p. 177). They submitted an essay written by Moran to Educational Testing Service’sCriterion, and found that the program was “vague, generally misleading, and often dead wrong” (p. 163). For instance, of the eight problems Criterion found in grammar, usage, and mechanics, all eight were false flags. The authors also critique Edward Brent’s SAGrader, finding the software’s analysis of free responses written for content courses generally helpful if used in pedagogically sound ways; but they severely question Collegiate Learning Assessment’s ability to identify meaningful learning outcomes, especially now that CLA has resorted to Educational Testing Service’s e-rater to score essays composed for CLA’s “more reductive” task-based prompts (p. 171).

Scharber, Cassandra, Dexter, Sara & Riedel, Eric. (2008). Students’ experiences with an automated essay scorer. Journal of Technology, Learning and Assessment 7(1). Retrieved 4/1/2013 from http://www.jtla.org.

The study explored preservice English teachers’ experience with automated essay scoring (AES) for formative feedback in an online, case-based course. Data collected included post-assignment surveys, a user log of students’ actions within the cases, instructor-assigned scores on final essays, and interviews with four selected students. The course used ETIPS, a comprehensive, online system that includes an AES option for formative feedback. The cases “are multimedia, network-based, online instructional resources that provide learning opportunities. . . to practice instructional decision-making skills related to technology integration and implementation” (p. 6). Twenty-five of the thirty-four students agreed to participate in the study, and thirteen of the twenty-five agreed to be interviewed, with four being selected through a purposive sampling matrix. Survey results showed that “most students did not assign strong positive ratings to any aspect of the scorer” but did find the AES helpful in composing their own response, yet they did not have much confidence in its evaluation. In response to an open-ended question about their use of the AES, the authors reported the two most frequent responses from the students were “that they tried to ‘please and then beat the scorer’” (n=16) and they “used the scorer then gave up” (n=9) (p. 14). From the four case studies, the authors concluded that “the nature of the formative feedback given to these students by the ETIPS scorer was not sophisticated enough for them to know what specific sort of revision to make to their answers” (p. 28).

Shermis, Mark D., Shneyderman, Aleksandr & Attali, Yigal. (2008). How important is content in the ratings of essay assessments? Assessment in Education: Principles, Policy & Practice,15(1), 91-105.

EBSCO ABSTRACT: This study was designed to examine the extent to which “content” accounts for variance in scores assigned in automated essay scoring protocols. Specifically it was hypothesized that certain writing genres would emphasize content more than others. Data were drawn from 1,668 essays calibrated at two grade levels (6 and 8) using “e-rater[TM]“, an automated essay scoring engine with established validity and reliability. “E-rater” v 2.0′s scoring algorithm divides 12 variables into “content” (scores assigned to essays with similar vocabulary; similarity of vocabulary to essays with the highest scores) and “non-content” (grammar, usage, mechanics, style, and discourse structure) related components. The essays were classified by genre: persuasive, expository, and descriptive. The analysis showed that there were significant main effects due to grade, F(1,1653) = 58.71, p less than 0.001, and genre F(2, 1653) = 20.57, p less than 0.001. The interaction of grade and genre was not significant. Eighth-grade students had significantly higher mean scores than sixth-grade students, and descriptive essays were rated significantly higher than those classified as persuasive or expository. Prompts elicited “content” according to expectations, with lowest proportion of content variance in persuasive essays, followed by expository and then descriptive. Content accounted for approximately 0-6% of the overall variance when all predictor variables were used. It accounted for approximately 35-58% of the overall variance when “content” variables alone were used in the prediction equation. (Contains 9 tables, 2 figures and 2 notes.)

Chen, Chi-Fen Emily & Cheng, Wei-Yuan Eugene. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes.Language Learning and Technology 12(2), 94-112.*

Used naturalistic classroom investigation to see how effectively MY Access! worked for ESL students in Taiwan. Found that the computer feedback was most useful during drafting and revising but only when it was followed with human feedback from peer students and from teachers. When students tried to use MY Access! on their own, they were often frustrated and their learning was limited. Generally, both teachers and students perceived the software and its feedback negatively.

Wang, Jinhao & Brown, Michelle Stallone. (2008). Automated essay scoring versus human scoring: A correlational study. Contemporary Issues in Technology and Teacher Education8(4)*

In one of the few empirical comparisons of machine with human scoring conducted outside the testing companies themselves, Wang and Brown had trained human raters independently to score student essays that had been scored by IntelliMetric in WritePlace Plus. The students were enrolled in an advanced basic-writing course in a Hispanic-serving college in south Texas. On the global or holistic level, the correlation between human and machine scores was only .11. On the five dimensions of focus, development, organization, mechanics, and sentence structure, the correlations ranged from .06 to .21. These dismal machine-human correlations question the generalizability of industry findings, which, as Wang and Brown point out, emerge from the same population of writers on which both machines and raters are trained. IntelliMetric scores also had no correlation (.01) with scores that students later achieved on the human-scored essay in a state-mandated exam, whereas the two human ratings correlated significantly (.35).

Wohlpart, James, Lindsey, Chuck & Rademacher, Craig. (2008). The reliability of computer software to score essays: Innovations in a humanities course. Computers and Composition25(2), 203-223.*

Considers Florida Gulf Coast University’s general-education course Understanding the Visual and Performing Arts, taught online in two large sections. Used Intelligent Essay Assessor to score two short essays that were part of module examinations. On four readings, using a four-point holistic scale, faculty readers achieved exact agreement with two independent readers only 49, 61, 49, and 57 percent of the time. IEA’s scores correlated with the final human scores (achieved sometimes after four readings) 64% of the time. When faculty later re-read discrepant essays, their scores almost always moved toward the IEA score. With essays where there was still a discrepancy, 78% were scored higher by IEA. The faculty were “convinced” that the use of IEA was a “success.” Note that the authors do not investigate the part that statistical regression toward the mean might have played in these results.

Hutchison, Dougal. (2007). An evaluation of computerised essay marking for national curriculum assessment in the UK for 11-year-olds. British Journal of Educational Technology 38(6), 977-989.

This study examines “how well the computer program can replicate human marking” (p. 980) and the “discrepancies between computer and human marking, and to try to identify the reasons for these” (p. 981). It used e-rater and a subset of 600 essays collected as part of the National Foundation for Educational Research’s work in developing National Curriculum Assessments in English. The comparison of e-rater scores with human readers showed that “e-rater scores agree nearly as often with human readers as two human readers agree with each other, and more closely with the average of the readers” (p. 981). To determine the reason for the discrepancies, the markers discussed the texts that had received discrepant scores and the researcher identified three reasons for the discrepancies that he termed Human Friendly, Neutral, and Computer Friendly. Based on the analysis of the results of the studies, the author concluded that “that even the most sophisticated programs, such as e-rater, which bases its assessment on a number of dimensions, can still miss out on important intrinsic qualities of an essay, such as whether it was lively or pedestrian” (988).

James, Cindy L. (2007). Validating a computerized scoring system for assessing writing and placing students in composition courses. Assessing Writing 11(3), 167-178.*

Compares scores given by ACCUPLACER OnLine WritePlacer Plus (using IntelliMetric) with student essay scores given by “untrained” faculty at Thompson Rivers University, and then compares the success of these two sets in predicting pass or failure in an introductory writing course and a course in literature and composition. ACCUPLACER was administered during the first week of class. Correlations between machine and human scores (ranging from .40 to .61) were lower than those between humans (from .45 to .80). Neither machine nor human scores accounted much for the variation in the composition or literature courses success (machine: 16% and 5%; humans: 26% and 9%). IntelliMetric picked only one of the 18 nonsuccessful students, and humans picked only 6 of them.

Prepared by the NCTE Task Force on Writing Assessment

Chris Anson, North Carolina State University (chair)

Scott Filkins, Champaign Unit 4 School District, Illinois

Troy Hicks, Central Michigan University

Peggy O’Neill, Loyola University Maryland

Kathryn Mitchell Pierce, Clayton School District, Missouri

Maisha Winn, University of Wisconsin