Developing an e-rater Advisory to detect Babel-generated essays

  • Aoife Cahill Educational Testing Service
  • Martin Chodorow Hunter College and the Graduate Center, CUNY
  • Michael Flor Educational Testing Service
Keywords: automated essay scoring, constructed response items, gaming strategies, lexical-semantic cohesion, natural language processing, writing analytics


  • Background: It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response strategies. The enhancement of systems to detect and deal with these kinds of strategies is often an iterative process, whereby as new strategies come to light they need to be evaluated and effective mechanisms built into the automated scoring systems to handle them. In this paper, we focus on the Babel system, which automatically generates semantically incohesive essays. We expect that these essays may unfairly receive high scores from automated scoring engines despite essentially being nonsense.
  • Literature Review: We discuss literature related to gaming of automated scoring systems. One reason that Babel essays are so easy to identify as nonsense by human readers is that they lack any semantic cohesion. Therefore, we also discuss some literature related to cohesion and detecting semantic cohesion.
  • Research Questions: This study addressed three research questions:
  1. Can we automatically detect essays generated by the Babel system?
  2. Can we integrate the detection of Babel-generated essays into an operational automated essay scoring system while making sure not to flag valid student responses?
  3. Does a general approach for detecting semantically incohesive essays also detect Babel-generated essays?
  • Research Methodology: This article describes the creation of two corpora necessary to address the research questions: (1) a corpus of Babel-generated essays and (2) a corresponding corpus of good-faith essays. We built a classifier to distinguish Babel-generated essays from good-faith essays and investigated whether the classifier can be integrated into an automated scoring engine without adverse effects. We also developed a measure of lexical-semantic cohesion and examined its distribution in Babel and in good-faith essays.
  • Results: We found that the classifier built on Babel-generated essays and good-faith essays and using features from the automated scoring engine can distinguish the Babel-generated essays from the good-faith ones with 100% accuracy. We also found that if we integrated this classifier into the automated scoring engine it flagged very few responses that were submitted as part of operational submissions (76 of 434,656). The responses that were flagged had previously been assigned a score of Null (non-scorable) or a score of 1 by human experts. The measure of lexical-semantic cohesion shows promise in being able to distinguish Babel-generated essays from good-faith essays.
  • Conclusions: Our results show that it is possible to detect the kind of gaming strategy illustrated by the Babel system and add it to an automated scoring engine without adverse effects on essays seen during real high-stakes tests. We also show that a measure of lexical-semantic cohesion can separate Babel-generated essays from good-faith essays to a certain degree, depending on task. This points to future work that would generalize the capability to detect semantic incoherence in essays.
  • Directions for Further Research: Babel-generated essays can be identified and flagged by an automated scoring system without any adverse effects on a large set of good-faith essays. However, this is just one type of gaming strategy. It is important for developers of automated scoring systems to continue to be diligent about expanding the construct coverage of their systems in order to prevent weaknesses that can be exploited by tools such as Babel. It is also important to focus on the underlying linguistic reasons that lead to nonsense sentences. Successful identification of such nonsense would lead to improved automated scoring and feedback.


Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3), 1–30.

BABEL Generator (2014). Retrieved from

Bamberg, B. (1983). What makes a text coherent? College composition and communication, 34(4), 417–429.

Beigman Klebanov, B., Madnani, N., Burstein, J., & Somasundaran, S. (2014). Content Importance models for scoring writing from sources. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 247–252). Association for Computational Linguistics.

Bejar, I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing, 22, 48–59.

Bennett, R. E. (2015). The changing nature of educational assessment. Review of Research in Education, 39(1), 370–407.

Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40.

Burstein, J., Tetreault, J., & Madnani, N. (2013). The E-rater® automated essay scoring system. Handbook of automated essay evaluation: Current applications and new directions, 55–67.

Carrell, P. L. (1982). Cohesion is not coherence. TESOL Quarterly, 16(4), 479–488.

Flor, M., & Beigman Klebanov, B. (2014). Associative lexical cohesion as a factor in text complexity. International Journal of Applied Linguistics, 165(2), 223–258.

Greene, P. (2018, July 2). Automated essay scoring remains an empty dream. Retrieved from Forbes:

Halliday, M. A., & Hasan, R. (1976). Cohesion in English. London: Longman.

Halliday, M. A., & Matthiessen, C. (2004). An introduction to Functional Grammar (3rd edition). London: Arnold.

Heilman, M., Cahill, A., Madnani, N., Lopez, M., Mulholland, M., & Tetreault, J. (2014). Predicting grammaticality on an ordinal scale. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 174–180). Association for Computational Linguistics.

Higgins, D., & Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(3), 36–46.

Hoey, M. (1991). Patterns of lexis in text. Oxford University Press.

Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.

Huang, L., Joseph, A. D., Blaine, N., Rubinstein, B. I., & Tygar, J. (2011). Adversarial machine learning. Proceedings of the 4th ACM workshop on security and artificial intelligence, (pp. 43–58).

Kane, M. T. (2013). Validating the interpretation and uses of test scores. Journal of Educational Measurement, 50, 1–72.

Klobucar, A., Deane, P., Elliot, N., Chaitanya, C., Deess, P., & Rudniy, A. (2012). Automated essay scoring and the search for valid writing assessment. International advances in writing research: Cultures, places, measures, 103–119.

Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. Proceedings of the Eighteenth Conference on Computational Natural Language Learning (pp. 171–180). Ann Arbor, Michigan: Association for Computational Linguistics.

Lochbaum, K. E., Rosenstein, M., Foltz, P. W., & Derr, M. A. (2013, April). Detection of gaming in automated scoring of essays with the IEA. Paper presented at the National Council on Measurement in Education Conference (NCME), San Francisco, CA.

Mandler, J. M., & Johnson, N. S. (1977). Remembrance of things parsed: Story structure and recall. Cognitive psychology, 9(1), 111–151.

Marathe, M., & Hirst, G. (2010). Lexical chains using distributional measures of concept distance. International Conference on Intelligent Text Processing and Computational Linguistics. 6008, pp. 291-302. Springer Lecture Notes in Computer Science.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, (pp. 3111-3119).

Morris, J., & Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1), 21–48.

Powers, D., Burstein, J., Chodorow, M., Fowles, M., & Kukich, K. (2001). Stumping E-Rater: Challenging the validity of automated essay scoring. ETS Research Report Series, i-44.

Robinson, N. (2017, October 12). Push to have robots mark school tests under fire from prominent US academic. Retrieved from ABC News:

Robinson, N. (2018, January 29). Robot marking of NAPLAN tests scrapped. Retrieved from ABC News:

Silber, H. G., & McCoy, K. F. (2002). Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Computational Linguistics, 28(4), 487–496.

Smith, T. (2018, June 30). More states opting to 'robo-grade' student essays by computer. Retrieved from NPR:

Somasundaran, S., Burstein, J., & Chodorow, M. (2014). Lexical chaining for measuring discourse coherence quality in test-taker essays. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 950–961). Dublin City University and Association for Computational Linguistics.

The Dada Engine. (2000). Retrieved from

Van Dijk, T. A. (1980). Macrostructures: An interdisciplinary study of global structures in discourse, interaction, and cognition. Hillsdale, NJ: Lawrence Erlbaum Associates.

Williamson, D. M., Bejar, I. I., & Hone, A. S. (2005). 'Mental Model'™ Comparison of Automated and human scoring. Journal of Educational Measurement, 36(2), 158–184.

Yoon, S.-Y., Cahill, A., Loukina, A., Zechner, K., Riordan, B., & Madnani, N. (2018). Atypical Inputs in educational applications. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers) (pp. 60–67). New Orleans, LA: Association for Computational Linguistics.

Zhang, M., Chen, J., & Ruan, C. (2016). Evaluating the advisory flags and machine scoring difficulty in the e-rater® automated scoring engine. ETS Research Report Series, 1-14.