Discovering the Predictive Power of Five Baseline Writing Competences

  • Vivekanandan Kumar Athabasca University
  • Shawn N. Fraser Athabasca University
  • David Boulanger Athabasca University
Keywords: automated essay scoring, connectivity, grammatical accuracy, inter-rater agreement, lexical diversity, regression, semantic similarity, spelling accuracy, writing analytics


  • Background: A shift of focus has been marked in recent years in the development of automated essay scoring systems (AES) passing from merely assigning a holistic score to an essay to providing constructive feedback over it. Despite all the major advances in the domain, many objections persist concerning their credibility and readiness to replace human scoring in high-stakes writing assessments. The purpose of this study is to shed light on how to build a relatively simple AES system based on five baseline writing features. The study shows that the proposed AES system compares very well with other state-of-the-art systems despite its obvious limitations.

  •  Literature Review: In 2012, ASAP (Automated Student Assessment Prize) launched a demonstration to benchmark the performance of state-of-the-art AES systems using eight hand-graded essay datasets originating from state writing assessments. These datasets are still used today to measure the accuracy of new AES systems. Recently, Zupanc and Bosnic (2017) developed and evaluated another state-of-the-art AES system, called SAGE, which enclosed new semantic and consistency features and provided for the first time an automatic semantic feedback. SAGE’s agreement level between machine and human scores for ASAP dataset #8 (the dataset also of interest in this study) was measured and had a quadratic weighted kappa of 0.81, while it ranged for 10 other state-of-the-art systems between 0.60 and 0.73 (Chen et al., 2012; Shermis, 2014). Finally, this section discusses the limitations of AES, which come mainly from its omission to assess higher-order thinking skills that all writing constructs are ultimately designed to assess.

  • Research Questions: The research questions that guide this study are as follows:
  1. RQ1: What is the power of the writing analytics tool’s five-variable model (spelling accuracy, grammatical accuracy, semantic similarity, connectivity, lexical diversity) to predict the holistic scores of Grade 10 narrative essays (ASAP dataset #8)?
  2. RQ2: What is the agreement level between the computer rater based on the regression model obtained in RQ1 and the human raters who scored the 723 narrative essays written by Grade 10 students (ASAP dataset #8)?
  • Methodology: ASAP dataset #8 was used to train the predictive model of the writing analytics tool introduced in this study. Each essay was graded by two teachers. In case of disagreement between the two raters, the scoring was resolved by a third rater. Basically, essay scores were the weighted sums of four rubric scores. A multiple linear regression analysis was conducted to determine the extent to which a five-variable model (selected from a set of 86 writing features) was effective to predict essay scores.
  • Results: The regression model in this study accounted for 57% of the essay score variability. The correlation (Pearson), the percentage of perfect matches, the percentage of adjacent matches (±2), and the quadratic weighted kappa between the resolved scores and predicted essay scores were 0.76, 10%, 49%, and 0.73, respectively. The results were measured on an integer scale of resolved essay scores between 10-60.
  • Discussion: When measuring the accuracy of an AES system, it is important to take into account several metrics to better understand how predicted essay scores are distributed along the distribution of human scores. Using average ranking over correlation, exact/adjacent agreement, quadratic weighted kappa, and distributional characteristics such as standard deviation and mean, this study’s regression model ranks 4th out of 10 AES systems. Despite its relatively good rank, the predictions of the proposed AES system remain imprecise and do not even look optimal to identify poor-quality essays (binary condition) smaller than or equal to a 65% threshold (71% precision and 92% recall).
  • Conclusions: This study sheds light on the implementation process and the evaluation of a new simple AES system comparable to the state of the art and reveals that the generally obscure state-of-the-art AES system is most likely concerned only with shallow assessment of text production features. Consequently, the authors advocate greater transparency in the development and publication of AES systems. In addition, the relationship between the explanation of essay score variability and the inter-rater agreement level should be further investigated to better represent the changes in terms of level of agreement when a new variable is added to a regression model. This study should also be replicated at a larger scale in several different writing settings for more robust results.

Author Biographies

Vivekanandan Kumar, Athabasca University
Dr. Vivekanandan Kumar is a Professor in the School of Computing and Information Systems at Athabasca University, Canada. He holds the Natural Sciences and Engineering Research Council of Canada’s (NSERC) Discovery Grant on Anthropomorphic Pedagogical Agents, funded by the Government of Canada. His research focuses on developing anthropomorphic agents, which mimic and perfect human-like traits to better assist learners in their regulatory tasks. His research includes investigating technology-enhanced erudition methods that employ big data learning analytics, self-regulated learning, co-regulated learning, causal modeling, and machine learning to facilitate deep learning and open research. For more information, visit
Shawn N. Fraser, Athabasca University
Dr. Shawn N. Fraser is an Associate Professor at Athabasca University and an Adjunct Assistant Professor in Physical Education and Recreation at the University of Alberta. His research interests include understanding how stress can impact upon rehabilitation success for heart patients. He teaches research methods courses in the Faculty of Health Disciplines and is interested in interdisciplinary approaches to studying and teaching research methods and data analysis.
David Boulanger, Athabasca University
David Boulanger is a student and data scientist involved in the learning analytics research group at Athabasca University. His primary research focus is on observational study designs and the application of computational tools and machine learning algorithms in learning analytics including writing analytics.


Aluthman, E. S. (2016). The effect of using automated essay evaluation on ESL undergraduate students’ writing skill. International Journal of English Linguistics, 6(5), 54–67.

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment, 4(3), 1–31. Retrieved from

Bennett, R. E. (2011). CBAL: Results from piloting innovative K–12 assessments. ETS Research Report Series, 2011(1), 1–39.

Brenner, H., & Kliebsch, U. (1996). Dependence of weighted kappa coefficients on the number of categories. Epidemiology, 7(2), 199–202. Retrieved from

Chen, H., He, B., Luo, T., & Li, B. (2012). A ranked-based learning approach to automated essay scoring. In 2012 Second International Conference on Cloud and Green Computing (pp. 448–455).

Clemens, C. (2017). A causal model of writing competence (Master's thesis). Retrieved from

Crossley, S. A., Kyle, K., & McNamara, D. S. (2016a). The development and use of cohesive devices in L2 writing and their relations to judgments of essay quality. Journal of Second Language Writing, 32, 1–16.

Crossley, S. A., Kyle, K., & McNamara, D. S. (2016b). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48(4), 1227–1237.

Crossley, S. A., & McNamara, D. S. (2011). Text coherence and judgments of essay quality: Models of quality and coherence. In Proceedings of the Cognitive Science Society (Vol. 33, No. 33). Retrieved from

Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24.

El Ebyary, K., & Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journal of English Studies, 10(2), 121–142. Retrieved from

Elliot, N., Rudniy, A., Deess, P., Klobucar, A., Collins, R., & Sava, S. (2016). ePortfolios: Foundational measurement issues. Journal of Writing Assessment, 9(2). Retrieved from

Fazal, A., Hussain, F. K., & Dillon, T. S. (2013). An innovative approach for automatically grading spelling in essays using rubric-based scoring. Journal of Computer and System Sciences, 79(7), 1040–1056.

Foltz, P. W. (1996). Latent semantic analysis for text-based research. Behavior Research Methods, 28(2), 197–202.

Gebril, A., & Plakans, L. (2016). Source-based tasks in academic writing assessment: Lexical diversity, textual borrowing and proficiency. Journal of English for Academic Purposes, 24, 78–88.

Gregori-Signes, C., & Clavel-Arroitia, B. (2015). Analysing lexical density and lexical diversity in university students’ written discourse. Procedia - Social and Behavioral Sciences, 198, 546–556.

Huck, S. (2009). Statistics misconceptions. New York, NY: Taylor & Francis.

Kleinbaum, D., Kupper, L., Nizam, A., & Rosenberg, E. (2013). Applied regression analysis and other multivariable methods. Nelson Education.

Larkey, L. S. (1998). Automatic essay grading using text categorization techniques. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 90–95). New York, NY, USA: ACM.

Latifi, S., Gierl, M. J., Boulais, A.-P., & De Champlain, A. F. (2016). Using automated scoring to evaluate written responses in English and French on a high-stakes clinical competency examination. Evaluation & the Health Professions, 39(1), 100–113.

Manning, C. D., Raghavan, P., Schütze, H., & others. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.

McNamara, D. S., Crossley, S. A., & Roscoe, R. (2013). Natural language processing in an intelligent writing strategy tutoring system. Behavior Research Methods, 45(2), 499–515.

McNamara, D. S., Crossley, S. A., Roscoe, R. D., Allen, L. K., & Dai, J. (2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23, 35–59.

Miłkowski, M. (2010). Developing an open-source, rule-based proofreading tool. Software: Practice and Experience, 40(7), 543–566. 10.1002/spe.971

Naber, D. (2003). A rule-based style and grammar checker. Retrieved from

Perelman, L. (2013). Critique of Mark D. Shermis & Ben Hammer, Contrasting state-of-the-art automated scoring of essays: Analysis. Journal of Writing Assessment, 6(1). Retrieved from

Perelman, L. (2014). When “the state of the art” is counting words. Assessing Writing, 21, 104–111.

Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetricTM essay scoring system. The Journal of Technology, Learning and Assessment, 4(4), 1–22. Retrieved from

Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20(1), 53–76.

Slomp, D. H. (2012). Challenges in assessing the development of writing ability: Theories, constructs and methods. Assessing Writing, 17(2), 81–91.

Villalon, J., & Calvo, R. A. (2013). A decoupled architecture for scalability in text mining applications. Journal of Universal Computer Science, 19(3), 406–427. Retrieved from

Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 157–180.

Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.

Wilson, J., Olinghouse, N. G., McCoach, D. B., Santangelo, T., & Andrada, G. N. (2016). Comparing the accuracy of different scoring methods for identifying sixth graders at risk of failing a state writing assessment. Assessing Writing, 27, 11–23.

Zupanc, K., & Bosnić, Z. (2017). Automated essay evaluation with semantic analysis. Knowledge-Based Systems, 120, 118–132.