Discovering the Predictive Power of Five Baseline Writing Competences
Keywords:automated essay scoring, connectivity, grammatical accuracy, inter-rater agreement, lexical diversity, regression, semantic similarity, spelling accuracy, writing analytics
- Background: A shift of focus has been marked in recent years in the development of automated essay scoring systems (AES) passing from merely assigning a holistic score to an essay to providing constructive feedback over it. Despite all the major advances in the domain, many objections persist concerning their credibility and readiness to replace human scoring in high-stakes writing assessments. The purpose of this study is to shed light on how to build a relatively simple AES system based on five baseline writing features. The study shows that the proposed AES system compares very well with other state-of-the-art systems despite its obvious limitations.
- Literature Review: In 2012, ASAP (Automated Student Assessment Prize) launched a demonstration to benchmark the performance of state-of-the-art AES systems using eight hand-graded essay datasets originating from state writing assessments. These datasets are still used today to measure the accuracy of new AES systems. Recently, Zupanc and Bosnic (2017) developed and evaluated another state-of-the-art AES system, called SAGE, which enclosed new semantic and consistency features and provided for the first time an automatic semantic feedback. SAGE’s agreement level between machine and human scores for ASAP dataset #8 (the dataset also of interest in this study) was measured and had a quadratic weighted kappa of 0.81, while it ranged for 10 other state-of-the-art systems between 0.60 and 0.73 (Chen et al., 2012; Shermis, 2014). Finally, this section discusses the limitations of AES, which come mainly from its omission to assess higher-order thinking skills that all writing constructs are ultimately designed to assess.
- Research Questions: The research questions that guide this study are as follows:
- RQ1: What is the power of the writing analytics tool’s five-variable model (spelling accuracy, grammatical accuracy, semantic similarity, connectivity, lexical diversity) to predict the holistic scores of Grade 10 narrative essays (ASAP dataset #8)?
- RQ2: What is the agreement level between the computer rater based on the regression model obtained in RQ1 and the human raters who scored the 723 narrative essays written by Grade 10 students (ASAP dataset #8)?
- Methodology: ASAP dataset #8 was used to train the predictive model of the writing analytics tool introduced in this study. Each essay was graded by two teachers. In case of disagreement between the two raters, the scoring was resolved by a third rater. Basically, essay scores were the weighted sums of four rubric scores. A multiple linear regression analysis was conducted to determine the extent to which a five-variable model (selected from a set of 86 writing features) was effective to predict essay scores.
- Results: The regression model in this study accounted for 57% of the essay score variability. The correlation (Pearson), the percentage of perfect matches, the percentage of adjacent matches (±2), and the quadratic weighted kappa between the resolved scores and predicted essay scores were 0.76, 10%, 49%, and 0.73, respectively. The results were measured on an integer scale of resolved essay scores between 10-60.
- Discussion: When measuring the accuracy of an AES system, it is important to take into account several metrics to better understand how predicted essay scores are distributed along the distribution of human scores. Using average ranking over correlation, exact/adjacent agreement, quadratic weighted kappa, and distributional characteristics such as standard deviation and mean, this study’s regression model ranks 4th out of 10 AES systems. Despite its relatively good rank, the predictions of the proposed AES system remain imprecise and do not even look optimal to identify poor-quality essays (binary condition) smaller than or equal to a 65% threshold (71% precision and 92% recall).
- Conclusions: This study sheds light on the implementation process and the evaluation of a new simple AES system comparable to the state of the art and reveals that the generally obscure state-of-the-art AES system is most likely concerned only with shallow assessment of text production features. Consequently, the authors advocate greater transparency in the development and publication of AES systems. In addition, the relationship between the explanation of essay score variability and the inter-rater agreement level should be further investigated to better represent the changes in terms of level of agreement when a new variable is added to a regression model. This study should also be replicated at a larger scale in several different writing settings for more robust results.
Aluthman, E. S. (2016). The effect of using automated essay evaluation on ESL undergraduate students’ writing skill. International Journal of English Linguistics, 6(5), 54–67. http://doi.org/10.5539/ijel.v6n5p54
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment, 4(3), 1–31. Retrieved from https://www.ets.org/research/policy_research_reports/publications/article/2006/hsjv
Bennett, R. E. (2011). CBAL: Results from piloting innovative K–12 assessments. ETS Research Report Series, 2011(1), 1–39. http://doi.org/10.1002/j.2333-8504.2011.tb02259.x
Brenner, H., & Kliebsch, U. (1996). Dependence of weighted kappa coefficients on the number of categories. Epidemiology, 7(2), 199–202. Retrieved from http://www.jstor.org/stable/3703036
Chen, H., He, B., Luo, T., & Li, B. (2012). A ranked-based learning approach to automated essay scoring. In 2012 Second International Conference on Cloud and Green Computing (pp. 448–455). http://doi.org/10.1109/CGC.2012.41
Clemens, C. (2017). A causal model of writing competence (Master's thesis). Retrieved from https://dt.athabascau.ca/jspui/handle/10791/233
Crossley, S. A., Kyle, K., & McNamara, D. S. (2016a). The development and use of cohesive devices in L2 writing and their relations to judgments of essay quality. Journal of Second Language Writing, 32, 1–16. http://doi.org/10.1016/j.jslw.2016.01.003
Crossley, S. A., Kyle, K., & McNamara, D. S. (2016b). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48(4), 1227–1237. http://doi.org/10.3758/s13428-015-0651-7
Crossley, S. A., & McNamara, D. S. (2011). Text coherence and judgments of essay quality: Models of quality and coherence. In Proceedings of the Cognitive Science Society (Vol. 33, No. 33). Retrieved from https://www.semanticscholar.org/paper/Text-Coherence-and-Judgments-of-Essay-Quality-Mode-Crossley-McNamara/89c191a8053412356eb8a68144ca59d8b5eb6a63
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. https://doi.org/10.1016/j.asw.2012.10.002
El Ebyary, K., & Windeatt, S. (2010). The impact of computer-based feedback on students’ written work. International Journal of English Studies, 10(2), 121–142. Retrieved from http://revistas.um.es/ijes/article/view/119231
Elliot, N., Rudniy, A., Deess, P., Klobucar, A., Collins, R., & Sava, S. (2016). ePortfolios: Foundational measurement issues. Journal of Writing Assessment, 9(2). Retrieved from http://journalofwritingassessment.org/article.php?article=110
Fazal, A., Hussain, F. K., & Dillon, T. S. (2013). An innovative approach for automatically grading spelling in essays using rubric-based scoring. Journal of Computer and System Sciences, 79(7), 1040–1056. https://doi.org/10.1016/j.jcss.2013.01.021
Foltz, P. W. (1996). Latent semantic analysis for text-based research. Behavior Research Methods, 28(2), 197–202. http://doi.org/10.3758/BF03204765
Gebril, A., & Plakans, L. (2016). Source-based tasks in academic writing assessment: Lexical diversity, textual borrowing and proficiency. Journal of English for Academic Purposes, 24, 78–88. https://doi.org/10.1016/j.jeap.2016.10.001
Gregori-Signes, C., & Clavel-Arroitia, B. (2015). Analysing lexical density and lexical diversity in university students’ written discourse. Procedia - Social and Behavioral Sciences, 198, 546–556. https://doi.org/10.1016/j.sbspro.2015.07.477
Huck, S. (2009). Statistics misconceptions. New York, NY: Taylor & Francis.
Kleinbaum, D., Kupper, L., Nizam, A., & Rosenberg, E. (2013). Applied regression analysis and other multivariable methods. Nelson Education.
Larkey, L. S. (1998). Automatic essay grading using text categorization techniques. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 90–95). New York, NY, USA: ACM. http://doi.org/10.1145/290941.290965
Latifi, S., Gierl, M. J., Boulais, A.-P., & De Champlain, A. F. (2016). Using automated scoring to evaluate written responses in English and French on a high-stakes clinical competency examination. Evaluation & the Health Professions, 39(1), 100–113. http://doi.org/10.1177/0163278715605358
Manning, C. D., Raghavan, P., Schütze, H., & others. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.
McNamara, D. S., Crossley, S. A., & Roscoe, R. (2013). Natural language processing in an intelligent writing strategy tutoring system. Behavior Research Methods, 45(2), 499–515. http://doi.org/10.3758/s13428-012-0258-1
McNamara, D. S., Crossley, S. A., Roscoe, R. D., Allen, L. K., & Dai, J. (2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23, 35–59. http://doi.org/10.1016/j.asw.2014.09.002
Miłkowski, M. (2010). Developing an open-source, rule-based proofreading tool. Software: Practice and Experience, 40(7), 543–566. http://doi.org/ 10.1002/spe.971
Naber, D. (2003). A rule-based style and grammar checker. Retrieved from https://www.researchgate.net/publication/239556866_A_Rule-Based_Style_and_Grammar_Checker
Perelman, L. (2013). Critique of Mark D. Shermis & Ben Hammer, Contrasting state-of-the-art automated scoring of essays: Analysis. Journal of Writing Assessment, 6(1). Retrieved from http://journalofwritingassessment.org/article.php?article=69
Perelman, L. (2014). When “the state of the art” is counting words. Assessing Writing, 21, 104–111. http://doi.org/10.1016/j.asw.2014.05.001
Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetricTM essay scoring system. The Journal of Technology, Learning and Assessment, 4(4), 1–22. Retrieved from https://ejournals.bc.edu/ojs/index.php/jtla/article/download/1651/1493
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20(1), 53–76. http://doi.org/10.1016/j.asw.2013.04.001
Slomp, D. H. (2012). Challenges in assessing the development of writing ability: Theories, constructs and methods. Assessing Writing, 17(2), 81–91. https://doi.org/10.1016/j.asw.2012.02.001
Villalon, J., & Calvo, R. A. (2013). A decoupled architecture for scalability in text mining applications. Journal of Universal Computer Science, 19(3), 406–427. Retrieved from https://pdfs.semanticscholar.org/ffcc/204e98f16fd0fc47af8c2ff312f5f50df81d.pdf
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 157–180. https://doi.org/10.1191/1362168806lr190oa
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Wilson, J., Olinghouse, N. G., McCoach, D. B., Santangelo, T., & Andrada, G. N. (2016). Comparing the accuracy of different scoring methods for identifying sixth graders at risk of failing a state writing assessment. Assessing Writing, 27, 11–23. http://doi.org/10.1016/j.asw.2015.06.003
Zupanc, K., & Bosnić, Z. (2017). Automated essay evaluation with semantic analysis. Knowledge-Based Systems, 120, 118–132. https://doi.org/10.1016/j.knosys.2017.01.006