Structural Features of Undergraduate Writing: A Computational Approach

  • Noah Arthurs Stanford University
Keywords: computational features, corpus linguistics, feature extraction, machine learning, NLP, parse stance, student writing, trees, topic modeling, undergraduate writing, writing analytics


  • Background: Over a decade ago, the Stanford Study of Writing (SSW) collected more than 15,000 writing samples from undergraduate students, but to this point the corpus has not been analyzed using computational methods. Through the use of natural language processing (NLP) techniques, this study attempts to reveal underlying structures in the SSW, while at the same time developing a set of interpretable features for computationally understanding student writing. These features fall into three categories: topic-based features that reveal what students are writing about; stance-based features that reveal how students are framing their arguments; and structure-based features that reveal sentence complexity. Using these features, we are able to characterize the development of the SSW participants across four years of undergraduate study, specifically gaining insight into the different trajectories of humanities, social science, and STEM students. While the results are specific to Stanford University’s undergraduate program, they demonstrate that these three categories of features can give insight into how groups of students develop as writers.
  • Literature Review: The Stanford Study of Writing (Lunsford et al., 2008; SSW, 2018) involved the collection of more than 15,000 writing samples from 189 students in the Stanford class of 2005. The literature surrounding the original study is largely qualitative (Fishman, Lunsford, McGregor, & Otuteye, 2005; Lunsford, 2013; Lunsford, Fishman, & Liew, 2013), so this study makes a first attempt at a quantitative analysis of the SSW. When considering the ethics of a computational approach, we find it important not to stray into the territory of writing evaluation, as purely evaluative systems have been shown to have limited instructional use in the classroom (Chen & Cheng, 2008; Weaver, 2006). Therefore, we find it important to take a descriptive, rather than evaluative approach. All of the features that we extract are both interpretable and grounded in prior research. Topic modeling has been used on undergraduate writing to improve the prediction of neuroticism and depression in college students (Resnik, Garron, & Resnik, 2013), stance markers have been used to show the development of undergraduate writers (Aull & Lancaster, 2014), and parse trees have been used to measure the syntactic complexity of student writing (Lu, 2010).
  • Research Questions: What computational features are useful for analyzing the development of student writers? Based on these features, what insights can we gain into undergraduate writing at Stanford and similar institutions?
  • Methodology: To extract topic features, we use LDA topic modeling (Blei, Ng, & Jordan, 2003) with Gibbs Sampling (Griffiths, 2002). To extract stance features, we replicate the stance markers approach from a past study (Aull & Lancaster, 2014). To describe sentence structure, we use parse trees generated using  Shift-Reduce dependency parsing (Sagae & Tsujii, 2008). For each parse tree, we use the tree depth and the average dependency length as heuristics for the syntactic complexity of the sentence.
  • Results: Topic modeling was useful for sorting papers into academic disciplines, as well as for distinguishing between argumentative and personal writing. Stance markers helped us characterize the intersection between the majors that students hold and the topics that they are writing about at a given time. Parse tree complexity demonstrated differences between writing in different disciplines. In addition, we found that students of different disciplines have different syntactic features even during their first year at Stanford.
  • Discussion: Topic modeling has given us a picture of interdisciplinary study at Stanford by showing how often students in the SSW wrote about topics outside their majors. Furthermore, studying interdisciplinary Stanford students allowed us to examine the intersection of a student’s major and current topic of writing when analyzing the other two sets of features. Stance markers in the SSW show that both field of study and topic of writing influence the ways in which students employ metadiscourse. In addition, when looking at stance across years, we see that Seniors regress towards their First-Year habits. The complexity results raise the question of whether different disciplines have different “ideal” levels of writing complexity.
  • Conclusions: The present study yields insight into undergraduate writing at Stanford in particular. Notably, we find that students develop most as writers during their first two years and that students of different majors develop as writers in different ways. We consider our three categories of features to be useful because they were able to give us these insights into the dataset. We hope that, moving forward, educators will be able to use this kind of analysis to understand how their students are developing as writers.

Author Biography

Noah Arthurs, Stanford University
Computer Science Masters student at Stanford.


2. Re - Regular Expression Operations. (2018, May 1). Retrieved from Python 3.6.5 Documentation:

Aull, L. L., & Lancaster, Z. (2014). Linguistic markers of stance in early and advanced academic writing: A corpus-based comparison. Written Communication, 31(2), 151–183. Retrieved from

Bird, S., & Loper, E. (2004). NLTK: The Natural Language Toolkit. In Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions (p. 31). Association for Computational Linguistics. Retrieved from

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022. Retrieved from

Chen, C.-F. E., & Cheng, W.-Y. E. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Retrieved from

Fishman, J., Lunsford, A. A., McGregor, B., & Otuteye, M. (2005). Performing writing, performing literacy. College Composition and Communication, 57(2), 224–252. Retrieved from

Griffiths, T. (2002). Gibbs sampling in the generative model of Latent Dirichlet Allocation. Retrieved from

Hays, D. G. (1964). Dependency theory: A formalism and some observations. Language, 40(4), 511–525. Retrieved from

Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1373–1378). Retrieved from

Hult, C. A. (1986). The computer and the inexperienced writer. Retrieved from

Hyland, K. (1998). Persuasion and context: The pragmatics of academic metadiscourse. Journal of Pragmatics, 30(4), 437–455. Retrieved from

Kumar, V., Fraser, S. N., & Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. Journal of Writing Analytics, 1, 176–226. Retrieved from

Langdetect. (2018, May 1). Retrieved from PyPI:

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. Retrieved from

Lunsford, A. A. (2013). Our semi-literate youth? Not so fast. Stanford University. Retrieved from

Lunsford, A. A., Fishman, J., & Liew, W. M. (2013). College writing, identification, and the production of intellectual property: Voices from the Stanford Study of Writing. College English, 75(5), 470–492.

Retrieved from

Lunsford, A. A., Stapleton, L., Fishman, J., Krampetz, E., Rogers, P. M., Diogenes, M., & Otuteye, M. (2008). The Stanford Study of Writing. Stanford University. Retrieved from

McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. Retrieved from

Papadimitriou, C. H., Tamaki, H., Raghavan, P., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (pp. 159–168). ACM. Retrieved from

Resnik, P., Garron, A., & Resnik, R. (2013). Using topic modeling to improve prediction of neuroticism and depression in college students. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1348–1353). Retrieved from

Sagae, K., & Tsujii, J. I. (2008). Shift-reduce dependency DAG parsing. Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1 (pp. 753–760). Association for Computational Linguistics. Retrieved from

SSW. (2018, 30 April). Retrieved from Stanford Study of Writing:

Stab, C., & Gurevych, I. (2014). Identifying argumentative discourse structures in persuasive essays. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 46–56). Retrieved from

Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (pp. 1882–1891). Retrieved from

The Hewlett Foundation. (2018, April 30). Short answer scoring. Retrieved from Kaggle:

Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 448–456). ACM. Retrieved from

Weaver, M. R. (2006). Do students value feedback? Student perceptions of tutors’ written responses. Assessment & Evaluation in Higher Education 31(3), 379–394. Retrieved from