Assessing Writing Constructs: Toward an Expanded View of Inter-Reader Reliability

  • Valerie Ross University of Pennsylvania
  • Rodger LeGrand Massachusetts Institute of Technology
Keywords: Cohen's Unweighted Kappa, ePortfolio, formative assessment, interrater agreement, interrater reliability, interreader reliability, kappa paradox, portfolio assessment, rubric traits, trait adjudication, writing analytics, writing assessment, writing in th


  • Background: This study focuses on construct representation and inter-reader agreement and reliability in ePortfolio assessment of 1,315 writing portfolios. These portfolios were submitted by undergraduates enrolled in required writing seminars at the University of Pennsylvania (Penn) in the fall of 2014.  Penn is an Ivy League university with a diverse student population, half of whom identify as students of color. Over half of Penn’s students are women, 12% are international, and 12% are first-generation college students. The students’ portfolios are scored by the instructor and an outside reader drawn from a writing-in-the-disciplines faculty who represent 24 disciplines. The portfolios are the product of a shared curriculum that uses formative assessment and a program-wide multiple-trait rubric. The study contributes to scholarship on the inter-reader reliability and validity of multiple-trait portfolio assessments as well as to recent discussions about reconceptualizing evidence in ePortfolio assessment. 

  •  Research Questions: Four questions guided our study:
  1. What levels of interrater agreement and reliability can be achieved when assessing complex writing performances that a) contain several different documents to be assessed; b) use a construct-based, multi-trait rubric; c) are designed for formative assessment rather than testing; and d) are rated by a multidisciplinary writing faculty? 
  2.  What can be learned from assessing agreement and reliability of individual traits?
  3. How might these measurements contribute to curriculum design, teacher development, and student learning?
  4. How might these findings contribute to research on fairness, reliability, and validity; rubrics; and multidisciplinary writing assessment?
  • Literature Review: There is a long history of empirical work exploring the reliability of scoring highly controlled timed writings, particularly by test measurement specialists. However, until quite recently, there have been few instances of applying empirical assessment techniques to writing portfolios.  Developed by writing theorists, writing portfolios contain multiple documents and genres and are produced and assessed under conditions significantly different from those of timed essay measurement. Interrater reliability can be affected by the different approaches to reading texts depending on the background, training, and goals of the rater. While a few writing theorists question the use of rubrics, most quantitatively based scholarship points to their effectiveness for portfolio assessment and calls into question the meaningfulness of single score holistic grading, whether impressionistic or rubric-based. Increasing attention is being paid to multi-trait rubrics, including, in the field of writing portfolio assessment, the use of robust writing constructs based on psychometrics alongside the more conventional cognitive traits assessed in writing studies, and rubrics that can identify areas of opportunity as well as unfairness in relation to the background of the student or the assessor. Scholars in the emergent field of empirical portfolio assessment in writing advocate the use of reliability as a means to identify fairness and validity and to create great opportunities for portfolios to advance student learning and professional development of faculty.  They also note that while the writing assessment community has paid attention to the work of test measurement practitioners, the reverse has not been the case, and that conversations and collaborations between the two communities are long overdue.

  • Methodology: We used two methods of calculating interrater agreement: absolute and adjacent percentages, and Cohen’s Unweighted Kappa, which calculates the extent to which interrater agreement is an effect of chance or expected outcome. For interrater reliability, we used the Pearson product-moment correlation coefficient. We used SPSS to produce all of the calculations in this study. 

  • Results: Interrater agreement and reliability rates of portfolio scores landed in the medium range of statistical significance.  Combined absolute and adjacent percentages of interrater reliability were above the 90% range recommended; however, absolute agreement was below the 70% ideal.  Furthermore, Cohen’s Unweighted Kappa rates were statistically significant but very low, which may be due to “kappa paradox.”

  • Discussion: The study suggests that a formative, rubric-based approach to ePortfolio assessment that uses disciplinarily diverse raters can achieve medium-level rates of interrater agreement and reliability. It raises the question of the extent to which absolute agreement is a desirable or even relevant goal for authentic feedback processes of a complex set of documents, and in which the aim is to advance student learning. At the same time, our findings point to how agreement and reliability measures can significantly contribute to our assessment process, teacher training, and curriculum. Finally, the study highlights potential concerns about construct validity and rater training. 

  • Conclusion: This study contributes to the emergent field of empirical writing portfolio assessment that calls into question the prevailing standard of reliability built upon timed essay measurement rather than the measurement, conditions, and objectives of complex writing performances.  It also contributes to recent research on multi-trait and discipline-based portfolio assessment.  We point to several directions for further research:  conducting “talk aloud” and recorded sessions with raters to obtain qualitative data on areas of disagreement; expanding the number of constructs assessed; increasing the range and granularity of the numeric scoring scale; and investigating traits that are receiving low interrater reliability scores. We also ask whether absolute agreement might be more useful for writing portfolio assessment than reliability and point to the potential “kappa paradox,” borrowed from the field of medicine, which examines interrater reliability in assessment of rare cases. Kappa paradox might be useful in assessing types of portfolios that are less frequently encountered by faculty readers. These, combined with the identification of jagged profiles and student demographics, hold considerable potential for rethinking how to work with and assess students from a range of backgrounds, preparation, and abilities.  Finally, our findings contribute to a growing effort to understand the role of rater background, particularly disciplinarity, in shaping writing assessment. The goals of our assessment process are to ensure that we are measuring what we intend to measure, specifically those things that students have an equal chance at achieving and that advance student learning.  Our findings suggest that interrater agreement and reliability measures, if thoughtfully approached, will contribute significantly to each of these goals.

Author Biographies

Valerie Ross, University of Pennsylvania

Valerie Ross is the Director of the Critical Writing Program in the Center for Contemporary Writing at the University of Pennsylvania. Her current research focuses on writing in the disciplines, knowledge transfer, curriculum and assessment development, and writing program administration.

Rodger LeGrand, Massachusetts Institute of Technology
Rodger LeGrand is a lecturer in Writing, Rhetoric, and Professional Communication at Massachusetts Institute of Technology. Prior to this appointment he was director of academic administration for the Critical Writing Program at the University of Pennsylvania. Along with teaching, research, and administration, he has five poetry collections, including Seeds (2017).  


Altman, D. (1991). Practical statistics for medical research (reprint 1999). Boca Raton, FL: CRC Press.

American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: AERA.

Andrade, H. L. (2006). The trouble with a narrow view of rubrics. The English Journal, 95(6), 9-9.

Andrade, H. L., Du, Y., & Wang, X. (2008). Putting rubrics to the test: The effect of a model, criteria generation, and rubric‐referenced self‐assessment on elementary school students' writing. Educational Measurement: Issues and Practice, 27(2), 3-13.

Anson, C. M., Dannels, D. P., Flash, P., & Gaffney, A. L. H. (2012). Big rubrics and weird genres: The futility of using generic assessment tools across diverse instructional contexts. Journal of Writing Assessment, 5(1).

Attali, Y. (2016). A Comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115. Retrieved from

Behizadeh, N., & Engelhard, G. (2011). Historical view of the influences of measurement and writing theories on the practice of writing assessment in the United States. Assessing Writing, 16(3), 189-211.

Behizadeh, N., & Engelhard, G. (2014). Development and validation of a scale to measure perceived authenticity in writing. Assessing Writing, 21, 18-36.

Bejar, I. (2006). Automated scoring of complex tasks in computer-based testing. Psychology Press.

Bejar, I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2-9. Retrieved from

Broad, B. (2016). This is not only a test: Exploring structured ethical blindness in the testing industry. Journal of Writing Assessment, 9(1). Retrieved from

Brough, J. A., & Pool, J. E. (2005). Integrating learning and assessment: The development of an assessment culture. In J. Etim (Ed.) Curriculum integration K-12: Theory and practice (pp. 196-204). Lanham, MD: University Press of America.

Bryant, L. H., & Chittum, J. R. (2013). ePortfolio effectiveness: A(n ill-fated) search for empirical evidence. International Journal of ePortfolio, 3, 189-198. Retrieved from

Chun, M. (2002). Looking where the light is better: A review of the literature on assessing higher education quality. Peer Review, 4(2/3), 16-25.

Cushman, E. (2016). Decolonizing validity. Journal of Writing Assessment, 9(1). Retrieved from

Dempsey, M. S., PytlikZillig, L. M., & Bruning, R. H. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a Web-based environment. Assessing Writing, 14(1), 38-61.

DeRemer, M. L. (1998). Writing assessment: Raters' elaboration of the rating task. Assessing Writing, 5(1), 7-29.

Elbow, P., & Yancey, K. B. (1994). On the nature of holistic scoring: An inquiry composed on email. Assessing Writing, 1(1), 91-107.

Elliot, N. (2016). A theory of ethics for writing assessment. Journal of Writing Assessment, 9(1). Retrieved from

Elliot, N., Rudniy, A., Deess, P., Klobucar, A., Collins R., & Sava, S. (2016). ePortfolios: Foundational measurement issues. Journal of Writing Assessment, 9(2). Retrieved from

Ewell, P. T. (1991). To capture the ineffable: New forms of assessment in higher education. Review of Research in Education, 17, 75–125.

Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543-549.

Fleiss, J. L. (1981). Statistical methods for rates and proportions. 2nd ed. New York: John Wiley.

Freedman, S. W., & Calfee, R. C. (1983). Holistic assessment of writing: Experimental design and cognitive theory. In P. Mosenthal, L. Tamor, & S. A. Walmsley (Eds.), Research on writing: Principles and methods (pp. 75–98). New York, NY: Longman.

Good, J. (2012). Crossing the measurement and writing assessment divide: The practical implications of interrater reliability in faculty development. The WAC Journal, 23, 19.

Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting interrater agreement of teacher and principal performance ratings. Center for Educator Compensation and Reform. Retrieved from

Hafner, J., & Hafner, P. (2003). Quantitative analysis of the rubric as an assessment tool: An empirical study of student peer‐group rating. Int. J. Sci. Educ., 25(12), 1509-1528.

Hamp-Lyons, L. (1991). Assessing second language writing in academic contexts. Norwood, NJ: Ablex Publishing Corporation.

Hamp-Lyons, L. (2002). The scope of writing assessment. Assessing Writing, 8(1), 5-16.

Hamp-Lyons, L. (2016). Farewell to holistic scoring. Part Two: Why build a house with only one brick?. Assessing Writing, (29), A1-A5.

Hartmann, D. P. (1977). Considerations in the choice of interobserver reliability measures. Journal of Applied Behavior Analysis, 10, 103–116.

Huber, M., & Hutchings, P. (2004). Integrative learning: Mapping the terrain. Washington, DC: American Association of Colleges and Universities.

Huot, B. (1996). Toward a new theory of writing assessment. College Composition and Communication, 47(4), 549-566.

Hutchings, P. T. (1990). Learning over time: Portfolio assessment. American Association of Higher Education Bulletin, 42, 6-8.

Inoue, A.B. (2004). Community-based assessment pedagogy. Assessing Writing, 9(3), 208-238. Retrieved from

Inoue, A.B. (2007). A reply to Peter Elbow on a “Community-Based Assessment Pedagogy”, Assessing Writing, 12(3). Retrieved from

Inoue, A. B. (2015). Antiracist writing assessment ecologies: Teaching and assessing writing for a socially just future. South Carolina: Parlor Press.

Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144.

Kelly-Riley, D., Elliot, N., & Rudniy, A. (2016). An empirical framework for ePortfolio assessment. International Journal of ePortfolio, 6(2), 95-116.

Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26-43.

Kohn, A. (2006). Speaking my mind: The trouble with rubrics. English Journal, 95(4), 12-15.

Landis, J. R., & Koch, G. G. (1977). A one way components of variance model for categorical data. Biometrics, 33, 671–679.

Looney, J. W. (2011). Integrating formative and summative assessment: Progress toward a seamless system, OECD Education Working Papers, No. 58. OECD Publishing (NJ1).

Mansilla, V. B., & Duraisingh, E. D. (2007). Targeted assessment of students' interdisciplinary work: An empirically grounded framework proposed. The Journal of Higher Education, 78(2), 215-237.

Mansilla, V. B., Duraisingh, E. D., Wolfe, C. R., & Haynes, C. (2009). Targeted assessment rubric: An empirically grounded rubric for interdisciplinary writing. The Journal of Higher Education, 80(3), 334-353.

Meier, S. L., Rich, B. S., & Cady, J. (2006). Teachers' use of rubrics to score non‐traditional tasks: Factors related to discrepancies in scoring. Assessment in Education, 13(01), 69-95.

Myers, M. (1980). A procedure for writing assessment and holistic scoring. ERIC Clearinghouse on Reading and Communication Skills, National Institute of Education. Urbana, IL: National Council of Teachers of English. Retrieved from

National Science Foundation. (2015). Collaborative Research: The Role of Instructor and Peer Feedback in Improving the Cognitive, Interpersonal, and Intrapersonal Competencies of Student Writers in STEM Courses (Award No. 1544130) Retrieved from

Newell, J. A., Dahm, K. D., & Newell, H. L. (2002). Rubric development and inter-rater reliability issues in assessing learning outcomes. Chemical Engineering Education, 36(3), 212-215.

Penny, J., Johnson, R. L., & Gordon, B. (2000). Using rating augmentation to expand the scale of an analytic rubric. The Journal of Experimental Education, 68(3), 269-287.

Poe, M., & Cogan, J.A. (2016). Civil rights and writing assessment: Using the disparate impact approach as a fairness methodology to evaluate social impact. Journal of Writing Assessment, 9(1). Retrieved from

Pula, J.J., & Huot, B.A. (1993). A model of background influences on holistic raters. In M.M. Williamson & B.A. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 237-265). Cresskill, NJ: Hampton Press.

Rezaei, A. R., & Lovorn, M. (2010). Reliability and validity of rubrics for assessment through writing. Assessing Writing, 15(1), 18-39.

Ross, V., Liberman, M., Ngo, L., & LeGrand, R. (2016a). Weighted log-odds-ratio, informative Dirichlet prior method to compare peer review feedback for top and bottom quartile college students in a first-year writing program., 1633. Retrieved from

Ross, V., Wehner, P., & LeGrand, R. (2016b). Tap Root: University of Pennsylvania's IWP and the financial crisis of 2008. College Composition and Communication, 68(1), 205-209.

Schneider, C. G. (2002). Can value added assessment raise the level of student accomplishment? Peer Review, 4(2/3), Winter/Spring.

Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4-14.

Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters background and training on the reliability of direct writing tests. Modern Language Journal, 76, 27–33.

Slomp, D. (2016). Ethical considerations and writing assessment. Journal of Writing Assessment, 9(1). Retrieved from

Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). Retrieved from

Stemler, S.E., & Tsai, J. (2016). Best practices in interrater reliability: Three common approaches. In J. Osborne (Ed.), Best practices in quantitative methods (pp. 29-49). Thousand Oaks: Sage Publications.

Stock, P. L., & Robinson, J. L. (1987). Taking on testing: Teachers as tester-researchers. English Education, 19(2), 93-121.

Thaiss, C., & Zawacki, T. M. (2006). Engaged writers and dynamic disciplines: Research on the academic writing life. Portsmouth, NH: Boynton/Cook Heinemann.

The secrets of Dorothy DeLay’s violin teaching methods. (2015, July 13). The Strad. Retrieved from

Tinsley, H. E. A., & Weiss, D. J. (2000). Interrater reliability and agreement. In H. E. A. Tinsley & S. D. Brown (Eds.), Handbook of applied multivariate statistics and mathematical modeling (pp. 95–124). New York: Academic Press.

Torrance, H. (1998). Learning from research in assessment: A response to writing assessment—raters' elaboration of the rating task. Assessing Writing, 5(1), 31-37.

University of Pennsylvania. (2014). Assessment of Student Learning. Accreditation and 2014 self-study report. Philadelphia, PA: Author. Retrieved from the University of Pennsylvania website:

University of Pennsylvania. (2015). Common Data Set 2014-2015. Philadelphia, PA: Author, 5. Retrieved from

University of Pennsylvania. (2016). Incoming Class Profile. Philadelphia, PA: Author. Retrieved from

University of Pennsylvania. (2017). Introduction to Penn. Philadelphia, PA: Author. Retrieved from

Vann, R. J., Meyer, D. E., & Lorenz, F. O. (1984). Error gravity: A study of faculty opinion of ESL errors. TESOL Quarterly, 427-440.

Vaughan, C. (1992). Holistic assessment: what goes on in the rater’s mind? In Hamp-Lyons, L., editor, Assessing second language writing in academic contexts. Norwood, NJ: Ablex, 111-26.

Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: The kappa statistic. Fam Med, 37(5), 360-363.

Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.

White, E. M. (1984). Holisticism. College Composition and Communications, 35, 400-409.

White, E. M. (1985). Teaching and assessing writing: Understanding, evaluating and improving student performance. San Francisco, CA: Jossey-Bass.

White, E. M. (1994). Issues and problems in writing assessment. Assessing Writing, 1(1), 11-27.

White, E. M. (2005). The scoring of writing portfolios: Phase 2. College Composition and Communication, 56(4), 581-600. Retrieved from

White, E. M., Elliot, N., & Peckham, I. (2015). Very like a whale. Boulder, Colorado: University Press of Colorado.

Wiggins, G. (1994). The constant danger of sacrificing validity to reliability: Making writing assessment serve writers. Assessing Writing, 1(1), 129-139.

Wilson, M. (2006). Rethinking rubrics in writing assessment. Portsmouth, NH: Heinemann.

Wiseman, C. S. (2012). Rater effects: Ego engagement in rater decision-making. Assessing Writing, 17(3), 150-173.

Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4(1), 83-106.

Wolfe, E. W., Kao, C. W., & Ranney, M. (1998). Cognitive differences in proficient and nonproficient essay scorers. Written Communication, 15(4), 465-492.

Yancey, K. (1999). Looking back as we look forward: Historicizing writing

assessment. College Composition and Communication, 50(3), 483-503.

doi:1. Retrieved from doi:1