This week marks the end of the Kaggle Automated Essay Scoring Competition. I participated as a way to build my machine learning skills after learning the basics in Andrew Ng’s online class.
The goal of the competition was to develop algorithms that could automatically score student essays for standardized achievement tests. Kaggle (and the sponsoring Hewitt Foundation) provided thousands of student responses and scores for eight essay prompts in a variety of formats. After the competition, the organizers will compare the best results to the predictions of several commercial grading packages.
Optimizing machine learning algorithms is challenging, as there is a great deal of freedom in constructing them. One must first choose and extract the best numerical features from the text. In this case, these may range from simple counts of the number of words in the essay or the number of misspellings to sophisticated measures of semantic similarity. Next, one must select the best algorithms with which to model the features (Linear Regression and its variants, Neural Networks, Support Vector Machines, Random Forests, Gradient Boosting…). Many of these algorithms require tuning to achieve the best performance. Finally, one may use a variety of ensemble methods to blend the results of multiple algorithms in order to obtain optimal predictions.
Given this complexity, I didn’t have great expectations for my results in my first real machine learning task. My performance was decent; I placed 19th in the final standings. Our results were scored with a measure called “Quadratic Weighted Kappa” which assesses the consistency of the grades assigned by two different raters. A score of zero indicates only chance agreement of the scores, while 1 would indicate perfect agreement. My best scores averaged over all eight essays were about 0.74, while the highest scoring submissions are around 0.80.
By far, the most predictive feature was essay length: The number of characters was 64% of the weight of my Random Forest classifier, followed by Determiner (DT) parts of speech (12%), misspelled words (11%), the number of words (3.5%), and comparative adjectives (JJR) (1.4%). Everything else was below 1%.
Audrey Watters wrote a post about the contest and included a link to the paper (Shermis & Hamner 2012) assessing the performance of commercial vendors on this dataset. Surprisingly, four out of nine of the commercial vendors scored worse than I did! My score was also right about at the level of consistency human raters have with each other–0.75. The best commercial scores were 0.77 and 0.79, so it looks like the Kaggle winner will have beaten all the commercial packages.
It’s extremely interesting that these algorithms can be as consistent with human raters as humans are with each other despite not “reading” the essays in a way comparable to human reading.
While participating in this competition has been an interesting and challenging learning experience, I can’t help but have reservations about the use of algorithmic essay grading in education. (Inside Higher Education quotes Shermis arguing that automated graders could be a useful supplement in entry-level writing classrooms.)
The standardized testing paradigm is indicative of modern education’s roots in the Industrial Revolution–students as standardized widgets. Writing to please an algorithm seems especially dehumanizing, though, and there’s no learning happening here. The black box spits out a score, providing no feedback a student could use to improve their logic, use of evidence, rhetoric, grammar, usage, style–in short, any of the things we value in writing. A teacher individually engaging with a student and her writing over a period of time is still the best way to improve reasoning and its expression.
Moreover, because the algorithms are trained on a representative sample of essays, automated essay scoring necessarily devalues idiosyncratic or creative responses. Can you imagine how an algorithm would score an essay by Emily Dickinson (improper punctuation use) or Ernest Hemingway (short, simple sentences)?
To be clear, I appreciate the value of these tools. You can’t improve student learning if you don’t measure it, and human graders are costly, cursory readers. But I think there’s a direct analogy between automated essay grading and standardized testing itself. In both cases, you’re trying to assess the quality of a multifaceted entity (the essay or the individual) in terms of measurable quantities. The temptation is to conflate measurability with importance.
As Shermis & Hamner themselves write,
A predictive model may do a good job of matching human scoring behavior, but do this by means of features and methods which do not bear any plausible relationship to the competencies and construct that the item aims to assess. To the extent that such models are used, this will limit the validity argument for the assessment as a whole.
Assessments have real-world consequences. When an algorithm becomes the measure of quality of an essay, or a standardized test determines the classes a student can take or the university she attends, correlation becomes causation. Longer essays score higher, and students who score better on tests have better opportunities.
Instead of sorting students along arbitrary scales, I think we’d be better served by trying to maximize the unique potential of each individual.
For fun, I fed this blog entry into my essay scoring model. Despite having nothing to do with the essay prompts, it earned scores ranging from 64% to a whopping 95% of the maximum, with a mean of 82%.