Surely something like this was inevitable.
Students sitting for their STAAR exams this week will be part of a new method of evaluating Texas schools: Their written answers on the state’s standardized tests will be graded automatically by computers.
The Texas Education Agency is rolling out an “automated scoring engine” for open-ended questions on the State of Texas Assessment of Academic Readiness for reading, writing, science and social studies. The technology, which uses natural language processing technology like artificial intelligence chatbots such as GPT-4, will save the state agency about $15-20 million per year that it would otherwise have spent on hiring human scorers through a third-party contractor.
The change comes after the STAAR test, which measures students’ understanding of state-mandated core curriculum, was redesigned in 2023. The test now includes fewer multiple choice questions and more open-ended questions — known as constructed response items. After the redesign, there are six to seven times more constructed response items.
“We wanted to keep as many constructed open ended responses as we can, but they take an incredible amount of time to score,” said Jose Rios, director of student assessment at the Texas Education Agency.
In 2023, Rios said TEA hired about 6,000 temporary scorers, but this year, it will need under 2,000.
To develop the scoring system, the TEA gathered 3,000 responses that went through two rounds of human scoring. From this field sample, the automated scoring engine learns the characteristics of responses, and it is programmed to assign the same scores a human would have given.
This spring, as students complete their tests, the computer will first grade all the constructed responses. Then, a quarter of the responses will be rescored by humans.
When the computer has “low confidence” in the score it assigned, those responses will be automatically reassigned to a human. The same thing will happen when the computer encounters a type of response that its programming does not recognize, such as one using lots of slang or words in a language other than English.
“We have always had very robust quality control processes with humans,” said Chris Rozunick, division director for assessment development at the Texas Education Agency. With a computer system, the quality control looks similar.
Every day, Rozunick and other testing administrators will review a summary of results to check that they match what is expected. In addition to “low confidence” scores and responses that do not fit in the computer’s programming, a random sample of responses will also be automatically handed off to humans to check the computer’s work.
TEA officials have been resistant to the suggestion that the scoring engine is artificial intelligence. It may use similar technology to chatbots such as GPT-4 or Google’s Gemini, but the agency has stressed that the process will have systematic oversight from humans. It won’t “learn” from one response to the next, but always defer to its original programming set up by the state.
“We are way far away from anything that’s autonomous or can think on its own,” Rozunick said.
But the plan has still generated worry among educators and parents in a world still weary of the influence of machine learning, automation and AI.
You can read on for a listing of those concerns, which are about what you’d expect: This is too new and untested, it’s not fair to the kids or to the schools given the high stakes of the STAAR, it’s being sprung on us with no notice, AI isn’t good enough yet to tackle a task like this, and so on. There is a way to challenge the score you get, whether from automated grading or a human reviewer; it costs $50 to request a re-grade and the fee is later waived if the score goes up as a result, but I didn’t see anything to suggest that those who can’t afford the fee up front can get the same consideration. It’s nice that this will save a few bucks, but at a time when the state is sitting on $30 billion in surplus and no new funds were allocated to public schools because Greg Abbott kept that all hostage to his voucher dreams, any talk of savings has a bitter tinge.
As I said, I think we could all see this sort of thing coming. Humans aren’t infallible as graders either, and surely this will take less time to perform. Maybe it will get to a point where the process is seen as more objective and thus fairer, or at least less likely to be affected by random factors. Whenever this was to be done for the first time, the same complaints were likely to be raised. I fully expect there to be problems, some of which will seem unbelievable and ridiculous, but I also expect that over time, probably less time than we think, it will improve to the point where few people will think the old way was better. I still wouldn’t want my kid to be one of the beta testers for this, and I fully sympathize with the fears expressed by teachers, administrators, and everyone else. I hope it’s done reasonably well and I hope the TEA responds quickly and compassionately to the problems that will arise. And we still need to elect a better government because everything that’s been happening with public education lately – among many other things – is just screwed up and we deserve so much better.
if you have ever graded anything, being reassured that the job is going to be handled by something that is NOT “intelligence” is hardly reassuring.
then again, it stands to reason that anyone sanguine about having robot cars aimed at them would also be OK with letting computers determine how their kid’s school will be funded.