The Fatal race condition: QA Lessons from the Therac-25 Tragedy
We are surrounded by QA testers. Every social media update, every fintech app, every e-commerce platform passes through rounds of beta testing and bug bashing before it reaches your phone. If a photo fails to upload or a cart doesn't refresh, you close the app, reopen it, and move on with your life. It's annoying. It's not dangerous.
But this everyday experience with software has quietly seeded a dangerous assumption: that testing catches everything, and that the engineers building the tools we depend on are always paying attention. In the world of safety-critical systems—medicine, aerospace, nuclear power—that assumption has cost people their lives. In these domains, a software bug is not a "glitch." It is a systemic failure. And no story illustrates this more viscerally than the tragedy of the Therac-25.
A Machine Built on Trust
The Therac-25 was a radiation therapy machine, a linear accelerator built by Atomic Energy of Canada Limited in the early 1980s. It was cutting-edge technology. Where its predecessors had relied on physical, mechanical safeguards to control radiation output, the Therac-25 put its faith almost entirely in software. The code itself was the safety net.
Between 1982 and 1987, that net failed catastrophically. At least six patients received massive radiation overdoses. The numbers are hard to comprehend: a standard therapeutic dose is around 200 rads. These patients received between 13,000 and 25,000 rads. Three of them died. The survivors lost limbs. Their internal organs failed. Their injuries were permanent and profound. The machine that was supposed to heal them had, in the most literal sense, burned them from the inside.
The Bug That Nobody Noticed
To understand what went wrong, you have to picture a nurse sitting at a VT100 terminal in 1985, typing quickly.
The Therac-25 had two treatment modes: X-ray mode, which used high-powered radiation, and Electron mode, which used a much lower dose. The operator would type the treatment parameters, and if they made a mistake and corrected it fast enough, something extraordinary and terrible would happen.
Two separate threads of software logic were running in parallel. One thread handled what appeared on the operator's screen. The other controlled the physical position of the hardware inside the machine. If the operator selected X-ray mode, caught the error, and corrected it to Electron mode, all within about 8 seconds, the screen would obediently update to show "Electron Mode." The nurse would see the right setting. She had no reason to doubt it.
But the hardware thread hadn't kept up. The beam was still configured for X-ray mode, at maximum power, with none of the protective shielding in place.
This is called a Race Condition.
Two processes racing to complete, and when the wrong one wins, the system enters a state that nobody designed for, nobody tested for, and nobody noticed until patients started dying.
Why Did No One Catch This?
The most haunting part of the Therac-25 story is not the bug itself. It's how easily it could have been caught, and why it wasn't.
The race condition had actually existed in an earlier machine, the Therac-20. But the Therac-20 had physical hardware interlocks: mechanical switches that would physically block the beam if the settings were wrong. The software could get confused all it liked. The hardware would step in and say no. So the bug was invisible. It never caused harm because the machine itself was the last line of defense.
When engineers developed the Therac-25, they copied much of the existing code. After all, it had worked fine in the Therac-20. What they didn't carry over were the hardware interlocks. The physical veto was gone. And with it, the invisible safety net that had been quietly correcting a flaw no one knew existed.
The code also suffered from a lack of what engineers call defensive programming. At no point did the software stop and ask the hardware, "Are you actually in the position you're supposed to be in?" It simply assumed the command had been followed. It sent the order and moved on.
How We Build Safer Software Today
The Therac-25 disasters reshaped how medical software is engineered, regulated, and tested. Today, the international standard governing medical device software is IEC 62304. It requires that software be classified by risk level, from Class A (low risk) to Class C (high risk). For devices like radiation therapy machines, every single line of code must be documented, audited, and tested against failure scenarios.
Independent Verification and Validation (IV&V) is now standard practice. The team that tests the software cannot be the same team that wrote it. This exists because the people who built something are the least likely to spot its flaws. Modern radiation systems also use closed-loop feedback. If the software commands the hardware to enter Electron mode, the hardware sensors must send back a confirmed signal before the beam is allowed to fire. If those signals don't match, the system performs a hard fail. It cuts power completely. It does not guess, and it does not proceed.
The Takeaway
The Therac-25 is a story about the danger of reusing code without understanding why it worked in the first place. The most dangerous variable in a safety-critical system is not always the operator who makes a typo. It's the engineer who assumes the software is already correct. In medicine, QA is not a final formality before launch. It is the last and sometimes only thing standing between a line of buggy code and a person's life.
References & Further Reading
- Nancy Leveson's definitive academic investigation into the Therac-25 software failures (MIT)
- FDA Case Study on Software Safety and pre-market approval history
- IEEE Spectrum retrospective on the engineering culture of the era
- ComputingCases.org Case Study: Institutional and social factors of the accidents
Discussion
0 commentsSign in with Google to join the discussion.