Duck Soup: America Has a Pangram Problem

AI-detection tools are getting better. But they still aren’t good enough.

Basically every recent, high-profile accusation of someone passing off AI-generated writing as their own has started in the same way: with a tool called Pangram. In March, when a horror novel from a major publishing house was pulled just days before its scheduled U.S. release date, it was in part because Pangram, an AI-detection program, had identified the text as AI-generated. Other people have fed text into Pangram to suggest that chatbots have been used to write articles in major newspapers including The New York Times, multiple short stories awarded a prestigious literary prize, and most recently, significant chunks of Pope Leo XIV’s encyclical warning about the dangers of AI. The tool is also used by universities to vet student work and scientific associations to scan research papers. As panic builds over AI-generated writing, Pangram is at the foundation.

Just a few years ago, it seemed like it might never be possible to instantly and reliably determine whether a piece of text was written by a bot or a person. In 2023, one detection tool, ZeroGPT, declared the U.S. Constitution to be AI-written; the same year, OpenAI abandoned its AI detector altogether owing to a “low rate of accuracy.” And that was when the quality of ChatGPT’s writing was markedly worse than it is today. But detection tools have gotten much better of late—and Pangram, in particular, has emerged as the gold standard: Paste a chunk of text into Pangram, and the model appraises what portions were “AI Generated,” “AI Assisted,” or “Human Written.”

Yet an AI detector that is mostly reliable might in some ways be more dangerous than a broken one. While Pangram is accumulating the power to end reputations and careers, the tool does make mistakes, perhaps to a greater extent than is currently understood. In turn, AI accusations could very quickly spiral into a witch hunt.

Pangram says its algorithm is so accurate that it incorrectly identifies text as an AI output only about one in every 10,000 times. “There is a great responsibility, a huge weight” in saying something is AI-generated, Max Spero, Pangram’s CEO, told me. “The only reason we do so is because we’re extremely confident.” Several independent analyses have also confirmed that it is quite good. One paper, from the University of Chicago, found that Pangram had almost no false positives on some 3,000 sample texts of roughly 500 to 1,000 words.

But Pangram’s ability to guarantee something was written by a human is shakier. Spero pointed me to a test showing that Pangram’s false-negative rate, or how frequently the model incorrectly labels text as human, is closer to one-in-70 (although some other assessments say it is more accurate than that).

Part of the problem is that Pangram is in an arms race with the major AI labs, which have an interest in making the writing of ChatGPT and Claude sound as natural and human as possible. And at the same time, Pangram has to deal with AI “humanizers”—programs designed explicitly to disguise AI text as your own. Reddit users rave about a humanizer called Walter Writes AI, which I decided to test out for myself. I had ChatGPT and Claude write brief articles, then pasted them into Walter Writes AI. The program, like other humanizer tools, does some anodyne rewording, swaps one clunky transition clause for another, and introduces grammatical oddities. For instance, ChatGPT’s “The numbers are no longer small enough to ignore” became “The sheer size of these usage figures can no longer be ignored.” When I pasted any output from Walter Writes AI into Pangram, it invariably told me that the twice-baked AI article was human-written. (It’s worth mentioning that The Atlantic forbids using AI-generated text unless labeled as such, and that I do not use AI for research.) [...]

Further complicating matters are the opaque ways in which Pangram and similar tools are designed. The model was trained by feeding it mountains of examples written by a human and by a bot—a book review in an actual magazine, then a review about the same book in the style of the same magazine, but produced by ChatGPT—until it can tell the two apart. This is akin to feeding millions of photos of cats and dogs into an image-recognition algorithm until it learns to spot the differences. Pangram cannot point to much specific evidence or patterns in diction, phrasing, or punctuation to support why it deems something AI or human. (I do not, for instance, understand why “these usage figures” was more human than “the numbers.”) Moreover, while Pangram distinguishes between “lightly” and “moderately AI-assisted,” these broad categories can mean just about anything short of copy-pasting from Claude—using AI for research, coming up with counterarguments, as a thesaurus, for a grammar check. The algorithm’s inner workings are “pretty uninterpretable,” Spero said, and although he wants to make Pangram’s “AI-assisted” label more granular, he is also “still not sure how possible it is.” Amid concerns of overreliance on AI chatbots, we risk simply layering on dependence on yet another black-box algorithm.

Spero told me that Pangram should “never be the ending arbiter” but instead a starting point for a more thorough investigation, and that the company looks into every reported error its model makes. He also noted that all sorts of detection technology we rely on—smoke detectors, TSA scanners—have base error rates too. On some level, in all these cases the biggest problems lie not in the technologies themselves but in what they’re trying to detect. It’s a problem that buildings catch on fire. It’s a problem that AI is seeping haphazardly into every facet of written communication.

by Matteo Wong, The Atlantic | Read more:

Image: Atlantic/Getty
[ed. This seems like a transient issue to me. If AI is eventually able to write something (or create art) that's undetectable from what a human would produce, who cares? (except for writers and artists, obviously). You don't see this controversy in coding. See also: AI-Writing Scandals Are Getting Very Confusing (Atlantic). Also via DWAtV:

***

Again, we learn not that AI is a good writer, or that humans are bad writers, but that the literary prize judgment processes are worthless.

Jack: That which can be won with undisclosed AI output should be

Nabeel S. Qureshi: *Another* apparently AI-generated story wins a literary prize, this time judged by a panel including the novelist Ruth Ozeki.

Literary prizes need to start including Pangram checks in their process, or else change the rules to make AI writing ok. It’s very simple! [...]

How should we think about ‘witch hunts’ where people identify writing as AI?

Shashank Joshi: One of the worst trends of recent months: pseudoscientific witch-hunts using AI detection tools

The hunts are fully scientific. The detection tools work, at least for now. I have yet to see a case where Pangram said something was AI, and the piece was neither written using AI nor crafted intentionally to fool Pangram. There are some cases of heavy copyediting that trigger Pangram, but if it’s heavy enough to trigger Pangram then I consider that to be on you.

Thursday, June 25, 2026

America Has a Pangram Problem