You’ve probably already seen many screenshots from the recent, life-affirmingly funny rollout of Google search’s new “AI Overview” feature, which uses a large language model (of the kind that powers ChatGPT, e.g.) to synthesize and summarize search results into a paragraph or two of text, placed prominently above the results. These “overviews,” synthesized from Google search results, are frequently incorrect or incoherent, often in ways that no human would ever be (screenshots below from Twitter and Bluesky; (...)
How do Google search results work?
One particularly fascinating aspect of the AI Overview fiasco is the manner of its failure. Maybe the greatest of the AI Overview errors was the cheerful suggestion to “add 1/8 a cup of non-toxic glue to [pizza] sauce to give it more tackiness,”1 which appears to come from a decade-old joke comment by someone called “Fucksmith” located deep in a Reddit thread somewhere.
In other words, the AI Overview errors aren’t the familiar hallucinations that open-ended large language models like ChatGPT will produce as a matter of course, but something like their opposite. If hallucinations are plausible falsehoods--statements that sound correct but have no grounding in reality--these are implausible “facts”--statements that are facially untrue but have direct textual grounding. Unlike the responses produced by ChatGPT, every string of text in an AI Overview has a specific (and often a cited) source that the model is dutifully summarizing and synthesizing, as it has been directed to do by its prompting. It’s just that, well, sometimes the source is Fucksmith.
In its statement, Google is eager make clear that the AI Overviews don’t “hallucinate,” but I’m not sure this is entirely flattering to Google. What it suggests is that Google’s problem here is not so much a misunderstanding of what LLMs are good at and what they’re for, but--more troublingly--a misunderstanding of what Google is good at and what it’s for. Deploying the AI Overview model to recapitulate a set of specific and narrowly selected texts--that is, the search results--strikes me as a relatively legitimate and logical use of a large language model, insofar as it takes advantage of the technology’s strengths (synthesizing and summarizing information) and minimizes its weaknesses (making bullshit up all the time).2 The LLM is even doing a pretty good job at creating a syntactically coherent summary of the top results--it’s just that “the top results” for any given search string often don’t represent a coherent body of text that can be sensibly synthesized--nor, as Mike Caulfield explains in an excellent post on why this style of synthetic “overview” drawn from Google results is poorly adapted to the task of search, should they:
a good search result for an ambiguous query will often be a bit of a buffet, pulling a mix of search results from different topical domains. When I say, for example, “why do people like pancakes but hate waffles” I’m going to get two sorts of results. First, I’ll get a long string of conversations where the internet debates the virtues of pancakes vs. waffles. Second, I’ll get links and discussion about a famous Twitter post about the the hopelessness of conversation on Twitter […] For an ambiguous query, this is a good result set. If you see a bit of each in the first 10 blue links you can choose your own adventure here. It’s hard for Google to know what matches your intent, but it’s trivially easy for you to spot things that match your intent. In fact, the wider apart the nature of the items, the easier it is to spot the context that applies to you. So a good result set will have a majority of results for what you’re probably asking and a couple results for things you might be asking, and you get to choose your path.You cannot combine this intentionally diverse array of results into a single coherent “answer,” and, it’s worth saying again, this is a strength of the Google search product. It is possible I am in a minority here, but speaking for myself I want to see a selection of different possible results and use the brain my ancestors spent hundreds of millions of years evolving to determine the context, tone, and intent.
Likewise, when you put in terms about cheese sliding off pizza, Google could restrict the returned results to recipe sites, advice which would be relatively glue-free. But maybe you want advice, or maybe you want to see discussion. Maybe you want to see jokes. Maybe you are looking for a very specific post you read, and not looking to solve an issue at all, in which case you just want relevance completely determined by closeness of word match. Maybe you’re looking for a movie scene you remember about cheese sliding off pizza.
LLMs vs. Platforms
Google’s response to complaints about AI Overview errors has been to dismiss the examples as “uncommon queries.” That’s probably true, not just in the specific (it is uncommon to Google “how many rocks should I eat”) but in the general sense that most Google searches are essentially navigational (i.e., looking for a specific official website you want to visit) or for pretty straightforward factual questions like “what time is the Super Bowl,” where a user probably doesn’t actually want or need domain diversity in their results. But the addition of the AI Overview module at the top of the results isn’t merely a problem in the case of those “uncommon queries” where it produces implausible “facts” based on shitposts. It’s a problem everywhere, because it fundamentally changes what Google presents itself as.
by Max Read, Read Max | Read more:
Images: Twitter/Bluesky