Type a few words into Google and hit ‘return’. Almost instantaneously, a list of links will appear. To find them, you may have to scroll past a bit of clutter – ads and, these days, an ‘AI Overview’ – but even if your query is obscure, and mine often are, it’s nevertheless quite likely that one of the links on your screen will take you to what you’re looking for. That’s striking, given that there are probably more than a billion sites on the web, and more than fifty times as many webpages.
On the foundation of that everyday miracle, a company currently worth around $3 trillion was built, yet today the future of Google is far from certain. It was founded in September 1998, at which point the world wide web, to which it became an indispensable guide, was less than ten years old. Google was by no means the first web search engine, but its older competitors had been weakened by ‘spamming’, much of it by the owners of the web’s already prevalent porn sites. Just as Google was to do, these early search engines deployed ‘web crawlers’ to find websites, ingest their contents and assemble an electronic index of them. They then used that index to find sites whose contents seemed the best match to the words in the user’s query. A spammer such as the owner of a porn site could plaster their site with words which, while irrelevant to the site’s content, were likely to appear in web searches. Often hidden from the users’ sight – encoded, for example, in the same colour as the background – those words would still be ingested by web crawlers. By the late 1990s, it was possible, even usual, to enter an entirely innocent search query – ‘skiing’, ‘beach holidays’, ‘best colleges’ – and be served up a bunch of links to porn.
In the mid to late 1990s, Google’s co-founders, Larry Page and Sergey Brin, were PhD students at Stanford University’s Computer Science Department. One of the problems Page was working on was how to increase the chances that the first entries someone would see in the comments section on a website would be useful, even authoritative. What was needed, as Page told Steven Levy, a tech journalist and historian of Google, was a ‘rating system’. In thinking about how websites could be rated, Page was struck by the analogy between the links to a website that the owners of other websites create and the citations that an authoritative scientific paper receives. The greater the number of links, the higher the probability that the site was well regarded, especially if the links were from sites that were themselves of high quality.
Using thousands of human beings to rate millions of websites wasn’t necessary, Page and Brin realised. ‘It’s all recursive,’ as Levy reports Page saying in 2001. ‘How good you are is determined by who links to you,’ and how good they are is determined by who links to them. ‘It’s all a big circle. But mathematics is great. You can solve this.’ Their algorithm, PageRank, did not entirely stop porn sites and other spammers infiltrating the results of unrelated searches – one of Google’s engineers, Matt Cutts, used to organise a ‘Look for Porn Day’ before each new version of its web index was launched – but it did help Google to improve substantially on earlier search engines.
Page’s undramatic word ‘recursive’ hid a giant material challenge. You can’t find the incoming links to a website just by examining the website itself. You have to go instead to the sites that link to it. But since you don’t know in advance which they are, you will have to crawl large expanses of the web to find them. The logic of what Page and Brin were setting out to do involved them in a hugely ambitious project: to ingest and index effectively every website in existence. That, in essence, is what Google still does. (...)
A quite different, and potentially more serious, threat to Google is a development that it did a great deal to foster: the emergence of large language models (LLMs) and the chatbots based on them, most prominently ChatGPT, developed by the start-up OpenAI. Google’s researchers have worked for more than twenty years on what a computer scientist would call ‘natural language processing’ – Google Translate, for example, dates from 2006 – and Google was one of the pioneers in applying neural networks to the task. These are computational structures (now often gigantic) that were originally thought to be loosely analogous to the brain’s array of neurons. They are not programmed in detail by their human developers: they learn from examples – these days, in many cases, billions of examples.
The efficiency with which a neural network learns is strongly affected by its structure or ‘architecture’. A pervasive issue in natural language processing, for example, is what linguists call ‘coreference resolution’. Take the sentence: ‘The animal didn’t cross the street because it was too tired.’ The ‘it’ could refer to the animal or to the street. Humans are called on to resolve such ambiguities all the time, and if the process takes conscious thought, it’s often a sign that what you’re reading is badly written. Coreference resolution is, however, a much harder problem for a computer system, even a sophisticated neural network.
In August 2017, a machine-learning researcher called Jakob Uszkoreit uploaded to Google’s research blog a post about a new architecture for neural networks that he and his colleagues called the Transformer. Neural networks were by then already powering Google Translate, but still made mistakes – in coreference resolution, for example, which can become embarrassingly evident when English is translated into a gendered language such as French. Uszkoreit’s example was the sentence I have just quoted. ‘L’animal’ is masculine and ‘la rue’ feminine, so the correct translation should end ‘il était trop fatigué,’ but Google Translate was still rendering it as ‘elle était trop fatiguée,’ presumably because in the sentence’s word order ‘street’ is closer than ‘animal’ to the word ‘it’.
The Transformer, Uszkoreit reported, was much less likely to make this sort of mistake, because it ‘directly models relationships between all words in a sentence, regardless of their respective position’. Before this, the general view had been that complex tasks such as coreference resolution require a network architecture with a complicated structure. The Transformer was structurally simpler, ‘dispensing with recurrence and convolutions entirely’, as Uszkoreit and seven current or former Google colleagues put it in a paper from 2017. Because of its simplicity, the Transformer was ‘more parallelisable’ than earlier architectures. Using it made it easier to divide language processing into computational subtasks that could run simultaneously, rather than one after the other.
Just as Dean and Ghemawat had done, the authors of the Transformer paper made it publicly available, at Neural Information Processing Systems, AI’s leading annual meeting, in 2017. One of those who read it was the computer scientist Ilya Sutskever, co-founder of OpenAI, who says that ‘as soon as the paper came out, literally the next day, it was clear to me, to us, that transformers address the limitations’ of the more complex neural-network architecture OpenAI had been using for language processing. The Transformer, in other words, should scale. As Karen Hao reports in Empire of AI, Sutskever started ‘evangelising’ for it within OpenAI, but met with some scepticism: ‘It felt like a wack idea,’ one of his OpenAI colleagues told Hao.† Crucially, however, another colleague, Alec Radford, ‘began hacking away on his laptop, often late into the night, to scale Transformers just a little and observe what happened’.
Sutskever was right: the Transformer architecture did scale. It made genuinely large, indeed giant, language models feasible. Its parallelisability meant that it could readily be implemented on graphics chips, originally designed primarily for rendering images in computer games, a task that has to be done very fast but is also highly parallelisable. (Nvidia, the leading designer of graphics chips, provides much of the material foundation of LLMs, making it the world’s most valuable company, currently worth around 30 per cent more than Alphabet.) If you have enough suitable chips, you can do a huge amount of what’s called ‘pre-training’ of a Transformer model ‘generatively’, without direct human input. This involves feeding the model huge bodies of text, usually scraped from the internet, getting the model to generate what it thinks will be the next word in each piece of text, then the word after that and so on, and having it continuously and automatically adjust its billions of parameters to improve its predictions. Only once you have done enough pre-training do you start fine-tuning the model to perform more specific tasks.
It was OpenAI, not Google, that made the most decisive use of the Transformer. Its debt is right there in the name: OpenAI’s evolving LLMs are all called GPT, or Generative Pre-trained Transformer. GPT-1 and GPT-2 weren’t hugely impressive; the breakthrough came in 2020 with the much larger GPT-3. It didn’t yet take the form of a chatbot that laypeople could use – ChatGPT was released only in November 2022 – but developers in firms other than OpenAI were given access to GPT-3 from June 2020, and found that it went well beyond previous systems in its capacity to produce large quantities of text (and computer code) that was hard to distinguish from something that a well-informed human being might write.
GPT-3’s success intensified the enthusiasm for LLMs that had already been growing at other tech firms, but it also caused unease. Timnit Gebru, co-founder of Black in AI and co-head of Google’s Ethical AI team, along with Emily Bender, a computational linguist at the University of Washington, and five co-authors, some of whom had to remain anonymous, wrote what has become the most famous critique of LLMs. They argued that LLMs don’t really understand language. Instead, they wrote, an LLM is a ‘stochastic parrot ... haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning’. What’s more, Bender, Gebru and colleagues noted, training such a model consumes huge quantities of electricity, and the giant datasets used in the training often ‘encode hegemonic views that are harmful to marginalised populations’. (They quoted the computer scientists Abeba Birhane and Vinay Uday Prabhu: ‘Feeding AI systems on the world’s beauty, ugliness and cruelty but expecting it to reflect only the beauty is a fantasy’.)
The efficiency with which a neural network learns is strongly affected by its structure or ‘architecture’. A pervasive issue in natural language processing, for example, is what linguists call ‘coreference resolution’. Take the sentence: ‘The animal didn’t cross the street because it was too tired.’ The ‘it’ could refer to the animal or to the street. Humans are called on to resolve such ambiguities all the time, and if the process takes conscious thought, it’s often a sign that what you’re reading is badly written. Coreference resolution is, however, a much harder problem for a computer system, even a sophisticated neural network.
In August 2017, a machine-learning researcher called Jakob Uszkoreit uploaded to Google’s research blog a post about a new architecture for neural networks that he and his colleagues called the Transformer. Neural networks were by then already powering Google Translate, but still made mistakes – in coreference resolution, for example, which can become embarrassingly evident when English is translated into a gendered language such as French. Uszkoreit’s example was the sentence I have just quoted. ‘L’animal’ is masculine and ‘la rue’ feminine, so the correct translation should end ‘il était trop fatigué,’ but Google Translate was still rendering it as ‘elle était trop fatiguée,’ presumably because in the sentence’s word order ‘street’ is closer than ‘animal’ to the word ‘it’.
The Transformer, Uszkoreit reported, was much less likely to make this sort of mistake, because it ‘directly models relationships between all words in a sentence, regardless of their respective position’. Before this, the general view had been that complex tasks such as coreference resolution require a network architecture with a complicated structure. The Transformer was structurally simpler, ‘dispensing with recurrence and convolutions entirely’, as Uszkoreit and seven current or former Google colleagues put it in a paper from 2017. Because of its simplicity, the Transformer was ‘more parallelisable’ than earlier architectures. Using it made it easier to divide language processing into computational subtasks that could run simultaneously, rather than one after the other.
Just as Dean and Ghemawat had done, the authors of the Transformer paper made it publicly available, at Neural Information Processing Systems, AI’s leading annual meeting, in 2017. One of those who read it was the computer scientist Ilya Sutskever, co-founder of OpenAI, who says that ‘as soon as the paper came out, literally the next day, it was clear to me, to us, that transformers address the limitations’ of the more complex neural-network architecture OpenAI had been using for language processing. The Transformer, in other words, should scale. As Karen Hao reports in Empire of AI, Sutskever started ‘evangelising’ for it within OpenAI, but met with some scepticism: ‘It felt like a wack idea,’ one of his OpenAI colleagues told Hao.† Crucially, however, another colleague, Alec Radford, ‘began hacking away on his laptop, often late into the night, to scale Transformers just a little and observe what happened’.
Sutskever was right: the Transformer architecture did scale. It made genuinely large, indeed giant, language models feasible. Its parallelisability meant that it could readily be implemented on graphics chips, originally designed primarily for rendering images in computer games, a task that has to be done very fast but is also highly parallelisable. (Nvidia, the leading designer of graphics chips, provides much of the material foundation of LLMs, making it the world’s most valuable company, currently worth around 30 per cent more than Alphabet.) If you have enough suitable chips, you can do a huge amount of what’s called ‘pre-training’ of a Transformer model ‘generatively’, without direct human input. This involves feeding the model huge bodies of text, usually scraped from the internet, getting the model to generate what it thinks will be the next word in each piece of text, then the word after that and so on, and having it continuously and automatically adjust its billions of parameters to improve its predictions. Only once you have done enough pre-training do you start fine-tuning the model to perform more specific tasks.
It was OpenAI, not Google, that made the most decisive use of the Transformer. Its debt is right there in the name: OpenAI’s evolving LLMs are all called GPT, or Generative Pre-trained Transformer. GPT-1 and GPT-2 weren’t hugely impressive; the breakthrough came in 2020 with the much larger GPT-3. It didn’t yet take the form of a chatbot that laypeople could use – ChatGPT was released only in November 2022 – but developers in firms other than OpenAI were given access to GPT-3 from June 2020, and found that it went well beyond previous systems in its capacity to produce large quantities of text (and computer code) that was hard to distinguish from something that a well-informed human being might write.
GPT-3’s success intensified the enthusiasm for LLMs that had already been growing at other tech firms, but it also caused unease. Timnit Gebru, co-founder of Black in AI and co-head of Google’s Ethical AI team, along with Emily Bender, a computational linguist at the University of Washington, and five co-authors, some of whom had to remain anonymous, wrote what has become the most famous critique of LLMs. They argued that LLMs don’t really understand language. Instead, they wrote, an LLM is a ‘stochastic parrot ... haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning’. What’s more, Bender, Gebru and colleagues noted, training such a model consumes huge quantities of electricity, and the giant datasets used in the training often ‘encode hegemonic views that are harmful to marginalised populations’. (They quoted the computer scientists Abeba Birhane and Vinay Uday Prabhu: ‘Feeding AI systems on the world’s beauty, ugliness and cruelty but expecting it to reflect only the beauty is a fantasy’.)
by Donald MacKenzie, London Review of Books | Read more:
Image: via
[ed. How search works, how it could change. Well worth a full read.]