Sunday, April 7, 2024

How Tech Giants Cut Corners to Harvest Data for A.I.

In late 2021, OpenAI faced a supply problem.

The artificial intelligence lab had exhausted every reservoir of reputable English-language text on the internet as it developed its latest A.I. system. It needed more data to train the next version of its technology — lots more.

So OpenAI researchers created a speech recognition tool called Whisper. It could transcribe the audio from YouTube videos, yielding new conversational text that would make an A.I. system smarter.

Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are “independent” of the video platform.

Ultimately, an OpenAI team transcribed more than one million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI’s president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful A.I. models and was the basis of the latest version of the ChatGPT chatbot.

The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times.

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by The Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

Like OpenAI, Google transcribed YouTube videos to harvest text for its A.I. models, five people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.

Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its A.I. products.

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips — has increasingly become the lifeblood of the booming A.I. industry. Creating innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos that resemble what a human creates.

The volume of data is crucial. Leading chatbot systems have learned from pools of digital text spanning as many as three trillion words, or roughly twice the number of words stored in Oxford University’s Bodleian Library, which has collected manuscripts since 1602. The most prized data, A.I. researchers said, is high-quality information, such as published books and articles, which have been carefully written and edited by professionals.

For years, the internet — with sites like Wikipedia and Reddit — was a seemingly endless source of data. But as A.I. advanced, tech companies sought more repositories. Google and Meta, which have billions of users who produce search queries and social media posts every day, were largely limited by privacy laws and their own policies from drawing on much of that content for A.I.

Their situation is urgent. Tech companies could run through the high-quality data on the internet as soon as 2026, according to Epoch, a research institute. The companies are using the data faster than it is being produced.

“The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data,” Sy Damle, a lawyer who represents Andreessen Horowitz, a Silicon Valley venture capital firm, said of A.I. models last year in a public discussion about copyright law. “The data needed is so massive that even collective licensing really can’t work.”

Tech companies are so hungry for new data that some are developing “synthetic” information. This is not organic data created by humans, but text, images and code that A.I. models produce — in other words, the systems learn from what they themselves generate.

OpenAI said each of its A.I. models “has a unique data set that we curate to help their understanding of the world and remain globally competitive in research.” Google said that its A.I. models “are trained on some YouTube content,” which was allowed under agreements with YouTube creators, and that the company did not use data from office apps outside of an experimental program. Meta said it had “made aggressive investments” to integrate A.I. into its services and had billions of publicly shared images and videos from Instagram and Facebook for training its models.

For creators, the growing use of their works by A.I. companies has prompted lawsuits over copyright and licensing. The Times sued OpenAI and Microsoft last year for using copyrighted news articles without permission to train A.I. chatbots. OpenAI and Microsoft have said using the articles was “fair use,” or allowed under copyright law, because they transformed the works for a different purpose.

More than 10,000 trade groups, authors, companies and others submitted comments last year about the use of creative works by A.I. models to the Copyright Office, a federal agency that is preparing guidance on how copyright law applies in the A.I. era.

Justine Bateman, a filmmaker, former actress and author of two books, told the Copyright Office that A.I. models were taking content — including her writing and films — without permission or payment.

“This is the largest theft in the United States, period,” she said in an interview.

by Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant, NY Times | Read more:
Image: Jason Henry for The New York Times
[ed. Read the whole thing. Of course it's illegal. Arrogantly so. It's part of Tech's ethic - move fast, ask for permission/forgiveness later. Congress needs to get off their lazy, self-absorbed asses and do some fast moving themselves (ha!). Big Tech have simply become modern day robber barons. See also: OpenAI transcribed over a million hours of YouTube videos to train GPT-4 (The Verge).]