Corporate America's new data gold rush |
Corporate America's new data gold rush
AI’s next breakthrough won’t come from scraping the web. Companies are racing to unlock new training data, from personal data to drones and corporate archives
Photo by Sean Gallup/Getty Images
The era of free AI training data is over. Reddit $RDDT charges millions for API access. The New York Times sued. Publishers are blocking scrapers. Even if AI companies could still vacuum up the public internet, they're running into a bigger problem: they need different kinds of data entirely for the next leap in abilities.
Large language models were built by scraping text and images from the web. But as AI systems move beyond chatbots, they need training data that was never publicly available in the first place. Data that's locked away, or scattered, or doesn't even exist yet.
New markets are emerging to unlock these sources. Here are three.
Your digital exhaust, monetized
Most people think of personal data as Social Security numbers and health records. But nearly everything you do online generates data that platforms collect and use — your Spotify $SPOT listening history, your email patterns, the documents you write in Google $GOOGL Docs, your conversations with ChatGPT.
When you download your Instagram data, for example, the company doesn't just give you your photos. You get everything Instagram has inferred about you based on your browsing behavior: hundreds of data points ranging from innocuous labels like "interested in nature" to psychological assessments........