Can Synthetic Data Help Solve Generative A.I.’s Training Data Crisis?

The supply of quality, real-world data used to train generative A.I. models appears to be dwindling as digital publishers increasingly restrict their access to their public data, according to a recent study. That means the advancement of large language models like OpenAI’s GPT-4 and Google’s Gemini could hit a wall once the A.I.s scrape all the remaining data on the internet.

Thank you for signing up!

By clicking submit, you agree to our terms of service and acknowledge we may use your information to send you emails, product samples, and promotions on this website and other properties. You can opt out anytime.

To address the growing A.I. training data crisis, some experts are considering synthetic data as a potential alternative. Real-world data, created by real humans, include news articles, YouTube videos, and other text and image content forms. Synthetic data, on the other hand, is artificially generated by machine learning models based on samples of real data. While synthetic data isn’t particularly new, using it to train A.I. models like GPT is a technique major companies including OpenAI are exploring—a practice experts say could backfire if done incorrectly.

“It’s still kind of the Wild West when it comes to generative A.I. models,” Kjell Carlsson, head of AI strategy at Domino Data Lab, a machine learning platform for businesses, told Observer.

Synthetic data has long been used to address the lack of sufficient training data for A.I. applications such as autonomous driving systems. For instance, companies like Waymo and Tesla use........

© Observer