- The question of what data is used for AI, and the value of that content, is becoming a hot topic.
- Can you measure the value of a specific piece of data in a huge AI model?
- Researchers are trying to measure this, and the 'data leverage' concept is gaining ground.
Years before ChatGPT, Nick Vincent was studying how much AI relies on human-generated data. One thing always struck him about the researchers and tech companies behind these powerful models.
"They always highlight their clever algorithms, not the underlying data," said Vincent, an assistant professor of computing science at Simon Fraser University near Vancouver.
That's beginning to change as the question of what data is used for AI, and the value of that information, becomes a hot topic.
Giant models, such as OpenAI's GPT-4, Google's PaLM 2 and Meta's Llama 2, have been partly built on millions of books, articles, online chats and other content posted online. Some of the creators behind these works have sued claiming copyright violations, while others want to be paid for their contributions.
But how can you measure the value of a particular piece of data when a giant AI model has sucked up most of what's been published online in the past decade or more?
This problem was highlighted in a recent blog on AI by tech analyst Benedict Evans: "It doesn't need your book or website in particular and doesn't care what you in particular wrote about, but it does need 'all' the books and 'all' the websites. It would work if one company removed its content, but not if everyone did."
Vincent, the professor, calls this "Data Leverage." If communities better know the value of their data for AI models, they can more effectively negotiate payment for their work.
"If we know that all our books together are responsible for half the 'goodness' of ChatGPT, then we can put a value on that," he said. "That was a fringe concept a few years ago and it is becoming more mainstream now. I've been beating this drum for years, and it's finally happening. I'm shocked to see it."
What makes LLMs tick?
This month, serious AI researchers waded into this debate with 2 papers that seek to address various aspects of the situation.
On August 7, Anthropic, one of the world's most-advanced AI companies, published a research paper describing a new way to more efficiently swap data in and out, and gauge how model performance changes. Until now, these types of tweaks to large language models have been so expensive they haven't really been tried.
"When an LLM outputs information it knows to be false, correctly solves math or programming problems, or begs the user not to shut it down, is it simply regurgitating (or splicing together) passages from the training set? Or is it combining its stored knowledge in creative ways and building on a detailed world model?" the Anthropic researchers wrote. "We believe this work is the first step towards a top-down approach to understanding what makes LLMs tick."
SILO and value of high-quality content
Second up was SILO, a new language model developed by researchers at the University of Washington in Seattle, UC Berkeley, and the Allen Institute for AI.
Their broad goal was to create a model that can remove data to reduce legal risks. In the process, they also developed a way to measure how specific data contributes to an AI model's output.
"SILO could provide a path forward for data owners to get properly credited (or be paid directly) every time their data in a datastore contributes to a prediction," the researchers wrote in a paper unveiling the technology on August 8.
The authors settled one important question right away: AI models rely heavily on high-quality human-generated content that is often under copyright. Without that, performance begins to suck hard.
"As we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage," they wrote.
The Harry Potter test
Then the researchers went deep into the weeds, using J.K. Rowling's Harry Potter books to see if individual pieces of data influence AI model performance.
They started with a large collection of published books that are part of The Pile, a huge dataset that's been built by scraping and storing a lot of what's been posted online over the years.
Then they created 2 "datastores." One had all the published books, except the first Harry Potter book. The other datastore excluded all 7 Harry Potter books. They ran tests to see how the model performed with using those 2 different datastores. Then they repeated the exercise, excluding the second Harry Potter book, then the third, and so on. The idea with this "leave-out" analysis was to see how well the model performed when these pieces of content are missing.
"When the Harry Potter books are removed from the datastore, the perplexity gets worse," the researchers found. Perplexity measures the accuracy of AI models. So, without Harry Potter, the model isn't as good.
The more specific conclusion seems painfully logical, but it is important: If you take specific content away, LLMs can't answer questions well about that content.
"LLMs threaten our ability to make these obvious conclusions," Vincent said. "Until now, throwing all data into an AI model has worked well. So there's been less need to specifically know what data is helping to make a model good."
Important legal benefits
Helping J.K. Rowling make even more money from her Harry Potter books was not the goal of the SILO study, though.
What the researchers proved is that it's possible to build powerful AI models while mitigating legal risk, according to Oren Etzoni, former CEO of the Allen Institute for AI who remains a board member and advisor to the organization.
The researchers trained the SILO model only on low-risk datasets that contained public domain text, such as books where the copyright has expired.
An important next step is called inference, where the model uses its training to interpret new information and decide the best output or course of action. The inference stage is where the researchers introduced the high-risk data, including those copyrighted books, along with news articles, medical text, and other content. This was where the Harry Potter "leave-out" tests happened.
This approach has important legal benefits, according to Etzioni. Authors can opt out at any time, and the model does not have to be re-trained. In addition, particular sentences can be attributed in the results, enabling credit to be assigned to authors.
"However, if authors insist on opting out en masse, then SILO will not end up being useful in practice," he added.
And is it legal to use copyrighted works at the inference stage of an AI model's development?
"That's a question for a copyright attorney," Etzioni said.