- Retrieval-augmented generation is enhancing large language models' accuracy and specificity.
- However, it still poses challenges and requires specific implementation techniques.
- This article is part of "Build IT," a series about digital tech trends disrupting industries.
The November 2022 launch of OpenAI's ChatGPT kicked off the latest wave of interest in AI, but it came with some serious issues. People could ask questions on almost any topic, but many of the large language model's answers were uselessly generic — or completely wrong. No, ChatGPT, the population of Mars is not 2.5 billion.
Such problems still plague large language models. But there's a solution: retrieval-augmented generation. This technique, invented in 2020 by a group of researchers at Meta's AI research group, is rewriting the rules of LLMs. The first wave of vague, meandering chatbots is receding, replaced by expert chatbots that can answer surprisingly specific questions.
RAG isn't well known outside the AI industry but has come to dominate conversations among insiders — especially those creating user-facing chatbots. Nvidia used RAG to build an LLM that helps its engineers design chips; Perplexity employs RAG to construct an AI-powered search engine that now claims over 10 million monthly active users; Salesforce used RAG to build a chatbot platform for customer relations.
"For a long time we were looking at databases, and we had a lot of excitement for AI. But what was the unique use case? RAG was the first," said Bob van Luijt, the CEO and cofounder of the AI data infrastructure company Weaviate. "From a user perspective, there was a simple problem, which is that generative models were stateless." (Meaning they couldn't update themselves in response to new information.) "If you tell it, 'Hey, I had a conversation with Bob,' the next time you use it, it won't remember. RAG solves that."
The innovation that's sweeping AI
"Every industry that has a lot of unstructured data can benefit from RAG," van Luijt said. "That ranges from insurance companies to legal companies, banks, and telecommunications." Companies in these industries often have vast troves of data, but sifting through it to gain insights is a difficult task. "That's where RAG adds a lot of value. You throw that information in, and you're like, 'Make sense of that for me.' And it does."
That's accomplished by adding a step when an LLM generates a reply. Instead of offering a response rooted only in how the model was trained, RAG retrieves additional data provided to it by the person or organization implementing RAG — most often text, though the latest methods can handle images, audio, and video — and incorporates it into its reply.
Nadaa Taiyab, a data scientist at the healthcare IT company Tegria, offered an example from the chatbot she designed, which uses RAG to answer nutrition questions based on data from NutritionFacts.org. The nonprofit has highlighted studies linking eggs and type 2 diabetes, a correlation that most LLMs won't report if asked whether eggs reduce the risk of diabetes. However, her RAG-powered chatbot can retrieve and reference NutritionFacts.org's published work in its response. "And it just works," Taiyab said. "It's pretty magical."
But it's not perfect
That magic makes RAG the go-to technique for those looking to build a chatbot grounded in specific, often proprietary data. However, van Lujit warned, "Like any tech, it's not a silver bullet."
Any data used for RAG must be converted to a vector database, where it's stored as a series of numbers an LLM can understand. This is well-understood by AI engineers, as it's core to how generative AI works, but the devil is in the details. Van Lujit said developers need to adopt specific techniques, such as "chunking strategies," that manipulate how RAG presents data to the LLM.
Fixed-size chunking, the most basic strategy, divides data like a pizza: every slice is (hopefully) the same size. But that's not necessarily the best approach, especially if an LLM needs to access data that's spread across many different documents. Other strategies, such as "semantic chunking," use algorithms to pick out the relevant data spread across many documents. This approach requires more expertise to implement, however, and access to powerful computers. Put simply: It's better, but it's not cheap.
Overcoming that obstacle can immediately lead to another issue. When successful, RAG can work a bit too well.
Kyle DeSana, the cofounder of the AI analytics company Siftree, warned against careless RAG implementations. "What they're doing without realizing it, without analytics, is that they're losing touch with the voice of their customer," DeSana said.
He said that a successful RAG chatbot could carry its own pitfalls. A chatbot with domain expertise that replies in seconds can encourage users to ask even more questions. The resulting back-and-forth may lead to questions beyond the chatbot's scope. This becomes what's known as a feedback loop.
Solving for the feedback loop
Analytics are essential for identifying shortcomings in a RAG-powered AI tool, but those are still reactive. AI engineers are eager to find more proactive solutions that don't require constant meddling with the data RAG provides to the AI. One cutting-edge technique, generative feedback loops, attempts to harness feedback loops to reinforce desirable results.
"A RAG pipeline is usually one direction," van Luijt explained. But an AI model can also use generated data to improve the quality of the information available through RAG. Van Lujit used vacation-rental companies such as Airbnb and Vrbo as an example. Listings on these sites have many details, some of which are missed or omitted by a listing's creator (does the place have easy access to transit?), and AI is quite good at filling in these gaps. Once that's done, the data can be included in RAG to improve the precision and detail of answers.
"We tell the model, 'Based on what you have, do you think you can fill in the blanks?' It starts to update itself," van Lujit said. Weaviate has published examples of generative feedback loops in action, including a recreation of Amazon's AI-driven review summaries. In this example, the summary can not only be published for people to read but also placed into a database for later retrieval through RAG. When new summaries are required in the future, the AI can refer to the previous answer rather than ingesting every published review — which may span tens or hundreds of thousands of reviews — again.
Both van Luijt and Taiyab speculated that as the AI industry continues its growth, new techniques will push models to a point where retrieval is no longer necessary. A recent paper from researchers at Google described a hypothetical LLM with infinite context. Put simply, an AI chatbot would have an effectively infinite memory, letting it "remember" any data presented to it in the past. In February, Google announced it had tested a context window of up to 10 million tokens, each representing a small chunk of text. That's large enough to store hundreds of books or tens of thousands of shorter documents.
At this point, the computing resources required are beyond all but the largest tech giants: Google's announcement said its February test pushed its hardware to its "thermal limit." RAG, on the other hand, can be implemented by a single developer in their spare time. It scales to serve millions of users, and it's available now.
"Maybe in the future RAG will go away altogether, because it's not perfect," Taiyab said. "But for now, this is all we have. Everyone is doing it. It's a core, fundamental application of large language models."