RAG (Retrieval-Augmented Generation) Revisited

Piotr Fic
17 hours ago
4 min read

What are we used to doing when implementing LLM-based applications?

With the rapid advance of AI, and especially Large Language Models (LLMs), a few design patterns have become go-to standards for AI application development.One of the most popular is RAG (Retrieval-Augmented Generation). If you’re not familiar with RAG (which is rather unlikely these days), here’s a quick recap:

RAG is a technique for incorporating relevant external information into LLMs to improve the accuracy and grounding of their responses.

LLMs are trained on publicly available data, which means they typically lack knowledge of internal or private information. A well-implemented RAG system bridges this gap by providing relevant chunks of such data to the model, enhancing its ability to generate accurate, context-aware, and source-grounded outputs.

When RAG proved its value, developers quickly adopted it as the standard for LLM-powered apps.

The fast-moving LLM landscape produced multiple frameworks and tools, which teams often combined into custom "DIY" RAG systems. Early prototypes usually look great, but taking them to production? That’s another story.

In this post, we’ll explore why custom RAG systems often disappoint, and what simpler, modern alternatives can get you to value faster.

Why can custom RAG systems disappoint?

Starting a RAG system is deceptively simple. You pick a vector database, an LLM provider for embedding and chat completion models, and a framework to wire it all together. A bit of coding later, you’ve loaded some trial documents, and your end product, maybe a chatbot, can answer questions about the loaded documents.

If that’s your mental image of RAG in practice, it’s worth pausing to consider a few common pitfalls:

Data extraction and preparation: Loading a few documents is easy; handling many messy, inconsistent ones is not. Your input might be full of irrelevant content that slows things down and hurts response quality. Every edge case becomes a headache requiring custom parsing logic. Supporting multiple file formats, especially non-text ones like images, adds further complexity. Even if you get all that right, you still need to experiment with chunking strategies and embedding models.
Data staleness: Once your system is live, the underlying data needs constant updating. Building a robust, efficient, and reliable ingestion pipeline is essential. It demands strong data engineering skills and ongoing maintenance. Many teams underestimate this effort, especially when organizational data is poorly structured or not ready for automated consumption.
Costs: Beyond infrastructure and operational costs, RAG projects often require specialized skills to maintain quality and performance. Expenses grow quickly with the scale and complexity of your data.

Are there simpler ways?

As we’ve seen, building a custom RAG system from scratch is rarely simple. It takes time, expertise, and a lot of trial and error to get right. Let’s look at a few alternatives that can help you build context-aware applications faster and with less friction.

RAG as a Service

Major cloud and AI providers now offer managed RAG services. These typically charge based on data volume (e.g., tokens processed or API calls). While they may seem more expensive upfront, they often turn out to be more cost-effective in the long run by eliminating many of the maintenance and integration headaches.

Modern offerings come packed with features and flexibility, allowing teams to balance customization with simplicity. Developers can mix and match out-of-the-box tools with custom components on top of a ready-made RAG engine. Such services automate data ingestion, optimize chunking and embeddings, and expose APIs for querying the data. You can choose the platform that best fits your stack, for example:

Amazon Bedrock or Google Vertex AI for seamless cloud integration,
OpenAI’s Assistant API or Google Gemini File Search for direct access to LLM-native RAG capabilities.

These solutions let you focus on building value, not infrastructure.

Live Web Search Context

To address data staleness, one effective option is to use live web search to augment LLM responses. Search engines are excellent and proven tools for information retrieval, and, crucially, they provide access to the latest information. The web contains vast amounts of relevant documents (including less obvious gems like PDFs hosted in websites).

You might worry this brings new problems, such as noisy sources or unstructured results, but specialized APIs can help. For example, Perplexity Search API returns structured text chunks with metadata for a given query. Developers can even configure domain restrictions or time ranges to filter results.

This approach integrates easily with existing RAG systems, either replacing the vector database lookup or supplementing it for more up-to-date results. An even simpler approach can be to use grounded LLM model response to let the provider handle the search and context augmentation under the hood. The response will be the LLM output with metadata about the sources used to generate it. An example of such tool is Gemini Grounding with Google Search or Perplexity Chat API.

Hybrid Approaches

Both approaches have strengths and weaknesses.

RAG built on internal data remains essential for sensitive or proprietary information.
Managed services accelerate time-to-market and reduce maintenance overhead.
Live web context keeps outputs current and diverse.

Combining them can yield cost-effective, high-quality results. While this post focuses on practical and quick wins to modernize traditional RAG systems, it’s worth noting that more advanced methods exist, such as querying relational databases, spreadsheets, or even letting autonomous agents execute code or SQL queries. These techniques hold great promise but come with their own complexities and risks.

Conclusion

In this post, we explored why custom RAG implementations often fall short, and how managed RAG services and live web search can simplify development while improving results. Try new tools, experiment with modern APIs, and see how much faster you can turn context into value.