Literature note | ✷ 2025-09-02 09:06 ↺ 2025-09-02 09:07

Details about RAG

Understanding Retrieval-Augmented Generation (RAG)

Imagine a Large Language Model (LLM) like ChatGPT or Gemini as a brilliant, well-read student who has studied a vast public library. They can talk about almost any general topic, but they have two main limitations:

Their knowledge is frozen in time; they don’t know anything that happened after their training was completed.
They have never seen your company’s private documents, your personal notes, or any other specific, non-public information.

This can lead to the LLM giving generic answers or, worse, making up plausible-sounding but incorrect information, an issue often called “hallucination”.

Retrieval-Augmented Generation (RAG) is a technique that solves this problem by giving the LLM an “open book” to consult before it answers your question. It enhances the model by retrieving relevant, external information and adding it to the prompt as context. This ensures the LLM’s answers are grounded in specific, accurate, and up-to-date facts.

How the RAG Process Works, Step-by-Step

The RAG process can be broken down into two main phases: the preparation phase (indexing the knowledge base) and the generation phase (answering a query).

Phase 1: Indexing the Knowledge Base (Creating the “Open Book”)

Before any questions can be answered, your private or domain-specific information needs to be prepared and stored in a way the system can search through it efficiently.

Collect Documents: First, you gather your knowledge base. This can be any collection of text documents: internal company wikis, technical manuals, customer support articles, legal contracts, or even your personal notes.
Chunking: These documents are broken down into smaller, manageable chunks (e.g., paragraphs or sections).
Vectorization: Each chunk of text is fed into a machine learning model (an embedding model) that converts it into a numerical representation called a vector embedding. This vector is essentially a long list of numbers that captures the semantic meaning of the text. In this high-dimensional vector space, chunks with similar meanings will have vectors that are “close” to each other.
Store in a Vector Database: These vector embeddings, along with their original text and any metadata, are stored and indexed in a vector database. This type of database is specially designed for one primary task: finding vectors that are similar to a query vector with incredible speed.

Phase 2: Answering a User’s Query (Using the “Open Book”)

This is what happens in real-time when you ask a question.

User Asks a Question: You submit a prompt to the application, for example, “What is our company’s policy on parental leave?”.
Vectorize the Query: Just like the document chunks, your question is converted into a vector embedding using the same machine learning model.
Retrieve Relevant Information: The system takes your query vector and uses the vector database to perform a similarity search. It searches for the document chunks whose vectors are the closest or most similar to your question’s vector. These are the pieces of information most likely to contain the answer.
Augment the Prompt: The system takes the most relevant text chunks it found (e.g., the top 3-5 results) and combines them with your original question. The new, expanded prompt might look something like this:

Context: “From HR-Policy-Doc, page 5: All full-time employees are eligible for up to 16 weeks of paid parental leave…“ “From Onboarding-Guide, section 3: Parental leave must be requested at least 30 days in advance…“

Question: What is our company’s policy on parental leave?

Answer based only on the context provided.
Generate the Answer: This augmented prompt is sent to the LLM. By instructing the model to answer based only on the provided context, it is guided to generate a response that is accurate, specific to your organization’s data, and factually grounded.

Why is RAG so Important?

RAG is a critical component for building powerful and trustworthy AI applications for several reasons:

Reduces Hallucinations: It grounds the LLM in facts, significantly reducing the risk of the model inventing answers.
Provides Current and Domain-Specific Knowledge: It allows an LLM to use information it was never trained on, making it useful for enterprise applications that rely on private or rapidly changing data.
Increases Trust and Explainability: Because the system retrieves specific documents to form its answer, it can often cite its sources, allowing users to verify the information.
Cost-Effective: It is much cheaper and faster to update a vector database with new information than it is to retrain an entire LLM.

Notes mentioning this note

Insights about vector databases

An Architectural Analysis of Vector Database Query Capabilities and Limitations