Chapter 14: Knowledge Retrieval (RAG)

Augment LLMs with external knowledge through retrieval-augmented generation techniques

Intermediate22 min readInteractive Playground

Topic:

Image placeholder - upload your image to replace

At a Glance

What

RAG (Retrieval-Augmented Generation) enhances LLMs by retrieving relevant external knowledge from documents, databases, or knowledge bases before generating responses, grounding answers in verifiable facts.

Why

LLMs have knowledge cutoff dates and can hallucinate. RAG provides up-to-date, domain-specific information with citations, reducing hallucinations and enabling access to proprietary knowledge.

Rule of Thumb

Use RAG when you need current information, domain-specific knowledge, or verifiable sources. Combine with semantic search and proper chunking for optimal retrieval quality.

What is RAG (Retrieval-Augmented Generation)?

RAG (Retrieval-Augmented Generation) is a technique that enhances LLMs by giving them access to external knowledge. Instead of relying only on training data, the model can "look up" relevant information from documents, databases, or knowledge bases before generating a response.

Simple analogy: An LLM without RAG is like taking a closed-book exam. RAG is like taking an open-book exam where you can reference materials to give accurate, cited answers!

Why RAG is a Game-Changer

Up-to-Date Information

Access current data beyond the model's training cutoff date

Reduced Hallucinations

Ground responses in verifiable facts from your knowledge base

Domain-Specific Knowledge

Use proprietary company docs, manuals, and internal wikis

Verifiable Sources

Provide citations showing exactly where information came from

How RAG Works: The Complete Flow

User Query

User asks a question: "What is our company's remote work policy?"

Query Embedding

Convert the question into a vector (numerical representation)

[0.23, -0.45, 0.67, ...] (768 dimensions)

Semantic Search

Search vector database for most similar document chunks

Finds: "Remote Work Policy 2025.pdf" (similarity: 0.92)

Context Augmentation

Add retrieved chunks to the original prompt

Context: [Retrieved policy text...]
Question: What is our remote work policy?

LLM Generation

Model generates answer based on retrieved context

"According to our 2025 policy, employees can work remotely up to 3 days per week..."

Visual Summary

Vector Embeddings

Convert text to numerical vectors that capture semantic meaning

Semantic Search

Find relevant documents based on meaning, not just keywords

Context Augmentation

Add retrieved knowledge to prompts for grounded generation

Verified Answers

Generate responses with citations and source attribution

Core Concepts You Need to Know

Embeddings

Numerical representations of text that capture semantic meaning. Similar concepts have similar vectors.

Example:

"cat" → [0.2, 0.8, 0.1, ...]

"kitten" → [0.21, 0.79, 0.11, ...] ← Very close!

"car" → [0.9, 0.1, 0.7, ...] ← Far away

Semantic Search

Finding documents based on meaning, not just keywords. Understands "furry feline" means "cat".

❌ Keyword Search

Query: "laptop repair"

Finds: Only docs with exact words

✓ Semantic Search

Query: "laptop repair"

Finds: "computer fixing", "notebook troubleshooting"

Vector Databases

Specialized databases optimized for storing and searching embeddings at scale.

Pinecone

Weaviate

Chroma

Qdrant

Document Chunking

Breaking large documents into smaller pieces for efficient retrieval and processing.

Why chunk? LLMs have token limits. Can't process entire 100-page manual at once.

Strategy: Split by paragraphs, sections, or fixed token counts (e.g., 500 tokens)

Overlap: Include 50-100 tokens from previous chunk to maintain context

Beginner Example: Company Knowledge Base

Let's build a simple RAG system for company documentation:

# 1. Load and chunk documents
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader('company_policies.txt')
documents = loader.load()

splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# 2. Create embeddings and store in vector DB
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Create retriever
retriever = vectorstore.as_retriever()

# 4. Build RAG chain
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=retriever
)

# 5. Ask questions!
answer = qa_chain.run("What is the vacation policy?")
print(answer)

Advanced: Agentic RAG

Agentic RAG adds a reasoning layer that actively evaluates, validates, and refines retrieved information:

Standard RAG

• Retrieve documents
• Add to prompt
• Generate answer
• Done

Agentic RAG

• Retrieve documents
• Validate source quality
• Reconcile conflicts
• Fill knowledge gaps
• Generate verified answer

Example: If RAG finds conflicting budget numbers ($50K vs $65K), Agentic RAG identifies the conflict, checks document dates, and uses the most recent authoritative source.

Key Takeaways

RAG enhances LLMs by retrieving relevant external knowledge before generation, grounding responses in verifiable facts
Embeddings convert text to vectors that capture semantic meaning, enabling similarity-based search
Vector databases (Pinecone, Chroma, Weaviate) enable fast semantic search across millions of documents
Proper chunking strategy (size, overlap, metadata) is crucial for effective retrieval and context quality
Agentic RAG adds reasoning to validate sources, reconcile conflicts, and refine retrieved information
Advanced techniques include query rewriting, hybrid search (semantic + keyword), and re-ranking for improved relevance
RAG is essential for customer support, legal analysis, research assistants, and any domain requiring current, verifiable knowledge

Chapter 14: Knowledge Retrieval (RAG)

Augment LLMs with external knowledge through retrieval-augmented generation techniques

Intermediate22 min readInteractive Playground

Topic:

Image placeholder - upload your image to replace

At a Glance

What

Why

LLMs have knowledge cutoff dates and can hallucinate. RAG provides up-to-date, domain-specific information with citations, reducing hallucinations and enabling access to proprietary knowledge.

Rule of Thumb

Use RAG when you need current information, domain-specific knowledge, or verifiable sources. Combine with semantic search and proper chunking for optimal retrieval quality.

What is RAG (Retrieval-Augmented Generation)?

Simple analogy: An LLM without RAG is like taking a closed-book exam. RAG is like taking an open-book exam where you can reference materials to give accurate, cited answers!

Why RAG is a Game-Changer

Up-to-Date Information

Access current data beyond the model's training cutoff date

Reduced Hallucinations

Ground responses in verifiable facts from your knowledge base

Domain-Specific Knowledge

Use proprietary company docs, manuals, and internal wikis

Verifiable Sources

Provide citations showing exactly where information came from

How RAG Works: The Complete Flow

User Query

User asks a question: "What is our company's remote work policy?"

Query Embedding

Convert the question into a vector (numerical representation)

[0.23, -0.45, 0.67, ...] (768 dimensions)

Semantic Search

Search vector database for most similar document chunks

Finds: "Remote Work Policy 2025.pdf" (similarity: 0.92)

Context Augmentation

Add retrieved chunks to the original prompt

Context: [Retrieved policy text...]
Question: What is our remote work policy?

LLM Generation

Model generates answer based on retrieved context

"According to our 2025 policy, employees can work remotely up to 3 days per week..."

Visual Summary

Vector Embeddings

Convert text to numerical vectors that capture semantic meaning

Semantic Search

Find relevant documents based on meaning, not just keywords

Context Augmentation

Add retrieved knowledge to prompts for grounded generation

Verified Answers

Generate responses with citations and source attribution

Core Concepts You Need to Know

Embeddings

Numerical representations of text that capture semantic meaning. Similar concepts have similar vectors.

Example:

"cat" → [0.2, 0.8, 0.1, ...]

"kitten" → [0.21, 0.79, 0.11, ...] ← Very close!

"car" → [0.9, 0.1, 0.7, ...] ← Far away

Semantic Search

Finding documents based on meaning, not just keywords. Understands "furry feline" means "cat".

❌ Keyword Search

Query: "laptop repair"

Finds: Only docs with exact words

✓ Semantic Search

Query: "laptop repair"

Finds: "computer fixing", "notebook troubleshooting"

Vector Databases

Specialized databases optimized for storing and searching embeddings at scale.

Pinecone

Weaviate

Chroma

Qdrant

Document Chunking

Breaking large documents into smaller pieces for efficient retrieval and processing.

Why chunk? LLMs have token limits. Can't process entire 100-page manual at once.

Strategy: Split by paragraphs, sections, or fixed token counts (e.g., 500 tokens)

Overlap: Include 50-100 tokens from previous chunk to maintain context

Beginner Example: Company Knowledge Base

Let's build a simple RAG system for company documentation:

# 1. Load and chunk documents
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader('company_policies.txt')
documents = loader.load()

splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# 2. Create embeddings and store in vector DB
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Create retriever
retriever = vectorstore.as_retriever()

# 4. Build RAG chain
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=retriever
)

# 5. Ask questions!
answer = qa_chain.run("What is the vacation policy?")
print(answer)

Advanced: Agentic RAG

Agentic RAG adds a reasoning layer that actively evaluates, validates, and refines retrieved information:

Standard RAG

• Retrieve documents
• Add to prompt
• Generate answer
• Done

Agentic RAG

• Retrieve documents
• Validate source quality
• Reconcile conflicts
• Fill knowledge gaps
• Generate verified answer

Example: If RAG finds conflicting budget numbers ($50K vs $65K), Agentic RAG identifies the conflict, checks document dates, and uses the most recent authoritative source.

Key Takeaways

RAG enhances LLMs by retrieving relevant external knowledge before generation, grounding responses in verifiable facts
Embeddings convert text to vectors that capture semantic meaning, enabling similarity-based search
Vector databases (Pinecone, Chroma, Weaviate) enable fast semantic search across millions of documents
Proper chunking strategy (size, overlap, metadata) is crucial for effective retrieval and context quality
Agentic RAG adds reasoning to validate sources, reconcile conflicts, and refine retrieved information
Advanced techniques include query rewriting, hybrid search (semantic + keyword), and re-ranking for improved relevance
RAG is essential for customer support, legal analysis, research assistants, and any domain requiring current, verifiable knowledge