How to build a RAG pipeline with Weaviate and Ollama
Storyblok is the first headless CMS that works for developers & marketers alike.
Welcome to the final part of our fascinating journey into AI-assisted search. This article is the last in a three-part series designed to demystify vector databases.
- Part 1: The “What & Why”: a conceptual introduction to what a vector database is, why it's a game-changer, and its core use cases.
- Part 2: The “How”: a practical, hands-on tutorial where we built a project to demonstrate the technology in action.
- Part 3: The “Real World” (you are here): a hands-on guide to building a retrieval-augmented generation (RAG) pipeline: hybrid search, intelligent chunking, and turning search results into AI-generated answers.
From lists to answers
Part 1 explored the world of vector embeddings: turning content into coordinates on a meaning map. Part 2 focused on building a local semantic search engine with Weaviate and Node.js.
This part goes one step further to deliver a better user experience.
The thing about search results is that they're a list. The user asks a question, and gets links to pages that include the search term. That's useful, but it's how search worked in 2018. What they actually want is an answer.
To close that gap, turn your search engine into an AI assistant that answers questions grounded in relevant content, and does it locally: no API keys, no cloud dependencies.
Here's what you’ll build:
- A hybrid search retrieval function (an upgrade to the pure vector search)
- A prompt template that grounds the LLM in your content
- A streaming generation pipeline with Ollama
- Intelligent chunking that guarantees quality answers
- Relevance thresholds that prevent hallucination
Already familiar with RAG and ready to start building? Skip right to the coding part.
The two librarians
Before you write any code, let me paint a picture.
Imagine you walk into a library in search of knowledge. The librarian greets you. They’ve read thousands of books over the years and memorized all of them by heart. You ask, “What's the best framework for server-side rendering?” The librarian smiles confidently and answers from memory. They even email you the framework’s name and URL. Later, you log into your work device, type in the URL, and, well, nothing. Turns out that the knowledgeable librarian confidently recommended a framework that doesn't exist. They weren’t lying; they genuinely believed they were right. They just...misremember things.
That's an LLM without RAG: remarkable recall, questionable accuracy.
Now imagine a different librarian. Same question. But instead of answering from memory, this one says, "Let me check." They walk over to the shelves, pull out three relevant books, flip to the right chapters, read them, and then answer your question, citing specifically relevant pages.
That's an LLM with RAG. The shelves are your vector database from part 2. In this part, you bring in the librarian.
What is RAG, really?
As the name suggests, retrieval-augmented generation is a three-step technique:
- Retrieval: search your vector database for content relevant to the question
- Augmentation: insert that content into the LLM's prompt as context
- Generation: enable the LLM to answer based on this fresh context, not the outdated training data
RAG doesn't make the LLM smarter, it makes it informed. And that's a crucial distinction. LLMs are frozen in time at their training cutoff, but your content changes constantly, maybe daily. RAG bridges the gap by giving the model up-to-date, relevant context at query time.
The beauty of this architecture is its simplicity. Each step is independent: this tutorial uses Weaviate for retrieval, Qwen 3.5 on Ollama for generation, and Node.js to wire it all together and run the JavaScript scripts. But you can swap your vector database, change your prompt template, or upgrade your LLM (if you have enough VRAM, and given today's market, I doubt it), all without rebuilding the pipeline from scratch.
Case in point, you could skip Weaviate and use Ollama for embeddings. However, since this tutorial builds directly on the previous part, Weaviate is already set up with text2vec-transformers and the relevant data. Switching would mean re-ingesting all articles with different vectors, so we keep what works and add what's missing.
- Weaviate still handles embeddings with the same
text2vec-transformersmodel that turns content and queries into vectors for search. - Ollama's job is to run LLMs that generate natural language answers from retrieved context.
Two models, two roles, one pipeline.
Let's get to it.
This tutorial builds directly on the previous part, so you need the Weaviate instance and ingested articles from that tutorial. The complete code is available on GitHub.
Set up Ollama
In part 2, you kept everything local with Docker. Same philosophy here: Ollama lets you run LLMs on your machine with zero cloud dependencies.
To keep the entire project as a standalone Dockerized container, including Ollama and Qwen, follow Weaviate’s official quickstart manual.
Why not use an API like Claude or GPT? You're absolutely right! For production, that's often the right call. For learning, local is better: you manage every piece, control every parameter, and don't worry about rate limits or costs while experimenting. Also, maybe you have sensitive data that you don't want to send to a cloud platform, and you can set up an on-premises solution with the same tools.
Install Ollama following the steps for your platform, then pull the Qwen 3.5 4b model:
ollama pull qwen3.5:4b Qwen 3.5 4b is a 3.4GB model that runs on a laptop, handles 256K tokens of context, and produces solid answers from retrieved content. We'll upgrade to a larger model later to see what changes.
Verify it's running:
ollama list
# NAME ID SIZE MODIFIED
# qwen3.5:4b 2a654d98e6fb 3.4 GB Just now Connect to Ollama from Node.js (full version):
import ollama from "ollama";
const response = await ollama.chat({
model: "qwen3.5:4b",
messages: [{ role: "user", content: "Say hello in one sentence." }],
});
console.log(response.message.content); If you see a greeting in the console, you're good.
If your content lives in Storyblok, you can fetch articles via the Content Delivery API and feed them into the same pipeline. As we explain later in this tutorial, the component-based structure gives you even better chunking out of the box.
Build the RAG pipeline
This is the core of the article, five steps with working examples. The snippets below are extracted from the complete pipeline in scripts/5-ask.js. Treat the snippets as a reference implementation: understand how each piece works, then write a version tailored to your data and use case.
1. The retrieval function
To implement a great search engine that gives useful answers, you first need to improve the retrieval from the vector database.
The earlier solution used nearText for pure vector search. It works, but it has a blind spot: specific terms. Try searching for an error code like ERR_MODULE_NOT_FOUND or a version number like v4.14.0. Vector search treats these as meaningless noise because they carry no semantic weight.
The fix is a hybrid search: combine vector similarity with keyword matching (BM25). Weaviate supports this natively with a single alpha parameter that controls the balance: 0 means pure keywords, 1 means pure vectors, and anything in between blends both signals.
Check the following snippet:
import weaviate from "weaviate-client";
import ollama from "ollama";
const client = await weaviate.connectToLocal();
const collection = client.collections.get("Article");
async function retrieve(question, limit = 5) {
const results = await collection.query.hybrid(question, {
limit,
alpha: 0.75, // That's the balance control
returnMetadata: ["score"],
returnProperties: ["title", "body", "description"],
});
return results.objects;
} The alpha: 0.75 leans toward semantic understanding but doesn't ignore keywords. In practice, this catches both conceptual matches ("How do I set up internationalization?") and exact matches ("ERR_MODULE_NOT_FOUND").
Here's the same query ("v4", imagine a user is just using this keyword) with both approaches to demonstrate the difference:
const query = process.argv[2] || "How do I set up Storyblok with Next.js?";
// Pure vector search (Part 2 approach)
const vectorResults = await collection.query.nearText(query, {
limit: 3,
returnMetadata: ["distance"],
returnProperties: ["title"],
});
// Hybrid search (Part 3 approach)
const hybridResults = await collection.query.hybrid(query, {
limit: 3,
alpha: 0.75,
returnMetadata: ["score"],
returnProperties: ["title"],
}); Run it with a keyword-heavy query:
node scripts/4-search.js "v4" --- nearText (pure vector) ---
Query: "v4"
đź“„ You're Not Building Netflix: Stop Coding Like You Are
đź“„ Nobody Writes Clean Code. We All Just Pretend
đź“„ Developers vs AI: Are We Becoming AI Managers Instead of Coders?
--- hybrid (vector + BM25) ---
Query: "v4"
đź“„ Introducing Storyblok CLI v4
đź“„ You're Not Building Netflix: Stop Coding Like You Are
đź“„ Nobody Writes Clean Code. We All Just Pretend The vector search is completely lost: "v4" carries no semantic meaning, so it returns irrelevant articles. Meanwhile, the hybrid search detects the literal keyword match and surfaces "Introducing Storyblok CLI v4" at the top. That's BM25 doing its job with the vector component still contributing to the overall ranking.
2. The prompt template
Retrieval gives you context. Now you need to package it for the LLM. Like the imaginary librarian, you literally take the sources from the vector database and augment the prompt, turning it into a powerful command that guides the LLM to craft a meaningful answer for your users.
This is the augmentation step, and the system prompt is arguably the most important piece of code in the entire pipeline. Get it wrong, and the model ignores the context and hallucinates. Get it right, and you have an assistant that stays grounded, cites its sources, and admits when it doesn't know:
function buildPrompt(question, sources) {
const context = sources
.map(
(source, i) =>
`[Source ${i + 1}: ${source.properties.title}]\n${source.properties.body.slice(0, 2000)}`
)
.join("\n\n---\n\n");
return {
system: `You are a helpful assistant that answers questions based ONLY on the provided context.
If the context does not contain enough information to answer the question, say so clearly.
When you use information from a source, cite it by number (e.g., [Source 1]).
Keep answers concise and practical.`,
user: `Context:\n${context}\n\nQuestion: ${question}`,
};
} Every word in that system prompt is there for a reason. Let's break it down:
- “Based ONLY on the provided context”: this is the anti-hallucination instruction. Without it, the model happily fills in gaps from its training data, which defeats the entire purpose of RAG.
- Numbered sources: formatting the context with clear
[Source N]labels lets the model cite its references. This makes every claim in the answer verifiable. - The “say so” escape hatch: explicitly telling the model to say "I don't know" reduces its inherent behavior to fabricate answers. LLMs are trained to be helpful, which means they'll invent something rather than stay silent. You need to give it explicit permission to decline.
After the system prompt, concatenate the sources with the user’s question. This is the magic of augmentation.
3. The generation call
Now it’s time to send the assembled prompt to Ollama and get an answer back. Let's start simple, with a blocking call, then upgrade to streaming:
import ollama from "ollama";
async function generate(prompt) {
const response = await ollama.chat({
model: "qwen3.5:4b",
messages: [
{ role: "system", content: prompt.system },
{ role: "user", content: prompt.user },
],
});
return response.message.content;
} This works, but the experience is terrible. You wait several seconds staring at a blank screen, then the entire answer emerges at once. Streaming transforms the engine into something that simulates an AI assistant:
import ollama from "ollama";
async function generate(prompt) {
async function generateStream(prompt, model = "qwen3.5:4b") {
const response = await ollama.chat({
const stream = await ollama.chat({
model: "qwen3.5:4b",
model,
messages: [
{ role: "system", content: prompt.system },
{ role: "user", content: prompt.user },
],
think: false,
stream: true,
});
return response.message.content;
for await (const chunk of stream) {
process.stdout.write(chunk.message.content);
}
console.log();
} Notice the most important changes here: stream: true and a for await loop. The tokens flow in as the model generates them, exactly like ChatGPT, but running on your laptop.
To generate faster responses and less verbose output from Qwen or other thinking models, turn off thinking with think: false.
4. Assemble all the pieces
Let's wire retrieval, prompt building, and generation into a single function:
async function askQuestion(question, model) {
console.log(`\n🔍 Searching for: "${question}"\n`);
// Step 1: Retrieve
const sources = await retrieve(question, 5);
console.log(
`📚 Found ${sources.length} sources:`,
sources.map((s) => s.properties.title)
);
// Step 2: Augment
const prompt = buildPrompt(question, sources);
// Step 3: Generate
console.log(`\nđź’¬ Answer (${model}):\n`);
await generateStream(prompt, model);
}
// Read question + model from CLI args, with sensible defaults
const question = process.argv[2] || "How do I set up Storyblok with Next.js?";
const model = process.argv[3] || "qwen3.5:4b";
await askQuestion(question, model);
client.close(); Question in, grounded answer out. Three steps, clearly separated, easy to debug. The second CLI argument lets us swap models without touching the code, which we'll use in the model comparison section below.
5. Run it
Test your implementation with a real question. Remember that LLMs are non-deterministic, so the answer you’ll get could be different.
node scripts/5-ask.js "How do I set up Storyblok with Next.js?" 🔍 Searching for: "How do I set up Storyblok with Next.js?"
📚 Found 5 sources: [
'Announcing React SDK v4 with full support for React Server Components',
'Announcing Official Storyblok Richtext Support in our Frontend SDKs',
'Announcing Stable Live Preview for Storyblok’s Astro SDK',
'Introducing Storyblok CLI v4',
'Announcing Storyblok Svelte SDK v5: now fully compatible with Svelte 5'
]
đź’¬ Answer (qwen3.5:4b):
Based on the provided context, here's how to set up Storyblok with Next.js:
1. Install the SDK: `npm install @storyblok/react@latest` [Source 2]
2. Initialize the SDK with a `getStoryblokApi()` function in `lib/storyblok.js` [Source 1]
3. For App Router with React Server Components, ensure you're on v4.0.0+ which provides full RSC support with live preview [Source 1] The answer is much more useful: it cites specific sources, draws from multiple articles to compose a complete response, and gets everything right, because it's reading your content, not guessing from training data.
But it can be even better.
Chunking: where most RAG pipelines fail
Look back at the answer we got in step 5. It's grounded in the right articles, cites sources, and isn’t hallucinating. But why does it feel…generic?
The model retrieved the correct articles, but the embeddings averaged across 3,000 words, diluting the specific setup steps. You’re forced to keep probing instead of getting started with integrating Storyblok into your Next.js project.
The problem is embedding dilution. When you vectorize an entire 3,000-word article, the result represents the average meaning of the whole piece. It's like describing a book by its most frequent word. The specific paragraph about cache TTL gets buried under the other 2,900 words about API authentication, rate limits, and SDK setup.
The fix is chunking: break documents into smaller, focused pieces before vectorizing them.
Naive chunking: the random way
Most tutorials suggest splitting text every 500 tokens (or every X tokens). Fast, simple, and useless. It slices mid-sentence, mid-paragraph, mid-thought. The resulting vectors are semantic smoothies, blended beyond recognition.
Heading-based chunking: the practical approach
Our data is in Markdown (DEV.to articles from part 2), so we can do better: split by headings. Each section under an ## or ### heading is a coherent chunk with a clear topic (full version):
function chunkByHeaders(article) {
const sections = article.body.split(/(?=^#{2,3}\s)/m);
return sections
.filter((section) => section.trim().length > 50)
.map((section, i) => ({
title: article.title,
chunkIndex: i,
body: section.trim(),
originalUrl: article.url,
author: article.author,
}));
} Now re-ingest the chunked content into Weaviate. The script creates a fresh ArticleChunk collection, flattens every article into its chunks, and batches the inserts so we don't spam the DB one object at a time:
import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("./data/articles.json", "utf-8"));
const chunks = data.flatMap(chunkByHeaders);
console.log(`Chunked ${data.length} articles into ${chunks.length} chunks`);
const objects = chunks.map((chunk) => ({ properties: chunk }));
// Batch insert
const batchSize = 20;
for (let i = 0; i < objects.length; i += batchSize) {
const batch = objects.slice(i, i + batchSize);
await client.collections.get("ArticleChunk").data.insertMany(batch);
} Run the chunking script:
node scripts/3-ingest-chunks.js Created ArticleChunk collection
Chunked 20 articles into 297 chunks
Ingested 297 chunks 20 articles became 297 focused chunks. To experience the difference, use a helper file: 6-compare-chunking.js. It's a trimmed variant of 5-ask.js that runs the same question through both collections (Article and ArticleChunk) and prints both answers back-to-back.
Let's run it:
node scripts/6-compare-chunking.js "How do I set up Storyblok with Next.js?" The whole articles’ answer comes out first and looks close to the one from step 5: correct at a high level, but vague. Now look at the chunked articles’ answer:
💬 Answer (qwen3.5:4b — chunked articles):
1. Install the SDK: `npm install @storyblok/react@latest` [Source 5]
2. Initialize the SDK in `lib/storyblok.js`:
import { apiPlugin, storyblokInit } from '@storyblok/react/rsc';
export const getStoryblokApi = storyblokInit({
accessToken: 'YOUR_ACCESS_TOKEN',
use: [apiPlugin],
components: { teaser: Teaser, page: Page },
});
[Source 1]
3. Wrap your app with `StoryblokProvider` in `app/layout.jsx` [Source 1]
4. Create the Provider as a client component with `'use client'` [Source 1]
5. Fetch and render content using `getStoryblokApi()` in server components
with `StoryblokStory` [Source 1] Same model, same question, but the chunked version features storyblokInit, StoryblokProvider, the 'use client' directive, server component patterns with StoryblokStory, and actual working setup steps pulled from focused sections, not averaged across entire articles.
The chunk-level embeddings captured the precise meaning of each section instead of diluting it.
Structure-aware chunking: the CMS advantage
For plain Markdown, heading-based splitting works well. But if your content lives in a CMS with a component model, you can do much better.
Consider how content is structured in Storyblok: a typical article has a title component, an introduction block, multiple section blocks (each with its own heading and body), code blocks, and a conclusion.
Each component is a self-contained semantic unit with clear boundaries. No need to invent a structure for chunking—the content architecture is the semantic structure.
Strategy | Pros | Cons | Best for |
|---|---|---|---|
Fixed-size (every N tokens) | Simple, predictable | Ignores context boundaries | Prototypes only |
Heading-based | Respects document structure | Misses nested semantics | Markdown, flat files |
Structure-aware (CMS components) | Preserves semantic boundaries | Requires content model knowledge | Structured CMS content |
This is one of the underappreciated advantages of structured content in the AI era: your component model gives you chunking for free.
Model swap: when bigger isn't better
You might be thinking, “Sure, chunking helped, but what if I just use a bigger model with the whole articles? More parameters must generate better answers.”
To test that, keep the whole-article retrieval (no chunks) and upgrade the model instead:
ollama pull qwen3.5:9b Ask the bigger model the same question:
node scripts/5-ask.js "How do I set up Storyblok with Next.js?" "qwen3.5:9b" 💬 Answer (qwen3.5:9b — whole articles):
1. Install the SDK: `npm install @storyblok/react@latest` [Source 1][Source 2]
2. Initialize the SDK: Create a file (e.g., `lib/storyblok.js`) to initialize the SDK and export the `getStoryblokApi()` function.
3. Use Rich Text Components: render rich text fields using the `StoryblokRichText` component [Source 2].
4. For Live Preview: version 4.0.0+ provides full support for live preview functionality [Source 1]. The qwen3.5:9b took noticeably longer to generate and used much more memory. And the result? Roughly the same generic answer as the qwen3.5:4b from step 5. Compare that with the chunked version that included storyblokInit, StoryblokProvider, the 'use client' directive, proper file paths, and server component fetching with StoryblokStory.
The bigger model still relies on diluted whole-article embeddings. It can't synthesize what isn't in the context.
In RAG, the most impactful improvements come from how you prepare and retrieve context, not from throwing a bigger model at the problem. Chunking strategy, hybrid search tuning, and prompt design are all human engineering decisions that outweigh model size. You don't need the most expensive model. You need smarter retrieval, better chunking, and a well-crafted system prompt.
This is equally true for cloud APIs. You don't need to call the most expensive frontier model if your retrieval pipeline is solid. A smaller, cheaper model with great context will outperform a giant model with mediocre context, and cost a fraction of the price.
Start small, invest in your retrieval pipeline, and upgrade the model only when you've exhausted the engineering improvements.
From local to production
The pipeline works. Before shipping it, add one critical safety net and survey additional production requirements.
Relevance thresholds
The most dangerous failure mode in RAG isn't a wrong answer, it's a confident wrong answer based on irrelevant context. If someone asks an off-topic question and the retrieval returns vaguely related noise, the model will dutifully synthesize an answer from that noise.
The fix is a relevance threshold: if nothing in the database is close enough to the question, say so instead of guessing:
async function retrieve(question, limit = 5, threshold = 0.3) {
const results = await collection.query.hybrid(question, {
limit,
alpha: 0.75,
returnMetadata: ["score"],
returnProperties: ["title", "body", "description"],
});
// Filter out results below the relevance threshold
const relevant = results.objects.filter(
(obj) => obj.metadata.score >= threshold
);
if (relevant.length === 0) {
return null; // Signal: nothing relevant found
}
return relevant;
} Hybrid search scores combine vector similarity and BM25 keyword matching, but don't map directly to cosine similarity. A threshold of 0.3 filters out noise while keeping genuinely relevant results. The exact value depends on your data and model, so test and adjust based on the results.
Update the main pipeline:
async function askQuestion(question, model) {
const sources = await retrieve(question, 5);
if (!sources) {
console.log("I don't have enough information to answer that question.");
client.close();
return;
}
const prompt = buildPrompt(question, sources);
await generateStream(prompt, model);
} Finally, test it with an off-topic question:
node scripts/5-ask.js "What's the best pizza in Naples?" 🔍 Searching for: "What's the best pizza in Naples?"
📚 Found 5 sources: [
"Mastering Scheduling and Gantt Charting: A Recap of Bryntum's Guide",
"DEV's Worldwide Show and Tell Challenge Presented by Mux",
'Will WebAssembly Kill JavaScript? Let's Find Out',
'Building a Custom Calendar with React + Storyblok',
"You're Not Building Netflix: Stop Coding Like You Are"
]
đź’¬ Answer (qwen3.5:4b):
The provided context does not contain enough information to answer the question.
None of the sources discuss food, restaurants, or pizza in Naples. No hallucinations ensure that users trust the assistant because it knows when to stay quiet.
What to explore next
A production RAG system has more aspects than one tutorial can cover.
Here's a roadmap:
- Reranking with cross-encoders: a second retrieval stage that re-scores the top results with much higher precision. Weaviate's
reranker-transformersmodule supports this natively for self-hosted instances. This is the single biggest quality improvement you can add after chunking. - Context window management: LLMs have token limits (though Qwen 3.5's 256K context window gives you plenty of room). When your retrieval returns more context than the model can handle, you need a truncation strategy that preserves the most relevant content.
- Caching common queries: if the same questions come up repeatedly, cache the retrieval results (or even the full answers) to reduce latency and computation.
- Error handling: what happens when Ollama is slow, returns garbage, or the context window is exceeded? Production systems need graceful degradation.
Takeaways
In part 1, you learned what vectors are. In part 2, you built a search engine. In this part, you turned that search engine into an AI assistant that runs locally.
The key lessons:
- Hybrid search beats pure vector search: keywords and semantics complement each other.
- Chunking matters more than model: an efficient retrieval pipeline with a small model outperforms a mediocre pipeline with a frontier model.
- RAG makes LLMs informed, not smarter: the quality of the answers depends on the quality of the retrieval.
- Know when to say “I don't know": relevance thresholds prevent the most dangerous failure mode: confident wrong answers.
- Structured content is a RAG superpower: if your content lives in a CMS with a component model (like Storyblok), the content architecture is the semantic structure. Your components give you chunking for free. This is the idea behind Strata, Storyblok's AI-ready content layer, currently in preview.
The pipeline you built is linear: retrieve, then generate. While the industry keeps progressing—agentic retrieval, graph RAG, evaluation frameworks like RAGAS—the foundation of hybrid search, smart chunking, and grounded generation is the starting point.
Your content already has the answers; you just need to tell the AI where to look.