Build a Semantic Search Engine from scratch

Developers February 3, 2026

Storyblok is the first headless CMS that works for developers & marketers alike.

In Part 1 of this series, we explored the concepts behind vector databases: how embedding models transform content into numerical coordinates on a “meaning map”, why semantic search outperforms traditional keyword matching, and the use cases that make this technology essential for modern AI applications.

Now it's time to get our hands dirty.

In this tutorial, we build a working semantic search engine from scratch. We fetch blog posts, store them in a vector database, and query them by meaning rather than keywords. Here's what you can expect to learn:

Set up Weaviate, an Open Source vector database, locally with Docker
Design a schema and ingest content with automatic vectorization
Perform semantic searches and combine them with structured filters
Understand how approximate nearest neighbor (ANN) indexes make similarity search fast

This article is the second in a three-part series that aims to demystify vector databases.

The “What & Why”: a conceptual introduction to what a vector database is, why it's a game-changer, and its core use cases.
The “How” (you are here): a practical, hands-on tutorial where we build a project to demonstrate the technology in action.
The “Real World”: an advanced guide covering production concerns like scaling, hybrid search, and architectures like Retrieval-Augmented Generation (RAG).

At Storyblok, we develop solutions that enable our CMS to do more than manage stories and assets, ensuring your content is smarter and AI-ready. This vision has led to our latest offering, Strata.

In this series, we share the foundational knowledge behind Strata, a custom vector database that's based on your content and redefines what you can do with it.

The tech stack

Weaviate (v1.34.5): our vector database
Docker: to run Weaviate and the embedding model locally
Node.js: for the ingestion and search scripts

The sample content is posts from DEV.to.

Note:

The snippets in this tutorial are simplified, and focus on the core concept discussed. Find the complete, operational code from this tutorial in a GitHub repository.

Set up Weaviate locally

In production, you'd likely use Weaviate Cloud or a managed deployment, but for understanding the mechanics, local is ideal. Running Weaviate locally is perfect for learning, provides full control, and requires no account.

We use Docker Compose to spin up two containers: Weaviate itself and a text2vec transformer model from Hugging Face that generates the embeddings.

First, create a file with the following contents (full version):

docker-compose.yml

services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.34.5
    ports:
      - "8080:8080"
      - "50051:50051"
    environment:
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      TRANSFORMERS_INFERENCE_API: '<http://t2v-transformers:8080>'
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
    depends_on:
      - t2v-transformers

  t2v-transformers:
    image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2
    environment:
      ENABLE_CUDA: '0'

We expose two ports:

8080 for the REST/GraphQL API (used for schema operations and simple queries) and
50051 for gRPC (used by client libraries for high-throughput data operations).

This tutorial uses Weaviate’s JavaScript/TypeScript client v3.9.0, which is already part of our stack and works with both ports.

Run the following command to start:

docker compose up -d

Once running, verify that Weaviate is responding:

curl <http://localhost:8080/v1/meta> | jq '.version'
# Should return "1.34.5"

Understand vectorizer modules and models

Before you start coding, take a moment to understand how Weaviate handles vectorization. This relates to the embedding models discussed in part 1.

Weaviate uses a module architecture. A module is an integration layer that connects Weaviate to an embedding provider. The module we use, text2vec-transformers, runs a neural network locally in a pre-built Docker container provided by Weaviate.

The embedding model is what transforms the content into coordinates on the “meaning map.” Weaviate supports various self-hosted and API-based models, and choosing the right one affects the quality of the search results.

The model we use in this tutorial, all-MiniLM-L6-v2, produces 384-dimensional vectors. It's fast and provides good quality for demonstrations, but it isn’t suitable for production workloads.

If you plan to self-host, consider bigger models, like all-mpnet-base-v2 or snowflake-arctic-embed-l.

Model	Dimensions	Trade-off
`all-MiniLM-L6-v2`	384	Fast, good for demos
`all-mpnet-base-v2`	768	Better quality, slower
`snowflake-arctic-embed-l`	1024	High accuracy, higher cost

Otherwise, Weaviate can integrate with cloud-based vectorizers like text2vec-openai or text2vec-cohere, which provide state-of-the-art embeddings but require API keys.

Alternatively, use a custom local model with Ollama and text2vec-ollama.

Create a data collection

The first step is to fetch data and define a matching schema.

The sample data

DEV.to has a free API that lets you fetch articles by organization or tag. Download a set of articles and store them in a JSON file with the following structure (full version):

data/articles.json

{
  "title": "Article title",
  "description": "Brief summary",
  "body": "Full article content...",
  "url": "<https://dev.to/>...",
  "author": "Author name",
  "published_at": "2025-01-15T10:00:00Z",
  "tags": ["javascript", "tutorial"]
}

This same approach works with any content source: Storyblok, another CMS, or your own data.

Design the data schema

In Weaviate, a schema defines the structure of your data, similar to a table schema in a relational database, but with vector-specific capabilities. The two relevant terms are collections and vectorizers.

Think of a collection as a database table, and its properties as columns. The key difference is the vectorizer configuration. To create a collection, specify which properties should be embedded and which model to use.

The following snippet is a subset of how we define an Article collection (full version):

scripts/1-create-collection.js

// ...

await client.collections.create({
    name: 'Article',
    description: 'Blog articles for semantic search',

    // Define the structure of our data
    properties: [
      { name: 'title', dataType: 'text' },
      { name: 'description', dataType: 'text' },
      { name: 'body', dataType: 'text' },
      { name: 'url', dataType: 'text' },
      { name: 'author', dataType: 'text' },
      { name: 'organization', dataType: 'text' },
      { name: 'published_at', dataType: 'date' },
      { name: 'tags', dataType: 'text[]' },
    ],

    // Configure automatic vectorization
    // Only title and body are embedded - not URLs, dates, etc.
    vectorizers: weaviate.configure.vectorizer.text2VecTransformers({
      sourceProperties: ['title', 'body'],
    }),
  });
  
  // ...

The crucial decision here is sourceProperties: ['title', 'body']. This instructs Weaviate to only vectorize the semantic content: the title and body text. Some metadata, such as tags or author names, doesn't affect vector similarity. Other types, like a URL or date, would pollute your results with meaningless noise.

Ingest the content

With the schema defined, it's time to import articles using Weaviate's client (full version):

scripts/2-ingest.js

import weaviate from 'weaviate-client';

// Connect to local Weaviate
const client = await weaviate.connectToLocal();
const articles = client.collections.get('Article');

// Our articles loaded from the JSON file
const objects = [{ title: '...', body: '...', /* ... */ }];

// Batch import
const result = await articles.data.insertMany(objects);

What happens under the hood when the app ingests an article?

Weaviate receives the object with its properties
It extracts the title and body fields (the sourceProperties)
It sends that text to the transformer container
The transformer returns a 384-dimensional vector
Weaviate stores both the original properties and the vector

This automatic vectorization is one of Weaviate's strengths. No need to manage embeddings; send text, and Weaviate handles the rest.

Hint:

This tutorial ingests articles as single documents. In production, you'd typically split longer content into smaller chunks for more precise retrieval. Don’t miss part 3, where we’ll cover chunking strategies.

Semantic search in action

Now for the payoff. Let's query those articles by meaning (full version):

scripts/3-search.js

const articles = client.collections.get('Article');

const result = await articles.query.nearText(
  'storyblok in the php ecosystem',
  {
    limit: 3,
    returnMetadata: ['distance']
  }
);

The nearText method does something powerful: it takes the query string, passes it through the same embedding model that processed the articles, and finds the vectors closest to the query vector.

Here's what we get back:

📄 Storyblok unveils new PHP packages in collaboration with SensioLabs
   Distance: 0.2082

📄 Announcing Official Storyblok Richtext Support in our Frontend SDKs
   Distance: 0.3900

📄 Global Financial Starter: Multilingual Template
   Distance: 0.4128

Let’s examine the results. The query was “storyblok in the php ecosystem”, and the best match, with an excellent distance score of 0.2082, is the article about PHP packages developed with SensioLabs. The second and third results are also relevant, but more distant.

Learn:

Weaviate returns a distance score with each result. By default, it uses cosine distance for text.

Cosine similarity: 1 = identical, 0 = unrelated
Cosine distance: 0 = identical, 2 = opposite

Lower distance means higher relevance, and the thresholds will vary depending on the model and content domain. Experimentation is key.

This is the power of semantic search. The embedding model understands that “php ecosystem” relates to PHP packages, SDKs, and developer tools, even when those exact words aren’t in the title.

Filtering: combine vectors with structured data

Pure semantic search is powerful, but real applications need boundaries. In other words, users would probably type something like “find similar articles published in 2025”.

Weaviate enables combining vector search with property filters:

scripts/3-search.js

// ...

// Semantic search + organization filter
const result = await articles.query.nearText('frontend development tips', {
  limit: 3,
  filters: articles.filter.byProperty('organization').equal('Storyblok')
});


// Semantic search + tag filter
const result = await articles.query.nearText('building user interfaces', {
  limit: 3,
  filters: articles.filter.byProperty('tags').containsAny(['react'])
});


// Semantic search + date filter
const result = await articles.query.nearText('developer productivity', {
  limit: 3,
  filters: articles.filter
    .byProperty('published_at')
    .greaterOrEqual(new Date('2025-01-01'))
});

Since it applies filters before the vector search, Weaviate returns semantically similar results within the filtered subset. Compared to pure embedding search, vector databases excel in combining semantic understanding with the structured querying of traditional databases.

Learn:

The difference between filtered search and hybrid search

Vector search with filters performs a semantic search within a filtered subset of the data.

True hybrid search merges results from both a vector search and a traditional keyword search (BM25), then re-ranks them.

In part 3, we'll explore how Weaviate supports this via the hybrid query method.

Under the hood: the algorithms that speed up search

You may have noticed that the semantic queries returned results almost instantly, even though Weaviate had to compare the query vector against every article in the database. With a few dozen articles, that's trivial. But what happens when you have millions of vectors?

In part 1, we mentioned that vector databases use ANN algorithms to speed up similarity searches. But how does this work in practice?

When you have millions of vectors, comparing a query against every single one would be prohibitively slow. ANN algorithms build an index structure that finds approximately the nearest neighbors without exhaustive comparison.

Weaviate’s default index is Hierarchical Navigable Small World (HNSW). Instead of checking every vector, HNSW builds a graph of connections between vectors. When you search, it navigates this graph, jumping through layers to quickly narrow down the most similar vectors. If you're familiar with the skip list algorithm, think of HNSW as a skip list combined with a navigable graph.

HNSW has several configurable parameters that balance speed, recall, and memory usage. The defaults usually work well, but you can fine-tune these settings.

The “approximate” in ANN might sound concerning, but in practice, HNSW achieves over 95% recall while being significantly faster than exact search. For content search, this trade-off is almost always worthwhile.

Conclusion

Building on the conceptual foundations in part 1, this tutorial shows how to:

Run a vector database locally with Docker
Evaluate the relationship between vectorizer modules and embedding models
Design schemas that separate semantic content from metadata
Perform semantic searches based on meaning, not just keywords
Combine vector similarity with structured filters
Understand how HNSW indexes speed ANN search

These building blocks power intelligent search, recommendations, and AI assistants across the industry. At Storyblok, we've packaged these patterns into Strata, making content AI-ready without requiring you to manage vector infrastructure yourself.

But there's more to explore. The real power of vector databases emerges when you combine them with large language models (LLMs) via RAG. Instead of an LLM making things up, it answers questions grounded in your content.

In part 3, we tackle real-world scenarios: building a production-ready RAG pipeline, implementing true hybrid search, tuning relevance thresholds, and creating AI applications for your content.

The semantic web isn't about to arrive. It's here. And now you know how to build it.