Featured image of post Vector Databases & Embeddings: The 'Semantic Atlas' Mental Model

Vector Databases & Embeddings: The 'Semantic Atlas' Mental Model

How does Spotify know you'll like this song? A mastery guide to embeddings, cosine similarity, and vector databases (Pinecone, Weaviate, pgvector).

Spotify recommends a song you’ve never heard and you immediately love it. Netflix shows you a movie that’s perfect for your mood. Google finds the exact document you need even though you used different words.

All of these are powered by Embeddings and Vector Databases.

This is the Mastery Guide to the infrastructure powering modern AI — from semantic search to choosing the right vector DB for production.


Part 1: Foundations (The Mental Model)

Traditional Search = Finding an Exact Word

A traditional database search is lexical — it looks for exact character matches.

1
SELECT * FROM docs WHERE content LIKE '%refund policy%';

This finds “refund policy” but misses: “money back guarantee”, “cancellation terms”, or “how to return a product” — all meaning the same thing.

Embeddings = The Semantic Atlas

An Embedding is a mathematical translation of meaning into coordinates in a high-dimensional space (typically 768–3072 dimensions).

Think of it like a map (2D simplification):

1
2
3
4
5
6
7
8
9
                    "Quantum Physics"
"Machine Learning" ────────┼──── "Neural Networks"
"Deep Learning"        "Astronomy"
                    
"Dog" ──── "Cat" ──── "Puppy"

"Democracy" ─ "Election" ─ "Politics"

Words/sentences that are semantically similar are close together on this map. Your query “money back” lands near “refund policy” — even though the words are different.

Key insight: A Vector DB doesn’t search for words. It searches for nearby coordinates on the semantic map.


Part 2: The Investigation (How Similarity Works)

Cosine Similarity

The most common way to measure “closeness” between two vectors:

  • similarity = 1.0 → Perfect match.
  • similarity ≈ 0.8 → Very similar (“dog” vs “canine”).
  • similarity ≈ 0.1 → Very different (“dog” vs “database”).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    return client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding  # Returns [0.23, -0.87, ...] (1536 dims)

def cosine_similarity(a, b) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(embed("dog"), embed("puppy")))    # ~0.92
print(cosine_similarity(embed("dog"), embed("database"))) # ~0.15

ANN Search (Approximate Nearest Neighbor)

Brute-force search through 10 million vectors is too slow. Vector DBs use HNSW (Hierarchical Navigable Small World) to find nearest neighbors in milliseconds — trading a tiny bit of accuracy for massive speed gains.


Part 3: The Diagnosis (Choosing the Right Vector DB)

DatabaseBest ForKey Feature
pgvectorSmall-medium scale, existing Postgres usersZero extra infra. SQL + vectors together.
ChromaLocal dev, prototypingEasiest to start. In-memory mode.
WeaviateHybrid search (keyword + semantic)Built-in BM25 + vector search.
QdrantHigh-performance, self-hostedFast, Rust-based, excellent filtering.
PineconeManaged, serverless, large scaleZero ops. Expensive at scale.
MilvusBillion-scale, open sourceMost scalable open-source option.

Decision Guide

1
2
3
4
5
6
Prototyping?              → Chroma (local, zero setup)
Already use Postgres?     → pgvector (no new infra)
Need hybrid search?       → Weaviate
Need fast + self-hosted?  → Qdrant
Need managed cloud?       → Pinecone
Need billion-scale OSS?   → Milvus

Part 4: The Resolution (Python Cookbook)

1. pgvector (Simplest Production Setup)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- HNSW index for fast ANN search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import psycopg2

conn = psycopg2.connect("postgresql://...")

def index_document(content: str):
    embedding = embed(content)
    conn.execute(
        "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
        (content, embedding)
    )

def search(query: str, top_k: int = 5) -> list[dict]:
    q_vec = embed(query)
    cur = conn.cursor()
    cur.execute("""
        SELECT content, 1 - (embedding <=> %s::vector) AS score
        FROM documents
        ORDER BY embedding <=> %s::vector  -- <=> is cosine distance
        LIMIT %s
    """, (q_vec, q_vec, top_k))
    return [{"content": r[0], "score": r[1]} for r in cur.fetchall()]

results = search("What is the refund policy?")

2. Qdrant (High Performance, Self-Hosted)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(":memory:")  # or url="http://localhost:6333"

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

# Index documents
client.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=i, vector=embed(chunk), payload={"content": chunk})
        for i, chunk in enumerate(chunks)
    ]
)

# Search
results = client.search(
    collection_name="docs",
    query_vector=embed("refund policy"),
    limit=5
)
for r in results:
    print(f"Score: {r.score:.3f} | {r.payload['content'][:100]}")

3. Chroma (Local Prototyping)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import chromadb

client = chromadb.Client()  # In-memory, zero setup
collection = client.create_collection("docs")

collection.add(
    documents=chunks,
    ids=[f"id_{i}" for i in range(len(chunks))]
    # Chroma auto-embeds if you don't provide embeddings
)

results = collection.query(query_texts=["refund policy"], n_results=5)

Final Mental Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Traditional DB   → Finds the exact word "dog". Misses "canine", "puppy", "hound".
Vector DB        → Finds everything near the concept of "dog". Semantics, not syntax.

Embedding        → GPS coordinate of a meaning on the semantic map.
Cosine Similarity→ Angle between two coordinate vectors. (0=opposite, 1=identical).
HNSW Index       → Fast navigation shortcut through the high-dimensional map.

pgvector  → Your Postgres DB grows wings. Start here.
Pinecone  → Someone else runs it. You pay more.
Qdrant    → The performance king for self-hosted.
Chroma    → Your local playground. Zero friction.

The AI Stack of 2026:

  • Embeddings → Turn your data into semantic coordinates.
  • Vector DB → Store and search those coordinates at scale.
  • LLM + RAG → Reason over the retrieved semantic results.

This is the complete foundation. Three posts. One unified architecture.

Made with laziness love 🦥

Subscribe to My Newsletter