📐 Middle School → Production · Full Visual Guide

Cosine Similarity,
HNSW Graphs & Transformer Embeddings
inside Elasticsearch & LLM Retrieval

Every concept explained from geometry basics — dot products, vector direction, transformer neural networks, multi-layer HNSW graphs — up to how real RAG pipelines ingest documents and answer your questions.

Section 1 — The Core Math
📐
Cosine Similarity — Measuring Direction, Not Length
cos(θ) = a·b ÷ (‖a‖·‖b‖)  |  O(d) time complexity

Imagine shining a flashlight. The direction you point it is what matters — not how bright it is. Cosine similarity measures the angle between two vectors. Vectors pointing the same way = similar meaning. Vectors pointing at right angles = totally different.

🧭 Magnitude-Blind A 3-sentence paragraph and a 30-sentence essay about the same topic will produce vectors pointing in nearly the same direction. Cosine ignores how long each vector is — only direction matters. This is perfect for comparing short vs. long text chunks.
⚡ O(d) Time Complexity Computing cosine costs exactly O(d) — one pass through all d dimensions. For a 768-d embedding: 768 multiplications + 768 additions + 2 square roots. Modern CPUs do this in microseconds with SIMD vector instructions.
The Formula — Fully Annotated
cos(θ)  =  a · b  ÷  ( ‖a‖  ·  ‖b‖ )
a · b (dot product)
Multiply matching dimensions then sum: a₁b₁ + a₂b₂ + … + aₙbₙ
‖a‖ (magnitude)
Length of vector a = √(a₁²+a₂²+…+aₙ²). Normalises for text length.
‖b‖ (magnitude)
Same for vector b. Dividing by both makes the result always −1 to +1.
θ (theta, the angle)
0° → cos=1 (identical). 90° → cos=0 (unrelated). 180° → cos=−1 (opposite).

📝 Worked Example — 2-D Vectors (Easy Numbers)

a = [3, 4] ← embedding for "The cat sat on the mat" b = [1, 2] ← embedding for "A cat rested on a rug" ── Step 1: Dot product ────────────────────────── a · b = (3×1) + (4×2) = 3 + 8 = 11 ── Step 2: Magnitudes ────────────────────────── ‖a‖ = √(3²+4²) = √25 = 5.000 ‖b‖ = √(1²+2²) = √5 ≈ 2.236 ── Step 3: Cosine similarity ──────────────────── cos(θ) = 11 ÷ (5.000 × 2.236) = 11 ÷ 11.18 ≈ 0.984 ✓ very similar!
x y a = [3,4] "cat on mat" b = [1,2] "cat on rug" θ ≈ 10° cos(θ) Result 0.984 ✓ Very Similar Meaning ChromaDB distance = 1 − 0.984 = 0.016 Smaller distance = closer match ✓ Scale: 1 unit = 40px Vector a Vector b Angle θ
🟠 ChromaDB — distance = 1 − cosine ChromaDB returns a distance score, not a similarity score. So distance = 1 − cos(θ). A distance of 0.0 means identical. A distance of 1.0 means completely unrelated. Smaller is always better — this is the opposite convention from Elasticsearch's similarity score where bigger is better.
Cosine Similarity Scale — What the Numbers Mean −1 Opposite −0.5 0 Unrelated +0.5 +1 Identical 0.984 (our example)

🎮 Interactive Cosine Playground — Drag the Sliders

60°
30°
0.866
cos(30°) — similar direction → similar meaning
Section 2 — Transformer Embeddings
🧠
Transformer Embeddings — Text → Numbers
768-d · nomic-embed · OpenAI ada-002 · "warranty" → [0.012, −0.034, 0.567, …]

A transformer embedding model reads text and outputs a fixed-length list of numbers — a vector. Think of it as a super-smart translator that converts meaning into a location in a high-dimensional map.

🔑 The Core Insight "warranty period" and "coverage duration" are completely different words, but they mean the same thing. A transformer embedding model has learned from billions of documents that these phrases appear in similar contexts — so it maps them to nearby locations in 768-dimensional space. Traditional keyword search would miss this. Cosine similarity catches it.
Transformer Embedding Pipeline Raw Text "warranty period" Tokenizer → token IDs [8943, 2201] Transformer 12–24 attention layers nomic-embed / ada-002 768M parameters 768-Dimensional Vector [0.012, −0.034, 0.567, 0.891, −0.203, …] 768 floats total Different Text "coverage duration" Similar Vector! 🎯 [0.011, −0.031, 0.571, 0.884, −0.198, …] cos ≈ 0.97 Same model, different words → nearly identical vectors → cosine ≈ 0.97

📊 Popular Embedding Models Compared

ModelDimensionsProviderBest ForContext Window
nomic-embed-text open768Nomic AILong docs, local deploy8,192 tokens
text-embedding-ada-0021,536OpenAIGeneral purpose RAG8,191 tokens
text-embedding-3-small1,536OpenAICost-efficient8,191 tokens
text-embedding-3-large3,072OpenAIHighest accuracy8,191 tokens
all-MiniLM-L6-v2 open384Sentence-BERTFast, small footprint512 tokens
e5-mistral-7b open4,096MicrosoftState-of-the-art open32,768 tokens
🔬 How Does the Transformer Produce These Numbers? The transformer reads every word in relation to every other word (called self-attention). After 12–24 layers of this, it has built a rich understanding of context. The final [CLS] token or mean-pooled output is taken as the sentence's embedding. The model was trained to make similar sentences produce similar vectors using contrastive learning.
── Python Example: Generating Embeddings ───────────────────── from sentence_transformers import SentenceTransformer model = SentenceTransformer("nomic-ai/nomic-embed-text-v1") chunks = [ "The warranty period covers 12 months from purchase", "Coverage duration is one year after the sale date", "Python is a programming language for data science", ] embeddings = model.encode(chunks) # shape: (3, 768) # embeddings[0] ≈ embeddings[1] → cosine ≈ 0.96 # embeddings[0] vs embeddings[2] → cosine ≈ 0.12 (unrelated)
Section 3 — HNSW Index
🗺️
HNSW — Hierarchical Navigable Small Worlds
Multi-layer graph · Sparse highways at top · Dense streets at bottom · O(log n) search

Searching cosine similarity across millions of vectors one by one would be O(n·d) — impossibly slow. HNSW solves this with a brilliant multi-layer graph structure inspired by how highway maps work. It finds your nearest neighbor in O(log n) time.

🛣️ Highways & Streets Analogy Imagine you're looking for a restaurant in a city. You don't check every street — you take the highway to the right district (sparse top layer), exit onto main roads (middle layers), then walk the local streets to the exact address (dense bottom layer). HNSW does exactly this with vectors.
📉 O(log n) vs O(n) For 10 million vectors: brute-force checks 10,000,000 nodes. HNSW checks about 23 (log₂ of 10M). That's a 430,000× speedup. This is why vector search is practical at scale.
HNSW Multi-Layer Graph Structure Layer 2 — Highway (Sparse, Long-Range Connections) Few nodes, each connected to others far away. Entry point for all searches. Layer 1 — Main Roads (Medium Density) More nodes, shorter connections. Refines the search to the right neighbourhood. Layer 0 — Local Streets (Dense, Short-Range, Every Node) All vectors live here. Dense connectivity to nearby neighbours. Final exact comparison happens here. 🎯 Nearest Neighbor Found! 1. Enter here 2. Greedily descend Search complexity: O(log n) 🚀

🔨 How HNSW is Built (Insertion Algorithm)

── Inserting a new vector into HNSW ──────────────────────── for each new vector v: 1. Randomly assign it to layers 0 … L (exponential decay: most go to L0 only) 2. for layer = top → assigned_layer: Find nearest neighbour (greedy search, just like query time) 3. for layer = assigned_layer → 0: Add bidirectional edges to M nearest neighbours If any node has > Mmax connections → prune weakest edges Parameters you tune: M = max connections per node (default 16) ↑ = more accurate, more RAM efConstruction = search width at build time ↑ = slower build, better index ef = search width at query time ↑ = more accurate, slower query
MethodSearch SpeedAccuracyMemoryUsed By
Brute ForceO(n·d) 🐢100% exactJust vectorsTiny datasets <10k
IVF (Flat)O(√n·d)~95%Centroids + vectorsFaiss, classic Pinecone
HNSWO(log n) 🚀95–99%~1.5× vectorsElasticsearch, ChromaDB, Weaviate, Qdrant
Product QuantizationO(log n)~90%~0.25× vectorsWhen RAM is scarce
✅ HNSW in Elasticsearch Elasticsearch uses HNSW natively for its dense_vector field type. When you set "index": true, ES builds an HNSW index automatically. You control M and ef_construction in the mapping. At query time, num_candidates sets the ef parameter.
✅ HNSW in ChromaDB ChromaDB uses the hnswlib C++ library under the hood. When you call collection.add(), it inserts into HNSW. When you call collection.query(), it traverses the graph and returns distance = 1 − cosine.
Section 4 — Classic Text Search (BM25)
📖
Inverted Index + BM25 — Elasticsearch's Classic Engine
term frequency · inverse document frequency · field boosting

Before vector search existed, Elasticsearch was already the world's fastest full-text search engine. It uses two powerful ideas together: the inverted index (a backwards book index) and BM25 scoring (a smart relevance formula).

Building the Inverted Index Doc 1 "The quick brown fox" Doc 2 "Fox ran through forest" Doc 3 "Brown bears eat fish" analyze Token Posting List (doc IDs + positions) fox Doc1:p3 Doc2:p1 brown Doc1:p3 Doc3:p1 quick Doc1:p2 fish Doc3:p4 p = position in document; used for phrase queries

📊 BM25 Scoring Formula

── BM25 — What Elasticsearch Actually Computes ───────────── BM25(doc, term) = IDF(term) × [ TF × (k₁+1) ] / [ TF + k₁ × (1 − b + b × |doc|/avgLen) ] IDF(term) = log( (N − df + 0.5) / (df + 0.5) + 1 ) TF = term frequency in this document |doc| = length of this document (in tokens) avgLen = average document length across all docs N = total documents in index df = documents containing this term k₁ = 1.2 (saturation — how much TF matters) b = 0.75 (length normalisation) Key insight: a word appearing 100× vs 10× isn't 10× more relevant. BM25 flattens this — the score saturates at high TF values. 📉→📉→→
Section 5 — ChromaDB Vector Store
🟠
ChromaDB — Purpose-Built Vector Database
distance = 1 − cosine · HNSW under the hood · Python-native

ChromaDB is a vector-first database — its only job is storing embeddings and finding similar ones fast. It's the most popular choice for local RAG prototypes and Python-native LLM apps.

🔑 ChromaDB Distance Convention ChromaDB uses distance = 1 − cosine_similarity. So when you get back results with distances=[0.02, 0.15, 0.41], the smallest number is the most similar. A distance of 0 means identical. Distance of 1 means totally unrelated. This is the opposite of Elasticsearch where bigger score = better.
🏗️ What's Inside ChromaDB Each collection has: an HNSW index (hnswlib), a SQLite metadata store, and a flat file for raw embeddings. You can swap in different embedding functions (OpenAI, Cohere, local models). Persistence is optional — great for testing in-memory.
── ChromaDB Full Example: Ingest + Query ─────────────────── import chromadb from chromadb.utils import embedding_functions # 1. Create client + collection client = chromadb.PersistentClient(path="./chroma_db") embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="nomic-ai/nomic-embed-text-v1" ) collection = client.get_or_create_collection( name="documents", embedding_function=embed_fn, metadata={"hnsw:space": "cosine"} # ← use cosine distance ) # 2. Add documents (embeddings generated automatically) collection.add( documents=[ "The warranty period covers 12 months from purchase date", "Coverage duration is one year after the sale", "Python is great for machine learning projects", ], ids=["doc1", "doc2", "doc3"] ) # 3. Query — returns distance = 1 − cosine (smaller = better) results = collection.query( query_texts=["how long is the warranty?"], n_results=2 ) # results["distances"][0] → [0.03, 0.04] ← doc1 and doc2 are very close! # results["documents"][0] → ["The warranty period...", "Coverage duration..."]
ChromaDB: distance = 1 − cosine(query, chunk) Query Vector "warranty length?" Doc 1 "warranty period..." distance = 0.03 ✓ Doc 2 "coverage duration..." distance = 0.04 ✓ Lower distance = higher cosine similarity = more relevant chunk for LLM context 0 1 0.5 our results
Section 6 — Elasticsearch kNN + Hybrid Search
Elasticsearch kNN — Vector Search at Production Scale
dense_vector · HNSW built-in · Hybrid BM25 + cosine in one query

Elasticsearch 8.0+ added native kNN vector search. It uses HNSW internally and — uniquely — lets you combine BM25 keyword scoring with cosine similarity in a single query. This is called hybrid search and it outperforms either method alone.

── Step 1: Create Index with dense_vector field ──────────── PUT /my_docs { "mappings": { "properties": { "text": { "type": "text" }, "embedding": { "type": "dense_vector", "dims": 768, // must match your model "index": true, // build HNSW index "similarity": "cosine", // cosine similarity metric "index_options": { "type": "hnsw", "m": 16, // HNSW M parameter "ef_construction": 100 // build accuracy vs speed } } } } } ── Step 2: Index a document ──────────────────────────────── POST /my_docs/_doc { "text": "The warranty period covers 12 months from purchase", "embedding": [0.012, -0.034, 0.567, ...] // 768 floats } ── Step 3: Hybrid search — BM25 + kNN cosine ─────────────── POST /my_docs/_search { "knn": { "field": "embedding", "query_vector": [0.008, -0.031, 0.571, ...], // query embedding "k": 10, // return 10 nearest "num_candidates":100 // HNSW ef parameter }, "query": { "match": { "text": "warranty period" } // BM25 boost }, "rank": { "rrf": {} } // Reciprocal Rank Fusion = merge both rankings }
🏆 Why Hybrid Search Wins BM25 is great at exact keyword matches ("warranty" → finds "warranty"). Cosine is great at semantic matches ("coverage duration" → finds "warranty period"). Combining both with Reciprocal Rank Fusion (RRF) gets you the best of both worlds — exact terms score high AND semantically similar results score high. Benchmark studies show RRF hybrid outperforms either alone by 10–30% on recall metrics.
Hybrid Search: BM25 + kNN → RRF Fusion User Query BM25 Ranking Inverted index keyword match Scores: [2.4, 1.8, 1.2 …] kNN Cosine HNSW vector similarity Scores: [0.98, 0.95, 0.87 …] RRF Merged Results 🏆
Section 7 — Full RAG Ingestion & Query Pipeline
🤖
RAG — Retrieval Augmented Generation
How ChatGPT-style apps use your documents to answer questions

RAG is how modern LLM apps (like enterprise chatbots, document Q&A) work. Instead of baking all knowledge into the model, you retrieve relevant chunks at query time and inject them into the LLM's prompt. Two phases: Ingestion and Retrieval.

📥 Phase 1 — Ingestion (Done Once / On New Data)

1. Raw Docs PDF, HTML, Markdown, Database rows 2. Chunking Split into ~500 token overlapping segments 3. Embedding nomic-embed text-ada-002 chunk → 768-d float vector 4. Store ChromaDB or Elasticsearch HNSW index 5. Metadata source file page number section title date created Each chunk gets its own vector → stored as a row in ChromaDB / document in Elasticsearch

🔎 Phase 2 — Retrieval (Done on Every User Query)

👤 User Question Embed Query same model → 768-d vector HNSW Search ChromaDB / Elasticsearch kNN O(log n) Top-k Chunks k=3 most similar passages returned with distances Build Prompt System + chunks + user question → LLM context LLM (GPT-4 / Claude / Llama) Reads context chunks + question Generates grounded answer ✅ Grounded Answer No hallucination — LLM only answers from retrieved chunks
── Full RAG Query with LangChain + ChromaDB + OpenAI ─────── from langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings from langchain.chat_models import ChatOpenAI from langchain.chains import RetrievalQA # 1. Load existing ChromaDB vector store vectorstore = Chroma( persist_directory="./chroma_db", embedding_function=OpenAIEmbeddings() ) # 2. Create retriever — returns top 3 chunks by cosine similarity retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 3} ) # 3. Build QA chain qa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4", temperature=0), retriever=retriever, return_source_documents=True ) # 4. Ask question → retrieval → grounded answer result = qa_chain({"query": "How long is the warranty period?"}) # LLM reads chunk: "The warranty period covers 12 months..." # Answers: "The warranty period is 12 months from the purchase date."
🎯 Why RAG Beats Fine-Tuning for Most Use Cases Fine-tuning bakes knowledge into model weights — expensive, slow, hard to update. RAG externalises knowledge into a retrieval store — cheap, instant updates, fully auditable (you can see exactly which chunks were used). The LLM becomes a reasoning engine; the vector DB becomes its long-term memory.
Section 8 — Alternatives & Decision Guide
⚖️
Which Tool Should You Use?
Vector DBs, hybrid engines, full-text search — decision guide
ToolTypeSearch MethodBest ForWeakness
Elasticsearch 8+hybridBM25 + HNSW cosineProduction RAG + keyword + analyticsComplex ops, RAM heavy
ChromaDBvectorHNSW cosine/L2/IPLocal prototyping, Python RAGNot production-grade at huge scale
PineconevectorHNSW / ANNManaged cloud vector searchCost, vendor lock-in
WeaviatevectorHNSW + BM25GraphQL + vector, multi-modalSteeper learning curve
QdrantvectorHNSW cosineHigh performance, Rust-basedSmaller ecosystem
pgvectorsqlExact cosine / IVFFlatAlready on PostgreSQLSlower than dedicated DBs at scale
Redis (VSS)cacheHNSW / FLATLow-latency, in-memoryData size limited by RAM
MilvusvectorHNSW/IVF/ANNOYBillion-scale vectorsHeavy infrastructure
OpenSearchhybridBM25 + k-NNAWS native, ES alternativeSlightly behind ES features
Vector DB Decision Tree New RAG Project? Prototype ChromaDB 🟠 Production Need keyword too? Yes (hybrid) Elasticsearch ⚡ Pure vector Pinecone / Qdrant Already on Postgres? pgvector 🐘 Billion+ vectors at tight RAM budget? Milvus / Weaviate
📚
References & Further Reading
Papers, docs, courses, and free tools