Cosine · HNSW · Transformers · Elasticsearch · RAG

Section 1 — The Core Math

📐

Cosine Similarity — Measuring Direction, Not Length

cos(θ) = a·b ÷ (‖a‖·‖b‖) | O(d) time complexity

Imagine shining a flashlight. The direction you point it is what matters — not how bright it is. Cosine similarity measures the angle between two vectors. Vectors pointing the same way = similar meaning. Vectors pointing at right angles = totally different.

🧭 Magnitude-Blind A 3-sentence paragraph and a 30-sentence essay about the same topic will produce vectors pointing in nearly the same direction. Cosine ignores how long each vector is — only direction matters. This is perfect for comparing short vs. long text chunks.

⚡ O(d) Time Complexity Computing cosine costs exactly O(d) — one pass through all d dimensions. For a 768-d embedding: 768 multiplications + 768 additions + 2 square roots. Modern CPUs do this in microseconds with SIMD vector instructions.

The Formula — Fully Annotated

        cos(θ)  = 
        a · b
         ÷ 
        ( ‖a‖  ·  ‖b‖ )
      

a · b  (dot product)

Multiply matching dimensions then sum: a₁b₁ + a₂b₂ + … + aₙbₙ

‖a‖  (magnitude)

Length of vector a = √(a₁²+a₂²+…+aₙ²). Normalises for text length.

‖b‖  (magnitude)

Same for vector b. Dividing by both makes the result always −1 to +1.

θ  (theta, the angle)

0° → cos=1 (identical). 90° → cos=0 (unrelated). 180° → cos=−1 (opposite).

📝 Worked Example — 2-D Vectors (Easy Numbers)

a = [3, 4] ← embedding for "The cat sat on the mat" b = [1, 2] ← embedding for "A cat rested on a rug" ── Step 1: Dot product ────────────────────────── a · b = (3×1) + (4×2) = 3 + 8 = 11 ── Step 2: Magnitudes ────────────────────────── ‖a‖ = √(3²+4²) = √25 = 5.000 ‖b‖ = √(1²+2²) = √5 ≈ 2.236 ── Step 3: Cosine similarity ──────────────────── cos(θ) = 11 ÷ (5.000 × 2.236) = 11 ÷ 11.18 ≈ 0.984 ✓ very similar!

🟠 ChromaDB — distance = 1 − cosine ChromaDB returns a distance score, not a similarity score. So distance = 1 − cos(θ). A distance of 0.0 means identical. A distance of 1.0 means completely unrelated. Smaller is always better — this is the opposite convention from Elasticsearch's similarity score where bigger is better.

🎮 Interactive Cosine Playground — Drag the Sliders

Vector A — angle °60°

Vector B — angle °30°

0.866

cos(30°) — similar direction → similar meaning

Section 2 — Transformer Embeddings

🧠

Transformer Embeddings — Text → Numbers

768-d · nomic-embed · OpenAI ada-002 · "warranty" → [0.012, −0.034, 0.567, …]

A transformer embedding model reads text and outputs a fixed-length list of numbers — a vector. Think of it as a super-smart translator that converts meaning into a location in a high-dimensional map.

🔑 The Core Insight "warranty period" and "coverage duration" are completely different words, but they mean the same thing. A transformer embedding model has learned from billions of documents that these phrases appear in similar contexts — so it maps them to nearby locations in 768-dimensional space. Traditional keyword search would miss this. Cosine similarity catches it.

📊 Popular Embedding Models Compared

Model	Dimensions	Provider	Best For	Context Window
nomic-embed-text open	768	Nomic AI	Long docs, local deploy	8,192 tokens
text-embedding-ada-002	1,536	OpenAI	General purpose RAG	8,191 tokens
text-embedding-3-small	1,536	OpenAI	Cost-efficient	8,191 tokens
text-embedding-3-large	3,072	OpenAI	Highest accuracy	8,191 tokens
all-MiniLM-L6-v2 open	384	Sentence-BERT	Fast, small footprint	512 tokens
e5-mistral-7b open	4,096	Microsoft	State-of-the-art open	32,768 tokens

🔬 How Does the Transformer Produce These Numbers? The transformer reads every word in relation to every other word (called self-attention). After 12–24 layers of this, it has built a rich understanding of context. The final [CLS] token or mean-pooled output is taken as the sentence's embedding. The model was trained to make similar sentences produce similar vectors using contrastive learning.

── Python Example: Generating Embeddings ───────────────────── from sentence_transformers import SentenceTransformer model = SentenceTransformer("nomic-ai/nomic-embed-text-v1") chunks = [ "The warranty period covers 12 months from purchase", "Coverage duration is one year after the sale date", "Python is a programming language for data science", ] embeddings = model.encode(chunks) # shape: (3, 768) # embeddings[0] ≈ embeddings[1] → cosine ≈ 0.96 # embeddings[0] vs embeddings[2] → cosine ≈ 0.12 (unrelated)

Section 3 — HNSW Index

🗺️

HNSW — Hierarchical Navigable Small Worlds

Multi-layer graph · Sparse highways at top · Dense streets at bottom · O(log n) search

Searching cosine similarity across millions of vectors one by one would be O(n·d) — impossibly slow. HNSW solves this with a brilliant multi-layer graph structure inspired by how highway maps work. It finds your nearest neighbor in O(log n) time.

🛣️ Highways & Streets Analogy Imagine you're looking for a restaurant in a city. You don't check every street — you take the highway to the right district (sparse top layer), exit onto main roads (middle layers), then walk the local streets to the exact address (dense bottom layer). HNSW does exactly this with vectors.

📉 O(log n) vs O(n) For 10 million vectors: brute-force checks 10,000,000 nodes. HNSW checks about 23 (log₂ of 10M). That's a 430,000× speedup. This is why vector search is practical at scale.

🔨 How HNSW is Built (Insertion Algorithm)

── Inserting a new vector into HNSW ──────────────────────── for each new vector v: 1. Randomly assign it to layers 0 … L (exponential decay: most go to L0 only) 2. for layer = top → assigned_layer: Find nearest neighbour (greedy search, just like query time) 3. for layer = assigned_layer → 0: Add bidirectional edges to M nearest neighbours If any node has > Mmax connections → prune weakest edges Parameters you tune: M = max connections per node (default 16) ↑ = more accurate, more RAM efConstruction = search width at build time ↑ = slower build, better index ef = search width at query time ↑ = more accurate, slower query

Method	Search Speed	Accuracy	Memory	Used By
Brute Force	O(n·d) 🐢	100% exact	Just vectors	Tiny datasets <10k
IVF (Flat)	O(√n·d)	~95%	Centroids + vectors	Faiss, classic Pinecone
HNSW ⭐	O(log n) 🚀	95–99%	~1.5× vectors	Elasticsearch, ChromaDB, Weaviate, Qdrant
Product Quantization	O(log n)	~90%	~0.25× vectors	When RAM is scarce

✅ HNSW in Elasticsearch Elasticsearch uses HNSW natively for its dense_vector field type. When you set "index": true, ES builds an HNSW index automatically. You control M and ef_construction in the mapping. At query time, num_candidates sets the ef parameter.

✅ HNSW in ChromaDB ChromaDB uses the hnswlib C++ library under the hood. When you call collection.add(), it inserts into HNSW. When you call collection.query(), it traverses the graph and returns distance = 1 − cosine.

Section 4 — Classic Text Search (BM25)

📖

Inverted Index + BM25 — Elasticsearch's Classic Engine

term frequency · inverse document frequency · field boosting

Before vector search existed, Elasticsearch was already the world's fastest full-text search engine. It uses two powerful ideas together: the inverted index (a backwards book index) and BM25 scoring (a smart relevance formula).

📊 BM25 Scoring Formula

── BM25 — What Elasticsearch Actually Computes ───────────── BM25(doc, term) = IDF(term) × [ TF × (k₁+1) ] / [ TF + k₁ × (1 − b + b × |doc|/avgLen) ] IDF(term) = log( (N − df + 0.5) / (df + 0.5) + 1 ) TF = term frequency in this document |doc| = length of this document (in tokens) avgLen = average document length across all docs N = total documents in index df = documents containing this term k₁ = 1.2 (saturation — how much TF matters) b = 0.75 (length normalisation) Key insight: a word appearing 100× vs 10× isn't 10× more relevant. BM25 flattens this — the score saturates at high TF values. 📉→📉→→

Section 5 — ChromaDB Vector Store

🟠

ChromaDB — Purpose-Built Vector Database

distance = 1 − cosine · HNSW under the hood · Python-native

ChromaDB is a vector-first database — its only job is storing embeddings and finding similar ones fast. It's the most popular choice for local RAG prototypes and Python-native LLM apps.

🔑 ChromaDB Distance Convention ChromaDB uses distance = 1 − cosine_similarity. So when you get back results with distances=[0.02, 0.15, 0.41], the smallest number is the most similar. A distance of 0 means identical. Distance of 1 means totally unrelated. This is the opposite of Elasticsearch where bigger score = better.

🏗️ What's Inside ChromaDB Each collection has: an HNSW index (hnswlib), a SQLite metadata store, and a flat file for raw embeddings. You can swap in different embedding functions (OpenAI, Cohere, local models). Persistence is optional — great for testing in-memory.

── ChromaDB Full Example: Ingest + Query ─────────────────── import chromadb from chromadb.utils import embedding_functions # 1. Create client + collection client = chromadb.PersistentClient(path="./chroma_db") embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="nomic-ai/nomic-embed-text-v1" ) collection = client.get_or_create_collection( name="documents", embedding_function=embed_fn, metadata={"hnsw:space": "cosine"} # ← use cosine distance ) # 2. Add documents (embeddings generated automatically) collection.add( documents=[ "The warranty period covers 12 months from purchase date", "Coverage duration is one year after the sale", "Python is great for machine learning projects", ], ids=["doc1", "doc2", "doc3"] ) # 3. Query — returns distance = 1 − cosine (smaller = better) results = collection.query( query_texts=["how long is the warranty?"], n_results=2 ) # results["distances"][0] → [0.03, 0.04] ← doc1 and doc2 are very close! # results["documents"][0] → ["The warranty period...", "Coverage duration..."]

Section 6 — Elasticsearch kNN + Hybrid Search

⚡

Elasticsearch kNN — Vector Search at Production Scale

dense_vector · HNSW built-in · Hybrid BM25 + cosine in one query

Elasticsearch 8.0+ added native kNN vector search. It uses HNSW internally and — uniquely — lets you combine BM25 keyword scoring with cosine similarity in a single query. This is called hybrid search and it outperforms either method alone.

── Step 1: Create Index with dense_vector field ──────────── PUT /my_docs { "mappings": { "properties": { "text": { "type": "text" }, "embedding": { "type": "dense_vector", "dims": 768, // must match your model "index": true, // build HNSW index "similarity": "cosine", // cosine similarity metric "index_options": { "type": "hnsw", "m": 16, // HNSW M parameter "ef_construction": 100 // build accuracy vs speed } } } } } ── Step 2: Index a document ──────────────────────────────── POST /my_docs/_doc { "text": "The warranty period covers 12 months from purchase", "embedding": [0.012, -0.034, 0.567, ...] // 768 floats } ── Step 3: Hybrid search — BM25 + kNN cosine ─────────────── POST /my_docs/_search { "knn": { "field": "embedding", "query_vector": [0.008, -0.031, 0.571, ...], // query embedding "k": 10, // return 10 nearest "num_candidates":100 // HNSW ef parameter }, "query": { "match": { "text": "warranty period" } // BM25 boost }, "rank": { "rrf": {} } // Reciprocal Rank Fusion = merge both rankings }

🏆 Why Hybrid Search Wins BM25 is great at exact keyword matches ("warranty" → finds "warranty"). Cosine is great at semantic matches ("coverage duration" → finds "warranty period"). Combining both with Reciprocal Rank Fusion (RRF) gets you the best of both worlds — exact terms score high AND semantically similar results score high. Benchmark studies show RRF hybrid outperforms either alone by 10–30% on recall metrics.

Section 7 — Full RAG Ingestion & Query Pipeline

🤖

RAG — Retrieval Augmented Generation

How ChatGPT-style apps use your documents to answer questions

RAG is how modern LLM apps (like enterprise chatbots, document Q&A) work. Instead of baking all knowledge into the model, you retrieve relevant chunks at query time and inject them into the LLM's prompt. Two phases: Ingestion and Retrieval.

📥 Phase 1 — Ingestion (Done Once / On New Data)

🔎 Phase 2 — Retrieval (Done on Every User Query)

── Full RAG Query with LangChain + ChromaDB + OpenAI ─────── from langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings from langchain.chat_models import ChatOpenAI from langchain.chains import RetrievalQA # 1. Load existing ChromaDB vector store vectorstore = Chroma( persist_directory="./chroma_db", embedding_function=OpenAIEmbeddings() ) # 2. Create retriever — returns top 3 chunks by cosine similarity retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 3} ) # 3. Build QA chain qa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4", temperature=0), retriever=retriever, return_source_documents=True ) # 4. Ask question → retrieval → grounded answer result = qa_chain({"query": "How long is the warranty period?"}) # LLM reads chunk: "The warranty period covers 12 months..." # Answers: "The warranty period is 12 months from the purchase date."

🎯 Why RAG Beats Fine-Tuning for Most Use Cases Fine-tuning bakes knowledge into model weights — expensive, slow, hard to update. RAG externalises knowledge into a retrieval store — cheap, instant updates, fully auditable (you can see exactly which chunks were used). The LLM becomes a reasoning engine; the vector DB becomes its long-term memory.

Section 8 — Alternatives & Decision Guide

⚖️

Which Tool Should You Use?

Vector DBs, hybrid engines, full-text search — decision guide

Tool	Type	Search Method	Best For	Weakness
Elasticsearch 8+ ⭐	hybrid	BM25 + HNSW cosine	Production RAG + keyword + analytics	Complex ops, RAM heavy
ChromaDB	vector	HNSW cosine/L2/IP	Local prototyping, Python RAG	Not production-grade at huge scale
Pinecone	vector	HNSW / ANN	Managed cloud vector search	Cost, vendor lock-in
Weaviate	vector	HNSW + BM25	GraphQL + vector, multi-modal	Steeper learning curve
Qdrant	vector	HNSW cosine	High performance, Rust-based	Smaller ecosystem
pgvector	sql	Exact cosine / IVFFlat	Already on PostgreSQL	Slower than dedicated DBs at scale
Redis (VSS)	cache	HNSW / FLAT	Low-latency, in-memory	Data size limited by RAM
Milvus	vector	HNSW/IVF/ANNOY	Billion-scale vectors	Heavy infrastructure
OpenSearch	hybrid	BM25 + k-NN	AWS native, ES alternative	Slightly behind ES features

📚

References & Further Reading

Papers, docs, courses, and free tools

📐 Math & Algorithms

🔧 Tools & Docs

🤖 RAG & LLMs

Feature	ChromaDB	Elasticsearch
Setup	pip install, 3 lines of code	Docker / cloud, cluster config
Keyword search	❌ No	✅ BM25
Hybrid search	❌ No	✅ BM25 + kNN + RRF
Scale	Millions (single node)	Billions (distributed)
Metadata filtering	✅ Basic	✅ Full query DSL
Distance score	1 − cosine (lower = better)	Cosine similarity (higher = better)
Best for	Prototyping, local RAG	Production apps, enterprise

Cosine Similarity, HNSW Graphs & Transformer Embeddings inside Elasticsearch & LLM Retrieval

🤖 What is RAG?

🎯 Why These Tools?

📐 Cosine Similarity — Deep Dive

🟠 ChromaDB Distance Convention

🔬 Self-Attention Explained

✂️ Chunking Strategies

⚙️ HNSW Parameter Tuning

🎯 Approximate vs Exact Search

📊 BM25 — Why It Beats TF-IDF

✂️ Text Analyzers in Elasticsearch

⬆️ Field Boosting in Elasticsearch

🆚 ChromaDB vs Elasticsearch

💾 ChromaDB Persistence & Scaling

🔀 Reciprocal Rank Fusion (RRF)

🗂️ Elasticsearch Shards & Scaling

📝 RAG Prompt Template Patterns

🔝 Reranking Strategies

⚙️ Chunk Size & Overlap Tuning

📊 The ELK / Elastic Stack

🚀 Future: Sparse + Dense Hybrid

📈 ANN-Benchmarks & BEIR

Cosine Similarity,
HNSW Graphs & Transformer Embeddings
inside Elasticsearch & LLM Retrieval