Vector Databases - Internals, Indexes & Distributed Architecture
Vector Databases: Internals, Indexes & Distributed Architecture
Section titled “Vector Databases: Internals, Indexes & Distributed Architecture”A Complete Deep-Dive Into HNSW, IVF, PQ, Embeddings, Metadata Filtering & Multi-Node Search Systems
Section titled “A Complete Deep-Dive Into HNSW, IVF, PQ, Embeddings, Metadata Filtering & Multi-Node Search Systems”Vector databases have exploded in popularity because large language models (LLMs) turned embeddings into the new “universal representation” for text, images, video, audio, code, protein sequences — anything that can be encoded into high-dimensional vector space.
In 2024–25, vector database questions became a core part of HLD interviews at FAANG, MAMAA, and Indian product companies:
- “Explain how HNSW works.”
- “How does a vector database scale horizontally?”
- “How do you combine ANN search with metadata filters?”
- “How does PQ compress vector space?”
- “Why are insertions slow in HNSW?”
- “What index would you pick for 100M vectors?”
- “How does Pinecone/Milvus shard data?”
- “What if you need freshness + fast ingestion?”
This article teaches everything you need for interviews and actual engineering work: From embeddings → indexing → compression → filters → distributed search → ingestion → consistency → system design.
This is your interview guide on vector databases.
Table of Contents
Section titled “Table of Contents”- Why Vector Databases Exist
- Embeddings 101 — The Semantic Foundation
- The Challenge: High-Dimensional Nearest Neighbor Search
- Exact Search vs Approximate Search
- ANN Indexes: Categories & Trade-offs
- HNSW Deep Dive (Most Important)
- IVF / IVF-Flat / IVF-PQ
- PQ, OPQ, Scalar Quantization
- Hybrid Index Architectures (HNSW-PQ, IVF-HNSW)
- Storage Engine Design for Vector Databases
- WAL, segments, delta logs & compaction
- Distributed Architecture (Sharding, Replication, Routing)
- Metadata Filtering + Vector Search Fusion
- Real-Time Ingestion Challenges
- Memory, Latency, Recall Tuning
- Case Studies: FAISS, Pinecone, Milvus, Weaviate, OpenSearch
- Interview Mental Models
- Summary
1. Why Vector Databases Exist
Section titled “1. Why Vector Databases Exist”LLMs turned all content into vector form.
Instead of keyword matching (BM25, inverted indexes), embeddings allow engines to search for meaning:
- “dog” and “puppy” → close in vector space
- “machine learning engineer” and “AI developer” → similar
- “refund request” and “money back issue” → close
Embeddings extract semantic information that keyword search simply cannot.
This creates a new problem:
How do you search millions or billions of 768-dimensional vectors?
Section titled “How do you search millions or billions of 768-dimensional vectors?”Naively, by computing:
distance(query, vector_i)for every vector. This is O(N × d) → too slow beyond 100K–1M vectors.
Vector databases exist to:
- store embeddings
- index them efficiently
- perform approximate nearest neighbor (ANN) search
- scale horizontally
- support filters
- handle ingestion
- provide consistency
They are specialized search engines — not general databases.
2. Embeddings 101 — The Semantic Foundation
Section titled “2. Embeddings 101 — The Semantic Foundation”An embedding is a d-dimensional float vector:
v = [0.12, -0.83, 0.44, …]Common embeddings:
- 384 dimensions (MiniLM)
- 768 dimensions (BERT)
- 1024–4096 dimensions (OpenAI / Llama)
- 8k dimensions (image embeddings)
Distance metrics:
- Cosine similarity (most common)
- L2 distance (Euclidean)
- Inner product (dot-product)
Cosine similarity requires normalization:
v_normalized = v / ||v||Distance and similarity are interchangeable:
cosine_similarity = dot(q, v)l2_distance = ||q - v||Vector DBs convert similarity queries into top-K nearest neighbor searches.
3. The Challenge: High-Dimensional Nearest Neighbor Search
Section titled “3. The Challenge: High-Dimensional Nearest Neighbor Search”Exact search is easy:
for v in all_vectors: compute distance(q, v)return top K resultsBut high-dimensional spaces suffer from:
The Curse of Dimensionality
Section titled “The Curse of Dimensionality”- Distances become less meaningful
- Indexes like KD-trees collapse
- Partitioning becomes ineffective
- Clustering becomes expensive
Thus the shift to Approximate Nearest Neighbor (ANN):
Return results that are very close to the nearest ones, but not necessarily the exact top-K.
ANN trades off accuracy for speed → this is acceptable for most LLM applications.
4. Exact Search vs Approximate Search
Section titled “4. Exact Search vs Approximate Search”| Property | Exact | Approximate (ANN) |
|---|---|---|
| Accuracy | 100% | ~90–99% |
| Latency | Slow | Fast |
| Memory | High | Medium/High |
| Use Case | Small datasets | >1M vectors |
Exact search uses:
- Flat index (brute force)
- GPU acceleration (FAISS GPU)
ANN uses:
- HNSW
- IVF
- PQ
- ScaNN
- ANNoy
- DiskANN
Vector databases almost always choose ANN.
5. ANN Index Families (The Big Picture)
Section titled “5. ANN Index Families (The Big Picture)”Three major families dominate ANN indexing:
Graph-based: HNSW, NSGTree/Cluster: IVF, KMeans, LSHQuantization: PQ, OPQ, SQThey differ in accuracy, speed, memory footprint, and ingestion cost.
6. HNSW Deep Dive (Most Important for Interviews)
Section titled “6. HNSW Deep Dive (Most Important for Interviews)”Hierarchical Navigable Small Worlds — The King of ANN Indexes
Section titled “Hierarchical Navigable Small Worlds — The King of ANN Indexes”HNSW is used by:
- Pinecone
- Weaviate
- Qdrant
- Vespa
- Milvus
- Many enterprise vector search engines
It is widely considered the best all-around ANN index.
6.1 Intuition: A Multi-Level Navigable Graph
Section titled “6.1 Intuition: A Multi-Level Navigable Graph”HNSW builds a graph where:
- nodes = vectors
- edges = proximity links
- upper layers = fewer nodes, long-range links
- bottom layer = dense links, local neighbors
Diagram:
Level 3: A ---- B | |Level 2: C ---- D ---- E | \Level 1: F -- G -- H -- I -- J \ \ \Level 0: (dense graph with nearest neighbors)Searching works by:
- Start at top layer
- Greedily move to neighbor closer to query
- Drop to next layer
- Repeat
- Use BFS/priority queue at bottom layer
Search complexity is O(log N).
6.2 Key Parameters
Section titled “6.2 Key Parameters”Max number of neighbors.
Higher M → higher accuracy, more memory.
efConstruction
Section titled “efConstruction”Quality of graph during build.
Higher → better recall, slower build.
efSearch
Section titled “efSearch”Search beam width.
Higher → better recall, slower search.
6.3 HNSW Strengths
Section titled “6.3 HNSW Strengths”- Best-in-class recall/latency trade-off
- Incremental inserts supported
- Very strong locality-sensitive navigation
- Executes on CPU, no GPU required
6.4 HNSW Weaknesses
Section titled “6.4 HNSW Weaknesses”- Memory heavy (graph edges ~ 64–128 bytes per edge)
- Insertions expensive
- Deletions complicated (lazy deletion + rebuild)
- Harder to persist to disk (needs compact segments)
- Poor for streaming ingestion >10–50K writes/sec
Thus many vector DBs use:
- HNSW for in-memory serving
- On-disk indexes (IVF-PQ, DiskANN) for cold storage
7. IVF (Inverted File Index) & IVF-PQ
Section titled “7. IVF (Inverted File Index) & IVF-PQ”IVF is the coarse quantization approach.
7.1 Concept
Section titled “7.1 Concept”- Run k-means clustering on all vectors.
- Assign each vector to its nearest centroid.
- Search only in a few nearest clusters.
Visual:
100M vectors↓1000 clusters↓At query time: search only 4 clusters (4/1000 = 0.4%)Huge speedup.
7.2 IVF-Flat
Section titled “7.2 IVF-Flat”IVF without compression.
Inside each cluster, you search raw vectors.
Fast but heavy on memory.
7.3 IVF-PQ (Product Quantization)
Section titled “7.3 IVF-PQ (Product Quantization)”PQ compresses vectors:
- Split vector into subspaces
- Quantize each subspace
- Store compact codes (8–16 bytes per vector instead of 512–4096 bytes)
Example:
128-dim vector↓ split into 8 segmentseach segment encoded to 1 byte↓8 bytes total storageReduction from 512 bytes → ~8 bytes.
This allows 1B vectors to fit in memory.
Downside: lower recall.
8. Product Quantization (PQ) — Compression for Massive Scale
Section titled “8. Product Quantization (PQ) — Compression for Massive Scale”PQ is critical for large-scale vector databases.
PQ breaks vectors into subspaces:
Section titled “PQ breaks vectors into subspaces:”128 dims → 8 groups of 16 dimsEach group quantized:
Section titled “Each group quantized:”v = [segment1][segment2]…[segment8]Each segment encoded as 1 byte.
Distance computed using lookup tables
Section titled “Distance computed using lookup tables”Distance lookup speed is extremely fast.
PQ variants:
- PQ — basic
- OPQ — rotates vector space to improve quantization
- SQ — scalar quantization (1D quantization per dimension)
PQ allows FAISS to search billion-scale datasets on a single machine.
9. Hybrid Indexes (HNSW + PQ, IVF + HNSW)
Section titled “9. Hybrid Indexes (HNSW + PQ, IVF + HNSW)”Real systems combine indexes:
- HNSW for top layer, PQ for bottom layer
- IVF + HNSW refined search
- HNSW graph over PQ vectors
Hybrid systems provide:
- High recall
- Low memory
- Fast latency
- Good ingestion performance
This is how Milvus and Pinecone optimize real workloads.
10. Storage Engine Internals
Section titled “10. Storage Engine Internals”Unlike relational DBs, vector DBs must store:
- raw vectors
- compressed vectors (PQ)
- metadata
- filters (inverted index)
- index files (graph, cluster assignments)
- WAL
- snapshot files
Segment Files (LSM-inspired)
Section titled “Segment Files (LSM-inspired)”Most systems (Milvus, Pinecone) use segments:
Segment 1: 500K vectors + indexSegment 2: 500K vectors + indexSegment 3: new writes (in-memory)When a segment grows:
- sealed
- persisted
- indexed via HNSW / IVF
- merged later
This solves the “HNSW is slow to insert” problem.
11. WAL, Delta Logs & Compaction
Section titled “11. WAL, Delta Logs & Compaction”Vector DBs maintain durability with:
- Write-Ahead Logs
- Segment logs
- Batch writes
- Periodic snapshotting
Process:
writes → WALsmall in-memory indexflush → segment filesegment indexing → backgroundcompaction → merge old segmentsThis mirrors RocksDB but adapted for ANN indexing.
12. Distributed Architecture (Sharding, Replication, Routing)
Section titled “12. Distributed Architecture (Sharding, Replication, Routing)”Modern vector DBs must scale horizontally.
The architecture typically looks like:
Query Node / Coordinator ↓Shards (Segment replicas) ↓Index Nodes (HNSW/IVF/PQ)12.1 Sharding Strategies
Section titled “12.1 Sharding Strategies”A. Hash by ID
Section titled “A. Hash by ID”Uniform distribution, simple.
B. Hash by vector
Section titled “B. Hash by vector”Rare, because embeddings don’t hash uniformly.
C. Cluster-based sharding (IVF centroids)
Section titled “C. Cluster-based sharding (IVF centroids)”Shards correspond to cluster partitions.
D. Range-based (for metadata hybrid DBs)
Section titled “D. Range-based (for metadata hybrid DBs)”Used in Weaviate hybrid search.
12.2 Replication
Section titled “12.2 Replication”Replication modes:
- sync — strong consistency
- async — eventual consistency
- multi-primary — conflict resolution needed
Often replicates entire segment files.
12.3 Query Routing
Section titled “12.3 Query Routing”Query steps:
- Coordinator determines which shards are relevant
- Sends query vector to top candidate shards
- Each shard returns top-K partial results
- Coordinator merges into global top-K
Diagram:
Query ↓Coordinator ↓ ↓ ↓Shard 1 Shard 2 Shard 3 ↓ ↓ ↓Top-K1 Top-K2 Top-K3 ↓ mergeFinal Top-K13. Metadata Filtering + ANN Fusion
Section titled “13. Metadata Filtering + ANN Fusion”One of the hardest problems in vector databases.
Example:
Find products semantically similar to Q where:brand = "Nike"price between 2000 and 4000category = "Shoes"ANN indexes don’t handle filters well. Vector DBs solve this using hybrid indexes:
Solution Approaches
Section titled “Solution Approaches”1. Pre-filtering (Filter → ANN)
Section titled “1. Pre-filtering (Filter → ANN)”Filter narrow set → ANN search on remaining.
Good when filters are highly selective.
2. Post-filtering (ANN → Filter)
Section titled “2. Post-filtering (ANN → Filter)”ANN returns candidates → filter them.
May degrade recall if filtered too aggressively.
3. Inverted index + ANN hybrid
Section titled “3. Inverted index + ANN hybrid”Systems like Weaviate store:
- Vector index (HNSW)
- Inverted index for metadata
Hybrid score computed using weighted fusion.
4. Pinecone Sparse-Dense Hybrid
Section titled “4. Pinecone Sparse-Dense Hybrid”Dense vector + sparse keyword vector (BM25-like):
score = w_dense * sim_dense + w_sparse * sim_sparseThis is extremely effective for RAG pipelines.
14. Real-Time Ingestion Challenges
Section titled “14. Real-Time Ingestion Challenges”HNSW is slow to insert into because:
- Graph rewiring required
- Each insertion updates multiple edges
- efConstruction parameter heavily affects time
Solutions:
A. Buffer writes into log segments
Section titled “A. Buffer writes into log segments”Then build HNSW offline.
B. Use IVF-PQ for write-friendly ingestion
Section titled “B. Use IVF-PQ for write-friendly ingestion”C. Store writes in RAM index (small HNSW)
Section titled “C. Store writes in RAM index (small HNSW)”Merge later.
D. Append-only segments + periodic rebuild
Section titled “D. Append-only segments + periodic rebuild”Like LSM compaction.
15. Memory, Latency & Recall Tuning
Section titled “15. Memory, Latency & Recall Tuning”Memory Footprint = Vectors + Index
Section titled “Memory Footprint = Vectors + Index”HNSW is memory heavy:
vector_size + M * neighbor_size * bytesPQ compresses vector size massively.
Recall Tuning
Section titled “Recall Tuning”HNSW tunables:
Section titled “HNSW tunables:”- efSearch
- efConstruction
- M
IVF tunables:
Section titled “IVF tunables:”- nprobe (# clusters searched)
Higher values → better recall, slower search.
Latency Optimization
Section titled “Latency Optimization”- SIMD instructions for distance calculation
- cache-aware block layout
- prefetching
- GPU-based search
- reducing dimensionality via PCA
16. Case Studies
Section titled “16. Case Studies”16.1 FAISS (Meta)
Section titled “16.1 FAISS (Meta)”- Best GPU-accelerated ANN library
- IVF-PQ, HNSW, Flat, OPQ
- Billion-scale search on a single node
16.2 Pinecone
Section titled “16.2 Pinecone”- Fully managed vector DB
- Built on HNSW + proprietary segmenting
- Adaptive capacity
- Multi-tenant isolation
- Sparse-dense hybrid retrieval
16.3 Milvus
Section titled “16.3 Milvus”- Most advanced open-source vector DB
- Supports HNSW, IVF, IVF-PQ, DiskANN
- Distributed metadata server
- LSM-tree-like segment design
16.4 Weaviate
Section titled “16.4 Weaviate”- Hybrid search (vector + keyword)
- HNSW for vector search
- Inverted index for metadata
- Excellent for enterprise use cases
16.5 OpenSearch KNN
Section titled “16.5 OpenSearch KNN”- FAISS/HNSW under the hood
- Integrates with existing search pipeline
17. Interview Mental Models
Section titled “17. Interview Mental Models”Q: Why ANN instead of exact search?
Section titled “Q: Why ANN instead of exact search?”High-dimensional exact search collapses after 100K vectors.
Q: When do you choose HNSW?
Section titled “Q: When do you choose HNSW?”When recall & latency are priority and ingestion is moderate.
Q: When IVF-PQ?
Section titled “Q: When IVF-PQ?”When scaling to 100M–10B vectors with memory constraints.
Q: How do you scale vector search horizontally?
Section titled “Q: How do you scale vector search horizontally?”Sharding + distributed top-K merging.
Q: Why are inserts slow in HNSW?
Section titled “Q: Why are inserts slow in HNSW?”Graph rewiring and greedy search to find neighbors.
Q: How do filters work with vectors?
Section titled “Q: How do filters work with vectors?”Hybrid indexing → pre-filter or post-filter strategies.
Q: How do vector DBs persist data?
Section titled “Q: How do vector DBs persist data?”Segments + WAL + compaction.
Q: Which index for streaming ingestion?
Section titled “Q: Which index for streaming ingestion?”IVF-flat or HNSW + segment buffering.
18. Summary (The one-sentence version)
Section titled “18. Summary (The one-sentence version)”A vector database is a specialized distributed search engine that combines embeddings, ANN indexes (HNSW/IVF/PQ), metadata filters, write-ahead logging, segment storage, and multi-node top-K routing to deliver low-latency semantic search at massive scale.