Vector Databases: Internals, Indexes & Distributed Architecture

A Complete Deep-Dive Into HNSW, IVF, PQ, Embeddings, Metadata Filtering & Multi-Node Search Systems

Vector databases have exploded in popularity because large language models (LLMs) turned embeddings into the new “universal representation” for text, images, video, audio, code, protein sequences — anything that can be encoded into high-dimensional vector space.

In 2024–25, vector database questions became a core part of HLD interviews at FAANG, MAMAA, and Indian product companies:

“Explain how HNSW works.”
“How does a vector database scale horizontally?”
“How do you combine ANN search with metadata filters?”
“How does PQ compress vector space?”
“Why are insertions slow in HNSW?”
“What index would you pick for 100M vectors?”
“How does Pinecone/Milvus shard data?”
“What if you need freshness + fast ingestion?”

This article teaches everything you need for interviews and actual engineering work: From embeddings → indexing → compression → filters → distributed search → ingestion → consistency → system design.

This is your interview guide on vector databases.

Why Vector Databases Exist
Embeddings 101 — The Semantic Foundation
The Challenge: High-Dimensional Nearest Neighbor Search
Exact Search vs Approximate Search
ANN Indexes: Categories & Trade-offs
HNSW Deep Dive (Most Important)
IVF / IVF-Flat / IVF-PQ
PQ, OPQ, Scalar Quantization
Hybrid Index Architectures (HNSW-PQ, IVF-HNSW)
Storage Engine Design for Vector Databases
WAL, segments, delta logs & compaction
Distributed Architecture (Sharding, Replication, Routing)
Metadata Filtering + Vector Search Fusion
Real-Time Ingestion Challenges
Memory, Latency, Recall Tuning
Case Studies: FAISS, Pinecone, Milvus, Weaviate, OpenSearch
Interview Mental Models
Summary

1. Why Vector Databases Exist

LLMs turned all content into vector form.

Instead of keyword matching (BM25, inverted indexes), embeddings allow engines to search for meaning:

“dog” and “puppy” → close in vector space
“machine learning engineer” and “AI developer” → similar
“refund request” and “money back issue” → close

Embeddings extract semantic information that keyword search simply cannot.

This creates a new problem:

How do you search millions or billions of 768-dimensional vectors?

Naively, by computing:

distance(query, vector_i)

for every vector. This is O(N × d) → too slow beyond 100K–1M vectors.

Vector databases exist to:

store embeddings
index them efficiently
perform approximate nearest neighbor (ANN) search
scale horizontally
support filters
handle ingestion
provide consistency

They are specialized search engines — not general databases.

2. Embeddings 101 — The Semantic Foundation

An embedding is a d-dimensional float vector:

v = [0.12, -0.83, 0.44, …]

Common embeddings:

384 dimensions (MiniLM)
768 dimensions (BERT)
1024–4096 dimensions (OpenAI / Llama)
8k dimensions (image embeddings)

Distance metrics:

Cosine similarity (most common)
L2 distance (Euclidean)
Inner product (dot-product)

Cosine similarity requires normalization:

v_normalized = v / ||v||

Distance and similarity are interchangeable:

cosine_similarity = dot(q, v)
l2_distance = ||q - v||

Vector DBs convert similarity queries into top-K nearest neighbor searches.

3. The Challenge: High-Dimensional Nearest Neighbor Search

Exact search is easy:

for v in all_vectors:
    compute distance(q, v)
return top K results

But high-dimensional spaces suffer from:

The Curse of Dimensionality

Distances become less meaningful
Indexes like KD-trees collapse
Partitioning becomes ineffective
Clustering becomes expensive

Thus the shift to Approximate Nearest Neighbor (ANN):

Return results that are very close to the nearest ones, but not necessarily the exact top-K.

ANN trades off accuracy for speed → this is acceptable for most LLM applications.

4. Exact Search vs Approximate Search

Property	Exact	Approximate (ANN)
Accuracy	100%	~90–99%
Latency	Slow	Fast
Memory	High	Medium/High
Use Case	Small datasets	>1M vectors

Exact search uses:

Flat index (brute force)
GPU acceleration (FAISS GPU)

ANN uses:

HNSW
IVF
PQ
ScaNN
ANNoy
DiskANN

Vector databases almost always choose ANN.

5. ANN Index Families (The Big Picture)

Three major families dominate ANN indexing:

Graph-based:    HNSW, NSG
Tree/Cluster:   IVF, KMeans, LSH
Quantization:   PQ, OPQ, SQ

They differ in accuracy, speed, memory footprint, and ingestion cost.

6. HNSW Deep Dive (Most Important for Interviews)

Hierarchical Navigable Small Worlds — The King of ANN Indexes

HNSW is used by:

Pinecone
Weaviate
Qdrant
Vespa
Milvus
Many enterprise vector search engines

It is widely considered the best all-around ANN index.

6.1 Intuition: A Multi-Level Navigable Graph

HNSW builds a graph where:

nodes = vectors
edges = proximity links
upper layers = fewer nodes, long-range links
bottom layer = dense links, local neighbors

Diagram:

Level 3:     A ---- B
             |      |
Level 2:   C ---- D ---- E
           |       \
Level 1: F -- G -- H -- I -- J
           \    \      \ 
Level 0:  (dense graph with nearest neighbors)

Searching works by:

Start at top layer
Greedily move to neighbor closer to query
Drop to next layer
Repeat
Use BFS/priority queue at bottom layer

Search complexity is O(log N).

6.2 Key Parameters

M

Max number of neighbors.

Higher M → higher accuracy, more memory.

efConstruction

Quality of graph during build.

Higher → better recall, slower build.

efSearch

Search beam width.

Higher → better recall, slower search.

6.3 HNSW Strengths

Best-in-class recall/latency trade-off
Incremental inserts supported
Very strong locality-sensitive navigation
Executes on CPU, no GPU required

6.4 HNSW Weaknesses

Memory heavy (graph edges ~ 64–128 bytes per edge)
Insertions expensive
Deletions complicated (lazy deletion + rebuild)
Harder to persist to disk (needs compact segments)
Poor for streaming ingestion >10–50K writes/sec

Thus many vector DBs use:

HNSW for in-memory serving
On-disk indexes (IVF-PQ, DiskANN) for cold storage

7. IVF (Inverted File Index) & IVF-PQ

IVF is the coarse quantization approach.

7.1 Concept

Run k-means clustering on all vectors.
Assign each vector to its nearest centroid.
Search only in a few nearest clusters.

Visual:

100M vectors
↓
1000 clusters
↓
At query time: search only 4 clusters (4/1000 = 0.4%)

Huge speedup.

7.2 IVF-Flat

IVF without compression.

Inside each cluster, you search raw vectors.

Fast but heavy on memory.

7.3 IVF-PQ (Product Quantization)

PQ compresses vectors:

Split vector into subspaces
Quantize each subspace
Store compact codes (8–16 bytes per vector instead of 512–4096 bytes)

Example:

128-dim vector
↓ split into 8 segments
each segment encoded to 1 byte
↓
8 bytes total storage

Reduction from 512 bytes → ~8 bytes.

This allows 1B vectors to fit in memory.

Downside: lower recall.

8. Product Quantization (PQ) — Compression for Massive Scale

PQ is critical for large-scale vector databases.

PQ breaks vectors into subspaces:

128 dims → 8 groups of 16 dims

Each group quantized:

v = [segment1][segment2]…[segment8]

Each segment encoded as 1 byte.

Distance computed using lookup tables

Distance lookup speed is extremely fast.

PQ variants:

PQ — basic
OPQ — rotates vector space to improve quantization
SQ — scalar quantization (1D quantization per dimension)

PQ allows FAISS to search billion-scale datasets on a single machine.

9. Hybrid Indexes (HNSW + PQ, IVF + HNSW)

Real systems combine indexes:

HNSW for top layer, PQ for bottom layer
IVF + HNSW refined search
HNSW graph over PQ vectors

Hybrid systems provide:

High recall
Low memory
Fast latency
Good ingestion performance

This is how Milvus and Pinecone optimize real workloads.

10. Storage Engine Internals

Unlike relational DBs, vector DBs must store:

raw vectors
compressed vectors (PQ)
metadata
filters (inverted index)
index files (graph, cluster assignments)
WAL
snapshot files

Segment Files (LSM-inspired)

Most systems (Milvus, Pinecone) use segments:

Segment 1: 500K vectors + index
Segment 2: 500K vectors + index
Segment 3: new writes (in-memory)

When a segment grows:

sealed
persisted
indexed via HNSW / IVF
merged later

This solves the "HNSW is slow to insert" problem.

11. WAL, Delta Logs & Compaction

Vector DBs maintain durability with:

Write-Ahead Logs
Segment logs
Batch writes
Periodic snapshotting

Process:

writes → WAL
small in-memory index
flush → segment file
segment indexing → background
compaction → merge old segments

This mirrors RocksDB but adapted for ANN indexing.

12. Distributed Architecture (Sharding, Replication, Routing)

Modern vector DBs must scale horizontally.

The architecture typically looks like:

Query Node / Coordinator
       ↓
Shards (Segment replicas)
       ↓
Index Nodes (HNSW/IVF/PQ)

12.1 Sharding Strategies

A. Hash by ID

Uniform distribution, simple.

B. Hash by vector

Rare, because embeddings don’t hash uniformly.

C. Cluster-based sharding (IVF centroids)

Shards correspond to cluster partitions.

D. Range-based (for metadata hybrid DBs)

Used in Weaviate hybrid search.

12.2 Replication

Replication modes:

sync — strong consistency
async — eventual consistency
multi-primary — conflict resolution needed

Often replicates entire segment files.

12.3 Query Routing

Query steps:

Coordinator determines which shards are relevant
Sends query vector to top candidate shards
Each shard returns top-K partial results
Coordinator merges into global top-K

Diagram:

Query
 ↓
Coordinator
 ↓            ↓             ↓
Shard 1     Shard 2      Shard 3
 ↓            ↓             ↓
Top-K1      Top-K2       Top-K3
 ↓ merge
Final Top-K

13. Metadata Filtering + ANN Fusion

One of the hardest problems in vector databases.

Example:

Find products semantically similar to Q where:
brand = "Nike"
price between 2000 and 4000
category = "Shoes"

ANN indexes don’t handle filters well. Vector DBs solve this using hybrid indexes:

Solution Approaches

1. Pre-filtering (Filter → ANN)

Filter narrow set → ANN search on remaining.

Good when filters are highly selective.

2. Post-filtering (ANN → Filter)

ANN returns candidates → filter them.

May degrade recall if filtered too aggressively.

3. Inverted index + ANN hybrid

Systems like Weaviate store:

Vector index (HNSW)
Inverted index for metadata

Hybrid score computed using weighted fusion.

4. Pinecone Sparse-Dense Hybrid

Dense vector + sparse keyword vector (BM25-like):

score = w_dense * sim_dense + w_sparse * sim_sparse

This is extremely effective for RAG pipelines.

14. Real-Time Ingestion Challenges

HNSW is slow to insert into because:

Graph rewiring required
Each insertion updates multiple edges
efConstruction parameter heavily affects time

Solutions:

A. Buffer writes into log segments

Then build HNSW offline.

B. Use IVF-PQ for write-friendly ingestion

C. Store writes in RAM index (small HNSW)

Merge later.

D. Append-only segments + periodic rebuild

Like LSM compaction.

15. Memory, Latency & Recall Tuning

Memory Footprint = Vectors + Index

HNSW is memory heavy:

vector_size + M * neighbor_size * bytes

PQ compresses vector size massively.

Recall Tuning

HNSW tunables:

efSearch
efConstruction
M

IVF tunables:

nprobe (# clusters searched)

Higher values → better recall, slower search.

Latency Optimization

SIMD instructions for distance calculation
cache-aware block layout
prefetching
GPU-based search
reducing dimensionality via PCA

16. Case Studies

16.1 FAISS (Meta)

Best GPU-accelerated ANN library
IVF-PQ, HNSW, Flat, OPQ
Billion-scale search on a single node

16.2 Pinecone

Fully managed vector DB
Built on HNSW + proprietary segmenting
Adaptive capacity
Multi-tenant isolation
Sparse-dense hybrid retrieval

16.3 Milvus

Most advanced open-source vector DB
Supports HNSW, IVF, IVF-PQ, DiskANN
Distributed metadata server
LSM-tree-like segment design

16.4 Weaviate

Hybrid search (vector + keyword)
HNSW for vector search
Inverted index for metadata
Excellent for enterprise use cases

16.5 OpenSearch KNN

FAISS/HNSW under the hood
Integrates with existing search pipeline

17. Interview Mental Models

Q: Why ANN instead of exact search?

High-dimensional exact search collapses after 100K vectors.

Q: When do you choose HNSW?

When recall & latency are priority and ingestion is moderate.

Q: When IVF-PQ?

When scaling to 100M–10B vectors with memory constraints.

Q: How do you scale vector search horizontally?

Sharding + distributed top-K merging.

Q: Why are inserts slow in HNSW?

Graph rewiring and greedy search to find neighbors.

Q: How do filters work with vectors?

Hybrid indexing → pre-filter or post-filter strategies.

Q: How do vector DBs persist data?

Segments + WAL + compaction.

Q: Which index for streaming ingestion?

IVF-flat or HNSW + segment buffering.

18. Summary (The one-sentence version)

A vector database is a specialized distributed search engine that combines embeddings, ANN indexes (HNSW/IVF/PQ), metadata filters, write-ahead logging, segment storage, and multi-node top-K routing to deliver low-latency semantic search at massive scale.

Vector Databases: Internals, Indexes & Distributed Architecture ​

A Complete Deep-Dive Into HNSW, IVF, PQ, Embeddings, Metadata Filtering & Multi-Node Search Systems ​

Table of Contents ​

1. Why Vector Databases Exist ​

How do you search millions or billions of 768-dimensional vectors? ​

2. Embeddings 101 — The Semantic Foundation ​

3. The Challenge: High-Dimensional Nearest Neighbor Search ​

The Curse of Dimensionality ​

4. Exact Search vs Approximate Search ​

5. ANN Index Families (The Big Picture) ​

6. HNSW Deep Dive (Most Important for Interviews) ​

Hierarchical Navigable Small Worlds — The King of ANN Indexes ​

6.1 Intuition: A Multi-Level Navigable Graph ​

6.2 Key Parameters ​

M ​

efConstruction ​

efSearch ​

6.3 HNSW Strengths ​

6.4 HNSW Weaknesses ​

7. IVF (Inverted File Index) & IVF-PQ ​

7.1 Concept ​

7.2 IVF-Flat ​

7.3 IVF-PQ (Product Quantization) ​

8. Product Quantization (PQ) — Compression for Massive Scale ​

PQ breaks vectors into subspaces: ​

Each group quantized: ​

Distance computed using lookup tables ​

9. Hybrid Indexes (HNSW + PQ, IVF + HNSW) ​

10. Storage Engine Internals ​

Segment Files (LSM-inspired) ​

11. WAL, Delta Logs & Compaction ​

12. Distributed Architecture (Sharding, Replication, Routing) ​

12.1 Sharding Strategies ​

A. Hash by ID ​

B. Hash by vector ​

C. Cluster-based sharding (IVF centroids) ​

D. Range-based (for metadata hybrid DBs) ​

12.2 Replication ​

12.3 Query Routing ​

13. Metadata Filtering + ANN Fusion ​

Solution Approaches ​

1. Pre-filtering (Filter → ANN) ​

2. Post-filtering (ANN → Filter) ​

3. Inverted index + ANN hybrid ​

4. Pinecone Sparse-Dense Hybrid ​

14. Real-Time Ingestion Challenges ​

A. Buffer writes into log segments ​

B. Use IVF-PQ for write-friendly ingestion ​

C. Store writes in RAM index (small HNSW) ​

D. Append-only segments + periodic rebuild ​

15. Memory, Latency & Recall Tuning ​

Memory Footprint = Vectors + Index ​

Recall Tuning ​

HNSW tunables: ​

IVF tunables: ​

Latency Optimization ​

16. Case Studies ​

16.1 FAISS (Meta) ​

16.2 Pinecone ​

16.3 Milvus ​

16.4 Weaviate ​

16.5 OpenSearch KNN ​

17. Interview Mental Models ​

Q: Why ANN instead of exact search? ​

Q: When do you choose HNSW? ​

Q: When IVF-PQ? ​

Q: How do you scale vector search horizontally? ​

Q: Why are inserts slow in HNSW? ​

Q: How do filters work with vectors? ​

Q: How do vector DBs persist data? ​

Q: Which index for streaming ingestion? ​

18. Summary (The one-sentence version) ​