Comet
A high-performance hybrid vector store written in Go. Comet brings together multiple indexing strategies and search modalities into a unified, hackable package. Hybrid retrieval with reciprocal rank fusion, autocut, pre-filtering, semantic search, full-text search, and multi-KNN searches, and multi-query operations -- all in pure Go.
Understand search internals from the inside out. Built for hackers, not hyperscalers. Tiny enough to fit in your head. Decent enough to blow it.
Choose from:
- Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid) storage backends
- Full-Text Search: BM25 ranking algorithm with tokenization and normalization
- Metadata Filtering: Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes
- Ranking Programmability: Reciprocal Rank Fusion, Fixed size result sets, Threshold based result sets, Dynamic result sets etc.
- Hybrid Search: Unified interface combining vector, text, and metadata search
Table of Contents
- Overview
- Features
- Installation
- Quick Start
- Architecture
- Core Concepts
- API Reference
- Examples
- Configuration
- API Details
- Use Cases
- Contributing
- License
Overview
Everything you need to understand how vector databases actually work--and build one yourself.
What's inside:
- 5 Vector Storage Types: Flat, HNSW, IVF, PQ, IVFPQ
- 3 Distance Metrics: L2, L2 Squared, Cosine
- Full-Text Search: BM25 ranking with Unicode tokenization
- Metadata Filtering: Roaring bitmaps + Bit-Sliced Indexes
- Hybrid Search: Combine vector + text + metadata with Reciprocal Rank Fusion
- Advanced Search: Multi-KNN queries, multi-query operations, autocut result truncation
- Production Features: Thread-safe, serialization, soft deletes, configurable parameters
Everything you need to understand how vector databases actually work--and build one yourself.
Features
Vector Storage
- Flat: Brute-force exact search (100% recall baseline)
- HNSW: Hierarchical navigable small world graphs (95-99% recall, O(log n) search)
- IVF: Inverted file index with k-means clustering (85-95% recall, 10-20x speedup)
- PQ: Product quantization for compression (85-95% recall, 10-500x memory reduction)
- IVFPQ: IVF + PQ combined (85-95% recall, 100x speedup + 500x compression)
Search Modalities
- Vector Search: L2, L2 Squared, and Cosine distance metrics
- Full-Text Search: BM25 ranking with Unicode-aware tokenization
- Metadata Filtering: Boolean queries on structured attributes
- Hybrid Search: Combine all three with configurable fusion strategies
Fusion Strategies
- Weighted Sum: Linear combination with configurable weights
- Reciprocal Rank Fusion (RRF): Scale-independent rank-based fusion
- Max/Min Score: Simple score aggregation
Data Structures (The Good Stuff)
- HNSW Graphs: Multi-layer skip lists for approximate nearest neighbor search
- Roaring Bitmaps: Compressed bitmaps for metadata filtering (array, bitmap, run-length encoding)
- Bit-Sliced Index (BSI): Efficient numeric range queries without full scans
- Product Quantization Codebooks: Learned k-means centroids for vector compression
- Inverted Indexes: Token-to-document mappings for full-text search
Other Capabilities
- Quantization: Full precision, half precision, int8 precision
- Soft Deletes: Fast deletion with lazy cleanup
- Serialization: Persist and reload indexes
- Thread-Safe: Concurrent read/write operations
- Autocut: Automatic result truncation based on score gaps
Installation
Quick Start
import (
"fmt"
"log"
"github.com/wizenheimer/comet"
)
func main() {
// Create a vector store (384-dimensional embeddings with cosine distance)
index, err := comet.NewFlatIndex(384, comet.Cosine)
if err != nil {
log.Fatal(err)
}
// Add vectors
vec1 := make([]float32, 384)
// ... populate vec1 with your embedding ...
node := comet.NewVectorNode(vec1)
index.Add(*node)
// Search for similar vectors
query := make([]float32, 384)
// ... populate query vector ...
results, err := index.NewSearch().
WithQuery(query).
WithK(10).
Execute()
if err != nil {
log.Fatal(err)
}
// Process results
for i, result := range results {
fmt.Printf("%d. ID=%d, Score=%.4f\n", i+1, result.GetId(), result.GetScore())
}
}
Output:
1. ID=123, Score=0.0234
2. ID=456, Score=0.0567
3. ID=789, Score=0.0823
...
Architecture
System Architecture
Comet is organized into three main search engines that can work independently or together:
Application Layer
+-------------------------------------------------------------+
| Your Application |
| (Using Comet as a Go Library) |
+----------------------+--------------------------------------+
|
+-------------+-------------+
| | |
V V V
Vector Text Metadata
Search Engine Layer
+-------------+ +-------------+ +-------------+
| Vector | | Text | | Metadata |
| Search | | Search | | Filtering |
| Engine | | Engine | | Engine |
+------+------+ +------+------+ +------+------+
| | |
| Semantic | Keywords | Filters
| Similarity | + Relevance | + Boolean Logic
V V V
Index Storage Layer
+-------------+ +-------------+ +-------------+
| HNSW / IVF | | BM25 Index | | Roaring |
| / PQ / Flat | | (Inverted) | | Bitmaps |
+-------------+ +-------------+ +-------------+
Graph/Trees Token-DocIDs Compressed Sets
Hybrid Coordinator
All Three Engines
|
V
+-----------------+
| Hybrid Search |
| Coordinator |
| (Score Fusion) |
+-----------------+
|
V
Combined Results
Component Details
Component A: Vector Storage Engine
Manages vector storage and similarity search across multiple index types.
Common Interface:
+------------------------------------+
| VectorIndex Interface |
| |
| +- Train(vectors) |
| +- Add(vector) |
| +- Remove(vector) |
| +- NewSearch() |
+------------------------------------+
Available Implementations:
FlatIndex - Brute force, 100% recall
HNSWIndex - Graph-based, O(log n)
IVFIndex - Clustering, 10-20x faster
PQIndex - Quantization, 10-500x compression
IVFPQIndex - Hybrid, best of IVF + PQ
Responsibilities:
- Vector preprocessing (normalization for cosine distance)
- Distance calculations (Euclidean, L22, Cosine)
- K-nearest neighbor search
- Serialization/deserialization
- Soft delete management with flush mechanism
Performance Characteristics:
- Flat: O(nxd) search, 100% recall
- HNSW: O(Mxefxlog n) search, 95-99% recall
- IVF: O(nProbesxn/kxd) search, 85-95% recall
Component B: Text Search Engine
Full-text search using BM25 ranking algorithm.
Inverted Index:
+------------------------------------+
| term - RoaringBitmap(docIDs) |
| |
| "machine" - {1, 5, 12, 45} |
| "learning" - {1, 3, 12, 20} |
| "neural" - {3, 20, 45} |
+------------------------------------+
Term Frequencies:
+------------------------------------+
| term - {docID: count} |
| |
| "machine" - {1: 3, 5: 1, 12: 2} |
+------------------------------------+
Document Stats:
+------------------------------------+
| docID - (length, token_count) |
| |
| 1 - (250 chars, 45 tokens) |
| 5 - (180 chars, 32 tokens) |
+------------------------------------+
Responsibilities:
- Text tokenization (UAX#29 word segmentation)
- Unicode normalization (NFKC)
- Inverted index maintenance
- BM25 score calculation
- Top-K retrieval with heap
Performance Characteristics:
- Add: O(m) where m = tokens
- Search: O(qxd_avg) where q = query terms, d_avg = avg docs per term
- Memory: Compressed inverted index, no original text stored
Component C: Metadata Filter Engine
Fast filtering using compressed bitmaps.
Categorical Fields (Roaring Bitmaps):
+------------------------------------+
| field:value - Bitmap(docIDs) |
| |
| "category:electronics" - {1,5,12} |
| "category:books" - {2,8,15} |
| "in_stock:true" - {1,2,5} |
+------------------------------------+
Numeric Fields (Bit-Sliced Index):
+------------------------------------+
| field - BSI (range queries) |
| |
| "price" - [0-1000, 1000-5000] |
| "rating" - [0-5 scale] |
+------------------------------------+
Document Universe:
+------------------------------------+
| allDocs - Bitmap(all IDs) |
| |
| Used for NOT operations |
+------------------------------------+
Responsibilities:
- Bitmap index maintenance (Roaring compression)
- BSI for numeric range queries
- Boolean query evaluation (AND, OR, NOT)
- Existence checks
- Set operations (IN, NOT IN)
Performance Characteristics:
- Add: O(f) where f = fields
- Query: O(fxlog n) with bitmap operations
- Memory: Highly compressed, 1-10% of uncompressed
Data Flow
How a hybrid search request flows through the system:
Step 1: Query Input
User Query:
+- Vector: [0.1, 0.5, ...]
+- Text: "machine learning"
+- Filters: {category="ai", price<100}
Step 2: Validation
+------------+
| Validation | ----> Error Handler
+-----+------+ (Invalid dimension, etc.)
|
Valid
Step 3: Metadata Pre-filtering
+-------------------------+
| Metadata Filter Engine |
| Apply: category="ai" |
| price<100 |
+-----+-------------------+
|
V
Candidates: {1, 5, 7, 12, 15}
Step 4: Parallel Search
Candidates - Split into both engines
+------------+ +------------+
| Vector | | Text |
| Search | | Search |
| (on cands) | | (on cands) |
+-----+------+ +-----+------+
| |
V V
Vec Results: Text Results:
{1: 0.2, {7: 8.5,
5: 0.3, 1: 7.2,
7: 0.4} 12: 6.8}
Step 5: Score Fusion
Both Result Sets
|
V
+---------------+
| Score Fusion |
| (RRF/Weighted)|
+-------+-------+
|
V
Fused Rankings
Step 6: Final Ranking
+---------------+
| Rank & Sort |
+-------+-------+
|
V
+---------------+
| Top-K Filter |
| (k=10) |
+-------+-------+
|
V
Final Results:
[{1: 0.8},
{7: 0.7},
{5: 0.6}]
Memory Layout
Understanding how different index types use memory:
HNSW Index Memory
File Header (24 bytes):
+-----------------------------+
| Magic: "HNSW" (4 B) |
| Version: 1 (4 B) |
| Dimensions: 384 (4 B) |
| M: 16 (4 B) |
| Max Level: 3 (4 B) |
| Entry Point: 42 (4 B) |
+-----------------------------+
Per-Node Storage:
Node ID: 4 bytes
Level: 4 bytes
Vector Data: 1536 bytes (384-dim x 4)
Graph Edges: 320 bytes (M connections x layers)
-------------
Total: ~1864 bytes per node
Scaling Analysis:
+-------------------+----------+-----------+
| Component | Per Node | 1M Nodes |
+-------------------+----------+-----------+
| Vectors (raw) | 1536 B | 1.46 GB |
| Graph structure | 320 B | 305 MB |
| Metadata | 8 B | 7.6 MB |
+-------------------+----------+-----------+
| Total | 1864 B | 1.78 GB |
+-------------------+----------+-----------+
Product Quantization Memory
Compression Overview:
Original Vector (384-dim):
384 x 4 bytes = 1536 bytes
|
Quantization
|
PQ Codes (8 subspaces):
8 x 1 byte = 8 bytes
|
192x smaller!
Codebook Storage:
8 codebooks x 256 centroids x 48 dims x 4 bytes
= 393 KB (shared across all vectors)
1M Vectors Comparison:
+-------------------+------------+----------+
| Format | Size | Ratio |
+-------------------+------------+----------+
| Original (float32)| 1.46 GB | 1x |
| PQ-8 | 7.6 MB | 192x |
| PQ-16 | 15.3 MB | 96x |
| PQ-32 | 30.5 MB | 48x |
| + Codebooks | +393 KB | - |
+-------------------+------------+----------+
Core Concepts
Concept 1: Vector Storage and Distance Metrics
A vector store maintains high-dimensional embeddings and enables efficient similarity search. The choice of storage type determines the tradeoff between search speed, memory usage, and accuracy.
Example: How vectors are stored and searched
Given these input vectors:
Vector 1: [0.1, 0.5, 0.3, 0.8]
Vector 2: [0.2, 0.4, 0.7, 0.1]
Query: [0.15, 0.45, 0.5, 0.5]
The index computes distances and returns nearest neighbors:
+-------------+--------------------------------+----------+
| Vector ID | Distance to Query | Rank |
+-------------+--------------------------------+----------+
| 1 | 0.234 (closest) | 1st |
| 2 | 0.567 (further) | 2nd |
+-------------+--------------------------------+----------+
Visual Representation: Storage Types Comparison
Flat Index (Brute Force)
Query - Compare with ALL vectors - Sort - Return K
100% Accuracy
O(n) time complexity
Slow for large datasets
HNSW Index (Graph-Based)
Layer 2: *-------------* (highways - long jumps)
| |
Layer 1: *---*---*---* (roads - medium jumps)
| | \ | \ |
Layer 0: *-*-*-*-*-*-* (streets - all nodes)
Search: Start high - Navigate greedily - Descend
O(log n) time complexity
95-99% recall
2-3x memory overhead
IVF Index (Clustering)
Cluster 1: **** Cluster 2: ****
Cluster 3: **** Cluster 4: ****
Query - Find nearest clusters - Search within
Fast on large datasets
Requires training
~ 85-95% recall
Product Quantization (Compression)
Original: [0.123, 0.456, 0.789, ...]
| Compress
Quantized: [17, 42, 89, ...]
4 bytes each - 1 byte each
4-32x memory reduction
Slight accuracy loss
Benefits of Different Storage Types:
- Flat Index: Perfect recall, simple, no training. Use for small datasets (<100K vectors)
- HNSW: Excellent speed/accuracy tradeoff, no training. Use for most production workloads
- IVF: Fast filtering with clusters, scalable. Use for very large datasets (>10M vectors)
- PQ: Massive memory savings. Use when storage cost is a concern
- IVFPQ: Best of IVF + PQ. Use for extremely large, memory-constrained environments
Concept 2: BM25 Full-Text Search Algorithm
BM25 (Best Matching 25) ranks documents by relevance to a text query using term frequency and inverse document frequency.
Setup
Test Corpus:
Doc 1: "the quick brown fox jumps over the lazy dog" (9 words)
Doc 2: "the lazy dog sleeps" (4 words)
Doc 3: "quick brown rabbits jump" (4 words)
Query: "quick brown"
Parameters:
Average Document Length: 7 words
K1 = 1.2 (term frequency saturation)
B = 0.75 (length normalization)
Step 1: Calculate IDF (Inverse Document Frequency)
For term "quick":
Appears in: 2 out of 3 documents
IDF = log((N - df + 0.5) / (df + 0.5) + 1)
= log((3 - 2 + 0.5) / (2 + 0.5) + 1)
= log(1.5 / 2.5 + 1)
= log(1.6)
= 0.470
For term "brown":
Appears in: 2 out of 3 documents
IDF = 0.470 (same calculation)
Step 2: Calculate TF Component for Each Document
Doc 1 (9 words - longer than average):
Term "quick" (tf=1):
TF_score = (tf x (K1 + 1)) / (tf + K1 x (1 - B + B x (docLen/avgLen)))
= (1 x 2.2) / (1 + 1.2 x (1 - 0.75 + 0.75 x (9/7)))
= 2.2 / 2.457
= 0.895
Term "brown" (tf=1):
TF_score = 0.895 (same calculation)
Doc 2 (4 words):
Terms "quick" and "brown": tf=0 (not present)
TF_score = 0
Doc 3 (4 words - shorter than average):
Term "quick" (tf=1):
TF_score = (1 x 2.2) / (1 + 1.2 x (1 - 0.75 + 0.75 x (4/7)))
= 2.2 / 1.815
= 1.212
Term "brown" (tf=1):
TF_score = 1.212 (same calculation)
Step 3: Calculate Final BM25 Scores
Combine IDF x TF for each document:
Doc 1: (0.895 x 0.470) + (0.895 x 0.470) = 0.841
Doc 2: 0 (no matching terms)
Doc 3: (1.212 x 0.470) + (1.212 x 0.470) = 1.139
Final Ranking
+---------+----------+--------+
| Rank | Doc ID | Score |
+---------+----------+--------+
| 1st | Doc 3 | 1.139 |
| 2nd | Doc 1 | 0.841 |
| 3rd | Doc 2 | 0.000 |
+---------+----------+--------+
Why Doc 3 Ranks Higher
Doc 1 vs Doc 3:
+- Same term frequencies (1 occurrence each)
+- Doc 3 is shorter (4 words vs 9 words)
+- BM25 length normalization penalizes longer docs
+- Result: Doc 3 gets higher TF scores (1.212 vs 0.895)
Key Insight: Shorter documents with same term frequency
are considered more relevant
Concept 3: Hybrid Search with Score Fusion
Hybrid search combines vector similarity, text relevance, and metadata filtering. Different fusion strategies handle score normalization differently.
Query Setup
INPUT:
+- Query: "machine learning tutorial"
+- Vector: [0.12, 0.45, 0.89, ...] (embedding)
+- Filter: category="education" AND price<50
Step 1: Apply Metadata Filter
+---------------------------------+
| Metadata Filtering |
| category="education" |
| price<50 |
+---------------------------------+
|
V
Candidate Docs: {1, 3, 5, 7, 9, 12, 15, 18, 20}
Step 2: Vector Search Results
Semantic Similarity (on candidates):
+--------+----------+------+
| Doc ID | Distance | Rank |
+--------+----------+------+
| 1 | 0.12 | 1 | - Closest
| 5 | 0.23 | 2 |
| 7 | 0.34 | 3 |
| 12 | 0.45 | 4 |
+--------+----------+------+
Step 3: Text Search Results
BM25 Ranking (on candidates):
+--------+------------+------+
| Doc ID | BM25 Score | Rank |
+--------+------------+------+
| 7 | 8.5 | 1 | - Most relevant
| 1 | 7.2 | 2 |
| 12 | 6.8 | 3 |
| 5 | 4.1 | 4 |
+--------+------------+------+
Step 4: Reciprocal Rank Fusion (RRF)
Formula: RRF_score = sum(1 / (K + rank_i)) where K=60
Doc 1 Calculation:
Vector rank: 1 - 1/(60+1) = 0.0164
Text rank: 2 - 1/(60+2) = 0.0161
--------
Combined RRF score: 0.0325
Doc 5 Calculation:
Vector rank: 2 - 1/(60+2) = 0.0161
Text rank: 4 - 1/(60+4) = 0.0156
--------
Combined RRF score: 0.0317
Doc 7 Calculation:
Vector rank: 3 - 1/(60+3) = 0.0159
Text rank: 1 - 1/(60+1) = 0.0164
--------
Combined RRF score: 0.0323
Doc 12 Calculation:
Vector rank: 4 - 1/(60+4) = 0.0156
Text rank: 3 - 1/(60+3) = 0.0159
--------
Combined RRF score: 0.0315
Final Ranking
+------+--------+-----------+------------------+
| Rank | Doc ID | RRF Score | Why |
+------+--------+-----------+------------------+
| 1st | 1 | 0.0325 | Best vector |
| 2nd | 7 | 0.0323 | Best text |
| 3rd | 5 | 0.0317 | Balanced |
| 4th | 12 | 0.0315 | Lower in both |
+------+--------+-----------+------------------+
Why RRF Over Weighted Sum?
Problem with Weighted Sum:
+- Vector distances: 0-2 range
+- BM25 scores: 0-100+ range
+- Different scales need manual tuning
RRF Advantages:
+- Scale Independent (uses ranks, not raw scores)
+- Robust (stable across score distributions)
+- No Tuning (no manual weight calibration)
+- Industry Standard (Elasticsearch, Vespa, etc.)
Storage Type Deep Dive
Choose your poison: Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid). Each trades speed, memory, and accuracy differently.
Full Deep Dive: See INDEX.md for complete algorithms, benchmarks, and implementation details.
Flat Index (Brute Force)
The Simplest Approach: Compare query against EVERY vector. 100% recall, zero approximation.
How It Works
Query Vector
|
+--------------------+
| Compare to ALL n |
| vectors | - Every single one checked
+--------------------+ No shortcuts!
|
Sort by distance
|
Return top K
Implementation
1. Store vectors - Raw float32 arrays
2. Search step - Compute distance to all vectors
3. Selection - Keep top-k closest
4. Index struct - None (just a list)
Complexity Analysis
+-------------+--------------------------+
| Operation | Complexity |
+-------------+--------------------------+
| Build | O(1) - just store |
| Search | O(n x d) - check all |
| Memory | O(n x d) - raw vectors |
| Recall | 100% - exhaustive |
+-------------+--------------------------+
When to Use
Small datasets (<10K vectors)
100% recall required (fingerprinting, security)
Benchmarking baseline
Large datasets (too slow)
HNSW Index (Hierarchical Graph)
The Graph Navigator: Build a multi-layer graph where each node connects to nearby vectors. Search by greedy navigation from layer to layer.
Graph Structure
Layer 2: [A]---------[D] - Highways (long jumps)
| | Few nodes
Layer 1: [A]--[B]----[D]--[E] - Roads (medium jumps)
| | \ | \ | More nodes
Layer 0: [A]-[B]-[C]-[D]-[E]-[F] - Streets (short links)
ALL vectors here
Search Process
1. Start at top layer - Long-range navigation
2. Greedy best move - Move to closest neighbor
3. Drop to next layer - Refine search
4. Repeat steps 2-3 - Until Layer 0
5. Return k-nearest - Final results
How Insertion Works
New Vector:
+- 1. Add to Layer 0 (always)
+- 2. Randomly assign to higher layers (exponential decay)
+- 3. Connect to M nearest neighbors per layer
+- 4. Update neighbors' connections (bidirectional)
Complexity Analysis
+-------------+--------------------------------+
| Operation | Complexity |
+-------------+--------------------------------+
| Search | O(M x efSearch x log n) |
| | Typically checks <1% of data |
| Insert | O(M x efConstruction x log n) |
| Memory | O(n x d + n x M x log n) |
| Recall | 95-99% with proper tuning |
+-------------+--------------------------------+
Key Parameters
M (Connections per layer):
+- 4-8: Low memory, lower recall
+- 16: Balanced (default)
+- 32-48: High recall, more memory
efConstruction (Build quality):
+- 100: Fast build, lower quality
+- 200: Good balance (default)
+- 400+: Better graph, slower build
efSearch (Search quality):
+- 50: Fast search, ~85% recall
+- 200: Balanced (default), ~96% recall
+- 400+: High recall, slower search
When to Use
Large datasets (10K-10M vectors)
Need 95-99% recall
Sub-millisecond latency required
Can afford 2-3x memory overhead
Memory constrained environments
IVF Index (Inverted File with Clustering)
The Clustering Approach: Partition vectors into clusters using k-means. Search only the nearest clusters.
Build Phase (Training)
All Vectors (n)
|
k-means clustering
|
+------------------------------------+
| [C1] [C2] [C3] [C4] ... [Cn] | - Centroids
+------------------------------------+
| | | | |
{v1} {v8} {v3} {v12} {v7} - Vectors assigned
{v5} {v9} {v6} {v19} {v11} to nearest
{v20} ... ... ... ... centroid
Search Phase
Query Vector
|
Find nProbe nearest centroids
|
+----------------------------+
| [C2] [C3] (only these!) | - Search 2 of 100 clusters
+----------------------------+
| |
{...} {...} Search within clusters
|
Top-k results
Key Insight
Instead of: Check all 1M vectors
Do this: 1. Find 8 nearest clusters (O(nClusters))
2. Search 80K vectors total (10% of data)
+-> 10-20x faster!
Complexity Analysis
+-------------+--------------------------------+
| Operation | Complexity |
+-------------+--------------------------------+
| Build | O(iterations x n x k x d) |
| | k-means clustering |
| Search | O(k x d + (n/k) x nProbe x d) |
| | Typically checks 5-20% of data |
| Memory | O(n x d + k x d) |
| Recall | 80-95% depending on nProbe |
+-------------+--------------------------------+
Key Parameters
nClusters (number of partitions):
+- Rule of thumb: sqrt(n) to n/10
+- 10K vectors - 100 clusters
+- 100K vectors - 316 clusters
+- 1M vectors - 1,000 clusters
+- More clusters - Faster search, slower build
nProbe (clusters to search):
+- 1: Fastest, ~60-70% recall
+- 8: Good balance, ~85% recall
+- 16: Better recall, ~92% recall
+- 32+: High recall, ~96% recall
When to Use
Large datasets (>100K vectors)
Can tolerate 85-95% recall
Want 10-20x speedup over Flat
Have training data available
Need 100% recall (use Flat)
Dataset too small (<10K)
PQ Index (Product Quantization)
The Compression Master: Split vectors into subvectors, quantize each subvector to 256 codes. Reduce memory by 16-32x.
Compression Process
Original Vector (384 dimensions):
[0.23, 0.91, ..., 0.15, 0.44, ..., 0.73, 0.22, ...]
\_____48D_____/ \_____48D_____/ \_____48D_____/
Subspace 1 Subspace 2 Subspace 3
| | |
K-means on K-means on K-means on
subspace 1 subspace 2 subspace 3
| | |
Codebook with Codebook with Codebook with
256 centroids 256 centroids 256 centroids
| | |
Find nearest Find nearest Find nearest
centroid ID centroid ID centroid ID
| | |
[12] [203] [45]
1 byte 1 byte 1 byte
Result
Before: [0.23, 0.91, ...] - 384 x 4 bytes = 1536 bytes
After: [12, 203, 45, ...] - 8 x 1 byte = 8 bytes
192x compression!
Search Process
1. Query arrives
|
2. Split query into subspaces (like training)
|
3. Precompute distance tables for each subspace
(query_subspace to all 256 codebook centroids)
|
4. For each vector:
- Look up codes: [12, 203, 45, ...]
- Table lookup: distances[12] + distances[203] + ...
- No float operations! Just array lookups
|
5. Return top-k
Complexity Analysis
+-------------+--------------------------------+
| Operation | Complexity |
+-------------+--------------------------------+
| Training | O(M x iterations x K x n/M) |
| | K-means on each subspace |
| Encoding | O(M x K x d/M) per vector |
| Search | O(M x K + n x M) |
| | Super fast, cache-friendly |
| Memory | O(n x M) - massive savings! |
| Recall | 70-85% typical |
+-------------+--------------------------------+
Memory Comparison
1M vectors, 384 dimensions:
Float32: 1M x 384 x 4 bytes = 1.46 GB
PQ-8: 1M x 8 x 1 byte = 7.6 MB (192x smaller!)
PQ-16: 1M x 16 x 1 byte = 15.3 MB (96x smaller!)
PQ-32: 1M x 32 x 1 byte = 30.5 MB (48x smaller!)
+ Codebooks: ~393 KB (shared across all vectors)
Key Parameters
M (nSubspaces):
+- 8: Maximum compression, lower accuracy
+- 16: Good balance
+- 32: Better accuracy, less compression
+- 64: High accuracy, moderate compression
bitsPerCode:
+- 8: Standard (256 centroids per subspace)
Perfect for uint8 storage
When to Use
Massive datasets (millions of vectors)
Memory is the bottleneck
Can tolerate 70-85% recall
Want 30-200x memory reduction
Need high recall (>95%)
Have plenty of memory
IVFPQ Index (Hybrid: Clustering + Quantization)
Best of Both Worlds: IVF clusters to reduce search space, PQ compression to reduce memory. The ultimate scalability index.
Build Process
Step 1: IVF Clustering
All Vectors
|
K-means clustering
|
[C1] [C2] [C3] [C4] ... [Cn]
Step 2: PQ Compression per Cluster
Cluster 1: Cluster 2: Cluster 3:
{vectors} {vectors} {vectors}
| | |
Apply PQ Apply PQ Apply PQ
| | |
[12,203,45] [91,34,178] [56,211,19]
[88,9,101] [23,156,88] [199,44,73]
[...] [...] [...]
uint8 codes uint8 codes uint8 codes
Search Process
Step 1: IVF Stage
Query - Find nProbe nearest centroids
|
[C2] [C3] [C7] (e.g., 3 of 100 clusters)
|
Search only 3% of the data!
Step 2: PQ Stage
For each selected cluster:
+- Precompute distance tables
+- Fast table lookups on PQ codes
+- No float operations
|
Top-k results
The Magic Combination
IVF contributes:
+- Speed: 10-100x faster (search only nProbe clusters)
PQ contributes:
+- Memory: 30-200x smaller (compressed codes)
Combined:
+- Fast + Tiny = Billion-scale capability!
Complexity Analysis
+-------------+--------------------------------+
| Operation | Complexity |
+-------------+--------------------------------+
| Training | O(IVF_kmeans + PQ_kmeans) |
| Search | O(kxd + (n/k)xnProbexM) |
| | Searches ~1-10% of data |
| Memory | O(nxM + kxd) |
| | Massive compression |
| Recall | 70-90% depending on params |
+-------------+--------------------------------+
Memory Savings Example
1M vectors, 384 dimensions:
Float32 + IVF: 1.46 GB
IVFPQ (M=8): 7.6 MB + 400 KB (centroids)
= ~8 MB total
180x compression!
Key Parameters
nClusters (IVF):
+- 100: For 100K vectors
+- 1K: For 1M vectors
+- 10K: For 100M vectors
nProbe (IVF search):
+- 1: Fastest, lower recall
+- 8: Good balance
+- 16: Better recall
M (PQ subspaces):
+- 8: Maximum compression
+- 16: Good balance
+- 32: Better accuracy
When to Use
Billion-scale datasets
Need <10ms latency
Severe memory constraints
Can tolerate 70-90% recall
Want 100x speed + 100x compression
Need >95% recall (use HNSW)
Small datasets (use Flat or HNSW)
Real-World Example
Use Case: 100M vectors, 768-dim
Flat Index:
+- Memory: ~288 GB
+- Search: ~10 seconds
+- Not feasible!
IVFPQ:
+- Memory: ~800 MB
+- Search: ~5 ms
+- Practical!
Decision Matrix
| Index | Recall | Search Speed | Memory | Build Time | Best For |
|---|---|---|---|---|---|
| Flat | 100% | Slow O(n) | High | Instant | <10k vectors, benchmarks |
| HNSW | 90-99% | Fast O(log n) | Highest | Slow | 10k-10M vectors, low latency |
| IVF | 80-95% | Medium | High | Medium | >100k vectors, moderate recall |
| PQ | 70-85% | Fast | Lowest | Slow | >1M vectors, memory-constrained |
| IVFPQ | 70-90% | Fastest | Very Low | Slow | >10M vectors, billion-scale |
Rules of thumb:
- Small data (<10k): Flat - brute force is fast enough
- Medium data (10k-1M): HNSW - best recall/speed tradeoff
- Large data (1M-100M): IVF or PQ - choose speed (IVF) vs memory (PQ)
- Massive data (>100M): IVFPQ - only option that scales to billions
Latency targets:
- Flat: 1-100ms (depends on n)
- HNSW: 0.1-2ms (sub-millisecond possible)
- IVF: 0.5-5ms
- PQ: 1-10ms (fast scan but approximate)
- IVFPQ: 0.5-3ms (fastest for massive datasets)
API Reference
For detailed API documentation, see API.md.
Examples
For practical examples and code samples, see EXAMPLE.md.
Configuration
Basic Configuration
flatIdx, _ := comet.NewFlatIndex(384, comet.Cosine)
// HNSW Index - Configure build and search parameters
hnswIdx, _ := comet.NewHNSWIndex(
384, // dimensions
comet.Cosine, // distance metric
16, // M: connections per layer
200, // efConstruction: build quality
200, // efSearch: search quality
)
// IVF Index - Configure clustering
ivfIdx, _ := comet.NewIVFIndex(
384, // dimensions
comet.Cosine, // distance metric
100, // nClusters: number of partitions
)
// Training required for IVF
trainingVectors := []comet.VectorNode{ /* ... */ }
ivfIdx.Train(trainingVectors)
Configuration Options
HNSW Parameters
M (Connections Per Layer)
- Type:
int - Default:
16 - Range: 4 to 64
- Description: Number of bidirectional connections per node at each layer (except layer 0 which uses 2xM)
Effects of different values:
+---------+------------------------------------------+
| Value | Effect |
+---------+------------------------------------------+
| 4-8 | Low memory, faster build, lower recall |
| 12-16 | Balanced (recommended for most cases) |
| 24-48 | High recall, slower build, more memory |
| 64+ | Diminishing returns, excessive memory |
+---------+------------------------------------------+
efConstruction (Build-Time Candidate List Size)
- Type:
int - Default:
200 - Range: 100 to 1000
- Description: Size of candidate list during index construction. Higher = better graph quality
Effects:
+---------+------------------------------------------+
| Value | Effect |
+---------+------------------------------------------+
| 100 | Fast build, lower quality graph |
| 200 | Good balance (default) |
| 400-500 | Better quality, 2-3x slower build |
| 800+ | Marginal gains, very slow build |
+---------+------------------------------------------+
efSearch (Search-Time Candidate List Size)
- Type:
int - Default:
200 - Range: k to 1000 (must be >= k)
- Description: Size of candidate list during search. Can be adjusted dynamically.
Effects:
+---------+------------------------------------------+
| Value | Effect |
+---------+------------------------------------------+
| 50 | Very fast, ~85% recall |
| 100 | Fast, ~92% recall |
| 200 | Balanced, ~96% recall |
| 400 | Slower, ~98% recall |
| 800 | Much slower, ~99% recall |
+---------+------------------------------------------+
Example:
index, _ := comet.NewHNSWIndex(
384,
comet.Cosine,
32, // More connections = better recall
400, // Higher construction quality
200, // Search quality (can adjust later)
)
// Dynamically adjust search quality
index.SetEfSearch(400) // Trade speed for recall
IVF Parameters
nClusters (Number of Clusters)
- Type:
int - Default: sqrt(n) where n = dataset size
- Range: 16 to n/100
- Description: Number of k-means clusters (Voronoi cells) to partition the space
+-------------+------------+-------------------------+
| Dataset Size| nClusters | Typical Range |
+-------------+------------+-------------------------+
| 10K | 100 | 50-200 |
| 100K | 316 | 200-500 |
| 1M | 1000 | 500-2000 |
| 10M | 3162 | 2000-5000 |
+-------------+------------+-------------------------+
nProbes (Search-Time Clusters to Probe)
- Type:
int - Default:
1 - Range: 1 to nClusters
- Description: Number of nearest clusters to search during query
+---------+------------------------------------------+
| nProbes | Effect |
+---------+------------------------------------------+
| 1 | Fastest, ~60-70% recall |
| 8 | Good balance, ~85% recall |
| 16 | Better recall, ~92% recall |
| 32 | High recall, ~96% recall |
| 64 | Very high recall, slower |
+---------+------------------------------------------+
Example:
index, _ := comet.NewIVFIndex(384, comet.Cosine, 256)
// Train with representative sample
trainData := sampleVectors(10000) // 10K samples for training
index.Train(trainData)
// Search with multiple probes
results, _ := index.NewSearch().
WithQuery(query).
WithK(10).
WithNProbes(16). // Search 16 nearest clusters
Execute()
Advanced Configuration
Fusion Configuration
config := &comet.FusionConfig{
VectorWeight: 0.7, // 70% weight to semantic similarity
TextWeight: 0.3, // 30% weight to keyword relevance
}
fusion, _ := comet.NewFusion(comet.WeightedSumFusion, config)
hybridSearch := hybrid.NewSearch().
WithVector(queryVec).
WithText("machine learning").
WithFusion(fusion).
Execute()
// Reciprocal Rank Fusion (rank-based, no weights needed)
hybridSearch := hybrid.NewSearch().
WithVector(queryVec).
WithText("machine learning").
WithFusionKind(comet.ReciprocalRankFusion).
Execute()
Distance Metrics
euclideanIdx, _ := comet.NewFlatIndex(384, comet.Euclidean)
// L2 Squared: Faster than Euclidean (no sqrt), preserves ranking
l2sqIdx, _ := comet.NewFlatIndex(384, comet.L2Squared)
// Cosine: Use for normalized vectors (text embeddings, etc.)
cosineIdx, _ := comet.NewFlatIndex(384, comet.Cosine)
API Details
FlatIndex
Constructor:
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
dim |
int |
Yes | Vector dimension (must be > 0) |
distanceKind |
DistanceKind |
Yes | Distance metric: Euclidean, L2Squared, or Cosine |
Returns:
*FlatIndex: Initialized flat indexerror: Error if parameters are invalid
HNSWIndex
Constructor:
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dim |
int |
Yes | - | Vector dimension (must be > 0) |
distanceKind |
DistanceKind |
Yes | - | Distance metric: Euclidean, L2Squared, or Cosine |
m |
int |
No | 16 | Max connections per node (higher = better accuracy, more memory) |
efConstruction |
int |
No | 200 | Build-time search depth (higher = better graph quality, slower build) |
efSearch |
int |
No | efConstruction |
Query-time search depth (higher = better accuracy, slower search) |
Returns:
*HNSWIndex: Initialized HNSW indexerror: Error if parameters are invalid
Parameter Guidelines:
+--------------+-----+----------------+----------+-------------+
| Use Case | M | efConstruction | efSearch | Description |
+--------------+-----+----------------+----------+-------------+
| Fast | 8 | 100 | 50 | Speed first |
| Balanced | 16 | 200 | 100 | Default |
| High Recall | 32 | 400 | 200 | Accuracy |
| Memory Eff. | 4 | 50 | 25 | Low memory |
+--------------+-----+----------------+----------+-------------+
IVFIndex
Constructor:
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
dim |
int |
Yes | Vector dimension (must be > 0) |
nlist |
int |
Yes | Number of clusters/partitions (must be > 0) |
distanceKind |
DistanceKind |
Yes | Distance metric: Euclidean, L2Squared, or Cosine |
Returns:
*IVFIndex: Initialized IVF indexerror: Error if parameters are invalid
Parameter Guidelines:
nlist Selection:
+- Small dataset (10K-100K): nlist = sqrt(n) = 100-300
+- Medium dataset (100K-1M): nlist = sqrt(n) = 300-1000
+- Large dataset (1M-10M): nlist = sqrt(n) = 1000-3000
+- Very large (10M+): nlist = sqrt(n) = 3000-10000
Rule of thumb: nlist sqrt(number_of_vectors)
Search Parameters:
results, _ := index.NewSearch().
WithQuery(queryVec).
WithK(10).
WithNProbe(8). // Search top 8 nearest clusters
Execute()
Speed vs Accuracy:
+- nProbe = 1: Fastest, lowest recall
+- nProbe = 8: Balanced (typical)
+- nProbe = nlist: Exhaustive (same as flat)
PQIndex
Constructor:
Parameters:
| Parameter | Type | Required | Constraint | Description |
|---|---|---|---|---|
dim |
int |
Yes | > 0 | Vector dimension |
distanceKind |
DistanceKind |
Yes | - | Distance metric: Euclidean, L2Squared, or Cosine |
M |
int |
Yes | dim % M == 0 | Number of subspaces (dim must be divisible by M) |
Nbits |
int |
Yes | 1-16 | Bits per subspace (typical: 8) |
Returns:
*PQIndex: Initialized PQ indexerror: Error if parameters are invalid or constraints violated
Parameter Guidelines:
M (Number of Subspaces):
+- M = 8: 192x compression (typical, good balance)
+- M = 16: 96x compression (better accuracy)
+- M = 32: 48x compression (highest accuracy)
+- Constraint: dim must be divisible by M
Nbits (Codebook size = 2^Nbits):
+- Nbits = 4: 16 centroids per subspace (very fast)
+- Nbits = 8: 256 centroids per subspace (typical)
+- Nbits = 12: 4096 centroids (high accuracy)
+- Constraint: 1 <= Nbits <= 16
Common Configurations:
+------+-------+---------------+---------------------+
| M | Nbits | Compression | Use Case |
+------+-------+---------------+---------------------+
| 8 | 8 | 192x (1 byte) | Maximum compression |
| 16 | 8 | 96x (2 byte) | Balanced |
| 32 | 8 | 48x (4 byte) | Better accuracy |
| 8 | 12 | 128x (1.5 B) | Higher quality |
+------+-------+---------------+---------------------+
IVFPQIndex
Constructor:
Parameters:
| Parameter | Type | Required | Constraint | Description |
|---|---|---|---|---|
dim |
int |
Yes | > 0 | Vector dimension |
distanceKind |
DistanceKind |
Yes | - | Distance metric: Euclidean, L2Squared, or Cosine |
nlist |
int |
Yes | > 0 | Number of IVF clusters |
m |
int |
Yes | dim % m == 0 | Number of PQ subspaces |
nbits |
int |
Yes | 1-16 | Bits per PQ subspace |
Returns:
*IVFPQIndex: Initialized IVFPQ indexerror: Error if parameters are invalid
Parameter Guidelines:
Combined IVF + PQ Configuration:
IVF Parameters:
+- nlist sqrt(n) for number of vectors
+- See IVFIndex guidelines above
PQ Parameters:
+- m: Typically 8, 16, or 32
+- nbits: Typically 8
Common Configurations for 100M vectors (384-dim):
+--------+-----+--------+----------+-----------------+
| nlist | m | nbits | Memory | Use Case |
+--------+-----+--------+----------+-----------------+
| 4096 | 8 | 8 | ~800 MB | Extreme speed |
| 8192 | 8 | 8 | ~900 MB | Balanced |
| 16384 | 16 | 8 | ~1.6 GB | Better accuracy |
| 8192 | 8 | 12 | ~1.2 GB | High quality |
+--------+-----+--------+----------+-----------------+
Memory Savings Example (100M vectors, 384-dim):
+- Original float32: 100M x 384 x 4 = 146 GB
+- IVF only: Still ~146 GB (no compression)
+- PQ only: 100M x 8 x 1 = 760 MB (faster train)
+- IVFPQ (m=8): 100M x 8 x 1 + centroids 800 MB (best of both!)
Search Parameters:
WithQuery(queryVec).
WithK(10).
WithNProbe(8). // IVF: clusters to search
WithNRefine(100). // Optional: refine top-100 with original vectors
Execute()
BM25SearchIndex
Constructor:
Parameters: None (uses default BM25 parameters: k1=1.5, b=0.75)
Returns:
*BM25SearchIndex: Initialized BM25 full-text search index
BM25 Parameters:
k1 = 1.5: Term frequency saturation (typical range: 1.2-2.0)b = 0.75: Document length normalization (typical range: 0.5-0.9)
RoaringMetadataIndex
Constructor:
Parameters: None
Returns:
*RoaringMetadataIndex: Initialized metadata filtering index using Roaring Bitmaps
HybridSearchIndex
Constructor:
vectorIndex VectorIndex,
textIndex TextIndex,
metadataIndex MetadataIndex,
) HybridSearchIndex
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
vectorIndex |
VectorIndex |
Yes | Any vector index (Flat, HNSW, IVF, PQ, IVFPQ) |
textIndex |
TextIndex |
Yes | BM25 text search index |
metadataIndex |
MetadataIndex |
Yes | Roaring metadata filter index |
Returns:
HybridSearchIndex: Initialized hybrid search combining all three modalities
Search APIs
Vector Search (VectorSearch Interface)
Creating a Search:
Methods:
| Method | Parameters | Description |
|---|---|---|
WithQuery |
queries ...[]float32 |
Set query vector(s) for similarity search |
WithNodes |
nodes ...uint32 |
Find vectors similar to indexed nodes by ID |
WithK |
k int |
Number of results to return (default: 10) |
WithDocumentIDs |
ids ...uint32 |
Pre-filter: only search within these document IDs |
WithScoreAggregation |
agg ScoreAggregation |
How to combine multi-query results: SumAggregation, MaxAggregation, MeanAggregation |
WithThreshold |
threshold float32 |
Only return results with distance <= threshold |
WithAutocut |
enabled bool |
Automatically determine optimal cutoff point |
Execute |
- | Run the search, returns []VectorResult, error |
HNSW-Specific Methods:
| Method | Parameters | Description |
|---|---|---|
WithEfSearch |
ef int |
Override default efSearch for this query |
IVF-Specific Methods:
| Method | Parameters | Description |
|---|---|---|
WithNProbe |
nprobe int |
Number of clusters to search (1 to nlist) |
IVFPQ-Specific Methods:
| Method | Parameters | Description |
|---|---|---|
WithNProbe |
nprobe int |
Number of clusters to search |
WithNRefine |
nrefine int |
Refine top-N results with original vectors |
Examples:
// Basic single-query search results, err := index.NewSearch(). WithQuery(queryVector). WithK(10). Execute() // Multi-query with aggregation results, _ := index.NewSearch(). WithQuery(query1, query2, query3). WithK(20). WithScoreAggregation(comet.MeanAggregation). Execute() // Node-based search (find similar to existing vectors) results, _ := index.NewSearch(). WithNodes(1, 5, 10). // IDs of indexed vectors WithK(10). Execute() // Pre-filtered search eligibleIDs := []uint32{100, 200, 300, 400, 500} results, _ := index.NewSearch(). WithQuery(queryVector). WithDocumentIDs(eligibleIDs...). WithK(5). Execute() // Distance threshold filtering results, _ := index.NewSearch(). WithQuery(queryVector). WithK(100). WithThreshold(0.5). // Only distances <= 0.5 Execute() // Autocut (automatic result truncation) results, _ := index.NewSearch(). WithQuery(queryVector). WithK(100). WithAutocut(true). // Returns fewer if quality drops Execute() // HNSW with custom efSearch hnswIndex, _ := comet.NewHNSWIndex(384, comet.Cosine, 16, 200, 100) results, _ := hnswIndex.NewSearch(). WithQuery(queryVe