Name	Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows	.github/workflows
docs	docs
media	media
public	public
.gitattributes	.gitattributes
.gitignore	.gitignore
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
aggregation.go	aggregation.go
aggregation_test.go	aggregation_test.go
bm25_index.go	bm25_index.go
bm25_index_document_filter_test.go	bm25_index_document_filter_test.go
bm25_index_search.go	bm25_index_search.go
bm25_index_search_test.go	bm25_index_search_test.go
bm25_index_test.go	bm25_index_test.go
clustering.go	clustering.go
clustering_test.go	clustering_test.go
distance.go	distance.go
distance_test.go	distance_test.go
doc.go	doc.go
document_filter.go	document_filter.go
document_filter_test.go	document_filter_test.go
flat_index.go	flat_index.go
flat_index_document_filter_test.go	flat_index_document_filter_test.go
flat_index_search.go	flat_index_search.go
flat_index_search_test.go	flat_index_search_test.go
flat_index_test.go	flat_index_test.go
fusion.go	fusion.go
fusion_test.go	fusion_test.go
go.mod	go.mod
go.sum	go.sum
hnsw_index.go	hnsw_index.go
hnsw_index_document_filter_test.go	hnsw_index_document_filter_test.go
hnsw_index_search.go	hnsw_index_search.go
hnsw_index_search_test.go	hnsw_index_search_test.go
hnsw_index_test.go	hnsw_index_test.go
hybrid_search_index.go	hybrid_search_index.go
hybrid_search_index_test.go	hybrid_search_index_test.go
index.go	index.go
index_search.go	index_search.go
ivf_index.go	ivf_index.go
ivf_index_document_filter_test.go	ivf_index_document_filter_test.go
ivf_index_search.go	ivf_index_search.go
ivf_index_search_test.go	ivf_index_search_test.go
ivf_index_test.go	ivf_index_test.go
ivfpq_index.go	ivfpq_index.go
ivfpq_index_document_filter_test.go	ivfpq_index_document_filter_test.go
ivfpq_index_search.go	ivfpq_index_search.go
ivfpq_index_search_test.go	ivfpq_index_search_test.go
ivfpq_index_test.go	ivfpq_index_test.go
limiter.go	limiter.go
limiter_test.go	limiter_test.go
metadata_index.go	metadata_index.go
metadata_index_search.go	metadata_index_search.go
metadata_index_search_test.go	metadata_index_search_test.go
metadata_index_test.go	metadata_index_test.go
node.go	node.go
node_test.go	node_test.go
pq_index.go	pq_index.go
pq_index_document_filter_test.go	pq_index_document_filter_test.go
pq_index_search.go	pq_index_search.go
pq_index_search_test.go	pq_index_search_test.go
pq_index_test.go	pq_index_test.go
quantizer.go	quantizer.go
quantizer_test.go	quantizer_test.go
reranker_test.go	reranker_test.go
storage.go	storage.go
storage_compaction.go	storage_compaction.go
storage_compaction_test.go	storage_compaction_test.go
storage_memtable.go	storage_memtable.go
storage_memtable_test.go	storage_memtable_test.go
storage_merge.go	storage_merge.go
storage_merge_test.go	storage_merge_test.go
storage_provider.go	storage_provider.go
storage_provider_test.go	storage_provider_test.go
storage_segment.go	storage_segment.go
storage_segment_test.go	storage_segment_test.go
storage_test.go	storage_test.go

Comet

A high-performance hybrid vector store written in Go. Comet brings together multiple indexing strategies and search modalities into a unified, hackable package. Hybrid retrieval with reciprocal rank fusion, autocut, pre-filtering, semantic search, full-text search, and multi-KNN searches, and multi-query operations -- all in pure Go.

Understand search internals from the inside out. Built for hackers, not hyperscalers. Tiny enough to fit in your head. Decent enough to blow it.

Choose from:

Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid) storage backends
Full-Text Search: BM25 ranking algorithm with tokenization and normalization
Metadata Filtering: Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes
Ranking Programmability: Reciprocal Rank Fusion, Fixed size result sets, Threshold based result sets, Dynamic result sets etc.
Hybrid Search: Unified interface combining vector, text, and metadata search

Overview
Features
Installation
Quick Start
Architecture
Core Concepts
API Reference
Examples
Configuration
API Details
Use Cases
Contributing
License

Overview

Everything you need to understand how vector databases actually work--and build one yourself.

What's inside:

5 Vector Storage Types: Flat, HNSW, IVF, PQ, IVFPQ
3 Distance Metrics: L2, L2 Squared, Cosine
Full-Text Search: BM25 ranking with Unicode tokenization
Metadata Filtering: Roaring bitmaps + Bit-Sliced Indexes
Hybrid Search: Combine vector + text + metadata with Reciprocal Rank Fusion
Advanced Search: Multi-KNN queries, multi-query operations, autocut result truncation
Production Features: Thread-safe, serialization, soft deletes, configurable parameters

Everything you need to understand how vector databases actually work--and build one yourself.

Features

Vector Storage

Flat: Brute-force exact search (100% recall baseline)
HNSW: Hierarchical navigable small world graphs (95-99% recall, O(log n) search)
IVF: Inverted file index with k-means clustering (85-95% recall, 10-20x speedup)
PQ: Product quantization for compression (85-95% recall, 10-500x memory reduction)
IVFPQ: IVF + PQ combined (85-95% recall, 100x speedup + 500x compression)

Search Modalities

Vector Search: L2, L2 Squared, and Cosine distance metrics
Full-Text Search: BM25 ranking with Unicode-aware tokenization
Metadata Filtering: Boolean queries on structured attributes
Hybrid Search: Combine all three with configurable fusion strategies

Fusion Strategies

Weighted Sum: Linear combination with configurable weights
Reciprocal Rank Fusion (RRF): Scale-independent rank-based fusion
Max/Min Score: Simple score aggregation

Data Structures (The Good Stuff)

HNSW Graphs: Multi-layer skip lists for approximate nearest neighbor search
Roaring Bitmaps: Compressed bitmaps for metadata filtering (array, bitmap, run-length encoding)
Bit-Sliced Index (BSI): Efficient numeric range queries without full scans
Product Quantization Codebooks: Learned k-means centroids for vector compression
Inverted Indexes: Token-to-document mappings for full-text search

Other Capabilities

Quantization: Full precision, half precision, int8 precision
Soft Deletes: Fast deletion with lazy cleanup
Serialization: Persist and reload indexes
Thread-Safe: Concurrent read/write operations
Autocut: Automatic result truncation based on score gaps

Installation

go get github.com/wizenheimer/comet

Quick Start

package main import ( "fmt" "log" "github.com/wizenheimer/comet" ) func main() { // Create a vector store (384-dimensional embeddings with cosine distance) index, err := comet.NewFlatIndex(384, comet.Cosine) if err != nil { log.Fatal(err) } // Add vectors vec1 := make([]float32, 384) // ... populate vec1 with your embedding ... node := comet.NewVectorNode(vec1) index.Add(*node) // Search for similar vectors query := make([]float32, 384) // ... populate query vector ... results, err := index.NewSearch(). WithQuery(query). WithK(10). Execute() if err != nil { log.Fatal(err) } // Process results for i, result := range results { fmt.Printf("%d. ID=%d, Score=%.4f\n", i+1, result.GetId(), result.GetScore()) } }

Output:

1. ID=123, Score=0.0234 2. ID=456, Score=0.0567 3. ID=789, Score=0.0823 ...

Architecture

System Architecture

Comet is organized into three main search engines that can work independently or together:

Application Layer

+-------------------------------------------------------------+ | Your Application | | (Using Comet as a Go Library) | +----------------------+--------------------------------------+ | +-------------+-------------+ | | | V V V Vector Text Metadata

Search Engine Layer

Index Storage Layer

Hybrid Coordinator

Component Details

Component A: Vector Storage Engine

Manages vector storage and similarity search across multiple index types.

Common Interface:

Available Implementations:

FlatIndex - Brute force, 100% recall HNSWIndex - Graph-based, O(log n) IVFIndex - Clustering, 10-20x faster PQIndex - Quantization, 10-500x compression IVFPQIndex - Hybrid, best of IVF + PQ

Responsibilities:

Vector preprocessing (normalization for cosine distance)
Distance calculations (Euclidean, L22, Cosine)
K-nearest neighbor search
Serialization/deserialization
Soft delete management with flush mechanism

Performance Characteristics:

Flat: O(nxd) search, 100% recall
HNSW: O(Mxefxlog n) search, 95-99% recall
IVF: O(nProbesxn/kxd) search, 85-95% recall

Component B: Text Search Engine

Full-text search using BM25 ranking algorithm.

Inverted Index:

+------------------------------------+ | term - RoaringBitmap(docIDs) | | | | "machine" - {1, 5, 12, 45} | | "learning" - {1, 3, 12, 20} | | "neural" - {3, 20, 45} | +------------------------------------+

Term Frequencies:

+------------------------------------+ | term - {docID: count} | | | | "machine" - {1: 3, 5: 1, 12: 2} | +------------------------------------+

Document Stats:

Responsibilities:

Text tokenization (UAX#29 word segmentation)
Unicode normalization (NFKC)
Inverted index maintenance
BM25 score calculation
Top-K retrieval with heap

Performance Characteristics:

Add: O(m) where m = tokens
Search: O(qxd_avg) where q = query terms, d_avg = avg docs per term
Memory: Compressed inverted index, no original text stored

Component C: Metadata Filter Engine

Fast filtering using compressed bitmaps.

Categorical Fields (Roaring Bitmaps):

+------------------------------------+ | field:value - Bitmap(docIDs) | | | | "category:electronics" - {1,5,12} | | "category:books" - {2,8,15} | | "in_stock:true" - {1,2,5} | +------------------------------------+

Numeric Fields (Bit-Sliced Index):

Document Universe:

+------------------------------------+ | allDocs - Bitmap(all IDs) | | | | Used for NOT operations | +------------------------------------+

Responsibilities:

Bitmap index maintenance (Roaring compression)
BSI for numeric range queries
Boolean query evaluation (AND, OR, NOT)
Existence checks
Set operations (IN, NOT IN)

Performance Characteristics:

Add: O(f) where f = fields
Query: O(fxlog n) with bitmap operations
Memory: Highly compressed, 1-10% of uncompressed

Data Flow

How a hybrid search request flows through the system:

Step 1: Query Input

User Query: +- Vector: [0.1, 0.5, ...] +- Text: "machine learning" +- Filters: {category="ai", price<100}

Step 2: Validation

+------------+ | Validation | ----> Error Handler +-----+------+ (Invalid dimension, etc.) | Valid

Step 3: Metadata Pre-filtering

+-------------------------+ | Metadata Filter Engine | | Apply: category="ai" | | price<100 | +-----+-------------------+ | V Candidates: {1, 5, 7, 12, 15}

Step 4: Parallel Search

Candidates - Split into both engines +------------+ +------------+ | Vector | | Text | | Search | | Search | | (on cands) | | (on cands) | +-----+------+ +-----+------+ | | V V Vec Results: Text Results: {1: 0.2, {7: 8.5, 5: 0.3, 1: 7.2, 7: 0.4} 12: 6.8}

Step 5: Score Fusion

Step 6: Final Ranking

+---------------+ | Rank & Sort | +-------+-------+ | V +---------------+ | Top-K Filter | | (k=10) | +-------+-------+ | V Final Results: [{1: 0.8}, {7: 0.7}, {5: 0.6}]

Memory Layout

Understanding how different index types use memory:

HNSW Index Memory

File Header (24 bytes):

Per-Node Storage:

Node ID: 4 bytes Level: 4 bytes Vector Data: 1536 bytes (384-dim x 4) Graph Edges: 320 bytes (M connections x layers) ------------- Total: ~1864 bytes per node

Scaling Analysis:

+-------------------+----------+-----------+ | Component | Per Node | 1M Nodes | +-------------------+----------+-----------+ | Vectors (raw) | 1536 B | 1.46 GB | | Graph structure | 320 B | 305 MB | | Metadata | 8 B | 7.6 MB | +-------------------+----------+-----------+ | Total | 1864 B | 1.78 GB | +-------------------+----------+-----------+

Product Quantization Memory

Compression Overview:

Original Vector (384-dim): 384 x 4 bytes = 1536 bytes | Quantization | PQ Codes (8 subspaces): 8 x 1 byte = 8 bytes | 192x smaller!

Codebook Storage:

8 codebooks x 256 centroids x 48 dims x 4 bytes = 393 KB (shared across all vectors)

1M Vectors Comparison:

+-------------------+------------+----------+ | Format | Size | Ratio | +-------------------+------------+----------+ | Original (float32)| 1.46 GB | 1x | | PQ-8 | 7.6 MB | 192x | | PQ-16 | 15.3 MB | 96x | | PQ-32 | 30.5 MB | 48x | | + Codebooks | +393 KB | - | +-------------------+------------+----------+

Core Concepts

Concept 1: Vector Storage and Distance Metrics

A vector store maintains high-dimensional embeddings and enables efficient similarity search. The choice of storage type determines the tradeoff between search speed, memory usage, and accuracy.

Example: How vectors are stored and searched

Given these input vectors:

Vector 1: [0.1, 0.5, 0.3, 0.8] Vector 2: [0.2, 0.4, 0.7, 0.1] Query: [0.15, 0.45, 0.5, 0.5]

The index computes distances and returns nearest neighbors:

+-------------+--------------------------------+----------+ | Vector ID | Distance to Query | Rank | +-------------+--------------------------------+----------+ | 1 | 0.234 (closest) | 1st | | 2 | 0.567 (further) | 2nd | +-------------+--------------------------------+----------+

Visual Representation: Storage Types Comparison

Flat Index (Brute Force)

Query - Compare with ALL vectors - Sort - Return K 100% Accuracy O(n) time complexity Slow for large datasets

HNSW Index (Graph-Based)

Layer 2: *-------------* (highways - long jumps) | | Layer 1: *---*---*---* (roads - medium jumps) | | \ | \ | Layer 0: *-*-*-*-*-*-* (streets - all nodes) Search: Start high - Navigate greedily - Descend O(log n) time complexity 95-99% recall 2-3x memory overhead

IVF Index (Clustering)

Cluster 1: **** Cluster 2: **** Cluster 3: **** Cluster 4: **** Query - Find nearest clusters - Search within Fast on large datasets Requires training ~ 85-95% recall

Product Quantization (Compression)

Original: [0.123, 0.456, 0.789, ...] | Compress Quantized: [17, 42, 89, ...] 4 bytes each - 1 byte each 4-32x memory reduction Slight accuracy loss

Benefits of Different Storage Types:

Flat Index: Perfect recall, simple, no training. Use for small datasets (<100K vectors)
HNSW: Excellent speed/accuracy tradeoff, no training. Use for most production workloads
IVF: Fast filtering with clusters, scalable. Use for very large datasets (>10M vectors)
PQ: Massive memory savings. Use when storage cost is a concern
IVFPQ: Best of IVF + PQ. Use for extremely large, memory-constrained environments

Concept 2: BM25 Full-Text Search Algorithm

BM25 (Best Matching 25) ranks documents by relevance to a text query using term frequency and inverse document frequency.

Setup

Test Corpus:

Doc 1: "the quick brown fox jumps over the lazy dog" (9 words) Doc 2: "the lazy dog sleeps" (4 words) Doc 3: "quick brown rabbits jump" (4 words)

Query: "quick brown"

Parameters:

Average Document Length: 7 words K1 = 1.2 (term frequency saturation) B = 0.75 (length normalization)

Step 1: Calculate IDF (Inverse Document Frequency)

For term "quick":

Appears in: 2 out of 3 documents IDF = log((N - df + 0.5) / (df + 0.5) + 1) = log((3 - 2 + 0.5) / (2 + 0.5) + 1) = log(1.5 / 2.5 + 1) = log(1.6) = 0.470

For term "brown":

Appears in: 2 out of 3 documents IDF = 0.470 (same calculation)

Step 2: Calculate TF Component for Each Document

Doc 1 (9 words - longer than average):

Term "quick" (tf=1): TF_score = (tf x (K1 + 1)) / (tf + K1 x (1 - B + B x (docLen/avgLen))) = (1 x 2.2) / (1 + 1.2 x (1 - 0.75 + 0.75 x (9/7))) = 2.2 / 2.457 = 0.895 Term "brown" (tf=1): TF_score = 0.895 (same calculation)

Doc 2 (4 words):

Terms "quick" and "brown": tf=0 (not present) TF_score = 0

Doc 3 (4 words - shorter than average):

Term "quick" (tf=1): TF_score = (1 x 2.2) / (1 + 1.2 x (1 - 0.75 + 0.75 x (4/7))) = 2.2 / 1.815 = 1.212 Term "brown" (tf=1): TF_score = 1.212 (same calculation)

Step 3: Calculate Final BM25 Scores

Combine IDF x TF for each document:

Doc 1: (0.895 x 0.470) + (0.895 x 0.470) = 0.841 Doc 2: 0 (no matching terms) Doc 3: (1.212 x 0.470) + (1.212 x 0.470) = 1.139

Final Ranking

+---------+----------+--------+ | Rank | Doc ID | Score | +---------+----------+--------+ | 1st | Doc 3 | 1.139 | | 2nd | Doc 1 | 0.841 | | 3rd | Doc 2 | 0.000 | +---------+----------+--------+

Why Doc 3 Ranks Higher

Doc 1 vs Doc 3: +- Same term frequencies (1 occurrence each) +- Doc 3 is shorter (4 words vs 9 words) +- BM25 length normalization penalizes longer docs +- Result: Doc 3 gets higher TF scores (1.212 vs 0.895) Key Insight: Shorter documents with same term frequency are considered more relevant

Concept 3: Hybrid Search with Score Fusion

Hybrid search combines vector similarity, text relevance, and metadata filtering. Different fusion strategies handle score normalization differently.

Query Setup

INPUT: +- Query: "machine learning tutorial" +- Vector: [0.12, 0.45, 0.89, ...] (embedding) +- Filter: category="education" AND price<50

Step 1: Apply Metadata Filter

+---------------------------------+ | Metadata Filtering | | category="education" | | price<50 | +---------------------------------+ | V Candidate Docs: {1, 3, 5, 7, 9, 12, 15, 18, 20}

Step 2: Vector Search Results

Semantic Similarity (on candidates): +--------+----------+------+ | Doc ID | Distance | Rank | +--------+----------+------+ | 1 | 0.12 | 1 | - Closest | 5 | 0.23 | 2 | | 7 | 0.34 | 3 | | 12 | 0.45 | 4 | +--------+----------+------+

Step 3: Text Search Results

BM25 Ranking (on candidates): +--------+------------+------+ | Doc ID | BM25 Score | Rank | +--------+------------+------+ | 7 | 8.5 | 1 | - Most relevant | 1 | 7.2 | 2 | | 12 | 6.8 | 3 | | 5 | 4.1 | 4 | +--------+------------+------+

Step 4: Reciprocal Rank Fusion (RRF)

Formula: RRF_score = sum(1 / (K + rank_i)) where K=60

Doc 1 Calculation:

Vector rank: 1 - 1/(60+1) = 0.0164 Text rank: 2 - 1/(60+2) = 0.0161 -------- Combined RRF score: 0.0325

Doc 5 Calculation:

Vector rank: 2 - 1/(60+2) = 0.0161 Text rank: 4 - 1/(60+4) = 0.0156 -------- Combined RRF score: 0.0317

Doc 7 Calculation:

Vector rank: 3 - 1/(60+3) = 0.0159 Text rank: 1 - 1/(60+1) = 0.0164 -------- Combined RRF score: 0.0323

Doc 12 Calculation:

Vector rank: 4 - 1/(60+4) = 0.0156 Text rank: 3 - 1/(60+3) = 0.0159 -------- Combined RRF score: 0.0315

Final Ranking

+------+--------+-----------+------------------+ | Rank | Doc ID | RRF Score | Why | +------+--------+-----------+------------------+ | 1st | 1 | 0.0325 | Best vector | | 2nd | 7 | 0.0323 | Best text | | 3rd | 5 | 0.0317 | Balanced | | 4th | 12 | 0.0315 | Lower in both | +------+--------+-----------+------------------+

Why RRF Over Weighted Sum?

Problem with Weighted Sum: +- Vector distances: 0-2 range +- BM25 scores: 0-100+ range +- Different scales need manual tuning RRF Advantages: +- Scale Independent (uses ranks, not raw scores) +- Robust (stable across score distributions) +- No Tuning (no manual weight calibration) +- Industry Standard (Elasticsearch, Vespa, etc.)

Storage Type Deep Dive

Choose your poison: Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid). Each trades speed, memory, and accuracy differently.

Full Deep Dive: See INDEX.md for complete algorithms, benchmarks, and implementation details.

Flat Index (Brute Force)

The Simplest Approach: Compare query against EVERY vector. 100% recall, zero approximation.

How It Works

Implementation

1. Store vectors - Raw float32 arrays 2. Search step - Compute distance to all vectors 3. Selection - Keep top-k closest 4. Index struct - None (just a list)

Complexity Analysis

When to Use

Small datasets (<10K vectors) 100% recall required (fingerprinting, security) Benchmarking baseline Large datasets (too slow)

HNSW Index (Hierarchical Graph)

The Graph Navigator: Build a multi-layer graph where each node connects to nearby vectors. Search by greedy navigation from layer to layer.

Graph Structure

Search Process

1. Start at top layer - Long-range navigation 2. Greedy best move - Move to closest neighbor 3. Drop to next layer - Refine search 4. Repeat steps 2-3 - Until Layer 0 5. Return k-nearest - Final results

How Insertion Works

New Vector: +- 1. Add to Layer 0 (always) +- 2. Randomly assign to higher layers (exponential decay) +- 3. Connect to M nearest neighbors per layer +- 4. Update neighbors' connections (bidirectional)

Complexity Analysis

Key Parameters

M (Connections per layer): +- 4-8: Low memory, lower recall +- 16: Balanced (default) +- 32-48: High recall, more memory efConstruction (Build quality): +- 100: Fast build, lower quality +- 200: Good balance (default) +- 400+: Better graph, slower build efSearch (Search quality): +- 50: Fast search, ~85% recall +- 200: Balanced (default), ~96% recall +- 400+: High recall, slower search

When to Use

Large datasets (10K-10M vectors) Need 95-99% recall Sub-millisecond latency required Can afford 2-3x memory overhead Memory constrained environments

IVF Index (Inverted File with Clustering)

The Clustering Approach: Partition vectors into clusters using k-means. Search only the nearest clusters.

Build Phase (Training)

All Vectors (n) | k-means clustering | +------------------------------------+ | [C1] [C2] [C3] [C4] ... [Cn] | - Centroids +------------------------------------+ | | | | | {v1} {v8} {v3} {v12} {v7} - Vectors assigned {v5} {v9} {v6} {v19} {v11} to nearest {v20} ... ... ... ... centroid

Search Phase

Query Vector | Find nProbe nearest centroids | +----------------------------+ | [C2] [C3] (only these!) | - Search 2 of 100 clusters +----------------------------+ | | {...} {...} Search within clusters | Top-k results

Key Insight

Instead of: Check all 1M vectors Do this: 1. Find 8 nearest clusters (O(nClusters)) 2. Search 80K vectors total (10% of data) +-> 10-20x faster!

Complexity Analysis

Key Parameters

nClusters (number of partitions): +- Rule of thumb: sqrt(n) to n/10 +- 10K vectors - 100 clusters +- 100K vectors - 316 clusters +- 1M vectors - 1,000 clusters +- More clusters - Faster search, slower build nProbe (clusters to search): +- 1: Fastest, ~60-70% recall +- 8: Good balance, ~85% recall +- 16: Better recall, ~92% recall +- 32+: High recall, ~96% recall

When to Use

Large datasets (>100K vectors) Can tolerate 85-95% recall Want 10-20x speedup over Flat Have training data available Need 100% recall (use Flat) Dataset too small (<10K)

PQ Index (Product Quantization)

The Compression Master: Split vectors into subvectors, quantize each subvector to 256 codes. Reduce memory by 16-32x.

Compression Process

Original Vector (384 dimensions): [0.23, 0.91, ..., 0.15, 0.44, ..., 0.73, 0.22, ...] \_____48D_____/ \_____48D_____/ \_____48D_____/ Subspace 1 Subspace 2 Subspace 3 | | | K-means on K-means on K-means on subspace 1 subspace 2 subspace 3 | | | Codebook with Codebook with Codebook with 256 centroids 256 centroids 256 centroids | | | Find nearest Find nearest Find nearest centroid ID centroid ID centroid ID | | | [12] [203] [45] 1 byte 1 byte 1 byte

Result

Before: [0.23, 0.91, ...] - 384 x 4 bytes = 1536 bytes After: [12, 203, 45, ...] - 8 x 1 byte = 8 bytes 192x compression!

Search Process

1. Query arrives | 2. Split query into subspaces (like training) | 3. Precompute distance tables for each subspace (query_subspace to all 256 codebook centroids) | 4. For each vector: - Look up codes: [12, 203, 45, ...] - Table lookup: distances[12] + distances[203] + ... - No float operations! Just array lookups | 5. Return top-k

Complexity Analysis

Memory Comparison

1M vectors, 384 dimensions: Float32: 1M x 384 x 4 bytes = 1.46 GB PQ-8: 1M x 8 x 1 byte = 7.6 MB (192x smaller!) PQ-16: 1M x 16 x 1 byte = 15.3 MB (96x smaller!) PQ-32: 1M x 32 x 1 byte = 30.5 MB (48x smaller!) + Codebooks: ~393 KB (shared across all vectors)

Key Parameters

M (nSubspaces): +- 8: Maximum compression, lower accuracy +- 16: Good balance +- 32: Better accuracy, less compression +- 64: High accuracy, moderate compression bitsPerCode: +- 8: Standard (256 centroids per subspace) Perfect for uint8 storage

When to Use

Massive datasets (millions of vectors) Memory is the bottleneck Can tolerate 70-85% recall Want 30-200x memory reduction Need high recall (>95%) Have plenty of memory

IVFPQ Index (Hybrid: Clustering + Quantization)

Best of Both Worlds: IVF clusters to reduce search space, PQ compression to reduce memory. The ultimate scalability index.

Build Process

Step 1: IVF Clustering

All Vectors | K-means clustering | [C1] [C2] [C3] [C4] ... [Cn]

Step 2: PQ Compression per Cluster

Cluster 1: Cluster 2: Cluster 3: {vectors} {vectors} {vectors} | | | Apply PQ Apply PQ Apply PQ | | | [12,203,45] [91,34,178] [56,211,19] [88,9,101] [23,156,88] [199,44,73] [...] [...] [...] uint8 codes uint8 codes uint8 codes

Search Process

Step 1: IVF Stage Query - Find nProbe nearest centroids | [C2] [C3] [C7] (e.g., 3 of 100 clusters) | Search only 3% of the data! Step 2: PQ Stage For each selected cluster: +- Precompute distance tables +- Fast table lookups on PQ codes +- No float operations | Top-k results

The Magic Combination

IVF contributes: +- Speed: 10-100x faster (search only nProbe clusters) PQ contributes: +- Memory: 30-200x smaller (compressed codes) Combined: +- Fast + Tiny = Billion-scale capability!

Complexity Analysis

Memory Savings Example

1M vectors, 384 dimensions: Float32 + IVF: 1.46 GB IVFPQ (M=8): 7.6 MB + 400 KB (centroids) = ~8 MB total 180x compression!

Key Parameters

nClusters (IVF): +- 100: For 100K vectors +- 1K: For 1M vectors +- 10K: For 100M vectors nProbe (IVF search): +- 1: Fastest, lower recall +- 8: Good balance +- 16: Better recall M (PQ subspaces): +- 8: Maximum compression +- 16: Good balance +- 32: Better accuracy

When to Use

Billion-scale datasets Need <10ms latency Severe memory constraints Can tolerate 70-90% recall Want 100x speed + 100x compression Need >95% recall (use HNSW) Small datasets (use Flat or HNSW)

Real-World Example

Use Case: 100M vectors, 768-dim Flat Index: +- Memory: ~288 GB +- Search: ~10 seconds +- Not feasible! IVFPQ: +- Memory: ~800 MB +- Search: ~5 ms +- Practical!

Decision Matrix

Index	Recall	Search Speed	Memory	Build Time	Best For
Flat	100%	Slow O(n)	High	Instant	<10k vectors, benchmarks
HNSW	90-99%	Fast O(log n)	Highest	Slow	10k-10M vectors, low latency
IVF	80-95%	Medium	High	Medium	>100k vectors, moderate recall
PQ	70-85%	Fast	Lowest	Slow	>1M vectors, memory-constrained
IVFPQ	70-90%	Fastest	Very Low	Slow	>10M vectors, billion-scale

Rules of thumb:

Small data (<10k): Flat - brute force is fast enough
Medium data (10k-1M): HNSW - best recall/speed tradeoff
Large data (1M-100M): IVF or PQ - choose speed (IVF) vs memory (PQ)
Massive data (>100M): IVFPQ - only option that scales to billions

Latency targets:

Flat: 1-100ms (depends on n)
HNSW: 0.1-2ms (sub-millisecond possible)
IVF: 0.5-5ms
PQ: 1-10ms (fast scan but approximate)
IVFPQ: 0.5-3ms (fastest for massive datasets)

API Reference

For detailed API documentation, see API.md.

Examples

For practical examples and code samples, see EXAMPLE.md.

Configuration

Basic Configuration

// Flat Index - No configuration needed flatIdx, _ := comet.NewFlatIndex(384, comet.Cosine) // HNSW Index - Configure build and search parameters hnswIdx, _ := comet.NewHNSWIndex( 384, // dimensions comet.Cosine, // distance metric 16, // M: connections per layer 200, // efConstruction: build quality 200, // efSearch: search quality ) // IVF Index - Configure clustering ivfIdx, _ := comet.NewIVFIndex( 384, // dimensions comet.Cosine, // distance metric 100, // nClusters: number of partitions ) // Training required for IVF trainingVectors := []comet.VectorNode{ /* ... */ } ivfIdx.Train(trainingVectors)

Configuration Options

HNSW Parameters

M (Connections Per Layer)

Type: int
Default: 16
Range: 4 to 64
Description: Number of bidirectional connections per node at each layer (except layer 0 which uses 2xM)

Effects of different values:

+---------+------------------------------------------+ | Value | Effect | +---------+------------------------------------------+ | 4-8 | Low memory, faster build, lower recall | | 12-16 | Balanced (recommended for most cases) | | 24-48 | High recall, slower build, more memory | | 64+ | Diminishing returns, excessive memory | +---------+------------------------------------------+

efConstruction (Build-Time Candidate List Size)

Type: int
Default: 200
Range: 100 to 1000
Description: Size of candidate list during index construction. Higher = better graph quality

Effects:

+---------+------------------------------------------+ | Value | Effect | +---------+------------------------------------------+ | 100 | Fast build, lower quality graph | | 200 | Good balance (default) | | 400-500 | Better quality, 2-3x slower build | | 800+ | Marginal gains, very slow build | +---------+------------------------------------------+

efSearch (Search-Time Candidate List Size)

Type: int
Default: 200
Range: k to 1000 (must be >= k)
Description: Size of candidate list during search. Can be adjusted dynamically.

Effects:

+---------+------------------------------------------+ | Value | Effect | +---------+------------------------------------------+ | 50 | Very fast, ~85% recall | | 100 | Fast, ~92% recall | | 200 | Balanced, ~96% recall | | 400 | Slower, ~98% recall | | 800 | Much slower, ~99% recall | +---------+------------------------------------------+

Example:

// Create HNSW with high quality settings index, _ := comet.NewHNSWIndex( 384, comet.Cosine, 32, // More connections = better recall 400, // Higher construction quality 200, // Search quality (can adjust later) ) // Dynamically adjust search quality index.SetEfSearch(400) // Trade speed for recall

IVF Parameters

nClusters (Number of Clusters)

Type: int
Default: sqrt(n) where n = dataset size
Range: 16 to n/100
Description: Number of k-means clusters (Voronoi cells) to partition the space

+-------------+------------+-------------------------+ | Dataset Size| nClusters | Typical Range | +-------------+------------+-------------------------+ | 10K | 100 | 50-200 | | 100K | 316 | 200-500 | | 1M | 1000 | 500-2000 | | 10M | 3162 | 2000-5000 | +-------------+------------+-------------------------+

nProbes (Search-Time Clusters to Probe)

Type: int
Default: 1
Range: 1 to nClusters
Description: Number of nearest clusters to search during query

+---------+------------------------------------------+ | nProbes | Effect | +---------+------------------------------------------+ | 1 | Fastest, ~60-70% recall | | 8 | Good balance, ~85% recall | | 16 | Better recall, ~92% recall | | 32 | High recall, ~96% recall | | 64 | Very high recall, slower | +---------+------------------------------------------+

Example:

// Create IVF index index, _ := comet.NewIVFIndex(384, comet.Cosine, 256) // Train with representative sample trainData := sampleVectors(10000) // 10K samples for training index.Train(trainData) // Search with multiple probes results, _ := index.NewSearch(). WithQuery(query). WithK(10). WithNProbes(16). // Search 16 nearest clusters Execute()

Advanced Configuration

Fusion Configuration

// Weighted Sum Fusion (manual weights) config := &comet.FusionConfig{ VectorWeight: 0.7, // 70% weight to semantic similarity TextWeight: 0.3, // 30% weight to keyword relevance } fusion, _ := comet.NewFusion(comet.WeightedSumFusion, config) hybridSearch := hybrid.NewSearch(). WithVector(queryVec). WithText("machine learning"). WithFusion(fusion). Execute() // Reciprocal Rank Fusion (rank-based, no weights needed) hybridSearch := hybrid.NewSearch(). WithVector(queryVec). WithText("machine learning"). WithFusionKind(comet.ReciprocalRankFusion). Execute()

Distance Metrics

// Euclidean (L2): Use when magnitude matters euclideanIdx, _ := comet.NewFlatIndex(384, comet.Euclidean) // L2 Squared: Faster than Euclidean (no sqrt), preserves ranking l2sqIdx, _ := comet.NewFlatIndex(384, comet.L2Squared) // Cosine: Use for normalized vectors (text embeddings, etc.) cosineIdx, _ := comet.NewFlatIndex(384, comet.Cosine)

API Details

FlatIndex

Constructor:

func NewFlatIndex(dim int, distanceKind DistanceKind) (*FlatIndex, error)

Parameters:

Parameter	Type	Required	Description
`dim`	`int`	Yes	Vector dimension (must be > 0)
`distanceKind`	`DistanceKind`	Yes	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`

Returns:

*FlatIndex: Initialized flat index
error: Error if parameters are invalid

HNSWIndex

Constructor:

func NewHNSWIndex(dim int, distanceKind DistanceKind, m, efConstruction, efSearch int) (*HNSWIndex, error)

Parameters:

Parameter	Type	Required	Default	Description
`dim`	`int`	Yes	-	Vector dimension (must be > 0)
`distanceKind`	`DistanceKind`	Yes	-	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`
`m`	`int`	No	16	Max connections per node (higher = better accuracy, more memory)
`efConstruction`	`int`	No	200	Build-time search depth (higher = better graph quality, slower build)
`efSearch`	`int`	No	`efConstruction`	Query-time search depth (higher = better accuracy, slower search)

Returns:

*HNSWIndex: Initialized HNSW index
error: Error if parameters are invalid

Parameter Guidelines:

+--------------+-----+----------------+----------+-------------+ | Use Case | M | efConstruction | efSearch | Description | +--------------+-----+----------------+----------+-------------+ | Fast | 8 | 100 | 50 | Speed first | | Balanced | 16 | 200 | 100 | Default | | High Recall | 32 | 400 | 200 | Accuracy | | Memory Eff. | 4 | 50 | 25 | Low memory | +--------------+-----+----------------+----------+-------------+

IVFIndex

Constructor:

func NewIVFIndex(dim int, nlist int, distanceKind DistanceKind) (*IVFIndex, error)

Parameters:

Parameter	Type	Required	Description
`dim`	`int`	Yes	Vector dimension (must be > 0)
`nlist`	`int`	Yes	Number of clusters/partitions (must be > 0)
`distanceKind`	`DistanceKind`	Yes	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`

Returns:

*IVFIndex: Initialized IVF index
error: Error if parameters are invalid

Parameter Guidelines:

nlist Selection: +- Small dataset (10K-100K): nlist = sqrt(n) = 100-300 +- Medium dataset (100K-1M): nlist = sqrt(n) = 300-1000 +- Large dataset (1M-10M): nlist = sqrt(n) = 1000-3000 +- Very large (10M+): nlist = sqrt(n) = 3000-10000 Rule of thumb: nlist sqrt(number_of_vectors)

Search Parameters:

// nProbe: number of clusters to search (1 to nlist) results, _ := index.NewSearch(). WithQuery(queryVec). WithK(10). WithNProbe(8). // Search top 8 nearest clusters Execute() Speed vs Accuracy: +- nProbe = 1: Fastest, lowest recall +- nProbe = 8: Balanced (typical) +- nProbe = nlist: Exhaustive (same as flat)

PQIndex

Constructor:

func NewPQIndex(dim int, distanceKind DistanceKind, M int, Nbits int) (*PQIndex, error)

Parameters:

Parameter	Type	Required	Constraint	Description
`dim`	`int`	Yes	> 0	Vector dimension
`distanceKind`	`DistanceKind`	Yes	-	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`
`M`	`int`	Yes	dim % M == 0	Number of subspaces (dim must be divisible by M)
`Nbits`	`int`	Yes	1-16	Bits per subspace (typical: 8)

Returns:

*PQIndex: Initialized PQ index
error: Error if parameters are invalid or constraints violated

Parameter Guidelines:

M (Number of Subspaces): +- M = 8: 192x compression (typical, good balance) +- M = 16: 96x compression (better accuracy) +- M = 32: 48x compression (highest accuracy) +- Constraint: dim must be divisible by M Nbits (Codebook size = 2^Nbits): +- Nbits = 4: 16 centroids per subspace (very fast) +- Nbits = 8: 256 centroids per subspace (typical) +- Nbits = 12: 4096 centroids (high accuracy) +- Constraint: 1 <= Nbits <= 16 Common Configurations: +------+-------+---------------+---------------------+ | M | Nbits | Compression | Use Case | +------+-------+---------------+---------------------+ | 8 | 8 | 192x (1 byte) | Maximum compression | | 16 | 8 | 96x (2 byte) | Balanced | | 32 | 8 | 48x (4 byte) | Better accuracy | | 8 | 12 | 128x (1.5 B) | Higher quality | +------+-------+---------------+---------------------+

IVFPQIndex

Constructor:

func NewIVFPQIndex(dim int, distanceKind DistanceKind, nlist int, m int, nbits int) (*IVFPQIndex, error)

Parameters:

Parameter	Type	Required	Constraint	Description
`dim`	`int`	Yes	> 0	Vector dimension
`distanceKind`	`DistanceKind`	Yes	-	Distance metric: `Euclidean`, `L2Squared`, or `Cosine`
`nlist`	`int`	Yes	> 0	Number of IVF clusters
`m`	`int`	Yes	dim % m == 0	Number of PQ subspaces
`nbits`	`int`	Yes	1-16	Bits per PQ subspace

Returns:

*IVFPQIndex: Initialized IVFPQ index
error: Error if parameters are invalid

Parameter Guidelines:

Combined IVF + PQ Configuration: IVF Parameters: +- nlist sqrt(n) for number of vectors +- See IVFIndex guidelines above PQ Parameters: +- m: Typically 8, 16, or 32 +- nbits: Typically 8 Common Configurations for 100M vectors (384-dim): +--------+-----+--------+----------+-----------------+ | nlist | m | nbits | Memory | Use Case | +--------+-----+--------+----------+-----------------+ | 4096 | 8 | 8 | ~800 MB | Extreme speed | | 8192 | 8 | 8 | ~900 MB | Balanced | | 16384 | 16 | 8 | ~1.6 GB | Better accuracy | | 8192 | 8 | 12 | ~1.2 GB | High quality | +--------+-----+--------+----------+-----------------+ Memory Savings Example (100M vectors, 384-dim): +- Original float32: 100M x 384 x 4 = 146 GB +- IVF only: Still ~146 GB (no compression) +- PQ only: 100M x 8 x 1 = 760 MB (faster train) +- IVFPQ (m=8): 100M x 8 x 1 + centroids 800 MB (best of both!)

Search Parameters:

results, _ := index.NewSearch(). WithQuery(queryVec). WithK(10). WithNProbe(8). // IVF: clusters to search WithNRefine(100). // Optional: refine top-100 with original vectors Execute()

BM25SearchIndex

Constructor:

func NewBM25SearchIndex() *BM25SearchIndex

Parameters: None (uses default BM25 parameters: k1=1.5, b=0.75)

Returns:

*BM25SearchIndex: Initialized BM25 full-text search index

BM25 Parameters:

k1 = 1.5: Term frequency saturation (typical range: 1.2-2.0)
b = 0.75: Document length normalization (typical range: 0.5-0.9)

RoaringMetadataIndex

Constructor:

func NewRoaringMetadataIndex() *RoaringMetadataIndex

Parameters: None

Returns:

*RoaringMetadataIndex: Initialized metadata filtering index using Roaring Bitmaps

HybridSearchIndex

Constructor:

func NewHybridSearchIndex( vectorIndex VectorIndex, textIndex TextIndex, metadataIndex MetadataIndex, ) HybridSearchIndex

Parameters:

Parameter	Type	Required	Description
`vectorIndex`	`VectorIndex`	Yes	Any vector index (Flat, HNSW, IVF, PQ, IVFPQ)
`textIndex`	`TextIndex`	Yes	BM25 text search index
`metadataIndex`	`MetadataIndex`	Yes	Roaring metadata filter index

Returns:

HybridSearchIndex: Initialized hybrid search combining all three modalities

Search APIs

Vector Search (VectorSearch Interface)

Creating a Search:

search := index.NewSearch() // Available on all VectorIndex implementations

Methods:

Method	Parameters	Description
`WithQuery`	`queries ...[]float32`	Set query vector(s) for similarity search
`WithNodes`	`nodes ...uint32`	Find vectors similar to indexed nodes by ID
`WithK`	`k int`	Number of results to return (default: 10)
`WithDocumentIDs`	`ids ...uint32`	Pre-filter: only search within these document IDs
`WithScoreAggregation`	`agg ScoreAggregation`	How to combine multi-query results: `SumAggregation`, `MaxAggregation`, `MeanAggregation`
`WithThreshold`	`threshold float32`	Only return results with distance <= threshold
`WithAutocut`	`enabled bool`	Automatically determine optimal cutoff point
`Execute`	-	Run the search, returns `[]VectorResult, error`

HNSW-Specific Methods:

Method	Parameters	Description
`WithEfSearch`	`ef int`	Override default efSearch for this query

IVF-Specific Methods:

Method	Parameters	Description
`WithNProbe`	`nprobe int`	Number of clusters to search (1 to nlist)

IVFPQ-Specific Methods:

Method	Parameters	Description
`WithNProbe`	`nprobe int`	Number of clusters to search
`WithNRefine`	`nrefine int`	Refine top-N results with original vectors

Examples:

// Basic single-query search
results, err := index.NewSearch().
 WithQuery(queryVector).
 WithK(10).
 Execute()

// Multi-query with aggregation
results, _ := index.NewSearch().
 WithQuery(query1, query2, query3).
 WithK(20).
 WithScoreAggregation(comet.MeanAggregation).
 Execute()

// Node-based search (find similar to existing vectors)
results, _ := index.NewSearch().
 WithNodes(1, 5, 10). // IDs of indexed vectors
 WithK(10).
 Execute()

// Pre-filtered search
eligibleIDs := []uint32{100, 200, 300, 400, 500}
results, _ := index.NewSearch().
 WithQuery(queryVector).
 WithDocumentIDs(eligibleIDs...).
 WithK(5).
 Execute()

// Distance threshold filtering
results, _ := index.NewSearch().
 WithQuery(queryVector).
 WithK(100).
 WithThreshold(0.5). // Only distances <= 0.5
 Execute()

// Autocut (automatic result truncation)
results, _ := index.NewSearch().
 WithQuery(queryVector).
 WithK(100).
 WithAutocut(true). // Returns fewer if quality drops
 Execute()

// HNSW with custom efSearch
hnswIndex, _ := comet.NewHNSWIndex(384, comet.Cosine, 16, 200, 100)
results, _ := hnswIndex.NewSearch().
 WithQuery(queryVe

Folders and files

Latest commit

History

Repository files navigation

Comet

Table of Contents

Overview

Features

Vector Storage

Search Modalities

Fusion Strategies

Data Structures (The Good Stuff)

Other Capabilities

Installation

Quick Start

Architecture

System Architecture

Application Layer

Search Engine Layer

Index Storage Layer

Hybrid Coordinator

Component Details

Component A: Vector Storage Engine

Component B: Text Search Engine

Component C: Metadata Filter Engine

Data Flow

Step 1: Query Input

Step 2: Validation

Step 3: Metadata Pre-filtering

Step 4: Parallel Search

Step 5: Score Fusion

Step 6: Final Ranking

Memory Layout

HNSW Index Memory

Product Quantization Memory

Core Concepts

Concept 1: Vector Storage and Distance Metrics

Flat Index (Brute Force)

HNSW Index (Graph-Based)

IVF Index (Clustering)

Product Quantization (Compression)

Concept 2: BM25 Full-Text Search Algorithm

Setup

Step 1: Calculate IDF (Inverse Document Frequency)

Step 2: Calculate TF Component for Each Document

Step 3: Calculate Final BM25 Scores

Final Ranking

Why Doc 3 Ranks Higher

Concept 3: Hybrid Search with Score Fusion

Query Setup

Step 1: Apply Metadata Filter

Step 2: Vector Search Results

Step 3: Text Search Results

Step 4: Reciprocal Rank Fusion (RRF)

Final Ranking

Why RRF Over Weighted Sum?

Storage Type Deep Dive

Flat Index (Brute Force)

How It Works

Implementation

Complexity Analysis

When to Use

HNSW Index (Hierarchical Graph)

Graph Structure

Search Process

How Insertion Works

Complexity Analysis

Key Parameters

When to Use

IVF Index (Inverted File with Clustering)

Build Phase (Training)

Search Phase

Key Insight

Complexity Analysis

Key Parameters

When to Use

PQ Index (Product Quantization)

Compression Process

Result

Search Process