Adaptive GPU Hash Table
A high-performance, GPU-accelerated hash table implementation using CUDA cooperative groups for optimized concurrent operations. Designed for large-scale parallel workloads requiring efficient key-value storage and retrieval on NVIDIA GPUs.
Overview
The Adaptive GPU Hash Table is a plug-and-play, header-only hash table optimized for NVIDIA CUDA architectures. It leverages cooperative groups (CG) to enable efficient parallel insertions, lookups, and deletions across thousands of GPU threads. The implementation uses linear probing with tombstone-based deletion and supports configurable load factors and probing strategies.
Architecture
Design Principles
- Cooperative Groups: Uses CUDA cooperative groups (tile size: 16 threads) to parallelize probing sequences
- Linear Probing: Open addressing with linear probing for cache-friendly memory access
- Tombstone Deletion: Tombstone markers enable efficient deletion without rehashing
- Structure of Arrays: Separate arrays for keys, values, and metadata for optimal memory coalescing
Core Components
AhgDeviceView: Device-side view providing insert, find, and erase operationsAhgHost: Host-side table management for memory allocation and initializationAdaptiveHashTableGPU: High-level C++ API wrapper for batch operations- Kernels: Optimized CUDA kernels for batch insert, find, and erase operations
Requirements
Hardware
- Recommended GPU Hardware: NVIDIA GeForce RTX 4070 or better
- Memory: Sufficient GPU memory for your dataset (capacity x 17 bytes per element)
Software
- CUDA Toolkit: Version 12.4 or compatible
- Compiler:
nvcc(CUDA compiler) andg++(C++17 support) - OS: Linux (tested on CUDA servers) [All reported benchmarks are from cuda5.cims.nyu.edu cluster in NYU Courant]
Installation
Step 1: Setup CUDA Environment
module load cuda-12.4
# Verify CUDA installation
nvcc --version
Step 2: Clone Repository
cd adaptive-gpu-hashtable
Step 3: Include in Your Project
Simply include the header file in your CUDA source:
Step 4: Compile
Note: Adjust -arch=sm_80 based on your GPU's compute capability:
sm_75for Turing (RTX 20xx)sm_80for Ampere (RTX 30xx, A100)sm_86for Ampere (RTX 30xx mobile)sm_89for Ada Lovelace (RTX 40xx)
Quick Start
Basic Usage
#include <vector>
using namespace gpu::hashtable;
int main() {
// Create hash table with capacity (will be rounded up to power of 2)
size_t capacity = 1'000'000;
AdaptiveHashTableGPU<uint64_t> table(capacity, 256); // max_probe = 256
// Prepare data
std::vector<uint64_t> keys = {1, 2, 3, 4, 5};
std::vector<uint64_t> values = {100, 200, 300, 400, 500};
std::vector<uint8_t> success(keys.size());
// Batch insert
table.insert(keys.data(), values.data(), keys.size(), success.data());
// Batch find
std::vector<uint64_t> out_values(keys.size());
std::vector<uint8_t> found(keys.size());
table.find(keys.data(), out_values.data(), found.data(), keys.size());
// Batch erase
table.erase(keys.data(), success.data(), keys.size());
return 0;
}
API Reference
AdaptiveHashTableGPU
Constructor
Creates a new hash table with the specified capacity (rounded up to power of 2) and maximum probe length.
Parameters:
capacity_pow2: Desired capacity (automatically rounded to next power of 2)max_probe: Maximum number of probe steps before giving up (default: 256)
Methods
void insert(const KeyType* h_keys, const uint64_t* h_vals, size_t n, uint8_t* h_success)
Batch insert operation. Inserts n key-value pairs into the hash table.
Parameters:
h_keys: Host pointer to array of keysh_vals: Host pointer to array of valuesn: Number of elements to inserth_success: Output array indicating success (1) or failure (0) for each insertion
Returns: void (success flags written to h_success)
void find(const KeyType* h_keys, uint64_t* h_out_vals, uint8_t* h_found, size_t n) const
Batch find operation. Looks up n keys in the hash table.
Parameters:
h_keys: Host pointer to array of keys to findh_out_vals: Output array for found valuesh_found: Output array indicating whether each key was found (1) or not (0)n: Number of keys to look up
Returns: void (results written to h_out_vals and h_found)
void erase(const KeyType* h_keys, uint8_t* h_success, size_t n)
Batch erase operation. Removes n keys from the hash table using tombstone markers.
Parameters:
h_keys: Host pointer to array of keys to eraseh_success: Output array indicating success (1) or failure (0) for each erasuren: Number of keys to erase
Returns: void (success flags written to h_success)
size_t capacity() const
Returns the actual capacity of the hash table (power of 2).
std::string hash_policy_name() const
Returns the name of the hash function used ("ahg_hash64").
std::string probing_policy_name() const
Returns the name of the probing strategy ("LinearProbing(CG-optimized)").
Benchmarks
We provide comprehensive benchmarking tools to evaluate performance across different configurations.
Running Benchmarks
Testing Dynamic Resizing and Compaction
The project includes a dedicated benchmark for evaluating the dynamic (resizing and compaction) features of the adaptive GPU hash table.
To run the adaptive resizing benchmark:
-
Compile the benchmark:
nvcc -std=c++17 -O3 -arch=sm_80 benchmarks/benchmark_adaptive_resize.cu -o exe/nvcc/benchmark_adaptive_resize -
Run the benchmark:
./exe/nvcc/benchmark_adaptive_resize
What this benchmark does:
- Insertion Phase: Inserts a large number of items in batches, triggering table growth and dynamic resizing.
- Deletion Phase: Deletes a significant portion of the table, creating tombstones and fragmentation.
- Compaction Phase: Triggers table shrinking and compaction, efficiently cleaning up tombstones.
- Performance Reporting: Prints table capacity, memory usage, and read performance before and after compaction, showing the effect of dynamic resizing.
Output Example:
--- INSERTION PHASE (Growth) ---
Batch,Capacity,Memory(MB),Time(ms)
...
Final capacity: 262144
--- DELETION PHASE (80% Data) ---
Pre-Compact Capacity: 262144 (Fragmented)
Fragmented Read Time (100k items): 12345 us
--- COMPACTION TRIGGERED ---
Post-Compact Capacity: 32768
Memory Saved: 3.6 MB
Compacted Read Time (100k items): 6789 us
Speedup: 1.8x
You can adjust the number of insertions, batch size, and load factor thresholds in benchmarks/benchmark_adaptive_resize.cu to suit your experiments.
Option 1: Using Precompiled Executables (Recommended)
Fixed Load Factor, Variable Size:
./exe/nvcc/syn_gpu_adaptive_benchmark \
./results/gpu/adaptive \
linear \
adaptive \
0.75 \
64
Fixed Size, Variable Load Factor:
./exe/nvcc/syn_gpu_adaptive_benchmark \
./results/gpu/adaptive \
linear \
adaptive \
1000000 \
64
Parameters:
EXECUTABLE: Path to benchmark executableRESULTS_DIR: Directory to save resultsPROBING_STRATEGY: Probing method (linear,quadratic,double)HASHTABLE_IMPLEMENTATION: Implementation type (adaptive,naive,default)MAX_LOAD_FACTORorN_SIZE: Load factor (0.0-1.0) or number of elementsKEY_SIZE: Key size in bits (32 or 64)
Option 2: Compile Your Own Benchmark
nvcc -std=c++17 -O3 -arch=sm_80 \
benchmarks/synthetic/run_benchmark_gpu_adaptive.cu \
-o exe/nvcc/syn_gpu_adaptive_benchmark
# GPU Naive
nvcc -std=c++17 -O3 -arch=sm_80 \
benchmarks/synthetic/run_benchmark_gpu_naive.cu \
-o exe/nvcc/syn_gpu_naive_benchmark
# CPU
g++ -std=c++17 -O3 \
benchmarks/synthetic/run_benchmark.cpp \
-o exe/g++/syn_cpu_benchmark
Performance Results
Static Operations Performance
Performance analysis of insert, find, and erase operations at various input sizes:
Dynamic Performance Comparison
Comparing C++ STL's unordered_map on CPU vs. adaptive-hash-map on GPU:
Load Factor Analysis
Effect of load factor on CPU vs. GPU performance at 10M elements:
Repository Structure
adaptive-gpu-hashtable/
+-- src/ # Core implementation
| +-- adaptive_hash_gpu.cuh # Low-level device operations
| +-- hashtable_gpu_adaptive.cuh # High-level C++ API
| +-- hashtable_gpu_naive.cuh # Naive GPU implementation
| +-- hashtable_cpu.hpp # CPU reference implementation
+-- benchmarks/ # Benchmarking suite
| +-- synthetic/
| +-- run_benchmark_gpu.cu # GPU benchmark driver
| +-- run_benchmark.cpp # CPU benchmark driver
| +-- cuda_helpers/ # GPU benchmark utilities
+-- scripts/ # Benchmark automation scripts
| +-- run_benchmark.sh # Variable size benchmarks
| +-- run_benchmark_load_factors.sh # Variable load factor benchmarks
+-- results/ # Benchmark results
| +-- cpu/ # CPU benchmark results
| +-- gpu/ # GPU benchmark results
| +-- plots/ # Performance plots
+-- exe/ # Precompiled executables
| +-- g++/ # CPU executables
| +-- nvcc/ # GPU executables
+-- assets/ # Documentation assets
Advanced Configuration
Switching Between Serial and Cooperative Group Modes
To use serial (scalar) operations instead of cooperative groups, modify src/adaptive_hash_gpu.cuh:
template<unsigned int TILE_SIZE, typename Parent>
__device__ bool insert(cg::thread_block_tile
// return insert_cg(tile, k, v); // Comment out CG version
return insert_scalar(k, v); // Use scalar version
}
template<unsigned int TILE_SIZE, typename Parent>
__device__ bool erase(cg::thread_block_tile
// return erase_cg(tile, k); // Comment out CG version
return erase_scalar(k); // Use scalar version
}
Customizing Tile Size
Modify the tile size by defining AHG_TILE before including the header:
#include "src/adaptive_hash_gpu.cuh"
Contributing
Contributions are welcome! Please follow these guidelines:
- Fork the repository and create a feature branch
- Follow code style: Use consistent formatting and naming conventions
- Add tests: Include benchmarks or unit tests for new features
- Update documentation: Keep README and code comments up to date
- Submit a pull request with a clear description of changes
Development Setup
git clone https://github.com/your-username/adaptive-gpu-hashtable.git
cd adaptive-gpu-hashtable
# Create a branch
git checkout -b feature/your-feature-name
# Make changes and test
nvcc -std=c++17 -O3 -arch=sm_80 your_test.cu -o test
./test
# Commit and push
git commit -m "Add your feature"
git push origin feature/your-feature-name
Known Issues
- Memory Alignment: Ensure keys and values are properly aligned for optimal performance
- Load Factor: Very high load factors (>0.95) may result in increased probe lengths and reduced performance
- Key Restrictions: Keys cannot be
0xFFFFFFFFFFFFFFFF(EMPTY_KEY) or0xFFFFFFFFFFFFFFFE(TOMB_KEY) - Concurrent Modifications: While thread-safe, extreme contention may reduce performance
Future Work
- Support for custom hash functions
- Dynamic resizing/rehashing
- Additional probing strategies (quadratic, double hashing)
- Multi-GPU support
- Stream-based asynchronous operations
- Integration with CUDA unified memory
- Support for non-64-bit value types
- Performance profiling and optimization tools
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this implementation in your research, please cite:
title = {Adaptive GPU Hash Table},
author = {Trivedi, Vishesh and Sathi, Vaigarai},
year = {2025},
url = {https://github.com/NerdyVisky/adaptive-gpu-hashtable}
}
Contact
For questions, issues, or contributions, please open an issue on GitHub.
Note: This implementation is optimized for NVIDIA GPUs with CUDA support. For CPU-based hash tables, consider using the reference implementation in src/hashtable_cpu.hpp or standard library containers.