Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

AdaMLLab/tokenbeast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

3 Commits

Repository files navigation

TokenBeast

A Rust tokenizer using the ungreedy 6-branch algorithm. Produces 4-15% fewer tokens than BPE at the same vocabulary size, with 8-9x faster tokenization throughput.

TokenBeast is a ground-up Rust rewrite of TokenMonster. Training finishes faster while matching or surpassing Go TokenMonster's compression quality.

Training Algorithms

Two algorithms are available via --algorithm:

TokenBeast (default) is a single-pass iterative distillation algorithm. It starts with millions of candidate tokens extracted from the corpus, then iteratively removes the lowest-scoring tokens until the target vocabulary size is reached. It protects short tokens from premature removal, uses a gradual removal schedule, and finishes with a tournament refinement phase that gives rejected tokens a second chance.

TokenMonster is a faithful Rust port of the Go multi-worker training algorithm. It uses parallel workers with dataset strips, union-based voting for token removal, and random swap refinement. Use this when you want results comparable to the original Go implementation.

See docs/algorithm-comparison.md for a detailed breakdown of how each algorithm works and where they differ.

Results

Trained on tiny_shakespeare (1.1 MB). All systems trained on the same data.

Vocab 4096

System chr/tok MB/s vs SentencePiece
SentencePiece BPE 3.679 3.5 baseline
TokenMonster (Go, fast) 3.927 29.0 +6.8%
TokenMonster (Go, full) 3.762 30.6 +2.3%
TokenBeast (Rust) 3.812 32.6 +3.6%

Vocab 8192

System chr/tok MB/s vs SentencePiece
SentencePiece BPE 4.000 3.2 baseline
TokenMonster (Go, fast) 4.338 25.9 +8.4%
TokenMonster (Go, full) 4.246 28.6 +6.2%
TokenBeast (Rust) 4.591 30.4 +14.8%

TokenBeast leads at both sizes, and the advantage grows with vocabulary size. All variants tokenize at 26-33 MB/s, roughly 8-9x faster than SentencePiece.

Full benchmark tables including Rust TokenMonster results and vocabulary overlap analysis are in docs/benchmarks.md.

Crates

Crate Description
tokenbeast Core library: vocab loading, tokenization, encoding/decoding
tokenbeast-train Training library + CLI: dictionary extraction and vocabulary training
tokenbeast-py Python bindings via PyO3: tokenization, training, and HuggingFace integration

Quick Start (Python)

cd crates/tokenbeast-py
pip install -e .

# With HuggingFace support
pip install -e ".[hf]"

Tokenize

from tokenbeast import Vocab

vocab = Vocab.load("results/final.tb")
ids = vocab.tokenize("Hello world")
text = vocab.decode_str(ids)
print(f"{len(ids)} tokens, round-trip: {text}")

Train

from tokenbeast import extract, train

# Extract a dictionary from training data
extract(dataset="data.txt", output="tokens.dict")

# Train a vocabulary (releases the GIL)
train(
dataset="data.txt",
dictionary="tokens.dict",
vocab_size=4096,
algorithm="tokenbeast", # or "tokenmonster"
dir="results",
)

HuggingFace Integration

from tokenbeast import TokenBeastTokenizer

tok = TokenBeastTokenizer(vocab_file="results/final.tb")
encoded = tok("Hello world")
print(encoded["input_ids"])
print(tok.decode(encoded["input_ids"]))

# Save and reload
tok.save_pretrained("/tmp/my_tokenizer")
tok2 = TokenBeastTokenizer.from_pretrained("/tmp/my_tokenizer")

Requires transformers. Install with pip install -e ".[hf]" or pip install transformers.

Quick Start (CLI)

Extract a dictionary

tokenbeast-train extract \
--dataset data.txt \
--output tokens.dict

Train a vocabulary

# TokenBeast (default)
tokenbeast-train train \
--dataset data.txt \
--dictionary tokens.dict \
--vocab-size 4096 \
--dir results

# TokenMonster
tokenbeast-train train \
--algorithm tokenmonster \
--dataset data.txt \
--dictionary tokens.dict \
--vocab-size 4096 \
--dir results_tm \
--workers 8

Evaluate a vocabulary

tokenbeast-train eval \
--vocab results/final.tb \
--dataset data.txt

Training Parameters

Parameter Default Description
--algorithm tokenbeast tokenbeast (alias: tb) or tokenmonster (alias: tm)
--dataset required Path to training data file
--dictionary required Path to token dictionary (from extract)
--vocab-size 32000 Target vocabulary size
--dir results Output directory for vocab snapshots
--workers 8 Number of parallel workers (TokenMonster only)
--percentage 15 Percentage of dataset to use as scoring strips
--midway-target 0 When to switch from strips to full dataset (0 = 6x vocab_size)
--keep-trying 1000 Max no-improvement attempts in refinement phase
--capcode 2 Capcode encoding level (0=none, 1=basic, 2=full)
--charset utf8 Character set: utf8, utf16, none
--level clean Optimization level: unfiltered, clean, balanced, consistent, strict
--seed-vocab none Optional seed vocabulary to warm-start training

Building

cargo build --release

The training binary will be at target/release/tokenbeast-train.

About

Rust variant of tokenmonster with python hooks and bindings included

Resources

Readme

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors