Name	Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmark	benchmark
crates	crates
docs	docs
.gitignore	.gitignore
Cargo.lock	Cargo.lock
Cargo.toml	Cargo.toml
README.md	README.md

TokenBeast

A Rust tokenizer using the ungreedy 6-branch algorithm. Produces 4-15% fewer tokens than BPE at the same vocabulary size, with 8-9x faster tokenization throughput.

TokenBeast is a ground-up Rust rewrite of TokenMonster. Training finishes faster while matching or surpassing Go TokenMonster's compression quality.

Training Algorithms

Two algorithms are available via --algorithm:

TokenBeast (default) is a single-pass iterative distillation algorithm. It starts with millions of candidate tokens extracted from the corpus, then iteratively removes the lowest-scoring tokens until the target vocabulary size is reached. It protects short tokens from premature removal, uses a gradual removal schedule, and finishes with a tournament refinement phase that gives rejected tokens a second chance.

TokenMonster is a faithful Rust port of the Go multi-worker training algorithm. It uses parallel workers with dataset strips, union-based voting for token removal, and random swap refinement. Use this when you want results comparable to the original Go implementation.

See docs/algorithm-comparison.md for a detailed breakdown of how each algorithm works and where they differ.

Results

Trained on tiny_shakespeare (1.1 MB). All systems trained on the same data.

Vocab 4096

System	chr/tok	MB/s	vs SentencePiece
SentencePiece BPE	3.679	3.5	baseline
TokenMonster (Go, fast)	3.927	29.0	+6.8%
TokenMonster (Go, full)	3.762	30.6	+2.3%
TokenBeast (Rust)	3.812	32.6	+3.6%

Vocab 8192

System	chr/tok	MB/s	vs SentencePiece
SentencePiece BPE	4.000	3.2	baseline
TokenMonster (Go, fast)	4.338	25.9	+8.4%
TokenMonster (Go, full)	4.246	28.6	+6.2%
TokenBeast (Rust)	4.591	30.4	+14.8%

TokenBeast leads at both sizes, and the advantage grows with vocabulary size. All variants tokenize at 26-33 MB/s, roughly 8-9x faster than SentencePiece.

Full benchmark tables including Rust TokenMonster results and vocabulary overlap analysis are in docs/benchmarks.md.

Crates

Crate	Description
`tokenbeast`	Core library: vocab loading, tokenization, encoding/decoding
`tokenbeast-train`	Training library + CLI: dictionary extraction and vocabulary training
`tokenbeast-py`	Python bindings via PyO3: tokenization, training, and HuggingFace integration

Quick Start (Python)

cd crates/tokenbeast-py pip install -e . # With HuggingFace support pip install -e ".[hf]"

Tokenize

from tokenbeast import Vocab vocab = Vocab.load("results/final.tb") ids = vocab.tokenize("Hello world") text = vocab.decode_str(ids) print(f"{len(ids)} tokens, round-trip: {text}")

Train

from tokenbeast import extract, train # Extract a dictionary from training data extract(dataset="data.txt", output="tokens.dict") # Train a vocabulary (releases the GIL) train( dataset="data.txt", dictionary="tokens.dict", vocab_size=4096, algorithm="tokenbeast", # or "tokenmonster" dir="results", )

HuggingFace Integration

from tokenbeast import TokenBeastTokenizer tok = TokenBeastTokenizer(vocab_file="results/final.tb") encoded = tok("Hello world") print(encoded["input_ids"]) print(tok.decode(encoded["input_ids"])) # Save and reload tok.save_pretrained("/tmp/my_tokenizer") tok2 = TokenBeastTokenizer.from_pretrained("/tmp/my_tokenizer")

Requires transformers. Install with pip install -e ".[hf]" or pip install transformers.

Quick Start (CLI)

Extract a dictionary

tokenbeast-train extract \ --dataset data.txt \ --output tokens.dict

Train a vocabulary

# TokenBeast (default) tokenbeast-train train \ --dataset data.txt \ --dictionary tokens.dict \ --vocab-size 4096 \ --dir results # TokenMonster tokenbeast-train train \ --algorithm tokenmonster \ --dataset data.txt \ --dictionary tokens.dict \ --vocab-size 4096 \ --dir results_tm \ --workers 8

Evaluate a vocabulary

tokenbeast-train eval \ --vocab results/final.tb \ --dataset data.txt

Training Parameters

Parameter	Default	Description
`--algorithm`	`tokenbeast`	`tokenbeast` (alias: `tb`) or `tokenmonster` (alias: `tm`)
`--dataset`	required	Path to training data file
`--dictionary`	required	Path to token dictionary (from `extract`)
`--vocab-size`	32000	Target vocabulary size
`--dir`	`results`	Output directory for vocab snapshots
`--workers`	8	Number of parallel workers (TokenMonster only)
`--percentage`	15	Percentage of dataset to use as scoring strips
`--midway-target`	0	When to switch from strips to full dataset (0 = 6x vocab_size)
`--keep-trying`	1000	Max no-improvement attempts in refinement phase
`--capcode`	2	Capcode encoding level (0=none, 1=basic, 2=full)
`--charset`	`utf8`	Character set: `utf8`, `utf16`, `none`
`--level`	`clean`	Optimization level: `unfiltered`, `clean`, `balanced`, `consistent`, `strict`
`--seed-vocab`	none	Optional seed vocabulary to warm-start training

Building

cargo build --release

The training binary will be at target/release/tokenbeast-train.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdaMLLab/tokenbeast

Folders and files

Latest commit

History

Repository files navigation

TokenBeast

Training Algorithms

Results

Vocab 4096

Vocab 8192

Crates

Quick Start (Python)

Tokenize

Train

HuggingFace Integration

Quick Start (CLI)

Extract a dictionary

Train a vocabulary

Evaluate a vocabulary

Training Parameters

Building

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TokenBeast

Training Algorithms

Results

Vocab 4096

Vocab 8192

Crates

Quick Start (Python)

Tokenize

Train

HuggingFace Integration

Quick Start (CLI)

Extract a dictionary

Train a vocabulary

Evaluate a vocabulary

Training Parameters

Building

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages