tokenizers/bindings/node at main * huggingface/tokenizers * GitHub

Name	Name	Last commit message	Last commit date
parent directory ..
.cargo	.cargo
.yarn/releases	.yarn/releases
examples/documentation	examples/documentation
lib/bindings	lib/bindings
npm	npm
src	src
.editorconfig	.editorconfig
.eslintrc.yml	.eslintrc.yml
.gitattributes	.gitattributes
.gitignore	.gitignore
.prettierignore	.prettierignore
.taplo.toml	.taplo.toml
.yarnrc.yml	.yarnrc.yml
Cargo.toml	Cargo.toml
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
build.rs	build.rs
index.d.ts	index.d.ts
index.js	index.js
jest.config.js	jest.config.js
package.json	package.json
rustfmt.toml	rustfmt.toml
tsconfig.json	tsconfig.json
types.ts	types.ts
yarn.lock	yarn.lock

Name

Last commit message

Last commit date

parent directory

.cargo

.yarn/releases

examples/documentation

NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Main features

Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

npm install tokenizers@latest

Basic example

import { Tokenizer } from "tokenizers"; const tokenizer = await Tokenizer.fromFile("tokenizer.json"); const wpEncoded = await tokenizer.encode("Who is John?"); console.log(wpEncoded.getLength()); console.log(wpEncoded.getTokens()); console.log(wpEncoded.getIds()); console.log(wpEncoded.getAttentionMask()); console.log(wpEncoded.getOffsets()); console.log(wpEncoded.getOverflowing()); console.log(wpEncoded.getSpecialTokensMask()); console.log(wpEncoded.getTypeIds()); console.log(wpEncoded.getWordIds());

License

Apache License 2.0

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node

Directory actions

More options

Directory actions

More options

Latest commit

History

node

Folders and files

parent directory

README.md

Main features

Installation

Basic example

License

FilesExpand file tree

node

Directory actions

More options

Directory actions

More options

Latest commit

History

node

Folders and files

parent directory

README.md

Main features

Installation

Basic example

License