sudachi.rs - English README
sudachi.rs is a Rust implementation of Sudachi, a Japanese morphological analyzer.
Python implementation is also available: SudachiPy Documentation.
TL;DR
Install Python version
or Rust version
$ cd ./sudachi.rs
$ cargo build --release
$ cargo install --path sudachi-cli/
$ ./fetch_dictionary.sh
$ echo "Gao Lun ge-toueiYi " | sudachi
Gao Lun ge-toueiYi Ming Ci ,Gu You Ming Ci ,Yi Ban ,*,*,* Gao Lun ge-toueiYi
EOS
Example
Multi-granular Tokenization
Xuan Ju Guan Li Wei Yuan Hui Ming Ci ,Gu You Ming Ci ,Yi Ban ,*,*,* Xuan Ju Guan Li Wei Yuan Hui
EOS
$ echo Xuan Ju Guan Li Wei Yuan Hui | sudachi --mode A
Xuan Ju Ming Ci ,Pu Tong Ming Ci ,saBian Ke Neng ,*,*,* Xuan Ju
Guan Li Ming Ci ,Pu Tong Ming Ci ,saBian Ke Neng ,*,*,* Guan Li
Wei Yuan Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* Wei Yuan
Hui Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* Hui
EOS
Normalized Form
Da Ip mu Dong Ci ,Yi Ban ,*,*,Wu Duan -maXing ,Zhong Zhi Xing -Yi Ban Da chiIp mu
Kong Bai ,*,*,*,*,*
katsuJing Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* katsuJing
Kong Bai ,*,*,*,*,*
Fu Shu Ming Ci ,Pu Tong Ming Ci ,saBian Ke Neng ,*,*,* Fu Shu
Kong Bai ,*,*,*,*,*
vintage Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* binte-zi
EOS
Wakati (space-delimited surface form) Output
etainoZhi renaiBu Ji naKuai gaSi noXin woShi Zhong Ya etsuketeita.
Jiao Zao toYan ouka, Xian E toYan ouka----Jiu woYin ndaatoniSu Zui gaaruyouni, Jiu woMei Ri Yin ndeirutoSu Zui niXiang Dang shitaShi Qi gayatsuteLai ru.
soregaLai tanoda. korehachiyotsutoikenakatsuta.
$ sudachi --wakati lemon.txt
etai no Zhi re nai Bu Ji na Kuai ga Si no Xin wo Shi Zhong Ya e tsuke te i ta .
Jiao Zao to Yan ou ka , Xian E to Yan ou ka -- -- Jiu wo Yin n da ato ni Su Zui ga aru you ni , Jiu wo Mei Ri Yin n de iru to Su Zui ni Xiang Dang shi ta Shi Qi ga yatsu te Lai ru .
sore ga Lai ta no da . kore ha chiyotsuto ike nakatsu ta .
Setup
You need sudachi.rs, default plugins, and a dictionary. (This crate don't include dictionary.)
1. Get the source code
2. Download a Sudachi Dictionary
Sudachi requires a dictionary to operate.
You can download a dictionary ZIP file from WorksApplications/SudachiDict (choose one from small, core, or full), unzip it, and place the system_*.dic file somewhere.
By the default setting file, sudachi.rs assumes that it is placed at resources/system.dic.
Convenience Script
Optionally, you can use the fetch_dictionary.sh shell script to download a dictionary and install it to resources/system.dic (overrides).
./fetch_dictionary.sh
# fetch dictionary of specified version and type
./fetch_dictionary.sh 20241021 small
3. Build
Build (bake dictionary into binary)
This was un-implemented and does not work currently, see #35
Specify the bake_dictionary feature to embed a dictionary into the binary.
The sudachi executable will contain the dictionary binary.
The baked dictionary will be used if no one is specified via cli option or setting file.
You must specify the path the dictionary file in the SUDACHI_DICT_PATH environment variable when building.
SUDACHI_DICT_PATH is relative to the sudachi.rs directory (or absolute).
Example on Unix-like system:
$ ./fetch_dictionary.sh
# Build with bake_dictionary feature (relative path)
$ env SUDACHI_DICT_PATH=resources/system.dic cargo build --release --features bake_dictionary
# or
# Build with bake_dictionary feature (absolute path)
$ env SUDACHI_DICT_PATH=/path/to/my-sudachi.dic cargo build --release --features bake_dictionary
4. Install
$ cargo install --path sudachi-cli/
$ which sudachi
/Users/<USER>/.cargo/bin/sudachi
$ sudachi -h
sudachi 0.6.0
A Japanese tokenizer
...
Usage as a command
A Japanese tokenizer
Usage: sudachi [OPTIONS] [FILE] [COMMAND]
Commands:
build
Builds system dictionary
ubuild
Builds user dictionary
dump
help
Print this message or the help of the given subcommand(s)
Arguments:
[FILE]
Input text file: If not present, read from STDIN
Options:
-r, --config-file <CONFIG_FILE>
Path to the setting file in JSON format
-p, --resource_dir <RESOURCE_DIR>
Path to the root directory of resources
-m, --mode <MODE>
Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]
-o, --output <OUTPUT_FILE>
Output text file: If not present, use stdout
-a, --all
Prints all fields
-w, --wakati
Outputs only surface form
-d, --debug
Debug mode: Print the debug information
-l, --dict <DICTIONARY_PATH>
Path to sudachi dictionary. If None, it refer config and then baked dictionary
--split-sentences <SPLIT_SENTENCES>
How to split sentences [default: yes]
-h, --help
Print help (see more with '--help')
-V, --version
Print version
Output
Columns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the -a (--all) flag, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0for the system dictionary1and above for the user dictionaries-1if a word is Out-of-Vocabulary (not in the dictionary)
- Synonym group IDs
(OOV)if a word is Out-of-Vocabulary (not in the dictionary)
Wai Guo Ren Can Zheng Quan Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* Wai Guo Ren Can Zheng Quan Wai Guo Ren Can Zheng Quan gaikokuzinsanseiken 0 []
EOS
A Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* A A -1 [] (OOV)
quei Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* quei quei -1 [] (OOV)
EOS
When you add -w (--wakati) flag, it outputs space-delimited surface instead.
Wai Guo Ren Can Zheng Quan
API
See API reference page.
ToDo
- Out of Vocabulary handling
- Easy dictionary file install & management, similar to SudachiPy
- Registration to crates.io
References
Sudachi
- WorksApplications/Sudachi
- WorksApplications/SudachiDict
- WorksApplications/SudachiPy
- msnoigrs/gosudachi
Morphological Analyzers in Rust
- agatan/yoin: A Japanese Morphological Analyzer written in pure Rust
- wareya/notmecab-rs: notmecab-rs is a very basic mecab clone, designed only to do parsing, not training.
Logo
- Sudachi Logo
- Crab illustration: Pixabay