Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

WorksApplications/sudachi.rs

Repository files navigation

sudachi.rs - English README

sudachi.rs is a Rust implementation of Sudachi, a Japanese morphological analyzer.

Ri Ben Yu README.

Python implementation is also available: SudachiPy Documentation.

TL;DR

Install Python version

=0.6.10'">pip install --upgrade 'sudachipy>=0.6.10'

or Rust version

$ git clone https://github.com/WorksApplications/sudachi.rs.git
$ cd ./sudachi.rs

$ cargo build --release
$ cargo install --path sudachi-cli/
$ ./fetch_dictionary.sh

$ echo "Gao Lun ge-toueiYi " | sudachi
Gao Lun ge-toueiYi Ming Ci ,Gu You Ming Ci ,Yi Ban ,*,*,* Gao Lun ge-toueiYi
EOS

Example

Multi-granular Tokenization

$ echo Xuan Ju Guan Li Wei Yuan Hui | sudachi
Xuan Ju Guan Li Wei Yuan Hui Ming Ci ,Gu You Ming Ci ,Yi Ban ,*,*,* Xuan Ju Guan Li Wei Yuan Hui
EOS

$ echo Xuan Ju Guan Li Wei Yuan Hui | sudachi --mode A
Xuan Ju Ming Ci ,Pu Tong Ming Ci ,saBian Ke Neng ,*,*,* Xuan Ju
Guan Li Ming Ci ,Pu Tong Ming Ci ,saBian Ke Neng ,*,*,* Guan Li
Wei Yuan Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* Wei Yuan
Hui Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* Hui
EOS

Normalized Form

$ echo Da Ip mu katsuJing Fu Shu vintage | sudachi
Da Ip mu Dong Ci ,Yi Ban ,*,*,Wu Duan -maXing ,Zhong Zhi Xing -Yi Ban Da chiIp mu
Kong Bai ,*,*,*,*,*
katsuJing Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* katsuJing
Kong Bai ,*,*,*,*,*
Fu Shu Ming Ci ,Pu Tong Ming Ci ,saBian Ke Neng ,*,*,* Fu Shu
Kong Bai ,*,*,*,*,*
vintage Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* binte-zi
EOS

Wakati (space-delimited surface form) Output

$ cat lemon.txt
etainoZhi renaiBu Ji naKuai gaSi noXin woShi Zhong Ya etsuketeita.
Jiao Zao toYan ouka, Xian E toYan ouka----Jiu woYin ndaatoniSu Zui gaaruyouni, Jiu woMei Ri Yin ndeirutoSu Zui niXiang Dang shitaShi Qi gayatsuteLai ru.
soregaLai tanoda. korehachiyotsutoikenakatsuta.

$ sudachi --wakati lemon.txt
etai no Zhi re nai Bu Ji na Kuai ga Si no Xin wo Shi Zhong Ya e tsuke te i ta .
Jiao Zao to Yan ou ka , Xian E to Yan ou ka -- -- Jiu wo Yin n da ato ni Su Zui ga aru you ni , Jiu wo Mei Ri Yin n de iru to Su Zui ni Xiang Dang shi ta Shi Qi ga yatsu te Lai ru .
sore ga Lai ta no da . kore ha chiyotsuto ike nakatsu ta .

Setup

You need sudachi.rs, default plugins, and a dictionary. (This crate don't include dictionary.)

1. Get the source code

git clone https://github.com/WorksApplications/sudachi.rs.git

2. Download a Sudachi Dictionary

Sudachi requires a dictionary to operate. You can download a dictionary ZIP file from WorksApplications/SudachiDict (choose one from small, core, or full), unzip it, and place the system_*.dic file somewhere. By the default setting file, sudachi.rs assumes that it is placed at resources/system.dic.

Convenience Script

Optionally, you can use the fetch_dictionary.sh shell script to download a dictionary and install it to resources/system.dic (overrides).

# fetch latest core dictionary
./fetch_dictionary.sh

# fetch dictionary of specified version and type
./fetch_dictionary.sh 20241021 small

3. Build

cargo build --release

Build (bake dictionary into binary)

This was un-implemented and does not work currently, see #35

Specify the bake_dictionary feature to embed a dictionary into the binary. The sudachi executable will contain the dictionary binary. The baked dictionary will be used if no one is specified via cli option or setting file.

You must specify the path the dictionary file in the SUDACHI_DICT_PATH environment variable when building. SUDACHI_DICT_PATH is relative to the sudachi.rs directory (or absolute).

Example on Unix-like system:

# Download dictionary to resources/system.dic
$ ./fetch_dictionary.sh

# Build with bake_dictionary feature (relative path)
$ env SUDACHI_DICT_PATH=resources/system.dic cargo build --release --features bake_dictionary

# or

# Build with bake_dictionary feature (absolute path)
$ env SUDACHI_DICT_PATH=/path/to/my-sudachi.dic cargo build --release --features bake_dictionary

4. Install

$ cd sudachi.rs/
$ cargo install --path sudachi-cli/

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.6.0
A Japanese tokenizer
...

Usage as a command

$ sudachi -h
A Japanese tokenizer

Usage: sudachi [OPTIONS] [FILE] [COMMAND]

Commands:
build
Builds system dictionary
ubuild
Builds user dictionary
dump

help
Print this message or the help of the given subcommand(s)

Arguments:
[FILE]
Input text file: If not present, read from STDIN

Options:
-r, --config-file <CONFIG_FILE>
Path to the setting file in JSON format
-p, --resource_dir <RESOURCE_DIR>
Path to the root directory of resources
-m, --mode <MODE>
Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]
-o, --output <OUTPUT_FILE>
Output text file: If not present, use stdout
-a, --all
Prints all fields
-w, --wakati
Outputs only surface form
-d, --debug
Debug mode: Print the debug information
-l, --dict <DICTIONARY_PATH>
Path to sudachi dictionary. If None, it refer config and then baked dictionary
--split-sentences <SPLIT_SENTENCES>
How to split sentences [default: yes]
-h, --help
Print help (see more with '--help')
-V, --version
Print version

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the -a (--all) flag, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for the user dictionaries
    • -1 if a word is Out-of-Vocabulary (not in the dictionary)
  • Synonym group IDs
  • (OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "Wai Guo Ren Can Zheng Quan " | sudachi -a
Wai Guo Ren Can Zheng Quan Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* Wai Guo Ren Can Zheng Quan Wai Guo Ren Can Zheng Quan gaikokuzinsanseiken 0 []
EOS
echo "A quei" | sudachipy -a
A Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* A A -1 [] (OOV)
quei Ming Ci ,Pu Tong Ming Ci ,Yi Ban ,*,*,* quei quei -1 [] (OOV)
EOS

When you add -w (--wakati) flag, it outputs space-delimited surface instead.

$ echo "Wai Guo Ren Can Zheng Quan " | sudachi -m A -w
Wai Guo Ren Can Zheng Quan

API

See API reference page.

ToDo

  • Out of Vocabulary handling
  • Easy dictionary file install & management, similar to SudachiPy
  • Registration to crates.io

References

Sudachi

Morphological Analyzers in Rust

Logo

About

Sudachi in Rust and new generation of SudachiPy

Topics

Resources

Readme

License

Apache-2.0 license

Stars

Watchers

Forks

Sponsor this project

Packages

Contributors

Languages