Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

PyThaiNLP/pythainlp

Repository files navigation

PyThaiNLP: Thai Natural Language Processing in Python

pythainlp.org | Tutorials | License info | Model cards | Adopters | e`ksaarphaasaaaithy

Designed to be a Thai-focused counterpart to NLTK, PyThaiNLP provides standard tools for linguistic analysis under an Apache-2.0 license, with its data and models covered by CC0-1.0 and CC-BY-4.0.

pip install pythainlp
Version Python version Changes Documentation
5.2.0 3.7+ Log pythainlp.org/docs
dev 3.9+ Log pythainlp.org/dev-docs

Features

  • Linguistic units: Sentence, word, and subword segmentation (sent_tokenize, word_tokenize, subword_tokenize).

  • Tagging: Part-of-speech tagging (pos_tag).

  • Transliteration: Romanization (transliterate) and IPA conversion.

  • Correction: Spelling suggestion and correction (spell, correct).

  • Utilities: Soundex, collation, number-to-text (bahttext), datetime formatting (thai_strftime), and keyboard layout correction.

  • Data: Built-in Thai character sets, word lists, and stop words.

  • CLI: Command-line interface via thainlp.

    thainlp data catalog # List datasets
    thainlp help # Show usage

Installation options

To install with specific extras (e.g., translate, wordnet, full):

pip install "pythainlp[extra1,extra2,...]"

Possible extras included:

  • compact -- install a stable and small subset of dependencies (recommended)
  • translate -- machine translation support
  • wordnet -- WordNet support
  • full -- install all optional dependencies (may introduce conflicts)

The documentation website maintains the full list of extras. To see the specific libraries included in each extra, please inspect the [project.optional-dependencies] section of pyproject.toml.

Data directory

PyThaiNLP downloads data (see the data catalog db.json at pythainlp-corpus) to ~/pythainlp-data by default. Set the PYTHAINLP_DATA_DIR environment variable to override this location.

When using PyThaiNLP in distributed computing environments (e.g., Apache Spark), set the PYTHAINLP_DATA_DIR environment variable inside the function that will be distributed to worker nodes. See details in the documentation.

Testing

We test core functionalities on all officially supported Python versions.

See tests/README.md for test matrix and other details.

Contribute to PyThaiNLP

Please fork and create a pull request. See CONTRIBUTING.md for guidelines and algorithm references.

Citations

If you use PyThaiNLP library in your project, please cite the software as follows:

Phatthiyaphaibun, Wannaphong, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, and Pattarawat Chormai. "PyThaiNLP: Thai Natural Language Processing in Python". Zenodo, 2 June 2024. http://doi.org/10.5281/zenodo.3519354.

with this BibTeX entry:

@software{pythainlp,
title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
author = "Phatthiyaphaibun, Wannaphong and
Chaovavanich, Korakot and
Polpanumas, Charin and
Suriyawongkul, Arthit and
Lowphansirikul, Lalita and
Chormai, Pattarawat",
doi = {10.5281/zenodo.3519354},
license = {Apache-2.0},
month = jun,
url = {https://github.com/PyThaiNLP/pythainlp/},
version = {v5.0.4},
year = {2024},
}

To cite our NLP-OSS 2023 academic paper, please cite the paper as follows:

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25-36, Singapore, Singapore. Empirical Methods in Natural Language Processing.

with this BibTeX entry:

@inproceedings{phatthiyaphaibun-etal-2023-pythainlp,
title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
author = "Phatthiyaphaibun, Wannaphong and
Chaovavanich, Korakot and
Polpanumas, Charin and
Suriyawongkul, Arthit and
Lowphansirikul, Lalita and
Chormai, Pattarawat and
Limkonchotiwat, Peerat and
Suntorntip, Thanathip and
Udomcharoenchaikit, Can",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Empirical Methods in Natural Language Processing",
url = "https://aclanthology.org/2023.nlposs-1.4",
pages = "25--36",
abstract = "We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.",
}

Acknowledgements

PyThaiNLP was founded by Wannaphong Phatthiyaphaibun in 2016. His contributions from 2021 were made during a PhD studentship supported by Vidyasirimedhi Institute of Science and Technology (VISTEC).

The contributions of Arthit Suriyawongkul to PyThaiNLP from November 2017 until August 2019 were funded by Wisesight. His contributions from November 2019 until October 2024 were made during a PhD studentship supported by Taighde Eireann - Research Ireland under Grant Number 18/CRT/6224 (Research Ireland Centre for Research Training in Digitally-Enhanced Reality (d-real)).

The contributions of Pattarawat Chormai to PyThaiNLP from 2018 until 2019 were made during a research internship at the Natural Language Processing Lab, Department of Linguistics, Faculty of Arts, Chulalongkorn University.

The contributions of Korakot Chaovavanich and Lalita Lowphansirikul to PyThaiNLP from 2019 until 2022 were funded by the VISTEC-depa Thailand AI Research Institute.

The Mac Mini M1 used for macOS testing was donated by MacStadium. This hardware was essential for the project's testing suite from October 2022 to October 2023, filling a critical gap before GitHub Actions introduced native support for Apple Silicon runners.

We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp.

Beware of malware if you use code from places other than these two.

Made with | PyThaiNLP Team | "We build Thai NLP"

About

Thai natural language processing in Python

Topics

Resources

Readme

License

Apache-2.0 license

Code of conduct

Code of conduct

Contributing

Contributing

Security policy

Security policy

Stars

Watchers

Forks

Packages

Contributors