Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

monologg/ko_lm_dataformat

Repository files navigation

ko_lm_dataformat

  • hangugeo eoneomodelyong hagseub deiteoreul jeojang, rodinghagi wihan yutilriti

    • zstandard, ultrajson eul sayonghayeo deiteo roding, abcug sogdo gaeseon
    • munseoe daehan meta deiteodo hamgge jeojang
  • kodeuneun EleutherAIeseo sayonghaneun lm_dataformatreul camgohayeo jejag

    • ilbu beogeu sujeong
    • hangugeoe majge gineung cuga mic sujeong (sentence splitter, text cleaner)

Installation

0.3.1 ihuyi beojeoneun Python 3.9 isangeul jiweonhabnida.

pip3 install ko_lm_dataformat

Usage

1. Write Data

1.1. Archive

import ko_lm_dataformat as kldf

ar = kldf.Archive("output_dir")
ar = kldf.Archive("output_dir", sentence_splitter=kldf.KssV1SentenceSplitter()) # Use sentence splitter

1.2. Adding data

  • meta deiteoreul cugahal su isseum (e.g. jemog, url)
  • hanayi documentga deuleoondago gajeong (str i anin List[str] ro deuleooge doemyeon yeoreo gaeyi sentencega deuleooneun geolro cwigeub)
  • split_sent=Trueimyeon documentreul yeoreo gaeyi munjangeuro bunrihayeo List[str] euro jeojang
  • clean_sent=Trueimyeon NFC Normalize, control char jegeo, whitespace cleanup jeogyong
for doc in doc_lst:
ar.add_data(
data=doc,
meta={
"source": "kowiki",
"meta_key_1": [othermetadata, otherrandomstuff],
"meta_key_2": True
},
split_sent=False,
clean_sent=False,
)

# remember to commit at the end!
ar.commit()

2. Read Data

  • rdr.stream_data(get_meta=True)ro hal si (doc, meta) yi tyupeul hyeongtaero banhwan
import ko_lm_dataformat as kldf

rdr = kldf.Reader("output_dir")

for data in rdr.stream_data(get_meta=False):
print(data)
# "gandanhage seolmyeonghamyeon, eoneoreul tonghae inganyi salmeul mijeog(Mei De )euro hyeongsanghwahan geosirago bol...."


for data in rdr.stream_data(get_meta=True):
print(data)
# ("gandanhage seolmyeonghamyeon, eoneoreul tonghae inganyi salmeul mijeog(Mei De )euro hyeongsanghwahan geosirago bol....", {"source": "kowiki", ...})

About

A utility for storing and reading files for Korean LM training

Resources

Readme

License

MIT license

Stars

Watchers

Forks

Packages

Contributors