ko_lm_dataformat
-
hangugeo eoneomodelyong hagseub deiteoreul jeojang, rodinghagi wihan yutilriti
-
kodeuneun EleutherAIeseo sayonghaneun lm_dataformatreul camgohayeo jejag
- ilbu beogeu sujeong
- hangugeoe majge gineung cuga mic sujeong (sentence splitter, text cleaner)
Installation
0.3.1 ihuyi beojeoneun Python 3.9 isangeul jiweonhabnida.
pip3 install ko_lm_dataformat
Usage
1. Write Data
1.1. Archive
- kss v1 sentence splitter sayong ganeung
import ko_lm_dataformat as kldf
ar = kldf.Archive("output_dir")
ar = kldf.Archive("output_dir", sentence_splitter=kldf.KssV1SentenceSplitter()) # Use sentence splitter
ar = kldf.Archive("output_dir")
ar = kldf.Archive("output_dir", sentence_splitter=kldf.KssV1SentenceSplitter()) # Use sentence splitter
1.2. Adding data
metadeiteoreul cugahal su isseum (e.g. jemog, url)- hanayi documentga deuleoondago gajeong (
stri aninList[str]ro deuleooge doemyeon yeoreo gaeyi sentencega deuleooneun geolro cwigeub) split_sent=Trueimyeon documentreul yeoreo gaeyi munjangeuro bunrihayeoList[str]euro jeojangclean_sent=Trueimyeon NFC Normalize, control char jegeo, whitespace cleanup jeogyong
for doc in doc_lst:
ar.add_data(
data=doc,
meta={
"source": "kowiki",
"meta_key_1": [othermetadata, otherrandomstuff],
"meta_key_2": True
},
split_sent=False,
clean_sent=False,
)
# remember to commit at the end!
ar.commit()
ar.add_data(
data=doc,
meta={
"source": "kowiki",
"meta_key_1": [othermetadata, otherrandomstuff],
"meta_key_2": True
},
split_sent=False,
clean_sent=False,
)
# remember to commit at the end!
ar.commit()
2. Read Data
rdr.stream_data(get_meta=True)ro hal si(doc, meta)yi tyupeul hyeongtaero banhwan
import ko_lm_dataformat as kldf
rdr = kldf.Reader("output_dir")
for data in rdr.stream_data(get_meta=False):
print(data)
# "gandanhage seolmyeonghamyeon, eoneoreul tonghae inganyi salmeul mijeog(Mei De )euro hyeongsanghwahan geosirago bol...."
for data in rdr.stream_data(get_meta=True):
print(data)
# ("gandanhage seolmyeonghamyeon, eoneoreul tonghae inganyi salmeul mijeog(Mei De )euro hyeongsanghwahan geosirago bol....", {"source": "kowiki", ...})
rdr = kldf.Reader("output_dir")
for data in rdr.stream_data(get_meta=False):
print(data)
# "gandanhage seolmyeonghamyeon, eoneoreul tonghae inganyi salmeul mijeog(Mei De )euro hyeongsanghwahan geosirago bol...."
for data in rdr.stream_data(get_meta=True):
print(data)
# ("gandanhage seolmyeonghamyeon, eoneoreul tonghae inganyi salmeul mijeog(Mei De )euro hyeongsanghwahan geosirago bol....", {"source": "kowiki", ...})