SAF Vectorizers
SAF Vectorizers - Plagin dlia SmartApp Framework, osushchestvliaiushchii vektorizatsiiu (poluchenie embedding'ov) tekstov s pomoshch'iu razlichnykh modelei:
-
SBERT (SentenceBERT) predobuchennaia russkoiazychnaia model' ot SberDevices, kotoraia dostupna v open source (podrobnee pro nee mozhno pochitat' v stat'e na habr).
Odnim iz avtorov modeli iavliaetsia Aleksandr Abramov, k nemu mozhno obrashchat'sia s voprosami i predlozheniiami po modeli SBERT. -
USE (Universal Sentence Encoder) predobuchennaia mul'tiiazykovaia model' (podrobnosti pro model' mozhno naiti na TensorFlow Hub). Model' rasprostraniaetsia pod litsenziei Apache-2.0 i ispol'zuetsia v original'nom vide, bez kakikh-libo izmenenii.
-
FastText predobuchennaia russkoiazychnaia model', rasprostraniaetsia na usloviiakh litsenzii Creative Commons Attribution-Share-Alike License 3.0. Model' skachivaetsia s ofitsial'nogo saita FastText i ispol'zuetsia v original'nom vide, bez kakikh-libo izmenenii.
Avtorami modeli iavliaiutsia:
@inproceedings{grave2018learning,
title={Learning Word Vectors for 157 Languages},
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}
- Word2Vec predobuchennaia russkoiazychnaia model', rasprostraniaetsia na usloviiakh litsenzii
Creative Commons Attribution (CC-BY).
Model' skachivaetsia s ofitsial'nogo saita NLPL word embeddings repository i
ispol'zuetsia v original'nom vide, bez kakikh-libo izmenenii.
Avtorami modeli iavliaiutsia Language Technology Group at the University of Oslo.
Nazvaniia tipov modelei (ispol'zuiutsia kak argument dlia skripta na skachivanie modelei, a takzhe v konfigakh
klassifikatorov v pole "vectorizer"): sbert, use, fasttext, word2vec
Oglavlenie
- Ustanovka
- Novyi funktsional
- Podkliuchenie plagina
- Dokumentatsiia
- Obratnaia sviaz'
Ustanovka
Pered nachalom ustanovki rekomenduetsia zapustit' skript na skachivanie predobuchennykh modelei vektorizatorov (v repozitorii ikh net t.k vse modeli tiazhelye), predvaritel'no vydaite skriptu prava na ispolnenie i otkliuchite VPN (esli ispol'zuete).
V kachestve argumentov skript prinimaet nazvaniia modelei vektorizatorov,
kotorye vy khotite skachat' i ispol'zovat'. Esli argument all, to skachivaiutsia vse modeli. Esli, naprimer, khotite
skachat' i ispol'zovat' tol'ko sbert, to zamenite all na sbert. Esli nuzhny tol'ko use i fasttext, to vmesto all
propishite use fasttext i t.d.
No obratite vnimanie, chto ne obiazatel'no zapuskat' otdel'no skript na skachivanie modelei, t.k on po umolchaniiu uzhe
zapuskaetsia v setup.py. Esli ne khotite kachat' vse modeli, to zaidite v setup.py i zamenite all na drugoe znachenie.
Komanda zapuska skripta na skachivanie modelei:
./download_models.sh all
U vas dolzhna poiavit'sia direktoriia static v saf_vectorizers, tam budut khranit'sia faily modelei,
final'nyi razmer direktorii, esli vy skachaete vse modeli, budet okolo 16 GB.
Protsess skachivaniia modelei ne bystryi i zanimaet kakoe-to vremia, v logakh konsoli mozhno uvidet' kakaia imenno model' seichas skachivaetsia.
Komanda ustanovki plagina:
Rekomenduetsia ustanavlivat' imenno takim obrazom, a ne cherez git t.k neobkhodimo vkliuchenie failov
iz direktoii static (sm. fail MANIFEST.in), t.e aktiviruete env, kuda u vas uzhe ustanovlen smart_app_framework, ili
sozdaete novyi env, zatem perekhodite v sklonirovannyi repozitorii saf_vectorizers (main vetka) i zapuskaete
pip install -e .
Proverit', chto vse ustanovilos' uspeshno v vash env mozhno tak:
from saf_vectorizers import SBERTVectorizer
vectorizer=SBERTVectorizer()
test_text=TextPreprocessingResult({"original_text": "khochu uznat' prognoz pogody na zavtra v moskve"})
res_vector=vectorizer.vectorize(test_text)
print(res_vector)
print(res_vector.shape)
Novyi funktsional
Plagin predostavliaet sleduiushchie sushchnosti:
class FastTextVectorizerclass SBERTVectorizerclass USEVectorizerclass Word2VecVectorizer
Kazhdyi iz etikh klassov iavliaetsia vektorizatorom, kotoryi vy mozhete ispol'zovat' pri obuchenie svoikh
klassifikatsionnykh modelei, a takzhe vo vremia inferensa, chtoby modeli na vkhod prikhodilo uzhe vektornoe
predstavlenie teksta. Chtoby poluchit' vektornoe predstavlenie teksta vam nuzhno vyzvat' u vektorizatora
metod vectorize. On prinimaet na vkhod ob'ekt TextPreprocessingResult i vozvrashchaet vektor kak NumPy massiv:
Primer ob'ekta TextPreprocessingResult mozhno naiti zdes':
https://github.com/sberdevices/saf_vectorizers/blob/main/saf_vectorizers/check_vectorizers.py
Podkliuchenie plagina
Chtoby podkliuchit' plagin, dobav'te ego imia v peremennuiu PLUGINS v app_config vashego smartappa:
PLUGINS = ["saf_vectorizers"]
V konfiguratsii klassifikatora, model' kotorogo dolzhna prinimat' na vkhod uzhe vektorizirovannuiu repliku,
sleduet dobavit' pole "vectorizer" s odnim iz znachenii (sbert, use, fasttext, word2vec)
tipa modeli vektorizatsii, ta zhe chto ispol'zovalas' pri obuchenie modeli:
"type": "scikit",
"threshold": 0.7,
"path": "pretrained_model.pkl",
"intents": ["intent_1", "intent_2" ... "intent_n"],
"vectorizer": "sbert"
}
Dokumentatsiia
Obratnaia sviaz'
C voprosami i predlozheniiami pishite nam po adresu developer@sberdevices.ru ili vstupaite v nash Telegram kanal - SmartMarket Community.