Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

sberdevices/saf_vectorizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

25 Commits

Repository files navigation

SAF Vectorizers

SAF Vectorizers - Plagin dlia SmartApp Framework, osushchestvliaiushchii vektorizatsiiu (poluchenie embedding'ov) tekstov s pomoshch'iu razlichnykh modelei:

  • SBERT (SentenceBERT) predobuchennaia russkoiazychnaia model' ot SberDevices, kotoraia dostupna v open source (podrobnee pro nee mozhno pochitat' v stat'e na habr).
    Odnim iz avtorov modeli iavliaetsia Aleksandr Abramov, k nemu mozhno obrashchat'sia s voprosami i predlozheniiami po modeli SBERT.

  • USE (Universal Sentence Encoder) predobuchennaia mul'tiiazykovaia model' (podrobnosti pro model' mozhno naiti na TensorFlow Hub). Model' rasprostraniaetsia pod litsenziei Apache-2.0 i ispol'zuetsia v original'nom vide, bez kakikh-libo izmenenii.

  • FastText predobuchennaia russkoiazychnaia model', rasprostraniaetsia na usloviiakh litsenzii Creative Commons Attribution-Share-Alike License 3.0. Model' skachivaetsia s ofitsial'nogo saita FastText i ispol'zuetsia v original'nom vide, bez kakikh-libo izmenenii.
    Avtorami modeli iavliaiutsia:

@inproceedings{grave2018learning,
title={Learning Word Vectors for 157 Languages},
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}

Nazvaniia tipov modelei (ispol'zuiutsia kak argument dlia skripta na skachivanie modelei, a takzhe v konfigakh klassifikatorov v pole "vectorizer"): sbert, use, fasttext, word2vec

Oglavlenie

  • Ustanovka
  • Novyi funktsional
  • Podkliuchenie plagina
  • Dokumentatsiia
  • Obratnaia sviaz'

Ustanovka

Pered nachalom ustanovki rekomenduetsia zapustit' skript na skachivanie predobuchennykh modelei vektorizatorov (v repozitorii ikh net t.k vse modeli tiazhelye), predvaritel'no vydaite skriptu prava na ispolnenie i otkliuchite VPN (esli ispol'zuete).

V kachestve argumentov skript prinimaet nazvaniia modelei vektorizatorov, kotorye vy khotite skachat' i ispol'zovat'. Esli argument all, to skachivaiutsia vse modeli. Esli, naprimer, khotite skachat' i ispol'zovat' tol'ko sbert, to zamenite all na sbert. Esli nuzhny tol'ko use i fasttext, to vmesto all propishite use fasttext i t.d.

No obratite vnimanie, chto ne obiazatel'no zapuskat' otdel'no skript na skachivanie modelei, t.k on po umolchaniiu uzhe zapuskaetsia v setup.py. Esli ne khotite kachat' vse modeli, to zaidite v setup.py i zamenite all na drugoe znachenie.

Komanda zapuska skripta na skachivanie modelei:

chmod u+r+x download_models.sh
./download_models.sh all

U vas dolzhna poiavit'sia direktoriia static v saf_vectorizers, tam budut khranit'sia faily modelei, final'nyi razmer direktorii, esli vy skachaete vse modeli, budet okolo 16 GB.

Protsess skachivaniia modelei ne bystryi i zanimaet kakoe-to vremia, v logakh konsoli mozhno uvidet' kakaia imenno model' seichas skachivaetsia.

Komanda ustanovki plagina:

pip install -e .

Rekomenduetsia ustanavlivat' imenno takim obrazom, a ne cherez git t.k neobkhodimo vkliuchenie failov iz direktoii static (sm. fail MANIFEST.in), t.e aktiviruete env, kuda u vas uzhe ustanovlen smart_app_framework, ili sozdaete novyi env, zatem perekhodite v sklonirovannyi repozitorii saf_vectorizers (main vetka) i zapuskaete pip install -e .

Proverit', chto vse ustanovilos' uspeshno v vash env mozhno tak:

from core.text_preprocessing.preprocessing_result import TextPreprocessingResult
from saf_vectorizers import SBERTVectorizer

vectorizer=SBERTVectorizer()

test_text=TextPreprocessingResult({"original_text": "khochu uznat' prognoz pogody na zavtra v moskve"})

res_vector=vectorizer.vectorize(test_text)

print(res_vector)
print(res_vector.shape)

Novyi funktsional

Plagin predostavliaet sleduiushchie sushchnosti:

  • class FastTextVectorizer
  • class SBERTVectorizer
  • class USEVectorizer
  • class Word2VecVectorizer

Kazhdyi iz etikh klassov iavliaetsia vektorizatorom, kotoryi vy mozhete ispol'zovat' pri obuchenie svoikh klassifikatsionnykh modelei, a takzhe vo vremia inferensa, chtoby modeli na vkhod prikhodilo uzhe vektornoe predstavlenie teksta. Chtoby poluchit' vektornoe predstavlenie teksta vam nuzhno vyzvat' u vektorizatora metod vectorize. On prinimaet na vkhod ob'ekt TextPreprocessingResult i vozvrashchaet vektor kak NumPy massiv:

def vectorize(self, text_preprocessing_result: TextPreprocessingResult) -> np.ndarray:

Primer ob'ekta TextPreprocessingResult mozhno naiti zdes': https://github.com/sberdevices/saf_vectorizers/blob/main/saf_vectorizers/check_vectorizers.py

Podkliuchenie plagina

Chtoby podkliuchit' plagin, dobav'te ego imia v peremennuiu PLUGINS v app_config vashego smartappa:
PLUGINS = ["saf_vectorizers"]

V konfiguratsii klassifikatora, model' kotorogo dolzhna prinimat' na vkhod uzhe vektorizirovannuiu repliku, sleduet dobavit' pole "vectorizer" s odnim iz znachenii (sbert, use, fasttext, word2vec) tipa modeli vektorizatsii, ta zhe chto ispol'zovalas' pri obuchenie modeli:

{
"type": "scikit",
"threshold": 0.7,
"path": "pretrained_model.pkl",
"intents": ["intent_1", "intent_2" ... "intent_n"],
"vectorizer": "sbert"
}

Dokumentatsiia

Ofitsial'naia dokumentatsiia

Obratnaia sviaz'

C voprosami i predlozheniiami pishite nam po adresu developer@sberdevices.ru ili vstupaite v nash Telegram kanal - SmartMarket Community.

About

Plagin dlia SmartApp Framework, osushchestvliaiushchii vektorizatsiiu (poluchenie embedding'ov) tekstov s pomoshch'iu razlichnykh modelei

Topics

Resources

Readme

License

View license

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors