Name	Name	Last commit message	Last commit date
Latest commit History 33 Commits
misc	misc
siamesexml	siamesexml
.gitignore	.gitignore
README.md	README.md

SiameseXML

Code for SiameseXML: Siamese networks meet extreme classifiers with 100M labels

Best Practices for features creation

Adding sub-words on top of unigrams to the vocabulary can help in training more accurate embeddings and classifiers.

Setting up

Expected directory structure

Download data for SiameseXML

* Download the (zipped file) BoW features from XML repository. * Extract the zipped file into data directory. * Yf.txt file contains label features; Either change the file name of make a soft-link to lbl_X_Xf.txt * The following files should be available in /data/ for new datasets (ignore the next step) - trn_X_Xf.txt - trn_X_Y.txt - tst_X_Xf.txt - lbl_X_Xf.txt - tst_X_Y.txt - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy * The following files should be available in /data/ if the dataset is in old format (please refer to next step to convert the data to new format) - train.txt - test.txt - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy

Convert to new data format

# A perl script is provided (in siamesexml/tools) to convert the data into new format # Either set the $data_dir variable to the data directory of a particular dataset or replace it with the path perl convert_format.pl $data_dir/train.txt $data_dir/trn_X_Xf.txt $data_dir/trn_X_Y.txt perl convert_format.pl $data_dir/test.txt $data_dir/tst_X_Xf.txt $data_dir/tst_X_Y.txt

Example use cases

A single learner

The given code can be utilized as follows. A json file is used to specify architecture and other arguments. Please refer to the full documentation below for more details.

./run_main.sh 0 SiameseXML LF-AmazonTitles-131K 0 108

Full Documentation

./run_main.sh * gpu_id: Run the program on this GPU. * type SiameseXML uses DeepXML[2] framework for training. The classifier is trained in M-IV. - SiameseXML: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets). - SiameseXML++: The intermediate representation is fine-tuned while training the classifier (leads to better accuracy on some datasets). * dataset - Name of the dataset. - SiameseXML expects the following files in /data/ - trn_X_Xf.txt - trn_X_Y.txt - tst_X_Xf.txt - lbl_X_Xf.txt - tst_X_Y.txt - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy - You can set the 'embedding_dims' in config file to switch between 300d and 512d embeddings. * version - different runs could be managed by version and seed. - models and results are stored with this argument. * seed - seed value as used by numpy and PyTorch.

Notes

/ for datasets in XC repository. You can use them when trying out the given code on new datasets. * We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets. * The code make use of CPU (mainly for hnswlib) as well as GPU. ">* Other file formats such as npy, npz, pickle are also supported. * Initializing with token embeddings (computed from FastText) leads to noticible accuracy gains. Please ensure that the token embedding file is available in data directory, if 'init=token_embeddings', otherwise it'll throw an error. * Config files are made available in siamesexml/configs// for datasets in XC repository. You can use them when trying out the given code on new datasets. * We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets. * The code make use of CPU (mainly for hnswlib) as well as GPU.

Cite as

@InProceedings{Dahiya21b, author = "Dahiya, K. and Agarwal, A. and Saini, D. and Gururaj, K. and Jiao, J. and Singh, A. and Agarwal, S. and Kar, P. and Varma, M", title = "SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels", booktitle = "Proceedings of the International Conference on Machine Learning", month = "July", year = "2021" }

References

[1] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021

[2] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.

[3] pyxclib: https://github.com/kunaldahiya/pyxclib

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme-classification/siamesexml

Folders and files

Latest commit

History

Repository files navigation

SiameseXML

Best Practices for features creation

Setting up

Expected directory structure

Download data for SiameseXML

Convert to new data format

Example use cases

A single learner

Full Documentation

Notes

Cite as

YOU MAY ALSO LIKE

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors 4

Uh oh!

Languages

Extreme-classification/siamesexml

Folders and files

Latest commit

History

Repository files navigation

SiameseXML

Best Practices for features creation

Setting up

Expected directory structure

Download data for SiameseXML

Convert to new data format

Example use cases

A single learner

Full Documentation

Notes

Cite as

YOU MAY ALSO LIKE

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages