Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Extreme-classification/siamesexml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

33 Commits

Repository files navigation

SiameseXML

Code for SiameseXML: Siamese networks meet extreme classifiers with 100M labels


Best Practices for features creation


  • Adding sub-words on top of unigrams to the vocabulary can help in training more accurate embeddings and classifiers.

Setting up


Expected directory structure

+--
| +-- programs
| | +-- siamesexml
| | +-- siamesexml
| +-- data
| +--
| +-- models
| +-- results

Download data for SiameseXML

* Download the (zipped file) BoW features from XML repository.
* Extract the zipped file into data directory.
* Yf.txt file contains label features; Either change the file name of make a soft-link to lbl_X_Xf.txt
* The following files should be available in /data/ for new datasets (ignore the next step)
- trn_X_Xf.txt
- trn_X_Y.txt
- tst_X_Xf.txt
- lbl_X_Xf.txt
- tst_X_Y.txt
- fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
* The following files should be available in /data/ if the dataset is in old format (please refer to next step to convert the data to new format)
- train.txt
- test.txt
- fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy

Convert to new data format

# A perl script is provided (in siamesexml/tools) to convert the data into new format
# Either set the $data_dir variable to the data directory of a particular dataset or replace it with the path
perl convert_format.pl $data_dir/train.txt $data_dir/trn_X_Xf.txt $data_dir/trn_X_Y.txt
perl convert_format.pl $data_dir/test.txt $data_dir/tst_X_Xf.txt $data_dir/tst_X_Y.txt

Example use cases


A single learner

The given code can be utilized as follows. A json file is used to specify architecture and other arguments. Please refer to the full documentation below for more details.

./run_main.sh 0 SiameseXML LF-AmazonTitles-131K 0 108

Full Documentation

./run_main.sh

* gpu_id: Run the program on this GPU.

* type
SiameseXML uses DeepXML[2] framework for training. The classifier is trained in M-IV.
- SiameseXML: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets).
- SiameseXML++: The intermediate representation is fine-tuned while training the classifier (leads to better accuracy on some datasets).

* dataset
- Name of the dataset.
- SiameseXML expects the following files in /data/
- trn_X_Xf.txt
- trn_X_Y.txt
- tst_X_Xf.txt
- lbl_X_Xf.txt
- tst_X_Y.txt
- fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
- You can set the 'embedding_dims' in config file to switch between 300d and 512d embeddings.

* version
- different runs could be managed by version and seed.
- models and results are stored with this argument.

* seed
- seed value as used by numpy and PyTorch.

Notes

/ for datasets in XC repository. You can use them when trying out the given code on new datasets. * We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets. * The code make use of CPU (mainly for hnswlib) as well as GPU. ">* Other file formats such as npy, npz, pickle are also supported.
* Initializing with token embeddings (computed from FastText) leads to noticible accuracy gains. Please ensure that the token embedding file is available in data directory, if 'init=token_embeddings', otherwise it'll throw an error.
* Config files are made available in siamesexml/configs// for datasets in XC repository. You can use them when trying out the given code on new datasets.
* We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets.
* The code make use of CPU (mainly for hnswlib) as well as GPU.

Cite as

@InProceedings{Dahiya21b,
author = "Dahiya, K. and Agarwal, A. and Saini, D. and Gururaj, K. and Jiao, J. and Singh, A. and Agarwal, S. and Kar, P. and Varma, M",
title = "SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels",
booktitle = "Proceedings of the International Conference on Machine Learning",
month = "July",
year = "2021"
}

YOU MAY ALSO LIKE

References


[1] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021

[2] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.

[3] pyxclib: https://github.com/kunaldahiya/pyxclib

About

Implementation of SiameseXML (ICML 2021)

Resources

Readme

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

Languages