DEXA
Code for DEXA: Deep Encoders with Auxiliary Parameters for Extreme Classification [1]
Setting up
Expected directory structure
| +-- programs
| | +-- dexa
| | +-- dexa
| +-- data
| +--
| +-- models
| +-- results
Download data for DEXA
* Extract the zipped file into data directory.
* The following files should be available in
- trn.json.gz
- tst.json.gz
- lbl.json.gz
- filter_labels_text.txt
- filter_labels_train.txt
Example use cases
A single learner
Extract and tokenize data as follows.
The algorithm can be run as follows. A json file (e.g., config/DEXA/LF-AmazonTitles-131K.json) is used to specify architecture and other arguments. Please refer to the full documentation below for more details.
Full Documentation
Tokenize the data
* dataset
- Name of the dataset.
- Tokenizer expects the following files in
- trn.json.gz
- tst.json.gz
- lbl.json.gz
- it'll dump the following six tokenized files
- trn_doc_input_ids.npy
- trn_doc_attention_mask.npy
- tst_doc_input_ids.npy
- tst_doc_attention_mask.npy
- lbl_input_ids.npy
- lbl_attention_mask.npy
* seq-len
- sequence length of text to consider while tokenizing
- 32 for titles dataset
- 256 for Wikipedia
- 128 for other full-text datasets
Run DEXA
* gpu_id: Run the program on this GPU.
* type
DEXA builds upon NGAME[2], SiameseXML [3] and DeepXML[4] for training. An encoder is trained in M1 and the classifier is trained in M-IV.
- DEXA: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets).
* dataset
- Name of the dataset.
- DEXA expects the following files in
- trn_doc_input_ids.npy
- trn_doc_attention_mask.npy
- trn_X_Y.txt
- tst_doc_input_ids.npy
- tst_doc_attention_mask.npy
- tst_X_Y.txt
- lbl_input_ids.npy
- lbl_attention_mask.npy
- filter_labels_test.txt (put empty file or set as null in config when unavailable)
* version
- different runs could be managed by version and seed.
- models and results are stored with this argument.
* seed
- seed value as used by numpy and PyTorch.
Cite as
author = "Dahiya, K. and Yadav, S. and Sondhi, S. and Saini, D. and Mehta, S. and Jiao, J. and Agarwal, S. and Kar, P. and Varma, M.",
title = "Deep encoders with auxiliary parameters for extreme classification",
booktitle = "KDD",
month = "August",
year = "2023"
}
YOU MAY ALSO LIKE
- NGAME: Negative mining-aware mini-batching for extreme classification
- SiameseXML: Siamese networks meet extreme classifiers with 100M labels
- DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents
- DECAF: Deep Extreme Classification with Label Features
- ECLARE: Extreme Classification with Label Graph Correlations
- GalaXC: Graph Neural Networks with Labelwise Attention for Extreme Classification
References
[1] K. Dahiya, S. Yadav, S. Sondhi, D. Saini, S. Mehta, J. Jiao, S. Agarwal, P. Kar and M. Varma. Deep encoders with auxiliary parameters for extreme classification. In KDD, Long Beach (CA), August 2023.
[2] K. Dahiya, N. Gupta, D. Saini, A. Soni, Y. Wang, K. Dave, J. Jiao, K. Gururaj, P. Dey, A. Singh, D. Hada, V. Jain, B. Paliwal, A. Mittal, S. Mehta, R. Ramjee, S. Agarwal, P. Kar and M. Varma. NGAME: Negative mining-aware mini-batching for extreme classification. In WSDM, Singapore, March 2023.
[2] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021
[3] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.
[4] pyxclib: https://github.com/kunaldahiya/pyxclib
[5] The Extreme Classification Repository: http://manikvarma.org/downloads/XC/XMLRepository.html