Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Extreme-classification/dexa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

13 Commits

Repository files navigation

DEXA

Code for DEXA: Deep Encoders with Auxiliary Parameters for Extreme Classification [1]


Setting up


Expected directory structure

+--
| +-- programs
| | +-- dexa
| | +-- dexa
| +-- data
| +--
| +-- models
| +-- results

Download data for DEXA

* Download the (zipped file) raw data from The XML repository [5].
* Extract the zipped file into data directory.
* The following files should be available in /data/ (create empty filter files if unavailable):
- trn.json.gz
- tst.json.gz
- lbl.json.gz
- filter_labels_text.txt
- filter_labels_train.txt

Example use cases


A single learner

Extract and tokenize data as follows.

./prepare_data.sh LF-AmazonTitles-131K 32

The algorithm can be run as follows. A json file (e.g., config/DEXA/LF-AmazonTitles-131K.json) is used to specify architecture and other arguments. Please refer to the full documentation below for more details.

./run_main.sh 0 DEXA LF-AmazonTitles-131K 0 108

Full Documentation

Tokenize the data

./prepare_data.sh

* dataset
- Name of the dataset.
- Tokenizer expects the following files in /data/
- trn.json.gz
- tst.json.gz
- lbl.json.gz
- it'll dump the following six tokenized files
- trn_doc_input_ids.npy
- trn_doc_attention_mask.npy
- tst_doc_input_ids.npy
- tst_doc_attention_mask.npy
- lbl_input_ids.npy
- lbl_attention_mask.npy

* seq-len
- sequence length of text to consider while tokenizing
- 32 for titles dataset
- 256 for Wikipedia
- 128 for other full-text datasets

Run DEXA

./run_main.sh

* gpu_id: Run the program on this GPU.

* type
DEXA builds upon NGAME[2], SiameseXML [3] and DeepXML[4] for training. An encoder is trained in M1 and the classifier is trained in M-IV.
- DEXA: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets).

* dataset
- Name of the dataset.
- DEXA expects the following files in /data/
- trn_doc_input_ids.npy
- trn_doc_attention_mask.npy
- trn_X_Y.txt
- tst_doc_input_ids.npy
- tst_doc_attention_mask.npy
- tst_X_Y.txt
- lbl_input_ids.npy
- lbl_attention_mask.npy
- filter_labels_test.txt (put empty file or set as null in config when unavailable)

* version
- different runs could be managed by version and seed.
- models and results are stored with this argument.

* seed
- seed value as used by numpy and PyTorch.

Cite as

@InProceedings{Dahiya23b,
author = "Dahiya, K. and Yadav, S. and Sondhi, S. and Saini, D. and Mehta, S. and Jiao, J. and Agarwal, S. and Kar, P. and Varma, M.",
title = "Deep encoders with auxiliary parameters for extreme classification",
booktitle = "KDD",
month = "August",
year = "2023"
}

YOU MAY ALSO LIKE

References


[1] K. Dahiya, S. Yadav, S. Sondhi, D. Saini, S. Mehta, J. Jiao, S. Agarwal, P. Kar and M. Varma. Deep encoders with auxiliary parameters for extreme classification. In KDD, Long Beach (CA), August 2023.

[2] K. Dahiya, N. Gupta, D. Saini, A. Soni, Y. Wang, K. Dave, J. Jiao, K. Gururaj, P. Dey, A. Singh, D. Hada, V. Jain, B. Paliwal, A. Mittal, S. Mehta, R. Ramjee, S. Agarwal, P. Kar and M. Varma. NGAME: Negative mining-aware mini-batching for extreme classification. In WSDM, Singapore, March 2023.

[2] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021

[3] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.

[4] pyxclib: https://github.com/kunaldahiya/pyxclib

[5] The Extreme Classification Repository: http://manikvarma.org/downloads/XC/XMLRepository.html