samgoldman97/mist-cf

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
analysis		analysis
configs		configs
data		data
launcher_scripts		launcher_scripts
notebooks		notebooks
preprocessing		preprocessing
quickstart		quickstart
run_scripts		run_scripts
src/mist_cf		src/mist_cf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
mist_cf_graphic.png		mist_cf_graphic.png
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

MIST-CF: Metabolite Inference with Spectrum Transformers (Chemical Formula)

This repository provides implementations and code examples for MIST-CF, an extension of MIST for annotating MS1 precursor masses from MS/MS data in a de novo setting. MIST-CF ranks chemical formula and adduct assignments for an unknown mass spectrum using an end-to-end energy based modeling approach, without referencing any spectrum databases. Instead of computing fragmentation trees, MIST-CF adopts a formula transformer neural network architecture and learns in a data dependent fashion.

Paper: https://pubs.acs.org/doi/full/10.1021/acs.jcim.3c01082

We note several advances to the MIST-CF chemical formula transformer architecture over the original MIST chemical formula transformer that we plan to add back into the MIST model architecture used for fingerprint prediction in future work:

Utilizing an internal chemical subformula assignment protocol (rather than SIRIUS fragmentation trees)
Considering multiple adduct types beyond [M+H]+ (still only positive mode)
Utilizing sinusoidal formula embeddings as developed in our previous work SCARF
Embedding instrument type used to measure the MS/MS as an additional model "covariate" to help make predictions
Embedding the neutral loss fragment formula for each peak in addition to the fragment formula

Install & setup
Quick start
Data
Training models
Experiments
Analysis
Citations

Install & setup

After git cloning the repository, the enviornment and package can be installed using Mamba:

mamba env create -f environment.yml mamba activate ms-gen pip install -r requirements.txt python setup.py develop

SIRIUS

To list out all potential formulae for an observed MS1 mass, we utilize the dynamic programming algorithm implemented by SIRIUS, SIRIUS decomp, which is provided as an independent module. SIRIUS can be downloaded and moved into a respective folder using the following commands. For non linux based systems, we suggest visiting the SIRIUS website.

Download sirius:

device=linux64 version=5.5.7 url_base=https://github.com/boecker-lab/sirius/releases/download/ wget ${url_base}v$version/sirius-$version-$device-headless.zip unzip sirius-$version-$device-headless.zip rm sirius-$version-$device-headless.zip

Set SIRIUS environment variable:

> ~/.bashrc . ~/.bashrc">echo 'export SIRIUS_PATH=${your_sirius_path}' >> ~/.bashrc . ~/.bashrc

Quick start

We have released a trained MIST-CF model (using the public NPLIB1/CANOPUS dataset). This can be downloaded (quickstart/download\_model.sh) and used to predict a set of 10 spectra from CASMI22, as included at data/demo_specs.mgf using the following commands:

# MIST-CF quickstart . quickstart/download_model.sh . quickstart/run_model.sh

Model output will be saved in quickstart/mist_cf_out/. This model may be less performant than the model trained on the commercial NIST20 Library (particularly for Orbitrap or higher resolution data). Download links to models trained on NIST20 models are available upon reasonable request to users with a NIST license.

MIST-CF scores the agreement between a precursor formula candidate and an unknown spectrum. Refer to notebooks/demo_mist_cf.ipynb for an interactive demo of how to use MIST-CF and also set user-defined formula candidates.

Data

Four key datasets were used in the process of this paper:

biomols: A dataset of biologically relevant molecules that we used to learn a fast filter model
NPLIB1: A public natural products dataset extracted from the GNPS database. NPLIB1 is used for model training and evaluation.
NPLIB1 + NIST: A proprietary dataset of NPLIB1 and NIST20 spectra. This dataset is used for model training and evaluation.
CASMI 2022: A dataset of positive mode spectra from the CASMI 2022 challenge. This dataset is used for prospective analysis.

We provide instructions for downloading and extracting datasets biomols, NPLIB1, and CASMI 2022 as we preprocessed them.

For a more in depth view and understanding of our preprocessing pipeline, we refer the reader to preprocessing/run_all.sh and other relevant preprocessing scripts.

BIOMOL

Fast filter model is trained using a large database of molecules and chemical formula extracted from varous sources prepared by Duhrkop et al..

# download data and extract it under ./data/biomols wget https://zenodo.org/record/8151490/files/biomols.zip unzip biomols.zip mv biomols data/ rm biomols.zip

NPLIB1

NPLIB1 is a public natural products dataset extracted from the GNPS database. NPLIB1 is used for model training and evaluation. We refer to this in the directory structure as "canopus_train"

# download data and extract it under ./data/canopus wget https://zenodo.org/record/8151490/files/canopus_train.zip unzip canopus_train.zip mv canopus_train data/ rm canopus_train.zip

CASMI 2022

CASMI-2022 dataset is a well-accepted and recent benchmark.

# download data and extract it under ./data/canopus wget https://zenodo.org/record/8151490/files/casmi22.zip unzip casmi22.zip mv casmi22 data/ rm casmi22.zip

Training models

Most of our data processing pipelines refer to a dataset nist_canopus, that includes both NIST and NPLIB1 jointly. The following instructions provide a simple demo of how to train a model on the NPLIB1 only (canopus_train) alone.

Training a fast filter model:

. run_scripts/public_data_train/train_fast_filter.sh

Training a mist cf model:

. run_scripts/public_data_train/train_mist_cf.sh

The exact source locations for each of the training and predict files is also listed for reference:

FastFilter

Train: src/mist_cf/fast_form_score/train.py
Predict: src/mist_cf/fast_form_score/predict.py

MIST-CF

Train: src/mist_cf/mist_cf_score/train.py
Predict: src/mist_cf/mist_cf_score/predict.py
Prospective analysis: src/mist_cf/mist_cf_score/predict_mgf.py

FFN

Train: src/mist_cf/ffn_score/train.py
Predict: src/mist_cf/ffn_score/predict.py

Xformer

Train: src/mist_cf/xformer_score/train.py
Predict: src/mist_cf/xformer_score/predict.py

Experiments

The entries below provide a record of our experiments used in our initial MIST-CF paper. While not intended to be re-run exactly as shown below due to the requirement of the nist_canopus subfolder and data, these demonstrate the parameters and call signatures we utilized.

Evaluate fast filter

Evaluate fast filter performance of reducing formula candidates.

Experiment pipeline:

Hyperopt fast filter model: run_scripts/hyperparams/find_fast_params.sh
Edit fast filter config file: configs/fast_filter.yaml
Train fast filter model: run_scripts/fast_filter/launch_fast_train.sh
Evaluate fast filter prediction: run_scripts/fast_filter/launch_fast_pred.py

Retrospective benchmarking

Benchmark MIST-CF performance with baseline models.

Experiment pipeline:

Hyperopt all models
- run_scripts/hyperparams/find_mist_cf_params.sh
- run_scripts/hyperparams/find_ffn_params.sh
- run_scripts/hyperparams/find_xformer_params.sh
Edit model config files:
- configs/mist_cf_canopus.yaml
- configs/ffn_canopus.yaml
- configs/ms1_canopus.yaml (hyperparameter same as ffn)
- configs/xformer_canopus.yaml
Train models:
- run_scripts/benchmarking/train_mist_cf.sh
- run_scripts/benchmarking/train_ffn.sh
- run_scripts/benchmarking/train_ms1.sh
- run_scripts/benchmarking/train_xformer.sh
Evaluate models: run_scripts/benchmarking/eval_models.py

Sweep MS2 peak number

Show that few MS2 peaks are sufficient to learn candidate formula ranking.

Experiment pipeline:

Edit MS2 peak config file: configs/mist_cf_canopus_nist_max_subpeak.yaml
Train models: run_scripts/max_subpeak/train_mist_cf_subpeak.sh
Evaluate models: run_scripts/max_subpeak/eval_models.py

Comparison with SIRIUS on test data

Compare MIST-CF and SIRIUS using a single split of NPLIB1 test data.

Experiment pipeline:

Run SIRIUS prediction: run_scripts/sirius_compare/sirius_1_run.py
Format SIRIUS output: run_scripts/sirius_compare/sirius_2_wrangle.py
Run MIST-CF prediction: run_scripts/sirius_compare/mist_cf_1_predict.sh
Evaluate performance: run_scripts/sirius_compare/eval_models.py

Prospective analysis: CASMI 2022

Compare MIST-CF and SIRIUS on CASMI-2022.

Experiment pipeline:

Run SIRIUS prediction: run_scripts/casmi22_eval/run_sirius.sh
Format SIRIUS output: run_scripts/casmi22_eval/wrangle_sirius.py
Run MIST-CF prediction: run_scripts/casmi22_eval/run_mist_cf.sh
Evaluate performance: run_scripts/casmi22_eval/eval_models.py

Analysis

Analysis scripts can be found in analysis/ for evaluating model predictions analysis/evaluate_pred.py.

Additional analyses used for figure generation were conducted in notebooks/.

Citations

If you use this repository, please consider citing both our work and the original SIRIUS papers, as we still rely on their deterministic tool for formula enumerations:

@article{doi:10.1021/acs.jcim.3c01082, author = {Goldman, Samuel and Xin, Jiayi and Provenzano, Joules and Coley, Connor W.}, title = {MIST-CF: Chemical Formula Inference from Tandem Mass Spectra}, journal = {Journal of Chemical Information and Modeling}, doi = {10.1021/acs.jcim.3c01082}, URL = {https://doi.org/10.1021/acs.jcim.3c01082}, }

Bocker, Sebastian, et al. "SIRIUS: decomposing isotope patterns for metabolite identification." Bioinformatics 25.2 (2009): 218-224.
Duhrkop, Kai, et al. "SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information." Nature methods 16.4 (2019): 299-302.

About

Predicting MS1 precursor chemical formula from MS/MS data

Resources

Readme

License

MIT license

Activity

Folders and files

Latest commit

History

Repository files navigation

MIST-CF: Metabolite Inference with Spectrum Transformers (Chemical Formula)

Table of Contents

Install & setup

SIRIUS

Quick start

Data

BIOMOL

NPLIB1

CASMI 2022

Training models

Experiments

Evaluate fast filter

Retrospective benchmarking

Sweep MS2 peak number

Comparison with SIRIUS on test data

Prospective analysis: CASMI 2022

Analysis

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages