You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MIST-CF: Metabolite Inference with Spectrum Transformers (Chemical Formula)
This repository provides implementations and code examples for
MIST-CF, an extension of
MIST for annotating MS1
precursor masses from MS/MS data in a de novo setting. MIST-CF ranks chemical
formula and adduct assignments for an unknown mass spectrum using an end-to-end
energy based modeling approach, without referencing any spectrum databases.
Instead of computing fragmentation trees, MIST-CF adopts a formula transformer
neural network architecture and learns in a data dependent fashion.
We note several advances to the MIST-CF chemical formula transformer architecture over the original MIST chemical formula transformer that we plan to add back into the MIST model architecture used for fingerprint prediction in future work:
Utilizing an internal chemical subformula assignment protocol (rather than SIRIUS fragmentation trees)
Considering multiple adduct types beyond [M+H]+ (still only positive mode)
Utilizing sinusoidal formula embeddings as developed in our previous work SCARF
Embedding instrument type used to measure the MS/MS as an additional model "covariate" to help make predictions
Embedding the neutral loss fragment formula for each peak in addition to the fragment formula
Table of Contents
Install & setup
Quick start
Data
Training models
Experiments
Analysis
Citations
Install & setup
After git cloning the repository, the enviornment and package can be installed using Mamba:
To list out all potential formulae for an observed MS1 mass, we utilize the dynamic programming algorithm implemented by SIRIUS, SIRIUS decomp, which is provided as an independent module. SIRIUS can be downloaded and moved into a respective folder using the following commands. For non linux based systems, we suggest visiting the SIRIUS website.
We have released a trained MIST-CF model (using the public NPLIB1/CANOPUS dataset). This can be downloaded (quickstart/download\_model.sh) and used to predict a set of 10 spectra from CASMI22, as included at data/demo_specs.mgf using the following commands:
Model output will be saved in quickstart/mist_cf_out/. This model may be less performant than the model trained on the commercial NIST20 Library (particularly for Orbitrap or higher resolution data). Download links to models trained on NIST20 models are available upon reasonable request to users with a NIST license.
MIST-CF scores the agreement between a precursor formula candidate and an unknown spectrum. Refer to notebooks/demo_mist_cf.ipynb for an interactive demo of how to use MIST-CF and also set user-defined formula candidates.
Data
Four key datasets were used in the process of this paper:
biomols: A dataset of biologically relevant molecules that we used to learn a fast filter model
NPLIB1: A public natural products dataset extracted from the GNPS database. NPLIB1 is used for model training and evaluation.
NPLIB1 + NIST: A proprietary dataset of NPLIB1 and NIST20 spectra. This dataset is used for model training and evaluation.
CASMI 2022: A dataset of positive mode spectra from the CASMI 2022 challenge. This dataset is used for prospective analysis.
We provide instructions for downloading and extracting datasets biomols, NPLIB1, and CASMI 2022 as we preprocessed them.
For a more in depth view and understanding of our preprocessing pipeline, we refer the reader to preprocessing/run_all.sh and other relevant preprocessing scripts.
BIOMOL
Fast filter model is trained using a large database of molecules and chemical formula
extracted from varous sources prepared by Duhrkop et al..
# download data and extract it under ./data/biomols wget https://zenodo.org/record/8151490/files/biomols.zip
unzip biomols.zip mv biomols data/ rm biomols.zip
NPLIB1
NPLIB1 is a public natural products dataset extracted from the GNPS database. NPLIB1 is used for model training and evaluation. We refer to this in the directory structure as "canopus_train"
# download data and extract it under ./data/canopus wget https://zenodo.org/record/8151490/files/canopus_train.zip
CASMI-2022 dataset is a well-accepted and recent benchmark.
# download data and extract it under ./data/canopus wget https://zenodo.org/record/8151490/files/casmi22.zip
unzip casmi22.zip mv casmi22 data/ rm casmi22.zip
Training models
Most of our data processing pipelines refer to a dataset nist_canopus, that includes both NIST and NPLIB1 jointly. The following instructions provide a simple demo of how to train a model on the NPLIB1 only (canopus_train) alone.
The entries below provide a record of our experiments used in our initial MIST-CF paper. While not intended to be re-run exactly as shown below due to the requirement of the nist_canopus subfolder and data, these demonstrate the parameters and call signatures we utilized.
Evaluate fast filter
Evaluate fast filter performance of reducing formula candidates.
Experiment pipeline:
Hyperopt fast filter model: run_scripts/hyperparams/find_fast_params.sh
Edit fast filter config file: configs/fast_filter.yaml
Train fast filter model: run_scripts/fast_filter/launch_fast_train.sh
Evaluate fast filter prediction: run_scripts/fast_filter/launch_fast_pred.py
Retrospective benchmarking
Benchmark MIST-CF performance with baseline models.
Experiment pipeline:
Hyperopt all models
run_scripts/hyperparams/find_mist_cf_params.sh
run_scripts/hyperparams/find_ffn_params.sh
run_scripts/hyperparams/find_xformer_params.sh
Edit model config files:
configs/mist_cf_canopus.yaml
configs/ffn_canopus.yaml
configs/ms1_canopus.yaml (hyperparameter same as ffn)
Analysis scripts can be found in analysis/ for evaluating model predictions analysis/evaluate_pred.py.
Additional analyses used for figure generation were conducted in notebooks/.
Citations
If you use this repository, please consider citing both our work and the original SIRIUS papers, as we still rely on their deterministic tool for formula enumerations:
@article{doi:10.1021/acs.jcim.3c01082, author = {Goldman, Samuel and Xin, Jiayi and Provenzano, Joules and Coley, Connor W.}, title = {MIST-CF: Chemical Formula Inference from Tandem Mass Spectra}, journal = {Journal of Chemical Information and Modeling}, doi = {10.1021/acs.jcim.3c01082}, URL = {https://doi.org/10.1021/acs.jcim.3c01082}, }
Bocker, Sebastian, et al. "SIRIUS: decomposing isotope patterns for metabolite identification." Bioinformatics 25.2 (2009): 218-224.
Duhrkop, Kai, et al. "SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information." Nature methods 16.4 (2019): 299-302.
About
Predicting MS1 precursor chemical formula from MS/MS data