Lodestone
Table of Contents
- What is Lodestone
- Quick Start
- Executors
- System Requirements
- Parameters
- Stub Runs
- Checkpoints
- Acknowledgments
- License
What is Lodestone?
This pipeline takes as input reads presumed to be from one of 10 mycobacterial genomes: abscessus, africanum, avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis. Input should be in the form of one directory containing pairs of fastq(.gz) or bam files.
Pipeline cleans and QCs reads with fastp and FastQC, classifies with Kraken2 & Afanc, removes non-bacterial content, and - by alignment to any minority genomes - disambiguates mixtures of bacterial reads. Cleaned reads are aligned to either of the 10 supported genomes and variants called. Produces as output one directory per sample, containing cleaned fastqs, sorted, indexed BAM, VCF, F2 and F47 statistics, an antibiogram and summary reports.
Note that while Mykrobe is included within this pipeline, it runs as an independent process and is not used for any downstream reporting.
WARNING: There are currently known errors with vcfmix, as such errorStrategy 'ignore' has been added to the processes vcfpredict:vcfmix to stop the pipeline from crashing. Please check the stdout from nextflow to see whether these processes have ran successfully.
Quick Start
This is a Nextflow DSL2 pipeline, it requires a version of Nextflow that supports DSL2 and the stub-run feature. It is recommended to run the pipeline with NXF_VER="24.10.0", as the pipeline has been tested using this version. To download run:
export NXF_VER="24.10.0"
curl -fsSL https://get.nextflow.io | bash
The workflow is designed to run with either docker -profile docker or singularity -profile singularity. The container images are pulled from quay.io and a singularity cache directory is set in the nextflow.config.
Please note, when running the pipeline with Singularity, either the $TMPDIR or $SINGULARITY_TMPDIR must be specified; e.g. export TMPDIR="./" to run in the working directory.
E.g. to run the workflow:
NXF_VER=24.10.0 nextflow run main.nf -profile singularity --filetype fastq --input_dir fq_dir --pattern "*_R{1,2}.fastq.gz" --unmix_myco yes \
--output_dir . --kraken_db /path/to/database --bowtie2_index /path/to/index --bowtie_index_name hg19_1kgmaj
NXF_VER=24.10.0 nextflow run main.nf -profile docker --filetype bam --input_dir bam_dir --unmix_myco no \
--output_dir . --kraken_db /path/to/database --bowtie2_index /path/to/index --bowtie_index_name hg19_1kgmaj
There is also a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add -profile climb to your command invocation. The input directory can point to an S3 bucket natively (e.g. --input_dir s3://my-team/bucket). By default this will run the workflow in Docker containers and take advantage of kubernetes pods. The Kraken2, Bowtie2 and Afanc databases will by default point to the pluspf16, hg19_1kgmaj_bt2 and Mycobacteriaciae_DB_7.0 directories by default. These are mounted on a public S3 bucket hosted on CLIMB.
Executors
By default, the pipeline will just run on the local machine. To run on a cluster, modifications will have to be made to the nextflow.config to add in the executor. E.g. for a SLURM cluster add process.executor = 'slurm'. For more information on executor options see the Nextflow docs: https://www.nextflow.io/docs/latest/executor.html
System Requirements
Minimum recommended requirements: 32GB RAM, 8CPU
Paramaters
The following parameters should be set in nextflow.config. They can be accessed by nextflow run main.nf --help:
--input_dir [string] Input directory containing FASTQs or BAMs
--pattern [string] Glob pattern for FASTQs or BAM
--output_dir [string] Output directory
--permissive [boolean] Flag. If True, errors in decontamination will be demoted to warnings
--filetype [string] Either "fastq" or "bam". Assumes FASTQs are PE Illumina reads and BAMs are mapped against one of the references in resources/ (accepted: bam, fastq) [default: fastq]
--unmix_myco [boolean] Flag. If True then minority Mycobacteriaceae reads will be removed. If False, they will be discarded
--species [string] Species which will be mapped against, corresponding to references in resources/: can be one of abscessus, africanum, avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis or null. If 'null' the top hit as determined by Afanc will be used (accepted: null, abscessus, africanum,
avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis)
--sing_dir [string] Directory to singularity definition files. Used to parse versions for reporting [default: ${baseDir}/resources]
--config_file [string] Path to Nextflow config file. Used for parsing arguments to write to results if needed [default: ${baseDir}/nextflow.config]
--help [boolean, string] Show the help message for all top level parameters. When a parameter is given to `--help`, the full help message of that parameter will be printed.
--helpFull [boolean] Show the help message for all non-hidden parameters.
--showHidden [boolean] Show all hidden parameters in the help message. This needs to be used in combination with `--help` or `--helpFull`.
resources
--resource_dir [string] Path to resources directroy where utility files are stored [default: ${baseDir}/resources]
--refseq [string] Path to NCBI refseq summary file [default: ${baseDir}/resources/assembly_summary_refseq.txt]
resistance
--resistance_profiler [string] Tool used for tb-profiler. Either tb-profiler or tbtamr (accepted: tb-profiler, tbtamr) [default: tb-profiler]
--collate [boolean] Flag. If True resistance reports will be summarised
bowtie
--bowtie_index [string] Bowtie index directory [default: ${baseDir}/bowtie2/]
--bowtie_index_name [string] Prefix for the Bowtie2 index (minus the file extensions). [default: hg19_1kgmaj]
afanc
--afanc_percent_threshold [number] Minimum percentage threshold for reads in order for a taxa to be considered in Afanc if the pipeline has failed earlier on (for reporting) [default: 5]
--afanc_n_reads_threshold [integer] Minimum reads threshold for reads in order for a taxa to be considered in Afanc [default: 500]
--afanc_fail_percent_threshold [number] Minimum percentage threshold for reads in order for a taxa to be considered in Afanc [default: 2]
--afanc_fail_n_reads_threshold [integer] Minimum reads threshold for reads in order for a taxa to be considered in Afanc if the pipeline has failed earlier on (for reporting) [default: 200]
kraken
--kraken_percent_threshold [number] Percentage threshold of reads required for taxa to be included in Kraken reports [default: 10]
--kraken_n_reads_threshold [integer] Raw reads threshold required for taxa to be included in Kraken reports [default: 10000]
--kraken_db [string] Kraken2 database path [default: kraken2/]