Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

IPNP-BIPN/SPLIT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

12 Commits

Repository files navigation

SPLIT

SNP-Level Inspection of Parental Transcripts

Nextflow DSL2 pipeline for allele-specific RNA-seq analysis using STAR, SNPsplit, and featureCounts. Ultra-minimalist -- 2 files only (main.nf + nextflow.config). Designed for solo bioinformaticians.


Pipeline Overview

SRA_DL["SRA_DOWNLOAD"] GEO["GSE / GSM"] --> RESOLVE["RESOLVE_GEO"] --> SRA_DL FQ_DIR["FASTQ directory"] CSV["CSV samplesheet"] end SRA_DL --> FASTQS(("FASTQs")) FQ_DIR --> FASTQS CSV --> FASTQS DOWNLOAD["DOWNLOAD_REFERENCES"] --> GPREP["SNPSPLIT_GENOME_PREP"] GPREP --> IDX1["STAR_INDEX (N-masked)"] DOWNLOAD --> IDX2["STAR_INDEX (reference)"] IDX1 --> A1["STAR_ALIGN (N-masked)"] IDX2 --> A2["STAR_ALIGN (reference)"] FASTQS --> A1 FASTQS --> A2 A1 --> S1["SORT_DEDUP"] --> SNP["SNPSPLIT"] A2 --> S2["SORT_DEDUP "] --> FC3["FEATURECOUNTS (reference)"] SNP -->|"genome1"| FC1["FEATURECOUNTS (genome1)"] SNP -->|"genome2"| FC2["FEATURECOUNTS (genome2)"] FC1 --> O1["genome1 counts"] FC2 --> O2["genome2 counts"] FC3 --> O3["reference counts"] FC1 & FC2 & FC3 --> MQC["MULTIQC"] --> O4["MultiQC report"] classDef input fill:#0570b0,stroke:#0570b0,color:#fff classDef process fill:#238b45,stroke:#238b45,color:#fff classDef key fill:#cb181d,stroke:#cb181d,color:#fff,stroke-width:3px classDef output fill:#6a51a3,stroke:#6a51a3,color:#fff classDef data fill:#e6550d,stroke:#e6550d,color:#fff classDef mqc fill:#41ab5d,stroke:#41ab5d,color:#fff class SRA,GEO,FQ_DIR,CSV input class SRA_DL,RESOLVE,DOWNLOAD,GPREP,IDX1,IDX2,A1,S1,A2,S2,FC1,FC2, FC3 process class SNP key class O1,O2,O3,O4 output class FASTQS data class MQC mqc " dir="auto">
%%{init: {'theme': 'base', 'themeVariables': {'background': '#ffffff', 'primaryTextColor': '#231f20', 'lineColor': '#999999', 'textColor': '#231f20', 'mainBkg': '#ffffff', 'nodeBorder': '#999999'}}}%%
flowchart TD
subgraph INPUT ["Input (one of)"]
SRA["SRR / ERR / DRR"] --> SRA_DL["SRA_DOWNLOAD"]
GEO["GSE / GSM"] --> RESOLVE["RESOLVE_GEO"] --> SRA_DL
FQ_DIR["FASTQ directory"]
CSV["CSV samplesheet"]
end

SRA_DL --> FASTQS(("FASTQs"))
FQ_DIR --> FASTQS
CSV --> FASTQS

DOWNLOAD["DOWNLOAD_REFERENCES"] --> GPREP["SNPSPLIT_GENOME_PREP"]
GPREP --> IDX1["STAR_INDEX (N-masked)"]
DOWNLOAD --> IDX2["STAR_INDEX (reference)"]

IDX1 --> A1["STAR_ALIGN (N-masked)"]
IDX2 --> A2["STAR_ALIGN (reference)"]
FASTQS --> A1
FASTQS --> A2

A1 --> S1["SORT_DEDUP"] --> SNP["SNPSPLIT"]
A2 --> S2["SORT_DEDUP "] --> FC3["FEATURECOUNTS (reference)"]

SNP -->|"genome1"| FC1["FEATURECOUNTS (genome1)"]
SNP -->|"genome2"| FC2["FEATURECOUNTS (genome2)"]

FC1 --> O1["genome1 counts"]
FC2 --> O2["genome2 counts"]
FC3 --> O3["reference counts"]
FC1 & FC2 & FC3 --> MQC["MULTIQC"] --> O4["MultiQC report"]

classDef input fill:#0570b0,stroke:#0570b0,color:#fff
classDef process fill:#238b45,stroke:#238b45,color:#fff
classDef key fill:#cb181d,stroke:#cb181d,color:#fff,stroke-width:3px
classDef output fill:#6a51a3,stroke:#6a51a3,color:#fff
classDef data fill:#e6550d,stroke:#e6550d,color:#fff
classDef mqc fill:#41ab5d,stroke:#41ab5d,color:#fff

class SRA,GEO,FQ_DIR,CSV input
class SRA_DL,RESOLVE,DOWNLOAD,GPREP,IDX1,IDX2,A1,S1,A2,S2,FC1,FC2, FC3 process
class SNP key
class O1,O2,O3,O4 output
class FASTQS data
class MQC mqc
Loading

Quick Start

# From a FASTQ directory (auto-detects PE/SE)
nextflow run IPNP-BIPN/SPLIT --fastq_dir /path/to/fastqs --outdir results -resume

# From SRA accessions
nextflow run IPNP-BIPN/SPLIT --sra_ids "SRR1234567,SRR1234568" --outdir results -resume

# From a GEO dataset (auto-resolves GSE - SRR)
nextflow run IPNP-BIPN/SPLIT --sra_ids GSE80810 --outdir results -resume

# From a samplesheet CSV
nextflow run IPNP-BIPN/SPLIT --input samplesheet.csv --outdir results -resume

# Custom strains
nextflow run IPNP-BIPN/SPLIT \
--fastq_dir /path/to/fastqs \
--strain1 CAST_EiJ \
--strain2 C57BL_6NJ \
--dedup true \
--outdir results \
-resume

Samplesheet format (CSV)

sample,fastq_1,fastq_2
sampleA,/path/to/sampleA_R1_001.fastq.gz,/path/to/sampleA_R2_001.fastq.gz
sampleB,/path/to/sampleB.fastq.gz,

Leave fastq_2 empty for single-end reads.


Parameters

Parameter Default Description
--input null Samplesheet CSV (sample,fastq_1,fastq_2)
--fastq_dir null Directory of FASTQs (*.fastq.gz)
--sra_ids null SRA/GEO accessions (comma-separated or file)
--outdir results Output directory
--strain1 CAST_EiJ First strain in VCF - genome1
--strain2 C57BL_6NJ Second strain in VCF - genome2
--dedup true Remove PCR duplicates (samtools markdup -r)
--strandedness 0 featureCounts strandedness (0/1/2)
--force_se false Force single-end counting in featureCounts
--genome_url Ensembl GRCm39 Genome FASTA URL
--gtf_url Ensembl 2023_04 GTF annotation URL
--vcf_url MGP REL2021 v8 VCF SNPs URL
--star_limit_genome_ram 60000000000 STAR --limitGenomeGenerateRAM
--max_cpus auto Maximum number of CPUs
--max_memory 64 GB Maximum memory (scales all process labels)

Pre-built references (skip downloads)

Parameter Description
--genome_fa Pre-downloaded genome FASTA
--gtf Pre-downloaded GTF
--vcf Pre-downloaded VCF
--star_index_nmask Pre-built STAR index (N-masked)
--star_index_ref Pre-built STAR index (reference)
--snp_file Pre-existing SNPsplit SNP annotation file

Output Structure

results/
+-- 00_sra_fastq/ # Downloaded FASTQs (if SRA input)
+-- 04_aln_nmask/ # N-mask aligned BAMs (sorted, deduped)
+-- 05_aln_ref/ # Reference aligned BAMs (sorted, deduped)
+-- 06_snpsplit/ # SNPsplit output (genome1, genome2, unassigned)
+-- 07_counts/
| +-- counts_genome1_CAST_EiJ.txt # Allele-specific counts (strain1)
| +-- counts_genome2_C57BL_6NJ.txt # Allele-specific counts (strain2)
| +-- counts_reference.txt # Standard reference counts
+-- 08_multiqc/ # Aggregated QC report
+-- reference/ # Downloaded + cached references
| +-- genome.fa # GRCm39 soft-masked
| +-- genes.gtf # Ensembl annotation
| +-- snps.vcf.gz # MGP REL2021 SNPs
| +-- snpsplit_prep/ # N-masked genome + SNP file
| +-- star_nmask/ # STAR index (N-masked)
| +-- star_ref/ # STAR index (reference)
+-- pipeline_info/ # Nextflow timeline, trace, DAG, report

Requirements

Core (always required): STAR samtools SNPsplit featureCounts (subread) multiqc wget

Optional: sra-tools bgzip (htslib/tabix) -- for SRA download

Nextflow >= 23.04


How it works

  1. References are automatically downloaded from Ensembl (GRCm39 genome + GTF) and MGP (SNP VCF). Cached via storeDir -- only downloaded once.

  2. SNPsplit genome preparation creates an N-masked genome where strain-discriminating SNP positions are replaced by N. This prevents alignment bias toward the reference allele.

  3. Two parallel alignment tracks:

    • N-masked track: alignments used for allele-specific analysis (SNPsplit)
    • Reference track: standard alignments for total gene expression
  4. SNPsplit assigns each read from the N-mask track to genome1 (strain1), genome2 (strain2), or unassigned based on informative SNP positions.

  5. featureCounts produces three count tables using gene_name attribute for human-readable gene symbols (e.g., Gapdh instead of ENSMUSG00000057666).


Resume & Cache

The pipeline natively leverages Nextflow's cache (-resume). Already completed steps are automatically skipped. References (genome, GTF, VCF, STAR indexes, N-masked genome) are persisted via storeDir and reused across runs.

# Re-run after a crash -- picks up exactly where it left off
nextflow run main.nf --fastq_dir fastqs --outdir results -resume

License

MIT

About

Allele-specific RNA-seq pipeline: STAR + SNPsplit + featureCounts (Nextflow DSL2)

Topics

Resources

Readme

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors