MONSDA
Installation
Install MONSDA via conda or pip
To install via conda/mamba in an environment named ‘monsda’ simply run
mamba create -n monsda -c bioconda -c conda-forge monsda
To install via pip you first need to create the MONSDA environment as found in the envs directory of this repository (simply clone with git clone) like so:
mamba env create -n monsda -f MONSDA/envs/monsda.yaml
The envs directory holds all the environments needed to run the pipelines in the workflows directory, these will be installed automatically alongside MONSDA.
For that activate the MONSDA environment and run pip
conda activate monsda
pip install MONSDA
Install from source
Simply clone this repository with
git clone git@github.com:jfallmann/MONSDA.git
You can then install dependencies as described for pip installation and manually run setup.py.
Be aware that MONSDA is version dependent, so config files can only be run with the specified version of MONSDA in order to guarantee reproducibility by conserving dependencies and environments.
First Steps
MONSDA acts a wrapper around Snakemake or Nextflow based on a user defined config.json file. This config.json holds all the information that is needed to run the jobs and will be parsed by MONSDA and split into independent sub-configs that can later be found in the directory SubSnakes or SubFlows respectively. Command line execution calls are stored in the directory JOBS, so users can manipulate those or rerun them manually as needed. By default, however, MONSDA will run those jobs automatically either locally, or through Snakemake’s or Nextflow’s integrated cluster interfaces.
- To successfully run an analysis pipeline, a few steps have to be followed:
Install MONSDA either via bioconda or pip following the instruction in Installation
Directory structure: The structure for the directories is dictated by The Condition-Tree in the config file
Config file: This is the central part of a MONSDA run. Depending on The config-file MONSDA will determine processing steps and generate corresponding config and workflow files to run each subworkflow until all processing steps are done.
In general it is necessary to write a configuration file containing information on paths, files to process and settings beyond default for mapping tools and others. The template on which analysis is based can be found in the config directory and will be explained in detail later.
To create a working environment for this repository please install the MONSDA.yaml environment (if not installed via bioconda) as found in the envs directory like so:
conda env create -n monsda -f envs/MONSDA.yaml
The envs directory holds all the environments needed to run the pipelines in the workflows directory, these will be installed automatically when needed.
For fast resolve of conda packages, we recommend mamba
For distribution of jobs one can either rely on local hardware, use scheduling software like Slurm or the SGE or follow any other integration in Snakemake or Nextflow but be aware that most of these have not been tested for this repository and usually require additional system dependent setup and configuration.
This manual will only show examples on local and SLURM usage.
Planning a project run
Before you set up a MONSDA project, you should take a few things into consideration to make sure that MONSDA “understands” your experiment. You should also have a good overview of your data and best decide in advance what analysis steps you want to run. Although it is possible to run MONSDA sequentially, building up a config step by step, this will lead to some workflows being run multiple times, as a unique “tool-key” will be defined for the set of enabled workflow steps and configured tools, more information on that can be found in The Condition-Tree.
To be well prepared, go through the following steps. Afterwards you will run the Configurator, which helps you to set up a MONSDA project.
STEP 1: What do I want to do exactly?
What is the experimental design? Which conditions do I have?
If you have data from an experiment and it is necessary to differ between conditions, MONSDA represents this as a condition-tree. Make sure, you know how to transfer your experimental design to the condition-tree. An example could be a simple ‘knockout’ experiment. You would transfer it as following:
┌──Condition1: "Wild-Type" Knockout-Experiment ──┤ └──Condition2: "Knockout"Your experimental design can be more complex. To get a better understanding of the concept behind the condition-tree click here.
What kind of analysis do I want to run? Which processing steps and tools do I need?
You probably have an idea of the workflow you want to create. You should write down which processing steps you want to run. MONSDA already offers a bunch of processing tools. Jump to Workflow Overview to get an overview of which tools you can use for which workflow-steps of your workflow.
Which settings should the tools run with?
Depending on your data and your specific analysis most of the tools must be assigned specific settings or options. MONSDA does not try to guess which settings might work best for you. Of course it is possible to adjust the settings afterwards, but this requires to delete generated results and is more tedious then to think about settings beforehand. MONSDA does not recommend settings for tools and will run all tools with their default settings. The only exception from that rule is a “paired” or “stranded” setting, which will automatically be derived from the config and resolved for each tool, where applicable.
It can be helpful to test your settings by hand before you integrate them in the MONSDA configuration, however, you can always re-run MONSDA after cleaning up results and changing the configuration. To understand better how MONSDA understands tool options have a look at the config-file.
STEP 2: Keep the overview
You may want to do a standard analysis. Most likely you will need a genome FASTA file and annotation files in addition to your samples. Maybe there are additional files your workflow needs.
So before you start the configurator and set up the MONSDA project, make sure you have organized your data well and know where to find things. The configurator will create a project folder with softlinks to your data. Later changes require you to move your data afterwards, which is extra work that is not necessary.
If eventually share your workflow and configuration, we also assume that you share all input data including genome and annotation files, at least with links to the original download location, so it makes sense to keep that information.
STEP 3: Hardware issues
MONSDA can be run on different platforms and allows to control multithreading. If you want to run it on SLURM, have a look here. Always make sure you know how many threads you can run on your machine and keep an eye on memory consumption. For some workflow steps like mapping, memory can be a bottle-neck, so best check your SLURM profile and set caps on memory usage.
Last but not least the processing results will be saved in your project-folder. This can take up a lot of disk space. make sure you have enough free space.
More details on input and output formats and files can be found in the Workflow Overview.
Configure MONSDA
Setting up an initial project folder
Creating the configuration file
Modifying existing configuration files easily
Run the Configurator
To run the Configurator, the MONSDA conda environment must be installed and activated (we assume you installed via conda in an environment named ‘monsda’).
conda activate monsda
Run the Configurator with
monsda_configure
To see further options run the Configurator with the - -help flag
monsda_configure --help
Main Functions
Create a new project or config-file
SELECT WORKFLOWS ⊳ MAKE CONDITION-TREE ⊳ ASSIGN SAMPLES ⊳ SET WORKFLOWS ⊳ SET THREADS |
To create a new project or a new config-file, the Configurator will take you through all necessary steps. You will have to choose a working directory. Note, that your project folder will grow with your results, so choose a place with enough space.
Modify an existing config-file
1. Add workflows
SELECT CONFIG ⊳ SELECT ADDITIONAL WORKFLOWS ⊳ SET WORKFLOWS |
Select workflows not activated in an existing config-file. The Configurator will expand each condition. Afterwards you have to define settings for the new workflows.
2. Remove workflows ******************`
SELECT CONFIG ⊳ SELECT REMOVABLE WORKFLOWS |
The Configurator will show you all established workflows. After selecting the ones to be removed it will delete them from the config-file for each condition.
3. Add conditions ****************`
SELECT CONFIG ⊳ MAKE CONDITION-TREE ⊳ ASSIGN SAMPLES ⊳ SET WORKFLOWS |
You can add conditions in a similar way you created the condition-tree. But note, that you can’t add sub-conditions to existing leafs. The configurator will expand the condition-tree for the settings-block and each workflow. Because now you have new option fields in the config-file the Configurator will ask you for copying existing workflow settings or to make new ones.
4. Remove conditions
SELECT CONFIG ⊳ SELECT REMOVABLE CONDITIONS |
The Configurator will offer you all conditions the condition-tree represents. After selecting one or several to be removed it will delete them in the settings-block and for each condition.
Interrupt Configuration
It can happen, that the Configurator asks for entries, you haven’t thought about yet. In this case you can interrupt the configuration and the Configurator will cache your entries. A temporary backup file called unfinished_config.pkl is created for that.
In most cases you can even just abort the script, but to guarantee clean re-entry you should type
exit
When you start the Configurator again later and it finds the unfinished_config.pkl in the current directory, it will provide the option to continue the session.
Note, that the unfinished_config.pkl will always be overwritten. To avoid this, you can rename the file. You can than continue with the –session flag. Run the Configurator like this:
monsda_configure -s my_renamed_unfinished_config.pkl
Assistance in detail
Create Condition-Tree
============================================================
{
"NewExperiment": {
"wild-type": {
"day1": {},
"day2": {}
},
"knockout": {
"day1": {},
"day2": {} <=(add sub-conditions here)
}
}
}
============================================================
MONSDA understands your experimental design by creating a condition-tree. The Configurator helps you to create it. To do this, the Configurator points to a condition in which you are allowed to add further sub-conditions. In this way you can create a nested condition-tree. Note that each leaf of this tree represents a separate condition. later you will define the workflow settings for each of these conditions.
Sample Assignment:
============================================================
{
"NewExperiment": {
"wild-type": {
"day1": {
"SAMPLES": [
"Sample_1",
"Sample_2"
]
},
"day2": {} <-
},
"knockout": {
"day1": {},
"day2": {}
}
}
}
============================================================
1 > Sample_1 in path/to/knockout/samples
2 > Sample_2 in path/to/knockout/samples
3 > Sample_3 in path/to/knockout/samples
4 > Sample_4 in path/to/knockout/samples
5 > Sample_a in path/to/wild-type/samples
6 > Sample_b in path/to/wild-type/samples
7 > Sample_c in path/to/wild-type/samples
8 > Sample_d in path/to/wild-type/samples
The Configurator helps you to assign samples to conditions. If you have activated the FETCH workflow, it will ask you for SRA Accession Numbers. Otherwise you have to add paths where your samples are stored. The Configurator finds every file with “.fastq.gz” ending and presents it for assignment. At the same time, the condition-Tree is displayed with an arrow indicating the condition to which samples are assigned.
Make Settings for Conditions
============================================================
{
"NewExperiment": {
"wild-type": {
"day1": {}, <- 1
"day2": {}, <- 1
"day3": {} <- 1
},
"knockout": {
"day1": {}, <- 2
"day2": {}, <- 2
"day3": {} <- 2
}
}
}
============================================================
MONSDA can run the same workflow with different settings, differentiated by conditions. Therefore the config-file needs workflow settings for each condition you created. However you will often set the same settings. To avoid these repetitions during config-creation the configurator offers you to set several conditions at once. In the example shown above, you would go through two setting loops. All sub-conditions of both “wild-type” and “knockout” are assigned the same settings. To change the conditions set simultaneously, you can loop through the possible selections by pressing enter.
Start a pipline run
Snakemake
Activate the MONSDA conda environment and run
monsda --help
to see the help and available options that will be passed through to Snakemake.
To start a job with Snakemake, which is the default, run
monsda -j NUMBER_OF_CORES -c YOUR_CONFIG.json --directory ${PWD}
or add additional arguments for Snakemake as you see fit, we highly recommend to set mamba as conda frontend and set a fixed directory to store environments (here conda_envs)
--conda-frontend mamba --conda-prefix path_to_conda_envs
Nextflow
To run MONSDA in Nextflow mode just add ‘–Nextflow’
monsda --Nextflow -j NUMBER_OF_CORES -c YOUR_CONFIG.json --directory ${PWD}
As with Snakemake additional arguments for Nextflow can be added and will be passed through.
Run on Slurm
Snakemake
You can either use the slurm profile adapted from Snakemake-Profiles that comes with this repository, or go through the process of manually creating one, either using the cookiecutter example in the Snakemake-Profiles repository or on your own. You can also adapt the example that comes with this repository and execute
monsda -j ${cpus} --configfile ${config.json} --directory ${PWD} --profile ${path_to_slurm_profile}
Further adaptions like grouping of jobs and advanced configs for rule based performance increase will be tackled in future releases of MONSDA.
Nextflow
Cluster config for Nextflow follows the description Nextflow-Executors and Nextflow-Profiles. To use SLURM as executor you can adapt the profile that comes with this repository and simply append
export NXF_EXECUTOR=slurm
to the call to MONSDA.
TUTORIAL
Here we will show three MONSDA runs from simple to most-complex, explaining the configuration and how to run this in real life.
First please create a working directory and access it, then install MONSDA following Installation and activate the conda environment.
All examples are based on data from SRA , which you can either download beforehand or you use the “FETCH” workflow as demonstrated in the tutorial.
Genome and Annotation files need to be downloaded beforehand and can be found here: GENOME, GFF, GTF
Ideally you download those files into a sub-directory of your workspace named “GENOME/Ecoli” and rename them to something simpler, in the tutorial we will refer to “GCF_000005845.2_ASM584v2_genomic” as “ecoli”. You also need to change the “.fna.gz” suffix of the genome file to “.fa.gz” and we should be ready to go.
All config files this tutorial is based on can be found in envs/monsda/share/MONSDA/configs of your local conda installation or the “configs” directory of the MONSDA repository.
A simple run
This run is based on a simple mapping only use-case. We have a single paired-end FASTQ file and want to map that. In this case we have no information about strandedness and ignore this setting.
The corresponding config file looks as follows:
{
"WORKFLOWS": "FETCH,MAPPING",
"BINS": "",
"MAXTHREADS": "4",
"VERSION": "1.0.0",
"SETTINGS": {
"SIMPLE": {
"SAMPLES": [
"SRR16324019"
],
"SEQUENCING": "paired",
"REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
"ANNOTATION": {
"GFF": "GENOMES/Ecoli/ecoli.gff.gz",
"GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
}
}
},
"FETCH": {
"TOOLS" :
{
"sra" : "fasterq-dump"
},
"SIMPLE": {
"sra": {
"OPTIONS":
{
"DOWNLOAD": ""
}
}
}
},
"MAPPING": {
"TOOLS": {
"star": "STAR"
},
"SIMPLE": {
"star": {
"OPTIONS": {
"INDEX": "--genomeSAindexNbases 8",
"MAP": "--outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20",
"EXTENSION": ""
}
}
}
}
}
This tells MONSDA version “1.0.0” to run the workflows “FETCH” and “MAPPING” on the sample “SRR16324019”.
“FETCH” will activate the environment “sra” and run the binary “fasterq-dump” with no extra options to fetch the sample from SRA.
“MAPPING” will then run the executable “STAR” in environment “star” on the downloaded FASTQ file. First it will check for existence of the needed INDEX file, and as no INDEX file is defined in the config it will try to generate one. For that it will use the REFERENCE “GENOMES/Ecoli/ecoli.fa.gz” as well as the ANNOTATION GTF “GENOMES/Ecoli/ecoli.gtf.gz”. GTF is always selected first and GFF is used as a fallback if no GTF is found. After successfully building the STAR index, MONSDA will start mapping with STAR in “paired” mode and applying a set of options as can be seen under the “MAP” key in section “MAPPING”->”SIMPLE”->”star”->”OPTIONS”. We do not define a pre- or suffix for output files, so “EXTENSION” can be skipped or left blank.
Starting the run with 4 cores (defining more will be capped by the config file as MAXTHREADS is set to 4)
monsda -j 4 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_quick.json --directory ${PWD}
Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/Ecoli/INDICES” directory containing the built index and a “MAPPING” directory containing the mapped files. Furthermore, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and a “JOBS” directory containing the ‘monsda.commands’ file, which contains all command-line calls that MONSDA runs. A successful run will show the message ‘Workflow finished, no error’ at the end.
A more complex run
This slightly more complex use case involves multiple input files, two conditions (WT/KO) and a more or less standard DE analysis workflow. We also include a “dummylevel” that is a placeholder for settings or other subdivisions of the WT level, to demonstrate that MONSDA can work on condition-trees of differing depth.
Workflows include:
FETCH: Download from SRA
QC: FASTQC of input and output
TRIMMING: Adaptor removal with cutadapt/trimgalore
MAPPING: Read mapping with STAR, hisat2, bwa, segemehl3 and minimap2
DEDUP: Read deduplication with umi_tools and picard
DE: Differential Expression analysis with EdgeR and DESeq2
The more complex config for this analysis follows
{
"WORKFLOWS": "FETCH, QC, TRIMMING, MAPPING, DEDUP, DE",
"VERSION": "1.0.0",
"BINS": "",
"MAXTHREADS": "16",
"SETTINGS": {
"Ecoli": {
"WT": {
"dummylevel": {
"SAMPLES": [
"SRR16324019",
"SRR16324018",
"SRR16324017"
],
"GROUPS": [
"ctrl",
"ctrl",
"ctrl"
],
"SEQUENCING": "paired",
"REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
"ANNOTATION": {
"GFF": "GENOMES/Ecoli/ecoli.gff.gz",
"GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
}
}
},
"KO": {
"SAMPLES": [
"SRR16324016",
"SRR16324015",
"SRR16324014"
],
"GROUPS": [
"ko",
"ko",
"ko"
],
"SEQUENCING": "paired",
"REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
"ANNOTATION": {
"GFF": "GENOMES/Ecoli/ecoli.gff.gz",
"GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
}
}
}
},
"FETCH": {
"TOOLS" :
{
"sra" : "fasterq-dump"
},
"Ecoli": {
"WT": {
"dummylevel": {
"sra": {
"OPTIONS":
{
"DOWNLOAD": ""
}
}
}
},
"KO": {
"sra": {
"OPTIONS":
{
"DOWNLOAD": ""
}
}
}
}
},
"QC": {
"TOOLS" :
{
"fastqc" : "fastqc"
},
"Ecoli": {
"KO": {
"fastqc": {
"OPTIONS":
{
"QC": "", #QC options here if any, KO is not required, will be resolved by rules
"MULTI": "" #MultiQC options here if any, KO is not required, will be resolved by rules
}
}
},
"WT": {
"dummylevel": {
"fastqc": {
"OPTIONS":
{
"QC": "", #QC options here if any, KO is not required, will be resolved by rules
"MULTI": "" #MultiQC options here if any, KO is not required, will be resolved by rules
}
}
}
}
}
},
"TRIMMING": {
"TOOLS" :
{
"trimgalore": "trim_galore",
"cutadapt": "cutadapt"
},
"Ecoli": {
"KO": {
"trimgalore":{
"OPTIONS":
{
"TRIM": "-q 15 --length 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
},
"cutadapt":{
"OPTIONS":
{
"TRIM": "-m 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
}
},
"WT": {
"dummylevel": {
"trimgalore":{
"OPTIONS":
{
"TRIM": "-q 15 --length 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
},
"cutadapt":{
"OPTIONS":
{
"TRIM": "-m 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
}
}
}
}
},
"DEDUP": {
"TOOLS": {
"umitools": "umi_tools",
"picard": "picard"
},
"Ecoli": {
"KO":{
"umitools":{
"OPTIONS":
{
"EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
"DEDUP": "--extract-umi-method read_id" # umitools dedup options
}
},
"picard":{
"OPTIONS":
{
"JAVA": "",
"DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
}
}
},
"WT":{
"dummylevel": {
"umitools": {
"OPTIONS":
{
"EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
"DEDUP": "--extract-umi-method read_id" # umitools dedup options
}
},
"picard": {
"OPTIONS":
{
"JAVA": "",
"DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
}
}
}
}
}
},
"MAPPING": {
"TOOLS": {
"star": "STAR",
"segemehl3": "segemehl.x",
"hisat2": "hisat2",
"bwa": "bwa mem",
"minimap": "minimap2"
},
"Ecoli": {
"WT": {
"dummylevel": {
"star": {
"OPTIONS": {
"INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",
"MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
}
},
"segemehl3": {
"OPTIONS": {
"INDEX": "",
"MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
}
},
"hisat2": {
"OPTIONS":
{
"INDEX": "",
"MAP": ""
}
},
"bwa": {
"OPTIONS":
{
"INDEX": "-a bwtsw",
"MAP": ""
}
},
"minimap": {
"OPTIONS":
{
"INDEX": "-k 15",
"MAP": "-ax sr -ub -Y -L --MD"
}
}
}
},
"KO": {
"star": {
"OPTIONS":
{
"INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",
"MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
}
},
"segemehl3": {
"OPTIONS": {
"INDEX": "",
"MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
}
},
"hisat2": {
"OPTIONS":
{
"INDEX": ""
}
},
"bwa": {
"OPTIONS":
{
"INDEX": "-a bwtsw",
"MAP": ""
}
},
"minimap": {
"OPTIONS":
{
"INDEX": "-k 15",
"MAP": "-ax sr -ub -Y -L --MD"
}
}
}
}
},
"DE": {
"TOOLS" :
{
"deseq2" : "Analysis/DE/DESEQ2.R",
"edger" : "Analysis/DE/EDGER.R"
},
"COMPARABLE" :
{
},
"EXCLUDE": [
],
"Ecoli": {
"WT": {
"dummylevel": {
"deseq2": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
}
}
},
"KO": {
"deseq2": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
}
}
}
}
}
Note that “SETTINGS” now also contain “GROUPS” which hold identifiers for the DE stage and can be used to define “COMPARABLES” that tell MONSDA which samples should be treated as replicates and compared to which other samples. By default this will make an all-vs-all comparison, in our simple case ctrl-vs-ko. Keeping this “GROUPS” setting separated from the condition-tree makes it possible to mix samples for different conditions for all pre- and processing steps and separating them for postprocessing without further struggle. This means that in this simple case we could have mixed all samples under one condition as all other tool options stay the same and reference and annotation do not differ and would still be able to split by “GROUPS” for the DE step. However, sometimes we need to compare data that has to be processed differently, e.g. mixing paired and single-ended reads, so we can apply different settings as needed for all workflow steps by simply splitting our condition-tree accordingly.
Starting the run with 12 cores (defining more will be capped by the config file as MAXTHREADS is set to 12)
monsda -j 12 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_de.json --directory ${PWD}
Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/INDICES” directory containing the built index, a “QC” directory containing all FASTQC reports and MULTIQC output, a “TRIMMED_FASTQ” directory for trimgalore and cutadapt output, a “DEDUP” directory for umi_tools (runs before trimming and after mapping) and picard (runs after mapping) output and a “MAPPING” directory containing the mapped files. Furthermore, a “DE” directory will be created which will hold output from counting with featurecounts and DE input and output from EDGER and DESeq2. Again, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and the “JOBS” directory with all command-line calls.
A successful run will show the message ‘Workflow finished, no error’ at the end.
Run it all
This exhaustive use case involves multiple input files, two conditions (WT/KO) and almost all workflows available. We also include a “dummylevel” that is a placeholder for settings or other subdivisions of the WT level, to demonstrate that MONSDA can work on condition-trees of differing depth.
Workflows include:
FETCH: Download from SRA
QC: FASTQC of input and output
TRIMMING: Adaptor removal with cutadapt/trimgalore
MAPPING: Read mapping with STAR, hisat2, bwa, segemehl3 and minimap2
DEDUP: Read deduplication with umi_tools and picard
DE: Differential Expression analysis with EdgeR and DESeq2
DEU: Differential Exon Usage analysis with EdgeR and DEXSeq
DAS: Differential Alternative Splicing analysis with EdgeR and DIEGO
DTU: Differential Transcript Usage analysis with DRIMSeq and DEXSeq
COUNTING: Read counting with FeaturCounts ot quantification with Salmon
TRACKS: Generation of tracks for UCSC or other genome browsers
PEAKS: Analysis of ChIP-Seq or CLIP-Seq or cyPhyRNA-Seq Peaks
The more complex config for this analysis follows
You will see here that for DTU workflows we will need an additional FASTA and GTF file covering transcripts. As our sample data is from E.coli, transcripts and genes can be seen as synonymous, so we can easily create the needed files from our already downloaded files.
Creating a transcript gtf file from the downloaded gtf by subsetting for gene features:
zcat ecoli.gtf.gz |head -n+3 > bla; zcat ecoli.gtf.gz|grep -P '\tgene\t' >> bla;cat bla|gzip > Ecoli_trans.gtf.gz;rm -f bla
We now fix the empty transcript-id tag by replacing it with the gene-id followed by “_1”:abbr:
zcat Ecoli_trans.gtf.gz |perl -wlane 'BEGIN{%ids}{if($_=~/^#/){print}else{$n=$F[9];$ids{$n}++;$F[11]=substr($n,0,-2)."_".$ids{$n}."\";";print join("\t",@F[0..7],join(" ",@F[8..$#F]))}}'|perl -F'\t' -wlane 'print $_;if($_ !~/^#/){print join("\t",@F[0..1])."\ttranscript\t".join("\t",@F[3..$#F])}'|gzip > Ecoli_trans_fix.gtf.gz
From this gtf we can now create a FASTA file by writing a helper BED file and using BEDtools to extract sequences from our genome FASTA file in strand specific manner. We then only need to clean up the name tag as follows:
zcat Ecoli_trans_fix.gtf.gz|grep -P '\ttranscript\t'|perl -wlane 'next if($_=~/^#/);($name=(split(/;/,$F[11]))[0])=~s/\"//g;print join("\t",$F[0],$F[3]-1,$F[4],$name,100,$F[6])' > Ecoli_trans.bed
bedtools getfasta -fi ecoli.fa -bed Ecoli_trans.bed -fo Ecoli_trans.fa -nameOnly -s
perl -wlane 'if($_=~/^>/){$_=~s/\(|\+|\-|\)//g;}print' Ecoli_trans.fa|gzip > Ecoli_trans.fa.gz
With these “Ecoli_trans.fa.gz” and “Ecoli_trans_fix.gtf.gz” files we can now also run the DTU workflow. The config file for the exhaustive run looks as follows:
{
"WORKFLOWS": "FETCH, QC, TRIMMING, MAPPING, DEDUP, TRACKS, PEAKS, DE, DEU, DAS, DTU, COUNTING",
"VERSION": "1.0.0",
"BINS": "",
"MAXTHREADS": "16",
"SETTINGS": {
"Ecoli": {
"WT": {
"dummylevel": {
"SAMPLES": [
"SRR16324019",
"SRR16324018",
"SRR16324017"
],
"GROUPS": [
"ctrl",
"ctrl",
"ctrl"
],
"SEQUENCING": "paired",
"REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
"ANNOTATION": {
"GFF": "GENOMES/Ecoli/ecoli.gff.gz",
"GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
}
}
},
"KO": {
"SAMPLES": [
"SRR16324016",
"SRR16324015",
"SRR16324014"
],
"GROUPS": [
"ko",
"ko",
"ko"
],
"SEQUENCING": "paired",
"REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
"ANNOTATION": {
"GFF": "GENOMES/Ecoli/ecoli.gff.gz",
"GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
}
}
}
},
"FETCH": {
"TOOLS" :
{
"sra" : "fasterq-dump"
},
"Ecoli": {
"WT": {
"dummylevel": {
"sra": {
"OPTIONS":
{
"DOWNLOAD": ""
}
}
}
},
"KO": {
"sra": {
"OPTIONS":
{
"DOWNLOAD": ""
}
}
}
}
},
"QC": {
"TOOLS" :
{
"fastqc" : "fastqc"
},
"Ecoli": {
"KO": {
"fastqc": {
"OPTIONS":
{
"QC": "", #QC options here if any, KO is not required, will be resolved by rules
"MULTI": "" #MultiQC options here if any, KO is not required, will be resolved by rules
}
}
},
"WT": {
"dummylevel": {
"fastqc": {
"OPTIONS":
{
"QC": "", #QC options here if any, KO is not required, will be resolved by rules
"MULTI": "" #MultiQC options here if any, KO is not required, will be resolved by rules
}
}
}
}
}
},
"TRIMMING": {
"TOOLS" :
{
"trimgalore": "trim_galore",
"cutadapt": "cutadapt"
},
"Ecoli": {
"KO": {
"trimgalore":{
"OPTIONS":
{
"TRIM": "-q 15 --length 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
},
"cutadapt":{
"OPTIONS":
{
"TRIM": "-m 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
}
},
"WT": {
"dummylevel": {
"trimgalore":{
"OPTIONS":
{
"TRIM": "-q 15 --length 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
},
"cutadapt":{
"OPTIONS":
{
"TRIM": "-m 8 -e 0.15" # trimming options here, --KO is not required, will be resolved by rules
}
}
}
}
}
},
"DEDUP": {
"TOOLS": {
"umitools": "umi_tools",
"picard": "picard"
},
"Ecoli": {
"KO":{
"umitools":{
"OPTIONS":
{
"EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
"DEDUP": "--extract-umi-method read_id" # umitools dedup options
}
},
"picard":{
"OPTIONS":
{
"JAVA": "",
"DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
}
}
},
"WT":{
"dummylevel": {
"umitools": {
"OPTIONS":
{
"EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
"DEDUP": "--extract-umi-method read_id" # umitools dedup options
}
},
"picard": {
"OPTIONS":
{
"JAVA": "",
"DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
}
}
}
}
}
},
"MAPPING": {
"TOOLS": {
"star": "STAR",
"segemehl3": "segemehl.x",
"hisat2": "hisat2",
"bwa": "bwa mem",
"minimap": "minimap2"
},
"Ecoli": {
"WT": {
"dummylevel": {
"star": {
"OPTIONS": {
"INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 8",
"MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
}
},
"segemehl3": {
"OPTIONS": {
"INDEX": "",
"MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
}
},
"hisat2": {
"OPTIONS":
{
"INDEX": ""
}
},
"bwa": {
"OPTIONS":
{
"INDEX": "-a bwtsw",
"MAP": ""
}
},
"minimap": {
"OPTIONS":
{
"INDEX": "-k 15",
"MAP": "-ax sr -ub -Y -L --MD"
}
}
}
},
"KO": {
"star": {
"OPTIONS":
{
"INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 8",
"MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
}
},
"segemehl3": {
"OPTIONS": {
"INDEX": "",
"MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
}
},
"hisat2": {
"OPTIONS":
{
"INDEX": ""
}
},
"bwa": {
"OPTIONS":
{
"INDEX": "-a bwtsw",
"MAP": ""
}
},
"minimap": {
"OPTIONS":
{
"INDEX": "-k 15",
"MAP": "-ax sr -ub -Y -L --MD"
}
}
}
}
},
"COUNTING": {
"FEATURES": {
"CDS": "ID"
},
"TOOLS": {
"countreads": "featureCounts"
},
"Ecoli": {
"WT": {
"dummylevel": {
"countreads": {
"OPTIONS":
{
"COUNT" : "-f --fraction -O -M"
}
}
}
},
"KO": {
"countreads": {
"OPTIONS":
{
"COUNT" : "-f --fraction -O -M"
}
}
}
}
},
"TRACKS": {
"TOOLS": {
"ucsc": "ucsc"
},
"Ecoli": {
"WT": {
"dummylevel": {
"ucsc": {
"OPTIONS":
{
"UCSC": "-n Eco_compare_Mapping -s dm6_st -l UCSC_DM6_compare_Mapping -b UCSC_dm6_star1"
}
}
}
},
"KO": {
"ucsc": {
"OPTIONS":
{
"UCSC": "-n Eco_paired_Mapping -s dm6_p -l UCSC_DM6_paired_Mapping -b UCSC_dm6_pstar1"
}
}
}
}
},
#DE Analysis
"DE": {
"TOOLS" :
{
"deseq2" : "Analysis/DE/DESEQ2.R",
"edger" : "Analysis/DE/EDGER.R"
},
"COMPARABLE" :
{
},
"EXCLUDE": [
],
"Ecoli": {
"WT": {
"dummylevel": {
"deseq2": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
}
}
},
"KO": {
"deseq2": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-O", # Options for read counting, independent of COUNTS workflows
"DE": "" # Options for DE Analysis
}
}
}
}
},
#DEU Analysis options
"DEU": {
"TOOLS" :
{
"dexseq" : "Analysis/DEU/DEXSEQ.R",
"edger" : "Analysis/DEU/EDGER.R"
},
"COMPARABLE" :
{
},
"EXCLUDE": [
],
"Ecoli": {
"WT": {
"dummylevel": {
"dexseq": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DEU": "" # Options for DEU Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DEU": "" # Options for DEU Analysis
}
}
}
},
"KO": {
"dexseq": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DEU": "" # Options for DEU Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DEU": "" # Options for DEU Analysis
}
}
}
}
},
#DAS Analysis options
"DAS": {
"TOOLS" :
{
"diego" : "diego.py",
"edger" : "Analysis/DAS/EDGER.R"
},
"COMPARABLE" :
{
},
"EXCLUDE": [
],
"Ecoli": {
"WT": {
"dummylevel": {
"diego": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DAS": "-c 10 -d 3 -q 0.01 -z 1.0 -r" # Options for DAS Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DAS": "" # Options for DAS Analysis
}
}
}
},
"KO": {
"diego": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DAS": "-c 10 -d 3 -q 0.01 -z 1.0 -r" # Options for DAS Analysis
}
},
"edger": {
"OPTIONS":
{
"COUNT": "-f -O", # Options for read counting, independent of COUNTS workflows
"DAS": "" # Options for DAS Analysis
}
}
}
}
},
"DTU": {
"TOOLS" :
{
"dexseq" : "Analysis/DTU/DEXSEQ.R",
"drimseq" : "Analysis/DTU/DRIMSEQ.R"
},
"COMPARABLE" :
{
},
"EXCLUDE": [
],
"Ecoli": {
"WT": {
"dummylevel": {
"dexseq": {
"REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
"ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
"OPTIONS":
{
"INDEX": "",
"QUANT": "", # Options for read counting, independent of COUNTS workflows
"DTU": "" # Options for DE Analysis
}
},
"drimseq": {
"REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
"ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
"OPTIONS":
{
"INDEX": "",
"QUANT": "", # Options for read counting, independent of COUNTS workflows
"DTU": "" # Options for DE Analysis
}
}
}
},
"KO": {
"dexseq": {
"REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
"ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
"OPTIONS":
{
"INDEX": "",
"QUANT": "", # Options for read counting, independent of COUNTS workflows
"DTU": "" # Options for DE Analysis
}
},
"drimseq": {
"REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
"ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
"OPTIONS":
{
"INDEX": "", # Options for read counting, independent of COUNTS workflows
"QUANT": "",
"DTU": "" # Options for DE Analysis
}
}
}
}
},
#PEAKS Analysis options
"PEAKS": {
"TOOLS" :
{
"macs" : "macs2",
"peaks" : "peaks",
"scyphy" : "Piranha"
},
"COMPARABLE" :
{
"SRR16324019": "SRR16324018",
"SRR16324014": "SRR16324015"
},
"Ecoli": {
"WT": {
"dummylevel": {
"macs": {
"OPTIONS":
{
"FINDPEAKS": "--nomodel --extsize 28"
}
},
"peaks":{
"OPTIONS":
{
"PREPROCESS": "",
"FINDPEAKS": "-l 0.6 -t 1 -w 5 -r 0.8 -c 3 -a 3"
}
},
"scyphy":{
"OPTIONS":
{
"PREPROCESS": "",
"FINDPEAKS": "-b 100"
}
}
}
},
"KO": {
"macs": {
"OPTIONS":
{
"FINDPEAKS": "--nomodel --extsize 28"
}
},
"peaks":{
"OPTIONS":
{
"PREPROCESS": "",
"FINDPEAKS": "-l 0.6 -t 1 -w 5 -r 0.8 -c 3 -a 3"
}
},
"scyphy":{
"OPTIONS":
{
"PREPROCESS": "",
"FINDPEAKS": "-b 100"
}
}
}
}
}
}
Same as for the more complex run we can see that “SETTINGS” contains “GROUPS” which hold identifiers for the DE/DEU/DAS/DTU stages that will be used to define “COMPARABLES” that tell MONSDA which samples should be treated as replicates and compared to which other samples. By default this will make an all-vs-all comparison, in our case ctrl-vs-ko. Keeping this “GROUPS” setting separated from the condition-tree makes it possible to mix samples for different conditions for all pre- and processing steps and separating them for postprocessing without further struggle. This means that in this simple case we could have mixed all samples under one condition as all other tool options stay the same and reference and annotation do not differ and would still be able to split by “GROUPS” for the DE/DEU/DAS steps. However, sometimes we need to compare data that has to be processed differently, e.g. mixing paired and single-ended reads, so we can apply different settings as needed for all workflow steps by simply splitting our condition-tree accordingly.
Somewhat special is the DTU step. Here we use the pseudo-mapper salmon to quantify expression of transcripts. These transcripts are first defined in the preprocessing part of this tutorial chapter. As we can see the REFERENCE and ANNOTATION differ from what is listed in SETTINGS, so we make use of the ability of MONSDA to define those settings on a per-workflow basis. This will ALWAYS over-rule the settings made in SETTINGS. So we simply define the REFERENCE and ANNOTATION key for each tool within the regular tool setting part of the config and this will make sure that these files are used instead of the globally defined ones.
Starting the run with 12 cores (defining more will be capped by the config file as MAXTHREADS is set to 12)
monsda -j 12 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_exhaustive.json --directory ${PWD}
Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/Ecoli/INDICES” directory containing the built indices, including the one built for salmon later on, a “QC” directory containing all FASTQC reports and MULTIQC output, a “TRIMMED_FASTQ” directory for trimgalore and cutadapt output, a “DEDUP” directory for umi_tools (runs before trimming and after mapping) and picard (runs after mapping) output and a “MAPPING” directory containing the mapped files. Furthermore, “DE/DEU/DAS” directories will be created which will hold output from counting with featurecounts and DE/DEU/DAS input and output from EDGER, DESeq2, DEXSeq and DIEGO respectively. Again, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and again a “JOBS” directory for command-line calls.
A successful run will show the message ‘Workflow finished, no error’. Be aware that this is indeed an exhaustive workflow and will require a decent amount of disk-space, memory and compute-time, depending on the hardware at your disposal.
Available Workflows
FETCH -> Download from SRA BASECALL -> Call bases from FAST5 QC -> Quality Control of FASTQ/BAM files TRIMMING -> Trim adaptor remnants MAPPING -> Map reads to reference DEDUP -> Deduplicate reads COUNTING -> Count or quantify reads TRACKS -> Generates trackdb.txt and BIGWIG files PEAKS -> Call peaks for ChIP, RIP, CLIP and other DE -> Differential Expression Analysis DEU -> Differential Exon Usage Analysis DAS -> Differential Alternative Splicing Analysis DTU -> Differential Transcript Usage Analysis
PREPROCESSING
FETCH
Downloads files from SRA
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
SRAtools fasterq-dump |
The fasterq-dump tool extracts data in FASTQ- or FASTA-format from SRA-accessions |
sra |
fasterq-dump |
SRA accession |
FASTQ |
BASECALL
Creates FASTQ files from ONT FAST5, Guppy needs to be installed locally
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
Guppy |
Data processing toolkit that contains Oxford Nanopore’s basecalling algorithms, and several bioinformatic post-processing features, such as barcoding/demultiplexing, adapter trimming, and alignment. Needs to be installed locally as no conda version is available |
guppy |
$PATH_TO_LOCAL_BIN |
FAST5 |
FASTQ |
QUALITY CONTROL I
This workflow step can be run as preprocessing step if none of the processing workflows is defined in the config.json.
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
FASTQC (includes MULTIQC) |
A quality control tool for high throughput sequence data. |
fastqc |
fastqc |
FASTQ/BAM |
ZIP/HTML |
PROCESSING
QUALITY CONTROL II
If any of the below listed processing steps is defined in the config.json, quality control will run for all generated output files if enabled.
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
FASTQC (includes MULTIQC) |
A quality control tool for high throughput sequence data. |
fastqc |
fastqc |
FASTQ/BAM |
ZIP/HTML |
Trimming
Trims adaptor left-overs from fastq files
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
trim_galore |
A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries. |
trimgalore |
trim_galore |
FASTQ |
TRIMMED_FASTQ |
|
cutadapt |
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads. |
cutadapt |
cutadapt |
FASTQ |
TRIMMED_FASTQ |
Mapping
Maps sequences to reference genomes or transcriptomes
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
HISAT2 |
HISAT2 is a fast and sensitive alignment program |
hisat2 |
hisat2 |
FASTQ/TRIMMED_FASTQ |
SAM.gz/BAM |
|
STAR |
Spliced Transcripts Alignment to a Reference |
star |
star |
FASTQ/TRIMMED_FASTQ |
SAM.gz/BAM |
|
Segemehl2|3 |
Segemehl is a software to map short sequencer reads to reference genomes. |
segemehl2/segemehl3 |
segemehl.x |
FASTQ/TRIMMED_FASTQ |
SAM.gz/BAM |
|
BWA |
BWA is a software package for mapping low-divergent sequences against a large reference genome |
bwa |
bwa mem |
FASTQ/TRIMMED_FASTQ |
SAM.gz/BAM |
|
Minimap2 |
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. |
minimap |
minimap2 |
FASTQ/TRIMMED_FASTQ |
SAM.gz/BAM |
DEDUP
Deduplicate reads by UMI or based on mapping position and CIGAR string
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
UMI-tools |
UMI-tools contains tools for dealing with Unique Molecular Identifiers (UMIs)/Random Molecular Tags (RMTs) and single cell RNA-Seq cell barcodes. |
umitools |
umi_tools |
FASTQ/TRIMMED_FASTQ |
FASTQ/BAM |
|
Picard tools |
A better duplication marking algorithm that handles all cases including clipped and gapped alignments. |
picard |
picard |
BAM |
BAM |
POSTPROCESSING
Read-Counting and Quantification
Count (unique) mapped reads and how often they map to defined features
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
FeatureCounts |
A software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins |
countreads |
featureCounts |
BAM/FASTQ |
TEXT |
|
Salmon |
Salmon is a tool for wicked-fast transcript quantification from RNA-seq data. |
salmon |
salmon |
FASTQ/TRIMMED_FASTQ |
TEXT |
Differential Analyses
Includes DE, DEU, DAS and DTU
Tool |
Analysis |
Filtering |
Normalization |
Distribution |
Testing |
Significance |
Results Table |
further |
SigTables |
Clustering |
further |
Rmd |
---|---|---|---|---|---|---|---|---|---|---|---|---|
edgeR |
Differential Gene Expression |
filterByExpr() |
TMM |
NB |
Fisher’s exact test |
pValue, LFC |
results, sorted-results |
normalized |
Sig, SigUP, SigDOWN |
MDS-plot |
BCV, QLDisp, MD(per comparison) |
✓ |
edgeR |
Differential Exon Usage |
filterByExpr() |
TMM |
NB |
Fisher’s exact test |
pValue, LFC |
results |
normalized |
MDS-plot |
BCV, QLDisp, MD(per comparison) |
✓ |
|
edgeR |
Differential Alternative Splicing |
filterByExpr() |
TMM |
NB |
Simes, gene-level, exon-level |
pValue, LFC |
results(diffSpliceExonTest, Simes-Test, Gene-Test) |
Sig, SigUP, SigDOWN |
MDS-plot |
BCV, QLDisp, MD(per comparison), topSpliceSimes-plots(per Gene) |
✓ |
|
DESeq2 |
Differential Gene Expression |
RowSums >= 10 |
RLE |
NB |
Wald test |
pValue, LFC |
results |
rld, vsd, results(per comparison) |
Sig, SigUP, SigDOWN |
PCA |
Heatmaps, MA(per comparison), VST-and-log2 |
✓ |
DEXSeq |
Differential Exon Usage |
RowSums >= 10 |
RLE |
Cox-Reid |
likelihood ratio test |
|||||||
DEXSeq |
Differential Transcript Usage |
dmFilter() |
RLE |
Cox-Reid |
likelihood ratio test |
pValue |
results |
✓ |
||||
DIEGO |
Differential Alternative Splicing |
Mann-Whitney U test |
pValue |
results |
Sig |
Dendrogram-plot |
✓ |
|||||
DRIMSeq |
Differential Transcript Usage |
dmFilter() |
DM |
pValue, LFC |
results(transcript, genes) |
Proportions-table, genewise precision |
Sig, SigUP, SigDOWN (transcipt, gene) |
FeatPerGene, precision, Pvalues (per comparison) |
✓ |
TRACKS
This workflow generates trackdb.txt files and bigwig files which can be used to create UCSC track hubs. However, bigwigs can of course be used for other genome viewers as well.
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
UCSC |
Track hubs are web-accessible directories of genomic data that can be viewed on the UCSC Genome Browser |
ucsc |
ucsc |
BAM |
BIGWIG/HUBS |
PEAKS
Calls peaks from mapping data for ChIP, RIP, CLIP and other
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
Piranha |
Piranha is a peak-caller for CLIP- and RIP-Seq data. |
piranha |
piranha |
BAM |
BED/BEDG/BIGWIG |
|
MACS |
Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. |
macs |
macs |
BAM |
BED/BEDG/BIGWIG |
|
SciPhy |
Software for cyPhyRNA-Seq Data analysis |
scyphy |
piranha |
BAM |
BED/BEDG/BIGWIG |
|
Peaks |
Slinding window peak finding tool for quick assessment of peaks. UNPUBLISHED, recommended for initial scanning only |
peaks |
peaks |
BAM |
BED/BEDG/BIGWIG |
CIRCS
Find circular RNAs in mapping data, CIRI2 needs to be installed locally.
TOOL |
DESCRIPTION |
ENV |
BIN |
LINK |
INPUT |
OUTPUT |
---|---|---|---|---|---|---|
CIRI2 |
CIRI (circRNA identifier) is a novel chiastic clipping signal based algorithm,which can unbiasedly and accurately detect circRNAs from transcriptome data by employing multiple filtration strategies. |
ciri2 |
$Path_to_CIRI2.pl |
BAM |
BED/BEDG/BIGWIG |
Wrapping Workflows
In general MONSDA is Python3 software, that wraps workflows by assembling subworkflows or single tools from .smk or .nf templates. The idea here is that tools and subworkflows for similar tasks are designed in a way that starts from the same input and results in the same output. This is not only true for single workflow steps which can be performed by multiple tools, but also for the wrapped workflow management systems (WMS). In principal output generated in Nextflow mode should be suitable as input for Snakemake and vice versa. This means that, for example, mapping output generated in Nextflow mode can be used as input for DE analysis in Snakemake mode, while both work from the same config.json.
As Snakemake is also written in Python, wrapping workflows is similar to the built-in way of submodule assembly, although we take care that submodules for the same task remain interchangeable. Wrapping Nextflow is slightly different, as MONSDA has to assemble Groovy text blocks, which does not make any difference to the end user, but requires to translate configuration from the config file to Nextflow parsable command lines. However, the idea of creating interchangeable subworkflows or tool specific code fragments stays the same.
Independently of the wrapped WMS, workflows are split internally in three independent stages. PREPROCESSING includes all workflow steps that generate or manipulate FASTQ files, to make them available to the PROCESSING stage. This includes download of files from SRA, basecalling with Guppy and pre-quality-control, so all workflow steps that do not require identical input formats, but lead to similar output.
PROCESSING starts from FASTQ files and includes trimming, deduplication, mapping and quality control for all subprocesses.
POSTPROCESSING builds upon PROCESSING output and includes quantification, differential expression analysis on gene, transcript and exon level, generation of tracks for UCSC or other genome browsers, peak finding and circular RNA identification. In contrast to PROCESSING workflows, these steps do not require output to be of similar format but are able to work from the same input.
In case dedicated workflows need to be established, as is the case for example for cyPhyRNA-Seq, the main idea is to split all preprocessing and processing steps in units that remain interchangeable, and deliver dedicated post-processing subworkflows which can work on their output. In the mentioned example we have quality control, trimming, mapping and deduplication embedded in standard workflows and dedicated, specific postprocessing of mapped reads wrapped in the PEAKS postprocessing step.
For new workflows, we aim to split those into as small subunits as possible to make subworkflows available for other pipelines and add dedicated parts as postprocessing to currently established categories. In case new categories need to be defined, please contact us to discuss how this can be embedded.
The Condition-Tree
A key concept behind MONSDA is that config files are split according to conditions defined in the config file and each subworkflow is then run consecutively. This separates workflows in independent subworkflows, each potentially running on different input, with differing options and executables, without danger of interfering with each other.
As described in Planning a project run, you should have an idea of what to analyze and how to split up your conditions if needed. For each ID you work on, you can define no, one or multiple conditions and settings that will be used for the analysis. The condition-tree also defines the directory structure to follow for Input and Output directories. We assume a general tree to look something like
'ID' -> 'CONDITION' -> 'SETTING'
Here ID is the first level and optional Condition the second. Setting is used by MONSDA to enable processing of the same samples under different settings or commandline options for e.g. mapping tools, trimming tools and later also postprocessing tools. MONSDA will also build an output directory based on the combination of tools used, e.g. fastqc-cutadapt-star-umitools, to indicate which combination of tools was used to generate the output and to prevent results from being mixed or overwritten.
As an example, I want to analyze samples retrieved from LabA on 01012020 (yes that happens), with the mapping tools star and hisat, my condition-tree would look like this LabA:01012020 and my FASTQ input directory would resemble that like FASTQ/LabA/01012020. The ‘01012020’ directory would thereby contain all the fastq.gz files I need for analysis as stated in the corresponding config file. As we assume that settings may change but the input files will stay the same, MONSDA will search one level upwards of the deepest level in the condition-tree. This means, if you have a tree like:
'ID1' -> 'CONDITION1' -> 'SETTING1', 'SETTING2', 'SETTING3'
You do not need to copy input from FASTQ/LabA/01012020 to FASTQ/LabA/01012020/SETTING1/2/3, instead MONSDA will find the input in FASTQ/LabA/01012020 and generate output directories which contain the Setting level. This works of course also if you want to analyze samples from different dates and same lab with same settings or different labs and so on.
MONSDA will automatically define a unique tool-key based on currently enabled workflow steps and the combination of tools defined in the config file. From that information it will generate output folders like MAPPING/LabA/01012020/star and MAPPING/LabA/01012020/hisat if no other tools and workflow steps where configured to be used.
Optionally a user can also run one or the other tool with different settings for example to benchmark tools. Define for example map_stringent and map_relaxed and indicate this on the MAPPING level in the config file. FASTQ input will still be found in FASTQ/LabA/01012020 while output files will appear in MAPPING/LabA/01012020/map_stringent/star and MAPPING/LabA/01012020/map_stringent/hisat or MAPPING/LabA/01012020/map_relaxed/star and MAPPING/LabA/01012020/map_relaxed/hisat respectively.
The config-file
It consists of multiple sections which will be explained in detail. For the template please refer to the template.json file in the config directory of the repo. The template.json is the default blueprint for a config file generated by monsda_configure.
The config file contains all the information needed to run workflows and tools with user defined settings and to find the correct sample and genome/annotation files. It starts with defining which workflow steps to run. Be aware that every workflow step has to correspond to a key in the config.json that defines parameters specific for the job. See Available Workflows: for available workflows and tools.
"WORKFLOWS": "", #which workflows do you want to run "MAPPING, QC, DEDUP, TRIMMING, COUNTING, TRACKS, PEAKS, ANNOTATE, DE, DEU, DAS, DTU, CIRCS"
"BINS": "", #where to find customscripts used in the workflow !!!ADVANCED USAGE ONLY!!!
"MAXTHREADS": "20", #maximum number of cores to use, make sure this fits your needs
"VERSION": "1.0.0", #needs to be identical to the installed version for reproducibility reasons
The next part defines ‘SETTINGS’, which is where you define which samples to process under which condition-tree leafs, which genomes and annotations to use, which sequencing type is to be analyzed and some workflow specific keys.
"SETTINGS": {
"id": {
"condition": {
"setting": {
"SAMPLES": ["Sample_1","Sample_2"] # List of samples you whish to analyze; skip file ending, this names will be used for input/output files of various formats; if paired sequencing make sure the names have _R1/_R2 as identifiers of first/second in pair at the very end of the filename and only list one file without the _R1/_R2 extension, this will be appended automatically
"SEQUENCING": "single", #or paired and stranded (rf,fr) info comma separated
"REFERENCE": "GENOMES/Dm6/dm6.fa.gz", #default Referene Genome fa.gz file
"INDEX": "GENOMES/Dm6/INDICES/star", #default index to use for mapping with this settings, empty if not already existing
"PREFIX": "idx", #OPTIONAL, prefix for mapping software can be set here
"ANNOTATION": {
"GTF": "GENOMES/Dm6/dm6.gtf.gz", #default annotation in GTF format, THIS WILL BE USED WHENEVER POSSIBLE OR NOT DEFINED OTHERWISE
"GFF": "GENOMES/Dm6/dm6.gff3.gz" #default annotation in GFF format
},
"IP": "iCLIP" # OPTIONAL if PEAKS is run and files need specific processing, eg. for iCLIP protocol, options are CLIP, iCLIP, revCLIP
}
}
}
},
To define the samples to run the analysis on, just add a list of sample names as innermost value to the SAMPLES key for each condition. In case of single-end sequencing make sure to include any _R1 _R2 tag should such exist, in case of paired end skip those as the pipeline will automatically look for _R1 and _R2 tags to find read pairs and prevent you from having to write duplicated information.
Make sure the naming of you samples follows this _R1 _R2 convention when running paired-end analysis! All files have to end with _R1.fastq.gz or _R2.fastq.gz for paired end and .fastq.gz for single-end processing respectively. MONSDA will append these suffixes automatically
The SEQUENCING key allows you to define single or paired as values to enable analysis of a mix of single/paired end sequences at once, split in different leafs of the condition-tree. You can also specify strandedness of the data at hand. If unstranded leave empty, else add strandedness according to http://rseqc.sourceforge.net/#infer-experiment-py as comma separated value (rf Assumes a stranded library fr-firststrand [1+-,1-+,2++,2–], fr Assumes a stranded library fr-secondstrand [1++,1–,2+-,2-+]). MONSDA will use these values to configure tools automatically where possible. It is therefore not required to add sequencing specific settings for each used tool to your config file.
Now the actual workflow section begins, where you can define which environments and tools to use and which settings to apply to the run for each step and condition/setting. These follow the same scheme for each step. The ENV key defines the conda environment to load from the env directory of this installation. This allows to add your own environment.yaml files if needed, just make sure to share those along with your configuration. The BIN key defines the name of the executable, this is needed in case the env and the bin differ as e.g. for the mapping tool segemehl/segemehl.x. Some tools, e.g. guppy are not available via conda, hence you have to install the tool locally and indicate the path to the executable.
You can always define differing ENV/BIN keys for each condition-tree leaf separately, which will overwrite the ENV/BIN set under ‘TOOLS’. This is intended to be used to benchmark different versions of the same tool for example. Furthermore, you can define keys like ‘ANNOTATION’ and ‘REFERENCE’ similarly, overwriting the ones defined under ‘SETTINGS’, which is useful for example, in case you run workflows like ‘COUNTING’ with salmon and want to quantify transcript isoforms while still being able to map and trim and count reads on genome level.
The next key-level is the OPTIONS key which is where you can define additional parameters for each tool. It is not needed to define anything related to single-end/paired-end end sequencing, this is done automatically. To add parameters simply add the OPTION key which defines a dict where you can set parameters for each defined subworkflow-step. Parameters are here defined as key/value pairs corresponding to the subworkflow-step, e.g. ‘INDEX’ to generate an index file for mapping and all settings similar to a command line call as values. This should become clear having a look at the different processing steps in the template json. If there are no options just leave the ‘OPTIONS’ dict empty.
"FETCH": {
"TOOLS" : # which tools to run, format is Conda-environment name : binary to call, will be overwritten by ENV/BIN in ICS
{
"sra" : "sra"
},
"id": { #key for source and genome
"condition": { # sample id
"setting": {
"ENV" : "sra", # OPTIONAL if no tool is defined, name of conda env for raw file download
"BIN" : "sra", # OPTIONAL PATH to executable, usually just the name of the executable
"sra": { # for which tool environment these settings work
"OPTIONS":
{
"DOWNLOAD": "" #FETCH options here if any, paired is not required, will be resolved by rules
}
}
}
}
}
},
"BASECALL": {
"TOOLS" : # which tools to run, format is Conda-environment name : binary to call
{
"guppy" : "~/.local/bin/guppy-cpu/bin/guppy_basecaller"
},
"id": { #key for source and genome
"condition": { # sample id
"setting": {
"ENV" : "guppy", # name of conda env for raw file download
"BIN" : "~/.local/bin/guppy-cpu/bin/guppy_basecaller", #PATH to guppy executable
"guppy":{
"OPTIONS":
{
"BASECALL": "" #Guppy options here if any, paired is not required, will be resolved by rules
}
}
}
}
}
},
"QC": {
"TOOLS" :
{
"fastqc" : "fastqc"
},
"id": { #key for source and genome
"condition": { # sample id
"setting": {
"ENV" : "fastqc", # name of conda env for QC
"BIN" : "fastqc", # binary for QC
"fastqc":{
"OPTIONS":
{
"QC": "", #QC options here if any, paired is not required, will be resolved by rules
"MULTI": "" #MultiQC options here if any, paired is not required, will be resolved by rules
}
}
}
}
}
},
"TRIMMING": { #options for trimming for each sample/condition
"TOOLS" :
{
"trimgalore": "trim_galore",
"cutadapt": "cutadapt"
},
"id": {
"condition": {
"setting": { # See above
"ENV": "trimgalore", # name of conda env for trimming
"BIN": "trim_galore", # name of binary for trimming
"trimgalore":{
"OPTIONS":
{
"TRIM": "-q 15 --length 8 -e 0.15" # trimming options here, --paired is not required, will be resolved by rules
}
},
"cutadapt":{
"OPTIONS":
{
"TRIM": "-q 15 --length 8 -e 0.15" # trimming options here, --paired is not required, will be resolved by rules
}
}
}
}
}
},
"DEDUP": { #options for deduplication for each sample/condition
"TOOLS": {
"umitools": "umi_tools",
"picard": "picard"
},
"id": {
"condition": {
"setting": { # See above
"ENV": "umitools", # name of conda env for dedup
"BIN": "umi_tools", # name of binary for dedup
"umitools":{
"OPTIONS":
{
"WHITELIST" : "--extract-method string --bc-pattern AGANNNNACGT",# umitools whitelist options
"EXTRACT": "--extract-umi-method read_id", # umitools extract options
"DEDUP": "" # umitools dedup options
}
},
"picard":{
"OPTIONS":
{
"JAVA" : "",# options
"DEDUP": "" # dedup options
}
}
}
}
}
},
"MAPPING": { #options for mapping for each sample/condition
"TOOLS": {
"star": "STAR",
"segemehl3": "segemehl.x",
"hisat2": "hisat2",
"bwa": "bwa mem",
"minimap": "minimap2"
},
"id": {
"condition": {
"setting": {
"ENV": "star", # which conda env to use for mapping
"BIN": "STAR", #how the mapper binary is called
"REFERENCE": "$PATHTO/genome.fa.gz", #Path to the gzipped Genome FASTA file, overwrites SETTINGS
"ANNOTATION": "$PATHTO/genome_or_other.gtfgff.gz", #Path to the gzipped annotation file in gtf/gff format, overwrites SETTINGS
"star":{
"OPTIONS": # first entry in list is a dict of options for indexing, second for mapping, third can be e.g. appendix to index name, useful especially with minimap if using different kmer sizes
{
"INDEX" : "--sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13", #indexing options
"MAP": "--sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore", #mapping options
"EXTENSION" : ""
}
},
"segemehl3": {
"OPTIONS":
{
"INDEX" : "",
"MAP": "",
"EXTENSION" : ""
}
},
"hisat2": {
"OPTIONS":
{
"INDEX" : "",
"MAP": "",
"EXTENSION" : ""
}
},
"bwa": {
"OPTIONS":
{
"INDEX" : "",
"MAP": "",
"EXTENSION" : ""
}
},
"minimap": {
"OPTIONS":
{
"INDEX" : "",
"MAP": "",
"EXTENSION" : ""
}
}
}
}
}
},
MONSDA further supports postprocessing steps like DE/DEU/DAS/DTU-Analysis for a defined (sub)-set of samples. These workflows act on ‘GROUPS’, which have to be defined in the ‘SETTINGS’. This allows users to compare samples across the condition tree. In case samples have been collected in batches or users want to define types for samples which will be considered in the respective design matrix of all provided R scripts. The settings for these step may look as follows:
"SETTINGS": {
"Ecoli": {
"WT": {
"dummylevel": {
"SAMPLES": [
"SRR16324019",
"SRR16324018",
"SRR16324017"
],
"GROUPS": [
"ctrl",
"ctrl",
"ctrl"
],
"SEQUENCING": "paired",
"REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
"ANNOTATION": {
"GFF": "GENOMES/Ecoli/ecoli.gff.gz",
"GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
}
}
},
"KO": {
"SAMPLES": [
"SRR16324016",
"SRR16324015",
"SRR16324014"
],
"GROUPS": [
"ko",
"ko",
"ko"
],
"SEQUENCING": "paired",
"REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
"ANNOTATION": {
"GFF": "GENOMES/Ecoli/ecoli.gff.gz",
"GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
}
}
And can be extended for example to
"SETTINGS": {
"Ecoli": {
"WT": {
"SAMPLES": [
"SRR16324019",
"SRR16324018",
"SRR16324017"
],
"GROUPS": [
"ctrl",
"ctrl",
"ctrl"
],
"BATCHES": [
"1",
"1",
"2"
],
"TYPES": [
"paired",
"paired",
"single"
]
}
}
}
Where ‘BATCHES’ and ‘TYPES’ can take on arbitrary values, but ‘BATCHES’ is intended to allow correction for batch effects and ‘TYPES’ allows to take e.g. single-end, paired-end information into account. Be aware that this is only making sense if you indeed have batches and/or types to compare, otherwise just skip those keys and MONSDA will take care of this.
Another extra-key for these analysis steps is ‘EXCLUDE’. It can be used to exclude samples from postprocessing, for example if the first round of analysis shows an outlier. The most important key is ‘COMPARABLE’, which, if left empty, will generate all-vs-all comparisons from all ‘GROUPS’ available. In case you want to only compare certain groups, you can edit the config to look like this:
"COMPARABLE" :
{
"comparison-name":
{
"Group1",
"Group2"
}
}
This will compare Group1 and Group2 and name output after comparison-name.
Another special case is ‘MACS’, a peak caller for ChIP-Seq data, which needs to process files pairwise, where one file is the file containing the signal and the other file is used as background. To keep configuration simple, such comparisons can also be described with the ‘COMPARABLE’ key like so:
"ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
"OPTIONS":
{
"INDEX": "", # Options for read counting, independent of COUNTS workflows
}
}
}
}
},
#PEAKS Analysis options
For everything else please refer to the TUTORIAL, the condition-tree and the Workflow Overview.
Keep in mind that every workflow step needs a corresponding entry in the config file or MONSDA.py will throw an error.
Integrating new tools/workflows
In case new tools need to be integrated, please refer to similar tools already implemented or contact us in case nothing similar is available yet. Workflows should always be split in subworkflows that follow the same principle as existing subworkflows, ideally making them reusable for other workflows in the future.
Tools should be easy to integrate, all that is needed is to write a tool and if applicable version specific .smk or .nf file describing input/output and command line calls as well as a fitting conda environment yaml file. Once these are available, they should already be usable and configurable via the config.json in the specific workflow section.
Contribute
If you like this project, are missing features, want to contribute or file bugs please open a PR, leave an issue or contact us directly.
To contribute new tools feel free to adopt existing ones, there should be a number of examples available that cover implementation details for almost all sorts of standard tools. If you need to add new python/groovy functions for processing of options or parameters add them to the corresponding file in the lib directory. New environments go into the envs directory, new subworkflows into the workflows directory. Do not forget to also extend the template.json and add some documentation before opening a pull request.
PRs always welcome.
##Contributors Joerg Fallmann @jfallmann Robin Goldmann @meisterL
Welcome to MONSDA, Modular Organizer of Nextflow and Snakemake driven hts Data Analysis
Automizing HTS analysis from data download, preprocessing and mapping to postprocessing/analysis and track generation centered on a single config file. MONSDA can create Snakemake and Nextflow workflows centered on a user friendly, sharable Json config file and reproducible subworkflows. These workflows can either be saved to disk for manual inspection and execution or automatically executed.
For details on Snakemake and Nextflow and their features please refer to the corresponding Snakemake or Nextflow documentation.
In general it is necessary to write a configuration file containing workflows to execute, information on paths, files to process and settings beyond default for mapping tools and others. The template on which MONSDA is based on can be found in the config directory.
For MONSDA to be as FAIR as possible, one needs to use conda or the faster drop-in replacement mamba. For details on either please refer to the corresponding conda or mamba manual.
This workflow organizer makes heavy use of conda and especially the bioconda channel.