TUTORIAL

Here we will show three MONSDA runs from simple to most-complex, explaining the configuration and how to run this in real life.

First please create a working directory and access it, then install MONSDA following Installation and activate the conda environment.

All examples are based on data from SRA , which you can either download beforehand or you use the “FETCH” workflow as demonstrated in the tutorial.

Genome and Annotation files need to be downloaded beforehand and can be found here: GENOME, GFF, GTF

Ideally you download those files into a sub-directory of your workspace named “GENOME/Ecoli” and rename them to something simpler, in the tutorial we will refer to “GCF_000005845.2_ASM584v2_genomic” as “ecoli”. You also need to change the “.fna.gz” suffix of the genome file to “.fa.gz” and we should be ready to go.

All config files this tutorial is based on can be found in envs/monsda/share/MONSDA/configs of your local conda installation or the “configs” directory of the MONSDA repository.

A quick run

This run is based on a simple mapping only use-case. We have a single paired-end FASTQ file and want to map that. In this case we have no information about strandedness and ignore this setting.

The corresponding config file looks as follows:

{
    "WORKFLOWS": "FETCH,MAPPING",
    "BINS": "",
    "MAXTHREADS": "4",
    "VERSION": "1.2.5",
    "SETTINGS": {
        "SIMPLE": {
            "SAMPLES": [
                "SRR16324019"
            ],
            "SEQUENCING": "paired",
            "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
            "ANNOTATION": {
                "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
            }
        }
    },
    "FETCH": {
        "TOOLS" :
        {
            "sra" : "fasterq-dump"
        },
        "SIMPLE": {
            "sra": {
                "OPTIONS":
                {
                    "PREFETCH": "${HOME}/.ncbi/user-settings.mkfg",
                    "DOWNLOAD": ""
                }
            }
        }
    },
    "MAPPING": {
        "TOOLS": {
            "star": "STAR"
        },
        "SIMPLE": {
            "star": {
                "OPTIONS": {
                    "INDEX": "--genomeSAindexNbases 8",
                    "MAP": "--outSAMprimaryFlag AllBestScore       --outFilterMultimapNmax 20",
                    "EXTENSION": ""
                }
            }
        }
    }
}

This tells MONSDA version “1.1.1” to run the workflows “FETCH” and “MAPPING” on the sample “SRR16324019”.

“FETCH” will activate the environment “sra” and run the binary “fasterq-dump” with no extra options to fetch the sample from SRA.

“MAPPING” will then run the executable “STAR” in environment “star” on the downloaded FASTQ file. First it will check for existence of the needed INDEX file, and as no INDEX file is defined in the config it will try to generate one. For that it will use the REFERENCE “GENOMES/Ecoli/ecoli.fa.gz” as well as the ANNOTATION GTF “GENOMES/Ecoli/ecoli.gtf.gz”. GTF is always selected first and GFF is used as a fallback if no GTF is found. After successfully building the STAR index, MONSDA will start mapping with STAR in “paired” mode and applying a set of options as can be seen under the “MAP” key in section “MAPPING”->”SIMPLE”->”star”->”OPTIONS”. We do not define a pre- or suffix for output files, so “EXTENSION” can be skipped or left blank.

Starting the run with 4 cores (defining more will be capped by the config file as MAXTHREADS is set to 4 but you are free to change that accordingly)

monsda -j 4 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_quick.json --directory ${PWD}

Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/Ecoli/INDICES” directory containing the built index and a “MAPPING” directory containing the mapped files. Furthermore, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and a “JOBS” directory containing the ‘monsda.commands’ file, which contains all command-line calls that MONSDA runs. A successful run will show the message ‘Workflow finished, no error’ at the end.

A more complex toolmix run

This slightly more complex use case involves multiple input files, two conditions (WT/KO) and a more or less standard DE analysis workflow. We also include a “dummylevel” that is a placeholder for settings or other subdivisions of the WT level, to demonstrate that MONSDA can work on condition-trees of differing depth.

Workflows include:

  • FETCH: Download from SRA

  • QC: FASTQC of input and output

  • TRIMMING: Adaptor removal with cutadapt/trimgalore

  • MAPPING: Read mapping with STAR, hisat2, bwa, segemehl3 and minimap2

  • DEDUP: Read deduplication with umi_tools and picard

The more complex config for this analysis follows

{
    "WORKFLOWS": "FETCH, QC, TRIMMING, MAPPING, DEDUP",
    "VERSION": "1.2.5",
    "BINS": "",
    "MAXTHREADS": "16",
    "SETTINGS": {
        "Ecoli": {
            "WT": {   
                "dummylevel": {       
                    "SAMPLES": [
                        "SRR16324019", 
                        "SRR16324018",
                        "SRR16324017"
                    ],
                    "GROUPS": [
                        "ctrl",
                        "ctrl",
                        "ctrl"
                    ],
                    "SEQUENCING": "paired",
                    "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                    "ANNOTATION": {
                        "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                        "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                    }
                }
            },
            "KO": {
                "SAMPLES": [
                    "SRR16324016", 
                    "SRR16324015",
                    "SRR16324014"
                ],
                "GROUPS": [
                    "ko",
                    "ko",
                    "ko"
                ],
                "SEQUENCING": "paired",
                "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                "ANNOTATION": {
                    "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                    "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                }
            }
        }
    },
    "FETCH": {
        "TOOLS" :
        {
            "sra" : "fasterq-dump"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "sra": {
                        "OPTIONS":
                        {
                            "PREFETCH": "${HOME}/.ncbi/user-settings.mkfg",
                            "DOWNLOAD": ""
                        }
                    }
                }
            },
            "KO": {
                "sra": {
                    "OPTIONS":
                    {
                        "PREFETCH": "${HOME}/.ncbi/user-settings.mkfg",
                        "DOWNLOAD": ""
                    }
                }
            }
        }
    },
    "QC": {
        "TOOLS" :
        {
            "fastqc" : "fastqc"
        },
        "Ecoli": {
            "KO": {
                "fastqc": {
                    "OPTIONS":
                    {
                        "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                        "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {       
                    "fastqc": {
                        "OPTIONS":
                        {
                            "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                            "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "TRIMMING": {
        "TOOLS" :
        {
            "trimgalore": "trim_galore",
            "cutadapt": "cutadapt"
        },
        "Ecoli": {
            "KO": {
                "trimgalore":{
                    "OPTIONS":
                    {
                        "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                },
                "cutadapt":{
                    "OPTIONS":
                    {
                        "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {       
                    "trimgalore":{
                        "OPTIONS":
                        {
                            "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    },
                    "cutadapt":{
                        "OPTIONS":
                        {
                            "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "DEDUP": {
        "TOOLS": {
            "umitools": "umi_tools",
            "picard": "picard"
        },
        "Ecoli": {
            "KO":{
                "umitools":{
                    "OPTIONS":
                    {
                        "EXTRACT": "--extract-method string --bc-pattern 'XNNNNX'",
                        "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                    }
                },
               "picard":{
                    "OPTIONS":
                    {
                        "JAVA": "",
                        "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                    }
              }
            },
            "WT":{
                "dummylevel": {       
                    "umitools": {
                        "OPTIONS":
                        {
                            "EXTRACT": "--extract-method string --bc-pattern 'XNNNNX'",
                            "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                        }
                    },
                    "picard": {
                        "OPTIONS":
                        {
                            "JAVA": "",
                            "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                        }
                    }
                }
            }
        }
    },
    "MAPPING": {
        "TOOLS": {
            "star": "STAR",
            "segemehl3": "segemehl.x",
            "hisat2": "hisat2",
            "bwa": "bwa mem",
            "minimap": "minimap2"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {       
                    "star": {
                        "OPTIONS": {
                            "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",
                            "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                        }
                    },
                    "segemehl3": {
                        "OPTIONS": {
                            "INDEX": "",
                            "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                        }
                    },
                    "hisat2": {
                        "OPTIONS":
                        {
                          "INDEX": "",
                          "MAP": ""
                        }
                    },
                    "bwa": {
                        "OPTIONS":
                        {
                            "INDEX": "-a bwtsw",
                            "MAP": ""
                        }
                    },
                    "minimap": {
                        "OPTIONS":
                        {
                            "INDEX": "-k 15",
                            "MAP": "-ax sr -ub -Y -L --MD"
                        }
                    }
                }
            },
            "KO": {
                "star": {
                    "OPTIONS":
                    {
                        "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",
                        "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                    }
                },
                "segemehl3": {
                    "OPTIONS": {
                        "INDEX":  "",
                        "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                    }
                },
                "hisat2": {
                    "OPTIONS":
                    {
                      "INDEX": ""
                    }
                },
                "bwa": {
                    "OPTIONS":
                    {
                        "INDEX": "-a bwtsw",
                        "MAP": ""
                    }
                },
                "minimap": {
                    "OPTIONS":
                    {
                        "INDEX": "-k 15",
                        "MAP": "-ax sr -ub -Y -L --MD"
                    }
                }
            }
        }
    }
}

Starting the run with 12 cores (defining more will be capped by the config file as MAXTHREADS is set to 12, again you can change that accordingly)

monsda -j 12 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_toolmix.json --directory ${PWD}

Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/INDICES” directory containing the built index, a “QC” directory containing all fastqc reports and multiqc output, a “TRIMMED_FASTQ” directory for trimgalore and cutadapt output, a “DEDUP” directory for umi_tools (runs before trimming and after mapping) and picard (runs after mapping) output and a “MAPPING” directory containing the mapped files. Furthermore, a “DE” directory will be created which will hold output from counting with featurecounts and DE input and output from EdgeR and DESeq2. Again, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and the “JOBS” directory with all command-line calls.

A successful run will show the message ‘Workflow finished, no error’ at the end.

Run postprocessing

This postprocessing use case involves multiple input files, two conditions (WT/KO) and almost all workflows available. We also include a “dummylevel” that is a placeholder for settings or other subdivisions of the WT level, to demonstrate that MONSDA can work on condition-trees of differing depth.

Workflows include:

  • FETCH: Download from SRA

  • QC: FASTQC of input and output

  • TRIMMING: Adaptor removal with cutadapt/trimgalore

  • MAPPING: Read mapping with STAR, hisat2, bwa, segemehl3 and minimap2

  • DEDUP: Read deduplication with umi_tools and picard

  • DE: Differential Expression analysis with EdgeR and DESeq2

  • DEU: Differential Exon Usage analysis with EdgeR (DEXSeq skipped, runtime)

  • COUNTING: Read counting with FeatureCounts

  • TRACKS: Generation of tracks for UCSC or other genome browsers

  • PEAKS: Analysis of ChIP-Seq or CLIP-Seq or cyPhyRNA-Seq Peaks

The more complex config for this analysis follows

Note that “SETTINGS” now also contain “GROUPS” which hold identifiers for the DE stage and can be used to define “COMPARABLES” that tell MONSDA which samples should be treated as replicates and compared to which other samples. By default this will make an all-vs-all comparison, in our simple case ctrl-vs-ko. Keeping this “GROUPS” setting separated from the condition-tree makes it possible to mix samples for different conditions for all pre- and processing steps and separating them for postprocessing without further struggle. This means that in this simple case we could have mixed all samples under one condition as all other tool options stay the same and reference and annotation do not differ and would still be able to split by “GROUPS” for the DE step. However, sometimes we need to compare data that has to be processed differently, e.g. mixing paired and single-ended reads, so we can apply different settings as needed for all workflow steps by simply splitting our condition-tree accordingly.

Same as for the more complex run we can see that “SETTINGS” contains “GROUPS” which hold identifiers for the DE/DEU/DAS/DTU stages that will be used to define “COMPARABLES” that tell MONSDA which samples should be treated as replicates and compared to which other samples. By default this will make an all-vs-all comparison, in our case ctrl-vs-ko. Keeping this “GROUPS” setting separated from the condition-tree makes it possible to mix samples for different conditions for all pre- and processing steps and separating them for postprocessing without further struggle. This means that in this simple case we could have mixed all samples under one condition, as all other tool options stay the same and reference and annotation do not differ and would still be able to split by “GROUPS” for the DE/DEU/DAS steps. However, sometimes we need to compare data that has to be processed differently, e.g. mixing paired and single-ended reads, so we can apply different settings as needed for all workflow steps by simply splitting our condition-tree accordingly. Another addition to the “SETTINGS” key is the “IP” (ImmunoPrecipitation, aka enrichment) part. This is intended to be combined with the PEAKS workflow and tools that have certain preprocessing step requirements dependent on the way reads where enriched. For example our novel “scyphy” (Software for cyPhyRNA-Seq) workflow can work on 5-prime and 3-prime end enriched data, which can be indicated as “iCLIP” and “revCLIP” respectively. For this tutorial we will simulate standard “CLIP” and set the corresponding key.

Starting the run with 12 cores (defining more will be capped by the config file as MAXTHREADS is set to 12, you already know the drill, change accordingly)

monsda -j 12 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_postprocess.json --directory ${PWD}

Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/Ecoli/INDICES” directory containing the built indices, including the one built for salmon later on, a “QC” directory containing all FastQC reports and MultiQC output, a “TRIMMED_FASTQ” directory for trimgalore and cutadapt output, a “DEDUP” directory for umi_tools (runs before trimming and after mapping) and picard (runs after mapping) output and a “MAPPING” directory containing the mapped files. Furthermore, “DE/DEU” directories will be created which will hold output from counting with FeatureCounts and DE/DEU input and output from EdgeR and DESeq2 respectively. Again, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and again a “JOBS” directory for command-line calls.

A successful run will show the message ‘Workflow finished, no error’. Be aware that this is indeed an exhaustive workflow and will require a decent amount of disk-space, memory and compute-time, depending on the hardware at your disposal.

Exhaustive run############

We implemented an exhaustive tutorial, but running analysis steps like DTU or DAS requires more complex datasets and thus more preparation.

This exhaustive use case involves multiple input files, two conditions (WT/KO) and almost all workflows available. We also include a “dummylevel” that is a placeholder for settings or other subdivisions of the WT level, to demonstrate that MONSDA can work on condition-trees of differing depth.

Workflows include:

  • FETCH: Download from SRA

  • QC: FASTQC of input and output

  • TRIMMING: Adaptor removal with cutadapt/trimgalore

  • MAPPING: Read mapping with STAR, hisat2, bwa, segemehl3 and minimap2

  • DEDUP: Read deduplication with umi_tools and picard

  • DE: Differential Expression analysis with EdgeR and DESeq2

  • DEU: Differential Exon Usage analysis with EdgeR (DEXSeq skipped, runtime)

  • DAS: Differential Alternative Splicing analysis with EdgeR and DIEGO

  • DTU: Differential Transcript Usage analysis with DRIMSeq

  • COUNTING: Read counting with FeaturCounts

  • TRACKS: Generation of tracks for UCSC or other genome browsers

  • PEAKS: Analysis of ChIP-Seq or CLIP-Seq or cyPhyRNA-Seq Peaks

The more complex config for this analysis follows

You will see here that for DTU workflows we will need an additional FASTA and GTF file covering transcripts. As our sample data is from E.coli, transcripts and genes can be seen as synonymous, so we can easily create the needed files from our already downloaded files.

Creating a transcript gtf file from the downloaded gtf by subsetting for gene features:

zcat ecoli.gtf.gz |head -n+3 > bla; zcat ecoli.gtf.gz|grep -P '\tgene\t|\tCDS\t' >> bla;cat bla|gzip > Ecoli_trans.gtf.gz;rm -f bla

We now fix the empty transcript-id tag by replacing it with the gene-id followed by “_1”:abbr:

zcat Ecoli_trans.gtf.gz |perl -F'\t' -wlane 'print $_;if($_ !~/^#/){$start=$F[3];$end=$F[4];$dist=int(($end-$start)/5); $start+=$dist;$end-=$dist;print join("\t",@F[0..1])."\ttranscript\t$start\t$end\t".join("\t",@F[5..$#F])}'|perl -wlane 'BEGIN{%ids}{if($_=~/^#/ or $F[2] eq "gene"){print}else{$n=$F[9];$ids{$n}++;$F[11]=substr($n,0,-2)."_".$ids{$n}."\";";print join("\t",@F[0..7],join(" ",@F[8..$#F]))}}'|sed -e 's/CDS/transcript/g' -e 's/gene_synonym/gene_name/g'|gzip > Ecoli_trans_fix.gtf.gz

From this gtf we can now create a FASTA file by writing a helper BED file and using BEDtools to extract sequences from our genome FASTA file in strand specific manner. You may have to install BEDtools first, best do so in a dedicated conda environment via

conda create -n bedtools bedtools

We then only need to clean up the name tag as follows:

zcat Ecoli_trans_fix.gtf.gz|grep -P '\ttranscript\t'|perl -wlane 'next if($_=~/^#/);($name=(split(/;/,$F[11]))[0])=~s/\"//g;print join("\t",$F[0],$F[3]-1,$F[4],$name,100,$F[6])' > Ecoli_trans.bed
zcat ecoli.fa.gz > ecoli.fa
conda activate bedtools
bedtools getfasta -fi ecoli.fa -bed Ecoli_trans.bed -fo Ecoli_trans.fa -name -s
conda deactivate
perl -wlane 'if($_=~/^>/){$_=~s/\(|\+|\-|\)//g;}print' Ecoli_trans.fa |gzip > tmp.gz;cat tmp.gz ecoli.fa.gz > Ecoli_trans.fa.gz && rm -f tmp.gz
zcat ecoli.fa.gz|grep -P '^>'|cut -d " " -f1 > salmon_decoy
sed -i.bak -e 's/>//g' salmon_decoy

Should you encounter problems with you preinstalled versiond of grep, perl or sed, please try to also install them in a dedicated conda environment and activate/deactivate them accordingly.

With these “Ecoli_trans.fa.gz” and “Ecoli_trans_fix.gtf.gz” files and the corresponding salmon decoy file we can now also run the DTU workflow. The config file for the exhaustive run looks as follows:

{
    "WORKFLOWS": "FETCH, QC, TRIMMING, MAPPING, DEDUP, TRACKS, PEAKS, DE, DEU, DAS, DTU, COUNTING",
    "VERSION": "1.2.5",
    "BINS": "",
    "MAXTHREADS": "16",
    "SETTINGS": {
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "SAMPLES": [
                        "SRR16324019", 
                        "SRR16324018",
                        "SRR16324017"
                    ],
                    "GROUPS": [
                        "ctrl",
                        "ctrl",
                        "ctrl"
                    ],
                    "SEQUENCING": "paired",
                    "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                    "ANNOTATION": {
                        "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                        "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                    },
                    "DECOY": {
                        "salmon": "GENOMES/Ecoli/salmon_decoy"
                    },
                    "IP": "CLIP"
                }
            },
            "KO": {
                "SAMPLES": [
                    "SRR16324016", 
                    "SRR16324015",
                    "SRR16324014"
                ],
                "GROUPS": [
                    "ko",
                    "ko",
                    "ko"
                ],
                "SEQUENCING": "paired",
                "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                "ANNOTATION": {
                    "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                    "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                },
                "DECOY": {
                    "salmon": "GENOMES/Ecoli/salmon_decoy"
                },
                "IP": "CLIP"
            }
        }
    },
    "FETCH": {
        "TOOLS" :
        {
            "sra" : "fasterq-dump"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "sra": {
                        "OPTIONS":
                        {
                            "PREFETCH": "${HOME}/.ncbi/user-settings.mkfg",
                            "DOWNLOAD": ""
                        }
                    }
                }
            },
            "KO": {
                "sra": {
                    "OPTIONS":
                    {
                        "PREFETCH": "${HOME}/.ncbi/user-settings.mkfg",
                        "DOWNLOAD": ""
                    }
                }
            }
        }
    },
    "QC": {
        "TOOLS" :
        {
            "fastqc" : "fastqc"
        },
        "Ecoli": {
            "KO": {
                "fastqc": {
                    "OPTIONS":
                    {
                        "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                        "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {
                    "fastqc": {
                        "OPTIONS":
                        {
                            "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                            "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "TRIMMING": {
        "TOOLS" :
        {
            "trimgalore": "trim_galore",
            "cutadapt": "cutadapt"
        },
        "Ecoli": {
            "KO": {
                "trimgalore":{
                    "OPTIONS":
                    {
                        "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                },
                "cutadapt":{
                    "OPTIONS":
                    {
                        "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {
                    "trimgalore":{
                        "OPTIONS":
                        {
                            "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    },
                    "cutadapt":{
                        "OPTIONS":
                        {
                            "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "DEDUP": {
        "TOOLS": {
            "umitools": "umi_tools",
            "picard": "picard"
        },
        "Ecoli": {
            "KO":{
                "umitools":{
                    "OPTIONS":
                    {
                        "EXTRACT": "--extract-method string --bc-pattern 'XNNNNX'",
                        "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                    }
                },
               "picard":{
                    "OPTIONS":
                    {
                        "JAVA": "",
                        "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                    }
              }
            },
            "WT":{
                "dummylevel": {
                    "umitools": {
                        "OPTIONS":
                        {
                            "EXTRACT": "--extract-method string --bc-pattern 'XNNNNX'",
                            "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                        }
                    },
                   "picard": {
                     "OPTIONS":
                        {
                            "JAVA": "",
                            "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                        }
                    }
                }
            }
        }
    },
    "MAPPING": {
        "TOOLS": {
            "star": "STAR",
            "segemehl3": "segemehl.x",
            "hisat2": "hisat2",
            "bwa": "bwa mem",
            "minimap": "minimap2"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "star": {
                        "OPTIONS": {
                            "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 8",
                            "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                        }
                    },
                    "segemehl3": {
                        "OPTIONS": {
                            "INDEX": "",
                            "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                        }
                    },
                    "hisat2": {
                        "OPTIONS":
                        {
                          "INDEX": ""
                        }
                    },
                    "bwa": {
                        "OPTIONS":
                        {
                            "INDEX": "-a bwtsw",
                            "MAP": ""
                        }
                    },
                    "minimap": {
                        "OPTIONS":
                        {
                            "INDEX": "-k 15",
                            "MAP": "-ax sr -ub -Y -L --MD"
                        }
                    }
                }
            },
            "KO": {
                "star": {
                    "OPTIONS":
                    {
                        "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 8",
                        "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                    }
                },
                "segemehl3": {
                    "OPTIONS": {
                        "INDEX":  "",
                        "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                    }
                },
                "hisat2": {
                    "OPTIONS":
                    {
                      "INDEX": ""
                    }
                },
                "bwa": {
                    "OPTIONS":
                    {
                        "INDEX": "-a bwtsw",
                        "MAP": ""
                    }
                },
                "minimap": {
                    "OPTIONS":
                    {
                        "INDEX": "-k 15",
                        "MAP": "-ax sr -ub -Y -L --MD"
                    }
                }
            }
        }
    },
    "COUNTING": {
        "FEATURES": {
            "CDS": "gene_id"
        },
        "TOOLS": {
            "countreads": "featureCounts"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "countreads": {
                        "OPTIONS":
                        {
                            "COUNT" : "-f --fraction -O -M"
                        }
                    }
                }
            },
            "KO": {
                "countreads": {
                    "OPTIONS":
                    {
                        "COUNT" : "-f --fraction -O -M"
                    }
                }
            }
        }
    },
    "TRACKS": {
        "TOOLS": {
            "ucsc": "ucsc"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "ucsc": {
                        "OPTIONS":
                        {
                            "UCSC": "-n Eco_compare_Mapping -s dm6_st -l UCSC_DM6_compare_Mapping -b UCSC_dm6_star1"
                        }
                    }
                }
            },
            "KO": {
                "ucsc": {
                    "OPTIONS":
                    {
                        "UCSC": "-n Eco_paired_Mapping -s dm6_p -l UCSC_DM6_paired_Mapping -b UCSC_dm6_pstar1"
                    }
                }
            }
        }
    },
#DE Analysis
    "DE": {
        "TOOLS" :
        {
            "deseq2" : "Analysis/DE/DESEQ2.R",
            "edger" : "Analysis/DE/EDGER.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "deseq2": {
                        "OPTIONS":
                        {
                            "COUNT": "-O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                            "DE": ""  # Options for DE Analysis
                        }
                    },
                    "edger": {
                        "OPTIONS":
                        {
                            "COUNT": "-O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                            "DE": ""  # Options for DE Analysis
                        }
                    }
                }
            },
            "KO": {
                "deseq2": {
                    "OPTIONS":
                    {
                        "COUNT": "-O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                        "DE": ""  # Options for DE Analysis
                    }
                },
                "edger": {
                    "OPTIONS":
                    {
                        "COUNT": "-O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                        "DE": ""  # Options for DE Analysis
                    }
                }
            }
        }
    },
#DEU Analysis options
    "DEU": {
        "TOOLS" :
        {
            "dexseq" : "Analysis/DEU/DEXSEQ.R",  # Very specific, needs more tweaking for test data
            "edger" : "Analysis/DEU/EDGER.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "dexseq": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O -t exon -g gene_id",  # Options for read counting, independent of COUNTS workflows
                            "DEU": ""  # Options for DEU Analysis
                        }
                    },
                    "edger": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                            "DEU": ""  # Options for DEU Analysis
                        }
                    }
                }
            },
            "KO": {
                "dexseq": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O -t exon -g gene_id",  # Options for read counting, independent of COUNTS workflows
                        "DEU": ""  # Options for DEU Analysis
                    }
                },
                "edger": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                        "DEU": ""  # Options for DEU Analysis
                    }
                }
            }
        }
    },
#DAS Analysis options
    "DAS": {
        "TOOLS" :
        {
            "diego" : "Analysis/DAS/DIEGO.py",
            "edger" : "Analysis/DAS/EDGER.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "diego": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                            "DAS": "-c 10 -d 3 -q 0.01 -z 1.0 -r"  # Options for DAS Analysis
                        }
                    },
                    "edger": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                            "DAS": ""  # Options for DAS Analysis
                        }
                    }
                }
            },
            "KO": {
                "diego": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                        "DAS": "-c 10 -d 3 -q 0.01 -z 1.0 -r"  # Options for DAS Analysis
                    }
                },
                "edger": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O -t CDS -g gene_id",  # Options for read counting, independent of COUNTS workflows
                        "DAS": ""  # Options for DAS Analysis
                    }
                }
            }
        }
    },
    "DTU": {
        "TOOLS" :
        {
            "dexseq" : "Analysis/DTU/DEXSEQ.R",
            "drimseq" : "Analysis/DTU/DRIMSEQ.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "dexseq": {
                        "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                        "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                        "DECOY": "GENOMES/Ecoli/salmon_decoy",
                        "OPTIONS":
                        {
                            "INDEX": "",
                            "QUANT": "",  # Options for read counting, independent of COUNTS workflows
                            "DTU": "min_samps_feature_expr = 1, min_feature_expr = .1, min_samps_gene_expr = 1, min_gene_expr = 1"  # Options for DE Analysis
                        }
                    },
                    "drimseq": {
                        "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                        "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                        "DECOY": "GENOMES/Ecoli/salmon_decoy",
                        "OPTIONS":
                        {
                            "INDEX": "",
                            "QUANT": "",  # Options for read counting, independent of COUNTS workflows
                            "DTU": "min_samps_feature_expr = 1, min_feature_expr = .1,  min_samps_gene_expr = 1, min_gene_expr = 1"  # Options for DE Analysis
                        }
                    }
                }
            },
            "KO": {
                "dexseq": {
                    "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                    "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                    "DECOY": "GENOMES/Ecoli/salmon_decoy",
                    "OPTIONS":
                    {
                        "INDEX": "",
                        "QUANT": "",  # Options for read counting, independent of COUNTS workflows
                        "DTU": "min_samps_feature_expr = 1, min_feature_expr = .1,  min_samps_gene_expr = 1, min_gene_expr = 1"  # Options for DE Analysis
                    }
                },
                "drimseq": {
                    "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                    "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                    "DECOY": "GENOMES/Ecoli/salmon_decoy",
                    "OPTIONS":
                    {
                        "INDEX": "",  
                        "QUANT": "", # Options for read counting, independent of COUNTS workflows
                        "DTU": "min_samps_feature_expr = 1, min_feature_expr = .1,  min_samps_gene_expr = 1, min_gene_expr = 1"  # Options for DE Analysis
                    }
                }
            }
        }
    },
#PEAKS Analysis options
    "PEAKS": {
        "TOOLS" :
        {
            "peaks" : "peaks",
            "scyphy" : "Piranha"
        },
        "COMPARABLE" :
        {
            "SRR16324019": "SRR16324018",
            "SRR16324014": "SRR16324015"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "peaks":{
                        "OPTIONS":
                        {
                            "PREPROCESS": "",
                            "FINDPEAKS": "-l 0.6 -t 1 -w 5 -r 0.8 -c 3 -a 3"
                        }
                    },
                    "scyphy":{
                        "OPTIONS":
                        {
                            "PREPROCESS": "",
                            "FINDPEAKS": "-b 100"
                        }
                    }
                }
            },
            "KO": {
                "peaks":{
                    "OPTIONS":
                    {
                        "PREPROCESS": "",
                        "FINDPEAKS": "-l 0.6 -t 1 -w 5 -r 0.8 -c 3 -a 3"
                    }
                },
                "scyphy":{
                    "OPTIONS":
                    {
                        "PREPROCESS": "",
                        "FINDPEAKS": "-b 100"
                    }
                }
            }
        }
    }
}

Note that “SETTINGS” now contains “GROUPS” which hold identifiers for the DE/DEU/DAS/DTU stages that will be used to define “COMPARABLES” that tell MONSDA which samples should be treated as replicates and compared to which other samples. By default this will make an all-vs-all comparison, in our case ctrl-vs-ko. Keeping this “GROUPS” setting separated from the condition-tree makes it possible to mix samples for different conditions for all pre- and processing steps and separating them for postprocessing without further struggle. This means that in this simple case we could have mixed all samples under one condition, as all other tool options stay the same and reference and annotation do not differ and would still be able to split by “GROUPS” for the DE/DEU/DAS steps. However, sometimes we need to compare data that has to be processed differently, e.g. mixing paired and single-ended reads, so we can apply different settings as needed for all workflow steps by simply splitting our condition-tree accordingly. Another addition to the “SETTINGS” key is the “IP” part. This is intended to be combined with the PEAKS workflow and tools that have certain preprocessing step requirements. For example our novel “scyphy” (Software for cyPhyRNA-Seq) workflow can work on 5-prime and 3-prime end enriched data, which can be indicated as “iCLIP” and “revCLIP” respectively. For this tutorial we will simulate “iCLIP” and set the corresponding key.

Somewhat special is the DTU step. Here we use the pseudo-mapper salmon to quantify expression of transcripts. These transcripts are first defined in the preprocessing part of this tutorial chapter. As we can see the REFERENCE and ANNOTATION differ from what is listed in SETTINGS, so we make use of the ability of MONSDA to define those settings on a per-workflow basis. This will ALWAYS over-rule the settings made in SETTINGS. So we simply define the REFERENCE and ANNOTATION key for each tool within the regular tool setting part of the config and this will make sure that these files are used instead of the globally defined ones.

Starting the run with 12 cores (defining more will be capped by the config file as MAXTHREADS is set to 12, I don’t have to tell you that you can change that accordingly)

monsda -j 12 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_exhaustive.json --directory ${PWD}

Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/Ecoli/INDICES” directory containing the built indices, including the one built for salmon later on, a “QC” directory containing all FASTQC reports and MULTIQC output, a “TRIMMED_FASTQ” directory for trimgalore and cutadapt output, a “DEDUP” directory for umi_tools (runs before trimming and after mapping) and picard (runs after mapping) output and a “MAPPING” directory containing the mapped files. Furthermore, “DE/DEU/DAS/DTU” directories will be created which will hold output from counting with FeatureCounts (or salmon for DTU) and DE/DEU/DAS/DTU input and output from EDGER, DESeq2, Diego and DrimSeq respectively. Again, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and again a “JOBS” directory for command-line calls.

A successful run will show the message ‘Workflow finished, no error’. Be aware that this is indeed an exhaustive workflow and will require a decent amount of disk-space, memory and compute-time, depending on the hardware at your disposal.