MONSDA

Installation

Install MONSDA via conda or pip

To install via conda/mamba in an environment named ‘monsda’ simply run

mamba create -n monsda -c bioconda -c conda-forge monsda

To install via pip you first need to create the MONSDA environment as found in the envs directory of this repository (simply clone with git clone) like so:

mamba env create -n monsda -f MONSDA/envs/monsda.yaml

The envs directory holds all the environments needed to run the pipelines in the workflows directory, these will be installed automatically alongside MONSDA.

For that activate the MONSDA environment and run pip

conda activate monsda
pip install MONSDA

Install from source

Simply clone this repository with

git clone git@github.com:jfallmann/MONSDA.git

You can then install dependencies as described for pip installation and manually run setup.py.

Be aware that MONSDA is version dependent, so config files can only be run with the specified version of MONSDA in order to guarantee reproducibility by conserving dependencies and environments.

First Steps

MONSDA acts a wrapper around Snakemake or Nextflow based on a user defined config.json file. This config.json holds all the information that is needed to run the jobs and will be parsed by MONSDA and split into independent sub-configs that can later be found in the directory SubSnakes or SubFlows respectively. Command line execution calls are stored in the directory JOBS, so users can manipulate those or rerun them manually as needed. By default, however, MONSDA will run those jobs automatically either locally, or through Snakemake’s or Nextflow’s integrated cluster interfaces.

To successfully run an analysis pipeline, a few steps have to be followed:
  • Install MONSDA either via bioconda or pip following the instruction in Installation

  • Directory structure: The structure for the directories is dictated by The Condition-Tree in the config file

  • Config file: This is the central part of a MONSDA run. Depending on The config-file MONSDA will determine processing steps and generate corresponding config and workflow files to run each subworkflow until all processing steps are done.

In general it is necessary to write a configuration file containing information on paths, files to process and settings beyond default for mapping tools and others. The template on which analysis is based can be found in the config directory and will be explained in detail later.

To create a working environment for this repository please install the MONSDA.yaml environment (if not installed via bioconda) as found in the envs directory like so:

conda env create -n monsda -f envs/MONSDA.yaml

The envs directory holds all the environments needed to run the pipelines in the workflows directory, these will be installed automatically when needed.

For fast resolve of conda packages, we recommend mamba

For distribution of jobs one can either rely on local hardware, use scheduling software like Slurm or the SGE or follow any other integration in Snakemake or Nextflow but be aware that most of these have not been tested for this repository and usually require additional system dependent setup and configuration.

This manual will only show examples on local and SLURM usage.

Planning a project run

Before you set up a MONSDA project, you should take a few things into consideration to make sure that MONSDA “understands” your experiment. You should also have a good overview of your data and best decide in advance what analysis steps you want to run. Although it is possible to run MONSDA sequentially, building up a config step by step, this will lead to some workflows being run multiple times, as a unique “tool-key” will be defined for the set of enabled workflow steps and configured tools, more information on that can be found in The Condition-Tree.

To be well prepared, go through the following steps. Afterwards you will run the Configurator, which helps you to set up a MONSDA project.

STEP 1: What do I want to do exactly?

  • What is the experimental design? Which conditions do I have?

    If you have data from an experiment and it is necessary to differ between conditions, MONSDA represents this as a condition-tree. Make sure, you know how to transfer your experimental design to the condition-tree. An example could be a simple ‘knockout’ experiment. You would transfer it as following:

                          ┌──Condition1: "Wild-Type"
    Knockout-Experiment ──┤
                          └──Condition2: "Knockout"
    

    Your experimental design can be more complex. To get a better understanding of the concept behind the condition-tree click here.

  • What kind of analysis do I want to run? Which processing steps and tools do I need?

    You probably have an idea of the workflow you want to create. You should write down which processing steps you want to run. MONSDA already offers a bunch of processing tools. Jump to Workflow Overview to get an overview of which tools you can use for which workflow-steps of your workflow.

  • Which settings should the tools run with?

    Depending on your data and your specific analysis most of the tools must be assigned specific settings or options. MONSDA does not try to guess which settings might work best for you. Of course it is possible to adjust the settings afterwards, but this requires to delete generated results and is more tedious then to think about settings beforehand. MONSDA does not recommend settings for tools and will run all tools with their default settings. The only exception from that rule is a “paired” or “stranded” setting, which will automatically be derived from the config and resolved for each tool, where applicable.

    It can be helpful to test your settings by hand before you integrate them in the MONSDA configuration, however, you can always re-run MONSDA after cleaning up results and changing the configuration. To understand better how MONSDA understands tool options have a look at the config-file.

STEP 2: Keep the overview

You may want to do a standard analysis. Most likely you will need a genome FASTA file and annotation files in addition to your samples. Maybe there are additional files your workflow needs.

So before you start the configurator and set up the MONSDA project, make sure you have organized your data well and know where to find things. The configurator will create a project folder with softlinks to your data. Later changes require you to move your data afterwards, which is extra work that is not necessary.

If eventually share your workflow and configuration, we also assume that you share all input data including genome and annotation files, at least with links to the original download location, so it makes sense to keep that information.

STEP 3: Hardware issues

MONSDA can be run on different platforms and allows to control multithreading. If you want to run it on SLURM, have a look here. Always make sure you know how many threads you can run on your machine and keep an eye on memory consumption. For some workflow steps like mapping, memory can be a bottle-neck, so best check your SLURM profile and set caps on memory usage.

Last but not least the processing results will be saved in your project-folder. This can take up a lot of disk space. make sure you have enough free space.

More details on input and output formats and files can be found in the Workflow Overview.

Configure MONSDA

MONSDA operates from a project folder containing a config json file.
To enable easy setup and configuration, we host the executable Configurator which can be called via monsda_configure.
The Configurator works as an interactive terminal user interface.
It supports you in three main things:
  1. Setting up an initial project folder

  2. Creating the configuration file

  3. Modifying existing configuration files easily


Run the Configurator

To run the Configurator, the MONSDA conda environment must be installed and activated (we assume you installed via conda in an environment named ‘monsda’).

conda activate monsda

Run the Configurator with

monsda_configure

To see further options run the Configurator with the - -help flag

monsda_configure --help

Main Functions

Create a new project or config-file

SELECT WORKFLOWS MAKE CONDITION-TREE ASSIGN SAMPLES SET WORKFLOWS SET THREADS

To create a new project or a new config-file, the Configurator will take you through all necessary steps. You will have to choose a working directory. Note, that your project folder will grow with your results, so choose a place with enough space.

Modify an existing config-file

1. Add workflows

SELECT CONFIG SELECT ADDITIONAL WORKFLOWS SET WORKFLOWS

Select workflows not activated in an existing config-file. The Configurator will expand each condition. Afterwards you have to define settings for the new workflows.

2. Remove workflows ******************`

SELECT CONFIG SELECT REMOVABLE WORKFLOWS

The Configurator will show you all established workflows. After selecting the ones to be removed it will delete them from the config-file for each condition.

3. Add conditions ****************`

SELECT CONFIG MAKE CONDITION-TREE ASSIGN SAMPLES SET WORKFLOWS

You can add conditions in a similar way you created the condition-tree. But note, that you can’t add sub-conditions to existing leafs. The configurator will expand the condition-tree for the settings-block and each workflow. Because now you have new option fields in the config-file the Configurator will ask you for copying existing workflow settings or to make new ones.

4. Remove conditions

SELECT CONFIG SELECT REMOVABLE CONDITIONS

The Configurator will offer you all conditions the condition-tree represents. After selecting one or several to be removed it will delete them in the settings-block and for each condition.


Interrupt Configuration

It can happen, that the Configurator asks for entries, you haven’t thought about yet. In this case you can interrupt the configuration and the Configurator will cache your entries. A temporary backup file called unfinished_config.pkl is created for that.

In most cases you can even just abort the script, but to guarantee clean re-entry you should type

exit

When you start the Configurator again later and it finds the unfinished_config.pkl in the current directory, it will provide the option to continue the session.

Note, that the unfinished_config.pkl will always be overwritten. To avoid this, you can rename the file. You can than continue with the –session flag. Run the Configurator like this:

monsda_configure -s my_renamed_unfinished_config.pkl

Assistance in detail

Create Condition-Tree

============================================================

{
      "NewExperiment": {
            "wild-type": {
                  "day1": {},
                  "day2": {}
            },
            "knockout": {
                  "day1": {},
                  "day2": {}    <=(add sub-conditions here)
            }
      }
}

============================================================

MONSDA understands your experimental design by creating a condition-tree. The Configurator helps you to create it. To do this, the Configurator points to a condition in which you are allowed to add further sub-conditions. In this way you can create a nested condition-tree. Note that each leaf of this tree represents a separate condition. later you will define the workflow settings for each of these conditions.

Sample Assignment:

  ============================================================

  {
      "NewExperiment": {
            "wild-type": {
                  "day1": {
                        "SAMPLES": [
                              "Sample_1",
                              "Sample_2"
                        ]
                  },
                  "day2": {}           <-
            },
            "knockout": {
                  "day1": {},
                  "day2": {}
            }
      }
  }

============================================================

     1  >  Sample_1     in  path/to/knockout/samples
     2  >  Sample_2     in  path/to/knockout/samples
     3  >  Sample_3     in  path/to/knockout/samples
     4  >  Sample_4     in  path/to/knockout/samples
     5  >  Sample_a     in  path/to/wild-type/samples
     6  >  Sample_b     in  path/to/wild-type/samples
     7  >  Sample_c     in  path/to/wild-type/samples
     8  >  Sample_d     in  path/to/wild-type/samples

The Configurator helps you to assign samples to conditions. If you have activated the FETCH workflow, it will ask you for SRA Accession Numbers. Otherwise you have to add paths where your samples are stored. The Configurator finds every file with “.fastq.gz” ending and presents it for assignment. At the same time, the condition-Tree is displayed with an arrow indicating the condition to which samples are assigned.

Make Settings for Conditions

============================================================

  {
        "NewExperiment": {
              "wild-type": {
                    "day1": {},           <-  1
                    "day2": {},           <-  1
                    "day3": {}            <-  1
              },
              "knockout": {
                    "day1": {},           <-    2
                    "day2": {},           <-    2
                    "day3": {}            <-    2
              }
        }
  }

============================================================

MONSDA can run the same workflow with different settings, differentiated by conditions. Therefore the config-file needs workflow settings for each condition you created. However you will often set the same settings. To avoid these repetitions during config-creation the configurator offers you to set several conditions at once. In the example shown above, you would go through two setting loops. All sub-conditions of both “wild-type” and “knockout” are assigned the same settings. To change the conditions set simultaneously, you can loop through the possible selections by pressing enter.

Start a pipline run

Snakemake

Activate the MONSDA conda environment and run

monsda --help

to see the help and available options that will be passed through to Snakemake.

To start a job with Snakemake, which is the default, run

monsda -j NUMBER_OF_CORES -c YOUR_CONFIG.json --directory ${PWD}

or add additional arguments for Snakemake as you see fit, we highly recommend to set mamba as conda frontend and set a fixed directory to store environments (here conda_envs)

--conda-frontend mamba --conda-prefix path_to_conda_envs

Nextflow

To run MONSDA in Nextflow mode just add ‘–Nextflow’

monsda --Nextflow -j NUMBER_OF_CORES -c YOUR_CONFIG.json --directory ${PWD}

As with Snakemake additional arguments for Nextflow can be added and will be passed through.

Run on Slurm

Snakemake

You can either use the slurm profile adapted from Snakemake-Profiles that comes with this repository, or go through the process of manually creating one, either using the cookiecutter example in the Snakemake-Profiles repository or on your own. You can also adapt the example that comes with this repository and execute

monsda -j ${cpus} --configfile ${config.json} --directory ${PWD} --profile ${path_to_slurm_profile}

Further adaptions like grouping of jobs and advanced configs for rule based performance increase will be tackled in future releases of MONSDA.

Nextflow

Cluster config for Nextflow follows the description Nextflow-Executors and Nextflow-Profiles. To use SLURM as executor you can adapt the profile that comes with this repository and simply append

export NXF_EXECUTOR=slurm

to the call to MONSDA.

TUTORIAL

Here we will show three MONSDA runs from simple to most-complex, explaining the configuration and how to run this in real life.

First please create a working directory and access it, then install MONSDA following Installation and activate the conda environment.

All examples are based on data from SRA , which you can either download beforehand or you use the “FETCH” workflow as demonstrated in the tutorial.

Genome and Annotation files need to be downloaded beforehand and can be found here: GENOME, GFF, GTF

Ideally you download those files into a sub-directory of your workspace named “GENOME/Ecoli” and rename them to something simpler, in the tutorial we will refer to “GCF_000005845.2_ASM584v2_genomic” as “ecoli”. You also need to change the “.fna.gz” suffix of the genome file to “.fa.gz” and we should be ready to go.

All config files this tutorial is based on can be found in envs/monsda/share/MONSDA/configs of your local conda installation or the “configs” directory of the MONSDA repository.

A simple run

This run is based on a simple mapping only use-case. We have a single paired-end FASTQ file and want to map that. In this case we have no information about strandedness and ignore this setting.

The corresponding config file looks as follows:

{
    "WORKFLOWS": "FETCH,MAPPING",
    "BINS": "",
    "MAXTHREADS": "4",
    "VERSION": "1.0.0",
    "SETTINGS": {
        "SIMPLE": {
            "SAMPLES": [
                "SRR16324019"
            ],
            "SEQUENCING": "paired",
            "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
            "ANNOTATION": {
                "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
            }
        }
    },
    "FETCH": {
        "TOOLS" :
        {
            "sra" : "fasterq-dump"
        },
        "SIMPLE": {
            "sra": {
                "OPTIONS":
                {
                    "DOWNLOAD": ""
                }
            }
        }
    },
    "MAPPING": {
        "TOOLS": {
            "star": "STAR"
        },
        "SIMPLE": {
            "star": {
                "OPTIONS": {
                    "INDEX": "--genomeSAindexNbases 8",
                    "MAP": "--outSAMprimaryFlag AllBestScore       --outFilterMultimapNmax 20",
                    "EXTENSION": ""
                }
            }
        }
    }
}

This tells MONSDA version “1.0.0” to run the workflows “FETCH” and “MAPPING” on the sample “SRR16324019”.

“FETCH” will activate the environment “sra” and run the binary “fasterq-dump” with no extra options to fetch the sample from SRA.

“MAPPING” will then run the executable “STAR” in environment “star” on the downloaded FASTQ file. First it will check for existence of the needed INDEX file, and as no INDEX file is defined in the config it will try to generate one. For that it will use the REFERENCE “GENOMES/Ecoli/ecoli.fa.gz” as well as the ANNOTATION GTF “GENOMES/Ecoli/ecoli.gtf.gz”. GTF is always selected first and GFF is used as a fallback if no GTF is found. After successfully building the STAR index, MONSDA will start mapping with STAR in “paired” mode and applying a set of options as can be seen under the “MAP” key in section “MAPPING”->”SIMPLE”->”star”->”OPTIONS”. We do not define a pre- or suffix for output files, so “EXTENSION” can be skipped or left blank.

Starting the run with 4 cores (defining more will be capped by the config file as MAXTHREADS is set to 4)

monsda -j 4 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_quick.json --directory ${PWD}

Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/Ecoli/INDICES” directory containing the built index and a “MAPPING” directory containing the mapped files. Furthermore, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and a “JOBS” directory containing the ‘monsda.commands’ file, which contains all command-line calls that MONSDA runs. A successful run will show the message ‘Workflow finished, no error’ at the end.

A more complex run

This slightly more complex use case involves multiple input files, two conditions (WT/KO) and a more or less standard DE analysis workflow. We also include a “dummylevel” that is a placeholder for settings or other subdivisions of the WT level, to demonstrate that MONSDA can work on condition-trees of differing depth.

Workflows include:

  • FETCH: Download from SRA

  • QC: FASTQC of input and output

  • TRIMMING: Adaptor removal with cutadapt/trimgalore

  • MAPPING: Read mapping with STAR, hisat2, bwa, segemehl3 and minimap2

  • DEDUP: Read deduplication with umi_tools and picard

  • DE: Differential Expression analysis with EdgeR and DESeq2

The more complex config for this analysis follows

{
    "WORKFLOWS": "FETCH, QC, TRIMMING, MAPPING, DEDUP, DE",
    "VERSION": "1.0.0",
    "BINS": "",
    "MAXTHREADS": "16",
    "SETTINGS": {
        "Ecoli": {
            "WT": {   
                "dummylevel": {       
                    "SAMPLES": [
                        "SRR16324019", 
                        "SRR16324018",
                        "SRR16324017"
                    ],
                    "GROUPS": [
                        "ctrl",
                        "ctrl",
                        "ctrl"
                    ],
                    "SEQUENCING": "paired",
                    "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                    "ANNOTATION": {
                        "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                        "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                    }
                }
            },
            "KO": {
                "SAMPLES": [
                    "SRR16324016", 
                    "SRR16324015",
                    "SRR16324014"
                ],
                "GROUPS": [
                    "ko",
                    "ko",
                    "ko"
                ],
                "SEQUENCING": "paired",
                "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                "ANNOTATION": {
                    "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                    "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                }
            }
        }
    },
    "FETCH": {
        "TOOLS" :
        {
            "sra" : "fasterq-dump"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "sra": {
                        "OPTIONS":
                        {
                            "DOWNLOAD": ""
                        }
                    }
                }
            },
            "KO": {
                "sra": {
                    "OPTIONS":
                    {
                        "DOWNLOAD": ""
                    }
                }
            }
        }
    },
    "QC": {
        "TOOLS" :
        {
            "fastqc" : "fastqc"
        },
        "Ecoli": {
            "KO": {
                "fastqc": {
                    "OPTIONS":
                    {
                        "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                        "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {       
                    "fastqc": {
                        "OPTIONS":
                        {
                            "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                            "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "TRIMMING": {
        "TOOLS" :
        {
            "trimgalore": "trim_galore",
            "cutadapt": "cutadapt"
        },
        "Ecoli": {
            "KO": {
                "trimgalore":{
                    "OPTIONS":
                    {
                        "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                },
                "cutadapt":{
                    "OPTIONS":
                    {
                        "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {       
                    "trimgalore":{
                        "OPTIONS":
                        {
                            "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    },
                    "cutadapt":{
                        "OPTIONS":
                        {
                            "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "DEDUP": {
        "TOOLS": {
            "umitools": "umi_tools",
            "picard": "picard"
        },
        "Ecoli": {
            "KO":{
                "umitools":{
                    "OPTIONS":
                    {
                        "EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
                        "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                    }
                },
               "picard":{
                    "OPTIONS":
                    {
                        "JAVA": "",
                        "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                    }
              }
            },
            "WT":{
                "dummylevel": {       
                    "umitools": {
                        "OPTIONS":
                        {
                            "EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
                            "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                        }
                    },
                    "picard": {
                        "OPTIONS":
                        {
                            "JAVA": "",
                            "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                        }
                    }
                }
            }
        }
    },
    "MAPPING": {
        "TOOLS": {
            "star": "STAR",
            "segemehl3": "segemehl.x",
            "hisat2": "hisat2",
            "bwa": "bwa mem",
            "minimap": "minimap2"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {       
                    "star": {
                        "OPTIONS": {
                            "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",
                            "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                        }
                    },
                    "segemehl3": {
                        "OPTIONS": {
                            "INDEX": "",
                            "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                        }
                    },
                    "hisat2": {
                        "OPTIONS":
                        {
                          "INDEX": "",
                          "MAP": ""
                        }
                    },
                    "bwa": {
                        "OPTIONS":
                        {
                            "INDEX": "-a bwtsw",
                            "MAP": ""
                        }
                    },
                    "minimap": {
                        "OPTIONS":
                        {
                            "INDEX": "-k 15",
                            "MAP": "-ax sr -ub -Y -L --MD"
                        }
                    }
                }
            },
            "KO": {
                "star": {
                    "OPTIONS":
                    {
                        "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",
                        "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                    }
                },
                "segemehl3": {
                    "OPTIONS": {
                        "INDEX":  "",
                        "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                    }
                },
                "hisat2": {
                    "OPTIONS":
                    {
                      "INDEX": ""
                    }
                },
                "bwa": {
                    "OPTIONS":
                    {
                        "INDEX": "-a bwtsw",
                        "MAP": ""
                    }
                },
                "minimap": {
                    "OPTIONS":
                    {
                        "INDEX": "-k 15",
                        "MAP": "-ax sr -ub -Y -L --MD"
                    }
                }
            }
        }
    },
    "DE": {
        "TOOLS" :
        {
            "deseq2" : "Analysis/DE/DESEQ2.R",
            "edger" : "Analysis/DE/EDGER.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {       
                    "deseq2": {
                        "OPTIONS":
                        {
                            "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                            "DE": ""  # Options for DE Analysis
                        }
                    },
                    "edger": {
                        "OPTIONS":
                        {
                            "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                            "DE": ""  # Options for DE Analysis
                        }
                    }
                }
            },
            "KO": {
                "deseq2": {
                    "OPTIONS":
                    {
                        "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                        "DE": ""  # Options for DE Analysis
                    }
                },
                "edger": {
                    "OPTIONS":
                    {
                        "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                        "DE": ""  # Options for DE Analysis
                    }
                }
            }
        }
    }
}

Note that “SETTINGS” now also contain “GROUPS” which hold identifiers for the DE stage and can be used to define “COMPARABLES” that tell MONSDA which samples should be treated as replicates and compared to which other samples. By default this will make an all-vs-all comparison, in our simple case ctrl-vs-ko. Keeping this “GROUPS” setting separated from the condition-tree makes it possible to mix samples for different conditions for all pre- and processing steps and separating them for postprocessing without further struggle. This means that in this simple case we could have mixed all samples under one condition as all other tool options stay the same and reference and annotation do not differ and would still be able to split by “GROUPS” for the DE step. However, sometimes we need to compare data that has to be processed differently, e.g. mixing paired and single-ended reads, so we can apply different settings as needed for all workflow steps by simply splitting our condition-tree accordingly.

Starting the run with 12 cores (defining more will be capped by the config file as MAXTHREADS is set to 12)

monsda -j 12 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_de.json --directory ${PWD}

Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/INDICES” directory containing the built index, a “QC” directory containing all FASTQC reports and MULTIQC output, a “TRIMMED_FASTQ” directory for trimgalore and cutadapt output, a “DEDUP” directory for umi_tools (runs before trimming and after mapping) and picard (runs after mapping) output and a “MAPPING” directory containing the mapped files. Furthermore, a “DE” directory will be created which will hold output from counting with featurecounts and DE input and output from EDGER and DESeq2. Again, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and the “JOBS” directory with all command-line calls.

A successful run will show the message ‘Workflow finished, no error’ at the end.

Run it all

This exhaustive use case involves multiple input files, two conditions (WT/KO) and almost all workflows available. We also include a “dummylevel” that is a placeholder for settings or other subdivisions of the WT level, to demonstrate that MONSDA can work on condition-trees of differing depth.

Workflows include:

  • FETCH: Download from SRA

  • QC: FASTQC of input and output

  • TRIMMING: Adaptor removal with cutadapt/trimgalore

  • MAPPING: Read mapping with STAR, hisat2, bwa, segemehl3 and minimap2

  • DEDUP: Read deduplication with umi_tools and picard

  • DE: Differential Expression analysis with EdgeR and DESeq2

  • DEU: Differential Exon Usage analysis with EdgeR and DEXSeq

  • DAS: Differential Alternative Splicing analysis with EdgeR and DIEGO

  • DTU: Differential Transcript Usage analysis with DRIMSeq and DEXSeq

  • COUNTING: Read counting with FeaturCounts ot quantification with Salmon

  • TRACKS: Generation of tracks for UCSC or other genome browsers

  • PEAKS: Analysis of ChIP-Seq or CLIP-Seq or cyPhyRNA-Seq Peaks

The more complex config for this analysis follows

You will see here that for DTU workflows we will need an additional FASTA and GTF file covering transcripts. As our sample data is from E.coli, transcripts and genes can be seen as synonymous, so we can easily create the needed files from our already downloaded files.

Creating a transcript gtf file from the downloaded gtf by subsetting for gene features:

zcat ecoli.gtf.gz |head -n+3 > bla; zcat ecoli.gtf.gz|grep -P '\tgene\t' >> bla;cat bla|gzip > Ecoli_trans.gtf.gz;rm -f bla

We now fix the empty transcript-id tag by replacing it with the gene-id followed by “_1”:abbr:

zcat Ecoli_trans.gtf.gz |perl -wlane 'BEGIN{%ids}{if($_=~/^#/){print}else{$n=$F[9];$ids{$n}++;$F[11]=substr($n,0,-2)."_".$ids{$n}."\";";print join("\t",@F[0..7],join(" ",@F[8..$#F]))}}'|perl -F'\t' -wlane 'print $_;if($_ !~/^#/){print join("\t",@F[0..1])."\ttranscript\t".join("\t",@F[3..$#F])}'|gzip > Ecoli_trans_fix.gtf.gz

From this gtf we can now create a FASTA file by writing a helper BED file and using BEDtools to extract sequences from our genome FASTA file in strand specific manner. We then only need to clean up the name tag as follows:

zcat Ecoli_trans_fix.gtf.gz|grep -P '\ttranscript\t'|perl -wlane 'next if($_=~/^#/);($name=(split(/;/,$F[11]))[0])=~s/\"//g;print join("\t",$F[0],$F[3]-1,$F[4],$name,100,$F[6])' > Ecoli_trans.bed
bedtools getfasta -fi ecoli.fa -bed Ecoli_trans.bed -fo Ecoli_trans.fa -nameOnly -s
perl -wlane 'if($_=~/^>/){$_=~s/\(|\+|\-|\)//g;}print' Ecoli_trans.fa|gzip > Ecoli_trans.fa.gz

With these “Ecoli_trans.fa.gz” and “Ecoli_trans_fix.gtf.gz” files we can now also run the DTU workflow. The config file for the exhaustive run looks as follows:

{
    "WORKFLOWS": "FETCH, QC, TRIMMING, MAPPING, DEDUP, TRACKS, PEAKS, DE, DEU, DAS, DTU, COUNTING",
    "VERSION": "1.0.0",
    "BINS": "",
    "MAXTHREADS": "16",
    "SETTINGS": {
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "SAMPLES": [
                        "SRR16324019", 
                        "SRR16324018",
                        "SRR16324017"
                    ],
                    "GROUPS": [
                        "ctrl",
                        "ctrl",
                        "ctrl"
                    ],
                    "SEQUENCING": "paired",
                    "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                    "ANNOTATION": {
                        "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                        "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                    }
                }
            },
            "KO": {
                "SAMPLES": [
                    "SRR16324016", 
                    "SRR16324015",
                    "SRR16324014"
                ],
                "GROUPS": [
                    "ko",
                    "ko",
                    "ko"
                ],
                "SEQUENCING": "paired",
                "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                "ANNOTATION": {
                    "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                    "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                }
            }
        }
    },
    "FETCH": {
        "TOOLS" :
        {
            "sra" : "fasterq-dump"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "sra": {
                        "OPTIONS":
                        {
                            "DOWNLOAD": ""
                        }
                    }
                }
            },
            "KO": {
                "sra": {
                    "OPTIONS":
                    {
                        "DOWNLOAD": ""
                    }
                }
            }
        }
    },
    "QC": {
        "TOOLS" :
        {
            "fastqc" : "fastqc"
        },
        "Ecoli": {
            "KO": {
                "fastqc": {
                    "OPTIONS":
                    {
                        "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                        "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {
                    "fastqc": {
                        "OPTIONS":
                        {
                            "QC": "",  #QC options here if any, KO is not required, will be resolved by rules
                            "MULTI": ""  #MultiQC options here if any, KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "TRIMMING": {
        "TOOLS" :
        {
            "trimgalore": "trim_galore",
            "cutadapt": "cutadapt"
        },
        "Ecoli": {
            "KO": {
                "trimgalore":{
                    "OPTIONS":
                    {
                        "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                },
                "cutadapt":{
                    "OPTIONS":
                    {
                        "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                    }
                }
            },
            "WT": {
                "dummylevel": {
                    "trimgalore":{
                        "OPTIONS":
                        {
                            "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    },
                    "cutadapt":{
                        "OPTIONS":
                        {
                            "TRIM": "-m 8 -e 0.15"  # trimming options here, --KO is not required, will be resolved by rules
                        }
                    }
                }
            }
        }
    },
    "DEDUP": {
        "TOOLS": {
            "umitools": "umi_tools",
            "picard": "picard"
        },
        "Ecoli": {
            "KO":{
                "umitools":{
                    "OPTIONS":
                    {
                        "EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
                        "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                    }
                },
               "picard":{
                    "OPTIONS":
                    {
                        "JAVA": "",
                        "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                    }
              }
            },
            "WT":{
                "dummylevel": {
                    "umitools": {
                        "OPTIONS":
                        {
                            "EXTRACT": "--extract-method string --bc-pattern AGANNNNACGT",
                            "DEDUP": "--extract-umi-method read_id"  # umitools dedup options
                        }
                    },
                   "picard": {
                     "OPTIONS":
                        {
                            "JAVA": "",
                            "DEDUP": "--VALIDATION_STRINGENCY SILENT"# umitools dedup options
                        }
                    }
                }
            }
        }
    },
    "MAPPING": {
        "TOOLS": {
            "star": "STAR",
            "segemehl3": "segemehl.x",
            "hisat2": "hisat2",
            "bwa": "bwa mem",
            "minimap": "minimap2"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "star": {
                        "OPTIONS": {
                            "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 8",
                            "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                        }
                    },
                    "segemehl3": {
                        "OPTIONS": {
                            "INDEX": "",
                            "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                        }
                    },
                    "hisat2": {
                        "OPTIONS":
                        {
                          "INDEX": ""
                        }
                    },
                    "bwa": {
                        "OPTIONS":
                        {
                            "INDEX": "-a bwtsw",
                            "MAP": ""
                        }
                    },
                    "minimap": {
                        "OPTIONS":
                        {
                            "INDEX": "-k 15",
                            "MAP": "-ax sr -ub -Y -L --MD"
                        }
                    }
                }
            },
            "KO": {
                "star": {
                    "OPTIONS":
                    {
                        "INDEX": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 8",
                        "MAP": "--sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 20"
                    }
                },
                "segemehl3": {
                    "OPTIONS": {
                        "INDEX":  "",
                        "MAP": "--MEOP --splits --accuracy 95 --differences 1 --evalue 5 --hitstrategy 1 --minsize 15 --minfraglen 20"
                    }
                },
                "hisat2": {
                    "OPTIONS":
                    {
                      "INDEX": ""
                    }
                },
                "bwa": {
                    "OPTIONS":
                    {
                        "INDEX": "-a bwtsw",
                        "MAP": ""
                    }
                },
                "minimap": {
                    "OPTIONS":
                    {
                        "INDEX": "-k 15",
                        "MAP": "-ax sr -ub -Y -L --MD"
                    }
                }
            }
        }
    },
    "COUNTING": {
        "FEATURES": {
            "CDS": "ID"
        },
        "TOOLS": {
            "countreads": "featureCounts"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "countreads": {
                        "OPTIONS":
                        {
                            "COUNT" : "-f --fraction -O -M"
                        }
                    }
                }
            },
            "KO": {
                "countreads": {
                    "OPTIONS":
                    {
                        "COUNT" : "-f --fraction -O -M"
                    }
                }
            }
        }
    },
    "TRACKS": {
        "TOOLS": {
            "ucsc": "ucsc"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "ucsc": {
                        "OPTIONS":
                        {
                            "UCSC": "-n Eco_compare_Mapping -s dm6_st -l UCSC_DM6_compare_Mapping -b UCSC_dm6_star1"
                        }
                    }
                }
            },
            "KO": {
                "ucsc": {
                    "OPTIONS":
                    {
                        "UCSC": "-n Eco_paired_Mapping -s dm6_p -l UCSC_DM6_paired_Mapping -b UCSC_dm6_pstar1"
                    }
                }
            }
        }
    },
#DE Analysis
    "DE": {
        "TOOLS" :
        {
            "deseq2" : "Analysis/DE/DESEQ2.R",
            "edger" : "Analysis/DE/EDGER.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "deseq2": {
                        "OPTIONS":
                        {
                            "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                            "DE": ""  # Options for DE Analysis
                        }
                    },
                    "edger": {
                        "OPTIONS":
                        {
                            "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                            "DE": ""  # Options for DE Analysis
                        }
                    }
                }
            },
            "KO": {
                "deseq2": {
                    "OPTIONS":
                    {
                        "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                        "DE": ""  # Options for DE Analysis
                    }
                },
                "edger": {
                    "OPTIONS":
                    {
                        "COUNT": "-O",  # Options for read counting, independent of COUNTS workflows
                        "DE": ""  # Options for DE Analysis
                    }
                }
            }
        }
    },
#DEU Analysis options
    "DEU": {
        "TOOLS" :
        {
            "dexseq" : "Analysis/DEU/DEXSEQ.R",
            "edger" : "Analysis/DEU/EDGER.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "dexseq": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                            "DEU": ""  # Options for DEU Analysis
                        }
                    },
                    "edger": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                            "DEU": ""  # Options for DEU Analysis
                        }
                    }
                }
            },
            "KO": {
                "dexseq": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                        "DEU": ""  # Options for DEU Analysis
                    }
                },
                "edger": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                        "DEU": ""  # Options for DEU Analysis
                    }
                }
            }
        }
    },
#DAS Analysis options
    "DAS": {
        "TOOLS" :
        {
            "diego" : "diego.py",
            "edger" : "Analysis/DAS/EDGER.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "diego": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                            "DAS": "-c 10 -d 3 -q 0.01 -z 1.0 -r"  # Options for DAS Analysis
                        }
                    },
                    "edger": {
                        "OPTIONS":
                        {
                            "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                            "DAS": ""  # Options for DAS Analysis
                        }
                    }
                }
            },
            "KO": {
                "diego": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                        "DAS": "-c 10 -d 3 -q 0.01 -z 1.0 -r"  # Options for DAS Analysis
                    }
                },
                "edger": {
                    "OPTIONS":
                    {
                        "COUNT": "-f -O",  # Options for read counting, independent of COUNTS workflows
                        "DAS": ""  # Options for DAS Analysis
                    }
                }
            }
        }
    },
    "DTU": {
        "TOOLS" :
        {
            "dexseq" : "Analysis/DTU/DEXSEQ.R",
            "drimseq" : "Analysis/DTU/DRIMSEQ.R"
        },
        "COMPARABLE" :
        {
        },
        "EXCLUDE": [
        ],
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "dexseq": {
                        "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                        "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                        "OPTIONS":
                        {
                            "INDEX": "",
                            "QUANT": "",  # Options for read counting, independent of COUNTS workflows
                            "DTU": ""  # Options for DE Analysis
                        }
                    },
                    "drimseq": {
                        "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                        "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                        "OPTIONS":
                        {
                            "INDEX": "",
                            "QUANT": "",  # Options for read counting, independent of COUNTS workflows
                            "DTU": ""  # Options for DE Analysis
                        }
                    }
                }
            },
            "KO": {
                "dexseq": {
                    "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                    "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                    "OPTIONS":
                    {
                        "INDEX": "",
                        "QUANT": "",  # Options for read counting, independent of COUNTS workflows
                        "DTU": ""  # Options for DE Analysis
                    }
                },
                "drimseq": {
                    "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                    "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                    "OPTIONS":
                    {
                        "INDEX": "",  # Options for read counting, independent of COUNTS workflows
                        "QUANT": "",
                        "DTU": ""  # Options for DE Analysis
                    }
                }
            }
        }
    },
#PEAKS Analysis options
    "PEAKS": {
        "TOOLS" :
        {
            "macs" : "macs2",
            "peaks" : "peaks",
            "scyphy" : "Piranha"
        },
        "COMPARABLE" :
        {
            "SRR16324019": "SRR16324018",
            "SRR16324014": "SRR16324015"
        },
        "Ecoli": {
            "WT": {
                "dummylevel": {
                    "macs": {
                        "OPTIONS":
                        {
                            "FINDPEAKS": "--nomodel --extsize 28"
                        }
                    },
                    "peaks":{
                        "OPTIONS":
                        {
                            "PREPROCESS": "",
                            "FINDPEAKS": "-l 0.6 -t 1 -w 5 -r 0.8 -c 3 -a 3"
                        }
                    },
                    "scyphy":{
                        "OPTIONS":
                        {
                            "PREPROCESS": "",
                            "FINDPEAKS": "-b 100"
                        }
                    }
                }
            },
            "KO": {
                "macs": {
                    "OPTIONS":
                    {
                        "FINDPEAKS": "--nomodel --extsize 28"
                    }
                },
                "peaks":{
                    "OPTIONS":
                    {
                        "PREPROCESS": "",
                        "FINDPEAKS": "-l 0.6 -t 1 -w 5 -r 0.8 -c 3 -a 3"
                    }
                },
                "scyphy":{
                    "OPTIONS":
                    {
                        "PREPROCESS": "",
                        "FINDPEAKS": "-b 100"
                    }
                }
            }
        }
    }
}

Same as for the more complex run we can see that “SETTINGS” contains “GROUPS” which hold identifiers for the DE/DEU/DAS/DTU stages that will be used to define “COMPARABLES” that tell MONSDA which samples should be treated as replicates and compared to which other samples. By default this will make an all-vs-all comparison, in our case ctrl-vs-ko. Keeping this “GROUPS” setting separated from the condition-tree makes it possible to mix samples for different conditions for all pre- and processing steps and separating them for postprocessing without further struggle. This means that in this simple case we could have mixed all samples under one condition as all other tool options stay the same and reference and annotation do not differ and would still be able to split by “GROUPS” for the DE/DEU/DAS steps. However, sometimes we need to compare data that has to be processed differently, e.g. mixing paired and single-ended reads, so we can apply different settings as needed for all workflow steps by simply splitting our condition-tree accordingly.

Somewhat special is the DTU step. Here we use the pseudo-mapper salmon to quantify expression of transcripts. These transcripts are first defined in the preprocessing part of this tutorial chapter. As we can see the REFERENCE and ANNOTATION differ from what is listed in SETTINGS, so we make use of the ability of MONSDA to define those settings on a per-workflow basis. This will ALWAYS over-rule the settings made in SETTINGS. So we simply define the REFERENCE and ANNOTATION key for each tool within the regular tool setting part of the config and this will make sure that these files are used instead of the globally defined ones.

Starting the run with 12 cores (defining more will be capped by the config file as MAXTHREADS is set to 12)

monsda -j 12 -c ${CONDA_PREFIX}/share/MONSDA/configs/tutorial_exhaustive.json --directory ${PWD}

Will start the run in the current directory and generate a “FASTQ” sub-directory containing the downloaded sample, a “GENOME/Ecoli/INDICES” directory containing the built indices, including the one built for salmon later on, a “QC” directory containing all FASTQC reports and MULTIQC output, a “TRIMMED_FASTQ” directory for trimgalore and cutadapt output, a “DEDUP” directory for umi_tools (runs before trimming and after mapping) and picard (runs after mapping) output and a “MAPPING” directory containing the mapped files. Furthermore, “DE/DEU/DAS” directories will be created which will hold output from counting with featurecounts and DE/DEU/DAS input and output from EDGER, DESeq2, DEXSeq and DIEGO respectively. Again, MONSDA will create a “LOG” directory containing it’s own log, as well as logs of all executed jobs and again a “JOBS” directory for command-line calls.

A successful run will show the message ‘Workflow finished, no error’. Be aware that this is indeed an exhaustive workflow and will require a decent amount of disk-space, memory and compute-time, depending on the hardware at your disposal.

Available Workflows

FETCH -> Download from SRA BASECALL -> Call bases from FAST5 QC -> Quality Control of FASTQ/BAM files TRIMMING -> Trim adaptor remnants MAPPING -> Map reads to reference DEDUP -> Deduplicate reads COUNTING -> Count or quantify reads TRACKS -> Generates trackdb.txt and BIGWIG files PEAKS -> Call peaks for ChIP, RIP, CLIP and other DE -> Differential Expression Analysis DEU -> Differential Exon Usage Analysis DAS -> Differential Alternative Splicing Analysis DTU -> Differential Transcript Usage Analysis

PREPROCESSING

FETCH

Downloads files from SRA

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

SRAtools fasterq-dump

The fasterq-dump tool extracts data in FASTQ- or FASTA-format from SRA-accessions

sra

fasterq-dump

sra

SRA accession

FASTQ

BASECALL

Creates FASTQ files from ONT FAST5, Guppy needs to be installed locally

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

Guppy

Data processing toolkit that contains Oxford Nanopore’s basecalling algorithms, and several bioinformatic post-processing features, such as barcoding/demultiplexing, adapter trimming, and alignment. Needs to be installed locally as no conda version is available

guppy

$PATH_TO_LOCAL_BIN

guppy

FAST5

FASTQ

QUALITY CONTROL I

This workflow step can be run as preprocessing step if none of the processing workflows is defined in the config.json.

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

FASTQC (includes MULTIQC)

A quality control tool for high throughput sequence data.

fastqc

fastqc

fastqc

FASTQ/BAM

ZIP/HTML

PROCESSING

QUALITY CONTROL II

If any of the below listed processing steps is defined in the config.json, quality control will run for all generated output files if enabled.

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

FASTQC (includes MULTIQC)

A quality control tool for high throughput sequence data.

fastqc

fastqc

fastqc

FASTQ/BAM

ZIP/HTML

Trimming

Trims adaptor left-overs from fastq files

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

trim_galore

A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries.

trimgalore

trim_galore

trimgalore

FASTQ

TRIMMED_FASTQ

cutadapt

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

cutadapt

cutadapt

cutadapt

FASTQ

TRIMMED_FASTQ

Mapping

Maps sequences to reference genomes or transcriptomes

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

HISAT2

HISAT2 is a fast and sensitive alignment program

hisat2

hisat2

hisat2

FASTQ/TRIMMED_FASTQ

SAM.gz/BAM

STAR

Spliced Transcripts Alignment to a Reference

star

star

star

FASTQ/TRIMMED_FASTQ

SAM.gz/BAM

Segemehl2|3

Segemehl is a software to map short sequencer reads to reference genomes.

segemehl2/segemehl3

segemehl.x

segemehl

FASTQ/TRIMMED_FASTQ

SAM.gz/BAM

BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome

bwa

bwa mem

bwa

FASTQ/TRIMMED_FASTQ

SAM.gz/BAM

Minimap2

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.

minimap

minimap2

minimap

FASTQ/TRIMMED_FASTQ

SAM.gz/BAM

DEDUP

Deduplicate reads by UMI or based on mapping position and CIGAR string

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

UMI-tools

UMI-tools contains tools for dealing with Unique Molecular Identifiers (UMIs)/Random Molecular Tags (RMTs) and single cell RNA-Seq cell barcodes.

umitools

umi_tools

umitools

FASTQ/TRIMMED_FASTQ

FASTQ/BAM

Picard tools

A better duplication marking algorithm that handles all cases including clipped and gapped alignments.

picard

picard

picard

BAM

BAM

POSTPROCESSING

Read-Counting and Quantification

Count (unique) mapped reads and how often they map to defined features

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

FeatureCounts

A software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins

countreads

featureCounts

featurecounts

BAM/FASTQ

TEXT

Salmon

Salmon is a tool for wicked-fast transcript quantification from RNA-seq data.

salmon

salmon

salmon

FASTQ/TRIMMED_FASTQ

TEXT

Differential Analyses

Includes DE, DEU, DAS and DTU

Tool

Analysis

Filtering

Normalization

Distribution

Testing

Significance

Results Table

further

SigTables

Clustering

further

Rmd

edgeR

Differential Gene Expression

filterByExpr()

TMM

NB

Fisher’s exact test

pValue, LFC

results, sorted-results

normalized

Sig, SigUP, SigDOWN

MDS-plot

BCV, QLDisp, MD(per comparison)

edgeR

Differential Exon Usage

filterByExpr()

TMM

NB

Fisher’s exact test

pValue, LFC

results

normalized

MDS-plot

BCV, QLDisp, MD(per comparison)

edgeR

Differential Alternative Splicing

filterByExpr()

TMM

NB

Simes, gene-level, exon-level

pValue, LFC

results(diffSpliceExonTest, Simes-Test, Gene-Test)

Sig, SigUP, SigDOWN

MDS-plot

BCV, QLDisp, MD(per comparison), topSpliceSimes-plots(per Gene)

DESeq2

Differential Gene Expression

RowSums >= 10

RLE

NB

Wald test

pValue, LFC

results

rld, vsd, results(per comparison)

Sig, SigUP, SigDOWN

PCA

Heatmaps, MA(per comparison), VST-and-log2

DEXSeq

Differential Exon Usage

RowSums >= 10

RLE

Cox-Reid

likelihood ratio test

DEXSeq

Differential Transcript Usage

dmFilter()

RLE

Cox-Reid

likelihood ratio test

pValue

results

DIEGO

Differential Alternative Splicing

Mann-Whitney U test

pValue

results

Sig

Dendrogram-plot

DRIMSeq

Differential Transcript Usage

dmFilter()

DM

pValue, LFC

results(transcript, genes)

Proportions-table, genewise precision

Sig, SigUP, SigDOWN (transcipt, gene)

FeatPerGene, precision, Pvalues (per comparison)

TRACKS

This workflow generates trackdb.txt files and bigwig files which can be used to create UCSC track hubs. However, bigwigs can of course be used for other genome viewers as well.

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

UCSC

Track hubs are web-accessible directories of genomic data that can be viewed on the UCSC Genome Browser

ucsc

ucsc

ucsc

BAM

BIGWIG/HUBS

PEAKS

Calls peaks from mapping data for ChIP, RIP, CLIP and other

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

Piranha

Piranha is a peak-caller for CLIP- and RIP-Seq data.

piranha

piranha

piranha

BAM

BED/BEDG/BIGWIG

MACS

Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites.

macs

macs

macs

BAM

BED/BEDG/BIGWIG

SciPhy

Software for cyPhyRNA-Seq Data analysis

scyphy

piranha

cyphyRNA-Seq

BAM

BED/BEDG/BIGWIG

Peaks

Slinding window peak finding tool for quick assessment of peaks. UNPUBLISHED, recommended for initial scanning only

peaks

peaks

ttp

BAM

BED/BEDG/BIGWIG

CIRCS

Find circular RNAs in mapping data, CIRI2 needs to be installed locally.

TOOL

DESCRIPTION

ENV

BIN

LINK

INPUT

OUTPUT

CIRI2

CIRI (circRNA identifier) is a novel chiastic clipping signal based algorithm,which can unbiasedly and accurately detect circRNAs from transcriptome data by employing multiple filtration strategies.

ciri2

$Path_to_CIRI2.pl

ciri2

BAM

BED/BEDG/BIGWIG

Wrapping Workflows

In general MONSDA is Python3 software, that wraps workflows by assembling subworkflows or single tools from .smk or .nf templates. The idea here is that tools and subworkflows for similar tasks are designed in a way that starts from the same input and results in the same output. This is not only true for single workflow steps which can be performed by multiple tools, but also for the wrapped workflow management systems (WMS). In principal output generated in Nextflow mode should be suitable as input for Snakemake and vice versa. This means that, for example, mapping output generated in Nextflow mode can be used as input for DE analysis in Snakemake mode, while both work from the same config.json.

As Snakemake is also written in Python, wrapping workflows is similar to the built-in way of submodule assembly, although we take care that submodules for the same task remain interchangeable. Wrapping Nextflow is slightly different, as MONSDA has to assemble Groovy text blocks, which does not make any difference to the end user, but requires to translate configuration from the config file to Nextflow parsable command lines. However, the idea of creating interchangeable subworkflows or tool specific code fragments stays the same.

Independently of the wrapped WMS, workflows are split internally in three independent stages. PREPROCESSING includes all workflow steps that generate or manipulate FASTQ files, to make them available to the PROCESSING stage. This includes download of files from SRA, basecalling with Guppy and pre-quality-control, so all workflow steps that do not require identical input formats, but lead to similar output.

PROCESSING starts from FASTQ files and includes trimming, deduplication, mapping and quality control for all subprocesses.

POSTPROCESSING builds upon PROCESSING output and includes quantification, differential expression analysis on gene, transcript and exon level, generation of tracks for UCSC or other genome browsers, peak finding and circular RNA identification. In contrast to PROCESSING workflows, these steps do not require output to be of similar format but are able to work from the same input.

In case dedicated workflows need to be established, as is the case for example for cyPhyRNA-Seq, the main idea is to split all preprocessing and processing steps in units that remain interchangeable, and deliver dedicated post-processing subworkflows which can work on their output. In the mentioned example we have quality control, trimming, mapping and deduplication embedded in standard workflows and dedicated, specific postprocessing of mapped reads wrapped in the PEAKS postprocessing step.

For new workflows, we aim to split those into as small subunits as possible to make subworkflows available for other pipelines and add dedicated parts as postprocessing to currently established categories. In case new categories need to be defined, please contact us to discuss how this can be embedded.

The Condition-Tree

A key concept behind MONSDA is that config files are split according to conditions defined in the config file and each subworkflow is then run consecutively. This separates workflows in independent subworkflows, each potentially running on different input, with differing options and executables, without danger of interfering with each other.

As described in Planning a project run, you should have an idea of what to analyze and how to split up your conditions if needed. For each ID you work on, you can define no, one or multiple conditions and settings that will be used for the analysis. The condition-tree also defines the directory structure to follow for Input and Output directories. We assume a general tree to look something like

'ID' -> 'CONDITION' -> 'SETTING'

Here ID is the first level and optional Condition the second. Setting is used by MONSDA to enable processing of the same samples under different settings or commandline options for e.g. mapping tools, trimming tools and later also postprocessing tools. MONSDA will also build an output directory based on the combination of tools used, e.g. fastqc-cutadapt-star-umitools, to indicate which combination of tools was used to generate the output and to prevent results from being mixed or overwritten.

As an example, I want to analyze samples retrieved from LabA on 01012020 (yes that happens), with the mapping tools star and hisat, my condition-tree would look like this LabA:01012020 and my FASTQ input directory would resemble that like FASTQ/LabA/01012020. The ‘01012020’ directory would thereby contain all the fastq.gz files I need for analysis as stated in the corresponding config file. As we assume that settings may change but the input files will stay the same, MONSDA will search one level upwards of the deepest level in the condition-tree. This means, if you have a tree like:

'ID1' -> 'CONDITION1' -> 'SETTING1', 'SETTING2', 'SETTING3'

You do not need to copy input from FASTQ/LabA/01012020 to FASTQ/LabA/01012020/SETTING1/2/3, instead MONSDA will find the input in FASTQ/LabA/01012020 and generate output directories which contain the Setting level. This works of course also if you want to analyze samples from different dates and same lab with same settings or different labs and so on.

MONSDA will automatically define a unique tool-key based on currently enabled workflow steps and the combination of tools defined in the config file. From that information it will generate output folders like MAPPING/LabA/01012020/star and MAPPING/LabA/01012020/hisat if no other tools and workflow steps where configured to be used.

Optionally a user can also run one or the other tool with different settings for example to benchmark tools. Define for example map_stringent and map_relaxed and indicate this on the MAPPING level in the config file. FASTQ input will still be found in FASTQ/LabA/01012020 while output files will appear in MAPPING/LabA/01012020/map_stringent/star and MAPPING/LabA/01012020/map_stringent/hisat or MAPPING/LabA/01012020/map_relaxed/star and MAPPING/LabA/01012020/map_relaxed/hisat respectively.

The config-file

It consists of multiple sections which will be explained in detail. For the template please refer to the template.json file in the config directory of the repo. The template.json is the default blueprint for a config file generated by monsda_configure.

The config file contains all the information needed to run workflows and tools with user defined settings and to find the correct sample and genome/annotation files. It starts with defining which workflow steps to run. Be aware that every workflow step has to correspond to a key in the config.json that defines parameters specific for the job. See Available Workflows: for available workflows and tools.

    "WORKFLOWS": "", #which workflows do you want to run "MAPPING, QC, DEDUP, TRIMMING, COUNTING, TRACKS, PEAKS, ANNOTATE, DE, DEU, DAS, DTU, CIRCS"
    "BINS": "", #where to find customscripts used in the workflow !!!ADVANCED USAGE ONLY!!!
    "MAXTHREADS": "20", #maximum number of cores to use, make sure this fits your needs
    "VERSION": "1.0.0", #needs to be identical to the installed version for reproducibility reasons

The next part defines ‘SETTINGS’, which is where you define which samples to process under which condition-tree leafs, which genomes and annotations to use, which sequencing type is to be analyzed and some workflow specific keys.

    "SETTINGS": {
        "id": {
            "condition": {
                "setting": {
                    "SAMPLES": ["Sample_1","Sample_2"] # List of samples you whish to analyze; skip file ending, this names will be used for input/output files of various formats; if paired sequencing make sure the names have _R1/_R2 as identifiers of first/second in pair at the very end of the filename and only list one file without the _R1/_R2 extension, this will be appended automatically
                    "SEQUENCING": "single",  #or paired and stranded (rf,fr) info comma separated
                    "REFERENCE": "GENOMES/Dm6/dm6.fa.gz",  #default Referene Genome fa.gz file
                    "INDEX": "GENOMES/Dm6/INDICES/star",  #default index to use for mapping with this settings, empty if not already existing
                    "PREFIX": "idx",  #OPTIONAL, prefix for mapping software can be set here
                    "ANNOTATION": {
                        "GTF": "GENOMES/Dm6/dm6.gtf.gz",  #default annotation in GTF format, THIS WILL BE USED WHENEVER POSSIBLE OR NOT DEFINED OTHERWISE
                        "GFF": "GENOMES/Dm6/dm6.gff3.gz"  #default annotation in GFF format
                    },
                    "IP": "iCLIP" # OPTIONAL if PEAKS is run and files need specific processing, eg. for iCLIP protocol, options are CLIP, iCLIP, revCLIP
                }
            }
        }
    },

To define the samples to run the analysis on, just add a list of sample names as innermost value to the SAMPLES key for each condition. In case of single-end sequencing make sure to include any _R1 _R2 tag should such exist, in case of paired end skip those as the pipeline will automatically look for _R1 and _R2 tags to find read pairs and prevent you from having to write duplicated information.

Make sure the naming of you samples follows this _R1 _R2 convention when running paired-end analysis! All files have to end with _R1.fastq.gz or _R2.fastq.gz for paired end and .fastq.gz for single-end processing respectively. MONSDA will append these suffixes automatically

The SEQUENCING key allows you to define single or paired as values to enable analysis of a mix of single/paired end sequences at once, split in different leafs of the condition-tree. You can also specify strandedness of the data at hand. If unstranded leave empty, else add strandedness according to http://rseqc.sourceforge.net/#infer-experiment-py as comma separated value (rf Assumes a stranded library fr-firststrand [1+-,1-+,2++,2–], fr Assumes a stranded library fr-secondstrand [1++,1–,2+-,2-+]). MONSDA will use these values to configure tools automatically where possible. It is therefore not required to add sequencing specific settings for each used tool to your config file.

Now the actual workflow section begins, where you can define which environments and tools to use and which settings to apply to the run for each step and condition/setting. These follow the same scheme for each step. The ENV key defines the conda environment to load from the env directory of this installation. This allows to add your own environment.yaml files if needed, just make sure to share those along with your configuration. The BIN key defines the name of the executable, this is needed in case the env and the bin differ as e.g. for the mapping tool segemehl/segemehl.x. Some tools, e.g. guppy are not available via conda, hence you have to install the tool locally and indicate the path to the executable.

You can always define differing ENV/BIN keys for each condition-tree leaf separately, which will overwrite the ENV/BIN set under ‘TOOLS’. This is intended to be used to benchmark different versions of the same tool for example. Furthermore, you can define keys like ‘ANNOTATION’ and ‘REFERENCE’ similarly, overwriting the ones defined under ‘SETTINGS’, which is useful for example, in case you run workflows like ‘COUNTING’ with salmon and want to quantify transcript isoforms while still being able to map and trim and count reads on genome level.

The next key-level is the OPTIONS key which is where you can define additional parameters for each tool. It is not needed to define anything related to single-end/paired-end end sequencing, this is done automatically. To add parameters simply add the OPTION key which defines a dict where you can set parameters for each defined subworkflow-step. Parameters are here defined as key/value pairs corresponding to the subworkflow-step, e.g. ‘INDEX’ to generate an index file for mapping and all settings similar to a command line call as values. This should become clear having a look at the different processing steps in the template json. If there are no options just leave the ‘OPTIONS’ dict empty.

    "FETCH": {
        "TOOLS" : # which tools to run, format is Conda-environment name : binary to call, will be overwritten by ENV/BIN in ICS
        {
            "sra" : "sra"
        },
        "id": { #key for source and genome
                "condition": { # sample id
                               "setting": {
                                   "ENV" : "sra",  # OPTIONAL if no tool is defined, name of conda env for raw file download
                                   "BIN" : "sra", # OPTIONAL PATH to executable, usually just the name of the executable
                                   "sra": { # for which tool environment these settings work
                                       "OPTIONS":
                                       {
                                           "DOWNLOAD": ""  #FETCH options here if any, paired is not required, will be resolved by rules

                                       }
                                   }
                               }
                             }
              }
    },
    "BASECALL": {
        "TOOLS" : # which tools to run, format is Conda-environment name : binary to call
        {
            "guppy" : "~/.local/bin/guppy-cpu/bin/guppy_basecaller"
        },
        "id": { #key for source and genome
                "condition": { # sample id
                               "setting": {
                                   "ENV" : "guppy",  # name of conda env for raw file download
                                   "BIN" : "~/.local/bin/guppy-cpu/bin/guppy_basecaller", #PATH to guppy executable
                                   "guppy":{
                                       "OPTIONS":
                                         {
                                             "BASECALL": ""  #Guppy options here if any, paired is not required, will be resolved by rules
                                         }

                                   }
                               }
                             }
              }
    },
    "QC": {
        "TOOLS" :
        {
            "fastqc" : "fastqc"
        },
        "id": { #key for source and genome
                "condition": { # sample id
                               "setting": {
                                   "ENV" : "fastqc",  # name of conda env for QC
                                   "BIN" : "fastqc", # binary for QC
                                   "fastqc":{
                                       "OPTIONS":
                                          {
                                              "QC": "",  #QC options here if any, paired is not required, will be resolved by rules
                                              "MULTI": ""  #MultiQC options here if any, paired is not required, will be resolved by rules
                                          }

                                   }
                               }
                             }
              }
    },
    "TRIMMING": { #options for trimming for each sample/condition
                  "TOOLS" :
                  {
                      "trimgalore": "trim_galore",
                      "cutadapt": "cutadapt"
                  },
                  "id": {
                      "condition": {
                          "setting": { # See above
                                       "ENV": "trimgalore", # name of conda env for trimming
                                       "BIN": "trim_galore", # name of binary for trimming
                                       "trimgalore":{
                                           "OPTIONS":
                                            {
                                                "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --paired is not required, will be resolved by rules
                                            }
                                       },
                                       "cutadapt":{
                                           "OPTIONS":
                                           {
                                                "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --paired is not required, will be resolved by rules
                                           }
                                       }
                                     }
                      }
                  }
                },
    "DEDUP": { #options for deduplication for each sample/condition
               "TOOLS": {
                   "umitools": "umi_tools",
                   "picard": "picard"
               },
               "id": {
                   "condition": {
                       "setting": { # See above
                                    "ENV": "umitools", # name of conda env for dedup
                                    "BIN": "umi_tools", # name of binary for dedup
                                    "umitools":{
                                        "OPTIONS":
                                        {
                                            "WHITELIST" : "--extract-method string --bc-pattern AGANNNNACGT",# umitools whitelist options
                                            "EXTRACT": "--extract-umi-method read_id",  # umitools extract options
                                            "DEDUP": ""  # umitools dedup options
                                        }
                                    },
                                    "picard":{
                                        "OPTIONS":
                                        {
                                            "JAVA" : "",# options
                                            "DEDUP": ""  # dedup options
                                        }
                                    }
                                  }
                   }
               }
             },
    "MAPPING": { #options for mapping for each sample/condition
                 "TOOLS": {
                     "star": "STAR",
                     "segemehl3": "segemehl.x",
                     "hisat2": "hisat2",
                     "bwa": "bwa mem",
                     "minimap": "minimap2"
                 },
                 "id": {
                     "condition": {
                         "setting": {
                             "ENV": "star",  # which conda env to use for mapping
                             "BIN": "STAR",  #how the mapper binary is called
                             "REFERENCE": "$PATHTO/genome.fa.gz",  #Path to the gzipped Genome FASTA file, overwrites SETTINGS
                             "ANNOTATION": "$PATHTO/genome_or_other.gtfgff.gz",  #Path to the gzipped annotation file in gtf/gff format, overwrites SETTINGS
                             "star":{
                                 "OPTIONS":  # first entry in list is a dict of options for indexing, second for mapping, third can be e.g. appendix to index name, useful especially with minimap if using different kmer sizes
                                    {
                                        "INDEX" : "--sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",  #indexing options
                                        "MAP": "--sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore",  #mapping options
                                        "EXTENSION" : ""
                                    }
                             },
                             "segemehl3": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""
                                }
                              },
                              "hisat2": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""
                                }
                              },
                              "bwa": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""
                                }
                              },
                              "minimap": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""
                                }
                              }
                         }
                     }
                 }
               },

MONSDA further supports postprocessing steps like DE/DEU/DAS/DTU-Analysis for a defined (sub)-set of samples. These workflows act on ‘GROUPS’, which have to be defined in the ‘SETTINGS’. This allows users to compare samples across the condition tree. In case samples have been collected in batches or users want to define types for samples which will be considered in the respective design matrix of all provided R scripts. The settings for these step may look as follows:

    "SETTINGS": {
        "Ecoli": {
            "WT": {   
                "dummylevel": {       
                    "SAMPLES": [
                        "SRR16324019", 
                        "SRR16324018",
                        "SRR16324017"
                    ],
                    "GROUPS": [
                        "ctrl",
                        "ctrl",
                        "ctrl"
                    ],
                    "SEQUENCING": "paired",
                    "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                    "ANNOTATION": {
                        "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                        "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                    }
                }
            },
            "KO": {
                "SAMPLES": [
                    "SRR16324016", 
                    "SRR16324015",
                    "SRR16324014"
                ],
                "GROUPS": [
                    "ko",
                    "ko",
                    "ko"
                ],
                "SEQUENCING": "paired",
                "REFERENCE": "GENOMES/Ecoli/ecoli.fa.gz",
                "ANNOTATION": {
                    "GFF": "GENOMES/Ecoli/ecoli.gff.gz",
                    "GTF": "GENOMES/Ecoli/ecoli.gtf.gz"
                }
            }

And can be extended for example to

"SETTINGS": {
    "Ecoli": {
        "WT": {
            "SAMPLES": [
                "SRR16324019",
                "SRR16324018",
                "SRR16324017"
            ],
            "GROUPS": [
                "ctrl",
                "ctrl",
                "ctrl"
            ],
            "BATCHES": [
                "1",
                "1",
                "2"
            ],
            "TYPES": [
                "paired",
                "paired",
                "single"
            ]
        }
    }
}

Where ‘BATCHES’ and ‘TYPES’ can take on arbitrary values, but ‘BATCHES’ is intended to allow correction for batch effects and ‘TYPES’ allows to take e.g. single-end, paired-end information into account. Be aware that this is only making sense if you indeed have batches and/or types to compare, otherwise just skip those keys and MONSDA will take care of this.

Another extra-key for these analysis steps is ‘EXCLUDE’. It can be used to exclude samples from postprocessing, for example if the first round of analysis shows an outlier. The most important key is ‘COMPARABLE’, which, if left empty, will generate all-vs-all comparisons from all ‘GROUPS’ available. In case you want to only compare certain groups, you can edit the config to look like this:

"COMPARABLE" :
{
    "comparison-name":
    {
        "Group1",
        "Group2"
    }
}

This will compare Group1 and Group2 and name output after comparison-name.

Another special case is ‘MACS’, a peak caller for ChIP-Seq data, which needs to process files pairwise, where one file is the file containing the signal and the other file is used as background. To keep configuration simple, such comparisons can also be described with the ‘COMPARABLE’ key like so:

                    "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                    "OPTIONS":
                    {
                        "INDEX": "",  # Options for read counting, independent of COUNTS workflows
                    }
                }
            }
        }
    },
#PEAKS Analysis options

For everything else please refer to the TUTORIAL, the condition-tree and the Workflow Overview.

Keep in mind that every workflow step needs a corresponding entry in the config file or MONSDA.py will throw an error.

Integrating new tools/workflows

In case new tools need to be integrated, please refer to similar tools already implemented or contact us in case nothing similar is available yet. Workflows should always be split in subworkflows that follow the same principle as existing subworkflows, ideally making them reusable for other workflows in the future.

Tools should be easy to integrate, all that is needed is to write a tool and if applicable version specific .smk or .nf file describing input/output and command line calls as well as a fitting conda environment yaml file. Once these are available, they should already be usable and configurable via the config.json in the specific workflow section.

Contribute

If you like this project, are missing features, want to contribute or file bugs please open a PR, leave an issue or contact us directly.

To contribute new tools feel free to adopt existing ones, there should be a number of examples available that cover implementation details for almost all sorts of standard tools. If you need to add new python/groovy functions for processing of options or parameters add them to the corresponding file in the lib directory. New environments go into the envs directory, new subworkflows into the workflows directory. Do not forget to also extend the template.json and add some documentation before opening a pull request.

PRs always welcome.

##Contributors Joerg Fallmann @jfallmann Robin Goldmann @meisterL

Welcome to MONSDA, Modular Organizer of Nextflow and Snakemake driven hts Data Analysis

Automizing HTS analysis from data download, preprocessing and mapping to postprocessing/analysis and track generation centered on a single config file. MONSDA can create Snakemake and Nextflow workflows centered on a user friendly, sharable Json config file and reproducible subworkflows. These workflows can either be saved to disk for manual inspection and execution or automatically executed.

For details on Snakemake and Nextflow and their features please refer to the corresponding Snakemake or Nextflow documentation.

In general it is necessary to write a configuration file containing workflows to execute, information on paths, files to process and settings beyond default for mapping tools and others. The template on which MONSDA is based on can be found in the config directory.

For MONSDA to be as FAIR as possible, one needs to use conda or the faster drop-in replacement mamba. For details on either please refer to the corresponding conda or mamba manual.

This workflow organizer makes heavy use of conda and especially the bioconda channel.