The config-file

It consists of multiple sections which will be explained in detail. For the template please refer to the template.json file in the config directory of the repo. The template.json is the default blueprint for a config file generated by monsda_configure.

The config file contains all the information needed to run workflows and tools with user defined settings and to find the correct sample and genome/annotation files. It starts with defining which workflow steps to run. Be aware that every workflow step has to correspond to a key in the config.json that defines parameters specific for the job. See Available Workflows: for available workflows and tools.

    "WORKFLOWS": "", #which workflows do you want to run "MAPPING, QC, DEDUP, TRIMMING, COUNTING, TRACKS, PEAKS, DE, DEU, DAS, DTU, CIRCS"
    "BINS": "", #where to find customscripts used in the workflow !!!ADVANCED USAGE ONLY!!!
    "MAXTHREADS": "20", #maximum number of cores to use, make sure this fits your needs
    "VERSION": "1.2.4", #needs to be identical to the installed version for reproducibility reasons

The next part defines ‘SETTINGS’, which is where you define which samples to process under which condition-tree leafs, which genomes and annotations to use, which sequencing type is to be analyzed and some workflow specific keys.

    "SETTINGS": {
        "id": {
            "condition": {
                "setting": {
                    "SAMPLES": ["Sample_1","Sample_2"] # List of samples you whish to analyze; skip file ending, this names will be used for input/output files of various formats; if paired sequencing make sure the names have _R1/_R2 as identifiers of first/second in pair at the very end of the filename and only list one file without the _R1/_R2 extension, this will be appended automatically
                    "SEQUENCING": "single",  #or paired and stranded (rf,fr) info comma separated
                    "REFERENCE": "GENOMES/Dm6/dm6.fa.gz",  #default Referene Genome fa.gz file
                    "INDEX": "GENOMES/Dm6/INDICES/star",  #default index to use for mapping with this settings, empty if not already existing
                    "PREFIX": "idx",  #OPTIONAL, prefix for mapping software can be set here
                    "ANNOTATION": {
                        "GTF": "GENOMES/Dm6/dm6.gtf.gz",  #default annotation in GTF format, THIS WILL BE USED WHENEVER POSSIBLE OR NOT DEFINED OTHERWISE
                        "GFF": "GENOMES/Dm6/dm6.gff3.gz"  #default annotation in GFF format
                    },
                    "DECOY": {
                        "salmon": "GENOMES/Ecoli/salmon_decoy"
                    }, # OPTIONAL if tool for mapping/counting needs decoys, like e.g. salmon
                    "IP": "iCLIP" # OPTIONAL if PEAKS is run and files need specific processing, eg. for iCLIP protocol, options are CLIP, iCLIP, revCLIP
                }

To define the samples to run the analysis on, just add a list of sample names as innermost value to the SAMPLES key for each condition. In case of single-end sequencing make sure to include any _R1 _R2 tag should such exist, in case of paired end skip those as the pipeline will automatically look for _R1 and _R2 tags to find read pairs and prevent you from having to write duplicated information.

Make sure the naming of you samples follows this _R1 _R2 convention when running paired-end analysis! All files have to end with _R1.fastq.gz or _R2.fastq.gz for paired end and .fastq.gz for single-end processing respectively. MONSDA will append these suffixes automatically

The SEQUENCING key allows you to define single , paired or singlecell as values to enable analysis of a mix of single- and paired end or singlecell sequences at once, split in different leafs of the condition-tree. You can also specify strandedness of the data at hand. If unstranded leave empty, else add strandedness according to http://rseqc.sourceforge.net/#infer-experiment-py as comma separated value (rf Assumes a stranded library fr-firststrand [1+-,1-+,2++,2–], fr Assumes a stranded library fr-secondstrand [1++,1–,2+-,2-+]). MONSDA will use these values to configure tools automatically where possible. It is therefore not required to add sequencing specific settings for each used tool to your config file.

Now the actual workflow section begins, where you can define which environments and tools to use and which settings to apply to the run for each step and condition/setting. These follow the same scheme for each step. The ENV key defines the conda environment to load from the env directory of this installation. This allows to add your own environment.yaml files if needed, just make sure to share those along with your configuration. The BIN key defines the name of the executable, this is needed in case the env and the bin differ as e.g. for the mapping tool segemehl/segemehl.x. Some tools, e.g. guppy are not available via conda, hence you have to install the tool locally and indicate the path to the executable. This also allows you to swap out R scripts for the postprocessing steps in DE/DEU/DAS/DTU with your own, just make sure that all parameters used in the default scripts are taken care of in your substitution scripts and to make them available together with the pipeline when sharing your config.

You can always define differing ENV/BIN keys for each condition-tree leaf separately, which will overwrite the ENV/BIN set under ‘TOOLS’. This is intended to be used to benchmark different versions of the same tool for example. Furthermore, you can define keys like ‘ANNOTATION’ and ‘REFERENCE’ similarly, overwriting the ones defined under ‘SETTINGS’, which is useful for example, in case you run workflows like ‘COUNTING’ with salmon and want to quantify transcript isoforms while still being able to map and trim and count reads on genome level.

The next key-level is the OPTIONS key which is where you can define additional parameters for each tool. It is not needed to define anything related to single-/paired- end or singlecell sequencing, this is done automatically. To add parameters simply add the OPTION key which defines a dict where you can set parameters for each defined subworkflow-step. Parameters are here defined as key/value pairs corresponding to the subworkflow-step, e.g. ‘INDEX’ to generate an index file for mapping and all settings similar to a command line call as values. This should become clear having a look at the different processing steps in the template json. If there are no options just leave the ‘OPTIONS’ dict empty.

        }
    },
#FETCH options
    "FETCH": {
        "TOOLS" : # which tools to run, format is Conda-environment name : binary to call, will be overwritten by ENV/BIN in ICS
        {
            "sra" : "sra"
        },
        "id": { #key for source and genome
                "condition": { # sample id
                               "setting": {
                                   "ENV" : "sra",  # OPTIONAL if no tool is defined, name of conda env for raw file download
                                   "BIN" : "sra", # OPTIONAL PATH to executable, usually just the name of the executable
                                   "sra": { # for which tool environment these settings work
                                       "OPTIONS":
                                       {
                                            "PREFETCH": "${HOME}/.ncbi/user-settings.mkfg",  # PATH to vdb-config file
                                            "DOWNLOAD": ""  #FETCH options here if any, paired is not required, will be resolved by rules

                                       }
                                   }
                               }
                             }
              }
    },
    "BASECALL": {
        "TOOLS" : # which tools to run, format is Conda-environment name : binary to call
        {
            "guppy" : "~/.local/bin/guppy-cpu/bin/guppy_basecaller"
        },
        "id": { #key for source and genome
                "condition": { # sample id
                               "setting": {
                                   "ENV" : "guppy",  # name of conda env for raw file download
                                   "BIN" : "~/.local/bin/guppy-cpu/bin/guppy_basecaller", #PATH to guppy executable
                                   "guppy":{
                                       "OPTIONS":
                                         {
                                             "BASECALL": ""  #Guppy options here if any, paired is not required, will be resolved by rules
                                         }

                                   }
                               }
                             }
              }
    },
    "QC": {
        "TOOLS" :
        {
            "fastqc" : "fastqc"
        },
        "id": { #key for source and genome
                "condition": { # sample id
                               "setting": {
                                   "ENV" : "fastqc",  # name of conda env for QC
                                   "BIN" : "fastqc", # binary for QC
                                   "fastqc":{
                                       "OPTIONS":
                                          {
                                              "QC": "",  #QC options here if any, paired is not required, will be resolved by rules
                                              "MULTI": ""  #MultiQC options here if any, paired is not required, will be resolved by rules
                                          }

                                   }
                               }
                             }
              }
    },
    "TRIMMING": { #options for trimming for each sample/condition
                  "TOOLS" :
                  {
                      "trimgalore": "trim_galore",
                      "cutadapt": "cutadapt"
                  },
                  "id": {
                      "condition": {
                          "setting": { # See above
                                       "ENV": "trimgalore", # name of conda env for trimming
                                       "BIN": "trim_galore", # name of binary for trimming
                                       "trimgalore":{
                                           "OPTIONS":
                                            {
                                                "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --paired is not required, will be resolved by rules
                                            }
                                       },
                                       "cutadapt":{
                                           "OPTIONS":
                                           {
                                                "TRIM": "-q 15 --length 8 -e 0.15"  # trimming options here, --paired is not required, will be resolved by rules
                                           }
                                       }
                                     }
                      }
                  }
                },
    "DEDUP": { #options for deduplication for each sample/condition
               "TOOLS": {
                   "umitools": "umi_tools",
                   "picard": "picard"
               },
               "id": {
                   "condition": {
                       "setting": { # See above
                                    "ENV": "umitools", # name of conda env for dedup
                                    "BIN": "umi_tools", # name of binary for dedup
                                    "umitools":{
                                        "OPTIONS":
                                        {
                                            "WHITELIST" : "--extract-method string --bc-pattern 'XNNNNX'",# umitools whitelist options
                                            "EXTRACT": "--extract-umi-method read_id",  # umitools extract options
                                            "DEDUP": ""  # umitools dedup options
                                        }
                                    },
                                    "picard":{
                                        "OPTIONS":
                                        {
                                            "JAVA" : "",# options
                                            "DEDUP": ""  # dedup options
                                        }
                                    }
                                  }
                   }
               }
             },
    "MAPPING": { #options for mapping for each sample/condition
                 "TOOLS": {
                     "star": "STAR",
                     "segemehl3": "segemehl.x",
                     "segemehl": "segemehl.x",
                     "hisat2": "hisat2",
                     "bwa": "bwa mem",
                     "bwameth": "bwameth",
                     "minimap": "minimap2"
                 },
                 "id": {
                     "condition": {
                         "setting": {
                             "ENV": "star",  # which conda env to use for mapping
                             "BIN": "STAR",  #how the mapper binary is called
                             "REFERENCE": "$PATHTO/genome.fa.gz",  #Path to the gzipped Genome FASTA file, overwrites SETTINGS
                             "ANNOTATION": "$PATHTO/genome_or_other.gtfgff.gz",  #Path to the gzipped annotation file in gtf/gff format, overwrites SETTINGS
                             "star":{
                                 "OPTIONS":  # first entry in list is a dict of options for indexing, second for mapping, third can be e.g. appendix to index name, useful especially with minimap if using different kmer sizes
                                    {
                                        "INDEX" : "--sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --genomeSAindexNbases 13",  #indexing options
                                        "MAP": "--sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --outSAMprimaryFlag AllBestScore",  #mapping options
                                        "EXTENSION" : ""
                                    }
                             },
                             "segemehl3": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""
                                }
                              },
                              "hisat2": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""
                                }
                              },
                              "bwa": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""
                                }
                              },
                              "minimap": {
                                "OPTIONS":
                                {
                                    "INDEX" : "",
                                    "MAP": "",
                                    "EXTENSION" : ""

MONSDA further supports postprocessing steps like DE/DEU/DAS/DTU-Analysis for a defined (sub)-set of samples. These workflows act on ‘GROUPS’, which have to be defined in the ‘SETTINGS’. This allows users to compare samples across the condition tree. In case samples have been collected in batches or users want to define types for samples which will be considered in the respective design matrix of all provided R scripts, the settings for these steps may look as follows:

And can be extended for example to

"SETTINGS": {
    "Ecoli": {
        "WT": {
            "SAMPLES": [
                "SRR16324019",
                "SRR16324018",
                "SRR16324017"
            ],
            "GROUPS": [
                "ctrl",
                "ctrl",
                "ctrl"
            ],
            "BATCHES": [
                "1",
                "1",
                "2"
            ],
            "TYPES": [
                "paired",
                "paired",
                "single"
            ]
        }
    }
}

Where ‘BATCHES’ and ‘TYPES’ can take on arbitrary values, but ‘BATCHES’ is intended to allow correction for batch effects and ‘TYPES’ allows to take e.g. single-end, paired-end information into account. Be aware that this is only making sense if you indeed have batches and/or types to compare and that those are not confounded, otherwise just skip those keys and MONSDA will take care of this.

Another extra-key for these analysis steps is ‘EXCLUDE’. It can be used to exclude samples from postprocessing, for example if the first round of analysis shows an outlier. The most important key is ‘COMPARABLE’, which, if left empty, will generate all-vs-all comparisons from all ‘GROUPS’ available. In case you want to only compare certain groups, you can edit the config to look like this:

"COMPARABLE" :
{
    "comparison-name":
    {
        "Group1",
        "Group2"
    }
}

This will compare Group1 and Group2 and name output after comparison-name.

Another special case is ‘MACS’, a peak caller for ChIP-Seq data, which needs to process files pairwise, where one file is the file containing the signal and the other file is used as background. To keep configuration simple, such comparisons can also be described with the ‘COMPARABLE’ key like so:

                "dexseq": {
                    "REFERENCE": "GENOMES/Ecoli/Ecoli_trans.fa.gz",
                    "ANNOTATION": "GENOMES/Ecoli/Ecoli_trans_fix.gtf.gz",
                    "DECOY": "GENOMES/Ecoli/salmon_decoy",
                        "INDEX": "",
                        "QUANT": "",  # Options for read counting, independent of COUNTS workflows
                        "DTU": "min_samps_feature_expr = 1, min_feature_expr = .1,  min_samps_gene_expr = 1, min_gene_expr = 1"  # Options for DE Analysis
                    }
                },
                "drimseq": {

For everything else please refer to the TUTORIAL, the condition-tree and the Workflow Overview.

Keep in mind that every workflow step needs a corresponding entry in the config file or MONSDA.py will throw an error.