.. _config-file:

===============
The config-file
===============

It consists of multiple sections which will be explained in detail.  For the template please refer to the
**template.json** file in the **config** directory of the repo.  The **template.json** is the default
blueprint for a config file generated by **monsda_configure**.

The config file contains all the information needed to run workflows and tools with user defined settings and to find the correct sample and genome/annotation files. It starts with defining which workflow steps to run. Be aware that every workflow step has to correspond to a key in the config.json that defines parameters specific for the job. See :ref:`wfoverview`: for available workflows and tools.

.. literalinclude:: ../../configs/template.json
    :language: json
    :lines: 2-5
   
The next part defines 'SETTINGS', which is where you define which samples to process under which condition-tree leafs, which genomes and annotations to use, which sequencing type is to be analyzed and some workflow specific keys.

.. literalinclude:: ../../configs/template.json
    :language: json
    :lines: 6-23

To define the samples to run the analysis on, just add a list of sample names as innermost value
to the *SAMPLES* key for each condition.  In case of single-end sequencing make sure to include any _R1 _R2
tag should such exist, in case of paired end skip those as the pipeline will automatically look for _R1 and _R2 tags to find read pairs and prevent you from having to write duplicated information.

*Make sure the naming of you samples follows this _R1 _R2 convention when running paired-end analysis! All files have to end with _R1.fastq.gz or _R2.fastq.gz for paired end and .fastq.gz for single-end processing respectively. MONSDA will append these suffixes automatically* 

The *SEQUENCING* key allows you to define *single* , *paired* or *singlecell* as values to enable analysis of a mix of single- and paired end or singlecell sequences at once, split in different leafs of the condition-tree.  You can also specify strandedness of the data at hand. If unstranded leave empty, else add strandedness according to http://rseqc.sourceforge.net/#infer-experiment-py as comma separated value (rf Assumes a stranded library fr-firststrand [1+-,1-+,2++,2--], fr Assumes a stranded library fr-secondstrand [1++,1--,2+-,2-+]). **MONSDA** will use these values to configure tools automatically where possible. It is therefore not required to add sequencing specific settings for each used tool to your config file.

Now the actual workflow section begins, where you can define which environments and tools to use and which settings to apply to the run for each step and condition/setting. These follow the same scheme for each step. The *ENV* key defines the conda environment to load from the *env* directory of this installation. This allows to add your own environment.yaml files if needed, just make sure to share those along with your configuration. The *BIN* key defines the name of the executable, this is needed in case the env and the bin differ as e.g. for the mapping tool **segemehl/segemehl.x**. Some tools, e.g. guppy are not available via conda, hence you have to install the tool locally and indicate the path to the executable. This also allows you to swap out R scripts for the postprocessing steps in DE/DEU/DAS/DTU with your own, just make sure that all parameters used in the default scripts are taken care of in your substitution scripts and to make them available together with the pipeline when sharing your config.

You can always define differing ENV/BIN keys for each condition-tree leaf separately, which will overwrite the ENV/BIN set under 'TOOLS'. This is intended to be used to benchmark different versions of the same tool for example. Furthermore, you can define keys like 'ANNOTATION' and 'REFERENCE' similarly, overwriting the ones defined under 'SETTINGS', which is useful for example, in case you run workflows like 'COUNTING' with salmon and want to quantify transcript isoforms while still being able to map and trim and count reads on genome level.

The next key-level is the *OPTIONS* key which is where you can define additional parameters for each tool. It is not needed to define anything related to *single-/paired-* end or *singlecell* sequencing, this is done automatically.  To add parameters simply add the *OPTION* key which defines a dict where you can set parameters for each defined subworkflow-step. Parameters are here defined as key/value pairs corresponding to the subworkflow-step, e.g. 'INDEX' to generate an index file for mapping and all settings similar to a command line call as values. This should become clear having a look at the different processing steps in the template json.  If there are no options just leave the 'OPTIONS' dict empty.


.. literalinclude:: ../../configs/template.json
    :language: json
    :lines: 25-203

MONSDA further supports postprocessing steps like DE/DEU/DAS/DTU-Analysis for a defined (sub)-set of samples. These workflows act on 'GROUPS', which have to be defined in the 'SETTINGS'. This allows users to compare samples across the condition tree. In case samples have been collected in batches or users want to define types for samples which will be considered in the respective design matrix of all provided R scripts, the settings for these steps may look as follows:

.. literalinclude:: ../../configs/tutorial_de.json
    :language: json
    :lines: 6-45

And can be extended for example to

.. code-block:: json

    "SETTINGS": {
        "Ecoli": {
            "WT": {            
                "SAMPLES": [
                    "SRR16324019", 
                    "SRR16324018",
                    "SRR16324017"
                ],
                "GROUPS": [
                    "ctrl",
                    "ctrl",
                    "ctrl"
                ],
                "BATCHES": [
                    "1",
                    "1",
                    "2"
                ],
                "TYPES": [
                    "paired",
                    "paired",
                    "single"
                ]
            }
        }
    }

Where 'BATCHES' and 'TYPES' can take on arbitrary values, but 'BATCHES' is intended to allow correction for batch effects and 'TYPES' allows to take e.g. single-end, paired-end information into account. Be aware that this is only making sense if you indeed have batches and/or types to compare and that those are not confounded, otherwise just skip those keys and **MONSDA** will take care of this.

Another extra-key for these analysis steps is 'EXCLUDE'. It can be used to exclude samples from postprocessing, for example if the first round of analysis shows an outlier. The most important key is 'COMPARABLE', which, if left empty, will generate all-vs-all comparisons from all 'GROUPS' available. In case you want to only compare certain groups, you can edit the config to look like this:

.. code-block:: json
    
        "COMPARABLE" :
        {
            "comparison-name":
            {
                "Group1",
                "Group2"
            }
        }


This will compare Group1 and Group2 and name output after comparison-name.

Another special case is 'MACS', a peak caller for ChIP-Seq data, which needs to process files pairwise, where one file is the file containing the signal and the other file is used as background. To keep configuration simple, such comparisons can also be described with the 'COMPARABLE' key like so:

.. literalinclude:: ../../configs/tutorial_exhaustive.json
    :language: json
    :lines: 514-517,520-525

For everything else please refer to the :ref:`tutorials`, the :ref:`condition-tree<condition-tree>` and the :ref:`Workflow Overview<WFoverview>`.

Keep in mind that every workflow step needs a corresponding entry in the config file or **MONSDA.py** will throw an error.