How to create a workflow
RattleSNP allows you to build a workflow using a simple config.yaml
configuration file :
First, provide the data paths
Second, activate tools from mapping to SNP calling.
And last, manage parameters tools.
To create this file, just run:
create_config
Create config.yaml for run
rattleSNP create_config [OPTIONS]
Options
- -c, --configyaml <configyaml>
Required Path to create config.yaml
Then, edit the relevant sections of the file to customize your flavor of a workflow.
1. Providing data
First, indicate the data path in the config.yaml
configuration file:
DATA:
FASTQ: "/path/to/fastq/"
VCF: ""
REFERENCE_FILE: "/path/to/reference.fasta"
OUTPUT: "/path/to/output"
Find here a summary table with description of each data need to launch RattleSNP :
Input |
Description |
---|---|
FASTQ |
Every paired FASTQ file should contain the whole set of reads to be mapped. Each fastq file will be mapped independently. |
VCF |
If SNP calling already run, you can use directly vcf to filter |
REFERENCE_FILE |
Only one REFERENCE genome file will be used by RattleSNP. This REFERENCE will be used for Mapping step |
OUTPUT |
output path directory |
Warning
For FASTQ, naming convention accepted by RattleSNP is NAME_R1.fastq.gz or NAME_R1.fq.gz or NAME_R1.fastq or NAME_R1.fq. Preferentially use short names and avoid special characters because report can fail. Avoid to use the long name given directly by sequencer. Same for _R2 All fastq files have to be homogeneous on their extension and can be compressed or not.
Reference fasta file need a .fasta or .fa extension uncompressed.
2. Providing params
PARAMS:
MITOCHONDRIAL_NAME : ""
# The filter suffix to add on vcf filter in order to allow multiple filter
FILTER_SUFFIX: ["-Q30-DP5-MAF005-MISS07",
"-Q30-DP20-MAF001-MISS05"]
Find here a summary table with description of each params for RattleSNP :
Params |
Description |
---|---|
MITOCHONDRIAL_NAME |
The name of mitochondrial sequence on fasta, used to remove on VCF file. If not keep empty |
FILTER_SUFFIX |
The suffix name add to vcf filters file |
3. Provide workflow step
Activate/deactivate tools as you wish. Feel free to activate only assembly, assembly+polishing or assembly+polishing+correction.
Example:
################################
# Pipeline tools activation
FASTQC: true
CLEANING:
ATROPOS: true
MAPPING:
ACTIVATE: true
TOOL: "BWA_MEM" # Use BWA_MEM or BWA_SAMPE only
BUILD_STATS: true # warning if true but mapping false, mapping automatically run
SNPCALLING: true
FILTER: true # Must be true if want run raxml or raxml-ng
RAXML: true
4. Parameters for some specific tools
You can manage tools parameters on the params section in the config.yaml
file.
Here you find standard parameters used on RattleSNP. Feel free to adapt it to your requires.
################################
# Misc. options for programs
PARAMS_TOOLS:
ATROPOS: "--minimum-length 35 -q 20,20 -U 8 -O 10"
FASTQC: ""
BWA_ALN: ""
BWA_SAMPE: ""
BWA_MEM: ""
SAMTOOLS_VIEW: "-bh -f 2"
SAMTOOLS_SORT: ""
SAMTOOLS_DEPTH: ""
PICARDTOOLS_MARK_DUPLICATES: "-CREATE_INDEX TRUE -VALIDATION_STRINGENCY SILENT"
GATK_HAPLOTYPECALLER: "--java-options '-Xmx40G' --emit-ref-confidence GVCF --output-mode EMIT_ALL_ACTIVE_SITES -ploidy 1"
GATK_GENOMICSDBIMPORT: "--java-options '-Xmx40G' "
GATK_GENOTYPEGVCFS: "--java-options '-Xmx40G' -new-qual"
VCFTOOLS: ["--minDP 5 --minQ 30 --remove-indels --recode --recode-INFO-all --maf 0.05 --max-missing 0.7",
"--minDP 20 --minQ 30 --remove-indels --recode --recode-INFO-all --maf 0.01 --max-missing 0.5"]
RAXML: "-m GTRGAMMAX -f a -x $RANDOM -# autoMRE -p 600"
RAXML_NG: "--all --model GTR+G --tree pars{50},rand{50} --bs-trees 100 --seed $RANDOM"
Warning
Please check documentation of each tool (outside of RattleSNP, and make sure that the settings are correct!)
How to run the workflow
Before attempting to run rattleSNP, please verify that you have already modified the config.yaml
file as explained in 1. Providing data.
If you installed RattleSNP on a HPC cluster with a job scheduler, you can run:
run_cluster
rattleSNP run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...
Options
- -c, --config <config>
Required Configuration file for run tool
- -pdf, --pdf
Run snakemake with –dag, –rulegraph and –filegraph
- Default:
False
Arguments
- SNAKEMAKE_OTHER
Optional argument(s)
run_local
rattleSNP run_local [OPTIONS] [SNAKEMAKE_OTHER]...
Options
- -c, --config <config>
Required Configuration file for run tool
- -t, --threads <threads>
Required Number of threads
- -p, --pdf
Run snakemake with –dag, –rulegraph and –filegraph
Arguments
- SNAKEMAKE_OTHER
Optional argument(s)
Advance run
Providing more resources
If the cluster default resources are not sufficient, you can edit the cluster_config.yaml
file. See 2. Adapting cluster_config.yaml:
edit_cluster_config
Edit cluster_config.yaml use by profile
rattleSNP edit_cluster_config [OPTIONS]
Providing your own tools_config.yaml
To change the tools used in a RattleSNP workflow, you can see 3. How to configure tools_path.yaml
edit_tools
Edit own tools version
rattleSNP edit_tools [OPTIONS]
Options
- -r, --restore
Restore default tools_config.yaml (from install)
- Default:
False
Output on RattleSNP
The architecture of RattleSNP output is designed as follow:
OUTPUT_RattleSNP/
├── 1_mapping
├── 2_snp_calling
├── 3_full_snp_calling_stats
├── 4_raxml
├── LOGS
Report
RattleSNP generates a useful report containing, foreach fastq, a summary of interesting statistics !!