Computational Pipeline Strategies Lecture 16 BF528
Computational Pipelines A serialized set of data processing steps where output from one step in input into the following step
Computational Pipelines A serialized set of data processing steps where output from one step in input into the following step Input Output 1 2 Final Step 1 Step 2 Step 3
Example Pipeline in Bioinformatics Goal: Profile transcriptional landscape of mammalian cardiac regeneration Steps: Acquire mRNA reads of regenerating myocytes Perform quality control on FASTQ files Align reads to reference genome Quantify gene expression Identify differentially expressed genes Interpret findings
Example Pipeline in Bioinformatics FastQC Out FastQC Trim Galore Trimmed Reads Raw Reads Tophat Cuff diff Count Matrix Cuff links Aligned Reads DE Genes Summarizing Analysis Output Final
Analysis as a DAG Analysis is a directed acyclic graph (DAG) Each step is a set of explicit instructions Input from one step feeds into another End point reachable from arbitrary starting point Pipelines are an implementation of a DAG
Why pipelines? Exactly replicate all steps in analysis Generalize common analysis steps Minimize manual function calls
Look, there’s a lot of data in the world and I’m lazy https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702932/
Properties of the ideal pipeline system General purpose: familiar language, can apply to any task Modular: any language, well-tested components with tight APIs Scalable: parallelize for free, independent of components Integrated: LIMS, metadata, viz, versioning, reporting Versioned: reproduce from snapshots in time Idempotent: resume from failure, guarantee outputs
Challenges Managing big genomics data is hard Optimise/parallelize computation Simplify deployment of complex pipelines Dozens of dependencies (binary tools, compilers, libraries, system tools, etc) Experimental nature of academic software Difficult to install, configure and deploy System dependence
Unix Pipe Model Naive solution: Bash scripts SunGrid Engine Parallelize by sample Single-machine or cluster command-line workflows A lot of user over head Manage dependencies Start up from failure
Pipeline/Workflow Tools Snakemake Python-based workflow manager Make Common tool for building software on UNIX-based systems NextFlow Based on the dataflow programming model Airflow Airbnb/Apache pipeline creator, Python SciPipe GO-based pipeline language
Snakemake Python-based workflow language/software Based on GNU make File-based production rules: Workflows are defined by rules that define how to create target files input file(s) → transform → output file(s) Excellent documentation Implicit parallelism and cluster integration Includes reproducibility tools Shared rule wrappers Conda and singularity integration
Snakemake Example: Variant Calling Starting Data Short read sequences in fastq format Target Output: VCF file of found variants Required Steps: Map reads to genome Sort mapped reads Index sorted reads Call Variants External Data Genome fasta file
Snakemake Example Show Data
Snakemake Example - Mapping Define rule taking fastp as input and generating BAM files rule bwa_map: input: "data/genome.fa", "data/samples/A.fastq" output: "mapped_reads/A.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"
Snakemake Example Execute Code
Snakemake Example - Generalize Create rule that generalizes rule for multiple samples rule bwa_map: input: "data/genome.fa", "data/samples/{sample}.fastq" output: "mapped_reads/{sample}.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"
Snakemake Example Execute Code
Snakemake Example - Sort Alignments Create rule takes BAM files and sorts them rule samtools_sort: input: "mapped_reads/{sample}.bam” output: "sorted_reads/{sample}.bam" shell: "samtools sort -T” “sorted_reads/{wildcards.sample}" “-O bam {input} > {output}”
Snakemake Example Execute Code
Snakemake Example - Index BAM Create rule that takes sorted BAM files and indexes them rule samtools_index: input: "sorted_reads/{sample}.bam” output: "sorted_reads/{sample}.bam.bai" shell: "samtools index {input}”
Execute Code and Visualize DAG Snakemake Example Execute Code and Visualize DAG snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
Snakemake Example - Index BAM Create rule that aggregates reads from all samples and jointly call genomic variants SAMPLES = [“A”, “B”] rule bcftools_call: input: fa=”data/genome.fa”, bam=expand(“sorted_reads/{sample}.bam”, sample=SAMPLES), bai=expand(“sorted_reads/{sample}.bam.bai”, sample=SAMPLES) output: "calls/all.vcf" shell: "samtools mpileup -g -f {input.fa} {input.bam} | ” “bcftools call -mv - > {output}
Snakemake Example Execute Code
Snakemake Example - Index BAM Create rule takes vcf file and plots summary rule plot_quals: input: ”calls/all.vcf” output: "plots/quals.svg" script: "scripts/plot-quals.py”
Snakemake Example Show and Execute Code
Snakemake Example - Target Rule Create rule that specifies final target file rule all: input: "plots/quals.svg”
Snakemake Example Execute Code Push Final Changes
Summary Workflow management software is indispensable: Simplifies analyzing many files in the same way Seamlessly runs code locally or on a cluster Documents your analysis - reproducible! Learn how to use it! A lot of different workflow/pipeline tools Not a single best Pick which one you like