Computational Pipeline Strategies

Computational Pipeline Strategies
Lecture 16 BF528

Computational Pipelines
A serialized set of data processing steps where output from one step in input into the following step

Computational Pipelines
A serialized set of data processing steps where output from one step in input into the following step Input Output 1 2 Final Step 1 Step 2 Step 3

Example Pipeline in Bioinformatics
Goal: Profile transcriptional landscape of mammalian cardiac regeneration Steps: Acquire mRNA reads of regenerating myocytes Perform quality control on FASTQ files Align reads to reference genome Quantify gene expression Identify differentially expressed genes Interpret findings

Example Pipeline in Bioinformatics
FastQC Out FastQC Trim Galore Trimmed Reads Raw Reads Tophat Cuff diff Count Matrix Cuff links Aligned Reads DE Genes Summarizing Analysis Output Final

Analysis as a DAG Analysis is a directed acyclic graph (DAG)
Each step is a set of explicit instructions Input from one step feeds into another End point reachable from arbitrary starting point Pipelines are an implementation of a DAG

Why pipelines? Exactly replicate all steps in analysis
Generalize common analysis steps Minimize manual function calls

Look, there’s a lot of data in the world and I’m lazy

Properties of the ideal pipeline system
General purpose: familiar language, can apply to any task Modular: any language, well-tested components with tight APIs Scalable: parallelize for free, independent of components Integrated: LIMS, metadata, viz, versioning, reporting Versioned: reproduce from snapshots in time Idempotent: resume from failure, guarantee outputs

Challenges Managing big genomics data is hard
Optimise/parallelize computation Simplify deployment of complex pipelines Dozens of dependencies (binary tools, compilers, libraries, system tools, etc) Experimental nature of academic software Difficult to install, configure and deploy System dependence

Unix Pipe Model Naive solution: Bash scripts SunGrid Engine
Parallelize by sample Single-machine or cluster command-line workflows A lot of user over head Manage dependencies Start up from failure

Pipeline/Workflow Tools
Snakemake Python-based workflow manager Make Common tool for building software on UNIX-based systems NextFlow Based on the dataflow programming model Airflow Airbnb/Apache pipeline creator, Python SciPipe GO-based pipeline language

Snakemake Python-based workflow language/software Based on GNU make
File-based production rules: Workflows are defined by rules that define how to create target files input file(s) → transform → output file(s) Excellent documentation Implicit parallelism and cluster integration Includes reproducibility tools Shared rule wrappers Conda and singularity integration

Snakemake Example: Variant Calling
Starting Data Short read sequences in fastq format Target Output: VCF file of found variants Required Steps: Map reads to genome Sort mapped reads Index sorted reads Call Variants External Data Genome fasta file

Snakemake Example Show Data

Snakemake Example - Mapping
Define rule taking fastp as input and generating BAM files rule bwa_map: input: "data/genome.fa", "data/samples/A.fastq" output: "mapped_reads/A.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"

Snakemake Example Execute Code

Snakemake Example - Generalize
Create rule that generalizes rule for multiple samples rule bwa_map: input: "data/genome.fa", "data/samples/{sample}.fastq" output: "mapped_reads/{sample}.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"

Snakemake Example - Sort Alignments
Create rule takes BAM files and sorts them rule samtools_sort: input: "mapped_reads/{sample}.bam” output: "sorted_reads/{sample}.bam" shell: "samtools sort -T” “sorted_reads/{wildcards.sample}" “-O bam {input} > {output}”

Snakemake Example - Index BAM
Create rule that takes sorted BAM files and indexes them rule samtools_index: input: "sorted_reads/{sample}.bam” output: "sorted_reads/{sample}.bam.bai" shell: "samtools index {input}”

Execute Code and Visualize DAG
Snakemake Example Execute Code and Visualize DAG snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

Create rule that aggregates reads from all samples and jointly call genomic variants SAMPLES = [“A”, “B”] rule bcftools_call: input: fa=”data/genome.fa”, bam=expand(“sorted_reads/{sample}.bam”, sample=SAMPLES), bai=expand(“sorted_reads/{sample}.bam.bai”, sample=SAMPLES) output: "calls/all.vcf" shell: "samtools mpileup -g -f {input.fa} {input.bam} | ” “bcftools call -mv - > {output}

Create rule takes vcf file and plots summary rule plot_quals: input: ”calls/all.vcf” output: "plots/quals.svg" script: "scripts/plot-quals.py”

Snakemake Example Show and Execute Code

Snakemake Example - Target Rule
Create rule that specifies final target file rule all: input: "plots/quals.svg”

Snakemake Example Execute Code Push Final Changes

Summary Workflow management software is indispensable:
Simplifies analyzing many files in the same way Seamlessly runs code locally or on a cluster Documents your analysis - reproducible! Learn how to use it! A lot of different workflow/pipeline tools Not a single best Pick which one you like

Computational Pipeline Strategies

Similar presentations

Presentation on theme: "Computational Pipeline Strategies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Pipeline Strategies

Similar presentations

Presentation on theme: "Computational Pipeline Strategies"— Presentation transcript:

Similar presentations

About project

Feedback