Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Pipeline Strategies

Similar presentations


Presentation on theme: "Computational Pipeline Strategies"— Presentation transcript:

1 Computational Pipeline Strategies
Lecture 16 BF528

2 Computational Pipelines
A serialized set of data processing steps where output from one step in input into the following step

3 Computational Pipelines
A serialized set of data processing steps where output from one step in input into the following step Input Output 1 2 Final Step 1 Step 2 Step 3

4 Example Pipeline in Bioinformatics
Goal: Profile transcriptional landscape of mammalian cardiac regeneration Steps: Acquire mRNA reads of regenerating myocytes Perform quality control on FASTQ files Align reads to reference genome Quantify gene expression Identify differentially expressed genes Interpret findings

5 Example Pipeline in Bioinformatics
FastQC Out FastQC Trim Galore Trimmed Reads Raw Reads Tophat Cuff diff Count Matrix Cuff links Aligned Reads DE Genes Summarizing Analysis Output Final

6 Analysis as a DAG Analysis is a directed acyclic graph (DAG)
Each step is a set of explicit instructions Input from one step feeds into another End point reachable from arbitrary starting point Pipelines are an implementation of a DAG

7 Why pipelines? Exactly replicate all steps in analysis
Generalize common analysis steps Minimize manual function calls

8 Look, there’s a lot of data in the world and I’m lazy

9

10 Properties of the ideal pipeline system
General purpose: familiar language, can apply to any task Modular: any language, well-tested components with tight APIs Scalable: parallelize for free, independent of components Integrated: LIMS, metadata, viz, versioning, reporting Versioned: reproduce from snapshots in time Idempotent: resume from failure, guarantee outputs

11 Challenges Managing big genomics data is hard
Optimise/parallelize computation Simplify deployment of complex pipelines Dozens of dependencies (binary tools, compilers, libraries, system tools, etc) Experimental nature of academic software Difficult to install, configure and deploy System dependence

12 Unix Pipe Model Naive solution: Bash scripts SunGrid Engine
Parallelize by sample Single-machine or cluster command-line workflows A lot of user over head Manage dependencies Start up from failure

13 Pipeline/Workflow Tools
Snakemake Python-based workflow manager Make Common tool for building software on UNIX-based systems NextFlow Based on the dataflow programming model Airflow Airbnb/Apache pipeline creator, Python SciPipe GO-based pipeline language

14 Snakemake Python-based workflow language/software Based on GNU make
File-based production rules: Workflows are defined by rules that define how to create target files input file(s) → transform → output file(s) Excellent documentation Implicit parallelism and cluster integration Includes reproducibility tools Shared rule wrappers Conda and singularity integration

15 Snakemake Example: Variant Calling
Starting Data Short read sequences in fastq format Target Output: VCF file of found variants Required Steps: Map reads to genome Sort mapped reads Index sorted reads Call Variants External Data Genome fasta file

16 Snakemake Example Show Data

17 Snakemake Example - Mapping
Define rule taking fastp as input and generating BAM files rule bwa_map: input: "data/genome.fa", "data/samples/A.fastq" output: "mapped_reads/A.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"

18 Snakemake Example Execute Code

19 Snakemake Example - Generalize
Create rule that generalizes rule for multiple samples rule bwa_map: input: "data/genome.fa", "data/samples/{sample}.fastq" output: "mapped_reads/{sample}.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"

20 Snakemake Example Execute Code

21 Snakemake Example - Sort Alignments
Create rule takes BAM files and sorts them rule samtools_sort: input: "mapped_reads/{sample}.bam” output: "sorted_reads/{sample}.bam" shell: "samtools sort -T” “sorted_reads/{wildcards.sample}" “-O bam {input} > {output}”

22 Snakemake Example Execute Code

23 Snakemake Example - Index BAM
Create rule that takes sorted BAM files and indexes them rule samtools_index: input: "sorted_reads/{sample}.bam” output: "sorted_reads/{sample}.bam.bai" shell: "samtools index {input}”

24 Execute Code and Visualize DAG
Snakemake Example Execute Code and Visualize DAG snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

25 Snakemake Example - Index BAM
Create rule that aggregates reads from all samples and jointly call genomic variants SAMPLES = [“A”, “B”] rule bcftools_call: input: fa=”data/genome.fa”, bam=expand(“sorted_reads/{sample}.bam”, sample=SAMPLES), bai=expand(“sorted_reads/{sample}.bam.bai”, sample=SAMPLES) output: "calls/all.vcf" shell: "samtools mpileup -g -f {input.fa} {input.bam} | ” “bcftools call -mv - > {output}

26 Snakemake Example Execute Code

27 Snakemake Example - Index BAM
Create rule takes vcf file and plots summary rule plot_quals: input: ”calls/all.vcf” output: "plots/quals.svg" script: "scripts/plot-quals.py”

28 Snakemake Example Show and Execute Code

29 Snakemake Example - Target Rule
Create rule that specifies final target file rule all: input: "plots/quals.svg”

30 Snakemake Example Execute Code Push Final Changes

31 Summary Workflow management software is indispensable:
Simplifies analyzing many files in the same way Seamlessly runs code locally or on a cluster Documents your analysis - reproducible! Learn how to use it! A lot of different workflow/pipeline tools Not a single best Pick which one you like


Download ppt "Computational Pipeline Strategies"

Similar presentations


Ads by Google