Computational Pipeline Strategies

Slides:

Advertisements

Similar presentations

Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.

Advertisements

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.

Before we start: Align sequence reads to the reference genome

NGS Analysis Using Galaxy

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

RNAseq analyses -- methods

Taverna and my Grid Open Workflow for Life Sciences Tom Oinn

NGS data analysis CCM Seminar series Michael Liang:

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Expanding the Functionality of ArcGIS Through Tool Building

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

SEMINAR WEI GUO. Software Visualization in the Large.

ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Cloud Implementation of GT-FAR (Genome and Transcriptome-Free Analysis of RNA-Seq) University of Southern California.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Developing a Software Package for Conceptualizing Molecular Findings Xinghua Lu, Harry Hocheiser & Vicky Chen Dept Biomedical Informatics.

RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov.

IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

1 Programming and problem solving in C, Maxima, and Excel.

Overview of Genomics Workflows

Software Engineering Algorithms, Compilers, & Lifecycle.

Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.

Canadian Bioinformatics Workshops

From Reads to Results Exome-seq analysis at CCBR

DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.

QuasR: Quantify and Annotate Short Reads in R Anita Lerch, Dimos Gaidatzis, Florian Hahne and Michael Stadler Friedrich Miescher Institute for Biomedical.

Konstantin Okonechnikov Qualimap v2: advanced quality control of

Topic 2: Hardware and Software

Introductory RNA-seq Transcriptome Profiling

Computing challenges in working with genomics-scale data

Using command line tools to process sequencing data

NGS File formats Raw data from various vendors => various formats

Why don’t programmers have to program in machine code?

Big Data is a Big Deal!.

Placental Bioinformatics

Cancer Genomics Core Lab

RNA Sequencing Day 7 Wooohoooo!

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Integrative Genomics Viewer (IGV)

Short Read Sequencing Analysis Workshop

Spark Presentation.

Pipeline Execution Environment

Complexity Time: 2 Hours.

Introductory RNA-Seq Transcriptome Profiling

GE3M25: Data Analysis, Class 4

Hierarchical Architecture

Many-core Software Development Platforms

Computer Science I CSC 135.

Weaving Abstractions into Workflows

CS110: Discussion about Spark

Reproducible Bioinformatics Research

Overview of big data tools

Next Gen. Sequencing Files and pysam

Maximize read usage through mapping strategies

Next Gen. Sequencing Files and pysam

Vocabulary Algorithm - A precise sequence of instructions for processes that can be executed by a computer Low level programming language: A programming.

Next Gen. Sequencing Files and pysam

Overview of Workflows: Why Use Them?

Supporting High-Performance Data Processing on Flat-Files

BF528 - Sequence Analysis Fundamentals

Server & Tools Business

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

Computational Pipeline Strategies Lecture 16 BF528

Computational Pipelines A serialized set of data processing steps where output from one step in input into the following step

Computational Pipelines A serialized set of data processing steps where output from one step in input into the following step Input Output 1 2 Final Step 1 Step 2 Step 3

Example Pipeline in Bioinformatics Goal: Profile transcriptional landscape of mammalian cardiac regeneration Steps: Acquire mRNA reads of regenerating myocytes Perform quality control on FASTQ files Align reads to reference genome Quantify gene expression Identify differentially expressed genes Interpret findings

Example Pipeline in Bioinformatics FastQC Out FastQC Trim Galore Trimmed Reads Raw Reads Tophat Cuff diff Count Matrix Cuff links Aligned Reads DE Genes Summarizing Analysis Output Final

Analysis as a DAG Analysis is a directed acyclic graph (DAG) Each step is a set of explicit instructions Input from one step feeds into another End point reachable from arbitrary starting point Pipelines are an implementation of a DAG

Why pipelines? Exactly replicate all steps in analysis Generalize common analysis steps Minimize manual function calls

Look, there’s a lot of data in the world and I’m lazy https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702932/

Properties of the ideal pipeline system General purpose: familiar language, can apply to any task Modular: any language, well-tested components with tight APIs Scalable: parallelize for free, independent of components Integrated: LIMS, metadata, viz, versioning, reporting Versioned: reproduce from snapshots in time Idempotent: resume from failure, guarantee outputs

Challenges Managing big genomics data is hard Optimise/parallelize computation Simplify deployment of complex pipelines Dozens of dependencies (binary tools, compilers, libraries, system tools, etc) Experimental nature of academic software Difficult to install, configure and deploy System dependence

Unix Pipe Model Naive solution: Bash scripts SunGrid Engine Parallelize by sample Single-machine or cluster command-line workflows A lot of user over head Manage dependencies Start up from failure

Pipeline/Workflow Tools Snakemake Python-based workflow manager Make Common tool for building software on UNIX-based systems NextFlow Based on the dataflow programming model Airflow Airbnb/Apache pipeline creator, Python SciPipe GO-based pipeline language

Snakemake Python-based workflow language/software Based on GNU make File-based production rules: Workflows are defined by rules that define how to create target files input file(s) → transform → output file(s) Excellent documentation Implicit parallelism and cluster integration Includes reproducibility tools Shared rule wrappers Conda and singularity integration

Snakemake Example: Variant Calling Starting Data Short read sequences in fastq format Target Output: VCF file of found variants Required Steps: Map reads to genome Sort mapped reads Index sorted reads Call Variants External Data Genome fasta file

Snakemake Example Show Data

Snakemake Example - Mapping Define rule taking fastp as input and generating BAM files rule bwa_map: input: "data/genome.fa", "data/samples/A.fastq" output: "mapped_reads/A.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"

Snakemake Example Execute Code

Snakemake Example - Generalize Create rule that generalizes rule for multiple samples rule bwa_map: input: "data/genome.fa", "data/samples/{sample}.fastq" output: "mapped_reads/{sample}.bam" shell: "bwa mem {input} | samtools view -Sb - > {output}"

Snakemake Example Execute Code

Snakemake Example - Sort Alignments Create rule takes BAM files and sorts them rule samtools_sort: input: "mapped_reads/{sample}.bam” output: "sorted_reads/{sample}.bam" shell: "samtools sort -T” “sorted_reads/{wildcards.sample}" “-O bam {input} > {output}”

Snakemake Example Execute Code

Snakemake Example - Index BAM Create rule that takes sorted BAM files and indexes them rule samtools_index: input: "sorted_reads/{sample}.bam” output: "sorted_reads/{sample}.bam.bai" shell: "samtools index {input}”

Execute Code and Visualize DAG Snakemake Example Execute Code and Visualize DAG snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

Snakemake Example - Index BAM Create rule that aggregates reads from all samples and jointly call genomic variants SAMPLES = [“A”, “B”] rule bcftools_call: input: fa=”data/genome.fa”, bam=expand(“sorted_reads/{sample}.bam”, sample=SAMPLES), bai=expand(“sorted_reads/{sample}.bam.bai”, sample=SAMPLES) output: "calls/all.vcf" shell: "samtools mpileup -g -f {input.fa} {input.bam} | ” “bcftools call -mv - > {output}

Snakemake Example Execute Code

Snakemake Example - Index BAM Create rule takes vcf file and plots summary rule plot_quals: input: ”calls/all.vcf” output: "plots/quals.svg" script: "scripts/plot-quals.py”

Snakemake Example Show and Execute Code

Snakemake Example - Target Rule Create rule that specifies final target file rule all: input: "plots/quals.svg”

Snakemake Example Execute Code Push Final Changes

Summary Workflow management software is indispensable: Simplifies analyzing many files in the same way Seamlessly runs code locally or on a cluster Documents your analysis - reproducible! Learn how to use it! A lot of different workflow/pipeline tools Not a single best Pick which one you like