Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support

What is Galaxy A web-based framework for running command- line utilities from a graphical user interface - Keep track of history - - Share data and analysis steps - - Create workflows - - Visualize results -

What is Galaxy Extremely active and popular open source project More than 60 public servers focus proteomics, metagenomics, metabolomics Solid and stable team of developers User conference regularly occurring, on all continents National Galaxy hubs and workshop events Tons of online learning material Provides advanced features for bioinformaticians RESTful APIs and bioblend scripting interface Can be launched on the cloud …

Why do we need it - Easy to manage your workspace - - Rerun tools with a click - - Store, Export and Share complete analysis - - Has a Workflow Manager -

The main public instance is at http://usegalaxy.org

Tools are on the left, history on the right Dataset History Available Tools

Tool parameters are given in the central view Main pane: run tools, view results Available Tools Dataset History

A tool without the UI looks like: $ fastqc --help FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2.. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1.. seqfileN

With the UI: Main pane: run tools, view results Available Tools Dataset History

Available Tools View results Main pane: run tools, view results View

Rerun tools Main pane: run tools, view results Rerun Save View

Shared data libraries

Shared histories

Shared workflows

What is a workflow manager Allows one to create a chain of dependent tasks to achieve a defined goal

Instead of

Galaxy workflows Main pane: design workflow Tool parameters Available Tools

Galaxy workflows

bowtie fastqc Compute Cluster flagstat Cluster Queue Galaxy at EMBL runs on a compute cluster

Output states Click on the bug Info button

Practical : Building a Workflow for ChIP-seq processing in Galaxy http://galaxy.embl.de

Go to Workflow => “Create new workflow” Input dataset: 1 fastq file Steps: – Check read quality Tool: FastQC – Map reads with bowtie2 Tool: Bowtie2 (organism: dm3) – Remove unmapped and multi-mapping reads Tool: Filter BAM – Remove duplicates Tool: MarkDuplicates – Check Strand Cross-correlation Tool : SPP (replicates removed: yes) – Generate bigwig coverage file for visualization Tool: bamCoverage (organism: dm3) Exercise 1: Build a workflow for basic processing of FASTQ files ; according to below specifications

Grab a (small) data file : – Go to Shared Data | Data Libraries Beta | Training | ChIPseq Training – Select file “K27ac_R2_chr2L_1-5M.fastq” – Click “to History” button to import the dataset in your history Execute your workflow : – Go to Workflow – Locate your workflow, click the down arrow and select “run” – Position parameters where needed and “Run Workflow” Exercise 2: Execute the workflow

Look at the FastQC, SPP results – http://www.bioinformatics.babraham.ac.uk/projects/fastq c/Help/ or http://bit.ly/1RcoFtN http://www.bioinformatics.babraham.ac.uk/projects/fastq c/Help/http://bit.ly/1RcoFtN Check read statistics (MarkDuplicates report) Visualize bigwig file in Trackster (dm3 genome) I have pre-run a similar workflow on two K27ac replicates and their input control. You can get all QC results by importing the History : – “EMBL ChIP-seq Training: QC Results” Exercise 3: Check workflow results

We collected all filtered BAM and bigwig files in the History “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Import it – Check results for “Correlate BAM GenomeWide (2K bins)” NB: you can check what was run by clicking on “Run this job again” – Check results for “bamFingerprint GenomeWide” (datasets 16) – Check IDR results (datasets 25 and 28) Exercise 4: Run additional quality checks and call peaks using IDR workflow

Still using the history “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Prepare a signal file representing the IP signal corrected for input (subtraction eg IP-input) in which both IP and input are replicate averages. Use tools “Average multiple (Big)Wig files” and “Subtract two (Big)Wig files”. Convert final final file to bigwig format with “Wig/BedGraph-to-bigWig converter” Precomputed datasets : 38 to 41 – Check results and visualize all bigwig (individual files and summarized ones) in Trackster Use the genome layout fetched from UCSC (dataset 42) – Prepare a data matrix summarizing signal values around all TSSs of the genome TSS are defined in the “35 : TSS_dm3.bed” file Use the computeMatrix tool (result : dataset 43) – Plot the data matrix as a heatmap and a profile average Use Deeptools’ heatmapper (result : dataset 44) Use Deeptools’ profiler (result : dataset 45) Exercise 5 : Generate heatmap and average plots

Part of fastQC wrapper

Exercise 1

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

Similar presentations

Presentation on theme: "Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

Similar presentations

Presentation on theme: "Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support."— Presentation transcript:

Similar presentations

About project

Feedback