Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

Similar presentations


Presentation on theme: "Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support."— Presentation transcript:

1 Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support

2 What is Galaxy A web-based framework for running command- line utilities from a graphical user interface - Keep track of history - - Share data and analysis steps - - Create workflows - - Visualize results -

3 What is Galaxy Extremely active and popular open source project More than 60 public servers focus proteomics, metagenomics, metabolomics Solid and stable team of developers User conference regularly occurring, on all continents National Galaxy hubs and workshop events Tons of online learning material Provides advanced features for bioinformaticians RESTful APIs and bioblend scripting interface Can be launched on the cloud …

4 Why do we need it - Easy to manage your workspace - - Rerun tools with a click - - Store, Export and Share complete analysis - - Has a Workflow Manager -

5 The main public instance is at http://usegalaxy.org

6 Tools are on the left, history on the right Dataset History Available Tools

7 Tool parameters are given in the central view Main pane: run tools, view results Available Tools Dataset History

8 A tool without the UI looks like: $ fastqc --help FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2.. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1.. seqfileN

9 With the UI: Main pane: run tools, view results Available Tools Dataset History

10 Available Tools View results Main pane: run tools, view results View

11 Rerun tools Main pane: run tools, view results Rerun Save View

12 Shared data libraries

13 Shared histories

14 Shared workflows

15 What is a workflow manager Allows one to create a chain of dependent tasks to achieve a defined goal

16 Instead of

17 Galaxy workflows Main pane: design workflow Tool parameters Available Tools

18 Galaxy workflows

19

20 bowtie fastqc Compute Cluster flagstat Cluster Queue Galaxy at EMBL runs on a compute cluster

21 Output states Click on the bug Info button

22 Practical : Building a Workflow for ChIP-seq processing in Galaxy http://galaxy.embl.de

23 Go to Workflow => “Create new workflow” Input dataset: 1 fastq file Steps: – Check read quality Tool: FastQC – Map reads with bowtie2 Tool: Bowtie2 (organism: dm3) – Remove unmapped and multi-mapping reads Tool: Filter BAM – Remove duplicates Tool: MarkDuplicates – Check Strand Cross-correlation Tool : SPP (replicates removed: yes) – Generate bigwig coverage file for visualization Tool: bamCoverage (organism: dm3) Exercise 1: Build a workflow for basic processing of FASTQ files ; according to below specifications

24

25 Grab a (small) data file : – Go to Shared Data | Data Libraries Beta | Training | ChIPseq Training – Select file “K27ac_R2_chr2L_1-5M.fastq” – Click “to History” button to import the dataset in your history Execute your workflow : – Go to Workflow – Locate your workflow, click the down arrow and select “run” – Position parameters where needed and “Run Workflow” Exercise 2: Execute the workflow

26 Look at the FastQC, SPP results – http://www.bioinformatics.babraham.ac.uk/projects/fastq c/Help/ or http://bit.ly/1RcoFtN http://www.bioinformatics.babraham.ac.uk/projects/fastq c/Help/http://bit.ly/1RcoFtN Check read statistics (MarkDuplicates report) Visualize bigwig file in Trackster (dm3 genome) I have pre-run a similar workflow on two K27ac replicates and their input control. You can get all QC results by importing the History : – “EMBL ChIP-seq Training: QC Results” Exercise 3: Check workflow results

27 Break

28 We collected all filtered BAM and bigwig files in the History “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Import it – Check results for “Correlate BAM GenomeWide (2K bins)” NB: you can check what was run by clicking on “Run this job again” – Check results for “bamFingerprint GenomeWide” (datasets 16) – Check IDR results (datasets 25 and 28) Exercise 4: Run additional quality checks and call peaks using IDR workflow

29 Still using the history “EMBL ChIP-seq Training : Result files (BAMs,bigwig) and further analysis” – Prepare a signal file representing the IP signal corrected for input (subtraction eg IP-input) in which both IP and input are replicate averages. Use tools “Average multiple (Big)Wig files” and “Subtract two (Big)Wig files”. Convert final final file to bigwig format with “Wig/BedGraph-to-bigWig converter” Precomputed datasets : 38 to 41 – Check results and visualize all bigwig (individual files and summarized ones) in Trackster Use the genome layout fetched from UCSC (dataset 42) – Prepare a data matrix summarizing signal values around all TSSs of the genome TSS are defined in the “35 : TSS_dm3.bed” file Use the computeMatrix tool (result : dataset 43) – Plot the data matrix as a heatmap and a profile average Use Deeptools’ heatmapper (result : dataset 44) Use Deeptools’ profiler (result : dataset 45) Exercise 5 : Generate heatmap and average plots

30 Part of fastQC wrapper

31 Exercise 1

32


Download ppt "Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support."

Similar presentations


Ads by Google