Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15

Overview What is Galaxy? Why is it useful? Command-line vs Galaxy A Basic Analysis with Galaxy Resources for Learning

What is Galaxy? A web-based genome analysis platform designed for experimental biologists www.galaxyproject.org

Why is it useful to a biologist?  Easy to use!  Allows data import from popular resources  Provides access to best practice bioinformatics tools  Allows you to build analysis pipelines and share them  Provides multiple ways to visualise your data

Trinity College Dublin, The University of Dublin Case Study: Chip-seq Analysis Pipeline Peak calling Enriched regions Quality control Map reads to reference genome Pre-processing of raw reads Sequencing

Trinity College Dublin, The University of Dublin Case Study: Chip-seq Analysis Pipeline Quality control Map reads to reference genome Peak calling Pre-processing of raw reads Enriched regions Sequencing Visualisation with genome browser Motif discovery Relationship with gene structure Gene set analysis Differential profile analysis

Trinity College Dublin, The University of Dublin Question? Which promoter regions of genes do these enriched regions map to???

Trinity College Dublin, The University of Dublin Command-line approach 1. Extract gene coordinates from UCSC 2. Extract 1kb upstream coordinates from UCSC 3. Merge upstream coordinates and gene annotation 5. Join the input files 6. Create user track for UCSC 7. Import to UCSC 8. Run a Wrapper script to enable a re-run of this pipeline with different parameters. 4. Clean files

Trinity College Dublin, The University of Dublin Galaxy Approach

Trinity College Dublin, The University of Dublin The Galaxy Interface Datasources and Tools Main Analysis window History of commands Main Menus

Trinity College Dublin, The University of Dublin Overview of Analysis  Import two datasets into Galaxy 1.Genomic coordinates of enriched peaks 2.Genomic coordinates of genes  Extract upstream regions of genes  Data cleaning  Identify overlap between promoter regions and enriched regions  Visualise on a genome browser Question: Which gene promoter regions do these enriched regions map to??? Analysis steps:

Trinity College Dublin, The University of Dublin Let’s begin! Register an account http://bioinf.gen.tcd.ie/workshops/Galaxy/

Trinity College Dublin, The University of Dublin Let’s begin! Step 1: Get Data into Galaxy

Step 1: Get data #1  TAF1 peaks Get Data -> Upload File -> Paste/Fetch -> Enter URL -> Start 1. Click Upload File 2. Click Paste/Fetch to display the URL box above 3. Paste in the URL containing your data 6. Click Start to upload the data to your history! 5. Type hg19 and specify Human Feb. 2009 (GRCh37/hg19) (hg19) 7. Click Close 4. Select ‘tabular’ file type http://bioinf.gen.tcd.ie/workshops/Galaxy/TAF1_peaks.txt

Data uploaded to your history! The file was sent to your history and given a number The history keeps track of all steps in your analysis

Step 2: Rename your History 1. Click here to rename your history You can have multiple histories with different names 2. Click the cog wheel if you want to create a new history or see a list of your saved histories

Step 3: Review your dataset 1. Click on dataset name to expand/collapse the meta data and mini view of the file content 3. Click the pencil icon to edit the file attributes 2. Click the eye icon to see the file contents in the main analysis window 4. Click the x to delete the file

Step 4a: Edit dataset 1. Click the pencil icon to edit the file attributes 3. First rename the file 5. Click save Change File name to a shorter name 4. Copy and paste the old name into the info to keep a record of it 2. There are four tabs in edit mode: To change file name click Attributes

Step 4b: Edit dataset 1. Click Datatype to change the file format 3. Define which columns of your TAF1 file are “chrom”, “start” and “end”. Look at the mini view image to see your TAF1 file 4. Click save Change File format so Galaxy knows where to find chr, start, end 2. Select interval from drop down and then click save 5. Format changed to interval. Galaxy now knows where chr, start and end are.

Step 5: Get data #2 -> Genes Get Data -> UCSC Main Table Browser

Step 5: Get data #2 -> Genes Ensure all drop downs as shown below are selected 1. Select all fields from drop downs as shown above, then click get output 2. Click Send query to Galaxy

Step 6: Edit dataset Click the pencil icon to edit the file name Change File name to a shorter name File name changed File format = bed Galaxy already knows where Chr, start and end are

Step 7: Get Promoter Regions Tool: Operate on Genomic Intervals  Get Flanks 4. Click Execute 3. Select 1000bp upstream 1. Select Genes dataset 2. Select upstream 5. Output sent to history! Same file content as ‘Genes’ but start and end coordinates are replaced with promoter regions 6. Rename file to ‘Promoters’

Step 8: Clean dataset Tool: Text Manipulation  Cut 1. Cut out the specific columns we want from the ‘Promoters’ file 2. Click Execute 3. Rename the output file to ‘Clean Promoters’

Datasets ready for analysis! Both files are associated with human hg19 Galaxy knows for each file where chr, start and end are. Now, we are ready to join these files and see which promoters have TAF1 peaks! Dataset #1 Dataset #2

How do we Join Genomic Intervals? Chr1 100 500 int1 + Chr1 200 400 cloneA + Chr Start End Name Strand Chr1 100 500 int1 + Chr1 1000 1200 int2 + Intervals that overlap! Interval file #1 Interval file #2 Example Chr Start End Name Strand Chr1 200 400 cloneA + Chr1 900 1000 cloneB + 100-500 200-400 1000-1200 900-1000 #1 #2

Step 9: Join on Genomic Intervals Tool: Operate on Genomic Intervals  Join The second dataset is the one we use for the filter (i.e. we want to filter the promoter dataset for just those regions that contain the TAF1 peaks) The first dataset is the one we want to filter (i.e. the large dataset containing all of the promoter regions) Click Execute Inner join returns only the genomic regions that overlap in both files

Step 9: Join on Genomic Intervals Output We have reduced the promoters from >54,000 to 154! All of these promoter regions contain a TAF1 peak region. Rename the output file to ‘Overlap’

Step 10: Build Custom Tracks for UCSC Tool: Graph/Display Data  Build custom track Click ‘Insert Track’ to open the track information. We will add three tracks to UCSC: 1. TAF1 peaks 2. Promoter regions 3. TAF1 peaks in promoter regions

Step 10: Build Custom Tracks for UCSC Click ‘Insert Track’ to open another track Select dataset Label the track Describe the track Select the colour of the track Track 1: TAF1 peaks

Step 10: Build Custom Tracks for UCSC Tracks 2 and 3: Click Execute when all three tracks are filled in

Click here to visualise your three tracks on UCSC Genome Browser This single output file contains the information to visualise three tracks on UCSC Genome Browser Step 10: Build Custom Tracks for UCSC Output

Visualisation on UCSC Genome Browser The three tracks Zoom out to see a larger genomic context

Extract Workflow from History Want to rerun your analysis but extract 3kb upstream? Click the cog wheel and select ‘Extract Workflow’ from the drop down menu

Extract Workflow from History Create a workflow name Lists all the tools used to create your history Click Create workflow

Extract Workflow from History Click edit workflow Or access your workflows from the top menu

Editing Workflows Click on a box and you can edit the variables of that step in the Details section on the right (in orange) Each box is a step of the analysis Noodles connect the steps Use blue window to move around the workflow

Editing Workflows This input dataset is the transcription factor dataset. Label this dataset in the details box on the right

Editing Workflows This input dataset is the Gene dataset. Label this dataset in the details box on the right

Editing Workflows 1. Click on Get Flanks tool to edit the upstream promoter region 2. Change the upstream promoter region to 3000 3. Click cog wheel to save workflow. Then click cog wheel again to Run the workflow

Running Workflows 1. Select Transcription factor file (e.g. TAF1_peaks) 3. Send output to a new history 4. Run workflow and go for a coffee!! 2. Select Genes file (e.g. Genes)

Your new History!

Trinity College Dublin, The University of Dublin Summary What you learned today – Getting data into Galaxy – How to review and edit datasets – Running Common Galaxy Tools – How to visualise your data in UCSC genome browser – How to extract workflows from a history

Large Tool Repository

Trinity College Dublin, The University of Dublin Data Visualisations UCSC Genome BrowserClustered HeatmapsVisualisation of Chip-seq dataChartsCircster – structural variation

Galaxy Learning Resources

Thank You Please fill in the online survey at bioinf.gen.tcd.ie/surveys/Galaxy

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

Similar presentations

Presentation on theme: "Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

Similar presentations

Presentation on theme: "Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15."— Presentation transcript:

Similar presentations

About project

Feedback