Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Slides:



Advertisements
Similar presentations
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
Advertisements

Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
Customized cloud platform for computing on your terms !
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Expression Analysis of RNA-seq Data
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
Introduction to RNA-Seq & Transcriptome Analysis
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
RNA-seq workshop ALIGNMENT
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
EDACC Quality Characterization for Various Epigenetic Assays
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
Introduction to RNAseq
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
The iPlant Collaborative
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
Genome STRiP ASHG Workshop demo materials
The iPlant Collaborative
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Accessing and visualizing genomics data
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
HOMER – a one stop shop for ChIP-Seq analysis
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Visualizing data from Galaxy
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Integrative Genomics Viewer (IGV)
Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.
Regulatory Genomics Lab
MANAGING, SHARING, AND PUBLISHING DATA WITH THE CYVERSE DATA STORE
How to store and visualize RNA-seq data
Introductory RNA-Seq Transcriptome Profiling
GE3M25: Data Analysis, Class 4
Rod Eyles1, John Juma1, Morag Ferguson1, Trushar Shah1 1 IITA, Nairobi
Regulatory Genomics Lab
Computational Pipeline Strategies
Regulatory Genomics Lab
Presentation transcript:

Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor

Scientific Objective The LEAFY transcription factor has been shown (Moyroud et al. 2011) to bind a dimer of the motif CCANTG[G/T] We will use data from a chromatin immunoprecipitation assay on the LEAFY protein to: Identify LEAFY binding targets Attempt confirmation of the binding site

A Few Known LEAFY targets Gene NameLocus APETALA (AP1)AT1G AGAMOUS (AG)AT4G LMI2AT3G LMI3AT5G LMI4AT5G LMI5AT1G Look for LEAFY enrichment at these loci in IGV 2.0

AP1 (APETALA) Mutant Why do we even care about LEAFY? Well, it activates AP1. If API is not active, Arabidopsis can’t make flowers and instead makes cauliflowers! Wild-typeap1

AP1 (APETALA) Mutant Why do we even care about LEAFY? Well, it activates AP1. If API is not active, Arabidopsis can’t make flowers and instead makes cauliflowers! Wild-typeap1

ChIPseq Conceptual Overview

The NCBI SRA NCBI SRA is a repository for NGS sequence reads Data is stored in association with basic metadata explaining experimental technique and inter-sample relationships Data format is NCBI-specific SRA and SRA-lite format. “Universal” lossless format. Upload and download is offered via FTP and HTTP but also via Aspera ASCP – Fast, parallel protocol similar in performance to iRODS iput/iget commands used in iPlant Data Store One can use NCBI SRA Import to rapidly copy SRA accession SRP over ASCP into the iPlant Data Store.

Import SRA data from NCBI SRA Extract FASTQ files from the downloaded SRA archives

NCBI SRA Toolkit SRA data format is a universal format, but no downstream apps can accept it natively. Need to export SRA to FASTQ, SFF, etc. These are the standard file formats for representing sequence. Use the NCBI SRA Toolkit fastq-dump to export FASTQ sequence files from SRA files so we can process them

Import SRA data from NCBI SRA Extract FASTQ files from the downloaded SRA archives

BWA BWA is one of many applications whose objective is to efficiently align short sequence reads to a reference genome sequence Other alternatives are BOWTIE, MAQ, TopHat, Stampy, Novoalign, etc. BWA was developed and used by the Human 1000 genomes project due to its speed and accuracy. BWA mem is a fast variant of BWA able to use long reads. It is newly available in the iPlant DE

Outputs from BWA BWA emits alignments in the SAM format SAM is a universal system for describing next-gen sequences and their corresponding genome alignments SAMTools is a suite of applications for manipulating SAM files – Sort, Merge, Index, and more – Emit as binary BAM file All SAMTools functions are in the DE

Align FASTQ files to Arabidopsis genome using BWA Merge and index BAM files using SAMtools apps

PeakRanger PeakRanger is a fast, optimized algorithm for detecting enrichment peaks in ChIPseq data sets PeakRanger was developed at OICR in partnership between modENCODE and iPlant and is now maintained at UTSW It’s not the only option for peak finding: – MACS – ChIPseq Peak Finder – CisGenome – FindPeaks

Use PeakRanger with the BAM files from the Control and Sample assays to find LEAFY enrichment NOTE: Many parameters to tweak. You are recommended to read the PeakRanger paper.

Wiggle (.wig) files: Density map of sequence reads across the reference genome for control and sample BAM alignments Region (.bed) file: Feature file containing the significantly enriched domains in the genome Summit (.bed) file: Feature file containing the single base maximum of each peak Outputs from PeakRanger

Wiggle file BED file

Integrative Genomics Viewer The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. Use IGV to inspect outputs from PeakRanger

Using IGV in Atmosphere 1.Launch an instance of RNA-Seq Visualization (or any image that has IGV) from the Atmosphere App list 2.Use VNClient to connect to your remote desktop

Using IGV in Atmosphere 1.Configure iDrop 2.Copy.wig and.bed files from the PeakRanger output to your Atmosphere instance desktop

Using IGV in Atmosphere 1.Launch IGV (Integrative Genomics Viewer) 2.Change the current genome to A. thaliana (TAIR10)

Using IGV in Atmosphere 1.Open igvtools and convert.wig file to.tdf 2.Load the.tdf and.bed files into the IGV window 3.Inspect loci by entering their name into search box

Using IGV in Atmosphere Enrichment region and alignment peak at promoter region of APETALA (AP1)

Filtering the PeakRanger summits file The statiscally best summits from PeakRanger have P-values of Zero. If you look at the summits.bed file you can see this is embedded in the name of the features. So, if we filter the summits.bed for only lines matching pval_0, we will generate a BED file containing summits most likely to be near true LEAFY binding sites. This identical to running egrep “pval_0” peakranger_summit.bed > peakranger_summit_best.bed on a command line This identical to running egrep “pval_0” peakranger_summit.bed > peakranger_summit_best.bed on a command line Find Lines Matching a Regular Expression

BEDTools for Interval Operations The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together. * The entire BEDtools suite is now integrated into the iPlant DE. Follow us on to learn when new tools become available. slopBed – Expand the coordinates of features in a BED file by a a defined number of bases fastaFromBed – Extract a multiFASTA file from a reference sequence using a BED file of features

Best Summits BED File (single base pair features) 100 bp Region BED File (100 bp centered on peak centers) 100 bp Region BED File (100 bp centered on peak centers) FASTA file of 100 bp regions (likely to contain consensus motifs) FASTA file of 100 bp regions (likely to contain consensus motifs) BEDTools slopBed, 50bp equidistant BEDTools fastaFromBed, Arabidopsis genome DREME Filter summits.bed on pval_0 Objective Go from BED file of single-base peak summits to a FASTA file containing the 100 bp surrounding those summits that can be used for motif hunting Peak regions from PeakRanger and/or MACS IntersectBed peak regions Peaks found by both codes

DREME Run DREME on 100bp windows surrounding LEAFY peaks Download results

DREME results CCANTG(G/T)! Success!

Potential Next Steps Identify all consensus LEAFY sites in the genome that fall in promoters Extract all the promoters where LEAFY has significant binding and associate them with genes. Generate a simple gene list and run Ontology Term enrichment analysis to find classes of genes influenced by LEAFY

Cyberinfrastructure Overview ComponentWhat we didWhy we used it iPlant Data StoreImported data from SRA. Stored results of analyses. Downloaded results. Fast, flexible storage for large bioinformatics data. Discovery EnvironmentData import. NGS Alignment. Peak Finding. Data organization. One interface. Multiple bioinformatics applications. Easy to manage work products. AtmosphereLoaded results into desktop client application. Avoid downloading large files to personal computer. Easy access to powerful desktop environment.

On to the Exercise