EDACC Quality Characterization for Various Epigenetic Assays

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
ChIP-seq Data Analysis
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Analysis of ChIP-Seq Data
Data Analysis for High-Throughput Sequencing
Microarray Type Analyses using Second Generation Sequencing
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
High Throughput Sequencing
Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center
Supplementary Material Supplementary Tables Supplementary Table 1. Sequencing statistics for ChIP-seq samples. Supplementary Table 2. Pearson correlation.
NGS Analysis Using Galaxy
Supplementary Material Epigenetic histone modifications of human transposable elements: genome defense versus exaptation Ahsan Huda, Leonardo Mariño-Ramírez.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Controls for TTS identification using PET A series of controls were implemented in order to evaluate the potential contamination by internal priming in.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Exploring Monoallelic Methylation Using High-throughput Sequencing Cristian Coarfa, Ronald Harris Ting Wang, Aleksandar Milosavljevic, Joe Costello.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
To access the wireless network: Please bookmark the following link, which will allow each of you to become set-up as a Rice visitor online:
I519 Introduction to Bioinformatics, Fall, 2012
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Exploring Monoallelic Methylation Using High-throughput Sequencing Cristian Coarfa Ronald Harris Aleksandar Milosavljevic Joe Costello.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
Overview of ENCODE Elements
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Supplemental Figure 1. False trans association due to probe cross-hybridization and genetic polymorphism at single base extension site. (A) The Infinium.
Accessing and visualizing genomics data
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
HOMER – a one stop shop for ChIP-Seq analysis
Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011.
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.
Using command line tools to process sequencing data
Canadian Bioinformatics Workshops
Epigenetics Continued
NGS Analysis Using Galaxy
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
MBD-Chip.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
ENCODE Pseudogenes and Transcription
ChIP-Seq Analysis – Using CLCGenomics Workbench
Day 5 Session 29: Questions and follow-up…. James C. Fleet, PhD
Analysing ChIP-Seq Data
Simon v ChIP-Seq Analysis Simon v
Exploring and Understanding ChIP-Seq data
Material for today’s workshop is at:
Eric Samorodnitsky, Jharna Datta, Benjamin M
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
Volume 133, Issue 3, Pages (May 2008)
Volume 23, Issue 1, Pages 9-22 (January 2013)
Human Promoters Are Intrinsically Directional
Evolution of Alu Elements toward Enhancers
Volume 132, Issue 2, Pages (January 2008)
Volume 32, Issue 6, Pages (June 2010)
Genetic mapping and epigenetic landscape of RUNX3 locus overlapping rs
Integrative analysis of 111 reference human epigenomes
Chromatin state mapping pinpoints PAX3–FOXO1 (P3F) in active enhancers
Presentation transcript:

EDACC Quality Characterization for Various Epigenetic Assays Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics

Data Types Submitted To EDACC ChIP-Seq Methyl-C RRBS MRE-Seq MeDIP-Seq Chromatin Accessibility small RNA-Seq mRNA-Seq

Quality Characterization How to measure the quality of mapped reads? Note: not quality of sequencing statistics on this are provided by the sequencer Most labs do some sort of visual inspection Metrics for characterizing level 2 data quality Apply it to various data types submitted to EDACC

Enrichment Based Protocols ChIP-Seq, MeDIP-Seq, Chromatin Accessibility Methods implemented PTIH (percent tags in hotspots) iROC (integral of ROC) Percent tags in peaks (FindPeaks) Poisson enrichment metric Implemented in EDACC pipeline Metrics computed on all submitted data

PTIH (percent tags in hotspots) Detect enriched regions using “hotspot” algorithm PTIH = percentage of all tags that fall in hotspots

Hotspot algorithm Scan statistic gauging enrichment with a z-score based on the binomial distribution. n tags 250 bp 50kb N tags Binomial distribution gives probability of seeing n tags in the small window given N tags total in the large window. This adjusts for local background fluctuations (due to CNV, for instance).

PTIH values 0.48 0.19 0.72 0.48

PTIH values 0.48 0.19 0.72 0.48

Ratio of Tags in Peaks Determine uniquely mapping reads Use FindPeaks to call peaks Count reads mapping into peaks percentage of total mapped reads

Poisson Based Enrichment Method Determine uniquely mapping reads Remove duplicate reads Bin the reads into 1kb windows Infer parameters of a simple poisson distribution Filter enriched windows p-value < 0.01 Count reads mapping into enriched windows

Next Step – Metrics Evaluation Metrics probe different features of data Use visual inspection to ascertain which (one or more) of the proposed methods captures useful aspects of data quality. 11

ChIP-Seq/Chromatin Accessibility/FindPeaks QC Metrics Collaborative efforts between centers ~330 lanes of verified ChIP-Seq, MeDIP-Seq, and Chromatin accesibility data Accesible in Epigenome Atlas

Going forward EDACC will run continuously on all submitted data Option to automatically flag data that fall below specified thresholds For most data types we need further experience on what thresholds make sense Include QC metrics in metadata Provide downstream users with this information Note that we are breaking new ground uniform quality scoring is not being performed by other major consortia (ENCODE, modENCODE)

Pearson correlation for ChIP-Seq Histone Modification Using raw density maps at 10kb resolution Process Select uniquely mapping reads Extend 200bp in mapping strand direction Remove monoclonal reads Build density map Pearson correlation with other submitted marks Ideally: a mark correlates best with other experiments for the same assay How well does Pearson correlation work ? Help us identify 5 bad lanes, REMCs retracted the data

PCA Analysis 10kb windows on chr20 PCA using Pearson correlation metric

Pearson correlation metric Input H3K36me3 H3K9me3 H3K79me1 H3K20me1 Pearson correlation metric H3K27me3 PCA 53.8% H3K4me3 H3K9ac H2AK5ac H2BK120ac H2BK12ac H2BK15ac H2BK20ac H3K14ac H3K18ac H3K23ac H3K27ac H3K4ac H3K56ac H4K5ac H4K8ac H4K91ac

MRE-Seq Reads are mapped onto reference genome Uniquely mapping reads are kept Build the fragment map of expecting mapping locations based on the enzyme cocktail used Count reads mapping within the expected digest fragments 76-99% of reads map within expected fragment

mRNA-Seq Reads are mapped onto reference genome Uniquely mapping reads are kept Count reads mapping within UCSC genes exons 70-90% of reads map within gene exons UCSC known genes Entrez genes

Small RNA-Seq Trim adaptors Reads are mapped onto reference genome Reads mapping up to 100 locations are kept Count reads overlapping with known small RNAs miRNAs, piRNAs, sno/scaRNAs, piRNAs, repeat RNAs At least 30% of reads overlap with known small RNAs

Bisulfite Sequencing Map using Pash Methyl-C RRBS Genome wide QC C->T Conversion rates; typically 99% RRBS Enzyme cocktail Map within expected cut sites Ratio varies 40%-90%

QC for MeDIP-Seq Data Using Galaxy

Exercise Download the input MeDIP-Seq file from the workshop wiki Determine the ratio of reads in peaks using FindPeaks in Galaxy