Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions.

Slides:



Advertisements
Similar presentations
1 Q1-Q3 results. 2 RF lengths 3 Filtered RF length distribution.
Advertisements

We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Breakdown of 244 total (Yale+Vega) Pseudogenes Amongst Various ENCODE Regions 211 Yale, 178 Vega, Union is 244 More pseudogenes in the manually picked.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
1 Institute for Systems Biology Enabling new genomics technologies in the ISB Microarray Facility B. Marzolf 1, P. Troisch 1 Multiple platforms support.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Genomics tools to identify the molecular basis of complex traits Justin Borevitz Salk Institute naturalvariation.org.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel.
Page 1 Mouse Genome CGH Microarray 44A. Page 2 Mouse Genome CGH Microarray Kit 44A Designed for CGH, Validated with samples of known aberrations Designed.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Mouse Genome Sequencing
1 ENCODE Pseudogene Summary for GT call Mark Gerstein 2005, :00 EDT summary of 6 Calls: Sept. 15, 22; Oct. 6, 13, 20, 27.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
RNAseq analyses -- methods
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Mapping Sites of Transcription Across the Drosophila Genome Using High Resolution Tiling Microarrays LBNL, Berkeley CA August 20, 2007 A. WillinghamAffymetrix,
ChIP-chip Data. DNA-binding proteins Constitutive proteins (mostly histones) –Organize DNA –Regulate access to DNA –Have many modifications Acetylation,
Chapter 21 Eukaryotic Genome Sequences
Wfleabase.org/docs/tilexseq0904.pdf What is all this genome expression? Observations and statistics for expression at the base level April 2009Don Gilbert.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Sackler Medical School
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
1 ENCODE Pseudogene Call Summary Mark Gerstein 2005, :00 EDT (Draft for G&T call on 2005, :00 EDT)
No reference available
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Transcriptome What is it - genome wide transcript abundance How do you obtain it - Arrays + MPSS What do you do with it when you have it - ?
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Special Topics in Genomics ChIP-chip and Tiling Arrays.
Presented by: Matthew Tippin, Bianca Sanchez Mora
The Transcriptional Landscape of the Mammalian Genome
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
ENCODE Pseudogenes and Transcription
TSS Annotation Workflow
Predicting Active Site Residue Annotations in the Pfam Database
Volume 54, Issue 1, Pages (April 2014)
Protein Occupancy Landscape of a Bacterial Genome
Volume 116, Issue 4, Pages (February 2004)
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
Volume 126, Issue 6, Pages (September 2006)
Human Promoters Are Intrinsically Directional
Volume 132, Issue 2, Pages (January 2008)
Volume 21, Issue 9, Pages (November 2017)
Sequence Analysis - RNA-Seq 2
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions

AFFX Transcriptome Group Computation Molecular Biology S. Bekiranov P. Kapranov S. Brubaker I. Bell J. Cheng J. Drenkow S. Ghosh D. Kampa-Bailey G. Helt J. Long G. Madhavan J. Manak S. Patel V. Sementchenko H. Tammana A. Piccolboni Support: NCI Contract (21XS019C Phases I- III) NHGRI ENCODE Grant AFFYMETRIX Acknowledgements NCI Harvard Medical School K. Struhl H. Hirsch H. H. Ng E. Sekinger Broad Institute B. Bernstein M. Kamal K. Lindblad-Toh D. J. Huebert S. McMahon E. K. Karlsson E. J. Kulbokas III S. L. Schreiber E. S. Lander

Transcription Map & Modification Site Generation…I 1.Median Scaling: Scale all features on chip such that chip median = M 2.Quantile Normalization(QN): QN Feature intensities within replicates only. QN Treatment and Control separately. 3.Probe Mapping to Genome: Map PM,MM pairs to genome via exact 25-mer alignment of PM. 4.Wilcoxon Signed Rank Test: Perform test on probe-pair signal S = log 2 (PM-MM) Apply a sliding window to estimate intensity of each probe pair as a pseudo-median of all probes in the window. A Sliding window makes use of neighboring probes; this reduces false positive rate and increases sensitivity. Window size varies w/ experiment: RNA~50bp, IP~250bp 5.Map and Site Generation: RNA Join probes w/ intensity > 5%FPR & maxgap, minrun to generate transcribed fragments Chromatin IP Generate Hodges Lehman Estimator to estimate expression level : logDiff = log 2 (min(PM-MM) T,1) – log 2 (PM-MM) C,1) Generate p-Value estimate per probe Join probes w/ p-value  & maxgap, minrun to generate modification/transcription factor binding sites CEL file Compute median (M) of all chip medians (if multiple arrays in a set) Median Scaling Quantile Normalization Probe Mapping to Genome Wilcoxon Signed Rank Test RNA or IP RNA: Transfrag Generation Chromation IP: Site Generation

Filtration of 10 Chromosome Data (Cheng, J., et al. Science Express; March 24, 2005) ( see UCSD Browser for 8 cell line data see Version 33) Low Complexity Repeats Processed Pseudogenes BLAT hits more than itself (lose some members of gene families) Use of all filters this reduces the transfrag by ~20% of transfrags, ~30% of which are pseudogenes. With BLAT data reduction is 14%

RACE Model (Need isothermal RT for unannotated transfrags)

RACE Analysis of Coding Gene DeGeorge Critical Region 14 gene

Un-annotated transfrags of PISD are part of at least 9 different, yet overlapping sense-antisense transcripts Sense Strand Anti-sense strand

Region Total 5' and 3' + 5' or 3' successful RACE Total successful 5' or 3' RACE Percent total success considering total 256 Intergen 178/257870% Intronic 213/ % Exonic 243/ % RACE Regions Validated for 768 Loci

Data sets analyzed Part 1 : a) Analysis done on v34 of the human genome. Total number of Encode regions analyzed = 12 ( region Enm006 ignored for this analysis since no annotations are available for v34). b) Set of Known/validated exons c) Set of predicted exons (from multiple gene predictions) d) Array detected transcript maps from HL-60 cell lines at 4 time points after RA stimulation. (i.e one cell line at 4 biological states) Part 2 : a) Analysis done on v35 of the human genome. Total number of Encode regions analyzed = 44 b) Set of Known/validated exons. c) Set of Vega putative exons. d) Set of predicted exons outside sets b & c (from multiple gene predictions). d) Array detected transcript maps from HL-60 cell lines at 4 time points after RA stimulation.

Genomic sequence 35 bp avg. distance Repeats (RepeatMasker) Coverage of interogated Regions using algorithms used To call Transfrags Probes Exon 1 < 100% Covered Exon 2 is 100% Covered Annotation (e.g. Vega) Analyses done only within interrogated regions How Comparisons are carried out using arrays, Annotations and predicted regions Predicted exons

Probes Exon 2 Annotation Genomic sequence Transfrags after minrun/maxgap parameters Positive probes X Predicted exons

Coverage of Annotation by array detected transfrags from HL60 cell line in 13 ENCODE regions

Analysis results of 12/13 ENCODE Regions Total Number of exons Interrogated Number of exons detected by array generated transfrags ( overlap by at least 1 bp) Number of exons detected by array generated transfrags ( > 75% of exon bp overlapped by a transfrag) 1852 (Known/Validated) 1068 [ 57.7%] (74% avg. bp coverage) 700 (37.7%) 1181 (Predicted) 360 [30.5%] (69.2% avg.bp coverage) 175 (14.8%)

Mode size of annotated exons is ~120bp Detection of exons is not dependent upon size (bp) of the exon (i.e. small exons are not biased against) If an exon is detected by transfrag, 65% of these are covered at >75%

Mode size of predicted exons is ~120bp Approximately 30.5 % of predicted exons are covered (i.e. at least 1bp coverage) by transfrags. If an exon is detected by transfrag, 48.6% of these are covered at >75%

Coverage of Annotation by array Detected transfrags from HL60 cell line in all 44 Encode regions

Analysis results of 44 ENCODE regions Total number of exons interrogated Number of exons detected by array generated transfrags ( overlap by at least 1 bp) Number of exons detected by array generated transfrags ( > 75% of exon bp overlapped by a transfrag) 6467 (Known/Validated) 3487 [ 53.9%] (70% avg. bp coverage) 2142 (33.1%) 4455 (Predicted) 809 [18.2%] ( 62.23% avg. bp coverage) 361 (8.1%) 185 (Vega Putative) 39 [ 21.1%] (35.71 % avg. bp coverage) 3 (1.6%)

Mode size of annotated exons is ~120bp Detection of exons is not dependent upon size (bp) of the exon (i.e. small exons are not biased against) If an exon is detected by transfrag, 61.4% of these are covered at >75%

Mode size of predicted exons is ~80bp Approximately, 18.2% of predicted exons are detected by transfrags ( ie. by at least 1 bp) If an exon is detected by transfrag, 44.6% of these are covered at >75%

Important Caveats To Recall In Pondering the Prediction vs Array Results Only one cell line used in this evaluation. We have set very conservative thresholds for transfrag prediction. Other thresholds can be used Strand information not deducible from transfrag map. TUFs (transcripts of unknown function) are collection of transfrags shown to be on the same molecule by RACE-RT/PCR-cloning/sequencing. Array interrogation resolution is 20bp on average for non-repeat portion of the genome and probes are 25mers. Thus, the boundaries of transfrags are not as precise as arrays with 5bp interrogation resolution and some small exons will not not be interrogated or detected Have not included other functional features (e.g.TF binding) which would provide additional confidence to transfrag data. These will be added under ENCODE project.

Conclusions Array based method detects ~53.9% of known/validated exons. Similarly, array based method provides evidence for ~18.2% of predicted exons. These detected exons should be analyzed further to improve the annotation. A combination of array based RNA map generation, followed by RACE experiments can significantly improve the rate of validation of gene predictions. Transfrags that map outside validated and predicted exons can be used to improve gene prediction programs and can form the basis for further experiments.