1 Q1-Q3 results. 2 RF lengths 3 Filtered RF length distribution.

Slides:



Advertisements
Similar presentations
Genomics – The Language of DNA Honors Genetics 2006.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Homology Based Analysis of the Human/Mouse lncRNome
Breakdown of 244 total (Yale+Vega) Pseudogenes Amongst Various ENCODE Regions 211 Yale, 178 Vega, Union is 244 More pseudogenes in the manually picked.
1 Detection of nTARs in the mouse intestinal transcriptome BMC Genomics, 2011.
Transcriptome Sequencing with Reference
Analysis of ChIP-Seq Data
Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions.
HIV Project -Matt Hagen. The Problem Are there any DNA sequences in common between HIV and human genomes? HIV-1, complete genome, chimeric clone AF HIV-1,
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Some new sequencing technologies. Molecular Inversion Probes.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.
CSE182-L12 Gene Finding.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Lecture 12 Splicing and gene prediction in eukaryotes
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Survey Experiments. Defined Uses a survey question as its measurement device Manipulates the content, order, format, or other characteristics of the survey.
Page 1 Mouse Genome CGH Microarray 44A. Page 2 Mouse Genome CGH Microarray Kit 44A Designed for CGH, Validated with samples of known aberrations Designed.
1 ENCODE Pseudogene Summary for GT call Mark Gerstein 2005, :00 EDT summary of 6 Calls: Sept. 15, 22; Oct. 6, 13, 20, 27.
1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
Expression of the Genome The transcriptome. Decoding the Genetic Information  Information encoded in nucleotide sequences contained in discrete units.
Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005, :00 EST.
Supplementary Figure S1 Percentage of peaks from Trf1 +/+ p53 -/- -Cre vs Trf1  /  p53 -/- -Cre comparison that are located in non subtelomeric and subtelomeric.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
Mapping Sites of Transcription Across the Drosophila Genome Using High Resolution Tiling Microarrays LBNL, Berkeley CA August 20, 2007 A. WillinghamAffymetrix,
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Sackler Medical School
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
The Havana-Gencode annotation GENCODE CONSORTIUM.
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Human Genome.
1 Q1-Q3 results Roderic Guigó' s lab April 11 th 2007 conference call.
1 ENCODE Pseudogene Call Summary Mark Gerstein 2005, :00 EDT (Draft for G&T call on 2005, :00 EDT)
Chapter 3 The Interrupted Gene.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Supplemental Figure 1. Bias-corrected NGS bioinformatics strategies. Paired-end DNA sequencing reveals the sequence of the genomic clone, the sample ID.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.
Simon v RNA-Seq Analysis Simon v
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Presented By: Chinua Umoja
MBD-Chip.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Experimental Verification Department of Genetic Medicine
ENCODE Pseudogenes and Transcription
2/23/15 Learning Objectives
DNase‐HS sites are main independent determinants of DNA replication timing Simulations based on genome sequence features (GC content, CpG islands), or.
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
From: TopHat: discovering splice junctions with RNA-Seq
Visualising and Exploring BS-Seq Data
Supplementary Figure 4. Comparisons of MethyLight and gene expression data. PMR values (X-axis) were plotted against log2 gene expression values (Y-axis)
Alternative Splicing QTLs in European and African Populations
The Structure of the Genome
Presentation transcript:

1 Q1-Q3 results

2 RF lengths

3 Filtered RF length distribution

4 Q1 filtered RF length distribution

5 Q2 filtered RF length distribution

6 Q3 filtered RF length distribution

7 RF position when compared to genes and exons

8 Q1-Q2-Q3: Projected filtered RF distribution (internal=overlap target gene ; projection done by pool) 39% internal 46% exonic 54% intronic 61% external 71% genic 79% exonic 22% overlap most 5' ex. of tr. 21% intronic 29% intergenic 86% internal 88% exonic 12% intronic 14% external 78% genic 88% exonic 47% overlap most 5' ex. of tr. 12% intronic 22% intergenic 21% internal 47% exonic 53% intronic 79% external 78% genic 69% exonic 23% overlap most 5' ex. of tr. 31% intronic 22% intergenic Q1Q3Q2  chimeric transcripts?

9 Why are Q3 RF mostly external (79%) ? Existence of a systematic swap between certain pairs of pools? For each RF we have computed the overlap with all genes of Q3 and then compared: RF pool with RF overlapping gene pool

10 RF overlapping gene pool

11 Q3 RF compared to Q3 genes  Q3 RF are more overlapping genes of their pool than genes of other pools (no clear pool swap)

12 6 genes of Q3 are in two different pools  generates pool unspecific RF Problematic pools are: ● ● 8-9 ● 12-3 ●

13 Q3 RF overlapping Q3 genes

14 Position of Q3 filtered projected RF when filtering RF shorter than a threshold

15 Q2 vs Encode % internal 88% exonic 12% intronic 14% external 78% genic 88% exonic 47% overlap most 5' ex. of tr. 12% intronic 22% intergenic 68% internal 49% exonic 51% intronic 32% external 80% genic 70% exonic 23% overlap most 5' ex. of tr. 30% intronic 20% intergenic Q2Encode out of 1577 (27.5%) are novel projected RF 2859 out of 4951 (57.8%) are novel projected RF

16 Distance of RF to closest gene within pool (target gene)

17 Q1, Q3: proportion of RF > 3Mb away from target gene Q1: 983/10387= 9.4% filtered RF > 3Mb away from target gene Q3: 1789/3411 = 52.4% RF > 3Mb away from target gene 839/1249 = 67.2% external non exonic RF > 3Mb away from target gene

18

19

20

21 Proportion of Q3 filtered RF >3 Mb away from target gene

22

23

24

25 Do external exonic projected RF overlap most 5' exons of transcripts more than other exons of transcripts ?

26 Proportion of external exonic projected RF overlapping most 5' exons of transcripts Real: 22.3% (63) Same strand: 68.3% (43) Opposite strand: 31.7% (20) Random: 19.8% (56) Same strand: 41.1% (23) Opposite strand: 58.9% (33) Real: 23.0% (335) Same strand: 62.1% (208) Opposite strand: 37.9% (127) Random: 15.8% (230) Same strand: 49.1% (113) Opposite strand: 50.9% (117) Real: 46.5% (206) Same strand: 45.6% (94) Opposite strand: 54.4% (112) Random: 30.7% (136) Same strand: 54.4% (74) Opposite strand: 45.6% (62) Q1Q3Q2

27 Does the most 5' RF of a particular gene and a particular tissue overlap most 5' exons of transcripts more than other RF?

28 Correlation of most 5' RF with CAGE tags

29 Correlation of most 5' racefrags with cage tags Most 5'RF 5'

30 Pool unspecific RF

31 Pool unspecific unique RF (USPP-filtered) Most pool unspecific unique RF are: Q1: internal exonic (72%) Q2: internal exonic (87%) Q3: external (91%) (of which 63% are exonic) 20 unique RF are in more than 4 pools

32 Pool unspecific unique Q3 RF (filtered) - Hits found by blat. - Need to be done again using our highlighted probe simulator.

33 Q3 Q1 Q2 Q1-Q3: Number of pools a unique RF appears in (unfiltered/filtered)

34 Pool-unspecific RFs in Q3 Possibly due to cross-hybridization? is there a correlation between number of pools a RF is found in and the number of non-unique probes it overlaps? no by the way 135,380 / 2,191,331 (6%) of probes from chr21/22 chip have multiple perfect matches in genome

35 Pool-unspecific RFs in Q3 Possibly due to high GC content? -> Answer: NO!

36 Pool-unspecific RFs in Q3 Possibly due to mis-priming on unknown transcripts of chr21 or chr22 (missed by the simulator)? 4 - genuine chimeric transcripts? 5 - Pooling errors the same gene is present in >1 pool because it has 2 different identifiers (UCSC known genes / RefSeq nomenclature discrepancy we found a few cases like this, not sure yet how widespread it is (systematic survey to come)

37 Genes present in several pools 5 genes present in 2 pools: RP5-1042K10.2,NM_ (pools 14,15) CHODL,NM_ (pools 6,10) NM_005446,P2RXL1 (pools 8,9) ZNF74,NM_ (pools 10,13) NM_015367,BCL2L13 (pools 12,3) 1 gene present in 3 pools: NM_021090,NM_ ,MTMR3 (pools 15,14,16) Eliminate RF present in these pairs/triplets of pools (problematic pool RF)

38 Effect of filtering problematic pool RF on Q3 pool unspecificity -48 Genes present in several pools do not explain all pool unspecific RF of Q3

39 Distribution of pool specific and pool unspecific unique Q3 RF Pool unspecific Q3 RF are more: ● external to Q3 genes, ● exonic, compared to pool specific Q3 RF

40 Pool specific and unspecific RF regarding gene overlap Pool specific RF overlap their target gene more than pool unspecific RF

41 Two other criteria for comparing Q3 pool specific and unspecific RF Overlap with gene in same orientation as target gene Distance to target gene Pool unspecific RF are more distant to their target gene Pool unspecific RF behave similarly as pool specific RF

42 6 genes of Q3 are in two different pools  generates pool unspecific RF Problematic pools are: ● ● 8-9 ● 12-3 ●

43 Impact of index exon position on RF coverage

44

45 USPP filter results

46 The USPP filter removes more intergenic than genic RF Q1: proportion of exonic, intronic and intergenic RF before and after USPP-based filtering

47 The USPP filter removes more RF located: - from 100 to 200 kb - from 1 to 5 Mb to closest gene within pool Q1: Distance of RF to closest gene within pool before and after the USPP-based filter

48 Q2: Class 0, 1, 3, 5 RF removed by USPP-based filter (using 0, 1 and 2 Race/probe mismatches) The USPP filter: - removes 37 times more 3' RF than 5' RF - is ~ independent of the number of RACE/probe mismatches

49 Proportion of RF and projected RF eliminated by the USPP-based filter (projections made by pool)

50 Proportion of RF and projected RF eliminated by the USPP-based filter (projections made by pool)

51 Tissue specificity results

52 Q1: number of tissues a unique RF appears in (unfiltered/filtered)

53 Q2: Number of tissues a unique RF appears in (unfiltered/filtered)

54 Generating RF from probes

55 Generating RF from probes

56 Comparison between Encode 2005 and Q2

57 Intersection between Encode 2005 and Q2 RF sets

58 Comparison between Q1 and Q3

59 Overlap between Q1 and Q3 RF assigned to genes common to Q1-Q3 40% overlap between Q1 and Q3 RF assigned to genes common to both experiments  problem in gene assignment?