Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005,09.22 11:00 EST.

Slides:



Advertisements
Similar presentations
1 Q1-Q3 results. 2 RF lengths 3 Filtered RF length distribution.
Advertisements

RIP – T RANSCRIPT E XPRESSION L EVELS. O UTLINE RNA Immuno-Precipitation (RIP) NGS on RIP & its alternatives Alternate splicing Transcription as a graph.
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Homology Based Analysis of the Human/Mouse lncRNome
Breakdown of 244 total (Yale+Vega) Pseudogenes Amongst Various ENCODE Regions 211 Yale, 178 Vega, Union is 244 More pseudogenes in the manually picked.
Recursion Lecture 18: Nov 18.
Concepts of Database Management Sixth Edition
Concepts of Database Management Seventh Edition
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005.
Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Human Resources ‘Designing out’ ‘designing in’ in the Open University: strategies for dealing with student plagiarism Jude Carroll 19 January 2005.
Assembly.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Genome Browsing with the UCSC Genome Browser
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Genome Annotation BCB 660 October 20, From Carson Holt.
Parallel Processing (CS526) Spring 2012(Week 5).  There are no rules, only intuition, experience and imagination!  We consider design techniques, particularly.
Cis-Regulatory/ Text Mining Interface Discussion.
Concepts of Database Management, Fifth Edition
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Probe selection for Microarrays Considerations and pitfalls.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
1 ENCODE Pseudogene Summary for GT call Mark Gerstein 2005, :00 EDT summary of 6 Calls: Sept. 15, 22; Oct. 6, 13, 20, 27.
Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Concepts of Database Management Seventh Edition
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
 During DNA replication, the two strands of the original parent DNA molecule, shown in blue, each serve as a template for making a new strand, shown in.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
 Read quality  Adaptor trimming  Read sequence collapse Preprocessing Genome mapping  Map read to the spruce genome (Pabies1.0- genome.fa) using Patman
D A S for ENCODE data coordination Felix Kokocinski, WTSI.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
A Geometric Database of Gene Expression Data for the Mouse Brain Tao Ju, Joe Warren Rice University.
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gerstein Lab Aims in ModENCODE.
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
1 ENCODE Pseudogene Call Summary Mark Gerstein 2005, :00 EDT (Draft for G&T call on 2005, :00 EDT)
August 20, 2007 BDGP modENCODE Data Production. BDGP Data Production Project Goals 21,000 RACE experiments 6,000 cDNA’s from directed screening and full.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
IB Saccharomyces cerevisiae - Jan Major model system for molecular genetics. For example, one can clone the gene encoding a protein if you.
Review 1 Merge Sort Merge Sort Algorithm Time Complexity Best case Average case Worst case Examples.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Permissions Statement This Presentation.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
1 Combinatorial Problem. 2 Graph Partition Undirected graph G=(V,E) V=V1  V2, V1  V2=  minimize the number of edges connect V1 and V2.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.
The modern view of dispersed genome activity
ENCODE Pseudogenes and Transcription
Tests for Gene Clustering
GO Annotation from different sources
Sorting … and Insertion Sort.
lincRNAs: Genomics, Evolution, and Mechanisms
Presentation transcript:

Discussion Points for 2 nd Pseudogene Call Mark Gerstein 2005, :00 EST

86 87 Havana-Gencode: 167 pseudogenes Yale: 184 pseudogenes UCSC retrogenes: 15 expressed (7-8 pseudogenes) not expressed (all pseudogenes) Provided by France. Intersection of Pseudogenes from Three Groups: Original 86 havana peudogenes overlap with any Yale pseudogene and 87 Yale pseudogenes overlap with any havana pseudogene (idem for retrogenes). This is a global result: maybe in some loci three havana pseudogenes overlap with only one yale pseudogene, but in other loci, several yale pseudogenes overlap with one havana pseudogene.

82 (34) Havana-Gencode: 167 pseudogenes Yale: 164 pseudogenes UCSC retrogenes: 146 not expressed 17 (7) 33 (1) 15 (1) 14 (2) 16 (0) 52 (2) The numbers in parentheses are pseudogenes from GIS. All from Pseudo-exons were merged to form pseudogenes and used for this comparison (now a pseudogene has only a single start and end) Strand information is ignored There are a total of 229 pseudogenes in the union Intersection of Pseudogenes from 4 Groups: Updated

82 (34) Havana-Gencode: 167 pseudogenes Yale: 164 pseudogenes UCSC retrogenes: 146 not expressed 17 (7) 33 (1) 15 (1) 14 (2) 16 (0) 52 (2) Intersection of Pseudogenes from 4 Groups: Non-processed Consensus GENCODE Processed GENCODE Non-Processed Yale Processed 7 / 85 / 5 Yale Non-Processed 4 / 439 / 37 Roughly agreement now is: – 7 = 127 from 229 total What to do with 102 ?

How to Pick Pseudogenes for RT-PCR? Start with the intersection 127 Duplicated v processed: how many of each? (2:1?) Rank Pseudogenes: –By likelihood to be transcribed according to ENCODE evidence ditag, then CAGE, then tiling array –By their uniqueness in genome Good primers Non cross-hybridizing probes How to get a consistent rank? Who will do RT-PCR ? What coordinates to use ? (Ignore 1 processed pseudogene already being sequenced by GIS group.)

How to generate a consensus for remaining 102 pseudogenes? Stick with the intersection 127 Develop a consistent criteria for identifying pseudogenes and uniformly apply to ENCODE –E.g. protein matches with disablements found from a pipeline –Ignores tricky cases flagged by manual annotation Do a simple union of UCSC, Havana & Yale, giving 229 –GIS is a subset of other 3 –Describe pseudogenes as being identified by multiple approaches and then explicitly flag each group’s unique ones in final annotation –Easy but perhaps biases stats Do a qualified union –Allow each group to “question” particular pseudogenes in another’s set –Send questions around and then have a call to sort out differences –Need a way to arbitrate– e.g. we could demand an obvious disablement –We might learn something! How do we represent this in the browser & in stats?

Once we have consensus, how to agree on pseudogene boundaries? Keep unchanged each group’s boundaries –If pseudogenes overlap, take largest region (union) or smallest Develop a uniform criteria for assigning pseudogene boundaries and apply it to each of the pseudogenes in the consensus set –Could just take each pseudogene in the consensus and have one group realign it against parent