Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Two short pieces MicroRNA Alternative splicing.
Ab initio gene prediction Genome 559, Winter 2011.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Prediction: Past, Present, and Future Sam Gross.
Gene Finding (DNA signals) Genome Sequencing and assembly
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
R ESEARCH G ENOME B IOINFORMATICS L AB R ESEARCH at G ENOME B IOINFORMATICS L AB Josep F. Abril Ferrando and Genís Parra Farré Genome BioInformatics Research.
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Chapter 21 Eukaryotic Genome Sequences
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Gene prediction roderic guigó i serra IMIM/UPF/CRG.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Motif Search and RNA Structure Prediction Lesson 9.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used for.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used.
EGASP 2005 Evaluation Protocol
The Transcriptional Landscape of the Mammalian Genome
EGASP 2005 Evaluation Protocol
Basics of Comparative Genomics
Eukaryotic Gene Finding
Ab initio gene prediction
Basics of Comparative Genomics
Genome Annotation and the Human Genome
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Presentation transcript:

Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS10/7/ Genes are read out via mRNA & processing

GENSIPS10/7/ RNA Processing

GENSIPS10/7/ A typical human gene structure

GENSIPS10/7/ In a mammalian genome Finding all the genes is hard Mammalian genomes are large –5,051 miles of 10pt type –Raleigh to Tripoli, Libya Only about 1.5% protein coding –Raleigh to Winston-Salem

GENSIPS10/7/ Genes are fairly unconstrained Intron length is highly variable ~5% are nt long ~3% are longer than 30,000 nt Distance between genes is highly variable From 10 3 to 10 6 nt or more (probably)

GENSIPS10/7/ Exons per gene (RefSeq)

GENSIPS10/7/ Background is not random Segmental duplications Entire regions duplicate, then diverge slowly Processed pseudogenes Spliced transcripts integrate back into the genome –Sequence is similar to source genes –Generally not functional

GENSIPS10/7/ Gene prediction: two approaches 1. Transcript-based (E.g., GeneWise) A.Map experimentally determined sequences of spliced transcripts to their genomic source B.Map transcript sequences to genomic regions that could produce similar transcripts 2. De novo (genome only) Model DNA patterns characteristic of gene components –Splice donor and accepter –Protein coding sequence –Translation start and stop

GENSIPS10/7/ Advantages and disadvantages Transcript-based Advantage: conservative –Evidence of transcription for every exon Disadvantage: conservative –Can’t find “truly novel” genes Still subject to error

GENSIPS10/7/ Advantages and disadvantages De novo Advantage 1: Less biased toward –Known transcripts –Transcripts that can be sequenced easily Advantage 2: Genome sequencing is easy Disadvantages –No direct evidence of transcription –Presumably, more false positives

GENSIPS10/7/ Single-genome de novo: Genscan Strengths For mammalian sequence, one of the best single-genome, de novo gene predictors Widely used to great practical advantage De facto standard for mammalian sequence Limitations Predicts >45K genes (best est.: 25-30K) Predicts >315K exons (best est. 200K-250K) Gets only 9% of known genes exactly right*

GENSIPS10/7/ Dual genome de novo We developed algorithms that use two genomes to Reduce the number of false positives Refined the details of the structures

GENSIPS10/7/ Probability model Assigns probability to annotated DNA sequences: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ Optimization algorithm Given a DNA sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Single-genome de novo method

GENSIPS10/7/ CCATGGCGTCTTCAGGCAGTGACTC Genscan’s generative model Intron Exon Intron

GENSIPS10/7/ Generalized HMM States correspond to gene features Model generates DNA sequence by passing through states The probability of annotated DNA sequence is the probability of –generating the DNA sequence –by passing through states corre- sponding to the annotation. Genscan’s generative model

GENSIPS10/7/ Dual genome prediction Input Target and informant genomes Idea Patterns of evolution since the last common ancestor may reveal gene structure

GENSIPS10/7/ Two conservation signals 1. Local alignment signal Selective pressures differ by feature This leaves a characteristic signature 2. Structural signal Locations of introns tend to be conserved

GENSIPS10/7/ Characteristic local alignments TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC Coding exon CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT Intron (non-coding) human mouse

GENSIPS10/7/ Conservation of intron location

GENSIPS10/7/ Align→predict→filter→test WU-BLAST Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGAC CAGATAGATACTT GTCTGCCACCCTC TTATCCACCAGAC CAGATAGGTATTT GTCAGCTACTCTC TCTGCCACC || || || TCAGCTACT TWINSCAN

GENSIPS10/7/ gHMM decoding Representation change TCTGCCACC ||:||:|| TCTGCCACC || || || TCAGCTACT Conservation sequence TWINSCAN

GENSIPS10/7/ BLAST Alignments Target Informant

GENSIPS10/7/ Projecting BLAST Alignments Target Informant

GENSIPS10/7/ Projecting BLAST Alignments Target Informant

GENSIPS10/7/ Projecting BLAST Alignments Target Informant

GENSIPS10/7/ Projecting BLAST Alignments Target Informant

GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| | ||||||||| || || || CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical

GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap

GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Conservation sequence human |||||| |:|||||||||::||:|| ||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC Conservation sequence human |||||| |:|||||||||::||:||||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

GENSIPS10/7/ Probability model Assigns probability to annotated DNA: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ ||| |:||||:|||||||||:||::|| Optimization Given DNA and conservation sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Twinscan: Extending the model

GENSIPS10/7/ Each state “generates” DNA and conservation sequence independently Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states Twinscan

GENSIPS10/7/ Performance Evaluation RefSeq A set ~13,000 “Known” mRNAs Represents ~40-50% of human genes –Usually, only one of several splices Mapping to genome is imperfect Best available gold standard

GENSIPS10/7/

GENSIPS10/7/

GENSIPS10/7/

GENSIPS10/7/

GENSIPS10/7/ Short term goal All multi-exon human genes Predict accurately –Integrate information from more genomes Verify at least one intron experimentally Follow up with full-length verification

GENSIPS10/7/ Acknowledgments Funding agencies National Institutes of Health (NHGRI) National Science Foundation (DBI) Sequencing centers Sanger, Whitehead, Wash. U. My group Ian Korf, Paul Flicek, Evan Keibler, Ping Hu Collaborators Roderic Guigo, Josep Abril, Genis Parra –Pankaj Agarwal Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis

GENSIPS10/7/ Other clades Plants Arabidopsis thaliana, cabbage, rice Nematodes C. elegans, C. briggsae Fungi Cryptococcus neoformans (JEC21, H99)

GENSIPS10/7/ Pair HMM algorithms (SLAM,…) Input is orthologous sequences. Aligns and predicts simultaneously, using a joint probability model Predicts orthologous genes in 2 sequences All predicted CDS is aligned Some aligned regions are not predicted CDS –Labeled conserved non-coding sequence

GENSIPS10/7/ The algorithms (SLAM,…) sgp2 Alignment before prediction (tblastx) Predicts genes in target sequence only Don’t need orthologous input sequences –Paralogs & low-coverage shotgun can help Modifies scores of all potential exons, by –At each base, add tblastx score of best overlapping local alignment (roughly) –To gene-id scores of that potential exon

GENSIPS10/7/ The algorithms TWINSCAN Alignment before prediction (blastn) Predicts in target sequence only Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by –At each base, apply a feature-specific scoring model (estimated for this purpose) –to the best overlapping local alignment, and adding the result –To Genscan scores for that feature

GENSIPS10/7/ % Aligned, CDS vs. other

GENSIPS10/7/ Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons Syntenic Gene Prediction (sgp2)

GENSIPS10/7/ Why work on gene finding? Genes are Components responsible for biological function Variations cause human disease / susceptibility Controls for modifying biological function –Human gene therapy –Agriculture –Nanotechnology, etc.