Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.

Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

GENSIPS10/7/2002 2 Genes are read out via mRNA & processing

GENSIPS10/7/2002 3 RNA Processing

GENSIPS10/7/2002 4 A typical human gene structure

GENSIPS10/7/2002 5 In a mammalian genome Finding all the genes is hard Mammalian genomes are large –5,051 miles of 10pt type –Raleigh to Tripoli, Libya Only about 1.5% protein coding –Raleigh to Winston-Salem

GENSIPS10/7/2002 6 Genes are fairly unconstrained Intron length is highly variable ~5% are 40-100 nt long ~3% are longer than 30,000 nt Distance between genes is highly variable From 10 3 to 10 6 nt or more (probably)

GENSIPS10/7/2002 7 Exons per gene (RefSeq)

GENSIPS10/7/2002 8 Background is not random Segmental duplications Entire regions duplicate, then diverge slowly Processed pseudogenes Spliced transcripts integrate back into the genome –Sequence is similar to source genes –Generally not functional

GENSIPS10/7/2002 9 Gene prediction: two approaches 1. Transcript-based (E.g., GeneWise) A.Map experimentally determined sequences of spliced transcripts to their genomic source B.Map transcript sequences to genomic regions that could produce similar transcripts 2. De novo (genome only) Model DNA patterns characteristic of gene components –Splice donor and accepter –Protein coding sequence –Translation start and stop

GENSIPS10/7/2002 10 Advantages and disadvantages Transcript-based Advantage: conservative –Evidence of transcription for every exon Disadvantage: conservative –Can’t find “truly novel” genes Still subject to error

GENSIPS10/7/2002 11 Advantages and disadvantages De novo Advantage 1: Less biased toward –Known transcripts –Transcripts that can be sequenced easily Advantage 2: Genome sequencing is easy Disadvantages –No direct evidence of transcription –Presumably, more false positives

GENSIPS10/7/2002 12 Single-genome de novo: Genscan Strengths For mammalian sequence, one of the best single-genome, de novo gene predictors Widely used to great practical advantage De facto standard for mammalian sequence Limitations Predicts >45K genes (best est.: 25-30K) Predicts >315K exons (best est. 200K-250K) Gets only 9% of known genes exactly right*

GENSIPS10/7/2002 13 Dual genome de novo We developed algorithms that use two genomes to Reduce the number of false positives Refined the details of the structures

GENSIPS10/7/2002 14 Probability model Assigns probability to annotated DNA sequences: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ Optimization algorithm Given a DNA sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Single-genome de novo method

GENSIPS10/7/2002 15 CCATGGCGTCTTCAGGCAGTGACTC Genscan’s generative model Intron Exon Intron

GENSIPS10/7/2002 16 Generalized HMM States correspond to gene features Model generates DNA sequence by passing through states The probability of annotated DNA sequence is the probability of –generating the DNA sequence –by passing through states corresponding to the annotation. Genscan’s generative model

GENSIPS10/7/2002 17 Dual genome prediction Input Target and informant genomes Idea Patterns of evolution since the last common ancestor may reveal gene structure

GENSIPS10/7/2002 18 Two conservation signals 1. Local alignment signal Selective pressures differ by feature This leaves a characteristic signature 2. Structural signal Locations of introns tend to be conserved

GENSIPS10/7/2002 19 Characteristic local alignments TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC Coding exon CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT Intron (non-coding) human mouse

GENSIPS10/7/2002 20 Conservation of intron location

GENSIPS10/7/2002 21 Align→predict→filter→test WU-BLAST Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGAC CAGATAGATACTT GTCTGCCACCCTC TTATCCACCAGAC CAGATAGGTATTT GTCAGCTACTCTC TCTGCCACC || || || TCAGCTACT TWINSCAN

GENSIPS10/7/2002 22 gHMM decoding Representation change TCTGCCACC ||:||:|| TCTGCCACC || || || TCAGCTACT Conservation sequence TWINSCAN

GENSIPS10/7/2002 23 BLAST Alignments Target Informant

GENSIPS10/7/2002 24 Projecting BLAST Alignments Target Informant

GENSIPS10/7/2002 28 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| | ||||||||| || || || CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical

GENSIPS10/7/2002 29 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap

GENSIPS10/7/2002 30 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse ||||||......... |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

GENSIPS10/7/2002 31 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Conservation sequence human ||||||......... |:|||||||||::||:|| ||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

GENSIPS10/7/2002 32 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC Conservation sequence human ||||||......... |:|||||||||::||:||||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

GENSIPS10/7/2002 33 Probability model Assigns probability to annotated DNA: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ |||........|:||||:|||||||||:||::|| Optimization Given DNA and conservation sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Twinscan: Extending the model

GENSIPS10/7/2002 34 Each state “generates” DNA and conservation sequence independently Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states Twinscan

GENSIPS10/7/2002 35 Performance Evaluation RefSeq A set ~13,000 “Known” mRNAs Represents ~40-50% of human genes –Usually, only one of several splices Mapping to genome is imperfect Best available gold standard

GENSIPS10/7/2002 36

GENSIPS10/7/2002 37

GENSIPS10/7/2002 38

GENSIPS10/7/2002 39

GENSIPS10/7/2002 40 Short term goal All multi-exon human genes Predict accurately –Integrate information from more genomes Verify at least one intron experimentally Follow up with full-length verification

GENSIPS10/7/2002 41 Acknowledgments Funding agencies National Institutes of Health (NHGRI) National Science Foundation (DBI) Sequencing centers Sanger, Whitehead, Wash. U. My group Ian Korf, Paul Flicek, Evan Keibler, Ping Hu Collaborators Roderic Guigo, Josep Abril, Genis Parra –Pankaj Agarwal Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis

GENSIPS10/7/2002 42 Other clades Plants Arabidopsis thaliana, cabbage, rice Nematodes C. elegans, C. briggsae Fungi Cryptococcus neoformans (JEC21, H99)

GENSIPS10/7/2002 43 Pair HMM algorithms (SLAM,…) Input is orthologous sequences. Aligns and predicts simultaneously, using a joint probability model Predicts orthologous genes in 2 sequences All predicted CDS is aligned Some aligned regions are not predicted CDS –Labeled conserved non-coding sequence

GENSIPS10/7/2002 44 The algorithms (SLAM,…) sgp2 Alignment before prediction (tblastx) Predicts genes in target sequence only Don’t need orthologous input sequences –Paralogs & low-coverage shotgun can help Modifies scores of all potential exons, by –At each base, add tblastx score of best overlapping local alignment (roughly) –To gene-id scores of that potential exon

GENSIPS10/7/2002 45 The algorithms TWINSCAN Alignment before prediction (blastn) Predicts in target sequence only Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by –At each base, apply a feature-specific scoring model (estimated for this purpose) –to the best overlapping local alignment, and adding the result –To Genscan scores for that feature

GENSIPS10/7/2002 46 % Aligned, CDS vs. other

GENSIPS10/7/2002 47 Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons Syntenic Gene Prediction (sgp2)

GENSIPS10/7/2002 48 Why work on gene finding? Genes are Components responsible for biological function Variations cause human disease / susceptibility Controls for modifying biological function –Human gene therapy –Agriculture –Nanotechnology, etc.

Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.

Similar presentations

Presentation on theme: "Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.

Similar presentations

Presentation on theme: "Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group."— Presentation transcript:

Similar presentations

About project

Feedback