Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.

Similar presentations


Presentation on theme: "Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group."— Presentation transcript:

1 Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

2 GENSIPS10/7/2002 2 Genes are read out via mRNA & processing

3 GENSIPS10/7/2002 3 RNA Processing

4 GENSIPS10/7/2002 4 A typical human gene structure

5 GENSIPS10/7/2002 5 In a mammalian genome Finding all the genes is hard Mammalian genomes are large –5,051 miles of 10pt type –Raleigh to Tripoli, Libya Only about 1.5% protein coding –Raleigh to Winston-Salem

6 GENSIPS10/7/2002 6 Genes are fairly unconstrained Intron length is highly variable ~5% are 40-100 nt long ~3% are longer than 30,000 nt Distance between genes is highly variable From 10 3 to 10 6 nt or more (probably)

7 GENSIPS10/7/2002 7 Exons per gene (RefSeq)

8 GENSIPS10/7/2002 8 Background is not random Segmental duplications Entire regions duplicate, then diverge slowly Processed pseudogenes Spliced transcripts integrate back into the genome –Sequence is similar to source genes –Generally not functional

9 GENSIPS10/7/2002 9 Gene prediction: two approaches 1. Transcript-based (E.g., GeneWise) A.Map experimentally determined sequences of spliced transcripts to their genomic source B.Map transcript sequences to genomic regions that could produce similar transcripts 2. De novo (genome only) Model DNA patterns characteristic of gene components –Splice donor and accepter –Protein coding sequence –Translation start and stop

10 GENSIPS10/7/2002 10 Advantages and disadvantages Transcript-based Advantage: conservative –Evidence of transcription for every exon Disadvantage: conservative –Can’t find “truly novel” genes Still subject to error

11 GENSIPS10/7/2002 11 Advantages and disadvantages De novo Advantage 1: Less biased toward –Known transcripts –Transcripts that can be sequenced easily Advantage 2: Genome sequencing is easy Disadvantages –No direct evidence of transcription –Presumably, more false positives

12 GENSIPS10/7/2002 12 Single-genome de novo: Genscan Strengths For mammalian sequence, one of the best single-genome, de novo gene predictors Widely used to great practical advantage De facto standard for mammalian sequence Limitations Predicts >45K genes (best est.: 25-30K) Predicts >315K exons (best est. 200K-250K) Gets only 9% of known genes exactly right*

13 GENSIPS10/7/2002 13 Dual genome de novo We developed algorithms that use two genomes to Reduce the number of false positives Refined the details of the structures

14 GENSIPS10/7/2002 14 Probability model Assigns probability to annotated DNA sequences: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ Optimization algorithm Given a DNA sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Single-genome de novo method

15 GENSIPS10/7/2002 15 CCATGGCGTCTTCAGGCAGTGACTC Genscan’s generative model Intron Exon Intron

16 GENSIPS10/7/2002 16 Generalized HMM States correspond to gene features Model generates DNA sequence by passing through states The probability of annotated DNA sequence is the probability of –generating the DNA sequence –by passing through states corre- sponding to the annotation. Genscan’s generative model

17 GENSIPS10/7/2002 17 Dual genome prediction Input Target and informant genomes Idea Patterns of evolution since the last common ancestor may reveal gene structure

18 GENSIPS10/7/2002 18 Two conservation signals 1. Local alignment signal Selective pressures differ by feature This leaves a characteristic signature 2. Structural signal Locations of introns tend to be conserved

19 GENSIPS10/7/2002 19 Characteristic local alignments TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC Coding exon CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT Intron (non-coding) human mouse

20 GENSIPS10/7/2002 20 Conservation of intron location

21 GENSIPS10/7/2002 21 Align→predict→filter→test WU-BLAST Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGAC CAGATAGATACTT GTCTGCCACCCTC TTATCCACCAGAC CAGATAGGTATTT GTCAGCTACTCTC TCTGCCACC || || || TCAGCTACT TWINSCAN

22 GENSIPS10/7/2002 22 gHMM decoding Representation change TCTGCCACC ||:||:|| TCTGCCACC || || || TCAGCTACT Conservation sequence TWINSCAN

23 GENSIPS10/7/2002 23 BLAST Alignments Target Informant

24 GENSIPS10/7/2002 24 Projecting BLAST Alignments Target Informant

25 GENSIPS10/7/2002 25 Projecting BLAST Alignments Target Informant

26 GENSIPS10/7/2002 26 Projecting BLAST Alignments Target Informant

27 GENSIPS10/7/2002 27 Projecting BLAST Alignments Target Informant

28 GENSIPS10/7/2002 28 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| | ||||||||| || || || CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical

29 GENSIPS10/7/2002 29 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap

30 GENSIPS10/7/2002 30 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse ||||||......... |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

31 GENSIPS10/7/2002 31 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Conservation sequence human ||||||......... |:|||||||||::||:|| ||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

32 GENSIPS10/7/2002 32 Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC Conservation sequence human ||||||......... |:|||||||||::||:||||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned

33 GENSIPS10/7/2002 33 Probability model Assigns probability to annotated DNA: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ |||........|:||||:|||||||||:||::|| Optimization Given DNA and conservation sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Twinscan: Extending the model

34 GENSIPS10/7/2002 34 Each state “generates” DNA and conservation sequence independently Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states Twinscan

35 GENSIPS10/7/2002 35 Performance Evaluation RefSeq A set ~13,000 “Known” mRNAs Represents ~40-50% of human genes –Usually, only one of several splices Mapping to genome is imperfect Best available gold standard

36 GENSIPS10/7/2002 36

37 GENSIPS10/7/2002 37

38 GENSIPS10/7/2002 38

39 GENSIPS10/7/2002 39

40 GENSIPS10/7/2002 40 Short term goal All multi-exon human genes Predict accurately –Integrate information from more genomes Verify at least one intron experimentally Follow up with full-length verification

41 GENSIPS10/7/2002 41 Acknowledgments Funding agencies National Institutes of Health (NHGRI) National Science Foundation (DBI) Sequencing centers Sanger, Whitehead, Wash. U. My group Ian Korf, Paul Flicek, Evan Keibler, Ping Hu Collaborators Roderic Guigo, Josep Abril, Genis Parra –Pankaj Agarwal Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis

42 GENSIPS10/7/2002 42 Other clades Plants Arabidopsis thaliana, cabbage, rice Nematodes C. elegans, C. briggsae Fungi Cryptococcus neoformans (JEC21, H99)

43 GENSIPS10/7/2002 43 Pair HMM algorithms (SLAM,…) Input is orthologous sequences. Aligns and predicts simultaneously, using a joint probability model Predicts orthologous genes in 2 sequences All predicted CDS is aligned Some aligned regions are not predicted CDS –Labeled conserved non-coding sequence

44 GENSIPS10/7/2002 44 The algorithms (SLAM,…) sgp2 Alignment before prediction (tblastx) Predicts genes in target sequence only Don’t need orthologous input sequences –Paralogs & low-coverage shotgun can help Modifies scores of all potential exons, by –At each base, add tblastx score of best overlapping local alignment (roughly) –To gene-id scores of that potential exon

45 GENSIPS10/7/2002 45 The algorithms TWINSCAN Alignment before prediction (blastn) Predicts in target sequence only Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by –At each base, apply a feature-specific scoring model (estimated for this purpose) –to the best overlapping local alignment, and adding the result –To Genscan scores for that feature

46 GENSIPS10/7/2002 46 % Aligned, CDS vs. other

47 GENSIPS10/7/2002 47 Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons Syntenic Gene Prediction (sgp2)

48 GENSIPS10/7/2002 48 Why work on gene finding? Genes are Components responsible for biological function Variations cause human disease / susceptibility Controls for modifying biological function –Human gene therapy –Agriculture –Nanotechnology, etc.


Download ppt "Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group."

Similar presentations


Ads by Google