Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.

Similar presentations


Presentation on theme: "Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley."— Presentation transcript:

1 Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley

2 The Gene Finding Problem 5’3’ DNA Exon 1 Exon 2Exon 3Exon 4 Intron 1Intron 2Intron 3 polyA signalPyrimidine tract Branchpoint CTG A C Splice site CAG Splice site GGTGAG Translation Initiation ATG Stop codon TAG/TGA/TAA Promoter TATA

3 Approaches to Gene Recognition Naïve (mid80s - mid90s) ORFfinder, BLAST.. Statistical de novo Genie (96),Genscan (97), FGENESH.. Systems Ensembl.. “Ask not what mathematics can do for biology, ask what biology can do for mathematics” - Stanislaw Ulam

4 Difficulty of naïve approaches n = number of acceptor splice sites m = number of donor splice sites n+m+1 (Fibonacci #) Number of gene structures = F n+m+1 (Fibonacci #) 1,1,2,3,5,8,13,21,34… 1,1,2,3,5,8,13,21,34…

5 statistical gene finding TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA

6 TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA

7 Using GHMMs for ab-initio gene finding In practice, have observed sequence Predict genes by estimating hidden state sequence Usual solution: single most likely sequence of hidden states (Viterbi). TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA

8 Results High sensitivity / low specificity Exon / Intron length distributions Identification of GC isochore - gene richness dep. Splice site models

9 Comparative Gene Finding

10 http://www-gsd.lbl.gov/vista/

11 Comparison of 1196 orthologous mRNAs (Makalowski et al., 1996) Sequence identity: –exons: 84.6% –protein: 85.4% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical.

12 Comparison of 117 complete genes Batzoglou/Pachter et al. 1999 95% of genes equal number of coding exons Exceptions: Spermidine Synthase Lymphotoxin Beta 73% of coding exons have equal length 95% of coding exons have length equal mod 3 Intron conservation 35% Intron length ratio longer/shorter: 1.5

13 SLAM- alignment & gene finding Input: –Pair of syntenic sequences (FASTA). Output: –CDS and CNS predictions in both sequences. –Protein predictions. –Protein and CNS alignment.

14 http://bio.math.berkeley.edu/slam/

15 SLAM components Splice site detector –VLMM Intron and intergenic regions –2nd order Markov chain –independent geometric lengths Coding sequence –PHMM on protein level –generalized length distribution Conserved non-coding sequence –PHMM on DNA level

16 Input:

17 Output:

18 What have we learned from comparative gene finding? conservation is a stronger splice site indicator than consensus intron lengths have diverged gene structure conservation is more powerful than sequence conservation for prediction consensus for GC splice sites

19 SLAM whole genome run Align the genomes Construct a synteny map Chop up into SLAMable pieces Run SLAM Collate results

20 Alignment project: http://zilla.lbl.gov/

21 Linux cluster with 15 1.2GHz PC, 750Mb of RAM Three days to align the entire mouse genome against the human genome

22 Finding regulatory regionsGodzilla Gene name Enolase -Experimentally defined enhancer (beta- enolase)

23 http://lemur.lbl.gov/vistatrack/

24

25

26

27 Experimental gene verification with RT-PCR predicted intron primer Intron > 1000bp Aligning human/mouse Exons > 60bp

28 SLAM CNS data

29 Single exon data

30 Acknowledgments Marina Alexandersson – Gothenburg, Sweden (SLAM) Nick Bray – LBNL/UCB math (Avid alignment program) Simon Cawley - Affymetrix (SLAM) Olivier Couronne – LBNL (Godzilla) Colin Dewey - Berkerley (SLAM) Alex Poliakov - LBNL (Godzilla, VISTA) Chuck Sugnet - UCSC (SLAM) Inna Dubchak - LBNL Eddy Rubin - LBNL


Download ppt "Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley."

Similar presentations


Ads by Google