Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford

Similar presentations


Presentation on theme: "Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford"— Presentation transcript:

1 Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford meyer@stats.ox.ac.uk

2 Making sense of the genome: What are the proteins and where are they encoded ? Experiments in Lab Sequence Database proteins ESTs protein DNA cctgctgggtgcgagagccggcgtaccggtgaggcc

3 Aim in ab initio gene prediction: protein Sequence Database proteins Experiments in Lab DNA cctgctgggtgcgagagccggcgtaccggtgaggcc

4 3059 million bases: GCTGCCAACGC… We will very soon have: 3286 million bases: ACTGCGGGCGC…

5 Rough comparative map: reference: www.ensembl.org

6 Typical Situation:... gagccgcctcctccccttccccacgctctaggagggggccgcgggggcctggct gcgtcggccaatcggagtgcacttccgcagctgacaaattcagtataaaagcttggggct ggggccgagcactggggactttgagggtggccaggccagcgtaggaggccagcgtaggat cctgctgggagcggggaactgagggaagcgacgccgagaaagcaggcgtaccacggaggg agagaaaagctccggaagcccagcagcgcctttacgcacagctgccaactggccgctgcc gaccgtctccagctcccgaggacgcgcgaccggacaccgggtcctgccacagccgaggac agctcgccgctcgccgcagcgagcccggggcggcccttcagggggacctttcccagatcg Cccaggccgcccggatgtgcacgaaaatggaacag...... ggcgacgggggctcgggaagcctgacagggcttttgcgcacagctgccggctgg tgctacccgcccgcgccagcccccgagaacgcgcgaccaggcacccagtccggtcaccgc agcggagagctcgccgctcgctgcagcgaggcccggagcggccccgcagggaccctcccc agaccgcctgggccgcccggatgtgcactaaaatggaacagcccttctaccacgacgact catacacagctacgggatacggccgggcccctggtggcctctctctacacgactacaaac tcctgaaaccgagcctggcggtcaacctggccgacccctaccggagtctcaaagcgcctg Gggctcgcggacccggcccagagggcggcggtggcggcagctacttttc... ? Human DNA Mouse DNA

7 Similar problem: demotic greek hieroglyphs

8 Aim in comparative ab initio gene prediction : annotatesimultaneously DNA x: DNA y: Input: x: y: Output: ? cctgctgggtgcgagagccggcgtaccggtgaggcc cctgctgggagcgaaagcaggcgtaccacggaggg

9 Why is this a good idea ? IISPTHISJLKDAFKLJDFISDFLKJUEHIDDENWRWIERUOIYWERIUY advantages: can detect new genes as there is no need to search in databases for proteins fewer assumptions needed than in one-strand ab initio gene- prediction methods, i.e. can detect unusual genes KISFTHISPLKDAPKOJGFISJYTKJUWHIDDENRUIEUNNKLZSBUEYQ

10 3059 million bases Mouse – human comparison: 3286 million bases about 30 000 (?) genes

11 Analysing mouse and human DNA: Training: adjust parameters of Doublescan with set of known pairs of orthologous mouse and human genes Testing: Test set: 80 pairs of known mouse and human genes 55 % : same number of exons, different coding length 42 % : same number of exons, same coding length 3 % : different number of exons, different coding length

12 Results - Performance: annotation: prediction: correct overlapping missing wrong

13 C. elegans – C. briggsae C. elegans sequenced in 1998 97 million bases 5 autosomes, one X about 20 000 genes C. briggsae around 100 million bases 5 autosomes, one X

14 Results - Performance: annotation: prediction: correct overlapping missing wrong

15 Summary: Doublescan: predicts the gene structures of both sequences at the same time as aligning the sequences capable of predicting partial, complete and multiple genes or no genes at all as well as more diverged pairs of genes which are related by events of exon-fusion or exon-splitting can be used to analyse long sequences using the Stepping Stone algorithm (same performance as Hirschberg algorithm) general concept: can be trained to analyse other pairs of related genomes performance on mouse - human DNA and c. elegans – c. briggsae DNA very promising

16 To do list: large scale mouse - human comparison large scale c. elegans – c. briggsae comparison search for regulatory regions: x: y:

17 References: www.sanger.ac.uk/Software/analysis/doublescan I.M.Meyer And R. Durbin, Bioinformatics, 2002,18(10), pp. 1309-

18 Acknowledgements: Richard Durbin Sequencing centres Trinity College, Cambridge Wellcome Trust The Sanger Centre

19 The method: What are pair hidden Markov models ? How can they be used to find genes ?

20 Pair HMMs: idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: each state reads a fixed number of letters from one or two of the sequences

21 idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: match exon match intron each state reads a fixed number of letters from one or two of the sequences match intergenic reads 1 letter from each sequence at a time start state Pair HMMs:

22 idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: match intergenic match exon match intron each state reads a fixed number of letters from one or two of the sequences a state a transition Pair HMMs:

23 ACGTCGACATGGCCTATCCGCTGAGCT ACGTCGGGCCTCTCCGCTAAGCT Doublescan: emit x:- emit -:y match intergenic x:y x: y: ACGTCGACATGGCCTATCCGCTGAGCT ACGTCG - - - - GGCCTCTCCGCTAAGCT

24 emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x: y: CAAGCATGCGACAAAGGATACAGCGACCTC CAAGCCTGCGGATACAGCGAACTC CAAGCATGCGACAAAGGATACAGCGACCTC CAAGCCTGC - - - - - - GGATACAGCGAACTC same amino-acid (Alanine) insertion of two codons similar amino-acids (Aspartic, Glutamic acid) Doublescan:

25 emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 start:start stop:stop start codonstop codon x: y: Doublescan:

26 emit x:-emit -:y match intron x:y start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 GT:GTAG:AG GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA intronexon 5’ splice site3’splice site Doublescan:

27 AGx2x3:AGy2y3 x1GT:y1GT start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) emit x:-emit -:y match intron x:y GT:GTAG:AG GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA GCATGCAGTACAGTTG…GTCAGGAGGCGAACTCGCA GCCTGCAGTACAGTTA…AGTACGAGGCGAACTCGCA exon intron Doublescan:

28 x1x2GT:y1y2GTAGx3:AGy3 (…) start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) emit x:-emit -:y match intron x:y GT:GTAG:AG x1x2GT:y1y2GT AGx3:AGy3exon intron GCATGCAGGTACAGTTG…GTCAGGAGCGAACTCGCA GCCTGCAGGTACAGTTA…AGTACGAGCGAACTCGCA Doublescan:

29 start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) x1x2GT:y1y2GTAGx3:AGy3 (…) GT:GTAG:AG emit x:-emit -:y match intron x:y x: y: x: y: exon fusion Doublescan:

30 -:GT-:AG emit y intron -:y -:y1GT -:AGy2y3 (…) -:y1y2GT-:AGy3 (…) start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) x1x2GT:y1y2GTAGx3:AGy3 (…) GT:GTAG:AG emit x:-emit -:y match intron x:y x1GT:- AGx2x3:- (…) x1x2GT:-AGx3:- (…) GT:-AG:- emit x intron x:- Doublescan: Start End are connected to all other states

31 score Refinements: Score all potential splice sites => distinguish between true and false splice sites by rescaling the nominal transition probs to the splice site states score cctgctgggtgcgagagccggcgtaccggtgaggcccctgctgggtg cgagagccggcgtaccggtg x y cctgctggaggcggtagcgtgcttagtggtgaggcccctgttgggcg cgagagccggtaaaccgctg match exon x1x2x3:y1y2y3 x1GT:y1GT x1x2GT:y1y2GT GT:GT

32 score Refinements to Doublescan: Score all potential translation start sites => distinguish between true and false translation start sites by rescaling the nominal transition probs to the START START state match intergenic x:y start:start stop:stop cctgctggatgcggtagcgtgcttatgggtgaggcccctgttgggca tgagagccggtaaaccgctg y cgtgctggacgcatgagcgtgcttacgggtgatgcccctgtatggca ggagagccggtatggcgctg x


Download ppt "Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford"

Similar presentations


Ads by Google