Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford

Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford meyer@stats.ox.ac.uk

Making sense of the genome: What are the proteins and where are they encoded ? Experiments in Lab Sequence Database proteins ESTs protein DNA cctgctgggtgcgagagccggcgtaccggtgaggcc

Aim in ab initio gene prediction: protein Sequence Database proteins Experiments in Lab DNA cctgctgggtgcgagagccggcgtaccggtgaggcc

3059 million bases: GCTGCCAACGC… We will very soon have: 3286 million bases: ACTGCGGGCGC…

Rough comparative map: reference: www.ensembl.org

Typical Situation:... gagccgcctcctccccttccccacgctctaggagggggccgcgggggcctggct gcgtcggccaatcggagtgcacttccgcagctgacaaattcagtataaaagcttggggct ggggccgagcactggggactttgagggtggccaggccagcgtaggaggccagcgtaggat cctgctgggagcggggaactgagggaagcgacgccgagaaagcaggcgtaccacggaggg agagaaaagctccggaagcccagcagcgcctttacgcacagctgccaactggccgctgcc gaccgtctccagctcccgaggacgcgcgaccggacaccgggtcctgccacagccgaggac agctcgccgctcgccgcagcgagcccggggcggcccttcagggggacctttcccagatcg Cccaggccgcccggatgtgcacgaaaatggaacag...... ggcgacgggggctcgggaagcctgacagggcttttgcgcacagctgccggctgg tgctacccgcccgcgccagcccccgagaacgcgcgaccaggcacccagtccggtcaccgc agcggagagctcgccgctcgctgcagcgaggcccggagcggccccgcagggaccctcccc agaccgcctgggccgcccggatgtgcactaaaatggaacagcccttctaccacgacgact catacacagctacgggatacggccgggcccctggtggcctctctctacacgactacaaac tcctgaaaccgagcctggcggtcaacctggccgacccctaccggagtctcaaagcgcctg Gggctcgcggacccggcccagagggcggcggtggcggcagctacttttc... ? Human DNA Mouse DNA

Similar problem: demotic greek hieroglyphs

Aim in comparative ab initio gene prediction : annotatesimultaneously DNA x: DNA y: Input: x: y: Output: ? cctgctgggtgcgagagccggcgtaccggtgaggcc cctgctgggagcgaaagcaggcgtaccacggaggg

Why is this a good idea ? IISPTHISJLKDAFKLJDFISDFLKJUEHIDDENWRWIERUOIYWERIUY advantages: can detect new genes as there is no need to search in databases for proteins fewer assumptions needed than in one-strand ab initio gene- prediction methods, i.e. can detect unusual genes KISFTHISPLKDAPKOJGFISJYTKJUWHIDDENRUIEUNNKLZSBUEYQ

3059 million bases Mouse – human comparison: 3286 million bases about 30 000 (?) genes

Analysing mouse and human DNA: Training: adjust parameters of Doublescan with set of known pairs of orthologous mouse and human genes Testing: Test set: 80 pairs of known mouse and human genes 55 % : same number of exons, different coding length 42 % : same number of exons, same coding length 3 % : different number of exons, different coding length

Results - Performance: annotation: prediction: correct overlapping missing wrong

C. elegans – C. briggsae C. elegans sequenced in 1998 97 million bases 5 autosomes, one X about 20 000 genes C. briggsae around 100 million bases 5 autosomes, one X

Results - Performance: annotation: prediction: correct overlapping missing wrong

Summary: Doublescan: predicts the gene structures of both sequences at the same time as aligning the sequences capable of predicting partial, complete and multiple genes or no genes at all as well as more diverged pairs of genes which are related by events of exon-fusion or exon-splitting can be used to analyse long sequences using the Stepping Stone algorithm (same performance as Hirschberg algorithm) general concept: can be trained to analyse other pairs of related genomes performance on mouse - human DNA and c. elegans – c. briggsae DNA very promising

To do list: large scale mouse - human comparison large scale c. elegans – c. briggsae comparison search for regulatory regions: x: y:

References: www.sanger.ac.uk/Software/analysis/doublescan I.M.Meyer And R. Durbin, Bioinformatics, 2002,18(10), pp. 1309-

Acknowledgements: Richard Durbin Sequencing centres Trinity College, Cambridge Wellcome Trust The Sanger Centre

The method: What are pair hidden Markov models ? How can they be used to find genes ?

Pair HMMs: idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: each state reads a fixed number of letters from one or two of the sequences

idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: match exon match intron each state reads a fixed number of letters from one or two of the sequences match intergenic reads 1 letter from each sequence at a time start state Pair HMMs:

idea: annotate the two sequences by parsing them through connected states DNA y: DNA x: match intergenic match exon match intron each state reads a fixed number of letters from one or two of the sequences a state a transition Pair HMMs:

ACGTCGACATGGCCTATCCGCTGAGCT ACGTCGGGCCTCTCCGCTAAGCT Doublescan: emit x:- emit -:y match intergenic x:y x: y: ACGTCGACATGGCCTATCCGCTGAGCT ACGTCG - - - - GGCCTCTCCGCTAAGCT

emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x: y: CAAGCATGCGACAAAGGATACAGCGACCTC CAAGCCTGCGGATACAGCGAACTC CAAGCATGCGACAAAGGATACAGCGACCTC CAAGCCTGC - - - - - - GGATACAGCGAACTC same amino-acid (Alanine) insertion of two codons similar amino-acids (Aspartic, Glutamic acid) Doublescan:

emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 start:start stop:stop start codonstop codon x: y: Doublescan:

emit x:-emit -:y match intron x:y start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 GT:GTAG:AG GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA intronexon 5’ splice site3’splice site Doublescan:

AGx2x3:AGy2y3 x1GT:y1GT start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) emit x:-emit -:y match intron x:y GT:GTAG:AG GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA GCATGCAGTACAGTTG…GTCAGGAGGCGAACTCGCA GCCTGCAGTACAGTTA…AGTACGAGGCGAACTCGCA exon intron Doublescan:

x1x2GT:y1y2GTAGx3:AGy3 (…) start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) emit x:-emit -:y match intron x:y GT:GTAG:AG x1x2GT:y1y2GT AGx3:AGy3exon intron GCATGCAGGTACAGTTG…GTCAGGAGCGAACTCGCA GCCTGCAGGTACAGTTA…AGTACGAGCGAACTCGCA Doublescan:

start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) x1x2GT:y1y2GTAGx3:AGy3 (…) GT:GTAG:AG emit x:-emit -:y match intron x:y x: y: x: y: exon fusion Doublescan:

-:GT-:AG emit y intron -:y -:y1GT -:AGy2y3 (…) -:y1y2GT-:AGy3 (…) start:startstop:stop emit x:- emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) x1x2GT:y1y2GTAGx3:AGy3 (…) GT:GTAG:AG emit x:-emit -:y match intron x:y x1GT:- AGx2x3:- (…) x1x2GT:-AGx3:- (…) GT:-AG:- emit x intron x:- Doublescan: Start End are connected to all other states

score Refinements: Score all potential splice sites => distinguish between true and false splice sites by rescaling the nominal transition probs to the splice site states score cctgctgggtgcgagagccggcgtaccggtgaggcccctgctgggtg cgagagccggcgtaccggtg x y cctgctggaggcggtagcgtgcttagtggtgaggcccctgttgggcg cgagagccggtaaaccgctg match exon x1x2x3:y1y2y3 x1GT:y1GT x1x2GT:y1y2GT GT:GT

score Refinements to Doublescan: Score all potential translation start sites => distinguish between true and false translation start sites by rescaling the nominal transition probs to the START START state match intergenic x:y start:start stop:stop cctgctggatgcggtagcgtgcttatgggtgaggcccctgttgggca tgagagccggtaaaccgctg y cgtgctggacgcatgagcgtgcttacgggtgatgcccctgtatggca ggagagccggtatggcgctg x

Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford

Similar presentations

Presentation on theme: "Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford

Similar presentations

Presentation on theme: "Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford"— Presentation transcript:

Similar presentations

About project

Feedback