Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark

Similar presentations


Presentation on theme: "Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark"— Presentation transcript:

1 Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk ”Gene Finding in Eukaryotic Genomes” DTU course #27011 Spring 2004

2 Center for Biologisk Sekvensanalyse Today’s plan 13.00-13.30 Lecture on gene finding 13.30-14.00 Get notebooks (building 208; secretary) Pause 14.00-14.45 Gene finding exercise 14.45-15.00 Pause 15.00-15.15 Introduction to project

3 Center for Biologisk Sekvensanalyse The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?

4 Center for Biologisk Sekvensanalyse Needles Hiding in Genome Haystacks... Genes are embedded in the genome sequence Coding regions constitute only 2% of human genome Can we distinguish the gene features from the background?

5 Center for Biologisk Sekvensanalyse Can U spot ’Spot’?

6 Center for Biologisk Sekvensanalyse TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATGCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGC TGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCA TCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGT TAAGGCTGCGGTGAGCTGTGATTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCA TCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCCA TTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTAT ATATGCGTGTGTGTTGTGTGTGTTATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACA GCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTC GGGTGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGG AGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGA TGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTTGTCAGGTTTTCACCCCATGCTCCTCCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGCTAGTCTGCTCTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCTTCCCGTCTTACTGGAAGACCA GCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGT CTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTT

7 Center for Biologisk Sekvensanalyse

8 AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAA GGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTG AACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTG AGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCA GCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAG CAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAG CAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGC TCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTT CATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATA TATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAA GGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTG GTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCC TGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTG AGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACA AAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTT AAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGT GAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCT CAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCT TTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCA AGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTT ATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGC TTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATAC AAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTT ATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAG ATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGC TTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAG CTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAA AATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGG ATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTT TTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTG GCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGC TCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAG TGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCG TATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCA TAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG Can U spot the Gin? Can U spot the Gene? Ooops

9 Center for Biologisk Sekvensanalyse AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAA GGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTG AACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTG AGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCA GCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAG CAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAG CAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGC TCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTT CATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATA TATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAA GGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTG GTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCC TGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTG AGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACA AAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTT AAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGT GAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCT CAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCT TTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCA AGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTT ATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGC TTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATAC AAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTT ATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAG ATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGC TTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAG CTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAA AATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGG ATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTT TTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTG GCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGC TCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAG TGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCG TATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCA TAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

10 Center for Biologisk Sekvensanalyse Needles Hiding in Genome Haystacks... Intron-exon structure of genes Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb)

11 Center for Biologisk Sekvensanalyse Manual Genefinding Start codon:ATG Stop codons:TAA, TAG, TGA Donor splice site: ^GT[AG]AG Acceptor splice site: [CT]AG^ >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

12 Center for Biologisk Sekvensanalyse Manual Genefinding Start codon:ATG P(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16) >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

13 Center for Biologisk Sekvensanalyse Manual Genefinding Start codon:ATG Stop codons:TAA, TAG, TGA >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

14 Center for Biologisk Sekvensanalyse Manual Genefinding Start codon:ATG Stop codons:TAA, TAG, TGA >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

15 Center for Biologisk Sekvensanalyse Genes and Signals

16 Center for Biologisk Sekvensanalyse

17 Manual Genefinding Start codon:ATG Stop codons:TAA, TAG, TGA Donor splice site: ^GT[AG]AG Acceptor splice site: [CT]AG^ >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

18 Center for Biologisk Sekvensanalyse Manual Genefinding Start codon:ATG Stop codons:TAA, TAG, TGA Donor splice site: ^GT[AG]AG Acceptor splice site: [CT]AG^ >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

19 Center for Biologisk Sekvensanalyse Gene Features Codon frequency/bias Organism dependent Hexamer statistics Transcriptional Promoters/enhancers Exon/introns Length distributions ORFs Splicing Donor/acceptor sites Branchpoints Translational Start codon context

20 Center for Biologisk Sekvensanalyse Gene Prediction Eukaryotic Gene Prediction Prediction relies on integration of several gene features Each gene feature carries a low signal E.g. ATG, splice sites, etc. Combinatorial explosion Some are mutually exclusive (e.g. reading frame) Sensor based HMMs well suited for gene prediction

21 Center for Biologisk Sekvensanalyse Gene Prediction Take home messages Human genome sequence is known Number of human genes is unknown! Before 2001: est. 30,000-140,000 Anno 2003: 25,000-40,000 Why? Because gene structure prediction is hard! Location, structure and function of many human genes is unknown! Genes may be discovered by different means and methods...

22 Center for Biologisk Sekvensanalyse The End

23 Today’s plan 13.00-13.30 Lecture on gene finding 13.30-14.00 Get notebooks (building 208; secretary) Pause 14.00-14.45 Gene finding exercise 14.45-15.00 Pause 15.00-15.15 Introduction to project

24 Center for Biologisk Sekvensanalyse Sensor-based methods Similarity searches misses some/many genes cDNA/EST libraries are not perfect Ab initio Gene Finders HMM-based GenScan HMMgene Neural network-based GRAIL NetGene2 (splice sites)

25 Center for Biologisk Sekvensanalyse Gene Prediction ”Isolated” methods Predict individual features E.g. splice sites, coding regions NetGene (Neural network) – http://www.cbs.dtu.dk/services/NetGene2/ http://www.cbs.dtu.dk/services/NetGene2/ ”Integrated” methods Predict genes in context ”Grammar” of genes Certain elements in specific order are required – HMMgene http://www.cbs.dtu.dk/services/HMMgene/ http://www.cbs.dtu.dk/services/HMMgene/ – GenScan (HMM-based) http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html

26 Center for Biologisk Sekvensanalyse Gene Grammar HAPPYEUGENEAWASGUYFINDER Isolated features

27 Center for Biologisk Sekvensanalyse Gene Grammar HAPPYEUGENEAWASGUYFINDER Isolated features Intron 3’UTR Exon Promoter Exon RBS

28 Center for Biologisk Sekvensanalyse Gene Grammar EUGENEFINDERWASAHAPPYGUY Integrated features HAPPYEUGENEAWASGUYFINDER

29 Center for Biologisk Sekvensanalyse Gene Grammar EUGENEFINDERWASAHAPPYGUY Integrated features Prom  RBS  Exon  Intron  Exon  3’UTR

30 Center for Biologisk Sekvensanalyse Gene Grammar ”Isolated” methods (e.g.NN): HAPPYEUGENEAWASGUYFINDER ”Integrated” methods (e.g.HMM): EUGENEFINDERWASAHAPPYGUY

31 Center for Biologisk Sekvensanalyse HMMs for genefinding GenScan principle E=exon I=intron F=5’ UTR T=3’ UTR P=promoter N=intergenic

32 Center for Biologisk Sekvensanalyse Genscan http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html

33 Center for Biologisk Sekvensanalyse Genscan

34 Center for Biologisk Sekvensanalyse Genscan http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html

35 Center for Biologisk Sekvensanalyse Genscan

36 Center for Biologisk Sekvensanalyse Genscan

37 Center for Biologisk Sekvensanalyse HMMgene http://www.cbs.dtu.dk/services/HMMgene/ http://www.cbs.dtu.dk/services/HMMgene/

38 Center for Biologisk Sekvensanalyse Defining the term ’exon’ Gene Prediction programs often use Exon = CDS (coding sequence) Real exons may contain 5’ or 3’ UTRs (untranslated regions)

39 Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

40 Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

41 Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

42 Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

43 Center for Biologisk Sekvensanalyse Gene Prediction Exercises Main website: http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/genefindingprojects/ http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/genefindingprojects/ Exercise: genefinding_exercise.htm

44 Center for Biologisk Sekvensanalyse Gene Prediction Exercise SequenceGenBankGenscanHMMgeneNetGene2 Seq#1 (HoxA10) 320..1226 2401..2675 320 1226 0.871 2401 2675 0.988 320 1226 0.744 2401 2675 0.971 Donor 1227 0.95H Acc. 2400 1.00H Seq#2 (Dub-2) 398..425 1208..2817 - 1208 2817 0.800 398 425 0.418 1208 2817 0.735 Donor 426 0.87 Acc. 1207 0.42 Acc. 1210 0.71 http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob /exercises/gf_exercise_solution.html

45 Center for Biologisk Sekvensanalyse Today’s plan 13.00-13.30 Lecture on gene finding 13.30-14.00 Get notebooks (building 208; secretary) Pause 14.00-14.45 Gene finding exercise 14.45-15.00 Pause 15.00-15.15 Introduction to project

46 Center for Biologisk Sekvensanalyse Project Main website: http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/genefindingprojects/ http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/genefindingprojects/ Project description: genefindingintro.htm

47 Center for Biologisk Sekvensanalyse

48 NIX – Visualizing Gene Predictions http://www.hgmp.mrc.ac.uk/NIX/ NO method is always best!

49 Center for Biologisk Sekvensanalyse Exon definition model


Download ppt "Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark"

Similar presentations


Ads by Google