Presentation is loading. Please wait.

Presentation is loading. Please wait.

What does the genome says? Sebastian Reyes Genome Center UCDavis.

Similar presentations


Presentation on theme: "What does the genome says? Sebastian Reyes Genome Center UCDavis."— Presentation transcript:

1 What does the genome says? Sebastian Reyes Genome Center UCDavis

2 Outline What’s genome annotation??? Why do we need it??? What’s an annotation pipeline??? Repeat prediction mRNA prediction ncRNA prediction Annotating my genome What are the maker.ctl files?? What’s a gff file???

3 AGGTCCTACCGGGAGTCG AACCCAGGTCGCTGGATT CAAAGTCCAGAGTGCTAA CCACTACACCATAGAACC Tomato Glycine tRNA

4 >SL2.40ch02: aggtagatcgatatgtaatctgcatatttattcaggaaaaagacatgaattgccccctcaacttgtaccaaaaagtcttttacacaattttattaatggggcaacctattacacacattaagtacccctttcgtcacccaacccaaaacattatgcatagagtgagagtcactcgcctctagcctctaactgaataaattgatgcctcgacccgattcatttgatccaaacccgt ttattttcacctaggttcatttttttcaatccaaatttagtttattttaccctagtccatttatttcacgcacccatttcgtttcgctaagaacaccaagctctcgagggagagagaatactaatgaggcgttcatatgtgtgtaacaaaataacacttataacgagcgtaatttatgtgtctgatttaatcaacaaagttatttttcaatatttttgttctttatcttaattta tgtgaccctttttttaaaatttatattaaaaaatatacctattgaaaatatactttattttgaacttcttcttttttttcttagtccttttacaggtatacatagtaacgtgtcaattgtcattctctatatttgaatatcaatttcaatttctctattttataccttttttcattttgatttttttaattatttttctaaataatcatcatgtaatataaaaatttatgtattt caaaatgtcgatgtaatatgaatttcataaagtaaaataagataaatattacatcatagttaaagaaaaactctattttcaagacggttatgttgctcctataaaactaaagtttgggattatatataaaacttaaaagagaaagaaaagtttaattaattgagtactgtataggtaccactacaaacatgaaagaacaacccaaaaaaaagaagaaaaagaaagttacaacaca aaaaagaaacgttattgatcatcaactgatctctctctctagataataataatataataaatatatggattacaatttcttcctcccccatggcgtagtatttccatcgaatagctcatcaatgtctgtttcaatttcttcaggttggtactttacatacttctccgtctgccttcccggagtaaatgtttgactaaaccttcccaagaatgacctgtacgctggtataaccatg tttgatatggaaactctcagctccgattgcaattgctcgtcgctaatcacccaactgctctgcgtcttgtgtatctcatcaaacaatgcattgaaactcttgaacctctcttttaatattggtttattcactttcccattcacattcaatccctcgtggtttaaacattgcaacaatttaccccacgtttctctttggtaattcttgtggtactgtcttagatctgaagatcgct ttctgtaccattgatcccccatcaggctattcatctctggagatccttttattttttgcaagatgtatcgtccgttattcatcatgaatattgaactgagtgaagtgtctttgtaaagccttgattttccctccagatttgaatccaatagatccatcacttttatcatatgcgtctcaaacggtgatggtttattcgatgacgtattattattattatgattattattattatt attattattattactactctgacccgcttgtggattttgacaatcaaaatccgatcccgtggctgagtcagctctttctatcatttgatgttctctgaatacctgctctaacgtgtctctgtactcccctacatatttcagataattcatgatgtaccgagttaaaggatgaacagcaccacctggaaccgcggtcttgtttgagtccccctggattgagttttccagctccgaa aagatagaaaccatggattcccctaagcggccccgactcaatgtagcttcagccttgagttcatcagcgtaagaaactggaaatatcttatccaccaaagggataaaatcccgcaacgtctcataaatgtcgagaaacttgaagagtttttcagcagcccgttttgtcatggaaacggcttcagcgaaattaaggagttggatcatcatacctcgagacaaattgctgaagatgg tctcagaaattgatggttgatcctcaaacaccgcatccgccagcttgcgttcactggagaagagcacattagtgcagtgcctgaatgttgcaatccaagcagtaacctctctctccaatggttcccaattcatcttctgcacatcctcgatgctatatttttcaaagccaagcttatgcaagctttcctccaaagctttcctccgtgcgatgaaatagacctgacagcattctgc ttcatagcctcctgcaattaaggctttggaaaatttattcaatgtggctacgatttcctcagagtagcctggaaatttattttcttcagaaggtttagattcagaagcagtgtcttgttctgcatcttcattggtgtctgatgatgaagaattagggttaggcttggcagaagaagtgtctaaattggtgatatctgaatcagtattgatcttgtaatcgtagaggattgacttg tattcttcctctatgtaggacattgctcgttgaagaacaccatcgacgcggctgattgaataagcatatttgtactctgaagaaaatcgacagagggaggtaaatagcttggagattcgatctacaatatttaaaaatgatgtggcttcctcctgagctagctggctccatttgacgggtgcatcaccgccatcatattcatcaattttcgcttcaacaagaacagcaaattgtt ctacaaacacaggaacatcgggaggcttagattcatcatcatcccctttaaagtttgatgattcagagataaattgatctatttcttctgaaaccttatcgagatcaggaggcaggggagaagaaacttcaaccttgacctcatcgtcgtcttcaggtttgatatcgtctgtcttggcaccatcatcctctgttgtgttgatgactctatcgtctgtttgcagttgttcaacaat ttcttcttctgtagactttgtttcatgggcaggaggagtttcatctggctgtttgtcattggattttgtttgatcagaagaagtttgtgttagatccatgattgtaattgtgtgagtgagtaggtaagtttagcaggaaaattgagcttacattatcaaaaagagtaaggaaagtgttgtgattggagatgcagcgagcaaatcgacgagaaaggaaatgtgtgggatcacatta attagttagttacctttaacgtccactcttgtccacgtctaatattctcaatgccataaacttgcacgaatcaaggacggcctctcccactcactactcaccatctccataaccatctcatccaaaatatattactacttccattttagtatgtcaacataaaattgacttaattttttaatttttgagattctcaatcttttagcttatataacatcaaacttgtggtcactgt gttatgaaacaaatgattatgggaaatattgaaaaagcaaaataaagttgtagttataacattttagtaatttgtccaatattttttaaaaaaaaattataatgattttatcaaaattatttatgagatgacataaattagggatagggaagaagaaaggaggaggaaggtagaggaaacaagtgatcctattctccatgtatgtatttaagtgtagacaggcaggcccgaccca tccgatccaatgtgaattattaaccctaggtaggtattggtaaagcttgagatgcgaatgagaatgtgatcaatgggtaaaccaactacccctcatcttcctccaggttccaacctacatactttattaataaatgttttgggtttttactcccaagctcaggattatgtttgcatacatctgctgcctgacattctttttgctacatgctctcacgaaacttctattatttatc taggttatcaacttcaatcacattagtagtattaaaggaagcttgaatttgttatgcatccccaattacatgcaaatattgctaatcatttttcctatttttaattactctgcaccttccagataatgctatcaaacatagctcactgctggagcccacacagcaagtactgatatcaaaacatgattctcctccgagaagcaaccttccaactttttcttaccgggatattgca accgccacaaacaatttcaggcgacaatccattattggggaaggtggctttgggccagtattcaaagggaaacttaacacgaatcaggtgattttcactaaatgaagtgtgttttggtcatttcaaatttatcatccttgaatttctataaccattcaggttgtggctgttaagaagctaaatcattccggtcttcaaggggataaggagttctttgtagaggttcacatgctct cactgatgcggcaccctaacctggttaatctaattggttactgctctgaaggagagcagcgacttcttatctatgaattcatgcctctaggatctttggaatatcatctccatggtaatgctcattcactattctcaccttagaaataaattcttatttgtcaacctgcaccaatctgttatgcttcagagctctacattcatgcattatttgctataaacttttggatcgtctt tgatagcttaactcatttagagcccgacctaataaaataaatagcttttaagtttttaagtcaaaaaaataaaaagtaggagagacctactcttctcttttttaaaagtttattttaagttatttttaaccttgtcaaacactttcagaagctaaaagtgacttcaaagtaagtttgatcaaacttttaagtcccttgaagatatgaagttactattagtatcattctttaatac tcattcacttctattttcattatttcttttttctttatcgaggaagtcttcacttctttaaggctgcactgccacatttgatggatatcaactgccctctcttattattctattaaccatgtattaaatttgttgatcctttttgtggttgtctgccaaagttgttactatgtgtgggaacttttttttctgtagatttttttaaaaattgtacatgcatcagaggatgcaacta atatattcattttccctctaaagtatgaaaaatgtgttgattaatagtataatgtgcagatattacgcctgacatgaagccattggattgggataccaggatggtaatagcatctggtgccgccaaaggcctggagtaccttcacaaccacgctgatcgtcctgttatctacagggaccttaagtcggcaaacatattattgggtgagggtttccatgctaaactttctgacttt ggtcttgcaaagtttggcccaattgcagacaacacacatgtttcaactcgggttatgggcactcatggatattgtgcacctgagtatgctggaacaggaaaactgactatgaaatccgacatctatagttttggtgtgctcttgttggagctaattactggatgtagagcaatggatgactctcacgaacatggaaaagaaatgcttgttgactgggtaattacttcacatttaa ttgattgaagtcattatacttggcaagctggaaaatgaaccaactttgagacacaaatataagtttttgtttcactattgagtttaaaaatgtttgtgtgtccttttttgactcttgaagcataaagttttttctctgagtgatgcatagttctatcaacgtgtcaaaggtattttggttgaaggtctctctctgccttggaggaggcccaaagttaccgaaaaatgagaatgac tgatctatcaagtgatatgcacagttctgaccattgaaatgtatcatagtttgatttttttttctttttctgaatgagaagtgtaggctcctgtttgtgcgcttccttttctccctgctaacagtgaagtttctcaggcacgtcctatgttaaaagaccgcatgaactatgtacagttagcggatccaatgttaagaggcaaatttccacaatctgtcttccgcagggtagtaga actggtcttaatgtgtgttcaggatgatccccatgctcgacctcacatgaaagacatcgtgcttgctttgagttactttgcatcccaaaagcatgattcacctgcagctcagattggatctcatgggggagaagggacaaatggaagctctgttgattttgatggagctcagatggatataacagaaataagagcttcaaacaaagatcaagagcgggagagagctgttgcagag gccaagaagtggggcgagacctggagagagaaagggaaacagaatgcggatgatgatttagattataaatcaaggtggtgattgattgtaagttagtttccttgtcacagaacagcatttttattcaaatttttgaaacagtgtcgtatgtatatactcataaaaaagaaattattggcattattgatgtatttgtatgcttgacaaaataaattcaatacaactactatatggc atttatcattaacaagtactactaattaggttttgttcaagtactaaccaaacaaacaagaatgttaaattaagattaaaatatataagcaaccacatggattaaggattcaagagtaataatgtaagaattgtaccgtacagagtttagtaatgtaggtatgattcttcttatttcactttttatccctcaaaatatcataaataaattcatgtattcggttaagatcggttat tggtaaaaattaaaatcaaattaatctagttggttttctaaattactaaaaccaaaccaaacggataaaataaattggtttggttcaatttttctattttttttttcagtttgaaagtaatacatttttcaggacaaacatatcttgatcgacacaaacacctagtatatgaacaatagagataaaaactattgttggttcaattagcaataagcaatattaaagttatcattac acgaataaagagattcaattaagagaaagtgtggaaccttaactaaggtaagtgtggggtttaagaaaatgataaagctaaagacttaaagttacttaaaatttaaaaaactatttatattttatttataaataatatataaataattataaaatttatatatctaattataagttcaatttcgatatttttgcagttgtctttttttagtaaaatcaaaatcaaactaaatagt attgatattcaaaattcaaagccaaatcaaacgaaatttcaagtttttaattagtttggttcgaattgtggtttggtatgattttttgtagccataaccgtaaacaatcatggttcggtttggatcgatattaattaaaaccttaaatttctaaacccaaaccaaatggaaaaaaaaaccatcggtttggtttggtctgattcggggttttgttctttttggtttaaattaaatt tattgagagtatatgcatctcgatagatccaaaaacctattacatgaacaatccacatgaaagttattatgtattcaattaaacaatattatagttatcattatgcgaataaagtggttcaattaaaaaaatgtgactatctaatgtgacgcggggtagataaagataaagattttgattgacttggtcttagtcgggctaaaaattaagtgttgaatttaaaaaaaatgtgaaa gttaagacttaagttaattaaattttatttatgagtaatatataaataataatgaaatttatatacaaaataatattgattttcaaaagagaggtcctaccgggagtcgaacccaggtcgctggattcaaagtccagagtgctaaccactacaccatagaaccaattgatactattcttcagctgaattgtatttagattatgtaagcactataaatatataatattttttttct gtgatgaactccttaaaagtgagaaagagggtcatttggttggtgggtgagtaaacaagtaatttcattacagaaacaaaaatagaaataggagggctagggcattagtgaagaaggaaaggaaaggtaacaaaagttgatggggtccagactccagtatgtcagccaacgataatatgccacgtcagtttgaccatcctgtaaagtgaaaaatgggtcccctaccaaaccaaac caaaccctttttacacactctctctttactaccctcatgcctttgtcggctcatttaaccaaaaaaaaaagcttcaccggaatctgagtgtttgtccggcaaggcacagaagtctctctgtttgctgctggagatctcgaggaagccccagagatgtatgtcgttcctcctccaaaacgacccgatccattgtctggatccgaggacttgcggatttaccagacatggaaaggaa gcaatgtaagagcttatccagtttattcacacttgctatctcatctaaaacttaacatcatgtttgtcacattaacttttaccaacgattattttatcaggacaaactgaatcaggtctataaaaacgatttactatcttaaaaatagacaaaagttatgactccttccttactagaacttttattcttaaattgctgacagatatttttcttccaaggaaggttcatatttggg ccagacgcaagatccctagcactgaccatatttctcatagtggcccctgtctcagttttctgtgtctttgttgcaagaaagctcatggatgatttttcaaatcacctggggatattgattatggttgtagtcattgtgttcacattctatgtaagttctctctccctacttctcattgattctaacatcctcatcacctgcactttatgacaacacaatacctctataaacttgc ctttatttaatctcacggttactggaaacaaaattccgagtattagaaagtcaatcatctcaaagttgatcacatatcaacttggcactgcatctcaaggatgtaaggttggcttggttgtgtcaatagtcttccattcgacggttaagtggctctccttacaatgaataggattggcaacctaattgccttcctattttgtatgtctaatagtttcacatttcaatattacagg ttttagttctactccttctcacatctggaagggatccaggaataattcctcgtaatgcacaccctccagaaccagaaggttatgatggtactgtggaaggtggtggacaaacccctcaattacgtttgcctcgcattaaagaagttgaggtcaatggcattaccgtcaagatcaaatactgtgacacctgcatgctttatagacctccccgctgttctcactgttcaatctgcaa taactgcgttgaaagatttgaccatcactgcccttgggtagggcaatgcattgggctggtaatagttcttttttttcttgttctttttcgtaacatatttttagttagtactgccacactagtaactgatattctctcttttcagttttacttttcaatgcttgatatgtctgatgaactattgcatataatctgatatgaagagtaaccagacattttatcatgatcacactcc ttctctctgcattgaaacctcataaaaagagacttaacaaaagactgattcttttacttgtaaaaacgaataaaataggtaagaccgattctttctttgttatctttttaaaaaatttacctattcactctaatgttaattcgcgaaaacaagcaagttacataaaatgtctaaagaaaaattgttgtcttaatgaagcaatgctgtggatctgaagtattagatcctcaaagca tcttgaaatctagctctcaaagtttataattatgtaaaaccgttaatactaatttatgaatcagactcttttaactattgatgctgttggcagcaggaggaagtttaaaaggtttgtaacttgagcatttagttcaaatggccatcatgaagacaattatatgttttagaaatatttagcactctagatctttgcagtggaactaaagttagagggtttagtacacatatccaca tgcttaagttctcacatttcagctgaaatctttttgtccttgcgtacagtacttgctgtcgtgattctcagatttttgaagttatgaagtcaggctggtcaatttacctgttaatgtcacgtccacctttgatttcagaaatatctatagtctttgcttgcgctgataatttctgtaacaaatgcatcttttaaaacggatcatctagtgttgatgtcttggactctttccttat gttgcacagatttctgaaacttgtgtctttttattgcagcgaaactaccgttttttctttatgtttgtcttctctacaacacttctttgcatatatgtttttgggttctgctgggtctatattaagagaatcatggttctagatgacaccaccatatggaaagcaatgatcaaaacaccggcttccattgttctaatagcatacacttttatatcagtatggtttgttggaggtc ttactgcttttcatctatacctcatcagtactaatcaggtatgttcgtggattgtaattttgtttttccctttgtctatctgaggtggtttggattttgatgtgaactgggagtgacgtctattttttgggtactgcagactacttatgagaattttagataccgatatgattggcgtgccaatccctacaacagaggagtgatgcagaatttcaaggagatattttgtactagt attcctccatccaagaacaatttccgtgcaaaggtgcccagggaacccaaggtggcaactcgatctgcaggtgggggttttgtgagtccaaacatggggaaggctgtggaggacatagaaatgggcaggaaagcggtttggagtgaagtaggggataacgaaggacaacttagcgacaatgatggcctgaacattaaagatggcatgttagggcaaatgtctcctgagataagaa gtacagtagatgagagtgatcgtgcaggaatacatcctagaggatcaagctggggaaggaaaagtggaagctgggagatgtcacctgaagttcttgctttggcatcaagagtgggagaagctaatcgaacaggtgggagtagcagaccaacagatcaaaaaaagttgtgattaacagatagtatgaaaactggattgaatagattagtggttcttggaggtgtatggtatgtggt cattcagtgtcgtgtattatcagatgttggctttaggaagtgtgtgatatgaggggtggttttaattcctaaaacttgtattgtatgtgtggattagttagtgtaatacattagttttgctctgttcatgctaggcggtgcatttattctttgtgcttaaacaatgtgggcaagagtcccatataatatatatagtacgttgtaacagttgttattttacaagtaagaatctggg tttttgctgaatcacagtaaggggtggatttgagtaaaaatagagaaaatacacaagtcccgtgtaaccacccacaatagtatatatggctcgaaggaattcttttataatatgataacaccaactaaaaaggtagtccaattatatttcttcaaccataggttcttttcctggaacaaacaagcattcataatatttttcctttcgatcacagccatgagccaaccaaacttgc tccaccatagttgttaccaacatctccctcgtgtcagcacaagttaagcctaagggactcctcaactttccatcctttaggatccattaaattaaaaaaaggaaatagtatgacccttcttgagaagtctttctgttgatgggtagagaacgttggcatttgtaaatggtttcttatcaaatgtgtgtgtggtttcatttaatacttcagatgagatatcatgccctcaagatta accctagatggtctccttcttacaattgtagatttattattcaaattatgaatgtctgagaacaatacaattgattcagatattatattcattaaaacaatacattacgacatataaaaataatacatcacaaccatcaattaccactttttatggagaaaagaaaaattaaaaagtttcaaataatatagtcatgggtaaatataaaacatcgtcactaaaaaatgagggagaa tagtttccaccaacttttgcaaataggttaggtaaattaaaattaaaatattaatgttttatatatattaatattttgtatatattaaaaaatttcaaataatattaatgttccgcgtgtgcaaaatattttgtatatattaagagtgagttggtacgaaggaaaatattttctcgaaaatgttttccaattttctcatatttggttgatataaaagttttgtaaaatgtttttc aaatcaactcattttcctcgaaattaaggaaaatgacttctcttcaaaaattaaggaaaacatttttcaaaattctactccaatttcaaattgcattttttttttcgaaagacatcaattttaaaagaatattttcaattttaaaattttatgtgtttacccaatccctcccccaaccccctaaaatattaatttcattcataaatcaaacacacgaaaatattttctactcacc taccaaacatgaaaaaataaatttaaaatctacttatttttcaaaacacatgtatctacgctaggttggggggtctaaaattaatttaattaagaaaaacataatctttccaactttaagcaaattatgtaaatcagattgatggagaaaaaaattaccagtcatgatttgtcttaaataatcattaataagtacataaaaataaaagtaatacgacatctaaaaatattccaag ttgccatattttttcttattacactctgttgcacttttcattttacatggcgctcaaaaaattataaataaaagaataaaatttttaaatttttttacaaataaatataaatatatttcaagaaataattgcaaaagtaatatatgacaaattctattatttattttaataaaataatggagacaattattttttataagaatatggaacggaattaataagagtagtggatgga tccagctaaggtttttttaataaataaagggagaatagtggtcagcaaaattttgcaataaatagatgtgtacaacttccaaaatactcttcgttcagaacgttatccataacatgtcagcatctcgtggctacacatcatcttcttttattaataatatatcaccttgcctattgcaatattaattacaaattatgtttctcatcatcaaatttatattcaagggaacacatgt ttcatacaccaatttcaatttattaaaatcagttcatcaatcatgactaatagttaatctttttaccaagttgaaaacaaattaaatatcatctagataactgcttgtaatatgagtgaatcattgcacgttttatatacctactactcgcacaaatctcgaaggaaaccaattagctgtaagctgactattacatttttttttatttttttaaaaaaaaaggaaaatcaaaatt tttatttttaaacttgtgttagtaattgaaaaacttaccacatgtaagaagaaaaaatctttttaaagttgacccaatagtaaattgtaaaagaaaaaaagtaaaaattatggtctatagatatccttttcagaatagcatcatgtggtatatatttagcatattcttccttacaaatgcaatatcgaaatcaacttttaaaaagtaacgacgattaaggggagcgaatatggtt actgaagggaaatttcgtacttttgcaagtcgaattattgcagcatgaagttaaatttgaaaaaaaaaatatatataataagagagttgagattttagaaggaatctataatgtgatgtattaattattaattaaaataaaggagtaaagaggaggacgcgtctgatgtacgtggGgagggagagaggaatatccaaaagcataacaccgattacggattgagaatattcctgat cccctctaacctcccataaataccattcttttcttacttgttgttgcatccaatccaatccaatcatcccgttttctcttctcttctcttcttctcatatatataatatatatactagtagtatatatatatataacaatacaacaacaatggctcttacagctgttcatgtttccgatgttcccaatctagatcaagtccctgacaaagctcctctatatgccacccgattctc tcaaggtttcctttgtctttaatttctatactcatttattctctttctttctagacaatacatacataatctcttattttgctttgcaggcattgaaattggaagagcatccgaatttttagttgttggacacagagggaacgggatgaatttgttgcaatcggctgaccggagaatgaatgccctcaaagaaaattccattctttctttcaatgcagctgccaattacccaatc gattttattgaatttgacgttcaggtaattttctattttccgcgtatttccttttttctttctattccccccgcccaaaaaaaaaatcgacccgacccgacccgatttaaaaatattaacagaaaaaccttaaaccctagatcttggcccagtatctctctgaagatggattctacctttagattaccaaagtaaatgtcattttttgtttgtaggtgacaaaggatgattgccc tgttatttttcacgacgatttcatcctcactcaacataataatgtaagctcttttaatatccaataacccctttctcttctaatcaaataaaatctcattctcatggtgcgtgtgcagggtacagtttatgaaaggagaattactgaattgtcacttgctgaatttcttagctatggaccccaaaaagaagagggtctcactggaaaacctttaatgaggaaaacgaaagatgga aagattgttagctggacagttgaaaccgatgattccgcatgtaccttaaaagaagcttttgagaaagtgaatccatctattggtttcaacatcgagctcaaatttgatgatcacattgtttatcaacaggactacctcatccatgctcttaaagcagtgttacatgtcgtattagagtatgctaaaggcagaccaatcatattctcaagtttccagcctgatgctgctctgcttg tcaagaagctccagacatgttaccctgtacgtacgtttccattttggaatctatcttaaacacaagtgaattcgatctgacatacttaccttttggcactgtgtaggtgttttttctcacaaatggaggtacagagatttactatgatgttcgaagaaactcgttggaagaggccactaaactgtgcttagagggtggtttggaaggtattgtttcggaggtgaaaggcatcttc aggaatccaggagtagttaacaagatcaaagagtccaagctgtctctgctgacatacggcaaattgaagtaagtagagtattgattcagttggacactaacgaactaacgaactaaccatttatttatttattatacgttcagtaatgtgcctgaagctgtgtatatgcaacacctgatgggaattgatggagtgatagtggattttgttgaacaagtaacagatgctgtgtgta agctggtgaagaagccagatgagatattgctggaaggggaggaaaaggttcaaaatagacctcaattttcacagagggaattgtcttttctgctcaaacttatccccgaactgatacaacaataacaattcataaaatgattgtagatagcaaagcgtgtagaaatgtagattttcattgtatttgcaccctctctgtaatcatatcaaaatattttatgaattgtcttggaatt ttagcaacatcaactcacttcaattaggttggttattaaacagtccaacccacattagttaccaaatgtacctgtggctgtgaatcatagttaaaattgacatgacctttcgagagaaagcaaagcaagtaacgtattaatctcaagtagaggggattcatttatgacagtgaatattaattaaacaataaccattagtcttcagttcccaattctatatttaaaagttccaatc ttcaattcctaattgcttaaaaatatattacaacgttttgaagccaaagcaattagatacaattatcatcaccttgtgttatcgactctccacacagagtgatagacatgttcctcaaatacaatattctttccatttcatttattccgtcttacttcccttttgattgcaacttttcgcccagtacatttaagaccacaagattaaagaatatcttgatgcactttacatatct ttaatttaaaagttttccttactttcttgaaactcaatatcaagttaaaacaaattaaagtatttttatttcgacagataatgatagcagaatatctaataggtaaactccaatatacatgttgtatattaattcatttcaaaataccagagaagtcgttgtttcttgttggactttgatggattaggatgtatgtatgttttggtccagtacttgtcaaggctatgagactcat aagaaatagggcaaaatttctatttatatgctataacaaagtttgcataatttcgctccatagcaaacatatatgtgtataattcgttatacatatacaattgaaacgaattgtataaaacgagaaagagaaaaattatatacaatttgaatttgtataaaacgaaaaagagagaaagacaaaagaaatggtttatataagtgtatattgagaat

5 What is genome annotation? – Attaching biological information to genome sequence (i) Structural annotation: genes, regulatory elements, etc. (ii) Functional annotation: biochemical, phenotypic, etc. – Creating annotations or features – Based in known information

6 Why should we worry about genome annotations?

7 Genome annotation pipelines The NCBI Eukaryotic Genome Annotation Pipeline ion_euk/process/ Ensembl Gene Set build/genome_annotation.html ant%20Genome%20Annotation%20Pipeline%2 0SOP.pdf JGI Plant Genomics Group Annotation Process lab.org/software/maker.html Maker

8 BGI genome annotation pipeline

9 Annotation classifications By source – De novo – Sequence alignment – Model prediction By type – Repeat elements – mRNA’s – Non coding RNA’s

10 FINDING REPEATS The start

11 Repeats 101 Sequences that are present in the genome in a high copy number Duplication rate and structure depends of the type Main types – Tandem repeat elements (TRF) : segments of small sequences repeated in tandem – Retro-transposons : self duplicating elements that transpose using an intermediary RNA – DNA-transposons : repeat elements that don’t use an RNA intermediary

12 Repeats Make a big portion of eukaryote genomes Not junk anymore

13 Repeat prediction Using pre-constructed repeat libraries – Representative sequences of elements found in eukaryotic species (also prokaryotic organisms) – Repbase ( ) – Dfam (http://www.dfam.org)http://www.dfam.org – Homology base searches Programs: RepeatRunner and RepeatMasker

14 Repeat prediction De novo analysis of the sequence Building custom repeat libraries – Predict repetitive elements using the genomic sequence – Based in sequence structure or repetitiveness Programs: TRF for tandem repeats RepeatModeler or RepeatScout for general searches MITE-hunter and LTRharvest for type specific searches => ltrharvest

15 WHERE ARE THE GENES The hard part

16 The expected of a gene

17 Gene prediction Combination of methods that predicts the existence of a transcribe mRNA Base in supporting “evidence” – EST’s sequences – Assembled transcriptomes – RNA-seq reads – Protein alignments – Ab initio predictions – Repeat element distribution

18 “EST” evidence Bundle of EST’s, unigenes, assembled transcriptomes and RNA-seq reads Flags of potential transcription Aligned using splice-aware aligners Programs: GMAP, GSNAP, TopHat, exonerate

19 Protein alignments Using proteins from close related species Evidence of expression of a particular domain or entire protein Also use splice-aware aligners Programs: Exonerate

20 Ab initio predictions Predictions using Hidden Markov Models Tailored for particular genomes – Unless you are working with a model species, will require construction of custom HMM’s Search for mRNA-like sequences in the genome Similar to protein domain searches Programs: Snap, augustus, genemark, fgenesh => snap => genemark

21 What’s a gene model Combined evidence generates a gene model

22 Gene functional annotation Assignation of putative biological functions to mRNA’s (“genes”) Searches for protein domains and similarity to references protein databases GO annotations, KEGG enzimes, pFam domain Programs: Interproscan, KAAS

23 THE OTHER RNA’S

24 ncRNA’s tRNA miRNA snoRNA

25 Non coding RNA’s: tRNA’s Simpler of the ncRNA’s prediction Uses junction of sequence homology and models for the secondary structure Programs: tRNAscan-SE

26 All the others Programs: Infernal, Snoscan and miRPredict Scan with other ncRNA databases, snoRNA’s and miRNA’s respectively Commonly personalized ncRNA libraries are required for comprehensive searches

27 ANNOTATING A GENOME WITH MAKER

28 Outline of the maker annotation pipeline MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes Brandi L. Cantarel, Ian Korf, Sofia M.C. Robb, Genis Parra, Eric Ross, Barry Moore, Carson Holt, Alejandro Sánchez Alvarado, and Mark Yandell Genome Res. January : ; Published in Advance November 19, 2007, doi: /gr

29 How to run Maker in iPlant https://pods.iplantcollaborative.org/wiki/display /sciplant/MAKER-P+Atmosphere+Tutorial https://pods.iplantcollaborative.org/wiki/display /sciplant/MAKER-P+at+TACC+Lonestar+Guide

30 Maker Tips For troubleshooting and testing the files better to use a small fasta file with a single sequence (i.e. a 500bp fragment of a chromosome/scaffold) – Verify fasta files don’t present any errors for maker – Verify that the system don’t have any issues (missing programs) Maker MPI (parallelization method) don’t scale up to systems with a high number of cores, manual parallelization it’s required above 24cores – Split up of genomic fasta file into smaller pieces Some scaffolds would report back FAILED, most cases are scaffolds with high content of repeat elements – Required independent run into RepeatMasker AED score => maker gene model quality score (lower the best) Annotation Edit Distance

31 Maker exe.file #-----Location of Executables Used by MAKER/EVALUATOR makeblastdb=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/makeblastdb #location of NCBI+ makeblastdb executable blastn=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/blastn #location of NCBI+ blastn executable blastx=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/blastx #location of NCBI+ blastx executable tblastx=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/tblastx #location of NCBI+ tblastx executable formatdb= #location of NCBI formatdb executable blastall= #location of NCBI blastall executable xdformat= #location of WUBLAST xdformat executable blasta= #location of WUBLAST blasta executable RepeatMasker=/home/sreyesch/MichelmoreBin/bin/RepeatMasker #location of RepeatMasker executable exonerate=/home/sreyesch/MichelmoreBin/bin/exonerate #location of exonerate executable #-----Ab-initio Gene Prediction Algorithms snap=/home/sreyesch/MichelmoreBin/bin/snap #location of snap executable gmhmme3=/home/sreyesch/MichelmoreBin/bin/gmhmme3 #location of eukaryotic genemark executable gmhmmp=/home/sreyesch/MichelmoreBin/bin/gmhmmp #location of prokaryotic genemark executable augustus=/home/sreyesch/MichelmoreBin/bin/augustus #location of augustus executable fgenesh= #location of fgenesh executable #-----Other Algorithms fathom=/home/sreyesch/MichelmoreBin/bin/fathom #location of fathom executable (experimental) probuild=/home/sreyesch/MichelmoreBin/bin/probuild #location of probuild executable (required for genemark) Location of all required programs for maker to run % of the time don’t NEEDs modification

32 Maker bopts.file #-----BLAST and Exonerate Statistics Thresholds blast_type=wublast #set to 'wublast' or 'ncbi' pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments eval_blastn=1e-10 #Blastn eval cutoff bit_blastn=40 #Blastn bit cutoff pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments eval_blastx=1e-06 #Blastx eval cutoff bit_blastx=30 #Blastx bit cutoff pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignments pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments eval_tblastx=1e-10 #tBlastx eval cutoff bit_tblastx=40 #tBlastx bit cutoff eva_pcov_blastn=0.8 #EVALUATOR Blastn Percent Coverage Threshold EST-Genome Alignments eva_pid_blastn=0.85 #EVALUATOR Blastn Percent Identity Threshold EST-Genome Alignments eva_eval_blastn=1e-10 #EVALUATOR Blastn eval cutoff eva_bit_blastn=40 #EVALUATOR Blastn bit cutoff ep_score_limit=20 #Exonerate protein percent of maximal score threshold en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold Maker blast options (blast opts) Specific options that maker will use to perform the blast aligments Don’t require modification

33 Maker opts.file File that contains our determined inputs for maker to run We need to modify it – To edit file, safest way is in the command line – Mac users should use their preffered command line editor – Windows could use wordpad to do editing of the file – Ubunto default ubuntu text editor works well Didn’t fit in the screen

34 Generalized annotation pipeline with maker for non-model species Genomic Sequence Enough EST? tRNA prediction Repeat library construction Maker 1 st iteration without ab initio predictions Maker 1 st iteration with HMM for closer specie Training of ab initio predictors Maker 2nd iteration with custom HMM’s Other ncRNA prediction Uncurated Genome Annotation mRNA functional annotation Yes No

35 Generalized annotation pipeline with maker for non-model species 1.Perform tRNA prediction with tRNAscan 2.Construct custom repeat library 3.Initial maker prediction 1.If enough “Est” data is available (representative of most of genes and a high depth), use maker without any ab initio prediction 2.If not enough “EST” is available, search for a close related HMM that can be use for prediction 4.Train ab initio predictors using the predicted gene models from initial maker 5.Perform second maker iteration using all the information from first run, but using the HMM’s generated in step 4 6.Do draft functional annotation of mRNA’s predicted in the second maker iteration 7.Perform other ncRNA prediction 8.Gather all the results from second maker iteration, tRNA and other ncRNA and you got your uncurated genome annotation

36 HOW DO WE STORE THE ANNOTATIONS

37 Generic Feature Format (GFF) Standardized format for storing genomic annotation (features) Have conventions for a wide variety of features Current version is gff3

38 ##gff-version 3 ##sequence-region ctg ctg123.gene ID=gene00001;Name=EDEN ctg123.TF_binding_site ID=tfbs00001;Parent=gene00001 ctg123.mRNA ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123.mRNA ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123.mRNA ID=mRNA00003;Parent=gene00001;Name=EDEN.3 ctg123.exon ID=exon00001;Parent=mRNA00003 ctg123.exon ID=exon00002;Parent=mRNA00001,mRNA00002 ctg123.exon ID=exon00003;Parent=mRNA00001,mRNA00003 ctg123.exon ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003 ctg123.exon ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003 ctg123.CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123.CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123.CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123.CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123.CDS ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 ctg123.CDS ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 ctg123.CDS ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 ctg123.CDS ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 ctg123.CDS ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 ctg123.CDS ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 ctg123.CDS ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 ctg123.CDS ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 ctg123.CDS ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

39 Column descriptions Column 1: "seqid” The ID of the landmark used to establish the coordinate system for the current feature Column 2: "source” The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Column 3: "type” The type of the feature. Predefined terms to characterized the feature. Columns 4 & 5: "start" and "end” The start and end coordinates of the feature on the reference sequence. Column 6: "score” The score of the feature. Highly varies depending of source Column 7: "strand” The strand of the feature. + for positive strand, - for minus strand, and. for features that are not stranded. Column 8: "phase” For features of type "CDS", indicates which phase it’s the CDS read Column 9: "attributes” A list of feature attributes in the format tag=value separated by semicolons.


Download ppt "What does the genome says? Sebastian Reyes Genome Center UCDavis."

Similar presentations


Ads by Google