Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics.

Similar presentations


Presentation on theme: "Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics."— Presentation transcript:

1 Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine

2 32 mammals 17 yeasts 12 flies The age of comparative genomics opossumarmadillorabbitcowhyraxelephant humanmouseratchimpdog batdolphinlemurbushbabypikahedgehogtenrec pangolinTree shrewllama etc...

3 Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: 0.62 32sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP 10 -6 10 mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at 10 -11 Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at 10 -21

4 Comparative Genomics 101: Conservation  Function Conserved elements are typically functional (and vice versa) –For example: exons are deeply conserved to mouse, chicken, fish Some conserved elements are still uncharacterized –How do we make sense of them? –How do we distinguish each type of functional element Answer: evolutionary signatures (Comp. Genomics 201) –Tell me how you evolve, I’ll tell you who you are –Patterns of change  selective pressures  specific function

5 Gene identification Study known genes Derive conservation rules Discover new genes Evolutionary signatures –“Tell me how you evolve, i’ll tell you who you are” –Each type of functional elements evolves in its own specific ways

6 Distinguishing genes from non-coding regions Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** Protein-coding genes have specific evolutionary constraints –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Encode as ‘evolutionary signatures’ –Computational test for each of them –Combine and score systematically Splice

7 Signature 1: Reading frame conservation 30% 1.3% 0.14% 58% 14% 10.2% GenesIntergenic Mutations Gaps Frameshifts Separation 2-fold 10-fold 75-fold  100% 60% 55% 90% 40% 60% 100% 20% 30% 40%  100%  60% RFC

8 Signature 2: Distinct patterns of codon substitution Codon observed in species 2 Codon observed in species 1 Genes Codon substitution patterns specific to genes –Genetic code dictates substitution patterns –Amino acid properties dictate substitution patterns Codon observed in species 2 Codon observed in species 1 Intergenic

9 Codon Substitution Matrix (CSM) human mouse aliphatic aromatic negativepolarpositive polar

10 Signatures 3, 4, 5, 6, 7, etc… Mutation patterns of splicing signals –Real splice acceptor/donor evolve in specific ways Evolution of other motifs associated with splicing –Exonic/Intronic Splicing Enhancers/Silencers (ESE,ESI) –Density of motif clouds surrounding real exons Sharp conservation boundaries –Relative conservation exon vs. surrounding regions Length of longest ‘open’ reading frame –Frequency of stop codons in each frame / each species ISEs ESEs real exon acceptor site donor site

11 Putting it all together: probabilistic framework Hidden Markov Models (HMMs) –Generative model, learn emission, transition probabilities –Easy to train, hard to integrate long-range signals Conditional Random Fields (CRFs) –Discriminative dual of HMMs, learn weights on features –Easy to integrate diverse signals, gradient ascent for training

12 From HMMs … to CRFs yiyi y i-1 y i+1 X hidden sequence feature functions F(i-1)F(i)F(i+1) observed

13 From HMMs … to CRFs Transition and Emission probabilities Generative modelDiscriminative model For example, features can simply be e i and a ij Or pretty much anything:

14 Running on real genomes Obtain optimal weights (from training set) –Experimentally-defined, genetics, curation, cDNA Apply CRF systematically to new genome –Revisit existing genomes –Annotate new genomes

15 Power of evolutionary signatures –New genes and exons, dubious genes and exons –Adjust gene boundaries: ATG, frame, splice site, seq errors Signatures more powerful than primary signals –Recognize unusual gene structures  read-through, uORFs, editing Towards a revised genome annotation  Curation: FlyBase integrates prediction with cDNA, protein, literature  Experimentation: BDGP large-scale functional validation novel exons D. simulans D. erecta D. persimilis D. melanog. 579 fully rejected 1,454 exons (~800 genes) 2,499 not aligned +668 exons in 443 genes Revisiting fly genome annotation 10,845 fully confirmed (…)

16 Systematic application leads to Exon-level changes –Ex 1: New genes –Ex 2: New exons –Ex 3: Dubious genes More subtle changes –Ex 4: Start/end adjustments –Ex 5: Wrong reading frame –Ex 6: Splice site adjustments –Ex 7: Sequencing errors fixed Unusual gene structures –W1: Stop-codon read-through –W2: uORFs & dicistronic –W3: Internal frame-shifts Codon observed in species 2 Codon observed in species 1 Genes vs. Intergenic Reading Frame Conservation Codon Substitution Matrix

17 conserved substitution insertion frameshift gap Example 1: Known genes stand out Sharp conservation boundaries. Known exons stand out. High sensitivity and specificity.

18 Example 2: Novel multi-exon gene 1,454 novel exons outside known genes –Many cluster in new multi-exon genes –Others are isolated high-confidence exons

19 Example 2b: Novel exons inside known genes (sorry, this example is from human, mouse, dog, rat) 668 cases in fly –New candidate alternatively spliced gene forms –New protein domains

20 Novel genes and exons 1,454 novel exons outside existing genes –60% cluster in 300 multi-exon genes –40% isolated exons 668 novel exons inside existing genes –Alternative splicing: Many with cDNA support –Nested genes: Few known examples Human curation –Collaboration with FlyBase –Hundreds of changes in release 5.1, more in 5.2 Systematic experimentation –Sue Celniker and Berkeley Genome Project –Thousands of new genes in the pipeline

21 Example 3: Dubious single-exon gene Only evidence was an open reading frame –Comparative information much stronger

22 579 Dubious Genes Classification approach: Yes / No answer –Closely related species: both genes and intergenic aligned –Show very different patterns of mutation Comparative analysis provides negative evidence –Alignment is unambiguous, orthologous, spans entire gene –Sequence shows mutations and indels in every species Weak or missing experimental evidence –100 of these independently rejected by FlyBase –These are missing from systematic clone collections –Only 34 (6%) have assigned names (vs. 36% of all fly genes)

23 Systematic application leads to Exon-level changes –Ex 1: New genes –Ex 2: New exons –Ex 3: Dubious genes More subtle changes –Ex 4: Start/end adjustments –Ex 5: Wrong reading frame –Ex 6: Splice site adjustments –Ex 7: Sequencing errors fixed Unusual gene structures –W1: Stop-codon read-through –W2: uORFs & dicistronic –W3: Internal frame-shifts Codon observed in species 2 Codon observed in species 1 Genes vs. Intergenic Reading Frame Conservation Codon Substitution Matrix

24 CG6664/FBtr0100439 annotated start codonconserved start codon Example 4: Start codon adjustment Codon substitution patterns suggest new start in 200 genes –Score each substitution using Codon Substitution Matrix (CSM) poor CSM score, atypical substitution high CSM score, protein-like substitution ATG

25 Annotated ORF (345nt)Real ORF (315nt) Example 5: Gene annotated on wrong reading frame cDNA evidence supports overlapping reading frames, both open –Annotation traditionally selects longer one –Conservation enables distinguishing the two mRNA supports both ORFs Conservation only supports shorter ORF Shorter ORF is the correct one CG7738-RA is incorrect

26 Example 6: Incorrect splice causes wrong frame Second exon annotated in the wrong frame –Due to splice site boundary error –Correction is supported by cDNA evidence Fix exon boundary First exon: correct frame2 nd exon: incorrect frame

27 Example 7: Detect seq. errors / strain mutations Insertion/deletion causes frameshift –Conservation signature shifts from ‘frame1’ to ‘frame2’ –All other species disagree with D. melanogaster indel –Sequencing error or species-specific mutation chr3R:6,953,865-6,953,927 (Ugt86Dd) dm CAGTACATATTTGTGGAGAGTTACTTGAAAG-CTTGGCAGCTAAGGGTCATCAGGTGACCGTTA droSec CAGTACATATTTTTGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTA droSim CAGTACATATTTATGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTA droYak CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCCAAGGGTCACCAGGTGACCGTTA droEre CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCTAGGGGTCACCAGGTGACTGTTA droAna CAGTACATCTTTGTGGAGACCTATCTGAAGGCTTTGGCCGACAAAGGTCACCAGGTGACTGTTA droWil CAATACATATTCATTGAGGCGTATCTAAAGGCATTGGCTGCCAAAGGACATCAGTTAACTGTGA droMoj CAGTACATATTCGCCGAGGCGTATTTGAAGGCGCTAGCAGCCCGGGGCCATGAGGTCACCGTGA droVir CAGTATATATTTGCCGAGTCGTATTTGAAGGCCTTGGCAGCGCGGGGTCATGAGGTGACAGTGA 01201201201201201201201201201201 2012012012012012012012012012012 ** ** ** ** *** ** * ** * * ** * ** ** ** * ** ** * Conservation in correct frameConservation in 2 nd frame Frame-shift (sequencing error / recent mutation)

28 Example 8: Dubious gene is a miRNA transcript Evolutionary signatures reveal specific function

29 Systematic application leads to Exon-level changes –Ex 1: New genes –Ex 2: New exons –Ex 3: Dubious genes More subtle changes –Ex 4: Start/end adjustments –Ex 5: Wrong reading frame –Ex 6: Splice site adjustments –Ex 7: Sequencing errors fixed Unusual gene structures –W1: Stop-codon read-through –W2: uORFs & dicistronic –W3: Internal frame-shifts Codon observed in species 2 Codon observed in species 1 Genes vs. Intergenic Reading Frame Conservation Codon Substitution Matrix

30 Unusual genes 1: Stop codon read-through Method #1 (single exons) –112 events, 95 extending known genes  Manual curation: 82 –Enriched in neuronal function Method #2 (after splicing) –256 events, looser cutoff, large overlap, needs manual curation –Enriched in transcription factors Protein-coding conservation Continued protein-coding conservation No more conservation Stop codon read through 2 nd stop codon

31 Unusual genes 2: Polycistronic messages / uORFs Method –High-scoring ORFs with cDNA evidence –Disjoint from the annotated ORF Results –217 cases Protein-coding conservation in the 5’UTR

32 Unusual genes 3: Frame-shift in the middle of exons Method –Exons changing high-scoring frame –Far from splice junctions Results –68 cases in 44 genes dm GACTATTTCAACAATCAGCAGCGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCG droSim GACTATTTCAACAACCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCG droSec GACTATTTCAACAACCAACAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCG droYak GACTACTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GGCGAGATTTGTACCGCCTCCACCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCG droEre GACTATTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTTCTCACGCAGACCG droAna GACTACTACAACAATCAGCAGCGGGAGCGGCACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGGCCAGCGGCGAAGTTCGTCCCTCCTCCGCCGCCTCCGCGACGTTTGCTTCTCACGCAGACAG droPse GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGCAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCA droPer GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGAAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCA droWil GACTACTACAACAATCAGCAGAGGGAGCGACACTACGAGCAACGTCGCCAAAGCCAGCGGCAGGCC---AGCCAAATTTATACCACCGCCACCGCCTCCACGTCGACTGCTGCTAACGCAGACAA droMoj GACTACTACAACAACCAGCAGCGGGAGCGGCACTACCAGCTGCGCCACCAGAGCCAACGTCAAGCC---ACCGAGATTTATACCACCACCGCCGCCGCCTCGTCGTCTGCTGCTCACGCAGACAA droVir GACTACTACAACAACCAACAGCGGGAGCGGCACTACCAGCAGCGCCGCCAGAGCCAACGTCAAGCC---ACCGAGATTCATTCCACCGCCGCCGCCGCCTCGTCGTCTGCTGCTCACGCAGACAA droGri GACTACTACAACAATCAGCAGCGGGAGCGGCACTATCAACAGCGTCGCCAGAGTCATCGTCAAGCC---ACCGAGATTTATACCACCACCACCGCCACCTCGTCGTCTATTGCTCACGCAGACAA 012012012012012012012012012012012012012012012012012012012012012012 01201201201201201201201201201201201201201201201201201201 ***** * ****** ** ** * ***** ***** * * ** ** ** ** ** * ** * * ** * ** ** ** ***** ** ** ** * * ** ******** chrX:2,226,518-2,226,639 (CG14047) 012 120 Frame 1 is high-scoringFrame 2 is high-scoring

33 Fully rejected genes: weak/no evidence New exons: existing & novel experimental evidence Need: large-scale functional annotation for novel genes Dog Mouse Rat Human 1,065 fully rejected 454 novel (2591 exons) 1,919 not aligned 7,717 refined Initial results for the whole human genome 9,862 fully confirmed

34 12 species 2 species Discriminative framework shows continued increase in power Reading frame conservation (RFC) score Codon substitution matrix (CSM) score  2 species3 species5 species12 species 2 species 12 species 90% 10% 30% 70%80% 95% 5% 20%

35 Overview Part 1. Genome interpretation  Evolutionary signatures of genes  Revisiting the human and fly genomes  Unusual gene structures Part 2. Gene regulation  Regulatory motif discovery  microRNA regulation  Enhancer identification Part 3. Genome evolution  Phylogenomics  The two forces of gene evolution  Accurate gene trees in complete genomes

36 Who’s actually doing the work Matt Rasmussen Phylogenomics Erez Lieberman Motif evolution Aviva Presser Network evolution Mike Lin Gene identification Alex Stark Fly motifs and miRNAs Pouya Kheradpour Human enhancers Josh Grochow Network motif discovery Ameya Deoras Spectral genomics


Download ppt "Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics."

Similar presentations


Ads by Google