Presentation is loading. Please wait.

Presentation is loading. Please wait.

TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.

Similar presentations


Presentation on theme: "TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name."— Presentation transcript:

1 TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

2 Title - 32 pt Arial

3

4

5 COMPARATIVE GENOMICS Manolis Kellis Board of Scientific Counselors January 2007

6 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA GTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA ACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT GTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA ATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC TACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG ATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG ATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG TTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA GAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT ACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT TAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA CAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA ACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC AACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT TGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT TCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTA ATGCTGAAATCTATCTTTGGAAAAGATTTACAA

7 Genes Encode proteins Regulatory motifs Control gene expression

8 32 mammals 9 yeasts 12 flies The power of comparative genomics Comparative genomics reveals selection –Functional elements mostly conserved –Non-functional regions mostly diverged  Functional regions stand out Comparative genomics reveals function –Each type of function under unique constraints (Proteins, RNA, motifs, each evolve differently) –Discover them by their distinct evolutionary patterns  Evolutionary signatures for each type of element humanmouseratchimpdog 8 Candida

9 Comparative genomics leads to… 1. Genome interpretation –Decode the human genome –Discover all functional elements  The building blocks 2. Cell circuitry –Discover all control constructs –Regulatory network properties  The interconnections 3. Evolutionary innovation –Emergence of new functions –Genome and network duplication  The dynamics

10 Distinguishing genes from non-coding regions Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** Protein-coding genes have specific evolutionary constraints –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Encode as ‘evolutionary signatures’ –Computational test for each of them –Combine and score systematically Splice Frame-shifting indels Periodic mutations Synonymous substs.

11 Power of evolutionary signatures Signatures much more precise than level of conservation Before: Parsing a genome into high-conservation / low-conservation Now: Parse into protein-coding conservation / RNA-like / motif-like, etc. Probabilistic framework Hidden Markov Models (HMMs) Generative model, learn emission, transition probabilities Easy to train, hard to integrate long-range signals Conditional Random Fields (CRFs) Discriminative dual of HMMs, learn weights on features Easy to integrate diverse signals, gradient ascent for training

12 Known genes stand out Substitution typical of protein-coding regions Substitution typical of intergenic regions

13 CG6664/FBtr0100439 Previously-annotated start codon Newly-identified start codon Ability to identify subtle events ATG Translation start corrected for 200 genes Protein-coding conservation Continued protein-coding conservation No more conservation Hundreds of read-through regions identified New mechanism of post-transcriptional control. Many questions remain. Enriched in brain proteins, ion channels. Under ADAR control. Stop codon read through 2 nd stop codon

14 Towards a revised genome annotation –Curation: FlyBase integrates prediction with cDNA, protein, literature –Experimentation: BDGP large-scale functional validation novel exons High-accuracy reannotation –Ability to detect small genes & exons (40AA: 95|99|99%, 20AA: 87|96|99%) –Detect subtle events: sequencing errors, start/stop and splice site changes –Recognize unusual gene structures  read-through, uORFs, RNA editing D. simulans D. erecta D. persimilis D. melanog. Summary: Revisiting fly genome annotation (…) 454 genes800 genes668 genes12,000 genes Confirmed DubiousNovelRefined Powerful approach for comprehensive genome annotation sen | pre | spe

15 Comparative genomics 1. Genome interpretation –Decode the human genome –Discover all functional elements  The building blocks 2. Cell circuitry –Discover all control constructs –Regulatory network properties  The interconnections 3. Evolutionary innovation –Emergence of new functions –Genome and network duplication  The dynamics

16 The regulatory code Multiple levels of regulation –Temporal and spatial regulation, disease, development –Chromatin, pre- / post-transcriptional, splicing, translational Combinatorial coding of individual motifs –The core: a relatively small number of regulatory motifs –Regions: diverse motif combinations specify diverse functions Regulatory motifs –Summarize information across thousands of sites Distinguish: regulatory motifs vs. motif instances –Challenging to discover Small (6-8 nucleotides), subtle (frequent degenerate positions), dispersed (act at a distance), diverse (sequence composition) Enhancer regions 5’-UTR Promoter motifs 3’-UTR Splicing signalsMotifs at RNA level

17 Regulatory motif discovery Study known motifs Derive conservation rules Discover novel motifs

18 Known motifs are preferentially conserved In multi-species alignments: known motifs  conservation islands –Conserved biology: Conserved regulatory code, same words are functional –Preferential conservation: Stand out from surrounding nucleotides –Good signal for identifying individual instances of known motifs Need additional power for motif discovery: –Conservation not limited to exact binding site  additional bases would be found –Weakly constrained positions can diverge  Real motifs will be missed –How do we discover motifs de novo?  Use basic property of regulatory motifs  Evaluate genome-wide conservation over thousands of instances Err  human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * Gabpa Errα

19 ConsensusMCSMatches to known Expression enrichment PromotersEnhancers 1CTAATTAAA65.6engrailed (en)25.42 2TTKCAATTAA57.3reversed-polarity (repo)5.84.2 3WATTRATTK54.9araucan (ara)11.72.6 4AAATTTATGCK54.4paired (prd)4.516.5 5GCAATAAA51ventral veins lacking (vvl)13.20.3 6DTAATTTRYNR46.7Ultrabithorax (Ubx)163.3 7TGATTAAT45.7apterous (ap)7.11.7 8YMATTAAAA43.1abdominal A (abd-A)72.2 9AAACNNGTT41.2 20.14.3 10RATTKAATT40 3.90.7 11GCACGTGT39.5fushi tarazu (ftz)17.9 12AACASCTG38.8broad-Z3 (br-Z3)10.7 13AATTRMATTA38.2 19.51.2 14TATGCWAAT37.8 5.82 15TAATTATG37.5Antennapedia (Antp)14.15.4 16CATNAATCA36.9 1.81.7 17TTACATAA36.9 5.4 18RTAAATCAA36.3 3.22.8 19AATKNMATTT36 3.60 20ATGTCAAHT35.6 2.44.6 21ATAAAYAAA35.5 57.2-0.5 22YYAATCAAA33.9 5.30.6 23WTTTTATG33.8Abdominal B (Abd-B)6.36 24TTTYMATTA33.6extradenticle (exd)6.71.7 25TGTMAATA33.2 8.91.6 26TAAYGAG33.1 4.72.7 27AAAKTGA32.9 7.60.3 28AAANNAAA32.9 449.70.8 29RTAAWTTAT32.9gooseberry-neuro (gsb-n)110.8 30TTATTTAYR32.9Deformed (Dfd)30.7 Systematically discover regulatory motifs

20 Functional clustering of motifs and tissues

21 Motif discovery in human enhancer regions Can identify 40% of enhancers with 50 motifs –3X enrichment (vs. 15% of intergenic regions) Motif combinations further improve performance –5X enrichment for top 30 motif combinations Chromatin signatures of enhancer regionsMotif signatures of enhancer regions 74 Enhancers 208 Promoters H3K4me3RNAPII p300H3K4me1

22 Evolutionary signatures for microRNA genes Genome-wide discovery of miRNAs –41 novel miRNA genes. Rediscover 81% of known (61 of 74). Reject 4 dubious. –454 sequencing of small RNAs confirms 27 of 41 novel miRNAs (66%). Genomic properties: –Introns of known genes, including several transcription factors –Genomic clustering of known and novel miRNAs: poly-cistronic precursors –Two ‘dubious’ protein-coding genes are in fact miRNAs  Improved annotation of miRNA genes

23 Functional properties of microRNA targets Refine annotation of known miRNA genes –Start adjustments suggested by the evolutionary signatures, confirmed by sequencing –Small change in start (+2 nucleotides) implies great change in target spectrum (>95%) miRNA targets –Novel miRNAs include many novel families  distinct groupings of genes. –Targets of novel show large overlap with targets of known  denser miRNA network miR10 * as a master Hox regulator –For three genes, both miRNA+ and miRNA* seem functional by evolution and sequencing. –For miR-10, the star shows stronger signal, more sequencing reads, more predicted targets. –Both miR-10+ and miR-10* targets several Hox genes, more than any other miRNA.

24 Comparative genomics 1. Genome interpretation –Decode the human genome –Discover all functional elements  The building blocks 2. Cell circuitry –Discover all control constructs –Regulatory network properties  The interconnections 3. Evolutionary innovation –Emergence of new functions –Genome and network duplication  The dynamics

25 Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: 0.62 32sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP 10 -6 10 mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at 10 -11 Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at 10 -21 0.3 sub/site 0.1 sub/site 0.8 sub/site


Download ppt "TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name."

Similar presentations


Ads by Google