Presentation is loading. Please wait.

Presentation is loading. Please wait.

GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University.

Similar presentations


Presentation on theme: "GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University."— Presentation transcript:

1 GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University

2 Genomes and gene contents 30,000 25,000 10,000 6,000 45,000 17,000

3 Duplicate genes in the genome  Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

4 Gene function and duplication  What’s the consequence?

5 Gene function and duplication  What’s the consequence?

6 Focus I: Duplication Mechanism and Loss Rate Gene Duplications MechanismsConsequences Preferential retention

7 Duplication mechanisms +  Whole genome duplication  Tandem duplication  Segmental duplication  Replicative transposition

8 Lineage-specific gains in plants and animals Organism Lineage-specific gains Normalized gain* # of genes in families analyzed % total Rice1011567432846735.5 (23.7)** Arabidopsis598439902193627.3 (18.2)** Human811 219543.7 Mouse1265 240415.3 *: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively). **: Numbers in parentheses refer to percentage total based on normalized gains.  Substantially more recent duplicates in plants than in animals  Mostly due to frequent whole genome duplications in plants

9 Gain vs. Loss  3 rounds of whole-genome duplications in the Arabidopsis lineage  ~82% duplicates from the last round were lost in the past 40 million years 15,000* 30,000 60,000 120,000 Arabidopsis gene content: 21,000** *: Number of orthologous groups in shared families between Arabidopsis and rice. **: Number of genes in shared families. Genome duplications + tandem duplications – gene losses =

10 “Age” distribution of animal duplicates  Steady decay in the number of duplicates  Frequent TD, SD, and RT Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity Shiu et al., 2006

11 Plant duplicate “age” distribution  Apparent peak at ~0.18 instead of zero Ks  Frequent WGD, TD, SD (maybe), and RT (in some plants) Shiu et al., 2004

12 Genome remodeling in polyploids  Natural and synthetic polyploids ~348 Mb ~203 Mb~314 Mb ~257 Mb 20,000 yr

13 Experimental approaches  Genome-wide polymorphism monitored by tiling array Genome Tiled probes Gap Resolution Array 20,000 yr ~6 million features

14 Genome-wide Single Feature Polymorphism  Mid-parent (MP) vs. Arabidopsis suecica (As) PolyploidSFP Natural58,517 Synthetic503

15 Genome-wide Single Feature Polymorphism  Genome-wide polymorphism monitored by tiling array Gene PseudogeneTransposon

16 Genome-wide Single Feature Polymorphism  Duplication or deletion MP duplication or As deletion

17 Genome Survey Sequencing  Sequence ~40-60Mb of the Arabidopsis suecica genome  0.15-0.2 X coverage, will be done next week!  Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership Grant  Ultra-high throughput  20-30 Mb per run, each run 5 hours  Will be 100Mb per run early 2007  Cost efficient  ~$0.3/kb  Read length rather limited  ~100bp per read now  Will be ~200bp early 2007  For more information contact:  Andreas Weber (aweber@msu.edu) aweber@msu.edu  David DeWitt (dewittd@msu.edu) dewittd@msu.edu  Or Shin-Han Shiu (shius@msu.edu) shius@msu.edu  Seminar on instrumentation:  9/29, Friday, 1pm, 1415 BPS

18 Summary: Gene duplication and polyploidy  Gene duplication occurred frequently in eukaryotes but most duplicate are lost.  In plants, whole genome duplication is common. But gene lost occurred frequently.  After 4 generations, very small number of SFPs are identified in synthetic polyploids.  After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion.  Clustered polymorphisms mostly locate in pseudogenes and transposons.  Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.

19 Focus II: Differential Retention of Duplicates Gene Duplications MechanismsConsequences Preferential retention

20 Duplicate genes in the genome  Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

21 Large gene families in plants  One of the largest gene families

22 Normalized gain: % expanded OGs  Large family sizes do not necessarily indicates higher expansion rates

23 Ancestral family sizes and gene gains  Large ancestral family tend to have more lineage specific gains but with many exceptions

24 Differential expansion of functional categories  GO: GeneOntology  Protein ubiquitination  Polysaccharide biosynthesis  Cell wall modification  Transcriptional regulation  Biotic stress response  Secondary metabolism

25 Differences in Duplicability CategoryArabidopsisHuman Defense response Proteolysis Transport Ion channel activity Metabolism Development Protein kinase activity Transcription factor activity  Duplicability  The propensity for the retention of a duplicate gene  Computational analysis of genome-wide trend

26 Kinase superfamily sizes among eukaryotes Organism Number of genes Kinase superfamily Percent total gene Arabidopsis thaliana25,81410414.0 Oryza sativa subsp. indica~35,00016073.6 Chlamydomonas reinhardtii~12,2004143.4 Plasmodium falciparum5,334941.8 Plasmodium yoelii7,681700.9 Caenorhabditis elegans19,4844172.1 Drosophila melanogaster13,8082621.9 Anopheles gambiae15,0882161.4 Ciona intestinalis15,8523162.0 Fugu rubripes33,6096321.9 Mus musculus22,4444952.2 Homo sapiens22,9804722.1 Saccharomyces cerevisiae64491131.8 Candida albicans6,164951.5 Neurospora crassa100821041.9 Schizosaccharomyces pombe49451092.2 Shiu & Bleecker, 2003

27 Kinase families in rice and Arabidopsis  Gene count differences among families indicate differential expansion Shiu et al., 2004

28 Estimation of ancestral RLK family size A.B. 440 speciation points rice Arabidopsis A.B. WAKLRR VIII, X, XII  Kinase phylogeny of Arabidopsis and rice RLKs Shiu et al., 2004

29 Development vs. resistance/defense RLKs Shiu et al., 2004

30 Contradiction  Plant genes invovled in development tend to have high duplicability Developmental RLKs Low duplicability Resistance/Defense RLKs High duplicability Animal tyrosine kinases Low duplicability Transcription factors High duplicability

31 Selection for expansion  Depend on the level of variations of the signals TT OR

32 Summary: differential retention  Longevity and duplicability of plant genes High Low High Low Duplicability Longevity Examples Transcription factors Resistance genes Enzymes in central metabolic pathways ??

33 Focus III: Functional Consequences Gene Duplications MechanismsConsequences Preferential retention

34 Functional Consequences of Duplication  Functional divergence and conservation  Is it because of changes in cis-regulatory elements or coding sequences  How are duplicates retained, subfunctionalization or neofunctionalization

35 Divergence in gene expression  Develop pipelines for cis-element prediction and Clusters of genes with similar expression profiles Machine learning Motif functional prediction Cis-regulatory logic Expression data Over-represented sequence motifs in 5’ regions Experimental validations

36 Divergence in post-translational modification  Conservation of phosphorylation site across speces  SACE: budding yeast  CAGL: Candida glabra  CAAL: Candida albicans  CATR: Candida tropicalis  NECR: Neurospora crassa  DEHA: Debaryomuces hansenii

37 Detailed Functional Studies of Duplicate Genes  Functional analyses of DDF1 and DDF2 transcription factors  Derived from recent whole genome duplication in Arabidopsis  Related to the well known CBF factors involved in cold and draught stress DDFs Promoter GFP Knockouts Over- expression studies Interacting proteins Binding targets DDFs Promoter GFP Knockouts Over- expression studies Interacting proteins Binding targets Arabidopsis thalianaArabidopsis lyrata

38 Focus IV: Protein space Gene Duplications MechanismsConsequences Preferential retention Consequences Preferential retention

39 Tiling array analysis of transcriptome  Human Chr 21, 22 Kapranov et al., 2002

40 Posterior probability p(F|coding)

41 Performance of the CI measure  Known Arabidopsis exon and intron 90-300bp  Arabidopsis small protein that are not annotated  Correctly predict 19 out of 20 (95%).  Yesat sORF with translation evidence  Correctly predict 98 out of 114 (86%)  In “intergenic” sequences of Arabidopsis genome  3,274 sORF identified

42 Coupling with tiling array expression  Hybridization intensities for feature types

43 Summary: Novel coding genes  Many unannotated regions in the genomes are expressed.  Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly.  Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome.  Using tiling array data, we found that many of these novel coding regions are expressed.

44 Acknowledgement  Lab members Kousuke Hanada Melissa Lehti-Shiu Cheng Zou Emily Eckenrode  University of Chicago  Justin Borevitz  Xu Zhang  University of Wisconsin  Sara Patterson  Rick Vierstra  University of Missouri  Scott Peck  Michigan State University  Many…  Rong Jin, Comp Sci & Eng  Yue-Hua Cui, Stat & Prob  Startup fund

45 Recent completion …

46 Genome remodeling in polyploids  Genome duplication occur frequently in plants  What is the fate of duplicates?  How fast do gene losses occur?  Is there any preference in genes retained? ABCDEABCDE A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 t1t1 t2t2 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 N g = 5 10 8 5

47 Comparing degrees of expansion Combined set Arabidopsis: ~25,000 proteins Rice prediction: ~66,000 genes Gene/domain families Shared unique Pairwise distance Putative orthologous groups u i = 1 GO:0001 e i = 4 All orthologous groups Total unexpanded = Σ u i Total expanded = Σ e i

48 Major questions on gene duplication  When: timing of gene duplications, e.g. N = 10

49 Domain gains in rice and Arabidopsis  Gain in one lineage does not necessarily predict gain in the other

50 Identify novel small coding genes  Determine base composition probabilities Coding sequences Non-coding sequences CDS parameters NCDS parameters # of AAA # of all NNN Pc(AAA) = Pc(AAAT) Pc(AAA) Pc(T|AAA) =  Calculate posterior probability c1c2c3 c4c5c6  Feature tables n

51 Setting up the Bayes’  Priors  S = ATG TTC TAC TTT G… …

52 Coding Likelihood (CL)  Sliding windows of a sequence  Simulation based on NCDS (introns) 1 2 3 4 … n

53 Divergence in post-translational modification  Conservation of phosphorylation site across speces


Download ppt "GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University."

Similar presentations


Ads by Google