Download presentation
Presentation is loading. Please wait.
1
EVOLUTIONARY GENOMICS
Dušan Kordiš Josef Stefan Institute Department of Molecular and Biomedical Sciences
2
Evolutionary genomics: molecular evolution at the genomic scale
-Whole-genome approach versus a few genes -Operationally defined as investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion. Genetics looks at single genes, one at a time, as a snapshot. -Genomics is trying to look at all the genes as a dynamic system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense. -Examples of genomics: Systems biology (rather than single genes in a pathway) Networks biology (how pathways interact with each other) Comparative Genomics -The study of comparing complete genome sequences, to understand general principles of genome structure and function. -The study of human genetics by comparisons with model organisms -A comprehensive view of large-scale changes in synteny, gene order, and regions of nonconservation while simultaneously affording exquisite molecular resolution at the level of the nucleotide. -Example: Comparative analysis of the Y chromosome using human and ape genomes. Evolutionary genomics: molecular evolution at the genomic scale
8
Eukaryotic genome projects
Mammalian comparative genomics Protozoan genomics for drug discovery fish genomic resources plant genomic resources
9
Comparisons of Genomes at Different Phylogenetic Distances Are Appropriate to Address Different Questions
10
1. From gene to genome -the genes of higher organisms are broken up and many non-coding DNA sections are strung together with regulatory sections. -the genetic structure of higher organisms is consequently exceedingly complex. For this reason, today's definition of a gene also includes those regulatory DNA sequences which decide when protein-coding gene sections are activated and deactivated, as well as the quantities in which a protein is produced in a cell. -The complexity of genes is one of the reasons why it is still not known exactly how many genes make up a human being, although the DNA sequence has almost been completely decoded. The complete genetic information describing an organism is subsumed under the term genome. -genomics: The study of genes and their function//The study of all of the nucleotide sequences, including structural genes, regulatory sequences, and non-coding DNA segments, in the chromosomes of an organism.
11
STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMES
Two main shotgun-sequencing strategies. a | Schematic overview of clone-by-clone shotgun sequencing. b | Schematic overview of whole-genome shotgun sequencing. Typically, tens of millions of sequence reads are generated and these in turn are subjected to computer-based assembly to generate contiguous sequences of various sizes. Sequence assembly
12
2. Genome components billions of base pairs of nearly complete genomic sequences, with their vast intergenic distances contain: -highly dispersed upstream and downstream regulatory elements, -overlapping genes, alternative exons, -somatic rearrangements, -paralogs, -gene families large and small, -apparently randomly scattered pseudogenes, -massive assembly-defeating arrays of repetitive sequences of all sorts, -great volumes of transposon-inserted material, and -confusing recent accretions of highly paralogous material in subtelomeric and pericentromeric regions.
13
The components of the human genome
Only about 1.5% of the more than 3 billion base pairs encode protein-products, whereas about 45% consists of genomic parasites (transposable elements: DNA transposons, LTR retrotransposons, and LINEs, and SINEs).
14
Gene ontology (GO) annotations for mouse and human proteins
The GO terms assigned to mouse (blue) and human (red) proteins based on sequence matches to InterPro domains are grouped into approximately a dozen categories. These categories fell within each of the larger ontologies of cellular component (a) molecular function (b) and biological process (c). In general, mouse has a similar percentage of proteins compared with human in most categories.
16
Overlapping genes in vertebrate genomes
Overlapping genes in mammalian genomes are unexpected phenomena even though hundreds of pairs of protein coding overlapping genes have been reported so far. Overlapping genes can be divided into different categories based on direction of transcription as well as on sequence segments being shared between overlapping coding regions. The biologic functions of natural antisense transcripts, their involvement in physiological processes and gene regulation in living organisms are not fully understood. Number of documented examples indicates that they may exert control at various levels of gene expression, such as transcription, mRNA processing, splicing, stability, transport, and translation. Similarly, evolutionary origin of such genes is not known, existing hypotheses can explain only selected cases of mammalian gene overlaps which could originate as result of rearrangements, overprinting and/or adoption of signals in the neighboring gene locus. Different types of overlapping genes. (a) Genes sharing the same locus on the same strand, however coding for different proteins. (b) Genes sharing promoter region. (c) Nested gene. (d) Embedded gene. (e) Genes on opposite strands with overlapping locus but no overlap in the exonic region. (f) Tail-to tail overlap in the exonic region. (g) Head-to-head overlap involving 3-UTRs and coding sequence. Dark (red) boxes: coding sequence; light (blue) boxes: untranslated regions; patterned (green) box: promoter region.
17
Widespread occurrence of antisense transcription in the human genome (and in a variety of eukaryotic organisms)
18
Like the Dark Matter of the Universe, the Dark Matter of the Eukaryotic Transcriptome Is Becoming Better Defined Several safeguards exist within eukaryotic cells that prevent adventitious expression of protein from transcripts resulting from spurious RNA Pol II initiation while simultaneously providing cells with valuable potential to evolve new functional genomic loci. In this figure, areas marked with X’s highlight mechanisms that result in transcript degradation and the area with light gray designates translational silencing.
19
ncRNAs (microRNAs, anti-sense transcripts and other Transcriptional Units without any clear ‘‘ORF’’) are important players in the complex network of molecular regulation and interaction within eukaryotic cells In this model, ncRNA deregulation, in addition to other molecular defects, may underlie or be a marker for complex diseases.
20
3. Comparative methods and the functional annotation of genomes
-Identification of functional regions in genomes can be carried out by searching for conservation among genome sequences, as functional regions are believed to be under stabilizing selection and should be preferentially conserved over evolutionary time. This approach has been used successfully in the annotation of animal and yeast genomes. -Both coding and regulatory regions can be identified by locating genomic areas under purifying selection, although regulatory regions tend to be less conserved and pose greater challenges to the comparative genomics approach. Comparisons at varying levels of evolutionary divergence are likely to reveal functional regions characteristic of different taxonomic groups; even intraspecific genomic approaches have been shown to be useful in predicting functional sequence motifs. -The reliability and usefulness of comparative genomics for genome annotation will depend on the continuous improvement of predicting algorithms, as well as our improving characterization of the varying neutral evolutionary rates across sequenced genomes. Comparisons among multiple species, although computationally intense, have also been shown to be a powerful method in the prediction of functional genomic regions.
24
Genome annotation past, present, and future: how to define an ORF at each locus
Recommended preference order for gene structure predictions. Alignments of full ORF cDNA sequences to the loci from which they were transcribed should take precedence. At loci where there is no full ORF cDNA alignment, alignments of highly similar proteins probably >95% identity) should be used. Third best is a gene structure that is partially determined by an EST aligned to its native locus, possibly extended to a full ORF by de novo prediction. At loci where none of these are available, pure de novo predictions using dual- or multi-genome prediction algorithms should be used.
25
Whole-genome alignments available online
27
An example of genome annotation
An example of genome annotation. The Ensembl web site is a rich source of annotations on the human genome.
28
URLs for Accessing Precomputed Whole-Genome Alignments and Their Analysis
Examples of UCSC Genome Browser Views of Genes and Alignments
30
H-InvDB is an Integrated Database of Annotated Human Genes and and transcripts
31
Nucleic Acids Research
Volume 35, Database issue January 2007 The 2007 Database Issue of Nucleic Acids Research is the fourteenth in a series dedicated to databases in the field of molecular biology. These databases are essential resources for experimental and computational biologists alike and this compilation provides descriptions and updates of the most important of these databases, and serves to introduce newly compiled resources that provide specialist information in the biological area. The current issue is the largest yet and presents 68 new databases and updates of 106 existing databases. The 2007 Database Issue is not included in the print subscription to NAR. Instead, the Database Issue is freely available online at:.
32
4. The evolution of genome function
-to understand the functional roles of genes and their evolutionary histories. Especially useful has been the appearance of: -genome-based methods to identify genomic regions of functional importance, -intraspecific whole-genome sequences (e.g. subspecies of rice or ecotypes of Arabidopsis) can reveal single nucleotide polymorphism genomic regions with markedly low or high levels, possible indicators of positive or balancing selection, both of which are signatures of adaptive evolution, -products of genes are embedded in large-scale interaction networks that represent integrated functional units at the molecular genetic level. Thus, to understand the evolution of function, it becomes necessary to understand the evolutionary dynamics of molecular genetic networks.
35
Identifying conserved DNA sequence elements
Alignment of genome sequences and tools for visualizing those alignments have proven remarkably useful in identifying how evolution has resulted in the conservation of genetic elements and the erasure of nonessential sequence. The figure shows a multi-species alignment of a homologous region to chr 19 in humans spanning the gene encoding apolipoprotein E.
36
Different Modes of Gene Evolution Increase the Diversity of Gene Function and Minimize Pleiotropy
The Modular Architecture of the cis-RegulatoryRegions of Pleiotropic Genes Enables the Independent Evolution of Gene Expression in Different Body Parts
37
Gene Ontology functional assignments to eukaryotic proteomes that have been completely sequenced
38
5. The evolution of genome structure
-numerous species-specific details, -genome size, -gene number, -patterns of sequence duplication, -a catalog of transposable elements, and -syntenic relationships. These studies have underscored the diverse architectures of eukaryotic genomes.
39
Processes that generate qualitative and quantitative variation in gene number and overall DNA content in plant nuclear genomes Small arrows indicate minor events, and large arrows indicate large events.
40
Mutational processes that generates genomic variation
Non-Allelic Homologous Recombination (NAHR) (gene conversion and rearrangement) contribute to sequence variation and structural polymorphism in the human genome.
41
Evolutionary mechanism for the origin of the new gene family
42
Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution
43
Genome-wide intraspecific DNA-sequence variations in rice:
extensive microcolinearity in gene order and content. However, deviations from colinearity are frequent owing to insertions or deletions. Intraspecific sequence polymorphisms commonly occur in both coding and non-coding regions. These variations often affect gene structures and may contribute to intraspecific phenotypic adaptations. Sequence comparison of an orthologous region (of about 100 Kb) from the two cultivated rice subspecies, Oryza sativa L. ssp indica (cv. GLA4) and Oryza sativa L. ssp japonica (cv. Nipponbare). Light-gray shading indicates their homologous regions and black bars show the insertions or deletions (Indels) that have occurred in the two subspecies. Repetitive elements are shown by bars of different colors. Predicted gene orders and structures in the top (Gene +) and bottom (Gene -) strands are indicated in dark blue and red, respectively. MITE, miniature inverted-repeat transposable element.
44
ALU REPEATS AND HUMAN GENOMIC DIVERSITY
Schematic of Alu-induced damage to the human genome. a | Potential consequences of insertion of a new element in the vicinity of a gene. The coloured boxes represent exons. The red arrows show existing Alu elements that are orientated in different directions in the introns of the gene. The site of insertion of an Alu element influences the effect of this insertion on the genome as shown. b | Unequal, homologous recombination between two Alu elements that are located in two different introns. The arrows that are broken by dashed lines show the path of the recombination event. The genes below show that a deletion has occurred in one copy, whereas a duplication has occurred in the other; either is likely to be deleterious.
45
6. Evolutionary dynamics of changes in genome structure and their consequences on gene content and evolution -whole-genome sequences at both intraspecific and interspecific levels resolved several key issues surrounding the evolution of genome architectures. These include: -the extents and rates of change in genome structure and size, -patterns of large-scale genome duplications (including polyploidization), -the dynamics of the origins and extinctions of genes, -the role of selection acting on large-scale variation in genome structure and organization, -the evolutionary forces that determine transposable element activity and number and the functional consequences of these mobile elements, and -the extent and impact of epigenetic markings on genome and organismal evolution.
50
Patterns of alternative splicing
Alternative splicing and genome complexity Alternative splicing opens neutral paths for an accelerated rate of new exon creation EST estimations of alternative splicing from different eukaryotes. The relative prevalence of the four types of alternatively spliced exons (exon skipping, alternative 3'ss, alternative 5'ss and intron retention) are shown for the different organisms. Patterns of alternative splicing Alternative splicing increases transcriptome and proteome diversification.
51
Conservation of synteny between human and mouse.
Segments and blocks >300 kb in size with conserved synteny in human are superimposed on the mouse genome.
52
Natural selection on protein-coding genes in the human genome
Red bars indicate loci under negative selection and blue bars are loci under positive selection at 95% credibility level. Loci with very strong evidence of selection (> 99% credibility) are denoted by their HUGO name, and within this category genes with a morbidity entry in the OMIM database are denoted by an asterisk.
53
Possible mechanisms of genome expansion in the grass genomes
Possible mechanisms leading to genome contraction
54
Lineage-specific expansions: Transcription Factor Families Have Much Higher Expansion Rates in Plants than in Animals After examining the lineage-specific expansion of TF families in two plants, eight animals, and two fungi, we found that TF families shared among these organisms have undergone much more dramatic expansion in plants than in other eukaryotes. The high rate of expansion among plant TF genes and their propensity for parallel expansion suggest frequent adaptive responses to selection pressure common among higher plants.
55
Evolution of exon–intron structure and alternative splicing
Schematic representation of the exon–intron structure and alternative splicing of the D. melanogaster genes and their orthologs in D. pseudoobscura and A. gambiae. -constitutive exons and splicing sites are more conserved than alternative ones, -internal alternatives are more conserved than terminal ones, -Retained introns are the least conserved, alternative acceptor sites are slightly more conserved than donor sites, and mutually exclusive exons are almost as conserved as constitutive exons. Cassette and mutually exclusive exons experience almost no intron insertions, -results agree with the observations made earlier in human–mouse comparisons and demonstrate that the phenomenon of relatively low conservation of alternatively spliced regions may be universal, as it has been observed in different taxonomic groups (mammals and insects) and at various evolutionary distances.
56
Structural variation in the human genome
Array-based, genome-wide methods for the identification of copy-number variants Structural variation in the human genome The complexity of segmental duplications and copy-number variants
57
Non-random mechanisms of segmental duplication (SD) evolution
Recent segmental duplications on human chromosome 7 The distribution of both interchromosomal (red) and intrachromosomal (blue) duplications is shown for human chromosome 7 (occurred over the last ~30 million years). Gene innovation in segmental duplications (SDs) Segmental duplication content of hominoids: hyperexpansions in chimpanzee
58
Genome-wide view of duplication
Genome-wide view of duplication. Gold bars indicate "hot spots of genomic instability" (SD).
59
Hotspots for copy number variation in chimpanzees and humans
Copy number variation is surprisingly common among humans and can be involved in phenotypic diversity and variable susceptibility to complex diseases. Many CNVs were observed in the corresponding regions in both chimpanzees and humans; especially those CNVs of higher frequency. Strikingly, these loci are enriched 20-fold for ancestral segmental duplications, which may facilitate CNV formation through nonallelic homologous recombination mechanisms. Therefore, some of these regions may be unstable "hotspots" for the genesis of copy number variation, with recurrent duplications and deletions occurring across and within species. Model for evolution of CNV hotspots Certain segmental duplications that arose in a human–chimpanzee common ancestor (depicted at point A) may facilitate separate nonallelic homologous recombination (NAHR) in both chimpanzees (B) and humans (C), leading to the genesis of CNVs in both species. If NAHR in these regions occurs frequently, it may be expected to lead to the maintenance of common CNVs by way of recurrent duplications and deletions.
60
Influence of structural variants on phenotype
Structural variants can be benign, can have subtle influences on phenotypes (for example, they can modify drug response), can predispose to or cause disease in the current generation (for example, owing to inversion, translocation or microdeletion that involves a disease-associated gene), or can predispose to disease in the next generation.
61
Mammalian chromosomal evolution is driven by regions of genome fragility
-there is a striking correspondence between fragile site (FRA) location, the positions of evolutionary breakpoints, and the distribution of tandem repeats throughout the human genome, which similarly reflect a non-uniform pattern of occurrence, -certain chromosomal regions in the human genome have been repeatedly used in the evolutionary process. As a consequence, the genome is a composite of fragile regions prone to reorganization that have been conserved in different lineages, and genomic tracts that do not exhibit the same levels of evolutionary plasticity.
62
7. Population genomics and phylogenomics
-a key tenet of evolutionary genetics is that natural selection affects single genes or gene regions, -population processes, such as gene flow, range expansion, or bottlenecks, leave their imprint on all genes in the genome, -unprecedented amounts of genome information with which to characterize population history and structure, -the evolutionary dynamics of gene families and transposable elements, in a population context. A consequence of the proliferation of genome studies has been the documentation of patterns of genome variation between species, -Random genetic drift, the mutation process, recombination, long-term demography, and selection act collectively to shape the landscape of human polymorphism structure. Higher mutation rates lead to more polymorphisms; random drift drives most novel mutations to extinction while preserving some; recombination shuffles mutations that originally arose on the same chromosome and breaks down allelic association; population bottlenecks reduce genetic diversity; selection promotes the spread of an advantageous allele, allowing non-functional alleles in close proximity to “hitchhike” with it. -Reconstruction of human demographic history: Changes in long-term population size, such as population expansion, collapse, or bottleneck imprint genome-wide SNP distributions, e.g. expansion gives rise to many rare alleles, a collapse preferentially weeds out rare alleles leading to an over-representation of high-frequency or common alleles. Human polymorphism and genotype data available on the genome scale now provides data sufficiency to infer these patterns for large world populations.
63
Sequence variations in the public human genome data reflect a bottlenecked population history
64
The payoff will be a better understanding of the genetic risk factors underlying a wide range of diseases and conditions.
65
Surveys of nucleotide diversity are beginning to show how genomes have been shaped by evolution. Surveys of nucleotide diversity provide a snap shot of evolution at its most basic level. This nucleotide diversity reflects a rich history of selection, migration, recombination, and mating systems. Additionally, the nucleotide diversity across a genome is the source of most of the phenotypic variation. The extent of polymorphism differs substantially between species and sampled loci. Nucleotide diversity is normally measured as the average sequence divergence between any two individuals for given locus. For example, average nucleotide diversity at any one locus ranges from less than 0.05% in some cotton loci [3] to over 5% at certain loci in Leavenworthia stylosa and maize. Although many factors influence diversity (Table 1), the neutral theory of evolution suggests that the level of polymorphism (θ) should be the product of the effective population size (Ne) and the mutation rate (μ) (θ = 4Neμ). There has been some success in showing the effect of demographic changes in Arabidopsis thaliana; rapid population expansion and inbreeding have resulted in many isolated, and probably slightly deleterious, polymorphisms becoming fixed in small populations. During the selection of advantageous phenotypes (domestication process ), some crops appear to have passed through bottlenecks that substantially reduced diversity. The maintenance of substantial diversity in the maize, wheat, barley, and rice indicates that they had large effective population sizes that met the needs of early farmers and therefore could never be severely bottlenecked.
66
8. Making sense of diverse elements in the genome is a critical problem in comparative genomics
Making sense of the diverse elements in the genome, understanding how they are organized within the genome of each species, and characterizing the changes in genome organization during evolution are critical problems in comparative genomics. evolutionary processes that affect genome structure: inversion and reciprocal translocation; chromosome fusion and fission; gene, segment, and chromosomal duplication and loss; polyploidization, various highly productive mechanisms for inserting external material, for the proliferation of repetitive sequencing, for massive ongoing sequence conversion (Y chromosome); high frequency of small sequence level rearrangements: inversions, transpositions, and duplications Eukaryotic genomes exhibit extensive structural variation in size, chromosome number, number and arrangement of genes, and number of genome copies per nucleus. This variation is the outcome of a set of highly active processes, including gene duplication and deletion, chromosomal duplication followed by gene loss, amplification of retrotransposons separating genes, and genome rearrangement, the latter often following hybridization and/or polyploidy. Eukaryotic nuclear genomes are enormously variable. Genome size, chromosome number, and ploidy are all snapshots of the current state of a set of continuous processes; genome evolution is a continuous and dynamic process in which genomes rearrange, expand and contract, duplicate in whole or in part, and gain and lose genes at the same time as transposons amplify and are excised. rearrangement, polyploidy, and gene duplication are all apparently processes of genome regeneration. Transposon amplification, gene loss, and changes in gene expression alter both the structure of the genome and the function of genes. We are only beginning to glimpse how these processes and events correlate with patterns of diversification and how they may provide the substrate for evolution.
67
Highly Conserved Non-Coding Sequences Are Associated with Vertebrate Development
CNE (conserved non-coding element) Clusters Are Found Close to Trans-Dev (regulation of development) Genes in the Human Genome
68
A distal enhancer and an ultraconserved exon are derived from a novel retroposon
69
9. Future of genome biology
Comparative genomics has proven an invaluable approach to understanding biology, not only for dissecting patterns and processes of genome evolution but also in revealing aspects of gene function. The rapid advances in technology, both for sequencing and for determining expression and interaction patterns, will continue to propel this area in the future. Intraspecific genome comparisons will primarily rely on resequencing techniques in the near future, though the advent of chip-based genomic array techniques as well as new methods will make it easier to acquire large genome coverage for individuals within populations or species. These tools, coupled with functional genomics approaches, may provide crucial insights into how genomes evolve in structure and function, and also permit us to link genome structure with organismal biology.
70
“Post-genomic era” Aim: functional annotation of every gene product
71
Towards multidimensional genome annotation
Four levels of annotation. One-dimensional genome annotation provides a list of network components. The interaction between network components can be represented using a two-dimensional annotation (where a matrix of stoichiometric coefficients is used to represent component interactions). The structural organization of the genome can also be represented spatially in a three-dimensional annotation. Changes in genome sequence can be characterized in a four-dimensional annotation.
72
The interplay and codependence of experimental and computational approaches
The centrally located yellow box labeled “Validation of Biological Mechanisms” depicts the ultimate goal for researchers studying a biological pathway. Approaches illustrated from the top of the image downward depict high-throughput analyses used to predict transcription factor binding sites and to determine functional activity of those elements.Approaches illustrated from the bottom of the image upward signify more conventional “locus-specific” analyses that start from a narrowly defined hypothesis of biological function and can include the use of animal models.
73
Functional genomic elements being identified by the ENCODE pilot phase
Functional genomic elements being identified by the ENCODE pilot phase. The indicated methods are being used to identify different types of functional elements in the human genome. Mammals for which genomic sequence is being generated for regions orthologous to the ENCODE targets. The current plans are to produce high quality finished (blue), comparative-grade finished (red), or assembled whole-genome shotgun (green) sequence, as indicated. Other vertebrate species for which sequences orthologous to the ENCODE targets are being generated include chicken, frog, and zebrafish. UCSC Genome Browser display of representative ENCODE data
74
Genome-Phenome Superbrain Project (GPSP)
GPSP integrates various databases to build a comprehensive computerized encyclopedia of omic sciences. Our goal in the future is to evolve this intelligent system into a computational superbrain, a form of artificial intelligence that can solve a genetic researcher's problems by learning computationally a vast amount of information accumulated in documents and published data ranging from genomes to phenomes.
75
The theory of “omic space,” a model to comprehensively describe life phenomena across all omic planes, from genome to phenome.
76
OmicBrowse: A Browser of Multidimensional Omics Annotations
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.