Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes.

Similar presentations


Presentation on theme: "Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes."— Presentation transcript:

1 Evolution (1 st lecture)

2 Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Gregory Cooper & all Identification and Characterization of Multi-Species Conserved Sequences Elliott Margulies & all Presented by Penka Markova

3 Finding Elements in DNA Conserved by Evolution Premise: highly conserved sequences are more likely to reflect regions under active selection due to the presence of an element(s) that confers biological function Involves comparative analysis, requires multi-alignments

4

5 Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

6 1 st Paper Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Gregory Cooper, Michael Brudno, Eric Stone, Inna Dubchak, Serafim Batzoglou, and Arend Sidow

7 Overview Goal: Comparative analysis of rat/mouse/human genome facilitate insights into basic mechanisms of nucleotide evolution facilitate the discovery of elements in the genome that play a functional role in human biology (by leveraging the fact that functional DNA is constrained because of purifying selection ) Summary: Provides analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor Evidence for shift in the mutational spectrum b/n the mouse and rat lineages (increase of CG content in the rat genome) Support for the idea that rates of evolution are influenced by local genomic or cell biological context No correlation b/n rates of point substitution & rates of microindels (influences that affect these processes are distinct) Identified the regions in the human genome that are evolving slowly (likely to include functional elements important to human biology)

8 Data 3 complete mammalian genome sequences  Human, rat, mouse  new: rat genome Multi-aligned  MLAGAN 2 datasets 1.Containing all sites that are confidently aligned among all 3 sequences (most included positions originated prior to the last common ancestor) 2.“rodent-specific neutral sites” - containing only sites present in the rodents (heavily enriched for neutrally evolving sites)

9 Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

10 Global Patterns of Nucleotide Substitution Global shift in the mutation spectra between mouse and rat Rat has 0.35% more CG than mouse (41.26% vs 41.61%) – statistically highly significant difference CpG dinucleotides 0.92% in the mouse, 1.06% in the rat (the rest of the nucleotides exhibit lower difference) Consistent bias toward elevated CG in the rat genome does not appear to be confined to particular types of transitions or transversions based on Dataset1 quantitative analysis (117 million position with single difference in either rodent) The causative factors for the shift, selective or otherwise, remain to be elucidated

11 Rates of Transitions and Transversions in the Rodents Transitions are approximately fourfold more likely than any transversion Useful for molecular evolutionary studies (most methods of phylogenetic inference model point substitutions on the basis of stationary Markov processes and require user-specified substitution parameters)

12 Rates of Neutral Point Substitution Point substitution events in rodent-specific neutral sites (Dataset2) Neutral rate for the evolutionary tree relating the 3 Relative branch length of the tree: based on Dataset1 positions without gap in any sequence Normalized (rat branch is 1 unit length)

13 Rates of Microinsertion and Microdeletion Definition: lesions no larger than 10bp Dataset1 Gaps of size 11bp or less Rapid decline in the relative numbers of indel events as size increases

14 Global Identification of Constrained Elements Annotated all the regions in the human genome that are evolving, on average, significantly slower than the neutral rate Sequences that function in organismal biology tend to be under purifying selection & thus manifest themselves as regions evolving slowly 210, 923 constrained elements (>51 bp)

15 Global Identification of Constrained Elements

16 Regional Variability of Evolutionary Parameters Substantially stable microevolutionary pressures (modest-to-strong correlations between rates of microdeletion [A, B]) Local evolutionary pressures appear to influence point substitutions and microindels differently (variation in rate of microinsertions/microdeletion does not correlate well with point substitution) Local genomic context influences the rate of point substitution regardless of the type of site (correlation b/n neutral rate with the rate of substitution [B]) CG content correlates with rates of point substitution Sliding window analysis along rat Chromosome1, window width of 2Mb

17 Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

18 2 nd Paper Identification and Characterization of Multi- Species Conserved Sequences Elliott Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program, David Haussler, Eric Green

19 Overview Goals Identify highly conserved DNA regions, in particular “Multi-species Conserved Sequences” (MCSs), in a robust fashion useful in comparative sequence analysis, aiming to elucidate genome function Evaluate the relative contribution of different species’ sequences to identifying genomic regions of interest one of the criteria considered in choosing additional species for whole-genome sequencing Summary of results Proposes 2 strategies for MCS identification (binomial, parsimony) detect virtually all known actively conserved sequences (coding seq), but very little neutrally evolving sequence (ancestral repeats) Analysis of the features of detected MCSs Currently available genome sequences are insufficient for comprehensive identification of MCSs in the human genome

20 Data Sequences of human and 11 non-human vertebrates  2 primates (chimpansee, baboon), 2 carnivores (cat, dog), 2 artiodactyls (cow and pig), 2 rodents (mouse and rat), 1 bird (chicken), 2 fish (fugu and tetraodon)  Orthologous to a 1.8-Mb region on human chromosome 7q31 Multi-aligned  human-referenced pair-wise alignment  Repeat-masker, blastz Systematically annotated for known coding exons, UTRs, and ARs (ancestral repeats)

21 Algorithms: Binomial, Parsimony, Intersecting Take into account Phylogenetic diversity of the aligned species’ sequences The varying neutral substitution rate The characteristics of the available genomic multi-sequence alignment, esp sparse alignments Requirements Sufficiently large branch length of the phylogenetic tree (non-functional regions should be sufficiently diverged) Greater total branch length (compared to the required length for identification of larger functional elements) Good multi-alignment is crucial

22 Algorithms: Binomial Binomial-Based Method for MCS Detection Calculates the conservation score based on the probability of detecting the observed amount of conservation between the human and each other species’ sequence, assuming neutral substitution rate Neutral substitution rate is calculated from fourfold degenerate positions (the third base of codons for which any base will encode the same amino acid) Normalizes for phylogenetic biases by averaging Final conservation score is calculated from overlapping 25- base windows

23 Algorithms: Binomial N number of aligned bases in the 25-base window of the human-species j alignment K number of perfect matches p j neutral substitution probability: the probability that a given base in the human sequence has been conserved in species j, assuming the neutral substitution rate between human and species j K/N baseline conservation level C(j)cumulative binomial probability of observing at least K matches in N bases Algorithm 1) within all windows of 25 bases, for each species j: CGGCTAAG…ACTGACTGGGT CGACTGAG…ACTGACTGGGT

24 Algorithms: Binomial Algorithm 2) “phylogenetically average” the individual species’ scores s j to obtain the final conservation score for the window 3) the final score assigned to position i is 4) For a given treshhold t, position I is predicted to be part of an MCS if

25 Algorithms: Binomial Binomial-Based Method: Conclusion Conservation scores below zero represent alignable regions that are less conserved than expected, the opposite for scores above zero Minimum MCS length is 25 bases Sequence conservation detected with more diverged species (with higher neutral substitution rates) is weighted more heavily Measures conservation with respect to one reference sequence only

26 Algorithms: Parsimony Parsimony-Based Method Amount of conservation within each column of the alignment is measured using a phylogenetic parsimony score P(i) P(i) reflects the minimal number of substitutions needed along the branches of an established phylogenetic tree to account for the observed bases at the leaves of the tree Based on P(i) calculates a score under a continuous-time Markov model of neutral evolution, measuring the “surprise” of observing P(i) or smaller parsimony score Requires a phylogenetic tree, a model of neutral substitution

27 Algorithms: Parsimony Algorithm 1) Calculate the parsimony score P(i) for the i-th position P(i) = the minimum number of substitutions, performed along the branches of the tree, needed to explain the bases observed at the leaves of the tree notice P(i) is a tight lower bound on the number of substitutions having actually occurred at position i during evolution 2.0) Define a model of neutral evolution  based on the phylogenetic tree T relating the species under study, a neutral substitution rate matrix Q ℓ(e) denotes the length of branch e, r the root of the tree  transition probability matrix along a branch (u,v) M (u,v) = e ℓ(u,v)Q  background base distribution π This model generates a set of random but related bases at the leaves of the tree by simulating evolution.

28 Algorithms: Parsimony 2) Define the score assigned to position i based on the 25-base window as Z(r) is the random variable describing the parsimony score of the bases of the subtree rooted at r Pr[Z(r)  P(j)] is the probability that the parsimony score of the bases at the leaves of T generated by the model defined above is at most P(j) calculated using a dynamic programming algorithm proceeding from the leaves of T ot its root if this probability is small, the position is unlikely to have been generated under neutral evolution

29 Algorithms: Parsimony 3) the final score assigned to position i is 4) For a given treshhold t, position i is predicted to be part of an MCS if Parsimony-Based Method: Conclusion Requires a phylogenetic tree, a model of neutral substitution Produces higher scores based on conservation across large phylogenetic distance

30 Algorithms: Binomial, Parsimony, Intersecting Intersecting Method  Intersects the results from the Binomial and Parsimony methods  MCSs can be shorter than 25 bp Observations All three methods are biased towards the identification of sequences that are conserved in most species (as opposed to only a subset of species) Conservation score treshhold used was selected such that 5% of the human sequence from the analyzed region falls within an MCS (5% of the human genome is considered to be under active selection)

31 Concordance of the binomial- and parsimony- based methods for MCS detection

32 Results: discrimination of different types of sequence using conservation scores

33 Results General features of detected MCSs detected virtually all known actively conserved sequences (coding seq), but very little neutrally evolving sequence (ancestral repeats) majority of sequences conserved across multiple vertebrate species has no known function (70% of MCSs reside in non- coding regions) Uniqueness of the MCSs in the human genome Correlating MCSs with Functional Elements MCSs correspond to clusters of transcription factor-binding sites, non-coding RNA transcripts, and other candidate functional elements

34 Results: characteristics of the detected MCSs

35 Positions of MCSs relative to other annotated genomic features (representative region)

36 Results Contribution of different species’ sequences to the detection of MCSs Rodent sequences detect the greatest number of MCS bases, largest number of non-coding sequence Chicken sequence has considerably higher specificity, largest amount of coding MCS bases MCSs detected with fish sequences almost exclusively contain coding sequence Non-human primate sequences are not useful with the applied methods None of the individual species’ sequences alone came close to identifying all the reference MCS bases Currently available genome sequences are insufficient for comprehensive identification of MCSs in the human genome

37 Ability of individual & combinations of species’ sequences to detect MCSs

38 Outline (The End) Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Overview Data Global Patterns of Nucleotide Substitution Rates of Transitions and Transversions in the Rodents Rates of Neutral Point Substitution Rates of Microinsertion and Microdeletion Global Identification of Constrained Elements Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences Overview Data Binomial, Parsimony and Intersecting Methods Stats Characteristics of the detected MCSs, conclusions

39 The end

40 A(u) is the random variable representing the base generated by this random process at node u.


Download ppt "Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes."

Similar presentations


Ads by Google