Intro to Comp Genomics Lecture 11: Using models for sequence evolution.

Intro to Comp Genomics Lecture 11: Using models for sequence evolution

Comparing everything Our intuition: Feature X similar among a group of species -> Feature X is important Feature X can be: Sequence Gene expression (human brain vs chimp brain?) Genic structure (Exon/intron) Protein complexes Protein networks TF-DNA interaction Two main difficulties: Species have common ancestry – a lot of stuff may be similar just because it did not diverge yet Species are related through phylogenetic trees – similarity should be following a tree structure

Modeling multiple genome sequences Genome 1 Genome 2 AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. AGCAACAAGAAAGGGTTACTACGCAGAAAA…. Alignment Statistics Genome 3 A C G T ACGT A C G T s s ss s Markov process Unobserved ancestral

Tree models H2H2 S3S3 S2S2 S1S1 H1H1 For a locus j: Extant Species S j 1,..,n Ancestral species H j 1,..(n-1) Tree T: Parents relation pa S i, pa H i (pa S 1 = H 1,pa S 3 = H 2 The root: H 2) Val(X) = {A,C,G,T} An evolutionary model = a joint distribution Pr(h,s) Locus independence:

Tree models A Tree: T, Species S 1,..,n Parents relation pa S i Markov assumption still in effect..but branching complicates it C C C A We need a little more: The model: In the triplet:

Tree models Toy model: Triplet phylogeny Substitution probability on all of the branches: Uniform background probability: P(x) = 0.25 H2H2 S3S3 S2S2 S1S1 H1H1

Tree models Marginal probability of X i (any r.v.) : Given partial observations s: “ancestral inference” The Total probability of the data: likelihood of the model given the data H2H2 S3S3 S2S2 S1S1 H1H1

Tree models ? A CA ? Given partial observations s: The Total probability of the data: ? ?

Intuition – maximum parsimony ? A CA ? “Parsimony” ~ minimal change The “small” parsimony problem: Find ancestral sequences that minimize the number of substitutions along the tree branches What is the minimal number of substitutions? (All branches are equal, all substitutions are equal) (The “big” parsimony problem: Find the tree topology that gives minimal parsimony score given a set of loci) C C 2 substitutions A A 1 substitution

Algorithm (Following Fitch 1971): Up(i): if(extant) { up_set[i] = S i ; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Down(i): down_set[i] = up_set[sibling[i]] ∩ down_set[par(i)] if(down_set[i] = 0) { down_set[i] = up_set[sibling[i]] + down_set[par(i)] } Algorithm: D=0 up(root); down_set[root] = 0; down(right(root)); down(left(root)); Intuition – maximum parsimony ? S3S3 S2S2 S1S1 ? up_set[4] up_set[5]

Algorithm (Following Fitch 1971): Up(i): if(extant) { up_set[i] = S i ; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Down(i): down_set[i] = up_set[sib[i]] ∩ down_set[par(i)] if(down_set[i] = 0) { down_set[i] = up_set[i] } down(left(i)), down(right(i)) Algorithm: D=0 up(root); down_set[root] = 0; down(right(root)); down(left(root)); Intuition – maximum parsimony ? S3S3 S2S2 S1S1 ? down_set[4] down_set[5] up_set[3] Set[i] = up_set[i] ∩ down_set[i]

Algorithm (Following Felsenstein 1981): Up(i): if(extant) { up[i][a] = (a==S i ? 1: 0); return} up(r(i)), up(l(i)) iter on a up[i][a] =  b,c Pr(X l(i) =b|X i =a) up[l(i)][b] Pr(X r(i) =c|X i =a) up[r(i)][c] Down(i): down[i][a]=  b,c Pr(X sib(i) =b|X par(i) =c) up[sib(i)][b] Pr(X i =a|X par(i) =c) down[par(i)][c] down(r(i)), down(l(i)) Algorithm: up(root); LL = 0; foreach a { L += log(Pr(root=a)up[root][a]) down[root][a]=Pr(root=a) } down(r(root)); down(l(root)); Probabilistic inference ? S3S3 S2S2 S1S1 ? up[4] up[5] Felsentstein

Algorithm (Following Felsenstein 1981): Up(i): if(extant) { up[i][a] = (a==S i ? 1: 0); return} up(r(i)), up(l(i)) iter on a up[i][a] =  b,c Pr(X l(i) =b|X i =a) up[l(i)][b] Pr(X r(i) =c|X i =a) up[r(i)][c] Down(i): down[i][a]=  b,c Pr(X sib(i) =b|X par(i) =c) up[sib(i)][b] Pr(X i =a|X par(i) =c) down[par(i)][c] down(r(i)), down(l(i)) Algorithm: up(root); LL = 0; foreach a { L += log(Pr(root=a)up[root][a]) down[root][a]=Pr(root=a) } down(r(root)); down(l(root)); ? S3S3 S2S2 S1S1 ? down[4] down5] up[3] P(h i |s) = up[i][c]*down[i][c]/ (  j up[i][j]down[i][j]) Probabilistic inference Felsentstein

Inference as message passing s s ss s s s You are P(H|our data) I am P(H|all data) DATA

Inference as message passing AC C C DATA Up: (0.01) 2,(0.96) 2,(0.01) 2,(0.02) 2 Down: (0.25),(0.25),(0.25),(0.25) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

Learning: Branch decomposition 3 21 5 4 5 3 5 41 4 2 4 Can we learn each branch independently? We know how to compute the ML model given two observed species We have P(S|D) for each species, can we substitute it for the statistics: A G C T AGCTAGCT

Transition posteriors: not independent! AC A C DATA Down: (0.25),(0.25),(0.25),(0.25) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

Learning: Second attempt 3 21 5 4 5 3 5 41 4 2 4 Can we learn each branch independently? Given P(S pai ->S i |D) for each species, can we substitute it for the statistics?

Expectation-Maximization 3 21 5 4 5 3 5 41 4 2 4

Continuous time Conditions on transitions: Theorem: exists (may be infinite) exists and finite Think of time steps that are smaller and smaller Markov Kolmogorov

Rates and transition probabilities The process’s rate matrix: Transitions differential equations (backward form):

Matrix exponential The differential equation: Series solution: 1-path2-path3-path4-path5-path Summing over different path lengths:

Computing the matrix exponential

Series methods: just take the first k summands reasonable when ||A||<=1 if the terms are converging, you are ok can do scaling/squaring: Eigenvalues/decomposition: good when the matrix is symmetric problems when having similar eigenvalues Multiple methods with other types of B (e.g., triangular)

Learning a rate matrix 3 21 5 4 5 3 5 41 4 2 4 What if we wish to learn a single rate matrix Q? Learning is easy for a single, fixed length branch. Given (inferred) statistics n k for multiple branch lengths, we must optimize a non linear likelihood function

Learning: Sharing the rate matrix 3 21 5 4 5 3 5 41 4 2 4 Use generic optimization methods: (BFGS)

Protein genes: codes and structure 123 codons Intron/exons Domains Conformation Degenerate code Recombination easier? Epistasis: fitness correlation between two remote loci 5’ utr3’ utr

The classical analysis paradigm BLAT/BLAST Target sequence Genbank Matching sequences CLUSTALW ACGTACAGA ACGT--CAGA ACGTTCAGA ACGTACGGA Alignment Phylogenetic Modeling Analysis: rate, Ka/Ks…

Clustalw and multiple alignment ClustalW is the semi-standard multiple alignment algorithm when sequences are relatively diverged and there are many of them ClustalW Compute pairwise sequence distances (using pairwise alignment) Build a guide-tree: approximating the phylogenetic relations between the sequences “Progressive” alignment on the guide tree S2S1 S4 S3 S5 Dist(s1,s2) = best pair align Distance Matrix Neighbor Joining Guide tree is based on pairwise analysis! From the leafs upwards: Align two children given their “profiles” Several heuristics for gaps Other methods are used to multi-align entire genomes, especially when one well annotated model genome is compared to several similar species. Think of using one genome as a “scaffold” for all other sequences.

Nucleotide substitution models For nucleotides, fewer parameters are needed: A CT G      A CT G      Jukes-Kantor (JK) Kimura But this is ignoring all we know on the properties of amino-acids!

Simple phylogenetic modeling: PAM/BLOSSOM62 Given a multiple alignment (of protein coding DNA) we can convert the DNA to proteins. We can then try to model the phylogenetic relations between the proteins using a fixed rate matrix Q, some phylogeney T and branch lengths t i When modeling hundreds/thousands amino acid sequences, we cannot learn from the data the rate matrix (20x20 parameters!) AND the branch lengths AND the phylogeny. Based on surveys of high quality aligned proteins, Margaret Dayhoff and colleuges generated the famous PAM (Point Accepted mutations): PAM1 is for 1% substitution probability. Using conserved aligned blocks, Henikoff and Henikoff generated the BLOSUM family of matrices. Henikoff approach improved analysis of distantly related proteins, and is based on more sequence (lots of conserved blocks), but filtering away highly conserved positions (BLOSUM62 filter anything that is more than 62% conserved)

Universal amino-acid substitution rates? Jordan et al., Nature 2005 “We compared sets of orthologous proteins encoded by triplets of closely related genomes from 15 taxa representing all three domains of life (Bacteria, Archaea and Eukaryota), and used phylogenies to polarize amino acid substitutions. Cys, Met, His, Ser and Phe accrue in at least 14 taxa, whereas Pro, Ala, Glu and Gly are consistently lost. The same nine amino acids are currently accrued or lost in human proteins, as shown by analysis of non- synonymous single-nucleotide polymorphisms. All amino acids with declining frequencies are thought to be among the first incorporated into the genetic code; conversely, all amino acids with increasing frequencies, except Ser, were probably recruited late. Thus, expansion of initially under-represented amino acids, which began over 3,400 million years ago, apparently continues to this day. “

You task Get aligned chromosome 17 for human, chimp, orangutan, rhesus, marmoset Use EM on the known phylogeny to estimate a substitution model from the data (P(x|pax)) Partition the genome into two parts according to overall conservation (define the criterion yourself). Then train independently two models and compare them. Optional: can your models be explained using a single rate matrix and different branch lengths?

Intro to Comp Genomics Lecture 11: Using models for sequence evolution.

Similar presentations

Presentation on theme: "Intro to Comp Genomics Lecture 11: Using models for sequence evolution."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intro to Comp Genomics Lecture 11: Using models for sequence evolution.

Similar presentations

Presentation on theme: "Intro to Comp Genomics Lecture 11: Using models for sequence evolution."— Presentation transcript:

Similar presentations

About project

Feedback