CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.

Slides:



Advertisements
Similar presentations
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Advertisements

Multiple Sequence Alignment
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Molecular Evolution and Phylogenetic Tree Reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Molecular Evolution Revised 29/12/06
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Problem Set 2 Solutions Tree Reconstruction Algorithms
DNA Sequencing.
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Phylogeny Tree Reconstruction
CS273a Lecture 5, Win07, Batzoglou Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort contigs from largest to smallest,
Overview of Phylogeny Artiodactyla (pigs, deer, cattle, goats, sheep, hippopotamuses, camels, etc.) Cetacea (whales, dolphins, porpoises)
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Lecture 8: Multiple Sequence Alignment
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Phylogeny Tree Reconstruction
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Phylogeny Tree Reconstruction
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogeny Tree Reconstruction
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Multiple Sequence Alignment
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Multiple Sequence Alignment
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
dij(T) - the length of a path between leaves i and j
Inferring a phylogeny is an estimation procedure.
Phylogeny.
Presentation transcript:

CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies

CS262 Lecture 12, Win07, Batzoglou Molecular Inversion Probes

CS262 Lecture 12, Win07, Batzoglou Single Molecule Array for Genotyping—Solexa

CS262 Lecture 12, Win07, Batzoglou Nanopore Sequencing

CS262 Lecture 12, Win07, Batzoglou Pyrosequencing on a chip Mostafa Ronaghi, Stanford Genome Technologies Center 454 Life Sciences

CS262 Lecture 12, Win07, Batzoglou Polony Sequencing

CS262 Lecture 12, Win07, Batzoglou Some future directions for sequencing 1.Personalized genome sequencing Find your ~3,000,000 single nucleotide polymorphisms (SNPs) Find your rearrangements Goals: Link genome with phenotype Provide personalized diet and medicine (???) designer babies, big-brother insurance companies Timeline: Inexpensive sequencing: Genotype–phenotype association:2010-??? Personalized drugs:2015-???

CS262 Lecture 12, Win07, Batzoglou Some future directions for sequencing 2.Environmental sequencing Find your flora: organisms living in your body External organs: skin, mucous membranes Gut, mouth, etc. Normal flora: >200 species, >trillions of individuals Flora–disease, flora–non-optimal health associations Timeline: Inexpensive research sequencing:today Research & associationswithin next 10 years Personalized sequencing2015+ Find diversity of organisms living in different environments Hard to isolate Assembly of all organisms at once

CS262 Lecture 12, Win07, Batzoglou Some future directions for sequencing 3.Organism sequencing Sequence a large fraction of all organisms Deduce ancestors Reconstruct ancestral genomes Synthesize ancestral genomes Clone—Jurassic park! Study evolution of function Find functional elements within a genome How those evolved in different organisms Find how modules/machines composed of many genes evolved

Phylogeny Tree Reconstruction

CS262 Lecture 12, Win07, Batzoglou Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length represents evolution time  AKA genetic distance  Not necessarily chronological time

CS262 Lecture 12, Win07, Batzoglou Inferring Phylogenetic Trees Trees can be inferred by several criteria:  Morphology of the organisms Can lead to mistakes!  Sequence comparison Example: Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution During infinitesimal time  t, there is not enough time for two substitutions to happen on the same nucleotide So we can estimate P(x | y,  t), for x, y  {A, C, G, T} Then let P(A|A,  t) …… P(A|T,  t) S(  t) = ……… P(T|A,  t) ……P(T|T,  t) xx y tt

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution Reasonable assumption: multiplicative (implying a stationary Markov process) S(t+t’) = S(t)S(t’) That is to say, P(x | y, t+t’) =  z P(x | z, t) P(z | y, t’) Jukes-Cantor: constant rate of evolution     For short time , S(  ) = I+R  =             AC GT

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution Jukes-Cantor: For longer times, r(t)s(t) s(t) s(t) S(t) = s(t)r(t) s(t) s(t) s(t)s(t) r(t) s(t) s(t)s(t) s(t) r(t) Where we can derive: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t ) S(t+  ) = S(t)S(  ) = S(t)(I + R  ) Therefore, (S(t+  ) – S(t))/  = S(t) R At the limit of   0, S’(t) = S(t) R Equivalently, r’ = -3  r + 3  s s’ = -  s +  r Those diff. equations lead to: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t )

CS262 Lecture 12, Win07, Batzoglou Modeling Evolution Kimura: Transitions: A/G, C/T Transversions: A/T, A/C, G/T, C/G Transitions (rate  ) are much more likely than transversions (rate  ) r(t)s(t) u(t) s(t) S(t) = s(t)r(t) s(t) u(t) u(t)s(t) r(t) s(t) s(t)u(t) s(t) r(t) Wheres(t) = ¼ (1 – e -4  t ) u(t) = ¼ (1 + e -4  t – e -2(  +  )t ) r(t) = 1 – 2s(t) – u(t)

CS262 Lecture 12, Win07, Batzoglou Phylogeny and sequence comparison Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is pretty certain – avoid areas with (too many) gaps

CS262 Lecture 12, Win07, Batzoglou Distance between two sequences Given sequences x i, x j, Define d ij = distance between the two sequences One possible definition: d ij = fraction f of sites u where x i [u]  x j [u] Better model (Jukes-Cantor): f = 3 s(t) = ¾ (1 – e -4  t )  ¾ e -4  t = ¾ – f  log (e -4  t ) = log (1 – 4/3 f)  -4  t = log(1 – 4/3 f) d ij = t = - ¼  -1 log(1 – 4/3 f)

CS262 Lecture 12, Win07, Batzoglou A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters C i, C j of sequences, 1 d ij = –––––––––  {p  Ci, q  Cj} d pq |C i |  |C j | Claim that if C k = C i  C j, then distance to another cluster C l is: d il |C i | + d jl |C j | d kl = –––––––––––––– |C i | + |C j | Proof  Ci,Cl d pq +  Cj,Cl d pq d kl = –––––––––––––––– (|C i | + |C j |) |C l | |C i |/(|C i ||C l |)  Ci,Cl d pq + |C j |/(|C j ||C l |)  Cj,Cl d pq = –––––––––––––––––––––––––––––––––––– (|C i | + |C j |) |C i | d il + |C j | d jl = ––––––––––––– (|C i | + |C j |)

CS262 Lecture 12, Win07, Batzoglou Algorithm: Average Linkage Initialization: Assign each x i into its own cluster C i Define one leaf per sequence, height 0 Iteration: Find two clusters C i, C j s.t. d ij is min Let C k = C i  C j Define node connecting C i, C j, & place it at height d ij /2 Delete C i, C j Termination: When two clusters i, j remain, place root at height d ij /

CS262 Lecture 12, Win07, Batzoglou Example vwxyz v w 0888 x 044 y 02 z 0 yzxwv vwxyz v 0688 w 088 x 04 0 vwxyz v 068 w 08 0 vwxyz vw 08 xyz 0

CS262 Lecture 12, Win07, Batzoglou Ultrametric Distances and Molecular Clock Definition: A distance function d(.,.) is ultrametric if for any three distances d ij  d ik  d ij, it is true that d ij  d ik = d ij The Molecular Clock: The evolutionary distance between species x and y is 2  the Earth time to reach the nearest common ancestor That is, the molecular clock has constant rate in all species years The molecular clock results in ultrametric distances

CS262 Lecture 12, Win07, Batzoglou Ultrametric Distances & Average Linkage Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances Proof: Exercise

CS262 Lecture 12, Win07, Batzoglou Weakness of Average Linkage Molecular clock: all species evolve at the same rate (Earth time) However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: Correct tree AL tree

CS262 Lecture 12, Win07, Batzoglou Additive Distances Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances d ij, can uniquely reconstruct edge lengths: Find two neighboring leaves i, j, with common parent k Place parent node k at distance d km = ½ (d im + d jm – d ij ) from any node m  i, j d 1,4

CS262 Lecture 12, Win07, Batzoglou Additive Distances For any four leaves x, y, z, w, consider the three sums d(x, y) + d(z, w) d(x, z) + d(y, w) d(x, w) + d(y, z) One of them is smaller than the other two, which are equal d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z) x y z w

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v vwxyz v w x 0915 y 014 z 0 T If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths D

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v vwxyz v w x 0915 y 014 z 0 T D

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v vwxyz v w x 0915 y 014 z 0 T D axyz a x 0915 y 014 z 0 a D1D1 d ax = ½ (d vx + d wx – d vw ) d ay = ½ (d vy + d wy – d vw ) d az = ½ (d vz + d wz – d vw )

CS262 Lecture 12, Win07, Batzoglou Reconstructing Additive Distances Given T x y z w v T axyz a x 0915 y 014 z 0 a D1D1 abz a 0610 b 0 z 0 D2D2 b c ac a 03 c 0 D3D3 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!!

CS262 Lecture 12, Win07, Batzoglou Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define D ij = (N – 2) d ij –  k  i d ik –  k  j d jk Claim: The above “magic trick” ensures that D ij is minimal iff i, j are neighbors

CS262 Lecture 12, Win07, Batzoglou Algorithm: Neighbor-joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. D ij is minimal Define a new node k, and set d km = ½ (d im + d jm – d ij ) for all m  L Add k to T, with edges of lengths d ik = ½ (d ij + r i – r j ), d jk = d ij – d ik Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length d ij

CS262 Lecture 12, Win07, Batzoglou Parsimony – direct method not using distances One of the most popular methods:  GIVEN multiple alignment  FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: 1.Find the parsimony cost of a given tree (easy) 2.Search through all tree topologies (hard)

CS262 Lecture 12, Win07, Batzoglou Example: Parsimony cost of one column A B A A {A, B} Cost C+=1 {A} Final cost C = 1 {A} {B} {A} ABAAABAA

CS262 Lecture 12, Win07, Batzoglou Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set R k = { x k [u] }// R k is simply the character of k th species If k is not a leaf, Let i, j be the daughter nodes; Set R k = R i  R j if intersection is nonempty Set R k = R i  R j, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C

CS262 Lecture 12, Win07, Batzoglou Example AAAB {A} {B} BABA {A}{B}{A}{B} {A} {A,B} {B}

CS262 Lecture 12, Win07, Batzoglou Traceback: 1.Choose an arbitrary nucleotide from R 2N – 1 for the root 2.Having chosen nucleotide r for parent k, If r  R i choose r for daughter i Else, choose arbitrary nucleotide from R i Easy to see that this traceback produces some assignment of cost C Traceback to find ancestral nucleotides

CS262 Lecture 12, Win07, Batzoglou Example A B A B {A, B} {A} {B} {A} {B} A B A B A A A x x A B A B A B A x x A B A B B B B x x Admissible with Traceback Still optimal, but inadmissible with Traceback

CS262 Lecture 12, Win07, Batzoglou Probabilistic Methods A more refined measure of evolution along a tree than parsimony P(x 1, x 2, x root | t 1, t 2 ) = P(x root ) P(x 1 | t 1, x root ) P(x 2 | t 2, x root ) If we use Jukes-Cantor, for example, and x 1 = x root = A, x 2 = C, t 1 = t 2 = 1, = p A  ¼(1 + 3e -4α )  ¼(1 – e -4α ) = (¼) 3 (1 + 3e -4α )(1 – e -4α ) x1x1 t2t2 x root t1t1 x2x2

CS262 Lecture 12, Win07, Batzoglou Probabilistic Methods If we know all internal labels x u, P(x 1, x 2, …, x N, x N+1, …, x 2N-1 | T, t) = P(x root )  j  root P(x j | x parent(j), t j, parent(j) ) Usually we don’t know the internal labels, therefore P(x 1, x 2, …, x N | T, t) =  x N+1  x N+2 …  x 2N-1 P(x 1, x 2, …, x 2N-1 | T, t) x root x1x1 x2x2 xNxN xuxu

CS262 Lecture 12, Win07, Batzoglou Felsenstein’s Likelihood Algorithm To calculate P(x 1, x 2, …, x N | T, t) Initialization: Set k = 2N – 1 Iteration: Compute P(L k | a) for all a   If k is a leaf node: Set P(L k | a) = 1(a = x k ) If k is not a leaf node: 1. Compute P(L i | b), P(L j | b) for all b, for daughter nodes i, j 2. Set P(L k | a) =  b,c P(b | a, t i ) P(L i | b) P(c | a, t j ) P(L j | c) Termination: Likelihood at this column = P(x 1, x 2, …, x N | T, t) =  a P(L 2N-1 | a)P(a) Let P(L k | a) denote the prob. of all the leaves below node k, given that the residue at k is a

CS262 Lecture 12, Win07, Batzoglou Probabilistic Methods Given M (ungapped) alignment columns of N sequences, Define likelihood of a tree: L(T, t) = P(Data | T, t) =  m=1…M P(x 1m, …, x nm, T, t) Maximum Likelihood Reconstruction: Given data X = (x ij ), find a topology T and length vector t that maximize likelihood L(T, t)

CS262 Lecture 12, Win07, Batzoglou Current popular methods HUNDREDS of programs available! Some recommended programs: Discrete—Parsimony-based  Rec-1-DCM3 Tandy Warnow and colleagues Probabilistic  SEMPHY Nir Friedman and colleagues

CS262 Lecture 12, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 12, Win07, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence

CS262 Lecture 12, Win07, Batzoglou Protein Phylogenies – Example

CS262 Lecture 12, Win07, Batzoglou

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can point to elements that are conserved among a class of and therefore important in the biology of these organisms The patterns of conservation can help us tell function of the element

CS262 Lecture 12, Win07, Batzoglou Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

CS262 Lecture 12, Win07, Batzoglou Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) Duck

CS262 Lecture 12, Win07, Batzoglou A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T

CS262 Lecture 12, Win07, Batzoglou Multiple Sequence Alignments Algorithms

CS262 Lecture 12, Win07, Batzoglou Multidimensional DP Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

CS262 Lecture 12, Win07, Batzoglou Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(x i, x j, x k ), F(i – 1, j – 1, k ) + S(x i, x j, - ), F(i – 1, j, k – 1) + S(x i, -, x k ), F(i – 1, j, k ) + S(x i, -, - ), F(i, j – 1, k – 1) + S( -, x j, x k ), F(i, j – 1, k ) + S( -, x j, - ), F(i, j, k – 1) + S( -, -, x k ) } Multidimensional DP

CS262 Lecture 12, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

CS262 Lecture 12, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly!  Require 2 N – 1 states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

CS262 Lecture 12, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

CS262 Lecture 12, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

CS262 Lecture 12, Win07, Batzoglou Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree (UPGMA / Neighbor Joining / Other methods)  Align on the tree x w y z ?

CS262 Lecture 12, Win07, Batzoglou Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

CS262 Lecture 12, Win07, Batzoglou Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

CS262 Lecture 12, Win07, Batzoglou A* for Multiple Alignments Review of the A* algorithm v START GOAL Say that we have a gigantic graph G START: start node GOAL: we want to reach this node with the minimum path Dijkstra: O(VlogV + E) – too slow if the number of edges is huge A*: a way of finding the optimal solution faster in practice

CS262 Lecture 12, Win07, Batzoglou A* for Multiple Alignments Review of the A* algorithm v START GOAL g(v) h(v) g(v) is the cost so far h(v) is an estimate of the minimum cost from v to GOAL f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v 1. Expand v with the smallest f(v) 2. Never expand v, if f(v) ≥ shortest path to the goal found so far Lemma Given sequences x, y, z, … The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimal pairwise alignments Proof M induces projected pairwise alignments a xy, a yz, a xz, …, and Score(M) = d(a xy ) + d(a xz ) + d(a yz ) +… Each of d(.) is smaller than the optimal edit distance

CS262 Lecture 12, Win07, Batzoglou A* for Multiple Alignments Nodes: Cells in the DP matrix g(v): alignment cost so far h(v): sum-of-pairs of individual pairwise alignments Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments v START GOAL g(v) h(v) To compute h(v) For each pair of sequences x, y, Compute F R (x, y), the DP matrix of scores of aligning a suffix of x to a suffix of y Then, at position (i 1, i 2, …, i N ), h(v) becomes the sum of (N choose 2) F R scores

CS262 Lecture 12, Win07, Batzoglou Consistency z x y xixi yjyj y j’ zkzk

CS262 Lecture 12, Win07, Batzoglou Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment,  For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz )  Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk

CS262 Lecture 12, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC Specialized VISTA alignment browser at LBNL ABC—Nice Stanford tool for browsing alignments Protein Multiple Aligners CLUSTALW – most widely used MUSCLE – most scalable PROBCONS – most accurate