Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.

Similar presentations


Presentation on theme: "CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable."— Presentation transcript:

1 CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable speed

2 CS262 Lecture 9, Win07, Batzoglou MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on those distances, with UPGMA 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.Measure new Kimura-based distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

3 CS262 Lecture 9, Win07, Batzoglou PROBCONS at a glance 1.Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM 2.Re-estimation of posterior matrices M’ xy with probabilistic consistency M’ xy (i, j) = 1/N  sequence z  k M xz (i, k)  M yz (j, k);M’ xy = Avg z (M xz M zy ) 3.Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes  aligned (i, j) in A M’ xy (i, j) Define E(x, y) =  aligned (i, j) in Axy M’ xy (i, j) 4.Build tree T with hierarchical clustering using similarity measure E(x, y) 5.Progressive alignment on T to maximize E(.,.) 6.Iterative refinement; for many rounds, do: Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

4 CS262 Lecture 9, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

5 CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

6 CS262 Lecture 9, Win07, Batzoglou

7 Motivation Genomic sequences are very long:  Human genome = 3 x 10 9 –long  Mouse genome = 2.7 x 10 9 –long Aligning genomic regions is useful for revealing common gene structure  It is useful to compare regions > 1,000,000-long

8 CS262 Lecture 9, Win07, Batzoglou The UCSC Browser http://genome.ucsc.edu/cgi-bin/hgGateway

9 CS262 Lecture 9, Win07, Batzoglou Main Idea Genomic regions of interest contain islands of similarity, such as genes 1.Find local alignments 2.Chain an optimal subset of them 3.Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

10 CS262 Lecture 9, Win07, Batzoglou Saving cells in DP 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

11 CS262 Lecture 9, Win07, Batzoglou Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

12 CS262 Lecture 9, Win07, Batzoglou The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

13 CS262 Lecture 9, Win07, Batzoglou Quadratic Time Solution Build Directed Acyclic Graph (DAG):  Nodes: local alignments [(x a,x b )  (y a,y b )] & score  Directed edges: local alignments that can be chained edge ( (x a, x b, y a, y b ), (x c, x d, y c, y d ) ) x a < x b < x c < x d y a < y b < y c < y d Each local alignment is a node v i with alignment score s i

14 CS262 Lecture 9, Win07, Batzoglou Quadratic Time Solution Initialization: Find each node v a s.t. there is no edge (u, v a ) Set score of V(a) to be s a Iteration: For each v i, optimal path ending in v i has total score: V(i) = ma x j s.t. there is edge (v j, v i ) ( weight(v j, v i ) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from v j Worst case time: quadratic

15 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming Back to the LCS problem: Given two sequences  x = x 1, …, x m  y = y 1, …, y n Find the longest common subsequence  Quadratic solution with DP How about when “hits” x i = y j are sparse?

16 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming 15324162042431118 4 20 24 3 11 15 11 4 18 20 Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

17 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet  x = x 1, …, x m Find a subsequence  s = s 1, …, s k  s 1 < s 2 < … < s k

18 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming – L.I.S. Let input be w: w 1,…, w n INITIALIZATION: L:last LIS elt. array L[0] = -inf L[1] = w 1 L[2…n] = +inf B:array holding LIS elts; B[0] = 0 P:array of backpointers // L[j]: smallest j th element w i of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j]  w[i] B[j]  i P[i]  B[j – 1] } That’s it!!! Running time?

19 CS262 Lecture 9, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w Every matching point (i, j), is inserted into w as follows: For each column j = 1…m, insert in w the points (i, j), in decreasing row i order The 11 example points are inserted in the order given a = (y, x), b = (y’, x’) can be chained iff  a is before b in w, and  y < y’ 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

20 CS262 Lecture 9, Win07, Batzoglou Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

21 CS262 Lecture 9, Win07, Batzoglou Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = [L1] [L2] [L3] [L4] [L5] … 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, 18 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

22 CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j : smallest (North) to largest (South) value  L is implemented as a balanced binary tree y h l

23 CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) score V(b) V(a)

24 CS262 Lecture 9, Win07, Batzoglou Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i i j k Is k ever removed?

25 CS262 Lecture 9, Win07, Batzoglou Example x y a: 5 c: 3 b: 6 d: 4 e: 2 2 5 6 9 10 11 12 14 15 16 1.When on the leftmost end of rectangle i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i) > V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i abcde V 5 L lili V(i) i 5 5 a 8 11 8 c 12 9 11 b 15 12 d 13 16 13 3

26 CS262 Lecture 9, Win07, Batzoglou Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

27 CS262 Lecture 9, Win07, Batzoglou Examples Human Genome Browser ABC


Download ppt "CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable."

Similar presentations


Ads by Google