Download presentation

Presentation is loading. Please wait.

Published byEthel Fisher Modified over 4 years ago

1
Rapid Global Alignments How to align genomic sequences in (more or less) linear time

2
Motivation Genomic sequences are very long: Human genome = 3 x 10 9 –long Mouse genome = 2.7 x 10 9 –long Aligning genomic regions is useful for revealing common gene structure Useful to compare regions > 1,000,000-long

3
Main Idea Genomic regions of interest contain ordered islands of similarity, such as genes 1.Find local alignments 2.Chain an optimal subset of them

4
Outline Methods to FIND Local Alignments Sorting k-long words Suffix Trees Methods to CHAIN Local Alignments Dynamic Programming Sparse Dynamic Programming

5
Methods to FIND Local Alignments 1.Sorting K-long words BLAST, BLAT, and the like 2.Suffix Trees

6
Finding Local Alignments: Sorting k-long words Given sequences x, y: 1.Write down all (w, 0, i):w = x i+1 …x i+k (z, 1, j): z = y j+1 …y j+k 2.Sort them lexicographically 3.Deduce all k-long matches between x and y 4.Extend to local alignments

7
Sorting k-long words: example Let x, y be matched with 3-long words: x = caggc:(cag,0,0), (agg,0,1), (ggc,0,2) y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2) Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1) Matches: 1. cag: x 1 x 2 x 3 = y 3 y 4 y 5 2. ggc: x 3 x 4 x 5 = y 1 y 2 y 3

8
Running time Worst case: O(NxM) In practice: a large value of k results in a short list of matches Tradeoff: Low k: worse running time High k: significant alignments missed PatternHunter: Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3

9
Suffix Trees Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d ab d a c c a b d a c c c c a d b 1 4 2 5 6 3

10
Definition of a Suffix Tree Definition: For string x = x 1 …x m, a suffix tree is: A rooted tree with m leaves Leaf i: x i …x m Each edge is a substring No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

11
Constructing a Suffix Tree Naïve algorithm: O( N 2 ) time Better algorithms: O( N ) time (outside the scope of this class – too technical and not so interesting) Memory: O( N ) but with a significant constant

12
Naïve Algorithm to Construct a Suffix Tree 1.Initialize tree T: a single root node r 2.Insert special symbol $ at end of x 3.For j = 1 to m Find longest match of x i …x m to T, starting from r Split edge where match stops: new node w Create edge (w, j), and label with unmatched portion of x i …x m

13
Example of Suffix Tree Construction 1 x = d a b d a $ d ab d a $ 1. Insert d a b d a $ a b d a $ 2 2. Insert a b d a $ $ a d b 3 3. Insert b d a $ $ 4 4. Insert d a $ $ 5 5. Insert a $ $ 6 6. Insert $

14
Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

15
Memory to Store Suffix Tree Can store in O( N ) memory! Every edge is labeled with (i, j): (i,j) denotes x i …x j Tree has O( N ) nodes Proof: 1.# leafs # nodes – 1 2.# leafs = |x|

16
Application: find all matches between x, y 1.Build suffix tree for x, mark nodes with x 2.Insert y in suffix tree, mark all nodes y “passes from” with y The path label of every node marked both 0 and 1, is a common substring

17
1 x = d a b d a $ y = a b a d a $ d ab d a $ 1. Construct tree for x a b d a $ 2 $ a d b 3 $ 4 $ 5 $ 6 x x x 6. Insert a $ 5 6 6. Insert $ 4. Insert a d a $ d a $ 3 5. Insert d a $ y 4 2. Insert a b a d a $ a y d a $ 1 y y x 3. Insert b a d a $ a d y 2 a $ x Example of Suffix Tree construction

18
Application: String search on a database Say we have a database D = { s 1, s 2, …s n } (e.g., proteins) Question: Given new string x, find all matches of x to database 1.Build suffix tree for {s 1,…, s n } 2.All new queries x take O( |x| ) time (somewhat like BLAST)

19
Application: common substrings of k strings To find the longest common substring of s 1, s 2, …s n 1.Build suffix tree for s 1,…, s n 2.All nodes labeled {s i1, …, s ik } represent a match between s i1, …, s ik

20
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

21
The Problem: Find a Chain of Local Alignments (x,y) (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

22
Quadratic Time Solution Build Directed Acyclic Graph (DAG): Nodes: local alignments [(x a,x b ) (y a,y b )] & score Directed edges: local alignments that can be chained edge ( (x a, x b, y a, y b ), (x c, x d, y c, y d ) ) x a < x b < x c < x d y a < y b < y c < y d Each local alignment is a node v i with alignment score s i

23
Quadratic Time Solution Dynamic programming: Initialization: Find each node v a s.t. there is no edge (u,v 0 ) Set score of V(a) to be s a Iteration: For each v i, optimal path ending in v i has total score: V(i) = max ( weight(v j, v i ) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from v j Worst case time: quadratic

24
Sparse Dynamic Programming Back to the LCS problem: Given two sequences x = x 1, …, x m y = y 1, …, y n Find the longest common subsequence Quadratic solution with DP How about when “hits” x i = y j are sparse?

25
Sparse Dynamic Programming 15324162042431118 4 20 24 3 11 15 11 4 18 20 Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

26
Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet x = x 1, …, x m Find a subsequence s = s 1, …, s k s 1 < s 2 < … < s k

27
Sparse LCS expressed as LIS Create a sequence w Every matching point x-to-y, (i, j), is inserted into a sequence as follows: For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order The 11 example points are inerted in the order given Any two points (y a, x a ), (y b, x b ) can be chained iff a is before b in w, and y a < y b 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

28
Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y a, x a ) < (y b, x b ) if y a < y b Claim: An increasing subsequence of w is a common subsequence of x and y 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

29
Sparse Dynamic Programming for LIS Algorithm: initialize empty array L /* at each point, l j will contain the last element of the longest j-long increasing subsequence that ends with the smallest w i */ for i = 1 to |w| binary search for w[i] in L, to find l j < w[i] ≤ l j+1 replace l j+1 with w[i] keep a backptr l j w[i] 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

30
Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, 18 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

31
Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j) L is sorted by l j L is implemented as a balanced binary tree y h l

32
Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.j: rectangle in L, with largest l j l i b.If V(i) > V(j): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l k, V(k), k) with V(k) V(i) & l k l i

33
Example x y 1: 5 3: 3 2: 6 4: 4 5: 2 2 5 6 9 10 11 12 14 15 16

34
Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google