 # Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

## Presentation on theme: "Rapid Global Alignments How to align genomic sequences in (more or less) linear time."— Presentation transcript:

Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Motivation Genomic sequences are very long:  Human genome = 3 x 10 9 –long  Mouse genome = 2.7 x 10 9 –long Aligning genomic regions is useful for revealing common gene structure  Useful to compare regions > 1,000,000-long

Main Idea Genomic regions of interest contain ordered islands of similarity, such as genes 1.Find local alignments 2.Chain an optimal subset of them

Outline Methods to FIND Local Alignments  Sorting k-long words  Suffix Trees Methods to CHAIN Local Alignments  Dynamic Programming  Sparse Dynamic Programming

Methods to FIND Local Alignments 1.Sorting K-long words BLAST, BLAT, and the like 2.Suffix Trees

Finding Local Alignments: Sorting k-long words Given sequences x, y: 1.Write down all (w, 0, i):w = x i+1 …x i+k (z, 1, j): z = y j+1 …y j+k 2.Sort them lexicographically 3.Deduce all k-long matches between x and y 4.Extend to local alignments

Sorting k-long words: example Let x, y be matched with 3-long words: x = caggc:(cag,0,0), (agg,0,1), (ggc,0,2) y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2) Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1) Matches: 1. cag: x 1 x 2 x 3 = y 3 y 4 y 5 2. ggc: x 3 x 4 x 5 = y 1 y 2 y 3

Running time Worst case: O(NxM) In practice: a large value of k results in a short list of matches Tradeoff: Low k: worse running time High k: significant alignments missed PatternHunter: Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3

Suffix Trees Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d ab d a c c a b d a c c c c a d b 1 4 2 5 6 3

Definition of a Suffix Tree Definition: For string x = x 1 …x m, a suffix tree is:  A rooted tree with m leaves Leaf i: x i …x m  Each edge is a substring  No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

Constructing a Suffix Tree Naïve algorithm: O( N 2 ) time Better algorithms: O( N ) time (outside the scope of this class – too technical and not so interesting) Memory: O( N ) but with a significant constant

Naïve Algorithm to Construct a Suffix Tree 1.Initialize tree T: a single root node r 2.Insert special symbol \$ at end of x 3.For j = 1 to m Find longest match of x i …x m to T, starting from r Split edge where match stops: new node w Create edge (w, j), and label with unmatched portion of x i …x m

Example of Suffix Tree Construction 1 x = d a b d a \$ d ab d a \$ 1. Insert d a b d a \$ a b d a \$ 2 2. Insert a b d a \$ \$ a d b 3 3. Insert b d a \$ \$ 4 4. Insert d a \$ \$ 5 5. Insert a \$ \$ 6 6. Insert \$

Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

Memory to Store Suffix Tree Can store in O( N ) memory! Every edge is labeled with (i, j): (i,j) denotes x i …x j Tree has O( N ) nodes Proof: 1.# leafs  # nodes – 1 2.# leafs = |x|

Application: find all matches between x, y 1.Build suffix tree for x, mark nodes with x 2.Insert y in suffix tree, mark all nodes y “passes from” with y  The path label of every node marked both 0 and 1, is a common substring

1 x = d a b d a \$ y = a b a d a \$ d ab d a \$ 1. Construct tree for x a b d a \$ 2 \$ a d b 3 \$ 4 \$ 5 \$ 6 x x x 6. Insert a \$ 5 6 6. Insert \$ 4. Insert a d a \$ d a \$ 3 5. Insert d a \$ y 4 2. Insert a b a d a \$ a y d a \$ 1 y y x 3. Insert b a d a \$ a d y 2 a \$ x Example of Suffix Tree construction

Application: String search on a database Say we have a database D = { s 1, s 2, …s n } (e.g., proteins) Question: Given new string x, find all matches of x to database 1.Build suffix tree for {s 1,…, s n } 2.All new queries x take O( |x| ) time (somewhat like BLAST)

Application: common substrings of k strings To find the longest common substring of s 1, s 2, …s n 1.Build suffix tree for s 1,…, s n 2.All nodes labeled {s i1, …, s ik } represent a match between s i1, …, s ik

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Quadratic Time Solution Build Directed Acyclic Graph (DAG):  Nodes: local alignments [(x a,x b )  (y a,y b )] & score  Directed edges: local alignments that can be chained edge ( (x a, x b, y a, y b ), (x c, x d, y c, y d ) ) x a < x b < x c < x d y a < y b < y c < y d Each local alignment is a node v i with alignment score s i

Quadratic Time Solution Dynamic programming: Initialization: Find each node v a s.t. there is no edge (u,v 0 ) Set score of V(a) to be s a Iteration: For each v i, optimal path ending in v i has total score: V(i) = max ( weight(v j, v i ) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from v j Worst case time: quadratic

Sparse Dynamic Programming Back to the LCS problem: Given two sequences  x = x 1, …, x m  y = y 1, …, y n Find the longest common subsequence  Quadratic solution with DP How about when “hits” x i = y j are sparse?

Sparse Dynamic Programming 15324162042431118 4 20 24 3 11 15 11 4 18 20 Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet  x = x 1, …, x m Find a subsequence  s = s 1, …, s k  s 1 < s 2 < … < s k

Sparse LCS expressed as LIS Create a sequence w Every matching point x-to-y, (i, j), is inserted into a sequence as follows: For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order The 11 example points are inerted in the order given Any two points (y a, x a ), (y b, x b ) can be chained iff  a is before b in w, and  y a < y b 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y a, x a ) < (y b, x b ) if y a < y b Claim: An increasing subsequence of w is a common subsequence of x and y 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

Sparse Dynamic Programming for LIS Algorithm: initialize empty array L /* at each point, l j will contain the last element of the longest j-long increasing subsequence that ends with the smallest w i */ for i = 1 to |w| binary search for w[i] in L, to find l j < w[i] ≤ l j+1 replace l j+1 with w[i] keep a backptr l j  w[i] 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, 18 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j  L is implemented as a balanced binary tree y h l

Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.j: rectangle in L, with largest l j  l i b.If V(i) > V(j): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l k, V(k), k) with V(k)  V(i) & l k  l i

Example x y 1: 5 3: 3 2: 6 4: 4 5: 2 2 5 6 9 10 11 12 14 15 16

Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Download ppt "Rapid Global Alignments How to align genomic sequences in (more or less) linear time."

Similar presentations