Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Similar presentations


Presentation on theme: "Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas."— Presentation transcript:

1 Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas

2 2 Center star algorithm for multiple sequence global alignment T is the set of strings that we want to align Pick S  T that minimizes The initial alignment starts with S (≡S 1 ) Suppose we have already aligned S 1, S 2,..., S i as S ′ 1, S′ 2,..., S′ i. Then we add the remaining strings one at a time by aligning S i+1 with S′ 1, obtaining S′ i+1 and S′′ 1. We replace S′ 1 with S′′ 1 and add spaces to S′ 2,..., S′ i wherever spaces were added to S′ 1.

3 3 Finding S S is the best representative of the set T in terms of the distance metric d If T is considered as a cluster of strings, then S is the centroid of the cluster To find S, align each string with every other ( pairs) and calculate the sum for each candidate. Pick the choice that minimizes this sum

4 4 Example Three strings: GTA, CGT, CAG Step 1: Calculate all three pairwise similarities and pick the string that minimizes total distance; let’s say it’s CGT Step 2-1: Align CGT with GTA  CGT-  -GTA Step 2-2: Extend uninvolved, processed strings with spaces (not needed now)

5 5 Example (continued) Step 3-1: Align CGT- with CAG  C-GT-  CAG-- Step 3-2: Extend uninvolved, processed strings with spaces ( -GTA )  C-GT-  --GTA  CAG--

6 6 Algorithm complexity – Finding S To find S, we consider k candidates For each candidate, we calculate the sum of k-1 terms – O(k 2 ) such terms total If the maximum string length is n, then each term can be calculated in O(n 2 ) time Total for finding S is O(k 2 n 2 )

7 7 Algorithm complexity – Subsequent alignments Each subsequent alignment at step i+1 aligns a string S′ 1 of length at most in with a string S i+1 of length at most n Each alignment can be found in time O(in∙n) Total time for these alignments is

8 8 Algorithm complexity – Extensions with spaces At step i+1 there is an extension of i-1 strings each of length at most in For each such string, we need to consider a total of n new space positions Time required is Overall total time for the algorithm is O(k 2 n 2 )

9 9 Error bounds It is useful to know how far the solution found by an approximate algorithm is from the true optimal solution Sometimes (but not always) it is possible to provide error bounds, that is give upper and lower bounds for the quantity Bounds may depend on n and k

10 10 Error analysis assumptions Sometimes we need additional assumptions in order to derive useful bounds For the approximate algorithm for multiple string alignment, we assume the triangle inequality for measure d:

11 11 Background on distances A distance or metric d is formally defined as a function A×A→ ℜ on a set A (called a metric space) with the following properties: –d(x,y)≥0 (non-negativity) –d(x,y)=0 iff x=y (identity of indiscernibles) –d(x,y)=d(y,x) (symmetry) –d(x,y)≤d(x,z)+d(z,y) (triangle inequality) Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the L p spaces, and inner product spaces.

12 12 Background on distances A distance or metric d is formally defined as a function A×A→ ℜ on a set A (called a metric space) with the following properties: –d(x,y)≥0 –d(x,y)=0 iff x=y –d(x,y)=d(y,x) –d(x,y)≤d(x,z)+d(z,y) Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the L p spaces, and inner product spaces. follows from 2, 3, and 4 pseudometric quasimetric semimetric

13 13 Deriving an error bound Let v 0 be the score for the optimal alignment and v * the score for the alignment produced by the center star algorithm Let d 0 (i,j) (d * (i,j)) be the corresponding induced distances on strings S i and S j

14 14 Lower bound for v 0 Because the induced distance can be no less than the distance between the strings themselves Choice of S 1

15 15 Upper bound for v * Triangle inequality Symmetry Each string is aligned with S 1 optimally (there may be additional spaces in matching positions, which do not change the distance)

16 16 Combining the bounds Better bound for low k

17 17 Motif data notation A motif is denoted by three parameters –Its length l –The number of allowed spaces g –The number of allowed changes d –(l, d, g) notation Changes and gaps allowed because of mutations across organisms In a “good” motif, g and d are small compared to l Most work assumes g = 0

18 18 Finding the motif consensus Assume known motif instance positions and length (e.g., via multiple alignment) Also known as the known site problem Input: A set of motif instances Output: What is the motif consensus? Further, is the consensus a valid motif, or is it statistically indistinguishable from what we would expect from other randomly chosen regions?

19 19 Statistical estimation An important approach to many data mining and machine learning tasks Requirement: The problem must be expressed as a probability function that depends on a number of modeled parameters whose value is unknown The estimation task: Find the optimal values for these parameters

20 20 Estimation example Can be performed without an explicit probabilistic model Example: Future markets are exchanges where contracts are traded for future execution Contract price reflects probabilities of events

21 21 Obama contract at intrade.com


Download ppt "Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas."

Similar presentations


Ads by Google