Presentation is loading. Please wait.

Presentation is loading. Please wait.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006.

Similar presentations


Presentation on theme: "CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006."— Presentation transcript:

1 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006

2 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] 06-11-2006 Sequence Analysis What can sequence alignment tell us about structure HSSP Sander & Schneider, 1991 ≥30% sequence identity

3 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] 06-11-2006 Sequence Analysis Pair-wise alignment Complexity of the problem Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~10 600 alignments! T D W V T A L K T D W L - - I K

4 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] 06-11-2006 Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value from residue exchange matrix i-1 i j-1 j

5 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] 06-11-2006 Sequence Analysis – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: gp(k)=  k Affine: gp(k)=  +  k Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores over all alignment columns Dynamic programming Scoring alignments / Gap length penalty affine concave linear, General alignment score: S a,b =

6 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] 06-11-2006 Sequence Analysis Scoring and gap penalties Scoring: –DNA or protein? –Which evolutionary distance do we cover? Gap penalty: always less than 0 (i.e. they lower the alignment score) –Linear: g(k)=  k –Affine: g(k)=  +  k  harder to implement –Concave: g(k)=log(k)  no efficient solution

7 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] 06-11-2006 Sequence Analysis Global dynamic programming – general algorithm M[i-1,j-1] M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - g open - (i-x-1)g extension } max{M[i-1, 0<x<j-1] - g open - (i-y-1)g extension } Value form residue exchange matrix i-1 i j-1 j Gap open penalty Gap extension penalty Number of gap extensions This more general way of dynamic programming also allows for affine or other gap penalty regimes

8 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] 06-11-2006 Sequence Analysis Easy DP recipe for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check –Cell[i-1, j-1]: apply score for cell[i-1, j-1] –preceding row until j-2: apply appropriate gap penalties –preceding column until i-2: apply appropriate gap penalties i-1 j-1

9 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] 06-11-2006 Sequence Analysis Note about gap penalties Some affine schemes use gap_penalty = -g open –g extension *(l-1), while others use gap_penalty = -g open –g extension *l, where l is the length of the gap. One can be converted into the other by adapting the g open penalty

10 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] 06-11-2006 Sequence Analysis Global dynamic programming g open =10, g extension =2 DWVTALK 0-12-14-16-18-20-22-24 T-128-9-6-5-9-11-14 D 09223-5-3-34 W-16-1325115490-21 V-18-10-4372119 15-16 L-20-14-223463137261 K-22-12-9173353395014 -34-2917392750 DWVTALK T83811998 D12168848 W12523265 V621288106 L46 96145 K85687513 These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

11 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] 06-11-2006 Sequence Analysis Variation on global alignment Global alignment: the previous algorithm is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- – Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other – Danger! – really bad alignments for divergent seqs seq X: seq Y:

12 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] 06-11-2006 Sequence Analysis Global alignments - review Take two sequences: X[j] and Y[j] The best alignment for X[1…i] and Y[1…j] is called M[i, j] Initiation: M[0,0]=0 Apply the equation Find the alignment with backtracing X[j] Y[i] 2

13 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] 06-11-2006 Sequence Analysis Global and local alignment Local Global A B A B A B BA AB Local Global A B A C C A BC ABC A B A C C Problem:

14 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] 06-11-2006 Sequence Analysis Local alignment What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally maximal: cannot make it better by trimming/extending the alignment Seq X: Seq Y:

15 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] 06-11-2006 Sequence Analysis Local alignment Why local? –Parts of sequence diverge faster evolutionary pressure does not put constraints on the whole sequence –Proteins have modular construction sharing domains between sequences Seq X: Seq Y: seq X: seq Y: seq X: seq Y:

16 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] 06-11-2006 Sequence Analysis Domains - example Immunoglobulin domain

17 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] 06-11-2006 Sequence Analysis Global  local alignment Take the old, good equation Look at the result of the global alignment q e s - euqes- CAGCACTTGGATTCTCG- CA-C-----GATTCGT-G a) global align b) retrieve the result c) sum score along the result align. pos. sum

18 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] 06-11-2006 Sequence Analysis Local alignment – breaking the alignment A recipe –Just don’t let the score go below 0 –Start the new alignment when it happens –Where is the result in the matrix? Before:After: sum align. pos. q e s 0 - euqes- q e 0 s 0 - euqes- sum align. pos. sum align. pos.

19 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] 06-11-2006 Sequence Analysis Local alignment – the equation Init the matrix with 0’s Read the maximal value from anywhere in the matrix Find the result with backtracking M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j]) max 0 Great contribution to science! q e s - euqes-

20 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] 06-11-2006 Sequence Analysis Finding second best alignment We can find the best local alignment in the sequence But where is the second best one? … 1613 A 1815 G 1618201714 A 15171916 G 141618 A … …AGA… A clump Best alignment Scoring: 1 for match -2 for a gap

21 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] 06-11-2006 Sequence Analysis Clump of an alignment Alignments sharing at least one aligned pair A clump … A G 20 A 1916 G 18 A … …AGA…

22 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] 06-11-2006 Sequence Analysis Example: repeats proteins Local alignment traces showing similarity between repeats Note: repeats are similar motifs but need not be identical

23 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] 06-11-2006 Sequence Analysis Clumps gene Y gene X The figure shows two proteins with so- called tandem repeats (similar motifs adjacent in the sequence) ‘Shadows’ of various alignments are visible – these all derive from parts of a same locally optimal alignment

24 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] 06-11-2006 Sequence Analysis Finding second best alignment Don’t let any matched pair contribute to the next alignment Recalculate the clump “Clear” the best alignment

25 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] 06-11-2006 Sequence Analysis Clumps gene Y gene X The figure shows two proteins with so- called tandem repeats (similar motifs adjacent in the sequence) The highest scoring local alignment is indicated

26 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] 06-11-2006 Sequence Analysis Clumps gene Y gene X

27 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] 06-11-2006 Sequence Analysis Extraction of alignments – Waterman-Eggert algorithm 1. Repeat a.Retrieve the highest scoring alignment b.Set its trace to 0

28 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] 06-11-2006 Sequence Analysis General local DP algorithm for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell anywhere in matrix and do trace-back until zero-valued cell or start of sequence(s) i-1 j-1 Penalty = P open + gap_length*P extension S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0

29 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] 06-11-2006 Sequence Analysis Example: general algorithm for local alignment DP algorithm with affine gap penalties (PAM250, Po=10, Pe=2) Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences) DWVTALK T0-503110 D4-7-200-40 W-717-6-5-6-2-3 V-2-64002-2 L-4-221 6-3 K0 -20-35 DWVTALK T0003 D4000 W02100 V00259 L0011 K00 PAM250

30 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] 06-11-2006 Sequence Analysis Pitfalls of alignments  Alignment is often not a reconstruction of evolution (common ancestor is usually extinct by the time of alignment)  Repeats: matches to the same fragment seq X: seq Y:

31 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] 06-11-2006 Sequence Analysis Summary 1. Global e.g Needleman-Wunsch algorithm 2. Semi-global 3. Local e.g.Smith-Waterman algorithm 4. Many local alignments aka Waterman-Eggert algorithm What’s the number of steps in these algorithms? How much memory is used? seq X: seq Y: seq X: seq Y: seq X: seq Y: seq X: seq Y: These questions deal with the algorithmic complexity (time, memory) of alignment

32 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] 06-11-2006 Sequence Analysis Semi-global alignment Ignore start gaps –First row/column set to 0 Ignore end gaps –Read the result from last row/column T G A - GTGAG- 13000 -202 10 1 0 000000

33 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [33] 06-11-2006 Sequence Analysis Low complexity regions Local composition bias –Replication slippage: e.g. triplet repeats Results in spurious hits –Breaks down statistical models –Different proteins reported as hits due to similar composition –Up to ¼ of the sequence can be biased Widely used filtering program: SEGS (Wootton and Federhen, 1996) Huntington’s disease –Huntingtin gene of unknown (!) function –Repeats #: 6-35: normal; 36- 120: disease

34 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [34] 06-11-2006 Sequence Analysis Synteny Synteny is preservation of gene order –5 to 20 million years of divergence Sequencing and comparison of yeast species to identify genes and regulatory elements, Manolis Kellis et.al, Nature 423, 241 - 254 (15 May 2003) Arrows in figure represent genes; arrow direction indicates ‘+’ or ‘-’ strand of DNA Figure shows preservation of gene order in four yeast species

35 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [35] 06-11-2006 Sequence Analysis Genome vs. gene *) b stands for bases or nucleotides

36 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [36] 06-11-2006 Sequence Analysis Take-home messages (I) Global, semi-global and local alignment Three types of gap penalties ‘easy’ (fast) and general algorithm Pitfalls of local alignment Low-complexity regions

37 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [37] 06-11-2006 Sequence Analysis Make sure you understand and can carry out 1. the ‘simple’ DP algorithm (for linear gap penalties) 2. The general DP algorithm for global, semi- global and local alignment! Take-home messages (II)


Download ppt "CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Alignments 2: Local alignment Sequence Analysis 06-11-2006."

Similar presentations


Ads by Google