Presentation is loading. Please wait.

Presentation is loading. Please wait.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 1 Sequence Analysis.

Similar presentations


Presentation on theme: "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 1 Sequence Analysis."— Presentation transcript:

1 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 1 Sequence Analysis

2 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] 02-11-2006 Sequence Analysis Searching for similarities What is the function of a new gene? The “lazy” investigation: – Find a set of similar proteins – Identify similarities and differences – For long proteins: identify domains Domains are structural units in a protein tertiary structure and often provide a given (sub)function to the complete protein

3 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] 02-11-2006 Sequence Analysis Is similarity really interesting? Common ancestry is a very important observation Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – It’s a nice tool: When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z X G

4 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] 02-11-2006 Sequence Analysis Functional and evolutionary Evolutionary relation, reconstruction: – Based on sequence Identity (simplest method) Similarity – Homology (the ultimate goal) – Other (e.g., 3D structure) Functional relation Sequence  Structure  Function determines

5 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] 02-11-2006 Sequence Analysis Evolution and 3d protein structure information Isocitrate dehydrogenise: The distance from the active site (yellow) determines the rate of evolution. (red = fast evolution blue = slow evolution) Dean, A. M. and G. B. Golding, Pacific Symposium on Bioinformatics 2000

6 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] 02-11-2006 Sequence Analysis How to determine similarity? Frequent evolutionary events: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion Evolution at work We’ll use only these Z X Y Common ancestor, usually extinct available

7 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] 02-11-2006 Sequence Analysis Alignment Mutations: substitution, insertion and deletion Which alignment is better? Use common sense and call it: – Simplest – Most probable – Maximum likely

8 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] 02-11-2006 Sequence Analysis Scoring Should give reasonable alignments And have to assign scores to: – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: g(k)=  k Affine: g(k)=  +  k Concave, e.g.: g(k)=log(k) The score for an alignment is the sum of scores of all alignment columns

9 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] 02-11-2006 Sequence Analysis Substitution matrices Define a score for match/mismatch of letters DNA - Simple: - Used in genome alignments

10 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] 02-11-2006 Sequence Analysis Substitution matrices for aa Amino acids are not equal: 1. Some are easily substituted, similar: biochemical properties structure 2. Some mutations occur more often due to similar codons The two above give us substitution matrices

11 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] 02-11-2006 Sequence Analysis Blosum62 matrix # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

12 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] 02-11-2006 Sequence Analysis Linear vs. affine scoring Seq1 G T A - - G - T - A Seq2 - - A T G - A T G - Linear -2 –2 1 –2 –2 (SUM=-7) -2 –2 1 -2 –2 (SUM=-7) Affine -3 –1 1 -3 –1 (SUM=-7) -3 –3 1 -3 –3 (SUM=-11) … and +1 for match Gap Scoring Introductionextension Linear-2 Affine-3

13 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] 02-11-2006 Sequence Analysis The algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming – Solve smaller subproblem(s) – Iteratively extend the solution The best alignment for X[ 1…i ] and Y[ 1…j ] is called M[ i, j ] X 1 … X i-1 - - X i Y 1 … - Y j-1 Y j -

14 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] 02-11-2006 Sequence Analysis The algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion The best alignment for X[1…i] and Y[1…j] is called M[i, j] 3 ways to extend the alignment: X[1…i-1] X[i] X[1…i] - X[1…i-1] X[i] Y[1…j-1] Y[j] Y[1…j-1] Y[j] Y[1…j] - M[i,j]= M[i-1,j-1] M[i,j-1]-g M[i-1,j]-g +m -s

15 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] 02-11-2006 Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value form residue exchange matrix i-1 i j-1 j

16 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] 02-11-2006 Sequence Analysis Example: global alignment of two sequences Align two DNA sequences: – GAGTGA – GAGGCGA (note the length difference) Parameters of the algorithm: – Match: score(A,A) = 1 – Mismatch: score(A,T) = – 1 – Gap: g = 2 M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2

17 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] 02-11-2006 Sequence Analysis The algorithm. Step 1: init Create the matrix Initiation – 0 at [0,0] – Apply the equation… M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 jj 0123456 i  -GAGTGA 0- 0 1G 2A 3G 4G 5C 6G 7A

18 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] 02-11-2006 Sequence Analysis The algorithm. Step 1: init Initiation of the matrix: – 0 at pos [0,0] – Fill in the first row using the “  ” rule – Fill in the first column using “  ” M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA - 0-2-4-6-8-10-12 G -2 A -4 G -6 G -8 C -10 G -12 A -14 j i

19 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] 02-11-2006 Sequence Analysis The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows) M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA - 0-2-4-6-8-10-12 G -21-3 A -42 G -6 G -8 C -10 G -12 A -14 j i

20 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] 02-11-2006 Sequence Analysis The algorithm. Step 2: fill in We are done… Where’s the result? M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA - 0-2-4-6-8-10-12 G -21-3-5-7-9 A -420-2-4-6 G -3031-3 G -8-5-21220 C -10-7-4011 G -12-9-6-3-210 A -14-11-8-5-42 j i

21 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] 02-11-2006 Sequence Analysis The algorithm. Step 3: backtrace Start at the last cell of the matrix Go in the direction of arrows Sometimes the value may be obtained from more than one cell (which one?) -GAGTGA - 0-2-4-6-8-10-12 G -21-3-5-7-9 A -420-2-4-6 G -3031-3 G -8-5-21220 C -10-7-4011 G -12-9-6-3-210 A -14-11-8-5-42 j i M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2

22 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] 02-11-2006 Sequence Analysis The algorithm. Step 3: backtrace Extract the alignments a) GAGT-GA GAGGCGA b) GA-GTGA GAGGCGA -GAGTGA - 0-2-4-6-8-10-12 G -21-3-5-7-9 A -420-2-4-6 G -3031-3 G -8-5-21220 C -10-7-4011 G -12-9-6-3-210 A -14-11-8-5-42 j i a b

23 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] 02-11-2006 Sequence Analysis Global dynamic programming – general algorithm M[i-1,j-1] M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - g open - (i-x- 1)g extension } max{M[i-1, 0<x<j-1] - g open - (i-y- 1)g extension } Value form residue exchange matrix i-1 i j-1 j Gap open penalty Gap extension penalty Number of gap extensions This more general way of dynamic programming also allows for affine or other gap penalties

24 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] 02-11-2006 Sequence Analysis Easy DP recipe for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check Cell[i-1, j-1]: apply score for cell[i-1, j-1] preceding row until j-2: apply appropriate gap penalties preceding column until i-2: apply appropriate gap penalties i-1 j-1

25 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] 02-11-2006 Sequence Analysis Note about gap penalties Some affine schemes use gap_penalty = -g open –g extension *(l-1), while others use gap_penalty = -g open –g extension *l, where l is the length of the gap.

26 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] 02-11-2006 Sequence Analysis Global dynamic programming g open =10, g extension =2 DWVTALK 0-12-14-16-18-20-22-24 T-128-9-6-5-9-11-14 D 09223-5-3-34 W-16-1325115490-21 V-18-10-4372119 15-16 L-20-14-223463137261 K-22-12-9173353395014 -34-2917392750 DWVTALK T83811998 D12168848 W12523265 V621288106 L46 96145 K85687513 These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

27 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] 02-11-2006 Sequence Analysis Variation on global alignment Global alignment: the previous algorithm is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- – Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other – Danger! – really bad alignments for divergent seqs seq X: seq Y:

28 CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] 02-11-2006 Sequence Analysis Take-home message Homology Why are we interested in similarity? Pairwise alignment: global alignment


Download ppt "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 1 Sequence Analysis."

Similar presentations


Ads by Google