Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

Similar presentations


Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell."— Presentation transcript:

1 Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell

2 Slide 2 EE3J2 Data Mining Objectives  To motivate the need for sequence analysis  To introduce the notions of: –Alignment path –Accumulated distance along a path  To introduce Dynamic Programming and the principle of optimality  To explain the use of dynamic programming to find the distance between two sequences

3 Slide 3 EE3J2 Data Mining Sequences  Sequential data common in real applications: –DNA analysis in bioinformatics and forensic science –Sequences of the letters A, G, C and T –Signature recognition biometrics –Words and text –Spelling and grammar checkers, author verification,.. –Speech, music and audio –Speech/speaker recognition, speech coding and synthesis –Electronic music –Radar signature recognition…

4 Slide 4 EE3J2 Data Mining Retrieving sequential data  Large corpora of sequential data –E.g: DNA databases  Individual sequences may not be amenable to human interpretation  Need for automated sequential data retrieval  Need to ‘mine’ sequential data  Fundamental requirement is a measure of distance between two sequences

5 Slide 5 EE3J2 Data Mining Basic definitions  In a typical sequence analysis application we have a basic alphabet consisting of N symbols  Examples: –In conventional text A is the set of letters plus punctuation plus ‘white space’ –In bioinformatics, scientists us {A, B, D, F} to describe DNA sequences

6 Slide 6 EE3J2 Data Mining Distance between sequences (1)  Consider two sequences of symbols from the alphabet A = {A, B, C, D}  How similar are the sequences: –S 1 = ABCD –S 2 = ABD  Intuitively S 2 is obtained from S 1 by deleting C  An alternative explanation is that S 2 is obtained from S 1 by substituting D for C and then deleting D

7 Slide 7 EE3J2 Data Mining Distance between sequences (2)  Alternatively S 2 was obtained from S 1 be deleting ABCD and inserting ABC  …  Why is the first explanation intuitively ‘correct’ and the last one intuitively ‘wrong’?  Because we favour the simplest explanation which involves the minimum number of insertions, deletions and substitutions  Or do we?

8 Slide 8 EE3J2 Data Mining Distance between sequences (3)  Consider: –S 1 = AABC –S 2 = SABC –S 3 = PABC –S 4 = ASCB  If we know that these sequences were typed then S 2 is closer to S 1 than S 3 is, because A and S are adjacent on a keyboard  Similarly S 4 is close to S 2 because letter-swapping (SA  AS etc) is a common typographical error

9 Slide 9 EE3J2 Data Mining Alignments  The relationship between two sequences can be expressed as an alignment between their elements  E.G: Insertion A B B C ABCABC

10 Slide 10 EE3J2 Data Mining Alignment: deletion and substitution A X C ABCABC substitution A C ABCABC deletion N.B: All edits described relative to vertical string

11 Slide 11 EE3J2 Data Mining General alignment path A C X C C D ABCDABCD Which alignment is best?

12 Slide 12 EE3J2 Data Mining The Distance Matrix  Suppose that d(A,B) denotes the distance between the alphabet symbols A and B  Examples: –d(A,B) = 0 if A = B, otherwise d(A,B) = 1 –In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B –In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B

13 Slide 13 EE3J2 Data Mining Notation  Suppose we have an alphabet:  A distance matrix for A is an N  N matrix where is the distance between the m th and n th alphabet symbols

14 Slide 14 EE3J2 Data Mining The Accumulated Distance  Consider two sequences: –S 1 = ABCD –S 2 = ACXCCD  For any alignment path p between S 1 and S 2 we define the accumulated distance between S 1 and S 2, denoted by AD p (S 1,S 2 ), to be the sum over all the nodes of p of the corresponding distances between elements of S 1 and S 2.

15 Slide 15 EE3J2 Data Mining Accumulated distance along p A C X C C D ABCDABCD Path p

16 Slide 16 EE3J2 Data Mining Accumulated distance (continued) A C X C C D ABCDABCD K DEL K INS K INS K INS

17 Slide 17 EE3J2 Data Mining Optimal path  The optimal path is the path with minimum accumulated distance  Formally the optimal path is where:  The accumulated distance AD(S 1,S 2 ) between S 1 and S 2 is given by:

18 Slide 18 EE3J2 Data Mining Calculating the optimal path  Given –the distance matrix D, –the insertion penalty K INS, and –the deletion penalty K DEL *  How can we compute the optimal path between two (potentially long) sequences S 1 and S 2 ?  In a real application the lengths of S 1 and S 2 might be large * If K DEL and K INS are not defined you should assume that they are zero

19 Slide 19 EE3J2 Data Mining Dynamic Programming (DP)  The optimal path is calculated using Dynamic Programming (DP)  DP is based on the principle of optimality Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of: (i) the best path from X to A plus the cost of going from A to Y (ii) the best path from X to A plus the cost of going from B to Y XY A B

20 Slide 20 EE3J2 Data Mining DP – step 1  Step 1: draw the trellis of all possible paths A C X C C D ABCDABCD

21 Slide 21 EE3J2 Data Mining DP – forward pass – initialisation A C X C C D ABCDABCD (Forward) path matrix d(A,A) (Forward) accumulated distance matrix ad(2,1)=ad(1,1)+d(2,1)

22 Slide 22 EE3J2 Data Mining ad(i,j)  ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j)  It is calculated using the principle of optimality  In addition, the forward path matrix records the local optimal path at each point (i-1,j) (i,j-1) (i-1,j-1)

23 Slide 23 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix ad(1,2) = ad(1,1)+d(A,C)

24 Slide 24 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix

25 Slide 25 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix

26 Slide 26 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand corner

27 Slide 27 EE3J2 Data Mining Summary  Introduction to sequence analysis  Definitions of: –distance, –forward accumulated distance, –forward path matrix, –optimal path  Dynamic programming  How to computed the accumulated distance using DP  How to recover the optimal path


Download ppt "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell."

Similar presentations


Ads by Google