Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Chapter 7 Dynamic Programming.
Sequence Alignment Tutorial #2
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
§ 8 Dynamic Programming Fibonacci sequence
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell.
Distance Functions for Sequence Data and Time Series
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 9 Data Analysis Martin Russell.
15-853:Algorithms in the Real World
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Developing Pairwise Sequence Alignment Algorithms
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.
Probabilistic Context Free Grammars for Representing Action Song Mao November 14, 2000.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association.
Vectors and the Geometry of Space 9. Vectors 9.2.
1 CSA4050: Advanced Topics in NLP Spelling Models.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequences and Summations
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Section 2.4. Section Summary Sequences. Examples: Geometric Progression, Arithmetic Progression Recurrence Relations Example: Fibonacci Sequence Summations.
CS 415 – A.I. Slide Set 6. Chapter 4 – Heuristic Search Heuristic – the study of the methods and rules of discovery and invention State Space Heuristics.
Sequences and Summations Section 2.4. Section Summary Sequences. – Examples: Geometric Progression, Arithmetic Progression Recurrence Relations – Example:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Part 2 # 68 Longest Common Subsequence T.H. Cormen et al., Introduction to Algorithms, MIT press, 3/e, 2009, pp Example: X=abadcda, Y=acbacadb.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
January 2012Spelling Models1 Human Language Technology Spelling Models.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Dynamic Programming for the Edit Distance Problem.
Hidden Markov Models BMI/CS 576
LECTURE 15: HMMS – EVALUATION AND DECODING
Distance Functions for Sequence Data and Time Series
Rosen 5th ed., §3.2 ~9 slides, ~½ lecture
Sequence Alignment with Traceback on Reconfigurable Hardware
Intro to Alignment Algorithms: Global and Local
LECTURE 14: HMMS – EVALUATION AND DECODING
CSE 589 Applied Algorithms Spring 1999
Rosen 5th ed., §3.2 ~9 slides, ~½ lecture
Discrete Mathematics and its Applications
Bioinformatics Algorithms and Data Structures
Presentation transcript:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell

Slide 2 EE3J2 Data Mining Objectives  To motivate the need for sequence analysis  To introduce the notions of: –Alignment path –Accumulated distance along a path  To introduce Dynamic Programming and the principle of optimality  To explain the use of dynamic programming to find the distance between two sequences

Slide 3 EE3J2 Data Mining Sequences  Sequential data common in real applications: –DNA analysis in bioinformatics and forensic science –Sequences of the letters A, G, C and T –Signature recognition biometrics –Words and text –Spelling and grammar checkers, author verification,.. –Speech, music and audio –Speech/speaker recognition, speech coding and synthesis –Electronic music –Radar signature recognition…

Slide 4 EE3J2 Data Mining Retrieving sequential data  Large corpora of sequential data –E.g: DNA databases  Individual sequences may not be amenable to human interpretation  Need for automated sequential data retrieval  Need to ‘mine’ sequential data  Fundamental requirement is a measure of distance between two sequences

Slide 5 EE3J2 Data Mining Basic definitions  In a typical sequence analysis application we have a basic alphabet consisting of N symbols  Examples: –In conventional text A is the set of letters plus punctuation plus ‘white space’ –In bioinformatics, scientists us {A, B, D, F} to describe DNA sequences

Slide 6 EE3J2 Data Mining Distance between sequences (1)  Consider two sequences of symbols from the alphabet A = {A, B, C, D}  How similar are the sequences: –S 1 = ABCD –S 2 = ABD  Intuitively S 2 is obtained from S 1 by deleting C  An alternative explanation is that S 2 is obtained from S 1 by substituting D for C and then deleting D

Slide 7 EE3J2 Data Mining Distance between sequences (2)  Alternatively S 2 was obtained from S 1 be deleting ABCD and inserting ABC  …  Why is the first explanation intuitively ‘correct’ and the last one intuitively ‘wrong’?  Because we favour the simplest explanation which involves the minimum number of insertions, deletions and substitutions  Or do we?

Slide 8 EE3J2 Data Mining Distance between sequences (3)  Consider: –S 1 = AABC –S 2 = SABC –S 3 = PABC –S 4 = ASCB  If we know that these sequences were typed then S 2 is closer to S 1 than S 3 is, because A and S are adjacent on a keyboard  Similarly S 4 is close to S 2 because letter-swapping (SA  AS etc) is a common typographical error

Slide 9 EE3J2 Data Mining Alignments  The relationship between two sequences can be expressed as an alignment between their elements  E.G: Insertion A B B C ABCABC

Slide 10 EE3J2 Data Mining Alignment: deletion and substitution A X C ABCABC substitution A C ABCABC deletion N.B: All edits described relative to vertical string

Slide 11 EE3J2 Data Mining General alignment path A C X C C D ABCDABCD Which alignment is best?

Slide 12 EE3J2 Data Mining The Distance Matrix  Suppose that d(A,B) denotes the distance between the alphabet symbols A and B  Examples: –d(A,B) = 0 if A = B, otherwise d(A,B) = 1 –In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B –In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B

Slide 13 EE3J2 Data Mining Notation  Suppose we have an alphabet:  A distance matrix for A is an N  N matrix where is the distance between the m th and n th alphabet symbols

Slide 14 EE3J2 Data Mining The Accumulated Distance  Consider two sequences: –S 1 = ABCD –S 2 = ACXCCD  For any alignment path p between S 1 and S 2 we define the accumulated distance between S 1 and S 2, denoted by AD p (S 1,S 2 ), to be the sum over all the nodes of p of the corresponding distances between elements of S 1 and S 2.

Slide 15 EE3J2 Data Mining Accumulated distance along p A C X C C D ABCDABCD Path p

Slide 16 EE3J2 Data Mining Accumulated distance (continued) A C X C C D ABCDABCD K DEL K INS K INS K INS

Slide 17 EE3J2 Data Mining Optimal path  The optimal path is the path with minimum accumulated distance  Formally the optimal path is where:  The accumulated distance AD(S 1,S 2 ) between S 1 and S 2 is given by:

Slide 18 EE3J2 Data Mining Calculating the optimal path  Given –the distance matrix D, –the insertion penalty K INS, and –the deletion penalty K DEL *  How can we compute the optimal path between two (potentially long) sequences S 1 and S 2 ?  In a real application the lengths of S 1 and S 2 might be large * If K DEL and K INS are not defined you should assume that they are zero

Slide 19 EE3J2 Data Mining Dynamic Programming (DP)  The optimal path is calculated using Dynamic Programming (DP)  DP is based on the principle of optimality Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of: (i) the best path from X to A plus the cost of going from A to Y (ii) the best path from X to A plus the cost of going from B to Y XY A B

Slide 20 EE3J2 Data Mining DP – step 1  Step 1: draw the trellis of all possible paths A C X C C D ABCDABCD

Slide 21 EE3J2 Data Mining DP – forward pass – initialisation A C X C C D ABCDABCD (Forward) path matrix d(A,A) (Forward) accumulated distance matrix ad(2,1)=ad(1,1)+d(2,1)

Slide 22 EE3J2 Data Mining ad(i,j)  ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j)  It is calculated using the principle of optimality  In addition, the forward path matrix records the local optimal path at each point (i-1,j) (i,j-1) (i-1,j-1)

Slide 23 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix ad(1,2) = ad(1,1)+d(A,C)

Slide 24 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix

Slide 25 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix

Slide 26 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand corner

Slide 27 EE3J2 Data Mining Summary  Introduction to sequence analysis  Definitions of: –distance, –forward accumulated distance, –forward path matrix, –optimal path  Dynamic programming  How to computed the accumulated distance using DP  How to recover the optimal path