Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Processing.

Similar presentations


Presentation on theme: "String Processing."— Presentation transcript:

1 String Processing

2 Basic String Techniques
Storing strings Reading text input by line Concatenating strings Checking for matching string at beginning Finding a substring within a larger string Counting occurances in a string (e.g. how many vowels) Tokenizing: splitting a string into substrings by delimiters Sorting an array of strings

3 String Matching Find occurrences of T (length m) inside S (length n)
Basic matching can use library functions Requires reasonably small strings Longer matching: naïve approach Loop over S (1 to n) Check whether T occurs starting at that point (1 to m) So, O(nm) total Better: Knuth-Morris-Pratt (KMP) Algorithm

4 Knuth-Morris-Pratt (KMP) Algorithm
Idea: preprocess T (the one to find) – use matches there to know where to start the next match Preprocess: For character i in T, If the string matched to character i, but not to character i+1, Then, how many digits of the string match up until that point This tells you where to start matching again Match like naïve. But, when you stop getting a match: Go back a given number of spaces (based on preprocess) Start match there

5 KMP Algorithm - running
Example: T is abracadabra Could represent differently – where the # stored is the number matching the prefix, but then need to offset everything else by 1 Example: S is abrabracabracadabracadabra a b r c d 1 2 3

6 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Mismatch at slot 4 (i=4, j=4). Back table has value 1 there. So, next we’ll continue with i=4, but j will go back to 1.

7 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Mismatch at slot j=6 (and i=9). Back table has value 1 there. So, next we’ll continue with i=9, but j will go back to slot 1:

8 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Full match here. Mark as found (at slot 8). Next one starts 4 back (i=18, j=4).

9 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Full match here again. Mark as found (at slot 15). Next one would start 4 back (i=21).

10 Dynamic Programming on Strings
Edit Distance: Given two strings, how many edits (insert space, delete digit, or have mismatch) are needed between them? Use DP: String A[1..n], B[1..m]: For A[1..i], B[1..j], we have V(i,j) = edit distance for substrings. We want V(n,m) V(0,0) = 0 V(i,0) = penalty to delete all i elements from A V(0,j) = penalty to delete all j elements from B V(i,j) = max: V(i-1,j-1) + score(A[i],B[i]) V(i-1,j)+score(A[i],-) V(i,j-1)+score(-,B[j]) Where score(A[i],B[j]) = 2 if matching, -1 if nonmatching, and score(x,-)=score(-,x) = -1 (penalty to delete = penalty to add a space)

11 More DP on Strings For Longest Common Subsequence
Same as String Alignment Penalty for mismatch = infinity Penalty for add/delete = 0 Points for match = 1


Download ppt "String Processing."

Similar presentations


Ads by Google