Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Processing.

Similar presentations


Presentation on theme: "String Processing."— Presentation transcript:

1 String Processing

2 Basic String Techniques
Storing strings Reading text input by line Concatenating strings Checking for matching string at beginning Finding a substring within a larger string Counting occurrances in a string (e.g. how many vowels) Tokenizing: splitting a string into substrings by delimiters Sorting an array of strings

3 String Matching Find occurrences of T (length m) inside S (length n)
Basic matching can use library functions Requires reasonably small strings Longer matching: naïve approach Loop over S (1 to n) Check whether T occurs starting at that point (1 to m) So, O(nm) total Better: Knuth-Morris-Pratt (KMP) Algorithm

4 Knuth-Morris-Pratt (KMP) Algorithm
Idea: preprocess T (the one to find) – use matches there to know where to start the next match Preprocess: For character i in T, If the string matched to character i, but not to character i+1, Then, how many characters of the string preceding character i+1 match the beginning of the string to match This tells you where to start matching again Match like naïve. But, when you stop getting a match: Go back the given number of spaces (based on preprocess) Start match there

5 KMP Algorithm - running
Example: T is abracadabra This says that if there is a match up until that character, but NOT that character, how many characters match in the beginning of the string. e.g. if the “c” is the first not to match, it means that the string had “abra” at the beginning. That means that we can restart here, assuming 1 character (the “a”, which comes right before the “c” matches. Could represent differently – where the # stored is the number matching the prefix, but then need to offset everything else by 1 Example we will use: S is abrabracabracadabracadabra a b r c d 1 2 3

6 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Mismatch at slot 4 (i=4, j=4). Back table has value 1 there. So, next we’ll continue with i=4, but j will go back to 1.

7 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Mismatch at slot j=6 (and i=9). Back table has value 1 there. So, next we’ll continue with i=9, but j will go back to slot 1:

8 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Full match here. Mark as found (at slot 8). Next one starts 4 back (i=18, j=4).

9 a b r c d 1 2 3 i: S: abrabracabracadabracadabra T: abracadabra j: Full match here again. Mark as found (at slot 15). Next one would start 4 back (i=21).

10 Dynamic Programming on Strings
Edit Distance: Given two strings, how many edits (insert space, delete digit, or have mismatch) are needed between them? Use DP: String A[1..n], B[1..m]: For A[1..i], B[1..j], we have V(i,j) = edit distance for substrings. We want V(n,m) V(0,0) = 0 V(i,0) = penalty to delete all i elements from A V(0,j) = penalty to delete all j elements from B V(i,j) = max: V(i-1,j-1) + score(A[i],B[i]) V(i-1,j)+score(A[i],-) V(i,j-1)+score(-,B[j]) Where score(A[i],B[j]) = 2 if matching, -1 if nonmatching, and score(x,-)=score(-,x) = -1 (penalty to delete = penalty to add a space)

11 More DP on Strings For Longest Common Subsequence
Same as String Alignment Penalty for mismatch = infinity Penalty for add/delete = 0 Points for match = 1


Download ppt "String Processing."

Similar presentations


Ads by Google