Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics by Margaret H. Dunham
CSE 8337 Spring Text Processing TOC Simple Text Storage String Matching String-to-String Correction (Approximate matching)
CSE 8337 Spring Text storage EBCDIC/ASCII Array of character Linked list of character Trees- B Tree, Trie Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp
CSE 8337 Spring Pattern Matching(Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis.
CSE 8337 Spring Similarity Measures Determine similarity between two objects. Similarity characteristics: Alternatively, distance measures measure how unlike or dissimilar objects are.
CSE 8337 Spring String Matching Problem Input: Pattern – length m Text string – length n Find one (next, all) occurrences of string in pattern Ex: String: Pattern:
CSE 8337 Spring String Matching Algorithms Brute Force Kknuth-Morris Pratt Boyer Moore P209 in text
CSE 8337 Spring Brute Force String Matching Brute Force Handbook of Algorithms and Data Structures Space O(m+n) Time O(mn)
CSE 8337 Spring FSR
CSE 8337 Spring Creating FSR Create FSM: Construct the “correct” spine. Add a default “failure bus” to state 0. Add a default “initial bus” to state 1. For each state, decide its attachments to failure bus, initial bus, or other failure links.
CSE 8337 Spring Knuth-Morris-Pratt Apply FSM to string by processing characters one at a time. Accepting state is reached when pattern is found. Space O(m+n) Time O(m+n) Handbook of Algorithms and Data Structures
CSE 8337 Spring Boyer-Moore Scan pattern from right to left Skip many positions on illegal character string. O(mn) Expected time better than KMP Expected behavior better Handbook of Algorithms and Data Structures
CSE 8337 Spring String-to-String Correction Measure of similarity between strings Can be used to determine how to convert from one string to another Cost to convert one to the other Transformations Match: Current characters in both strings are the same Delete: Delete current character in input string Insert: Insert current character in target string into string
CSE 8337 Spring Distance Between Strings
CSE 8337 Spring Approximate String Matching Find patterns “close to” the string Fuzzy matching Applications: Spelling checkers IR Define similarity (distance) between string and pattern