Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.

Similar presentations


Presentation on theme: "Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics."— Presentation transcript:

1 Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book

2 CSE 8337 Spring 2005 2 Text Processing TOC Simple Text Storage String Matching String-to-String Correction (Approximate matching)

3 CSE 8337 Spring 2005 3 Text storage EBCDIC/ASCII Array of character Linked list of character Trees- B Tree, Trie Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.

4 CSE 8337 Spring 2005 4 Pattern Matching(Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis.

5 CSE 8337 Spring 2005 5 Similarity Measures Determine similarity between two objects. Similarity characteristics: Alternatively, distance measures measure how unlike or dissimilar objects are.

6 CSE 8337 Spring 2005 6 String Matching Problem Input: Pattern – length m Text string – length n Find one (next, all) occurrences of string in pattern Ex: String: 00110011011110010100100111 Pattern: 011010

7 CSE 8337 Spring 2005 7 String Matching Algorithms Brute Force Kknuth-Morris Pratt Boyer Moore P209 in text

8 CSE 8337 Spring 2005 8 Brute Force String Matching Brute Force Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html Space O(m+n) Time O(mn) 00110011011110010100100111 011010

9 CSE 8337 Spring 2005 9 FSR

10 CSE 8337 Spring 2005 10 Creating FSR Create FSM: Construct the “correct” spine. Add a default “failure bus” to state 0. Add a default “initial bus” to state 1. For each state, decide its attachments to failure bus, initial bus, or other failure links.

11 CSE 8337 Spring 2005 11 Knuth-Morris-Pratt Apply FSM to string by processing characters one at a time. Accepting state is reached when pattern is found. Space O(m+n) Time O(m+n) Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html

12 CSE 8337 Spring 2005 12 Boyer-Moore Scan pattern from right to left Skip many positions on illegal character string. O(mn) Expected time better than KMP Expected behavior better Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html

13 CSE 8337 Spring 2005 13 String-to-String Correction Measure of similarity between strings Can be used to determine how to convert from one string to another Cost to convert one to the other Transformations Match: Current characters in both strings are the same Delete: Delete current character in input string Insert: Insert current character in target string into string

14 CSE 8337 Spring 2005 14 Distance Between Strings

15 CSE 8337 Spring 2005 15 Approximate String Matching Find patterns “close to” the string Fuzzy matching Applications: Spelling checkers IR Define similarity (distance) between string and pattern


Download ppt "Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics."

Similar presentations


Ads by Google