CS 430: Information Discovery

CS 430: Information Discovery
Lecture 6 Text Processing Methods 1

Course administration
•

String Searching: Naive Algorithm
Objective: Given a pattern, find any substring of a given text that matches the pattern. pat pattern to be matched m length of pattern pat (characters) tx the text to be searched n length of tx (characters) The naive algorithm examines the characters of tx in sequence. for j from 1 to n-m+1 if character j of tx matches the first character of pat (compare following characters of tx and pat until a complete match or a difference is found)

String Searching: Knuth-Morris-Pratt Algorithm
Concept: The naive algorithm is modified, so that whenever a partial match is found, it may be possible to advance the character index, j, by more than 1. Example: pat = "university" tx = "the uniform commercial code ..." j= after partial match continue here To indicate how far to advance the character pointer, pat is preprocessed to create a table, which lists how far to advance against a given length of partial match. In the example, j is advanced by the length of the partial match, 3.

Signature Files: Sequential Search without Inverted File
Inexact filter: A quick test which discards many of the non-qualifying items. Advantages • Much faster than full text scanning -- 1 or 2 orders of magnitude • Modest space overhead -- 10% to 15% of file • Insertion is straightforward Disadvantages • Sequential searching is no good for very large files • Some hits are false hits

Signature Files Signature size. Number of bits in a signature, F.
Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical or of all the word signatures in a block of text.

Signature Files Example Word Signature free 001 000 110 010
text block signature F = 12 bits in a signature m = 4 bits per word D = 2 words per block

Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, Fd . Frake, Section 4.2, page 47 discussed how to minimize Fd. The rest of this chapter discusses enhancements to the basic algorithm.

Search for Substring In some information retrieval applications, any substring can be a search term. Tries, using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents.

Tries: Search for Substring
Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents.

Tries: Suffix Tree Example: suffix tree for the following words: begin
beginning between bread break b e rea gin tween d k null ning

Tries: Sistrings A binary example String: 01 100 100 010 111

Tries: Lexical Ordering
Unique string indicated in blue

Trie: Basic Concept 1 1 1 2 1 1 7 5 1 1 6 3 1 4 8

Patricia Tree 1 1 2 2 1 1 00 3 3 4 2 1 1 10 7 5 5 1 6 3 1 4 8 Single-descendant nodes are eliminated. Nodes have bit number.

Indexing Subsystem documents Documents assign document IDs text
document numbers and *field numbers break into tokens tokens stop list* non-stoplist tokens stemming* *Indicates optional operation. stemmed terms term weighting* terms with weights Index database

Search Subsystem query parse query query tokens ranked document set
stop list* non-stoplist tokens ranking* stemming* stemmed terms Boolean operations* retrieved document set *Indicates optional operation. Index database relevant document set

Oxford English Dictionary

Lexical Analysis: Tokens
What is a token? Free text indexing A token is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word. Usually, tokens are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc.

Lexical Analysis: Choices
Punctuation: In technical contexts, punctuation may be used as a character within a term, e.g., wordlist.txt. Case: Case of letters is usually not significant. Hyphens: (a) Treat as separators: state-of-art is treated as state of art. (b) Ignore: on-line is treated as online. (c) Retain: Knuth-Morris-Pratt Algorithm is unchanged. Digits: Most numbers do not make good tokens, but some are parts of proper nouns or technical terms: CS430, Opus 22.

Lexical Analysis: Choices
The modern tendency, for free text searching, is to map upper and lower case letters together in index terms, but otherwise to minimize the changes made at the lexical analysis stage.

Lexical Analysis Example: Query Analyzer
A token is a letter followed by a sequence of letters and digits. Upper case letters are mapped into the lower case equivalents. The following characters have significance as operators: ( ) & |

Lexical Analysis: Transition Diagram
letter, digit 1 2 space letter ( 3 ) & 4 | 5 other 6 end-of-string 7

Lexical Analysis: Transition Table
State space letter ( ) & | other end-of digit string States in red are final states.

Changing the Lexical Analyzer
This use of a transition table allows the system administrator to establish differ lexical choices for different collections of documents. Example: To change the lexical analyzer to accept tokens that begin with a digit, change the top right element of the table to 1.

Stop Lists Very common words, such as of, and, the, are rarely of use in information retrieval. A stop list is a list of such words that are removed during lexical analysis. A long stop list saves space indexes, speeds processing, and eliminates many false hits. However, common words are sometimes significant in information retrieval, which is an argument for a short stop list. (Consider the query, "To be or not to be?")

Suggestions for Including Words in a Stop List
• Include the most common words in the English language (perhaps 50 to 250 words). • Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world.) • In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents).

Stop Lists in Practice The modern tendency is:
have very short stop lists for broad-ranging or multi-lingual document collections, especially when the users are not trained. have longer stop lists for document collections in well-defined fields, especially when the users are trained professional.

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback