Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.

Similar presentations

Presentation on theme: "Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006."— Presentation transcript:

1 Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006

2 Topics 5-step Documents preprocessing Porter stemming algorithm Text compression

3 Five-Step Document Preprocessing Lexical analysis of the text –How to treat digits, hyphens, punctuation marks, the case of letters Elimination of stopwords –Words with low discrimination values Stemming –Removing prefixes and suffixes Selection of index terms –Determine which words/stems will be used as indexing elements Construction of term categorization structures –a thesaurus,

4 Step 1: Lexical analysis of the text Converting the text of a document (a large string/or a stream of characters) to a stream of words –Word separators (English, Chinese) How to deal with digits, punctuation marks, hyphens, and the case of letters

5 Step 2: Elimination of stopwords Frequent words in the collection Not good discriminators –Filtered out as potential index terms Elimination of stopwords reduces the size of the indexing structure considerable. –40% or more Examples –Articles, prepositions, conjunctions, etc. –Even some verbs, adverbs and adjectives

6 Step 3: Stemming Problem with perfect match: –One query word “connect” and its multiple “connected”, “connecting”, “connects” in different documents Stemming: Reduce variants of the same root word to a common concept Stemming also reduces the number of distinct index terms The Porter Algorithm

7 Stemming Approaches Table lookup –Generation is complex –Final tables are often incomplete Affix removal –Suffix vs. prefix (e.g. mega-volt) –Doesn’t always work, esp. not in German Successor variety stemming –More complex than suffix removal –Uses (e.g.) linguistic approaches and techniques from morphology N-grams –General clustering approach which can also be used for stemming

8 Step 4: Selection of index terms Full text representation vs. selected set of terms as index terms Many distinct automatic approaches The identification of noun groups (Inquery system) –Most of the semantics is carried by the noun words in a sentence –Combine nearby nouns into noun groups.

9 Step 5: Construction of term categorization structures A thesaurus –A standard vocabulary for indexing and searching –Relationships among indexed terms –Assist users with locating terms for proper query formulation An example of an entry in Roget’s thesaurus –Cowardly adjective –Ignobly lacking in courage: cowardly turncoats –Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

10 Thesauri Indexed terms –Denotes a concept, basic semantic unit –Can be individual words, group of words, or phrases –Terms are basically nouns –Terms can also be verbs in gerund form whenever they are used as nouns. (teaching, acting etc.) Relationships –A set of related terms to a entry is mostly composed of synonyms or near-synonyms.

11 The Use of Thesauri in IR Selecting related terms in a thesaurus to reformulate a query when initial query words are erroneous and improper. Unfortunately, this approach does not work well in general. –Relationships captured in a thesaurus are not valid in the local context of a given query. An alternative: determine thesaurus-like relationships at query time –Challenging for web search- can’t afford the effort for each individual query

12 The Porter Algorithm Special algorithm for the English language based on suffix removal 5 successive distinct phases, applied to words sequentially one after another Example: Remove plural ‘s’ and ‘sses’ Rules: sses -> ss, s -> NIL (obey order!)

13 Porter Algorithm Conventions –C: consonant, V: vowel, L: consonant or vowel –Combination of C, V, L to define patterns –Operators ”+” and “*” to form complex patterns *: zero or more repetitions of a given pattern: (V*C) +: one of more repetitions of a given pattern :( (C)*((V)+(C)+)+(V)*) Statements/commands –Rule-base statements Single rule: If (*V*L) then ed  Nil (remove ed) Multiple rules: –Select rule with longest suffix{ sses  ss ies  i; ss  ss; s-> }

14 Try Porter Algorithm Played Classes Policy Position Capability Active, actively, activity

15 The Porter Algorithm: advantages & disadvantages Advantage: Easy algorithm with good results –abate abated abatement abatements abates -->abat Disadvantage: Not always correct, e.g. –Same root for police – policy, execute –executive, … –Different root for european – europe, search – searcher,

16 Next Lecture: Compression. Ch. 7

Download ppt "Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006."

Similar presentations

Ads by Google