Presentation is loading. Please wait.

Presentation is loading. Please wait.

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 2: Text Operations.

Similar presentations


Presentation on theme: "Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 2: Text Operations."— Presentation transcript:

1 Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 2: Text Operations

2 Srihari-CSE535-Spring2008 Logical View 1. Description of content and semantics of document 2. Structure 3. Derivation of index terms: (a) removal of stopwords (b) stemming (c) identification of nouns

3 Srihari-CSE535-Spring2008 Index Terms Assigned terms 1. Manual indexing (a) Professional indexer (b) Authors 2. Classification efforts (Library of Congress Classification, ACM) (a) Coarseness: large collections? (b) Categories are mutually exclusive 3. Subjectivity, inter-indexer consistency (:very low!) 4. Ontologies and classications: limited appeal for retrieval Assigned terms 1. Manual indexing (a) Professional indexer (b) Authors 2. Classification efforts (Library of Congress Classification, ACM) (a) Coarseness: large collections? (b) Categories are mutually exclusive 3. Subjectivity, inter-indexer consistency (:very low!) 4. Ontologies and classications: limited appeal for retrieval Extracted terms 1. Extraction from title, abstract or full-text 2. Methodology to extract must distinguish significant from non-significant terms 3. Controlled (limit to pre-defined dictionary vs. non-controlled vocabulary (all terms as they are) 4. Controlled ! less noise, better performance?

4 Srihari-CSE535-Spring2008 Document Pre-Processing Lexical analysis digits, hypens, punctuation marks e.g. I’ll --> I will Stopword elimination filter out words with low discriminatory value stopword list reduces size of index by 40% or more might miss documents, e.g. “to be or not to be” Stemming connecting, connected, etc. Selection of index terms e.g. select nouns only Thesaurus construction permits query expansion Lexical analysis digits, hypens, punctuation marks e.g. I’ll --> I will Stopword elimination filter out words with low discriminatory value stopword list reduces size of index by 40% or more might miss documents, e.g. “to be or not to be” Stemming connecting, connected, etc. Selection of index terms e.g. select nouns only Thesaurus construction permits query expansion

5 Srihari-CSE535-Spring2008 Lexical Analysis Converting stream of characters into stream of words/tokens more than just detecting spaces between words (see next slide) Disregard numbers as index terms too vague perform date and number normalization Hyphens consistent policy on removal e.g. B-52 Punctuation marks typically removed sometimes problematic, e.g. variable names myclass.height Orthographic variations typically disregarded Converting stream of characters into stream of words/tokens more than just detecting spaces between words (see next slide) Disregard numbers as index terms too vague perform date and number normalization Hyphens consistent policy on removal e.g. B-52 Punctuation marks typically removed sometimes problematic, e.g. variable names myclass.height Orthographic variations typically disregarded

6 Srihari-CSE535-Spring2008 Tokenization

7 Srihari-CSE535-Spring2008 Tokenization Input: “ Friends, Romans and Countrymen ” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing Described below But what are valid tokens to emit? Input: “ Friends, Romans and Countrymen ” Output: Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing Described below But what are valid tokens to emit?

8 Srihari-CSE535-Spring2008 Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem. But there are complications …

9 Srihari-CSE535-Spring2008 Format/language stripping Documents being indexed can include docs from many different languages A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats French with a Portuguese pdf attachment. What is a unit document? An ? With attachments? An with a zip containing documents? Documents being indexed can include docs from many different languages A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats French with a Portuguese pdf attachment. What is a unit document? An ? With attachments? An with a zip containing documents?

10 Srihari-CSE535-Spring2008 Tokenization Issues in tokenization: Finland’s capital  Finland? Finlands? Finland’s ? Hewlett-Packard  Hewlett and Packard as two tokens? San Francisco : one token or two? How do you decide it is one token? Issues in tokenization: Finland’s capital  Finland? Finlands? Finland’s ? Hewlett-Packard  Hewlett and Packard as two tokens? San Francisco : one token or two? How do you decide it is one token?

11 Srihari-CSE535-Spring2008 Language issues Accents: résumé vs. resume. L'ensemble  one token or two? L ? L’ ? Le ? How are your users like to write their queries for these words? Accents: résumé vs. resume. L'ensemble  one token or two? L ? L’ ? Le ? How are your users like to write their queries for these words?

12 Srihari-CSE535-Spring2008 Tokenization: language issues Chinese and Japanese have no spaces between words: Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingled Dates/amounts in multiple formats Chinese and Japanese have no spaces between words: Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingled Dates/amounts in multiple formats フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 ) KatakanaHiraganaKanji“Romaji” End-user can express query entirely in (say) Hiragana!

13 Srihari-CSE535-Spring2008 Normalization “Right-to-left languages” like Hebrew and Arabic: can have “left-to-right” text interspersed. Need to “normalize” indexed text as well as query terms into the same form Character-level alphabet detection and conversion Tokenization not separable from this. Sometimes ambiguous: “Right-to-left languages” like Hebrew and Arabic: can have “left-to-right” text interspersed. Need to “normalize” indexed text as well as query terms into the same form Character-level alphabet detection and conversion Tokenization not separable from this. Sometimes ambiguous: 7 月 30 日 vs. 7/30 Morgen will ich in MIT … Is this German “mit”?

14 Srihari-CSE535-Spring2008 Punctuation Ne’er : use language-specific, handcrafted “locale” to normalize. Which language? Most common: detect/apply language at a pre-determined granularity: doc/paragraph. State-of-the-art : break up hyphenated sequence. Phrase index? U.S.A. vs. USA - use locale. a.out Ne’er : use language-specific, handcrafted “locale” to normalize. Which language? Most common: detect/apply language at a pre-determined granularity: doc/paragraph. State-of-the-art : break up hyphenated sequence. Phrase index? U.S.A. vs. USA - use locale. a.out

15 Srihari-CSE535-Spring2008 Numbers 3/12/91 Mar. 12, B.C. B-52 My PGP key is 324a3df234cb23e Generally, don’t index as text. Will often index “meta-data” separately Creation date, format, etc. 3/12/91 Mar. 12, B.C. B-52 My PGP key is 324a3df234cb23e Generally, don’t index as text. Will often index “meta-data” separately Creation date, format, etc.

16 Srihari-CSE535-Spring2008 Case folding Reduce all letters to lower case exception: upper case (in mid-sentence?) e.g., General Motors Fed vs. fed SAIL vs. sail Reduce all letters to lower case exception: upper case (in mid-sentence?) e.g., General Motors Fed vs. fed SAIL vs. sail

17 Srihari-CSE535-Spring2008 Thesauri and soundex Handle synonyms and homonyms Hand-constructed equivalence classes e.g., car = automobile your and you’re Index such equivalences When the document contains automobile, index it under car as well (usually, also vice-versa) Or expand query? When the query contains automobile, look under car as well Handle synonyms and homonyms Hand-constructed equivalence classes e.g., car = automobile your and you’re Index such equivalences When the document contains automobile, index it under car as well (usually, also vice-versa) Or expand query? When the query contains automobile, look under car as well

18 Srihari-CSE535-Spring2008 Assymetric query expansion

19 Srihari-CSE535-Spring2008 Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color Reduce inflectional/variant forms to base form E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color

20 Srihari-CSE535-Spring2008 Dictionary entries – first cut ensemble.french 時間. japanese MIT.english mit.german guaranteed.english entries.english sometimes.english tokenization.english These may be grouped by language. More on this in ranking/query processing.

21 Srihari-CSE535-Spring2008 Perl program to Split Words # ! / usr/bin/perl while (<>) { $ line = $ ; chomp ( $ line ) ; Store line (with CR removed) if ( $ line = ˜ / [ aÅ|z ] / ) { # split on = split / \ s + /, $line ; words is an associative array # separately print all words in line for each $term ) { # lowercase $term = ˜ tr / [A-Z ] / [ a-z ] / ; # remove punctuation $term = ˜ s / \, | \. | \ ; | \ : | \ ” | \ ’ | \ ‘ | \ ? | \ ! / / g ; print $term. ” \ n ” ; } # ! / usr/bin/perl while (<>) { $ line = $ ; chomp ( $ line ) ; Store line (with CR removed) if ( $ line = ˜ / [ aÅ|z ] / ) { # split on = split / \ s + /, $line ; words is an associative array # separately print all words in line for each $term ) { # lowercase $term = ˜ tr / [A-Z ] / [ a-z ] / ; # remove punctuation $term = ˜ s / \, | \. | \ ; | \ : | \ ” | \ ’ | \ ‘ | \ ? | \ ! / / g ; print $term. ” \ n ” ; }

22 Srihari-CSE535-Spring2008 Building Lexical Analysers Finite state machines recognizer for regular expressions Lex (http://www.combo.org/lex_yacc_page/lex.html) popular Unix utility program generator designed for lexical processing of character input streams input: high-level, problem oriented specification for character string matching output: program in a general purpose language which recognizes regular expressions Source -> | Lex | -> yylex Input -> | yylex | -> Output Finite state machines recognizer for regular expressions Lex (http://www.combo.org/lex_yacc_page/lex.html) popular Unix utility program generator designed for lexical processing of character input streams input: high-level, problem oriented specification for character string matching output: program in a general purpose language which recognizes regular expressions Source -> | Lex | -> yylex Input -> | yylex | -> Output

23 Srihari-CSE535-Spring2008 Writing lex source files Format of a lex source file: {definitions} % {rules} % {user subroutines} invoking lex: lex source cc lex.yy.c -ll minimum lex program is % (null program copies input to output) use rules to do any special transformations Program to convert British spelling to American colour printf("color"); mechanise printf("mechanize"); petrol printf("gas");

24 Srihari-CSE535-Spring2008 Sample tokenizer using lex Sample input and output from above program here are some ids 1: now some numbers : ids with numbers x123 78xyz 3: 4: \:\

25 Srihari-CSE535-Spring2008 Stopword Elimination 1. Stopwords: (a) Too frequent, P(ki) > 0.8 (b) Irrelevant in linguistic terms i. Do not correpond to nouns, concepts etc ii. Could however be vital in natural language processing 2. Removal from collection on two grounds: (a) Retrieval efficiency i. High frequency: no distinction between documents ii. Useless for retrieval (b) Storage efficiency i. Zipf’s law ii. Heap’s law 1. Stopwords: (a) Too frequent, P(ki) > 0.8 (b) Irrelevant in linguistic terms i. Do not correpond to nouns, concepts etc ii. Could however be vital in natural language processing 2. Removal from collection on two grounds: (a) Retrieval efficiency i. High frequency: no distinction between documents ii. Useless for retrieval (b) Storage efficiency i. Zipf’s law ii. Heap’s law

26 Srihari-CSE535-Spring2008 OAC Stopword list (~ 275 words) The Complete OAC-Search Stopword List adidhasnobodysomewhereusually aboutdohavenoonesoonvery alldoeshavingnorstillwas almostdoingherenotsuchway alongdonehownothingthanways alreadyduringhowevernowthatwe alsoeachiofthewell althougheitherifoncetheirwent alwaysenoughiionetheirswere ametcinone'sthenwhat amongevenincludingonlytherewhen aneverindeedorthereforewhenever andeveryinsteadotherthesewhere anyeveryoneintootherstheywherever anyoneeverythingisourthingwhether anythingforitoursthingswhich

27 Srihari-CSE535-Spring2008 Stopword list contd. anywhere from its out this while are further itself over those who as gave just own though whoever at get like perhaps through why away getting likely rather thus will because give may really to with between given maybe same together within beyond giving me shall too without both go might should toward would but going mine simply until yes by gone much since upon yet can good must so us you cannot got neither some use your come great never someone used yours could had no something uses yourself sometimes using

28 Srihari-CSE535-Spring2008 Implementing Stoplists 2 ways to filter stoplist words from input token stream examine tokenizer output and remove any stopwords remove stopwords as part of lexical analysis (or tokenization) phase Approach 1: filtering after tokenization standard list search problem; every token must be looked up in stoplist binary search trees, hashing hashing is preferred method insert stoplist into a hash table each token hashed into table if location is empty, not a stopword; otherwise string comparisons to see whether hashed value is really a stopword Approach 2 much more efficient 2 ways to filter stoplist words from input token stream examine tokenizer output and remove any stopwords remove stopwords as part of lexical analysis (or tokenization) phase Approach 1: filtering after tokenization standard list search problem; every token must be looked up in stoplist binary search trees, hashing hashing is preferred method insert stoplist into a hash table each token hashed into table if location is empty, not a stopword; otherwise string comparisons to see whether hashed value is really a stopword Approach 2 much more efficient

29 Srihari-CSE535-Spring2008 Stemming Algorithms Designed to standardize morphological variations stem is the portion of a word which is left after the removal of its affixes (prefixes and suffixes) Benefits of stemming supposedly, to increase recall; could lead to lower precision controversy surrounding benefits web search engines do NOT perform stemming Reduces size of index (clear benefit) 4 types of stemming algorithms affix removal : most commonly used table look-up : considerable storage space, all data not available successor variety : (uses knowledge from structural linguistics, more complex) n-grams ; more a term clustering procedure that a stemming algorithm Porter Stemming algorithm Designed to standardize morphological variations stem is the portion of a word which is left after the removal of its affixes (prefixes and suffixes) Benefits of stemming supposedly, to increase recall; could lead to lower precision controversy surrounding benefits web search engines do NOT perform stemming Reduces size of index (clear benefit) 4 types of stemming algorithms affix removal : most commonly used table look-up : considerable storage space, all data not available successor variety : (uses knowledge from structural linguistics, more complex) n-grams ; more a term clustering procedure that a stemming algorithm Porter Stemming algorithm

30 Srihari-CSE535-Spring2008 Stemming - example Word is “duplicatable” duplicatstep 4 duplicatestep 1b duplicstep 3 Word is “duplicatable” duplicatstep 4 duplicatestep 1b duplicstep 3

31 Srihari-CSE535-Spring2008 Stemming Reduce terms to their “roots” before indexing language dependent e.g., automate(s), automatic, automation all reduced to automat. Reduce terms to their “roots” before indexing language dependent e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres.

32 Srihari-CSE535-Spring2008 Porter’s algorithm Commonest algorithm for stemming English Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. Commonest algorithm for stemming English Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

33 Srihari-CSE535-Spring2008 Typical rules in Porter sses  ss ies  i ational  ate tional  tion sses  ss ies  i ational  ate tional  tion

34 Srihari-CSE535-Spring2008 Other stemmers Other stemmers exist, e.g., Lovins stemmer m Single-pass, longest suffix removal (about 250 rules) Motivated by Linguistics as well as IR Full morphological analysis - modest benefits for retrieval Other stemmers exist, e.g., Lovins stemmer m Single-pass, longest suffix removal (about 250 rules) Motivated by Linguistics as well as IR Full morphological analysis - modest benefits for retrieval

35 Srihari-CSE535-Spring2008 Stemming variants

36 Srihari-CSE535-Spring2008 Porter Stemmer : Representing Strings A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term ‘consonant’ is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel. A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms: CVCV... C CVCV... V VCVC... C VCVC... V These may all be represented by the single form [C]VCVC... [V] where the square brackets denote arbitrary presence of their contents. Using (VC) m to denote VC repeated m times, this may again be written as [C](VC) m [V]. m will be called the measure of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples: m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

37 Srihari-CSE535-Spring2008 Porter Stemmer - Types of Rules The rules for removing a suffix will be given in the form (condition) S1 -> S2 This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g. (m > 1) EMENT -> Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2. The ‘condition’ part may also contain the following: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). And the condition part may also contain expressions with and, or and not, so that (m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T, while (*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant other than L, S or Z. Elaborate conditions like this are required only rarely. In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word. For example, with SSES -> SS IES -> I SS -> SS S -> (here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for S1. Equally CARESS maps to CARESS (S1=‘SS’) and CARES to CARE (S1=‘S’).

38 Srihari-CSE535-Spring2008 Porter Stemmer - Step 1a In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. Only one rule from each step can fire. The algorithm now follows: Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat

39 Srihari-CSE535-Spring2008 Porter Stemmer- Step 1b Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file The rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4.

40 Srihari-CSE535-Spring2008 Porter Stemmer- Step 1c Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward.

41 Srihari-CSE535-Spring2008 Porter Stemmer - Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli -> vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible The test for the string S1 can be made fast by doing a program switch on the penultimate letter of the word being tested. This gives a fairly even breakdown of the possible values of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in the alphabetical order of their penultimate letter. Similar techniques may be applied in the other steps.

42 Srihari-CSE535-Spring2008 Porter Stemmer - Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good

43 Srihari-CSE535-Spring2008 Porter Stemmer- Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler The suffixes are now removed. All that remains is a little tidying up.

44 Srihari-CSE535-Spring2008 Porter Stemmer - Step 5 Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

45 Srihari-CSE535-Spring2008 Thesauri Thesaurus precompiled list of words in a given domain for each word above, a list of related words computer-aided instruction computer applications educational computing Applications of Thesaurus in IR provides a standard vocabulary for indexing and searching medical literature querying proper query formulation classified hierarchies that allow broadening/narrowing of current query e.g. Yahoo Components of a thesaurus index terms relationship among terms layout design for term relationships Thesaurus precompiled list of words in a given domain for each word above, a list of related words computer-aided instruction computer applications educational computing Applications of Thesaurus in IR provides a standard vocabulary for indexing and searching medical literature querying proper query formulation classified hierarchies that allow broadening/narrowing of current query e.g. Yahoo Components of a thesaurus index terms relationship among terms layout design for term relationships


Download ppt "Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 2: Text Operations."

Similar presentations


Ads by Google