Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

Chapter 5: Introduction to Information Retrieval
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Rules Always answer in the form of a question 50 points deducted for wrong answer.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Brief introduction to morphology
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
WMES3103 : INFORMATION RETRIEVAL
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Prof. Erik Lu. MORPHOLOGY GRAMMAR MORPHOLOGY MORPHEMES BOUND FREE WORDS LEXICAL GRAMMATICAL NOUNS VERBS ADJECTIVES (ADVERBS) PRONOUNS ARTICLES ADVERBS.
Modern Information Retrieval Computer engineering department Fall 2005.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Information Retrieval Lecture 2: The term vocabulary and postings lists.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Information Retrieval
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
Welcome to Stanah School
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Text Based Information Retrieval
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Token generation - stemming
Chapter 5: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Discussion Class 3 Stemming Algorithms.
Information Retrieval and Web Design
Presentation transcript:

Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence of bytes into sequence of characters Problems involved: Encoding, language, … Different formats (some might be proprietary) Content vs. structure information etc.

Lexical Analysis & Tokenization Goal : Transform stream of characters into stream of words / terms / tokens Terms should be consistent and ‘reasonable’ Lots of potential problems here, e.g. Apostrophes England’s: England s – Englands? haven’t: haven t – havent – have not? user’s vs. users’ vs. users’s? Hyphenation state-of-the-art vs. state of the art co-authors vs. coauthors vs. co authors

Lex. Analysis & Tokeniz. (Cont.) Special characters Punctuation (Normally removed. But what about Names: C++, A* Algorithm, M*A*S*H, … Use just white spaces for segmentation? San Francisco? Los Angeles? Nuclear power plant? Language specific issues, e.g. German: ‘Komposita’, e.g. Atomkraftwerk Case sensitivity? Conclusion : Implementation is easy, but what and how is critical (often heuristics, etc.)

Removal of Stop Words Goal : Remove words with no relevance for any search in order to reduce disk space, speed up search time, and improve result Problems: What about ‘To be or not to be’? What about phrases, e.g. ‘The name of the rose’ (book/movie vs. flower names)? Usability? Implementation : Often via manually generated stop words lists (depending on the task and data, e.g. small vs. big, general vs. topic specific, etc.)

Thesaurus Thesaurus : A data structure composed of (1) a pre-compiled list of important words in a given domain of knowledge and (2) for each word in the list, a list of related (synonym) words. (Glossary, page 453, Baeza-Yates & Ribeiro-Neto [1]) Implementation : Either manually generated (hard & expensive) or automatic (e.g. with statistical methods) Usefulness for retrieval is controversial Topic-specific thesauri seem to work well, but limited usage with large, dynamic data base(?)

Thesaurus Usefulness for indexing unclear, but it can be useful in another way:

Thesaurus Usefulness for indexing unclear, but it can be useful in another way:

Thesaurus Usefulness for indexing unclear, but it can be useful in another way:

Stemming Stemming : A technique for reducing words to their grammatical roots. (Glossary, page 452, Baeza-Yates & Ribeiro-Neto [1]) Motivation: reduce different cases, singular and plural, etc. to the root (stem) of the word Example: connected, connecting, connection, connections -> connect But how about: operation research, operating system, operative dentistry -> opera???

Approaches for Stemming Table lookup Generation is complex Final tables are often incomplete Affix removal Suffix vs. prefix (e.g. mega-volt) Doesn’t always work, esp. not in German Successor variety stemming More complex than suffix removal Uses (e.g.) linguistic approaches and techniques from morphology N-grams General clustering approach which can also be used for stemming

The Porter Algorithm Special algorithm for the English language based on suffix removal Basic idea: Successive application of several rules for suffix removal (5 phases of condition and action rules) Example: Remove plural ‘s’ and ‘sses’ Rules: sses -> ss, s -> NIL (obey order!) Advantage: Easy algorithm with good results Disadvantage: Not always correct, e.g. Same root for police – policy, execute –executive, … Different root for european – europe, search – searcher, …

Conventions for the pseudo language used to describe the - a consonant variable is represented by the symbol C which is used to refer to any letter other than a,e,i,o,u and other than the letter y preceded by a consonant; - a vowel variable is represented by the symbol V which is used to refer to any letter which is not a consonant; - a generic letter (consonant or vowel) is represented by the symbol L; - the symbol NIL is used to refer to an empty string (i.e., one with no letters); - combinations of C, V, and L are used to define patterns; - the symbol * is used to refer to zero or more repetitions of a given pattern; - the symbol + is used to refer to one or more repetitions of a given pattern; - matched parenthesis are used to subordinate a sequence of variables to the operators * and +; - a generic pattern is a combination of symbols, matched parenthesis, and the operators * and +; - the substitution rules are treated as commands which are separated by a semicolon punctuation mark; - the substitution rules are applied to the suffixes in the current word; - a conditional if statement is expressed as ``if (pattern) rule'' and the rule is executed only if the pattern in the condition matches the current word; - a line which starts with a % is treated as a comment; - curled brackets are used to form compound commands; - a “select rule with longest suffix” statement selects a single rule for execution among all the rules in a compound command. The rule selected is the one with the largest matching suffix. Porter Algorithm – Convention rules I

Porter Algorithm – Convention rules II Thus, the expression (C)* refers to a sequence of zero or more consonants while the expression ((V)*(C)*)* refers to a sequence of zero or more vowels followed by zero or more consonants which can appear zero or more times. It is important to distinguish the above from the sequence (V*C) which states that a sequence must be present and that this sequence necessarily starts with a vowel, followed by a subsequence of zero or more letters, and finished by a consonant. Finally, the command if (*V*L) then ed -> NIL states that the substitution of the suffix ed by nil (i.e., the removal of the suffix ed) only occurs if the current word contains a vowel and at least one additional letter.

Porter Algorithm – Phase 1 % Phase 1: Plurals and past participles. select rule with longest suffix { sses -> ss; ies -> i; ss -> ss; s -> NIL; } select rule with longest suffix { if ( (C)*((V)+(C)+)+(V)*eed) then eed -> ee; if (*V*ed or *V*ing) then { select rule with longest suffix { ed -> NIL; ing -> NIL; } select rule with longest suffix { at -> ate; bl -> ble; iz -> ize; if ((* C1C2) and ( C1 = C2) and (C1 not in {l,s,z})) then C1C2 -> C1; if (( (C)*((V)+(C)+)C1V1C2) and (C2 not in {w,x,y})) then C1V1C2 -> C1V1C2e; } } if (*V*y) then y -> i;

Porter Algorithm – Phase 2 % Phase 2 if ( (C)*((V)+(C)+)+(V)*) then select rule with longest suffix { ational -> ate; tional -> tion; enci -> ence; anci -> ance; izer -> ize; abli -> able; alli -> al; entli -> ent; eli -> e; ousli -> ous; ization -> ize; ation -> ate; ator -> ate; alism -> al; iveness -> ive; fulness -> ful; ousness -> ous; aliti -> al; iviti -> ive; biliti -> ble; }

Porter Algorithm – Phase 3 % Phase 3 if ( (C)*((V)+(C)+)+(V)*) then select rule with longest suffix { icate -> ic; ative -> NIL; alize -> al; iciti -> ic; ical -> ic; ful -> NIL; ness -> NIL; }

Porter Algorithm – Phase 4 % Phase 4 if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then select rule with longest suffix { al -> NIL; ance -> NIL; ence -> NIL; er -> NIL; ic -> NIL; able -> NIL; ible -> NIL; ant -> NIL; ement -> NIL; ment -> NIL; ent -> NIL; ou -> NIL; ism -> NIL; ate -> NIL; iti -> NIL; ous -> NIL; ive -> NIL; ize -> NIL; if (*s or *t) then ion -> NIL; }

Porter Algorithm – Phase 5 % Phase 5 select rule with longest suffix { if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then e -> NIL; if (( (C)*((V)+(C)+)(V)*) and not (( *C1V1C2) and (C2 not in {w,x,y}))) then e -> nil; } if ( (C)*((V)+(C)+)((V)+(C)+)+V*ll) then ll -> l; Source: Baeza-Yates [1], Appendix (available online at

Stemming Usefulness for retrieval is controversial Seems useful for languages such as German (lots of cases, compound nouns, etc.), i.e. languages where it is hard to do successfully! Hard for large, multilingual data collections (such as the web) Another problem: Usability (wrong results often hard to understand by the users)

Term Processing – Summary Term processing, e.g. Normalization Lexical analysis & tokenization Removal of stop words Thesaurus Stemming (Porter’s Algorithm) and much more How is it usually done? Often depends on the application and data Is it useful? How does it influence the result? We need a measure for that!

Evaluation – Precision & Recall Evaluation of IR systems is often done based on some test data (benchmark tests), i.e. a set of (document-query-relevance)-labels The most important measures to quantify IR performance are Precision and Recall PRECISION = # FOUND & RELEVANT # FOUND RECALL = # FOUND & RELEVANT # RELEVANT

Precision & Recall – Example PRECISION = # FOUND & RELEVANT # FOUND RECALL = # FOUND & RELEVANT # RELEVANT RESULT: DOCUMENTS: AC D B F H G E J I 1. DOC. B 2. DOC. E 3. DOC. F 4. DOC. G 5. DOC. D 6. DOC. H

Relation Term Proc. – Prec. & Rec. How does term proc. influence IR performance? Generally: More specific terms: Precision, Recall Broader terms: Precision, Recall More specific: Thesaurus: Precision, Recall Phrases: Precision, Recall Stemming: Precision, Recall Compound nouns: Precision, Recall Note: Often increase in recall rather than precision. Maybe a reason why it’s hardly used for web search?

LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Documents: Selection & Segmentation Which documents should be indexed? Big issue for web search (because we can’t do all of it: too big, changes too quickly, …) General problems: Confidential information, formats, etc. What is a reasonable unit for indexing ? In most cases: 1 document = 1 file What about archives w. more than one in one file (file = folder)? What about attachments? What about one document split over several files (e.g. HTML version of PPT file)

LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

References & Recommended Reading [1] R. BAEZA-YATES, B. RIBEIRO-NETO: ‘MODERN INFORMATIN RETRIEVAL’, ADDISON WESLEY, 1999 CHAPTER 7 (TERMPROCESSING) CHAPTER 8.1, 8.2 (INDEX, INV. FILES) [2] C. MANNING, P. RAGHAVAN, H. SCHÜTZ: ‘INTRODUCTION TO IR’ (TO APPEAR 2007) CHAPTER 1 (INDEX, INV. FILES) CHAPTER 2 (TERMPROCESSING) AVAILABLE ONLINE AT information-retrieval-book.html

Schedule Next week: 2 lectures (Tuesday & Wednesday) The following week: Lecture & tutorials/lab will start (most likely) I will put a schedule on the web next week