Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 Automatic Indexing Automatic Text Processing by G. Salton, Addison-Wesley, 1989.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Modern Information Retrieval Chapter 7: Text Processing.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Web- and Multimedia-based Information Systems Lecture 2.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Plan for Today’s Lecture(s)
Multimedia Information Retrieval
Representation of documents and queries
CSE 635 Multimedia Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Boolean and Vector Space Retrieval Models
Recuperação de Informação
Information Retrieval and Web Design
Presentation transcript:

Text Operations J. H. Wang Feb. 21, 2008

The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, Text Database Text

Outline Document Preprocessing ( ) Text Compression ( ): skipped Automatic Indexing (Chap. 9, Salton) –Term Selection

Document Preprocessing Lexical analysis –Letters, digits, punctuation marks, … Stopword removal –“the”, “of”, … Stemming –Prefix, suffix Index term selection –Noun Construction of term categorization structure –Thesaurus

Logical view of the documents structure Accents, spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

Lexical Analysis Converting a stream of characters into a stream of words –Recognition of words –Digits: usually not good index terms Ex.: The number of deaths due to car accidents between 1910 and 1989, “510B.C.”, credit card numbers, … –Hyphens Ex: state-of-the-art, gilt-edge, B-49, … –Punctuation marks: normally removed entirely Ex: 510B.C., program codes: x.id vs. xid, … –The case of letters: usually not important Ex: Bank vs. bank, Unix-like operating systems, …

Elimination of Stopwords Stopwords : words which are too frequent among the documents in the collection are not good discriminators –Articles, prepositions, conjunctions, … –Some verbs, adverbs, and adjectives To reduce the size of the indexing structure Stopword removal might reduce recall –Ex: “to be or not to be”

Stemming The substitution of the words by their respective stems –Ex: plurals, gerund forms, past tense suffixes, … A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes) –Ex: connect, connected, connecting, connection, connections Controversy about the benefits –Useful for improving retrieval performance –Reducing the size of the indexing structure

Stemming Four types of stemming strategies –Affix removal, table lookup, successor variety, and n-grams (or term clustering) Suffix removal –Port’s algorithm (available in the Appendix) Simplicity and elegance

Index Term Selection Manually or automatically Identification of noun groups –Most of the semantics is carried by the noun words –Systematic elimination of verbs, adjectives, adverbs, connectives, articles, and pronouns –A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

Thesauri Thesaurus: a reference to a treasury of words –A precompiled list of important words in a given domain of knowledge –For each word in this list, a set of related words Ex: synonyms, … –It also involves normalization of vocabulary, and a structure

Example Entry in Peter Roget’s Thesaurus Cowardly adjective Ignobly lacking in courage: cowardly turncoats. Syns : chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

Main Purposes of a Thesaurus To provide a standard vocabulary for indexing and searching To assist users with locating terms for proper query formulation To provide classified hierarchies that allow the broadening and narrowing of the current query request according to the user needs

Motivation for Building a Thesaurus Using a controlled vocabulary for the indexing and searching –Normalization of indexing concepts –Reduction of noise –Identification of indexing terms with a clear semantic meaning –Retrieval based on concepts rather than on words Ex: term classification hierarchy in Yahoo!

Main Components of a Thesaurus Index terms: individual words, group of words, phrases –Concept Ex: “missiles, ballistic” –Definition or explanation Ex: seal (marine animals), seal (documents) Relationships among the terms –BT (broader), NT (narrower) –RT (related): much difficult A layout design for these term relationships –A list or bi-dimensional display

Automatic Indexing (Term Selection)

Automatic Indexing Indexing –assign identifiers (index terms) to text documents Identifiers –single-term vs. term phrase –controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, … –objective vs. nonobjective text identifiers cataloging rules control, e.g., author names, publisher names, dates of publications, …

Two Issues Issue 1: indexing exhaustivity –exhaustive: assign a large number of terms –Nonexhaustive: only main aspects of subject content Issue 2: term specificity –broad terms (generic) cannot distinguish relevant from nonrelevant documents –narrow terms (specific) retrieve relatively fewer documents, but most of them are relevant

All docs Recall vs. Precision Recall (R) = Number of relevant documents retrieved / total number of relevant documents in collection –The proportion of relevant items retrieved Precision (P) = Number of relevant documents retrieved / total number of documents retrieved –The proportion of items retrieved that are relevant Example: for a query, e.g. Taipei Retrieved docs Relevant docs

More on Recall/Precision Simultaneously optimizing both recall and precision is not normally achievable –Narrow and specific terms: precision is favored –Broad and nonspecific terms: recall is favored When a choice must be made between term specificity and term breadth, the former is generally preferable –High-recall, low-precision documents will burden the user –Lack of precision is more easily remedied than lack of recall

Term-Frequency Consideration Function words –for example, "and", "of", "or", "but", … –the frequencies of these words are high in all texts Content words –words that actually relate to document content –varying frequencies in the different texts of a collection –indicate term importance for content

A Frequency-Based Indexing Method Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words Compute the term frequency tf ij for all remaining terms T j in each document D i, specifying the number of occurrences of T j in D i Choose a threshold frequency T, and assign to each document D i all term T j for which tf ij > T

More on Term Frequency High-frequency term –Recall Ex: “Apple” –But only if its occurrence frequency is not equally high in other documents Low-frequency term –Precision Ex: “Huntington’s disease” –Able to distinguish the few documents in which they occur from the many from which they are absent

How to Compute Weight w ij ? Inverse document frequency, idf j – tf ij *idf j (TFxIDF) Term discrimination value, dv j – tf ij *dv j Probabilistic term weighting tr j – tf ij *tr j Global properties of terms in a document collection

Inverse Document Frequency Inverse Document Frequency (IDF) for term T j where df j (document frequency of term T j ) is the number of documents in which T j occurs. –fulfil both the recall and the precision –occur frequently in individual documents but rarely in the remainder of the collection

TFxIDF Weight w ij of a term T j in a document d i Eliminating common function words Computing the value of w ij for each term T j in each document D i Assigning to the documents of a collection all terms with sufficiently high ( tf x idf ) weights

Term-discrimination Value Useful index terms –Distinguish the documents of a collection from each other Document Space –Each point represents a particular document of a collection –The distance between two points is inversely proportional to the similarity between the respective term assignments When two documents are assigned very similar term sets, the corresponding points in document configuration appear close together

Original State After Assignment of good discriminator After Assignment of poor discriminator A Virtual Document Space

Good Term Assignment When a term is assigned to the documents of a collection, the few document s to which the term is assigned will be distinguished from the rest of the collection This should increase the average distance between the documents in the collection and hence produce a document space less dense than before

Poor Term Assignment A high frequency term is assigned that does not discriminate between the objects of a collection Its assignment will render the document more similar This is reflected in an increase in document space density

Term Discrimination Value Definition dv j = Q - Q j where Q and Q j are space densities before and after the assignment of term T j The average pairwise similarity between all pairs of distinct terms: dv j >0, T j is a good term; dv j <0, T j is a poor term.

Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j <0 N Variations of Term-Discrimination Value with Document Frequency

TF ij x dv j w ij = tf ij x dv j compared with – : decreases steadily with increasing document frequency – dv j : increases from zero to positive as the document frequency of the term increases, decreases shapely as the document frequency becomes still larger Issue: efficiency problem to compute N(N-1) pairwise similarities

Document Centroid Document centroid C = ( c 1, c 2, c 3,..., c t ) where w ij is the j -th term in document I –A “dummy” average document located in the center of the document space Space density

Probabilistic Term Weighting Goal Explicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection Definition Given a user query q, and the ideal answer set of the relevant documents From decision theory, the best ranking algorithm for a document D

Probabilistic Term Weighting Pr( rel ), Pr( nonrel ): document’s a priori probabilities of relevance and nonrelevance Pr( D | rel ), Pr( D | nonrel ): occurrence probabilities of document D in the relevant and nonrelevant document sets

Assumptions Terms occur independently in documents

Derivation Process

Given a document D=( d 1, d 2, …, d t ) Assume d i is either 0 (absent) or 1 (present) Pr(x i =1|rel) = p i Pr(x i =0|rel) = 1-p i Pr(x i =1|nonrel) = q i Pr(x i =0|nonrel) = 1-q i For a specific document D

Term Relevance Weight

Issue How to compute p j and q j ? p j = r j / R q j = ( df j - r j )/( N - R ) –r j : the number of relevant documents that contains term T j –R: the total number of relevant documents –N: the total number of documents

Estimation of Term-Relevance The occurrence probability of a term in the nonrelevant documents q j is approximated by the occurrence probability of the term in the entire document collection q j = df j / N –Large majority of documents will be nonrelevant to the average query The occurrence probabilities of the terms in the small number of relevant documents is assumed to be equal by using a constant value p j = 0.5 for all j

When N is sufficiently large, N-df j  N,  = idf j Comparison