Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
Evidence from Metadata LBSC 796/INFM 718R Session 9: November 5, 2007 Douglas W. Oard.
WMES3103 : INFORMATION RETRIEVAL
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Advance Information Retrieval Topics Hassan Bashiri.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Evidence from Content INST 734 Module 2 Doug Oard.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Evidence from Metadata LBSC 796/INFM 718R Session 9: April 6, 2011 Douglas W. Oard.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
The Vector Space Model LBSC 708A/CMSC 838L Session 3, September 18, 2001 Douglas W. Oard.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Modern Information Retrieval Lecture 2: Key concepts in IR.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Evidence from Content INST 734 Module 2 Doug Oard.
Evidence from Content INST 734 Module 2 Doug Oard.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Evidence from Metadata INST 734 Doug Oard Module 8.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Why indexing? For efficient searching of a document
CS 430: Information Discovery
Vector-Space (Distributional) Lexical Semantics
Statistical NLP: Lecture 9
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel

Agenda IR system model IP process model The “bag of words” representation Evidence for feature assignment Phrase indexing Word sense disambiguation Machine aided indexing

Retrieval System Model Query Formulation Detection Delivery Selection Examination Index Docs User Indexing

Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Process Model Components Document representation function –This week Query representation function –Next week Comparison function –Week after

“Bag of Words” Representation Simple strategy for representing documents Count how many times each term occurs A “term” is any lexical item that you chose –A fixed-length sequence of characters (an “n-gram”) –A word (delimited by “white space” or punctuation) –Some standard “root form” of each word (e.g., a stem) –A phrase (e.g., phrases listed in a dictionary) Counts can be recorded in any consistent order

Bag of Words Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men tocome jump aid of their party Indexed Term Document 1Document 2 Stopword List

Boolean Free Text Example quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1 Doc Doc 3Doc Doc 5Doc Doc 7Doc 8 dog AND fox –Doc 3, Doc 5 dog NOT fox –Empty fox NOT dog –Doc 7 dog OR fox –Doc 3, Doc 5, Doc 7 good AND party –Doc 6, Doc 8 good AND party NOT over –Doc 6

Evidence for Feature Assignment Orthographic –Separate words at white space, … Statistical –How often do words appear together?, … Syntactic –What relationships can a parser discover? Lexical –What phrases appear in the dictionary

Phrase Formation Two types of phrases –Compositional: composition of word meanings –Noncompositional: idiomatic expressions e.g., “kick the bucket” or “buy the farm” Three ways to find phrases –Dictionary lookup –Parsing –Cooccurrence

Semantic Phrases Same idea as longest substring match –But look for word (not character) sequences Compile a term list that includes phrases –Technical terminology can be very helpful Index any phrase that occurs in the list Most effective in a limited domain –Otherwise hard to capture most useful phrases

Statistical Phrases Compute observed occurrence probability –For each single word and each word n-gram “buy” 10 times in 1000 words yields 0.01 “the” 100 times in 1000 words yields 0.10 “farm” 5 times in 1000 words yields “buy the farm” 4 times in 1000 words yields Compute n-gram probability if truly independent –0.01*0.10*0.005= Compare with observed probability –Keep phrases that occur more often than expected

Phrase Indexing Lessons Poorly chosen phrases hurt effectiveness –And some techniques can be slow (e.g., parsing) Better to index phrases and words –Want to find constituents of compositional phrases Better weighting schemes benefit less –Negligible improvement in some TREC-6 systems Very helpful for cross-language retrieval –Noncompositional translation, less ambiguity

Problems With Word Matching Word matching suffers from two problems –Synonymy: many words with similar meanings –Homonymy: one word has dissimilar meanings Disambiguation seeks to resolve homonymy –Index word senses rather than words Synonymy can be addressed by –Thesaurus-based query expansion –Latent semantic indexing

Word Sense Disambiguation Context provides clues to word meaning –“The doctor removed the appendix.” For each occurrence, note surrounding words –Typically +/- 5 non-stopwords Group similar contexts into clusters –Based on overlaps in the words that they contain Separate clusters represent different senses

Disambiguation Example Consider four example sentences –The doctor removed the appendix –The appendix was incomprehensible –The doctor examined the appendix –The appendix was removed What clusters can you find? Can you find enough word senses this way? Might you find too many word senses?

Why Disambiguation Hurts Bag-of-words techniques already disambiguate –When more words are present, documents rank higher –So a context for each term is established in the query Same reason that passages are better than documents Formal disambiguation tries to improve precision –But incorrect sense assignments would hurt recall Average precision balances recall and precision –But the possible precision gains are small –And present techniques substantially hurt recall

Machine Assisted Indexing Goal: Automatically suggest descriptors –Better consistency with lower cost Chosen by a rule-based expert system –Design thesaurus by hand in the usual way –Design an expert system to process text String matching, proximity operators, … –Write rules for each thesaurus/collection/language –Try it out and fine tune the rules by hand

Machine Assisted Indexing Example //TEXT: science IF (all caps) USE research policy USE community program ENDIF IF (near “Technology” AND with “Development”) USE community development USE development aid ENDIF near: within 250 words with: in the same sentence Access Innovations system:

Text Categorization Goal: fully automatic descriptor assignment Machine learning approach –Assign descriptors manually for a “training set” –Design a learning algorithm find and use patterns Bayesian classifier, neural network, genetic algorithm, … –Present new documents System assigns descriptors like those in training set Tom Mitchell described an example of this