Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel

Agenda IR system model IP process model The “bag of words” representation Evidence for feature assignment Phrase indexing Word sense disambiguation Machine aided indexing

Retrieval System Model Query Formulation Detection Delivery Selection Examination Index Docs User Indexing

Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Process Model Components Document representation function –This week Query representation function –Next week Comparison function –Week after

“Bag of Words” Representation Simple strategy for representing documents Count how many times each term occurs A “term” is any lexical item that you chose –A fixed-length sequence of characters (an “n-gram”) –A word (delimited by “white space” or punctuation) –Some standard “root form” of each word (e.g., a stem) –A phrase (e.g., phrases listed in a dictionary) Counts can be recorded in any consistent order

Bag of Words Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men tocome jump aid of their party 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 Indexed Term Document 1Document 2 Stopword List

Boolean Free Text Example quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1 Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 dog AND fox –Doc 3, Doc 5 dog NOT fox –Empty fox NOT dog –Doc 7 dog OR fox –Doc 3, Doc 5, Doc 7 good AND party –Doc 6, Doc 8 good AND party NOT over –Doc 6

Evidence for Feature Assignment Orthographic –Separate words at white space, … Statistical –How often do words appear together?, … Syntactic –What relationships can a parser discover? Lexical –What phrases appear in the dictionary

Phrase Formation Two types of phrases –Compositional: composition of word meanings –Noncompositional: idiomatic expressions e.g., “kick the bucket” or “buy the farm” Three ways to find phrases –Dictionary lookup –Parsing –Cooccurrence

Semantic Phrases Same idea as longest substring match –But look for word (not character) sequences Compile a term list that includes phrases –Technical terminology can be very helpful Index any phrase that occurs in the list Most effective in a limited domain –Otherwise hard to capture most useful phrases

Statistical Phrases Compute observed occurrence probability –For each single word and each word n-gram “buy” 10 times in 1000 words yields 0.01 “the” 100 times in 1000 words yields 0.10 “farm” 5 times in 1000 words yields 0.005 “buy the farm” 4 times in 1000 words yields 0.004 Compute n-gram probability if truly independent –0.01*0.10*0.005=0.000005 Compare with observed probability –Keep phrases that occur more often than expected

Phrase Indexing Lessons Poorly chosen phrases hurt effectiveness –And some techniques can be slow (e.g., parsing) Better to index phrases and words –Want to find constituents of compositional phrases Better weighting schemes benefit less –Negligible improvement in some TREC-6 systems Very helpful for cross-language retrieval –Noncompositional translation, less ambiguity

Problems With Word Matching Word matching suffers from two problems –Synonymy: many words with similar meanings –Homonymy: one word has dissimilar meanings Disambiguation seeks to resolve homonymy –Index word senses rather than words Synonymy can be addressed by –Thesaurus-based query expansion –Latent semantic indexing

Word Sense Disambiguation Context provides clues to word meaning –“The doctor removed the appendix.” For each occurrence, note surrounding words –Typically +/- 5 non-stopwords Group similar contexts into clusters –Based on overlaps in the words that they contain Separate clusters represent different senses

Disambiguation Example Consider four example sentences –The doctor removed the appendix –The appendix was incomprehensible –The doctor examined the appendix –The appendix was removed What clusters can you find? Can you find enough word senses this way? Might you find too many word senses?

Why Disambiguation Hurts Bag-of-words techniques already disambiguate –When more words are present, documents rank higher –So a context for each term is established in the query Same reason that passages are better than documents Formal disambiguation tries to improve precision –But incorrect sense assignments would hurt recall Average precision balances recall and precision –But the possible precision gains are small –And present techniques substantially hurt recall

Machine Assisted Indexing Goal: Automatically suggest descriptors –Better consistency with lower cost Chosen by a rule-based expert system –Design thesaurus by hand in the usual way –Design an expert system to process text String matching, proximity operators, … –Write rules for each thesaurus/collection/language –Try it out and fine tune the rules by hand

Machine Assisted Indexing Example //TEXT: science IF (all caps) USE research policy USE community program ENDIF IF (near “Technology” AND with “Development”) USE community development USE development aid ENDIF near: within 250 words with: in the same sentence Access Innovations system:

Text Categorization Goal: fully automatic descriptor assignment Machine learning approach –Assign descriptors manually for a “training set” –Design a learning algorithm find and use patterns Bayesian classifier, neural network, genetic algorithm, … –Present new documents System assigns descriptors like those in training set Tom Mitchell described an example of this

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Similar presentations

Presentation on theme: "Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Similar presentations

Presentation on theme: "Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel."— Presentation transcript:

Similar presentations

About project

Feedback