Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,

Similar presentations


Presentation on theme: "The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,"— Presentation transcript:

1 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons, including: Documents are not (very) structured –Database searches vs document base searches Language is not (very) cooperative –DNA: microbiology or DEC Network Architecture? –Free rider: game theory or urban transportation systems? –Corporate memory or organizational memory? n Physical access vs logical access Physical: relatively easy Logical: terribly difficult Information Retrieval

2 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19982 The Information Retrieval Problem n Kinds of information searches Framework from David Blair –Search exhaustivity makes it difficult to determine whether all relevant documents were retrieved –Data base size as a framework for text retrieval Systems ( greater than 250,000 pages of text ) n Distinctions Large vs small (document) data bases Exhaustive vs sample searches Content vs context searches Blair and Maron 1985 vs left hand side of page in the middle of a red book Information Retrieval

3 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19983 The Information Retrieval Problem: Basic IR Technology n Your basic IR technology Full text or keyword retrieval, with Boolean combinations and Location indicators n Full text--has everything Or does it? n Keyword indexing Requires work n Boolean combination of words Usual Boolean operators: AND, OR, NOT This is a logically complete set Information Retrieval

4 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19984 Web Search Engines - Indexing retrieval algorithms  Manual indexing along common themes www.yahoo.com  Weight each word numerically (eliminate common words such as of, that, and, etc.)  Some weight words in the section or in the URL higher.  Some weight order of the first word in the query higher than the second and so on.  Retrieve all documents that match the query (typically a Boolean query)  Count frequency of word occurrences (The Stroud Corporation example: publishers “game” the indexing algorithm)  Add up word weights for document reflecting the word frequency  Search engines do not index words in graphics (gif and jpg files)  Infoseek, Lycos and Yahoo offer multilingual queries Information Retrieval

5 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19985 Web Search Engines - Metasearches Advantages  Query is sent to multiple search engines simultaneously  Results are grouped, aggregated, and sorted with duplicates removed  Often adds new metatitles to help categorize the sites Disadvantages  Returns much less information about each site  Omits unique sites only found by particular nuances of a particular query engine  It is very difficult to formulate complex queries Examples  www.inference.com  www.web-search.com/savvy.html/ Information Retrieval

6 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19986 The Information Retrieval Problem: Probability of Retrieving a Relevant Document P(word 1 ) =.6 probability searcher uses word 1 in a query P(word 2 ) =.5 probability searcher uses word 2 in a query P(Doc_word 1 ) =.7 probability word 1 is in relevant document P(Doc_word 2 ) =.6 probability word 2 is in relevant document The probability of searcher using word 1 in a query and word 1 being in a relevant document is P(word 1 ) x P(Doc_word 1 ) =.6 x.7 =.42 The probability of searcher using word 1 in a query and word 1 being in a relevant document is P(word 2 ) x P(Doc_word 2 ) =.5 x.6 =.30 The probability of searcher using word 1 and word 2 in a query and both word 1 and word 2 being in a relevant document is P(word 1 ) x P(Doc_word 1 ) x P(word 2 ) x P(Doc_word 2 ) =.6 x.7 x.5 x.6 =.126 Information Retrieval

7 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19987 The Information Retrieval Problem: Basic IR Technology Recall measures how well all relevant documents are retrieved ( x / n 2 ) Precision measures how well only relevant documents are retrieved ( x / n 1 ) Information Retrieval

8 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19988 The Information Retrieval Problem: Basic IR Technology relevantretrieved not relevant not retrieved relevant and retrieved n When and where and how does the recall vs precision distinction matter? n How well does full text retrieval work? Information Retrieval

9 The Wharton School of the University of Pennsylvania OPIM 101 2/16/19989 The Information Retrieval Problem: Summary of Blair and Maron Study n Searcher perception that their search was exhaustive (recall > 75%) actual recall 20% n No significant difference between searching ability of lawyer or paralegal n Searchers were only able to anticipate a small number of words and phrases that could be used to retrieve relevant documents and would not be in irrelevant documents n Extraordinary and unpredictable variability in the words and phrases used to discuss the same topics (e.g., the accident in the litigation referred to as situation, difficulty, event, what happened last week, and we all know why we are here ) Information Retrieval


Download ppt "The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,"

Similar presentations


Ads by Google