The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Information Retrieval Review
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
INFO 624 Week 3 Retrieval System Evaluation
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Lesson 12 — The Internet and Research
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 © Netskills Quality Internet Training, University of Newcastle Search Engines and Other Animals © Netskills, Quality Internet Training, University of.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine Interfaces search engine modus operandi.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
The Internet 8th Edition Tutorial 4 Searching the Web.
1 IR pptSteven O. Kimbrough Basics of Information Retrieval.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
CPT 499 Internet Skills for Educators Session Three Class Notes.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Lecture 4 Access Tools/Searching Tools. Learning Objectives To define access tools To identify various access tools To be able to formulate a search strategy.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Information Architecture
Search Engines and Search techniques
Multimedia Information Retrieval
Information Retrieval
Introduction into Knowledge and information
Data Mining Chapter 6 Search Engines
IL Step 3: Using Bibliographic Databases
Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Related Content Finder: A Search Engine that works!
Information Retrieval and Web Design
Presentation transcript:

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons, including: Documents are not (very) structured –Database searches vs document base searches Language is not (very) cooperative –DNA: microbiology or DEC Network Architecture? –Free rider: game theory or urban transportation systems? –Corporate memory or organizational memory? n Physical access vs logical access Physical: relatively easy Logical: terribly difficult Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19982 The Information Retrieval Problem n Kinds of information searches Framework from David Blair –Search exhaustivity makes it difficult to determine whether all relevant documents were retrieved –Data base size as a framework for text retrieval Systems ( greater than 250,000 pages of text ) n Distinctions Large vs small (document) data bases Exhaustive vs sample searches Content vs context searches Blair and Maron 1985 vs left hand side of page in the middle of a red book Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19983 The Information Retrieval Problem: Basic IR Technology n Your basic IR technology Full text or keyword retrieval, with Boolean combinations and Location indicators n Full text--has everything Or does it? n Keyword indexing Requires work n Boolean combination of words Usual Boolean operators: AND, OR, NOT This is a logically complete set Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19984 Web Search Engines - Indexing retrieval algorithms  Manual indexing along common themes  Weight each word numerically (eliminate common words such as of, that, and, etc.)  Some weight words in the section or in the URL higher.  Some weight order of the first word in the query higher than the second and so on.  Retrieve all documents that match the query (typically a Boolean query)  Count frequency of word occurrences (The Stroud Corporation example: publishers “game” the indexing algorithm)  Add up word weights for document reflecting the word frequency  Search engines do not index words in graphics (gif and jpg files)  Infoseek, Lycos and Yahoo offer multilingual queries Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19985 Web Search Engines - Metasearches Advantages  Query is sent to multiple search engines simultaneously  Results are grouped, aggregated, and sorted with duplicates removed  Often adds new metatitles to help categorize the sites Disadvantages  Returns much less information about each site  Omits unique sites only found by particular nuances of a particular query engine  It is very difficult to formulate complex queries Examples   Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19986 The Information Retrieval Problem: Probability of Retrieving a Relevant Document P(word 1 ) =.6 probability searcher uses word 1 in a query P(word 2 ) =.5 probability searcher uses word 2 in a query P(Doc_word 1 ) =.7 probability word 1 is in relevant document P(Doc_word 2 ) =.6 probability word 2 is in relevant document The probability of searcher using word 1 in a query and word 1 being in a relevant document is P(word 1 ) x P(Doc_word 1 ) =.6 x.7 =.42 The probability of searcher using word 1 in a query and word 1 being in a relevant document is P(word 2 ) x P(Doc_word 2 ) =.5 x.6 =.30 The probability of searcher using word 1 and word 2 in a query and both word 1 and word 2 being in a relevant document is P(word 1 ) x P(Doc_word 1 ) x P(word 2 ) x P(Doc_word 2 ) =.6 x.7 x.5 x.6 =.126 Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19987 The Information Retrieval Problem: Basic IR Technology Recall measures how well all relevant documents are retrieved ( x / n 2 ) Precision measures how well only relevant documents are retrieved ( x / n 1 ) Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19988 The Information Retrieval Problem: Basic IR Technology relevantretrieved not relevant not retrieved relevant and retrieved n When and where and how does the recall vs precision distinction matter? n How well does full text retrieval work? Information Retrieval

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19989 The Information Retrieval Problem: Summary of Blair and Maron Study n Searcher perception that their search was exhaustive (recall > 75%) actual recall 20% n No significant difference between searching ability of lawyer or paralegal n Searchers were only able to anticipate a small number of words and phrases that could be used to retrieve relevant documents and would not be in irrelevant documents n Extraordinary and unpredictable variability in the words and phrases used to discuss the same topics (e.g., the accident in the litigation referred to as situation, difficulty, event, what happened last week, and we all know why we are here ) Information Retrieval