Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Information Retrieval
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Search Tools and Search Engines Searching for Information and common found internet file types.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Vector Space Models.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
General Architecture of Retrieval Systems 1Adrienn Skrop.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Information Retrieval and Web Search
Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Parallel and Distributed Searching

Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed Searching –Collection Partioning –Query Processing –Collection/Results Fusion

Boolean Queries Queries with terms connected by AND OR and NOT –(Internet AND retrieval) AND (NOT english) –“world wide web” OR internet

Advantages Easy to Implement Allow very precise query specifications Facilitate parallel execution

Disadvantages People are bad at Boolean algebra Difficult to interpret to get effective relevance ranking Difficult to include sensible query weighting

Parallel Searching Useful in improving performance in very large/heavily used search engines break query down into several subqueries execute each at the same time combine results share subqueries between different searches

Distributed Searching More about metasearching and turning plain searching into metasearching

Distribution Methods Multiple copies of collection: mirror sites Why not split the documents between servers according to their topics ?

Collection Partioning Manual/Semi automatic Topic Partioning –medical vs engineering –books vs CD’s One Central Index One Index per server

Distributed Query Processing Select collections to search distribute query to selected collections evaluate query at selected servers in parallel combine results into a final result

Source Selection Obtain global term distribution data –on the web ????? Analyse central index of collection relevance Missing gems

Missing Gems Example Query –wear characteristics of high titanium steel alloys –actually occurs in medical collection describing use in artificial hips

Results Fusion Want to present a single result collected from several sources Also known as collection fusion because it makes several collections appear as one

Results Fusion How do you put together the results from several web sites/search engines into a single combined result ? Collection at a time Round robin Relevance Ranked

Collection at a Time Use e.g. tf * idf across each collection to rank searched collection by relevance Display the results from the best collection first

Tf *idf Tf - term frequency –terms that are frequently mentioned in individual documents improve recall idf - inverse document frequency –inversely proportional to the number of documents which mention a term –prefers discriminating terms

Round Robin Take the first document from collection 1 Then the first document from collection 2 and so on for each collection then the second document from collection 1 and so on

Relevance based methods Calculate Relevance for the documents returned by each selected source Try to calculate some global statistics Use some special measures

Other Alternatives Random Firstcome first show etc ….

Conclusions Parallel Searching is one way to speed up searching Distributing Information can help ease/speed searching and but has some dangers Some solutions to the results fusion problem