Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.

Slides:



Advertisements
Similar presentations
Metacrawler Melissa Cyr Information Literacy. A metasearch engine is a search tool that sends user requests to several other search engines and/or databases.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engine – Metasearch Engine Comparison By Ali Can Akdemir.
Hyper-Searching the Web. Search Engines Basic Search (index) Cluster Search (themes) Meta-search (outsource) “Smarter” meta-search (themes + outsource)
Exploring the Deep Web Brunvand, Amy, Kate Holvoet, Peter Kraus, and David Morrison. "Exploring the Deep Web." PPT--Download University of Utah.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
INTERNET A collection of networks. History ARPANet – developed for security of sending in case of a nuclear attack IDEA – the system would not go down.
Information Retrieval in Practice
Search Engines and Information Retrieval
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
(c) Maria Indrawan Distributed Information Retrieval.
1 Pertemuan 20 Searching Mechanisms Matakuliah: M0284/Teknologi & Infrastruktur E-Business Tahun: 2005 Versi: >
INFO 624 Week 3 Retrieval System Evaluation
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 CS6320 – Why Servlets? L. Grewe 2 What is a Servlet? Servlets are Java programs that can be run dynamically from a Web Server Servlets are Java programs.
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) Intelligent IRD.
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
Chapter 10 Publishing and Maintaining Your Web Site.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Search Engines and Information Retrieval Chapter 1.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Search Engine Interfaces search engine modus operandi.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Search Engines By Wanda Dansby CECS 5030 Dr. Knezek.
Student name: ahmed abudayya. Before the advent of the web there were search engines for old systems or protocols, such as a search engine for sites Erkki.
1 nlresearch.com The First ReSearch Engine: Northern Light® Susan M. Stearns Director of Enterprise Marketing March, 1999.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Where do I find it? Created by Connie CampbellConnie Campbell.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Implementation of Meta-Search Engine by: Antony pranata
Search Tools and Search Engines Searching for Information and common found internet file types.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Internet Searching Part I Search Engine Types Boolean Searching Techniques.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Learning how to search on the web “If all you ever do is all you’ve ever done, then all you’ll ever get is all you’ve ever got.” (author unknown)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Metasearch Thanks to Eric Glover NEC Research Institute.
Using Search Tools on the Internet
The Anatomy of a Large-Scale Hypertextual Web Search Engine
IST 497E Information Retrieval and Organization
CS246: Information Retrieval
Presentation transcript:

Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada

The Beginning Since its inception the internet has grown at a staggering rate with an extremely large number of pages being added every day Search engines such as Alta Vista, Excite, HotBot, Infoseek, Lycos, and Northern Light attempted to turn the internet into a “15-billion-word encyclopedia”

Performance Measures For a Search Engine Coverage: also called “recall” in IR Relevance: also called “precision” in IR Freshness of pages in the index Speed

What is the main problem with this? Coverage –Despite their claims, no single search engine could index the entire web –Traditional IR systems were really designed for static collections They could not keep up with the growth of the internet

Experiment by Selberg and Etzioni They did an experiment using the results from logs from their MetaCrawler web sites. –“unique documents” What were the problems with this experiment? –They only took the first X pages returned from each engine –Ranking system of each search engine was different

Experiment by Lawrence and Giles Produced statistics on the coverage of the major web search engines and the estimated size of the web Compared the number of documents returned by each and analyzed the results Problems –They did not know if they were indexing unique URLs or subsets of the same URLs –Returned first X amount of documents

They found… Using the estimate that the web contains 320 million pages they calculated the following: –HotBot: 34% coverage –Alta Vista: 28% coverage –Northern Light: 20% coverage –Excite: 14% coverage –Infoseek: 10% coverage –Lycos: 3% coverage *Note: both experiments were concerned with coverage

What method did they find that could increase coverage? Combining results from multiple engines –By combining all six search engines they were able to yield 3.5X the amount of results –Selberg and Etzioni had created a MetaCrawler which gathered a “market share” of the results of each engine *A solution better than MetaCrawler?

The First MetaCrawler Softbot –Invented by Selberg and Etzioni at the University of Washington –What important qualities did it provide? A single interface to query through multiple search engines such as Lycos and Alta Vista Obtained higher quality results as opposed to just combining results

Modular Design User Interface –Translates user queries and options into appropriate parameters Aggregation Engine –Obtains references, eliminates duplicates, collates & outputs results Parallel Web Interface –Downloads HTML pages from the Web, sends queries and obtains results Harness –Where service specific information is kept

Motivation Growth of the Web Difficulty in finding information Search engines index different documents and use different ranking algorithms –By using a single search engine you could miss over 77% of the most relevant references Interfaces of many search engines were difficult to use

Softbot Addresses These Problems Aggregates web search services under a unified interface –Interface was much easier to use –Forwards queries to single search engines and ranks results into one composite list Obtains higher quality results –Allows users to be more specific –Eliminates duplicates using comparison algorithm –Adapts to a rapidly changing environment

Formatting and Ranking MetaCrawler translates each query into the appropriate format for use in each search engine Uses a “confidence score” to rank –Allows each service to vote on relevancy for a particular document –Higher total score = higher ranking on final list

Speed Has user modifiable timeouts References only downloaded when needed or only when user chooses to Shows partial results –Doesn’t wait for full results list to be generated before showing you something

Adaptability, Portability, Scalability Modular design allows for services to be added, modified, and removed quickly Does not require large databases/large amounts of memory, can run on most machines Has ability to scale without adding more machines

MetaCrawlers Today “The Big Four”” –Dogpile –Metacrawler –Excite –Webcrawler