1 Searching the Web Baeza-Yates Modern Information Retrieval, 1999 Chapter 13.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 Searching the Web Prepared By: Hasan Ba-Abdullah Supervised By: Dr. Mourad Ykhlef King Saud University College of Computer & Information Sciences.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
(c) Maria Indrawan Distributed Information Retrieval.
Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
Search engines. The number of Internet hosts exceeded in in in in in
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Roy McElmurry EXPLORATION SEMINAR 2 SEARCHING AND GOOGLE.
Searching “Search results are only as good as the query you pose and how you search. There is no silver bullet”
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Recuperação de Informação B Cap : Ranking : Crawling the Web : Indices December 06, 1999.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Hotbot A Search Engine Case Study. Introduction  Owned by Terra/Lycos.  One of the largest web search engines.  Uses the Inktomi database combined.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Engines.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Search Tools and Search Engines Searching for Information and common found internet file types.
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
1 SEARCHING FOR TRUTH Locating Information on the WWW chapter 5.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Retrieval CSE 8337 Spring 2005 Web Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Search Engine Architecture
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Search Engines & Subject Directories
Information Retrieval
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Chapter 5: Information Retrieval and Web Search
Search Engines & Subject Directories
Web Search Engines.
Information Retrieval and Web Search
Presentation transcript:

1 Searching the Web Baeza-Yates Modern Information Retrieval, 1999 Chapter 13

2 Introduction l Characterizing the Web l Three different forms »Search engines –AltaVista »Web directories –Yahoo »Hyperlink search –WebGlimpse

3 Challenges on the Web l Distributed data l Volatile data l Large volume l Unstructured and redundant data l Data quality l Heterogeneous data

4 Measuring the Web l The size of the Web (the number of hosts) »Netsizer, –2.7 million web servers, 65 million internet hosts, 1999 »Netcraft, –8 million web servers using different web servers, 1999 »Internet Domain Survey, –56 million internet hosts »WWW Consortium (W3C)

5 Other measures l The number of different institutions maintain Web »more than 40% of the number of Web servers l The number of Web pages »350 million in Jul [BB98, WWW7] –20,000 random queries based on a lexicon of 400,000 words extracted from Yahoo –the union of all answers from four search engines covered about 70% of the Web l The size of a page »5Kb on average with a median 2Kbs

6 Other measures (cont.) l The number of links in a page »5~15 links, 8 on average »80% of these home pages had fewer than 10 external links l Yahoo and other web directories are the glue of the Web l The size of Web size (in bytes) »5Kb*350 million=1.7 terabytes l The languages of the Web

7 Modeling the Web l Heaps’ and Zipf’s laws are also valid in the Web. »In particular, the vocabulary grows faster (larger  ) and the word distribution should be more biased (larger  ) l Heaps’ Law »An empirical rule which describes the vocabulary growth as a function of the text size. »It establishes that a text of n words has a vocabulary of size O(n  ) for 0<  <1 l Zipf’s Law »An empirical rule that describes the frequency of the text words. »It states that the i-th most frequent word appears as many times as the most frequent one divided by i , for some  >1

8 Zipf’s and Heaps’ Law Distribution of sorted word frequencies (left) and size of the vocabulary (right)

9 Search Engines l Centralized Architecture l Distributed Architecture l User Interface l Ranking l Crawling the Web l Indices

10 Typical Crawler-Indexer Architecture Query Engine (Ranking) Interface Crawler Indexer Index

11 Centralized Architecture

12 Centralized Architecture l HotBot, GoTo and Microsoft are powered by Inktomi l Magellan are powered by Excite’s internal engine l Others »Ask Jeeves, –simulates an interview »DirectHit, –ranks the Web pages in the order of their popularity

13 l Harvest »Gatherers: collect and extract indexing information from one or more Web servers »Brokers: provide the indexing mechanism and the query interface to the data data gathered »Netscape’s Catalog Server Distributed Architecture Broker Gatherer Broker User Web Object Cache Replication manager

14 User Interface l Query interface »AltaVista: OR »HotBot: AND l Answer interface »order by relevance »order by Url or date »option: find documents similar to each Web page

15 Ranking l Most search engines follow traditional »Boolean or Vector Model »Yuwono and Lee (1996) –Boolean spread –vector spread –most-cited l Hyperlink Information »WebQuery (CK97, WWW6) »Li98, Internet Computing »HITS (Kleinsberg, (SIAM98) »ARC (Cha98, WWW7) »PageRank, Google (BP98, WWW7)

16 Crawling the Web l Synonyms »spider, robot, crawler, etc. »Starting from a set of popular URLs »Partition the Web using country codes or Internet names l Crawling order »Depth-first, breadth-first »CG98, WWW7 l robot.txt »Guidelines for robot behavior includes what pages should not be indexed »e.g. dynamically generated pages, password protected pages

17 Indices l Variants of Inverted file »A short description of each Web page is complemented –creation data, size, the title and the first lines or a few headings –500bytes for each page*100million pages=50GB »30% of the text size –5KB for each page*100million pages*30%=150GB »compression –50GB l Binary Search on the sorted list of words of the inverted file

18 Indexing Granularity l Pointing to pages or to word positions is an indication of the granularity of the index »Use logical blocks instead of pages –reduce the size of the pointers (fewer blocks than documents) »Occurrences of a non-frequent word will be clustered in the same block –reduce the number of pointers l Queries are resolved as for inverted files »Obtaining a list of blocks that are then searched sequentially »Exact sequential search: 30Mb/sec »Glimpse in Harvest

19 Browsing in Web Directories

20 Combining Searching with Browsing l WebGlimpse WebGlimpse »attaches a small search box to the bottom of every HTML page »allows the search to cover the neighborhood of that page or the whole site without having to stop browsing »

21 MetaCrawlers

22 Metasearchers (cont.) l Client side metasearchers »WebCompass »WebSeeker »EchoSearch »WebFerret l Better ranking »Inquirus (LG98, WWW7) –NEC Research Institue metasearch engine

23 Dynamic Search and Software Agents l Fish search (Bra94, WWW2) » fall94.html l Shark search (HJM+98, WWW7) l Searching specific information »LaMacchia, WWW6, Internet fish construction kit »SiteHelper (NW97, WWW6) l Shopping robots »Jango »Junglee »Express

24 Summary l Characterizing the Web l Search engines »