Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.
Chapter 19: Information Retrieval
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Overview of Search Engines
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Understanding Search Engines. Basic Defintions: Search Engine Search engines are information retrieval (IR) systems designed to help find specific information.
Search Engines By: Faruq Hasan.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Search Engine Architecture
Search Engine Architecture
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Information retrieval and PageRank
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Presentation transcript:

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

Structure Index (Links) Represents the links between the indexed pages Important for - Relevance calculation (PageRank, HITS,...) - Crawling (importance metrics,...) and some other applications (Web mining, etc.) Most critical issues (again): - Size and rate of change Most important requirements: - Reduce space / compression - Support required operations (random and streaming access, add / delete) - Speed

The Web Graph The structure index represents the web graph : - Node = web page - Directed edge = link Common representation techniques for graphs: a) Adjacency matrix

The Web Graph The structure index represents the web graph : - Node = web page - Directed Edge = link Common representation techniques for graphs: b) Adjacency list

The Structure Index Example: The Connectivity Server [3] Based on a data structure that supports the following operations : - Given a URL u (or a set of URLs U), return a list of pages that point to u (U), i.e. its predecessors (back links) and a list of pages that are pointed to from u (U), i.e. its successors (forward links) - Given a set of URLs U and a distance, return the respective neighborhood of U in the graph

The Connectivity Server Nodes: Array (1 node = 1 element) Edges: - OUTLIST: Adjacency list (successors) - INLIST: Inverted adjacency list (predecessors) URL DATA- BASE PTR TO URLPTR TO INLISTPTR TO OUTLIST... NODE TABLE... INLIST TABLE... OUTLIST TABLE

The Connectivity Server (cont.) Additional data structure to map ULRs to IDs (and vice versa) ID = index in the lexicographically sorted list of all crawled URLs Advantage: Compression, i.e. delta-encoding Example: GANDALF.HTM 26 7 GRAB.COM/ 41 ORIGINAL TEXT DELTA ENCODING

The Connectivity Server (cont.) Problem: Need to scan all URLs because of delta encoding (i.e. saves space at cost of speed) Solution: Include Checkpoint URLs Another problem: Updates are hard to do Several other (newer) approaches exist that take into account (e.g.) the actual web structure

S-Node Representation [4] Observations on the web structure : - Link copying: Lots of clusters with nodes containing very similar adjacency lists - Domain and URL locality: A significant fraction of links on a page point to pages from the same domain - Page similarity: Pages that have very similar adjacency lists are likely to be related Idea: Make use of these observations, e.g. by grouping related pages / similar URLs

S-Node Representation - Example PARTITION P = {N 1, N 2, N 3 } N 1 = {P 1, P 2 } N 2 = {P 3 } N 3 = {P 4, P 5 } INTRA-NODES N i N2N2 N1N1 N3N3 SUPERNODE GRAPH

S-Node Representation - Example PARTITION P = {N 1, N 2, N 3 } N 1 = {P 1, P 2 } N 2 = {P 3 } N 3 = {P 4, P 5 } N2N2 N1N1 N3N INTRA-NODES N i SUPERNODE GRAPH POSITIVE SUPEREDGES NEGATIVE SUPER- EDGES

Creating partitions 1. Initial partition : Based on URL (top two levels of DNS), e.g ad.informatik.uni-freiburg.de URL Split : Split N i s based on URL prefixes, e.g Clustered Split : Use clustering algorithm to split partitions into groups with similar adjacency lists

References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4 (System Anatomy) [3] BHARAT, BRODER, HENZINGER, KUMAR, VENKATASUBRAMINAIN: "THE CONNECTIVITY SERVER: FAST ACCESS TO LINKAGE INFORMATION ON THE WEB", WWW 1998 [4] RAGHAVAN, GARCIA-MOLINA: "REPRESENTING WEB GRAPHS", STANFORD TECHNICAL REPORT 2002

General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)

The evolution of search engines 1st generation : Use only "on page", text data - Word frequency, language (AltaVista, Excite, Lycos, etc.) 2nd gen. : Use off-page, web-specific data - Link (or connectivity) analysis - Click-through data (what results people click on) - Anchor-text (how people refer to a page) From 1998 (made popular by Google but everyone now) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

Still experimental The evolution of search engines Semantic analysis - What is it about? Focus on user need, rather than on query - Corpus reflects user needs / expectations - Integrates multiple sources of data - Help the user create a good query Context determination - Spatial (user location/target location) - Query stream (previous queries) - Personal (user profile) - Explicit (vertical search) - Implicit (on altavista.de) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR rd gener. : Answer the need behind the query

Still experimental The evolution of search engines 3rd gener. : Answer the need behind the query (cont.) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 Helping the user - UI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etc. Integration of search and text analysis

Example: Google 3rd gener. : Answer the need behind the query

Web Search Lecture - Schedule 1. Classic IR (Basics) 2. Classic IR Exercises 3. Web Search (Basics) 4. Web Search Exercises [June, 28th till July 12th] 5. Web Search (Selected Topics) [July, 18th till July 26th]

Web Search – Summer Term 2006 Web Search Basics - (Programming) Exercises (c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)

Programming Exercises Exercise sheet 1: Tools, Library (Lucene) Exercise sheet 2: Database (and text index) Exercise sheet 3: Index (structure index) Exercise sheet 4: Search (link-based ranking)

Web Search Lecture - Schedule 1. Classic IR (Basics) 2. Classic IR Exercises 3. Web Search (Basics) 4. Web Search Exercises [June, 28th till July 12th] 5. Web Search (Selected Topics) [July, 18th till July 26th]

New Lecturnity Player Advanced replay features (developed by us) Modification of replay speed (while preserving the pitch of the voice)