Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations


Presentation on theme: "Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

1 Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University

2 General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)

3 Types of (generic) indexes 1. Text index = "Traditional", text-based index "Inverted files have traditionally been the index structure choice of the web" [3] Main purpose: Identification and selection of relevant pages Special characteristics: - Size and rate of change - Consider anchor text and surrounding text

4 Types of (generic) indexes 2. Structure / link index = Description of the linkage between web pages Usually modeled as a graph (nodes = pages, directed edges = links) Main purpose: Provide structure information (esp. neighborhood relationships), usually to create the ranking Problem: Requires a scalable and efficient representation of a VERY large graph

5 3. Utility index : Stores additional, search engine dependent information needed for page selection and relevance estimation, e.g. - PageRank - Site index - special site-related characteristics etc. Main purpose: Usually to speed up processing time Types of (generic) indexes

6 Inverted File : Generally: term -> document (web page) - Posting (t, l) :pair of term t and location l - Sometimes: Payload field to store add. info In addition: Lexicon (dictionary) with - List of all terms in the index - Related statistics (IDF,...) Note: Similar to traditional IR but size and rate of change require special techniques Text Index (= Inverted File)

7 The WebBase System as an example for a distributed text index [1,3]...... DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX........................

8 DISTRIBUTORS INDEXERS QUERY SERVERS.................. WebBase Architecture - 3 Types of Nodes...... WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN INVERTED INDEX......

9 WebBase Indexing Process - 2 Stages...... DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX........................

10 WebBase - Distributed inv. idx. organization...... DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX........................ Two strategies : - Local inverted files - Global inverted files

11 WebBase - Parallelizing the indexing process...... DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX........................

12 Parallel index construction (Indexers) INPUT: STREAM OF WEB PAGES FROM REPOSITORY OUTPUT: SORTED RUNS / INTERMEDIATE RUNS (SORTED POSTINGS OF A SUBSET OF THE REPOSITORY) LOADINGFLUSHING MEMORY WEB PAGES MEMORY SORTED RUNS PROCESSING MEMORY PAR- SING, TOKE- NIZA- TION SOR- TING

13 Parallel index construction (Indexers) TIME L P F L P F L P F L P F L P F L P F Loading Processing Flushing Software pipeline to create sorted runs (multi-threaded execution)

14 WebBase - Collecting global statistics...... DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX........................

15 Coll. global statistics (Statistician) Avoid disk accesses (expensive!) Communication with the statistician only if data is already in memory (i.e. during merging or flushing) Avoid intensive communication between indexer and statistician Only send partly sorted (summarized) postings Two strategies to collect statistical info on term level: - ME strategy (during merging) - FL strategy (during flushing)

16 ME strategy CAT(6,2) (3,1) DOG(8,3) RAT(8,3) (4,1) CAT(4,2) (3,3) (7,1) DOG(5,2) (9,1) (DOG, 1) (CAT, 2) (RAT, 2) (DOG, 2) (CAT, 3) AGGRE- GATE INDEXERS (INVERTED LISTS) INDEXERS (LEXICON) STATISTICIAN (DOG, 3) (CAT, 5) (RAT, 2) DOG:3 CAT:5 RAT:2 DOG:3 CAT:5

17 FL strategy INDEXERS (SORTED RUNS) INDEXERS (LEXICON) (CAT, 1) (DOG, 2) HASH TABLE CAT(6,1) DOG(8,3) CAT(2,1) CAT(6,2) RAT(4,3) RAT (8,1) DOG(4,2) CAT(5,2) DOG(5,1) DOG(7,2) (CAT, 1) (DOG, 1) (CAT, 2) STATISTICIAN DOG:4 CAT:4 RAT:2 DOG:4 CAT:4 (RAT, 2) (DOG, 1) DOG? CAT? RAT? HASH TABLE DOG4 CAT4 RAT2 STATISTICIAN DURINGAFTER PROCESSING

18 Summary: ME vs. ML strategy General observations: - Relatively low overhead (both strategies) - Confirmed experimentally ("less than 5% for a 2 million page collection") ++--FL (FLUSHING) + -+ ME (MERGING) PARALLELISMMEMORY USAGE STATISTICIAN LOAD Summary of characteristics (+/-)

19 The WebBase System - Summary...... DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX........................

20 References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4 (System Anatomy) [3] S. MELNIK, S. RAGHAVAN, B. YANG, H. GARCIA-MOLINA: "BUILDING A DISTRIBUTED FULL-TEXT INDEX FOR THE WEB", ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 13/3, JULY 2001


Download ppt "Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University."

Similar presentations


Ads by Google