Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Chapter 5: Introduction to Information Retrieval

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )

Information Retrieval

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Overview of Search Engines

Presented By: - Chandrika B N

Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching.

Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

By Sergey Melnik, Sriram Raghavan, Beberly Yang and Garcia-Molina 10/22/2015Building a Distributed Full-Text Index for the Web1.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.

Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%

Search Engine Architecture

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

The anatomy of a Large-Scale Hypertextual Web Search Engine.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Why indexing? For efficient searching of a document

Information Retrieval in Practice

Information Retrieval in Practice

Search Engine Architecture

The Anatomy of a Large-Scale Hypertextual Web Search Engine

CSCE 561 Information Retrieval System Models

Yoram Bachrach Yiftah Ben-Aharon

Data Mining Chapter 6 Search Engines

The Search Engine Architecture

Presentation transcript:

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)

Types of (generic) indexes 1. Text index = "Traditional", text-based index "Inverted files have traditionally been the index structure choice of the web" [3] Main purpose: Identification and selection of relevant pages Special characteristics: - Size and rate of change - Consider anchor text and surrounding text

Types of (generic) indexes 2. Structure / link index = Description of the linkage between web pages Usually modeled as a graph (nodes = pages, directed edges = links) Main purpose: Provide structure information (esp. neighborhood relationships), usually to create the ranking Problem: Requires a scalable and efficient representation of a VERY large graph

3. Utility index : Stores additional, search engine dependent information needed for page selection and relevance estimation, e.g. - PageRank - Site index - special site-related characteristics etc. Main purpose: Usually to speed up processing time Types of (generic) indexes

Inverted File : Generally: term -> document (web page) - Posting (t, l) :pair of term t and location l - Sometimes: Payload field to store add. info In addition: Lexicon (dictionary) with - List of all terms in the index - Related statistics (IDF,...) Note: Similar to traditional IR but size and rate of change require special techniques Text Index (= Inverted File)

The WebBase System as an example for a distributed text index [1,3] DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX

DISTRIBUTORS INDEXERS QUERY SERVERS WebBase Architecture - 3 Types of Nodes WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN INVERTED INDEX......

WebBase Indexing Process - 2 Stages DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX

WebBase - Distributed inv. idx. organization DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX Two strategies : - Local inverted files - Global inverted files

WebBase - Parallelizing the indexing process DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX

Parallel index construction (Indexers) INPUT: STREAM OF WEB PAGES FROM REPOSITORY OUTPUT: SORTED RUNS / INTERMEDIATE RUNS (SORTED POSTINGS OF A SUBSET OF THE REPOSITORY) LOADINGFLUSHING MEMORY WEB PAGES MEMORY SORTED RUNS PROCESSING MEMORY PAR- SING, TOKE- NIZA- TION SOR- TING

Parallel index construction (Indexers) TIME L P F L P F L P F L P F L P F L P F Loading Processing Flushing Software pipeline to create sorted runs (multi-threaded execution)

WebBase - Collecting global statistics DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX

Coll. global statistics (Statistician) Avoid disk accesses (expensive!) Communication with the statistician only if data is already in memory (i.e. during merging or flushing) Avoid intensive communication between indexer and statistician Only send partly sorted (summarized) postings Two strategies to collect statistical info on term level: - ME strategy (during merging) - FL strategy (during flushing)

ME strategy CAT(6,2) (3,1) DOG(8,3) RAT(8,3) (4,1) CAT(4,2) (3,3) (7,1) DOG(5,2) (9,1) (DOG, 1) (CAT, 2) (RAT, 2) (DOG, 2) (CAT, 3) AGGRE- GATE INDEXERS (INVERTED LISTS) INDEXERS (LEXICON) STATISTICIAN (DOG, 3) (CAT, 5) (RAT, 2) DOG:3 CAT:5 RAT:2 DOG:3 CAT:5

FL strategy INDEXERS (SORTED RUNS) INDEXERS (LEXICON) (CAT, 1) (DOG, 2) HASH TABLE CAT(6,1) DOG(8,3) CAT(2,1) CAT(6,2) RAT(4,3) RAT (8,1) DOG(4,2) CAT(5,2) DOG(5,1) DOG(7,2) (CAT, 1) (DOG, 1) (CAT, 2) STATISTICIAN DOG:4 CAT:4 RAT:2 DOG:4 CAT:4 (RAT, 2) (DOG, 1) DOG? CAT? RAT? HASH TABLE DOG4 CAT4 RAT2 STATISTICIAN DURINGAFTER PROCESSING

Summary: ME vs. ML strategy General observations: - Relatively low overhead (both strategies) - Confirmed experimentally ("less than 5% for a 2 million page collection") ++--FL (FLUSHING) + -+ ME (MERGING) PARALLELISMMEMORY USAGE STATISTICIAN LOAD Summary of characteristics (+/-)

The WebBase System - Summary DISTRIBUTORS INDEXERS WEB PAGES INTER- MEDIATE RUNS STAGE 1 STAGE 2 STATIS- TICIAN QUERY SERVERS INVERTED INDEX

References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4 (System Anatomy) [3] S. MELNIK, S. RAGHAVAN, B. YANG, H. GARCIA-MOLINA: "BUILDING A DISTRIBUTED FULL-TEXT INDEX FOR THE WEB", ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 13/3, JULY 2001