Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations


Presentation on theme: "Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

1 Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University

2 Indexing in the 1st Google engine - Parsing of the HTML pages in the repository - Indexing of the document - Store indexed docs in barrels - Code words in a wordID - Create lexicon that maps words to wordIDs - Store hit lists in forward barrels (Note: Indexing process is parallelized) - Sorting - Sort anchor and title hits from the forward barrels in inverted barrels and all other hits in full text inverted barrels Now: Description of the major data structures

3 REPOSITORY Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF. [2], FIG. 1) DOC INDEXLEXICON P AGE R ANK LINKS BARRELS REPOSITORY : PAGEURLPAGE_LENURL_LENECODEDOCID...... REPOSITORY

4 Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF. [2], FIG. 1) DOC INDEXLEXICON P AGE R ANK LINKS BARRELS DOCUMENT INDEX: DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) IF DOCUMENT HAS BEEN CRAWLED - POINTER TO URL LIST OTHERWISE ADDITIONAL FILE TO CONVERT URLS TO DOCIDs: URL CHECKSUM -> DOCID DOC INDEX

5 REPOSITORY Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF. [2], FIG. 1) DOC INDEXLEXICON P AGE R ANK LINKS BARRELS ANCHORS : SOURCE, DESTINATION, AND ANCHOR TEXT LINKS : PAIRWISE DOCIDS ANCHORS LINKS

6 REPOSITORY Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF. [2], FIG. 1) DOC INDEXLEXICON P AGE R ANK LINKS BARRELS LEXICON BARRELS INVERTED INDEX : WORD -> DOCUMENT LEXICON:INVERTED BARRELS: WORDID, NDOCS...... DOCID, NO-OF-HITS, HIT1, HIT2,............... FORWARD INDEX : DOCUMENT -> WORD DOCIDWORDID, NO-OF-HITS, HIT1, HIT2,... WORDID, NO-OF-HITS, HIT1, HIT2,... NULL WORDID............ DOCIDWORDID, NO-OF-HITS, HIT1, HIT2,... WORDID, NO-OF-HITS, HIT1, HIT2,... HITS : FANCY HIT (URL, TITLE, ANCHOR TEXT, META TAG) PLAIN HIT (EVERYTHING ELSE) CAPITALIZATION, FONTSIZE, TYPE, POSITION IN DOCUMENT CAPITALIZATION, FONTSIZE, POSITION IN DOCUMENT

7 REPOSITORY Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF. [2], FIG. 1) DOC INDEXLEXICON P AGE R ANK LINKS BARRELS P AGE R ANK

8 Query Processing DOCID, NO-OF-HITS, HIT1, HIT2,......... WORDID, NDOCS LEXICON INVERTED INDEX / BARRELS CAPITALIZATION, FONTSIZE, TYPE, POS. IN DOC HITLIST PAGERANK DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) DOCUMENT INDEX PAGEURLPAGELENURLLENECODEDOCID REPOSITORY

9 Further reading Note: This was information from a paper from 1998 (with a collection of 25 million pages) Newer information about the infrastructure and data structure used by Google (today?) can be found in the following references: Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Proc. on Large Clusters Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google File System Luiz Andre Barroso, Jeffrey Dean, Urs Hoelzle: Web Search for a Planet: The Google Cluster Archit. which are available at http://labs.google.com/papers/

10 References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4 (System Anatomy) [3] S. MELNIK, S. RAGHAVAN, B. YANG, H. GARCIA-MOLINA: "BUILDING A DISTRIBUTED FULL-TEXT INDEX FOR THE WEB", ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 13/3, JULY 2001


Download ppt "Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University."

Similar presentations


Ads by Google