Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations


Presentation on theme: "Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

1 Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University

2 Structure Index (Links) Represents the links between the indexed pages Important for - Relevance calculation (PageRank, HITS,...) - Crawling (importance metrics,...) and some other applications (Web mining, etc.) Most critical issues (again): - Size and rate of change Most important requirements: - Reduce space / compression - Support required operations (random and streaming access, add / delete) - Speed

3 The Web Graph The structure index represents the web graph : - Node = web page - Directed edge = link 1 3 2 Common representation techniques for graphs: a) Adjacency matrix

4 The Web Graph The structure index represents the web graph : - Node = web page - Directed Edge = link 1 3 2 Common representation techniques for graphs: b) Adjacency list

5 The Structure Index Example: The Connectivity Server [3] Based on a data structure that supports the following operations : - Given a URL u (or a set of URLs U), return a list of pages that point to u (U), i.e. its predecessors (back links) and a list of pages that are pointed to from u (U), i.e. its successors (forward links) - Given a set of URLs U and a distance, return the respective neighborhood of U in the graph

6 The Connectivity Server Nodes: Array (1 node = 1 element) Edges: - OUTLIST: Adjacency list (successors) - INLIST: Inverted adjacency list (predecessors) URL DATA- BASE PTR TO URLPTR TO INLISTPTR TO OUTLIST... NODE TABLE... INLIST TABLE... OUTLIST TABLE

7 The Connectivity Server (cont.) Additional data structure to map ULRs to IDs (and vice versa) ID = index in the lexicographically sorted list of all crawled URLs Advantage: Compression, i.e. delta-encoding Example: WWW.FOOBAR.COM/ WWW.FOOBAR.COM/GANDALF.HTM WWW.FOOGRAB.COM/ 0 WWW.FOOBAR.COM/ 1 15 GANDALF.HTM 26 7 GRAB.COM/ 41 ORIGINAL TEXT DELTA ENCODING

8 The Connectivity Server (cont.) Problem: Need to scan all URLs because of delta encoding (i.e. saves space at cost of speed) Solution: Include Checkpoint URLs Another problem: Updates are hard to do Several other (newer) approaches exist that take into account (e.g.) the actual web structure

9 S-Node Representation [4] Observations on the web structure : - Link copying: Lots of clusters with nodes containing very similar adjacency lists - Domain and URL locality: A significant fraction of links on a page point to pages from the same domain - Page similarity: Pages that have very similar adjacency lists are likely to be related Idea: Make use of these observations, e.g. by grouping related pages / similar URLs

10 S-Node Representation - Example PARTITION P = {N 1, N 2, N 3 } N 1 = {P 1, P 2 } N 2 = {P 3 } N 3 = {P 4, P 5 } 1 2 3 5 4 1 2 3 5 4 INTRA-NODES N i N2N2 N1N1 N3N3 SUPERNODE GRAPH

11 S-Node Representation - Example PARTITION P = {N 1, N 2, N 3 } N 1 = {P 1, P 2 } N 2 = {P 3 } N 3 = {P 4, P 5 } 1 2 3 5 4 N2N2 N1N1 N3N3 1 2 3 5 4 INTRA-NODES N i SUPERNODE GRAPH POSITIVE SUPEREDGES 252 3 1 4351 NEGATIVE SUPER- EDGES 532 4 5 1 2 41 5

12 Creating partitions 1. Initial partition : Based on URL (top two levels of DNS), e.g. - www.informatik.uni-freiburg.de - ad.informatik.uni-freiburg.de - www.imtek.uni-freiburg.de 2. URL Split : Split N i s based on URL prefixes, e.g. - www.informatik.uni-freiburg.de/students - www.informatik.uni-freiburg.de/studienberatung 3. Clustered Split : Use clustering algorithm to split partitions into groups with similar adjacency lists

13 References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4 (System Anatomy) [3] BHARAT, BRODER, HENZINGER, KUMAR, VENKATASUBRAMINAIN: "THE CONNECTIVITY SERVER: FAST ACCESS TO LINKAGE INFORMATION ON THE WEB", WWW 1998 [4] RAGHAVAN, GARCIA-MOLINA: "REPRESENTING WEB GRAPHS", STANFORD TECHNICAL REPORT 2002

14 General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)

15 The evolution of search engines 1st generation : Use only "on page", text data - Word frequency, language 1995-1997 (AltaVista, Excite, Lycos, etc.) 2nd gen. : Use off-page, web-specific data - Link (or connectivity) analysis - Click-through data (what results people click on) - Anchor-text (how people refer to a page) From 1998 (made popular by Google but everyone now) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

16 Still experimental The evolution of search engines Semantic analysis - What is it about? Focus on user need, rather than on query - Corpus reflects user needs / expectations - Integrates multiple sources of data - Help the user create a good query Context determination - Spatial (user location/target location) - Query stream (previous queries) - Personal (user profile) - Explicit (vertical search) - Implicit (on altavista.de) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 3rd gener. : Answer the need behind the query

17 Still experimental The evolution of search engines 3rd gener. : Answer the need behind the query (cont.) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 Helping the user - UI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etc. Integration of search and text analysis

18 Example: Google 3rd gener. : Answer the need behind the query

19 Web Search Lecture - Schedule 1. Classic IR (Basics) 2. Classic IR Exercises 3. Web Search (Basics) 4. Web Search Exercises [June, 28th till July 12th] 5. Web Search (Selected Topics) [July, 18th till July 26th]

20 Web Search – Summer Term 2006 Web Search Basics - (Programming) Exercises (c) Wolfgang Hürst, Albert-Ludwigs-University

21 General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)

22 Programming Exercises Exercise sheet 1: Tools, Library (Lucene) Exercise sheet 2: Database (and text index) Exercise sheet 3: Index (structure index) Exercise sheet 4: Search (link-based ranking)

23 Web Search Lecture - Schedule 1. Classic IR (Basics) 2. Classic IR Exercises 3. Web Search (Basics) 4. Web Search Exercises [June, 28th till July 12th] 5. Web Search (Selected Topics) [July, 18th till July 26th]

24 New Lecturnity Player Advanced replay features (developed by us) Modification of replay speed (while preserving the pitch of the voice)


Download ppt "Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University."

Similar presentations


Ads by Google