Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1.

Similar presentations

Presentation on theme: "Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1."— Presentation transcript:

1 Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1

2 Non-Relevant Data on the Web “non-relevant” – not directly related to the main topic / functionality of the page Example – A fundamental problem on the Web: Many pages contain lots of non-relevant data 2

3 Hypertext IR Principles Relevant Linkage Principle [Kleinberg 1997] –p links to q  q is relevant to p Topical Unity Principle [Kessler 1963, Small 1973] –q 1 and q 2 are co-cited in p  q 1 and q 2 are related to each other Lexical Affinity Principle [Maarek et al. 1991] –The closer the links to q 1 and q 2 are the stronger the relation between them. Underlying principles of all link based IR tools: 3

4 Example: HITS & Clever [Kleinberg 1997, Chakrabarti et al. 1998] Uses the Relevant Linkage Principle –All links propagate score from hubs to authorities and vice versa Uses the Topical Unity Principle –Co-cited authorities propagate score to each other Clever uses the Lexical Affinity Principle –text around links is used to weight relevance of the links HubsAuthorities 4

5 Example: Focused Crawler [Chakrabarti et al. 1999] Goal –fetch pages relevant to a given topic Technique –Order already crawled pages according to relevance to the topic –Crawl over the links from the top page in the list –Remove top page from the list Uses the Relevant Linkage Principle and the Topical Unity Principle –All the links from the top page are assumed relevant to the topic 5

6 Link Based Web IR Tools Search algorithms –HITS and Clever [Kleinberg 1997,Chakrabarti et al. 1998] –Google [Brin and Page 1998] –SALSA [Lempel and Moran 2000] Finding similar pages –Co-Citation [Dean and Henzinger 1999] Hypertext classification –Hyperclass [Chakrabarti et al. 1998] Focused crawling –FOCUS [Chakrabarti et al. 1999] Page clustering –[Modha and Spangler 2000] 6

7 Violations of the Hypertext IR Principles Frequent violations of all hypertext IR principles Violations are caused by systematic phenomena on the Web Violations significantly deteriorate accuracy of the hypertext IR tools 7

8 Violations of Relevant Linkage Principle Navigational links – Download links – Advertisement links – Endorsement links – Spam links 8

9 Violations of Topical Unity Principle Violations of the Relevant Linkage Principle Bookmark pages –Kjhan's Bookmark ListsKjhan's Bookmark Lists General resource lists –Links of Interest to Electrical EngineersLinks of Interest to Electrical Engineers Personal homepages –Ron Fagin's Home PageRon Fagin's Home Page 9

10 Violations of Lexical Affinity Principle Alphabetical index lists –Computer and Communication Companies ("M" entries)Computer and Communication Companies ("M" entries) HTML representation –Adjacent cells in the same column are far from each other in the HTML text 10

11 Templates Semantic Definition: A template is a master HTML shell page that is used as a basis for composing new pages –Content of new pages plugged into template shell –All pages share common look & feel –Example: Usually controlled by a central authority –Not necessarily confined to a single site (e.g., Amazon and May include variety of data –Navigational bars –Advertisements –Company info and policies 11

12 Why are Templates Bad for IR Tools? Violate the hypertext IR principles –Relevant linkage principle –Topical unity principle Extremely common –Became standard in web site design 12

13 IR Tool Problems Generalization –Search for “Frequency Division Multiplexing” and get back general Electrical Engineering sites Topic drift –Search for “Finite Model Theory” and get SF 49’ers fan web sites Irrelevance –Get “Yahoo” as a result regardless of the query Bias –Search for “computing companies” and get Microspy highly ranked 13

14 Hypertext Improvement Problem Develop hypertext processing techniques that: automatically improve hypertext data are efficient and scalable Main Goal remove violations of the Hypertext IR principles process quickly millions of pages 14

15 Hypertext Cleaning Web Crawler Hypertext Cleaner IR Tool 15

16 Previous Hypertext Improvement Techniques Heuristics –Ignore intra-site (“nepotisitic”) links [Kleinberg 1997] –Ignore links to popular sites (“stop sites”) [Chakrabarti et al. 1998, Bharat and Henzinger 1998] Query dependent techniques –Weight links according to relevance to query [Chakrabarti et al. 1999, Bharat and Henzinger 1998] Pre-processing techniques –Eliminate duplicate pages [Broder et al. 1997] –Identify “noisy” links automatically [Davison 2000] 16

17 Pagelets Semantic Definition: A pagelet is a maximal region of a page that has a single topic or functionality –Not too large has only one topic / functionality –Not too small any larger region that contains it has other topics / functionalities Example: 17

18 IR with Pagelets Use pagelets rather than pages as atomic units for information retrieval Main Idea 1 Satisfy Relevant Linkage Principle Satisfy Topical Unity Principle 18

19 IR with Pagelets (cont.) Drawbacks –Lose some semantic data latent in pages –No natural link structure on pagelets Issues –How to divide a page into pagelets? –How to adapt IR tools to work with pagelets? 19

20 Pagelets: Syntactic Definition A pagelet is a node in the HTML parse tree of a page satisfying the following: –Its HTML tag is one of the following:,,,,,, … –It contains at least 3 hyperlinks –None of its children is a pagelet 20

21 Template Elimination How to recognize templates efficiently? –Templates vs. mirrors –Templates vs. accidental pagelet similarities Main Idea 2 Eliminate pagelets belonging to templates 21

22 Templates: Syntactic Definition Similarity –p 1,…,p k are identical or almost identical Connectivity –Every two pages owning pagelets in T are reachable from each other (undirectedely) through other pages owning pagelets in T. A template is a collection T = (p 1,…,p k ) of pagelets satisfying: p1p1 p3p3 p5p5 p2p2 p4p4 22 Template Recognition Problem: Given a set of pages S find all the templates in S.

23 Cluster pagelets in S according to shingle Calculate shingle(p) for each pagelet p  S Eliminate Duplicate Pages from S Template Recognition in Small Sets In small sets: hard to validate connectivity requirement low chance of accidental pagelet similarities 23 Output clusters of size > 1

24 Template Recognition in Large Sets 24 Cluster pagelets in S according to shingle Calculate shingle(p) for each pagelet p  S Discard clusters of size 1 For each remaining cluster C: Construct graph G c of pages that own pagelets in C Find undirected connected components of G c Output components of size > 1

25 Scalability Store pages and pagelets on database tables Template recognition & elimination can be carried out by a few cheap database operations –Finding the connected components can be done in main memory using BFS 25

26 Example: Clever Hubs Authorities Hubs – all non-template pagelets in the base set Authorities – all pages in the base set 26

27 Classical Clever vs. “Clean” Clever 27

28 Classical Clever vs. “Clean” Clever (cont.) 28

29 Classical Clever vs. “Clean” Clever (cont.) 29

30 Conclusions Contributions –Formulation of the hypertext improvement problem –Introduction of pagelets and templates as means of improving hypertext IR –Efficient algorithms for pagelet and template recognition –Demonstration of technique’s effectiveness by improving Clever’s precision Future work –Test the new technique with other IR algorithms –Find new hypertext improvement techniques 30

31 Thank You! 31

32 The Yahoo Template 32

33 The Yahoo Pagelets 33

Download ppt "Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1."

Similar presentations

Ads by Google