Hubs and Authorities on the world wide web (most from Rao’s lecture slides) Presentor: Lei Tang.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Hyper-Searching the Web. Search Engines Basic Search (index) Cluster Search (themes) Meta-search (outsource) “Smarter” meta-search (themes + outsource)
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Search Engine Technology 2/10 Slides are revised version of the ones taken from
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Search Engine Technology Slides are revised version of the ones taken from
Link Analysis HITS Algorithm PageRank Algorithm.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Lecture 22 SVD, Eigenvector, and Web Search
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Hubs and Authorities on the world wide web (most from Rao’s lecture slides) Presentor: Lei Tang

Desiderata for link-based ranking A page that is referenced by lot of important pages (has more back links) is more important (Authority) –A page referenced by a single important page may be more important than that referenced by five unimportant pages –No links between competitive authorities(like Ford, Honda) A page that references a lot of important pages is also important (Hub) Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other. “Importance” can be propagated – Your importance is the weighted sum of the importance conferred on you by the pages that refer to you –The importance you confer on a page may be proportional to how many other pages you refer to (cite) (Also what you say about them when you cite them!) Different Notions of importance

Authority and Hub Pages (1) The basic idea: A page is a good authoritative page with respect to a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query. A page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query. Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other.

Authority and Hub Pages (2) Authorities and hubs related to the same query tend to form a bipartite subgraph of the web graph. A web page can be a good authority and a good hub. hubsauthorities

Authority and Hub Pages (7) Operation I: for each page p: a(p) =  h(q) q: (q, p)  E Operation O: for each page p: h(p) =  a(q) q: (p, q)  E q1q1 q2q2 q3q3 p q3q3 q2q2 q1q1 p

Authority and Hub Pages (8) Matrix representation of operations I and O. Let A be the adjacency matrix of SG: entry (p, q) is 1 if p has a link to q, else the entry is 0. Let A T be the transpose of A. Let h i be vector of hub scores after i iterations. Let a i be the vector of authority scores after i iterations. Operation I: a i = A T h i-1 Operation O: h i = A a i Normalize after every multiplication

Authority and Hub Pages (11) Example: Initialize all scores to 1. 1 st Iteration: I operation: a(q 1 ) = 1, a(q 2 ) = a(q 3 ) = 0, a(p 1 ) = 3, a(p 2 ) = 2 O operation: h(q 1 ) = 5, h(q 2 ) = 3, h(q 3 ) = 5, h(p 1 ) = 1, h(p 2 ) = 0 Normalization: a(q 1 ) = 0.267, a(q 2 ) = a(q 3 ) = 0, a(p 1 ) = 0.802, a(p 2 ) = 0.535, h(q 1 ) = 0.645, h(q 2 ) = 0.387, h(q 3 ) = 0.645, h(p 1 ) = 0.129, h(p 2 ) = 0 q1q1 q2q2 q3q3 p1p1 p2p2

Authority and Hub Pages (12) After 2 Iterations: a(q 1 ) = 0.061, a(q 2 ) = a(q 3 ) = 0, a(p 1 ) = 0.791, a(p 2 ) = 0.609, h(q 1 ) = 0.656, h(q 2 ) = 0.371, h(q 3 ) = 0.656, h(p 1 ) = 0.029, h(p 2 ) = 0 After 5 Iterations: a(q 1 ) = a(q 2 ) = a(q 3 ) = 0, a(p 1 ) = 0.788, a(p 2 ) = h(q 1 ) = 0.657, h(q 2 ) = 0.369, h(q 3 ) = 0.657, h(p 1 ) = h(p 2 ) = 0 q1q1 q2q2 q3q3 p1p1 p2p2

(why) Does the procedure converge? x x2x2 xkxk As we multiply repeatedly with M, the component of x in the direction of principal eigen vector gets stretched wrt to other directions.. So we converge finally to the direction of principal eigenvector Necessary condition: x must have a component in the direction of principal eigen vector (c 1 must be non-zero) The rate of convergence depends on the “eigen gap”

Authority and Hub Pages (3) Main steps of the algorithm for finding good authorities and hubs related to a query q. 1.Submit q to a regular similarity-based search engine. Let S be the set of top n pages returned by the search engine. (S is called the root set and n is often in the low hundreds). 2.Expand S into a large set T (base set): Add pages that are pointed to by any page in S. Add pages that point to any page in S. If a page has too many parent pages, only the first k parent pages will be used for some k.

Authority and Hub Pages (4) 3. Find the subgraph SG of the web graph that is induced by T. S T

Authority and Hub Pages (5) Steps 2 and 3 can be made easy by storing the link structure of the Web in advance Link structure table (during crawling) --Most search engines serve this information now. (e.g. Google’s link: search) parent_url child_url url1 url2 url1 url3

USER(41): aaa ;;an adjacency matrix #2A((0 0 1) (0 0 1) (1 0 0)) USER(42): x ;;an initial vector #2A((1) (2) (3)) USER(43): (apower-iteration aaa x 2) ;;authority computation—two iterations [1] USER(44): (apower-iterate aaa x 3) ;;after three iterations #2A(( ) (0.0) ( )) [1] USER(45): (apower-iterate aaa x 15) ;;after 15 iterations #2A(( e-5) (0.0) (1.0)) [1] USER(46): (power-iterate aaa x 5) ;;hub computation 5 iterations #2A(( ) ( ) ( )) [1] USER(47): (power-iterate aaa x 15) ;;15 iterations #2A(( ) ( ) ( e-5)) [1] USER(48): Y ;; a new initial vector #2A((89) (25) (2)) [1] USER(49): (power-iterate aaa Y 15) ;;Magic… same answer after 15 iter #2A(( ) ( ) ( e-7)) A B C

Authority and Hub Pages (6) 4.Compute the authority score and hub score of each web page in T based on the subgraph SG(V, E). Given a page p, let a(p) be the authority score of p h(p) be the hub score of p (p, q) be a directed edge in E from p to q. Two basic operations: Operation I: Update each a(p) as the sum of all the hub scores of web pages that point to p. Operation O: Update each h(p) as the sum of all the authority scores of web pages pointed to by p.

Authority and Hub Pages (9) After each iteration of applying Operations I and O, normalize all authority and hub scores. Repeat until the scores for each page converge (the convergence is guaranteed). 5. Sort pages in descending authority scores. 6. Display the top authority pages.

Authority and Hub Pages (10) Algorithm (summary) submit q to a search engine to obtain the root set S; expand S into the base set T; obtain the induced subgraph SG(V, E) using T; initialize a(p) = h(p) = 1 for all p in V; for each p in V until the scores converge { apply Operation I; apply Operation O; normalize a(p) and h(p); } return pages with top authority scores;

(why) Does the procedure converge? x x2x2 xkxk As we multiply repeatedly with M, the component of x in the direction of principal eigen vector gets stretched wrt to other directions.. So we converge finally to the direction of principal eigenvector Necessary condition: x must have a component in the direction of principal eigen vector

Handling “spam” links Should all links be equally treated? Two considerations: Some links may be more meaningful/important than other links. Web site creators may trick the system to make their pages more authoritative by adding dummy pages pointing to their cover pages (spamming).

Handling Spam Links (contd) Transverse link: links between pages with different domain names. Domain name: the first level of the URL of a page. Intrinsic link: links between pages with the same domain name. Transverse links are more important than intrinsic links. Two ways to incorporate this: 1.Use only transverse links and discard intrinsic links. 2.Give lower weights to intrinsic links.

Handling Spam Links (contd) How to give lower weights to intrinsic links? In adjacency matrix A, entry (p, q) should be assigned as follows: If p has a transverse link to q, the entry is 1. If p has an intrinsic link to q, the entry is c, where 0 < c < 1. If p has no link to q, the entry is 0.

Considering link “context” For a given link (p, q), let V(p, q) be the vicinity (e.g.,  50 characters) of the link. If V(p, q) contains terms in the user query (topic), then the link should be more useful for identifying authoritative pages. To incorporate this: In adjacency matrix A, make the weight associated with link (p, q) to be 1+n(p, q), where n(p, q) is the number of terms in V(p, q) that appear in the query. Alternately, consider the “vector similarity” between V(p,q) and the query Q

Evaluation Sample experiments: Rank based on large in-degree (or backlinks) query: game Rank in-degree URL gamelink/gamelink.html Only pages 1, 2 and 4 are authoritative game pages.

Evaluation Sample experiments (continued) Rank based on large authority score. query: game Rank Authority URL gamefan-network.com/ All pages are authoritative game pages.

Authority and Hub Pages (19) Sample experiments (continued) Rank based on large authority score. query: free Rank Authority URL All pages are authoritative free pages.

Tyranny of Majority Which do you think are Authoritative pages? Which are good hubs? -intutively, we would say that 4,8,5 will be authoritative pages and 1,2,3,6,7 will be hub pages. BUT The power iteration will show that Only 4 and 5 have non-zero authorities [ ] And only 1, 2 and 3 have non-zero hubs [.5.7.5] The authority and hub mass Will concentrate completely Among the first component, as The iterations increase. (See next slide)

Tyranny of Majority (explained) p1 p2 pm p q1 qn q m n Suppose h0 and a0 are all initialized to 1 m>n

Tyranny of Majority (explained) p1 p2 pm p q1 qn q m n Suppose h0 and a0 are all initialized to 1 m>n

Impact of Bridges When the graph is disconnected, only 4 and 5 have non-zero authorities [ ] And only 1, 2 and 3 have non-zero hubs [.5.7.5]CV 9 When the components are bridged by adding one page (9) the authorities change only 4, 5 and 8 have non-zero authorities [ ] And 1, 2, 3, 6,7 and 9 will have non-zero hubs [ ] Bad news from stability point of view

Authority and Hub Pages (24) Multiple Communities (continued) How to retrieve pages from smaller communities? A method for finding pages in nth largest community: –Identify the next largest community using the existing algorithm. –Destroy this community by removing links associated with pages having large authorities. –Reset all authority and hub values back to 1 and calculate all authority and hub values again. –Repeat the above n  1 times and the next largest community will be the nth largest community.

Multiple Clusters on “House” Query: House (first community)

Authority and Hub Pages (26) Query: House (second community)

Authority and Hub Pages (20) For a given query, the induced subgraph may have multiple dense bipartite communities due to: multiple meanings of query terms multiple web communities related to the query ad page obscure web page

Authority and Hub Pages (21) Multiple Communities (continued) If a page is not in a community, then it is unlikely to have a high authority score even when it has many backlinks. Example: Suppose initially all hub and authority scores are 1. q’s p q’s p’s G1: G2: 1 st iteration for G1: a(q) = 0, a(p) = 5, h(q) = 5, h(p) = 0 1 st iteration for G2: a(q) = 0, a(p) = 3, h(q) = 9, h(p) = 0

Authority and Hub Pages (22) Example (continued): 1 st normalization (suppose normalization factors H 1 for hubs and A 1 for authorities): for pages in G1: a(q) = 0, a(p) = 5/A 1, h(q) = 5/H 1, h(p) = 0 for pages in G2: a(q) = 0, a(p) = 3/A 1, h(q) = 9/H 1, a(p) = 0 After the nth iteration (suppose H n and A n are the normalization factors respectively): for pages in G1: a(p) = 5 n / (H 1 …H n-1 A n ) ---- a for pages in G2: a(p) = 3*9 n-1 /(H 1 …H n-1 A n ) ---- b Note that a/b approaches 0 when n is sufficiently large, that is, a is much much smaller than b.

Authority and Hub Pages (23) Multiple Communities (continued) If a page is not in the largest community, then it is unlikely to have a high authority score. –The reason is similar to that regarding pages not in a community. larger community smaller community

More stable because random surfer model allows low prob edges to every place.CV Can be done For base set too Can be done For full web too Query relevance vs. query time computation tradeoff Can be made stable with subspace-based A/H values [see Ng. et al.; 2001] See topic-specific Page-rank idea..

Novel uses of Link Analysis Link analysis algorithms—HITS, and Pagerank—are not limited to hyperlinks -Citeseer/Cora use them for analyzing citations (the link is through “citation”) -See the irony here—link analysis ideas originated from citation analysis, and are now being applied for citation analysis -Some new work on “keyword search on databases” uses foreign-key links and link analysis to decide which of the tuples matching the keyword query are most important (the link is through foreign keys) -[Sudarshan et. Al. ICDE 2002]Sudarshan et. Al. ICDE Keyword search on databases is useful to make structured databases accessible to naïve users who don’t know structured languages (such as SQL).