7CCSMWAL Algorithmic Issues in the WWW

Slides:



Advertisements
Similar presentations
Hubs and Authorities on the world wide web (most from Rao’s lecture slides) Presentor: Lei Tang.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Using Hyperlink structure information for web search.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
Chapter 7 Web Structure Mining
Text & Web Mining 9/22/2018.
HITs Implementation Presented by the Amazingly Brilliant John Yankowski and the slightly less brilliant Larry Phillips.
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Inf 723 Information & Computing
CS 440 Database Management Systems
PageRank algorithm based on Eigenvectors
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Discussion Class 9 Google.
Presentation transcript:

7CCSMWAL Algorithmic Issues in the WWW Lecture HITS

HITS HITS: Important (non-Google) method of ranking web pages The method has many attractions, but so far has not had such a successful implementation as PageRank HITS classifies the pages in 2 ways, based on in-degree and out-degree Uses/used a similar algorithm? Teoma (Ask Jeeves), Twitter As of 2010 Ask.com referred to the Teoma algorithm as the ExpertRank algorithm

Some documentation https://en.wikipedia.org/wiki/HITS_algorithm The next one is very ‘famous’ http://web.eecs.umich.edu/~michjc/eecs584/notes/lecture19-kleinberg.pdf Basically what search engines do is mainly secret Why would that be?

ASK

HITS (Hypertext Induced Topic Search) Similar to PageRank, but uses both in-links and out-links to create two popularity scores for each page HITS defines hubs and authorities A hub is a page with many out-links An authority is a page with many in-links A page can be both a hub and an authority Hub Auth

Hubs and Authorities “Good authorities are pointed to by good hubs and good hubs point to good authorities” Measures of goodness for each page Pi Authority score (or weight) xi Hub score (or weight) yi The definitions where E is the set of all hyperlinks Hub i: (i, j) refers to hyperlinks from Pi to Pj Auth i: (j, i) refers to the hyperlink from Pj to Pi

Suppose the hub and authority scores of Example P1 Suppose the hub and authority scores of P1, P2, P3, P4 are P2 P5 Pages P1 P2 P3 P4 Authority score xi 3.5 1.2 4.2 1.0 Hub score yi 2.1 0.3 5.2 1.1 P3 P4 The authority score of P5 is 2.1 + 0.3 + 5.2 = 7.6 Because Hubs P(1),P(2),P(3) point to P(5) The hub score of P5 is 3.5 + 1.0 = 4.5 Because P5 points to Authorities P(1), P(4);

Iterative Approach Similar to the pagerank computation, the authority and hub scores are computed iteratively, followed by normalization (scores add to 1) Let x(k)i and y(k)i be the authority and hub scores, respectively, at iteration k For instance, in each iteration, we start by updating the authority scores and then the hub scores

Score Computation in Matrix Form Adjacency matrix L Lij = 1, if there exists a link from Pi to Pj Lij = 0, otherwise. Example 1 3 2 4

Score Computation in Matrix Form Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Then we have y(k) = L x(k) because for each i y(k)i = j = 1 to n Lij * x(k)j =  (i,j) in E x(k)j

Score Computation in Matrix Form Transpose of a nm matrix M is a mn matrix MT where MTij = Mji Example

Score Computation in Matrix Form We have x(k) = LT y(k-1) because for each i x(k)i = j = 1 to n LTij * y(k-1)j =  (j,i) in E y(k-1)j Note: x authority score of page, y hub score of page Note if edge (i,j) then LTij =1

HITS Score Computation in Matrix Form Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Let L be the adjacency matrix of the network HITS Algorithm Initialize: y(0) = e (column vector of all ones) Until convergence, do x(k) = LT y(k-1) y(k) = L x(k) Normalize* x(k) and y(k) k = k+1 * To avoid the values getting too large. Need to add to 1

Example  

Graph

results hub authority Pagerank99 PageRank85 betweenness 1 0.000 0.090 0.170 0.180 0.180 2 0.470 0.000 0.220 0.200 0.300 3 0.160 0.170 0.060 0.060 0.010 4 0.270 0.220 0.080 0.090 0.040 5 0.000 0.260 0.100 0.100 0.130 6 0.050 0.260 0.100 0.100 0.030 7 0.050 0.000 0.050 0.060 0.010 8 0.000 0.000 0.220 0.210 0.300

R program require(igraph) #read in graph mygraph # see the data mygraph<-read.table("compare-the-measures.csv",sep=",") mygraph # see the data #turn data into a graph mygraph<-graph.data.frame(mygraph, directed=T) plot(mygraph,vertex.color="white") #draw the graph #HITS (Kleinberg scores) H<-hub.score(mygraph)$vector A<-authority.score(mygraph)$vector #normalize HITS #H=H/max(H) #A=A/max(A) H=H/sum(H) A=A/sum(A) #pagerank with 'no damping' and 'Google damping' P1<-page.rank(mygraph,damping=0.99999)$vector P2<-page.rank(mygraph,damping=0.85)$vector P1=P1/sum(P1) P2=P2/sum(P2) #betweenness centrality B<-betweenness(mygraph) B=B/sum(B) answer=data.frame(hub=H,authority=A,Pagerank99=P1, PageRank85=P2,betweenness=B) format(round(answer, 2), nsmall = 3)

HITS Score Computation in Matrix Form Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Let L be the adjacency matrix of the network HITS Algorithm Initialize: y(0) = e (column vector of all ones) Until convergence, do x(k) = LT y(k-1) y(k) = L x(k) Normalize* x(k) and y(k) k = k+1 * To avoid the values getting too large. Need to add to 1

Score Computation in Matrix Form In step 2 of the algorithm, the two equations x(k) = LT y(k-1) y(k) = L x(k) can be simplified by substitution to x(k) = LT L x(k-1) y(k) = L LT y(k-1) The equations become very similar to that of the pagerank computation: (k) = (k-1)H In HITS LTL is called the authority matrix L LT is called the hub matrix

HITS Implementation Two main steps A neighbourhood graph N related to the query terms is built (Query Dependent) The authority and hub scores for each page in N are computed Two ranked lists: the most authoritative pages and most “hubby” pages We focus on the first step as the second step has been explained

Neighbourhood Graph N Formation Initialized with all pages containing references to the query terms Make use of the content index (inverted file) Expand the graph N by adding vertices (from the Web graph) that link either to or from vertices in N Relevant pages without the query terms can be added. E.g., with query term “car”, pages containing “automobile” may be added. Some what arbitrary process. N can become very large if a page containing the query terms has huge in-degree or out-degree In practice, a limit, say 100, is applied to the expansion from in-links or out-links of a page with the query terms

Score Computation with N Once N is built, the adjacency matrix L corresponding to N is formed The number of pages in N (and size of L) is much smaller than the total number of pages on the Web It incurs much smaller cost in computing the authority and hub scores, when compared with PageRank method In fact, we do not need to iterate both equations because when a stable authority vector x is obtained we can apply y = Lx to get the stable hub vector y

Normalization (Total = 1) To limit the values of the authority and hub scores, and ensure the convergence Normalization step x(k)  x(k) / m(x(k)) where m(x(k)) can be the sum of individual values in x(k), i.e., x(k)1 + x(k)2 + ... + x(k)n n is the number of pages in N (not the whole Web graph) Similarly for y: y(k)  y(k) / m(y(k))

Example Suppose P1 and P6 contain the query terms and the neighbourhood graph around P1 and P6 includes P2, P3, P5 and P10 The adjacency matrix L 3 10 1 6 2 5

Example (cont) The respective authority and hub matrices Authority matrix 3 10 1 6 2 5 Hub matrix

For the first 20 iterations Example (cont) x(0) = (1,1,1,1,1,1)T x(1) = LT L x(0) = (1,0,4,2,4,0) Normalize x(1), each element is divided by the sum of x(1), which is 11  (1/11,0/11,4/11,2/11,4/11,0/11)  (.0909, 0, .3636, .1818, .3636, 0) x(2) = LTL x(1) ... For the first 20 iterations

Example (cont) By the HITS algorithm, the stable authority scores x and hub scores y are x = (0 0 0.3660 0.1340 0.5 0)T y = (0.3660 0 0.2113 0 0.2113 0.2113)T Labels=(1, 2, 3, 5, 6, 10) Ties may occur and be broken by any tie-breaking strategy, e.g., by page number Authority ranking: P6, P3, P5, P1, P2, P10 Hub ranking: P1, P3, P6, P10, P2, P5 P6 is the most authoritative page and P1 is the best hub

Modification Similar to PageRank method, we can also incorporate teleporting by modifying the authority and hub matrices Authority matrix:  LT L + (1 – )(1/n) I Hub matrix:  L LT + (1 – )(1/n) I where  is a number between 0 & 1 (similar to  in PageRank) For the example with  =0.95, the modification obtains the authority and hub scores x = (0.0032 0.0023 0.3634 0.1351 0.4936 0.0023)T y = (0.3628 0.0032 0.2106 0.0023 0.2106 0.2106)T Note that the rankings remain the same in this example

Strength of HITS Provide two ranked lists The most authoritative pages: for more in-depth information The most “hubby” pages: for a portal to related information in a broad search Size of the problem is much smaller than that of the PageRank method Intuitively, hubs with a high score carry a lot of traffic, good place for advertising etc

Weakness of HITS Query-dependent For each query, a neighbourhood graph must be built in query time It is easy to make HITS query-independent by computing the authority and hub vectors, x and y, using the adjacency matrix of the entire web graph. Then, it leads to the size problem that PageRank method is facing

Weakness of HITS Susceptibility to spamming Adding more out-links to a page can easily increase the hub score Authority score and hub score are interdependent. An authority score will increase as a hub score increases Bharat & Henzinger (1998) proposed to normalize links in two situations If k links from Host 1 to the same page in Host 2, each link has a weight of 1/k in computing the authority score of Pi I.e., each page Pj in Host 1 that points to Pi contributes only yj/k to xi (1) k links ... ... Pi ... Host 1 Host 2

Weakness of HITS If h links from the same page of Host 1 to the some pages in Host 2, each link has a weight of 1/h in computing the hub score of Pi i.e., each page Pj in Host 2 that is pointed to from Pi contributes only xj/h to yi (2) h links ... Pi ... ... Host 1 Host 2

Weakness of HITS Topic drift Very authoritative yet off-topic page can be included in the neighbourhood graph, if the page is linked to a page containing the query terms E.g., query for “Jaguar” (wild cat) may return homepages of car manufacturers or lists of car manufacturers Neighbourhood Graph N which is built may not relate to properly to the query

Weakness of HITS Bharat & Henzinger suggest a solution that weights the authority and hub scores of the pages in N by a measure of relevancy to the query Let S(Q,Pi) be the “relevancy score”, (a number between 0 and 1), of page Pi to the query Can be computed using traditional information retrieval techniques (see later) For any hyperlink (Pi, Pj), Pi contributes S(Q,Pi) * yi score to the authority score of Pj, where yi is the hub score of Pi Pj contributes S(Q,Pj) * xj score to the hub score of Pj, where xj is the authority score of Pj