7CCSMWAL Algorithmic Issues in the WWW

7CCSMWAL Algorithmic Issues in the WWW
Lecture HITS

HITS HITS: Important (non-Google) method of ranking web pages
The method has many attractions, but so far has not had such a successful implementation as PageRank HITS classifies the pages in 2 ways, based on in-degree and out-degree Uses/used a similar algorithm? Teoma (Ask Jeeves), Twitter As of 2010 Ask.com referred to the Teoma algorithm as the ExpertRank algorithm

Some documentation https://en.wikipedia.org/wiki/HITS_algorithm
The next one is very ‘famous’ Basically what search engines do is mainly secret Why would that be?

HITS (Hypertext Induced Topic Search)
Similar to PageRank, but uses both in-links and out-links to create two popularity scores for each page HITS defines hubs and authorities A hub is a page with many out-links An authority is a page with many in-links A page can be both a hub and an authority Hub Auth

Hubs and Authorities “Good authorities are pointed to by good hubs and good hubs point to good authorities” Measures of goodness for each page Pi Authority score (or weight) xi Hub score (or weight) yi The definitions where E is the set of all hyperlinks Hub i: (i, j) refers to hyperlinks from Pi to Pj Auth i: (j, i) refers to the hyperlink from Pj to Pi

Suppose the hub and authority scores of
Example P1 Suppose the hub and authority scores of P1, P2, P3, P4 are P2 P5 Pages P1 P2 P3 P4 Authority score xi 3.5 1.2 4.2 1.0 Hub score yi 2.1 0.3 5.2 1.1 P3 P4 The authority score of P5 is = 7.6 Because Hubs P(1),P(2),P(3) point to P(5) The hub score of P5 is = 4.5 Because P5 points to Authorities P(1), P(4);

Iterative Approach Similar to the pagerank computation, the authority and hub scores are computed iteratively, followed by normalization (scores add to 1) Let x(k)i and y(k)i be the authority and hub scores, respectively, at iteration k For instance, in each iteration, we start by updating the authority scores and then the hub scores

Score Computation in Matrix Form
Adjacency matrix L Lij = 1, if there exists a link from Pi to Pj Lij = 0, otherwise. Example 1 3 2 4

Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Then we have y(k) = L x(k) because for each i y(k)i = j = 1 to n Lij * x(k)j =  (i,j) in E x(k)j

Transpose of a nm matrix M is a mn matrix MT where MTij = Mji Example

We have x(k) = LT y(k-1) because for each i x(k)i = j = 1 to n LTij * y(k-1)j =  (j,i) in E y(k-1)j Note: x authority score of page, y hub score of page Note if edge (i,j) then LTij =1

HITS Score Computation in Matrix Form
Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Let L be the adjacency matrix of the network HITS Algorithm Initialize: y(0) = e (column vector of all ones) Until convergence, do x(k) = LT y(k-1) y(k) = L x(k) Normalize* x(k) and y(k) k = k+1 * To avoid the values getting too large. Need to add to 1

Example

results hub authority Pagerank99 PageRank85 betweenness

R program require(igraph) #read in graph mygraph # see the data
mygraph<-read.table("compare-the-measures.csv",sep=",") mygraph # see the data #turn data into a graph mygraph<-graph.data.frame(mygraph, directed=T) plot(mygraph,vertex.color="white") #draw the graph #HITS (Kleinberg scores) H<-hub.score(mygraph)$vector A<-authority.score(mygraph)$vector #normalize HITS #H=H/max(H) #A=A/max(A) H=H/sum(H) A=A/sum(A) #pagerank with 'no damping' and 'Google damping' P1<-page.rank(mygraph,damping= )$vector P2<-page.rank(mygraph,damping=0.85)$vector P1=P1/sum(P1) P2=P2/sum(P2) #betweenness centrality B<-betweenness(mygraph) B=B/sum(B) answer=data.frame(hub=H,authority=A,Pagerank99=P1, PageRank85=P2,betweenness=B) format(round(answer, 2), nsmall = 3)

HITS Score Computation in Matrix Form
Let x(k) be the column vector of the authority scores Let y(k) be the column vector of the hub scores Let L be the adjacency matrix of the network HITS Algorithm Initialize: y(0) = e (column vector of all ones) Until convergence, do x(k) = LT y(k-1) y(k) = L x(k) Normalize* x(k) and y(k) k = k+1 * To avoid the values getting too large. Need to add to 1

In step 2 of the algorithm, the two equations x(k) = LT y(k-1) y(k) = L x(k) can be simplified by substitution to x(k) = LT L x(k-1) y(k) = L LT y(k-1) The equations become very similar to that of the pagerank computation: (k) = (k-1)H In HITS LTL is called the authority matrix L LT is called the hub matrix

HITS Implementation Two main steps
A neighbourhood graph N related to the query terms is built (Query Dependent) The authority and hub scores for each page in N are computed Two ranked lists: the most authoritative pages and most “hubby” pages We focus on the first step as the second step has been explained

Neighbourhood Graph N Formation
Initialized with all pages containing references to the query terms Make use of the content index (inverted file) Expand the graph N by adding vertices (from the Web graph) that link either to or from vertices in N Relevant pages without the query terms can be added. E.g., with query term “car”, pages containing “automobile” may be added. Some what arbitrary process. N can become very large if a page containing the query terms has huge in-degree or out-degree In practice, a limit, say 100, is applied to the expansion from in-links or out-links of a page with the query terms

Score Computation with N
Once N is built, the adjacency matrix L corresponding to N is formed The number of pages in N (and size of L) is much smaller than the total number of pages on the Web It incurs much smaller cost in computing the authority and hub scores, when compared with PageRank method In fact, we do not need to iterate both equations because when a stable authority vector x is obtained we can apply y = Lx to get the stable hub vector y

Normalization (Total = 1)
To limit the values of the authority and hub scores, and ensure the convergence Normalization step x(k)  x(k) / m(x(k)) where m(x(k)) can be the sum of individual values in x(k), i.e., x(k)1 + x(k) x(k)n n is the number of pages in N (not the whole Web graph) Similarly for y: y(k)  y(k) / m(y(k))

Example Suppose P1 and P6 contain the query terms and the neighbourhood graph around P1 and P6 includes P2, P3, P5 and P10 The adjacency matrix L 3 10 1 6 2 5

Example (cont) The respective authority and hub matrices Authority
matrix 3 10 1 6 2 5 Hub matrix

For the first 20 iterations
Example (cont) x(0) = (1,1,1,1,1,1)T x(1) = LT L x(0) = (1,0,4,2,4,0) Normalize x(1), each element is divided by the sum of x(1), which is 11  (1/11,0/11,4/11,2/11,4/11,0/11)  (.0909, 0, .3636, .1818, .3636, 0) x(2) = LTL x(1) ... For the first 20 iterations

Example (cont) By the HITS algorithm, the stable authority scores x and hub scores y are x = ( )T y = ( )T Labels=(1, 2, 3, 5, 6, 10) Ties may occur and be broken by any tie-breaking strategy, e.g., by page number Authority ranking: P6, P3, P5, P1, P2, P10 Hub ranking: P1, P3, P6, P10, P2, P5 P6 is the most authoritative page and P1 is the best hub

Modification Similar to PageRank method, we can also incorporate teleporting by modifying the authority and hub matrices Authority matrix:  LT L + (1 – )(1/n) I Hub matrix:  L LT + (1 – )(1/n) I where  is a number between 0 & 1 (similar to  in PageRank) For the example with  =0.95, the modification obtains the authority and hub scores x = ( )T y = ( )T Note that the rankings remain the same in this example

Strength of HITS Provide two ranked lists
The most authoritative pages: for more in-depth information The most “hubby” pages: for a portal to related information in a broad search Size of the problem is much smaller than that of the PageRank method Intuitively, hubs with a high score carry a lot of traffic, good place for advertising etc

Weakness of HITS Query-dependent
For each query, a neighbourhood graph must be built in query time It is easy to make HITS query-independent by computing the authority and hub vectors, x and y, using the adjacency matrix of the entire web graph. Then, it leads to the size problem that PageRank method is facing

Weakness of HITS Susceptibility to spamming
Adding more out-links to a page can easily increase the hub score Authority score and hub score are interdependent. An authority score will increase as a hub score increases Bharat & Henzinger (1998) proposed to normalize links in two situations If k links from Host 1 to the same page in Host 2, each link has a weight of 1/k in computing the authority score of Pi I.e., each page Pj in Host 1 that points to Pi contributes only yj/k to xi (1) k links ... ... Pi ... Host 1 Host 2

Weakness of HITS If h links from the same page of Host 1 to the some pages in Host 2, each link has a weight of 1/h in computing the hub score of Pi i.e., each page Pj in Host 2 that is pointed to from Pi contributes only xj/h to yi (2) h links ... Pi ... ... Host 1 Host 2

Weakness of HITS Topic drift
Very authoritative yet off-topic page can be included in the neighbourhood graph, if the page is linked to a page containing the query terms E.g., query for “Jaguar” (wild cat) may return homepages of car manufacturers or lists of car manufacturers Neighbourhood Graph N which is built may not relate to properly to the query

Weakness of HITS Bharat & Henzinger suggest a solution that weights the authority and hub scores of the pages in N by a measure of relevancy to the query Let S(Q,Pi) be the “relevancy score”, (a number between 0 and 1), of page Pi to the query Can be computed using traditional information retrieval techniques (see later) For any hyperlink (Pi, Pj), Pi contributes S(Q,Pi) * yi score to the authority score of Pj, where yi is the hub score of Pi Pj contributes S(Q,Pj) * xj score to the hub score of Pj, where xj is the authority score of Pj

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback