Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.

Similar presentations


Presentation on theme: "CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS."— Presentation transcript:

1 CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS

2 The Web as a Directed Graph Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor of the hyperlink describes the target page (textual context) Page A hyperlink Page B Anchor

3 Anchor Text For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) www.ibm.com “ibm” “ibm.com” “IBM home page” A million pieces of anchor text with “ibm” send a strong signal

4 Indexing anchor text When indexing a document D, include anchor text from links pointing to D. www.ibm.com Armonk, NY-based computer giant IBM announced today Joe’s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter

5 Indexing anchor text Can sometimes have unexpected side effects – e.g., evil empire. Can index anchor text with less weight.

6 2. Citation Analysis Citation frequency The kind of background work Deans are doing at tenure time Co-citation coupling frequency Co-citations with a given author measures “impact” Are you co-cited with influential publications? Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is author cited by? (Garfield [Garf72] )

7 PageRank PR – Single-page Definition W is a web page Wi are the web pages that have a link to P O(Wi) is the number of outlinks from Pi t is the teleportation probability N is the size of the web

8 Iteratively Computing PageRanks t is normally set to 0.15, but for simplicity lets set it to 0.5 Set initial PR values to 1 Solve the following equations iteratively:

9 Example Computation of PR

10 Pagerank – Matrix Mult. Definition Imagine a browser doing a random walk on web pages: Start at a random page At each step, go equi-probably out of the current page along one of the links on that page, “In the steady state” each page has a long-term visit rate: Use this rate as the page’s score. 1/3

11 Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ??

12 Teleporting At a dead end, jump to a random web page. At any non-dead end, With probability, say, 10%, jump to a random web page. With remaining probability (90%), go out on a random link. t=10% is the “teleporting” parameter.

13 Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited This not obvious, needs proof! How do we compute this visit rate?

14 Markov chains A Markov chain consists of n states, and an n  n transition probability matrix P. At each step, we are in exactly one of the states. For 1  i,j  n, the matrix entry P ij tells us the probability of j being the next state, given we are currently in state i. ij P ij P ii >0 is OK

15 Markov chains Clearly, for all i, Markov chains are abstractions of random walks. Exercise: Represent the teleporting random walk with teleporting parameter t=50% as a Markov chain, for this graph: ABCD

16 Computing P with teleporting t Start with Adjacency matrix A If there is hyperlink from i to j, A ij = 1, else A ij = 0 If a row has all 0’s,  replace each element by 1/N Else  divide each 1 by the number of 1’s in the row  Multiply the matrix by 1-t  Add t/N to every entry of the resulting matrix ABCD

17 Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other For any start state, after a finite transient time T 0, the probability of being in any state at time T>T 0 is nonzero. Not ergodic (even/odd).

18 Ergodic Markov chains From Probability theory: For any ergodic Markov chain, there is a unique long-term visit rate for each state. “Steady-state probability distribution” Over a long time-period, we visit each state in proportion to this rate. It doesn’t matter where we start! As long as you walk “long enough”

19 Probability vectors A probability (row) vector x = (x 1, … x n ) tells us where the walk is at any point. E.g., (0 0 0 … 1 … 0 0 0) means we’re in state i. E.g., (0 0 0 … 0.5 … 0.5 … 0 0) means we’re? ___________ in1 More generally, the vector x = (x 1, … x n ) means the walk is in state i with probability x i. in1 j

20 Change in probability vector If the probability vector is x = (x 1, … x n ) at this step, what is it at the next step? Recall that row i of the transition prob. Matrix P tells us where we go next from state i. So from x, our next state is distributed as xP.

21 Steady state example The steady state looks like a vector of probabilities a = (a 1, … a n ): a i is the probability that we are in state i. 12 3/4 1/4 3/41/4 For this example, a 1 =1/4 and a 2 =3/4.

22 How do we compute this vector? Let a = (a 1, … a n ) denote the row vector of steady-state probabilities. If we our current position is described by a, then the next step is distributed as aP. But a is the steady state, so a=aP. Solving this matrix equation gives us a. Linear Algebra anyone? So a is the (left) eigenvector for P. Corresponds to the “principal” eigenvector of P with the largest eigenvalue. Transition probability matrices always have largest eigenvalue.

23 One way of computing a (all Pageranks) Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(1 0 … 0)). After one step, we’re at xP; after two steps at xP 2, then xP 3 and so on. “Eventually” means for “large” k, xP k = a. Algorithm: multiply x by increasing powers of P until the product looks stable.

24 Pagerank summary Preprocessing: Given graph of links, build matrix P. From it compute a. The entry a i is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent If PR(A) > PR(B) for some query, it beats it in every query

25 How is Pagerank used? PageRank Technology: PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. http://www.google.com/corporate/tech.html “Pagerank is used in Google, but so are many other clever heuristics.” Pagerank is dead, long live Pagerank!

26 Extended PageRank example

27

28

29 2.2 Pagerank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? [Fagi00] Surfer behavior sharply skewed towards short paths [Hube98] Search engines, bookmarks & directories make jumps non- random. Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)

30 Topic Specific Pagerank [Have02] Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule:  Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories  Teleport to a page uniformly at random within the chosen category Sounds hard to implement: can’t compute PageRank at query time!

31 Implementation of Topic Specific Pagerank offline: Compute pagerank distributions w.r.t. individual categories Query independent model as before Each page has multiple pagerank scores – one for each ODP category, with teleportation only to that category online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of category-specific pageranks

32 Influencing PageRank (“Personalization”) Input: Web graph W influence vector v v : (page  degree of influence) Output: Rank vector r: (page  page importance wrt v) r = PR(W, v)

33 Non-uniform Teleportation Teleport with 10% probability to a Sports page Sports

34 Interpretation of Composite Score For a set of personalization vectors {v j }  j [w j · PR(W, v j )] = PR(W,  j [w j · v j ]) Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt v j

35 Interpretation 10% Sports teleportation Sports

36 Interpretation Health 10% Health teleportation

37 Interpretation Sports Health pr = (0.9 PR sports + 0.1 PR health ) gives you: 9% sports teleportation, 1% health teleportation

38 3. Hyperlink-Induced Topic Search (HITS) [Klei98] In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a subject.  e.g., “Bob’s list of cancer-related links.” Authority pages occur recurrently on good hubs for the subject. Best suited for “broad topic” queries rather than for page-finding queries. Gets at a broader slice of common opinion.

39 Pre-processing for HITS 1) Collect the top t pages (say t = 200) based on the input query; call this the root set. 2) Extend the root set into a base set as follows, for all pages p in the root set: 1) add to the root set all pages that p points to, and 2) add to the root set up-to q pages that point to p (say q = 50). 3) Delete all links within the same web site in the base set resulting in a focused sub-graph.

40 Expanding the Root Set

41 HITS Algorithm – Iterate until Convergence B is the base set q and p are web pages in B A(p) is the authority score for p H(p) is the hub score for p

42 Applications of HITS Search engine querying (speed is an issue). Finding web communities. Finding related pages. Populating categories in web directories. Citation analysis.

43 Communities on the Web A densely linked focused sub-graph of hubs and authorities is called a community. Over 100,000 emerging web communities have been discovered from a web crawl (a process called trawling). Alternatively, a community is a set of web pages W having at least as many links to pages inside W as to pages outside W.

44 Weblogs influence on PageRank A weblog (or blog) is a frequently updated web site on a particular topic, made up of entries in reverse chronological order. Blogs are a rich source of links, and therefore their links influence PageRank. A “google bomb” is an attempt to influence the ranking of a web page for a given phrase by adding links to the page with the phrase as its anchor text.

45 Link Spamming to Improve PageRank Spam is the act of trying unfairly to gain a high ranking on a search engine for a web page without improving the user experience. Link farms - join the farm by copying a hub page which links to all members. Selling links from sites with high PageRank.


Download ppt "CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS."

Similar presentations


Ads by Google