Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Slides:

Advertisements

Similar presentations

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

Advertisements

TrustRank Algorithm Srđan Luković 2010/3482

Analysis of Large Graphs: TrustRank and WebSpam

Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,

Link Analysis Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.

CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis, PageRank and Search Engines on the Web

1 Evaluating the Web PageRank Hubs and Authorities.

1 Evaluating the Web PageRank Hubs and Authorities.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Structure and Web Mining Shuying Wang

PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.

CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis HITS Algorithm PageRank Algorithm.

CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Adversarial Information Retrieval The Manipulation of Web Content.

1 1 Lecture 4: Information Retrieval and Web Mining

Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.

Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Web Intelligence Web Communities and Dissemination of Information and Culture on the www.

1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.

Lecture 2: Extensions of PageRank

Link Analysis in Web Mining Hubs and Authorities Spam Detection.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

Ranking Link-based Ranking (2° generation) Reading 21.

Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

1 1 Lecture 4: Information Retrieval and Web Mining

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page

CS 440 Database Management Systems Web Data Management 1.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Jeffrey D. Ullman Stanford University. 3  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.

Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.

Combatting Web Spam Dealing with Non-Main-Memory Web Graphs SimRank

Search Engines and Link Analysis on the Web

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

Link Analysis 2 Page Rank Variants

DTMC Applications Ranking Web Pages & Slotted ALOHA

Lecture 22 SVD, Eigenvector, and Web Search

CS 440 Database Management Systems

Junghoo “John” Cho UCLA

Junghoo “John” Cho UCLA

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

Link Analysis II Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.

Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.

Presentation transcript:

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

2  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Technically, importance = the principal eigenvector of the transition matrix of the Web.  A few fixups needed.

3  Number the pages 1, 2,….  Page i corresponds to row and column i.  M [i, j] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i.  M [i, j] is the probability we’ll next be at page i if we are now at page j.

4 i j Suppose page j links to 3 pages, including i but not x. 1/3 x 0

5  Suppose v is a vector whose i th component is the probability that a random walker is at page i at a certain time.  If a walker follows a link from i at random, the probability distribution for walkers is then given by the vector Mv.

6  Starting from any vector v, the limit M (M (…M (M v ) …)) is the long-term distribution of walkers.  Intuition: pages are important in proportion to how likely a walker is to be there.  The math: limiting distribution = principal eigenvector of M = PageRank.

7 Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m

8  Because there are no constant terms, the equations v = Mv do not have a unique solution.  In Web-sized examples, we cannot solve by Gaussian elimination anyway; we need to use relaxation (= iterative solution).  Can work if you start with a fixed v.

9  Start with the vector v = [1, 1,…, 1] representing the idea that each Web page is given one unit of importance.  Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.  About 50 iterations is sufficient to estimate the limiting solution.

10  Equations v = Mv: y = y /2 + a /2 a = y /2 + m m = a /2 y a = m /2 1/2 5/4 1 3/4 9/8 11/8 1/2 6/5 3/5... Note: “=” is really “assignment.”

11 Yahoo M’softAmazon

12 Yahoo M’softAmazon

13 Yahoo M’softAmazon

14 Yahoo M’softAmazon

15 Yahoo M’softAmazon

16  Some pages are dead ends (have no links out).  Such a page causes importance to leak out.  Other groups of pages are spider traps (all out- links are within the group).  Eventually spider traps absorb all importance.

17 Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0 y a m

18  Equations v = Mv: y = y /2 + a /2 a = y /2 m = a /2 y a = m /2 3/4 1/2 1/4 5/8 3/8 1/

19 Yahoo M’softAmazon

20 Yahoo M’softAmazon

21 Yahoo M’softAmazon

22 Yahoo M’softAmazon

23 Yahoo M’softAmazon

24 Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m

25  Equations v = Mv: y = y /2 + a /2 a = y /2 m = a /2 + m y a = m /2 3/2 3/4 1/2 7/4 5/8 3/

26 Yahoo M’softAmazon

27 Yahoo M’softAmazon

28 Yahoo M’softAmazon

29 Yahoo M’softAmazon

30  “Tax” each page a fixed percentage at each interation.  Add a fixed constant to all pages.  Good idea: distribute the tax, plus whatever is lost in dead-ends, equally to all pages.  Models a random walk with a fixed probability of leaving the system, and a fixed number of new walkers injected into the system at each step.

31  Equations v = 0.8(Mv) + 0.2: y = 0.8(y /2 + a/2) a = 0.8(y /2) m = 0.8(a /2 + m) y a = m /11 5/11 21/11...

32  Goal: Evaluate Web pages not just according to their popularity, but by how relevant they are to a particular topic, e.g. “sports” or “history.”  Allows search queries to be answered based on interests of the user.  Example: Search query [SAS] wants different pages depending on whether you are interested in travel or technology.

33  Assume each walker has a small probability of “teleporting” at any tick.  Teleport can go to: 1.Any page with equal probability.  As in the “taxation” scheme. 2.A set of “relevant” pages (teleport set).  For topic-specific PageRank.

34  Only Microsoft is in the teleport set.  Assume 20% “tax.”  I.e., probability of a teleport is 20%.

35 Yahoo M’softAmazon Dr. Who’s phone booth.

36 Yahoo M’softAmazon

37 Yahoo M’softAmazon

38 Yahoo M’softAmazon

39 Yahoo M’softAmazon

40 Yahoo M’softAmazon

41 Yahoo M’softAmazon

42 1. Choose the pages belonging to the topic in Open Directory. 2. “Learn” from examples the typical words in pages belonging to the topic; use pages heavy in those words as the teleport set.

43  Spam farmers create networks of millions of pages designed to focus PageRank on a few undeserving pages.  We’ll discuss this technology shortly.  To minimize their influence, use a teleport set consisting of trusted pages only.  Example: home pages of universities.

44  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.  Authorities turn out to be places where information can be found.  Example: course home pages.  Hubs tell where the authorities are.  Example: Departmental course-listing page.

45  HITS uses a matrix A[i, j] = 1 if page i links to page j, 0 if not.  A T, the transpose of A, is similar to the PageRank matrix M, but A T has 1’s where M has fractions.

46 Yahoo M’softAmazon A = y a m y a m

47  Powers of A and A T have elements of exponential size, so we need scale factors.  Let h and a be vectors measuring the “hubbiness” and authority of each page.  Equations: h = λ Aa; a = μ A T h.  Hubbiness = scaled sum of authorities of successor pages (out-links).  Authority = scaled sum of hubbiness of predecessor pages (in-links).

48  From h = λ Aa; a = μ A T h we can derive:  h = λμ AA T h  a = λμ A T A a  Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority.  Pick an appropriate value of λμ.

A = A T = AA T = A T A= a(yahoo) a(amazon) a(m’soft) ======   3 h(yahoo) = 1 h(amazon) = 1 h(microsoft) =

50  Iterate as for PageRank; don’t try to solve equations.  But keep components within bounds.  Example: scale to keep the largest component of the vector at 1.  Trick: start with h = [1,1,…,1]; multiply by A T to get first a; scale, then multiply by A to get next h,…

51  You may be tempted to compute AA T and A T A first, then iterate these matrices as for PageRank.  Bad, because these matrices are not nearly as sparse as A and A T.

 PageRank prevents spammers from using term spam (faking the content of their page by adding invisible words) to fool a search engine.  Spammers now attempt to fool PageRank by link spam by creating structures on the Web, called spam farms, that increase the PageRank of undeserving pages. 52

53  Three kinds of Web pages from a spammer’s point of view: 1.Own pages.  Completely controlled by spammer. 2.Accessible pages.  E.g., Web-log comment pages: spammer can post links to his pages. 3.Inaccessible pages.

54  Spammer’s goal:  Maximize the PageRank of target page t.  Technique: 1.Get as many links from accessible pages as possible to target page t. 2.Construct “link farm” to get PageRank multiplier effect.

55 Inaccessible t AccessibleOwn 1 2 M Goal: boost PageRank of page t. One of the most common and effective organizations for a spam farm.

56 Suppose rank from accessible pages = x. PageRank of target page = y. Taxation rate = 1-  Rank of each “farm” page =  y/M + (1-  )/N. Inaccessible t Accessible Own 1 2 M From t; M = number of farm pages Share of “tax”; N = size of the Web

57 y = x +  M[  y/M + (1-  )/N] + (1-  )/N y = x +  2 y +  (1-  )M/N y = x/(1-  2 ) + cM/N where c =  /(1+  ) Inaccessible t Accessible Own 1 2 M Tax share for t. Very small; ignore. PageRank of each “farm” page

58  y = x/(1-  2 ) + cM/N where c =  /(1+  ).  For  = 0.85, 1/(1-  2 )= 3.6.  Multiplier effect for “acquired” page rank.  By making M large, we can make y as large as we want. Inaccessible t Accessible Own 1 2 M

59  Topic-specific PageRank, with a set of “trusted” pages as the teleport set is called TrustRank.  Spam Mass = (PageRank – TrustRank)/PageRank.  High spam mass means most of your PageRank comes from untrusted sources – you may be link- spam.

60  Two conflicting considerations:  Human has to inspect each seed page, so seed set must be as small as possible.  Must ensure every “good page” gets adequate TrustRank, so all good pages should be reachable from the trusted set by short paths.

61 1. Pick the top k pages by PageRank.  It is almost impossible to get a spam page to the very top of the PageRank order. 2. Pick the home pages of universities.  Domains like.edu are controlled.