Presentation is loading. Please wait.

Presentation is loading. Please wait.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Similar presentations


Presentation on theme: "GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction."— Presentation transcript:

1 GRAPH AND LINK MINING 1

2 Graphs - Basics 2

3 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction. Degree of node: Number of edges incident on the node Path: A sequence of edges from one node to another We say that the node is reachable Connected Component: A set of nodes such that there is a path between any two nodes in the set 3

4 Directed Graphs Directed Graph: The edges are ordered pairs – they can be traversed in the direction from first to second. In-degree and Out-degree of a node. Path: A sequence of directed edges from one node to another We say that the node is reachable Strongly Connected Component: A set of nodes such that there is a directed path between any two nodes in the set Weakly Connected Component: A set of nodes such that there is an undirected path between any two nodes in the set 4

5 Examples of Graphs we Might Mine Airline Route Maps are useful Information can tell you about both history and politics Call Detail Records tell use about relationships between people Based on news from the last few years who seems most interested in this? Web is based on (hyper)links between documents Link Analysis is the data mining technique that addresses relationships and connections 5

6 6 Degrees of Separation Claim that there are no more than 6 degrees of separation between any two people This is important in social networks. For example, LinkedIn tell you how you connect to others and it expands with each link. Stanley Milgram was not the first to note small world phenomenon, but popularized it with famous experiment How close are two random people? Picked people in Omaha Nebraska or Wichita Kansas and someone in Boston Asked source person to send it to other person and if did not know the person send it to someone more likely to know them Average path length was 5.5 or 6 But only 64 of 296 arrived 6

7 Examples of Applications Identifying authoritative sources of information on the WWW by analyzing page links Google and PageRank– we will come back to this Understanding physician referral patterns Analyzing telephone call patterns MCI Friends and Family Could give out private info You know Mary Smith, also on MCI, so join MCI But your wife does not know Mary Smith Far-fetched: Facebook does it all of the time!!!! Can identify fraud: calling card thief's call same people 7

8 Mining the graph structure A graph is a combinatorial object, with a certain structure. Mining the structure of the graph reveals information about the entities in the graph E.g., if in the Facebook graph I find that there are 100 people that are all linked to each other, then these people are likely to be a community The community discovery problem By measuring the number of friends in the facebook graph I can find the most important nodes The node importance problem 8

9 Importance problem What are the most important nodes in the graph? What are the most authoritative pages on the web Who are the important users in Facebook? What are the most influential Twitter accounts? 9

10 Link Analysis First generation search engines view documents as flat text files could not cope with size, spamming, user needs Second generation search engines Ranking becomes critical shift from relevance to authoritativeness authoritativeness: the static importance of the page use of Web specific data: Link Analysis of the Web graph a success story for the network analysis + a huge commercial success it all started with two graduate students at Stanford 10

11 Link Analysis: Intuition A link from page p to page q denotes endorsement page p considers page q an authority on a subject use the graph of recommendations assign an authority value to every page The same idea applies to other graphs as well Twitter graph, where user p follows user q 11

12 Constructing the graph Goal: output an authority weight for each node Also known as centrality, or importance w w w w w 12

13 Rank by Popularity Rank pages according to the number of incoming edges (in-degree, degree centrality) 1.Red Page 2.Yellow Page 3.Blue Page 4.Purple Page 5.Green Page w=1 w=2 w=3 w=2 13

14 Popularity It is not important only how many link to you, but how important are the people that link to you. Good authorities are pointed by good authorities Recursive definition of importance 14

15 PageRank Good authorities should be pointed by good authorities The value of a page is the value of the people that link to you How do we implement that? Assume that we have a unit of authority to distribute to all nodes. Each node distributes the authority value they have to their neighbors The authority value of each node is the sum of the authority fractions it collects from its neighbors. Solving the system of equations we get the authority values for the nodes w = ½, w = ¼, w = ¼ ww w w + w + w = 1 w = w + w w = ½ w 15

16 A more complex example v1v1 v2v2 v3v3 v4v4 v5v5 w 1 = 1/3 w 4 + 1/2 w 5 w 2 = 1/2 w 1 + w 3 + 1/3 w 4 w 3 = 1/2 w 1 + 1/3 w 4 w 4 = 1/2 w 5 w 5 = w 2 16

17 Random Walks on Graphs What we described is equivalent to a random walk on the graph Random walk: Start from a node uniformly at random Pick one of the outgoing edges uniformly at random Repeat. Some nodes will be visited more often than others. Those are more important. Based not only on number of incoming links, but how often the predecessor nodes are visited A value like Google’s Pagerank indicates how often a node would be visited 17

18 Random walks on graphs v1v1 v3v3 v4v4 v5v5 p’ 1 = 1/3 p 4 + 1/2 p 5 p’ 2 = 1/2 p 1 + p 3 + 1/3 p 4 p’ 3 = 1/2 p 1 + 1/3 p 4 p’ 4 = 1/2 p 5 p’ 5 = p 2 v2v2 18

19 How Does Pagerank Work? Pagerank of Page A depends on the pagerank of other pages pointing to it. Can arbitrarily initialize all pages to the same Pagerank (e.g., 1) and then repeatedly perform the calculations for each page. Eventually the values will settle down (converge) Pagerank is what caused Google to succeed Prior to that only content mattered, not link structure 19

20 Benefits of PageRank It is not trivial to fool Pagerank If you want to boost a page you can create dummy pages to point to it, but since no one is pointing to those pages, it will have low PageRank and not help much You can create dummy pages to also point to one another, but without being pointed to by an outside authority, the impact will be limited But it is clear that Google must have many tweaks to catch cases like this– link spam or link farms 20


Download ppt "GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction."

Similar presentations


Ads by Google