GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Slides:



Advertisements
Similar presentations
Complex Network Theory
Advertisements

Markov Models.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Analysis and Modeling of Social Networks Foudalis Ilias.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Graph A graph, G = (V, E), is a data structure where: V is a set of vertices (aka nodes) E is a set of edges We use graphs to represent relationships among.
Relationship Mining Network Analysis Week 5 Video 5.
Link Analysis: PageRank
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Social Networks 101 P ROF. J ASON H ARTLINE AND P ROF. N ICOLE I MMORLICA.
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Introduction to Graphs
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Computing Trust in Social Networks
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CSE 321 Discrete Structures Winter 2008 Lecture 25 Graph Theory.
How is this going to make us 100K Applications of Graph Theory.
HCC class lecture 22 comments John Canny 4/13/05.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Peer-to-Peer and Social Networks Introduction. What is a P2P network Uses the vast resource of the machines at the edge of the Internet to build a network.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Lecture 13 Graphs. Introduction to Graphs Examples of Graphs – Airline Route Map What is the fastest way to get from Pittsburgh to St Louis? What is the.
Graph Theoretic Concepts. What is a graph? A set of vertices (or nodes) linked by edges Mathematically, we often write G = (V,E)  V: set of vertices,
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Lectures 6 & 7 Centrality Measures Lectures 6 & 7 Centrality Measures February 2, 2009 Monojit Choudhury
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
COLOR TEST COLOR TEST. Social Networks: Structure and Impact N ICOLE I MMORLICA, N ORTHWESTERN U.
Classification Naïve Bayes Supervised Learning Graphs And Centrality
Ranking Link-based Ranking (2° generation) Reading 21.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
DATA MINING LECTURE 11 Classification Naïve Bayes Graphs And Centrality.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CS:4980:0005 Peer-to-Peer and Social Networks Fall 2015 Introduction.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Graphs G = (V,E) V is the vertex set. Vertices are also called nodes and points. E is the edge set. Each edge connects two different vertices. Edges are.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Topics In Social Computing (67810) Module 1 Introduction & The Structure of Social Networks.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
CS:4980:0001 Peer-to-Peer and Social Networks Fall 2017
CS:4980:0001 Peer-to-Peer and Social Networks Fall 2017
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Link-Based Ranking Seminar Social Media Mining University UC3M
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Peer-to-Peer and Social Networks Fall 2017
CS 440 Database Management Systems
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Graph and Link Mining.
Presentation transcript:

GRAPH AND LINK MINING 1

Graphs - Basics 2

Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction. Degree of node: Number of edges incident on the node Path: A sequence of edges from one node to another We say that the node is reachable Connected Component: A set of nodes such that there is a path between any two nodes in the set 3

Directed Graphs Directed Graph: The edges are ordered pairs – they can be traversed in the direction from first to second. In-degree and Out-degree of a node. Path: A sequence of directed edges from one node to another We say that the node is reachable Strongly Connected Component: A set of nodes such that there is a directed path between any two nodes in the set Weakly Connected Component: A set of nodes such that there is an undirected path between any two nodes in the set 4

Examples of Graphs we Might Mine Airline Route Maps are useful Information can tell you about both history and politics Call Detail Records tell use about relationships between people Based on news from the last few years who seems most interested in this? Web is based on (hyper)links between documents Link Analysis is the data mining technique that addresses relationships and connections 5

6 Degrees of Separation Claim that there are no more than 6 degrees of separation between any two people This is important in social networks. For example, LinkedIn tell you how you connect to others and it expands with each link. Stanley Milgram was not the first to note small world phenomenon, but popularized it with famous experiment How close are two random people? Picked people in Omaha Nebraska or Wichita Kansas and someone in Boston Asked source person to send it to other person and if did not know the person send it to someone more likely to know them Average path length was 5.5 or 6 But only 64 of 296 arrived 6

Examples of Applications Identifying authoritative sources of information on the WWW by analyzing page links Google and PageRank– we will come back to this Understanding physician referral patterns Analyzing telephone call patterns MCI Friends and Family Could give out private info You know Mary Smith, also on MCI, so join MCI But your wife does not know Mary Smith Far-fetched: Facebook does it all of the time!!!! Can identify fraud: calling card thief's call same people 7

Mining the graph structure A graph is a combinatorial object, with a certain structure. Mining the structure of the graph reveals information about the entities in the graph E.g., if in the Facebook graph I find that there are 100 people that are all linked to each other, then these people are likely to be a community The community discovery problem By measuring the number of friends in the facebook graph I can find the most important nodes The node importance problem 8

Importance problem What are the most important nodes in the graph? What are the most authoritative pages on the web Who are the important users in Facebook? What are the most influential Twitter accounts? 9

Link Analysis First generation search engines view documents as flat text files could not cope with size, spamming, user needs Second generation search engines Ranking becomes critical shift from relevance to authoritativeness authoritativeness: the static importance of the page use of Web specific data: Link Analysis of the Web graph a success story for the network analysis + a huge commercial success it all started with two graduate students at Stanford 10

Link Analysis: Intuition A link from page p to page q denotes endorsement page p considers page q an authority on a subject use the graph of recommendations assign an authority value to every page The same idea applies to other graphs as well Twitter graph, where user p follows user q 11

Constructing the graph Goal: output an authority weight for each node Also known as centrality, or importance w w w w w 12

Rank by Popularity Rank pages according to the number of incoming edges (in-degree, degree centrality) 1.Red Page 2.Yellow Page 3.Blue Page 4.Purple Page 5.Green Page w=1 w=2 w=3 w=2 13

Popularity It is not important only how many link to you, but how important are the people that link to you. Good authorities are pointed by good authorities Recursive definition of importance 14

PageRank Good authorities should be pointed by good authorities The value of a page is the value of the people that link to you How do we implement that? Assume that we have a unit of authority to distribute to all nodes. Each node distributes the authority value they have to their neighbors The authority value of each node is the sum of the authority fractions it collects from its neighbors. Solving the system of equations we get the authority values for the nodes w = ½, w = ¼, w = ¼ ww w w + w + w = 1 w = w + w w = ½ w 15

A more complex example v1v1 v2v2 v3v3 v4v4 v5v5 w 1 = 1/3 w 4 + 1/2 w 5 w 2 = 1/2 w 1 + w 3 + 1/3 w 4 w 3 = 1/2 w 1 + 1/3 w 4 w 4 = 1/2 w 5 w 5 = w 2 16

Random Walks on Graphs What we described is equivalent to a random walk on the graph Random walk: Start from a node uniformly at random Pick one of the outgoing edges uniformly at random Repeat. Some nodes will be visited more often than others. Those are more important. Based not only on number of incoming links, but how often the predecessor nodes are visited A value like Google’s Pagerank indicates how often a node would be visited 17

Random walks on graphs v1v1 v3v3 v4v4 v5v5 p’ 1 = 1/3 p 4 + 1/2 p 5 p’ 2 = 1/2 p 1 + p 3 + 1/3 p 4 p’ 3 = 1/2 p 1 + 1/3 p 4 p’ 4 = 1/2 p 5 p’ 5 = p 2 v2v2 18

How Does Pagerank Work? Pagerank of Page A depends on the pagerank of other pages pointing to it. Can arbitrarily initialize all pages to the same Pagerank (e.g., 1) and then repeatedly perform the calculations for each page. Eventually the values will settle down (converge) Pagerank is what caused Google to succeed Prior to that only content mattered, not link structure 19

Benefits of PageRank It is not trivial to fool Pagerank If you want to boost a page you can create dummy pages to point to it, but since no one is pointing to those pages, it will have low PageRank and not help much You can create dummy pages to also point to one another, but without being pointed to by an outside authority, the impact will be limited But it is clear that Google must have many tweaks to catch cases like this– link spam or link farms 20