Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department Carnegie Mellon University

Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department Carnegie Mellon University jure@cs.cmu.edu http://www.cs.cmu.edu/~jure/

Jure Leskovec Networks (graphs)

Jure Leskovec Examples of networks Internet (a) citation network (b) World Wide Web (c) (b)(c) (a) (d) (e) sexual network (d) food web (e)

Jure Leskovec Networks of the real-world (1) Information networks: –World Wide Web: hyperlinks –Citation networks –Blog networks Social networks: people + interactios –Organizational networks –Communication networks –Collaboration networks –Sexual networks –Collaboration networks Technological networks: –Power grid –Airline, road, river networks –Telephone networks –Internet –Autonomous systems Florence families Karate club network Collaboration network Friendship network

Jure Leskovec Networks of the real-world (2) Biological networks –metabolic networks –food web –neural networks –gene regulatory networks Language networks –Semantic networks Software networks … Yeast protein interactions Semantic network Language network XFree86 network

Jure Leskovec Types of networks Directed/undirected Multi graphs (multiple edges between nodes) Hyper graphs (edges connecting multiple nodes) Bipartite graphs (e.g., papers to authors) Weighted networks Different type nodes and edges Evolving networks: –Nodes and edges only added –Nodes, edges added and removed

Jure Leskovec Traditional approach Sociologists were first to study networks: –Study of patterns of connections between people to understand functioning of the society –People are nodes, interactions are edges –Questionares are used to collect link data (hard to obtain, inaccurate, subjective) –Typical questions: Centrality and connectivity Limited to small graphs (~10 nodes) and properties of individual nodes and edges

Jure Leskovec New approach (1) Large networks (e.g., web, internet, on-line social networks) with millions of nodes Many traditional questions not useful anymore: –Traditional: What happens if a node U is removed? –Now: What percentage of nodes needs to be removed to affect network connectivity? Focus moves from a single node to study of statistical properties of the network as a whole Can not draw (plot) the network and examine it

Jure Leskovec New approach (2) How the network “looks like” even if I can’t look at it? Need statistical methods and tools to quantify large networks 3 parts/goals: –Statistical properties of large networks –Models that help understand these properties –Predict behavior of networked systems based on measured structural properties and local rules governing individual nodes

Jure Leskovec Statistical properties of networks Features that are common to networks of different types: –Properties of static networks: Small-world effect Transitivity or clustering Degree distributions (scale free networks) Network resilience Community structure Subgraphs or motifs –Temporal properties: Densification Shrinking diameter

Jure Leskovec Small-world effect (1) Six degrees of separation (Milgram 60s) –Random people in Nebraska were asked to send letters to stockbrokes in Boston –Letters can only be passed to first-name acquantices –Only 25% letters reached the goal –But they reached it in about 6 steps Measuring path lengths: –Diameter (longest shortest path): max d ij –Effective diameter: distance at which 90% of all connected pairs of nodes can be reached –Mean geodesic (shortest) distance l or

Jure Leskovec Small-world effect (2) Distribution of shortest path lengths Microsoft Messenger network –180 million people –1.3 billion edges –Edge if two people exchanged at least one message in one month period Distance (Hops) Number of nodes Pick a random node, count how many nodes are at distance 1,2,3... hops 7

Jure Leskovec Small-world effect (3) Fact: –If number of vertices within distance r grows exponentially with r, then mean shortest path length l increases as log n Implications: –Information (viruses) spread quickly –Erdos numbers are small –Peer to peer networks (for navigation purposes) Shortest paths exists Humans are able to find the paths: –People only know their friends –People do not have the global knowledge of the network This suggests something special about the structure of the network –On a random graph short paths exists but no one would be able to find them

Jure Leskovec Transitivity or Clustering “friend of a friend is a friend” If a connects to b, and b to c, then with high probability a connects to c. Clustering coefficient C : C = 3*number of triangles / number of connected triples Alternative definition: C i = triangles connected to vertex i / number triples centered on vertex i –Clustering coefficient: C i =1, 1, 1/6, 0, 0

Jure Leskovec Clustering coefficient scales as It is considerably higher than in a random graph It is speculated that in real networks: C=O(1) as n→∞ In Erdos-Renyi random graph: C=O(n -1 ) Transitivity or Clustering (2) Synonyms network World Wide Web

Jure Leskovec Degree distributions (1) Let p k denote a fraction of nodes with degree k We can plot a histogram of p k vs. k In a Erdos-Renyi random graph degree distribution follows Poisson distribution Degrees in real networks are heavily skewed to the right Distribution has a long tail of values that are far above the mean Heavy (long) tail: –Amazon sales –word length distribution, …

Jure Leskovec Detour: how long is the long tail? This is not directly related to graphs, but it nicely explains the “long tail” effect. It shows that there is big market for niche products.

Jure Leskovec Degree distributions (2) Many real world networks contain hubs: highly connected nodes We can easily distinguish between exponential and power- law tail by plotting on log-lin and log-log axis In scale-free networks maximum degree scales as n 1/(α-1) Degree distribution in a blog network lin-linlog-lin log-log pkpk pkpk k k k

Jure Leskovec Poisson vs. Scale-free network Poisson network Scale-free (power-law) network Function is scale free if: f(ax) = b f(x) (Erdos-Renyi random graph) Degree distribution is Poisson Degree distribution is Power-law

Jure Leskovec Degree distribution number of people a person talks to on a Microsoft Messenger Node degree Count X Highest degree

Jure Leskovec Network resilience (1) We observe how the connectivity (length of the paths) of the network changes as the vertices get removed Vertices can be removed: –Uniformly at random –In order of decreasing degree It is important for epidemiology –Removal of vertices corresponds to vaccination

Jure Leskovec Network resilience (2) Real-world networks are resilient to random attacks –One has to remove all web-pages of degree > 5 to disconnect the web –But this is a very small percentage of web pages Random network has better resilience to targeted attacks Fraction of removed nodes Mean path length Random network Fraction of removed nodes Internet (Autonomous systems) Random removal Preferential removal

Jure Leskovec Community structure Most social networks show community structure –groups have higher density of edges within than accross groups –People naturally divide into groups based on interests, age, occupation, … How to find communities: –Spectral clustering (embedding into a low-dim space) –Hierarchical clustering based on connection strength –Combinatorial algorithms –Block models –Diffusion methods Friendship network of children in a school

Jure Leskovec MSN Messenger Distribution of Connected components in MSN Messenger network X Largest component Size (number of nodes) Count Growth of largest component over time in a citation network Graphs have a “giant component ” Distribution of connected components follows a power law

Jure Leskovec Network motifs (1) What are the building blocks (motifs) of networks? Do motifs have specific roles in networks? Network motifs detection process: –Count how many times each subgraph appears –Compute statistical significance for each subgraph – probability of appearing in random as much as in real network 3 node motifs

Jure Leskovec Network motifs (2) Biological networks –Feed-forward loop –Bi-fan motif Web graph: –Feedback with two mutual diads –Mutual diad –Fully connected triad

Jure Leskovec Network motifs (3) Transcription networks Signal transduction networks WWW and friendship networks Word adjacency networks

Jure Leskovec Networks over time: Densification A very basic question: What is the relation between the number of nodes and the number of edges in a network? Networks are becoming denser over time The number of edges grows faster than the number of nodes – average degree is increasing a … densification exponent: 1 ≤ a ≤ 2 : –a=1 : linear growth – constant out- degree (assumed in the literature so far) –a=2 : quadratic growth – clique Internet Citations N(t) E(t) a=1.2 a=1.7

Jure Leskovec Densification & degree distribution How does densification affect degree distribution? Given densification exponent a, the degree exponent is: –(a) For γ=const over time, we obtain densification only for 1<γ<2, then γ=a/2 –(b) For γ<2 degree distribution has to evolve according to: Power-law: y=b x γ, for γ<2 E[y] = ∞ Degree exponent over time γ(t) p k =k γ γ(t) a=1.1 a=1.6 (a) (b)

Jure Leskovec Shrinking diameters Intuition says that distances between the nodes slowly grow as the network grows (like log n ) But as the network grows the distances between nodes slowly decrease Internet Citations

Models of network generation and evolution

Jure Leskovec Recap (1) Last time we saw: –Large networks (web, on-line social networks) are here –Many traditional questions not useful anymore –We can not plot the network so we need statistical methods and tools to quantify large networks –3 parts/goals: Statistical properties of large networks Models that help understand these properties Predict behavior of networked systems based on measured structural properties and local rules governing individual nodes

Jure Leskovec Recap (2) We also so features that are common to networks of various types: Properties of static networks: –Small-world effect –Transitivity or clustering –Degree distributions (scale free networks) –Network resilience –Community structure –Subgraphs or motifs Temporal properties: –Densification –Shrinking diameter

Jure Leskovec Outline for today We will see the network generative models for modeling networks’ features: –Erdos-Renyi random graph –Exponential random graphs (p*) model –Small world model –Preferential attachment –Community guided attachment –Forest fire model Fitting models to real data –How to generate a synthetic realistic looking network?

Jure Leskovec (Erodos-Renyi) Random graphs Also known as Poisson random graphs or Bernoulli graphs –Given n vertices connect each pair i.i.d. with probability p Two variants: –G n,p : graph with m edges appears with probability p m (1-p) M-m, where M=0.5n(n-1) is the max number of edges –G n,m : graphs with n nodes, m edges Very rich mathematical theory: many properties are exactly solvable

Jure Leskovec Properties of random graphs Degree distribution is Poisson since the presence and absence of edges is independent Giant component: average degree k=2m/n : –k=1-ε : all components are of size log n –k=1+ε : there is 1 component of size n All others are of size log n They are a tree plus an edge, i.e., cycles Diameter: log n / log k

Jure Leskovec Evolution of a random graph for non-GCC vertices k

Jure Leskovec Subgraphs in random graphs Expected number of subgraphs H(v,e) in G n,p is a... # of isomorphic graphs

Jure Leskovec Random graphs: conclusion Pros: –Simple and tractable model –Phase transitions –Giant component Cons: –Degree distribution –No community structure –No degree correlations Extensions: –Configuration model Random graphs with arbitrary degree sequence Excess degree: Degree of a vertex of the end of random edge: q k = k p k Configuration model

Jure Leskovec Exponential random graphs (p* models) Comes from social sciences Let ε i set of measurable properties of a graph (number of edges, number of nodes of a given degree, number of triangles, …) Exponential random graph model defines a probability distribution over graphs: Examples of ε i

Jure Leskovec Exponential random graphs Includes Erdos-Renyi as a special case Assume parameters β i are specified –No analytical solutions for the model –But can use simulation to sample the graphs: Define local moves on a graph: –Addition/removal of edges –Movement of edges –Edge swaps Parameter estimation: –maximum likelihood Problem: –Can’t solve for transitivity (produces cliques) –Used to analyze small networks Example of parameter estimates:

Jure Leskovec Small-world model Used for modeling network transitivity Many networks assume some kind of geographical proximity Small-world model: –Start with a low-dimensional regular lattice –Rewire: Add/remove edges to create shortcuts to join remote parts of the lattice For each edge with prob p move the other end to a random vertex Rewiring allows to interpolate between regular lattice and random graph

Jure Leskovec Small-world model Regular lattice ( p=0 ): –Clustering coefficient C=(3k-3)/(4k-2)=3/4 –Mean distance L/4k Almost random graph ( p=1 ): –Clustering coefficient C=2k/L –Mean distance log L / log k No power-law degree distribution Rewiring probability p Degree distribution

Jure Leskovec Models of evolution Models of network evolution: –Preferential attachment –Edge copying model –Community Guided Attachment –Forest Fire model Models for realistic network generation: –Kronecker graphs

Jure Leskovec Preferential attachment Models the growth of the network Preferential attachment (Price 1965, Albert & Barabasi 1999): –Add a new node, create m out-links –Probability of linking a node k i is proportional to its degree Based on Herbert Simon’s result –Power-laws arise from “Rich get richer” (cumulative advantage) Examples (Price 1965 for modeling citations): –Citations: new citations of a paper are proportional to the number it already has

Jure Leskovec Preferential attachment Leads to power-law degree distributions But: –all nodes have equal (constant) out- degree –one needs a complete knowledge of the network There are many generalizations and variants, but the preferential selection is the key ingredient that leads to power-laws

Jure Leskovec Edge copying model Copying model: –Add a node and choose k the number of edges to add –With prob β select k random vertices and link to them –With prob 1-β edges are copied from a randomly chosen node Generates power-law degree distributions with exponent 1/(1-β) Generates communities Related Random-surfer model

Jure Leskovec Community guided attachment Want to model/explain densification in networks Assume community structure One expects many within-group friendships and fewer cross-group ones Self-similar university community structure CS Math DramaMusic Science Arts University

Jure Leskovec Community guided attachment Assuming cross-community linking probability The Community Guided Attachment leads to Densification Power Law with exponent –a … densification exponent –b … community tree branching factor –c … difficulty constant, 1 ≤ c ≤ b If c = 1 : easy to cross communities –Then: a=2, quadratic growth of edges – near clique If c = b : hard to cross communities –Then: a=1, linear growth of edges – constant out-degree

Jure Leskovec Forest Fire Model Want to model graphs that density and have shrinking diameters Intuition: –How do we meet friends at a party? –How do we identify references when writing papers?

Jure Leskovec Forest Fire Model for directed graphs The model has 2 parameters: –p … forward burning probability –r … backward burning probability The model: –Each turn a new node v arrives –Uniformly at random chooses an “ambassador” w –Flip two geometric coins to determine the number in- and out-links of w to follow (burn) –Fire spreads recursively until it dies –Node v links to all burned nodes

Jure Leskovec Forest Fire Model Simulation experiments Forest Fire generates graphs that densify and have shrinking diameter densification diameter 1.32 N(t) E(t) N(t) diameter

Jure Leskovec Forest Fire Model Forest Fire also generates graphs with heavy-tailed degree distribution in-degreeout-degree count vs. in-degreecount vs. out-degree

Jure Leskovec Forest Fire: Parameter Space Fix backward probability r and vary forward burning probability p We observe a sharp transition between sparse and clique-like graphs Sweet spot is very narrow Sparse graph Clique-like graph Increasing diameter Decreasing diameter Constant diameter

Jure Leskovec Kronecker graphs Want to have a model that can generate a realistic graph: –Static Patterns Power Law Degree Distribution Small Diameter Power Law Eigenvalue and Eigenvector Distribution –Temporal Patterns Densification Power Law Shrinking/Constant Diameter For Kronecker graphs all these properties can actually be proven

Jure Leskovec Adjacency matrix Kronecker Product – a Graph Intermediate stage Adjacency matrix

Jure Leskovec Kronecker Product – Definition The Kronecker product of matrices A and B is given by We define a Kronecker product of two graphs as a Kronecker product of their adjacency matrices N x MK x L N*K x M*L

Jure Leskovec Stochastic Kronecker Graphs Create N 1  N 1 probability matrix P 1 Compute the k th Kronecker power P k For each entry p uv of P k include an edge (u,v) with probability p uv 0.50.2 0.10.3 P1P1 Instance Matrix G 2 0.250.10 0.04 0.050.150.020.06 0.050.020.150.06 0.010.03 0.09 P2P2 flip biased coins Kronecker multiplication

Jure Leskovec Fitting Kronecker to Real Data Given a graph G and Kronecker matrix P 1 we can calculate probability that P 1 generated G: P(G|P 1 ): 0.250.10 0.04 0.050.150.020.06 0.050.020.150.06 0.010.03 0.09 0.50.2 0.10.3 P1P1 PkPk 1100 1110 0111 0011 G σ… node labeling P(G|P 1 )

Jure Leskovec Fitting Kronecker: 2 challenges Invariance to node labeling σ (there are N! labelings) Calculating P(G|P1) takes O(N 2 ) (since one needs to consider every cell of adjacency matrix) 0.250.10 0.04 0.050.150.020.06 0.050.020.150.06 0.010.03 0.09 0.50.2 0.10.3 P1P1 PkPk 1100 1110 0111 0011 G P(G|P 1 ) == 1 2 3 4 2 1 4 3

Jure Leskovec Fitting Kronecker: Solutions Node Labeling: can use MCMC sampling to average over (all) node labelings P(G|P 1 ) takes O(N 2 ) : Real graphs are sparse, so calculate P(G empty ) and then “add” the edges. This takes O(E). 0.250.10 0.04 0.050.150.020.06 0.050.020.150.06 0.010.03 0.09 P= 1100 1110 0111 0011 G= σ… node labeling

Jure Leskovec Experiments on real AS graph Degree distributionHop plot Network valueAdjacency matrix eigen values

Jure Leskovec Why fitting generative models? Parameters tell us about the structure of a graph Extrapolation: given a graph today, how will it look in a year? Sampling: can I get a smaller graph with similar properties? Anonymization: instead of releasing real graph (e.g., email network), we can release a synthetic version of it

Processes taking place on networks

Jure Leskovec Epidemiological processes The simplest way to spread a virus over the network –S: Susceptible –I: Infected –R: Recovered (removed) SIS model: 2 parameters –β … virus birth rate –δ … virus death (recovery) rate SIR model: as one gets cured, he or she can not get infected again SIS model ζ it depends on β and topology

Jure Leskovec Epidemic threshold for SIS model How infectious the virus needs to be to survive in the network? First results on power-law networks suggested that any virus will prevail New result that works for any topology: For s>1 virus prevails For s<1 virus dies λ 1 … largest eigen value of graph adjacency matrix

Jure Leskovec Navigation in small-world networks Milgram’s experiment showed: –(a) short paths exist in networks –(b) humans are able to find them Assume the following setting: –Nodes of a graph are scattered on a plane –Given starting node u and we want to reach target node v –A small world navigation algorithm navigates the network by always navigating to a neighbor that is closest (in Manhattan distance) to target node v u v

Jure Leskovec Navigation in small-world networks Start with random lattice: –Each node connects with their 4 immediate neighbors –Long range links are added with probability proportional to the distance between the points ( p(u,v) ~ d α ) Can be show that only for α=2 delivery time is poly -log in number of nodes n Deliver time T < n β Network creation

Jure Leskovec Navigation in a real-world network Take a social network of 500k bloggers where for each blogger we know their geographical location Pick two nodes at random and geographically greedy navigate the network Results: –13% success rate (vs. 18% for Milgram) Distribution of path lengths Friendships vs. distance

Jure Leskovec Navigation in real-world network Geographical distance may not be the right kind of distance Since population is non- uniform let’s use rank based friendship distance: i.e., we measure the distance d(u,v) by the number of people living closer to v than u does Then And the proof still works

Jure Leskovec Some references used to prepare this talk: –The Structure and Function of Complex Networks, by Mark Newman –Statistical mechanics of complex networks, by Reka Albert and Albert-Laszlo Barabasi –Graph Mining: Laws, Generators and Algorithms, by Deepay Chakrabarti and Christos Faloutsos –An Introduction to Exponential Random Graph (p*) Models for Social Networks by Garry Robins, Pip Pattison, Yuval Kalish and Dean Lusher –Graph Evolution: Densification and Shrinking Diameters, by Jure Leskovec, Jon Kleinberg and Christos Faloutsos –Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication, by Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg and Christos Faloutsos –Navigation in a Small World, by Jon Kleinberg –Geographic routing in social networks, by David Liben-Nowell, Jasmine Novak, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins –Some plots and slides borrowed from Lada Adamic, Mark Newman, Mark Joseph, Albert Barabasi, Jon Kleinberg, David Lieben-Nowell, Sergi Valverde and Ricard Sole

Jure Leskovec Rough random material that did not make it into the presentation

Jure Leskovec Bow-tie structure of the web SCC 56 M OUT 44 M IN 44 M Broder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TENDRILS 44M

Jure Leskovec Study of 3 websites study over three universities’ publicly indexable Web sites

Jure Leskovec Australia In- and out-degree distributions

Jure Leskovec In- and out-degree distributions New Zealand

Jure Leskovec United Kingdom In- and out-degree distributions

Jure Leskovec We assume this node would like to connect to a centrally located node; a node whose distances to other nodes is minimized. d ij is the Euclidean distance h j is some measure of the “centrality” of node j α is a parameter – a function of the final number n of points, gauging the relative importance of the two objectives

Jure Leskovec Fabrikant et al. define 3 possible measures of “centrality” 1. The average number of hops from other nodes 2. The maximum number of hops from another node 3. The number of hops from a fixed center of the tree

Jure Leskovec α is the crux of the theorem! Why? Here are some examples: Fabrikant&al

Jure Leskovec If α is too low, then the Euclidian distances become unimportant, and the network resembles a star: Fabrikant&al

Jure Leskovec But if α grows at least as fast as √n, where n is the final number of points, then distance becomes too important, and minimum spanning trees with high degree occur, but with exponentially vanishing probability – thus not a power law. if α is anywhere in between, we have a power law Through a rather complex and elaborate proof, Fabrikant&al prove this initial assumption will produce a power law distribution – I’ll save you the math!

Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department Carnegie Mellon University

Similar presentations

Presentation on theme: "Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department Carnegie Mellon University

Similar presentations

Presentation on theme: "Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback