Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then look at some fundamental applications 2.1 Basic Definitions Graphs: Nodes.

Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then look at some fundamental applications 2.1 Basic Definitions Graphs: Nodes and Edges A graph is a way to specify relationships among a collection of items A graph consists of a set of objects, called nodes  Certain pairs of these objects connected by links called edges

E.g., the graph in Fig. 2.1(a) consists of 4 nodes, labeled A, B, C, D  B is connected to each of the other 3 nodes by edges  C and D are connected by an edge too Two nodes are neighbors if they’re connected by an edge In Fig. 2.1(a), the relationship between the 2 ends of an edge as being symmetric  The edge simply connects them to each other

But often we want to express asymmetric relationships Define a directed graph to consist of a set of nodes together with a set of directed edges, each a link from one node to another  The direction being important—see Fig. 2.1(b) To emphasize that a graph isn’t directed, call it an undirected graph  But generally a graph is assumed undirected unless otherwise noted

Graphs as Models of Networks Graphs serve as mathematical models of network structures For a real example, Fig. 2.2 depicts the network structure of the Internet (“Arpanet”) in Dec. 1970—only 13 sites  Nodes represent computing hosts  There’s an edge joining 2 nodes if there’s a direct communication link between them Ignoring the superimposed US map and the blow-up circles in MA and CA, the rest depicts this 13-node graph using the dots- and-lines style of Fig. 2.1

For showing the pattern of connections, the actual placement of the nodes is immaterial; all that matters is which nodes link to which  Fig. 2.3 is a different drawing of the same 13-node Arpanet graph

Graphs are useful whenever we want to represent how things are either physically or logically linked to one another in a network structure The 13-node Arpanet is an example of a communication network In Chapter 1, we saw examples from 2 other broad classes of graph structures  Social networks Nodes are people or groups of people Edges represent some kind of social interaction  Information networks Nodes are info resources (e.g., Web pages or documents) Edges represent logical connections such as hyperlinks, citations, or cross-references Fig. 2.4: a few further examples

The depictions of airline and subway systems in (a) and (b) are examples of transportation networks  Nodes are destinations and edges represent direct connections

The prerequisites among college courses in (c) is an example of a dependency network  Nodes are tasks and directed edges indicate that one task must be performed before another The Tank Street Bridge from Brisbane, Australia in (d) is an example of a structural network  Joints are nodes and physical linkages are edges

2.2 Paths and Connectivity Paths A path is a sequence of nodes with the property that each consecutive pair in the sequence is connected by an edge E.g., the sequence of nodes MIT, BBN, RAND, UCLA is a path in the Internet graph from Figs. 2.2 and 2.3  Another path is the sequence CASE, LINCOLN, MIT, UTAH, SRI, UCSB

A path can repeat nodes, e.g., SRI, STAN, UCLA, SRI, UTAH, MIT But most paths we consider won’t do this  A path without repeat nodes is a simple path

Cycles A particularly important kind of non-simple path is a cycle, a path with  3 edges, in which the 1 st and last nodes are the same, but otherwise all nodes are distinct Many cycles in Fig. 2.3, e.g.,  SRI, STAN, UCLA, SRI (as short as possible: it has 3 edges)  SRI, STAN, UCLA, RAND, BBN, MIT, UTAH, SRI

Every edge in the 1970 Arpanet belongs to a cycle  If any edge were to fails, there’s still a way to get from any node to any other node More generally, cycles in communication and transportation networks allow for redundancy—provide for alternate routings In the social network of friendships, cycles are common

NetworkX: Cycles cycle_basis(G) returns a list of cycles that form a basis for cycles of G Each cycle list is a list of nodes forming a cycle; cyclic permutations are not included >>> G = nx.barbell_graph(4,2) >>> nx.cycle_basis(G) [[1, 3, 0], [2, 3, 0], [8, 7, 6], [9, 7, 6], [8, 9, 6], [1, 2, 0]] G must be a Graph —can’t be a DiGraph A basis for cycles of a network is a minimal collection of cycles s.t. any cycle in the network can be written as a sum of cycles in the basis  Here summation of cycles is defined as “exclusive or” of the edges

simple_cycles(DG) returns the simple cycles (elementary circuits) of a directed graph DG must be a DiGraph A simple cycle is a closed path where no node appears twice, except the 1 st and last are the same Two elementary circuits are distinct if they aren’t cyclic permutations of each other >>> DG1 = nx.DiGraph([(0,1), (1,3), (3,0), (3,2), (2,0)]) >>> nx.simple_cycles(DG1) [[0, 1, 3, 0], [0, 1, 3, 2, 0]]

Connectivity A graph is connected if there’s a path between every pair of nodes E.g., the 13-node Arpanet graph is connected We expect most communication and transportation networks to (try to) be connected  Their goal is to move traffic from one node to another But there’s no a priori reason to expect graphs in other settings to be connected  Figs. 2.5 and 2.6 show disconnected graphs

Fig. 2.6 is the collaboration graph of the biological research center Structural Genomics of Pathogenic Protozoa  Nodes represent researchers  There’s an edge between 2 nodes if the researchers co-authored a publication

Components In Fig. 2.5, the graph consists of 3 “pieces”:  one consisting of nodes A and B,  one consisting of nodes C, D, and E, and  one consisting of the rest of the nodes The network in Fig. 2.6 also consists of 3 pieces: one on 3 nodes, one on 4 nodes, and one that’s much larger A connected component (or just component) of a graph is a subset of the nodes s.t. (i) every node in the subset has a path to every other, and (ii) the subset isn't part of some larger set with the property that every node can reach every other

Dividing a graph into its components is just a first, global way of describing its structure Within a given component, there may be richer internal structure that’s important to our interpretation of the network E.g., in the largest component in Fig. 2.6, there’s a prominent node at the center, and tightly-knit groups linked to this node but not to each other  This component would break into 3 distinct components if this node were removed Analyzing a graph this way (its densely-connected regions and the boundaries between them) is a powerful way of thinking about network structure—cf. Chap. 3

Giant Components Consider the social network of the entire world, with a link between 2 people if they’re friends This global friendship network probably isn’t connected—consider, e.g., un-contacted tribes But the component you inhabit probably contains a significant fraction of the world’s population. This is true for a range of network datasets—large, complex networks often have a giant component  This is a deliberately informal term for a connected component containing a significant fraction of all the nodes

When a network contains a giant component, it almost always contains only one E.g., if the global friendship network had 2 giant components, all it would take is a meting between a representative of each to combines them  This in fact happened with the discovery of America—with dramatic consequences

The notion of giant components is useful for reasoning about networks on much smaller scales as well See the collaboration network in Fig. 2.6 Another example is Fig. 2.7: the romantic relationships in an American high school over an 18-month period Not all edges were present at once The fact that this graph contains such a large component is significant regarding the spread of STDs The researchers noted that, “like social facts, [these structures] are invisible yet consequential macrostructures that arise as the product of individual agency.”

NetworkX: Subgraphs A subgraph of a graph G is a graph  whose vertex set is a subset of that of G, and  whose adjacency relation is a subset of that of G restricted to this subset Graph.subgraph(nbunch) returns the subgraph induced on the nodes in nbunch I.e., the nodes in nbunch and the edges between them The graph, edge or node attributes just point to the original graph  So changes to the node or edge structure won’t be reflected in the original graph But changes to the attributes will To create a subgraph with its own copy of the edge/node attributes use nx.Graph(G.subgraph(nbunch))

If edge attributes are containers, get a deep copy using G.subgraph(nbunch).copy() For an in-place reduction of a graph to a subgraph, remove nodes G.remove_nodes_from([n in G if n not in set(nbunch)]) The following all have the same description as Graph.subgraph() DiGraph.subgraph(nbunch) MultiGraph.subgraph(nbunch) MultiDiGraph.subgraph(nbunch)

Make a barbell without the handle >>> G = nx.barbell_graph(4,2) >>> G1 = G.subgraph([0,1,2,3,6,7,8,9])

NetworkX: Connected Components In an undirected graph G, vertices u and v are connected if G contains a path from u to v A graph is connected if every pair of vertices in it is connected A connected component is a maximal connected subgraph of G  Each vertex belongs to exactly 1 connected component, as does each edge A directed graph is weakly connected if replacing all of its directed edges with undirected edges produces a connected (undirected) graph It’s connected if it contains a directed path from u to v or a directed path from v to u for every pair of vertices u, v It’s strongly connected if it contains a directed path from u to v and a directed path from v to u for every pair of vertices u, v

The weakly connected components are the maximal weakly connected subgraphs The strongly connected components are the maximal strongly connected subgraphs The condensation of a directed graph is the graph with each of the strongly connected components contracted into a single node

For a Graph G is_connected(G) tests G ’s connectivity number_connected_components(G) returns the number of connected components in G connected_components(G) returns a list of the connected components of G, each a list of nodes connected_component_subgraphs(G) returns a list of the connected components of G as subgraphs node_connected_component(G, n ) returns a list of the nodes in the connected components of G containing node n

>>> G1 = nx.complete_graph(3) >>> G2 = nx.complete_graph(2) >>> GG = nx.disjoint_union(G1, G2) >>> GG.edges() [(0, 1), (0, 2), (1, 2), (3, 4)] >>> nx.connected_components(GG) [[0, 1, 2], [3, 4]] >>> H1, H2 = nx.connected_component_subgraphs(GG) >>> H1.edges() [(0, 1), (0, 2), (1, 2)] >>> H2.edges() [(3, 4)] >>> nx.node_connected_component(GG, 2) [0, 1, 2]

For a DiGraph G is_strongly_connected(G) tests G for strong connectivity number_strongly_connected_components(G) returns the number of strongly connected components in G strongly_connected_components(G) returns a list of the strongly connected components of G, each a list of nodes strongly_connected_component_subgraphs(G) returns a list of the strongly connected components of G as subgraphs condensation(G, scc) returns the condensation of G scc is a list of strongly connected components—cf. strongly_connected_components() The resulting graph is a directed acyclic graph Node labels are the indices of the components in the list of strongly connected components

is_weakly_connected(G) tests G for weak connectivity number_weakly_connected_components(G) returns the number of weakly connected components in G weakly_connected_components(G) returns a list of the strongly connected components of G, each a list of nodes weakly_connected_component_subgraphs(G) returns a list of the weakly connected components of G as subgraphs

>>> DG = nx.DiGraph([(0,1),(1,2),(2,0),(3,4)]) >>> nx.weakly_connected_components(DG) [[0, 1, 2], [3, 4]] >>> scc = nx.strongly_connected_components(DG) >>> scc [[0, 1, 2], [4], [3]] >>> DGS = nx.strongly_connected_component_subgraphs(DG)[0] >>> DGS.edges() [(0, 1), (1, 2), (2, 0)]

>>> DGcon = nx.condensation(DG, scc) >>> DGcon.edges() [(2, 1)] >>> DGcon.nodes() [0, 1, 2]

Cliques A clique in an undirected graph G = (V, E) is a subset of the vertex set C ⊆ V s.t. every 2 vertices in C are connected by an edge  Equivalently, the subgraph induced by C is complete  Sometimes the term “clique” also refers to the subgraph A maximal clique is a clique that can’t be extended by including 1 more adjacent vertex i.e., a clique that doesn’t exist exclusively within the vertex set of a larger clique A maximum clique is a clique of the largest possible size in G The clique number of G is the number of nodes in a maximum clique of G Finding a maximum clique is an NP-complete problem

In the following, G may be a Graph, DiGraph, MultiGraph, or MultiDiGraph find_cliques(G) returns a generator of maximal cliques in G as node lists graph_clique_number(G) returns the clique number for G graph_number_of_cliques(G) returns the number of maximal cliques in G

>>> F = nx.barbell_graph(4,1) >>> for cl in nx.find_cliques(F):... print cl... [8, 5, 6, 7] [3, 0, 1, 2] [3, 4] [5, 4] >>> nx.graph_number_of_cliques(F) 4 >>> nx.graph_clique_number(F) 4

2.3 Distance and Breadth-First Search Beyond asking whether 2 nodes are connected by a path, ask how long such a path is The length of a path is the number of edges in the sequence that comprises it  E.g., the path MIT, BBN, RAND, UCLA in Fig. 2.3 has length 3  The path MIT, UTAH has length 1 The distance between 2 nodes is the length of the shortest path between them  E.g., the distance between LINC and SRI is 3 Check that there’s no length-1 or length-2 path between them

Breadth-First Search First declare all of your actual friends to be at distance 1 Then find all of their friends (not counting people already friends of yours), declare these to be at distance 2 Then find all of their friends (not counting people already found at distances 1 and 2), declare these to be at distance 3 Continuing in this way, search in successive layers, each representing the next distance out  Each new layer is built from all those nodes that have not already been discovered in earlier layers, and have an edge to some node in the previous layer Figure 2.8

Figure 2.9. How to discover all distances from the node MIT in the 13- node Arpanet graph from Figure 2.3

NetworkX: Depth First Search Various algorithms giving the result of a depth-first search (DFS) on a graph The source argument (where the traversal begins) is optional (defaulting to node 0 or whichever is listed first) but generally included dfs_edges(G, source) returns a generator that produces edges in a DFS dfs_tree(G, source) returns a directed tree (a DiGraph ) of a DFS dfs_predecessors(G, source) returns a dictionary of predecessors in a DFS dfs_successors(G, source) returns a dictionary of successors in a DFS dfs_preorder_nodes(G, source ) returns a generator producing nodes in a DFS pre-ordering dfs_postorder_nodes(G, source) returns a generator producing nodes in a DFS post-ordering dfs_labeled_edges(G, source) returns a generator that produces edges in a DFS labeled by direction (‘dir’) type (‘forward’, ‘reverse’, ‘nontree’)

>>> G = nx.krackhardt_kite_graph() >>> list(nx.dfs_edges(G,0)) [(0, 1), (1, 3), (3, 2), (2, 5), (5, 6), (6, 4), (6, 7), (7, 8), (8, 9)] >>> list(nx.dfs_edges(G,9)) [(9, 8), (8, 7), (7, 5), (5, 0), (0, 1), (1, 3), (3, 2), (3, 4), (4, 6)] >>> list(nx.dfs_edges(G)) [(0, 1), (1, 3), (3, 2), (2, 5), (5, 6), (6, 4), (6, 7), (7, 8), (8, 9)]

>>> tree = nx.dfs_tree(G, 9) >>> tree >>> tree.succ {0: {1: {}}, 1: {3: {}}, 2: {}, 3: {2: {}, 4: {}}, 4: {6: {}}, 5: {0: {}}, 6: {}, 7: {5: {}}, 8: {7: {}}, 9: {8: {}}}

>>> nx.dfs_successors(G, 9) {0: [1], 1: [3], 3: [2, 4], 4: [6], 5: [0], 7: [5], 8: [7], 9: [8]} >>> nx.dfs_predecessors(G, 9) {0: 5, 1: 0, 2: 3, 3: 1, 4: 3, 5: 7, 6: 4, 7: 8, 8: 9} >>> list(nx.dfs_preorder_nodes(G, 9)) [9, 8, 7, 5, 0, 1, 3, 2, 4, 6] >>> list(nx.dfs_postorder_nodes(G, 9)) [2, 6, 4, 3, 1, 0, 5, 7, 8, 9]

NetworkX: Breadth First Search Various algorithms that give the result of a breadth-first search (BFS) on a graph The source argument (where the traversal begins) again is optional (defaulting to the node listed first) but generally included bfs_edges(G, source) returns a generator that produces edges in a BFS bfs_tree(G, source) returns a directed tree of a BFS bfs_predecessors(G, source) returns a dictionary of predecessors in a BFS bfs_successors(G, source) returns a dictionary of successors in a BFS

>>> list(nx.bfs_edges(G, 0)) [(0, 1), (0, 2), (0, 3), (0, 5), (1, 4), (1, 6), (5, 7), (7, 8), (8, 9)] >>> list(nx.bfs_edges(G, 9)) [(9, 8), (8, 7), (7, 5), (7, 6), (5, 0), (5, 2), (5, 3), (6, 1), (6, 4)] >>> tree = nx.bfs_tree(G, 9) >>> tree.succ {0: {}, 1: {}, 2: {}, 3: {}, 4: {}, 5: {0: {}, 2: {}, 3: {}}, 6: {1: {}, 4: {}}, 7: {5: {}, 6: {}}, 8: {7: {}}, 9: {8: {}}}

>>> nx.bfs_predecessors(G, 9) {0: 5, 1: 6, 2: 5, 3: 5, 4: 6, 5: 7, 6: 7, 7: 8, 8: 9} >>> nx.bfs_successors(G, 9) {8: [7], 9: [8], 5: [0, 2, 3], 6: [1, 4], 7: [5, 6]}

NetworkX: Shortest Path These algorithms work for undirected and directed graphs Return an arbitrary shortest path when there is more than 1 shortest path between 2 nodes Those that search for a path between 2 particular nodes raise a NetworkXNoPat h exception if there is not such path has_path(G, source, target) returns True if G has a path from source to target ; otherwise, False is returned

shortest_path (G, source=None, target=None, weight=None) computes shortest paths in G If the source and target are both specified, return a single list of nodes in a shortest path If only the source is specified, return a dictionary keyed by targets with a list of nodes in a shortest path If neither the source nor the target is specified, return path, a dictionary of dictionaries where path[source][target] is the list of nodes in the source-to-target path weight, if None (default), causes every edge to have weight/distance 1  If weight is a string, it’s the edge attribute to use as the edge weight Any edge attribute not present defaults to 1

shortest_path_length( G, source=None, target=None, weight=None ) computes shortest path lengths in G source, target, weight are as with shortest_path() If the source and target are both specified, return a single number for the shortest path If only the source is specified, return a dictionary keyed by targets with the shortest path lengths as values If neither the source nor the target is specified, return length, a dictionary of dictionaries where length[source][target] is the length of a shortest path from source to target

average_shortest_path_length(G, weight=None) returns the average shortest path length over all pairs of distinct nodes of G weight is as before The average shortest path length a is where  V is the set of nodes in G,  d (s, t ) is the shortest path from s to t, and  n is the number of nodes in G

>>> import networkx as nx >>> DG = nx.DiGraph() >>> DG.add_weighted_edges_from([(0,1,1.0), (1,0,1.0), (1,3,1.0), (3,4,1.0), (4,2,1.0), (2,1,1.0), (1,4,3.0), (4,1,3.0), (5,4,1.0)]) >>> nx.has_path(DG,5,0) True >>> nx.has_path(DG,0,5) False

>>> d = nx.shortest_path(DG) >>> d {0: {0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3], 4: [0, 1, 4]}, 1: {0: [1, 0], 1: [1], 2: [1, 4, 2], 3: [1, 3], 4: [1, 4]}, 2: {0: [2, 1, 0], 1: [2, 1], 2: [2], 3: [2, 1, 3], 4: [2, 1, 4]}, 3: {0: [3, 4, 1, 0], 1: [3, 4, 1], 2: [3, 4, 2], 3: [3], 4: [3, 4]}, 4: {0: [4, 1, 0], 1: [4, 1], 2: [4, 2], 3: [4, 1, 3], 4: [4]}, 5: {0: [5, 4, 1, 0], 1: [5, 4, 1], 2: [5, 4, 2], 3: [5, 4, 1, 3], 4: [5, 4], 5: [5]}} >>> d[0] {0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3], 4: [0, 1, 4]} >>> dw = nx.shortest_path(DG, weight='weight') >>> dw[0] {0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]} >>> nx.shortest_path(DG, source=0, weight='weight') {0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]} >>> nx.shortest_path(DG, source=0, target=4, weight='weight') [0, 1, 3, 4]

>>> nx.shortest_path(DG, source=0, target=5) Traceback (most recent call last): … networkx.exception.NetworkXNoPath: No path between 0 and 5. >>> dw_len = nx.shortest_path_length(DG, weight='weight') >>> dw_len[0] {0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0} >>> dw_len[0][4] 3.0 >>> nx.shortest_path_length(DG, source=0, weight='weight') {0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0} >>> nx.shortest_path_length(DG, source=0, target=4, weight='weight') 3.0

>>> nx.average_shortest_path_length(DG, weight='weight') 1.9333333333333333 >>> dw_len = nx.shortest_path_length(DG, weight='weight') >>> count = len_sum = 0 >>> for lens in dw_len.values():... count += len(lens)... len_sum += sum(lens.values())... >>> print count, len_sum 31 58.0 >>> len_sum / count 1.8709677419354838

Advanced Shortest Path These are often more specific than the functions listed above and often provide the implementations for those functions Generally return results in the now familiar nested dictionary format A function without ‘dijkstra’ in its name ignores weights and other edge data A function with ‘dijkstra’ in its name by default considers the values of edge ‘ weight ’ attributes  To consider different edge data, set the ‘ weight ’ keyword argument to that attribute All these functions have a keyword argument cutoff  Can be set to stop the search at the given depth  Paths of length greater than the cutoff are ignored

single_source_shortest_path(G, source, cutoff=None) computes shortest path from source to all nodes reachable from it single_source_shortest_path_length( G, source, cutoff=None ) computes shortest path lengths from source to all reachable nodes all_pairs_shortest_path(G, cutoff=None) computes shortest paths between all node all_pairs_shortest_path_length(G, cutoff=None) computes the shortest path lengths between all nodes

>>> nx.single_source_shortest_path(DG, 0) {0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3], 4: [0, 1, 4]} >>> nx.single_source_shortest_path(DG, 0, cutoff=2) {0: [0], 1: [0, 1], 3: [0, 1, 3], 4: [0, 1, 4]} >>> nx.single_source_shortest_path_length(DG, 0) {0: 0, 1: 1, 2: 3, 3: 2, 4: 2} >>> nx.all_pairs_shortest_path_length(DG) {0: {0: 0, 1: 1, 2: 3, 3: 2, 4: 2}, 1: {0: 1, 1: 0, 2: 2, 3: 1, 4: 1}, 2: {0: 2, 1: 1, 2: 0, 3: 2, 4: 2}, 3: {0: 3, 1: 2, 2: 2, 3: 0, 4: 1}, 4: {0: 2, 1: 1, 2: 1, 3: 2, 4: 0}, 5: {0: 3, 1: 2, 2: 2, 3: 3, 4: 1, 5: 0}}

dijkstra_path(G, source, target, weight=’weight’) returns the shortest path from source to target in a weighted graph dijkstra_path_length(G, source, target, weight=’weight’) returns the shortest path length from source to target in a weighted graph single_source_dijkstra_path( G, source, cutoff=None, weight=’weight’) computes the shortest paths between source and all other reachable nodes for a weighted graph single_source_dijkstra_path_length(G, source, cutoff=None, weight=’weight’) computes the lengths of the shortest paths lengths between source and all other reachable nodes for a weighted graph all_pairs_dijkstra_path(G, cutoff=None, weight=’weight’) compute shortest paths between all nodes in a weighted graph all_pairs_dijkstra_path_length(G, cutoff=None, weight=’weight’) compute the lengths of the shortest paths between all nodes in a weighted graph

single_source_dijkstra(G, source, target=None, cutoff=None, weight=’weight’) computes the shortest paths and their lengths in a weighted graph Returns a tuple of 2 dictionaries keyed by node,  1 st for distances from the source  2 nd for the paths from the source to that node >>> nx.dijkstra_path(DG,0,4) [0, 1, 3, 4] >>> ls, ps = nx.single_source_dijkstra(DG, 0) >>> ls {0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0} >>> ps {0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]}

floyd_warshall(G, weight=’weight’) finds all-pairs shortest path lengths using Floyd’s algorithm Floyd’s algorithm is appropriate for finding shortest paths in dense graphs or graphs with negative weights when Dijkstra’s algorithm fails This algorithm can still fail if there are negative cycles It has running time O(n 3 ) with running space is O(n 2 ) >>> fw = nx.floyd_warshall(DG) >>> fw[0] defaultdict 1 ( at 0x01E0E5B0>, {0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: inf}) >>> fw[5] defaultdict( at 0x01FC1230>, {0: 4.0, 1: 3.0, 2: 2.0, 3: 4.0, 4: 1.0, 5: 0}) >>> fw[5][0] 4.0 1. defaultdict is a dict subclass that calls a factory function to supply missing values

astar_path( G, source, target, heuristic=None, weight=’weight’) returns a list of nodes in a shortest path between source and target using the A* (“A-star”) algorithm heuristic is a function to estimate the distance from a node to target To guarantee a shortest path, this function should never overestimate this distance The function takes 2 node arguments and must return a number >>> G=nx.grid_graph(dim=[3,3]) >>> def dist(a, b):... (x1, y1) = a... (x2, y2) = b... return ((x1 - x2) ** 2 + (y1 - y2) ** 2) ** 0.5... >>> nx.astar_path(G,(0,0),(2,2),dist) [(0, 0), (0, 1), (1, 1), (1, 2), (2, 2)]

The Small-World Phenomenon Go back to our thought experiments on the global friendship network The argument explaining why you belong to a giant component asserts something stronger: Not only do you have paths of friends connecting you to a large fraction of the world’s population, but these paths are surprisingly short  E.g., consider a friend from another country (thence to his friends and relatives, etc.) This is the small-world phenomenon: the idea that the world looks “small” when you think of how short a path of friends it takes to get from you to almost anyone else

Also known as the 6 degrees of separation  Title of a play by John Guare  One line is: “I read somewhere that everybody on this planet is separated by only six other people. Six degrees of separation between us and everyone else on this planet

The first experimental study of this notion (and the origin of “6”) was by Stanley Milgram and his colleagues in the 1960s With a small budget, he tested the idea that people are really connected in the global friendship network by short chains of friends Asked 296 randomly chosen “starters” to try forwarding a letter to a “target” person, a stockbroker living in a suburb of Boston  The starters were each given some personal info about the target (including address and occupation)  Asked to forward the letter to someone they knew on a first-name basis, with the same instructions Try to reach the target as quickly as possible  Formed chains of people that closed in on the stockbroker

Figure 2.10: the distribution of path lengths, among the 64 chains that reached the target The median length was 6 It’s striking that so many letters reached their destination and by such short paths

Some caveats about the experiment  It clearly doesn't establish a statement as bold as “six degrees of separation between us and everyone else on this planet” The paths were just to a single, fairly affluent target Many letters never got there Attempts to recreate the experiment have been problematic due to lack of participation  We can ask how useful these short paths really are Milgram himself in his original paper: If we think of each person as the center of their own social “world,” then “six short steps" becomes “six worlds apart” Makes 6 sound like a much larger number

Still, the overall conclusion has been accepted in a broad sense: Social networks tend to have very short paths between essentially arbitrary pairs of people The existence of all these short paths has substantial consequences for the potential speed with which info, diseases, and other kinds of contagion can spread  Cf. also the potential access that the social network provides to opportunities and to people with very different characteristics from our own See Chapter 20, a more detailed study of the small-world phenomenon and its consequences

Instant Messaging, Paul Erdös, and Kevin Bacon That social networks generally are “small worlds” has been increasingly confirmed in settings where we do have full data on the network structure Milgram resorted to an experiment where letters served as “tracers” through a global friendship network  He had no hope of fully mapping the network on his own But where the full graph structure is known, we load it into a computer and perform breadth-first search to determine what typical distances look like

One of the largest such computational studies was by Leskovec and Horvitz Analyzed the 240 million active user accounts on Microsoft Instant Messenger  They were employed by Microsoft at the time, had access to a complete snapshot of the system for the month under study Built a graph where each node corresponds to a user  There’s an edge between two users if they engaged in a two-way conversation at any point during a month-long observation period The graph had a giant component containing almost all of the nodes The distances within this giant component were very small  An estimated average distance of 6.6  An estimated median of 7

Figure 2.11: The distribution of distances averaged over a random sample of 1000 users: Breadth-first search was done separately from each of these 1000 users  The results from these 1000 nodes were combined to produce the plot in The graph was so large that doing breadth-first search from every single node would have taken an astronomical amount of time Producing plots like this efficiently for massive graphs is an interesting research topic in itself

Figure 2.11 approximates what Milgram was after: the distribution of how far apart we all are in the full global friendship network Reconciling the structure of such massive datasets with the underlying networks we’re trying to measure is an issue arising here and many times later Here we’re still some distance from Milgram's goal  We only track people who are technologically-endowed enough to have access to instant messaging  Rather than basing the graph on who is truly friends with whom, we observe only who talks to whom during an observation period

Turning to a smaller scale (magnitude 10 5 rather than 10 8 people), researchers have also discovered very short paths in the collaboration networks within professional communities E.g., in mathematics, Erdös (published c. 1500 papers) is a central figure in the collaborative structure of the field Define a collaboration graph (as in Figure 2.6) with  nodes for mathematicians and  edges connecting pairs who have jointly authored a paper

Figure 2.12: A small hand-drawn piece of the collaboration graph, with paths leading to Paul Erdös A mathematician's Erdös number is the distance from them to Erdös  Most mathematicians have Erdös numbers of at most 4 or 5  Extending the collaboration graph to co-authorship across all the sciences, most scientists in other fields have Erdös numbers only slightly (if at all) larger Einstein (2), Fermi (3), Chomsky (4), Pauling (4), Crick (5), Watson (6)

Three students at Albright College in PA around 1994 adapted the idea of Erdös numbers to the collaboration graph of movie actors  Nodes are performers  An edge connects 2 performers if they've appeared together in a movie A performer's Bacon number is their distance in this graph to Kevin Bacon  Using cast lists from the Internet Movie Database (IMDB), compute Bacon numbers for all performers via breadth-first search

The ave. Bacon number, over all performers in the IMDB, is c. 2.9  Hard to find one that's larger than 5 One movies enthusiast tried to come up with the largest Bacon number  Found an obscure 1928 Soviet pirate film, Plenniki Morya, starring P. Savin with Bacon number of 7  Supporting cast of 8 appeared nowhere else

NetworkX: Minimum Spanning Tree Given a connected, undirected graph G, a spanning tree of G is a subgraph that  is a tree and  connects all the vertices A single graph may have several spanning trees The weight of a spanning tree is the sum of the weights of the edges in that spanning tree A minimum spanning tree (MST) is a spanning tree with weight  the weight of every other spanning tree More generally, any undirected graph (not necessarily connected) graph has a minimum spanning forest— the union of MSTs for its connected components

minimum_spanning_tree(G, weight=’weight’) returns an MST or, if the graph isn’t connected, a min. spanning forest of G G must be a Graph (not a DiGraph, MultiGraph, …)  A Graph is returned weight is the edge-data key to use for the weight (default = ‘ weight ’) If the edges don’t have a weight attribute, a default weight of 1 is used Uses Kruskal’s algorithm minimum_spanning_edges(G, weight=’weight’, data=True) returns a generator that produces edges in the MST Edges are 3-tuples (u,v,w), where w is the edge-data dictionary  If keyword argument data is False, edges are just (u,v)

>>> G = nx.watts_strogatz_graph(30, 10, 0.2) >>> nx.draw_graphviz(G, prog='sfdp', node_color='w') >>> plt.show() >>> mst = nx.minimum_spanning_tree(G) >>> nx.draw_graphviz(mst, prog='sfdp', node_color='w') >>> plt.show()

>>> ee = nx.minimum_spanning_edges(G) >>> for (u,v,w) in ee:... if u == 0 or v == 0:... print u,v,w... 0 3 {} 0 4 {} 0 5 {} 0 7 {} 0 16 {} 0 26 {} 0 27 {} 0 28 {} 0 29 {}

NetworkX: Distance Measures First define the graphic-theoretic distance-related concepts then give the relevant NetworkX functions The distance between 2 vertices (nodes) in a graph is the number of edges in a shortest path connecting them  Also called the geodesic distance: it’s the length of the graph geodesic between those 2 vertices A graph geodesic is a shortest path between 2 nodes— possibly several for a given pair of nodes  If there is no path connecting the 2 vertices, the distance is defined as infinite

The eccentricity  of a vertex v is the greatest geodesic distance between v and any other vertex  How far a node is from the node most distant from it in the graph The radius of a graph is the min. eccentricity of any vertex in the graph The diameter of a graph is the max. eccentricity of any vertex in the graph  I.e., the greatest distance between any pair of vertices. A central vertex in a graph of radius r is one whose eccentricity is r —i.e., a vertex that achieves the radius A peripheral vertex in a graph of diameter d is one that is distance d from some other vertex—i.e., a vertex that achieves the diameter

Python Functions for Distance Measures eccentricity(G) returns the eccentricities of G radius(G) returns the radius of G diameter(G) returns the diameter of G center(G) returns the set of central vertices (nodes) of G periphery(G) returns the set of peripheral nodes of G

>>> G = nx.barbell_graph(4,2) >>> nx.eccentricity(G) {0: 5, 1: 5, 2: 5, 3: 4, 4: 3, 5: 3, 6: 4, 7: 5, 8: 5, 9: 5} >>> nx.diameter(G) 5 >>> nx.periphery(G) [0, 1, 2, 7, 8, 9] >>> nx.radius(G) 3 >>> nx.center(G) [4, 5]

>>> DG1 = nx.DiGraph([(0,1), (1,3), (3,0), (3,2), (2,0)]) >>> nx.eccentricity(DG1) {0: 3, 1: 2, 2: 3, 3: 2} >>> nx.diameter(DG1) 3 >>> nx.periphery(DG1) [0, 2] >>> nx.radius(DG1) 2 >>> nx.center(DG1) [1, 3]

NetworkX: Directed Acyclic Graphs The following work only for a DiGraph A directed acyclic graph (DAG) is a directed graph with no cycles A topological sort is a non-unique permutation of the nodes of a DAG s.t. an edge from u to v implies that u appears before v is_directed_acyclic_graph(DG) returns True if DG is a DAG or False if not topological_sort(DG, nbunch=None) returns a list of nodes in topological sort order  nbunch is an optional container of nodes; only those nodes are sorted  If DG isn’t a DAG, no topological sort exists, and a NetworkXUnfeasible exception is raised

>>> DG = nx.DiGraph([(0,2), (0,3), (1,2), (1,4), (2,3), (2,4)]) >>> nx.is_directed_acyclic_graph(DG) True >>> nx.topological_sort(DG) [1, 0, 2, 4, 3] >>> nx.topological_sort(DG, [4,2,3]) [2, 3, 4] >>> DG.add_edge(3,1) >>> nx.is_directed_acyclic_graph(DG) False >>> nx.topological_sort(DG) Traceback (most recent call last): … networkx.exception.NetworkXUnfeasible: Graph contains a cycle.

NetworkX: Reversing a DiGraph DiGraph.reverse(copy=True) returns the reverse of the graph— a graph with the same nodes and edges but with the edge directions reversed copy, if True, results in a new DiGraph returned that holds the reversed edges  If copy is False, the reverse graph is created using the original graph (changing it in place) MultiDiGraph.reverse(copy=True) has the same description

>>> DG = nx.DiGraph([(0,1), (1,2)]) >>> DG1 = DG.reverse()

2.4 Network Datasets: An Overview The increasing availability of large, detailed network datasets has led to an explosion of research on large-scale networks in recent Now think more systematically about where people get the data for such research There are several reasons we might study a particular network dataset  We may care about the actual domain it comes from So fine-grained details of the data are potentially as interesting as the broad picture  Or we’re using the dataset as a proxy for a related network that may be impossible to measure E.g., the Microsoft IM graph from Figure 2.11 gave us info about distances in a social network of a scale and character that begins to approximate the global friendship network

 Or we’re looking for network properties common across many different domains So finding a similar effect in unrelated settings can suggest that it has a certain universal nature All 3 motivations are often at work simultaneously, to varying degrees  E.g., consider the analysis of the Microsoft IM graph It gave insight into the global friendship network At a more specific level, the researchers were also interested in the dynamics of instant messaging in particular At a more general level, the result of the IM graph analysis fit into the broader framework of small-world phenomena that spans many domains

To study a social network on 20 people, we can interview then all and ask them who their friends are But to study the interactions among 20,000 people, we need to be more opportunistic in where we look for data  Can't just go collect everything by hand  Must think about settings where the data has in some essential way already been measured for us Now consider some of the main sources of large-scale network data used for research  The resulting list is not exhaustive  The categories aren’t truly distinct—a single dataset can exhibit characteristics from several

Collaboration Graphs Collaboration graphs record who works with whom in a specific setting E.g., co-authorships among scientists, co-appearances by actors An example extensively studied by sociologists is the graph on highly- placed people in the corporate world  An edge joins 2 if they’ve served together on the board of directors of the same Fortune 500 company The on-line world provides new instances  The Wikipedia collaboration graph (connecting 2 Wikipedia editors if they've ever edited the same article)  The World-of-Warcraft collaboration graph (connecting 2 W-o-W users if they've ever taken part together in the same raid or other activity)

Sometimes a collaboration graph is studied to learn about the specific domain it comes from  E.g., sociologists who study the business world are interested in the relationships among companies at the director level, as expressed via co-membership on boards In contrast, e.g., people other than research scientists are interested in scientific co-authorship networks because they form detailed, pre-digested snapshots of a rich form of social interaction that unfolds over a long period of time  With on-line bibliographic records, can often track the patterns of collaboration within a field across a century or more  Thereby extrapolate how the social structure of collaboration may work across a range of harder-to-measure settings as well

Who-Talks-to-Whom Graphs The Microsoft IM graph is a snapshot of a large community engaged in several billion conversations during a month  Captures the “who-talks-to-whom” structure of the community Similar datasets have been constructed  from the e-mail logs within a company or a university  from records of phone calls: study the structure of call graphs where each node is a phone number there’s an edge between 2 if they engaged in a phone call over a given observation period

Can also use the fact that mobile phones with short-range wireless technology can detect other similar devices nearby  Equip subjects with such devices  Study the traces they record  Thereby build “face-to-face” graphs that record physical proximity A node is a person carrying one of the mobile devices There’s an edge joining 2 people if they were detected to be in close physical proximity over the observation period

The nodes generally represent customers, employees, or students of the organization that maintains the data with strong expectations of privacy  The research is generally restricted in specific ways to protect privacy Such privacy considerations have also become an issue where  companies try to use this type of data for marketing  governments try to use it for intelligence-gathering purposes

Economic network measurements recording the “who-transacts-with- whom” structure of a market or financial community have been used to study the ways in which different levels of access to market participants lead to different levels of market power and different prices for goods  This motivates more mathematical investigations of how a network structure limiting access between buyers and sellers affects outcomes (cf. Chaps. 10-12)

Information Linkage Graphs Snapshots of the Web are central examples of network datasets  Nodes are Web pages  Directed edges represent links from one page to another Of particular interest (beyond the info in the documents) is the social and economic structures that stand behind the info  hundreds of millions of personal pages on social-networking and blogging sites  hundreds of millions more representing companies and governmental organizations engineering their external images

Because of the scale of the full Web, just manipulating the data effectively is a research challenge in itself  So a lot of network research is done on interesting, well-defined subsets of the Web, including the linkages among bloggers pages on Wikipedia pages on social-networking sites such as Facebook or MySpace discussions and product reviews on shopping sites

Since the early 20 th century (well before the Web), citation analysis has studied the network structure of citations among scientific papers or patents  Lets us track the evolution of science Citation networks remain popular in social research for the same reason that scientific co-authorship graphs are  They’re very clean datasets that span decades

Technological Networks Don’t think of the Web as primarily a technological network It’s really a projection onto a technological backdrop of ideas, info, and social and economic structure created by humans But there’s been a convergence of social and technological networks A lot of interesting network data comes from the more overtly technological end of the spectrum  Nodes represent physical devices  Edges represent physical connections between them  Examples include the interconnections among computers on the Internet generating stations in a power grid

Even such physical networks are ultimately also economic networks  Represent the interactions among the competing organizations, companies, regulatory bodies, and other economic entities that shape them

On the Internet, we have a two-level view of the network  At the lowest level Nodes are individual routers and computers An edge means that 2 devices are physically connected  At a higher level, these nodes are grouped into little “nation-states” termed autonomous systems (ASs) Each is controlled by a different Internet service-providers (ISP) The who-transacts-with-whom graph on the ASs is the AS graph  Represents the data transfer agreements these ISPs make with each other

Networks in the Natural World Network research has special interest in several different types of biological networks  Look at 3 examples at 3 different scales, from population level down to molecular level Food webs represent the who-eats-whom relationships among species in an ecosystem  There’s a node for each species  A directed edge from node A to node B indicates that members of A consume members of B  Seeing the structure of a food web as a graph helps us reason about issues such as cascading extinctions If certain species become extinct, species relying on them for food also risk extinction These extinctions can propagate through the food web

In the structure of neural connections within an organism's brain:  Nodes are neurons  An edge represents a connection between 2 neurons  The global brain architecture for the simple organism C. Elegans (a 1mm roundworm), with 302 nodes and c. 7000 edges, has been completely mapped  But detailed network pictures for brains of higher organisms are far beyond the state of the art  Still, significant insight has been gained by studying the structure of specific modules within a complex brain and understanding how they interrelate

There are many ways to define the set of networks that make up a cell’s metabolism, but roughly:  Nodes are compounds that play a role in a metabolic process  Edges represent chemical interactions among them  Hope that analysis of these networks can shed light on the complex reaction pathways and regulatory feedback loops that take place inside a cell and suggest network-centric attacks on pathogens that disrupt a cell’s metabolism

Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then look at some fundamental applications 2.1 Basic Definitions Graphs: Nodes.

Similar presentations

Presentation on theme: "Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then look at some fundamental applications 2.1 Basic Definitions Graphs: Nodes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then look at some fundamental applications 2.1 Basic Definitions Graphs: Nodes.

Similar presentations

Presentation on theme: "Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then look at some fundamental applications 2.1 Basic Definitions Graphs: Nodes."— Presentation transcript:

Similar presentations

About project

Feedback