Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research.

Slides:



Advertisements
Similar presentations
Generating Random Spanning Trees Sourav Chatterji Sumit Gulwani EECS Department University of California, Berkeley.
Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
GRAIL: Scalable Reachability Index for Large Graphs VLDB2010 Vineet Chaoji Mohammed J. Zaki.
The Connectivity and Fault-Tolerance of the Internet Topology
Sampling: Final and Initial Sample Size Determination
Guest lecture II: Amos Fiat’s Social Networks class Edith Cohen TAU, December 2014.
1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Centrality and Prestige HCC Spring 2005 Wednesday, April 13, 2005 Aliseya Wright.
On Approximating the Average Distance Between Points Kfir Barhum, Oded Goldreich and Adi Shraibman Weizmann Institute of Science.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Hardware-based Load Generation for Testing Servers Lorenzo Orecchia Madhur Tulsiani CS 252 Spring 2006 Final Project Presentation May 1, 2006.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Computing Trust in Social Networks
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Experimental Evaluation
Routing Protocol Pertemuan 21 Matakuliah: H0484/Jaringan Komputer Tahun: 2007.
Population Proportion The fraction of values in a population which have a specific attribute p = Population proportion X = Number of items having the attribute.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
Models of Influence in Online Social Networks
Aim: How do we find confidence interval? HW#9: complete question on last slide on loose leaf (DO NOT ME THE HW IT WILL NOT BE ACCEPTED)
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.
Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)
Exposure In Wireless Ad-Hoc Sensor Networks Seapahn Meguerdichian Computer Science Department University of California, Los Angeles Farinaz Koushanfar.
Exposure In Wireless Ad-Hoc Sensor Networks Seapahn Meguerdichian Computer Science Department University of California, Los Angeles Farinaz Koushanfar.
Random Sampling, Point Estimation and Maximum Likelihood.
Finding dense components in weighted graphs Paul Horn
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
2.1 Computational Tractability. 2 Computational Tractability Charles Babbage (1864) As soon as an Analytic Engine exists, it will necessarily guide the.
A Graph-based Friend Recommendation System Using Genetic Algorithm
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Statistical Methods II: Confidence Intervals ChE 477 (UO Lab) Lecture 4 Larry Baxter, William Hecker, & Ron Terry Brigham Young University.
Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck.
Slides are modified from Lada Adamic
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
Teknik Routing Pertemuan 10 Matakuliah: H0524/Jaringan Komputer Tahun: 2009.
1 HOTP2P 2011 Parallel and Distributed Systems Group, Delft University of Technology, the Netherlands May 20, 2011 Betweenness Centrality Approximations.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Sampling in Graphs Alexandr Andoni (Microsoft Research)
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Confidence Intervals Cont.
Approximating the MST Weight in Sublinear Time
Lecture 18: Uniformity Testing Monotonicity Testing
Statistical Methods For Engineers
Enumerating Distances Using Spanners of Bounded Degree
Finding Fastest Paths on A Road Network with Speed Patterns
Department of Computer Science University of York
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
Sampling in Graphs: node sparsifiers
Statistical Inference
Presentation transcript:

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research

Very Large Graphs  Model many types of relations and interactions (edges) between entities (nodes)  Call detail, exchanges, Web links, Social Networks (friend, follow, like), Commercial transactions,…  Need for scalable analytics:  Centralities/Influence (power/importance/coverage of a node or a set of nodes): ranking, viral marketing,…  Similarities/Communities (how tightly related are 2 or more nodes): Recommendations, Advertising, Prediction

Centrality  Centrality of a node measures its importance. Applications: ranking, scoring, characterize network properties.  Several structural centrality definitions:  Betweeness: effectiveness in connecting pairs of nodes  Degree: Activity level  Eigenvalue: Reputation  Closeness: Ability to reach/influence others.

Closeness Centrality Classic Closeness Centrality [(Bavelas 1950, Beaucahmp 1965, Sabidussi 1966)] (Inverse of) the average distance to all other nodes Maximum centrality node is the 1-median Importance measure of a node that is a function of the distances from a node to all other nodes.

Computing Closeness Centrality

Exact, but does not scale for many nodes on large graphs.

Goals

Algorithmic Overview  Approach I: Sampling  Properties: good for “close” distances (concentrated around mean)  Approach II: Pivoting  Properties: good for “far” distances (heavy tail)  Hybrid: Best of all worlds

Approach I: Sampling [EW 2001, OCL2008,Indyk1999,Thorup2001] Computation Estimation

Sampling

Sampling: Properties Works well! sample average captures population average

Sampling: Properties Heavy tail -- sample average has high variance

Approach II: Pivoting Computation Estimation

Pivoting

Inherit centrality of pivot (closest sampled node)

Pivoting: properties

WHP

Pivoting: properties Bounded relative error for any instance ! A property we could not obtain with sampling

Pivoting vs. Sampling But neither gives us a small relative error !

Hybrid Estimator !! How to partition close/far ?

Hybrid

Close nodes Far nodes

Close nodes

Far nodes

Analysis

Analysis (worst case)

Analysis What about the guarantees (want confidence intervals) ?

Adaptive Error Estimation Idea: We use the information we have on the actual distance distribution to obtain tighter confidence bounds for our estimate than the worst-case bounds.  Close nodes: Estimate population variance from samples.

Extension: Adaptive Error Minimization For a given sample size (computation investment), and a given node, we can consider many thresholds for partitioning into closer/far nodes.  We can compute an adaptive error estimate for each threshold (based on what we know on distribution).  Use the estimate with smallest estimated error.

Efficiency

Scalability: Using +O(1)/node memory

Hybrid slightly slower, but more accurate than sampling or pivoting

Directed graphs  Sampling works (same properties) when graph is strongly connected.  Pivoting breaks, even with strong connectivity. Hybrid therefore also breaks.  When graph is not strongly connected, basic sampling also breaks – we may not have enough samples from each reachability set (Classic Closeness) Centrality is defined as (inverse of) average distance to reachable (outbound distances) or reaching (inbound distances) nodes only. We design a new sampling algorithm…

…Directed graphs (Classic Closeness) Centrality is defined as (inverse of) average distance to reachable (outbound distances) or reaching (inbound distances) nodes only.  Process nodes u in random permutation order  Run Dijkstra from u, prune at nodes already visited k times

Directed graphs: Reachability sketch based sampling is orders of magnitude faster with only a small error.

Extension: Metric Spaces  Perform both a forward and back Dijkstra from each sampled node.  Compute roundtrip distances, sort them, and apply estimator to that. Application: Centrality with respect to Round-trip distances in directed strongly connected graphs:

Extension: Node weights Weighted centrality: Nodes are heterogeneous. Some are more important. Or more related to a topic. Weighted centrality emphasizes more important nodes. Variant of Hybrid with same strong guarantees uses a weighted (VAROPT) instead of a uniform nodes sample.

Closeness Centrality  Classic (penalize for far nodes)  Distance-decay (reward for close nodes) Different techniques required: All-Distances Sketches [C’ 94] work for approximating distance-decay but not classic.

Summary

Thank you!