1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou.

Slides:



Advertisements
Similar presentations
Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.
Advertisements

Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
Analysis and Modeling of Social Networks Foudalis Ilias.
Sampling.
1 2.5K-Graphs: from Sampling to Generation Minas Gjoka, Maciej Kurant ‡, Athina Markopoulou UC Irvine, ETZH ‡
Practical Recommendations on Crawling Online Social Networks
Construction of Simple Graphs with a Target Joint Degree Matrix and Beyond Minas Gjoka, Balint Tillman, Athina Markopoulou University of California, Irvine.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
4. PREFERENTIAL ATTACHMENT The rich gets richer. Empirical evidences Many large networks are scale free The degree distribution has a power-law behavior.
1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.
Random Walk on Graph t=0 Random Walk Start from a given node at time 0
Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),
Sampling.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Fundamentals of Sampling Method
Searching in Unstructured Networks Joining Theory with P-P2P.
Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,
Sunbelt 2009statnet Development Team ERGM introduction 1 Exponential Random Graph Models Statnet Development Team Mark Handcock (UW) Martina.
Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference,
Chapter Outline  Populations and Sampling Frames  Types of Sampling Designs  Multistage Cluster Sampling  Probability Sampling in Review.
QUIZ CHAPTER Seven Psy302 Quantitative Methods. 1. A distribution of all sample means or sample variances that could be obtained in samples of a given.
Models of Influence in Online Social Networks
Multigraph Sampling of Online Social Networks Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou 1Multigraph sampling.
COLLECTING QUANTITATIVE DATA: Sampling and Data collection
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Popularity versus Similarity in Growing Networks Fragiskos Papadopoulos Cyprus University of Technology M. Kitsak, M. Á. Serrano, M. Boguñá, and Dmitri.
Complex network geometry and navigation Dmitri Krioukov CAIDA/UCSD F. Papadopoulos, M. Kitsak, kc claffy, A. Vahdat M. Á. Serrano, M. Boguñá UCSD, December.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
© 2003 Prentice-Hall, Inc.Chap 7-1 Basic Business Statistics (9 th Edition) Chapter 7 Sampling Distributions.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Poking Facebook: Characterization of OSN Applications Minas Gjoka, Michael Sirivianos, Athina Markopoulou, Xiaowei Yang University of California, Irvine.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
To Blog or Not to Blog: Characterizing and Predicting Retention in Community Blogs Imrul Kayes 1, Xiang Zuo 1, Da Wang 2, Jacob Chakareski 3 1 University.
Hung X. Nguyen and Matthew Roughan The University of Adelaide, Australia SAIL: Statistically Accurate Internet Loss Measurements.
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
How far removed are you? Scalable Privacy-Preserving Estimation of Social Path Length with Social PaL Marcin Nagy joint work with Thanh Bui, Emiliano De.
Chapter 7 The Logic Of Sampling. Observation and Sampling Polls and other forms of social research rest on observations. The task of researchers is.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
5-4-1 Unit 4: Sampling approaches After completing this unit you should be able to: Outline the purpose of sampling Understand key theoretical.
Chap 7-1 Basic Business Statistics (10 th Edition) Chapter 7 Sampling Distributions.
Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Statistics and Quantitative Analysis U4320 Segment 5: Sampling and inference Prof. Sharyn O’Halloran.
Data Collection & Sampling Dr. Guerette. Gathering Data Three ways a researcher collects data: Three ways a researcher collects data: By asking questions.
A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi.
Statistical Analysis Quantitative research is first and foremost a logical rather than a mathematical (i.e., statistical) operation Statistics represent.
Community-enhanced De-anonymization of Online Social Networks Shirin Nilizadeh, Apu Kapadia, Yong-Yeol Ahn Indiana University Bloomington CCS 2014.
Minas Gjoka, Emily Smith, Carter T. Butts
Methods of HRDU (PDU) population estimates Bruno Sopko, Ph.D. 11 th June 2014.
1 of 26Visit UMT online at Prentice Hall 2003 Chapter 7, STAT125Basic Business Statistics STATISTICS FOR MANAGERS University of Management.
Sampling technique  It is a procedure where we select a group of subjects (a sample) for study from a larger group (a population)
Topics Semester I Descriptive statistics Time series Semester II Sampling Statistical Inference: Estimation, Hypothesis testing Relationships, casual models.
Fast Parallel Algorithms for Edge-Switching to Achieve a Target Visit Rate in Heterogeneous Graphs Maleq Khan September 9, 2014 Joint work with: Hasanuzzaman.
1 Coarse-Grained Topology Estimation via Graph Sampling Maciej Kurant 1 Minas Gjoka 2 Yan Wang 2 Zack W. Almquist 2 Carter T. Butts 2 Athina Markopoulou.
Basic Business Statistics (8th Edition)
BroadNets 2004, October 25-29, San Jose
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
4 Sampling.
An Online CNN Poll.
Chapter 8: Weighting adjustment
Sampling.
Model generalization Brief summary of methods
Statistical Inference
Network Models Michael Goodrich Some slides adapted from:
Presentation transcript:

1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou (UC Irvine), Carter T. Butts (UC Irvine), Patrick Thiran (EPFL). Presented at Sunbelt Social Networks Conference February 08-13, 2011.

2 (over 15% of world’s population, and over 50% of world’s Internet users !) Online Social Networks (OSNs) > 1 b illion users October million2 200 million9 130 million million43 75 million10 75 million29 Size Traffic

Facebook: 500+M users 130 friends each (on average) 8 bytes (64 bits) per user ID The raw connectivity data, with no attributes: 500 x 130 x 8B = 520 GB This is neither feasible nor practical. Solution: Sampling! To get this data, one would have to download: 260 TB of HTML data!

Sampling Topology? What:

Sampling Topology? Nodes? What: Directly? How:

Topology? Nodes? What: Directly? Exploration? How: Sampling

E.g., Random Walk (RW) Topology? Nodes? What: Directly? Exploration? How: Sampling

8 q k - observed node degree distribution p k - real node degree distribution A walk in Facebook

9 Metropolis-Hastings Random Walk (MHRW): DAAC… … C C D D M M J J N N A A B B I I E E K K F F L L H H G G How to get an unbiased sample? S =

10 Metropolis-Hastings Random Walk (MHRW): DAAC… … C C D D M M J J N N A A B B I I E E K K F F L L H H G G 10 Re-Weighted Random Walk (RWRW): Introduced in [Volz and Heckathorn 2008] in the context of Respondent Driven Sampling Now apply the Hansen-Hurwitz estimator: How to get an unbiased sample? S =

11 Metropolis-Hastings Random Walk (MHRW):Re-Weighted Random Walk (RWRW): Facebook results

12 MHRW or RWRW ? ~3.0

13 RWRW > MHRW (RWRW converges 1.5 to 6 times faster) But MHRW is easier to use, because it does not require reweighting. MHRW or RWRW ? [1] Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.

RW extensions 1) Multigraph sampling

C C D D M M J J N N A A B B I I E E K K F F L L H H G G Friends C C D D M M J J N N A A B B I I E E K K F F L L H H G G Events C C D D M M J J N N A A B B I I E E K K F F L L H H G G Groups E.g., in LastFM

C C D D M M J J N N A A B B I I E E K K F F L L H H G G Friends C C D D M M J J N N A A B B I I E E K K F F L L H H G G Events C C D D M M J J N N A A B B I I E E K K F F L L H H G G Groups E.g., in LastFM

J J C C D D M M N N A A B B I I E E G * = Friends + Events + Groups ( G * is a multigraph ) F F L L H H G G K K 17 Multigraph sampling [2] Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou, “Multigraph Sampling of Online Social Networks”, arXiv:

RW extensions 2) Stratified Weighted RW

Not all nodes are equal irrelevant important (equally) important Node categories: Stratification. Node weight is proportional to its sampling probability under Weighted Independence Sampler (WIS)

Not all nodes are equal But graph exploration techniques have to follow the links! We have to trade between fast convergence and ideal (WIS) node sampling probabilities Enforcing WIS weights may lead to slow (or no) convergence irrelevant important (equally) important Node categories:

Measurement objective E.g., compare the size of red and green categories.

Measurement objective Category weights optimal under WIS E.g., compare the size of red and green categories. Theory of stratification

Measurement objective Category weights optimal under WIS Modified category weights Limit the weight of tiny categories (to avoid “black holes”) Allocate small weight to irrelevant node categories Controlled by two intuitive and robust parameters E.g., compare the size of red and green categories.

Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G Target edge weights 20 = 22 = 4 = Resolve conflicts: arithmetic mean, geometric mean, max, … E.g., compare the size of red and green categories.

Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G WRW sample E.g., compare the size of red and green categories.

Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G WRW sample Final result Hansen-Hurwitz estimator E.g., compare the size of red and green categories.

Stratified Weighted Random Walk (S-WRW) Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G WRW sample Final result E.g., compare the size of red and green categories.

28 Colleges in Facebook versions of S-WRW Random Walk (RW) 3.5% of Facebook users are declare memberships in colleges S-WRW collects times more samples per college than RW This difference is larger for small colleges – stratification works! RW needs times more samples to achieve the same error! [3] Maciej Kurant, Minas Gjoka, Carter T. Butts and Athina Markopoulou, “Walking on a Graph with a Magnifying Glass”, to appear in SIGMETRICS 2011.

Part 2: What do we learn from our samples?

What can we learn from datasets? Node properties: Community membership information Privacy settings Names … Local topology properties: Node degree distribution Assortativity Clustering coefficient …

31 Probability that a user changes the default privacy settings PA = What can we learn from datasets? Example: Privacy Awareness in Facebook

32 number of sampled nodes total number of nodes (estimated) number of nodes sampled in B nodes sampled in A number of nodes sampled in A number of edges between node a and community B From a randomly sampled set of nodes we infer a valid topology! What can we learn from datasets? Coarse-grained topology A B Pr[ a random node in A and a random node in B are connected ]

33 US Universities

34 US Universities

Country-to-country FB graph Some observations: – Clusters with strong ties in Middle East and South Asia – Inwardness of the US – Many strong and outwards edges from Australia and New Zealand

36 Egypt Saudi Arabia United Arab Emirates Lebanon Jordan Israel Strong clusters among middle-eastern countries

Part 3: Sampling without repetitions:

Exploration without repetitions

Examples: RDS (Respondent-Driven Sampling) Snowball sampling BFS (Breadth-First Search) DFS (Depth-First Search) Forest Fire …

41 pkpk qkqk Why?

42 Graph model RG(p k ) Random graph RG(p k ) with a given node degree distribution p k

43 Graph traversals on RG(p k ): MHRW, RWRW - real average node degree - real average squared node degree. Solution (very briefly)

44 Graph traversals on RG(p k ): MHRW, RWRW - real average node degree - real average squared node degree. Solution (very briefly) RDS expected bias corrected

Solution (very briefly) 45 - real average node degree - real average squared node degree. Graph traversals on RG(p k ): For small sample size (for f→0), BFS has the same bias as RW. (observed in our Facebook measurements) This bias monotonically decreases with f. We found analytically the shape of this curve. MHRW, RWRW For large sample size (for f→1), BFS becomes unbiased. RDS expected bias corrected

46 What if the graph is not random? Current RDS procedure

Summary

C C D D M M J J N N A A B B I I E E K K F F L L H H G G C C D D M M J J N N A A B B I I E E K K F F L L H H G G C C D D M M J J N N A A B B I I E E K K F F L L H H G G J J C C D D M M N N A A B B I I E E F F L L G G K K H H Multigraph sampling [2]Stratified WRW [3] Random Walks References [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM [2] M. Gjoka, C. T. Butts, M. Kurant and A. Markopoulou, “Multigraph Sampling of Online Social Networks”, arXiv: [3] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Walking on a Graph with a Magnifying Glass”, to appear in SIGMETRICS [4] M. Kurant, A. Markopoulou and P. Thiran, “On the bias of BFS (Breadth First Search)”, ITC 22, [5] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Estimating coarse-grained graphs of OSNs”, in preparation. [6] Facebook data: [7] Python code for BFS correction: RWRW > MHRW [1] The first unbiased sample of Facebook nodes [1,6] Convergence diagnostics [1]

J J C C D D M M N N A A B B I I E E F F L L G G K K H H References [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM [2] M. Gjoka, C. T. Butts, M. Kurant and A. Markopoulou, “Multigraph Sampling of Online Social Networks”, arXiv: [3] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Walking on a Graph with a Magnifying Glass”, to appear in SIGMETRICS [4] M. Kurant, A. Markopoulou and P. Thiran, “On the bias of BFS (Breadth First Search)”, ITC 22, [5] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Estimating coarse-grained graphs of OSNs”, in preparation. [6] Facebook data: [7] Python code for BFS correction: Multigraph sampling [2]Stratified WRW [3] Graph traversals on RG(p k ): MHRW, RWRW [4,7] Random Walks RWRW > MHRW [1] The first unbiased sample of Facebook nodes [1,6] Convergence diagnostics [1] Traversals (no repetitions) RDS

J J C C D D M M N N A A B B I I E E F F L L G G K K H H References [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM [2] M. Gjoka, C. T. Butts, M. Kurant and A. Markopoulou, “Multigraph Sampling of Online Social Networks”, arXiv: [3] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Walking on a Graph with a Magnifying Glass”, to appear in SIGMETRICS [4] M. Kurant, A. Markopoulou and P. Thiran, “On the bias of BFS (Breadth First Search)”, ITC 22, [5] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Estimating coarse-grained graphs of OSNs”, in preparation. [6] Facebook data: [7] Python code for BFS correction: Multigraph sampling [2]Stratified WRW [3] Graph traversals on RG(p k ): MHRW, RWRW A B [3,5] [4,7] Thank you! Random Walks Coarse-grained topologies RWRW > MHRW [1] The first unbiased sample of Facebook nodes [1,6] Convergence diagnostics [1] Traversals (no repetitions) RDS