Presentation is loading. Please wait.

Presentation is loading. Please wait.

Partitioning Social Networks for Fast Retrieval of Time-dependent Queries Mindi Yuan, David Stein, Berenice Carrasco, Joana Trindade, Yi Lu University.

Similar presentations


Presentation on theme: "Partitioning Social Networks for Fast Retrieval of Time-dependent Queries Mindi Yuan, David Stein, Berenice Carrasco, Joana Trindade, Yi Lu University."— Presentation transcript:

1 Partitioning Social Networks for Fast Retrieval of Time-dependent Queries Mindi Yuan, David Stein, Berenice Carrasco, Joana Trindade, Yi Lu University of Illinois at Urbana-Champaign Presenter: Sameh Elnikety Microsoft Research Email questions, praise and complains to Prof. Yi Lu 1

2 Online Social Networking (OSN) OSNs differ from traditional web applications significantly – Handle highly personalized content Retrieved pages differ from user to user; also vary over short time intervals when new activities occur – Highly interconnected data Immediate neighborhood and extended neighborhood (friend’s friend) – Community Strong community structure among end users, albeit varying with time Scaling OSNs is a problem – Interconnected nature – Astounding growth rate Twitter grew by 10 times in one month, forced to redesign and re-implement its architecture several times 2

3 Existing Approaches Conventional vertical scaling – Upgrade existing hardware – Replicate fully on each machine Problems – Expensive because of the cost of high performance servers – Even infeasible Facebook requires multiple hundreds of Terabytes of memory Horizontal scaling – Use a higher number of cheap commodity servers – Partition the load among them – Good for stateless traditional web applications Problematic with data back-end layer that maintains state – OSN data cannot be partitioned into disjoint components – Leads to costly inter-server communication 3

4 Inter-server communication Distributed queries are shown to reduce performance compared to local queries [Pujol et. al. 2010][Curino et. al. 2010] Curino et. al. reported that local queries double throughput Question to consider: How much can we minimize inter-server communication and at what cost? 4

5 Overview 1.Problem 2.Previous work 3.Our approach 4.Evaluation 5

6 Two extremes of the spectrum Cassandra – Random partitioning with distributed hashing – Consistency is easy to maintain since only one replica is available – Inter-server traffic increases at retrieval – Slow down responses – Lead to “multi-get” hole problem with e.g. Facebook, where increasing the number of servers actually reduces throughput Replicating all users’ data on multiple servers – Eliminates inter-server traffic but increases replication overhead – Impossible to hold all data in memory on each server – Delay in writing due to updates of all replicas – Network traffic for maintaining consistency across replicas 6

7 Little Engines The little engine(s) that could: scaling online social networks. Pujol et. al. Sigcomm, 2010. Assumption – Most of operations are based on data of a user and her neighbors – The assumption is debatable: True for Twitter data, not true for Facebook data where a two-hop network is accessed for most operations (expanded on next slide) Guarantees that for all users in an OSN, their direct neighbors’ data are co- located on the same server – Application developers can assume local semantics – Scalability achieved with servers with low memory and network I/O Drawback – Popular data replicated on all servers – Extension to two-hop network will require a much larger number of replicas 7

8 Two-hop network SPAR proposed in Little Engines considers one-hop network – Mainly focus on twitter-type models: a user follows all contents of a group of senders – Facebook messages were tested upon by SPAR, but they are still modeled as a user following all his friends, instead of the real-world scenario where a user can access related activities of friends’ friends Two-hop networks – For instance, Joana can accesses all activities Between Jona and Nandana Adarsh Jona Nandana Joana Naseer 8

9 A Caveat A two-hop network can be transformed into a one-hop network if each message is saved with both the sender and the receiver However – Data storage immediately inflated by 2 – Popular receivers will experience a large number of writes – Replicas only worsen the situation. Popular receivers are likely to have more neighbors and hence more replicas, which entail even larger number of writes to maintain consistency 9

10 Overview 1.Problem 2.Previous work 3.Our approach 4.Evaluation 10

11 Our Approach Goal: have most requests handled locally while having much fewer replicas than the approach in “little engines”? – To have an approach scalable with two-hop networks – To reduce the cost of maintaining consistency among replicas Main idea: – Instead of partitioning the social network, which contains all users that will potentially send a message – We focus on the activity network, which contains users actually sending messages – Partitioning is based on actual activities in the network so that most requests can be served locally, which can be achieved with much fewer replicas than a strong locality guarantee 11

12 Activity Network 1/2 Contains much fewer edges than the social network – More amenable to partitioning and replication In a two-hop neighborhood, even if a user does not participate in much activity with his direct neighbors, there can still be abundant activity in the two-hop neighborhood Facebook data for Dec 2006 New Orleans network – While most user pairs only have 1 post, the top 8% of users receiver > 100 posts in their two-hop neighborhood 12

13 Activity Network 2/2 There is strong time correlation among wall posts between the same pairs of users – This motivates a local, dynamic partitioning algorithm that adapts to the temporal correlation Computation of auto-correlation – Gather all messages in the period m(1), … m(k) – Compute the auto-correlation function 13

14 Activity Prediction Graph The example has 6 users in the network There is a message node between each pair of user nodes Each edge is assigned a weight that reflects – Read frequency of a user (assume data collected) – The likelihood of retrieving a message from this particular edge when the most recent N messages are retrieved from the two-hop neighborhood 14

15 Partitioning Algorithm Periodic partitioning – Use K-Metis to repartition the graph at fixed time epochs – In between epochs, new users and new interactions are added to the graph to minimize partitioning cost Local adaptive partitioning – Takes advantage of strong time correlation between messages – Triggered when a retrieval results in remote accesses – Keep movements of nodes across partitions small 15

16 Overview 1.Problem 2.Previous work 3.Our approach 4.Evaluation 16

17 Evaluation Scenario Each retrieval involves the most recent 6 messages in a two-hop neighborhood We chose 6 as the dataset available is relatively small – A total of 13948 messages with 8640 active users We consider algorithms – Random hashing based on sender (hash_p1) – Random hashing based on sender-receiver pair (hash_p2) – Periodic partitioning – Local adaptive partitioning – Retrospective algorithm: instead use the APG, which uses past data to predict the activity for the current month, use the actual activity in the month. It is the best one can do given the limitations of the partitioning algorithm 17

18 Results Proportion of retrievals that accesses only one partition for all 6 messages The periodic algorithm performs much better than random hashing Close to retro, so APG does provide relatively accurate prediction 18

19 Results Proportion of retrievals that accesses at most 3 partitions for all 6 messages Most retrievals touch at most 3 partitions 19

20 Results The local adaptive algorithm further improves over periodic partitioning At least 6.4 times more local queries than random hashing Without causing large movements of nodes as in the periodic case 20

21 Results We also compare the periodic partitioning on the APG vs a graph using weights in one-hop neighborhood only The retrievals are performed in the two-hop neighborhood There is clearly an advantage in considering activities in the two-hop neighborhood when constructing the graph The advantage is more significant as the number of partitions increases 21

22 Conclusion We proposed a dynamic, adaptive algorithm to predict activities on an OSN and partition the network for fast retrieval of data The algorithm scales when information needs to be retrieved from the two-hop neighborhood Further work is done to improve data locality for most retrievals with a small number of replicas 22


Download ppt "Partitioning Social Networks for Fast Retrieval of Time-dependent Queries Mindi Yuan, David Stein, Berenice Carrasco, Joana Trindade, Yi Lu University."

Similar presentations


Ads by Google