Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

Similar presentations


Presentation on theme: "The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder."— Presentation transcript:

1 The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder Chhabra, Pablo Rodriguez Telefonica Research From SIGCOMM 2010 1

2 Introduction Online Social Networks (OSNs) differ from traditional web applications in two ways: o Handle highly personalized content. o Deal with highly interconnected data due to the presence of a strong community structure among their end users. Scaling OSNs is needed. E.g. Twitter grew by 1382% between Feb and Mar 2009. Scaling solutions include “vertical” and “horizontal” scaling. o Vertical scaling: upgrade existing hardware. (infeasible) o Horizontal scaling: adding cheap commodity servers. (on stateless, independent data) 2

3 Introduction OSNs ‘s data is not independent, and it is hard to partition it into disjoint sets. o Because most of the operations in OSNs are based on the data of a user and her neighbors, and most of users are belong to more then one community. Random partitioning is applied in many current OSNs. o Inter-server traffic is heavy. Replicating user’s data: o Eliminates the inter-server traffic. o Have many side-effects. (overhead, query time, update traffic) 3

4 Outline Introduction SPAR And Problem Statement SPAR On The Fly Measurement SPAR System Architecture And Evaluation Related Work And Conclusions 4

5 SPAR The main contribution of this paper. A Social Partitioning And Replication middle-ware for social applications. What does SPAR do? 1.Solves the Designer’s Dilemma for early stage OSNs. (Adding features/Highly scalable system) 2.Avoids performance bottlenecks in established OSNs. 3.Minimizes the effect of provider lock-ins. How does SPAR do it? o “Local semantics” o Joint partitioning and replication. 5

6 SPAR (a) Full Replication (b) Partition using DHT (random partitioning) (c) Random Partitioning (DHT) with replication of the neighbors (d) SPAR, socially aware partition and replication of the neighbors. 6

7 Problem Statement Requirement: 1. Maintain local semantics. 2. Balance loads. 3. Be resilient to machine failures. 4. Be amenable to online operations. 5. Be stable. 6. Minimize replication overhead. Replica: a copy of the user’s data o Master replica serves read/write. o Slave replica: for redundancy and “local semantics”. Formulation: o We can formulate the solution as an optimization problem of minimizing the number of required replicas. (cast as an integer linear program) o Symbols: 1.G(V,E): a social graph G with node set V and edge set E 2.N = |V|, the numbers of users. M, the number of servers. 7

8 Problem Statement Formulation o Symbols: 3.pij, a binary decision variable that becomes 1 if and only if the primary of user i is assigned to partition j, 1 ≤ j ≤ M. 4.rij, a similar decision variable for a replica of user i assigned to partition j. 5.ε ii′ = 1 if {i,i′} ∈ E capture the friendship relationships. 6.K slaves to ensure redundancy. 1)Ensure one master copy. 2)Ensure local semantics. 3)Ensure load balance. 4)Ensure redundancy demand. 8

9 Problem Statement Graph partitioning algorithms and modularity optimization algorithms can applied to this problem, however… o Most graph partitioning algorithms are not incremental. (offline) o Algorithms based on community detection are known to be very sensitive to input conditions. (not stable) o Reducing the number of inter-partition edges directly = reducing the number of replicas? E.g. o The partition P1 and P2 results in 3 edges between partitions and 5 nodes to be replicated (e to i). The partitions P3 and P4 results in 4 inter-partition edges but only 4 nodes need to be replicated (c to f). 9

10 Outline Introduction SPAR And Problem Statement SPAR On The Fly Measurement SPAR System Architecture And Evaluation Related Work AndConclusions 10

11 SPAR On The Fly The online heuristic solution is applied when social networks and servers change. (six events) o Node addition, node removal, edge addition, edge removal, server addition, server removal. Edge addition: When a new edge is created between nodes u and v, 1.Check whether both masters are already co-located with each other or with a master’s slave, if not, next… 2.Calculate the number of replicas that would be generated for each of the three possible configurations: 1) no movements of masters. 2) the master of u goes to the partition containing the master of v. 3) the opposite. 3.Choose the configuration that yields the smallest number of replicas subject to the constraint of load balancing the masters. 11

12 SPAR On The Fly Edge addition, three configurations Moving 6 to M1 minimizes the total Number of replicas. However, such a configuration violates the load balancing condition. 12

13 SPAR On The Fly Server addition: two choice: 1) force redistribution of the masters from the other servers to the new one. 2) do nothing. o In first choice, the algorithm… 1.Select the least replicated masters from the M server and move them to the new server M+1. 2.For each moved masters, ensure there is a slave replica in the original server. 3.A fraction of the edges are replay the edge creation event. Server removal, the algorithm… 1.Reallocates the masters nodes hosted in that server to the remaining M-1 servers equally. o Highly connected nodes, with potentially many replicas to be moved, get to first choose the server they go to. 13

14 Outline Introduction SPAR And Problem Statement SPAR On The Fly Measurement SPAR System Architecture And Evaluation Related Work And Conclusions 14

15 Measurement Datasets o Twitter, by crawling Twitter between Nov 25 – Dec 4, 2008, comprises 2,408,534 nodes and 48,776,888 edges, 12M tweets. o Facebook, New Orleans Facebook network, between Dec 2008 and Jan 2009, includes nodes, friendship links, wall posts, comprises 59,297nodes and 477,993 edges. o Orkut, between Oct 3 and Nov 11 2006, comprises 3,073,441 nodes and 223,534,301 edges. Evaluation Procedure o Input includes: a social graph, the number of desired partitions M and the desired minimum number of replicas per user’s profile K. o The partitions are produced by executing each of the algorithms on the input. o For those offline algorithm, since local semantics are required, replicas are added as necessary. o Generate edge creation trace for each data set. 15

16 Measurement Metrics, replication overhead r 0 : o The number of slave replicas that need to be created to guarantee local semantics while observing the K-redundancy condition. o SPAR naturally achieves a low replication overhead since this is the optimization objective. The competing algorithms optimize for minimal inter-server edges. 16

17 Measurement Comparison on Random vs SPAR for K = 2. Dynamic operations: 1.Load balance of SPAR o Twitter dataset with 128 servers and K = 2. o Average replication overhead = 3.69, 75.8% of users have 3 replicas. o Aggregate read load of servers. Coefficient of variation (COV) of masters among servers is 0.0019. o The COV of writes per server is 0.37. 17

18 Measurement 2.Action taken by SPAR when edge creation event occurs. (K=2 and 16 servers.) 3.The movement cost on the system 18

19 Measurement Adding servers: 1.Wait for new arrivals: o go from 16 to 32 servers by adding one server every 150K new users.  overhead = 2.78, the overhead started with 32 servers = 2.74.  COV of the number of masters among servers = 0.004 2.Re-distribute existing masters: o double the number of servers at once and force re-distribution of the masters, and replay 50% of the edge creation. => overhead = 2.82, 2% higher than started with 32 servers. Removing a server: o Overhead increase from 2.74 to 2.87 marginally. o Replay the edge => overhead 2.77. 19

20 Outline Introduction SPAR And Problem Statement SPAR On The Fly Measurement SPAR System Architecture And Evaluation Related Work And Conclusions 20

21 SPAR System Architecture And Evaluation The only interaction between application and SPAR is through the Middle-ware. Four components of SPAR: the Directory Service DS, the Local Directory Service LDS, the Partition Manager PM and the Replication Manager RM. 21

22 SPAR System Architecture And Evaluation 1. Directory Service (DS) : Handle data distribution, implemented as a DHT. o Look-up table of key-> user (server). o Look-up table of user->replica servers. 2. Local Directory Service (LDS) contains a partial view of the DS. 3. Partitioning Manager (PM) : PM run the SPAR algorithm and performs the following functions: o Map the user’s key to its replicas, whether master or slaves. o Re-distribute replicas and schedule the movement of replicas. o Adding and removing servers. o Only component that can update DS. 22

23 SPAR System Architecture And Evaluation 4.Ensure Data Consistency while moving and writing. 5.Heart-beat system to monitor servers failure. o Permanent failure – treat as a server removal. o Transient failure – without triggering the recreation of neighbors, sacrifice some local data semantics. SPAR EVALUATION o On two data-stores: MySQL (RDBMS) and Cassandra (Key-Value). o Use an open-source Twitter-clone called Statusnet. o Statusnet’s canonical operation – retrieving the last 20 updates for a given user. o Testbed consists of a cluster of 16 low-end commodity servers. 1.Evaluation with Cassandra o Original Cassandra, random partitioning. 23

24 SPAR System Architecture And Evaluation Response time between SPAR and original Cassandra. Multi-get increase the network load in Random partitioning. 24

25 SPAR System Architecture And Evaluation 2.Evaluation with MySQL Use Tsung and two servers to emulate users. Comparison to full replication o Full replication:113ms for 16 req/s, 151ms for 150req/s, 245ms for 320 req/s.(95 th percentile of response time) o SPAR: 150 ms for 2500 req/s. (99 th percentile) Adding writes o Full replication: performs very poor. o With 16 updates/s, 260ms for 50 req/s and 380 ms for 200 read req/s (95 th percentile) o Median response time = 2ms. 25

26 Outline Introduction SPAR And Problem Statement SPAR On The Fly Measurement SPAR System Architecture And Evaluation Related Work And Conclusions 26

27 Related Work The first work to address the problem of scalability of the data back-end for OSNs. Other related issues are… Scaling out: o Amazon EV2 and Google AppEngines (stateless) Key-Value Stores: o Many OSNs today rely on DHTs and Key-Value store. SPAR improves performance over Key-Value stores. Distributed File Systems and Databases: o Ficus and Coda are distributed file systems that replicate files for high availability. Distributed RDBMS: MySQL cluster and Bayou. o SPAR doesn’t distribute data, but maintains it locally via replication. 27

28 Conclusions The data of user in OSNs is highly interconnected and hard to subjected to a clean partition. SPAR ensures “local semantics” o All relevant data of the direct neighbors of a user is co-located in the same server hosting the user. “Local semantics” enables o Queries can be resolved locally on a server and breaks the dependency between users. o Transparent scale of the OSN at a low cost. o Throughput (requests per second) increases multifold and network I/O is avoided. Validated SPAR o Replication overhead is low. o Dealing with the dynamics of OSN gracefully. o Implemented a Twitter-like application and evaluated SPAR with it. 28


Download ppt "The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder."

Similar presentations


Ads by Google