Presentation is loading. Please wait.

Presentation is loading. Please wait.

August 2nd, 2006Department of Computer Sciences, UT Austin 1 Practi Replication Towards a Unified Theory of Replication Nalini Belaramani, Mike Dahlin,

Similar presentations


Presentation on theme: "August 2nd, 2006Department of Computer Sciences, UT Austin 1 Practi Replication Towards a Unified Theory of Replication Nalini Belaramani, Mike Dahlin,"— Presentation transcript:

1 August 2nd, 2006Department of Computer Sciences, UT Austin 1 Practi Replication Towards a Unified Theory of Replication Nalini Belaramani, Mike Dahlin, Lei Gao, Amol Nayate, Arun Venkatramani, Praveen Yalangandula, Jiandan Zheng University of Texas at Austin 2 nd August 2006

2 August 2nd, 2006Department of Computer Sciences, UT Austin 2 Replication Systems Galore

3 August 2nd, 2006Department of Computer Sciences, UT Austin 3 Server-Replication Bayou [Terry et al 95] All servers have full set of data Nodes exchange updates made since previous synchronization Any server can exchange node with any other server Eventually nodes will agree on order of updates to data Read Write Read Write

4 August 2nd, 2006Department of Computer Sciences, UT Austin 4 Client-Server Model Coda [Kistler et al 92] Data cached on client machine Callbacks established for notification of change Clients can get updates only from server Read A Write A A modified

5 August 2nd, 2006Department of Computer Sciences, UT Austin 5 File System For Planet Lab Data is replicated on geographically distributed nodes Updates need to be propagated from node to node Need to maintain strong consistency depending on application Some FS assume complete connectivity among nodes

6 August 2nd, 2006Department of Computer Sciences, UT Austin 6 Personal File System Data at multiple locations Desktop, server, laptop, pda, collegues laptop Desirable properties Download updates to only what I want Do not necessarily have to connect to server for updates. Some consistency guarantee

7 August 2nd, 2006Department of Computer Sciences, UT Austin 7 See a Similarity? They are all data replication systems Data is replicated on multiple nodes They differ... How much data is replicated at each node Who each node talks to What consistency to guarantee So many existing replication systems 14 systems in SOSP/OSDI in the last 10 years New applications, New domain -> build system from scratch Need characteristics from different systems -> build system from scratch

8 August 2nd, 2006Department of Computer Sciences, UT Austin 8 Motivation What if we have a toolkit? Supports mechanisms required for replication systems Mix and match mechanisms to build system for your requirements Pay for what you need. We will have A way to build better replication systems A better way to build replication systems

9 August 2nd, 2006Department of Computer Sciences, UT Austin 9 Our Work 3 properties to characterize replication systems PR – Partial Replication AC – Arbitrary Consistency TI – Topology Independence Mechanisms to support above properties Practi prototype Subsumes existing replication systems Better trade-offs Policy elegantly characterized Policy as topology Concise declarative rules + configuration parameters

10 August 2nd, 2006Department of Computer Sciences, UT Austin 10 Grand Challenge How can I convince you? Better tradeoffs Build 14 OSDI/SOSP systems on prototype With less that 1000 lines of code each Time(s) OfficeHomeHotelPlane PRACTI Infinity 81 Client-Server Full Replication

11 August 2nd, 2006Department of Computer Sciences, UT Austin 11 Outline PRACTI Taxonomy Achieving Practi PRACTI prototype Evaluation Making Policy Easier Building on PRACTI Policy as topology Ongoing and Future Work

12 August 2nd, 2006Department of Computer Sciences, UT Austin 12 Practi Taxonomy Characterizing Replication Systems

13 August 2nd, 2006Department of Computer Sciences, UT Austin 13 PRACTI Taxonomy Topology Independence Arbitrary Consistency Partial Replication Any node can communicate with any other node Support consistency requirements of application Replicate any subset of data to any node

14 August 2nd, 2006Department of Computer Sciences, UT Austin 14 PRACTI Taxonomy Topology Independence Arbitrary Consistency Partial Replication Any node can communicate with any other node Support consistency requirements of application Replicate any subset of data to any node Hierarchy, Client/Server (e.g. Coda, Hier-AFS) DHT (e.g. CFS, PAST) Object Replication (e.g. Ficus, Pangaea) Server Replication (e.g. Bayou, TACT)

15 August 2nd, 2006Department of Computer Sciences, UT Austin 15 PRACTI Taxonomy Topology Independence Arbitrary Consistency Partial Replication Any node can communicate with any other node Support consistency requirements of application Replicate any subset of data to any node Hierarchy, Client/Server (e.g. Coda, Hier-AFS) DHT (e.g. CFS, PAST) Object Replication (e.g. Ficus, Pangaea) Server Replication (e.g. Bayou, TACT) PRACTI

16 August 2nd, 2006Department of Computer Sciences, UT Austin 16 Why is Practi Hard? project/module/a project/module/b project/module/z project/module/b project/module/a project/module/b time … Write module A Write module B project/module/bproject/module/a project/module/b Read module B Read module A

17 August 2nd, 2006Department of Computer Sciences, UT Austin 17 Achieving Practi Practi Prototype Evaluation

18 August 2nd, 2006Department of Computer Sciences, UT Austin 18 Step 1: Peer-to-Peer Log Exchange

19 August 2nd, 2006Department of Computer Sciences, UT Austin 19 Peer to Peer Log Exchange [Patterson 97] Log exchanges for updates Order of updates is maintained Write = Node A … … Node B Log Checkpoint Log Checkpoint

20 August 2nd, 2006Department of Computer Sciences, UT Austin 20 Peer-to-Peer Log Exchange … Node 1 Node 2 Node 3 Node 4 … … … … Log exchanges for updates TI: Pairwise exchange with any peer AC: Careful ordering of updates in logs Prefix property, causal/eventual consistency Broad range of consistency [Yu and Vahdat 2002] -PR: All nodes store all data, see all updates

21 August 2nd, 2006Department of Computer Sciences, UT Austin 21 Step 2: Separation of Metadata and Data Paths

22 August 2nd, 2006Department of Computer Sciences, UT Austin 22 Separate Data and Metadata Paths Log exchange: Ordered streams of metadata (invalidations) Invalidation : All nodes see all invalidations (logically) Checkpoints track which objects are VALID Nodes receive only bodies of interest Node A … … Node B Log Checkpoint Log Checkpoint Invalidation = > Write =, body>, body> Read foo

23 August 2nd, 2006Department of Computer Sciences, UT Austin 23 Separate Data and Metadata Paths Separation of data and metadata paths: TI: Pairwise exchange with any peer AC: Careful ordering of updates in logs -PR: Partial replication of bodies Full replication of invalidations Node 1 body … … … … … Invalidation stream Node 2 Node 3 Node 4

24 August 2nd, 2006Department of Computer Sciences, UT Austin 24 Step 3: Summarize Unneeded Metadata

25 August 2nd, 2006Department of Computer Sciences, UT Austin 25 Summarize Unneeded Metadata Imprecise invalidation Summary of group of invalidations “One or more objects in objectSet were modified between start time and end time” Conservative summary ObjectSet may include superset of the targets Compact encoding of large number of invalidations

26 August 2nd, 2006Department of Computer Sciences, UT Austin 26 PI: > Summarize unneeded Metadata (2) Imprecise invalidations act as “placeholders” In log and checkpoint Receiver knows that it is missing information Receiver blocks operations that depend on missing information Node A … … Node B Log Checkpoint Log Read foo II:, > subscribe for green

27 August 2nd, 2006Department of Computer Sciences, UT Austin 27 Summarize Unneeded Metadata (3) Node 1 body Invalidation stream Node 2 Node 3 Node 4 … … … … … Summarize unneeded metadata: TI: Pairwise exchange with any peer AC: Careful ordering of updates in logs PR: Partial replication of bodies Partial replication of invalidations

28 August 2nd, 2006Department of Computer Sciences, UT Austin 28 Summary of Approach 3 key ideas Peer-to-Peer log exchange Separation of data and metadata paths Summarize unneeded metadata … Node 1 Node 2 Node 3 Node 4 … … … …

29 August 2nd, 2006Department of Computer Sciences, UT Austin 29 Summary of Approach 3 key ideas Peer-to-Peer log exchange Separation of data and metadata paths Summarize unneeded metadata Node 1 body … … … … … invalidation stream Node 2 Node 3 Node 4

30 August 2nd, 2006Department of Computer Sciences, UT Austin 30 Summary of Approach 3 key ideas Peer-to-Peer log exchange Separation of data and metadata paths Summarize unneeded metadata Node 1 body invalidation stream Node 2 Node 3 Node 4 … … … … …

31 August 2nd, 2006Department of Computer Sciences, UT Austin 31 Why is this better? How to evaluate? Compare with AC-TI server replication (e.g., Bayou, TACT) PR-AC client-server (e.g., Coda, NFS) PR-TI object replication (e.g., Ficus, Pangea) Key question Does system provide significant advantages? Prototype benchmarking Java + Berkley DB

32 August 2nd, 2006Department of Computer Sciences, UT Austin 32 PRACTI v. Client/Server v. Full Replication HOTEL 10 Mb/s1 Mb/s 50 Kb/s0 Mb/s 10 Mb/s 1 Mb/s 10 Mb/s StorageDirty DataWirelessInternet Office server1TB100MB10 Mb/s100 Mb/s Home desktop 10GB10MB10Mb/s1Mb/s Laptop10GB10MB10Mb/s50Kb/s (hotel) Palmtop100MB100KB1Mb/sNA Internet

33 August 2nd, 2006Department of Computer Sciences, UT Austin 33 Client-server (e.g., Coda) Limited by network to server – Not an attractive solution Full Replication (e.g., Bayou) Limited by fraction of shared data – Not a feasible solution PRACTI: Up to order of magnitude better – Does what you want! Synchronization Time Palmtop Laptop Time(s) OfficeHomeHotelPlane PRACTI Infinity 81 Client-Server Full Replication

34 August 2nd, 2006Department of Computer Sciences, UT Austin 34 Making Policy Easier Building on Practi Policy as Topology

35 August 2nd, 2006Department of Computer Sciences, UT Austin 35 Practi as a toolkit Practi Prototype Provides all 3 properties Subsumes existing replication systems Gives you the mechanisms Implement policy over PRACTI for different systems Bayou PRACTI Prototype Coda PlanetLab FS Personal FS... Policy Mechanism

36 August 2nd, 2006Department of Computer Sciences, UT Austin 36 System Overview Practi Core Controller Local Interface Read() Write() Delete() RequestsEvents Requests from remote cores Requests to remote cores Inval Streams Body Streams Core – mechanisms Asynchronous message passing Controller - policy Controller Interface

37 August 2nd, 2006Department of Computer Sciences, UT Austin 37 PRACTI Basics Subscription Streams 2 types of streams – Inval streams and body streams Every stream is associated with a subscription set Received Invals and bodies are forwarded to appropriate outgoing streams Controller Implements the policy Who to establish subscriptions to What to do in a read miss Who to send updates to

38 August 2nd, 2006Department of Computer Sciences, UT Austin 38 Controller Interface Notification of key events Stream begin/end Invalidation arrival Body arrival Local read miss Became Precise Became Imprecise Directs communication among cores Subscribe to inval or body stream Request demand read body Local housekeeping Log garbage collection Cache replacement

39 August 2nd, 2006Department of Computer Sciences, UT Austin 39 Not all that simple yet Need to take care of Stream management, timeouts etc. Some systems Arrival of body or inval may require special processing Read misses occur and need to be dealt with Replication set based on priorities or access patterns Policy - 39 methods to do magic Can we make it easier?

40 August 2nd, 2006Department of Computer Sciences, UT Austin 40 Policy as Topology Characterizing Policy Elegantly

41 August 2nd, 2006Department of Computer Sciences, UT Austin 41 Policy & Topology Overlay topology question: Among all the possible nodes I am connected to, who do I communicate with? Replication policy questions: If data is not available locally, who do I contact? If data is locally updated Who do I send updates? Who do I send invalidates? Whom to prefetch from? ~Topology Replication Set Consistency Semantics ~Configuration Parameters

42 August 2nd, 2006Department of Computer Sciences, UT Austin 42 Policy Revisited Policy now separated into several dimensions Propagation of updates -> Topology if there are updates, or if I have a local read miss, who do I contact? Consistency requirements -> Local interface Whether we can read stale/invalid data. How stale? Replication of data -> config file What subset of data does each node have Other policy essentials -> config file How long is timeout? How many times to retry? How often do I GC logs? How much storage to I have? Conflict resolution?

43 August 2nd, 2006Department of Computer Sciences, UT Austin 43 Bayou Policy Propagation of updates When connected to a neighbor, exchange updates for everything -> establish update subscription from neighbor for “/*” Replication Full Replication Local interface Reads - only precise and valid objects On read miss Should not happen

44 August 2nd, 2006Department of Computer Sciences, UT Austin 44 How to specify topology? In concise rules

45 August 2nd, 2006Department of Computer Sciences, UT Austin 45 Overlog/P2 Overlog [Boon et al 05] Declarative routing language based on Datalog Expressive and compact rules to specify topology Relational data model Tuples Stored in tables, or transient Rules Fired by combination of tuples and conditions A tuple is generated after a rule is fired Inter-node access : through remote table access or tuples Basic Syntax :- - location specifier _ - wild card P2 Runtime system for Overlog Parses Overlog and sets up data flows between nodes, etc.

46 August 2nd, 2006Department of Computer Sciences, UT Austin 46 Overlog 101 Ping every neighbor periodically to find live neighbors. /* Tables */ neighbor(X,Y) liveNeighbor(X,Y) /* generate ping event every PING_PERIOD seconds */ pg0 :- E, PING_PERIOD). /* generate a ping request */ pg1 Y) :- Y). /* send reply to ping request */ pg2 X) :- Y). /* add to live neighbor table */ pg3 Y) :- X).

47 August 2nd, 2006Department of Computer Sciences, UT Austin 47 Practi & P2 Overview Wrapper handles conversion between overlog tuples and Practi requests and events takes care of reconnections and time-outs. Overlog/P2 Practi Core Controller Interface Wrapper Local Interface Tuples RequestsEvents Overlog/P2 Practi Core Wrapper Local Interface DataFlows Local Read & Writes Streams Controller Interface

48 August 2nd, 2006Department of Computer Sciences, UT Austin 48 Practi & Overlog Implement policy with Overlog rules Overlog tuples/table -> invoke mechanisms in Practi AddInvalSubscription / RemoveInvalSubscription AddBodySubscription / RemoveBodySubscription DemandRead Practi Events -> overlog tuples LocalRead / LocalWrite / LocalReadMiss RecvInval Example: Policy: Subscribe invalidates for /* from all neighbors Overlog rule: N, SS) :- N), SS:= “/*”.

49 August 2nd, 2006Department of Computer Sciences, UT Austin 49 Bayou in Overlog Bayou Policy Replication Full Replication Local interface Reads - only precise and valid objects On read miss Should not happen Propagation of updates When connected to a neighbor, exchange updates (anti-entropy) -> establish update subscription from neighbor for “/*” In overlog: subscriptionSet("localhost:5000", "/*") Y, SS) :- Y), SS)

50 August 2nd, 2006Department of Computer Sciences, UT Austin 50 Coda in Overlog Policy for Coda (Single Server) Replication Server: All data Client: HoardSet + currently being accessed Local Interface (Client) Reads - only precise & valid objects (blocks otherwise) Writes - to locally valid objects (otherwise conflict) ReadMiss Get the object from the server, and establish callback: Callback: establish a inval subscription for the object. Propagation of Updates Client sends updates to Server Server: Break callback for all other clients who have the obj To break callback: remove obj from inval subscription stream Hoarding Periodically, fetch all (invalid) objects and establish callbacks on them

51 August 2nd, 2006Department of Computer Sciences, UT Austin 51 Coda in Overlog 2 Client: On Read Miss Get Obj from Server S, Obj, offset, length) :- Obj, offset, length), S), V), V == 1. Establish Callback S, Obj) :- Obj, _, _), S), V), V == 1. Set up Subscription for Updates X, Obj) :- Obj, _, _), S), V), V == 1. X Obj) :- Obj, _, _), S), V), V == 1. Server: On receiving Update from Client Break callbacks for other clients C2, Obj) :- C1, O, _, _, _, _, _), C2, O),C1 != C2.

52 August 2nd, 2006Department of Computer Sciences, UT Austin 52 Grand Challenge How can I convince you? Better Tradeoffs Build 14 OSDI/SOSP Systems on Prototype Experience so far Bayou – 1 rule + 10 config parameters CODA – 13 rules + 10 config parameters Time(s) OfficeHomeHotelPlane PRACTI Infinity 81 Client-Server Full Replication

53 August 2nd, 2006Department of Computer Sciences, UT Austin 53 Overlog/P2 – not quite perfect No guarantee of atomicity, or ordering among rules. Difficult to specify access-based policies. Difficult to specify policies which store information in the replicated object itself

54 August 2nd, 2006Department of Computer Sciences, UT Austin 54 Ongoing and Future Work

55 August 2nd, 2006Department of Computer Sciences, UT Austin 55 Ongoing and Future Work To make a dream a reality Overlog + Practi integration 14 OSDI/SOSP Systems NFS interface New Systems: Personal File System, Enterprise File System Scalibility s of nodes Security …

56 August 2nd, 2006Department of Computer Sciences, UT Austin 56 Conclusions Identified 3 properties which can be used to classify existing replication systems. A way to build better replication systems First replication architecture which provides all three properties Subsumes existing systems Exposes new points in the design space A better way to build replication systems Policy elegantly characterized Policy as topology and configuration parameters Policy can be written as concise rules + config parameters

57 August 2nd, 2006Department of Computer Sciences, UT Austin 57 Thank You Towards a unified replication architecture

58 August 2nd, 2006Department of Computer Sciences, UT Austin 58

59 August 2nd, 2006Department of Computer Sciences, UT Austin 59

60 August 2nd, 2006Department of Computer Sciences, UT Austin 60 Why is this better? Subsumes existing systems Client-Server, server, object replication, P2P, quorums,.. Exposes new points design space Makes it easier to build new systems Builds better systems


Download ppt "August 2nd, 2006Department of Computer Sciences, UT Austin 1 Practi Replication Towards a Unified Theory of Replication Nalini Belaramani, Mike Dahlin,"

Similar presentations


Ads by Google