Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Similar presentations


Presentation on theme: "Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)"— Presentation transcript:

1 Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)
Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

2 Why Spanner born? Google had BigTable and MegaStore.
Why not BigTable ? Can’t handle with complex, evolving schemes. Only eventual consistency across datacenters. Transactional scope limited to single row. Why not MegaStore ? Low performance.

3 So, What is Spanner? At high level of abstraction, it is a database that shards data across many set of Paxos state machines in datacenters spread all over the world. Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows. Spanner maintained multiple replicas for each data. Replication is used for global availability. Applications can use Spanner for high availability even in face of wide-area natural disasters.

4 So, What is Spanner? Spanner supports general-purpose transactions (ACID). Atomicity, Consistency, Isolation, Durability. Sometimes “Eventually-consistent” of BigTable isn’t good enough. Spanner provides a SQL-based query language. Which provides to the applications the ability to handle complex schemes.

5 A Spanner deployment is called “universe”.
universemaster – status of all zones placement driver – transfers data between zones location proxies – Used by clients to locate spanservers that hold the data they need zonemaster allocates data to spanservers Thousands of spanservers per zone

6 Spanserver Software Stack
Tables sharded across rows into tablets (like bigtable) . Tablet maps (key:string, timestamp:int64)->string. Each spanserver is responsible for tablets Paxos state machine enables support synchronous replication.

7 Paxos State Machine Paxos state machines - to implement a consistently replicated bag of mapping. The key-value mapping state of each replica is stored in its corresponding tablet. Writes must initiate the Paxos protocol at the leader. The set of replicas is collectively a Paxos Group. Each replica can be located on different datacenter.

8 Spanner’s Features As a globally-distributed database , Spanner provides several interesting features. Applications can specify constraints to control which datacenters contain which data : How far data is from its users (to control read latency). How replicas are from each other (to control write latency). How many replicas are maintained (to control durability, availability and read performance). In addition, Spanner has two features that are difficult to implement in a distributed database : Externally-consistent reads and writes. Globally-consistent reads across the database at a timestamp.

9 Spanner’s Features Why Externally-consistent reads and writes and Globally-consistent reads across the database at a timestamp are difficult to implement in a distributed database. Because we don't have a global “Wall Clock”.

10 So, what we can do? Global “Wall-Clock” time == External Consistency : Commit order respects global wall-time order. So, we will transform the problem to : Timestamp order respects global wall-time order. timestamp order == commit order.

11 Assigning timestamps to RW transactions
Transaction that write use 2PL. Each transaction T is assigned a timestamp s. Data written by T is timestamped with s. Assign timestamp while locks are held. Acquired locks Release locks T Pick s = now()

12 Timestamp Invariants Timestamp order == commit order
Timestamp order respects global wall-time order T3 T4

13 TrueTime API The key enabler of these properties (previous slide) is a new TrueTime API and its implementation. The API exposes clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. The implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks).

14 TrueTime “Global wall-clock time” with bounded uncertainty. TT.now()
earliest latest 2*ε

15 Timestamps and TrueTime
Acquired locks Release locks T Pick s = TT.now().latest s Wait until TT.now().earliest > s Commit wait average ε average ε

16 Operations Spanner supports:
Read-write transaction. Read-only transaction. Snapshot reads. Read-only transaction must be pre-declared as not have any writes. Reads in read-only transactions execute at a system-chosen timestamp without locking, so that incoming writes are not blocked. Snapshot read is a read in the past that execute without locking. Client can either specify a timestamp or provide an upper bound.

17 Reads within read-write transactions
Writes that occur in a transaction are buffered at the client until commit, as a result reads in a transaction do not see the effects of them. The client issues reads to the leader replica of the appropriate group. Acquires read locks and then reads the most recent data. While a client transaction remains open, it sends “keep-alive” messages. When a client has completed all reads and buffered all writes , write protocol begin.

18 RW transactions which involves one Paxos Group
Start consensus Achieve consensus Notify slaves Acquired locks Release locks T Pick s Commit wait done

19 RW transactions which involves more than one Paxos Group – 2PC protocol
Start logging Done logging Acquired locks Release locks TC Committed Notify participants of s Acquired locks Release locks TP1 Acquired locks Release locks TP2 Prepared Send s Compute s for each Commit wait done Compute overall s

20 Example Remove X from my friend list Risky post P TC T2 sC=6 s=8 s=15
Remove myself from X’s friend list TP sP=8 s=8 Time <8 8 15 My friends [X] [] My posts [P] X’s friends [me] []

21 Serving Reads at a Timestamp
Each replica maintains 𝑡 𝑆𝑎𝑓𝑒 . A replica can satisfy a read at a timestamp t if t <= 𝑡 𝑆𝑎𝑓𝑒 . 𝑡 𝑆𝑎𝑓𝑒 = min( 𝑡 𝑆𝑎𝑓𝑒 𝑃𝑎𝑥𝑜𝑠 , 𝑡 𝑆𝑎𝑓𝑒 𝑇𝑀 ). 𝑡 𝑆𝑎𝑓𝑒 𝑃𝑎𝑥𝑜𝑠 is timestamp of highest-applied Paxos write. 𝑡 𝑆𝑎𝑓𝑒 𝑇𝑀 is much harder: = ∞ if no pending 2PC transaction. = minimum (s-prepare i,g ) over i prepared transactions in group g. Thus, 𝑡 𝑆𝑎𝑓𝑒 is maximum timestamp at which reads are safe.

22 Read-Only transactions
Executes in two phases: Assign a timestamp 𝑆 𝑅𝑒𝑎𝑑 . Reads as snapshot reads at 𝑆 𝑅𝑒𝑎𝑑 . The snapshot reads can execute at any replicas that are up-to-date. The simple assignment of 𝑆 𝑅𝑒𝑎𝑑 =TT.now().latest , preservers external consistency. Such a timestamp may require the execution of data reads at 𝑆 𝑅𝑒𝑎𝑑 to block if 𝑡 𝑆𝑎𝑓𝑒 has not advanced sufficiently. To reduce the chances of blocking, Spanner should assign the oldest timestamp that preserved external consistency.

23 Read-Only transactions
Assigning a timestamp requires a negotiation phase between all of the Paxos groups that are involved in the read. As a result , Spanner requires a scope expression that summarizes the keys that will be read. If the scope’s values are served by a single Paxos group: The client issues the read-only transaction to the group leader. The leader assign 𝑆 𝑅𝑒𝑎𝑑 = LastTS() (=the timestamp of the last committed write at Paxos). And execute the read at any up-to-date replica. If the scope’s values are served by multiple Paxos groups: 𝑆 𝑅𝑒𝑎𝑑 = TT.now().latest (which may wait for safe time to advance).

24 Benchmarks 50 Paxos groups, 2500 buckets, 4KB
reads or writes, datacenters 1ms apart. Latency remains mostly constant as number of replicas increases because Paxos executes in parallel at a group’s replicas.

25 Benchmarks All leaders explicitly placed in zone Z1.
Red-line – Killing non-leader no effects on read throughput. Green-line – Killing leader-soft giving the leaders time to handoff leadership. Blue-line – Killing leader-hard no warning for leaders.

26 Questions? Thanks!


Download ppt "Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)"

Similar presentations


Ads by Google