Presentation is loading. Please wait.

Presentation is loading. Please wait.

+ Massively Parallel/Distributed Data Storage Systems S. Sudarshan IIT Bombay Derived from an earlier talk by S. Sudarshan, presented at the MSRI Summer.

Similar presentations


Presentation on theme: "+ Massively Parallel/Distributed Data Storage Systems S. Sudarshan IIT Bombay Derived from an earlier talk by S. Sudarshan, presented at the MSRI Summer."— Presentation transcript:

1 + Massively Parallel/Distributed Data Storage Systems S. Sudarshan IIT Bombay Derived from an earlier talk by S. Sudarshan, presented at the MSRI Summer School on Distributed Systems, May/June 2012

2 + 2 Why Distributed Data Storage a.k.a. Cloud Storage Explosion of social media sites (Facebook, Twitter) with large data needs Explosion of storage needs in large web sites such as Google, Yahoo 100s of millions of users Many applications with petabytes of data Much of the data is not files Very high demands on Scalability Availability IBM ICARE Winter School on Big Data, Oct. 2012

3 + 3 Why Distributed Data Storage a.k.a. Cloud Storage Step 0 (prehistory): Distributed database systems with tens of nodes Step 1: Distributed file systems with 1000s of nodes Millions of Large objects (100s of megabytes) Step 2: Distributed data storage systems with 1000s of nodes 100s of billions of smaller (kilobyte to megabyte) objects Step 3 (recent and future work): Distributed database systems with 1000s of nodes IBM ICARE Winter School on Big Data, Oct. 2012

4 + 4 Examples of Types of Big Data Large objects video, large images, web logs Typically write once, read many times, no updates Distributed file systems Transactional data from Web-based applications E.g. social network (Facebook/Twitter) updates, friend lists, likes, … (at least metadata) Billions to trillions of objects, distributed data storage systems Indices E.g. Web search indices with inverted list per word In earlier days, no updates, rebuild periodically Today: frequent updates (e.g. Google Percolator) IBM ICARE Winter School on Big Data, Oct. 2012

5 + 5 Why not use Parallel Databases? Parallel databases have been around since 1980s Most parallel databases were designed for decision support not OLTP Were designed for scales of 10s to 100s of processors Single machine failures are common and are handled But usually require jobs to be rerun in case of failure during exection Do not consider network partitions, and data distribution Demands on distributed data storage systems Scalability to thousands to tens of thousands of nodes Geographic distribution for lower latency, and higher availability IBM ICARE Winter School on Big Data, Oct. 2012

6 + 6 Basics: Parallel/Distributed Data Storage Replication System maintains multiple copies of data, stored in different nodes/sites, for faster retrieval and fault tolerance. Data partitioning Relation is divided into several partitions stored in distinct nodes/sites Replication and partitioning are combined Relation is divided into multiple partition; system maintains several identical replicas of each such partition. IBM ICARE Winter School on Big Data, Oct. 2012

7 + 7 Basics: Data Replication Advantages of Replication Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r. Cost of Replication Increased cost of updates: each replica of relation r must be updated. Special concurrency control and atomic commit mechanisms to ensure replicas stay in sync IBM ICARE Winter School on Big Data, Oct. 2012

8 + 8 Basics: Data Transparency Data transparency: Degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system Consider transparency issues in relation to: Fragmentation transparency Replication transparency Location transparency IBM ICARE Winter School on Big Data, Oct. 2012

9 + 9 Basics: Naming of Data Items Naming of items: desiderata 1. Every data item must have a system-wide unique name. 2. It should be possible to find the location of data items efficiently. 3. It should be possible to change the location of data items transparently. 4. Data item creation/naming should not be centralized Implementations: Global directory Used in file systems Partition name space Each partition under control of one node Used for data storage systems IBM ICARE Winter School on Big Data, Oct. 2012

10 + 10 Build-it-Yourself Parallel Data Storage: a.k.a. Sharding Sharding Divide data amongst many cheap databases (MySQL/PostgreSQL) Manage parallel access in the application Partition tables map keys to nodes Application decides where to route storage or lookup requests Scales well for both reads and writes Limitations Not transparent application needs to be partition-aware AND application needs to deal with replication (Not a true parallel database, since parallel queries and transactions spanning nodes are not supported) IBM ICARE Winter School on Big Data, Oct. 2012

11 + 11 Parallel/Distributed Key-Value Data Stores Distributed key-value data storage systems allow key- value pairs to be stored (and retrieved on key) in a massively parallel system E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon Dynamo,.. Partitioning, replication, high availability etc mostly transparent to application Are the responsibility of the data storage system These are NOT full-fledged database systems A.k.a. NO-SQL systems Focus of this talk IBM ICARE Winter School on Big Data, Oct. 2012

12 + 12 Typical Data Storage Access API Basic API access: get(key) -- Extract the value given a key put(key, value) -- Create or update the value given its key delete(key) -- Remove the key and its associated value execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map.... Etc) Extensions to add version numbering, etc IBM ICARE Winter School on Big Data, Oct. 2012

13 + 13 What is NoSQL? Stands for No-SQL or Not Only SQL?? Class of non-relational data storage systems E.g. BigTable, Dynamo, PNUTS/Sherpa,.. Synonymous with distributed data storage systems We dont like the term NoSQL IBM ICARE Winter School on Big Data, Oct. 2012

14 + 14 Data Storage Systems vs. Databases Distributed data storage systems do not support many relational features No join operations (except within partition) No referential integrity constraints across partitions No ACID transactions (across nodes) No support for SQL or query optimization But usually do provide flexible schema and other features Structured objects e.g. using JSON Multiple versions of data items, IBM ICARE Winter School on Big Data, Oct. 2012

15 + 15 Querying Static Big Data Large data sets broken into multiple files Static append-only data E.g. new files added to dataset each day No updates to existing data Map-reduce framework for massively parallel querying Not the focus of this talk. We focus on: Transactional data which is subject to updates Very large number of transactions Each of which reads/writes small amounts of data I.e. online transaction processing (OLTP) workloads IBM ICARE Winter School on Big Data, Oct. 2012

16 + Talk Outline Background: Distributed Transactions Concurrency Control/Replication Consistency Schemes Distributed File Systems Parallel/Distributed Data Storage Systems Basics Architecture Bigtable (Google) PNUTS/Sherpa (Yahoo) Megastore (Google) CAP Theorem: availability vs. consistency Basics Dynamo (Amazon) IBM ICARE Winter School on Big Data, Oct. 2012

17 + Background: Distributed Transactions Slides in this section are from Database System Concepts, 6 th Edition, by Silberschatz, Korth and Sudarshan, McGraw Hill, 2010 IBM ICARE Winter School on Big Data, Oct. 2012

18 + 18 Distributed Transactions Transaction may access data at several sites. Each site has a local transaction manager responsible for: Maintaining a log for recovery purposes Participating in coordinating the concurrent execution of the transactions executing at that site. Each site has a transaction coordinator, which is responsible for: Starting the execution of transactions that originate at the site. Distributing subtransactions at appropriate sites for execution. Coordinating the termination of each transaction that originates at the site, which may result in the transaction being committed at all sites or aborted at all sites. IBM ICARE Winter School on Big Data, Oct. 2012

19 + 19 Transaction System Architecture IBM ICARE Winter School on Big Data, Oct. 2012

20 + 20 System Failure Modes Failures unique to distributed systems: Failure of a site. Loss of massages Handled by network transmission control protocols such as TCP-IP Failure of a communication link Handled by network protocols, by routing messages via alternative links Network partition A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them Note: a subsystem may consist of a single node Network partitioning and site failures are generally indistinguishable. IBM ICARE Winter School on Big Data, Oct. 2012

21 + 21 Commit Protocols Commit protocols are used to ensure atomicity across sites a transaction which executes at multiple sites must either be committed at all the sites, or aborted at all the sites. not acceptable to have a transaction committed at one site and aborted at another The two-phase commit (2PC) protocol is widely used The three-phase commit (3PC) protocol is more complicated and more expensive, but avoids some drawbacks of two-phase commit protocol. IBM ICARE Winter School on Big Data, Oct. 2012

22 + 22 Two Phase Commit Protocol (2PC) Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached. The protocol involves all the local sites at which the transaction executed Let T be a transaction initiated at site S i, and let the transaction coordinator at S i be C i IBM ICARE Winter School on Big Data, Oct. 2012

23 + 23 Phase 1: Obtaining a Decision Coordinator asks all participants to prepare to commit transaction T i. C i adds the records to the log and forces log to stable storage sends prepare T messages to all sites at which T executed Upon receiving message, transaction manager at site determines if it can commit the transaction if not, add a record to the log and send abort T message to C i if the transaction can be committed, then: add the record to the log force all records for T to stable storage send ready T message to C i IBM ICARE Winter School on Big Data, Oct. 2012

24 + 24 Phase 2: Recording the Decision T can be committed of C i received a ready T message from all the participating sites: otherwise T must be aborted. Coordinator adds a decision record, or, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur) Coordinator sends a message to each participant informing it of the decision (commit or abort) Participants take appropriate action locally. IBM ICARE Winter School on Big Data, Oct. 2012

25 + 25 Three Phase Commit (3PC) Blocking problem in 2PC: if coordinator is disconnected from participant, participant which had sent a ready message may be in a blocked state Cannot figure out whether to commit or abort Partial solution: Three phase commit Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1. Every site is ready to commit if instructed to do so Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC In phase 2 coordinator makes a decision as in 2PC (called the pre-commit decision) and records it in multiple (at least K) sites In phase 3, coordinator sends commit/abort message to all participating sites Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure Avoids blocking problem as long as < K sites fail Drawbacks: higher overhead, and assumptions may not be satisfied in practice IBM ICARE Winter School on Big Data, Oct. 2012

26 + 26 Distributed Transactions via Persistent Messaging Notion of a single transaction spanning multiple sites is inappropriate for many applications E.g. transaction crossing an organizational boundary Latency of waiting for commit from remote site Alternative models carry out transactions by sending messages Code to handle messages must be carefully designed to ensure atomicity and durability properties for updates Isolation cannot be guaranteed, in that intermediate stages are visible, but code must ensure no inconsistent states result due to concurrency Persistent messaging systems are systems that provide transactional properties to messages Messages are guaranteed to be delivered exactly once Will discuss implementation techniques later IBM ICARE Winter School on Big Data, Oct. 2012

27 + 27 Persistent Messaging Example: funds transfer between two banks Two phase commit would have the potential to block updates on the accounts involved in funds transfer Alternative solution: Debit money from source account and send a message to other site Site receives message and credits destination account Messaging has long been used for distributed transactions (even before computers were invented!) Atomicity issue once transaction sending a message is committed, message must guaranteed to be delivered Guarantee as long as destination site is up and reachable, code to handle undeliverable messages must also be available e.g. credit money back to source account. If sending transaction aborts, message must not be sent IBM ICARE Winter School on Big Data, Oct. 2012

28 + 28 Error Conditions with Persistent Messaging Code to handle messages has to take care of variety of failure situations (even assuming guaranteed message delivery) E.g. if destination account does not exist, failure message must be sent back to source site When failure message is received from destination site, or destination site itself does not exist, money must be deposited back in source account Problem if source account has been closed get humans to take care of problem User code executing transaction processing using 2PC does not have to deal with such failures There are many situations where extra effort of error handling is worth the benefit of absence of blocking E.g. pretty much all transactions across organizations IBM ICARE Winter School on Big Data, Oct. 2012

29 + 29 Persistent Messaging and Workflows Workflows provide a general model of transactional processing involving multiple sites and possibly human processing of certain steps E.g. when a bank receives a loan application, it may need to Contact external credit-checking agencies Get approvals of one or more managers and then respond to the loan application Persistent messaging forms the underlying infrastructure for workflows in a distributed environment IBM ICARE Winter School on Big Data, Oct. 2012

30 + 30 Managing Replicated Data Issues: All replicas should have the same value updates performed at all replicas But what if a replica is not available (disconnected, or failed)? What if different transactions update different replicas concurrently? Need some form of distributed concurrency control IBM ICARE Winter School on Big Data, Oct. 2012

31 + 31 Primary Copy Choose one replica of data item to be the primary copy. Site containing the replica is called the primary site for that data item Different data items can have different primary sites For concurrency control: when a transaction needs to lock a data item Q, it requests a lock at the primary site of Q. Implicitly gets lock on all replicas of the data item Benefit Concurrency control for replicated data handled similarly to unreplicated data - simple implementation. Drawback If the primary site of Q fails, Q is inaccessible even though other sites containing a replica may be accessible. IBM ICARE Winter School on Big Data, Oct. 2012

32 + 32 Primary Copy Primary copy scheme for performing updates: Update at primary, updates subsequently replicated to other copies Updates to a single item are serialized at the primary copy IBM ICARE Winter School on Big Data, Oct. 2012

33 + 33 Majority Protocol for Locking Local lock manager at each site administers lock and unlock requests for data items stored at that site. When a transaction wishes to lock an unreplicated data item Q residing at site S i, a message is sent to S i s lock manager. If Q is locked in an incompatible mode, then the request is delayed until it can be granted. When the lock request can be granted, the lock manager sends a message back to the initiator indicating that the lock request has been granted. IBM ICARE Winter School on Big Data, Oct. 2012

34 + 34 Majority Protocol for Locking If Q is replicated at n sites, then a lock request message is sent to more than half of the n sites in which Q is stored. The transaction does not operate on Q until it has obtained a lock on a majority of the replicas of Q. When writing the data item, transaction performs writes on all replicas. Benefit Can be used even when some sites are unavailable details on how handle writes in the presence of site failure later Drawback Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1) messages for handling unlock requests. Potential for deadlock even with single item - e.g., each of 3 transactions may have locks on 1/3rd of the replicas of a data. IBM ICARE Winter School on Big Data, Oct. 2012

35 + 35 Majority Protocol for Accessing Replicated Items The majority protocol for updating replicas of an item Each replica of each item has a version number which is updated when the replica is updated, as outlined below A lock request is sent to at least ½ the sites at which item replicas are stored and operation continues only when a lock is obtained on a majority of the sites Read operations look at all replicas locked, and read the value from the replica with largest version number May write this value and version number back to replicas with lower version numbers (no need to obtain locks on all replicas for this task) IBM ICARE Winter School on Big Data, Oct. 2012

36 + 36 Majority Protocol for Accessing Replicated Items Majority protocol (Cont.) Write operations find highest version number like reads, and set new version number to old highest version + 1 Writes are then performed on all locked replicas and version number on these replicas is set to new version number With 2 phase commit OR distributed consensus protocol such as Paxos Failures (network and site) cause no problems as long as Sites at commit contain a majority of replicas of any updated data items During reads a majority of replicas are available to find version numbers Note: reads are guaranteed to see latest version of data item Reintegration is trivial: nothing needs to be done IBM ICARE Winter School on Big Data, Oct. 2012

37 + 37 Biased Protocol for Locking Local lock manager at each site as in majority protocol, however, requests for shared locks are handled differently than requests for exclusive locks. Shared locks. When a transaction needs to lock data item Q, it simply requests a lock on Q from the lock manager at one site containing a replica of Q. Exclusive locks. When transaction needs to lock data item Q, it requests a lock on Q from the lock manager at all sites containing a replica of Q. Advantage - imposes less overhead on read operations. Disadvantage - additional overhead on writes IBM ICARE Winter School on Big Data, Oct. 2012

38 + 38 Quorum Consensus Protocol A generalization of both majority and biased protocols Each site is assigned a weight. Let S be the total of all site weights Choose two values read quorum Q R and write quorum Q W Such that Q R + Q W > S and 2 * Q W > S Quorums can be chosen (and S computed) separately for each item Quorum consensus protocol for locking Each read must lock enough replicas that the sum of the site weights is >= Q R Each write must lock enough replicas that the sum of the site weights is >= Q RW IBM ICARE Winter School on Big Data, Oct. 2012

39 + 39 Quorum Consensus Protocol Quorum consensus protocol for accessing replicated data Each read must access enough replicas that the sum of the site weights is >= Q R Guaranteed to see at least one site which participated in last write Each write must update enough replicas that the sum of the site weights is >= Q w Of two concurrent writes, only one can succeed Biased protocol is a special case of quorum consensus Allows reads to read any one replica but updates require all replicas to be available at commit time (called read one write all) IBM ICARE Winter School on Big Data, Oct. 2012

40 + 40 Read One Write All (Available) Read one write all available (ignoring failed sites) is attractive, but incorrect If failed link may come back up, without a disconnected site ever being aware that it was disconnected The site then has old values, and a read from that site would return an incorrect value If site was aware of failure reintegration could have been performed, but no way to guarantee this With network partitioning, sites in each partition may update same item concurrently believing sites in other partitions have all failed IBM ICARE Winter School on Big Data, Oct. 2012

41 + 41 Replication with Weak Consistency Many systems support replication of data with weak degrees of consistency (I.e., without a guarantee of serializabiliy) i.e. Q R + Q W <= S or 2*Q W < S Usually only when not enough sites are available to ensure quorum Tradeoff of consistency versus availability Many systems support lazy propagation where updates are transmitted after transaction commits Allows updates to occur even if some sites are disconnected from the network, but at the cost of consistency What to do if there are inconsistent concurrent updates? How to detect? How to reconcile? IBM ICARE Winter School on Big Data, Oct. 2012

42 + 42 Availability High availability: time for which system is not fully usable should be extremely low (e.g % availability) Robustness: ability of system to function spite of failures of components Failures are more likely in large distributed systems To be robust, a distributed system must either Detect failures Reconfigure the system so computation may continue Recovery/reintegration when a site or link is repaired OR follow protocols that guarantee consistency in spite of failures. Failure detection: distinguishing link failure from site failure is hard (partial) solution: have multiple links, multiple link failure is likely a site failure IBM ICARE Winter School on Big Data, Oct. 2012

43 + 43 Reconfiguration Reconfiguration: Abort all transactions that were active at a failed site Making them wait could interfere with other transactions since they may hold locks on other sites However, in case only some replicas of a data item failed, it may be possible to continue transactions that had accessed data at a failed site (more on this later) If replicated data items were at failed site, update system catalog to remove them from the list of replicas. This should be reversed when failed site recovers, but additional care needs to be taken to bring values up to date If a failed site was a central server for some subsystem, an election must be held to determine the new server E.g. name server, concurrency coordinator, global deadlock detector IBM ICARE Winter School on Big Data, Oct. 2012

44 + 44 Reconfiguration (Cont.) Since network partition may not be distinguishable from site failure, the following situations must be avoided Two or more central servers elected in distinct partitions More than one partition updates a replicated data item Updates must be able to continue even if some sites are down Solution: majority based approach Alternative of read one write all available is tantalizing but causes problems IBM ICARE Winter School on Big Data, Oct. 2012

45 + 45 Site Reintegration When failed site recovers, it must catch up with all updates that it missed while it was down Problem: updates may be happening to items whose replica is stored at the site while the site is recovering Solution 1: halt all updates on system while reintegrating a site Unacceptable disruption Solution 2: lock all replicas of all data items at the site, update to latest version, then release locks Other solutions with better concurrency also available IBM ICARE Winter School on Big Data, Oct. 2012

46 + 46 Coordinator Selection Backup coordinators site which maintains enough information locally to assume the role of coordinator if the actual coordinator fails executes the same algorithms and maintains the same internal state information as the actual coordinator fails executes state information as the actual coordinator allows fast recovery from coordinator failure but involves overhead during normal processing. Election algorithms used to elect a new coordinator in case of failures Special case of distributed consensus IBM ICARE Winter School on Big Data, Oct. 2012

47 + 47 Distributed Consensus From Lamports Paxos Made Simple: Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are: Only a value that has been proposed may be chosen, Only a single value is chosen, and A process never learns that a value has been chosen unless it actually has been Paxos: a family of protocols for distributed consensus IBM ICARE Winter School on Big Data, Oct. 2012

48 + 48 Paxos: Overview Three kinds of participants (site can play more than one role) Proposer: proposes a value Acceptor: accepts (or rejects) a proposal, following a protocol Consensus is reached when a majority of acceptors have accepted a proposal Learner: finds what value (if any) was accepted by a majority of acceptors Acceptor generally informs each member of a set of learners IBM ICARE Winter School on Big Data, Oct. 2012

49 + 49 Paxos Made Simple Phase 1 (a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors Number has to be chosen in some unique way (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest- numbered proposal (if any) that it has accepted Else it rejects proposal If proposer receives IBM ICARE Winter School on Big Data, Oct. 2012

50 + 50 Paxos Made Simple Phase 2 (a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals. Else its proposal has been rejected (b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n. At end of phase, it is possible that a majority have agreed on a value Learners receive messages from acceptors to figure this out IBM ICARE Winter School on Big Data, Oct. 2012

51 + 51 Paxos Details Possible lack of progress: All proposals may get rejected No party gets a majority For a coalition government? No. Instead, elect a coordinator (distinguished proposer) who is the only one who can make proposals Elections in Iraq under Saddam? Many more details under cover Many variants of Paxos optimized for different scenarios NOT the focus of this talk IBM ICARE Winter School on Big Data, Oct. 2012

52 + Distributed File Systems IBM ICARE Winter School on Big Data, Oct. 2012

53 + 53 Distributed File Systems Google File System (GFS) Hadoop File System (HDFS) And older ones like CODA, Basic architecture: Master: responsible for metadata Chunk servers: responsible for reading and writing large chunks of data Chunks replicated on 3 machines, master responsible for managing replicas Replication is within a single data center IBM ICARE Winter School on Big Data, Oct. 2012

54 Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. Send filename 2. Get back BlckId, DataNodes o 3.Read data NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk IBM ICARE Winter School on Big Data, Oct. 2012

55 + 55 Google File System (OSDI 04) Inspiration for HDFS Master: responsible for all metadata Chunk servers: responsible for reading and writing large chunks of data Chunks replicated on 3 machines Master responsible for ensuring replicas exist IBM ICARE Winter School on Big Data, Oct. 2012

56 + 56 Limitations of GFS/HDFS Central master becomes bottleneck Keep directory/inode information in memory to avoid IO Memory size limits number of files File system directory overheads per file Not appropriate for storing very large number of objects File systems do not provide consistency guarantees File systems cache blocks locally Ideal for write-once and and append only data Can be used as underlying storage for a data storage system E.g. BigTable uses GFS underneath IBM ICARE Winter School on Big Data, Oct. 2012

57 + Parallel/Distributed Data Storage Systems IBM ICARE Winter School on Big Data, Oct. 2012

58 + 58 Typical Data Storage Access API Basic API access: get(key) -- Extract the value given a key put(key, value) -- Create or update the value given its key delete(key) -- Remove the key and its associated value execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map.... Etc) Extensions to add version numbering, etc IBM ICARE Winter School on Big Data, Oct. 2012

59 + 59 Data Types in Data Storage Systems Uninterpreted key/value or the big hash table. Amazon S3 (Dynamo) Flexible schema Ordered keys, semi-structured data BigTable [Google], Cassandra [Facebook/Apache], Hbase [Apache, similar to BigTable] Unordered keys, JSON Sherpa/Pnuts [Yahoo], CouchDB [Apache] Document/Textual data (with JSON variant) MongoDB [10gen, open source] IBM ICARE Winter School on Big Data, Oct. 2012

60 + 60 E.g. of Flexible Data Model ColumnFamily: Inventory Details KeyValue A A B NameValue Memory inventoryQty 3G Ipad (new) 16GB 512MB 100 No name NameValue Memory inventoryQty Screen ASUS Laptop 3GB 9 15 inch name NameValue All Terrain Vehicle Wheels inventoryQty Power10HP 3 5 IBM ICARE Winter School on Big Data, Oct. 2012

61 + 61 Bigtable: Data model triple for key lookup (point and range) E.g. prefix lookup by com.cnn.* in example below insert and delete API Allows inserting and deleting versions for any specific cell Arbitrary columns on a row-by-row basis Partly column-oriented physical store: all columns in a column family are stored together IBM ICARE Winter School on Big Data, Oct. 2012

62 + Architecture of distributed storage systems There are many more distributed data storage systems, we will not do a full survey, but just cover some representatives Some of those we dont cover: Cassandra, Hbase, CouchDB, MongoDB Some we cover later: Dynamo Bigtable PNUTS/Sherpa Megastore IBM ICARE Winter School on Big Data, Oct. 2012

63 + Bigtable Massively parallel data storage system Supports key-value storage and lookup Incorporates flexible schema Designed to work within a single data center Does not address distributed consistency issues, etc Built on top of Chubby distributed fault tolerant lock service Google File System (GFS) IBM ICARE Winter School on Big Data, Oct. 2012

64 + 64 Distributing Load Across Servers Table split into multiple tablets Tablet servers manage tablets, multiple tablets per server. Each tablet is MB Each tablet controlled by only one server Tablet server splits tablets that get too big Master responsible for load balancing and fault tolerance All data and logs stored in GFS Leverage GFS replication/fault tolerance Data can be accessed if required from any node to aid in recovery IBM ICARE Winter School on Big Data, Oct. 2012

65 + 65 Chubby (OSDI06) {lock/file/name} service 5 replicas, tolerant to failures of individual replicas need a majority vote to be active Uses Paxos algorithm for distributed consensus Coarse-grained locks can store small amount of data in a lock Used as fault tolerant directory In Bigtable Chubby is used to locate master node Detect master and tablet server failures using lock leases More on this shortly Choose new master/tablet server in a safe manner using lock service IBM ICARE Winter School on Big Data, Oct. 2012

66 + 66 How To Layer Storage on Top of File System File system supports large objects, but with Significant per-object overhead, and Limits on number of objects No guarantees of consistency if updates/reads are made at different nodes How to store smallobject store on top of file system? Aggregate multiple small objects into a file (SSTable) SSTable: Immutable sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values Index 64K block SSTable IBM ICARE Winter School on Big Data, Oct. 2012

67 + 67 Tablets: Partitioning Data across Nodes Tablet: Unit of partitioning Contains some range of rows of the table Built out of multiple SSTables Key range of SSTables can overlap Lookups access all SSTables One of the SSTables is in memory (Memtable) Inserts done on memtable Index 64K block SSTable Index 64K block SSTable Tablet Start:aardvarkEnd:apple IBM ICARE Winter School on Big Data, Oct. 2012

68 + 68 Immutability SSTables are immutable simplifies caching, sharing across GFS etc no need for concurrency control SSTables of a tablet recorded in METADATA table Garbage collection of SSTables done by master On tablet split, split tables can start off quickly on shared SSTables, splitting them lazily Only memtable has reads and updates concurrent copy on write rows, allow concurrent read/write IBM ICARE Winter School on Big Data, Oct. 2012

69 + 69 Editing a table Mutations are logged, then applied to an in-memory memtable May contain deletion entries to handle updates Lookups match up relevant deletion entries to filter out logically deleted data Periodic compaction merges multiple SSTables and removes logically deleted entries SSTable Tablet apple_two_E boat Insert Delete Insert Delete Insert Memtable tablet log GFS Memory IBM ICARE Winter School on Big Data, Oct. 2012

70 + 70 Table Multiple tablets make up the table Tablet ranges do not overlap Directory to provide key-range to tablet mapping Each tablet assigned to one server (tablet server) But data is in GFS, replication handled by GFS If master is reassigned, new master simply fetches data from GFS All updates/lookups on tablet routed through tablet server Primary copy scheme IBM ICARE Winter School on Big Data, Oct. 2012

71 + 71 Finding a tablet Directory Entries: Key: table id + end row, Data: location Cached at clients, which may detect data to be incorrect in which case, lookup on hierarchy performed Also prefetched (for range queries) Identifies tablet, server found separately via master IBM ICARE Winter School on Big Data, Oct. 2012

72 + 72 Masters Tasks Use Chubby to monitor health of tablet servers, restart failed servers Tablet server registers itself by getting a lock in a specific directory chubby Chubby gives lease on lock, must be renewed periodically Server loses lock if it gets disconnected Master monitors this directory to find which servers exist/are alive If server not contactable/has lost lock, master grabs lock and reassigns tablets GFS replicates data. Prefer to start tablet server on same machine that the data is already at IBM ICARE Winter School on Big Data, Oct. 2012

73 + 73 Masters Tasks (Cont) When (new) master starts grabs master lock on chubby Ensures only one master at a time Finds live servers (scan chubby directory) Communicates with servers to find assigned tablets Scans metadata table to find all tablets Keeps track of unassigned tablets, assigns them Metadata root from chubby, other metadata tablets assigned before scanning. IBM ICARE Winter School on Big Data, Oct. 2012

74 + 74 Metadata Management Master handles table creation, and merging of tablet Tablet servers directly update metadata on tablet split, then notify master lost notification may be detected lazily by master IBM ICARE Winter School on Big Data, Oct. 2012

75 + 75 Table SSTables can be shared by more than one tablet to provide quick tablet split SSTable Tablet aardvark apple Tablet apple_two_E boat IBM ICARE Winter School on Big Data, Oct. 2012

76 + 76 Compactions Minor compaction – convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging compaction Reduce number of SSTables Good place to apply policy keep only N versions Major compaction Merging compaction that results in only one SSTable No deletion records, only live data IBM ICARE Winter School on Big Data, Oct. 2012

77 + 77 Locality Groups Group column families together into an SSTable Avoid mingling data, e.g. page contents and page metadata Can keep some groups all in memory Can compress locality groups Bloom Filters on SSTables in a locality group bitmap on keyvalue hash, used to overestimate which records exist avoid searching SSTable if bit not set Tablet movement Major compaction (with concurrent updates) Minor compaction (to catch up with updates) without any concurrent updates Load on new server without requiring any recovery action IBM ICARE Winter School on Big Data, Oct. 2012

78 + 78 Log Handling Commit log is per server, not per tablet (why?) complicates tablet movement when server fails, tablets divided among multiple servers can cause heavy scan load by each such server optimization to avoid multiple separate scans: sort log by (table, rowname, LSN), so logs for a tablet are clustered, then distribute GFS delay spikes can mess up log write (time critical) solution: two separate logs, one active at a time can have duplicates between these two IBM ICARE Winter School on Big Data, Oct. 2012

79 + 79 Transactions in Bigtable Bigtable supports only single-row transactions atomic read-modify-write sequences on data stored under a single row key No transactions across sites Limitations are many E.g. secondary indices cannot be supported consistently PNUTS provides different approach based on persistent messaging IBM ICARE Winter School on Big Data, Oct. 2012

80 + Yahoo PNUTS (a.k.a. Sherpa) Goals similar to Bigtable, but some significant architectural differences Not layered on distributed file system, handles distribution and replication by itself Master node takes care of above tasks Master node is replicated using master-slave backup Assumption: joint failure of master node and replica unlikely Similar to GFS assumption about master Can use existing storage systems such as MySQL or BerlekeyDB underneath Focus on geographic replication IBM ICARE Winter School on Big Data, Oct. 2012

81 Data-path components Storage units Routers Tablet controller REST API Clients Message Broker Detailed architecture This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

82 Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB Detailed architecture This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

83 + 83 Yahoo Message Bus Distributed publish-subscribe service Guarantees delivery once a message is published Logging at site where message is published, and at other sites when received Guarantees messages published to a particular cluster will be delivered in same order at all other clusters Record updates are published to YMB by master copy All replicas subscribe to the updates, and get them in same order for a particular record This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

84 + 84 Managing Concurrency Managing concurrency is now the programmers job Updates are sent lazily to replicas Update replica 1, read replica 2 can give old value Replication is no longer transparent To deal with this, PNUTS provides version numbering Reads can specify version number Write can be conditional on version number Application programmer has to manage concurrency With great scale comes great responsibility … IBM ICARE Winter School on Big Data, Oct. 2012

85 + 85 Consistency model Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key Brian? Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

86 + 86 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Read This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

87 + 87 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

88 + 88 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read v.6 Current version Stale version Read-critical(required version): This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

89 + 89 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

90 + 90 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Test-and-set-write(required version) This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

91 + 91 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Mechanism: per record mastership This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

92 + 92 Record and Tablet Mastership Data in PNUTS is replicated across sites Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master Record also contains origin of last few updates Mastership can be changed by current master, based on this information Mastership change is simply a record update Tablets mastership Required to ensure primary key consistency Can be different from record mastership This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

93 + 93 Other Features Per record transactions Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is received Apply later updates Tablet split Has to be coordinated across all copies This slide from Brian Coopers talk at VLDB 2008 IBM ICARE Winter School on Big Data, Oct. 2012

94 + Asynchronous View Maintenance in PNUTS Queries involving joins/aggregations are very expensive Alternative: materialized view that are maintained on each update Small (constant) amount of work per update Asynchronous view maintenance To avoid imposing latency on updater Readers may get out of date view Readers can ask for view state after a certain update has been applied P. Agrawal et al, SIGMOD 2009 IBM ICARE Winter School on Big Data, Oct. 2012

95 + 95 Asynchronous View Maintenance in PNUTS Two kinds of views: Remote view Basically just a repartitioning of data on different key Local view Can aggregate local tuples Tuples of each table are partitioned on some partitioning key Maintained synchronously with updates Can layer views to express more complex queries No join views But can co-locate tuples from two relations based on join key And can compute join locally on demand IBM ICARE Winter School on Big Data, Oct. 2012

96 + 96 Asynchronous View Maintenance in PNUTS (from Agrawal, SIGMOD 09) IBM ICARE Winter School on Big Data, Oct. 2012

97 + 97 Remote View Maintenance (from Agrawal, SIGMOD 09) Clients Query Routers Storage Servers Log Managers API a.disk write in log b.cache write in storage c.return to user d.flush to disk e.remote view maintenance f.view log message(s) g.view update(s) in storage a bd Remote Maintainer f e g g IBM ICARE Winter School on Big Data, Oct. 2012

98 + 98 Aggregates by scatter-gather (from Agrawal, SIGMOD 09) 1DaveTV… 2JackGPS… 7DaveGPS… 8TimTV… 4AlexDVD… 5JackTV… Number of TV reviews Reviews(review-id,user,item,text) GPS1 TV1 Avoids expensive scans. Simple materialized views. Scatter gather DVD1 TV1 GPS1 TV1 Local views IBM ICARE Winter School on Big Data, Oct. 2012

99 + Local Aggregates over Repartitioned Data (from Agrawal, SIGMOD 09) 1DaveTV… 2JackGPS… 7DaveGPS… 8TimTV… 4AlexDVD… 5JackTV… Reviews(review-id,user,item,text) ByItem(item,review- id,user,text) DVD4Alex… TV1Dave… TV5Jack TV8Tim… GPS2Jack… GPS7Dave… DVD1 GPS2 TV3 Number of TV reviews IBM ICARE Winter School on Big Data, Oct. 2012

100 + 100 Consistency Model for Views Base consistency model: timelines of all records are independent Multiple records of the views are connected to base records. Information in a view record vr comes from a base record r vr is dependent on r while r is incident on vr Indexes, selections, equi-joins: one to one Group by/aggregates: Many to one Goal: to ensure view record read is consistent with some update which has taken place earlier Request specifying a version blocks until update is reflected in the view How to implement? IBM ICARE Winter School on Big Data, Oct. 2012

101 + 101 Reading View Records Based on Versions LVT is always up-to-date wrt local replica of the underlying base table Version request mapped to underlying base table For RVTs: Single record reads for one-to-one views: Each view record has single incident base record View records can use the version no. of the base records Consistency guarantees are inherited from the base table reads: ReadAny(vr), ReadCritical(vr, v'), ReadLatest(vr) IBM ICARE Winter School on Big Data, Oct. 2012

102 + 102 Reading View Records Based on Versions Single record reads for many-to-one views: Multiple br are incident on single vr ReadAny: reads any version of base records. ReadCritical: Needs specific versions for certain subsets of base records and ReadAny for all other base records. For e.g. When user updates some tuples and then reads back, he would like to see the latest version of the updated records and relatively stale other records may be acceptable. Base record versions are available in RVT/base table on which view is defined ReadLatest: Accesses base table, high cost unavoidable IBM ICARE Winter School on Big Data, Oct. 2012

103 + Googles Megastore Layer on top of Bigtable Provides geographic replication Key feature: notion of entity group Provides full ACID properties for transactions which access only a single entity group Plus support for 2PC/persistent messaging for transactions that span entity groups Use of Paxos for distributed consensus For atomic commit across replicas IBM ICARE Winter School on Big Data, Oct. 2012

104 + 104 Megastore: Motivation Many applications need transactional support beyond individual rows Entity group: set of related rows E.g: records of a single user, or of a group ACID transactions are supported within an entity group Some transactions need to cross entity groups E.g. bidirectional relationships Use 2PC or persistent messaging Data structures spanning multiple entity groups E.g. secondary indices IBM ICARE Winter School on Big Data, Oct. 2012

105 + 105 Entity Group Example (Baker et al., CIDR 11) CREATE TABLE User { required int64 user_id; required string name; } PRIMARY KEY(user_id), ENTITY GROUP ROOT; CREATE TABLE Photo { required int64 user_id; required int32 photo_id; required int64 time; required string full_url; optional string thumbnail_url; repeated string tag; } PRIMARY KEY(user_id, photo_id), IN TABLE User, ENTITY GROUP KEY(user_id) REFERENCES User; CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING (thumbnail_url); IBM ICARE Winter School on Big Data, Oct. 2012

106 + 106 Megastore Architecture (from Baker et al, CIDR 11) IBM ICARE Winter School on Big Data, Oct. 2012

107 + 107 Megastore: Operations across Entity Groups (from Baker et al, CIDR 11) IBM ICARE Winter School on Big Data, Oct. 2012

108 + 108 Multiversioning in Megastore Megastore implements multiversion concurrency control (MVCC) within an entity group when mutations within a transaction are applied, the values are written at the timestamp of their transaction. Readers use the timestamp of the last fully applied trans- action to avoid seeing partial updates Snapshot isolation Updates are applied after commit Current read applies all committed updates, and returns current value Readers and writers dont block each other IBM ICARE Winter School on Big Data, Oct. 2012

109 + 109 Megastore Replication Architecture (Baker et al. CIDR 11) IBM ICARE Winter School on Big Data, Oct. 2012

110 + 110 Managing Replicated Data in Megastore Options Master/slave replication: need to detect failure of master, and initiate transfer of mastership to new site Not an issue within a data center (e.g. GFS or Bigtable) Consensus protocols such as Paxos need to transfer mastership Megastore decided to go with Paxos for managing data update, not just for deciding master/coordinator Application can control placement of entity-group replicas Usually 3 to 4 data centers near place where entity group is accessed IBM ICARE Winter School on Big Data, Oct. 2012

111 + 111 Paxos in Megastore Key idea: replicate a write-ahead log over a group of symmetric peers (Optimized version of) Paxos used to append log records to log Each log append blocks on acknowledgments from a majority of replicas replicas in the minority catch up as they are able Paxos inherent fault tolerance eliminates the need for a distinguished failed state One log per entity group Stored in Bigtable row for root of entity group All logs of a transaction written at once Sequence number for an entity group (log position) read at beginning of txn, and add 1 to get new sequence number Concurrent transactions on same entity group will generate same sequence number only one succeeds in Paxos commit phase IBM ICARE Winter School on Big Data, Oct. 2012

112 + 112 Paxos in Megastore Paxos optimized for normal (non-failure) case Local reads at any up-to-date replica How? Coordinator at each data center tracks set of entity groups for which Paxos could update all replicas If Paxos failed on some replica, writes must inform coordinator Local txn checks with coordinator, and then does local read if coordinator confirms no failures Single-roundtrip writes Leader chosen for each entity group Writes sent to leader if successful all replicas honor leaders choice If leader not alive, fall back to normal Paxos Gives benefits of primary-copy mechanism under non-failure condition IBM ICARE Winter School on Big Data, Oct. 2012

113 + 113 Timeline of Write Algorithm (Baker et al. CIDR 11) IBM ICARE Winter School on Big Data, Oct. 2012

114 + 114 Performance of Megastore Optimized Paxos found to perform well Most reads are local Thanks to coordinator optimization, and The fact that most writes succeed on all replicas IBM ICARE Winter School on Big Data, Oct. 2012

115 + 115 Other (recent) stuff from Google 1: Spanner Descendant of Bigtable, Successor to Megastore Differences: hierarchical directories instead of rows, fine-grained replication; replication configuration at the per-directory level Globally distributed Replicas spread across country to survive regional disasters Synchronous cross-datacenter replication Transparent sharding, data movement General transactions Multiple reads followed by a single atomic write Local or cross-machine (using 2PC) Snapshot reads IBM ICARE Winter School on Big Data, Oct. 2012

116 + 116 Other (recent) stuff from Google 2: F1 Scalable Relation Storage F1: Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business (Shute et al. SIGMOD 2012) Hybrid combining Scalability of Bigtable Usability and functionality of SQL databases Key Ideas Scalability: Auto-sharded storage Availability & Consistency Synchronous replication, consistent indices High commit latency can be hidden Hierarchical schema and data storage IBM ICARE Winter School on Big Data, Oct. 2012

117 + 117 Queries and Transactions in F1 Preferred transaction structure One read phase: No serial reads Read in batches Read asynchronously in parallel Buffer writes in client, send as one RPC Client library with lightweight Object Relational Mapping IBM ICARE Winter School on Big Data, Oct. 2012

118 + 118 Other Work on Cloud Transaction Processing Efficient transaction processing on cloud storage Unbundling Transaction Services in the Cloud (Lomet et al, CIDR 09, CIDR 2011) Partition storage engine into two components Transactional component (TC): transactional recovery and concurrency control Data component (DC): tables, indexes, storage, and cache Focus on supporting ACID properties while minimizing latency due to round trups IBM ICARE Winter School on Big Data, Oct. 2012

119 + G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud Das et al. SoCC 10 IBM ICARE Winter School on Big Data, Oct. 2012

120 + 120 Dynamic Entity Groups Megastore entity groups are static Work very well for a variety of scenarios But there are some scenarios where static nature is limiting E.g. Casino scenario from Gstore paper, where set of users get together for a game, during which transaction guarantees are required across users Basic idea: items need not be statically grouped, grouping can be changed dynamically But grouping must be non-overlapping so that a data item is in only one group at a point in time IBM ICARE Winter School on Big Data, Oct. 2012

121 + 121 Gstore: Key contributions Key Group abstraction defines a unit for on-demand transactional access over a dynamically selected set of keys. Key Groups can be dynamically created and dissolved Application can select arbitrarily set of keys to form a Key Group. KeyGrouping protocol collocates ownership of all the keys in a group at a single node, thereby allowing efficient multi key access. IBM ICARE Winter School on Big Data, Oct. 2012

122 + 122 Group Formation Protocol IBM ICARE Winter School on Big Data, Oct Site (leader) initiates group formation 2.Leader invites followers to join group (J) 3.Followers accept (JA), 4.All group operations happen at site of Leader 5.Leader deletes group, freeing followers

123 + 123 Dealing with failures Protocol issues due to site failures, unreliable messaging Delayed/Lost/Duplicated messages, etc Details in Das et al paper. IBM ICARE Winter School on Big Data, Oct. 2012

124 + CAP Theorem: Consistency vs. Availability Basic principles Amazon Dynamo IBM ICARE Winter School on Big Data, Oct. 2012

125 + 125 What is Consistency? Consistency in Databases (ACID): Database has a set of integrity constraints A consistent database state is one where all integrity constraints are satisfied Each transaction run individually on a consistent database state must leave the database in a consistent state Consistency in distributed systems with replication Strong consistency: a schedule with read and write operations on an object should give results and final state equivalent to some schedule on a single copy of the object, with order of operations from a single site preserved Weak consistency (several forms) IBM ICARE Winter School on Big Data, Oct. 2012

126 + 126 Availability Traditionally, availability of centralized server For distributed systems, availability of system to process requests For large system, at almost any point in time theres a good chance that a node is down or even Network partitioning Distributed consensus algorithms will block during partitions to ensure consistency Many applications require continued operation even during a network partition Even at cost of consistency IBM ICARE Winter School on Big Data, Oct. 2012

127 + 127 Brewers CAP Theorem Three properties of a system Consistency (all copies have same value) Availability (system can run even if parts have failed) Via replication Partitions (network can break into two or more parts, each with active systems that cant talk to other parts) Brewers CAP Theorem: You can have at most two of these three properties for any system Very large systems will partition at some point Choose one of consistency or availablity Traditional database choose consistency Most Web applications choose availability Except for specific parts such as order processing IBM ICARE Winter School on Big Data, Oct. 2012

128 + 128 Replication with Weak Consistency Many systems support replication of data with weak degrees of consistency (I.e., without a guarantee of serializabiliy) i.e. Q R + Q W <= S or 2*Q W < S Usually only when not enough sites are available to ensure quorum But sometimes to allow fast local reads Tradeoff of consistency versus availability or latency Key issues: Reads may get old versions Writes may occur in parallel, leading to inconsistent versions Question: how to detect, and how to resolve Will see in detail shortly IBM ICARE Winter School on Big Data, Oct. 2012

129 + 129 Eventual Consistency When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID Soft state: copies of a data item may be inconsistent Eventually Consistent – copies becomes consistent at some later time if there are no more updates to that data item IBM ICARE Winter School on Big Data, Oct. 2012

130 + 130 Availability vs Latency Abadis classification system: PACELC CAP theorem only matters when there is a partition Even if partitions are rare, applications may trade off consistency for latency E.g. PNUTS allows inconsistent reads to reduce latency Critical for many applications But update protocol (via master) ensures consistency over availability Thus Abadi asks two questions: If there is Partitioning, how does system tradeoff Availability for Consistency Else how does system trade off Latency for Consistency E.g. Megastore: PC/EC PNUTS: PC/EL Dynamo (by default): PA/EL IBM ICARE Winter School on Big Data, Oct. 2012

131 + Amazon Dynamo Distributed data storage system supporting very high availability Even at cost of consistency E.g. motivation from Amazon: Web users should always be able to add items to their cart Even if they are connected to an app server which is now in a minority partition Data should be synchronized with majority partition eventually Inconsistency may be visible (briefly) to users preferable to losing a customer! DynamoDB: part of Amazon Web Service, can subscribe and use over the Web IBM ICARE Winter School on Big Data, Oct. 2012

132 + 132 Dynamo: Basics Provides a key-value store with basic get/put interface Data values entirely uninterpreted by system Unlike Bigtable, PNUTS, Megastore, etc. Underlying storage based on DHTs using consistent hashing with virtual processors Replication (N-ary) Data stored in node to which key is mapped, as well as N-1 consecutive successors in ring Replication at level of key range (virtual node) Put call may return before data has been stored on all replicas Reduces latency, at risk of consistency Programmer can control degree of consistency (Q R, Q W and S) per instance (relation) IBM ICARE Winter School on Big Data, Oct. 2012

133 + 133 Dynamo: Versioning of Data Data items are versioned Each update creates a new immutable version In absence of failure, there is a single latest version But with failures, version can diverge Need to detect, and fix such situations Key idea: vector clock identifies each data version Set of (node, counter) pairs Define a partial order across versions Read operation may find incomparable versions All such versions returned by read operation Up to application to reconcile multiple versions IBM ICARE Winter School on Big Data, Oct. 2012

134 + 134 Vector Clocks for Versioning Examples of vector clocks ([Sx,1]): data item created by site Sx ([Sx,2]): data item created by site Sx, and later updated ([Sx,2],[Sy,1]): data item updated twice by site Sx and once by Sy Update by a site Sx increments the counter for Sx, but leaves counters from other sites unchanged ([Sx,4],[Sy,1]) newer than ([Sx,3],[Sy,1]) But ([Sx,2],[Sy,1]) incomparable with ([Sx,1],[Sy,2]) A practical issue: Vector clock can become long Updates handled by one of the sites containing a replica Timestamps used to prune out very old members of vector update presumed to have reached all replicas IBM ICARE Winter School on Big Data, Oct Vector clocks have been around a long while, not invented by Dynamo

135 + 135 Example of Vector Clock (from DeCandia et al., SOSP 2007) IBM ICARE Winter School on Big Data, Oct Item created by site Sx Item concurrently updated by site Sy and Sz (usually due to network partitioning) 1.Read at Sx returns two incomparable versions 2.Application merges versions and writes new version Item updated by site Sx

136 + 136 Performing Put/Get Operations Get/put requests handled by a coordinator (one of the nodes containing a replica of the item) Upon receiving a put() request for a key the coordinator generates the vector clock for the new version and writes the new version locally The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes. If at least Q W -1 nodes respond then the write is considered successful. For a get() request the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, Waits for Q R responses before returning the result to the client. Returns all causally unrelated (incomparable) versions Application should reconcile divergent versions and write back a reconciled version superseding the current versions IBM ICARE Winter School on Big Data, Oct. 2012

137 + 137 How to Reconcile Inconsistent Versions? Reconciliation is application specific E.g. two sites concurrent insert items to cart Merge adds both items to the final cart state E.g. S1 adds item A, S2 deletes item B Merge adds item A, but deleted item B resurfaces Cannot distinguish S2 deletes B from S1 add B Problem: operations are inferred from states of divergent versions Better solution (not supported in Dynamo) keep track of history of operations IBM ICARE Winter School on Big Data, Oct. 2012

138 + 138 Dealing with Failure Hinted handoff: If replica is not available at time of write, write performed on another (live) node, Write operations will not fail due to temporary node unavailability Risk of divergence of data value due to above To detect divergence Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts Nodes exchange roots of Merkle trees with other replicas to detect any divergence IBM ICARE Winter School on Big Data, Oct. 2012

139 + 139 Consistency Models for Applications Read-your-writes if a process has performed a write, a subsequent read will reflect the earlier write operation Session consistency Read-your-writes in the context of a session, where application connects to storage system Monotonic consistency For reads: later reads never return older version than earlier reads For writes: serializes writes by a single proces Minimum requirement IBM ICARE Winter School on Big Data, Oct. 2012

140 + 140 Implementing Consistency Models Sticky sessions: all operations from a session on a data item go to the same node Get operations can specify a version vector Result of get guaranteed to be at least as new as specified version vector IBM ICARE Winter School on Big Data, Oct. 2012

141 + Conclusions, Future Work, and References IBM ICARE Winter School on Big Data, Oct. 2012

142 + 142 Conclusions New generation of massively parallel/distributed data storage systems developed to address issues of scale, availability and latency in extremely large scale Web applications Revisiting old issues in terms of replication, consistency and distributed consensus Some new technical ideas, but key contributions are in terms of engineering at scale Developers need to be aware of Tradeoffs between consistency, availability and latency How to choose between these for different parts of their applications And how to enforce their choice on the data storage platform Good old days of database ensuring ACID properties are gone! But life is a lot more interesting now IBM ICARE Winter School on Big Data, Oct. 2012

143 + 143 Future Work We are at the beginning of a new wave Programmers expected to open the hood and get hands dirty Programming model to deal with different consistency levels for different parts of an application Hide consistency issues as much as possible from programmer Need better support for logical operations to allow resolving of conflicting updates At system level At programmer level, to reason about application consistency E.g. Relationship between Consistency And Logical Monotonicity (CALM principle): Monotonic programs guarantee eventual consistency under any interleaving of delivery and computation. Tools for testing/verification IBM ICARE Winter School on Big Data, Oct. 2012

144 + 144 Some References D. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design, IEEE Computer, Feb 2012 P. Agrawal et al, Asynchronous View Maintenance for VLSD Databases, SIGMOD 2009 P. Alvaro et al., Consistency Analysis in Bloom: a CALM and Collected Approach, CIDR 2011 J. Baker et al, Megastore: Providing Scalable, Highly Available Storage for Interactive Services, CIDR 2011 F. Chang et al., Bigtable: A Distributed Storage System for Structured Data, OSDI 2006 G. DeCandia et al., Dynamo: Amazons Highly Available Key-value Store, SOSP 2007 S. Das et al, G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud, SoCC, June 2012 S. Finkelstein et al., Transactional Intent, CIDR 2011 S. Gilbert and N. Lynch, Perspectives on the CAP Theorem, IEEE Computer, Feb 2012 L. Lamport, Paxos Made Simple ACM SIGACT News (Distributed Computing Column) 32, 4, 2011 W. Vogels, Eventual Consistency, CACM Jan 2008 IBM ICARE Winter School on Big Data, Oct. 2012

145 + END OF TALK

146 + 146 Figure IBM ICARE Winter School on Big Data, Oct. 2012

147 + 147 Figure IBM ICARE Winter School on Big Data, Oct. 2012

148 + 148 Figure IBM ICARE Winter School on Big Data, Oct. 2012

149 + 149 Bully Algorithm If site S i sends a request that is not answered by the coordinator within a time interval T, assume that the coordinator has failed S i tries to elect itself as the new coordinator. S i sends an election message to every site with a higher identification number, S i then waits for any of these processes to answer within T. If no response within T, assume that all sites with number greater than i have failed, S i elects itself the new coordinator. If answer is received S i begins time interval T, waiting to receive a message that a site with a higher identification number has been elected. IBM ICARE Winter School on Big Data, Oct. 2012

150 + 150 Bully Algorithm (Cont.) If no message is sent within T, assume the site with a higher number has failed; S i restarts the algorithm. After a failed site recovers, it immediately begins execution of the same algorithm. If there are no active sites with higher numbers, the recovered site forces all processes with lower numbers to let it become the coordinator site, even if there is a currently active coordinator with a lower number. IBM ICARE Winter School on Big Data, Oct. 2012

151 + Background: Concurrency Control and Replication in Distributed Systems IBM ICARE Winter School on Big Data, Oct. 2012

152 + 152 Concurrency Control Modify concurrency control schemes for use in distributed environment. We assume that each site participates in the execution of a commit protocol to ensure global transaction atomicity. We assume all replicas of any item are updated Will see how to relax this in case of site failures later IBM ICARE Winter School on Big Data, Oct. 2012

153 + 153 Single-Lock-Manager Approach System maintains a single lock manager that resides in a single chosen site, say S i When a transaction needs to lock a data item, it sends a lock request to S i and lock manager determines whether the lock can be granted immediately If yes, lock manager sends a message to the site which initiated the request If no, request is delayed until it can be granted, at which time a message is sent to the initiating site IBM ICARE Winter School on Big Data, Oct. 2012

154 + 154 Single-Lock-Manager Approach (Cont.) The transaction can read the data item from any one of the sites at which a replica of the data item resides. Writes must be performed on all replicas of a data item Advantages of scheme: Simple implementation Simple deadlock handling Disadvantages of scheme are: Bottleneck: lock manager site becomes a bottleneck Vulnerability: system is vulnerable to lock manager site failure. IBM ICARE Winter School on Big Data, Oct. 2012

155 + 155 Distributed Lock Manager In this approach, functionality of locking is implemented by lock managers at each site Lock managers control access to local data items But special protocols may be used for replicas Advantage: work is distributed and can be made robust to failures Disadvantage: deadlock detection is more complicated Lock managers cooperate for deadlock detection More on this later Several variants of this approach Primary copy Majority protocol Biased protocol Quorum consensus IBM ICARE Winter School on Big Data, Oct. 2012

156 + Distributed Commit Protocols IBM ICARE Winter School on Big Data, Oct. 2012

157 + 157 Two Phase Commit Protocol (2PC) Assumes fail-stop model – failed sites simply stop working, and do not cause any other harm, such as sending incorrect messages to other sites. Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached. The protocol involves all the local sites at which the transaction executed Let T be a transaction initiated at site S i, and let the transaction coordinator at S i be C i IBM ICARE Winter School on Big Data, Oct. 2012

158 + 158 Phase 1: Obtaining a Decision Coordinator asks all participants to prepare to commit transaction T i. C i adds the records to the log and forces log to stable storage sends prepare T messages to all sites at which T executed Upon receiving message, transaction manager at site determines if it can commit the transaction if not, add a record to the log and send abort T message to C i if the transaction can be committed, then: add the record to the log force all records for T to stable storage send ready T message to C i IBM ICARE Winter School on Big Data, Oct. 2012

159 + 159 Phase 2: Recording the Decision T can be committed of C i received a ready T message from all the participating sites: otherwise T must be aborted. Coordinator adds a decision record, or, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur) Coordinator sends a message to each participant informing it of the decision (commit or abort) Participants take appropriate action locally. IBM ICARE Winter School on Big Data, Oct. 2012

160 + 160 Handling of Failures - Site Failure When site S k recovers, it examines its log to determine the fate of transactions active at the time of the failure. Log contain record: site executes redo (T) Log contains record: site executes undo (T) Log contains record: site must consult C i to determine the fate of T. If T committed, redo (T) If T aborted, undo (T) The log contains no control records concerning T implies that S k failed before responding to the prepare T message from C i since the failure of S k precludes the sending of such a response C i must abort T S k must execute undo (T) IBM ICARE Winter School on Big Data, Oct. 2012

161 + 161 Handling of Failures- Coordinator Failure If coordinator fails while the commit protocol for T is executing then participating sites must decide on Ts fate: 1. If an active site contains a record in its log, then T must be committed. 2. If an active site contains an record in its log, then T must be aborted. 3. If some active participating site does not contain a record in its log, then the failed coordinator C i cannot have decided to commit T. Can therefore abort T. 4. If none of the above cases holds, then all active sites must have a record in their logs, but no additional control records (such as of ). In this case active sites must wait for C i to recover, to find decision. Blocking problem: active sites may have to wait for failed coordinator to recover. IBM ICARE Winter School on Big Data, Oct. 2012

162 + 162 Handling of Failures - Network Partition If the coordinator and all its participants remain in one partition, the failure has no effect on the commit protocol. If the coordinator and its participants belong to several partitions: Sites that are not in the partition containing the coordinator think the coordinator has failed, and execute the protocol to deal with failure of the coordinator. No harm results, but sites may still have to wait for decision from coordinator. The coordinator and the sites are in the same partition as the coordinator think that the sites in the other partition have failed, and follow the usual commit protocol. Again, no harm results IBM ICARE Winter School on Big Data, Oct. 2012

163 + 163 Recovery and Concurrency Control In-doubt transactions have a, but neither a, nor an log record. The recovering site must determine the commit-abort status of such transactions by contacting other sites; this can slow and potentially block recovery. Recovery algorithms can note lock information in the log. Instead of, write out L = list of locks held by T when the log is written (read locks can be omitted). For every in-doubt transaction T, all the locks noted in the log record are reacquired. After lock reacquisition, transaction processing can resume; the commit or rollback of in-doubt transactions is performed concurrently with the execution of new transactions. IBM ICARE Winter School on Big Data, Oct. 2012

164 + 164 Three Phase Commit (3PC) Assumptions: No network partitioning At any point, at least one site must be up. At most K sites (participants as well as coordinator) can fail Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1. Every site is ready to commit if instructed to do so Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC In phase 2 coordinator makes a decision as in 2PC (called the pre-commit decision) and records it in multiple (at least K) sites In phase 3, coordinator sends commit/abort message to all participating sites, Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure Avoids blocking problem as long as < K sites fail Drawbacks: higher overheads assumptions may not be satisfied in practice IBM ICARE Winter School on Big Data, Oct. 2012

165 + 165 Timestamping Timestamp based concurrency-control protocols can be used in distributed systems Each transaction must be given a unique timestamp Main problem: how to generate a timestamp in a distributed fashion Each site generates a unique local timestamp using either a logical counter or the local clock. Global unique timestamp is obtained by concatenating the unique local timestamp with the unique identifier. IBM ICARE Winter School on Big Data, Oct. 2012

166 + 166 Timestamping (Cont.) A site with a slow clock will assign smaller timestamps Still logically correct: serializability not affected But: disadvantages transactions To fix this problem Define within each site S i a logical clock (LC i ), which generates the unique local timestamp Require that S i advance its logical clock whenever a request is received from a transaction Ti with timestamp and x is greater that the current value of LC i. In this case, site S i advances its logical clock to the value x + 1. IBM ICARE Winter School on Big Data, Oct. 2012

167 + 167 Comparison with Remote Backup Remote backup (hot spare) systems (Section 17.10) are also designed to provide high availability Remote backup systems are simpler and have lower overhead All actions performed at a single site, and only log records shipped No need for distributed concurrency control, or 2 phase commit Using distributed databases with replicas of data items can provide higher availability by having multiple (> 2) replicas and using the majority protocol Also avoid failure detection and switchover time associated with remote backup systems IBM ICARE Winter School on Big Data, Oct. 2012


Download ppt "+ Massively Parallel/Distributed Data Storage Systems S. Sudarshan IIT Bombay Derived from an earlier talk by S. Sudarshan, presented at the MSRI Summer."

Similar presentations


Ads by Google