Cloud Computing and Scalable Data Management

Slides:



Advertisements
Similar presentations
2 Google GFS Bigtable Mapreduce Yahoo Hadoop.
Advertisements

An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions.
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Introduction to cloud computing
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
PNUTS: Yahoo’s Hosted Data Serving Platform Jonathan Danaparamita jdanap at umich dot edu University of Michigan EECS 584, Fall Some slides/illustrations.
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen,
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
MapReduce M/R slides adapted from those of Jeff Dean’s.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Bigtable: A Distributed Storage System for Structured Data 1.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
PNUTS PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bigtable: A Distributed Storage System for Structured Data
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Web-Scale Data Serving with PNUTS
Dr.S.Sridhar, Director, RVCET, RVCE, Bangalore
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
PNUTS: Yahoo!’s Hosted Data Serving Platform
MapReduce Simplied Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Cloud Computing and Scalable Data Management Jiaheng Lu and Sai Wu Renmin Universtiy of China , National University of Singapore APWeb’2011 Tutorial

Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Outline Cloud computing Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Data indexing in the clouds Conclusion and open issues Part 1 Part 2 APWeb 2011

Cloud computing CLOUD COMPUTING

Why we use cloud computing?

Why we use cloud computing? Case 1: Write a file Save Computer down, file is lost Files are always stored in cloud, never lost

Why we use cloud computing? Case 2: Use IE --- download, install, use Use QQ --- download, install, use Use C++ --- download, install, use …… Get the serve from the cloud

What is cloud and cloud computing? Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet. Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them.

Characteristics of cloud computing Virtual. software, databases, Web servers, operating systems, storage and networking as virtual servers. On demand. add and subtract processors, memory, network bandwidth, storage.

Infrastructure as a Service Types of cloud service SaaS Software as a Service PaaS Platform as a Service IaaS Infrastructure as a Service

Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Outline Cloud computing Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Data indexing in the clouds Conclusion and open issues Part 1 Part 2 APWeb 2011

Introduction to MapReduce 14

MapReduce Programming Model Inspired from map and reduce operations commonly used in functional programming languages like Lisp. Users implement interface of two primary methods: 1. Map: (key1, val1) → (key2, val2) 2. Reduce: (key2, [val2]) → [val3] 15

Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (doc—id, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. 16

Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. 17

Pseudo-code for each word w in input_value: EmitIntermediate(w, "1"); map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 18

MapReduce: Execution overview 19

MapReduce: Example 20

Research works MapReduce is slower than Parallel Databases by a factor 3.1 to 6.5 [1][2] By adopting an hybrid architecture, the performance of MapReduce can approach to Parallel Databases [3] MapReduce is an efficient tool [4] Numerous discussions in MapReduce community … [1] A comparison of approaches to large-scale data analysis. SIGMOD 2009 [2] MapReduce and Parallel DBMSs: friends or foes? CACM 2010 [3] HadoopDB: an architectural hybrid of MapReduce and DBMS techonologies for analytical workloads. VLDB 2009 [4] MapReduce: a flexible data processing tool. CACM 2010 First two papers speculate some factors but without further analysis. [3] yields an interesting question whether MapReduce itself can achieve the same performance There is no benchmark data support the claims in [4]

An in-depth study (1) Scheduling I/O modes Record Parsing Indexing A performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. [5] [5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The Performance of MapReduce: An In-depth Study. PVLDB 3(1): 472-483 (2010) Scheduling I/O modes Record Parsing Indexing

An in-depth study (2) By carefully tuning these factors, the overall performance of Hadoop can be comparable to that of parallel database systems. Possible to build a cloud data processing system that is both elastically scalable and efficient [5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The Performance of MapReduce: An In-depth Study. PVLDB 3(1): 472-483 (2010)

Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database Christopher Yang, Christine Yen, Ceryen Tan, Samuel Madden: Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. ICDE 2010:657-668

Problem proposed Problem: Node failures on distributed database system Faults may be common on large clusters of machines Existing solution: Aborting and (possibly) restarting the query A reasonable approach for short OLTP-style queries But it’s time-wasting for analytical (OLAP) warehouse queries

MapReduce-style fault tolerance for a SQL database Break up a SQL query (or program) into smaller, parallelizable subqueries Adapt the load balancing strategy of greedy assignment of work

Osprey A middleware implementation of MapReduce-style fault tolerance for a SQL database

SQL procedure Query Query Transformer Subquery Subquery Subquery PWQ A PWQ B PWQ C PWQ :partition work queue Scheduler Execution Result Merger Result

Mapreduce Online Evaluation Plateform

Mapreduce OnlineEvaluation Construction Mapreduce OnlineEvaluation User Register Login Update Info Problem Scan Problem Submit Solution Submission History Theory test Test Result Help FAQs Hadoop Quick Start School Of Information Renmin University Of China 2017/3/27

Cloud data management

Four new principles in Cloud-based data management

New principle in cloud dta management(1) Partition Everything and key-value storage 切分万物以治之 1st normal form cannot be satisfied 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

New principle in cloud dta management (2) Embrace Inconsistency 容不同乃成大同 ACID properties are not satisfied 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

New principle in cloud dta management (3) Backup everything with three copies 狡兔三窟方高枕 Guarantee 99.999999% safety 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

New principle in cloud dta management (4) Scalable and high performance 运筹沧海量兼容 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

Cloud data management 切分万物以治之 Partition Everything 容不同乃成大同 Embrace Inconsistency 狡兔三窟方高枕 Backup data with three copies 运筹沧海量兼容 Scalable and high performance 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

BigTable: A Distributed Storage System for Structured Data 38

Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products 39

Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren’t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer 40

Goals Want asynchronous processes to be continuously updating different pieces of data Want access to most current data at any time Need to support: Very high read/write rates (millions of ops per second) Efficient scans over all or interesting subsets of data Efficient joins of large one-to-one and one-to-many datasets Often want to examine data changes over time E.g. Contents of a web page over multiple crawls 41

BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 42

(row, column, timestamp) -> cell contents Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications 43

WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched. 44

Rows Name is an arbitrary string Rows ordered lexicographically Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines 45

Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys 46

Columns Columns have two-level name structure: Column family family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional levels of indexing, if desired 47

Timestamps Used to store different versions of data in a cell New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Column families can be marked w/ attributes: “Only retain most recent K values in a cell” “Keep values until they are older than K seconds” 48

Implementation – Three Major Components Library linked into every client One master server Responsible for: Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Many tablet servers Tablet servers handle read and write requests to its table Splits tablets that have grown too large 49

Tablets Large tables broken into tablets at row boundaries Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets Fast recovery: 100 machines each pick up 1 tablet for failed machine Fine-grained load balancing: Migrate tablets away from overloaded machine Master makes load-balancing decisions 50

Refinements: Locality Groups Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group. 51

Refinements: Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1) 52

Refinements: Bloom Filters Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk 53

PNUTS / SHERPA To Help You Scale Your Mountains of Data A project in Y!R focused on a long-range problem, origins in earlier work at Wisconsin. Basis for the Goldrush hack, which won the recent Local hack competition, and could contribute to creation/refinement of Y! Local content and Next Gen Search.

Yahoo! Serving Storage Problem Small records – 100KB or less Structured records – lots of fields, evolving Extreme data scale - Tens of TB Extreme request scale - Tens of thousands of requests/sec Low latency globally - 20+ datacenters worldwide High Availability - outages cost $millions Variable usage patterns - as applications and users change 55

What is PNUTS/Sherpa? CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E Structured, flexible schema E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E Geographic replication Parallel database Hosted, managed infrastructure 56

What Will It Become? Indexes and views CREATE TABLE Parts ( A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E Indexes and views CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E Geographic replication Parallel database Structured, flexible schema Hosted, managed infrastructure

What Will It Become? Indexes and views A 42342 E B 42521 W A 42342 E C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E Indexes and views E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E

Technology Elements Applications PNUTS API Tabular API PNUTS Query planning and execution Index maintenance Distributed infrastructure for tabular data Data partitioning Update consistency Replication YCA: Authorization YDOT FS Ordered tables YDHT FS Hash tables Tribble Pub/sub messaging Zookeeper Consistency service 59 59

Data Manipulation Per-record operations Multi-record operations Get Set Delete Multi-record operations Multiget Scan Getrange 60

Tablets—Hash Table 0x0000 0x2AF3 0x911F 0xFFFF Name Description Price Grape Grapes are good to eat $12 Lime Limes are green $9 Apple Apple is wisdom $1 Strawberry Strawberry shortcake $900 0x2AF3 Orange Arrgh! Don’t get scurvy! $2 Avocado But at what price? $3 Lemon How much did you pay for this lemon? $1 Tomato Is this a vegetable? $14 0x911F Banana The perfect fruit $2 Kiwi New Zealand $8 0xFFFF 61

Tablets—Ordered Table Name Description Price A Apple Apple is wisdom $1 Avocado But at what price? $3 Banana The perfect fruit $2 Grape Grapes are good to eat $12 H Kiwi New Zealand $8 Lemon How much did you pay for this lemon? $1 Lime Limes are green $9 Orange Arrgh! Don’t get scurvy! $2 Q Strawberry Strawberry shortcake $900 Tomato Is this a vegetable? $14 Z 62

Flexible Schema Posted date Listing id Item Price 6/1/07 424252 Couch $570 763245 Bike $86 6/3/07 211242 Car $1123 6/5/07 421133 Lamp $15 Condition Good Fair Color Red

Detailed Architecture Local region Remote regions Clients REST API Routers Tribble Tablet Controller Storage units 64

Tablet Splitting and Balancing Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Storage unit Tablet Overfull tablets split Tablets may grow over time Shed load by moving tablets to other servers 65

QUERY PROCESSING 66

Accessing Data SU SU SU Get key k Get key k Record for key k 1 4 2 3 67

Bulk Read SU SU SU {k1, k2, … kn} Get k1 Get k2 Get k3 1 2 68 Scatter/ gather server SU SU SU 68

Range Queries in YDOT Clustered, ordered retrieval of records Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Router Apple Avocado Banana Blueberry Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Grapefruit…Lime? Grapefruit…Pear? Lime…Pear? Canteloupe Grape Kiwi Lemon Lime Mango Orange Storage unit 1 Storage unit 2 Storage unit 3 Strawberry Tomato Watermelon Apple Avocado Banana Blueberry Strawberry Tomato Watermelon Lime Mango Orange Canteloupe Grape Kiwi Lemon

Updates SU SU SU Write key k Sequence # for key k Write key k 8 1 Sequence # for key k Write key k Routers Message brokers 2 Write key k 3 Write key k 7 4 Sequence # for key k 5 SUCCESS SU SU SU 6 Write key k 70

SHERPA IN CONTEXT 71

Retrieval from single table of objects/records Types of Record Stores Query expressiveness S3 PNUTS Oracle Simple Feature rich Object retrieval Retrieval from single table of objects/records SQL

Consistency model Types of Record Stores S3 PNUTS Oracle Best effort Strong guarantees Eventual consistency Timeline consistency ACID Program centric consistency Object-centric consistency

Data model Types of Record Stores PNUTS CouchDB Oracle Flexibility, Schema evolution Optimized for Fixed schemas Object-centric consistency Consistency spans objects

Elasticity (ability to add resources on demand) Types of Record Stores Elasticity (ability to add resources on demand) PNUTS S3 Oracle Inelastic Elastic Limited (via data distribution) VLSD (Very Large Scale Distribution /Replication)

Application Design Space Get a few things Sherpa MObStor YMDB MySQL Oracle Filer BigTable Scan everything Everest Hadoop Records Files 76

Alternatives Matrix Consistency model Global low Structured SQL/ACID latency Structured access Consistency model SQL/ACID Operability Availability Updates Elastic Sherpa Y! UDB MySQL Oracle HDFS BigTable Dynamo Cassandra 77

Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Outline Cloud computing Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Data indexing in the clouds Conclusion and open issues Part 1 Part 2 APWeb 2011

The CAP Theorem Availability Consistency Partition tolerance PODC:ACM Symposium on Principles of Distributed Computing(PODC)International Conference 79

The CAP Theorem Once a writer has written, all readers will see that write Availability Consistency Partition tolerance

The CAP Theorem System is available during software and hardware upgrades and node failures. Availability Consistency Partition tolerance

The CAP Theorem A system can continue to operate in the presence of a network partitions. Availability Consistency Partition tolerance

The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Availability Consistency Partition tolerance

Consistency Two kinds of consistency: strong consistency – ACID(Atomicity Consistency Isolation Durability) weak consistency – BASE(Basically Available Soft-state Eventual consistency ) Cluster里面至少有一个replica是最新的,其他的replica最终会达到一致

A tailor RDBMS LOCK ACID SAFETY TRANSACTION 3NF

Datalog Main expressive advantage: recursive queries. More convenient for analysis: papers look better. Without recursion but with negation it is equivalent in power to relational algebra Has affected real practice: (e.g., recursion in SQL3, magic sets transformations).

Datalog Example Datalog program: parent(bill,mary). parent(mary,john). ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y). ?- ancestor(bill,X)

Joseph’s Conjecture(1) CONJECTURE 1. Consistency And Logical Monotonicity (CALM). A program has an eventually consistent, coordination-free execution strategy if and only if it is expressible in (monotonic) Datalog.

Joseph’s Conjecture (2) CONJECTURE 2. Causality Required Only for Non-monotonicity (CRON). Program semantics require causal message ordering if and only if the messages participate in non-monotonic derivations.

Joseph’s Conjecture (3) CONJECTURE 3. The minimum number of Dedalus timesteps required to evaluate a program on a given input data set is equivalent to the program’s Coordination Complexity.

Joseph’s Conjecture (4) CONJECTURE 4. Any Dedalus program P can be rewritten into an equivalent temporally-minimized program P’ such that each inductive or asynchronous rule of P’ is necessary: converting that rule to a deductive rule would result in a program with no unique minimal model.

Circumstance has presented a rare opportunity—call it an imperative—for the database community to take its place in the sun, and help create a new environment for parallel and distributed computation to flourish. ------Joseph M. Hellerstein (UC Berkeley)

Open questions and conclusion

Open Questions What is the right consistency model? What is the right programming model? Whether and how to make use of caching? How to balance functionality and scale? What are the right cloud abstractions? Cloud inter-operatability VLDB 2010 Tutorial

Concluding Data Management for Cloud Computing poses a fundamental challenge to database researchers: Scalability Reliability Data Consistency Elasticity Database community needs to be involved – maintaining status quo will only marginalize our role. VLDB 2010 Tutorial

New Textbook “Distributed System and Cloud computing”《分布式系统与云计算》 分布式系统概述 (Introduction to Distributed System) 分布式云计算技术综述 ( Distributed Computing) 分布式云计算平台 (Cloud-based platform) 分布式云计算程序开发 (Cloud-based programming) 96

Further Reading F. Chang et al. Bigtable: A distributed storage system for structured data. In OSDI, 2006. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. G. DeCandia et al. Dynamo: Amazon’s highly available key-value store. In SOSP, 2007. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, 2003. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000.

Further Reading Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008) Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni Asynchronous View Maintenance for VLSD Databases, Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu Ramakrishnan SIGMOD 2009 Cloud Storage Design in a PNUTShell Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava Beautiful Data, O’Reilly Media, 2009

Further Reading F. Chang et al. Bigtable: A distributed storage system for structured data. In OSDI, 2006. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. G. DeCandia et al. Dynamo: Amazon’s highly available key-value store. In SOSP, 2007. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, 2003. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000.

Thanks 谢谢!