Cloud Computing and Scalable Data Management

Cloud Computing and Scalable Data Management
Jiaheng Lu and Sai Wu Renmin Universtiy of China , National University of Singapore APWeb’2011 Tutorial

Map/Reduce, Bigtable and PNUT CAP Theorem and datalog
Outline Cloud computing Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Data indexing in the clouds Conclusion and open issues Part 1 Part 2 APWeb 2011

Cloud computing CLOUD COMPUTING

Why we use cloud computing?

Case 1: Write a file Save Computer down, file is lost Files are always stored in cloud, never lost

Case 2: Use IE --- download, install, use Use QQ --- download, install, use Use C download, install, use …… Get the serve from the cloud

What is cloud and cloud computing?
Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet. Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them.

Characteristics of cloud computing
Virtual. software, databases, Web servers, operating systems, storage and networking as virtual servers. On demand. add and subtract processors, memory, network bandwidth, storage.

Infrastructure as a Service
Types of cloud service SaaS Software as a Service PaaS Platform as a Service IaaS Infrastructure as a Service

Introduction to MapReduce
14

MapReduce Programming Model
Inspired from map and reduce operations commonly used in functional programming languages like Lisp. Users implement interface of two primary methods: 1. Map: (key1, val1) → (key2, val2) 2. Reduce: (key2, [val2]) → [val3] 15

Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (doc—id, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. 16

Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. 17

Pseudo-code for each word w in input_value: EmitIntermediate(w, "1");
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 18

MapReduce: Execution overview
19

MapReduce: Example 20

Research works MapReduce is slower than Parallel Databases by a factor 3.1 to 6.5 [1][2] By adopting an hybrid architecture, the performance of MapReduce can approach to Parallel Databases [3] MapReduce is an efficient tool [4] Numerous discussions in MapReduce community … [1] A comparison of approaches to large-scale data analysis. SIGMOD 2009 [2] MapReduce and Parallel DBMSs: friends or foes? CACM 2010 [3] HadoopDB: an architectural hybrid of MapReduce and DBMS techonologies for analytical workloads. VLDB 2009 [4] MapReduce: a flexible data processing tool. CACM 2010 First two papers speculate some factors but without further analysis. [3] yields an interesting question whether MapReduce itself can achieve the same performance There is no benchmark data support the claims in [4]

An in-depth study (1) Scheduling I/O modes Record Parsing Indexing
A performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. [5] [5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The Performance of MapReduce: An In-depth Study. PVLDB 3(1): (2010) Scheduling I/O modes Record Parsing Indexing

An in-depth study (2) By carefully tuning these factors, the overall performance of Hadoop can be comparable to that of parallel database systems. Possible to build a cloud data processing system that is both elastically scalable and efficient [5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The Performance of MapReduce: An In-depth Study. PVLDB 3(1): (2010)

Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database Christopher Yang, Christine Yen, Ceryen Tan, Samuel Madden: Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. ICDE 2010:

Problem proposed Problem: Node failures on distributed database system
Faults may be common on large clusters of machines Existing solution: Aborting and (possibly) restarting the query A reasonable approach for short OLTP-style queries But it’s time-wasting for analytical (OLAP) warehouse queries

MapReduce-style fault tolerance for a SQL database
Break up a SQL query (or program) into smaller, parallelizable subqueries Adapt the load balancing strategy of greedy assignment of work

Osprey A middleware implementation of MapReduce-style fault tolerance for a SQL database

SQL procedure Query Query Transformer Subquery Subquery Subquery
PWQ A PWQ B PWQ C PWQ :partition work queue Scheduler Execution Result Merger Result

Mapreduce Online Evaluation Plateform

Mapreduce OnlineEvaluation
Construction Mapreduce OnlineEvaluation User Register Login Update Info Problem Scan Problem Submit Solution Submission History Theory test Test Result Help FAQs Hadoop Quick Start School Of Information Renmin University Of China 2017/3/27

Cloud data management

Four new principles in Cloud-based data management

New principle in cloud dta management（1）
Partition Everything and key-value storage 切分万物以治之 1st normal form cannot be satisfied 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

New principle in cloud dta management （2）
Embrace Inconsistency 容不同乃成大同 ACID properties are not satisfied 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

Backup everything with three copies 狡兔三窟方高枕 Guarantee % safety 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

Scalable and high performance 运筹沧海量兼容 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

Cloud data management 切分万物以治之 Partition Everything 容不同乃成大同
Embrace Inconsistency 狡兔三窟方高枕 Backup data with three copies 运筹沧海量兼容 Scalable and high performance 1. GFS built from hundreds/thousands of machines built from inexpensive commodity parts and accessed by a comparable # of client machines

BigTable: A Distributed Storage System for Structured Data
38

Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products 39

Why not just use commercial DB?
Scale is too large for most commercial databases Even if it weren’t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer 40

Goals Want asynchronous processes to be continuously updating different pieces of data Want access to most current data at any time Need to support: Very high read/write rates (millions of ops per second) Efficient scans over all or interesting subsets of data Efficient joins of large one-to-one and one-to-many datasets Often want to examine data changes over time E.g. Contents of a web page over multiple crawls 41

BigTable Distributed multi-level map Fault-tolerant, persistent
Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 42

(row, column, timestamp) -> cell contents
Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications 43

WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched. 44

Rows Name is an arbitrary string Rows ordered lexicographically
Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines 45

Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys 46

Columns Columns have two-level name structure: Column family
family:optional_qualifier Column family Unit of access control Has associated type information Qualifier gives unbounded columns Additional levels of indexing, if desired 47

Timestamps Used to store different versions of data in a cell
New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Column families can be marked w/ attributes: “Only retain most recent K values in a cell” “Keep values until they are older than K seconds” 48

Implementation – Three Major Components
Library linked into every client One master server Responsible for: Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Many tablet servers Tablet servers handle read and write requests to its table Splits tablets that have grown too large 49

Tablets Large tables broken into tablets at row boundaries
Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets Fast recovery: 100 machines each pick up 1 tablet for failed machine Fine-grained load balancing: Migrate tablets away from overloaded machine Master makes load-balancing decisions 50

Refinements: Locality Groups
Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group. 51

Refinements: Compression
Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1) 52

Refinements: Bloom Filters
Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk 53

PNUTS / SHERPA To Help You Scale Your Mountains of Data
A project in Y!R focused on a long-range problem, origins in earlier work at Wisconsin. Basis for the Goldrush hack, which won the recent Local hack competition, and could contribute to creation/refinement of Y! Local content and Next Gen Search.

Yahoo! Serving Storage Problem
Small records – 100KB or less Structured records – lots of fields, evolving Extreme data scale - Tens of TB Extreme request scale - Tens of thousands of requests/sec Low latency globally datacenters worldwide High Availability - outages cost $millions Variable usage patterns - as applications and users change 55

What is PNUTS/Sherpa? CREATE TABLE Parts ( ID VARCHAR,
StockNumber INT, Status VARCHAR … ) E C A E B W C W D E F E A E B W C W D E E C F E Structured, flexible schema E C A E B W C W D E F E Geographic replication Parallel database Hosted, managed infrastructure 56

What Will It Become? Indexes and views CREATE TABLE Parts (
A E B W C W D E F E E C A E B W C W D E F E Indexes and views CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) E C A E B W C W D E F E Geographic replication Parallel database Structured, flexible schema Hosted, managed infrastructure

What Will It Become? Indexes and views A 42342 E B 42521 W A 42342 E
C W D E F E E C A E B W C W D E F E Indexes and views E C A E B W C W D E F E

Technology Elements Applications PNUTS API Tabular API PNUTS
Query planning and execution Index maintenance Distributed infrastructure for tabular data Data partitioning Update consistency Replication YCA: Authorization YDOT FS Ordered tables YDHT FS Hash tables Tribble Pub/sub messaging Zookeeper Consistency service 59 59

Data Manipulation Per-record operations Multi-record operations Get
Set Delete Multi-record operations Multiget Scan Getrange 60

Tablets—Hash Table 0x0000 0x2AF3 0x911F 0xFFFF Name Description Price
Grape Grapes are good to eat $12 Lime Limes are green $9 Apple Apple is wisdom $1 Strawberry Strawberry shortcake $900 0x2AF3 Orange Arrgh! Don’t get scurvy! $2 Avocado But at what price? $3 Lemon How much did you pay for this lemon? $1 Tomato Is this a vegetable? $14 0x911F Banana The perfect fruit $2 Kiwi New Zealand $8 0xFFFF 61

Tablets—Ordered Table
Name Description Price A Apple Apple is wisdom $1 Avocado But at what price? $3 Banana The perfect fruit $2 Grape Grapes are good to eat $12 H Kiwi New Zealand $8 Lemon How much did you pay for this lemon? $1 Lime Limes are green $9 Orange Arrgh! Don’t get scurvy! $2 Q Strawberry Strawberry shortcake $900 Tomato Is this a vegetable? $14 Z 62

Flexible Schema Posted date Listing id Item Price 6/1/07 424252 Couch
$570 763245 Bike $86 6/3/07 211242 Car $1123 6/5/07 421133 Lamp $15 Condition Good Fair Color Red

Detailed Architecture
Local region Remote regions Clients REST API Routers Tribble Tablet Controller Storage units 64

Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Storage unit Tablet Overfull tablets split Tablets may grow over time Shed load by moving tablets to other servers 65

QUERY PROCESSING 66

Accessing Data SU SU SU Get key k Get key k Record for key k 1 4 2 3
67

Bulk Read SU SU SU {k1, k2, … kn} Get k1 Get k2 Get k3 1 2 68 Scatter/
gather server SU SU SU 68

Range Queries in YDOT Clustered, ordered retrieval of records
Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Router Apple Avocado Banana Blueberry Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Grapefruit…Lime? Grapefruit…Pear? Lime…Pear? Canteloupe Grape Kiwi Lemon Lime Mango Orange Storage unit 1 Storage unit 2 Storage unit 3 Strawberry Tomato Watermelon Apple Avocado Banana Blueberry Strawberry Tomato Watermelon Lime Mango Orange Canteloupe Grape Kiwi Lemon

Updates SU SU SU Write key k Sequence # for key k Write key k
8 1 Sequence # for key k Write key k Routers Message brokers 2 Write key k 3 Write key k 7 4 Sequence # for key k 5 SUCCESS SU SU SU 6 Write key k 70

SHERPA IN CONTEXT 71

Retrieval from single table of objects/records
Types of Record Stores Query expressiveness S3 PNUTS Oracle Simple Feature rich Object retrieval Retrieval from single table of objects/records SQL

Consistency model Types of Record Stores S3 PNUTS Oracle Best effort
Strong guarantees Eventual consistency Timeline consistency ACID Program centric consistency Object-centric consistency

Data model Types of Record Stores PNUTS CouchDB Oracle Flexibility,
Schema evolution Optimized for Fixed schemas Object-centric consistency Consistency spans objects

Elasticity (ability to add resources on demand)
Types of Record Stores Elasticity (ability to add resources on demand) PNUTS S3 Oracle Inelastic Elastic Limited (via data distribution) VLSD (Very Large Scale Distribution /Replication)

Application Design Space
Get a few things Sherpa MObStor YMDB MySQL Oracle Filer BigTable Scan everything Everest Hadoop Records Files 76

Alternatives Matrix Consistency model Global low Structured SQL/ACID
latency Structured access Consistency model SQL/ACID Operability Availability Updates Elastic Sherpa Y! UDB MySQL Oracle HDFS BigTable Dynamo Cassandra 77

The CAP Theorem Availability Consistency Partition tolerance
PODC:ACM Symposium on Principles of Distributed Computing(PODC)International Conference 79

The CAP Theorem Once a writer has written, all readers will see that write Availability Consistency Partition tolerance

The CAP Theorem System is available during software and hardware upgrades and node failures. Availability Consistency Partition tolerance

The CAP Theorem A system can continue to operate in the presence of a network partitions. Availability Consistency Partition tolerance

The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Availability Consistency Partition tolerance

Consistency Two kinds of consistency:
strong consistency – ACID(Atomicity Consistency Isolation Durability) weak consistency – BASE(Basically Available Soft-state Eventual consistency ) Cluster里面至少有一个replica是最新的，其他的replica最终会达到一致

A tailor RDBMS LOCK ACID SAFETY TRANSACTION 3NF

Datalog Main expressive advantage: recursive queries.
More convenient for analysis: papers look better. Without recursion but with negation it is equivalent in power to relational algebra Has affected real practice: (e.g., recursion in SQL3, magic sets transformations).

Datalog Example Datalog program: parent(bill,mary). parent(mary,john).
ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y). ?- ancestor(bill,X)

Joseph’s Conjecture(1)
CONJECTURE 1. Consistency And Logical Monotonicity (CALM). A program has an eventually consistent, coordination-free execution strategy if and only if it is expressible in (monotonic) Datalog.

Joseph’s Conjecture (2)
CONJECTURE 2. Causality Required Only for Non-monotonicity (CRON). Program semantics require causal message ordering if and only if the messages participate in non-monotonic derivations.

CONJECTURE 3. The minimum number of Dedalus timesteps required to evaluate a program on a given input data set is equivalent to the program’s Coordination Complexity.

CONJECTURE 4. Any Dedalus program P can be rewritten into an equivalent temporally-minimized program P’ such that each inductive or asynchronous rule of P’ is necessary: converting that rule to a deductive rule would result in a program with no unique minimal model.

Circumstance has presented a rare opportunity—call it an imperative—for the database community to take its place in the sun, and help create a new environment for parallel and distributed computation to flourish. ------Joseph M. Hellerstein (UC Berkeley)

Open questions and conclusion

Open Questions What is the right consistency model?
What is the right programming model? Whether and how to make use of caching? How to balance functionality and scale? What are the right cloud abstractions? Cloud inter-operatability VLDB 2010 Tutorial

Concluding Data Management for Cloud Computing poses a fundamental challenge to database researchers: Scalability Reliability Data Consistency Elasticity Database community needs to be involved – maintaining status quo will only marginalize our role. VLDB 2010 Tutorial

New Textbook “Distributed System and Cloud computing”《分布式系统与云计算》
分布式系统概述 (Introduction to Distributed System) 分布式云计算技术综述 ( Distributed Computing) 分布式云计算平台 (Cloud-based platform) 分布式云计算程序开发 (Cloud-based programming) 96

Further Reading F. Chang et al.
Bigtable: A distributed storage system for structured data. In OSDI, 2006. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. G. DeCandia et al. Dynamo: Amazon’s highly available key-value store. In SOSP, 2007. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, 2003. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000.

Further Reading Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008) Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni Asynchronous View Maintenance for VLSD Databases, Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu Ramakrishnan SIGMOD 2009 Cloud Storage Design in a PNUTShell Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava Beautiful Data, O’Reilly Media, 2009

Further Reading F. Chang et al.
Bigtable: A distributed storage system for structured data. In OSDI, 2006. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. G. DeCandia et al. Dynamo: Amazon’s highly available key-value store. In SOSP, 2007. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, 2003. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000.

Thanks 谢谢！

Cloud Computing and Scalable Data Management

Similar presentations

Presentation on theme: "Cloud Computing and Scalable Data Management"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cloud Computing and Scalable Data Management

Similar presentations

Presentation on theme: "Cloud Computing and Scalable Data Management"— Presentation transcript:

Similar presentations

About project

Feedback