Presentation is loading. Please wait.

Presentation is loading. Please wait.

APWeb2011 Tutorial Jiaheng Lu and Sai Wu Renmin Universtiy of China, National University of Singapore Cloud Computing and Scalable Data Management.

Similar presentations


Presentation on theme: "APWeb2011 Tutorial Jiaheng Lu and Sai Wu Renmin Universtiy of China, National University of Singapore Cloud Computing and Scalable Data Management."— Presentation transcript:

1 APWeb2011 Tutorial Jiaheng Lu and Sai Wu Renmin Universtiy of China, National University of Singapore Cloud Computing and Scalable Data Management

2 APWeb 2011 Cloud computing Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Data indexing in the clouds Conclusion and open issues Outline Part 1 Part 2

3 Cloud computing

4

5 Why we use cloud computing?

6 Case 1: Write a file Save Computer down, file is lost Files are always stored in cloud, never lost

7 Why we use cloud computing? Case 2: Use IE --- download, install, use Use QQ --- download, install, use Use C++ --- download, install, use …… Get the serve from the cloud

8 What is cloud and cloud computing? Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet. Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them.

9 Characteristics of cloud computing Virtual. software, databases, Web servers, operating systems, storage and networking as virtual servers. On demand. add and subtract processors, memory, network bandwidth, storage.

10 IaaS Infrastructure as a Service PaaS Platform as a Service SaaS Software as a Service Types of cloud service

11

12

13 APWeb 2011 Cloud computing Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Data indexing in the clouds Conclusion and open issues Outline Part 1 Part 2

14 Introduction to MapReduce

15 MapReduce Programming Model Inspired from map and reduce operations commonly used in functional programming languages like Lisp. Users implement interface of two primary methods: – 1. Map: (key1, val1) (key2, val2) – 2. Reduce: (key2, [val2]) [val3]

16 Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. – e.g. (docid, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query.

17 Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

18 Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

19 MapReduce: Execution overview

20 MapReduce: Example

21 Research works MapReduce is slower than Parallel Databases by a factor 3.1 to 6.5 [1][2] By adopting an hybrid architecture, the performance of MapReduce can approach to Parallel Databases [3] MapReduce is an efficient tool [4] Numerous discussions in MapReduce community … [1] A comparison of approaches to large-scale data analysis. SIGMOD 2009 [2] MapReduce and Parallel DBMSs: friends or foes? CACM 2010 [3] HadoopDB: an architectural hybrid of MapReduce and DBMS techonologies for analytical workloads. VLDB 2009 [4] MapReduce: a flexible data processing tool. CACM 2010

22 An in-depth study (1) Scheduling I/O modes Record Parsing Indexing A performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. [5] [5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The Performance of MapReduce: An In-depth Study. PVLDB 3(1): 472-483 (2010)

23 An in-depth study (2) By carefully tuning these factors, the overall performance of Hadoop can be comparable to that of parallel database systems. Possible to build a cloud data processing system that is both elastically scalable and efficient [5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The Performance of MapReduce: An In-depth Study. PVLDB 3(1): 472-483 (2010)

24 Osprey: Implementing MapReduce- Style Fault Tolerance in a Shared- Nothing Distributed Database Christopher Yang, Christine Yen, Ceryen Tan, Samuel Madden: Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. ICDE 2010:657-668

25 Problem proposed Problem: Node failures on distributed database system – Faults may be common on large clusters of machines Existing solution: Aborting and (possibly) restarting the query – A reasonable approach for short OLTP-style queries – But its time-wasting for analytical (OLAP) warehouse queries

26 MapReduce-style fault tolerance for a SQL database Break up a SQL query (or program) into smaller, parallelizable subqueries Adapt the load balancing strategy of greedy assignment of work

27 Osprey A middleware implementation of MapReduce-style fault tolerance for a SQL database

28 SQL procedure Query Query Transformer Subquery PWQ A Scheduler Result Merger Result Execution Subquery PWQ B Subquery PWQ C PWQ :partition work queue

29 Mapreduce Online Evaluation Plateform

30 Construction Mapreduce OnlineEvaluation User Register Login Update Info Problem Scan Problem Submit Solution Submission History Theory test Test Result Help FAQs Hadoop Quick Start 2014-1-3130 School Of Information Renmin University Of China

31 Cloud data management

32 Four new principles in Cloud-based data management

33 New principle in cloud dta management 1 Partition Everything and key-value storage 1 st normal form cannot be satisfied

34 New principle in cloud dta management 2 Embrace Inconsistency ACID properties are not satisfied

35 New principle in cloud dta management 3 Backup everything with three copies Guarantee 99.999999% safety

36 New principle in cloud dta management 4 Scalable and high performance

37 Cloud data management Partition Everything Embrace Inconsistency Backup data with three copies Scalable and high performance

38 BigTable: A Distributed Storage System for Structured Data

39 Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size – Petabytes of data across thousands of servers Used for many Google projects – Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Googles products

40 Why not just use commercial DB? Scale is too large for most commercial databases Even if it werent, cost would be very high – Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly – Much harder to do when running on top of a database layer

41 Goals Want asynchronous processes to be continuously updating different pieces of data – Want access to most current data at any time Need to support: – Very high read/write rates (millions of ops per second) – Efficient scans over all or interesting subsets of data – Efficient joins of large one-to-one and one-to-many datasets Often want to examine data changes over time – E.g. Contents of a web page over multiple crawls

42 BigTable Distributed multi-level map Fault-tolerant, persistent Scalable – Thousands of servers – Terabytes of in-memory data – Petabyte of disk-based data – Millions of reads/writes per second, efficient scans Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance

43 Basic Data Model A BigTable is a sparse, distributed persistent multi- dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications

44 WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched.

45 Rows Name is an arbitrary string – Access to data in a row is atomic – Row creation is implicit upon storing data Rows ordered lexicographically – Rows close together lexicographically usually on one or a small number of machines

46 Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys

47 Columns Columns have two-level name structure: family:optional_qualifier Column family – Unit of access control – Has associated type information Qualifier gives unbounded columns – Additional levels of indexing, if desired

48 Timestamps Used to store different versions of data in a cell – New writes default to current time, but timestamps for writes can also be set explicitly by clients Lookup options: – Return most recent K values – Return all values in timestamp range (or all values) Column families can be marked w/ attributes: – Only retain most recent K values in a cell – Keep values until they are older than K seconds

49 Implementation – Three Major Components Library linked into every client One master server – Responsible for: Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Many tablet servers – Tablet servers handle read and write requests to its table – Splits tablets that have grown too large

50 Tablets Large tables broken into tablets at row boundaries – Tablet holds contiguous range of rows Clients can often choose row keys to achieve locality – Aim for ~100MB to 200MB of data per tablet Serving machine responsible for ~100 tablets – Fast recovery: 100 machines each pick up 1 tablet for failed machine – Fine-grained load balancing: Migrate tablets away from overloaded machine Master makes load-balancing decisions

51 Refinements: Locality Groups Can group multiple column families into a locality group – Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. – In WebTable, page metadata can be in one group and contents of the page in another group.

52 Refinements: Compression Many opportunities for compression – Similar values in the same row/column at different timestamps – Similar values in different columns – Similar values across adjacent rows Two-pass custom compressions scheme – First pass: compress long common strings across a large window – Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to- 1)

53 Refinements: Bloom Filters Read operation has to read from disk when desired SSTable isnt in memory Reduce number of accesses by specifying a Bloom filter. – Allows us ask if an SSTable might contain data for a specified row/column pair. – Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations – Use implies that most lookups for non-existent rows or columns do not need to touch disk

54 PNUTS / SHERPA To Help You Scale Your Mountains of Data

55 Yahoo! Serving Storage Problem – Small records – 100KB or less – Structured records – lots of fields, evolving – Extreme data scale - Tens of TB – Extreme request scale - Tens of thousands of requests/sec – Low latency globally - 20+ datacenters worldwide – High Availability - outages cost $millions – Variable usage patterns - as applications and users change 55

56 E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E What is PNUTS/Sherpa? E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Structured, flexible schema Hosted, managed infrastructure A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E 56

57 What Will It Become? E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure

58 What Will It Become? E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E Indexes and views

59 Technology Elements PNUTS Query planning and execution Index maintenance Distributed infrastructure for tabular data Data partitioning Update consistency Replication YDOT FS Ordered tables Applications Tribble Pub/sub messaging YDHT FS Hash tables Zookeeper Consistency service YCA: Authorization PNUTS API Tabular API 59

60 Data Manipulation Per-record operations – Get – Set – Delete Multi-record operations – Multiget – Scan – Getrange 60

61 TabletsHash Table Apple Lemon Grape Orange Lime Strawberry Kiwi Avocado Tomato Banana Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Dont get scurvy! But at what price? How much did you pay for this lemon? Is this a vegetable? New Zealand The perfect fruit NameDescriptionPrice $12 $9 $1 $900 $2 $3 $1 $14 $2 $8 0x0000 0xFFFF 0x911F 0x2AF3 61

62 TabletsOrdered Table 62 Apple Banana Grape Orange Lime Strawberry Kiwi Avocado Tomato Lemon Grapes are good to eat Limes are green Apple is wisdom Strawberry shortcake Arrgh! Dont get scurvy! But at what price? The perfect fruit Is this a vegetable? How much did you pay for this lemon? New Zealand $1 $3 $2 $12 $8 $1 $9 $2 $900 $14 NameDescriptionPrice A Z Q H

63 Flexible Schema Posted dateListing idItemPrice 6/1/07424252Couch$570 6/1/07763245Bike$86 6/3/07211242Car$1123 6/5/07421133Lamp$15 Color Red Condition Good Fair

64 Storage units Routers Tablet Controller REST API Clients Local region Remote regions Tribble Detailed Architecture 64

65 Tablet Splitting and Balancing 65 Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers Storage unit Tablet

66 QUERY PROCESSING 66

67 Accessing Data 67 SU 1 Get key k 2 3 Record for key k 4

68 Bulk Read 68 SU Scatter/ gather server SU 1 {k1, k2, … kn} 2 Get k 1 Get k 2 Get k 3

69 Storage unit 1Storage unit 2Storage unit 3 Range Queries in YDOT Clustered, ordered retrieval of records Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1

70 Updates 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers 70

71 SHERPA IN CONTEXT 71

72 Types of Record Stores Query expressiveness Simple Feature rich Object retrieval Retrieval from single table of objects/records SQL S3 PNUTS Oracle

73 Types of Record Stores Consistency model Best effort Strong guarantees Eventual consistency Timeline consistency ACID S3 PNUTS Oracle Program centric consistency Object-centric consistency

74 Types of Record Stores Data model Flexibility, Schema evolution Optimized for Fixed schemas CouchDB PNUTS Oracle Consistency spans objects Object-centric consistency

75 Types of Record Stores Elasticity (ability to add resources on demand) Inelastic Elastic Limited (via data distribution) VLSD (Very Large Scale Distribution /Replication) Oracle PNUTS S3

76 Application Design Space Records Files Get a few things Scan everything Sherpa MObStor Everest Hadoop YMDB MySQL Filer Oracle BigTable 76

77 Alternatives Matrix Elastic Operability Global low latency Availability Structured access Sherpa Y! UDB MySQL Oracle HDFS BigTable Dynamo Updates Cassandra Consistency model SQL/ACID 77

78 APWeb 2011 Cloud computing Map/Reduce, Bigtable and PNUT CAP Theorem and datalog Data indexing in the clouds Conclusion and open issues Outline Part 1 Part 2

79 The CAP Theorem Consistency Partition tolerance Availability

80 The CAP Theorem Once a writer has written, all readers will see that write Consistency Partition tolerance Availability

81 The CAP Theorem System is available during software and hardware upgrades and node failures. Consistency Partition tolerance Availability

82 The CAP Theorem A system can continue to operate in the presence of a network partitions. Consistency Partition tolerance Availability

83 The CAP Theorem Theorem: You can have at most two of these properties for any shared-data system Consistency Partition tolerance Availability

84 Consistency Two kinds of consistency: – strong consistency – ACID(Atomicity Consistency Isolation Durability) – weak consistency – BASE(Basically Available Soft-state Eventual consistency )

85 A tailor 3NF TRANSACTION LOCK ACID SAFETY RDBMS

86 Datalog Main expressive advantage: recursive queries. More convenient for analysis: papers look better. Without recursion but with negation it is equivalent in power to relational algebra Has affected real practice: (e.g., recursion in SQL3, magic sets transformations).

87 Datalog Example Datalog program: parent(bill,mary). parent(mary,john). ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y). ?- ancestor(bill,X)

88 Josephs Conjecture(1) CONJECTURE 1. Consistency And Logical Monotonicity (CALM). A program has an eventually consistent, coordination-free execution strategy if and only if it is expressible in (monotonic) Datalog.

89 Josephs Conjecture (2) CONJECTURE 2. Causality Required Only for Non-monotonicity (CRON). Program semantics require causal message ordering if and only if the messages participate in non-monotonic derivations.

90 Josephs Conjecture (3) CONJECTURE 3. The minimum number of Dedalus timesteps required to evaluate a program on a given input data set is equivalent to the programs Coordination Complexity.

91 Josephs Conjecture (4) CONJECTURE 4. Any Dedalus program P can be rewritten into an equivalent temporally- minimized program P such that each inductive or asynchronous rule of P is necessary: converting that rule to a deductive rule would result in a program with no unique minimal model.

92 Circumstance has presented a rare opportunitycall it an imperativefor the database community to take its place in the sun, and help create a new environment for parallel and distributed computation to flourish. ------Joseph M. Hellerstein (UC Berkeley)

93 Open questions and conclusion

94 Open Questions What is the right consistency model? What is the right programming model? Whether and how to make use of caching? How to balance functionality and scale? What are the right cloud abstractions? Cloud inter-operatability VLDB 2010 Tutorial

95 Concluding Data Management for Cloud Computing poses a fundamental challenge to database researchers: – Scalability – Reliability – Data Consistency – Elasticity Database community needs to be involved – maintaining status quo will only marginalize our role. VLDB 2010 Tutorial

96 New Textbook Distributed System and Cloud computing 96 (Introduction to Distributed System) ( Distributed Computing) (Cloud-based platform) (Cloud-based programming)

97 Further Reading F. Chang et al. Bigtable: A distributed storage system for structured data. In OSDI, 2006. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. G. DeCandia et al. Dynamo: Amazons highly available key-value store. In SOSP, 2007. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, 2003. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000.

98 Further Reading Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008) Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni Asynchronous View Maintenance for VLSD Databases, Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu Ramakrishnan SIGMOD 2009 Cloud Storage Design in a PNUTShell Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava Beautiful Data, OReilly Media, 2009

99 Further Reading F. Chang et al. Bigtable: A distributed storage system for structured data. In OSDI, 2006. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. G. DeCandia et al. Dynamo: Amazons highly available key-value store. In SOSP, 2007. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. SOSP, 2003. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000.

100 Thanks


Download ppt "APWeb2011 Tutorial Jiaheng Lu and Sai Wu Renmin Universtiy of China, National University of Singapore Cloud Computing and Scalable Data Management."

Similar presentations


Ads by Google