Presentation is loading. Please wait.

Presentation is loading. Please wait.

C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.

Similar presentations


Presentation on theme: "C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009."— Presentation transcript:

1 C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009

2 What is the Cloud? A definition from Wikipedia  Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.scalablevirtualizedas a service Platform as a service (e.g, Amazon EC2)  allows customers to rent computers (virtual machines) on which to run their own computer applications. Software as a service infrastructure as a service

3 Amazon EC2 (Elastic Compute Cloud) EC2 uses Xen virtualization.Xenvirtualization Each virtual machine, called an "instance", functions as a virtual private server in one of three sizes:virtual private server  small, large or extra large. Amazon.com sizes instances based on "EC2 Compute Units"  the equivalent CPU capacity of physical hardware. One EC2 Compute Unit equals GHz 2007 Opteron or 2007 Xeon processor. OpteronXeon

4 Pricing Amazon charges customers in two primary ways:  Hourly charge per virtual machine  Data transfer charge Amazon advertising describes the pricing scheme as "you pay for resources you consume".

5 Advantage of Public Cloud Public clouds are hosted by large infrastructure companies such as  Amazon, Google, Yahoo, Microsoft, Sun  Can afford huge cloud. For many companies, especially for start-ups and medium-sized business), setting up a private cloud can be too expensive  hardware cost  Software cost  Personnel cost for maintaining the system

6 Cloud Characteristics Computing power is elastic, but noly if workload is parallelizable.  Computing power comes from shared-nothing architecture. Data is stored at an un-trusted host.  A possible solution is encrypting data. Data is replicated, often across large geographic distance.  To provide data availability and durability.

7 Transactional Data Management (OLTP) Typically does not use a shared-nothing architecture.  OLTP systems are usually less than 1TB in size. It is hard to maintain ACID guarantees in the face of data replication over large geographic distances.  Google’s Bigtable implements a replicated shared-nothing database, by weaking “A” from ACID.  The H-Store project still remains in vision stage. There are big risks in storing transactional data on an un-trusted host.  Transactional data include details at the lowest granularity.

8 First Conclusion Transactional data management applications are not well suited for deployment in the cloud.

9 Analytical Data Management (DW) Tend to be read-mostly (read-only), with occasional batch inserts. Shared-nothing architecture is a good match.  The ever increasing amount of data is the primary driver for choosing shared-nothing.  Large scans, multidimensional aggregations, and star schema joins for analytical workload are easy to parallelize on shared-nothing system.  Infrequent writes eliminates the need for complex distributed locking and commit protocols.

10 Analytical Data Management (DW): continued ACID guarantees are typically not needed.  Snapshot isolation is usually enough. Particularly sensitive data can often be left out of the analysis.  Less granular versions of the data are usually used for analysis.

11 Second Conclusion Analytical Data Management applications are well-suited for deployment in the cloud.

12 Vertica (C-Store) for the Cloud

13 Cloud DBMS Wish List Efficiency Fault Tolerance  If a query must restart each time a node fails, then long, complex queries are difficult to complete. Ability to run in a heterogeneous environment.  Should prevent the slowest node from making a disproportionate affect on total query performance. Ability to operate on encrypted data. Ability to interface with business intelligence products.

14 MapReduce vs. Parallel DBMS (1) Efficiency  MapReduce is good for brute-force scan over unstructured data such as text documents.  Parallel DBMS is good for selective access of structured data. Fault Tolerance  MapReduce takes it as a high priority.  Most parallel DBMS restart a query upon a faiure. Ability to run in a heterogeneous environment.  MapReduce does well.  Parallel DBMS are generally designed to run in a homogeneous environment.

15 MapReduce vs. Parallel DBMS (2) Ability to operate on encrypted data.  Neither has the native ability to operate on encrypted data. Ability to interface with business intelligence products.  MapReduce is not intended for interfacing with BI products.  Parallel DBMS supports BI products well.

16 A Call for A Hybrid Solution Bring together ideas from MapReduce and Parallel DBMS. The hybrid solution should combine  Fault tolerance, heterogeneous cluster, and ease of use out-of-the-box capabilities of MapReduce  With the efficiency, performance, and tool plugability of shared-nothing parallel DBMS.

17 References 1. Abadi, Daniel J. Data Management in the Cloud: Limitations and Opportunities. In IEEE Data Engineering Bulletin, 2009.Data Management in the Cloud: Limitations and Opportunities. 2. Vertica Company. Getting Started with Vertica Analytic Database for the Cloud


Download ppt "C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009."

Similar presentations


Ads by Google