Presentation on theme: "C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009."— Presentation transcript:
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009
What is the Cloud? A definition from Wikipedia Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.scalablevirtualizedas a service Platform as a service (e.g, Amazon EC2) allows customers to rent computers (virtual machines) on which to run their own computer applications. Software as a service infrastructure as a service
Amazon EC2 (Elastic Compute Cloud) EC2 uses Xen virtualization.Xenvirtualization Each virtual machine, called an "instance", functions as a virtual private server in one of three sizes:virtual private server small, large or extra large. Amazon.com sizes instances based on "EC2 Compute Units" the equivalent CPU capacity of physical hardware. One EC2 Compute Unit equals GHz 2007 Opteron or 2007 Xeon processor. OpteronXeon
Pricing Amazon charges customers in two primary ways: Hourly charge per virtual machine Data transfer charge Amazon advertising describes the pricing scheme as "you pay for resources you consume".
Advantage of Public Cloud Public clouds are hosted by large infrastructure companies such as Amazon, Google, Yahoo, Microsoft, Sun Can afford huge cloud. For many companies, especially for start-ups and medium-sized business), setting up a private cloud can be too expensive hardware cost Software cost Personnel cost for maintaining the system
Cloud Characteristics Computing power is elastic, but noly if workload is parallelizable. Computing power comes from shared-nothing architecture. Data is stored at an un-trusted host. A possible solution is encrypting data. Data is replicated, often across large geographic distance. To provide data availability and durability.
Transactional Data Management (OLTP) Typically does not use a shared-nothing architecture. OLTP systems are usually less than 1TB in size. It is hard to maintain ACID guarantees in the face of data replication over large geographic distances. Google’s Bigtable implements a replicated shared-nothing database, by weaking “A” from ACID. The H-Store project still remains in vision stage. There are big risks in storing transactional data on an un-trusted host. Transactional data include details at the lowest granularity.
First Conclusion Transactional data management applications are not well suited for deployment in the cloud.
Analytical Data Management (DW) Tend to be read-mostly (read-only), with occasional batch inserts. Shared-nothing architecture is a good match. The ever increasing amount of data is the primary driver for choosing shared-nothing. Large scans, multidimensional aggregations, and star schema joins for analytical workload are easy to parallelize on shared-nothing system. Infrequent writes eliminates the need for complex distributed locking and commit protocols.
Analytical Data Management (DW): continued ACID guarantees are typically not needed. Snapshot isolation is usually enough. Particularly sensitive data can often be left out of the analysis. Less granular versions of the data are usually used for analysis.
Second Conclusion Analytical Data Management applications are well-suited for deployment in the cloud.
Vertica (C-Store) for the Cloud
Cloud DBMS Wish List Efficiency Fault Tolerance If a query must restart each time a node fails, then long, complex queries are difficult to complete. Ability to run in a heterogeneous environment. Should prevent the slowest node from making a disproportionate affect on total query performance. Ability to operate on encrypted data. Ability to interface with business intelligence products.
MapReduce vs. Parallel DBMS (1) Efficiency MapReduce is good for brute-force scan over unstructured data such as text documents. Parallel DBMS is good for selective access of structured data. Fault Tolerance MapReduce takes it as a high priority. Most parallel DBMS restart a query upon a faiure. Ability to run in a heterogeneous environment. MapReduce does well. Parallel DBMS are generally designed to run in a homogeneous environment.
MapReduce vs. Parallel DBMS (2) Ability to operate on encrypted data. Neither has the native ability to operate on encrypted data. Ability to interface with business intelligence products. MapReduce is not intended for interfacing with BI products. Parallel DBMS supports BI products well.
A Call for A Hybrid Solution Bring together ideas from MapReduce and Parallel DBMS. The hybrid solution should combine Fault tolerance, heterogeneous cluster, and ease of use out-of-the-box capabilities of MapReduce With the efficiency, performance, and tool plugability of shared-nothing parallel DBMS.
References 1. Abadi, Daniel J. Data Management in the Cloud: Limitations and Opportunities. In IEEE Data Engineering Bulletin, 2009.Data Management in the Cloud: Limitations and Opportunities. 2. Vertica Company. Getting Started with Vertica Analytic Database for the Cloud