Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities.

Similar presentations

Presentation on theme: "1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities."— Presentation transcript:

1 1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities

2 2 National University of Singapore PhD XML query processing and XML keyword search University of California, Irvine Postdoc Approximate string processing Data integration and data cleaning Renmin University of China Cloud data management XML data management Research experience and interesting

3 3 Outline Motivation: cloud data management Database Future and Challenges Large-scale Data management & transaction processing Cloud-based data indexing and query optimization Recent research work An efficient multiple-dimensional indexes for cloud data management CIKM Workshop CloudDB 2009

4 4 Motivation: Internet Chatter

5 5 BLOG Wisdom If you want vast, on-demand scalability, you need a non-relational database. Since scalability requirements: Can change very quickly and, Can grow very rapidly. Difficult to manage with a single in-house RDBMS server. Although RDBMS scale well: When limited to a single node. Overwhelming complexity to scale on multiple sever nodes.

6 6 Current State Most enterprise solutions are based on RDBMS technology. Significant Operational Challenges: Provisioning for Peak Demand Resource under-utilization Capacity planning: too many variables Storage management: a massive challenge System upgrades: extremely time-consuming

7 7 Internet Search Data Analytics: A Case Study Data analytics: Parsed WEB Logs ingested in a RDBMS store. Hourly and Daily summarization for custom reporting. Operational nightmare: Maintaining live reporting system ON at all costs and at all times. Timely completion of hourly summarization. Constant tension between Ad-hoc workload versus reporting workload. Data-driven feedback to live products. Temporal depth of detailed data

8 8 Internet Search Data Analytics: A Case Study Various solutions explored: Data Warehousing appliance for fast summarization. Parallel RDBMS technology for fast ad-hoc queries. Business Intelligence Products (Data Cubes) for fast and intuitive reporting and analysis. None of the solutions completely satisfactory: Plans to migrate low-level data to file-based system to overcome Database scalability bottlenecks

9 9 Paradigm Shift in Computing

10 10 WEB is replacing the Desktop

11 11 What is Cloud Computing? Old idea: Software as a service (SaaS) Def: delivering applications over the internet Recently: [Hardware, infrastructure, Platform] as a service Poorly defined so we avoid all X as a service Utility Computing: pay-as-you-go computing Illusion of infinite resources No up-front cost Fine-grained billing (e.g. hourly)

12 12 Why Now? Experience with very large datacenters Unprecedented economies of scale Other factors Pervasive broadband internet Pay-as-you-go billing model

13 13 Cloud Computing Spectrum Instruction Set VM (Amazon EC2, 3Tera) Framework VM Google AppEngine,

14 14 Cloud Killer Apps Mobile and web applications Extensions of desktop software Matlab, Mathematica Batch processing/MapReduce

15 15 Economics of Cloud Users Pay by use instead of provisioning for peak

16 16 Economics of Cloud Users Risk of over-provisioning: underutilization

17 17 Economics of Cloud Users Heavy penalty for under-provisioning

18 18 Economics of Cloud Providers 5-7X economies of scale [Hamilton 2008] Extra benefits Amazon: utilize off-peak capacity Microsoft: sell.NET tools Google: reuse existing infrastructure

19 19 Engineering Definition Providing services on virtual machines allocated on top of a large physical machine pool.

20 20 Business Definition A method to address scalability and availability concerns for large scale applications.

21 21 Data Management in the Cloud?

22 22 Cloud Computing Implications on DBMSs Where do Databases fit in this paradigm? Generational reality: Started with 50 servers on Amazon EC2 Growth of 25,000 users/hour Need to scale to 3,500 servers in 2 days. Many similar stories: RightScale Joyent …

23 23 Clouded Data? Reality Number Unlimited processing assumption Interactive page views: By targeting large number of SQL queries against MySQL Still Expect sub-millisecond object retrieval Reality Number : Why cant the database tier be replicated in the same way as the Web Server and App Server can? These are the major challenges for Data Management in the cloud.

24 24 The Vision R&D Challenges at the macro level: Where and how does the DBMS fit into this model. R&D Challenges at micro level: Specific technology components that must be developed to enable the migration of enterprise data into the clouds.

25 25 Data and Networks: Attempt Distributed Database (1980s): Idealized view: unified access to distributed data Prohibitively expensive: global synchronization Remained a laboratory prototype: Associated technology widely in-use: 2PC

26 26 Data and Networks: Attempt

27 27 Data and Networks: Pragmatics

28 28 Database on S 3: SIGMOD08 Amazons Simple Storage Service(S3): Updates may not preserve initiation order No force writes Eventual guarantee Proposed solution: Pending Update Queue Checkpoint protocol to ensure consistent ordering ACID: only Atomicity + Durability

29 29 Unbundling Txns in the Cloud Research results: CIDR09 proposal to unbundle Transactions Management for Cloud Infrastructures Attempts to refit the DBMS engine in the cloud storage and computing

30 30 Analytical Processing

31 31 Architectural and System Impacts Current state: MapReduce Paradigm for data analysis What is missing: Auxiliary structures and indexes for associative access to data (i.e., attribute-based access) Caveat: inherent inconsistency and approximation Future projection: Eventual merger of databases (ODSs) and data warehouses as we have learned to use and implement them.

32 32 Underlying Principles: CIDR2009 Business data may not always reflect the state of the world or the business: Inherent lack of perfect information Secondary data need not be updated with primary data: Inherent latency Transactions/Events may temporarily violate integrity constraints: Referential integrity may need to be compromised

33 33 Data Security & Privacy Data privacy remains a show-stopper in the context of database outsourcing. Encryption-based solutions are too expensive and are projected to be so in the foreseeable future: Private Information Retrieval (Sion2008) Other approaches: Information-theoretic approaches that uses data- partitioning for security (Emekci2007) Hardware-based solution for information security

34 34 Self management and self tuning in cloud-based data management Self management and self tuning Query optimization on thousands of nodes

35 35 Remarks Data Management for Cloud Computing poses a fundamental challenge to database researchers: Scalability Reliability Data Consistency Radically different approaches and solution are warranted to overcome this challenge: Need to understand the nature of new applications

36 36 References Life Beyond Distributed Transactions: An Apostates Opinion by P.Helland, CIDR07 Building a Database on S3 M.Brartner, D.Florescu, D.Graf, D.Kossman, T.Kraska, SIGMOD08 Unbundling Transaction Services in the Cloud D.Lo,et, A.Fekete, G.Weikum, M.Zwilling, CIDR09 Principles of Inconsistency S.Finkelstein, R.Brendle, D.Jacobs, CIDR09 VLDB Database School (China) 2009 l2009English.htm

37 37 CIKM workshop CloudDB09

38 38 INTRODUCTION MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE Extended Nodes partition Node partition Cost Estimation Strategy EVALUATION

39 39 Google File System Yahoo PNUTS

40 40 BigTable HBase How to query on other attributes besides primary key?

41 41 S. Wu and K.-L. Wu, An indexing framework for efficient retrieval on the cloud, IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. H. chih Yang and D. S. Parker, Traverse: Simplified indexing on large map-reduce-merge clusters, in Proceedings of DASFAA 2009, Brisbane, Australia, April 2009, pp. 308–322. M. K. Aguilera, W. Golab, and M. A. Shah, A practical scalable distributed b-tree, in Proceedings of VLDB08, Auckland, New Zealand, August 2008, pp. 598–609.

42 42 INTRODUCTION MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE Extended Nodes partition Node partition Cost Estimation Strategy EVALUATION

43 43

44 44 R-trees is a tree data structure that is similar to a B-tree, but is used for spatial access methods

45 45 kd-tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space.

46 46 Master Slave range 0 2000, 500~1200 range 800 3500, 300~1300 range 6300 7000, 599~1400 range 2000 40000, 3400~8900 range 6800 9000, 3400~8900

47 47 INTRODUCTION MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE Extended Nodes partition Node partition Cost Estimation Strategy EVALUATION

48 48 Random cutting: Pick several random values on the attribute and cut by the points. with the random method you may receive great performance, but also possible to have poor performance. Equal cutting: Cut the attribute into several equal intervals. This method is relatively stable since no extreme case will happen. Clustering-based cutting: Cut the attribute by clustering values on the attribute and cut between clusters. This method may receive foreseeable better performance, but the time cost is also apparently higher. The time complexity of a clustering algorithm is typically O(nlogn) or even higher. Nodes partition for data summary

49 49 Random cuttingEqual cuttingClustering-based cutting Nodes partition

50 50

51 51 Update of node cube: Why? If the data distribution in the node cube have greatly changed and caused the cube to be sparse or greatly uneven How? Reorganize the nodes partition again When? A two-phase approach After each update, compute the minimal ΔT for next update When the ΔT expires, check if needs update Dynamic maintenance of Indexes

52 52 Basic idea: benefit > cost Volume of a node cube is defined as the number of combination of records can be made out of the cube. The volume can be calculated as the product of lengths of all the intervals. We note volume of a cube by v. For the cube \{[1, 11], [2, 5]\}, the volume is (11-1)*(5-2) = 30. Dynamic maintenance of Indexes

53 53 Assumption: The amount of queries forwarded to each slave node is proportional to the total volume of all the node cubes of the slave node. Dynamic maintenance of Indexes

54 54 benefit = (Δv/v) * nq * ΔT Δ v: decrement of volume after update nq: number of queries this node must process before update. cost = mt/qt mt: time cost of last update qt: time needed for processing one query benefit > cost => T > (mt * v)/(qt * Δ v * nq) Dynamic maintenance of Indexes

55 55 After Δ T expires, check if an update is needed. This check involves following: Record update frequency Expected benefit ratio Performance requirement We leave this as a future work. Dynamic maintenance of Indexes

56 56 6 machines 1 master 5 slaves : 100~1000 nodes Each machine had a 2.33GHz Intel Core2 Quad CPU, 4GB of main memory, and a 320G disk. Machines ran Ubuntu 9.04 Server OS.

57 57

58 58 Result Cover Rate: one ten thousandth

59 59 In this paper we presented a series of approaches on building efficient multi-dimensional index in cloud platform. We used the combination of R-tree and KD-tree to support the index structure. We developed the node partition technique to reduce query processing cost on the cloud platform. In order to maintain efficiency of the index, we proposed a cost estimation-based approach for index update.

60 60 Better node partition algorithms Improve the estimation-based approach Consider multiple replicas of data Future works

61 61

62 62 Result Cover Rate: one thousandth 1 ~ 2

63 63 Result Cover Rate: one thousandth 4 ~ 5

Download ppt "1 2009-08-25 Cloud-based Data Management: Challenges & Opportunities."

Similar presentations

Ads by Google