Presentation is loading. Please wait.

Presentation is loading. Please wait.

Index for Cloud Data Management Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA.

Similar presentations


Presentation on theme: "Index for Cloud Data Management Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA."— Presentation transcript:

1 Index for Cloud Data Management Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA

2 Outline  Motivating Applications  Existing Technologies  Conclusions & Future work

3 Motivating Application Cloud System select sum(number) from Product where product.name = ‘beer’ and product.price <=10$ and product.price >=5$ Big Data in a Private Cloud Table : Product Queries with multi-attributes and non-rowkey are quite common !

4 Page 4 Motivating Application: Mobile Coupon Distribution Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons Mobile Coupon Distributer

5 Page 5 Motivating Application: Mobile Coupon Distribution Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries 125,000,000 subscribers in Japan

6 Outline  Motivating Applications  Existing Technologies  Conclusions & Future work

7 Existing Technologies Multi- dimensional Queries Scalability Relational DBs Spatial DBs Commercial products but expensive Open source products Key-Value Stores What We Want at a reasonable price

8 Solutions-overview RowkeyNon-rowkey Single Dimensional Index [BigTable 、 HBase] [Point Query 、 Range Query] [Aguilera PVLDB’08] [S.Wu Data Eng’09] [S. Wu PVLDB’10] Multiple Dimensional Index [X.Zhang CloudDB’09] [J.Wang SIGMOD’10] [G.Chen VLDB’11] [Y. Zou NPC’10] [Shoji Nishimura MDM’11] Local Index + Global Index NEC CAS

9 Efficient B-tree Based Indexing for Cloud Data Processing S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. PVLDB'10

10 Efficient B-tree Based Indexing for Cloud Data Processing  Motivation  Designing a scalable and high-throughput indexing scheme to support efficient query for huge volumes of data in cloud  Low maintenance cost but also support parallel search

11 System Architecture ① Local Index ② BATON overlay network ③ publish

12 Challenges  How to select the local B + -tree nodes to publish in Global index?  How to organize the global index?  How to maximize the throughput?

13 Selecting local B + -tree nodes  Cost modeling  Query cost 1.routing cost : 2.local search cost :  Update cost  : cost of sending an index message  : cost of random I/O 1 : Search in global index 2 : Search in local index

14 Adaptive indexing strategy  Index expand  Index collapse Local Index

15 BATON : Balanced Tree Overlay Network  A distributed tree structure for P2P systems  Supporting range search

16 Index Construction  Assign a range to each node  For each node n  The range of its left sub-tree is less than that of n  The range of its right sub-tree is larger than that of n

17 Publish local B + -tree node to BATON

18 Maximizing the throughput  Eventual consistent model  Lazy update  if the update does not affect the key range of a local B+-tree, the stale index will not affect the correctness of the query processing.  Eager update  updates in the Left-most and right-most nodes

19 Pros and cons  Pros  Supporting efficient point query and range query for non-rowkey  Proposed an adaptive indexing strategy based on the cost model of overlay routings  Cons  Can not support multi-dimensional query

20 Multi-dimensional index [X.Zhang CloudDB’09]

21 Multi-dimensional index [J.Wang SIGMOD’10] [G.Chen VLDB’11]

22 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura, Sudipto Das. MDM'11

23 Contributions  Using linearization to implement a scalable multi-dimensional index structure layered over a range-partitioned Key-value store  Implementing a K-d tree and a Quad tree by the design

24 Ordered Key-Value Stores key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Sorted by key Good at 1-D Range Query Longitude Time Latitude But, our target is multi-dimensional…

25 Naïve Solution: Linearlization key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve…

26 Problem: False positive scans  MD-query on Linearized space  Translate a MD-query to linearized range query. Ex. Query from 2 to 9.  Scan queried linearized range.  Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space

27 Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase

28 Space Partition By the K-d tree Binary Z-ordering space Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving

29 Key Idea: The longest common prefix naming scheme * 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** Left-bottom corner Right-top corner *→0 *→1 (10, 00)(11, 11)

30 Build an index with the longest common prefix of keys *001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace

31 Reconstruct the boundary Info. & Check whether intersecting the queried area Multi-dimensional Range Query * 001* 01** 10** 11** Index Filter 001* 000* 001* 10** 11** 01** 10** Scan Subspace Pruning Scan on the index

32 Variations of Storage Layer  Table Share Model  Use single table, Maintain bucket boundary  Most space efficiency  Table per Bucket Model  Allocate a table per bucket  Most flexible mapping  One-to-one, one-to-many, many-to-one  Bucket split is expensive  Copy all points to the new buckets.  Region per Bucket Model  Allocate a region per bucket  Most bucket split efficiency  Require modification of HBase buckets table

33 Experimental Results: Multi-dimensional Range Query  Dataset: 400,000,000 points  Queries: select objects within MD ranges and change selectivity  Cluster size: 16 nodes  MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.

34 Experimental Results: Insert  Dataset: spatially skewed data  MD-HBase shows good scalability without significant overhead.

35 Conclusions  Designed a scalable multi-dimensional data store.  Mapping multi-dimension to single dimension  Key Idea: indexing the longest common prefix of keys  Demonstrated scalable insert throughput and excellent query performance.  Range Query: times faster than existing technologies.  Insert: 220K inserts/sec on 16 nodes cluster without overhead

36 CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries Y. Zou, J. Liu, S. Wang. NPC’10 end

37 Introduction  Motivation  Building index in DOTs to support multi-dimensional range query  High performance, low space overhead, high reliability  DOT  Distributed Ordered Table  BigTable , HBase  Observations  Usually 3 to 5 replica in DOTs  Index number is usually less than 5  Random read is significantly slower than scan

38 Basic idea : Complemental Clustering Index CCIT : convert slow random reads to fast sequential scan CCT : for fast data recovery

39 Challenges  Performance  Reliability  Space overhead

40 Performance  HBase  16 nodes  90 million records Query optimization based on the region-to-server mapping information

41 Reliability: Fault tolarance  Get other index value from CCTs  Query the CCITs to recover data  Replicate CCTs

42 Space overhead  N : the index column number  X-axis  Length of record to length of index columns  Y-axis  Overhead ratio

43 Conclusions  Proposed CCIndex to support Multi-dimensional range query in DOTs  Not suitable for more than 5 index columns  Write operation is slower than the original table

44 Outline  Motivating Applications  Existing Technologies  Conclusions & Future work

45 Conclusions  Index for non-rowkey in cloud data management system  Solutions  Local index + global index  Linearlization  Secondary index  Key issues  Index reliability  Query result correctness  Index maintenance  …

46 Future work  Study the architecture of HDFS and Hbase in detail  Test the existing index solutions in Cloud  Index framework and index structure

47 References M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598– 609, Y. Zou, J. Liu, S. Wang. CCIndex: a Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC’10. S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In SIGMOD, S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. PVLDB, 3(1):1207–1218, X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng, “An efficient multidimensional index for cloud data management,” in CloudDB, 2009, pp.17–24. Shoji Nishimura, Sudipto Das. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. MDM2011.

48 Thank you


Download ppt "Index for Cloud Data Management Lab of Web And Mobile Data Management ( WAMDM ) Youzhong MA."

Similar presentations


Ads by Google