Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA.

Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA

Outline  Motivating Applications  Existing Technologies  Conclusions & Future work

Motivating Application Cloud System select sum(number) from Product where product.name = ‘beer’ and product.price <=10$ and product.price >=5$ Big Data in a Private Cloud Table ： Product Queries with multi-attributes and non-rowkey are quite common !

Motivating Application: Mobile Coupon Distribution Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons Mobile Coupon Distributer

Motivating Application: Mobile Coupon Distribution Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries 125,000,000 subscribers in Japan

Existing Technologies Multi- dimensional Queries Scalability Relational DBs Spatial DBs Commercial products but expensive Open source products Key-Value Stores What We Want at a reasonable price

Solutions-overview RowkeyNon-rowkey Single Dimensional Index [BigTable 、 HBase] [Point Query 、 Range Query] [Aguilera PVLDB’08] [S.Wu Data Eng’09] [S. Wu PVLDB’10] Multiple Dimensional Index [X.Zhang CloudDB’09] [J.Wang SIGMOD’10] [G.Chen VLDB’11] [Y. Zou NPC’10] [Shoji Nishimura MDM’11] Local Index + Global Index NEC CAS

Efficient B-tree Based Indexing for Cloud Data Processing S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. PVLDB'10

Efficient B-tree Based Indexing for Cloud Data Processing  Motivation  Designing a scalable and high-throughput indexing scheme to support efficient query for huge volumes of data in cloud  Low maintenance cost but also support parallel search

System Architecture ① Local Index ② BATON overlay network ③ publish

Challenges  How to select the local B + -tree nodes to publish in Global index?  How to organize the global index?  How to maximize the throughput?

Selecting local B + -tree nodes  Cost modeling  Query cost 1.routing cost ： 2.local search cost ：  Update cost  ： cost of sending an index message  ： cost of random I/O 1 ： Search in global index 2 ： Search in local index

Adaptive indexing strategy  Index expand  Index collapse Local Index

BATON ： Balanced Tree Overlay Network  A distributed tree structure for P2P systems  Supporting range search

Index Construction  Assign a range to each node  For each node n  The range of its left sub-tree is less than that of n  The range of its right sub-tree is larger than that of n

Publish local B + -tree node to BATON

Maximizing the throughput  Eventual consistent model  Lazy update  if the update does not affect the key range of a local B+-tree, the stale index will not affect the correctness of the query processing.  Eager update  updates in the Left-most and right-most nodes

Pros and cons  Pros  Supporting efficient point query and range query for non-rowkey  Proposed an adaptive indexing strategy based on the cost model of overlay routings  Cons  Can not support multi-dimensional query

Multi-dimensional index [X.Zhang CloudDB’09]

Multi-dimensional index [J.Wang SIGMOD’10] [G.Chen VLDB’11]

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura, Sudipto Das. MDM'11

Contributions  Using linearization to implement a scalable multi-dimensional index structure layered over a range-partitioned Key-value store  Implementing a K-d tree and a Quad tree by the design

Ordered Key-Value Stores key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Sorted by key Good at 1-D Range Query Longitude Time Latitude But, our target is multi-dimensional…

Naïve Solution: Linearlization key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve… 571315 461214 13911 02810

Problem: False positive scans  MD-query on Linearized space  Translate a MD-query to linearized range query. Ex. Query from 2 to 9.  Scan queried linearized range.  Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space. 571315 461214 13911 02810 2 9

Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase

Space Partition By the K-d tree 01010101011101111101110111111111 01000100011001101100110011101110 00010001001100111001100110111011 00000000001000101000100010101010 Binary Z-ordering space 00 01 10 11 11 10 01 00 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving

Key Idea: The longest common prefix naming scheme 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 000* 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** Left-bottom corner Right-top corner 1000100011111111 *→0 *→1 (10, 00)(11, 11)

Build an index with the longest common prefix of keys 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 000*001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace

Reconstruct the boundary Info. & Check whether intersecting the queried area Multi-dimensional Range Query 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 000* 001* 01** 10** 11** Index Filter 001* 000* 001* 10** 11** 01** 10** Scan Subspace Pruning Scan 0010 -1001 on the index

Variations of Storage Layer  Table Share Model  Use single table, Maintain bucket boundary  Most space efficiency  Table per Bucket Model  Allocate a table per bucket  Most flexible mapping  One-to-one, one-to-many, many-to-one  Bucket split is expensive  Copy all points to the new buckets.  Region per Bucket Model  Allocate a region per bucket  Most bucket split efficiency  Require modification of HBase buckets table

Experimental Results: Multi-dimensional Range Query  Dataset: 400,000,000 points  Queries: select objects within MD ranges and change selectivity  Cluster size: 16 nodes  MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.

Experimental Results: Insert  Dataset: spatially skewed data  MD-HBase shows good scalability without significant overhead.

Conclusions  Designed a scalable multi-dimensional data store.  Mapping multi-dimension to single dimension  Key Idea: indexing the longest common prefix of keys  Demonstrated scalable insert throughput and excellent query performance.  Range Query: 10-100 times faster than existing technologies.  Insert: 220K inserts/sec on 16 nodes cluster without overhead

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries Y. Zou, J. Liu, S. Wang. NPC’10 end

Introduction  Motivation  Building index in DOTs to support multi-dimensional range query  High performance, low space overhead, high reliability  DOT  Distributed Ordered Table  BigTable ， HBase  Observations  Usually 3 to 5 replica in DOTs  Index number is usually less than 5  Random read is significantly slower than scan

Basic idea ： Complemental Clustering Index CCIT ： convert slow random reads to fast sequential scan CCT ： for fast data recovery

Challenges  Performance  Reliability  Space overhead

Performance  HBase 0.20.1  16 nodes  90 million records Query optimization based on the region-to-server mapping information

Reliability: Fault tolarance  Get other index value from CCTs  Query the CCITs to recover data  Replicate CCTs

Space overhead  N ： the index column number  X-axis  Length of record to length of index columns  Y-axis  Overhead ratio

Conclusions  Proposed CCIndex to support Multi-dimensional range query in DOTs  Not suitable for more than 5 index columns  Write operation is slower than the original table

Conclusions  Index for non-rowkey in cloud data management system  Solutions  Local index + global index  Linearlization  Secondary index  Key issues  Index reliability  Query result correctness  Index maintenance  …

Future work  Study the architecture of HDFS and Hbase in detail  Test the existing index solutions in Cloud  Index framework and index structure

References M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598– 609, 2008. Y. Zou, J. Liu, S. Wang. CCIndex: a Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC’10. S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, 2009. J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In SIGMOD, 2010. S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. PVLDB, 3(1):1207–1218, 2010. X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng, “An efficient multidimensional index for cloud data management,” in CloudDB, 2009, pp.17–24. Shoji Nishimura, Sudipto Das. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. MDM2011.

Thank you

Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA.

Similar presentations

Presentation on theme: "Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA.

Similar presentations

Presentation on theme: "Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA."— Presentation transcript:

Similar presentations

About project

Feedback