Download presentation

Presentation is loading. Please wait.

Published byAlisa Hazle Modified about 1 year ago

1
Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA

2
Outline Motivating Applications Existing Technologies Conclusions & Future work

3
Motivating Application Cloud System select sum(number) from Product where product.name = ‘beer’ and product.price <=10$ and product.price >=5$ Big Data in a Private Cloud Table ： Product Queries with multi-attributes and non-rowkey are quite common !

4
Page 4 Motivating Application: Mobile Coupon Distribution Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons Mobile Coupon Distributer

5
Page 5 Motivating Application: Mobile Coupon Distribution Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries 125,000,000 subscribers in Japan

6
Outline Motivating Applications Existing Technologies Conclusions & Future work

7
Existing Technologies Multi- dimensional Queries Scalability Relational DBs Spatial DBs Commercial products but expensive Open source products Key-Value Stores What We Want at a reasonable price

8
Solutions-overview RowkeyNon-rowkey Single Dimensional Index [BigTable 、 HBase] [Point Query 、 Range Query] [Aguilera PVLDB’08] [S.Wu Data Eng’09] [S. Wu PVLDB’10] Multiple Dimensional Index [X.Zhang CloudDB’09] [J.Wang SIGMOD’10] [G.Chen VLDB’11] [Y. Zou NPC’10] [Shoji Nishimura MDM’11] Local Index + Global Index NEC CAS

9
Efficient B-tree Based Indexing for Cloud Data Processing S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. PVLDB'10

10
Efficient B-tree Based Indexing for Cloud Data Processing Motivation Designing a scalable and high-throughput indexing scheme to support efficient query for huge volumes of data in cloud Low maintenance cost but also support parallel search

11
System Architecture ① Local Index ② BATON overlay network ③ publish

12
Challenges How to select the local B + -tree nodes to publish in Global index? How to organize the global index? How to maximize the throughput?

13
Selecting local B + -tree nodes Cost modeling Query cost 1.routing cost ： 2.local search cost ： Update cost ： cost of sending an index message ： cost of random I/O 1 ： Search in global index 2 ： Search in local index

14
Adaptive indexing strategy Index expand Index collapse Local Index

15
BATON ： Balanced Tree Overlay Network A distributed tree structure for P2P systems Supporting range search

16
Index Construction Assign a range to each node For each node n The range of its left sub-tree is less than that of n The range of its right sub-tree is larger than that of n

17
Publish local B + -tree node to BATON

18
Maximizing the throughput Eventual consistent model Lazy update if the update does not affect the key range of a local B+-tree, the stale index will not affect the correctness of the query processing. Eager update updates in the Left-most and right-most nodes

19
Pros and cons Pros Supporting efficient point query and range query for non-rowkey Proposed an adaptive indexing strategy based on the cost model of overlay routings Cons Can not support multi-dimensional query

20
Multi-dimensional index [X.Zhang CloudDB’09]

21
Multi-dimensional index [J.Wang SIGMOD’10] [G.Chen VLDB’11]

22
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura, Sudipto Das. MDM'11

23
Contributions Using linearization to implement a scalable multi-dimensional index structure layered over a range-partitioned Key-value store Implementing a K-d tree and a Quad tree by the design

24
Ordered Key-Value Stores key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Sorted by key Good at 1-D Range Query Longitude Time Latitude But, our target is multi-dimensional…

25
Naïve Solution: Linearlization key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve…

26
Problem: False positive scans MD-query on Linearized space Translate a MD-query to linearized range query. Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space

27
Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase

28
Space Partition By the K-d tree Binary Z-ordering space Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving

29
Key Idea: The longest common prefix naming scheme * 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** Left-bottom corner Right-top corner *→0 *→1 (10, 00)(11, 11)

30
Build an index with the longest common prefix of keys *001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace

31
Reconstruct the boundary Info. & Check whether intersecting the queried area Multi-dimensional Range Query * 001* 01** 10** 11** Index Filter 001* 000* 001* 10** 11** 01** 10** Scan Subspace Pruning Scan on the index

32
Variations of Storage Layer Table Share Model Use single table, Maintain bucket boundary Most space efficiency Table per Bucket Model Allocate a table per bucket Most flexible mapping One-to-one, one-to-many, many-to-one Bucket split is expensive Copy all points to the new buckets. Region per Bucket Model Allocate a region per bucket Most bucket split efficiency Require modification of HBase buckets table

33
Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.

34
Experimental Results: Insert Dataset: spatially skewed data MD-HBase shows good scalability without significant overhead.

35
Conclusions Designed a scalable multi-dimensional data store. Mapping multi-dimension to single dimension Key Idea: indexing the longest common prefix of keys Demonstrated scalable insert throughput and excellent query performance. Range Query: times faster than existing technologies. Insert: 220K inserts/sec on 16 nodes cluster without overhead

36
CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries Y. Zou, J. Liu, S. Wang. NPC’10 end

37
Introduction Motivation Building index in DOTs to support multi-dimensional range query High performance, low space overhead, high reliability DOT Distributed Ordered Table BigTable ， HBase Observations Usually 3 to 5 replica in DOTs Index number is usually less than 5 Random read is significantly slower than scan

38
Basic idea ： Complemental Clustering Index CCIT ： convert slow random reads to fast sequential scan CCT ： for fast data recovery

39
Challenges Performance Reliability Space overhead

40
Performance HBase 16 nodes 90 million records Query optimization based on the region-to-server mapping information

41
Reliability: Fault tolarance Get other index value from CCTs Query the CCITs to recover data Replicate CCTs

42
Space overhead N ： the index column number X-axis Length of record to length of index columns Y-axis Overhead ratio

43
Conclusions Proposed CCIndex to support Multi-dimensional range query in DOTs Not suitable for more than 5 index columns Write operation is slower than the original table

44
Outline Motivating Applications Existing Technologies Conclusions & Future work

45
Conclusions Index for non-rowkey in cloud data management system Solutions Local index + global index Linearlization Secondary index Key issues Index reliability Query result correctness Index maintenance …

46
Future work Study the architecture of HDFS and Hbase in detail Test the existing index solutions in Cloud Index framework and index structure

47
References M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598– 609, Y. Zou, J. Liu, S. Wang. CCIndex: a Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC’10. S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32, pp.75–82, J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In SIGMOD, S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. PVLDB, 3(1):1207–1218, X. Zhang, J. Ai, Z. Wang, J. Lu, and X. Meng, “An efficient multidimensional index for cloud data management,” in CloudDB, 2009, pp.17–24. Shoji Nishimura, Sudipto Das. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. MDM2011.

48
Thank you

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google