Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal.

Slides:



Advertisements
Similar presentations
Trees for spatial indexing
Advertisements

Indexing DNA Sequences Using q-Grams
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Multimedia Database Systems
CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Multidimensional Indexing
Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Spatio-Temporal Databases
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Indexing Multimedia Data Ι Multidimensional access methods.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Database Management 9. course. Execution of queries.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Indexing for Multidimensional Data An Introduction.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
1 Multiple Classifier Based on Fuzzy C-Means for a Flower Image Retrieval Keita Fukuda, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering,
CSIS7101 – Advanced Database Technologies Spatio-Temporal Data (Part 1) On Indexing Mobile Objects Kwong Chi Ho Leo Wong Chi Kwong Simon Lui, Tak Sing.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,
The Curse of Dimensionality Richard Jang Oct. 29, 2003.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Spatio-Temporal Databases
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
Prof. Bayer, DWH, Ch.6, SS Chapter 6: UB-tree for Multidimensional Indexing Note: all relational databases are multidimensional: a tuple in a relation.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
Indexing Multidimensional Data
Fast Subsequence Matching in Time-Series Databases.
Spatial Data Management
CS 540 Database Management Systems
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Parallel Databases.
RE-Tree: An Efficient Index Structure for Regular Expressions
Spatial Indexing I Point Access Methods.
Database Management Systems (CS 564)
Sameh Shohdy, Yu Su, and Gagan Agrawal
Orthogonal Range Searching and Kd-Trees
G-trees Region identifier: string of bits built as follows
Probabilistic Data Management
Similarity Search: A Matching Based Approach
Self-organizing Tuple Reconstruction in Column-stores
Chapter 10.1: UB-tree for Multidimensional Indexing
Chapter 6: UB-tree for Multidimensional Indexing
Presentation transcript:

Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal ACM SIGMOD Conference 1998 pages May Sun Woo Kim

Database Systems Laboratory 2 Contents Introduction Analysis of Balanced Splits The Pyramid-Technique Query Processing Analysis of the Pyramid-Technique The Extended Pyramid-Technique Experimental Evaluation

Database Systems Laboratory 3 Introduction A variety of new DB applications has been developed Data warehousing Require a multidimensional view on the data Multimedia Using some kind of feature vectors  Has to support query processing on large amounts of high-dimensional data

Database Systems Laboratory 4 Analysis of Balanced Splits Performance degeneration Data space cannot be split in each dimension To a uniformly distributed data set 1-dimenstion data space  2 1 = 2 data pages 20-dimension data space  2 20 ≈ 1,000,000 data pages Similar property holds for range query  q = 0.63, d = 20, selectivity s = 0.01% = d=1: q = d=2: q = 0.01 d=3: q = … d=20: q =

Database Systems Laboratory 5 Analysis of Balanced Splits Expected value of data page access 10

Database Systems Laboratory 6 The Pyramid-Technique Basic idea Transform d-dimesional point into 1-dimensional value Store and access the values using B + -tree Store d-dimensional points plus 1-dimensional key

Database Systems Laboratory 7 Data Space Partitioning Pyramid-Technique Split the data space into 2d pyramids top: center point (0.5, 0.5, …, 0.5)‏ base: (d-1)-dimensional surface each of pyramids is divided into several partitions Center Point (0.5, 0.5)‏ d-1 dimensional 2-dimension 4 pyramids partition

Database Systems Laboratory 8 Data Space Partitioning Definition 1: (Pyramid of a point v)‏ p0p0 p1p1 p2p2 p3p3 d0d0 d1d1 p0p0 p1p1 +2 (0.6, 0.2)‏ v 0.5-v v v 0 ≥0.5-v 1 = 0.3 ≥ 0.0  j max = 0 v jmax = 0.8 ≥ 0.5 i=j max +d= 0+2=2  v 2 is in p v 1 ≥0.5-v 0  j max = 0 v jmax = 0.2<0.5 i = 1  v is in p 1 (0.8, 0.5)‏ v2v2

Database Systems Laboratory 9 Data Space Partitioning Definition 2: (Height of a point v)‏ Definition 3: (Pyramid value of a point v)‏ Pyramid p 1 (0.6, 0.2)‏ v h v = =0.3 p0p0 p1p1 p2p2 p3p3 d0d0 d1d1 (0.6, 0.2)‏ Def 1: i = 1 Def 2: h v = 0.3 Def 3: pv v = = 1.3

Database Systems Laboratory 10 Index Creation Insert Determine the pyramid value of a point v Insert into B + -tree using pv v as a key Store the d-dimensional point v and pv v In the according data page of the B + -tree Update and delete can be done analogously Resulting data pages of the B + -tree Belong to same pyramid Interval given by the min and max value partition min max M= (0.6,0.2)‏

Database Systems Laboratory 11 Query Processing Point query Compute the pyramid value pv q of q Query the B + -tree using pv q Obtain a set of points sharing pv q Determine whether the set contains q M= (0.2,0.7)‏ q q=(0.2,0.3)‏ pv q =0.3

Database Systems Laboratory 12 Query Processing Range query Point_Set PyrTree::range_query(range q)‏ { Point_Set res; for (i=0; i<2d; i++) { if (intersect(p[i], q) { determine_range(p[i], q, h low, h high ); cs = btree_query(i+h low, i+h high ); for (c=cs.first; cs.end; cs.next)‏ if (inside(q, c))‏ res.add(c); } return res; } Lemma 1 Lemma 2 M= [0.1, 0.6][0.1, 0.3] i=0 h low =0.2 h high =0.4 (0.1, 0.3)‏ i=1 (0.5, 0.3)‏ (0.3, 0.2)‏ (0.5, 0.1)‏ i=2 i=3 (0.2,0.7)‏(0.3,0.4)‏ (0.1, 0.5)‏ (0.1,0.3)‏ Q

Database Systems Laboratory 13 Query Processing Lemma 1: (Intersection of a Pyramid and a Rectangle)‏ d0d0 d1d1 [0.4, 0.9][0.1, 0.3][-0.1, 0.4][-0.4, -0.2] i=0: q imin =-0.1 -MIN(q j )= > -0.2  false i=1: q imin =-0.4 -MIN(q j )= < 0  true p0p0 p1p1 Only use i<d

Database Systems Laboratory 14 Query Processing Lemma 2: (Interval of Intersection of Query and Pyramid)‏ Case 1: Case 2: (otherwise)‏  d0d0 d1d1 d0d0 d1d1 i=0: h low =0 h high =0.3 i=1: h low =0 h high =0.1 i=0: h low =0.2 h high =0.4 i=1: h low =0.2 h high =0.3 MIN(q j )‏ MIN(q i )‏ = MIN(q i )‏ Only use i<d

Database Systems Laboratory 15 Analysis of the Pyramid-Technique Required number of accesses pages for 2d pyramids Does not reveal any performance degeneration

Database Systems Laboratory 16 The Extended Pyramid-Technique Basic idea Conditions (0.1, 0.y)‏ (0.x, 0.1)‏ (0.5, 0.5)‏

Database Systems Laboratory 17 The Extended Pyramid-Technique Performance Speed up : about 10 ~ 40% Loss of performance not too high Compare to high speed-up factors over other index structures

Database Systems Laboratory 18 Experimental Evaluation Using Synthetic Data Performance behavior over database size (16-dimension)‏ Performance behavior over data space dimension 1,000,000 objects

Database Systems Laboratory 19 Experimental Evaluation Using Real Data Sets Query processing on text data (16-dimension)‏ Query processing on warehousing data (13 attributes)‏

Database Systems Laboratory 20 Summarize For almost hypercube shaped queries Outperforms any competitive technique Including linear scan Holds even for skewed, clustered and categorical data For queries having a bad selectivity Outperforms competitive index structures However, a linear scan is faster Fin. pyramid T.: 2.6 Sequ. Scan: 2.48