Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

Global Hands-On Universe meeting July 15, 2007 Authentic Data in the Classroom with the Sloan Digital Sky Survey Jordan Raddick (Johns Hopkins University)
Eötvös University Budapest in the Network.  Seniors: István Csabai (node coordinator): »Photometric redshift estimation, virtual observatories, science.
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Multidimensional Indexing
Searching on Multi-Dimensional Data
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges.
Galaxy Distributions Analysis of Large-scale Structure Using Visualization and Percolation Technique on the SDSS Early Data Release Database Yuk-Yan Lam.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Chapter 3: Data Storage and Access Methods
An Intelligent & Incremental Approach to kNN using R-trees DJ Oneil & Esten Rye (G01)
Distributed and Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence Kalin Kanov Department of Computer Science Johns Hopkins.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
FLANN Fast Library for Approximate Nearest Neighbors
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Data Structure and access method Fan Zhang Zhiqi Chen.
Data Mining Techniques
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Introduction to Sky Survey Problems Bob Mann. Introduction to sky survey database problems Astronomical data Astronomical databases –The Virtual Observatory.
Trees for spatial data representation and searching
How to speed up search of ILMT light curves using the HTM (Hierarchical Triangular Mesh) method in relational databases ARC Liège, 11 February 2010 ILMT.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
Spatial Indexing of large astronomical databases László Dobos, István Csabai, Márton Trencséni ELTE, Hungary.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
SDSS Quasars Spectra Fitting N. Kuropatkin, C. Stoughton.
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
Optimal insert methods of geographical information to Spatio- temporal DB Final Presentation Industrial Project June 17,2012 Students: Michael Tsalenko.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
SDSS photo-z with model templates. Photo-z Estimate redshift (+ physical parameters) –Colors are special „projection” of spectra, like PCA.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems Photometric parallax estimation using the MILES catalog and BaSeL models István Csabai.
Ching-Wa Yip Johns Hopkins University.  Alex Szalay (JHU)  Rosemary Wyse (JHU)  László Dobos (ELTE)  Tamás Budavári (JHU)  Istvan Csabai (ELTE)
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Web based spectrum databases and utilities László Dobos Tamás Budavári István Csabai MAGPOP kick-off meeting, January Cassis.
MORPHOLOGICAL ANALYSIS OF SDSS DISC GALAXIES József Varga 1 Supervisor: István Csabai 1 1 Department of Physics of Complex Systems Eötvös University Budapest.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Budapest Group Eötvös University MAGPOP kick-off meeting Cassis 2005 January
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Spatial Data Management
MATLAB Distributed, and Other Toolboxes
Spatial Indexing I Point Access Methods.
COMP 430 Intro. to Database Systems
Query Processing in Databases Dr. M. Gavrilova
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Multidimensional Indexes
Presentation transcript:

Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University, Budapest T.Budavári, A. Szalay Johns Hopkins University, Baltimore

URGENT! We have lot of data, and still collecting … STOP The data is complex … STOP We want to do complex stuff with it … STOP We want to interactively visualize it … STOP Files are not good enough for us … STOP Current DBMS are not designed for us … STOP Please help ! … SOS! FROM: Natural Scientists TO: DB Community Telegraph Message

Doing Science with Elephants E = mc 2

The data  5 years of Sloan Digital Sky Survey data  Public archive: SkyServer (SQL Server, A. Szalay, J. Gray)  Large: 3TB, 270M objects  Multi-dimensional: 300 parameters/object Index only for key values (1D) and sky coordinates (2D)  Spatial …  Upcoming surveys (Pan-Starrs, 1.4 Gpixel camera) will produce same data in 1 week 120 Mpixel camera

u g r i z 270 million points in 5+ dimensions 270 million points in 5+ dimensions The magnitude space - Multidimensional point data - highly non-uniform distribution - outliers - Multidimensional point data - highly non-uniform distribution - outliers

The questions astronomers ask petroMag_i > 17.5 and (petroMag_r > 15.5 or petroR50_r > 2) and (petroMag_r > 0 and g > 0 and r > 0 and i > 0) and ( (petroMag_r-extinction_r) -0.2) and ( (petroMag_r - extinction_r * LOG10(2 * * petroR50_r * petroR50_r)) < 24.2) ) or ( (petroMag_r - extinction_r < 19.5) and ( (dered_r - dered_i - (dered_g - dered_r)/ ) > ( * (dered_g - dered_r)) ) and ( (dered_g - dered_r) > ( * (dered_r - dered_i)) ) ) and ( (petroMag_r - extinction_r * LOG10(2 * * petroR50_r * petroR50_r) ) < 23.3 ) ) petroMag_i > 17.5 and (petroMag_r > 15.5 or petroR50_r > 2) and (petroMag_r > 0 and g > 0 and r > 0 and i > 0) and ( (petroMag_r-extinction_r) -0.2) and ( (petroMag_r - extinction_r * LOG10(2 * * petroR50_r * petroR50_r)) < 24.2) ) or ( (petroMag_r - extinction_r < 19.5) and ( (dered_r - dered_i - (dered_g - dered_r)/ ) > ( * (dered_g - dered_r)) ) and ( (dered_g - dered_r) > ( * (dered_r - dered_i)) ) ) and ( (petroMag_r - extinction_r * LOG10(2 * * petroR50_r * petroR50_r) ) < 23.3 ) ) Skyserver log; a query from the 12 million Star/galaxy separation Quasar target selection Star/galaxy separation Quasar target selection Combination of inequalities Multi-dimensional polyhedron query Multi-dimensional polyhedron query Drop outliers, search for rare objects Point density estimation Find similar galaxies K-nearest neighbor search

The goal TRADITIONAL APPROACH Flat files, Fortran, C code + Complex manipulation of data - Sequential slow access TRADITIONAL APPROACH Flat files, Fortran, C code + Complex manipulation of data - Sequential slow access SQL DATABASES Oracle, MS SQL Server, PostgreSQL … + Organized, efficient data access - Hard to implement complex algorithms - Multi-dimensional support (OLAP) is limited to categorical data SQL DATABASES Oracle, MS SQL Server, PostgreSQL … + Organized, efficient data access - Hard to implement complex algorithms - Multi-dimensional support (OLAP) is limited to categorical data MULTI-DIMENSIONAL INDEXING B-tree, R-tree, K-d tree, BSP-tree … + Many for low D, some for higher D + Fast, tuned for various problems - Implemented mostly as memory algorithms, maybe suboptimal in databases MULTI-DIMENSIONAL INDEXING B-tree, R-tree, K-d tree, BSP-tree … + Many for low D, some for higher D + Fast, tuned for various problems - Implemented mostly as memory algorithms, maybe suboptimal in databases VISUALIZATION Tools using OpenGL, DirectX + Fast - Using files, some tools access database, but not interactive VISUALIZATION Tools using OpenGL, DirectX + Fast - Using files, some tools access database, but not interactive INTEGRATE use for astronomical data-mining and for fast interactive visualization INTEGRATE use for astronomical data-mining and for fast interactive visualization

Implemented indexing techniques  MS SQL Server 2005,.NET, C# CLR support – run complex procedural code inside the RDBMS  Quad-tree (32-tree) Build (SQL 1h) Range search, k nearest neighbor, visualization support (SQL) Large query time variation in 5D with non-uniform data  Balanced k-d tree Build: T-SQL (12h) Range search, k nearest neighbor (C#) Local polynomial regression (C#)  Voronoi tessellation Limited number of random seeds (build: points 1h, insertion: 270M points 12h) Density estimation, NN-search C# wrapper for Qhull

Usage: Geometric queries  First run the query against the index  Select cells those are fully covered fully outside intersected  Run detailed SQL only on intersected cells

Usage: Non-parametric estimation Template fitting Nearest neighbor + polynomial fit foreach (Galaxy g in UnknownSet){ neighbors = NearestNeighbors(g, ReferenceSet) polynomCoeffs = FitPolynomial(neighbors.Colors, neighbors.Redshift) g.Redshift = Estimate(g.colors, polynomCoeffs) } foreach (Galaxy g in UnknownSet){ neighbors = NearestNeighbors(g, ReferenceSet) polynomCoeffs = FitPolynomial(neighbors.Colors, neighbors.Redshift) g.Redshift = Estimate(g.colors, polynomCoeffs) } For 1M galaxies (reference set) SDSS can measure redshift for the rest 269M (unknown set) not Kd-tree based nearest neighbor search Polynomial regression implemented in C# runs as CLR code in SQL Server

Usage: Search for similar spectra PCA: AMD optimized LAPACK routines called from SQL Server Dimension reduced from 3000 to 5 Kd-tree based nearest neighbor search Matching with simulated spectra, where all the physical parameters are known would estimate age, chemical composition, etc. of galaxies.

Adaptive Visualizer  Using managed DirectX  Visualize more data than fits into memory  Towards graphical SQL: mouse actions are converted to queries and passed to SQL Server LOD, zoom in and out 270M points Voronoi, kd-tree visualization Brush select, click-connect to SkyServer Select nearest neighbors Multi-resolution density maps Multidim : quickly change axes Interact with other Virtual Observatory data

Visualizer Demo

The Tools  MS SQL Server 2005  OODB vs. RDBMS  SDSS SkyServer using SQL Server SQL Server 2005 CLR support – run complex procedural code inside the DB - No support for vector data  C# + native SQL  VS.2005, rapid prototyping  Managed DirectX  Web Services support for Virtual Observatories

Why is magnitude space interesting? LIGHT Spectrum 1M objects BROADBAND FILTERS MAGNITUDE SPACE 270M objects REDSHIFT PHYSICAL PARAMETRS age, dust, chemical comp. GALAXY elliptic, spiral 3000 DIMENSIONAL POINT DATA 5 DIMENSIONAL POINT DATA 3-10 DIMENSION PCA

 Similar to SkyServer HTM indexing … but in 5 dimensions Spatial indexing

Quad-trees  32-tree in 5D  No need to store the structure  Number of nodes goes exponentially  Breaks down in high dimensions or if data is highly non-uniformly distributed  32-tree in 5D  No need to store the structure  Number of nodes goes exponentially  Breaks down in high dimensions or if data is highly non-uniformly distributed

K-d trees Only one cut in each level Store bounding boxes

Voronoi tessellation each point of the cell is closer to the seed than to any other the solution space for NN more spherical cells, 50 neighbors, 1000 vertices density estimation, clustering complex code, computation intensive in higher dimensions

Complex code in SQL/CLR  Spectrum Services Composite, continuum and line fit, convolving filters and spectra, dereddening  Non-parametric estimation  Find k-nearest neighbors  Polynomial fit (AMD optimized LAPACK) DR5: photometric redshift Garching DR4: ‘photometric’ D n (4000), Hδ A, age, mass