January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.
Spatial (or N-Dimensional) Search in a Relational World Jim Gray.
Trying to Use Databases for Science Jim Gray Microsoft Research
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Yukon – What is New Rajesh Gala. Yukon – What is new.NET Framework Programming Data Types Exception Handling Batches Databases Database Engine Administration.
Global Hands-On Universe meeting July 15, 2007 Authentic Data in the Classroom with the Sloan Digital Sky Survey Jordan Raddick (Johns Hopkins University)
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing ADASS London, UK September 23-26, 2007 Jacek Becla 1, Kian-Tat Lim 1,
ICS 421 Spring 2010 Performance Tuning Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/15/20101Lipyeow.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Architecting a Large-Scale Data Warehouse with SQL Server 2005 Mark Morton Senior Technical Consultant IT Training Solutions DAT313.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Overview of SQL Server Alka Arora.
Spatial Indexing and Visualizing Large Multi-dimensional Databases I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
How to speed up search of ILMT light curves using the HTM (Hierarchical Triangular Mesh) method in relational databases ARC Liège, 11 February 2010 ILMT.
Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Database Management 9. course. Execution of queries.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.
1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28.
G. Fekete, JHU Efficient search indices for geospatial data in a relational database Gyorgy (George) Fekete Dept. Physics and Astronomy Johns Hopkins University.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
May 17, 2005Maria Nieto-Santisteban, JHU / IVOA - Kyoto1 VO JHU Open SkyQuery and more … T. Budavari, S. Carliles, L. Dobos, G. Fekete,
Client-Server Paradise ICOM 8015 Distributed Databases.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Recent spatial work by Jim Gray and Alex Szalay Bob Mann.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS 540 Database Management Systems
ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Slide 1 PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA.
Dynamicpartnerconnections.com Development for performance Oleksandr Katrusha, Program manager
SQL Basics Review Reviewing what we’ve learned so far…….
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Session Name Pelin ATICI SQL Premier Field Engineer.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
CS 540 Database Management Systems
Cross-matching the sky with database server cluster
Database Management Systems (CS 564)
Sky Query: A distributed query engine for astronomy
Introduction to Query Optimization
Database Management Systems (CS 564)
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Selected Topics: External Sorting, Join Algorithms, …
Evaluation of Relational Operations: Other Techniques
Efficient Catalog Matching with Dropout Detection
LSST, the Spatial Cross-Match Challenge
Presentation transcript:

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R. Thakar, Alex Szalay, Tanu Malik, The Johns Hopkins University Jim Gray Microsoft Research Sloan Digital Sky Survey

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh2 Sloan Digital Sky Survey Digital map in 5 spectral bands covering ¼ of the sky. Will obtain 40 TB of raw pixel data. Photometric catalog with more than 200 million objects. Spectra of 1 million objects. Data Release One – DR1: 6 TB of images, 200 k spectra.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh3 The SkyServer Database Processed data is stored into a relational database, SkyServer. Allows fast exploration and analysis of the data (Data Mining). DR1 data base (Best + Target): ~ 1TB and 6TB for final release. Heavily indexed to speed up access. Short queries can run interactively. Long queries (> 1 hour) require a custom Batch Query System. DBMS, Microsoft SQL Server 2000.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh4 SkyServer Database Schema Meta SpectroPhoto

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh5 SkyServer and the NVO

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh6 Partitioning and Parallelization Goals Speed up query execution: –Queries looking at different parts of the sky are distributed among servers. –Queries covering wide areas are executed in parallel by different servers. (Sequential scans fall naturally in this range) –Neighborhood queries are “isolated” and processed in parallel. (Gravitational lenses and Galaxy clusters) Speed up cross-match requests from other NVO data nodes.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh7 Partitioning Facts Partitioning works well if tables in the database are naturally divisible into similar partitions where most of the rows accessed by any SQL statement can be placed on the same member server. Partitioning is most effective if the tables in the database can be partitioned symmetrically. (Not exactly our case) Related data should be placed on the same member server so most SQL statements routed to a member will require minimum data from other servers. Data should be partitioned uniformly across the member servers. Designing Partitions. Microsoft SQL Server Books Online

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh8 Partitioning Strategy Two-Step Process 1.Distribute data homogenously among servers. -Each server has roughly the same amount of objects. -Objects inside servers are spatially related. -Balances the workload among servers. -Queries redirected to the server holding the data. 2.(Re)Define zones inside each server dynamically. -Zones are defined according to some search radius to solve specific problems: -Finding Galaxy Cluster, -Gravitational Lenses, etc. -Facilitates cross-match queries from other NVO data nodes.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh9 Each Zone is a declination stripe of height h. In principle, h can be any number. In practice, 30 arcsec. (DR1 => 8000 ZONES) South-pole zone = Zone 0. Each object belongs to one Zone: ZoneID = floor ( (dec + 90) / h ) Each server holds N “contiguous” Zones. N is determined by the number of objects that each Zone contains and the number of servers in the cluster. –Not all servers contain the same number of zones. –Not all servers cover the same declination range. Straightforward mapping between queries and servers. Mapping the Sphere into Zones

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh10 Cone Searches using Zones -Restrict search on dec to dec   (dec - r), (dec + r)] –Restrict search on ra to ra   (ra-r)/(cos( |dec| ) +   ), (ra+r)/(cos( |dec| ) +   )]  1 e-6 –Filter on distance  ( (cx – x) 2 + (cy – y) 2 + (cz – z) 2 ) < r ConeSearch (ra, dec, r) –Need to search only on zones between: maxZone = ceiling ((dec r)/ h) minZone = floor ((dec + 90 – r)/ h)

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh11 Cone Searches using Zones -Restrict search on dec to dec   (dec - r), (dec + r)] –Restrict search on ra to ra   (ra-r)/(cos( |dec| ) +   ), (ra+r)/(cos( |dec| ) +   )]  1 e-6 –Filter on distance  ( (cx – x) 2 + (cy – y) 2 + (cz – z) 2 ) < r ConeSearch (ra, dec, r) –Need to search only on zones between: maxZone = ceiling ((dec r)/ h) minZone = floor ((dec + 90 – r)/ h)

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh12 Margins & Buffers To improve queries around Ra = 0 (or 360): –Duplicate objects inside Ra = [-1, 0) and Ra = (360, 361] facilitates searches. –Objects in the margin area are marked as Margin. To guarantee that neighboring searches can be fully satisfied inside a single server some zones are replicated: –Each server adds 2 extra buffer regions of height R M –R M, is the maximum neighboring distance we assume (1 degree). –Objects in buffers are “marked” as Visitors to the Server.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh13 Zoning Performance Time vs Search Radius for Cone Search searches 7x faster than using external calls to the HTM functions Any small zone height is adequateh of 4 arcminutes is near-optimal

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh14 Partitioning Process Generate partitions (n_servers, n_buffers) 1.Calculate the number of objects included on each Zone, N z. 2.Compute the accumulated distribution of objects, A, for each Zone. A Zi = Sum (N zi ), i = 1.. i 3.Assign the 100/n_servers % of objects to each server. 4.Add to each server the buffer zones. 5.Add margin objects. Output: –ServerZones (ServerId, ZoneID, objID, ra, dec, x, y, z, wrap, native, …) Indexed by ZoneID and objID for fast access! –Servers (ServerID, nObj, minZoneID, maxZoneID, overlapMinZoneID, overlapMaxZoneID, minDEC, maxDEC) Transfer data from main server to cluster members.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh15 Data Transfer: Main Server - Nodes 1.Replicate the database schema on each server to maintain relationships between tables. 2.Member servers pull data from the main server using the ServerZones table. –Can be done in parallel. –Easier for the transaction manager. 3.Asymmetric partitioning: –Replicate most of the tables on each server. It makes it easier and faster. –Partition PhotoObjAll and SpecObjAll … maybe others. 4.Rebuild indexes.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh16 Routing Rules Definition Determine where to send a query. Need a parser to capture regions requests like POINT, CIRCLE, REC, POLY … etc. (Reuse parser from SkyQuery.) Once we have a declination, (or x,y,z) server = SELECT ServerID FROM Servers WHERE ( obj.dec BETWEEN minDEC AND maxDEC ) Queries without positional constrains mean full table scans and have to be sent to all nodes to be processed in parallel.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh17 Building Neighborhoods Computing neighborhoods is computationally-intensive. Nested loop where for each object all neighbors inside some radius are computed. For = {-1, 0, 1} insert neighbors -- insert one zone's neighbors select o1.objID as objID, -- object pairs o2.objID as NeighborObjID, from zone o1 join zone o2 -- join 2 zones on o1.zoneID = o2.zoneID -- using zone number and ra and o2.ra between o1.ra and o1.ra -- points near ra where -- elided margin logic and o2.dec between o1.dec and o1.dec -- quick filter on dec and sqrt ( power(o1.x - o2.x, 2) + power(o1.y - o2.y, 2) + power(o1.z - o2.z, 2)) -- careful filter on distance

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh18 Building Neighborhood Performance Results for Personal SkyServer (154k rows) –Build Zone table: s –Join to Zone -1: sgenerated 128,469 rows –Join to Zone 0: sgenerated 389,157 rows –Join to Zone 1: sgenerated 126,104 rows –Add mirror rows: sTotal = 1,287,460 rows –Create the index: s Total time s For DR1, computing the neighbor table (30”) took 2 days instead of 2 weeks. The overall improvement has been 32x faster than using external calls to the HTM functions. Can be done in Parallel!

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh19 Neighborhoods best Performance Building Neighborhoods performs best when the zone height is equal to the radius of the neighborhood. small radius imply joins with two or more northern zones and two or more southern neighbors. Tall zones require many more pairs and the work rise quadraticly.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh20 Neighborhoods best Performance Building Neighborhoods performs best when the zone height is equal to the radius of the neighborhood. the center zone requires just a join with the upper and southern zones. A box of 3r x 2r.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh21 Zones and Cross-Match 2MASS SDSS GALEX Applying a zoning approach to other surveys makes JOINs a faster process.

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh22 Do the actual partitioning of SkyServer. Test it and measure performance. So far we have “played” with MySkyServer a subset of DR1 with 1.3 GB. Test with the finding galaxy cluster algorithm to compare against current grid approach. Connect our databases to do the high computational processing on the GRID. Work to do

January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh23 To allow interactive exploration and visualization of the data and do new discoveries! Any time left for the Image Cutout demo? It takes 5 minutes. After all... Why High-Speed Access?