Presentation is loading. Please wait.

Presentation is loading. Please wait.

G. Fekete, JHU Efficient search indices for geospatial data in a relational database Gyorgy (George) Fekete Dept. Physics and Astronomy Johns Hopkins University.

Similar presentations


Presentation on theme: "G. Fekete, JHU Efficient search indices for geospatial data in a relational database Gyorgy (George) Fekete Dept. Physics and Astronomy Johns Hopkins University."— Presentation transcript:

1 G. Fekete, JHU Efficient search indices for geospatial data in a relational database Gyorgy (George) Fekete Dept. Physics and Astronomy Johns Hopkins University

2 G. Fekete, JHU Acknowledgements Alex Szalay –NVO, SDSS, iVDGL,... Jim Gray –Databases, SQL Server Ani Thakar, Tamas Budavari –SDSS pipeline, Geometric libraries

3 G. Fekete, JHU Motivation Growth of volume of data –terabytes per day Increasing importance of databases in managing science data Data mining : potential for new discoveries Cross matching between multiple surveys Much of this data is distributed on a sphere –astronomy and earth science –great interest in a universal, computer-friendly index on the sphere

4 G. Fekete, JHU Astronomy Data “old days” –astronomers took photos. Since the 1960’s –they began to digitize. New instruments are digital (100s of GB/nite) Detectors are following Moore’s law. Data avalanche: double every 2 years

5 G. Fekete, JHU Astronomy Data Astronomers have a few Petabytes now. Data volume and ownership –doubles every 2 years. –Data is public after 2 years. –So, 50% of the data is public. –Some have private access to 5% more data. But….. –How do I get at that 50% of the data?

6 G. Fekete, JHU New Astronomy Data “Avalanche” –the flood of Terabytes of data –present techniques of handling these data do not scale well with data volume Systematic data exploration –will have a central role –statistical analysis of the “typical” objects –automated search for the “rare” events Digital archives of the sky –will be the main access to data

7 G. Fekete, JHU Data Intensive Science Data avalanche in astronomy and other sciences –old file-based solutions do not cut it –old data silos don’t work –old programming models don’t work We have some new tricks! Astronomy and Earth-Science –methods presented here deal with the topology and the geometry of the sphere

8 G. Fekete, JHU One Of These Tricks: Map regions of the sphere to unique identifiers that can be used as references to those areas –elementary spherical geometry to identify a region –multi-resolution –compactly describe areas at arbitrary granularity

9 G. Fekete, JHU Support Spatial Searches Typical queries –What is near this point? –What objects are in this area? –What areas overlap this area?

10 G. Fekete, JHU Design Considerations Has to –work with a relational database –represent areas of interest precisely –be scalable –be coordinate system neutral –maintain consistency with the topology of the sphere Approach: –precise mathematical description of regions –methods for covering a region with an optimal set of discrete descriptors (trixels) –covermap of trixels used for accelarated query

11 G. Fekete, JHU Components Region descriptions (continuous part) –region, convex, halfspace –API and a text language to describe –XML for inter-service, inter-application object transfer Hierachical Triangular Mesh (discrete part) –trixels –covermaps Database –extend the DB server engine with spatial access methods –implementing coarse filtering with table valued functions

12 G. Fekete, JHU Continuous Part: A Region Region –is the union of convexes Convex –is intersection of halfspaces Halfspace –simple search cone –circle

13 G. Fekete, JHU Examples of Convexes Disk, Circle, Search cone,... Spherical Polygon –yes, it is actually a convex (adj.) convex (n.) Band Lat/Lon (or Ra/Dec) rectangle anything else...

14 G. Fekete, JHU Halfspace Cutting plane makes two halfspaces Oriented plane makes one well defined halfspace

15 G. Fekete, JHU Halfspace Completely defined by (directed) plane normal and distance along the normal D = cos (cone halfangle) h = (x, y, z, D)

16 G. Fekete, JHU Point Inclusion In Region (x,y,z) P Q P. (x, y, z) > D h = (x, y, z, D) Q. (x, y, z) < D Point is inside a convex if and ony if it is inside all halfspaces Point is inside a region if and ony if it is inside at least one convex

17 G. Fekete, JHU Band: Two Halfspaces

18 G. Fekete, JHU Rectangle: Four Halfspaces

19 G. Fekete, JHU Disconnected Components Intersecting halfspaces can produce multiple connected components Anything you can think of can be expressed as a union of convexes

20 G. Fekete, JHU Discrete Part: The HT Mesh

21 G. Fekete, JHU Triangle Subdivision Scheme Each trixel can be named: eg S123222102 HTMId: depth limited trixels are represented 64-bit integers

22 G. Fekete, JHU HTMId Coherence 10234092 - 409516368 - 16383 level 3level 4level 5 17575006175232 - 17592186044415 level 20

23 G. Fekete, JHU Covermap Of Circle covermap is a set of trixels that cover a region

24 G. Fekete, JHU Covermap Of California 15277198671872 - 15278272413695 15298673508352 - 15300820991999 15301089427456 - 15302968475647...... 15384572854272 - 15384841289727 44 trixels, but only 13 ranges Use covermaps and HtmIDs to coarse filter...

25 G. Fekete, JHU Database Part 1.From table of objects, consider only those whose key values are in the covermap 2.Of those that passed, perform calculation to complete query 3.Return result in table

26 G. Fekete, JHU Coarse and Fine Filtering In Queries use covermaps use precise calculations

27 G. Fekete, JHU Usage of Tables and Index Keys Create a function that generates keys that cluster related data together –if objects A and B are nearby, then the keys for A and B should be also be nearby in the Index space –HtmID Create a table-valued function that returns –list of key ranges (the covermap) containing all the pertinent values –covermap

28 G. Fekete, JHU Caveats You cannot always get every key to be near all its neighbors –keys are sorted in one dimension –relatives are near in two-dimensional space But we can come close –The ratio of false-positives to correct answers is a measure of how well you are doing..

29 G. Fekete, JHU USGS Dataset Experiment 18,000 stream gauges 23,000 places

30 G. Fekete, JHU Sample Covermap select * from fHtmCoverCircleLatLon(39.3, -76.6, 100) HtmIDStart HtmIDEnd ---------------- 14023336656896 14024141963263 14024410398720 14025215705087 14025484140544 14027363188735

31 G. Fekete, JHU Places Within 100 Miles Of Baltimore select ObjID from SpatialIndex join fHtmCoverCircleLatLon(39.3, -76.6, 100) On HtmID between HtmIDStart and HtmIDEnd where Type = 'P' and dbo.fDistanceLatLon(39.3, -76.6, Lat, Lon) < 100 go Number of rows in cover join (coarse filter) 2223 Number of rows that are within 100 n. miles (after the fine filter). 1122 Number of places in DB 22993 Time with covermap 35 Time without covermap 100

32 G. Fekete, JHU California As A Region declare @californiaRegion varchar(max) set @californiaRegion = 'REGION ' + 'rect latlon 39 -125' -- nortwest corner + '42 -120 ' -- center of Lake Tahoe + 'chull latlon 39 -124 ' -- Pt. Arena + '39 -120 ' -- Lake tahoe. + '35 -114.6 ' -- start Colorado River + '34.3 -114.1 ' -- Lake Havasu + '32.74 -114.5 ' -- Yuma + '32.53 -117.1 ' -- San Diego + '33.2 -119.5 ' -- San Nicholas Is + '34 -120.5 ' -- San Miguel Is. + '34.57 -120.65 ' -- Pt. Arguelo + '36.3 -121.9 ' -- Pt. Sur + '36.6 -122.0 ' -- Monterey + '38 -123.03 ' -- Pt. Rayes

33 G. Fekete, JHU California Cities select PlaceName from Place where HtmID in (select distinct SI.objID from fHtmCoverRegion(@californiaRegion) loop join SpatialIndex SI on SI.HtmID between HtmIdStart and HtmIdEnd and SI.type = 'P' join place P on SI.objID = P.HtmID cross join fHtmRegionToTable(@californiaRegion) Poly group by SI.objID, Poly.convexID having min(SI.x*Poly.x + SI.y*Poly.y + SI.z*Poly.z - Poly.d) >= 0) OPTION( FORCE ORDER) This is a popular query, so we can include it as a stored procedure See Point Inclusion

34 G. Fekete, JHU Point Inclusion With SQL (x,y,z) P P. (x, y, z) > D h = (x, y, z, D) P. (x, y, z) - D > 0 min(SI.x*Poly.x + SI.y*Poly.y + SI.z*Poly.z - Poly.d) >= 0)

35 G. Fekete, JHU Covermap Of California 15277198671872 - 15278272413695 15298673508352 - 15300820991999 15301089427456 - 15302968475647...... 15384572854272 - 15384841289727 44 trixels, but only 13 ranges Use covermaps and HtmIDs to coarse filter...

36 G. Fekete, JHU DB Function For Region Search select PlaceName from Place where HtmID in (select ObjID from fHtmRegionObjects(@californiaRegion,'P') ) Number of rows in cover join (coarse filter) 981 Number of rows that are within region 885 Number of places in DB 22993 Time with covermap 110 Time without covermap 1210

37 G. Fekete, JHU SDSS Digital map in 5 spectral bands covering ¼ of the sky. Will obtain 40 TB of raw pixel data. Photometric catalog with more than 200 million objects. Spectra of ~ 1 million objects. Data Release 3 – DR3: 150 M images, 480 k spectra.

38 G. Fekete, JHU Ambitious Survey Info content > US Library of Congress Before SDSS: total number of galaxies with measured parameters ~ 100k After SDSS, we will have detailed parameters for over 100 Million galaxies!!

39 G. Fekete, JHU SDSS Processing Pipeline Processed data ingested into a relational DBMS Allows fast exploration and analysis - Data Mining Heavily indexed to speed up access – HTM + DB Indices Short queries can run interactively. Long queries (> 1 hr) require a custom Batch Query System.

40 G. Fekete, JHU SDSS Data Access Data Archive Server (DAS) –FITS files (raw data) –Images, spectra, corrected frames, atlas images, binned images, masks –Online form-based access at www.sdss.orgwww.sdss.org –Rsync and wget file retrieval Catalog Archive Server (CAS) –Science parameters extracted to catalogs –Stuffed into relational DBMS (SQL Server) –Online access via SkyServer at http://cas.sdss.org/, http://skyserver.sdss.orghttp://cas.sdss.org/ http://skyserver.sdss.org

41 G. Fekete, JHU Conclusion HTM methods provide means for implementing ways to filter data so that expensive geometrical computations to satisfy a query are performed on only a small subset of the original data The HTM is on its way to become one of the de facto standards for representing spatial information in astronomical catalogs, especially for large-scale surveys.


Download ppt "G. Fekete, JHU Efficient search indices for geospatial data in a relational database Gyorgy (George) Fekete Dept. Physics and Astronomy Johns Hopkins University."

Similar presentations


Ads by Google