Presentation is loading. Please wait.

Presentation is loading. Please wait.

László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1.

Similar presentations


Presentation on theme: "László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1."— Presentation transcript:

1 László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1

2 Cross-match problem in astronomy Astronomical catalogs in the TB range, o(100M) detections per catalog Geographically distributed: reliable, lightweight transfer protocol needed should benefit from co-located datasets Goals: find the same object in every catalog find drop-outs (requires complete description of footprints) on-demand: do it quickly (< 5 min) Matching primarily based on celestial coordinates astrometric error error can vary from object to object Additional match criteria: size, color, etc. Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore2

3 Cross-match problem in astronomy The math: Bayesian model selection [Budavári & Szalay 2008, „Probabilistic Cross-Identification of Astronomical Sources”] First step: cut on distance Including additional match criteria is easy and natural Tested on simulations [Heinis et al. 2009] The problems one-to-one matching of objects is expensive trigonometric computations IO intensive if dataset is big: always have to keep the right subset of data in memory Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore3

4 Hardware and data layout JHU Graywulf cluster: Dell PowerEdge 2950 + Dell PowerVauld MD 1000, 2 × PERC 5/e raid controller 1.2-1.4 GB/sec nominal IO bandwidth, InfiniBand 2x4 core iXeon, 8-32 GB RAM 5-20 machines partially assigned to cross-match engine Catalogs are mirrored on every node User catalogs uploaded to / located at a dedicated node Remote data sources (via various protocols) Queries are partitioned and executed in parallel on every machine Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore4

5 Xmatch definition language A cross-match query: SELECT s.objId as SobjID, s.ra, s.dec, g.ra, g.dec, j_m FROM SDSS:PhotoObjAll AS s CROSS JOIN GALEX:PhotoObjAll AS g XMATCH BAYESIAN AS x MUST s ON Point(s.ra, s.dec), 0.1 MUST g ON Point(g.ra, g.dec), 0.5 HAVING x.BF > 1e3 WHERE s.type = 3 AND s.ra BETWEEN 200 AND 210 AND s.dec BETWEEN -2 AND 2 AND g.ra BETWEEN 200 AND 210 AND g.dec BETWEEN -2 AND 2 A partitioned query: SELECT s.ObjID FROM SDSS:PhotoObjAll s PARTITION ON Ra WHERE Ra BETWEEN 200 AND 210 AND Dec BETWEEN -5 AND 5 Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore5

6 Query Execution 1 Parse: proprietary SQL parser written from scratch covers ~ 80% of SQL Server’s SELECT statement grammar extensions can be added easily by changing BNF grammar Job assignment: (to be implemented) determine sets of collocated catalogs using a central registry send part of cross-match job to remote service return only cross-matched result, not full raw datasets merge resultsets at any node Partition: cross-match queries: on right ascension simple queries: on specified column partitioned determined based on histogram: histogram query executed on a subsample to get metrics Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore6

7 Query Execution 2 Cache: cache remote datasets copy myDB tables to worker nodes can benefit from filters defined in query Execute: construct T-SQL queries execute T-SQL queries on nodes in parallel automatically retry on failure Merge merge result sets benefit from clever partitioning: no duplicates Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore7

8 Applied technologies Relational Database Management System: SQL Server 2008 CLR integration with parallel execution support Windows Workflow Foundation: coordinates the complex execution workflow transactions help keep the system consistent parallel execution support SMO SQL management objects easy access to the database schema Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore8

9 Zone algorithm Zone algorithm: Pure T-SQL: can leverage from query optimizer of SQL Server Divide sphere into zones ZoneID: very simple hash on declination Indexes built on ZoneID and right ascension help very quick pre-filtering of match candidates very well parallelized on multi-core machines [Gray, Szalay & Nieto-Santisteban 2006, The Zones Algorithm for Finding Points- Near-a-Point or Cross-Matching Spatial Datasets] Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore9

10 Summary and future work On-demand cross-matching is feasible Parser and partitioning logic built for handling cross- match job descriptions Workflow built for executing partitioned jobs New technologies allow rapid development of complex workflows and high performance data warehouses Future work: Develop GUI Install and publish system Add support for remote datasets Add support to benefit from collocated datasets Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore10


Download ppt "László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug. 25-26, 2008.IDIES Inaugural Symposium, Baltimore1."

Similar presentations


Ads by Google