Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre.

Similar presentations


Presentation on theme: "Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre."— Presentation transcript:

1 Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre CADC

2 Stephen Gwyn Canadian Astronomy Data Centre - Astronomy is using more and more archival data - More than 50% of HST papers are archival - Similar trends for other telescopes - Harder for solar system astronomy SSOIS: S olar S ystem O bject I mage S earch allows users to search for images of moving targets

3 Stephen Gwyn Canadian Astronomy Data Centre SSOIS: S olar S ystem O bject I mage S earch allows users to search for images of moving targets

4 Stephen Gwyn Canadian Astronomy Data Centre SSOIS: S olar S ystem O bject I mage S earch allows users to search for images of moving targets

5 Stephen Gwyn Canadian Astronomy Data Centre CFHT Initally, only data from CFHT/MegaCam was searched

6 Stephen Gwyn Canadian Astronomy Data Centre NEAT CFHT Subaru ESO Gemini AAT SDSS NOAO ING HST WISE Next added data from external telescope archives

7 Stephen Gwyn Canadian Astronomy Data Centre CADC Next added data from external telescope archives

8 Stephen Gwyn Canadian Astronomy Data Centre For each image, we need: - position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data Scraping external archives:

9 Stephen Gwyn Canadian Astronomy Data Centre For each image, we need: - position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data Scraping external archives:

10 There are a variety of data archive interfaces....

11 Stephen Gwyn Canadian Astronomy Data Centre - In an ideal world: one query to get all metadata - In real life: row limits - As the archives are updated, they need to be re-scraped periodically - Programmatic retrieval is required Scraping external archives:

12 Stephen Gwyn Canadian Astronomy Data Centre Advantages: - A single tool can scrape multiple archives Disadvantages: - Not all archives have an SIAP interface - Many SIAP services do not conform to the VO standard - Not all SIAP services contain all the necessary metadata - Most archives have at least 1 heavily observed patch of sky: hit the row limit again - SIAP services vary in ability for positional queries - maximum search area - search is circle or box - may require 10 5 queries: may be perceived as DOS attack Far better off scraping by day/night/MJD - Almost all telescopes take <10000 observations per 24 hours: - Can re-scrape with fewer queries Use SIAP?

13 Stephen Gwyn Canadian Astronomy Data Centre Scraping by RA/Dec

14 Stephen Gwyn Canadian Astronomy Data Centre Scraping by Date

15 Stephen Gwyn Canadian Astronomy Data Centre Older archive interfaces: - Query page + simple CGI result page - view source on the query page - get form inputs - issue repeated queries to CGI result page using GET or POST with wget/curl/scripting API - Easy http://astronomydata.edu/query?ra=12.87&dec=13.52&mjd=57323

16 Stephen Gwyn Canadian Astronomy Data Centre Newer archive interfaces: - AJAX/HTML5/etc page - Download Javascript and run through de-obfuscator - locate relevant XMLHttpRequest - determine if cookies are necessary - issue repeated queries to XMLHttpRequest URLs - Much harder

17 Stephen Gwyn Canadian Astronomy Data Centre Easiest of all... http://smoka.nao.ac.jp/status/obslog/SUP_2007.txt

18 Stephen Gwyn Canadian Astronomy Data Centre A script to get all Subaru/SuprimeCam metadata... #!/bin/bash wget http://smoka.nao.ac.jp/status/obslog/SUP_1999.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2000.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2001.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2002.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2003.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2004.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2005.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2006.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2007.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2008.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2009.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2010.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2011.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2012.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2013.txt wget http://smoka.nao.ac.jp/status/obslog/SUP_2014.txt

19 Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search

20 Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search

21 Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search

22 Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search http://www1.cadc-ccda.hia-iha.nrc- cnrc.gc.ca/tap/sync?LANG=ADQL&REQUEST=doQuery&QUERY=SELECT%20Observation.observationURI%20AS%20%22Preview%22%2C%20Observation.coll ection%20AS%20%22Collection%22%2C%20Observation.observationID%20AS%20%22Obs.%20ID%22%2C%20COORD1(CENTROID(Plane.position_bounds))% 20AS%20%22RA%20(J2000.0)%22%2C%20COORD2(CENTROID(Plane.position_bounds))%20AS%20%22Dec.%20(J2000.0)%22%2C%20Plane.time_bounds_c val1%20AS%20%22Start%20Date%22%2C%20Observation.instrument_name%20AS%20%22Instrument%22%2C%20Plane.time_exposure%20AS%20%22Int.%2 0Time%22%2C%20Observation.target_name%20AS%20%22Target%20Name%22%2C%20Plane.energy_bandpassName%20AS%20%22Filter%22%2C%20Plan e.calibrationLevel%20AS%20%22Cal.%20Lev.%22%2C%20Observation.type%20AS%20%22Obs.%20Type%22%2C%20Plane.energy_bounds_cval1%20AS%20 %22Min.%20Wavelength%22%2C%20Plane.energy_bounds_cval2%20AS%20%22Max.%20Wavelength%22%2C%20Observation.proposal_id%20AS%20%22Pro posal%20ID%22%2C%20Observation.proposal_pi%20AS%20%22P.I.%20Name%22%2C%20Plane.productID%20AS%20%22Product%20ID%22%2C%20Plane.d ataRelease%20AS%20%22Data%20Release%22%2C%20AREA(Plane.position_bounds)%20AS%20%22Field%20of%20View%22%2C%20Plane.position_sample Size%20AS%20%22Pixel%20Scale%22%2C%20Plane.dataProductType%20AS%20%22Data%20Type%22%2C%20Plane.position_timeDependent%20AS%20%2 2Moving%20Target%22%2C%20Plane.provenance_name%20AS%20%22Provenance%20Name%22%2C%20Plane.provenance_keywords%20AS%20%22Proven ance%20Keywords%22%2C%20Observation.intent%20AS%20%22Intent%22%2C%20Observation.target_type%20AS%20%22Target%20Type%22%2C%20Obser vation.target_standard%20AS%20%22Target%20Standard%22%2C%20Plane.metaRelease%20AS%20%22Meta%20Release%22%2C%20Observation.sequence Number%20AS%20%22Sequence%20Number%22%2C%20Observation.algorithm_name%20AS%20%22Algorithm%20Name%22%2C%20Observation.proposal_ti tle%20AS%20%22Proposal%20Title%22%2C%20Observation.proposal_keywords%20AS%20%22Proposal%20Keywords%22%2C%20Observation.proposal_proje ct%20AS%20%22Proposal%20Project%22%2C%20Plane.position_bounds%20AS%20%22Polygon%22%2C%20Plane.energy_emBand%20AS%20%22Band%22 %2C%20Plane.provenance_reference%20AS%20%22Prov.%20Reference%22%2C%20Plane.provenance_version%20AS%20%22Prov.%20Version%22%2C%20 Plane.provenance_project%20AS%20%22Prov.%20Project%22%2C%20Plane.provenance_producer%20AS%20%22Prov.%20Producer%22%2C%20Plane.proven ance_runID%20AS%20%22Prov.%20Run%20ID%22%2C%20Plane.provenance_lastExecuted%20AS%20%22Prov.%20Last%20Executed%22%2C%20Plane.prov enance_inputs%20AS%20%22Prov.%20Inputs%22%2C%20Plane.energy_restwav%20AS%20%22Rest- frame%20Spectral%20Coverage%22%2C%20Plane.planeID%20AS%20%22planeID%22%2C%20isDownloadable(Plane.planeURI)%20AS%20%22DOWNLOADA BLE%22%2C%20Plane.planeURI%20AS%20%22CAOM%20Plane%20URI%22%2C%20Observation.instrument_keywords%20AS%20%22Instrument%20Keyword s%22%2C%20Plane.energy_transition_species%20AS%20%22Molecule%22%2C%20Plane.energy_transition_transition%20AS%20%22Transition%22%2C%20Pl ane.position_resolution%20AS%20%22IQ%22%20FROM%20caom2.Plane%20AS%20Plane%20JOIN%20caom2.Observation%20AS%20Observation%20ON%20 Plane.obsID%20%3D%20Observation.obsID%20WHERE%20%20(%20Observation.instrument_name%20%3D%20%27MegaPrime%27%20AND%20Observation.c ollection%20%3D%20%27CFHT%27%20)&FORMAT=tsv

23 Stephen Gwyn Canadian Astronomy Data Centre The second easiest: CADC's Advanced Search SELECT Observation.observationURI AS "Preview", Observation.collection AS "Collection", Observation.observationID AS "Obs. ID", COORD1(CENTROID(Plane.position_bounds)) AS "RA (J2000.0)", COORD2(CENTROID(Plane.position_bounds)) AS "Dec. (J2000.0)", Plane.time_bounds_cval1 AS "Start Date", Observation.instrument_name AS "Instrument", Plane.time_exposure AS "Int. Time", Observation.target_name AS "Target Name", Plane.energy_bandpassName AS "Filter", Plane.calibrationLevel AS "Cal. Lev.", Observation.type AS "Obs. Type", Plane.energy_bounds_cval1 AS "Min. Wavelength", Plane.energy_bounds_cval2 AS "Max. Wavelength", Observation.proposal_id AS "Proposal ID", Observation.proposal_pi AS "P.I. Name", Plane.productID AS "Product ID", Plane.dataRelease AS "Data Release", AREA(Plane.position_bounds) AS "Field of View", Plane.position_sampleSize AS "Pixel Scale", Plane.dataProductType AS "Data Type", Plane.position_timeDependent AS "Moving Target", Plane.provenance_name AS "Provenance Name", Observation.intent AS "Intent", Observation.target_type AS "Target Type", Observation.target_standard AS "Target Standard", Observation.sequenceNumber AS "Sequence Number", Observation.algorithm_name AS "Algorithm Name", Observation.proposal_title AS "Proposal Title", Observation.proposal_keywords AS "Proposal Keywords", Plane.energy_emBand AS "Band", Plane.provenance_version AS "Prov. Version", Plane.provenance_project AS "Prov. Project", Plane.provenance_runID AS "Prov. Run ID", Plane.provenance_lastExecuted AS "Prov. Last Executed", Plane.energy_restwav AS "Rest-frame Spectral Coverage", isDownloadable(Plane.planeURI) AS "DOWNLOADABLE", Plane.planeURI AS "CAOM Plane URI", Observation.instrument_keywords AS "Instrument Keywords", Plane.energy_transition_species AS "Molecule", Plane.energy_transition_transition AS "Transition", Plane.position_resolution AS "IQ" FROM caom2.Plane AS Plane JOIN caom2.Observation AS Observation ON Plane.obsID = Observation.obsID WHERE ( Observation.collection = 'CFHT' )

24 Stephen Gwyn Canadian Astronomy Data Centre The other hard part: - Parsing downloaded metadata - Which observations are images? - Quality control - is MJD right? - Are coordinates 2000.0 or 1950.0? - Sorting out filters: - remove narrow band filter data - remove bad filters - remove grism data - maybe homogenize filter names (B vs Bj vs Bjohnson vs Johnson B vs...) - Telescope footprint not typically part of the metadata - Work out links back to original images

25 SSOIS saves the Earth....

26 Stephen Gwyn Canadian Astronomy Data Centre Summary: - SSOIS allows multi-archive searches for moving objects - Metadata is harvested from external archives - Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it

27 Stephen Gwyn Canadian Astronomy Data Centre

28 Stephen Gwyn Canadian Astronomy Data Centre Summary: - SSOIS allows multi-archive searches for moving objects - Metadata is harvested from external archives - Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it


Download ppt "Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre."

Similar presentations


Ads by Google