Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association techniques for the Virtual Observatory Bob Mann.

Similar presentations


Presentation on theme: "Association techniques for the Virtual Observatory Bob Mann."— Presentation transcript:

1 Association techniques for the Virtual Observatory Bob Mann

2 Why associations are crucial to the Virtual Observatory The essence of the VO is database federation The essence of the VO is database federation  Usually DBs of independent origin  No links between entries in different DBs Such links needed for prototypical VO query Such links needed for prototypical VO query  e.g. “give me all galaxies in region A of the sky with an optical/X-ray flux ratio greater than X which are not detected in the radio to a limiting flux of Y” OpticalX-rayRadio

3 Why you might think associations are easy to make Natural spatial indexing to astro databases Natural spatial indexing to astro databases  Plus uncertainties on positions, in general Just perform matching by proximity Just perform matching by proximity  Simple-ish methods for doing this [Clive] Some practical issues for distributed case Some practical issues for distributed case  Data volumes  think about transfers & performance  Metadata for interoperability

4 SkyQuery: www.skyquery.net Restriction to SQLServer databases &.Net Restriction to SQLServer databases &.Net Requires special facilities at data centres? [Greg] Requires special facilities at data centres? [Greg] Matching by proximity alone Matching by proximity alone

5 Matching by proximity is not always adequate Need astrophysical information to know which of the red objects is the most likely counterpart to the cyan source

6 General Case Database A: Database A:  Positions: (RA i,Dec i ) for i=1,N A  Pos. Uncerts: (σ RA,i, σ Dec,i ) or (σ X,i, σ Y,i ) or σ i or σ  Other attributes A ij for j=1,M A Ditto for Database B: Ditto for Database B: (N A,N B ) may be up to ~10 9 (N A,N B ) may be up to ~10 9 (M A,M B ) may be ~10 2 (M A,M B ) may be ~10 2  <10 likely to be used in association procedure

7 General Requirements Users can readily assess whether associations are suitable for their analysis Users can readily assess whether associations are suitable for their analysis  Transparency of method used  Figure of merit for each association User-supplied association methods(?) User-supplied association methods(?) Performance: pre-computation vs. on-the-fly Performance: pre-computation vs. on-the-fly Incorporating astrophysical prior knowledge, but not biasing associations unduly Incorporating astrophysical prior knowledge, but not biasing associations unduly  Often new classes of source involved

8 Likelihood Ratio technique(s) Likelihood Ratio, LR ij, for association of ith entry of DB A and jth of B defined to be Likelihood Ratio, LR ij, for association of ith entry of DB A and jth of B defined to be LR ij = prob. that A i is true counterpart of B j ________________________________ prob. that A i is not true counterpart of B j Choose i that maximises LR ij Choose i that maximises LR ij

9 LR example A is an optical catalogue, with magnitudes m and negligible positional errors A is an optical catalogue, with magnitudes m and negligible positional errors Gaussian positional uncertainty, e(x,y), for B Gaussian positional uncertainty, e(x,y), for B Then, LR ij = n A,ID (m i ) e(x j,y j ) / n A (m i ) Then, LR ij = n A,ID (m i ) e(x j,y j ) / n A (m i ) Problems: Problems:  Might not know form of n A,ID (m i )  Might have several populations in B

10 If n A,ID (m i ) is not known Estimate it: Estimate it:  Compare n A (m) around source positions with n A (m) for full database A Learn it: Learn it:  Use EM algorithm to learn form of n A,ID (m i ) [Emma Taylor PhD thesis] Circumvent it: Circumvent it:  Set n A,ID (m i )=const. and normalise LR ij using randomly-located fictitious sources

11 But… All of these methods require statistics on A All of these methods require statistics on A  e.g. n A (m)  …or histogram of any other attribute(s) The more complicated the physical model – e.g. multiple source populations in B – the more complicated the statistics that are needed The more complicated the physical model – e.g. multiple source populations in B – the more complicated the statistics that are needed Not insurmountable problem – just lots of count(*) queries Not insurmountable problem – just lots of count(*) queries

12 Pre-computing cross-neighbours LR chooses between a few candidates usually LR chooses between a few candidates usually Pre-compute & store cross-neighbours Pre-compute & store cross-neighbours  At least for the few, very large DBs Can then allow many probabilistic models to be used following the initial proximity cut Can then allow many probabilistic models to be used following the initial proximity cut A B CrossNeighbours (B,C) CrossNeighbours (C,B) C

13 Distributed Association Service? c.f. Distributed Annotation Server c.f. Distributed Annotation Server  Allows third-party annotation in bio DBs  “inferred function of this gene is junk”  Can be included in queries (somehow)  Select whatever from BioDB where not “function is junk” where not “function is junk”  Some sort of join between BioDB and the Distributed Annotation Server

14 Distributed Association Service (2) Is something like this needed in the VO? Is something like this needed in the VO?  Easier than adding extra columns to tables What would it contain: What would it contain:  References to original databases  “entry N in DB A is entry M in DB B”  Descriptions of methods used  Links to literature references…ADS/CDS

15 Associations in the VO Basically, something like Greg’s picture… Basically, something like Greg’s picture…  Start with a large dose of SkyQuery  Add possibility of running user-defined algorithms on dataset from proximity cut  Pre-compute cross-neighbours for big DBs  Distributed Association Service to record matches made?…and methods used?


Download ppt "Association techniques for the Virtual Observatory Bob Mann."

Similar presentations


Ads by Google