Association techniques for the Virtual Observatory Bob Mann.

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

John Cunniffe Dunsink Observatory Dublin Institute for Advanced Studies Evert Meurs (Dunsink Observatory) Aaron Golden (NUI Galway) Aus VO 18/11/03 Efficient.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Internet Vision - Lecture 3 Tamara Berg Sept 10. New Lecture Time Mondays 10:00am-12:30pm in 2311 Monday (9/15) we will have a general Computer Vision.
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
The University of Akron Dept of Business Technology Computer Information Systems The Relational Model: Query-By-Example (QBE) 2440: 180 Database Concepts.
Constraining Astronomical Populations with Truncated Data Sets Brandon C. Kelly (CfA, Hubble Fellow, 6/11/2015Brandon C. Kelly,
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Basics of ANOVA Why ANOVA Assumptions used in ANOVA
Deriving and fitting LogN-LogS distributions Andreas Zezas Harvard-Smithsonian Center for Astrophysics.
Lecture 9: One Way ANOVA Between Subjects
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
BinX and Astronomy Bob Mann Institute for Astronomy and National e-Science Centre.
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Database : collection of information. data management tool. huge volumes. like a filing system. providing answers.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
Information Systems Chapter 5 Building the database Part 1. Unsing Access.
EdSkyQuery-G Overview Brian Hills, December
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
W  eν The W->eν analysis is a phi uniformity calibration, and only yields relative calibration constants. This means that all of the α’s in a given eta.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Access Path Selection in a Relational Database Management System Selinger et al.
Database Management 9. course. Execution of queries.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
Computer Security: Principles and Practice
NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.
Astronomical Spectroscopy and the Virtual Observatory ESAC, March 2007 VO tools and cross-calibration Pedro García-Lario European Space Astronomy.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
Quantitative Analysis. Quantitative / Formal Methods objective measurement systems graphical methods statistical procedures.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Cengage Learning. All rights reserved. 14 Elements of Nonparametric Statistics.
1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28.
Astrophysics working group - CERN March, 2004 Point source searches, Aart Heijboer 1 Point Source Searches with ANTARES Outline: reconstruction news event.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
E. Solano, R. Gutiérrez, B. Montesinos, C. Morales, J. García, L. Sanz LAEFF-INTA. P.O. Box 50727, Madrid (Spain) Development of a multi-mission.
Workshop Garching, June 27 – July Statistical Cross-Matching Across Distributed Archives H.-M. Adorf & GAVO Team MPI f. extraterrestrische Physik.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
A statistical test for point source searches - Aart Heijboer - AWG - Cern june 2002 A statistical test for point source searches Aart Heijboer contents:
CSE 3330 Database Concepts MongoDB. Big Data Surge in “big data” Larger datasets frequently need to be stored in dbs Traditional relational db were not.
2003 May 24Clive Page Implementation of XMATCH function.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Query Optimizer (Chapter ). Optimization Minimizes uses of resources by choosing best set of alternative query access plans considers I/O cost,
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
February 12, 2002Tom McGlynn ADEC Interoperability Technical Working Group Report.
1 VLDB, Background What is important for the user.
Simple Linear Regression and Correlation (Continue..,) Reference: Chapter 17 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
SQL Server Statistics and its relationship with Query Optimizer
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
MCMC Output & Metropolis-Hastings Algorithm Part I
Parallel Databases.
Cross-matching the sky with database server cluster
Chapter 12: Query Processing
Sky Query: A distributed query engine for astronomy
JULIE McLAIN-HARPER LINKEDIN: JM HARPER
Descriptive and inferential statistics. Confidence interval
TEMPDB – INTERNALS AND USAGE
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Spreadsheets, Modelling & Databases
Presentation transcript:

Association techniques for the Virtual Observatory Bob Mann

Why associations are crucial to the Virtual Observatory The essence of the VO is database federation The essence of the VO is database federation  Usually DBs of independent origin  No links between entries in different DBs Such links needed for prototypical VO query Such links needed for prototypical VO query  e.g. “give me all galaxies in region A of the sky with an optical/X-ray flux ratio greater than X which are not detected in the radio to a limiting flux of Y” OpticalX-rayRadio

Why you might think associations are easy to make Natural spatial indexing to astro databases Natural spatial indexing to astro databases  Plus uncertainties on positions, in general Just perform matching by proximity Just perform matching by proximity  Simple-ish methods for doing this [Clive] Some practical issues for distributed case Some practical issues for distributed case  Data volumes  think about transfers & performance  Metadata for interoperability

SkyQuery: Restriction to SQLServer databases &.Net Restriction to SQLServer databases &.Net Requires special facilities at data centres? [Greg] Requires special facilities at data centres? [Greg] Matching by proximity alone Matching by proximity alone

Matching by proximity is not always adequate Need astrophysical information to know which of the red objects is the most likely counterpart to the cyan source

General Case Database A: Database A:  Positions: (RA i,Dec i ) for i=1,N A  Pos. Uncerts: (σ RA,i, σ Dec,i ) or (σ X,i, σ Y,i ) or σ i or σ  Other attributes A ij for j=1,M A Ditto for Database B: Ditto for Database B: (N A,N B ) may be up to ~10 9 (N A,N B ) may be up to ~10 9 (M A,M B ) may be ~10 2 (M A,M B ) may be ~10 2  <10 likely to be used in association procedure

General Requirements Users can readily assess whether associations are suitable for their analysis Users can readily assess whether associations are suitable for their analysis  Transparency of method used  Figure of merit for each association User-supplied association methods(?) User-supplied association methods(?) Performance: pre-computation vs. on-the-fly Performance: pre-computation vs. on-the-fly Incorporating astrophysical prior knowledge, but not biasing associations unduly Incorporating astrophysical prior knowledge, but not biasing associations unduly  Often new classes of source involved

Likelihood Ratio technique(s) Likelihood Ratio, LR ij, for association of ith entry of DB A and jth of B defined to be Likelihood Ratio, LR ij, for association of ith entry of DB A and jth of B defined to be LR ij = prob. that A i is true counterpart of B j ________________________________ prob. that A i is not true counterpart of B j Choose i that maximises LR ij Choose i that maximises LR ij

LR example A is an optical catalogue, with magnitudes m and negligible positional errors A is an optical catalogue, with magnitudes m and negligible positional errors Gaussian positional uncertainty, e(x,y), for B Gaussian positional uncertainty, e(x,y), for B Then, LR ij = n A,ID (m i ) e(x j,y j ) / n A (m i ) Then, LR ij = n A,ID (m i ) e(x j,y j ) / n A (m i ) Problems: Problems:  Might not know form of n A,ID (m i )  Might have several populations in B

If n A,ID (m i ) is not known Estimate it: Estimate it:  Compare n A (m) around source positions with n A (m) for full database A Learn it: Learn it:  Use EM algorithm to learn form of n A,ID (m i ) [Emma Taylor PhD thesis] Circumvent it: Circumvent it:  Set n A,ID (m i )=const. and normalise LR ij using randomly-located fictitious sources

But… All of these methods require statistics on A All of these methods require statistics on A  e.g. n A (m)  …or histogram of any other attribute(s) The more complicated the physical model – e.g. multiple source populations in B – the more complicated the statistics that are needed The more complicated the physical model – e.g. multiple source populations in B – the more complicated the statistics that are needed Not insurmountable problem – just lots of count(*) queries Not insurmountable problem – just lots of count(*) queries

Pre-computing cross-neighbours LR chooses between a few candidates usually LR chooses between a few candidates usually Pre-compute & store cross-neighbours Pre-compute & store cross-neighbours  At least for the few, very large DBs Can then allow many probabilistic models to be used following the initial proximity cut Can then allow many probabilistic models to be used following the initial proximity cut A B CrossNeighbours (B,C) CrossNeighbours (C,B) C

Distributed Association Service? c.f. Distributed Annotation Server c.f. Distributed Annotation Server  Allows third-party annotation in bio DBs  “inferred function of this gene is junk”  Can be included in queries (somehow)  Select whatever from BioDB where not “function is junk” where not “function is junk”  Some sort of join between BioDB and the Distributed Annotation Server

Distributed Association Service (2) Is something like this needed in the VO? Is something like this needed in the VO?  Easier than adding extra columns to tables What would it contain: What would it contain:  References to original databases  “entry N in DB A is entry M in DB B”  Descriptions of methods used  Links to literature references…ADS/CDS

Associations in the VO Basically, something like Greg’s picture… Basically, something like Greg’s picture…  Start with a large dose of SkyQuery  Add possibility of running user-defined algorithms on dataset from proximity cut  Pre-compute cross-neighbours for big DBs  Distributed Association Service to record matches made?…and methods used?