Capturing and Organizing Scientific Annotations Greg Riccardi Florida State University riccardi@cs.fsu.edu Riccardi: Workshop on Data Management March 17, 2004
What is an Annotation? An assertion of a relationship among objects Someone claims that several objects are connected by a relationship and gives evidence of the connection Includes record of author and date of assertion Objects are often datasets with provenance Annotations often assert quality characteristics of data objects Crucial social components Attribution, confidence, and validity Ontologies and compliance with standards Establishment of object naming strategy Security policies Riccardi: Workshop on Data Management March 17, 2004
Example from SkyServer These object are the same Telescope and catalog info SkyQuery dataset SkyQuery dataset Analysis Query string Query string Riccardi: Workshop on Data Management March 17, 2004
Types and Importance of Annotations Three types of annotations Systematic Semi-structured Ad Hoc Annotations are of primary importance in data semantics and analysis Record of semantics of data Record of peoples opinions about data We need tools to make annotations easy to create, organize, understand, and search Riccardi: Workshop on Data Management March 17, 2004
Systematic Annotations Collected automatically Anticipated and organized Factual Experimental metadata See example of Jefferson Lab run log A run log entry asserts a relationship between the metadata and the raw data The run number identifies each object Rows in runBegin table, runEnd table, runFiles table, runComment table Object identification is much more difficult in most cases As noted in earlier talks Experimental metadata is not always collected or curated properly Riccardi: Workshop on Data Management March 17, 2004
Systematic Provenance Annotations Derivation provenance Record of computational creation of data Must be collected by computations directly Query provenance In SkyQuery, user submits query and results dataset is retained in MyDB The query must be retained to record semantics of dataset GGF Database Access and Integration Working Group (DAIS) Deveoping standards for representing queries on databases and other data stores Provides a data access recipe that can be used to fetch a particular dataset Morphbank images of scanning electron micrographs Riccardi: Workshop on Data Management March 17, 2004
Semi-Structured Annotations Anticipated and organized Collected mostly by hand Experimental logbook from Jefferson Lab Riccardi: Workshop on Data Management March 17, 2004
Jefferson Lab Logbook Run and log daily summaries Standard logbook entry Many standard (expected) fields Comment field filled with ad hoc annotations “ADB crate” “voltage” Complaint about logbook usage Suggested strategy for creating logbook entries Automatically generated logbook entry Post processing software creates database entries directly Image tags point to files on some computer Riccardi: Workshop on Data Management March 17, 2004
Semi-Structured Annotations Anticipated and organized Collected mostly by hand Experimental logbook from Jefferson Lab Logbook entry has specific fields Run id, subject, author, entry_type, system Entry has an ad hoc field Searching comment field requires interpretation of words [Ontologies?] Search page for log book Based on predefined structure Created and used by experts Riccardi: Workshop on Data Management March 17, 2004
Ad Hoc Annotations Asserts connection between arbitrary objects Example from morphology Riccardi: Workshop on Data Management March 17, 2004
Morphology Publication Example Riccardi: Workshop on Data Management March 17, 2004
Ad Hoc Annotations Asserts connection between arbitrary objects Example from morphology Searching is difficult Ambiguous and inefficient Google is a search engine for ad hoc annotations Not based on organized ontology Not based on document structure Riccardi: Workshop on Data Management March 17, 2004
Annotating data quality Suppose that someone finds error in a SkyQuery dataset Create an ad-hoc annotation “Objects X, Y, Z in data catalog D are incorrectly identified” Include annotation in any query? We don’t know how to carry quality annotations into the query results Riccardi: Workshop on Data Management March 17, 2004
Organizing Annotations Need to find ways to structure ad hoc annotations When structure emerges, capture it Create specific schemas Create specific interfaces for collection, display and search Main goal is to make it easy enough for scientists They must see advantages to the extra work of structuring their thoughts and conforming to ontologies Riccardi: Workshop on Data Management March 17, 2004
Querying the Annotation Activity Publish/Subscribe database strategies Publish the history of updates Subscribe to queries on the history Suppose you are the curator of a SkyQuery database Someone claims that the object catalog is wrong You should be informed Riccardi: Workshop on Data Management March 17, 2004
Example of Annotation Query These object are the same Telescope and catalog info SkyQuery dataset SkyQuery dataset Analysis Curator Query string Query string Riccardi: Workshop on Data Management March 17, 2004
Challenges of Ad Hoc Annotations Establishing globally unique, persistent data object names Optimizing searches Result semantics Ontologies Capturing structure of frequent annotation styles Providing user interfaces to define semi-structured annotations Riccardi: Workshop on Data Management March 17, 2004
Annotations Technology: SAM Scientific Annotation Middleware Jim Myers and Al Geist EMSL Electronic Notebook Riccardi: Workshop on Data Management March 17, 2004
Annotations Technology: Amaya Annotations of HTML and XML documents Project includes browser and document editor Text annotations attached to XHtml, XML, MathML and SVG http://www.w3.org/amaya Annotea collaborative annotation technology http://www.w3.org/2001/Annotea/ Riccardi: Workshop on Data Management March 17, 2004
References SkyQuery and SkyServer Jefferson Lab Logbooks http://www.skyquery.org/ http://cas.sdss.org/dr2/en/tools/chart/navi.asp Jefferson Lab Logbooks Home page http://clasweb.jlab.org/clasonline/ Today’s runs http://clasweb.jlab.org/clasonline/servlet/prodruninfo?action=today Today’s Logbook entries http://clasweb.jlab.org/clasonline/servlet/prodloginfo?action=today Run detail page http://clasweb.jlab.org/clasonline/servlet/prodruninfo?action=detail&run=42331 Logbook entry http://clasweb.jlab.org/clasonline/servlet/newloginfo?action=logentry&entryId=17082 Morphbank: Johan Liljeblad & Fredrik Ronquist http://www.morphbank.net/ http://www.csit.fsu.edu/~ronquist/papers/SystEnt1998.pdf Scientific Annotation Middleware http://collaboratory.emsl.pnl.gov/ W3C Amaya XML Annotation project http://www.w3.org/Amaya/ http://www.w3.org/2001/Annotea/ Riccardi: Workshop on Data Management March 17, 2004