Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.

Slides:



Advertisements
Similar presentations
The Library of Life Federated Description Services and the Library of Life or What can we do with SDD anyway? Kevin Thiele Centre for Biological Information.
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
John Deck, University of California, Berkeley Brian Stucky, University of Colorado, Boulder Lukasz Ziemba, University of Florida, Gaineseville Nico Cellinese,
Developing an XBRL Reporting Architecture Rafael Valero Arce Fujitsu España Services es.fujitsu.com.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
MaNIS Interface Project Mayjane Co Denise Green Jane Lee Rebecca Shapley.
R utgers C ommunity R epository RU CORE Fedora Repository Object Datastreams.
MaNIS Interface Project Mayjane Co Denise Green Jane Lee Rebecca Shapley.
Project 1 Assignment Building a mini-database for CCI in UNCC which includes entity sets: departments (CS,SIS, bioinformatics), faculties, courses given.
MaNIS Interface Project Mayjane Co Denise Green Jane Lee Rebecca Shapley.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Tutorial 11: Connecting to External Data
Distributed Systems: Client/Server Computing
CORDRA Philip V.W. Dodds March The “Problem Space” The SCORM framework specifies how to develop and deploy content objects that can be shared and.
XIS™ XML Intranet System. XIS, the XML Intranet System provides the foundation for your database production and management. XIS maximizes the flexible.
Ch. 31 Q and A IS 333 Spring 2015 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
ACAT 2008 Erice, Sicily WebDat: Bridging the Gap between Unstructured and Structured Data Jerzy M. Nogiec, Kelley Trombly-Freytag, Ruben Carcagno Fermilab,
CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
Open source administration software for education software development simplified KRAD Kuali Application Development Framework.
SERNEC Image/Metadata Database Goals and Components Steve Baskauf
Kuali Chart of Accounts Vince Schimizzi, Michigan State University Bill Overman, Indiana University.
Databases and the Internet. Lecture Objectives Databases and the Internet Characteristics and Benefits of Internet Server-Side vs. Client-Side Special.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Michael Byrne Geographic Information Officer National Broadband Map Update.
DNS (Domain Name System) Protocol On the Internet, the DNS associates various sorts of information with domain names. A domain name is a meaningful and.
Smart Reference Proxy Provides additional actions whenever an object is referenced (e.g., counting the number of references to the object) Firewall Proxy.
Use case lessons: Components of the SEEK architecture Robert K. Peet University of North Carolina.
Using Taxonomies Effectively in the Organization v. 2.0 KnowledgeNets 2001 Vivian Bliss Microsoft Knowledge Network Group
Representing taxonomy MarBEF-IODE workshop Oostende, March 2007.
Designing and Developing WS B. Ramamurthy. Plans We will examine the resources available for development of JAX-WS based web services. We need an IDE,
ZLOT Prototype Assessment John Carlo Bertot Associate Professor School of Information Studies Florida State University.
BioData a new bioassessment database for the USGS Briefing for the CDI
BIEN Confederated DB (S) Analytical DB(s) Heterogeneous source database(s) of Plots/Specimens/Occurrences Synonymy Names Reference taxonomy *** *** Feedback.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
CBoL Taipei, september 2007 BARCODE DATA, MUSEUM CATALOGS AND GBIF Simon Tillier.
CSIRO Marine Research Data Centre linked databases - CAAB, MarLIN and Divisional Data Warehouse.
Requirements of a Taxonomy Database Tcl-DB a Prototype.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Biocode Field Information Management System (FIMS) John Deck, UC Berkeley TDWG, 2014.
Search Tools and Search Engines Searching for Information and common found internet file types.
3/18: Microsoft Access Refresher: What is a relational database? Why use a database? Sample database in MS access. –Fields, records, attributes. –Tables,
STAR C OMPUTING STAR Analysis Operations and Issues Torre Wenaus BNL STAR PWG Videoconference BNL August 13, 1999.
12/6/2015B.Ramamurthy1 Java Database Connectivity B.Ramamurthy.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
Design a full-text search engine for a website based on Lucene
SimDB Implementation & Browser IVOA InterOp 2008 Meeting, Theory Session 1. Baltimore, 26/10/2008 Laurent Bourgès This work makes use of EURO-VO software,
Biocode Commons Identifiers (BCIDS) A free* for use, persistent identifier solution for biological sample collection from the field, scalable to the billions.
Interface for Glyco Vault Functionality and requirements. Initial proposal. Maciej Janik.
The New GBIF Data Portal Web Services and Tools Donald Hobern GBIF Deputy Director for Informatics October 2006.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
Programming KOSSA Review --Justin Smith--. EMPLOYABILITY us/lesson/wesint1/2 us/lesson/wesint1/2.
Day in the Life (DITL) Production Operations with Energy Builder Copyright © 2015 EDataViz LLC.
Building KFS using KNS Presented by James SmithJustin Beltran University of ArizonaUniversity of California, Irvine.
Ch. 31 Q and A IS 333 Spring 2016 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Core LIMS Training: Entering Experimental Data – Simple Data Entry.
HMVR System Final Presentation
VI-SEEM Data Discovery Service
CHAPTER 3 Architectures for Distributed Systems
Databases.
Lecture 12: Data Wrangling

Java Database Connectivity
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Taxonomies for a Metrology Information Infrastructure
Presentation transcript:

Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and Betty Moore Foundation Presentation by John Deck, University of California at Berkeley

Outline Part 1: Background on Moorea Biocode Project Part 2: bioValidator: field based data validation Part 3: A case study in handling taxonomic names in a field based client application

Part 1: Background on Moorea Biocode Project

Moorea Biocode: The Collecting

The Sorting Moorea Biocode: Sorting Specimens

Moorea Biocode: Tissue Sampling

Moorea Biocode LIMS: Binning, Trimming, & Assembly of Sequence Data

Challenges Facing the Moorea Biocode Project IT Team Multiple taxa & a different team for each group. Various cultures and workflow for each team. Everyone in a hurry, non-technical biologists entering data Specimens (& metadata) ultimately owned by multiple host institutions. Multiple labs processing genetic data (w/ different equipment, processes, and workflows). Final taxonomic determination made using Lab and/or Host Institution (Often much later than collecting event) No internet or bad internet in field. *Need to associate photos/standardized higher taxonomy in the field (before accession into any db)

Field Based System Requirements Spreadsheets for data entry Extensible validation rules (each project or sub- project has its own requirements) Match specimen data to Photos Tag photos and load to external system (e.g. Flickr) Query multiple taxonomic authorities (each TaxonTeam selects its own authority) Updates online database periodically.

Part 2: bioValidator: a Field based Data Validation Tool Validate data using extensible validation rules Search multiple taxonomies built in Lucene Specimen to photo matching Upload to Flickr using machine tags No internet required Java based

Part 3: A case study in handling taxonomic names in a field based client application Uniform Lucene Indices, Cached, and searchable offline GenbankITISWoRMS

Why Lucene? Java-based, cross platform Indexes can be delivered to client apps (can run offline) Ability to build a standardized interface to multiple taxonomies.

Higher Taxonomic Name Handling in the Field Initial Spreadsheet: Just assign the lowest taxon name and lowest taxon level. bioValidator: Suggest a higher taxonomy based off name and level provided. Revised Spreadsheet: update with suggested higher taxonomic hierarchy.

Lucene Indexer Implementation for Taxonomy Taxonomic ConceptLucene Class UnitDocument RankField Taxonomic DatabaseIndexWriter String sql = "SELECT tsn from taxonomic_units”; … obtain resultset … while (resultset.next()) { Document doc = new Document(); // itisUnit is class that abstracts ITIS Schema itisRanks ir = new itisRanks(resultset.getString("tsn”)); while (ir.next()) { doc.add(new Field(ir.rank, ir.name)); } IndexWriter.addDocument(doc); Example of Lucene Index built on ITIS

Lucene Search Implementation for Taxonomy public static Hashtable searchIndex(String taxonLevel, String taxonName) { // Construct query Query query = new QueryParser(taxonLevel, taxonName); // Possible multiple matches TopFieldDocs hits = new IndexSearcher().search(query); // Loop through each taxonomic Unit for (int taxonUnit = 0; i < hits.totalHits; taxonUnit++) { Document doc = searcher.doc(hits.scoreDocs[taxonUnit].doc); // Loop each rank to assign to map for (int rank = 0; rank < taxonLevels.getNumLevels(); rank++) { Object value = doc.get(taxonLevels.getLevel(rank)); // Populate a simple table with taxon ranks & values map.put(level, value); } return map; } Example of (a simplified!) Lucene Search:

Further Work Standardization in validation protocols (expand on CRIA work). As we push the envelop in field-based data collection this will become more of an issue. Network of Lucene indexes for taxonomies? GUID implementation in spreadsheets? How to track and update data as it changes in dependent systems (LIMS Systems, Genbank, BOLD, CalPhotos). See BiSciCol Grant (NSF)

More Information John Deck Moorea Biocode Project – bioValidator – bioTaxonomy (Lucene index/search) –