Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara.

Similar presentations


Presentation on theme: "Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara."— Presentation transcript:

1 Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara and many major collaborators: Mark Schildhauer, Josh Madin, Jing Tao, Chad Berkley, Dan Higgins, Peter McCartney, Chris Jones, Shawn Bowers, Bertram Ludaescher, and others April 24, 2007

2 Scaling-up Synthesis More than 400 projects at NCEAS –have produced over 1000 publications that synthesize and re-use existing data –massive investment in compiling, integrating, and analyzing data Building custom database for each project is not logistically feasible Instead, need loosely-coupled systems that accommodate heterogeneity

3 Dilemma: no unified model No single database suffices –Data warehouses use federated schemas any data that does not fit is not captured original data transformed to fit federation –this is a form of data integration for one purpose –Numerous data warehouses exist not extensible for all data VegBank, ClimbDB, GenBank, PDB, etc.

4 Metadata-based data collections –Loosely-coupled metadata and data collections –No constraints on data schemas –Data discovery based on metadata –Dynamic data loading and query based on metadata descriptions Data Collections

5 PhysicalDataFormat Access and Distribution LogicalDataModel MethodsCoverage: Space, Time, Taxa Identity and Discovery Information A … modular extensible comprehensive Ecological Metadata Language What is EML?

6 EML: Selected relationships 1990 19952000 2005 ‘91‘92‘93‘94 ‘96‘97‘98‘99 ‘01‘02‘03‘04 FGDC created ‘06‘07‘08‘09 EML 1.0.0 EML 1.3.0 EML 1.4.x EML 2.0.0 CSDGM 1.0 Michener ’97 paper ESA FLED Report NBIIB DP ISO 19115 Dublin Core OBOE XML 1.0 EML 2.0.1

7 A simple EML example eml packageId: sbclter.316.18 system: knb dataset title: Kelp Forest Community Dynamics: Benthic Fish creator individualName contact surName: Reed surName: EvansindividualName

8 Data Discovery Geographic, Temporal, and Taxonomic coverage

9 Logical Model: Attribute structure Describes data tables and their variables/attributes a typical data table with 10 attributes –some metadata are likely apparent, other ambiguous –missing value code is present –definitions need to be explicit, as well as data typing YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES 2001 8 2001-08-22 ABUR 1 0-20 CLIN 5 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 11 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 10 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 14 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 7 06. 2001 8 2001-08-22 ABUR 1 21-40 OPIC 19 06. 2001 8 2001-08-22 ABUR 1 21-40 COTT 5 06. 2001 8 2001-08-22 ABUR 2 0-20 CLIN 5 06. 2001 8 2001-08-22 ABUR 2 21-40 NF 0 06. 2001 8 2001-08-27 AHND 1 0-20 NF 0 03. Species Codes Value bounds Date Format Code definitions

10 EML Measurement Scale Low Medium High Equidistant on number scale, meaningful ratio Equidistant on number scale OrderedCategoriesCategories Points on calendar timescale Male Female 3 Celsius5 meter 6-Oct-2004 Textual OrdinalNominal Numeric RatioIntervalDatetime Dates

11 Logical Model: unit Dictionary Consistent assignment of measurement units –Quantitative definitions in terms of SI units –‘unitType’ expresses dimensionality time, length, mass, energy are all ‘unitType’s second, meter, gram, pound, joule are all ‘unit’s Mass kilogram gram UnitTypeUnit x1000

12 Collating metadata Most scientists know all of this information about their data –EML simply provides a standardized format for recording the information Enables data exchange across organizations and software systems

13 Knowledge Network for Biocomplexity (KNB) PISCO KNB II AND... (26) GCE LTER NCEAS ESA OBFS KNB 1 Building a community data network Simplified data sharing Immediate change tracking Redundant backup Data maintained by individuals Access controlled by individuals

14 EML-described data in the KNB Data Packages in the KNB 20022003200420052006 Year 0 2000 4000 6000 8000 10000 12000 Cumulative count

15 Kepler: dynamic data loading Data source from EcoGrid (metadata-driven ingestion) res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) R processing script Kepler supports dynamic data loading: Data sources are discovered via metadata queries EML metadata allows arbitrary schemas to be loaded into an embedded database Data queries can be performed before data flows downstream

16 Importance of semantics So far we’ve dealt only with the logical data model –any semantics in EML in natural language The computer doesn’t really understand: –what is being measured –how measurements relate to one another –how semantics map to logical structure Analysis depends on understanding the semantic contextual relationships among data measurements –e.g., density measured within subplot

17 Provide extension points for loading specialized domain ontologies Goal: semantically describe the structure of scientific observation and measurement as found in a data set Observation ontology (OBOE) Entities represent real- world objects or concepts that can be measured. Observations are made about particular entities. Every measurement has a characteristic, which defines the property of the entity being measured. Observations can provide context for other observations. slide from J. Madin

18 Semantic annotation Observation Ontology Data set Mapping between data and the ontology via semantic annotation slide from J. Madin Relational data lacks critical semantic information no way for computer to determine that “Ht.” represents a “height” measurement no way for computer to determine if Plot is nested within Site or vice-versa no way for computer to determine if the Temp applies to Site or Plot or Species

19 DateSitePlotSpeciesHeight 10/12Hendricks1AHYA12.2 10/12Hendricks1AHYA11.0 10/12Hendricks1AHYA 9.7 …………… h DateLocationNameHeightTaxonomicNameLabel Characteristic: AreaTimeSpace Organism Entity: hasContext

20 TreePlotSpeciesCount A1AHYA3 A2AHYA2 A3AHYA8 ………… OrganismSpaceOrganism LabelAbundanceTaxonomicNameReplicate Entity: Characteristic: Area hasContext A B C

21 Observation ontology slide from J. Madin Extension points

22 Observation A high-level assertion that a thing was observed ?

23 All things (concrete and conceptual) that are observable Entity

24 An extension point for domain-specific terms Entity extension

25 Asserts a “containment” relationship between entities Context

26 Context is transitive Context

27 Observations are composed of measurements, which refer measurable characteristics to the entity being observed Measurement

28 Characteristic

29 Summary EML captures critical metadata OBOE adds critical semantic descriptions Data discovery and integration tools can be built that leverage metadata and ontologies Metadata and ontologies permit: –Loosely-coupled systems –Schema independence in data systems –Semantic data integration –Capturing data that is collected, rather than derived product

30 Vegetation Schema Questions Vegetation schema –Exchange standard or federation? Can we accommodate all data that is collected in vegetation plots? –or just a transformed subset XML? RDF? OWL? other? Should a vegetation schema link to other evolving community standards? –EML? –OBOE?

31 Questions? http://www.nceas.ucsb.edu/ecoinformatics/ http://knb.ecoinformatics.org/http://knb.ecoinformatics.org http://seek.ecoinformatics.org/http://seek.ecoinformatics.org http://kepler-project.org/http://kepler-project.org

32 Knowledge Representation Working Group Mark Schildhauer, Matt Jones (NCEAS) Shawn Bowers, Bertram Ludaescher, Dave Thau (UCD) Deana Pennington (UNM) Serguei Krivov, Ferdinando Villa (UVM) Corinna Gries, Peter McCartney (ASU) Rich Williams (Microsoft) Acknowledgements

33 Acknowledgments This material is based upon work supported by: The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence


Download ppt "Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara."

Similar presentations


Ads by Google