Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Similar presentations


Presentation on theme: "1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905."— Presentation transcript:

1 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905 bebargmeyer@lbl.gov JTC1 SC32N1633

2 2 Topics F Study period purpose F New challenges F A brief tutorial on Semantics and semantic computing F where XMDR fits u Semantic computing technologies u Traditional Data Administration F Some limitations of current relational technologies F Some input from other sources

3 Future Database Needs Study Period F A one-year study period to identify and understand case studies related to this area. F Bring together a small group of experts in a meeting on “Case Studies on new Database Standards Requirements”. F The workshop would provide input to existing SC32 projects and may provide background material for new proposals for upgrades or for new work within SC32 in time for 2007 SC32 Plenary --Document 32N1451 3

4 4 The Internet Revolution A world wide web of diverse content: The information glut is nothing new. The access to it is astonishing.

5 5 Challenge: Find and process non- explicit data Analgesic Agent Non-Narcotic Analgesic AcetominophenNonsteroidal Antiinflammatory Drug Analgesic and Antipyretic Datril Anacin-3Tylenol For example… Patient data on drugs contains brand names (e.g. Tylenol, Anacin-3, Datril,…); However, want to study patients taking analgesic agents

6 6 Challenge: Specify and compute across Relations, e.g., within a food web in an Arctic ecosystem An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer. Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)http://en.wikipedia.org/wiki/Food_web#Food_web

7 7 Challenge: Combine Data, Metadata & Concept Systems IDDateTempHg A06-09-134.44 B06-09-139.32 X06-09-136.778 NameDatatypeDefinitionUnits IDtext Monitoring Station Identifier not applicable DatedateDateyy-mm-dd Tempnumber Temperature (to 0.1 degree C) degrees Celcius Hgnumber Mercury contamination micrograms per liter Inference Search Query: “find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003” Data: Metadata: BiologicalRadioactive Contamination leadcadmium mercury Chemical Concept system:

8 8 Challenge: Use data from systems that record the same facts with different terms F Reduce the human toil of drawing information together and performing analysis -> shift to computer processing.

9 9 Challenge: Use data from systems that record the same facts with different terms Common Content OASIS/ebXML Registries Common Content ISO 11179 Registries Common Content Ontological Registries Common Content CASE Tool Repositories Common Content UDDI Registries Country Identifier Data Element XML Tag Term Hierarchy Attribute Business Specification Table Column Software Component Registries Common Content Database Catalogs Business Object Dublin Core Registries Common Content Coverage

10 10 Data Elements DZ BE CN DK EG FR... ZW ISO 3166 English Name ISO 3166 3-Numeric Code 012 056 156 208 818 250... 716 ISO 3166 2-Alpha Code Algeria Belgium China Denmark Egypt France... Zimbabwe Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others ISO 3166 French Name L`Algérie Belgique Chine Danemark Egypte La France... Zimbabwe DZA BEL CHN DNK EGY FRA... ZWE ISO 3166 3-Alpha Code Same Fact, Different Terms Algeria Belgium China Denmark Egypt France... Zimbabwe Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org.: Steward: Classification: Registration Authority: Others Data Element Concept

11 11 Challenge: Draw information together from a broad range of studies, databases, reports, etc.

12 12 Challenge: Gain Common Understanding of meaning between Data Creators and Data Users Users Information systems Data Creation Users EEA USGS DoD EPA environ agriculture climate human health industry tourism soil water air 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 textdata environ agriculture climate human health industry tourism soil water air 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 text ambiente agricultura tiempo salud hunano industria turismo tierra agua aero 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 textdata environ agriculture climate human health industry tourism soil water air 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 textdata Others... ambiente agricultura tiempo salud huno industria turismo tierra agua aero 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 3268 0825 1348 5038 2708 0000 2178 textdata A common interpretation of what the data represents

13 13 Challenge: Drawing Together Dispersed Data Users Information systems Data Creation Users EEA USGS DoD EPA environ agriculture climate human health industry tourism soil water air 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 textdata environ agriculture climate human health industry tourism soil water air 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 text ambiente agricultura tiempo salud hunano industria turismo tierra agua aero 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 textdata environ agriculture climate human health industry tourism soil water air 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 textdata Others... ambiente agricultura tiempo salud huno industria turismo tierra agua aero 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 3268 0825 1348 5038 2708 0000 2178 textdata A common interpretation of what the data represents

14 14 Semantic Computing F We are laying the foundation to make a quantum leap toward a substantially new way of computing: Semantic Computing F How can we make use of semantic computing? F What do organizations need to do to prepare for and stimulate semantic computing?

15 15 Coming: A Semantic Revolution Searching and ranking Pattern analysis Knowledge discovery Question answering Reasoning Semi-automated decision making

16 16 The Nub of It F Processing that takes “meaning” into account F Processing based on the relations between things not just computing about the things themselves. F Computing that takes people out of the processing, reducing the human toil u Data access, extraction, mapping, translation, formatting, validation, inferencing, … F Delivering higher-level results that are more helpful for the user’s thought and action

17 17 Semantics Challenges F Managing, harmonizing, and vetting semantics is essential to enable enterprise semantic computing F Managing, harmonizing and vetting semantics is important for traditional data management. u In the past we just covered the basics F Enabling “community intelligence” through efforts similar to Wikipedia, Wikitionary, Flickr

18 18 A Brief Tutorial on Semantics F What is meaning? F What are concepts? F What are relations? F What are concept systems? F What is “reasoning”?

19 19 C.K Ogden and I. A. Richards. The Meaning of Meaning. Thought or Reference (Concept) Referent Symbol SymbolisesRefers to Stands for “Rose”, “ClipArt” Meaning: The Semiotic Triangle

20 20 Semiotic Triangle: Concepts, Definitions and Signs CONCEPT Referent Refers To Symbolizes Stands For “Rose”, “ClipArt” Definition Sign

21 21 Semiotic Triangle: Concepts, Definitions, Signs, & Designations Definition CONCEPT Referent Refers To Symbolizes Stands For “Rose”, “ClipArt” Sign Designation

22 22 Forms of Definitions CONCEPT Referent Refers To Symbolizes Stands For “Rose”, “ClipArt” Definition - Define by: --Essence & Differentia --Relations --Axioms Sign

23 23 Definition of Concept - Rose: Dictionary - Essence & Differentia F 1.any of the wild or cultivated, usually prickly-stemmed, pinnate-leaved, showy- flowered shrubs of the genus Rosa. Cf. rose family. F 2.any of various related or similar plants. F 3.the flower of any such shrub, of a red, pink, white, or yellow color. --Random House Webster’s Unabridged Dictionary (2003)

24 24 Definitions in the EPA Environmental Data Registry http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress The exact address where a mail piece is intended to be delivered, including urban-style address, rural route, and PO Box http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode The U.S. Postal Service (USPS) abbreviation that represents a state or state equivalent for the U.S. or Canada http://www.epa/gov/edr/sw/AdministeredItem#StateName The name of the state where mail is delivered Mailing Address: State USPS Code: Mailing Address State Name:

25 25 Definition of Concept - Rose: Relations to Other Concepts CONCEPT Referent Refers To Symbolizes Stands For “Rose”, “ClipArt” Love Romance Marriage

26 26 SNOMED – Terms Defined by Relations

27 27 Definition of Concept - Rose: Defined by Axioms in OWL CONCEPT Referent Refers To Symbolizes Stands For “Rose”, “ClipArt” rdfs:subClassOf owl:equivalentClass owl:disjointWith

28 28 Class Axiom (Definitions) Class Description is Building Block of Class Axiom F A class description is the term used in this document (and in the OWL Semantics and Abstract Syntax) for the basic building blocks of class axioms (informally called class definitions in the Overview and Guide documents). A class description describes an OWL class, either by a class name or by specifying the class extension of an unnamed anonymous class. F OWL distinguishes six types of class descriptions: F a class identifier (a URI reference) F an exhaustive enumeration of individuals that together form the instances of a classenumeration F a property restrictionproperty restriction F the intersection of two or more class descriptionsintersection F the union of two or more class descriptionsunion F the complement of a class descriptioncomplement F The first type is special in the sense that it describes a class through a class name (syntactically represented as a URI reference). The other five types of class descriptions describe an anonymous class by placing constraints on the class extension. F Class descriptions of type 2-6 describe, respectively, a class that contains exactly the enumerated individuals (2nd type), a class of all individuals which satisfy a particular property restriction (3rd type), or a class that satisfies boolean combinations of class descriptions (4th, 5th and 6th type). Intersection, union and complement can be respectively seen as the logical AND, OR and NOT operators. The four latter types of class descriptions lead to nested class descriptions and can thus in theory lead to arbitrarily complex class descriptions. In practice, the level of nesting is usually limited.

29 29 Class Descriptions -> Class Axiom F Class descriptions form the building blocks for defining classes through class axioms. The simplest form of a class axiom is a class description of type 1, It just states the existence of a class, using owl:Class with a class identifier. F For example, the following class axiom declares the URI reference #Human to be the name of an OWL class: u This is correct OWL, but does not tell us very much about the class Human. Class axioms typically contain additional components that state necessary and/or sufficient characteristics of a class. OWL contains three language constructs for combining class descriptions into class axioms: F rdfs:subClassOf allows one to say that the class extension of a class description is a subset of the class extension of another class description. rdfs:subClassOf F owl:equivalentClass allows one to say that a class description has exactly the same class extension as another class description. owl:equivalentClass F owl:disjointWith allows one to say that the class extension of a class description has no members in common with the class extension of another class description. owl:disjointWith

30 30 Computable Meaning CONCEPT Referent Refers To Symbolizes Stands For “Rose”, “ClipArt” rdfs:subClassOf owl:equivalentClass owl:disjointWith If “rose” is owl:disjointWith “daffodil”, then a computer can determine that an assertion is invalid, if it states that a rose is also a daffodil (e.g., in a knowledgebase).

31 31 Fletcher Creek Merced Lake WaterBody What are Relations? Relation Merced Lake Fletcher Creek Merced River isA Concepts and relations can be represented as nodes and edges in formal graph structures, e.g., “is-a” hierarchies.

32 32 A 2 bacd 1 Nodes represent concepts Lines (arcs) represent relations Concept Systems have Nodes and may have Relations Concept systems are concepts and the relations between them. Concept systems can be represented & queried as graphs

33 33 A More Complex Concept Graph From Supervaluation Semantics for an Inland Water Feature Ontology Paulo Santos and Brandon Bennett http://ijcai.org/papers/1187.pdf#search=%22terminology%20water%20ontology%22http://ijcai.org/papers/1187.pdf#search=%22terminology%20water%20ontology%22 Concept lattice of inland water features LinearLarge Non-linear Large linearSmall linearSmall non- linear DeepNatural Artificial RiverStreamCanalReservoirLakeMarshPond FlowingShallowStagnant

34 34 Types of Concept System Graph Structures F Trees F Partially Ordered Trees F Ordered Trees F Faceted Classifications F Directed Acyclic Graphs F Partially Ordered Graphs F Lattices F Bipartite Graphs F Directed Graphs F Cliques F Compound Graphs

35 35 Tree Partial Order Tree Ordered Tree Faceted Classification Directed Acyclic Graph Partial Order Graph Powerset of 3 element set Bipartite Graph Clique Compound Graph Types of Concept System Graph Structures

36 36 Graph Taxonomy Directed Graph Directed Acyclic Graph Graph Undirected Graph Bipartite Graph Partial Order Graph Faceted Classification Clique Partial Order Tree Tree Lattice Ordered Tree Note: not all bipartite graphs are undirected.

37 37 What Kind of Relations are There? Lots! Relationship class: A particular type of connection existing between people related to or having dealings with each other. F acquaintanceOf - A person having more than slight or superficial knowledge of this person but short of friendship. F ambivalentOf - A person towards whom this person has mixed feelings or emotions. F ancestorOf - A person who is a descendant of this person. F antagonistOf - A person who opposes and contends against this person. F apprenticeTo - A person to whom this person serves as a trusted counselor or teacher. F childOf - A person who was given birth to or nurtured and raised by this person. F closeFriendOf - A person who shares a close mutual friendship with this person. F collaboratesWith - A person who works towards a common goal with this person. F …

38 38 Example of relations in a food web in an Arctic ecosystem An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer. Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)http://en.wikipedia.org/wiki/Food_web#Food_web

39 39 Ontologies are a type of Concept System F Ontology: explicit formal specifications of the terms in the domain and relations among them (Gruber 1993) F An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them. F Why would someone want to develop an ontology? Some of the reasons are: u To share common understanding of the structure of information among people or software agents u To enable reuse of domain knowledge u To make domain assumptions explicit u To separate domain knowledge from the operational knowledge u To analyze domain knowledge http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html

40 40 What is Reasoning? Inference PolioSmallpox Infectious Disease Disease is-a Diabetes Heart disease Chronic Disease is-a Signifies inferred is-a relationship

41 41 Reasoning: Taxonomies & partonomies can be used to support inference queries Oakland Berkeley Alameda County California part-of Santa Clara San Jose Santa Clara County part-of E.g., if a database contains information on events by city, we could query that database for events that happened in a particular county or state, even though the event data does not contain explicit state or county codes.

42 42 Reasoning: Relationship metadata can be used to infer non-explicit data Analgesic Agent Non-Narcotic Analgesic AcetominophenNonsteroidal Antiinflammatory Drug Analgesic and Antipyretic Datril Anacin-3Tylenol For example… (1)patient data on drugs currently being taken contains brand names (e.g. Tylenol, Anacin-3, Datril,…); (2) concept system connects different drug types and names with one another (via is-a, part-of, etc. relationships); (3) so… patient data can be linked and searched by inferred terms like “acetominophen” and “analgesic” as well as trade names explicitly stored as text strings in the database

43 43 Reasoning: Least Common Ancestor Query Analgesic and Antipyretic Analgesic Agent Non-Narcotic Analgesic Acetominophen Opioid Opiate Morphine Sulfate Codeine Phosphate Nonsteroidal Antiinflammatory Drug What is the least common ancestor concept in the NCI Thesaurus for Acetominophen and Morphine Sulfate ? (answer = Analgesic Agent)

44 44 Reasoning: Example “sibling” queries: concepts that share a common ancestor F Environmental: u "siblings" of Wetland (in NASA SWEET ontology) F Health u Siblings of ERK1 finds all 700+ other kinase enzymes u Siblings of Novastatin finds all other statins F 11179 Metadata u Sibling values in an enumerated value domain

45 45 F Health u Find all the siblings of Breast Neoplasm F Environmental u Find all chemicals that are a u carcinogen (cause cancer) and u toxin (are poisonous) and u terratogenic (cause birth defects) Reasoning: More complex “sibling” queries: concepts with multiple ancestors site neoplasmsbreast disorders Breast neoplasm Respiratory System neoplasm Non-Neoplastic Breast Disorder Eye neoplasm

46 46 End of Tutorial about concept systems What are the “Database Language” challenges?

47 Samples of Eco & Bio Graph Data F Partonomies are used in biological settings most often to represent common topological relationships of gross anatomy in multi-cellular organisms. They are also useful in sub-cellular anatomy, and possibly in describing protein complexes. They are comprised of part-of relationships (in contrast to is-a relationships of taxonomies). Part-of relationships are represented by directed edges and are transitive. Partonomies are directed acyclic graphs. F Data Provenance relationships are used to record the source and derivation of data. Here, some nodes are used to represent either individual "facts" or "datasets" and other nodes represent "data sources" (either labs or individuals). Edges between "datasets" and "data sources" indicate "contributed by". Other edges (between datasets (or facts)) indicate derived from (e.g., via inference or computation). Data provenance graphs are usually directed acyclic graphs. F Taxonomies of proteins, chemical compounds, and organisms,... These taxonomies (classification systems) are usually represented as directed acyclic graphs (partial orders or lattices). They are used when querying the pathways databases. Common queries are subsumption testing between two terms/concepts, i.e., is one concept a subset or instance of another. Note that some phylogenetic tree computations generate unrooted, i.e., undirected. trees. F Metabolic pathways: chemical reactions used for energy production, synthesis of proteins, carbohydrates, etc. Note that these graphs are usually cyclic. F Signaling pathways: chemical reactions for information transmission and processing. Often these reactions involve small numbers of molecules. Graph structure is similar to metabolic pathways.

48 Eco & Bio Graph Data (Continued) F Contact Graphs are used to indicate that two atoms in a protein are within a small (fixed) distance of one another (typically 3-5 Angstroms). Contact graphs are used to represent 3D protein structures. These are undirected, possibly cyclic, graphs. "Alignments" of contact graphs (of protein structures) are used as a measure of the similarity of 3D protein structures. F Gene Clusterings derived from microarray data are conventionally represented as rooted trees (i.e., directed edges). F Bibliographic citations can be represented by means of directed graphs, in which the edges represent cite relationships. While one would expect that these graphs are acyclic, they actually do contain cycles (due to the availability of preprints, self-citation, etc.). F Hypertext, e.g., World Wide Web include links which can be modeled as directed edges. While the actual WWW is distributed among millions of sites, various web indexing systems (Inktomi, Google, etc.) crawl the web and assemble single site databases of these links, to support retreival. Systems such as Google, utilize the link structure (i.e., the graph structure) to help identify relevant web pages. Such systems are of direct interest for biomedical applications, e.g., data mining the web. Note that these databases are often very large.Google, F Gene regulatory networks: Genes, gene regulatory sequences, signaling proteins, which control the activation or suppression of gene expression in the organism. Note that these graphs are usually cyclic. F Protein interaction networks: Typically, undirected graphs are used to record pairs of proteins which are experimentally observed to interact with each other. Protein complexes can be represented as partonomies.

49 49 Metadata Registries & Database Technologies – Which Does What? Traditional Data Registries (11179 Edition 2) F Register metadata which describes data—in databases, applications, XML Schemas, data models, flat files, paper F Assist in harmonizing, standardizing, and vetting metadata F Assist data engineering F Provide a source of well formed data designs for system designers F Record reporting requirements F Assist data generation, by describing the meaning of data entry fields and the potential valid values F Register provenance information that can be provided to end users of data F Assist with information discovery by pointing to systems where particular data is maintained.

50 50 Data Elements DZ BE CN DK EG FR... ZW ISO 3166 English Name ISO 3166 3-Numeric Code 012 056 156 208 818 250... 716 ISO 3166 2-Alpha Code Algeria Belgium China Denmark Egypt France... Zimbabwe Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others ISO 3166 French Name L`Algérie Belgique Chine Danemark Egypte La France... Zimbabwe DZA BEL CHN DNK EGY FRA... ZWE ISO 3166 3-Alpha Code Traditional MDR: Manage Code Sets Algeria Belgium China Denmark Egypt France... Zimbabwe Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org.: Steward: Classification: Registration Authority: Others Data Element Concept

51 51 What Can XMDR Do? Support a new generation of semantic computing F Concept system management F Harmonizing and vetting concept systems F Linkage of concept systems to data F Interrelation of multiple concept systems F Grounding ontologies and RDF in agreed upon semantics F Reasoning across XMDR content (concept systems and metadata) F Provision of Semantic Services

52 52 We are trying to manage semantics in an increasingly complex content space Structured data Semi-structured data Unstructured data Text Pictographic Graphics Multimedia Voice video

53 53 Case Study F Combining Concept Systems, Data, and Metadata to answer queries.

54 54 Linking Concepts: Text Document § 141.62 Maximum contaminant levels for inorganic contaminants. (a) [Reserved] (b) The maximum contaminant levels for inorganic contaminants specified in paragraphs (b) (2)–(6), (b)(10), and (b) (11)–(16) of this section apply to community water systems and non-transient, non-community water systems. The maximum contaminant level specified in paragraph (b)(1) of this section only applies to community water systems. The maximum contaminant levels specified in (b)(7), (b)(8), and (b)(9) of this section apply to community water systems; non-transient, noncommunity water systems; and transient non-community water systems. Contaminant MCL (mg/l) (1) Fluoride............................ 4.0 (2) Asbestos.......................... 7 Million Fibers/liter (longer than 10 μm). (3) Barium.............................. 2 (4) Cadmium.......................... 0.005 (5) Chromium......................... 0.1 (6) Mercury............................ 0.002 (7) Nitrate............................... 10 (as Nitrogen) § 141.62 40 CFR Ch. I (7–1–02 Edition) Title 40--Protection of Environment CHAPTER I--ENVIRONMENTAL PROTECTION AGENCY PART 141--NATIONAL PRIMARY DRINKING WATER REGULATIONS

55 55 Thesaurus Concept System (From GEMET) Chemical Contamination Definition The addition or presence of chemicals to, or in, another substance to such a degree as to render it unfit for its intended purpose. Broader Term contamination Narrower Terms cadmium contamination, lead contamination, mercury contamination Related Terms chemical pollutant, chemical pollution Deutsch: Chemische Verunreinigung English (US): chemical contamination Español: contaminación química SOURCE General Multi-Lingual Environmental Thesaurus (GEMET)

56 56 Concept System (Thesaurus) Chemical cadmiumleadmercury BiologicalRadioactive chemical pollutant chemical pollution Contamination

57 57 NameAcalypha ostryifolia MercuryMercury, bis(acetato-.kappa.O) (benzenamine)- Mercury, (acetato-.kappa.O) phenyl-, mixt. with phenylmercuric propionate TypeBiological Organism Chemical CAS Number 7439-97-663549-47-3No CAS Number TSN28189 ICTV EPA IDE17113275E965269 Recent AdditionsRecent Additions | Contact UsContact Us Environmental Data Registry Chemicals in EPA Environmental Data Registry

58 58 Data Monitoring Stations NameLatitudeLongitudeLocation A41.45 N125.99 WMerced Lake B43.23 N120.50 W Merced River X39.45 N118.12 W Fletcher Creek IDDateTemp Hg A2006-09-134.44 B2006-09-139.32 X2006-09-155.23 X2006-09-136.778 Measurements A B X Merced Lake Fletcher Creek Merced River

59 59 Metadata SystemData ElementDefinitionUnitsPrecision MeasurementsIDMonitoring Station Identifiernot applicable MeasurementsDateDate sample was collectednot applicable MeasurementsTempTemperaturedegrees Celcius0.1 MeasurementsHgMercury contaminationmicrograms per liter0.004 Monitoring StationsNameMonitoring Station Identifier Monitoring StationsLatitudeLatitude where sample was taken Monitoring StationsLongitude Longitude where sample was taken Monitoring StationsLocationBody of water monitored ContaminantsContaminantName of contaminant ContaminantsThresholdAcceptable threshold value Metadata Contaminants ContaminantThreshold mercury5 lead42? cadmium250?

60 60 Relations among Inland Bodies of Water Fletcher Creek Merced Lake Merced River feeds into Fletcher CreekMerced Lake Merced River fed from feeds into

61 61 Combining Data, Metadata & Concept Systems IDDateTempHg A06-09-134.44 B06-09-139.32 X06-09-136.778 NameDatatypeDefinitionUnits IDtext Monitoring Station Identifier not applicable DatedateDateyy-mm-dd Tempnumber Temperature (to 0.1 degree C) degrees Celcius Hgnumber Mercury contamination micrograms per liter Inference Search Query: “find water bodies downstream from Fletcher Creek where chemical contamination was over 2 parts per billion between December 2001 and March 2003” Data Metadata BiologicalRadioactive Contamination leadcadmium mercury Chemical Concept system

62 62 Example – Environmental Text Corpus F Idea: Develop an environmental research corpus that could attract R&D efforts. Include the reports and other material from over $1b EPA sponsored research. u Prepare the corpus and make it available n Research results from years of ORD R&D u Publish associated metadata and concept systems in XMDR u Use open source software for EPA testing

63 63 Information Extraction & Semantic Computing Segment Classify Associate Normalize Deduplicate Discover patterns Select models Fit parameters Inference Report results Actionable Information Decision Support Extraction Engine 11179-3 (E3) XMDR

64 64 Extraction Engines F Find concepts and relations between concepts in text, tables, data, audio, video, … F Produce databases (relational tables, graph structures), and other output F Functions: u Segment – find text snippets (boundaries important) u Classify – determines database field for text segment u Association – which text segments belong together u Normalization – put information into standard form u Deduplication – collapse redundant information

65 65 Metadata Registries are Useful Registered semantics F For “training” extraction engines F The“Normalize” function can make use of standard code sets that have mapping between representation forms. F The “Classify” function can interact with pre-established concept systems. Provenance F High precision for proper nouns, less precision (e.g., 70%) for other concepts -> impacts downstream processing, Need to track precision

66 66 Data Elements DZ BE CN DK EG FR... ZW ISO 3166 English Name ISO 3166 3-Numeric Code 012 056 156 208 818 250... 716 ISO 3166 2-Alpha Code Algeria Belgium China Denmark Egypt France... Zimbabwe Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others ISO 3166 French Name L`Algérie Belgique Chine Danemark Egypte La France... Zimbabwe DZA BEL CHN DNK EGY FRA... ZWE ISO 3166 3-Alpha Code Normalize – Need Registered and Mapped Concepts/Code Sets Algeria Belgium China Denmark Egypt France... Zimbabwe Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org.: Steward: Classification: Registration Authority: Others Data Element Concept

67 Challenge for Database Languages F The extraction database can contain graphs with > a billion nodes. u Types of queries that can be done u Query performance u Linkage of “extract database” concepts and relations to same concepts and relations in traditional databases. 67

68 68 Example – 11179-3 (E3) Support Semantic Web Applications The address state code is “AB”. This can be expressed as a directed Graph e.g., an RDF statement: Address AB State Code Node Edge Subject Predicate Object XMDR may be used to “ground” the Semantics of an RDF Statement. Graph RDF

69 69 Example: Grounding RDF nodes and relations: URIs Reference a Metadata Registry dbA:ma344 “AB”^^ai:StateCode ai: StateUSPSCode @prefix dbA: “http:/www.epa.gov/databaseA” @prefix ai: “http://www.epa.gov/edr/sw/AdministeredItem#” dbA:e0139 ai: MailingAddress

70 70 Definitions in the EPA Environmental Data Registry http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress The exact address where a mail piece is intended to be delivered, including urban-style address, rural route, and PO Box http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode The U.S. Postal Service (USPS) abbreviation that represents a state or state equivalent for the U.S. or Canada http://www.epa/gov/edr/sw/AdministeredItem#StateName The name of the state where mail is delivered Mailing Address: State USPS Code: Mailing Address State Name:

71 71 Use data from systems that record the same facts with different terms F Avoid a combinatorial explosion of data content, description, and metadata arrangements for information access, exchange, and presentation..

72 72 Ontologies for Data Mapping Concept Geographic Area Geographic Sub-Area Country Country Identifier Country NameCountry Code Short Name ISO 3166 2-Character Code ISO 3166 3- Character Code Long Name Distributor Country Name Mailing Address Country Name ISO 3166 3-Numeric Code FIPS Code Ontologies can help to capture and express semantics

73 73 Example: Content Mapping Service F Collect data from many sources – files contain data that has the same facts represented by different terms. E.g., one system responds with Danemark, DK, another with DNK, another with 208; map all to Denmark. F XMDR could accept XML files with the data from different code sets and return a result mapped to a single code set.

74 74 Actions to Manage Enterprise Semantics F Define, data, concepts, and relations F Harmonize and vet data and concept systems F Ground semantics for RDF, concept systems, ontologies F Provide semantics services

75 75 Challenge: Concept System Store Metadata Registry Concept System Thesaurus Themes Data Standards Ontology GEMET Structured Metadata Users Concept systems: Keywords Controlled Vocabularies Thesauri Taxonomies Ontologies Axiomatized Ontologies (Essentially graphs: node-relation-node + axioms) }

76 76 Challenge: Management of Concept Systems Metadata Registry Concept System Thesaurus Themes Data Standards Ontology GEMET Structured Metadata Users Concept system: Registration Harmonization Standardization Acceptance (vetting) Mapping (correspondences) }

77 77 Challenge: Life Cycle Management Metadata Registry Concept System Thesaurus Themes Data Standards Ontology GEMET Structured Metadata Users Life cycle management: Data and Concept systems (ontologies) }

78 78 Challenge: Grounding Semantics Metadata Registry Concept System Thesaurus Themes Data Standards Ontology GEMET Structured Metadata Users Registries Semantic Web RDF Triples Subject (node URI) Verb (relation URI) Object (node URI) Ontologies

79 Some Limitations of Relational Technologies & SQL F Limited graph computations u Weak graph query language F Limited object computations u Weak object query language F Inadequate linkage of metadata to data (underspecified “catalog”) u CASE tools also disable, rather than enable data administration & semantics management 79

80 Limitations (Cont.) F Limited linkage of concept system (graphs) to data (relational, graph, object) 80

81 Some Input From WG 2 and XMDR F Look at recent work on a graph query language by David Silberberg of Johns Hopkins University Applied Physics Lab. 81

82 Input from WG 2 and XMDR F David Jensen, of the University of Massachussetts Amherst ( http://kdl.cs.umass.edu/people/jensen/ ) has been developing a very interesting Proximity system and in the process has worked with complex patterns in very large data sets, including alternative query languages and database technologies. ( http://kdl.cs.umass.edu/proximity/index.html ). QGRAPH is a new visual language for querying and updating graph databases. A key feature of QGRAPH is that the user can draw a query consisting of vertices and edges with specified relations between their attributes. The response will be the collection of all subgraphs of the database that have the desired pattern. http://kdl.cs.umass.edu/people/jensen/ http://kdl.cs.umass.edu/proximity/index.html 82

83 Input from WG 2 and XMDR F Query languages are necessary to extract useful information from massive data sets. Moreover, annotated corpora require thousands of hours of manual annotation to create, revise and maintain. Query languages are also useful during this process. For example, queries can be used to find parse errors or to transform annotations into different schemes. However, they suffer from several problems. u First, updates are not supported as query languages focus on the needs of linguists searching for syntactic constructions. u Second, their relationship to existing database query languages is poorly understood, making it difficult to apply standard database indexing and query optimization techniques. As a consequence they do not scale well. u Finally, linguistic annotations have both a sequential and a hierarchical organization. Query languages must support queries that refer to both of these types of structure simultaneously. Such hybrid queries should have a concise syntax. The interplay between these factors has resulted in a variety of mutually-inconsistent approaches. Catherine Lai and Steven Bird Department of Computer Science and Software Engineering University of Melbourne, Victoria 3010, Australia 83

84 Input from WG 2 and XMDR F Try to keep an eye on companies that are grappling with advanced database, knowledge management, information extraction, and analysis requirements, such as Metamatrix, I2, NetViz, Top Quadrant, OntologyWorks, Franz, Cogito, or Objectivity, with new ones cropping up very often. F Check out the EU sites given the large investments being made there in areas of interest. For example, KAON. F Watch the outcome of an NSF funded project on querying linguistic databases,including annotated corpora ( http://projects.ldc.upenn.edu/QLDB/ ). Steven Bird at U. Melbourne is one of the principals on that project. http://projects.ldc.upenn.edu/QLDB/ 84

85 Input from WG 2 and XMDR F Need for graph query languages that go beyond RDF and XML F Frank Olken: Make SQL a strongly typed language with respect to measurement dimensionality. F Performance: project graph structured queries against graph structured data. Express with great difficulty the query in SQL. Complex objects. Model gets complex. Putting humpty dumpty together again at query time. F Political problem in govt. Vendors on board, hard to pursue other technologies. F Object systems. OMG working on it? (OQL?). JAVA has ugly layer that maps into relational system. Franz has SPARQL built on top of a graph store. 85

86 Input from WG 2 and XMDR F Link Mining Applications: Progress and Challenges - Ted E. Senator Link mining is a fairly new research area that lies at the intersection of link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining. However, and perhaps more important, it also represents an important and essential set of techniques for constructing useful applications of data mining in a wide variety of real and important domains, especially those involving complex event detection from highly structured data. Imagine a complete “link mining toolkit.” What would such a toolkit look like? 86

87 Input from WG 2 and XMDR Link Mining Applications: Progress and Challenges - Ted E. Senator F Most important, it would require a language that enabled the natural representation of entities and links. Such a language would also allow for the representation of pattern templates and for specifying matches between the templates and their instantiations. F The language would have to accept an arbitrary database schema as input, with a specified mapping between relations in the database and fundamental link types in the language. F It would have to compile into efficient and rapidly executable database queries. F It would need to be able to represent grouped entities and multiple abstraction hierarchies and reason at all levels. F It would have to enable the creation of new schema elements in the database to represent newly discovered concepts. 87

88 Input from WG 2 and XMDR Link Mining Applications: Progress and Challenges - Ted E. Senator F It would need to represent both pattern templates and pattern instances, and to have a mechanism for tracking matches between the two. F It would have to have constructs for representing fundamental relationships such as part-of, is-a, and connected-to (the most generic link relationship), as well as perhaps other high-level link types such as temporal relationships (e.g., before, after, during, overlapping, etc.), geo-spatial relationships, organizational relationships, trust relationships, and activities and events. F The toolkit would include at least one and possibly many pattern matchers. It would require tools for creating and editing patterns. It would have to include visualizations for many different types of structured data. F It would need mechanisms for handling uncertainty and confidence. F It would have to track the dependence of any conclusion (e.g., pattern match or discovered pattern) back to the underlying data, and perhaps incorporate backtracking so the impact of data corrections could be detected. 88

89 Input from WG 2 and XMDR Link Mining Applications: Progress and Challenges - Ted E. Senator F It would need configuration management tools to track the history of discovered and matched patterns. F It would need workflow mechanisms to support multiple users in an organizational structure. F It would need mechanisms for ingesting domain-specific knowledge. F It would have to be able to deal with multiple data types including text and imagery. F And it would have to be able to rapidly incorporate new link mining techniques as they are developed. F Finally, it would need to include mechanisms for maximum privacy protection. 89

90 Where to Progress Semantics Management? F SC 32 in WG 2 and WG 3 as extensions to ongoing work or as New Work Items F W3C as XQuery, SPARQL, Semantic Web Deployment WG (RDF vocabularies, SKOS) F OMG as extensions to the MOF F … 90

91 91 Thanks & Acknowledgements F John McCarthy F Karlo Berket F Kevin Keck F Frank Olken F Harold Solbrig F L8 and SC 32/WG 2 Standards Committees F Major XMDR Project Sponsors and Collaborators u U.S. Environmental Protection Agency u Department of Defense u National Cancer Institute u U.S. Geological Survey u Mayo Clinic u Apelon


Download ppt "1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905."

Similar presentations


Ads by Google