Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS Eduard Hovy Information Sciences Institute University of Southern California.

Similar presentations


Presentation on theme: "Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS Eduard Hovy Information Sciences Institute University of Southern California."— Presentation transcript:

1 Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University)

2 CARDGIS2 Context: CARDGIS Project Sources: –Energy Info. Adminstration (quarterly CD ROM). –Bureau of Labor Statistics (http://stats.bls.gov). –Census Bureau (CD ROM for 1992 data). –California Energy Commission (weekly data at http://energy.ca.gov). Enable access to multiple, heterogeneous Federal agency data sources through single interface using standardized nomenclature, while accounting for semantic variability.

3 CARDGIS3 System Architecture Sources Integrated Ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor User phase: Compose query Ontology Construction - DB analysis - text analysis Construction phase: Deploy DBs Extend ontol. Query Processor - reformulation - cost optimization RST  Access phase: Create DB query Retrieve data

4 CARDGIS4 So What is an Ontology? Desiderata: –‘anchor points’ for terminology variants (salary, income…), –wide coverage, –some degree of taxonomic organization for inference/program behavior control. Terminological (not domain) ontology.

5 CARDGIS5 Taxonomy, multiple superclass links. Approx. 90,000 items. Top level: Penman Upper Model (ISI). Body: WordNet (Princeton), rearranged. Used at ISI for machine translation, text summarization, database access. http://vigor.isi.edu:8002/sensus2/ ISI’s SENSUS Ontology

6 CARDGIS6 3 Ways of Building Ontologies 1. Combine existing knowledge resources: ontology alignment. + + 2. Learn from texts and Web: extract word families for thousands of concepts. 3. Parse dictionary definitions: extract information and place into ontology.

7 CARDGIS7 1. Cross-Ontology Alignment 1. Text Matches –concept names (cognates; reward for delimiter confluence...) –textual definitions (string matching, demorphing, stop words...) [Knight & Luk 94, Dalianis & Hovy 98] 2. Hierarchy Matches –shared superconcepts, to filter ambiguity [Knight & Luk 94] –semantic distance [Agirre et al. 94] 3. Data Item and Form Matches –inter-concept relations [Ageno et al. 94; Rigau & Agirre 95] –slot-filler restrictions [Okumura & Hovy 94] Why create a new Ontology? — Merge and re- use existing ones! Problem: automatically find corresp. concepts.

8 CARDGIS8 Cross-Ontology Alignment Results Ontologies: –SENSUS Upper Model (350) –CYC top region (2400) [Lenat; Lehmann 96] –MIKROKOSMOS (4790 concepts) [Mahesh 96] –SENSUS top region (6768) Recall (how many links were missed?): difficult to count! … 32.4 mill pairs Precision (how many suggested links are correct?): –0.252 (strict) –0.517 (lenient) After 5 runs: correct: 244 (= 3.6%) –883 suggestions near miss: 256 (= 3.8%) (= 13% of SENSUS candidates)wrong: 383 (= 5.6%) 1996 1997

9 CARDGIS9 2. The Websucker Corpus –Training set WSJ 1987: 16,137 texts (32 topics). –Test set WSJ 1988: 12,906 texts (31 topics). –Texts indexed into categories by humans. Signature data –300 terms each, using tf.idf. –Word forms: single words, demorphed words, multi-word phrases. How many terms in signatures? –5,10,15, …, 300 terms.

10 CARDGIS10 Pollution on the Web Cleanup: try various methods: tf.idf,  2, Latent Semantic Analysis...

11 CARDGIS11 3. Dictionary Extraction Babel n 2 [ SENT [ NP OR [ NP A/DT place/NN ] [ NP scene/NN ] ] [ PP of/IN [ NP AND [ NP noise/NN ] [ NP confusion/NN ] ] ] ] ;/: [ SENT [ NP a/DT confused/JJ mixture/NN ] [ PP of/IN [ NP sounds/NNS ] ],/, as/IN [ PP of/IN [ NP languages/NNS ] ] ]./. Step 1: find unencumbered dictionary (Webster 1913). Step 2: reformat and then parse entries (http://www.isi.edu/natural-language/dpp/). Step 3: identify individual propositions and their heads. Step 5: place entries into ontology (not yet done). Step 4: convert preps to semantic relations (EM alg).

12 CARDGIS12 Identify propositions and their parts: Impression: “A communicating [of a mold or trait] [by an external force or influence]” Reflection: “The return [of light or sound waves] [by or as if by a mirror]” by = AGENT or PATH? communication by force; return by mirror; return by road of = OWNER or NUMBER-PART or SOURCE or …? the car of Joe; 1 of 15 people smoke; man of La Mancha Apply EM algorithm to disambiguate. Disambiguating Extracted Info.

13 CARDGIS13 Dictionary Extraction Results Ambiguity reduction Readings Instances 60 1 48 1 36 1 24 1 18 7 12 8 10 2 6 764 5 12 4 20 3 108 2 310 1 902 Evaluation for sentence #1: "As a prefix to english words." 0.000000000621871299: NIL relation<abst PHRASAL speech_act Score: 1/1 = 1 Evaluation for sentence #13: "Gives up to underwriters." 0.000000041080864587: create,make NIL RECIPIENT capitalist<so 0.000000038652300894: transmit_thou NIL RECIPIENT capitalist<so Score: 1/2 = 0.5 Evaluation for sentence #14: "Gives all claim to the property." 0.000000002594561718: emit,utter human_action PHRASAL possessn>tr 0.000000002564569212: chnge_pos human_action PHRASAL possessn>tr 0.000000002451809783: create,make human_action PHRASAL possessn>tr 0.000000002368122454: cogitate human_action PHRASAL possessn>tr 0.000000002366411877: utilize human_action PHRASAL possessn>tr 0.000000002307022303: transmit_thou human_act PHRASAL possessn>tr 0.000000002177555675: transfer>comm human_act PHRASAL possessn>tr 0.000000002049017956: chnge>go_mad human_act PHRASA possessn>tr Score: 1/8 = 0.125

14 CARDGIS14 The Future: Terminology Standard? Reasons for terminology standardization:  1. Non-duplication  similar domain models built for many applications  2. Consistency  across experts within domain, and across domains  3. Efficient model building  complex: many decisions required simultaneously ANSI Ad Hoc Group on Ontology Standards (NCITS): draw together Ontology work worldwide IBM (Santa Teresa), Stanford, ISI, CYC, TextWise, EDR, CSLI, NMSU, Lawrence Livermore, OnTek, Government... Meetings: 3/96, 9/96, 3/97, 11/97, 1/98, (6/98)…

15 CARDGIS15 Questions?


Download ppt "Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS Eduard Hovy Information Sciences Institute University of Southern California."

Similar presentations


Ads by Google