Presentation is loading. Please wait.

Presentation is loading. Please wait.

AAAI 2002 WS1 Peppering knowledge sources with SALT Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby Brigham Young University

Similar presentations


Presentation on theme: "AAAI 2002 WS1 Peppering knowledge sources with SALT Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby Brigham Young University"— Presentation transcript:

1 AAAI 2002 WS1 Peppering knowledge sources with SALT Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby Brigham Young University lonz@byu.edu, {ding,embley}@cs.byu.edu, akm@byu.edu (Boosting conceptual content for ontology generation)

2 AAAI 2002 WS 2 Acknowledgements Co-authors (Embley, Ding) EU Fifth Framework IST/HLT 3.4.1 NSF Information and Intelligent Systems grant IIS-0083127 Gerhard Budin (Eurodicautom data) Sergei Nirenburg (Mikrokosmos ontology)

3 AAAI 2002 WS 3 Outline Termbases and lexicons: (re)use(s) The SALT and TIDIE projects Data modeling and data resources Termbase conversion Ontology generation Results and evaluation Conclusions

4 AAAI 2002 WS 4 Termbases Terminology databases for humans in multilingual documentation industry Several models, formats; often concept-oriented in nature Termium, Eurodicautom, etc.

5 AAAI 2002 WS 5 Lexicons NLP applications: IR, MT, NLU, speech understanding Widely varying data formats Description at various levels of linguistic theory

6 AAAI 2002 WS 6 Sharing resources Integration is the trend Lexicons (OLIF for MT system lexicons) Termbases (MARTIF for human termbases) Lexicons and termbases Needed: principled data-modeling approach Wide variety of information to be treated Wide range of formats currently in use

7 AAAI 2002 WS 7 The SALT project SALT: Standards-based Access service to multilingual Lexicons and Terminologies (www.ttt.org/salt).www.ttt.org/salt International cooperation, standards for coding and interchange of linguistic data, and the combining of technologies Several partners (BYU TRG, KSU, etc.) Data modeling approach to addresses the problem of interchange among diverse collections of such data, including their ontological substructure

8 AAAI 2002 WS 8 The SALT approach Goal: provide 1)Modularity differentiate core structure vs. data category specification 2)Coherence use a meta-model 3)Flexibility Support interoperable alternative representations Modular meta-model approach Implemented in various settings Ongoing refinement: model’s coverage

9 AAAI 2002 WS 9 The TIDIE project TIDIE: Target-based Independent-of- Document Information Extraction (www.deg.byu.edu)www.deg.byu.edu Ontology-based data extraction Conceptual modeling of real-world applications Narrow, data-rich domains Leverage (or build) custom ontologies for target-based extraction

10 AAAI 2002 WS 10 Information exchange SourceTarget Information Extraction Schema Matching Leverage this … … to do this

11 AAAI 2002 WS 11 Information Extraction Examine/retrieve information from documents to fill information from user- supplied template Requires some user-oriented specification of information Our approach: finding, extracting, structuring, and synthesizing information is easier given a conceptual-model-based ontology

12 AAAI 2002 WS 12 Extracting pertinent information from documents

13 AAAI 2002 WS 13 A Conceptual Modeling Solution YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* 0..1 1..* 0..1 0..* 1..*

14 AAAI 2002 WS 14 Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End;

15 AAAI 2002 WS 15 Recognition and Extraction Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (363)835-8597 0002 1998 Elandra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stero 0002 a/c 0003 Auto 0003 jade green 0003 gold

16 AAAI 2002 WS 16 Lexical resources for data modeling Information extraction also requires knowledge representations with terminological and conceptual content. Extraction ontology knowledge sources must: be of a general nature contain meaningful relationships already exist in machine-readable form have a straightforward conversion into XML. This paper: create, leverage large-scale termbase some ontological structure reformatted according to the SALT standard converted into μK-compliant XML for use by the ontology generator

17 AAAI 2002 WS 17 Eurodicautom Well-known, widely-used termbase > 1 million concept entries Wide range of topics Entries are multilingual Entry information: sources cited, input/approval dates, … Single-word terms (e.g. “generator”) or multi-word expressions (e.g. “black humus”) Entries each have Lenoch subject-area code Hierarchical representation for classifying terms (and by extension their related concepts)

18 AAAI 2002 WS 18 Partial Eurodicautom entry %CM AG4 CH6 GO6 %DA %VE lavmosetørv %RF A.Klougart %EN %VE black humus %RF CILF,Dict.Agriculture,ACCT,1977 %IT %VE humus nero %RF BTB %ES %VE humus negro %RF CILF,Dict.Agriculture,ACCT,1977 %SV %VE sumpjord %RF Mats Olsson,SLU(1997)

19 AAAI 2002 WS 19 Sample Lenoch codes AD Public Administration - Private Administration - Offices AD1 general aspects of the subject field AD2 public and private organisations AD3 publications & documentary search AD31 documentation and information systems AD4 administrative staff AD5 public procurement AD51 expropriation in the public interest TEH testing methods TEH1 general aspects of testing methods TEH2 non-destructive testing TEH21 chemical tests TEH22 photometrical testing TEH221 X-ray spectrometrical testing

20 AAAI 2002 WS 20 Converting the termbase Use several thousand English terms and their subject codes %CM line lists three Lenoch codes: AG4 (representing the subclass AGRONOMY), CH6 (representing ANALYTICAL-CHEMISTRY) GO6 (representing GEOMORPH-OLOGY). Convert termbase entries via the SALT-developed TBX termbase exchange framework XML-based refinement of MARTIF Convert to μK XML format used by ontology engine Result: TBX-mediated conversion from native Eurodicautom terms to the final XML-specified ontology (μK) Lenoch codes re-interpreted as typical hierarchical relations (e.g. IS-A and SUBCLASS)

21 AAAI 2002 WS 21 Conversion process Eurodicautom (native) Lenoch Eurodicautom (TBX) SALT Eurodicautom (μK)

22 AAAI 2002 WS 22 Eurodicautom-TBX encoding sample Eurodicautom entry DXLTdv04.xml BTB DAG77 4 souto fullForm BTB-DAG77-63 V.Correia,Engº Agrónomo,PDR Vale do Lima minifúndio fullForm BTB-DAG77-63 V.Correia,Engº Agrónomo,PDR Vale do Lima

23 AAAI 2002 WS 23 Derived XML ontology xenobiotic substances SUBCLASSES VALUE/FACET> hazardous raw materials 0 physical nuisances SUBCLASSES VALUE/FACET> ambient light 0 financial statistics IS-A VALUE/FACET> economic statistics 0 ….

24 AAAI 2002 WS 24 Ontology generation Goal: specify an ontology for information extraction purposes Problem: complex, tedious, costly Ideally: automatically generate schemas, ontologies Source: natural-language text, tables, etc.

25 AAAI 2002 WS 25 Ontology generation overview

26 AAAI 2002 WS 26 Knowledge sources Mikrokosmos (μK) ontology About 5,000 hierarchically-arranged concepts Fairly high connectivity ( about 14 inter-concept links per node) Fairly general content, inheritance of properties Data frame library regular-expression templates for matching structured low-level lexical items (e.g. measurements, dates, currency expressions, and phone numbers) provide information for conceptual matching via inheritance Lexicons (e.g. onomastica, WordNet synsets) Domain-specific training documents

27 AAAI 2002 WS 27 Knowledge integration

28 AAAI 2002 WS 28 Methodology Preprocess input knowledge sources: Integrate: map lexicon content and data frame templates to nodes in the merged ontology Extract: match information from training documents collection Parse, tokenize, regularize lexical content Generate the ontology: four-stage generation process concept selection relationship retrieval constraint discovery refinement of the output ontology

29 AAAI 2002 WS 29 Processing input documents

30 AAAI 2002 WS 30 Concept selection Finding which subset of the ontology’s concepts is of interest to a user Concepts are selected via string matches between textual content and the ontological data. Three different selection heuristics concept-name matching concept-value matching data-frame pattern matching String matching plus: word synonym matching: WordNet synonym sets multi-word term matching: bag-of-words (CAPITAL-CITY is considered a synonym of capital and city)

31 AAAI 2002 WS 31 Concept selection algorithm PROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF- Library); ConflictHandling(); SelectedSubgraphGeneration();

32 AAAI 2002 WS 32 Basic Selection Strategy Select from Mikrokosmos Ontology Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar- e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

33 AAAI 2002 WS 33 Basic Selection Strategy Select from Mikrokosmos Ontology concept names and their synonyms Afghanistan smaller than Texas. Area : 648,000 sq. km. Capital --Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population :17.7 million. Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

34 AAAI 2002 WS 34 Select from Mikrokosmos Ontology concept names and their synonyms concept values and their synonyms Afghanistan smaller than Texas. Area : 648,000 sq. km. Capital --Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population :17.7 million. Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. Basic Selection Strategy

35 AAAI 2002 WS 35 Select from Mikrokosmos Ontology concept names and their synonyms concept values and their synonyms Select from Data Frame Libraries Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar- e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. Basic Selection Strategy

36 AAAI 2002 WS 36 Select from Mikrokosmos Ontology concept names and their synonyms concept values and their synonyms Select from Data Frame Libraries extract result based on the data frames Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar- e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. Basic Selection Strategy

37 AAAI 2002 WS 37 Concept conflict resolution Arrive at an internally consistent set of selected concepts. Two levels of resolution Document-level resolution Knowledge-source resolution Criteria: lexical occurrence, proximity, length and distribution of words and terms Preferences from among knowledge sources specifying matches Other default strategies

38 AAAI 2002 WS 38 Document-Level Conflict Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital --Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

39 AAAI 2002 WS 39 Concept-Level Conflict Afghanistan smaller than Texas. Area : 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population : 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

40 AAAI 2002 WS 40 Relationship retrieval Ontology structure: directed graph, nodes are concepts Conceptual relationship: all paths connecting concepts generated at given stage Theoretical solution: find all the paths in the graph (NP- complete) When multiple paths do exist, take the shortest path between 2 concepts (Cf. μK Onto-Search algorithm) Dijkstra’s (polynomial) algorithm to compute the most salient relationships between concepts Distance threshold on path length to prune weak relationships Construct schemas, or linked conceptual configurations, from the relationships posited in the previous step. Primary concept selected (or posited): highest connectivity Cardinalities inferred from observed relationships

41 AAAI 2002 WS 41 Participation Constraints Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital—Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]

42 AAAI 2002 WS 42 Participation Constraints (2) Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities --Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. City [1:1] PartOf Nation [1:*]

43 AAAI 2002 WS 43 Refining results Output ontology: may require hand-crafting can be done in a text editor (flat ASCII ontology) Considerable expertise required: markup syntax specification of conceptual relations. familiarity with regular-expression writing Possible solution: ontology editors for typical end- users With rich enough knowledge sources and a good set of training documents, however, we believe that the generation of extraction ontologies can be fully automatic.

44 AAAI 2002 WS 44 Testing the system Input: various of U.S. Department of Energy abstracts Knowledge base: μK ontology Energy sub-hierarchy of Eurodicautom terms (300)

45 AAAI 2002 WS 45 Sample application document The trend in supply and demand of fuel and the fuels for electric power generation, iron manufacturing and transportation were reviewed from the literature published in Japan and abroad in 1986. FY 1986 was a turning point in the supply and demand of energy and also a serious year for them because the world crude oil price dropped drastically and the exchange rate of yen rose rapidly since the end of 1985 in Japan as well. The fuel consumption for steam power generation in FY 1986 shows the negative growth for two successive years as much as 98.1%, or 65,730,000 kl in heavy oil equivalent, to that in the previous year. The total energy consumption in the iron and steel industry in 1986 was 586 trillion kcal (626 trillion kcal in the previous year). The total sales amount of fuel in 1986 was 184,040,000 kl showing a 1.5% increase from that in the previous year. The concept Best Mix was proposed as the ideal way in the energy industry. (21 figs, 2 tabs, 29 refs)

46 AAAI 2002 WS 46 Sample output -- energy2 Information Ontology energy2 [-> object]; energy2 [0:*] has Alloy [1:*]; energy2 [0:*] has Consumption [1:*]; energy2 [0:*] has CrudeOil [1:*]; energy2 [0:*] has ForProfitCorporation [1:*]; energy2 [0:*] has FossilRawMaterials [1:*]; energy2 [0:*] has Gas [1:*]; energy2 [0:*] has Increase [1:*]; energy2 [0:*] has LinseedOil [1:*]; energy2 [0:*] has MetallicSolidElement [1:*]; energy2 [0:*] has Ores [1:*]; energy2 [0:*] has Produce [1:*]; energy2 [0:*] has RawMaterials [1:*]; energy2 [0:*] has RawMaterialsSupply [1:*]; Alloy [0:*] MadeOf.SOLIDELEMENT.Subclasses MetallicSolidElement [0:*]; Alloy [0:*] IsA.METAL.StateOfMatter.SOLID.Subclasses CrudeOil [0:*]; Alloy [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*]; AmountAttribute [0:*] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT Consumption [0:*] IsA.FINANCIALEVENT.Agent Human [0:*]; ControlEvent [0:*] IsA.SOCIALEVENT.Agent Human [0:*]; ControlEvent [0:*] IsA.SOCIALEVENT.Location.PLACE.Subclasses Nation [0:*]; CountryName [0:*] NameOf Nation [0:*]; CountryName [0:*] IsA.REPRESENTATIONALOBJECT.OwnedBy Human [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.OwnedBy Human [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Combine [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Display [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*]; Custom [0:*] IsA.ABSTRACTOBJECT.ThemeOf.MENTALEVENT.Subclasses AddUp [0:*]; Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.Subclasses Gas [0:*]; Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.OwnedBy Human [0:*]; ForProfitCorporation [0:*] OwnedBy Human [0:*]; ForProfitCorporation [0:*] IsA.CORPORATION.HasNationality Nation [0:*]; Gas [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*]; Gas [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*]; LinseedOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*];

47 AAAI 2002 WS 47 Evaluation Several dozen relationships are generated Correct: relationship is posited between the concept CRUDE-OIL and the action PRODUCE; the role is Theme, meaning that one can PRODUCE CRUDE-OIL Incorrect: relationship between GAS and GROW Precision: relatively low (around 75%) due to high number of matches Recall: better (around 90%) Note: it’s easier for a human to refine the system’s output by rejecting spurious relationships (i.e. deleting false positives) than to specify relationships that the system has missed.

48 AAAI 2002 WS 48 How to improve results Less general, more focused ontologies Richer ontological structure More types of hierarchical relationships (beyond IS-A and its inverse, SUB- CLASSES) Deeper hierarchies (maximum 4 in Lenoch) Note: TBX supports several data types for conceptual encoding

49 AAAI 2002 WS 49 Related work Lexical chaining in NLP extracting and associating chains of word- based relationships from text relating words and terms to resources like WordNet Widely used in text categorization, automatic summarization, and topic detection and tracking Our contributions: integrating disparate knowledge sources for similar tasks Discovering and generating a compatible set of ontological relationships

50 AAAI 2002 WS 50 Conclusions The knowledge acquisition bottleneck impacts ontology construction for information extraction. Terminographers and lexicographers codify information that can be advantageous for work in semantic-based processing. Integrating these two disparate areas, it is possible to leverage large-scale terminological and conceptual information with relationship-rich semantic resources in order to reformulate, match, and merge retrieved information of interest to a user. Possible future applications: Knowledge-focused personal agents Customized search, filtering, and extraction tools Individually tailored views of data via integration, organization, and summarization Lots of work still to be done…


Download ppt "AAAI 2002 WS1 Peppering knowledge sources with SALT Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby Brigham Young University"

Similar presentations


Ads by Google