Presentation is loading. Please wait.

Presentation is loading. Please wait.

L-ISA Learning Domain Specific ISA relations from the WEB Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy LREC 2008.

Similar presentations


Presentation on theme: "L-ISA Learning Domain Specific ISA relations from the WEB Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy LREC 2008."— Presentation transcript:

1 L-ISA Learning Domain Specific ISA relations from the WEB Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy LREC 2008 Marrakech, 31may 2008

2 Overview Learning ISA relations in the patent processing domain (the PatExpert Project) The L-ISA algorithm Evaluation Future Work

3 Ontology Learning/Population Ontology Learning: acquisition of new concepts and relations between them –e.g., a device is an artifact Ontology Population: acquisition of factual knowledge about specific instances –e.g. Einstein is an instance of a scientist –e.g. Einstein was born in 1879

4 PATExpert Funded by the European Union Aim: improving patent retrieval, summarization, paraphrasing, classification and valuing through shallow and deep semantic analysis Main semantic analysis task: recognizing occurrences of KB concepts and relations Proof of the concept on two domains –Optical Recording –Machine Tools Focus of the presentation: Ontology Learning in the Optical Recording domain

5 Optical Recording Domain Ontology (ORDO) Based on the Owl formalism Built in three stages –200 hundreds manually crafted concepts: starting from a list of the most frequent terms in a reference corpus –Pro-ISA: ontology learning algorithm based on projection of WordNet fragments onto ORDO –L-ISA: ontology learning algorithm based on acquisition of isa templates from the Web

6 Patent Concept Annotation Given a target word: –disambiguate it, by assigning a WN synset whose domain is compatible with the optical recording domain (exploiting WORDNET-DOMAINS ) –If the synset is linked to an ORDO concept annotate the target word with the ORDO concept –Otherwise: apply Pro-ISA –Otherwise: apply L-ISA

7 Choosing the right sense Senses for the word “ CD ” : 1. cadmium, Cd, atomic_number_48 (CHEMISTRY) 2. candle, candela, cd, standard candle (PHISYCS) 3. certificate of deposit, CD (MONEY) 4. compact disk, compact disc, CD (COMPUTER, MUSIC)

8 CD {compact_disk, compact_disc, CD}ordo:cd lemma synset KB-concept Direct Concept Annotation

9 Pro-ISA 1: Looking for a WN-to-ORDO link {event} {crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount} {noise, interference, disturbance} {trouble} {happening, occurrence, occurrent, natural_event} sumo:Process cross-talk Lemma:

10 Pro-ISA 2: Projecting ISA chains (WN -> ORDO) {event} {crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount} {noise, interference, disturbance} {trouble} {happening, occurrence, occurrent, natural_event} sumo:Process cross-talk auto_ordo:crosstalk auto_ordo:noise auto_ordo:trouble auto_ordo:happening

11 From Pro-ISA to L-ISA In 15% of cases, the target word is not in WordNet, so Pro-ISA cannot be applied Then try and exploit the WEB Why not the patent corpus itself? –Isa relations are not frequent in restricted corpora –Patents often contain concept definitions with local scope –We don’t want idiosyncratic concept definitions, but common, shared definitions.

12 Learning ISA relations from a corpus … … by exploiting linguistic patterns expressing the ISA relation (Hearst, 1992; Hearst, 1998; Mititelu, 2006) Many patterns have been presented in the literature, but –Few evaluations of the pattern reliability (except Snow 2006) –Even less task-oriented evaluation in domain specific, concrete application scenarios. This paper: attempt to provide both kind of evaluations in a real-word, challenging scenario such as patent semantic analysis.

13 Lexico-Syntactic Patterns Patterns reported in the literature NP1 isa-phrase NP2 syntactic noun phrases sequence of tokens In our case we are looking for the hypernym of a specific target term Term-NP isa-phrase Hyper-NP Hyper-NP isa-phrase Term-NP

14 L-ISA Google (or any other web engine) does not allow for searching lexico-syntactic patterns… So, we proceed in three steps –Snippet acquisition from Google –Lexico-syntactic filtering –Semantic filtering

15 L-ISA: Snippet Acquisition Suppose we cannot link the term “photodetector” to any ORDO concept. We want to exploit the following lexico-syntactic pattern: “is an” Submit to Google the following string query: “photodetector is an” Keep the first 100 snippets (at most), e.g. “... upper frequencies, the PIN waveguide photodetector is an attractive device, since it is possible to reduce transit time without..” Transform HTML snippets in pure text.

16 L-ISA: Lexico-syntactic Filtering Annotate snippets with TextPro (PoS, lemma, chunk) Recognize isa-phrase in the annotated snippets tokenPoSlemmachunk TERM-NPtheAT0theB-NP PINNN1pinI-NP waveguideNN1waveguideI-NP photodetectorNN1photodetectorI-NP isa-phraseisVBZbeB-VP anAT0anB-NP HYPER-NPattractiveAJ0attractiveI-NP deviceNN1deviceI-NP

17 L-ISA: Lexico-syntactic Filtering Filter out TERM-NP : –if target term is modified (e.g. “PIN waveguide photodetector” above) –if it looks like a proper names (e.g. uppercase letter in the middle of a sentence). Keep HYPER-NP: –only if it fits a restricted number of PoS-pattern: (N | AN | NN | NNN | ANN | XNN | R Vpastpart AXN) TERM-NPHYPER-NP the photodetectoranalog signal A photodetectorapparatus photodetectoreffective monitor The photodetectorelectric device a photodetectorelectronic device photodetectorobject

18 Semantic Filtering Keep only those HYPER-NPs compatible with the Optical Recording domain, by checking –whether the HYPER-NP is already a label in one of the known ontologies (SUMO, ORDO, AUTO-ORDO) –whether it is present in a WordNet synset with a WORDNET- DOMAIN label compatible with the Optical Recording domain. HYPER-NPIN KBIN WNDOM. COMPAT analog signal apparatusyes midlow effective monitor electric device electronic deviceyes objectyes midlow Candidate hypernyms for photodetctor

19 Candidate Selection Candidates are weighed according to –Frequency and Reliability of patterns where the hypernym occurs –Variety of patterns –Belonging to specific ontologies (manual ORDO, AUTO-ORDO or SUMO, in decreasing preference order)

20 Evaluating the Reliability of ISA Patterns Assessement of the reliability of the patterns reported in the literature as predictors of the isa relation –Around 80 templates –On three target terms: “groove”, “photodetector” and “magnetic head”. –Google returned around 9.000 snippets –Only snippets passing lexico-syntactic filtering have been actually manually evaluated (about 1,450) –Guideline: try to interpret the intentions of the author (does he/she really intende to say that X isa subclass of Y, beyond inappropriate phrasing, and even if you know that it is not true?) –Results of this evaluation exploited in weighting the hypernym candidates

21 Evaluating the L-ISA accuracy Measuring the accuracy of the L-ISA algorithm in finding the hypernym of a given domain concept Most frequent 100 terms that we were not able to link to the ORDO ontology using the Pro-ISA learning strategy Including “wrong” target terms (because of errors of the linguistic processors, e.g. a past participle instead of a noun) Accuracy: 78.6%

22 Future Work Extend evaluation set Inter-coder agreement Use Machine Learning to optimize the weights associated to templates


Download ppt "L-ISA Learning Domain Specific ISA relations from the WEB Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy LREC 2008."

Similar presentations


Ads by Google