Presentation on theme: "Semi-automatic compound nouns annotation for data integration systems Tuesday, 23 June 2009 SEBD 2009 Sonia Bergamaschi Serena Sorrentino www.dbgroup.unimo.it."— Presentation transcript:
1Semi-automatic compound nouns annotation for data integration systems Tuesday, 23 June 2009 SEBD 2009Sonia BergamaschiSerena SorrentinoDipartimento di Ingegneria dell’InformazioneUniversità di Modena e Reggio Emilia, via Vignolese 905, Modena
2The ProblemData integration systems: produce a comprehensive global schema successfully integrating data from heterogeneous structured and semi-structured data sourcesStarting from the “meanings” associated to schema elements it is possible to discover mappings among the elements of different schemataLexical Annotation :the explicit inclusion of the “meaning“ of a data source element (i.e. class/attribute name) w.r.t. a thesaurus (WordNet (WN) in our case)Automatic Lexical Annotation becomes crucial as a starting point for mappingdiscoveryProblem :many schemata names are non-dictionary words (compound nouns, acronyms, abbreviations etc.) i.e. not be present in the lexical resourcein this work, we will concentrate only on non-dictionary Compound Nouns (CNs)the result of lexical annotation is strongly affected by the presence of these non-dictionary CNs in the schemaThe focus of data integration systems is on producing a comprehensive global schema successfully integrating data from heterogeneous structured and semi-structured data sources. Therefore, it is important to deal with labels of schemata , i.e. to understand the “meaning" behind the names denoting schemata elements.Lexical Annotation becomes, thus, crucial to understand the meaning of schemata. Lexical annotation is the explicit inclusion of the “meaning“ of a data source element (i.e. class/attribute name) w.r.t. a thesaurus (WordNet (WN) in our case).Problem :a lexical resource does not cover with the same detail different domains of knowledge and many domain dependent terms, say non-dictionary words, may not be present in it;non-dictionary words include compound nouns, acronyms etc. In this work, we will concentrate only on non-dictionary Compound Nouns (CNs).
3Proposed Solution & Motivation In some approaches the constituents of a CN are treated as single words. E.g. the CN “teacher_judgment" is split into two tokens (“teacher" and “judgment") and its relatedness to other sources element is calculated as an average relatedness between each token and the other elementsA large set of relationships among different schemata is discovered, including a great amount of false positive relationshipsWe propose a semi-automatic method for the lexical annotation of non-dictionary CNsStarting from our previous works on lexical annotation of structured and semi-structured data sources, we propose a semi-automatic method for the lexical annotation of non-dictionary CNs by creating a new WN meaning.Our method is implemented in the MOMIS (Mediator envirOnment for Multiple Information Sources) system. However, it may be applied in general in the context of schema mapping discovery, ontology merging and data integration system.
4Compound Noun annotation Compound Noun (CN): a word composed of more than one words called CN constituentsIn order to perform semi-automatic CNs annotation a method for their interpretation has to be devisedThe interpretation of a CN is the task of determining the semantic relationships among the constituents of a CNCNs can be divided in four categories: endocentric, exocentric, copulative and appositional and to consider only endocentric CNsEndocentric CN: consists of a head (i.e. the part that contains the basic meaning of the whole CN) and modifiers, which restricts this meaning. A CN exhibits a modifier-head structure with a sequence of nouns composed of a head noun and one or more modifiers where the head noun occurs always after the modifiersEndocentric CN: consists of a head (i.e. the categorical part that contains the basic meaning of the whole CN) and modifiers, which restrict this meaning. A CN exhibits a modifier-head structure with a sequence of nouns composed of a head noun and one or more modifiers where the head noun occurs always after the modifiers.
5Compound Noun annotation Our restriction is motivated by different elements:the the vast majority of schemata CNs fall in the endocentric categoryendocentric CNs are the most common type of CNs in Englishexocentric and copulative CNs, which are represented by a unique word, are often present in a dictionary (e.g. “loudmouth”, “sleepwalk”, etc.)appositional compound are not very common in English and less likely used as element of a schema (e.g.“sweet-sour”)Our method can be summed up into four main steps:CN constituents disambiguationredundant constituents identification and pruningCN interpretation via semantic relationshipscreation of a new WN meaning for a CN
6CN constituents disambiguation & pruning Compound Noun syntactic analysis: syntactic analysis of CN constituents, performed by a parserDisambiguating head and modifier: by applying our CWSD (Combined Word Sense Disambiguation) algorithm, each word is automatically mapped into its corresponding WordNet 2.0 synsetsRedundant constituents identification and pruningRedundant words: words that do not contribute new information, i.e. derived from the schema or from the lexical resourceE.g. the attribute “company_address” of the class “company”: “company” is not considered as the relationship holding among a class and its attributes is implicit in the schemathe syntactic analysis is performedby a parser (in our case the Charniak parser ), toidentify the part of speech of each constituent. If theCN does not fall under the endocentric syntactic structure(noun-noun or adjective-noun), it is ignored.
7CN interpretation via semantic relationships Our goal is to select, among a set of predefined semantic relationships, the one that best capture the relation between the head and the modifier9 possible semantic relationship: CAUSE, HAVE, MAKE, IN, FOR, ABOUT, USE, BE, FROM (Levi’s semantic relationships set)the semantic relationship between the head and the modifier of a CN is the same holding between their top level WN nouns in the WN hierarchyThe top level concepts of the WNhierarchy are the 25 uniquebeginners for WN English nounsdefined by Miller
8CN interpretation via semantic relationships To each couple of unique beginners we associate the relationship from the Levi's set that best describes their combined meaningFor example, we interpret the CN “teacher judgment“ by the MAKE relationship as “teacher" is an hyponym of “person" and “judgment" is an hyponym of “act“ and for the couple (person, act) of unique beginners we choose the relationship MAKEMAKEPerson#1Act#2…hyponymhyponymEducator#1…hyponymMAKETeacher#1Judgment#2
9Creation of a new WN meaning for a CN (a) Gloss definition: we create the gloss to be associated to a CN, starting from the relationship associated to a CN and exploiting the glosses of the CN constituentsTeacher #1 GlossA person whose occupation is teaching.judgment #2 GlossThe act to judging or assessing a person or situation or event.++Modifier MAKE Headew WN meaning for a CN; it can be divided in two prevalent steps:Gloss definition: during this step we create the gloss to be associated to a CN, starting from the relationship associated to a CN and exploiting the glosses of the CN constituents.Inclusion of the new CN meaning in WN: the insertion of a new CN meaning in the WN hierarchy implies the definition of its relationships with the other WN meanings. As the concepts denoted by a CN are a subset of the concepts denoted by the head we assume that a CN inherits most of its semantic from its head. Starting from this consideration, we can infer that the CN is related, in the WN hierarchy, with its head by an hyponym relationship. Moreover, we represent the CN semantics related to its modifier by inserting a generic relationship RT (Related term), corresponding to WN relationships as member meronym, part meronym etc. Moreover, we use the WNEditor tool to create/manage the new meaning and to set relationships between it and WN onesTeacher_judgment Gloss:A person whose occupation is teaching make the act to judging or assessing a person or situation or event.
10Creation of a new WN meaning for a CN (b) Inclusion of the new CN meaning in WN: as the concept denoted by a CN is a subset of the concept denoted by the head we create an hyponym relationship between the new CN meaning and its head meaning a generic relationship RT (Related term), corresponding to WN relationships as member meronym, part meronym etc. between the CN meaning and its modifier we use the WNEditor tool to create/manage the new meaning and to set new relationships between it and WN meaningsTeacher_judgment#1judgment#2Teacher#1SYNSETβWNEditorhypernym/hyponymRelated ToTeacher_judgment#1SYNSETµ
12Evaluation: Experimental Result CNs annotation extends the automatic annotation tool within the MOMIS systemEvaluation over a real data sources environment: three sources of an application scenario of the NeP4B project (491 schema elements) which contain a lot of CNs (about 50%).Without CNs annotation, CWSD obtains a very low recall value. Our method increases the recall without significantly worsening precision. However, the recall value is not very high: presence of a lot of acronym terms.A CN has been considered correctly annotated if the Levi's relationship selected manually by the user is the same returned by our method
13ConclusionThe experimental results showed the effectiveness of our method which significantly improves the result of the lexical annotation processOur method may be applied in general in the context of mapping discovery, ontology merging and data integration systemFuture work will be devoted to investigate on the role of the set of semantic relationships chosen for the CNs interpretation processWe will extend the tool with a component which deals with acronyms and abbreviations expansion (to appear at 28th International Conference on Conceptual Modeling, ER 2009)