Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label Normalization and Lexical Annotation for Schema and.

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label Normalization and Lexical Annotation for Schema and Ontology Matching International Doctorate School in Information and Communication Technologies Università degli Studi di Modena e Reggio Emilia Serena Sorrentino XXIII Cycle Computer Engineering and Science Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof. Sanda Harabagiu

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology MatchingOutline 2 Conclusion & Future Work Overview Schema Matching Lexical Annotation The MOMIS Data Integration System Open Problems and Contributions Semi-Automatic Lexical Annotation Schema Label Normalization Uncertainty in Automatic Annotation

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Schema Matching - Definition Schema matching Schema matching is the task of finding the semantic correspondences (mappings) between elements of two schemata 3 Auxiliary Information: dictionaries, thesauri, user input … Schema Information: element names, data types, constraints… Instance Information: used to characterize the content and semantics of schema elements Match Result: is defined as a set of mapping elements each of which specifies that certain elements of S1 are mapped to certain elements of S2 Input Output 3

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Lexical Annotation for Schema Matching 4 Lexical Annotation of schema labels is the explicit assignment of meanings w.r.t. a reference lexical thesaurus (WordNet in our case) Lexical relationships (inter-schema knowledge): SYN SYN (Synonym-of)  between two synonym terms BT ( BT (Broader Term)  between two terms where the first generalizes the second (the opposite is NT- Narrower Term) RT RT(Related Term)  between two terms that are generally used together in the same context [ S.Bergamaschi, S.Castano, M.Vincini, D.Beneventano. Semantic integration of heterogeneous information sources. DKE Journal, 2001] Schema derived relationships (intra-schema knowledge): BT/NT ( BT/NT ( from ISA relationships, and from Foreign Key (FK) in relational sources when it is a Primary Key in both the original and referenced relation) RT RT (from nested elements in XML files and from FK in relational sources) DBGroup Approach: schema labels DBGroup Approach: starting from “hidden” meanings associated to schema labels (i.e. class and attribute names, also called terms), it is possible to discover lexical relationships among schema elements

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Lexical Annotation - Example 5 √ √ √ √ Lexical Annotation Customer Client SYN Client #2 Client #3 Customer #1 Client #1 Same Synset … … hyponym meronymy hypernym holonym … Lexical Relationship Discovery SYN SYN  synonym in WordNet BT/NT BT/NT  hypernym/hyponym WordNet relationship RT RT  meronym relationship (part of) or sibling in WordNet

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching The MOMIS Data Integration System 6 MANUAL LEXICAL ANNOTATION AUTOMATIC LEXICAL ANNOTATION INFERRED RELATIONSHIPS LEXICAL RELATIONSHIPS SCHEMA DERIVED RELATIONSHIPS Common Thesaurus COMMON THESAURUS GENERATION USER SUPPLIED RELATIONSHIPS LOCAL SCHEMA N GLOBAL SCHEMA GENERATION clusters generation WRAPPING LOCAL SCHEMA 1 … RDB SYNSET 2 SYNSET # SYNSET 3 SYNSET 1 MAPPING TABLES GLOBAL CLASSES The MOMIS System (Mediator EnvirOment for Multiple Information Sources) is an I 3 framework designed for the integration of structured and semi-structured data sources 6 Wrapping Lexical Annotation Common Thesaurus Generation Global Schema Generation

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Open Problems and Contributions: Automatic Lexical Annotation 7 … … … Schema S1 Schema S2 CLIENT_ID NAME ADDRESS CLIENT COUNTRY CITY PO_ID STREET_ADDRESS PO_ID PRODUCT_CODE PURCHASE_ORDER QTY TSP_INFO INVOCE_NR PRICE … … Non-Dictionary Words. i.e., Compound Nouns(CNs), abbreviations, acronyms: need to normalize schema labels Non-Dictionary Words. i.e., Compound Nouns(CNs), abbreviations, acronyms: need to normalize schema labels Fully Automatic Annotation (i.e. “on- the-fly”) is intrinsically uncertaint: need of dealing with uncertain annotations Fully Automatic Annotation (i.e. “on- the-fly”) is intrinsically uncertaint: need of dealing with uncertain annotations Manual Annotation is a boring and not scalable task  we need of a method to perform Automatic or Semi-automatic Annotation

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Word Sense Disambiguation for Semi-Automatic Lexical Annotation WSD (Word Sense Disambiguation) is the ability of identifying the meanings of words in a context by a computational technique [R. Navigli, Word sense disambiguation: A survey. ACM Comput. Surv., 2009 ] 9 The semi-automatic CWSD (Combined Word Sense Disambiguation) method: associates to each label, one/more WordNet meanings combines two WSD algorithms: SD (Structural Disambiguation) exploits the schema derived relationships WND (WordNet domains Disambiguation) exploits WordNet Domains [B. Magnini, et al.,The role of domain information in Word Sense Disambiguation, Journal of Natural Language Engineering, 2002 ]

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching The CWSD method SOURCES SCHEMA DERIVED RELATIONSHIP EXTRACTION (Automatic Wrapping) 1 CLASS AND ATTRIBUTE NAMES EXTRACTION (Automatic Wrapping) 1 SD Algorithm WND Algorithm CWSD LEXICAL RELATIONSHIPS 4 3 ANNOTATED SCHEMATA A AA INTEGRATION DESIGNER Selects relevant domains 10 Common Thesaurus 2

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching We experimented CWSD over a real data set: three level of a subtree of the Yahoo and Google directories (“society and culture” and “society”, respectively) Experimental Evaluation WSD Algorithm RecallPrecisionF-Measure SD0.080.970.15 WND0.670.700.68 CWSD0.74 11 Publications related to CWSD: OTM Workshops 2007 S.Bergamaschi, L.Po, S.Sorrentino. Automatic Annotation in Data Integration Systems. OTM Workshops 2007 DBISP2P 2007 S.Bergamaschi, L.Po, A.Sala, S.Sorrentino. Data source annotation in data integration systems. DBISP2P 2007

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Schema label normalization: Schema label normalization: is the reduction of each label to some standardized form that can be easily recognized In our case In our case: the process of abbreviation expansion and CN (Compound Noun) annotation Schema Label Normalization a- Discovered relationships without Schema normalization b- Discovered relationships with Schema normalization Legenda Right Relationship False Negative Relationship False Positive Relationship PO PurchaseOrder SYN PO PurchaseOrder 13

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching The Schema Label Normalization method 14  Selecting  Selecting the labels to be normalized  Tokenizing  Tokenizing labels into separated words  Identifying  Identifying abbreviations and CNs among the tokenized words  Selecting  Selecting the labels to be normalized  Tokenizing  Tokenizing labels into separated words  Identifying  Identifying abbreviations and CNs among the tokenized words Maciej Gawinecki’s presentation Maciej Gawinecki’s presentation  Interpreting  Interpreting CNs  Creating new WordNet entries and meanings  Creating new WordNet entries and meanings for the CNs  Interpreting  Interpreting CNs  Creating new WordNet entries and meanings  Creating new WordNet entries and meanings for the CNs We propose a semi-automatic schema label normalization method which is composed by three phases: Label Preprocessing Abbreviation Expansion CN Annotation

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching CN Annotation Compound Noun (CN): is a term composed of two or more words called constituents head modifiers Endocentric CNs: they consist of a head (i.e. the part that contains the basic meaning of the CN) and modifiers, which restrict this meaning. Eg. “delivery company” four main steps Our method can be summed up into four main steps 15

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1.CN constituent disambiguation head and modifiers disambiguationhead and modifiers disambiguation: by applying CWSD 2.Redundant constituent identification and pruning Redundant words:Redundant words: words that do not contribute new information, i.e. derived from the schema or from the lexical thesaurus E.g. the attribute “company address” of the class “company”: “company” is not considered as the relationship holding among a class and its attributes is implicit in the schema CN constituent disambiguation & pruning 16

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching CN interpretation via semantic relationships 3. CN interpretation: selecting, among a set of predefined semantic relationships in our case the nine Levi’s relationships ( CAUSE, HAVE, MAKE, IN, FOR, ABOUT, USE, BE, FROM ) [Levi, J. N., The Syntax and Semantics of Complex Nominals. Academic Press, 1978]) the one that best captures the relationship between the head and the modifier Intuition: the semantic relationship between head and modifier is the same holding between their unique beginners (i.e., the 25 top concepts in the noun WordNet hierarchy)  we manually select the correct Levi’s relationship only for the couple of unique beginners Group #1 hyponym … Institution #1 hyponym … Company #1 Act #2 hyponym Delivery #1 MAKE hyponym Transport #1 … … 17 they are suitable to interpret couple of unique beginners a detailed and fine interpretation is not required in our context they can be used during the CN gloss definition Why Levi’s relationships?:

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Creation of a new WN meaning for a CN 4.a Gloss definition Company #1 Gloss Delivery #1 Gloss an institution created to conduct business the act of delivering or distributing something + + Modifier MAKE Head an institution created to conduct business make the act of delivering or distributing something Delivery_Company Gloss: 4.b Inclusion of the new CN meaning in WN Company #1 Delivery #1 Delivery_Company #1 SYNSET µ SYNSET β Hypernym/ Hyponym Related Term Delivery_Company#1 18

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Experimental Evaluation Evaluation over five different data sets (including relational and XML schemata) Evaluating the lexical annotation process: Evaluating the discovered lexical relationships: PrecisionRecallF-Measure Lexical Annotation without Normalization 0.780.360.49 Lexical Annotation with Normalization 0.710.660.68 PrecisionRecallF-Measure Relationships discovered without Normalization0.520.470.49 Relationships discovered with Normalization0.790.750.77 19 Publications related to Schema Label Normalization : DKE Journal, 2010. S.Sorrentino, S.Bergamaschi, M.Gawinecki, L.Po, Schema Label Normalization for Improving Schema Matching, DKE Journal, 2010. ER 2009 S.Sorrentino, S.Bergamaschi, M.Gawinecki, L.Po, Schema Label Normalization for Improving Schema Matching, ER 2009

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Uncertainty in Automatic Annotation 21 In Automatic Lexical Annotation, uncertainty is assessed in terms of probability PWSD The PWSD (Probabilistic Word Sense Disambiguation) algorithm: automatically associates one/more WordNet meanings to schema labels automatically assigns to each annotation a probability value that indicates the reliability of the annotation itself is based on a probabilistic combination of different WSD algorithms uses the Dempster-Shafer theory [Shafer, G., A Mathematical Theory of Evidence, Princeton 1976] to combine the results of the different WSD algorithms

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology MatchingExample 22 Dempster-Shafer Theory …… AnnotationsProb. Value 0.65 0.17 0.60 0.48 Source1.Book Source2.Brochure Source2.Book Heading Schema Elements book#1 book#3 brochure#1 heading#2 … meaningsWSD 1WSD 2WSD N labellabel#1xxx label#2 label#3x WSD Algorithm 1 70% Confidence TERMS ANNOTED WITH ALGORITHM 1 WSD Algorithm 2 60% Confidence WSD Algorithm 3 50% Confidence … TERMS ANNOTED WITH ALGORITHM 2 TERMS ANNOTED WITH ALGORITHM N SCHEMA LABELS

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Probabilistic Lexical Relationships 23 probabilistic lexical relationships Starting from the probabilistic annotation, PWSD derives a set of probabilistic lexical relationships between schema elements 0.42 0.38 0.40 0.57 0.56 0.39 0.62 0.51 0.78 0.64 0.23 WordNet First SensePWSD

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Experimental Results Evaluation on 2 relational schemata of the Amalgam integration benchmark and 3 ontologies from the benchmark OAEI’06 24 WSD methodPrecisionRecallF-Measure WordNet First Sense0.750.540.63 PWSD*0.630.730.68 WSD methodPrecisionRecallF-Measure WordNet First Sense0.800.650.72 PWSD*0.800.710.75 * Threshold = 0.2 * Threshold = 0.15 Evaluating the lexical annotation process: : Evaluating the discovered lexical relationships: Publications related to PWSD: Information Systems Journal, 2011 L.Po, S.Sorrentino, Automatic generation of probabilistic relationships for improving schema matching, Information Systems Journal, 2011 ECKM 2009 L. Po, S.Sorrentino, S.Bergamaschi, D. Beneventano, Lexical knowledge extraction: an effective approach to schema and ontology matching, ECKM 2009

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching NORMS and ALA ICDE 2011 The Schema Label Normalization functionalities have been implemented in a tool called NORMS (NORMalizer of Schemata) which allows the designer to enhance the normalized labels by correcting potential errors [S.Sorrentino, S.Bergamaschi, M.Gawinecki, NORMS: an automatic tool to perform schema label normalization, ICDE 2011] ERPD 2009 CWSD and PWSD have been implemented in a tool called ALA (Automatic Lexical Annotator). It has been integrated within the MOMIS System [S.Bergamaschi, L.Po, S.Sorrentino, A.Corni, Dealing with Uncertainty in Lexical Annotation, ERPD 2009 ] 25

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology MatchingConclusion 26 Automatic and Semi-Automatic methods to perform Label Normalization and Lexical Annotation have been presented: CWSD Schema Label Normalization PWSD Automatic methods to extract (probabilistic) lexical relationships have been proposed and their effectiveness in order to improve schema matching has been shown All the methods have been implemented in the context of the MOMIS Data Integration System. However, they can be applied in the general contexts of schema and ontology matching

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Future Work 27 New research lines: inclusion and integration of other knowledge resources for automatic lexical annotation: Domain-Specific Resources such as domain ontologies, domain thesauri etc. to address the problem of specific domain terms in schemata (e.g., the biomedical term “aromatase” which is an enzyme involved in the production of estrogen) Generic resources: Wikipedia, dictionary etc. inclusion of instance-information extraction techniques to improve the automatic annotation and relationship discovery processes and to solve the problem of non-informative schema labels The use of RELEVANT [S. Bergamaschi, C. Sartori, F. Guerra, M. Orsini, Extracting Relevant Attribute Values for Improved Search. IEEE Internet Computing 2007], which is a tool to extract (and add to the schema) metadata about the relevant instance values of an attribute, is a promising direction

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology MatchingPublicationsJournals: Po, L. and Sorrentino, S. (2011). Automatic generation of probabilistic relationships for improving schema matching. Information Systems Journal, Special Issue on Semantic Integration of Data, Multimedia, and Services, 36(2):192208 Sorrentino, S., Bergamaschi, S., Gawinecki, M., and Po, L. (2010). Schema label normalization for improving schema matching. DKE Journal, 69(12):12541273. International Conferences and Workshops: ICDE 2011Sorrentino, S., Bergamaschi, S., and Gawinecki, M. (2011). NORMS: an automatic tool to perform schema label normalization. In Press, Accepted Manuscript (Demo Paper), IEEE International Conference on Data Engineering ICDE 2011, April 11-16, Hannover. ER 2009Sorrentino, S., Bergamaschi, S., Gawinecki, M., and Po, L. (2009). Schema normalization for improving schema matching. In proceedings of the 28th International Conference on Conceptual Modeling, ER 2009, Gramado, Brasil, 9-12 November, pages 280-293. IEEE NLP-KEBeneventano, D., Bergamaschi, S., and Sorrentino, S. (2009) Extending WordNet with compound nouns for semi-automatic annotation in data integration systems. In proceeding of the IEEE NLP-KE Conference, Dalian, China, 24-27 September 2009. ER 2009 Poster and DemonstrationsBergamaschi, S., Po, L., Sorrentino, S., and Corni, A. (2009). Dealing with Uncertainty in Lexical Annotation. Revista de Informatica Terica e Aplicada, RITA, ER 2009 Poster and Demonstrations Session,16(2):9396. 28

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology MatchingPublications IEEE ICSC 2009Beneventano, D., Orsini, M., Po, L., Antonio, S., and Sorrentino, S. (2009). An ontology-based data integration system for data and multimedia sources. In Proceeding of the Third International Conference on Semantic Computing, IEEE ICSC 2009, Berkeley, CA, USA - September 14-16, pages 606-611. IEEE Computer Society. ISDSI 2009Beneventano, D., Orsini, M., Po, L., and Sorrentino, S. (2009). The MOMIS-STASIS approach for Ontology-Based Data Integration. In proceedings of the 1st International Workshop on Interoperability through Semantic Data and Service Integration, ISDSI 2009, Camogli (GE), Italy June 25. ECKM 2009Po, L., Sorrentino, S., Bergamaschi, S., and Beneventano, D. (2009). Lexical knowledge extraction: an effective approach to schema and ontology matching. Proceedings of the European Conference on Knowledge Management, ECKM 2009, 3-4 September Vicenza. DBISP2PBergamaschi, S., Po, L., Sala, A., and Sorrentino, S. (2007). Data source annotation in data integration systems. In Proceedings of the fifth International Workshop on Databases, Information Systems and Peer- to -Peer Computing, DBISP2P, at 33st International Conference on Very Large Data Bases (VLDB 2007), University of Vienna, Austria, September 24. OTM WorkshopsBergamaschi, S., Po, L., and Sorrentino, S. (2007). Automatic Annotation in Data Integration Systems. In Proceeding of the OTM Workshops, Portugal, November 27-28. 29

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology MatchingPublications National Conferences National Conferences ITAISBergamaschi, L. Po, S. Sorrentino, A. Corni, "Uncertainty in data integration systems: automatic generation of probabilistic relationships", VI Conference of the Italian Chapter of AIS, ITAIS 2009,, Costa Smeralda, Italy, October 2-3 2009. SEBDS. Bergamaschi, S. Sorrentino, "Semi-automatic compound nouns annotation for data integration systems", Proceedings of the 17th Italian Symposium on Advanced Database Systems, SEBD 2009, Camogli (Genova), Italy 21-24 June 2009. SEBDS. Bergamaschi, L. Po, and S. Sorrentino, "Automatic annotation for mapping discovery in data integration systems", Proceedings of the Sixteenth Italian Symposium on Advanced Database Systems, SEBD 2008, Mondello (Palermo), Italy, 22-25 June 2008 (pp 334-341). Book Chapters Bergamaschi, S., Beneventano, D., Po, L., Sorrentino, S. (2011). Automatic Schema Mapping through Normalization and Annotation. In Press, in Second Search Computing Workshop: Challenges and Directions, 2010, LNCS State-of-the-Art Survey. Bergamaschi S., Po L., Sorrentino S., Corni A.. “Uncertainty in data integration systems: automatic generation of probabilistic relationships”, to appeat at Management of the Interconnected World (A. D’Atri, M. De Marco, A. Maria Braccini, F. Cariddu eds.), Springer, ISBN/ISSN: 978-3-7908-2403-2, 2010. 30

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology MatchingProjects 31 NeP4B - Networked Peers for Business, MIUR funded research project – FIRB 2005 (2006- 2009) (http://www.dbgroup.unimo.it/nep4b)http://www.dbgroup.unimo.it/nep4b STASIS - SofTware for Ambient Semantic Interoperable Services - Project FP6-2005-IST-5-034980 (2006-2008) (http://www.dbgroup.unimo.it/stasis/)http://www.dbgroup.unimo.it/stasis/ “Searching for a needle in mountains of data!” project funded by the Fondazione Cassa di Risparmio di Modena within the Bando di Ricerca Internazionale (2008-2010) (http://www.dbgroup.unimo.it/keymantic)http://www.dbgroup.unimo.it/keymantic

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Thanks for your attention! 32

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Evaluation Measures 33 FN:False Negative TP: True Positive FP: False Positive TN: True Negative

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Unique beginners The top level concepts of the WordNet hierarchy are the 25 unique beginners (e.g., act, animal, artifact etc.) for WordNet English nouns defined in [ Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K., WordNet: An on-line lexical database. International Journal of Lexicography, 1990] 34

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Levi’s relationships set 35 M = Modifier H = Head [Levi, J. N., The Syntax and Semantics of Complex Nominals. Academic Press, 1978]

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching Dempster-Shafer theory 36 The Dempster-Shafer theory is a mathematical theory of evidence. It allows to combine evidence from different sources: by using this theory for each algorithm, we assign a probability mass function m(·) to the set of all possible meanings for the term under consideration The mass function of the WSD algorithms are combined by using the Dempster’s rule of combination In the end, to obtain the probability assigned to each meaning, the belief mass function concerning a set of meanings is split

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label Normalization and Lexical Annotation for Schema and.

Similar presentations

Presentation on theme: "Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label Normalization and Lexical Annotation for Schema and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label Normalization and Lexical Annotation for Schema and.

Similar presentations

Presentation on theme: "Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label Normalization and Lexical Annotation for Schema and."— Presentation transcript:

Similar presentations

About project

Feedback