Presentation is loading. Please wait.

Presentation is loading. Please wait.

KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud) WOD’2013.

Similar presentations


Presentation on theme: "KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud) WOD’2013."— Presentation transcript:

1 KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud) WOD’2013 June, 3th

2 Data Linking More and more heterogeneous RDF sources Links can be asserted between them ▫Same as is one of the most important types of links: combine information given in different data sources ▫LOD: the number of already existing links is very small How to create links automatically ? 2 Danai Symeonidou, WOD’2013

3 3 FirstName: George LastName: Thomson SSN : Job : Artist FirstName: George LastName: Thomson SSN : Job: Professor FirstName: George LastName: Thomson SSN : Age : 45 Dataset1Dataset2 Data Linking Problem P1 P2 P3 Danai Symeonidou, WOD’2013

4 4 FirstName: George LastName: Thomson SSN : Job : Artist FirstName: George LastName: Thomson SSN : Job: Professor FirstName: George LastName: Thomson SSN : Age : 45 Dataset1Dataset2 SameAs Data Linking Problem P1 P2 P3 Danai Symeonidou, WOD’2013

5 5 FirstName: George LastName: Thomson SSN : Job : Artist FirstName: George LastName: Thomson SSN : Job: Professor FirstName: George LastName: Thomson SSN : Age : 45 Dataset1Dataset2 SameAs Data Linking Problem P1 P2 P3 Danai Symeonidou, WOD’2013

6 Data Linking with or without key constraints No knowledge given about the properties:  all the properties have the same importance. Knowledge given by an expert:  Specific expert rules [Arasu and al.’09, Low and al.’01, Volz and al.’09 (Silk)] Example: max(jaro(phone-number;phone-number; jaro-winkler(SSN;SSN)) > 0.88  Key constraints [Saïs, Pernelle and Rousset’09] Example: hasKey(Museum (museumName) (museumAddress)) OWL2 Key for a class expression: a combination of (inverse) properties which uniquely identify an entity ▫hasKey( CE ( OPE 1... OPE m ) ( DPE 1... DPE n ) ) Example: hasKey(Museum (museumName) (museumAddress)) expresses: Museum(x1) ∧ Museum(x2) ∧ museumName(x1, y) ∧ museumName(x2, y) ∧ museumAddress(x1, w) ∧ museumAddress(x2, w)  sameAs(x1, x2) 6 Danai Symeonidou, WOD’2013

7  Problem: when data sources contain numerous data and/or complex ontologies  Some keys are not obvious to find.  Erroneous keys can be given by the expert. Aim: automatic discovery of a complete set of keys from data Naïve automatic way to discover keys: examine all the possible combinations of properties ▫Example: given an instance described by 15 properties the number of candidate keys is = ▫For each candidate key we have to scan all the instances of the data Objective: find efficiently keys by: ▫Reducing the combinations ▫Partially scanning the data Key Discovery Problem 7 Danai Symeonidou, WOD’2013

8 RDF data sources (conforming to an OWL 2 ontology) Mappings between classes and properties of the different ontologies Open world assumption (incomplete data) and multivalued properties may exist How to discover keys when we do not know if : i1 =?= i2 =?=i3 =?=i4 hasFriend(i1,i4), hasFriend(i2, i3) …. ?? firstName(i1, Elodie) … ? Key Discovery Problem 8 idlastNamefirstNamehasFriend i1TompsonManueli2,i3 i2TompsonMaria i3DavidGeorgei2, i4 i4SolgarMichel Danai Symeonidou, WOD’2013

9 Unique Name Assumption (UNA): two different URIs refer to distinct entities (data sources generated from relational databases, Yago)  i1 <> i2<> i3 <> i4 Two literals that are syntactically different are semantically different ▫  (e.g. “Napoleon Bonaparte” <> “Napoleon”) Key Discovery Problem:Assumptions 9 Danai Symeonidou, WOD’2013

10 Heuristic 1 - Pessimistic: ▫Not instantiated property  all the values are possible  Example: hasFriend(i2, i3), hasFriend(i4, i2) are possible. ▫Instantiated property  only given values are considered  Example: not hasFriend(i1, i4) Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend} Undetermined keys: {hasFriend, lastName} 10 Key Discovery:Heuristics idlastNamefirstNamehasFriend i1TompsonManueli2,i3 i2TompsonMaria i3DavidGeorgei2, i4 i4SolgarMichel Danai Symeonidou, WOD’2013

11 Heuristic 1 - Optimistic: ▫Not instantiated property  value not one of the already existing ones  Example: not hasFriend(i2, i3), not hasFriend(i2, i1), not hasFriend(i2, i4). ▫Instantiated property  only given values are considered  Example: not hasFriend(i1, i4) Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} 11 Key Discovery:Heuristics idlastNamefirstNamehasFriend i1TompsonManueli2,i3 i2TompsonMaria i3DavidGeorgei2, i4 i4SolgarMichel Danai Symeonidou, WOD’2013

12 KD2R approach 12 Topological sort of the classes (subsumption) Key Finder ▫Discover non keys  Ex: {lastName}, {hasFriend} ▫Derive keys using non keys  Ex: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} Key Merge ▫Cartesian product of minimal key sets in S1,S2  Ex. K s1 = {firstName} K s2 = {hasFriend} K s1-s2 = {firstName, hasFriend} Danai Symeonidou, WOD’2013 Technical report available: https://www.lri.fr/~bibli/Rapports-internes/2013/RR1559.pdf

13 KD2R approach : Key Finder 13 Danai Symeonidou, WOD’2013 Computation of maximal non keys and undetermined keys ▫Represent data in a prefix-tree (a compact representation of the data of one class)

14 Validation of approach Datasets where KD2R has been tested: 14 DatasetsRDF files#instance s Optimisti c Pessimisti c OAEI Restaurants Dataset Restaurant1339Yes Restaurant21390Yes OAEI Persons Dataset Person111000Yes Peson121000Yes Person211200Yes Dbpedia Dataset (properties instasiated in at least 80% of the data) Person763644YesNo NaturalPlace78400YesNo BodyOfWater34008YesNo Lake33348YesNo googleFusion Dataset G_Restaurant372813Yes ChefMoz DatasetC_Restaurant1047Yes Danai Symeonidou, WOD’2013

15 Demo Ontologies ▫Data conforming to one ontology RDF data ▫Dbpedia NaturalPlace dataset (78400 instances) ▫OAEIPerson dataset (2000 instances) Data linking ▫Link data using LN2R ▫Measure quality of linking using:  recall  precision  f-measure 15 Danai Symeonidou, WOD’2013

16 QUESTIONS??? 16 Danai Symeonidou, WOD’2013

17 THANK YOU!!! 17 Danai Symeonidou, WOD’2013


Download ppt "KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud) WOD’2013."

Similar presentations


Ads by Google