Download presentation
Presentation is loading. Please wait.
Published byLily Hudson Modified over 9 years ago
1
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington
2
Semex : Personal Information Management System MentionedIn(315) AuthorOfArticles(52) RecipientOfEmails(8547) SenderOfEmails(7595) Homepage(1)
3
Semex : Personal Information Management System Email Contacts(1145) Co-authors(24)
4
Semex : Personal Information Management System Authors FromFile CitedBy Cites(33) PublishedIn Article: Reference Reconciliation in Complex Information Spaces
5
Semex : Personal Information Management System Xin (Luna) Dong xin dong ¶ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names Emails
6
Semex Without Deduplication Search results for luna luna dong SenderOfEmails(3043) RecipientOfEmails(2445) MentionedIn(94) 23 persons
7
Semex Without Deduplication Search results for luna Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20) 23 persons
8
Semex Without Deduplication A Platform for Personal Information Management and Integration
9
Semex Without Deduplication 9 Persons: dong xin xin dong
10
Semex NEEDS Deduplication (Reference Reconciliation)
11
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington
12
Complex Information Space Example – An Abstract View of Personal Information Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)
13
Complex Information Space Example – An Abstract View of Personal Information Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”) p 8 =(null, “stonebraker@csail.mit.edu”) p 9 =(“mike”, “stonebraker@csail.mit.edu”) Class Reference Atomic Attribute Association Attribute
14
Other Complex Information Spaces Citation portals, e.g., Citeseer, Cora Online product catalogs in E-commerce
15
Real-World Objects Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”) p 8 =(null, “stonebraker@csail.mit.edu”) p 9 =(“mike”, “stonebraker@csail.mit.edu”)
16
Reference Reconciliation Input: A set of references R Output: A partitioning over R, such that Each partition refers to a single real-world object – high precision Different partitions refer to different objects – high recall
17
Related Work A very active area of research in Databases, Data Mining and AI Most current approaches assume matching tuples from a single database table Traditional approaches (Surveyed in [Cohen, et al. 2003]) Step I. Compare attributes Step II. Combine attribute similarities to decide tuple match/non- match Step III. Compute transitive closures to get partitions New approaches explore relationship between reconciliation decisions using probability models [Russell et al, 2002] [Domingos et al, 2004] Harder for complex information spaces
18
Challenges in Complex Information Spaces Article: a 1 =(“Distributed Query Processing”,“169-180”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“169-180”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”) p 8 =(null, “stonebraker@csail.mit.edu”) p 9 =(“mike”, “stonebraker@csail.mit.edu”) 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ??
19
Intuition Complex information spaces can be considered as networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations
20
Outline Introduction and problem definition Reconciliation algorithm Experimental results Conclusions
21
Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 2, p 8 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) Reference SimilarityAttribute Similarity Compare contacts Cross-attr similarity (p 1,p 7 ) (“Michael Stonebraker”, p 7 ) (p 1, “stonebraker@csail.mit.edu”) (p 3, “stonebraker@csail.mit.edu”)
22
Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 2, p 8 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) Reference SimilarityAttribute Similarity Compare contacts Cross-attr similarity
23
Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) Reference SimilarityAttribute Similarity (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) (“Eugene Wong”, “Eugene Wong”)
24
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 )
25
Dependency Graph Example II (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) Compare authored papers
26
Strategy I. Consider Richer Evidence Cross-attribute similarity – Name&email p 5 =(“Stonebraker, M.”, null) p 8 =(null, “stonebraker@csail.mit.edu”) Context Information I – Contact list p 5 =(“Stonebraker, M.”, null, {p 4, p 6 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 6 =p 7 Context Information II – Authored articles p 2 =(“Michael Stonebraker”, null) p 5 =(“Stonebraker, M.”, null) p 2 and p 5 authored the same article
27
Considering Only Attribute-wise Similarities Cannot Merge Persons Well 1409 Person references: 24076 Real-world persons (gold-standard):1750 3159
28
Considering Richer Evidence Improves the Recall 1409 346 Person references: 24076Real-world persons:1750
29
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 )
30
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
31
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
32
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
33
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
34
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“169-180”, “169-180”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
35
Strategy II. Propagate Information between Reconciliation Decisions After changing the similarity score of one node, re-compute similarity scores of its neighbors This process converges if Similarity score is monotone in the similarity values of neighbors Compute neighbor similarities only if similarity increase is not too small
36
Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076Real-world persons:1750
37
Strategy III. Enrich References in Reconciliation Enrich knowledge of a real-world object for later reconciliation Naïve: Construct graph Compute similarity Transitive Closure Problems Dependency-graph construction is expensive Reference enrichment takes effect until the next pass Solution Instant enrichment by adding neighbors in the dependency graph
38
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilar
39
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilar
40
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilar
41
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) (p 2, p 8 ) (“Michael Stonebraker”, “mike”)(p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilar
42
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“mike”, “stonebraker@csail.mit.edu”, null) (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) (p 2, p 8 ) (“Michael Stonebraker”, “mike”)(p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilar
43
References Enrichment Improves Recall More than Information Propagation Person references: 24076Real-world persons:1750
44
Applying Both Information Propagation and Reference Enrichment Get the Highest Recall Person references: 24076Real-world persons:1750 1409 125 346
45
Outline Introduction and problem definition Reconciliation algorithm Experimental results Conclusions
46
Experiment Settings Datasets Four personal datasets Cora dataset for citations Use the same parameters and thresholds for all datasets Measure Precision and recall, F-measure Precision: The percentage of correctly reconciled reference pairs over all reconciled reference pairs Recall: The percentage of correctly reconciled reference pairs over pairs of references that refer to the same real-world object Diversity and Dispersion Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision) Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)
47
Recall Results on One Personal Dataset Person references: 24076Real-world persons:1750 1409 125 346
48
Results Considering All Occurrences of Person Instances Dataset #per/#ref Attr-wise MatchingDependency Graph Prec/RecallF#ParPrec/RecallF#Par A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 0.999/0.741 0.974/0.998 0.999/0.967 0.894/0.998 0.967/0.926 0.851 0.986 0.983 0.943 0.946 3159 2154 1660 1579 0.999/0.999 0.982/0.987 0.999/0.920 0.995/0.976 0.999 0.985 0.958 0.986 1873 2068 1596 1546 Both precision and recall increase compared with attr-wise matching.
49
Results Considering Only Distinct Person References Dataset #per/#dist-ref Attr-wise MatchingDependency Graph Prec/RecallF#ParPrec/RecallF#Par A (1750/3114) B (1989/3211) C (1570/2430) D (1518/2188) Avg 0.995/0.509 0.81/0.803 0.987/0.782 0.694/0.837 0.872/0.733 0.673 0.806 0.873 0.759 0.778 3159 2154 1660 1579 0.982/0.947 0.958/0.891 0.814/0.925 0.942/0.737 0.924/0.875 0.964 0.923 0.867 0.827 0.895 1873 2068 1596 1546 Precision and recall increase largely compared with attr-wise matching.
50
Diversity and Dispersion Are Very Close to 1 Dataset #per/#ref Attr-wise MatchingDependency Graph Diversity/Dispersion A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 1.18/1.003 1.067/1.01 1.053/1.003 1.041/1.004 1.085/1.005 1.047/1.003 1.039/1.008 1.03/1.017 1.023/1.005 1.035/1.008
51
Our Algorithm Equals or Outperforms Attr-wise Matching in All Classes Class Attr-wise MatchingDependency Graph PrecisionRecallPrecisionRecall Person Article Venue 0.967 0.997 0.935 0.926 0.977 0.790 0.995 0.999 0.987 0.976 0.937
52
Results on Cora Dataset is Competitive with Other Reported Results Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003] Class Attr-wise MatchingDependency Graph Prec/RecallF-msrePrec/RecallF-msre Article Person Venue 0.985/0.913 0.994/0.985 0.982/0.362 0.948 0.989 0.529 0.985/0.924 1/0.987 0.837/0.714 0.954 0.993 0.771
53
Conclusions Contributions : Dependency-graph-based reconciliation algorithm Exploit rich evidence Propagate information between reconciliation decisions Enrich references during reconciliation Extended Work Propagate negative information through dependency Graph
54
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 http://data.cs.washington.edu/semex
55
Strategy IV. Enforce Constraints Problem: Solution: Propagate negative information—Constraints Non-merge node: the two elements are guaranteed to be different and should never be merged P1P1 P2P2 P3P3
56
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“matt”, “stonebraker@csail.mit.edu”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) (p 8, p 9 ) Reference SimilarityAttribute Similarity (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”)
57
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“matt”, “stonebraker@csail.mit.edu”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilarNon-merge (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Constraint
58
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“matt”, “stonebraker@csail.mit.edu”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilarNon-merge (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Constraint
59
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“matt”, “stonebraker@csail.mit.edu”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilarNon-merge (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Constraint
60
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, “eugene@berkeley.edu”, {p 8 }) p 8 =(null, “stonebraker@csail.mit.edu”, {p 7 }) p 9 =(“matt”, “stonebraker@csail.mit.edu”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, “stonebraker@”) ReconciledSimilarNon-merge (p 8, p 9 ) (“stonebraker@csail.mit.edu”, “stonebraker@csail.mit.edu”) Constraint
61
Enforcing Constraints Improves Precision MethodPrecision #(Entities reconciled with others incorrectly) Constraint0.99913 No Constraint0.94761
62
Similarity Computation Similarity function for node N – s(N) Input: sim scores of N’s neighbors Output: sim score of N, ranged from 0 to 1 Similarity function can be defined by applying domain knowledge, learning from training data, resorting to global knowledge, etc. S = S rv + S sb + S wb S rv : from real-valued neighbors. Decision-tree shape. S sb : from strong-boolean-valued neighbors S wb : from weak-boolean-valued neighbors
63
Framework: Dependency Graph Definition For every pair of references A and B: A node representing their similarity For every attribute of A and attribute of B A node representing attribute similarity An edge between attr-sim node and ref-sim node, representing the dependency between the similarities Each node is associated with a similarity score between 0 and 1 Construction: include only nodes whose two elements have potential to be similar
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.