Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entity Profiling with Varying Source Reliabilities

Similar presentations


Presentation on theme: "Entity Profiling with Varying Source Reliabilities"— Presentation transcript:

1 Entity Profiling with Varying Source Reliabilities
Furong Li, Mong Li Lee, Wynne Hsu {furongli, leeml, comp.nus.edu.sg National University of Singapore

2 Outline Motivation Proposed Method Performance Study Conclusion
The presentation contains four parts and we start from motivation

3 Entities in Multiple Sources
a real world entity's information is, more often than not, published by multiple data sources. Here we see a webpage of a restaurant, and its urbanspoon entry, and the foursquare page. Our first observation is, different data sources provide Various name representations Various name representations

4 Entities in Multiple Sources
The source may contain erroneous attribute values The phone numbers are different and one of them is wrong Various name representations Erroneous attribute values

5 Entities in Multiple Sources
No opening hours Third, a source may publish Incomplete information. Various name representations Erroneous attribute values Incomplete information

6 Entities in Multiple Sources
Furthermore, there are Ambiguous references. We can see a page with very similar name, but actually refers to another entity Various name representations Erroneous attribute values Incomplete information Ambiguous references

7 What We Want Entity Profiling Various name representations
Erroneous attribute values Incomplete information Ambiguous references Given the information published by different data sources, which has these properties We want to have a complete and accurate picture of a real world entity by fusing the information above And we call it entity profiling. Entity Profiling

8 The Problem Involves Two Tasks
Record linkage [Getoor et al, VLDB’12], [Negahban et al, CIKM’12] Truth discovery [Li et al, VLDB’13], [Yin et al, KDD’07] (LocalEats) (FourSquare) (Urbanspoon) Frank First record linkage: link the real world entities and the records published by various sources Second truth discovery: Find true values from data provided by different sources Now you may realize that the second relies on the output of the first one. But such a pipeline solution may not work well The erroneous values may prevent correct linkages Incomplete picture of the data limits the effectiveness of truth discovery

9 State-of-the-art Method
[Guo et al, VLDB’10] Assume a set of (soft) uniqueness constraints Transform records into attribute value pairs Limitations Make wrong associations when the percentage of erroneous values increases Computationally expensive The uniqueness constraint limits its generality So we consider how to simultaneously solve the two tasks The state of the art method…

10 Outline Motivation Proposed Method Performance Study Conclusion

11 A Motivating Example matching -> profiles √ √ ? X
Suppose we want to collate information on researchers in Computer Science. We could first obtain a list of researchers from acm website. Since the information in these reference records is limited, we would look at other data sources, such as university home pages, object-level search engines, LinkedIn, to harvest more information. ? X

12 A Motivating Example matching -> profiles ? X

13 The Example Tells Us The data sources are not equally reliable among different attributes Introduce a reliability matrix 𝑀[𝑠,𝑎] Lower the impact of erroneous values on matching decisions Rectifying errors in attribute values provides additional evidence for linking records Interleave the processes of record linkage and error correction M[s,a] reliability of each source on each attribute

14 The Proposed Two-phase Method
X X The first phase is a confidence based matching that links each input record to one or more reference records. Based on this, we obtain an initial assessment of the reliability of the sources. Then the second phase, adaptive matching, iteratively determine the correct attribute values and eliminate unlikely matches for the input records. This iterative process is exactly how we interleave record linkage and truth discovery.

15 Confidence Based Matching
Bootstrap the framework with a small set of confident matches Initialize reliability matrix based on confident matches X Two proposes For attributes that are not contained in reference records, set to the average of the row

16 Confidence Based Matching
Distinguish sources Reliable: 𝑠𝑟𝑐 1 , 𝑠𝑟𝑐 2 , 𝑠𝑟𝑐 3 𝑟 5 , 𝑟 4 Unreliable: { 𝑠𝑟𝑐 4 , 𝑠𝑟𝑐 5 } 𝑟 7 , 𝑟 8 , 𝑟 10 Given the confident matches, we want to Distinguish sources

17 Confidence Based Matching
Distinguish sources Reliable: 𝑠𝑟𝑐 1 , 𝑠𝑟𝑐 2 , 𝑠𝑟𝑐 3 𝑟 5 , 𝑟 4 Unreliable: { 𝑠𝑟𝑐 4 , 𝑠𝑟𝑐 5 } 𝑟 7 , 𝑟 8 , 𝑟 10

18 Confidence Based Matching
Distinguish sources Reliable: 𝑠𝑟𝑐 1 , 𝑠𝑟𝑐 2 , 𝑠𝑟𝑐 3 𝑟 5 , 𝑟 4 Unreliable: { 𝑠𝑟𝑐 4 , 𝑠𝑟𝑐 5 } 𝑟 7 , 𝑟 8 , 𝑟 10

19 Adaptive Matching X X Cluster signatures <- current truth
Update reliability matrix Refine clusters by pruning the most dissimilar records First, for each cluster $c$ obtained in the previous phase, we build its signature $H_c$ based on the accuracy of the attribute values of the records in $c$. The reliability matrix is updated subsequently. Then each record $r$ is compared with the signatures of the clusters it is associated with, and pruned from the cluster that it is the most dissimilar to. The above steps are repeated until there is no change to the clusters. Discount records belonging to multiple clusters Lower the impact of erroneous values on our matching decision

20 Outline Motivation Proposed Method Performance Study Conclusion

21 Comparative Methods PIPELINE MATCH [3] COMET
Record linkage [1] + Truth discovery [2] MATCH [3] State-of-the-art method COMET The proposed method Negahban et al. Scaling multiple-source entity resolution using statistically efficient transfer learning. In CIKM, 2012 Yin et al. TruthFinder. In KDD, 2007 Guo et al. Record linkage with uniqueness constraints and erroneous values. VLDB, 2010

22 Results on Restaurant Dataset
The evaluation contains two parts: how good is the record matching and the quality of the constructed profiles.

23 %err: percentage of erroneous values
%ambi: percentage of records with abbreviated names We vary two variables We can significantly improve the recall

24 %err: percentage of erroneous values
%ambi: percentage of records with abbreviated names

25 Scalability Experiments

26 Outline Motivation Proposed Method Performance Study Conclusion

27 Conclusion Address the problem of building entity profiles by collating data records from multiple sources in the presence of erroneous values Interleave record linkage with truth discovery Varying source reliabilities Reduce the impact of erroneous values on matching decisions

28 Thanks! Q & A


Download ppt "Entity Profiling with Varying Source Reliabilities"

Similar presentations


Ads by Google