Entity Profiling with Varying Source Reliabilities

Slides:

Advertisements

Similar presentations

String Similarity Measures and Joins with Synonyms

Advertisements

An Adaptive Learning Method for Target Tracking across Multiple Cameras Kuan-Wen Chen, Chih-Chuan Lai, Yi-Ping Hung, Chu-Song Chen National Taiwan University.

A Unified Framework for Context Assisted Face Clustering

ICDE 2014 LinkSCAN*: Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§, Kyomin Jung ¶, and.

CrowdER - Crowdsourcing Entity Resolution

Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.

1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.

Declarative Analysis of Noisy Information Networks Walaa Eldin Moustafa Galileo Namata Amol Deshpande Lise Getoor University of Maryland.

Spectral graph reduction for image and streaming video segmentation Fabio Galasso 1 Margret Keuper 2 Thomas Brox 2 Bernt Schiele 1 1 Max Planck Institute.

Large-Scale Entity-Based Online Social Network Profile Linkage.

Tighter and Convex Maximum Margin Clustering Yu-Feng Li (LAMDA, Nanjing University, China) Ivor W. Tsang.

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.

NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.

Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.

Non-Negative Tensor Factorization with RESCAL Denis Krompaß 1, Maximilian Nickel 1, Xueyan Jiang 1 and Volker Tresp 1,2 1 Department of Computer Science.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.

1 Unsupervised Modeling of Object Categories Using Link Analysis Techniques Gunhee Kim Christos Faloutsos Martial Hebert Gunhee Kim Christos Faloutsos.

Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

Cluster based fact finders Manish Gupta, Yizhou Sun, Jiawei Han Feb 10, 2011.

How Significant Is the Effect of Faults Interaction on Coverage Based Fault Localizations? Xiaozhen Xue Advanced Empirical Software Testing Group Department.

KDD 2012, Beijing, China Community Discovery and Profiling with Social Messages Wenjun Zhou Hongxia Jin Yan Liu.

Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

Mining High Utility Itemset in Big Data

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.

Entity Profiling with Varying Source Reliabilities Date: 2015/02/26 Author:Furong Li,Mong Li Lee,Wynne Hsu Source: KDD ’14 Advisor: Dr. Jia-Ling Koh Speaker:

Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):

A Fuzzy k-Modes Algorithm for Clustering Categorical Data

Dual Transfer Learning Mingsheng Long 1,2, Jianmin Wang 2, Guiguang Ding 2 Wei Cheng, Xiang Zhang, and Wei Wang 1 Department of Computer Science and Technology.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Marin Silic, Goran Delac and Sinisa Srbljic Prediction of Atomic Web Services Reliability Based on K-means Clustering Consumer Computing Laboratory Faculty.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Scalable and Coordinated Scheduling for Cloud-Scale computing

Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.

Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015.

Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.

Arizona State University1 Fast Mining of a Network of Coevolving Time Series Wei FanHanghang TongPing JiYongjie Cai.

Big Data Quality Panel Norman Paton University of Manchester.

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta

Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.

Collective Network Linkage across Heterogeneous Social Platforms

Preference Query Evaluation Over Expensive Attributes

Community-based User Recommendation in Uni-Directional Social Networks

Knowledge Base Completion

Record Linkage with Uniqueness Constraints and Erroneous Values

Efficient Record Linkage in Large Data Sets

Adaptive Data Refinement for Parallel Dynamic Programming Applications

Resource Allocation for Distributed Streaming Applications

Category-Sensitive Question Routing in Community Question Answering

Review 1+3= 4 7+3= = 5 7+4= = = 6 7+6= = = 7+7+7=

CS639: Data Management for Data Science

An Efficient Partition Based Method for Exact Set Similarity Joins

Improving Cross-lingual Entity Alignment via Optimal Transport

Presentation transcript:

Entity Profiling with Varying Source Reliabilities Furong Li, Mong Li Lee, Wynne Hsu {furongli, leeml, whsu} @ comp.nus.edu.sg National University of Singapore

Outline Motivation Proposed Method Performance Study Conclusion The presentation contains four parts and we start from motivation

Entities in Multiple Sources a real world entity's information is, more often than not, published by multiple data sources. Here we see a webpage of a restaurant, and its urbanspoon entry, and the foursquare page. Our first observation is, different data sources provide Various name representations Various name representations

Entities in Multiple Sources The source may contain erroneous attribute values The phone numbers are different and one of them is wrong Various name representations Erroneous attribute values

Entities in Multiple Sources No opening hours Third, a source may publish Incomplete information. Various name representations Erroneous attribute values Incomplete information

Entities in Multiple Sources Furthermore, there are Ambiguous references. We can see a page with very similar name, but actually refers to another entity Various name representations Erroneous attribute values Incomplete information Ambiguous references

What We Want Entity Profiling Various name representations Erroneous attribute values Incomplete information Ambiguous references Given the information published by different data sources, which has these properties We want to have a complete and accurate picture of a real world entity by fusing the information above And we call it entity profiling. Entity Profiling

The Problem Involves Two Tasks Record linkage [Getoor et al, VLDB’12], [Negahban et al, CIKM’12] Truth discovery [Li et al, VLDB’13], [Yin et al, KDD’07] 512-494-6916 (LocalEats) 512-494-6894 (FourSquare) (Urbanspoon) Frank First record linkage: link the real world entities and the records published by various sources Second truth discovery: Find true values from data provided by different sources Now you may realize that the second relies on the output of the first one. But such a pipeline solution may not work well The erroneous values may prevent correct linkages Incomplete picture of the data limits the effectiveness of truth discovery

State-of-the-art Method [Guo et al, VLDB’10] Assume a set of (soft) uniqueness constraints Transform records into attribute value pairs Limitations Make wrong associations when the percentage of erroneous values increases Computationally expensive The uniqueness constraint limits its generality So we consider how to simultaneously solve the two tasks The state of the art method…

Outline Motivation Proposed Method Performance Study Conclusion

A Motivating Example matching -> profiles √ √ ? X Suppose we want to collate information on researchers in Computer Science. We could first obtain a list of researchers from acm website. Since the information in these reference records is limited, we would look at other data sources, such as university home pages, object-level search engines, LinkedIn, to harvest more information. ? √ √ X

A Motivating Example matching -> profiles ? √ √ X

The Example Tells Us The data sources are not equally reliable among different attributes Introduce a reliability matrix 𝑀[𝑠,𝑎] Lower the impact of erroneous values on matching decisions Rectifying errors in attribute values provides additional evidence for linking records Interleave the processes of record linkage and error correction M[s,a] reliability of each source on each attribute

The Proposed Two-phase Method X X The first phase is a confidence based matching that links each input record to one or more reference records. Based on this, we obtain an initial assessment of the reliability of the sources. Then the second phase, adaptive matching, iteratively determine the correct attribute values and eliminate unlikely matches for the input records. This iterative process is exactly how we interleave record linkage and truth discovery.

Confidence Based Matching Bootstrap the framework with a small set of confident matches Initialize reliability matrix based on confident matches X Two proposes For attributes that are not contained in reference records, set to the average of the row

Confidence Based Matching Distinguish sources Reliable: 𝑠𝑟𝑐 1 , 𝑠𝑟𝑐 2 , 𝑠𝑟𝑐 3 𝑟 5 , 𝑟 4 Unreliable: { 𝑠𝑟𝑐 4 , 𝑠𝑟𝑐 5 } 𝑟 7 , 𝑟 8 , 𝑟 10 Given the confident matches, we want to Distinguish sources

Confidence Based Matching Distinguish sources Reliable: 𝑠𝑟𝑐 1 , 𝑠𝑟𝑐 2 , 𝑠𝑟𝑐 3 𝑟 5 , 𝑟 4 Unreliable: { 𝑠𝑟𝑐 4 , 𝑠𝑟𝑐 5 } 𝑟 7 , 𝑟 8 , 𝑟 10

Confidence Based Matching Distinguish sources Reliable: 𝑠𝑟𝑐 1 , 𝑠𝑟𝑐 2 , 𝑠𝑟𝑐 3 𝑟 5 , 𝑟 4 Unreliable: { 𝑠𝑟𝑐 4 , 𝑠𝑟𝑐 5 } 𝑟 7 , 𝑟 8 , 𝑟 10

Adaptive Matching X X Cluster signatures <- current truth Update reliability matrix Refine clusters by pruning the most dissimilar records First, for each cluster $c$ obtained in the previous phase, we build its signature $H_c$ based on the accuracy of the attribute values of the records in $c$. The reliability matrix is updated subsequently. Then each record $r$ is compared with the signatures of the clusters it is associated with, and pruned from the cluster that it is the most dissimilar to. The above steps are repeated until there is no change to the clusters. Discount records belonging to multiple clusters Lower the impact of erroneous values on our matching decision

Outline Motivation Proposed Method Performance Study Conclusion

Comparative Methods PIPELINE MATCH [3] COMET Record linkage [1] + Truth discovery [2] MATCH [3] State-of-the-art method COMET The proposed method Negahban et al. Scaling multiple-source entity resolution using statistically efficient transfer learning. In CIKM, 2012 Yin et al. TruthFinder. In KDD, 2007 Guo et al. Record linkage with uniqueness constraints and erroneous values. VLDB, 2010

Results on Restaurant Dataset The evaluation contains two parts: how good is the record matching and the quality of the constructed profiles.

%err: percentage of erroneous values %ambi: percentage of records with abbreviated names We vary two variables We can significantly improve the recall

%err: percentage of erroneous values %ambi: percentage of records with abbreviated names

Scalability Experiments

Outline Motivation Proposed Method Performance Study Conclusion

Conclusion Address the problem of building entity profiles by collating data records from multiple sources in the presence of erroneous values Interleave record linkage with truth discovery Varying source reliabilities Reduce the impact of erroneous values on matching decisions

Thanks! Q & A furongli@comp.nus.edu.sg