Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Linear Programming. Introduction: Linear Programming deals with the optimization (max. or min.) of a function of variables, known as ‘objective function’,
INTRODUCTION TO MODELING
Introduction to Markov Random Fields and Graph Cuts Simon Prince
Detecting Phantom Nodes in Wireless Sensor Networks Joengmin Hwang Tian He Yongdae Kim Department of Computer Science, University of Minnesota, Minneapolis.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Basic Data Mining Techniques Chapter Decision Trees.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
The Theory of NP-Completeness
Link Analysis, PageRank and Search Engines on the Web
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Chapter 1: Introduction to Statistics
Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham.
Towards situational awareness systems for disaster response Naveen Ashish Bell Labs India, Bangalore, 04/23/07.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Entity-Relationship Data Model N. Harika Lecturer(csc)
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
TECH Computer Science NP-Complete Problems Problems  Abstract Problems  Decision Problem, Optimal value, Optimal solution  Encodings  //Data Structure.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
More Computational Complexity Shirley Moore CS4390/5390 Fall August 29,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Data Structures & Algorithms Graphs
Explain the Marketing Research Process
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
NP-Complete Problems Algorithm : Design & Analysis [23]
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Random Interpretation Sumit Gulwani UC-Berkeley. 1 Program Analysis Applications in all aspects of software development, e.g. Program correctness Compiler.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Mining of Massive Datasets Edited based on Leskovec’s from
Creating a data set From paper surveys to excel. STEPS 1.Order your filled questionnaires 2.Number your questionnaires 3.Name your variables. 4.Create.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
More NP-Complete and NP-hard Problems
Finding Dense and Connected Subgraphs in Dual Networks
Search Engines and Link Analysis on the Web
A paper on Join Synopses for Approximate Query Answering
Computer Science Department University of California, Irvine
Instructor: Shengyu Zhang
Disambiguation Algorithm for People Search on the Web
Self-tuning in Graph-Based Reference Disambiguation
Distributed Systems CS
Distributed Systems CS
Jiawei Han Department of Computer Science
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Presentation transcript:

Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University of California, Irvine (RESCUE) Copyright(c) by Dmitri V. Kalashnikov, 2005 SIAM Data Mining Conference, 2005

2 Talk Overview Examples –motivating data cleaning (DC) –motivating analysis of relationships for DC Reference disambiguation –one of the DC problems –this work addresses Proposed approach –RelDC (Relationship-based Data Cleaning) –employs analysis of relationships for DC –the main contribution Experiments

3 Why do we need “Data Cleaning”? An actual excerpt from a person’s CV –sanitized for privacy –quite common in CVs, etc –this particular person –argues he is good –because his work is well-cited –but, there is a problem with using CiteSeer ranking –in general, it is not valid (in CVs) –let’s see why... “... In June 2004, I was listed as the 1000 th most cited author in computer science (of 100,000 authors) by CiteSeer, available at

4 Suspicious entries –Let us go to the DBLP website –which stores bibliographic entries of many CS authors –Let us check who are –“A. Gupta” –“L. Zhang” What is the problem in the example? CiteSeer: the top-k most cited authorsDBLP

5 Comparing raw and cleaned CiteSeer RankAuthorLocation# citations 1 (100.00%)douglas 2 (100.00%)rakesh 3 (100.00%)hector 4 (100.00%)sally 5 (100.00%)jennifer 6 (100.00%)david 6 (100.00%)thomas 7 (100.00%)rajeev 8 (100.00%)willy 9 (100.00%)van 10 (100.00%)rajeev 11 (100.00%)john 12 (100.00%)joseph 13 (100.00%)andrew 14 (100.00%)peter 15 (100.00%)serge CiteSeer top-k Cleaned CiteSeer top-k

6 What is the lesson? –data should be cleaned first –e.g., determine the (unique) real authors of publications –solving such challenges is not always “easy” –that explains a large body of work on data cleaning –note –CiteSeer is aware of the problem with its ranking –there are more issues with CiteSeer –many not related to data cleaning “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

7 High-level view of the problem

8 Traditional Domain-Independent DC Methods

9 What is “Reference Disambiguation”?  A1, ‘Dave White’, ‘Intel’   A2, ‘Don White’, ‘CMU’   A3, ‘Susan Grey’, ‘MIT’   A4, ‘John Black’, ‘MIT’   A5, ‘Joe Brown’, unknown   A6, ‘Liz Pink’, unknown   P1, ‘Databases... ’, ‘John Black’, ‘Don White’   P2, ‘Multimedia... ’, ‘Sue Grey’, ‘D. White’   P3, ‘Title3...’, ‘Dave White’   P4, ‘Title5...’, ‘Don White’, ‘Joe Brown’   P5, ‘Title6...’, ‘Joe Brown’, ‘Liz Pink’   P6, ‘Title7... ’, ‘Liz Pink’, ‘D. White’  Author table (clean)Publication table (to be cleaned) ? Analysis (‘D. White’ in P2, our approach): 1. ‘Don White’ has a paper with ‘John 2. ‘Dave White’ is not connected to MIT in any way 3. ‘Sue Grey’ is coauthor of P2 too, MIT Thus: ‘D. White’ in P2 is probably Don (since we know he collaborates with MIT ppl.) Analysis (‘D. White’ in P6, our approach): 1. ‘Don White’ has a paper (P4) with Joe Brown; Joe has a paper (P5) with Liz Pink; Liz Pink is a coauthor of P6. 2. ‘Dave White’ does not have papers with Joe or Liz Thus: ‘D. White’ in P6 is probably Don (since co-author networks often form clusters)

10 Attributed Relational Graph (ARG) View dataset as a graph –nodes for entities –papers, authors, organizations –e.g., P2, Susan, MIT –edges for relationships –“writes”, “affiliated with” –e.g. Susan → P2 (“writes”) “Choice” nodes –for uncertain relationships –mutual exclusion –“1” and “2” in the figure Analysis can be viewed as –application of the “Context AP” –to this graph –defined next... Q: How come domain-independent?

11 In designing the RelDC approach - our goal was to use CAP as an axiom - then solve problem formally, without heuristics if –reference r, made in the context of entity x, refers to an entity y j –but, the description, provided by r, matches multiple entities: y 1,…, y j,…, y N, then – x and y j are likely to be more strongly connected to each other via chains of relationships –than x and y k ( k = 1, 2, …, N; k  j ). Context Attraction Principle (CAP) “J. Smith”publication P1 John E. Smith SSN = 123 Joe A. Smith P1John E. Smith Jane Smith

12 Analyzing paths: linking entities and contexts D. White is a reference –in the context of P2, P6 –can link P2, P6 to Don –cannot link P2, P6 to Dave –more complex paths in general Analysis (‘D. White’ in P2): path P2→Don 1. ‘Don White’ has a paper with ‘John 2. ‘Dave White’ is not connected to MIT in any way 3. ‘Sue Grey’ is coauthor of P1 too, MIT Thus: ‘D. White’ is probably Don White Analysis (‘D. White’ in P6): path P6→Don 1. ‘Don White’ has a paper (P4) with Joe Brown; Joe has a paper (P5) with Liz Pink; Liz Pink is a coauthor of P6. 2. ‘Dave White’ does not have papers with Joe or Liz Thus: ‘D. White’ is probably Don White

13 Questions to answer 1. Does the CAP principle hold over real datasets? That is, if we disambiguate references based on it, will the references be correctly disambiguated? 2. Can we design a generic solution to exploiting relationships for disambiguation?

14 Problem formalization NotationMeaning X={x 1, x 2,..., x N }the set of all entities in in the database x i.r k the k-th reference of entity x i a referencea description of an object, multiple attributes d[x i.r k ]the “answer” for x i.r k -- the real entity x i.r k refers to (unknown, the goal is to find it) CS[x i.r k ]the “choice set” for x i.r k -- the set of all entities matching the description provided by x i.r k y 1, y 2,..., y N the “options” for x i.r k -- elements in CS[x i.r k ] v[xi]v[xi]the node in the graph for entity x i the name of k-th author of paper x i, e.g. ‘J. Smith’ the true k-th author of paper x i ‘John A. Smith’, ‘Jane B. Smith’,...

15 Handling References: Linking (references correspond to relationships) if |CS[x i.r k ]| = 1 then –we know the answer d[x i.r k ] –link x i and d[x i.r k ] directly, w = 1 else –the answer is uncertain for x i.r k –create a “choice” node, link it –“option-weights”, w w N = 1 –option-weights are variables Entity-Relationship Graph RelDC views dataset as a graph –undirected –nodes for entities –don’t have weights –edges for relationships –have weights –real number in [0,1] –the confidence the relationship exists “J. Smith” P1 “Jane Smith” “John Smith”

16 Definition: To resolve a reference x i.r k means –to pick one y j from CS[x i.r k ] as d[x i.r k ]. Graph interpretation –among w 1, w 2,..., w N, assign w j = 1 to one w j –means y j is chosen as the answer d[x i.r k ] Definition: Reference x i.r k is resolved correctly, if the chosen y j = d[x i.r k ]. Definition: Reference x i.r k is unresolved or uncertain, if not yet resolved... Goal: Resolve all uncertain references as correctly as possible. Objective of Reference Disambiguation

17 Alterative goal –for each reference x i.r k –assign option-weights w 1,...,w N –but it [0,1], not binary as before – w j reflects the degree of confidence that y j = d[x i.r k ] – w 1 + w w N = 1 Mapping the alternative goal to the original – use an interpretation procedure – pick y i with the max w i as the answer for x i.r k – a final step RelDC deals with the alternative goal! – the bulk of the discussion on computing those option-weights Alternative Goal

18 Formalizing the CAP CAP –is based on “connection strength” –c(u,v) for entities u and v –measures how strongly u and v are connected to each other via relationships –e.g. c(u,v) > c(u,z) in the figure –will formalize c(u,v) later if c(x i, y j ) ≥ c(x i, y k ) then w j ≥ w k (most of the time) Context Attraction Principle (CAP) We use proportionality: c(x i, y j ) ∙ w k = c(x i, y k ) ∙ w j

19 RelDC approach Input: the ARG for the dataset 1.Computing connection strengths −for each unresolved reference x i.r k −determine equations for all (i.e., N ) c(x i, y j )’s − c(x i, y j ) = g ij (w) − a function of other option-weights 2.Determining equations for option-weights −use CAP to relate all w j ’s and connection strengths −since c(x i, y j ) = g ij (w), hence w ij = f ij (w) 3.Computing option-weights −solve the system of equations from Step 2. 4.Resolving references −use the interpretation procedure to resolve weights

20 Computing connection strength (Step 1) Computation of c(u,v) consists of two phases –Phase 1: Discover connections –all L-short simple paths between u and v –bottleneck –optimizations, not in SDM05 –Phase 2: Measure the strength –in the discovered connections –many c(u,v) models exist –we use random walks in graphs model

21 Measuring connection strength Note: –c(u,v) returns an equations –because paths can go via various option-edges –c uv = c(u,v) = g uv ( w )

22 Equations for option-weights (Step 2) CAP (proportionality): System (over-constrained): Add slack:

23 Solving the system (Steps 3 and 4) Step 3: Solve the system of equations 1.use a math solver, or 2.iterative method (approx. solution ), or 3.bounding-interval-based method (tech. report). Step 4: Interpret option-weights –to determine the answer for each reference –pick y j with the largest weight as the answer

24 Experimental Setup Parameters –When looking for L-short simple paths, L = 7 –L is the path-length limit RealPub dataset: –CiteSeer + HPSearch –publications (255K) –authors (176K) –organizations (13K) –departments (25K) –ground truth is not known –accuracy... SynPub datasets: –many ds of two types –emulation of RealPub –publications (5K) –authors (1K) –organizations (25K) –departments (125K) –ground truth is known RealMov: –movies (12K) –people (22K) –actors –directors –producers –studious (1K) –producing –distributing

25 Sample Publication Data CiteSeer: publication records HPSearch: author records

26 Efficiency and Long paths Non-exponential cost Longer paths do help

27 Accuracy on SynPub

28 Sample Movies Data

29 Accuracy on RealMov References to DirectorsReferences to Studios

30 Summary DC and “Garbage in, Garbage out” principle Our main contributions –showing that analyzing relationship can help DC –an approach, that achieves that RelDC –developed in Aug 2003 –domain-independent data cleaning framework –not about cleaning CiteSeer –uses relationships for data cleaning Ongoing work –“learning” the importance of relationships from data

31 Contact Information RelDC project (RESCUE) Dmitri V. Kalashnikov (contact author) Sharad Mehrotra Zhaoqi Chen

32 Summary DC and “Garbage in, Garbage out” principle Analyzing relationship can help data cleaning RelDC –developed in Aug 2003 –domain-independent data cleaning framework –not about cleaning CiteSeer –uses relationships for data cleaning –employs CAP as an axiom –converts the problem to an optimization problem –can disambiguate different types of references at once –in theory, not tested yet