Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,

Slides:



Advertisements
Similar presentations
Lecture 24 MAS 714 Hartmut Klauck
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
CS 206 Introduction to Computer Science II 11 / 07 / 2008 Instructor: Michael Eckmann.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
1 Minimum Ratio Contours For Meshes Andrew Clements Hao Zhang gruvi graphics + usability + visualization.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
25/06/2015Marius Mikucionis, AAU SSE1/22 Principles and Methods of Testing Finite State Machines – A Survey David Lee, Senior Member, IEEE and Mihalis.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Tag Ranking Present by Jie Xiao Dept. of Computer Science Univ. of Texas at San Antonio.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Maximization of Network Survivability against Intelligent and Malicious Attacks (Cont’d) Presented by Erion Lin.
ENM 503 Lesson 1 – Methods and Models The why’s, how’s, and what’s of mathematical modeling A model is a representation in mathematical terms of some real.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
The new protocol of freenet Taken from Ian Clarke and Oskar Sandberg (The Freenet Project)
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
CPSC 322, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 30, 2015 Slide source: from David Page (MIT) (which were.
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Vasilis Syrgkanis Cornell University
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Probabilistic Reasoning Inference and Relational Bayesian Networks.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
The NP class. NP-completeness
More NP-Complete and NP-hard Problems
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Chapter 6 Classification and Prediction
Jianping Fan Dept of CS UNC-Charlotte
Network Science: A Short Introduction i3 Workshop
Computer Science Department University of California, Irvine
Disambiguation Algorithm for People Search on the Web
Self-tuning in Graph-Based Reference Disambiguation
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Handwritten Characters Recognition Based on an HMM Model
Presentation transcript:

Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine (RESCUE) ACM IQIS 2005 Work supported by NSF Grants IIS and IIS

2 Talk Overview Motivation Object consolidation problem Proposed approach –RelDC: Relationship based data cleaning –Relationship analysis and graph partitioning Experiments

3 Why do we need “Data Cleaning”? q Hi, my name is Jane Smith. I’d like to apply for a faculty position at your university Wow! Unbelievable! Are you sure you will join us even if we do not offer you tenure right away? Jane Smith – Fresh Ph.D.Tom - Recruiter OK, let me check something quickly … ??? Publications: 1.…… 2.…… 3.…… Publications: 1.…… 2.…… 3.…… CiteSeer Rank

4 Names often do not uniquely identify people What is the problem? CiteSeer: the top-k most cited authorsDBLP

5 Comparing raw and cleaned CiteSeer RankAuthorLocation# citations 1 (100.00%)douglas 2 (100.00%)rakesh 3 (100.00%)hector 4 (100.00%)sally 5 (100.00%)jennifer 6 (100.00%)david 6 (100.00%)thomas 7 (100.00%)rajeev 8 (100.00%)willy 9 (100.00%)van 10 (100.00%)rajeev 11 (100.00%)john 12 (100.00%)joseph 13 (100.00%)andrew 14 (100.00%)peter 15 (100.00%)serge CiteSeer top-k Cleaned CiteSeer top-k

6 Object Consolidation Problem Cluster representations that correspond to the same “real” world object/entity Two instances: real world objects are known/unknown r1r2r3r4r5r6r7rN o1o2o3o4o5o6o7oM Representations of objects in the database Real objects in the database

7 RelDC Approach Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work f1 f2 f3 ? ? ? f4 Y f1 f2 f3 f4 ? X Traditional Methods + X Y A B C D EF Relationship Analysis ARG RelDC Framework features and context Relationship-based Data Cleaning

8 Attributed Relational Graph (ARG) View the database as an ARG Nodes –per cluster of representations (if already resolved by feature-based approach) –per representation (for “tough” cases) Edges –Regular – correspond to relationships between entities –Similarity – created using feature-based methods on representations

9 Context Attraction Principle (CAP) Who is “J. Smith” –Jane? –John?

10 Questions to Answer 1.Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the quality of consolidation improves? 2.Can we design a generic strategy that exploits CAP for consolidation?

11 Consolidation Algorithm 1.Construct ARG and identify all virtual clusters (VCSs) –use FBS in constructing the ARG 2.Choose a VCS and compute connection strength between nodes –for each pair of repr. connected via a similarity edge 3.Partition the VCS –use a graph partitioning algorithm –partitioning is based on connection strength –after partitioning, adjust ARG accordingly –go to Step 2, if more potential clusters exists

12 Connection Strength c(u,v) Models for c(u,v) –many possibilities –diffusion kernels, random walks, etc –none is fully adequate –cannot learn similarity from data Diffusion kernels –  (x,y)=  1 (x,y) “base similarity” –via direct links (of size 1) –  k (x,y) “indirect similarity” –via links of size k –B: where B xy = B 1 xy =  1 (x,y) –base similarity matrix –B k : indirect similarity matrix –K: total similarity matrix, or “kernel”

13 Connection Strength c(u,v) (cont.) Instantiating parameters –Determining  (x,y) –regular edges have types T 1,...,T n –types T 1,...,T n have weights w 1,...,w n –  (x,y) = w i –get the type of a given edge –assign this weigh as base similarity –Handling similarity edges –  (x,y) assigned value proportional to similarity (heuristic) – Approach to learn  (x,y) from data (ongoing work) Implementation –we do not compute the whole matrix K –we compute one c(u,v) at a time – limit path lengths by L

14 Consolidation via Partitioning Observations –each VCS contains representations of at least 1 object –if a repr. is in VCS, then the rest of repr. of the same object are in it too Partitioning –two cases –k, the number of entities in VSC, is known –k is unknown –when k is known, use any partit. algo –maximize inside-con, minimize outside-con. –we use [Shi,Malik’2000] –normalized cut –when k is unknown –split into two: just to see the cut –compare cut against threshold –decide “to split” or “not to split” –Iterate

15 Measuring Quality of Outcome –dispersion –for an entity, into how many clusters its repr. are clustered, ideal is 1 –diversity –for a cluster, how many distinct entities it covers, ideal is 1 –Entity uncertainty –for an entity, if out of m represent. m 1 to C 1 ;...; m n to C n then –Cluster Uncertainty –if a cluster consists of represent.: m 1 of E 1 ;...; m n of E n then (same...) –ideal entropy is zero

16 Experimental Setup Parameters –L-short simple paths, L = 7 –L is the path-length limit Note –The algorithm is applied to “tough cases”, after FBS already has successfully consolidated many entries! RealMov –movies (12K) –people (22K) –actors –directors –producers –studious (1K) –producing –distributing Uncertainty –d1,d2,...,dn are director entities –pick a fraction d1,d2,...,dm –Group entries in size k, –e.g. in groups of two {d1,d2},...,{d9,d10} –make all representations of a group indiscernible by FBS,... Baseline 1 –one cluster per VCS, regardless –Equivalent to using only FBS –ideal dispersion & H(E)! Baseline 2 Baseline 2 –knows grouping statistics –gueses #ent in VCS –random assigns repr. to clusters

17 Sample Movies Data

18 The Effect of L on Quality Cluster Entropy & DiversityEntity Entropy & Dispersion

19 Effect of Threshold and Scalability

20 Summary RelDC –domain-independent data cleaning framework –uses relationships for data cleaning –reference disambiguation [SDM’05] –object consolidation [IQIS’05] Ongoing work –“learning” the importance of relationships from data –Exploiting relationships among entities for other data cleaning problems

21 Contact Information RelDC project (RESCUE) Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra

22 extra slides…

23 What is the lesson? –data should be cleaned first –e.g., determine the (unique) real authors of publications –solving such challenges is not always “easy” –that explains a large body of work on data cleaning –note –CiteSeer is aware of the problem with its ranking –there are more issues with CiteSeer –many not related to data cleaning “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

24 Object Consolidation Notation – O={o 1,...,o |O| } set of entities –unknown in general – X={x 1,...,x |X| } set of repres. – d[x i ] the entity x i refers to –unknown in general – C[x i ] all repres. that refer to d[x i ] –“group set” –unknown in general –the goal is to find it for each x i – S[x i ] all repres. that can be x i –“consolidation set” –determined by FBS –we assume C[x i ]  S[x i ]

25 Object Consolidation Problem Let O={o 1,...,o |O| } be the set of entities –unknown in general Let X={x 1,...,x |X| } be the set of representations Map xi to its corresponding entity oj in O d[x i ] the entity x i refers to –unknown in general – C[x i ] all repres. that refer to d[x i ] –“group set” –unknown in general –the goal is to find it for each x i – S[x i ] all repres. that can be x i –“consolidation set” –determined by FBS –we assume C[x i ]  S[x i ]

26 RelDC Framework

27 Connection Strength Computation of c(u,v) Phase 1: Discover connections –all L-short simple paths between u and v –bottleneck –optimizations, not in IQIS’05 Phase 2: Measure the strength –in the discovered connections –many c(u,v) models exist –we use model similar to diffusion kernels

28 Our c(u,v) Model Our c(u,v) model –regular edges have types T 1,...,T n –types T 1,...,T n have weights w 1,...,w n –  (x,y) = w i –get the type of a given edge –assign this weigh as base similarity –paths with similarity edges –might not exist, use heuristics Our model & Diff. kernels –virtually identical, but... –we do not compute the whole matrix K –we compute one c(u,v) at a time –we limit path lengths by L –  (x,y) is unknown in general –the analyst assigns them –learn from data (ongoing work)