Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,

Slides:



Advertisements
Similar presentations
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Advertisements

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
CS 206 Introduction to Computer Science II 11 / 07 / 2008 Instructor: Michael Eckmann.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Intro to RecSys and CCF Brian Ackerman 1. Roadmap Introduction to Recommender Systems & Collaborative Filtering Collaborative Competitive Filtering 2.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
Lecture 21: Spectral Clustering
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Analysis of Algorithms CS 477/677
25/06/2015Marius Mikucionis, AAU SSE1/22 Principles and Methods of Testing Finite State Machines – A Survey David Lee, Senior Member, IEEE and Mihalis.
CS 206 Introduction to Computer Science II 11 / 05 / 2008 Instructor: Michael Eckmann.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Technical Writing for Computer Science Part 1: Content and Organization Research Careers Lecture Series July 13, 2009 Michael J. Lewis, Director Department.
Clustering and greedy algorithms Prof. Noah Snavely CS1114
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
The Shortest Path Problem
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Chapter 1: Introduction to Statistics
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
 Jim has six children.  Chris fights with Bob,Faye, and Eve all the time; Eve fights (besides with Chris) with Al and Di all the time; and Al and Bob.
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
ENM 503 Lesson 1 – Methods and Models The why’s, how’s, and what’s of mathematical modeling A model is a representation in mathematical terms of some real.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Spectral Clustering Jianping Fan Dept of Computer Science UNC, Charlotte.
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
CSC2535 Lecture 5 Sigmoid Belief Nets
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.
On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.
The NP class. NP-completeness
More NP-Complete and NP-hard Problems
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Software Testing and Maintenance 1
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
RE-Tree: An Efficient Index Structure for Regular Expressions
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Jianping Fan Dept of CS UNC-Charlotte
Network Science: A Short Introduction i3 Workshop
Computer Science Department University of California, Irvine
Sequence comparison: Multiple testing correction
Disambiguation Algorithm for People Search on the Web
Self-tuning in Graph-Based Reference Disambiguation
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Presentation transcript:

Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine (RESCUE) ACM IQIS 2005 Copyright(c) by Dmitri V. Kalashnikov, 2005Work supported by NSF Grants IIS and IIS

2 Talk Overview Examples –motivating data cleaning (DC) –motivating analysis of relationships for DC Object consolidation –one of the DC problems –this work addresses Proposed approach –RelDC framework –Relationship analysis and graph partitioning Experiments

3 Why do we need “Data Cleaning”? q Hi, my name is Jane Smith. I’d like to apply for a faculty position at your university Wow! Unbelievable! You must be a really hard worker! I am sure we will accept a candidate like that! Jane Smith – Fresh Ph.D.Tom - Recruiter OK, let me check something quickly … ??? Publications: 1.…… 2.…… 3.…… Publications: 1.…… 2.…… 3.…… CiteSeer Rank

4 Suspicious entries –Lets go to DBLP website –which stores bibliographic entries of many CS authors –Lets check two people –“A. Gupta” –“L. Zhang” What is the problem? CiteSeer: the top-k most cited authorsDBLP

5 Comparing raw and cleaned CiteSeer RankAuthorLocation# citations 1 (100.00%)douglas 2 (100.00%)rakesh 3 (100.00%)hector 4 (100.00%)sally 5 (100.00%)jennifer 6 (100.00%)david 6 (100.00%)thomas 7 (100.00%)rajeev 8 (100.00%)willy 9 (100.00%)van 10 (100.00%)rajeev 11 (100.00%)john 12 (100.00%)joseph 13 (100.00%)andrew 14 (100.00%)peter 15 (100.00%)serge CiteSeer top-k Cleaned CiteSeer top-k

6 What is the lesson? –data should be cleaned first –e.g., determine the (unique) real authors of publications –solving such challenges is not always “easy” –that explains a large body of work on data cleaning –note –CiteSeer is aware of the problem with its ranking –there are more issues with CiteSeer –many not related to data cleaning “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

7 RelDC Framework

8 Object Consolidation Notation – O={o 1,...,o |O| } set of entities –unknown in general – X={x 1,...,x |X| } set of repres. – d[x i ] the entity x i refers to –unknown in general – C[x i ] all repres. that refer to d[x i ] –“group set” –unknown in general –the goal is to find it for each x i – S[x i ] all repres. that can be x i –“consolidation set” –determined by FBS –we assume C[x i ]  S[x i ]

9 Attributed Relational Graph (ARG) ARG in RelDC Nodes –per cluster of representations –per representation (for “tough” cases) Edges –regular –similarity

10 Context Attraction Principle (CAP) Take a guess: Who is “J. Smith” –Jane? –John?

11 Questions to Answer 1.Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the quality of consolidation improves? 2.Can we design a generic solution to exploiting relationships for disambiguation?

12 Consolidation Algorithm 1.Construct ARG and identify all VCS’s –use FBS in constructing the ARG 2.Choose a VCS and compute c(u,v)’s –for each pair of repr. connected via a similarity edge 3.Partition VSC –use a graph partitioning algorithm –partitioning is based on c(u,v)’s –after partitioning, adjust ARG accordingly –go to Step 2, if more VCS exists

13 Connection Strength Computation of c(u,v) Phase 1: Discover connections –all L-short simple paths between u and v –bottleneck –optimizations, not in IQIS’05 Phase 2: Measure the strength –in the discovered connections –many c(u,v) models exist –we use model similar to diffusion kernels

14 Existing c(u,v) Models Models for c(u,v) –many exists –diffusion kernels, random walks, etc –none is fully adequate –cannot learn similarity from data Diffusion kernels –  (x,y)=  1 (x,y) “base similarity” –via direct links (of size 1) –  k (x,y) “indirect similarity” –via links of size k –B: where B xy = B 1 xy =  1 (x,y) –base similarity matrix –B k : indirect similarity matrix –K: total similarity matrix, or “kernel”

15 Our c(u,v) Model Our c(u,v) model –regular edges have types T 1,...,T n –types T 1,...,T n have weights w 1,...,w n –  (x,y) = w i –get the type of a given edge –assign this weigh as base similarity –paths with similarity edges –might not exist, use heuristics Our model & Diff. kernels –virtually identical, but... –we do not compute the whole matrix K –we compute one c(u,v) at a time –we limit path lengths by L –  (x,y) is unknown in general –the analyst assigns them –learn from data (ongoing work)

16 Consolidation via Partitioning Observations –each VCS contains representations of at least 1 object –if a repr. is in VCS, then the rest of repr. of the same object are in it too Partitioning –two cases –k, the number of entities in VSC, is known –k is unknown –when k is known, use any partit. algo –maximize inside-con, minimize outside-con. –we use [Shi,Malik’2000] –normalized cut –when k is unknown –split into two: just to see the cut –compare cut against threshold –decide “to split” or “not to split” actually

17 Measuring Quality of Outcome Existing measures –dispersion [DMKD’04] –for an entity, into how many clusters its repr. are clustered, ideal is 1 –diversity –for a cluster, how many distinct entities it covers, ideal is 1 –easy, clear semantics –but have problems, see figure Entropy –for an entity, if out of m represent. m 1 to C 1 ;...; m n to C n then –if a cluster consists of represent.: m 1 of E 1 ;...; m n of E n then (same...) –ideal entropy is zero

18 Experimental Setup Parameters –L-short simple paths, L = 7 –L is the path-length limit Note –The algorithm is applied to “tough cases”, after FBS already has successfully consolidated many entries! RealMov –movies (12K) –people (22K) –actors –directors –producers –studious (1K) –producing –distributing Uncertainty –d1,d2,...,dn are director entities –pick a fraction d1,d2,...,d10 –group, e.g. in groups of two –{d1,d2},...,{d9,d10} –make all representations of d1,d2 indiscernible by FBS,... Baseline 1 –one cluster per VCS, regardless –dumb?... but ideal disp & H(E) Baseline 2 –knows grouping statistics –guesses #ent in VCS –random assigns repr. to clusters

19 Sample Movies Data

20 The Effect of L on Quality Cluster Entropy & DiversityEntity Entropy & Dispersion

21 Effect of Threshold and Scalability

22 Summary RelDC –developed in Aug 2003 (reference disambiguation) –domain-independent data cleaning framework –uses relationships for data cleaning –reference disambiguation [SDM’05] –object consolidation [IQIS’05] Ongoing work –“learning” the importance of relationships from data

23 Contact Information RelDC project (RESCUE) Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra