Presentation is loading. Please wait.

Presentation is loading. Please wait.

Collective Relational Clustering Indrajit Bhattacharya Indrajit Bhattacharya Assistant Professor Department of CSA Indian Institute of Science.

Similar presentations


Presentation on theme: "Collective Relational Clustering Indrajit Bhattacharya Indrajit Bhattacharya Assistant Professor Department of CSA Indian Institute of Science."— Presentation transcript:

1 Collective Relational Clustering Indrajit Bhattacharya Indrajit Bhattacharya Assistant Professor Department of CSA Indian Institute of Science

2 Relational Data  Recent abundance of relational (‘non-iid’) data oInternet oSocial networks oCitations in scientific literature oBiological networks oTelecommunication networks oCustomer shopping patterns o…  Various applications oWeb Mining oOnline Advertising and Recommender Systems oBioinformatics oCitation analysis oEpidemiology oText Analysis o…

3 Clustering for Relational Data  Lot of research in Statistical Relational Learning over the last decade oSeries of focused workshops in premier conferences oConfluence of different research areas  Recent focus of unsupervised learning from relational data oRegular papers in premiere conferences oRecent Book: Relational Data Clustering: Models, Algorithms, and Applications, Bo Long, Zhongfei Zhang, Philip S. Yu, CRC Press 2009

4 Traditional vs Relational Clustering  Traditional clustering focuses on ‘flat’ data oCluster based on features of individual objects  Relational clustering additionally considers relations oHeterogeneous relations across objects of different types oHomogeneous relations across objects of the same type  Naïve solution: Flatten data, then cluster oLoss of relational and structural information oNo influence propagation across relational chains oCannot discover interaction patterns across clusters  Collective relational clustering looks to cluster different data objects jointly

5 Early Instances of Relational Clustering  Graph Partitioning Problem oSingle type homogenous relational data  Co-clustering Problem oBi-type heterogeneous relational data  General relational clustering considers multi-type data with heterogeneous and homogeneous relationships

6 Talk Outline  Introduction  Motivating Application: Entity Resolution over Heterogeneous Relational Data  The Relational Clustering Problem  Quick Survey of Relational Clustering Approaches  Probabilistic Model for Structured Relations  Probabilistic Model for Heterogeneous Relations  Future Directions

7 Talk Outline  Introduction  Motivating Application: Entity Resolution over Heterogeneous Relational Data  The Relational Clustering Problem  Quick Survey of Relational Clustering Approaches  Probabilistic Model for Structured Relations  Probabilistic Model for Heterogeneous Relations  Future Directions

8 Application: Entity Resolution Web data on Stephen Johnson

9 Application: Entity Resolution Ind. Researcher Professor Media Presenter Movie Director Photographer Administrator

10 Application: Entity Resolution  Data contains references to real world entities oStructured entities (People, Products, Institutions,…) oTopics / Concepts (comp science, movies, politics, …)  Aim: Consolidate (cluster) according to entities oEntity Resolution: Map structured references to entities oSense Disambiguation: Group words according to senses oTopic Discovery: Group words according to topics or concepts

11 Relationships for Entity Resolution Movie Director Photographer  Each document or structured record is a (co-occurrence) relation between references to persons, places, organizations, concepts, etc.

12 Relational Network Among Entities Stephen Johnson Alfred Aho Jeffrey Ullman Bell Labs Comp. Sc. Prog. Lang. Mark Cross Chris Walshaw Univ of Greenwich HPC Photography Ansel Adams Cinema Direction Peter Gabriel White House EPA George W. Bush Government MediaMusic BBC Stephen Johnson Entertainment Leeds University

13 Using the Network for Clustering  Given the network, find the assignment of data items or references to these entities oCollective cluster assignment  Find a “nice” network of entities with regularities in the relational structure oResearchers collaborate with colleagues on similar topics oPeople send s to colleagues and friends

14 Collective Cluster Assignment: Example Stephen Johnson S Johnson SC Jonshon Alfred Aho A Aho A V Aho Jeffrey Ullman J. Ullman J D Ullman Bell Labs AT&T Bell code generation grammar expression tree Stephen Johnson Steve Johnson S Johnson S P Johnson Mark Cross M Cross Chris Walshaw Chris Walsaw C Walshaw U. Greenwich U. of GWich Parallelization Structured Mesh Code generation …To find a minimal match cost, dynamic programming, approach of [A Aho and S Johnson, 76], is used. … Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 11 Cluster 12 Cluster 13 Cluster 14 Cluster 15

15 Regularity in a Cluster Network S. Johnson Stephen C. Johnson S. Johnson M. G. Everett M. Everett Alfred V. Aho A. Aho S. Johnson Stephen C. Johnson S. Johnson M. G. Everett M. Everett Alfred V. Aho A. Aho M J1 A J2 M J A J  Cl. 1 has better separation of attributes  Cl. 2 has fewer cluster- cluster relations M J1 A J2 M J A J Clustering 1 Clustering 2

16 Collective Relational Clustering  Goal: Given relations among data items, assign to clusters such that relational neighborhoods of clusters have regularities (in addition to attribute similarities within clusters)  Challenges: oCollective / joint clustering decisions over relational neighborhoods oDefining regularity in relational neighborhoods oSearching over relational networks

17 Talk Outline  Introduction  Motivating Application: Entity Resolution over Heterogeneous Relational Data  The Relational Clustering Problem  Quick Survey of Relational Clustering Approaches  Probabilistic Model for Structured Relations  Probabilistic Model for Heterogeneous Relations  Future Directions

18 Relational Clustering: Different Approaches  Greedy Agglomerative Algorithms oBhattacharya et al ‘04, Dong et al ‘05  Information Theoretic Methods oMutual Information (Dhillon et al ’03), oInformation Bottleneck (Slonim & Tishby ’03), oBregman Divergence (Merugu et al ‘04, Merugu et al ’06)  Matrix Factorization Techniques oSVD, BVD, (Long et al ‘05, Long et al ’06)  Graph Cuts oMin Cut, Ratio Cut, Normalized Cut, (Dhillon ’01)

19 Relational Clustering: Probabilistic Approaches  Models for Co-clustering oTaskar et al, ‘01; Hofmann et al, ‘98  Infinite Relational Model (Kemp et al, ’06)  Mixed Membership Relational Clustering model (Long et al, ‘06)  Topic Models Extensions oCorrelated Topic Models (Blei et al, ‘06) oGrouped Cluster Model (Bhattacharya et al ‘06) oGaussian Process Topic Models (Agovic & Banerjee, ‘10)  Markov Logic Network (Kok & Domingos, ‘08)  Model for Mixed Relational Data (Bhattacharya et al 08)

20 Talk Outline  Introduction  Motivating Application: Entity Resolution over Heterogeneous Relational Data  The Relational Clustering Problem  Quick Survey of Relational Clustering Approaches  Probabilistic Model for Structured Relations  Probabilistic Model for Heterogeneous Relations  Future Directions

21 Modeling Groups of Entities Bell Labs Group Alfred V Aho Jeffrey D Ullman Ravi Sethi Stephen C Johnson Parallel Processing Research Group Mark Cross Chris WalshawKevin McManus Stephen P Johnson Martin Everett P1: C. Walshaw, M. Cross, M. G. Everett, S. Johnson P2: C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P3: C. Walshaw, M. Cross, M. G. Everett P4: Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman P5: A. Aho, S. Johnson, J. Ullman P6: A. Aho, R. Sethi, J. Ullman

22 P LDA-Group Model R r θ z a T Φ A V α β  Entity label a and group label z for each reference r  Θ: ‘ mixture’ of groups for each co-occurrence  Φ z : multinomial for choosing entity a for each group z  V a : multinomial for choosing reference r from entity a  Dirichlet priors with α and β

23 P LDA-Group Model R r θ z a T Φ A V α β  Entity label a and group label z for each reference r  Θ: ‘ mixture’ of groups for each co-occurrence  Φ z : multinomial for choosing entity a for each group z  V a : multinomial for choosing reference r from entity a  Dirichlet priors with α and β Reference S. Johnson Entity Stephen P Johnson Group Bell Labs generate document generate names

24 Inference Using Gibbs Sampling  Approximate inference with Gibbs sampling oFind conditional distribution for any reference given current groups and entities of all other references oSample from conditional distribution oRepeat over all references until convergence  When number of groups and entities are known

25  Hidden name for a new entity equally prefers all observed references Non Parametric Entity Resolution  Number of entities not a parameter oAllow number of entities to grow with data  For each reference choose any existing entity, or a new entity a new

26 Faster Inference: Split-Merge Sampling  Naïve strategy reassigns data items individually  Alternative: allow clusters to merge or split  For cluster a i, find conditional probabilities for 1.Merging with existing cluster a j 2.Splitting back to last merged clusters 3.Remaining unchanged  Sample next state for a i from distribution  O(n g + e) time per iteration compared to O(n g + n e)

27 ER: Evaluation Datasets  CiteSeer o1,504 citations to machine learning papers (Lawrence et al.) o2,892 references to 1,165 author entities  arXiv o29,555 publications from High Energy Physics (KDD Cup’03) o58,515 refs to 9,200 authors  Elsevier BioBase o156,156 Biology papers (IBM KDD Challenge ’05) o831,991 author refs oKeywords, topic classifications, language, country and affiliation of corresponding author, etc

28 ER: Experimental Evaluation  LDA-ER outperforms baselines in all datasets oA - Same entity to refs with attr similarity over a threshold oA* - Transitive closure over decisions in A  Baselines require threshold as parameter oBest achievable performance over all thresholds  LDA-ER does not require similarity threshold

29 ER: Trends in Semi-Synthetic Data Bigger improvement with obigger % of ambiguous refs omore refs per co-occurrence omore neighbors per entity

30 Talk Outline  Introduction  Motivating Application: Entity Resolution over Heterogeneous Relational Data  The Relational Clustering Problem  Quick Survey of Relational Clustering Approaches  Probabilistic Model for Structured Relations  Probabilistic Model for Heterogeneous Relations  Future Directions

31 In a document collection, which names refer to the same entities? Entity Resolution over a Document Collection Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb. Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good. Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job.

32 Jointly Modeling the Textual Content Words are indicative of the concept entities Concept entities are related to person entities Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb. Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good. Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job.

33  Document words belong to two categories oReferences to structured entities oReferences to (unstructured) concept entities  Collectively determine clusters for both types of entities  Relational patterns over two types of entities  Simplifications for learning  Observed domain of entities w/ structured attributes  Observed relationships between domain entities and categories for constructing relational neighborhoods Relational Clustering Over Structured and Unstructured Data

34 c t e a w n m N Generative Model for Documents from Structured Entities  Generate N reviews one by one  First choose a genre, say Action  Choose an Action movie, say Indiana Jones  Generate n mentions for movie oChoose movie attribute, say Actor oGet attribute value, say Harrison Ford oGenerate mention for attribute value  Harrison Ford  Ford  Generate m Action words oadventurer, quest, justice …  P(t) : Prior over genres  P(e | t) : Movies for genre  P(w | t) : Words for genre  P(c) : Prior over movie attributes

35  Movie Reviews o12,500 reviews: First 10 reviews for top 50 movies for 25 genres  Structured Movie Database from IMDB o26,250 movies: Top 1250 movies from 25 genres + 25,000 others oMovie table with 7 columns, but no movie name column oGenre + Top 2 actors, actresses, directors, writers  Entity Identification Baseline oAggregate similarity over all mentions to score entity for doc oDoes not use unstructured words in document  Document Classification Baseline oSVM-Light with default parameters oUses all words in the document, including structured mentions Entity Identification: Evaluation

36 Ent-Id: Experimental Results on IMDB  Baseline catches up with joint model only when 35% docs provided for training  Improvement in ent-id accuracy  Significant drop in entropy over entity choices

37 Ent-Id: Results on Semi-Synthetic Data  Ent-Id improves from 38% to 60% for medium overlap and to 70% when words clearly indicate genre  80% training data for baseline, none for JM  Joint model outperforms baseline for large overlap between genres

38 Future Directions  Handling uncertain relations oCoupling with information extraction  Modeling the cluster network  Regularization for networks  Scalable inference mechanisms  Incorporating domain knowledge and user interaction oSemi-supervision oActive learning

39 References  A Agovic and A Banerjee., Gaussian Process Topic Models, UAI 2010  S Kok and P Domingos, Extracting Semantic Networks from Text via Relational Clustering, ECML 2008  I Bhattacharya, S Godbole, and S Joshi, Structured Entity Identification and Document Categorization: Two Tasks with One Joint Model, SIGKDD 2008  I Bhattacharya and L Getoor, Collective Entity Resolution in Relational Data, ACM- TKDD, March 2007  A Banerjee, S Basu, S Merugu, Multi-Way Clustering on Relation Graphs, SIAM SDM 2007  B Long, M Zhang, P S Yu, A Probabilistic Framework for Relational Clustering, SIGKDD 2007  D Zhou, J Huang, B Schoelkopf, Learning with hypergraphs: Clustering, classification, and embedding, NIPS 2007  B Long, M Zhang, X Wu, P S Yu, Spectral Clustering for Multi-type Relational Data, ICML 2006  I Bhattacharya and L Getoor, A Latent Dirichlet Model for Unsupervised Entity Resolution, SIAM SDM 2006  X Dong, A Halevy, J Madhavan, Reference reconciliation in complex information spaces, SIGMOD 2005  I Bhattacharya and L Getoor, Iterative Record Linkage for Cleaning and Integration, SIGMOD–DMKD, 2004  B Taskar, E Segal, D Koller, Probabilistic Classification and Clustering in Relational Data, IJCAI 2001

40 Backup Slides

41 P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson P2: “Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman P5: “Deterministic Parsing of Ambiguous Grammars”, A. Aho, S. Johnson, J. Ullman P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman Entity Resolution From Structured Relations Stephen Johnson Alfred Aho Jeffrey Ullman Bell Labs Prog. Lang. Stephen Johnson Mark Cross Chris Walshaw Univ of Greenwich HPC

42 LDA-ER Generative Process: Illustration For each paper p: 1.Choose θ p 2.For each author  Sample z from θ p  Sample a from Φ z  Sample r from V a P5 θ P5 = [ p(G1)=0.1, p(G2)=0.9 ] z=G2 a=Aho Φ G2 WalshawJohnson1McManusCrossEverettUllmanAhoSethiJohnson2 G2G1 Φ G Φ G2 r= A.Aho VAVA G2 U Φ G2 J.Ullman VUVU G2 J2 Φ G2 S.Johnson V J2 S C JohnsonStephen C JohnsonS Johnson V J1 =Stephen P Johnson

43 Generating References from Entities  Entities are not directly observed 1.Hidden attribute for each entity 2.Similarity measure for pairs of attributes  A distribution over attributes for each entity S C JohnsonStephen C JohnsonS Johnson Alfred AhoM. Cross Stephen C Johnson

44 ER: Performance for Specific Names Significantly larger improvements for ‘ambiguous names’

45 Simplifying the problem: Entity Identification  Assume database on entities available oIMDB movie database oDBLP, PubMed paper database oCustomer databases in companies

46 Not enough information to disambiguate Noise in entity mentions Entity Identification: Still Difficult Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb. Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good. Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job. American Graffiti : Harrison Ford, George Lucas Indiana Jones and the Last Crusade : Harrison Ford, George Lucas Star Wars: Return of the Jedi : Harrison Ford, George Lucas ? ? ? Fugitive: Harrison Ford, David Twohy

47  Categorization and Entity Identification help each other  Classifier predicts additional attributes from document for use in entity identification oClassifiers for Genre, Rating, Country of the movie …  Entity identification creates labeled data for training the classifier oReviews tagged with movies  labeled with Genre, Rating, etc The Intuition

48 Problem Formulation columns C entities E Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases Structured mentions derived from column values Unstructured words determined by type value type column T  Problem: Find the central entity for each document and categorize the documents according to type values Unobserved central entity for each document

49  Traditional entity identification only considers structured mentions as evidence  Here, words suggest type values, and entities relevant for those types get priority Formalizing the Intuition

50  Traditional entity identification only considers structured mentions as evidence  Traditional document categorization only considers words as evidence  Here, words suggest type values, and entities relevant for those types get priority  Mentions suggest entities, and type values relevant for those entities get priority Formalizing the Intuition

51  Infer hidden entity and type value from observed words and references for each document  Initialize posteriors using entity references only  Restrict assignment space for tractability Unsupervised EM for Inference

52 Objective Function  Greedy agglomerative clustering step: merge cluster pair with max reduction in objective function value Common cluster neighborhood Similarity of attributes weight for attributes weight for relations similarity of attributes 1 iff relational edge exists between c i and c j  Minimize:

53 Collective Relational Clustering Algorithm 1.Find similar references using ‘blocking’ 2.Bootstrap clusters using attributes and relations 3.Compute similarities for cluster pairs and insert into priority queue 4.Repeat until priority queue is empty 5. Find ‘closest’ cluster pair 6. Stop if similarity below threshold 7. Merge to create new cluster 8. Update similarity for ‘related’ clusters  O(n k log n) algorithm w/ efficient implementation


Download ppt "Collective Relational Clustering Indrajit Bhattacharya Indrajit Bhattacharya Assistant Professor Department of CSA Indian Institute of Science."

Similar presentations


Ads by Google