The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Lecture 15. Graph Algorithms
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Cluster Analysis: Basic Concepts and Algorithms
Albert Gatt Corpora and Statistical Methods Lecture 13.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.
Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis.
Greedy Algorithms Greed is good. (Some of the time)
Great Theoretical Ideas in Computer Science for Some.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Greed is good. (Some of the time)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Maximizing the Spread of Influence through a Social Network
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
HCS Clustering Algorithm
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
1 -1 Chapter 1 Introduction Why Do We Need to Study Algorithms? To learn strategies to design efficient algorithms. To understand the difficulty.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Clustering Unsupervised learning Generating “classes”
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Nattee Niparnan. Greedy If solving problem is a series of steps Simply pick the one that “maximize” the immediate outcome Instead of looking for the long.
Machine Learning Queens College Lecture 7: Clustering.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Interactive Control of Avatars Animated with Human Motion Data By: Jehee Lee, Jinxiang Chai, Paul S. A. Reitsma, Jessica K. Hodgins, Nancy S. Pollard Presented.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
An Algorithm to Learn the Structure of a Bayesian Network Çiğdem Gündüz Olcay Taner Yıldız Ethem Alpaydın Computer Engineering Taner Bilgiç Industrial.
Machine Learning for the Quantified Self
Sofus A. Macskassy Fetch Technologies
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio
Data Mining K-means Algorithm
Great Theoretical Ideas in Computer Science
Critical Issues with Respect to Clustering
Clustering.
CSE 373: Data Structures and Algorithms
Text Categorization Berlin Chen 2003 Reference:
Minimum Spanning Trees (MSTs)
Clustering The process of grouping samples so that the samples are similar within each group.
CSE 373: Data Structures and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle H. Unger Presented by Danny Wyatt

Record Linkage Methods  As classification [Felligi & Sunter] Data point is a pair of records Each pair is classified as “match” or “not match” Post-process with transitive closure  As clustering Data point is an individual record All records in a cluster are considered a match No transitive closure if no cluster overlap

Motivation  Either way, n 2 such evaluations must be performed  Evaluations can be expensive Many features to compare Costly metrics (e.g. string edit distance)  Non-matches far outnumber matches  Can we quickly eliminate obvious non- matches to focus effort?

Canopies  A fast comparison groups the data into overlapping “canopies”  The expensive comparison for full clustering is only performed for pairs in the same canopy  No loss in accuracy if: “For every traditional cluster, there exists a canopy such that all elements of the cluster are in the canopy”

Creating Canopies  Define two thresholds Tight: T 1 Loose: T 2  Put all records into a set S  While S is not empty Remove any record r from S and create a canopy centered at r For each other record r i, compute cheap distance d from r to r i If d < T 2, place r i in r’s canopy If d < T 1, remove r i from S

Creating Canopies  Points can be in more than one canopy  Points within the tight threshold will not start a new canopy  Final number of canopies depends on threshold values and distance metric  Experimental validation suggests that T 1 and T 2 should be equal

Canopies and GAC  Greedy Agglomerative Clustering Make fully connected graph with a node for each data point Edge weights are computed distances Run Kruskal’s MST algorithm, stopping when you have a forest of k trees Each tree is a cluster  With Canopies Only create edges between points in the same canopy Run as before

EM Clustering  Create k cluster prototypes c 1 …c k  Until convergence Compute distance from each record to each prototype ( O(kn) ) Use that distance to compute probability of each prototype given the data Move the prototypes to maximize their probabilities

Canopies and EM Clustering  Method 1 Distances from prototype to data points only computed within a canopies containing the prototype Note that prototypes can cross canopies  Method 2 Same as one, but also use all canopy centers to account for outside data points  Method 3 Same as 1, but dynamically create and destroy prototypes using existing techniques

Complexity  n : number of data points  c : number of canopies  f : average number of canopies covering a data point  Thus, expect fn/c data points per canopy  Total distance comparisons needed becomes

Reference Matching Results  Labeled subset of Cora data 1916 citations to 121 distinct papers  Cheap metric Based on shared words in citations Inverted index makes finding that fast  Expensive metric Customized string edit distance between extracted author, title, date, and venue fields  GAC for final clustering

Reference Matching Results MethodF1ErrorPrecisionRecallMinutes Canopies % Complete GAC % Author/Year % none1.99%

Discussion  How do cheap and expensive distance metrics interact? Ensure the canopies property Maximize number of canopies Minimize overlap  Probabilistic extraction, probabilistic clustering How do the two interact?  Canopies and classification-based linkage Only calculate pair data points for records in the same canopy