Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.

Similar presentations


Presentation on theme: "The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle."— Presentation transcript:

1 The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle H. Unger Presented by Danny Wyatt

2 Record Linkage Methods  As classification [Felligi & Sunter] Data point is a pair of records Each pair is classified as “match” or “not match” Post-process with transitive closure  As clustering Data point is an individual record All records in a cluster are considered a match No transitive closure if no cluster overlap

3 Motivation  Either way, n 2 such evaluations must be performed  Evaluations can be expensive Many features to compare Costly metrics (e.g. string edit distance)  Non-matches far outnumber matches  Can we quickly eliminate obvious non- matches to focus effort?

4 Canopies  A fast comparison groups the data into overlapping “canopies”  The expensive comparison for full clustering is only performed for pairs in the same canopy  No loss in accuracy if: “For every traditional cluster, there exists a canopy such that all elements of the cluster are in the canopy”

5 Creating Canopies  Define two thresholds Tight: T 1 Loose: T 2  Put all records into a set S  While S is not empty Remove any record r from S and create a canopy centered at r For each other record r i, compute cheap distance d from r to r i If d < T 2, place r i in r’s canopy If d < T 1, remove r i from S

6 Creating Canopies  Points can be in more than one canopy  Points within the tight threshold will not start a new canopy  Final number of canopies depends on threshold values and distance metric  Experimental validation suggests that T 1 and T 2 should be equal

7 Canopies and GAC  Greedy Agglomerative Clustering Make fully connected graph with a node for each data point Edge weights are computed distances Run Kruskal’s MST algorithm, stopping when you have a forest of k trees Each tree is a cluster  With Canopies Only create edges between points in the same canopy Run as before

8 EM Clustering  Create k cluster prototypes c 1 …c k  Until convergence Compute distance from each record to each prototype ( O(kn) ) Use that distance to compute probability of each prototype given the data Move the prototypes to maximize their probabilities

9 Canopies and EM Clustering  Method 1 Distances from prototype to data points only computed within a canopies containing the prototype Note that prototypes can cross canopies  Method 2 Same as one, but also use all canopy centers to account for outside data points  Method 3 Same as 1, but dynamically create and destroy prototypes using existing techniques

10 Complexity  n : number of data points  c : number of canopies  f : average number of canopies covering a data point  Thus, expect fn/c data points per canopy  Total distance comparisons needed becomes

11 Reference Matching Results  Labeled subset of Cora data 1916 citations to 121 distinct papers  Cheap metric Based on shared words in citations Inverted index makes finding that fast  Expensive metric Customized string edit distance between extracted author, title, date, and venue fields  GAC for final clustering

12 Reference Matching Results MethodF1ErrorPrecisionRecallMinutes Canopies0.8380.75%0.7350.9767.65 Complete GAC 0.8350.76%0.7370.965134.09 Author/Year0.6971.60%0.5590.9260.03 none1.99%1.0000.0000.00

13 Discussion  How do cheap and expensive distance metrics interact? Ensure the canopies property Maximize number of canopies Minimize overlap  Probabilistic extraction, probabilistic clustering How do the two interact?  Canopies and classification-based linkage Only calculate pair data points for records in the same canopy


Download ppt "The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle."

Similar presentations


Ads by Google