Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.

Similar presentations


Presentation on theme: "Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey."— Presentation transcript:

1 Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey

2 Active Learning for de-duplication De-duplication systems try to learn a function: Where D is the data set. –f is learned using a labeled training data set –Normally, D is large, so many sets L p are possible. Choosing a representative & useful L p is hard. Instead of a fixed set L p, in Active Learning the learner interactively chooses pairs from D  D to be labeled and added to L p.

3 The ALIAS de-duplicator Input –Set D p of pairs of data records represented as feature vectors (features might include edit distance, soundex, etc). –Initial set L p of some elements of D p labeled as duplicates or non-duplicates. Set T = L p Loop until user satisfaction: –Train classifier C using T. –Use C to choose a set S of instances from D p for labeling. –Get labels for S from user, and set T = T  S.

4 The ALIAS de-duplicator

5 Active Learning How do we choose the set S of instances to label? Idea: Choose most uncertain instances. We’re given that +’s and –’s can be separated by some point, and assume that probability of – or + is linear between labeled examples r and b. The point m –maximally uncertain, –also the point that reduces our “confusion region” the most. –So choose m!

6 Measuring Uncertainty with Committees Train a committee of several slightly different versions of a classifier. Uncertainty(x)  entropy committee (x) Form committees by –Randomizing model parameters –Partitioning training data –Partitioning attributes

7 Methods for Forming Committees

8 Committee Size

9 Representativeness of an Instance We need informative instances, not just uncertain ones. Solution: sample n of the kn most uncertain instances, weighted by uncertainty. –k = 1  no sampling –kn = all data  full-sampling Why not use information gain?

10 Sampling for Representativeness

11 Evaluation – Different Classifiers Decision Trees & Naïve Bayes: –Committees of 5 via parameter randomization SVMs –Uncertainty = distance from separator Start with one dup, one non-dup, add a new training example each round (n = 1), partial sampling (k = 5). Similarity functions – 3-Grams match, % overlapping words, approx. edit distance, special handling of #s/nulls. Data sets: –Bibliography: 32131 citation pairs from Citeseer, 0.5% duplicates. –Address: 44850 pairs, 0.25% duplicates.

12 Evaluation – different classifiers

13

14 Value of Active Learning

15

16 Example Decision Tree

17 Conclusions Active Learning improves performance over random selection. –Uses two orders of magnitude less training data. –Note: not due just to change in +/- mix. In these experiments, Decision Trees outperformed SVMs and Naïve Bayes.


Download ppt "Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey."

Similar presentations


Ads by Google