# Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

## Presentation on theme: "Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint."— Presentation transcript:

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A

Efficient classification for metric data 2 Classification problem Probabilistic concept learning S is a set of n examples (x,y) drawn from X x {-1,1} according to some unknown probability distribution P. The learner produces hypothesis h: X {-1,1} A good hypothesis (classifier) minimizes the generalization error P{(x,y): h(x) y} A popular solution uses kernels Data represented as vectors, kernels take the dot-product of vectors

Efficient classification for metric data 3 Finite metric space (X,d) is a metric space if X = set of points d = distance function Nonnegative Symmetric Triangle inequality Classification for metric data? Problem: No vector representation No notion of dot-product Cant use kernels What can be done in this setting? Haifa Jerusalem Tel-Aviv 151km 95km62km

Efficient classification for metric data 4 Preliminary definition The Lipschitz constant L of a function f: X R is the smallest value that satisfies for all points x i,x j in X L |f(x i )-f(x j )| / d(x i,x j ) Consider a hypothesis consistent with all of S Its Lipschitz constant is determined by the closest pair of differently labeled points L 2 / d(x i,x j ) for all x i in S, x j in S +

Efficient classification for metric data 5 Classification for metric data A powerful framework for this problem was introduced by von Luxburg & Bousquet (vLB, JMLR 04) The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions Given the classifier h, the problem of evaluating of h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem, a classic problem in Analysis For example f(x) = min i [y i + 2d(x, x i )/d(S +,S )]over all (x i,x j ) in S Function evaluation reduces to exact Nearest Neighbor Search (assuming zero training error) Strong theoretical motivation for the NNS classification heuristic

Efficient classification for metric data 6 Two new directions The framework of vLB leaves open two further questions: Efficient evaluation of the classifier h on X In arbitrary metric space, exact NNS requires Θ(n) time Can we do better? Bias – variance tradeoff Which sample points in S should h ignore? q ~1 +1

Efficient classification for metric data 7 Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x. The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is dim(M)=log ¸ (M) [GKL-03] A metric is doubling if its doubling dimension is constant Packing property of doubling spaces A set with diameter D and min. inter-point distance a, contains at most (D/a) O(log ¸ ) points Here7.

Efficient classification for metric data 8 Application I We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension vLB provided similar bounds using covering numbers and Rademacher averages Fat-shattering analysis: Lipschitz function shatters a set inter-point distance is at least 2/L Packing property set has (DL) O(log ¸ ) points So the fat-shattering dimension is low

Efficient classification for metric data 9 Application I Theorem: For any f that classifies a sample of size n correctly, we have with probability at least 1 P {(x, y) : sgn(f(x)) y} 2/n (d log(34en/d) log(578n) + log(4/ )). Likewise, if f is correct on all but k examples, we have with probability at least 1 P {(x, y) : sgn(f(x)) y} k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/ ))] 1/2. In both cases, d 8LD] log ¸ +1.

Efficient classification for metric data 10 Application II Evaluation of h for new points in X Lipschitz extension function f(x) = min i [y i + 2d(x, x i )/d(S +,S )] Requires exact nearest neighbor search, which can be expensive! New tool: (1+ )-approximate nearest neighbor search ¸ O(1) log n + ¸ O(-log ) time [KL-04, HM-05, BKL-06, CG-06] If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of g(x) = (1+ ) f(x) + h(x) = (1+ ) f(x) - Note that g(x) f(x) h(x) g(x) and h(x) have Lipschitz constant (1+ )L, so they and the approximate function generalizes well

Efficient classification for metric data 11 Bias variance tradeoff Which sample points in S should h ignore? If f is correct on all but k examples, we have with probability at least 1 P {(x, y):sgn(f(x)) y} k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/ ))] 1/2. Where d 8LD] ¸ +1. +1

Efficient classification for metric data 12 Bias variance tradeoff Algorithm Fix a target Lipschitz constant L O(n 2 ) possibilities Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error Goal: Remove as few points as possible

Efficient classification for metric data 13 Bias variance tradeoff Algorithm Fix a target Lipschitz constant L Out of O(n 2 ) possibilities Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error Goal: Remove as few points as possible Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

Efficient classification for metric data 14 Bias variance tradeoff Algorithm Fix a target Lipschitz constant L Out of O(n 2 ) possibilities Locate all pairs of points from S + and S - whose distance is less than 2L At least one of these points has to be taken as an error Goal: Remove as few points as possible Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time Minimum vertex cover on a bipartite graph Equivalent to maximum matching (Konigs theorem) Admits an exact solution in O(n 2.376 ) randomized time

Efficient classification for metric data 15 Bias variance tradeoff Algorithm: For each of O(n 2 ) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L O(n 4.376 ) randomized time Better algorithm Binary search over O(n 2 ) values of L For each value Run matching algorithm Find minimum error in O(n 2.376 log n) randomized time Evaluate generalization bound for this value of L Run greedy 2-approximation Approximate minimum error in O(n 2 log n) time Evaluate approximate generalization bound for this value of L

Efficient classification for metric data 16 Conclusion Results: Generalization bounds for Lipschitz classifiers in doubling spaces Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS Efficient calculation of the bias variance tradeoff Continuing research Similar results for continuous labels

Similar presentations