1 Part II: Practical Implementations.
2 Modeling the Classes Stochastic Discrimination
3 Algorithm for Training a SD Classifier Generate projectable weak model Evaluate model w.r.t. training set, check enrichment Check uniformity w.r.t. existing collection Add to discriminant
4 Dealing with Data Geometry: SD in Practice
5 2D Example Adapted from [Kleinberg, PAMI, May 2000]
6 An “r=1/2” random subset in the feature space that covers ½ of all the points
7 Watch how many such subsets cover a particular point, say, (2,17) (2,17)
8 It’s in 1/2 models Y = ½ = 0.5 It’s in 2/3 models Y = 2/3 = 0.67 It’s in 3/4 models Y = ¾ = 0.75 It’s in 4/5 models Y = 4/5 = 0.8 It’s in 5/6 models Y = 5/6 = 0.83 It’s in 0/1 models Y = 0/1 = 0.0 In Out In
9 It’s in 6/8 models Y = 6/8 = 0.75 It’s in 7/9 models Y = 7/9 = 0.77 It’s in 8/10 models Y = 8/10 = 0.8 It’s in 8/11 models Y = 8/11 = 0.73 It’s in 8/12 models Y = 8/12 = 0.67 It’s in 5/7 models Y = 5/7 = 0.72 In Out
10 Fraction of “r=1/2” random subsets covering point (2,17) as more such subsets are generated
11 Fractions of “r=1/2” random subsets covering several selected points as more such subsets are generated
12 Distribution of model coverage for all points in space, with 100 models
13 Distribution of model coverage for all points in space, with 200 models
14 Distribution of model coverage for all points in space, with 300 models
15 Distribution of model coverage for all points in space, with 400 models
16 Distribution of model coverage for all points in space, with 500 models
17 Distribution of model coverage for all points in space, with 1000 models
18 Distribution of model coverage for all points in space, with 2000 models
19 Distribution of model coverage for all points in space, with 5000 models
20 Introducing enrichment: For any discrimination to happen, the models must have some difference in coverage for different classes.
21 Enforcing enrichment (adding in a bias): require each subset to cover more points of one class than another Class distributionA biased (enriched) weak model
22 Distribution of model coverage for points in each class, with 100 enriched weak models
23 Distribution of model coverage for points in each class, with 200 enriched weak models
24 Distribution of model coverage for points in each class, with 300 enriched weak models
25 Distribution of model coverage for points in each class, with 400 enriched weak models
26 Distribution of model coverage for points in each class, with 500 enriched weak models
27 Distribution of model coverage for points in each class, with 1000 enriched weak models
28 Distribution of model coverage for points in each class, with 2000 enriched weak models
29 Distribution of model coverage for points in each class, with 5000 enriched weak models
30 Error rate decreases as number of models increases Decision rule: if Y < 0.5 then class 2 else class 1
31 Sparse Training Data: Incomplete knowledge about class distributions Training SetTest Set
32 Distribution of model coverage for points in each class, with 100 enriched weak models Training SetTest Set
33 Distribution of model coverage for points in each class, with 200 enriched weak models Training SetTest Set
34 Distribution of model coverage for points in each class, with 300 enriched weak models Training SetTest Set
35 Distribution of model coverage for points in each class, with 400 enriched weak models Training SetTest Set
36 Distribution of model coverage for points in each class, with 500 enriched weak models Training SetTest Set
37 Distribution of model coverage for points in each class, with 1000 enriched weak models Training SetTest Set
38 Distribution of model coverage for points in each class, with 2000 enriched weak models Training SetTest Set
39 Distribution of model coverage for points in each class, with 5000 enriched weak models Training SetTest Set No discrimination!
40 Models of this type, when enriched for training set, are not necessarily enriched for test set Training SetTest Set Random model with 50% coverage of space
41 Introducing projectability: Maintain local continuity of class interpretations. Neighboring points of the same class should share similar model coverage.
42 Allow some local continuity in model membership, so that interpretation of a training point can generalize to its immediate neighborhood Class distributionA projectable model
43 Distribution of model coverage for points in each class, with 100 enriched, projectable weak models Training SetTest Set
44 Distribution of model coverage for points in each class, with 300 enriched, projectable weak models Training SetTest Set
45 Distribution of model coverage for points in each class, with 400 enriched, projectable weak models Training SetTest Set
46 Distribution of model coverage for points in each class, with 500 enriched, projectable weak models Training SetTest Set
47 Distribution of model coverage for points in each class, with 1000 enriched, projectable weak models Training SetTest Set
48 Distribution of model coverage for points in each class, with 2000 enriched, projectable weak models Training SetTest Set
49 Distribution of model coverage for points in each class, with 5000 enriched, projectable weak models Training SetTest Set
50 Promoting uniformity: All points in the same class should have equal likelihood to be covered by a model of each particular rating. Retain models that cover the points whose coverage by current collection is less
51 Distribution of model coverage for points in each class, with 100 enriched, projectable, uniform weak models Training SetTest Set
52 Distribution of model coverage for points in each class, with 1000 enriched, projectable, uniform weak models Training SetTest Set
53 Distribution of model coverage for points in each class, with 5000 enriched, projectable, uniform weak models Training SetTest Set
54 Distribution of model coverage for points in each class, with enriched, projectable, uniform weak models Training SetTest Set
55 Distribution of model coverage for points in each class, with enriched, projectable, uniform weak models Training SetTest Set
56 The 3 necessary conditions Complementary Information Discriminating Power Generalization Power Enrichment: Projectability: Uniformity:
57 Extensions and Comparisons
58 Alternative Discriminants [Berlind 1994] Different discriminants for N-class problems Additional condition on symmetry Approximate uniformity Hierarchy of indiscernibility
59 Estimates of Classification Accuracies [Chen 1997] Statistical estimate of classification accuracy under weaker conditions: Approximate uniformity Approximate indiscernibility
60 For n classes, define n discriminants Y i, one for each class i vs the others Classify an unknown point to the class i for which the computed Y i is the largest Multi-class Problems
61 [Ho & Kleinberg ICPR 1996]
62
63
64
65 Open Problems Algorithm for uniformity enforcement Deterministic methods? Desirable form of weak models Fewer, more sophisticated classifiers? Other ways to address the 3-way trade-off Enrichment / Uniformity / Projectability
66 Random Decision Forest [Ho 1995, 1998] A structured way to create models: fully split a tree, use leaves as models Perfect enrichment and uniformity for TR Promote projectability by subspace projection
67 Compact Distribution Maps [Ho & Baird 1993, 1997] Another structured way to create models Start with projectable models by coarse quantization of feature value range Seek enrichment and uniformity Signature of 2 types of events and measurements from a new observation Signal IndexSignal Level
68 SD & Other Ensemble Methods Ensemble learning via boosting: A sequential way to promote uniformity of ensemble element coverage XCS (a genetic algorithm) A way to create, filter, and use stochastic models that are regions in feature space
69 XCS Classifier System [Wilson,95] Recent focus of GA community Good performance Reinforcement Learning + Genetic Algorithms Model: set of rules Environment Set of Rules input class Reinforcement Learning Genetic Algorithms reward update search if (shape=square and number>10) then class=red if (shape=circle and number<5) then class=yellow
70 Multiple Classifier Systems: Examples in Word Image Recognition
71 Complementary Strengths of Classifiers The case for classifier combination … decision fusion … mixture of experts … committee decision making Rank of true class out of a lexicon of 1091 words, by 10 classifiers for 20 images
72 Classifier Combination Methods Decision Optimization: find consensus among a given set of classifiers Coverage Optimization: create a set of classifiers that work best with a given decision combination function
73 Decision Optimization Develop classifiers with expert knowledge Try to make the best use of their decisions via majority/plurality vote, sum/product rule, probabilistic methods, Bayesian methods, rank/confidence score combination … The joint capability of the classifiers set an intrinsic limit on the combined accuracy There is no way to handle the blind spots
74 Difficulties in Decision Optimization Reliability versus overall accuracy Fixed or trainable combination function Simple models or combinatorial estimates How to model complementary behavior
75 Coverage Optimization Fix a decision combination function Generate classifiers automatically and systematically via training set sub-sampling (stacking, bagging, boosting), subspace projection (RSM), superclass/subclass decomposition (ECOC), random perturbation of training processes, noise injection … Need enough classifiers to cover all blind spots (how many are enough?) What else is critical?
76 Difficulties in Coverage Optimization What kind of differences to introduce: –Subsamples? Subspaces? Super/Subclasses? –Training parameters? –Model geometry? 3-way tradeoff: –discrimination + diversity + generalization Effects of the form of component classifiers
77 Dilemmas and Paradoxes in Classifier Combination Weaken individuals for a stronger whole? Sacrifice known samples for unseen cases? Seek agreements or differences?
78 Stochastic Discrimination A mathematical theory that relates several key concepts in pattern recognition: –Discriminative power …enrichment –Complementary information …uniformity –Generalization power …projectability It offers a way to describe complementary behavior of classifiers It offers guidelines to design multiple classifier systems (classifier ensembles)