Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University.

Slides:

Advertisements

Similar presentations

Neural Networks and Kernel Methods

Advertisements

Applications of one-class classification

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

Introduction to Support Vector Machines (SVM)

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Lecture 9 Support Vector Machines

ECG Signal processing (2)

Linear Regression.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Linear Classifiers (perceptrons)

An Introduction of Support Vector Machine

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Reduced Support Vector Machine

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Support Vector Machines

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

SVM Support Vectors Machines

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

An Introduction to Support Vector Machines Martin Law.

July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.

This week: overview on pattern recognition (related to machine learning)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Support Vector Machine & Image Classification Applications

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Universit at Dortmund, LS VIII

10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

An Introduction to Support Vector Machines (M. Law)

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Support Vector Machines Tao Department of computer science University of Illinois.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

CS 9633 Machine Learning Support Vector Machines

PREDICT 422: Practical Machine Learning

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

An Introduction to Support Vector Machines

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Support Vector Machines and Kernels

SVMs for Document Ranking

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Support Vector Machines 2

Presentation transcript:

Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University of Illinois at Urbana mrebl@uiuc.edu Qiang Sun, Shiau Hong Lim, Li-Lun Wang

Challenges of Noisy Unstructured Text Data Noise – working with real input Bottom-up limitations Some true noise Some self-induced variability More reliant on prior structure Lack of structure – problem complexity Top-down limitations Highly structured = little variability More reliant on input (noisy or otherwise)

Noise True noise Induced noise Missing information Extra information Random / Normal(?) Induced noise Imperfect representation Pixelization Staircasing Extra / missing blobs or pixels Variability Unmodeled / approximated world dynamics Ignored parameters / covariates Not random Convenient to pretend it is true noise…

Structure vs. Unstructured Relatively unstructured: Very structured: With more structure, less induced noise Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation… Name: Ishmael . Finances: Low . Problem: Bored, Spleen . Date: Recent? .

Unstructured: Deal with the Noise With structure  programming problem Without structure  learning problem Learn signal from noise via training examples Each training example contains little information Is there enough information? Task dependent Difficulty: Subtlety of required processing Two statistical NLP question types: “How large is Brazil?” “Will the Fed raise interest rates?” Second requires integrating lots of partial evidence

Machine Learning as an Empirically Guided Search through a Hypothesis Space - - + + - Example Space X with Training Set Z Hypothesis Space H

What Makes a Learning Problem Hard? Expressiveness of hypothesis space H Large / Diverse / Complex H: More bad hypothesis can masquerade as good More training examples are required for desired confidence Want high confidence that a learner will produce a good approximation of the true concept Cost: More information  More training examples * *

Explanation Based Learning Information Beyond Training Examples Utilize existing domain knowledge Treat training examples as illustrations of a deeper pattern Explain how the assigned class label may arise from an example’s properties Explanations suggest the deeper patterns Calibrate and confirm using other training examples

Two Kinds of Prior Knowledge Solution Knowledge is directly relevant to a specific classification task. Can be readily used to bias a learning system. But it requires the expert to already know the solution and to possess expertise about the machine learner and its bias space. Domain Knowledge is more abstract and not tied to any particular classification task. “The same pen will leave similar-width strokes.” Only indirectly helpful for telling a “3” from a “6” Easy for human experts to articulate. Difficult to express in a statistical learner’s bias vocabulary The goal of this research is to incorporate domain knowledge into a statistical learner.

Solution vs. Domain Knowledge Right half: little information Left half: much more information Solution knowledge: “pay attention to the left half” Domain knowledge Prior idealized stroke representations: Conjecture differential information Calibrate & Verify with training data EBL: Derive solution knowledge Use domain knowledge Interacting with training examples 3 8

The Explanation-Based Learning Approach Transform Domain Knowledge into Solution Knowledge. Conjecture explanations for some training labels using Domain Knowledge. Evaluate explanation quality using the rest of the training set. Assemble statistically confirmed explanations into Solution Knowledge. Adjust the statistical learner’s bias to reflect the new Solution Knowledge. How humans use stroke knowledge: interaction between examples and knowledge. In our approach, domain knowledge is used in an Explanation-Based-Learning fashion to build explanations. Those explanations are then used to bias the inductive learner. how to build explanations using domain knowledge, and how to use explanations as bias will be illustrated in the next couple slides.

SVM Background (Support Vector Machines) Generic: few parameters to manipulate Linear AND nonlinear Linear in a high dimensional dot product space Nonlinear in the input feature space Expressiveness: nonlinear Cost: linear (+ convex optimization) Two cute nuggets: Large margin: prefer low capacity / reduce overfitting Kernel function (Kernel “trick”): compact, efficient, expressive

Handwritten Digits an ML success story(?) Pixel input, e.g.: 32  32  8 bits x = 1024 dimensions, 256 values Multi-class classifiers Ten index classifiers 1vAll Four Boolean encoders All pairs w/ voting … Generic ANNs work poorly Generic SVMs work better Specially designed ANNs work well* Well: < 0.5% overall (LeCun et al, ’98; Simard et al ‘03) Convolutional Networks with Elastic Distortions Introduce feature levels – often with desirable invariance properties – manipulate bias = program We are interested in generic solutions

Class Information Let x be the vector of image pixels: x = {x1, x2, x3,… x1024} Distributed No crucial input pixel Class c: relations among many pixels x is Sufficient Given the input x, the label is not ambiguous (at least to people) Entropy (c | x)  0 Separator is a function of the input pixels It must be nonlinear: interaction / relation among pixels determines class assignment The class entropy given the input is nearly zero

What’s the Best Separating Hyperplane? + -

What’s the Best Separating Hyperplane? + -

What’s the Best Separating Hyperplane? + -

What’s the Best Separating Hyperplane? + - Support Vectors Margin m Novikoff 1963 Can use the radius r of the smallest enclosing sphere Capacity is related to (r/m)2

Kernel Methods Map to a new higher dimensional space Kernel functions Can be very high Can be infinite Kernel functions Introduce high dimensionality Computation is independent of dimensionality Defined w/ dot product of input image vectors (information on the Cosine between image vectors) A kernel function defines a distance metric over space of example images Points not linearly separable: soft margin, margin distributions,…

SVMs for Digit Images K(x,y) = (x y)3 or (x y + 1)3 Dot product  scalar; cube it Consider how this works… Before 322 features (or about 103) Now ~ (322)3 features (or about 109) New Feature = monomial = correlation among three pixels VC(lin sep) ~ # dimensions Overfitting problem? Not if the margin is large Monitor number of support vectors

Mercer’s Condition / Representer Theorem <Kernel matrix is positive semidefinite> The desired hyperplane can be represented as Linear weighted sum of distances to support vectors Kernel defines the distance metric The hypothesis space is represented efficiently by using some of the training examples – the support vectors

Distinguishing Handwritten Seven’s vs. Two’s and Eight’s Handwritten 32 x 32 gray scale pixels Two’s Eight’s Seven’s Input feature space is inappropriate Map inputs to a high-dimensional space Many more features; nonlinear combinations Linearly separable in the new space Inappropriate because Not linearly separable Or poor expected behavior – small margins classification information not expressible as linear combinations of input features

Mercer Kernels Usually start with a kernel rather than features (s  x)d Homogeneous polynomials (s  x + 1)d Complete polynomials Exp(-||s – x||2 / 22 ) Gaussian / RBF K + k c  K K + c K  k Gaussian kernel: subtract support vector from unknown -> vector, dot with itself to get its length

Problems SVMs & statistical learning generally Little information from each training example Signal must show through the noise Need many training examples Thousands of are needed for handwritten digits Much information is ignored (weak bias vocabulary) Compare w/ humans Novel simple shape of similar complexity Master with several tens (perhaps a hundred) training examples Exceedingly small non-fatigue error rate Chinese characters are much more difficult than digits

Two Related Classification Problems 1.2% negligible error 60000 < 100 ? No. examples SVMs Humans The first problem is to distinguish between hand-written threes and sixes People are better but with enough training SVMs achieve very respectable results SVMs require very little hand-tuning aside from selecting the kernel function the SVM error results are based on MNIST data set

Two Related Classification Problems a fixed permutation over pixels 1.2% negligible error 60000 < 100 ? No. examples SVMs Humans The second problem: a single random permutation is constructed and applied to every example in the data set How well can people learn to perform this classification task? They can’t; people find it impossible What about Support Vector Machines? No difference at all – in computing a dot product, the order of summation is irrelevant; the task is no different from the SVM point of view

Two Related Classification Problems a fixed permutation over pixels 1.2% 50% error 60000 NA No. examples negligible < 100 ? SVMs Humans To an SVM these are the same problem Apparently the SVM ignores information crucial to people

Strokes Make the Difference Explanatory hidden features Humans know that strokes mediate between pixels and class labels. Statistical machine learners find the pattern using pixel level inputs alone without knowing about strokes. What can this example tell us? Statistical learning algorithms are advanced enough to extract complex pattern from data. But simple prior knowledge (e.g., the existence of strokes) may help to find relevant patterns faster and more accurately. Inventing latent features is hard for statistics

Domain Knowledge What can we say about strokes? Within an image they are written by the same person using the same writing instrument… They are made by a succession of simple pen movements… They give rise to the pixels… Much Information! (suppose it did not hold) This is not easily captured in the native bias vocabulary (not solution knowledge) Knowledge about strokes is imperfect so that building a bottom-up stroke extractor is error-prone. Other things we can say about strokes: Within an image – user exhibits the same left/right handedness; same care & neatness; feels the same time pressure Simple pen movements suggests smooth pen trajectories – low information content at level of strokes; stroke order may leave image artifacts;… Often the raw input features do not possess a similar low-information relationship to the classification label A kernel function, a vocabulary for decision tree splits, initial weights of a neural net… cannot easily encode this knowledge of strokes Re-formulating the test inputs (bottom up) into a stroke feature space is problematic; line finding is notoriously difficult with important lines being missed and many specious lines added

Primary Domain: Distinguishing Handwritten Chinese Characters More complex than digits or Western characters (64x63 pixels). Thousands of different characters  Few training examples available for each (200 labeled images for us). Domain knowledge includes an ideal prototype stroke representation for each character.

Handwritten Chinese Characters We selected ten characters in three classes: Yields forty-five classification problems. Classification difficulty varies significantly by classification problem. We used handwritten Chinese image examples from the ETL9B database created by Electrotechnical Laboratory of Japan. It contains 200 samples for each character. All training images are binary of size 64 by 63. SUMMARY (maybe for the audience, maybe for Dan): Generate new stroke interaction features by comparing stroke representations of the relevant prototype characters (it might help to pick out two e.g., chars 1 & 2 in line 2 where the left one has continuous legs and the right one has gaps before the base) An explanation is an assembly of the pixel-level correlates of these new features The pixel correlates are image regions where the stroke interaction features are statistically to be found according to the training set (As we will see next, we can reliably know BOTH pixel and stroke representations for training examples [essentially because the label provides guidance about what strokes to look for])

Hough Transform Old (but good) idea <x,y>  <m,b> given y = mx + b Hough transform makes a poor line detector BUT Explaining is easy and reliable (class label determines the ideal prototype stroke representation) We know the lines: approximate parameters, geometric constraints Find / hallucinate the Hough peaks to optimize the fit

Feature Kernel Functions Design special-purpose kernel functions Adapt “distance” metric to fit the task Emphasize expected high-information content pixels

Explaining Chinese Characters A pixel is judged to be informative if it is likely to be part of an informative stroke feature. Stroke features are informative if they are distinctive between the ideal prototype characters. Interaction between training examples and the prior domain knowledge is crucial.

Constructing Explanations 五互 From domain knowledge, the top and bottom horizontal strokes are unlikely to be informative. Explanation: apply a linear Hough transformation to identify lines in the image, and associate pixels in the images with strokes. Prototype stroke representations greatly aid in identifying the pixel – stroke correspondence in training examples (but not test examples). High information pixels correspond to distinctive stroke-level features 互五 [ Qiang’s notes: Knowing how many strokes are in the image, and whether they are horizontal or vertical and their relative positions makes linear Hough transformation much more robust. ] 五互

What is an Explanation for the Feature Kernel Function Approach? An account of where the class information is expected to be found within the input image pixels Uniform emphasis over disk of 90% probability mass of the fitted Gaussian

Experiments Feature kernel function vs conventional (cubic polynomial SVM) FKF: similar performance with nearly an order of magnitude less training Performance by problem Scatter Plot for 45 Problems All problems improve; FKF never hurts Lower slope? (suggests hardest problems are helped most)

Experiments Feature kernel function vs conventional (cubic polynomial SVM) Learning curves by problem difficulty (as judged by SVM accuracy) A) Hardest B) Middle C) Easiest third

Experiments Feature kernel function vs conventional (cubic polynomial SVM) For each problem at full training FKF always uses fewer support vectors Interaction between prior knowledge and training examples is crucial

Explanation-Augmented Support Vector Machine EA-SVM: another approach Previous approach adapted the kernel function EA-SVM alters the SVM algorithm; uses standard kernel function Explanations are integrated directly as a bias

EA-SVM What is an Explanation? An explanation is a generalization of a training example, a proposed equivalence class of examples. Same explanation implies same label for the same reason, and should be treated the same by the classifier. For an SVM, examples with the same explanation should have the same margin. A perfect explanation is a hyperplane to which the classifier should be parallel Explanations are not perfect. So prefer a decision surface that is more nearly parallel to confirmed explanations. Penalize non-parallelness Explanations reflect information from both data and knowledge.

Formalizing the Constraints Mathematically Let an explanation justify the label for a given example x using only a subset e of features, the explained example v is defined as: The special symbol ‘*’ indicates that this feature does not participate in the inner product evaluation. With numerical features one can simply use the value zero. The constraints can be expressed as: or equally: Geometrically, this requires the classifier hyper-plane to be parallel to the direction x – v.

EA-SVMs: Explanation-Augmented Support Vector Machines Incorporate high quality explanations into a conventional SVM Classifier reflects information from both examples and domain knowledge. Optimal classifier blends: Maximal conventional margin to training examples Maximally parallel to high quality explanations We use soft constraints for each. Similar analyses using two sets of slack variables. Linear blending via cross validation.

The EA-SVM Optimization Problem Perfect knowledge: Imperfect knowledge: Introduce positive new slack variables (i): The optimization problem become: K, the confidence parameter, is determined by cross-validation; it blends empirical and explanation information With perfect knowledge, require margin of the example w.xi to be equal to the margin of its explanation w.vi (recall vi is xi with irrelevant pixels replaced with zeros) With imperfect knowledge the difference in these margins are the deltas - we minimize their sum, using K to blend this minimization problem with the original

Solutions for EA-SVM With perfect knowledge: where With imperfect knowledge: When confidence parameter K goes to infinity, the second solution reduces to the same as the first one. When K and the i are 0, the problem ignores the explanations and reduces to a standard SVM. This shows the form of the solution, in the imperfect knowledge case the new lagrange variables lambda are bounded by K For simplicity we do NOT show the conventional soft-margin classifier; there is no difficulty – just messiness with two sets of slack variables; The treatment & justification for our soft parallel constraints follows the standard soft-margin treatment: mapping to a higher dimension where the constraints are satisfied, etc.

Formal Analysis: Why EA-SVM works EA-SVM algorithm minimizes the following error bound: Interesting symbols in the expression of h: Rv : The radius of the ball that contains all the explained examples. We expect Rv < R. D: The penalty of a separator <u,b> violates the parallel constrains imposed by explanations. D is determined by cross-validation to minimize h. The important role here is Rv – the radius of the sphere containing all the *explained* examples; Rv will likely be smaller than R since the explanation process typically removes (often many) dimensions

A Simple Prediction A closer look at h: With perfect knowledge, D=0: Without knowledge: EA-SVM has most to offer when the ratio Rv /R is small, which means explanations uses few important features to justify the label. Intuitively, the learning problem is difficult but the domain knowledge is informative. Not surprisingly, with perfect knowledge, all of the information is in the explanations, the examples are irrelevant and Rv appears but no R Without knowledge it reduces to the standard dependence on R

Experiment 1: Does Explanation-Augmentation Help? Results for 45 classifiers on pairs of Chinese characters. Below the line means EA-SVM makes fewer errors than SVM.

Experiment 2: Difficult Problems Benefit More Left chart shows tasks grouped into two categories – easy and hard; EA-SVM never loses but explanations help most in the hard category. Right chart shows the pattern holds generally – (e.g., it is not due to getting lucky on a few outlier problems that happen to be labeled as difficult) We tried several alternative definitions of “problem difficulty” – all yield very similar results; A (non-parametric) Kendall’s Tau test that shows a highly significant agreement between the SVM and EA-SVM on which classification tasks are “more difficult” according to error rates Details are in the paper. EA-SVM vs. SVM Easy tasks: Similar Difficult tasks: EA-SVM wins at all training levels. Task difficulty is highly correlated with Improvement of EA-SVM over conventional SVM.

Exp 3: Robustness and the Effect of Knowledge Quality A) explanations as described; B) randomly select which pixels are labeled as informative; C) Reverse A Clearly the good explanations help In B the slight improvement in medium-difficulty problems is likely due to stumbling upon a few explanations with more informative than uninformative pixels and the cross-validation procedure recognizing this situation In C there is much less room for lucky explanations. The cross validation blending makes it unlikely that poor explanations will adversely affect the learned classifier EA-SVM benefits from good knowledge, and is not hurt by incorrect knowledge.

Exp 4: Additional (Non-image) Domains. Protein Explanations: only known motif sequences are important for proteins’ categorization. Text Explanations: Only words related to the category label are important. ROC (protein) and F1 (text) scores show EA-SVM improvement. A) Classify proteins into super-families based on their amino acid sequences. The domain knowledge is a database of motifs, which are conserved sequences that have been experimentally determined to be important for a protein’s functionality. We use the Structural Classification of Proteins (SCOP), a database of known 3D structures of proteins, as the data set. It contains 54 super-families and 7329 example protein sequences. We adopt the same test and training set splits, and use the same mismatch kernel function as (Leslie & Kuang, 2003). See text. B) Classify text articles into categories. We use the Reuters-21578 data set with the Modified Apte (“ModApte”) split, which leads to a corpus of 9603 training documents and 3299 test documents. Domain knowledge is from WordNet. Words semantically related to the label of a category are informative about the topic. Words that are one-distance away in the WordNet from synonyms of topic words according to WordNet are taken as informative.

Previous Work on Incorporating Knowledge into SVMs (Solution Knowledge) Incorporating transformation invariance into SVMs. Virtual support vector (Schölkopf, 1996) Invariant kernel function (Schölkopf, 2002) Jittered SVM (DeCoste & Schölkopf, 2002) Tangent propagation (Simard 1992, 1998) Locally-improved kernel function explores spatial locality property (Schölkopf, 1998) Convolutional networks (LeCun et al 1998, Simard et al 2003) Knowledge-based SVM and kernels incorporates prior rules. (Fung, Mangasarian & Shavlik, 2002, 2003; Mangasarian, Shavlik & Wild 2004) Extracting character high-level features from pixel representation. (Teow 2000, Shi 2003, Kadir 2004…) Not systematic, specific to the given task, no theoretical results.

Conclusion Inductive learning algorithms can benefit from domain knowledge. This work illustrates a novel direction of using knowledge by combining EBL ideas into a statistical learner. With Domain Knowledge, the expert need not also be expert in the learning algorithms. The EBL components are extremely simple; more can be done. The role of Domain knowledge rather than Solution Knowledge demands further study; this is an important and little-explored direction. Next step: IJCAI07 Poster Explanation-Based Feature Construction Shiau Hong Lim