Presentation on theme: "Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,"— Presentation transcript:
Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009, Bangkok
Plan Meta-learning Learning from others ULM algorithm Types of features Illustrative results Conclusions Learn from others, not only from your own mistakes! Then you will always have free lunch!
Meta-learning Meta-learning means different things for different people. Some people call “meta” learning of many models (ex. Weka), ranking them, arcing, boosting, bagging, or creating an ensemble in many ways optimization of parameters to integrate models. Here meta-learning means learning how to learn. Goal: replace experts who search for the best models making a lot of experiments – there is no free lunch, but why to cook yourself? Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search. One direction towards universal learning: use any method you like, but take the best from your competition! Best what? Best fragments of models, combinations of features.
CI and “no free lunch” “No free lunch" theorem: no single system may reach the best results for all possible distributions of data. Decision trees & rule-based systems: best for data with logical structure, require sharp decision borders, fail on problems where linear discrimination provides accurate solution. SVM in kernelized form works well when complex topology is required but may miss simple solutions that rule-based systems find, fails when sharp decision borders are needed, fail on complex Boolean problems. The key to general intelligence: specific information filters that make learning possible; chunking mechanisms that combine partial results into higher-level mental representations. More attention should be paid to generation of useful features.
ULM idea ULM is composed from two main modules: feature constructors, simple classifiers. In machine learning features are used to calculate: linear combinations of feature values, calculate distances (dissimilarites), scaled (includes selection) Is this sufficient? No, non-linear functions of features carry information that cannot be easily recovered by CI methods. Kernel approaches: linear solutions in the kernel space, implicitly add new features based on similarity K(X,S V ). => Create potentially useful, redundant set of futures. How? Learn what other models do well!
Binary features Binary features: B1: unrestricted projections; MAP classifiers, p(C|b); 2N C regions, complexity O(1) B2: Binary: restricted by other binary features; complexes b 1 ᴧ b 2 … ᴧ b k ; complexity O(2 k ) B3: Binary: restricted by distance; b ᴧ r 1 є [r 1-, r 1+ ]... ᴧ r k є [r k-, r k+ ]; separately for each b value. Ex: b=1, r 1 є [0, 1] take vectors only from this slice N1: Nominal – like binary. r1r1 b
Real features R1: Line, original features x i, sigmoidal transformation (x i ) for contrast enhancement; search for 1D patterns (k-sep intervals). R2: Line, like R1 but restricted by other features, for example z i = x i (X) only for |x j | < t j. R3: Line, z i = x i (X) like R2 but restricted by distance R4: Line – linear combinations z=W. X optimized by projection pursuit (PCA, ICA, QPC...). P1: Prototypes: weighted distance functions, or specialized kernels z i = K(X,X i ). M1: Motifs, based on correlations between elements rather than input values.
Datasets Dataset#Features#Samples#Samples per class Australian15690383 no307 yes Appendicitis710685 C 1 21 C 2 Heart13303164 absence139 presence Diabetes8768268 C 1 500 C 2 Wisconsin9699458 benign241 malignant Hypothyroid21377293 C 1 191 C 2 3488 C 3
B1 Features Dataset B1 Features AustralianF8 < 0.5F8 ≥ 0.5 ᴧ F9 ≥ 0.5 AppendicitisF7 ≥ 7520.5F7 < 7520.5 ᴧ F4 < 12 HeartF13 < 4.5 ᴧ F12 < 0.5F13 ≥ 4.5 ᴧ F3 ≥ 3.5 DiabetesF2 < 123.5F2 ≥ 143.5 WisconsinF2 < 2.5F2 ≥ 4.5 HypothyroidF17 < 0.00605F17 ≥ 0.00605 ᴧ F21 < 0.06472 Example of B1 features taken from segments of decision trees. Other features that frequently proved useful on this data: P1 prototypes, enhanced contrast R1. These features used in various learning systems greatly simplify their models and increase their accuracy.
Conclusions Systematic explorations of features transformations allows for discovery of simple models that more sophisticated learning systems may miss; results always improve and models simplify! Some benchmark problems have been found rather trivial, and have been solved with a single binary feature, one constrained nominal feature, or one new feature constructed as a projection on a line connecting means of two classes. Kernel-based features offer an attractive alternative to current kernel- based SVM approaches, offering multiresolution and adaptive regularization possibilities, combined with LDA or SVNT. Analysis of images, multimedia streams or biosequences may require even more sophisticated ways of constructing features starting from available input features. Learn from others, not only on your own errors!