Meta-Learning: towards universal learning paradigms Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W.

Slides:

Advertisements

Similar presentations

Heuristic Search techniques

Advertisements

Slides from: Doug Gray, David Poole

Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Meta-Learning: the future of data mining Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer.

Data Mining Classification: Alternative Techniques

Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.

PROBABILISTIC DISTANCE MEASURES FOR PROTOTYPE-BASED RULES Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland, School of.

Heterogeneous adaptive systems Włodzisław Duch & Krzysztof Grąbczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland.

K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological.

Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.

Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Transfer functions: hidden possibilities for better neural networks. Włodzisław Duch and Norbert Jankowski Department of Computer Methods, Nicholas Copernicus.

Towards comprehensive foundations of Computational Intelligence Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.

Global Visualization of Neural Dynamics

1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

How to learn hard Boolean functions Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.

Support Feature Machines: Support Vectors are not enough Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University,

CS Instance Based Learning1 Instance Based Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Artificial Intelligence (AI) Addition to the lecture 11.

Data Mining Techniques

Meta-Learning: the future of data mining Włodzisław Duch Norbert Jankowski, Krzysztof Grąbczewski + Tomasz Maszczyk + Marek Grochowski Department of Informatics,

Radial Basis Function Networks

Efficient Model Selection for Support Vector Machines

Meta-Uczenie Maszynowe Włodzisław Duch, Norbert Jankowski & Krzysztof Grąbczewski Katedra Informatyki Stosowanej, Uniwersytet Mikołaja Kopernika, Toruń.

Meta-Learning: the future of data mining Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer.

IJCNN 2012 Competition: Classification of Psychiatric Problems Based on Saccades Włodzisław Duch 1,2, Tomasz Piotrowski 1 and Edward Gorzelańczyk 3 1 Department.

Computational Intelligence: Methods and Applications Lecture 37 Summary & review Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Meta-Learning: towards universal learning paradigms Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Meta-Learning and learning in highly non-separable cases Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Meta-Learning: towards universal learning paradigms Włodzisław Duch Norbert Jankowski, Krzysztof Grąbczewski & Co Department of Informatics, Nicolaus Copernicus.

Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Meta-Learning:Towards Universal Learning Paradigms Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

How to learn highly non-separable data Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch ICAISC’08.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Neural network applications: The present and the future Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Meta-Learning:Towards Universal Learning Paradigms Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Towards Science of DM Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

Data Mining and Decision Support

Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:

Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:

Meta-Learning and Learning in Highly Non-separable Cases Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

CS 9633 Machine Learning Support Vector Machines

Support Feature Machine for DNA microarray data

Basic machine learning background with Python scikit-learn

Department of Informatics, Nicolaus Copernicus University, Toruń

Meta-Learning: the future of data mining

Meta-Learning: the future of data mining

Meta-Learning and Learning in Highly Non-separable Cases

Data Mining Practical Machine Learning Tools and Techniques

Tomasz Maszczyk and Włodzisław Duch Department of Informatics,

Learning data structures with inherent complex logic

Towards comprehensive foundations of Computational Intelligence

How to learn highly non-separable data

Support Vector Neural Training

Heterogeneous adaptive systems

Presentation transcript:

Meta-Learning: towards universal learning paradigms Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch UoT 7/2010

Toruń

Norbert Tomek Marek Krzysztof

Copernicus Nicolaus Copernicus: born in 1472

DI NCU Projects Google W Duch => List of projects, papers Computational intelligence (CI), main themes: Foundations of computational intelligence: transformation based learning, k-separability, learning hard boole’an problems. Meta-learning, or learning how to learn. Understanding of data: prototype-based rules, visualization. Novel learning: projection pursuit networks, QPC (Quality of Projected Clusters), search-based neural training, transfer learning or learning from others (ULM), aRPM, SFM... Similarity based framework for metalearning, heterogeneous systems, new transfer functions for neural networks. Feature selection, extraction, creation.

DI NCU Projects Neurocognitive Informatics projects. Computational creativity, insight, intuition, consciousness. Neurocognitive approach to language, word games. Medical information retrieval, analysis, visualization. Global analysis of EEG, visualization of high-D trajectories. Brain stem models and consciousness in artificial systems. Autism, comprehensive theory. Imagery agnosia, especially imagery amusia. Infants: observation, guided development. A test-bed for integration of different Humanized Interface Technologies (HIT), with Singapore C 2 I Center. Free will, neural determinism and social consequences.

CI definition Computational Intelligence. An International Journal (1984) + 10 other journals with “Computational Intelligence”, D. Poole, A. Mackworth & R. Goebel, Computational Intelligence - A Logical Approach. (OUP 1998), GOFAI book, logic and reasoning. CI should: be problem-oriented, not method oriented; cover all that CI community is doing now, and is likely to do in future; include AI – they also think they are CI... CI: science of solving (effectively) non-algorithmizable problems. Problem-oriented definition, firmly anchored in computer sci/engineering. AI: focused problems requiring higher-level cognition, the rest of CI is more focused on problems related to perception/action/control.

Data mining packages No free lunch => provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees, tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects. Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime packages on the-data-mine.com list! We are building Intemi, completely new tools for metalearing. GhostMiner, data mining tools from our lab + Fujitsu: Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GM Developer & Analyzer

Principles: information compression Neural information processing in perception and cognition: information compression, or algorithmic complexity. In computing: minimum length (message, description) encoding. Wolff (2006): all cognition and computation as compression! Analysis and production of natural language, fuzzy pattern recognition, probabilistic reasoning and unsupervised inductive learning. Talks about multiple alignment, unification and search, but so far only models for sequential data and 1D alignment. Information compression: encoding new information in terms of old has been used to define the measure of syntactic and semantic information (Duch, Jankowski 1994); based on the size of the minimal graph representing a given data structure or knowledge-base specification, thus it goes beyond alignment.

Graphs of consistent concepts Brains learn new concepts in terms of old; use large semantic network and add new concepts linking them to the known. Disambiguate concepts by spreading activation and selecting those that are consistent with already active subnetworks.

Similarity-based framework (Dis)similarity: more general than feature-based description, no need for vector spaces (structured objects), more general than fuzzy approach (F-rules are reduced to P-rules), includes nearest neighbor algorithms, MLPs, RBFs, separable function networks, SVMs, kernel methods and many others! Similarity-Based Methods (SBMs) are organized in a framework: p(C i |X;M) posterior classification probability or y(X;M) approximators, models M are parameterized in increasingly sophisticated way. A systematic search (greedy, beam, evolutionary) in the space of all SBM models is used to select optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem. Results: several candidate models are created, even very limited version gives best results in 7 out of 12 Stalog problems.

SBM framework components Pre-processing: objects O => features X, or (diss)similarities D(O,O’). Calculation of similarity between features d(x i,y i ) and objects D(X,Y). Reference (or prototype) vector R selection/creation/optimization. Weighted influence of reference vectors G(D(R i,X)), i=1..k. Functions/procedures to estimate p(C|X;M) or y(X;M). Cost functions E[D T ;M] and model selection/validation procedures. Optimization procedures for the whole model M a. Search control procedures to create more complex models M a+1. Creation of ensembles of (local, competent) models. M={X(O), d(.,. ), D(.,. ), k, G(D), {R}, {p i (R)}, E[. ], K(. ), S(.,. )}, where: S(C i,C j ) is a matrix evaluating similarity of the classes; a vector of observed probabilities p i (X) instead of hard labels. The kNN model p(Ci|X;kNN) = p(C i |X;k,D(. ),{D T }); the RBF model: p(Ci|X;RBF) = p(Ci|X;D(. ),G(D),{R}), MLP, SVM and many other models may all be “re-discovered”.

Meta-learning in SBM scheme Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations. k-NN 67.5/76.6% +d(x,y); Canberra 89.9/90.7 % + s i =(0,0,1,0,1,1); 71.6/64.4 % +selection, 67.5/76.6 % +k opt; 67.5/76.6 % +d(x,y) + s i =(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + sel. or opt k; Canberra 89.9/90.7 % k-NN 67.5/76.6% +d(x,y); Canberra 89.9/90.7 % + s i =(0,0,1,0,1,1); 71.6/64.4 % +selection, 67.5/76.6 % +k opt; 67.5/76.6 % +d(x,y) + s i =(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + selection; Canberra 89.9/90.7 %

Selecting Support Vectors Active learning: if contribution to the parameter change is negligible remove the vector from training set. If the difference is sufficiently small the pattern X will have negligible influence on the training process and may be removed from the training. Conclusion: select vectors with e W (X)>e min, for training. 2 problems: possible oscillations and strong influence of outliers. Solution: adjust e min dynamically to avoid oscillations; remove also vectors with e W (X)>1-e min =e max

SVNT XOR solution

How much can we learn? Linearly separable or almost separable problems are relatively simple – deform or add dimensions to make data separable. How to define “slightly non-separable”? There is only separable and the vast realm of the rest.

Neurons learning complex logic Boole’an functions are difficult to learn, n bits but 2 n nodes => combinatorial complexity; similarity is not useful, for parity all neighbors are from the wrong class. MLP networks have difficulty to learn functions that are highly non-separable. Projection on W=( ) gives clusters with 0, 1, 2... n bits; solution requires abstract imagination + easy categorization. Ex. of 2-4D parity problems. Neural logic can solve it without counting; find a good point of view.

Boolean functions n=2, 16 functions, 12 separable, 4 not separable. n=3, 256 f, 104 separable (41%), 152 not separable. n=4, 64K=65536, only 1880 separable (3%) n=5, 4G, but << 1% separable... bad news! Existing methods may learn some non-separable functions, but most functions cannot be learned ! Example: n-bit parity problem; many papers in top journals. No off-the-shelf systems are able to solve such problems. For all parity problems SVM is below base rate! Such problems are solved only by special neural architectures or special classifiers – if the type of function is known. But parity is still trivial... solved by

Example: aRPM Almost Random Projection Machine (with Hebbian learning): generate random combinations of inputs (line projection) z(X)=W. X, find and isolate pure cluster h(X)=G(z(X)); estimate relevance of h(X), ex. MI(h(X),C), leave only good nodes; continue until each vector activates minimum k nodes. Count how many nodes vote for each class and plot.

Learning from others … Learn to transfer interesting features created by different systems. Ex. prototypes, combinations of features with thresholds … See our talk with Tomasz Maszczyk on Universal Learning Machines. Example of features generated: B1: Binary – unrestricted projections; B2: Binary – restricted by other binary features; complexes b 1 ᴧ b 2 … ᴧ b k B3: Binary – restricted by distance R1: Line – original real features r i ; non-linear thresholds for “contrast enhancement“  (r i  b i ); intervals (k-sep). R4: Line – restricted by distance, original feature; thresholds; intervals (k-sep); more general 1D patterns. P1: Prototypes: general q-separability, weighted distance functions or specialized kernels. M1: Motifs, based on correlations between elements rather than input values.

B1 Features Dataset B1 Features AustralianF8 < 0.5F8 ≥ 0.5 F9 ≥ 0.5 AppendicitisF7 ≥ F7 < F4 < 12 HeartF13 < 4.5 F12 < 0.5F13 ≥ 4.5 F3 ≥ 3.5 DiabetesF2 < 123.5F2 ≥ WisconsinF2 < 2.5F2 ≥ 4.5 HypothyroidF17 < F17 ≥ F21 < Example of B1 features taken from segments of decision trees. These features used in various learning systems greatly simplify their models and increase their accuracy.

Dataset Classifier SVM (#SV)SSV (#Leafs)NB Australian84.9±5.6 (203)84.9±3.9 (4)80.3±3.8 ULM86.8±5.3(166)87.1±2.5(4)85.5±3.4 FeaturesB1(2) + P1(3)B1(2) + R1(1) + P1(3)B1(2) Appendicitis87.8±8.7 (31)88.0±7.4 (4)86.7±6.6 ULM91.4±8.2(18)91.7±6.7(3)91.4±8.2 FeaturesB1(2) Heart82.1±6.7 (101)76.8±9.6 (6)84.2±6.1 ULM83.4±3.5(98)79.2±6.3(6)84.5±6.8 FeaturesData + R1(3) Data + B1(2) Diabetes77.0±4.9 (361)73.6±3.4 (4)75.3±4.7 ULM78.5±3.6(338)75.0±3.3(3)76.5±2.9 FeaturesData + R1(3) + P1(4)B1(2)Data + B1(2) Wisconsin96.6±1.6 (46)95.2±1.5 (8)96.0±1.5 ULM97.2±1.8(45)97.4±1.6(2)97.2±2.0 FeaturesData + R1(1) + P1(4)R1(1) Hypothyroid94.1±0.6 (918)99.7±0.5 (12)41.3±8.3 ULM99.5±0.4(80)99.6±0.4(8)98.1±0.7 FeaturesData + B1(2)

Meta-learningMeta-learning Meta-learning means different things for different people. Some will call “meta” any learning of many models (ex. Weka), ranking them, arcing, boosting, bagging, or creating an ensemble in many ways  optimization of parameters to integrate models. Stacking: learn new models on errors of the previous ones. Landmarking: characterize many datasets and remember which method worked the best on each dataset. Compare new dataset to the reference ones; define various measures (not easy) and use similarity-based methods. Regression models created for each algorithm on parameters that describe data to predict their expected accuracy. Goal: rank potentially useful algorithms. Rather limited success …

Real meta-learning! Meta-learning: learning how to learn, replace experts who search for best models making a lot of experiments. Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search. Abstract view, uniform I/O, uniform results management. Directed acyclic graphs (DAG) of boxes representing scheme placeholders and particular models, interconnected through I/O. Configuration level for meta-schemes, expanded at runtime level. An exercise in software engineering for data mining!

Advanced meta-learning Extracting meta-rules, describing interesting search directions. Finding the correlations occurring among different items in most accurate results, identifying different machine (algorithmic) structures with similar behavior in an area of the model space. Depositing the knowledge they gain in a reusable meta-knowledge repository (for meta-learning experience exchange between different meta-learners). A uniform representation of the meta-knowledge, extending expert knowledge, adjusting the prior knowledge according to performed tests. Finding new successful complex structures and converting them into meta-schemes (which we call meta abstraction) by replacing proper substructures by placeholders. Beyond transformations & feature spaces: actively search for info. Intemi software (N. Jankowski and K. Grąbczewski) incorporating these ideas and more is coming “soon”...

Meta-learning architecture Inside meta-parameter search a repeater machine composed of distribution and test schemes are placed.

Complexities on vowel data ……………

Simple machines on vowel data Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).

Complex machines on vowel data Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).

Thyroid example 32-51: ParamSearch [SVMClassifier [KernelProvider]] 28-30: kNN; 31 NBC

SummarySummary Challenging data cannot be handled with existing DM tools. Similarity-based framework enables meta-learning as search in the model space, heterogeneous systems add fine granularity. No off-shelf classifiers are able to learn difficult Boolean functions. Visualization of hidden neuron’s shows that frequently perfect but non- separable solutions are found despite base-rate outputs. Linear separability is not the best goal of learning, other targets that allow for easy handling of final non-linearities should be defined. k-separability defines complexity classes for non-separable data. Transformation-based learning shows the need for component-based approach to DM, discovery of simplest models. Meta-learning replaces data miners automatically creating new optimal learning methods on demand. Is this the final word in data mining? Future will tell.

Thank you for lending your ears... Google: W. Duch => Papers & presentations; Norbert: KIS: => On-line publications. Book: Meta-learning in Computational Intelligence (2010).