Least-Mean-Square Training of Cluster-Weighted-Modeling National Taiwan University Department of Computer Science and Information Engineering.

Slides:

Advertisements

Similar presentations

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Advertisements

Pattern Recognition and Machine Learning

Supervised Learning Recap

Machine Learning and Data Mining Clustering

Visual Recognition Tutorial

6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Unsupervised Learning

Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.

Expectation-Maximization

Clustering & Dimensionality Reduction 273A Intro Machine Learning.

Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;

Radial-Basis Function Networks

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.

Radial Basis Function Networks

8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.

Collaborative Filtering Matrix Factorization Approach

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Biointelligence Laboratory, Seoul National University

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

EM and expected complete log-likelihood Mixture of Experts

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

University of Southern California Department Computer Science Bayesian Logistic Regression Model (Final Report) Graduate Student Teawon Han Professor Schweighofer,

Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

Generalised method of moments approach to testing the CAPM Nimesh Mistry Filipp Levin.

CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.

Linear Models for Classification

M Machine Learning F# and Accord.net.

Lecture 2: Statistical learning primer for biologists

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Semi-Supervised Clustering

Chapter 7. Classification and Prediction

Probability Theory and Parameter Estimation I

Ch3: Model Building through Regression

Classification of unlabeled data:

Special Topics In Scientific Computing

Probabilistic Models for Linear Regression

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Collaborative Filtering Matrix Factorization Approach

Neuro-Computing Lecture 4 Radial Basis Function Network

Pattern Recognition and Machine Learning

Biointelligence Laboratory, Seoul National University

Robust Full Bayesian Learning for Neural Networks

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Linear Discrimination

EM Algorithm and its Applications

Presentation transcript:

Least-Mean-Square Training of Cluster-Weighted-Modeling National Taiwan University Department of Computer Science and Information Engineering

Outline Introduction of CWM Introduction of CWM Least-Mean-Square Training of CWM Least-Mean-Square Training of CWM Experiments Experiments Summary Summary Future work Future work Q&A Q&A

Cluster-Weighted Modeling (CWM) CWM is a supervised learning model which are based on the joint probability density estimation of a set of input and output (target) data. CWM is a supervised learning model which are based on the joint probability density estimation of a set of input and output (target) data. The joint probability is expended into clusters which describe local subspaces well. Each local Gaussian expert can have its own local function The joint probability is expended into clusters which describe local subspaces well. Each local Gaussian expert can have its own local function (constant, linear or quadratic function). (constant, linear or quadratic function). The global (nonlinear) model can be constructed by combining all the local models. The global (nonlinear) model can be constructed by combining all the local models. The resulting model has transparent local structures and meaningful parameters. The resulting model has transparent local structures and meaningful parameters.

Architecture sdff sdff

Prediction calculation Conditional forecast: The expected output given the input. Conditional forecast: The expected output given the input. Conditional error (output uncertainty): The expected output covariance given the input Conditional error (output uncertainty): The expected output covariance given the input

Objective function: Log-likelihood function Objective function: Log-likelihood function Initialize cluster means (k-means), variances (maximal range for each dimension). Initialize Initialize cluster means (k-means), variances (maximal range for each dimension). Initialize =1/M. M: Predetermined number of clusters. =1/M. M: Predetermined number of clusters. E-step: Evaluate the posterior probability E-step: Evaluate the posterior probability M-step: M-step: Update clusters means Update clusters means Update prior probability Update prior probability Training (EM Algorithm)

M-step ( Cont.) Define cluster-weighted expectation Define cluster-weighted expectation Update cluster-weighted covariance matrices Update cluster-weighted covariance matrices Update cluster parameters which maximizes Update cluster parameters which maximizes the data likelihood the data likelihood where where Update output covariance matrices Update output covariance matrices

Least-Mean-Square Training of CWM To train CWM ’ s model parameters from a least- squared perspective. To train CWM ’ s model parameters from a least- squared perspective. Minimizing squared error function of CWM ’ s training result to find another solution which can have a better accuracy. Minimizing squared error function of CWM ’ s training result to find another solution which can have a better accuracy. To find another solution when CWM is trapped in local minima. To find another solution when CWM is trapped in local minima. Applying supervised selection of cluster centers instead of unsupervised method. Applying supervised selection of cluster centers instead of unsupervised method.

LMS Learning Algorithm The instantaneous error produced by sample n is The prediction formula is Using softmax function to constrain prior probability to have value between 0 and 1 and their summation equal to 1.

LMS Learning Algorithm (cont.) The derivation of gradients: The derivation of gradients:

LMS CWM Learning Algorithm Initialization: Initialize Initialization: Initialize Using CWM ’ s training result. Initialize Iterate until convergence: For n=1:N For n=1:N Estimate error Estimate error Estimate gradients Estimate gradients Update Update End EndE-step:M-step:

Simple Demo cwm1d cwm1d cwmprdemo cwmprdemo cwm2d cwm2d lms1d lms1d

Experiments A simple Sin function. A simple Sin function. LMS-CWM has a better interpolation result. LMS-CWM has a better interpolation result.

Mackey-Glass Chaotic Time Series Prediction 1000 data points. We take the first 500 points as training set, the last 500 points are chosen as test set data points. We take the first 500 points as training set, the last 500 points are chosen as test set. Single-step prediction Single-step prediction Input: [s(t),s(t-6),s(t-12),s(t-18)] Input: [s(t),s(t-6),s(t-12),s(t-18)] Output: s(t+85) Output: s(t+85) Local linear model Local linear model Number of clusters: 30 Number of clusters: 30

Results (1) CWMLMS-CWM

Results (2) Learning curve Learning curve CWM LMS CWM MSECWM LMS CWM Test set Training set

Local Minima The initial locations of four clusters. The initial locations of four clusters. The initial locations of four clusters The resulting centers ’ locations after each training session of CWM and LMS-CWM.

Summary A LMS learning method for CWM is presented. A LMS learning method for CWM is presented. May lose the benefits of data density estimation and characterizing data. May lose the benefits of data density estimation and characterizing data. Provides an alternative training option. Provides an alternative training option. Parameters can be trained by EM and LMS alternatively. Parameters can be trained by EM and LMS alternatively. Combine both advantages of EM and LMS learning. Combine both advantages of EM and LMS learning. LMS-CWM learning can be viewed as a refinement to CWM if only prediction accuracy is our main concern. LMS-CWM learning can be viewed as a refinement to CWM if only prediction accuracy is our main concern.

Future work Regularization. Regularization. Comparison between different models (from theoretical, performance point of views) Comparison between different models (from theoretical, performance point of views)

Q&A Thank You! Thank You!