Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.

Slides:



Advertisements
Similar presentations
Robust Feature Selection by Mutual Information Distributions Marco Zaffalon & Marcus Hutter IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland
Advertisements

Bayesian Treatment of Incomplete Discrete Data applied to Mutual Information and Feature Selection Marcus Hutter & Marco Zaffalon IDSIA IDSIA Galleria.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Unsupervised Learning
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
Supervised Learning Recap
Minimum Redundancy and Maximum Relevance Feature Selection
What is Statistical Modeling
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Overview Full Bayesian Learning MAP learning
Learning Maximum Likelihood Bounded Semi-Naïve Bayesian Network Classifier Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Feature Selection for Regression Problems
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Finite mixture model of Bounded Semi- Naïve Bayesian Network Classifiers Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.
Thanks to Nir Friedman, HU
Robust Bayesian Classifier Presented by Chandrasekhar Jakkampudi.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Real-Time Odor Classification Through Sequential Bayesian Filtering Javier G. Monroy Javier Gonzalez-Jimenez
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Anomaly detection with Bayesian networks Website: John Sandiford.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Bayesian Networks Martin Bachler MLA - VO
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
CSE 4705 Artificial Intelligence
Chapter 7. Classification and Prediction
ICS 280 Learning in Graphical Models
Basic machine learning background with Python scikit-learn
Data Mining Lecture 11.
Machine Learning Feature Creation and Selection
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
A Unifying View on Instance Selection
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Pegna, J.M., Lozano, J.A., and Larragnaga, P.
Somi Jacob and Christian Bach
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Machine Learning – a Probabilistic Perspective
Presentation transcript:

Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In IEEE Trans. on PAMI, 23(6), Summarized by Kyu-Baek Hwang

Abstract Feature selection for unsupervised learning of Gaussian networks  Unsupervised learning for Bayesian networks?  Which feature is good for the learning task? Assessment of the relevance of the feature for learning process  How to determine the threshold for cutting? Accelerate the learning time and still obtain reasonable models  Two artificial datasets  Two benchmark datasets from the UCI repository

Unsupervised Learning for Conditional Gaussian Networks Data clustering  learning the probabilistic graphical model from the unlabeled data Cluster membership  a hidden variable Conditional Gaussian networks  Cluster variable is the ancestor for all the other variables.  The joint probability distribution over all the other variables given the cluster membership is multivariate Gaussian. Feature selection in classification  feature selection in clustering  Consider all the features eventually, to describe the domain.

Conditional Gaussian Distribution Data clustering  X = (Y, C) = (Y 1, …, Y n, C) Conditional Gaussian distribution  Pdf for Y given C = c is,  whenever p(c) = p(C = c) > 0 Positive definite

Conditional Gaussian Networks Factorization of the conditional Gaussian distribution  Conditional independencies among all the variables is encoded by the network structure s.  Local probability distribution

An Example of CGNs C

Learning CGNs from Data Incomplete dataset d Structural EM algorithm OH n1 N

Structural EM Algorithm Expected score Relaxed version:

Scoring Metrics for the Structural Search The log marginal likelihood of the expected complete data

Feature Selection Large databases  Many instances  Many attributes   Dimensionality reduction required Select features based on some criterion.  The criterion differs from the purpose of learning.  Learning speed, accurate predictions, and the comprehensibility of the learned models Non exhaustive search (2 n )  Sequential selection (forward or backward)  Evolutionary, population-based, randomized search based on the EDA.

Wrapper and Filter Wrapper  Feature subsets tailored to the performance function of learning process  Predictive accuracy on the test data set. Filter  Based on the intrinsic properties of the data set.  Correlation between the class label and each attribute  Supervised learning Two problems in unsupervised learning  Absence of the class label  different criterion for the feature selection  No standard accepted performance task  multiple predictive accuracy or class prediction

Feature Selection in Learning CGNs Data analysis (clustering)  description, not prediction  All the features are necessary for the description. CGN learning with many features is a time-consuming task.  Preprocessing: feature selection  Learning CGNs  Postprocessing: addition of the other features as conditionally independent given the cluster membership The goal  how to measure the relevance  Fast learning time  Accuracy  log likelihood for the test data

Relevance Those features that exhibit low correlation with the rest of the features can be considered irrelevant for the learning process.  Conditionally independent given the cluster membership. First trial in the continuous domain

Relevance Measure The relevance measure:  Null hypothesis (edge exclusion test) r 2 ij|rest  The sample partial correlation of Y i and Y j  The maximum likelihood estimates (mles) of the elements of the inverse variance matrix

Graphical Gaussian Models (1/2)

Graphical Gaussian Models (2/2)

Relevance Threshold Distribution of the test statistic  G  (x): pdf of a  1 2 random variable  5 percent test  The resolution of the above equation  optimization

Learning Scheme

Experimental Settings Model speicifications  Tree augmented Naïve Bayes (TANB) models  Predictive attributes may have, at most, one other predictive attribute as a parent. An example C

Data Sets Synthetic data sets (4000:1000)  TANB model with 25 (15:14[-1, 1]) attributes, (0, 4, 8), 1  C: uniform, (0, 1)  TANB model with 30 (15:14[-1, 1]) attributes, (0, 4, 8), 2  C: uniform, (0, 5) Waveform (artificial data) (4000:1000)  3 clusters, 40 attributes, the last 19 are noise attributes Pima  768 cases (700:68)  8 attributes

Performance Criteria The log marginal likelihood of the training data The multiple predictive accuracy  A probabilistic approach to the standard multiple predictive accuracy Runtime  10 independent runs for the synthetic data sets and the waveform data  50 independent runs for the pima data  On a Pentium 366 machine

Relevance Ranking

Likelihood Plots for Synthetic Data

Likelihood Plots for Real Data

Runtime

Automatic Dimensionality Reduction

Conclusions and Future Work Relevance assessment for feature selection in unsupervised learning and continuous domain Reasonable learning performance Extension to categorical domain Redundant feature problem Relaxation of the model structure More realistic data set