Graphical Models - Learning -

Slides:



Advertisements
Similar presentations
ABSTRACT: We examine how to detect hidden variables when learning probabilistic models. This problem is crucial for for improving our understanding of.
Advertisements

Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University
Learning with Missing Data
Image Modeling & Segmentation
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
An introduction to machine learning and probabilistic graphical models
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Learning: Parameter Estimation
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Nir Friedman, Iftach Nachman, and Dana Peer Announcer: Kyu-Baek Hwang
Graphical Models - Inference -
Graphical Models - Modeling - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.
Structure Learning in Bayesian Networks
Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California, Irvine.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.
Describing Data The canonical descriptive strategy is to describe the data in terms of their underlying distribution As usual, we have a p-dimensional.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Some Neat Results From Assignment 1. Assignment 1: Negative Examples (Rohit)
Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center
Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Bayesian Networks Russell and Norvig: Chapter 14 CS440 Fall 2005.
Integrative Genomics BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Introduction to Artificial Intelligence
Learning Bayesian Network Models from Data
Irina Rish IBM T.J.Watson Research Center
Introduction to Artificial Intelligence
CSCI 5822 Probabilistic Models of Human and Machine Learning
Read R&N Ch Next lecture: Read R&N
Probabilistic Models with Latent Variables
Introduction to Artificial Intelligence
Learning Bayesian networks
Efficient Learning using Constrained Sufficient Statistics
Bayesian Learning Chapter
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Learning Bayesian networks
Presentation transcript:

Graphical Models - Learning - Based on J. A. Bilmes,“A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97-021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, „The EM Algorithm and Extensions“, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts, Stanford University, 2001., N. Friedman & D. Koller‘s NIPS‘99. Advanced I WS 06/07 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP Graphical Models - Learning - Structure Learning Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany

Learning With Bayesian Networks Fixed structure Fixed variables Hidden variables observed fully Partially Easiest problem counting Selection of arcs New domain with no domain expert Data mining Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, „Only“ Structural EM is known Scientific discouvery ? ? ? A B A B A B H - Learning Stucture learing? Parameter Estimation

Why Struggle for Accurate Structure? Earthquake Alarm Set Sound Burglary Truth Missing an arc Adding an arc Earthquake Alarm Set Sound Burglary Earthquake Alarm Set Sound Burglary - Learning Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Increases the number of parameters to be estimated Wrong assumptions about domain structure

Unknown Structure, (In)complete Data E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . Network structure is not specified Learnerr needs to select arcs & estimate parameters Data does not contain missing values E B A ? e b B E P(A | E,B) .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) Learning algorithm E B A - Learning E, B, A <Y,?,N> <Y,N,?> <N,N,Y> <N,Y,Y> . <?,Y,Y> Network structure is not specified Data contains missing values Need to consider assignments to missing values

Score-based Learning Define scoring function that evaluates how well a structure matches the data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . score E - Learning E B E A A B A B Search for a structure that maximizes the score

Structure Search as Optimization Input: Training data Scoring function Set of possible structures Output: A network that maximizes the score - Learning

Traverse space looking for high-scoring structures Search techniques: Heuristic Search Define a search space: search states are possible structures operators make small changes to structure Traverse space looking for high-scoring structures Search techniques: Greedy hill-climbing Best first search Simulated Annealing ... Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1 - Learning

Typically: Local Search Start with a given network empty network, best tree , a random network At each iteration Evaluate all possible changes Apply change based on score Stop when no modification improves score S C E D - Learning

Typically: Local Search Start with a given network empty network, best tree , a random network At each iteration Evaluate all possible changes Apply change based on score Stop when no modification improves score S C E D Add C D S C E D - Learning

Typically: Local Search Start with a given network empty network, best tree , a random network At each iteration Evaluate all possible changes Apply change based on score Stop when no modification improves score S C E D Add C D Reverse C E S C E D - Learning S C E D

Typically: Local Search Start with a given network empty network, best tree , a random network At each iteration Evaluate all possible changes Apply change based on score Stop when no modification improves score S C E D Add C D Reverse C E S C E D Delete C E - Learning S C E D S C E D

Typically: Local Search If data is complete: To update score after local change, only re-score (counting) families that changed S C E D Add C D Reverse C E S C E D Delete C E - Learning S C E D S C E D If data is incomplete: To update score after local change, reran parameter estimation algorithm

Local Search in Practice Local search can get stuck in: Local Maxima: All one-edge changes reduce the score Plateaux: Some one-edge changes leave the score unchanged Standard heuristics can escape both Random restarts TABU search Simulated annealing - Learning

Local Search in Practice Using LL as score, adding arcs always helps Max score attained by fully connected network Overfitting: A bad idea… Minimum Description Length: Learning  data compression Other: BIC (Bayesian Information Criterion), Bayesian score (BDe) <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ………………. - Learning DL(Data|model) DL(Model)

Local Search in Practice Perform EM for each candidate graph Parameter space G3 G2 G1 Parametric optimization (EM) Local Maximum G4 Gn - Learning

Local Search in Practice Perform EM for each candidate graph Parameter space G3 G2 G1 Parametric optimization (EM) Local Maximum Computationally expensive: Parameter optimization via EM — non-trivial Need to perform EM for all candidate structures Spend time even on poor candidates  In practice, considers only a few candidates G4 Gn - Learning

Structural EM [Friedman et al. 98] Recall, in complete data we had Decomposition  efficient search Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score  improvement in real score - Learning

Structural EM [Friedman et al. 98] Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: Better scoring parameters: “parametric” EM step or Better scoring structure: “structural” EM step - Learning

Structural EM [Friedman et al. 98] Reiterate Score & Parameterize Computation Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) X1 X2 X3 H Y1 Y2 Y3 X1 X2 X3 H Y1 Y2 Y3  - Learning N(X2,X1) N(H, X1, X3) N(Y1, X2) N(Y2, Y1, H) Training Data X1 X2 X3 H Y1 Y2 Y3

Structure Learning: incomplete data S X D C B <? 0 1 0 1> <1 1 ? 0 1> <0 0 0 ? ?> <? ? 0 ? 1> ……… Data E A Expectation B Inference: P(S|X=0,D=1,C=0,B=1) Current model S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ……….. Expected counts Maximization Parameters - Learning EM-algorithm: iterate until convergence

Structure Learning: incomplete data B S X D C B <? 0 1 0 1> <1 1 ? 0 1> <0 0 0 ? ?> <? ? 0 ? 1> ……… Data E A A Expectation B Inference: P(S|X=0,D=1,C=0,B=1) Current model S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ……….. Expected counts Maximization Parameters Maximization Structure - Learning E E B E SEM-algorithm: iterate until convergence A A A B B

Structure Learning: Summary Expert knowledge + learning from data Structure learning involves parameter estimation (e.g. EM) Optimization w/ score functions likelihood + complexity penality = MDL Local traversing of space of possible structures: add, reverse, delete (single) arcs Speed-up: Structural EM Score candidates w.r.t. current best model - Learning