© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Graphical Models - Learning -
Visual Recognition Tutorial
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.
. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Optimization via Search CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.
A Brief Introduction to Graphical Models
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
Data Analysis with Bayesian Networks: A Bootstrap Approach Nir Friedman, Moises Goldszmidt, and Abraham Wyner, UAI99.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Learning Bayesian Networks with Local Structure by Nir Friedman and Moises Goldszmidt.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Computing & Information Sciences Kansas State University Data Sciences Summer Institute Multimodal Information Access and Synthesis Learning and Reasoning.
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
CSE 517 Natural Language Processing Winter 2015
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Integrative Genomics BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Learning Bayesian Network Models from Data
Irina Rish IBM T.J.Watson Research Center
Data Mining Lecture 11.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Learning Bayesian networks
Efficient Learning using Constrained Sufficient Statistics
SMEM Algorithm for Mixture Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
More on Search: A* and Optimization
Bayesian Learning Chapter
Chapter 20. Learning and Acting with Bayes Nets
Learning Bayesian networks
Presentation transcript:

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-2 Bayesian Networks Qualitative part: statistical independence statements (causality!) u Directed acyclic graph (DAG) l Nodes - random variables of interest (exhaustive and mutually exclusive states) l Edges - direct (causal) influence Quantitative part: Local probability models. Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-3 Learning Bayesian networks (reminder) Inducer Data + Prior information E R B A C.9.1 e b e be b b e BEP(A | E,B)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-4 The Learning Problem

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-5 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-6 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-7 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-8 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-9 Learning Parameters for the Burglary Story E B A C i.i.d. samples Network factorization We have 4 independent estimation problems

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-10 Incomplete Data Data is often incomplete u Some variables of interest are not assigned value This phenomena happen when we have u Missing values u Hidden variables

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-11 Missing Values u Examples: u Survey data u Medical records l Not all patients undergo all possible tests

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-12 Missing Values (cont.) Complicating issue: u The fact that a value is missing might be indicative of its value l The patient did not undergo X-Ray since she complained about fever and not about broken bones…. To learn from incomplete data we need the following assumption: Missing at Random (MAR):  The probability that the value of X i is missing is independent of its actual value given other observed values

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-13 Hidden (Latent) Variables u Attempt to learn a model with variables we never observe l In this case, MAR always holds u Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-14 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables  a serious problem HY

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-15 Gradient Ascent u Main result  Requires computation: P(x i,Pa i |o[m],  ) for all i, m u Pros: l Flexible l Closely related to methods in neural network training u Cons: l Need to project gradient onto space of legal parameters l To get reasonable convergence we need to combine with “smart” optimization techniques

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-16 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT T??THT??TH HTHTHTHT HHTTHHTT P(Y=H|X=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current model

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-17 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) (M-Step) Reiterate

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-18 EM (cont.) Formal Guarantees:  L(  1 :D)  L(  0 :D) l Each iteration improves the likelihood  If  1 =  0, then  0 is a stationary point of L(  :D) l Usually, this means a local maximum Main cost: u Computations of expected counts in E-Step u Requires a computation pass for each instance in training set l These are exactly the same as for gradient ascent!

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-19 Example: EM in clustering u Consider clustering example E-Step: Compute P(C[m]|X 1 [m],…,X n [m],  ) l This corresponds to “soft” assignment to clusters l Compute expected statistics: M-Step Re-estimate P(X i |C), P(C) Cluster X1X1... X2X2 XnXn

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-20 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones Speed up: u various methods to speed convergence

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-21 Why Struggle for Accurate Structure u Increases the number of parameters to be fitted u Wrong assumptions about causality and domain structure u Cannot be compensated by accurate fitting of parameters u Also misses causality and domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arcMissing an arc

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-22 Minimum Description Length (cont.) u Computing the description length of the data, we get u Minimizing this term is equivalent to maximizing # bits to encode G # bits to encode  G # bits to encode D using (G,  G )

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-23 Heuristic Search u We address the problem by using heuristic search u Define a search space: l nodes are possible structures l edges denote adjacency of structures u Traverse this space looking for high-scoring structures Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-24 Heuristic Search (cont.) u Typical operations: S C E D S C E D S C E D S C E D Add C  D Reverse C  E Remove C  E

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-25 Exploiting Decomposability in Local Search u Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move S C E D S C E D S C E D S C E D

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-26 Greedy Hill-Climbing Simplest heuristic local search l Start with a given network empty network best tree a random network l At each iteration Evaluate all possible changes Apply change that leads to best improvement in score Reiterate l Stop when no modification improves score  Each step requires evaluating approximately n new changes

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-27 Greedy Hill-Climbing (cont.) u Greedy Hill-Climbing can get struck in: l Local Maxima: All one-edge changes reduce the score l Plateaus: Some one-edge changes leave the score unchanged u Both are occur in the search space

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-28 Greedy Hill-Climbing (cont.) To avoid these problems, we can use: u TABU-search Keep list of K most recently visited structures l Apply best move that does not lead to a structure in the list This escapes plateaus and local maxima and with “basin” smaller than K structures u Random Restarts l Once stuck, apply some fixed number of random edge changes and restart search l This can escape from the basin of one maxima to another

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-29 Other Local Search Heuristics u Stochastic First-Ascent Hill-Climbing l Evaluate possible changes at random l Apply the first one that leads “uphill” l Stop when a fix amount of “unsuccessful” attempts to change the current candidate u Simulated Annealing l Similar idea, but also apply “downhill” changes with a probability that is proportional to the change in score l Use a temperature to control amount of random downhill steps l Slowly “cool” temperature to reach a regime where performing strict uphill moves

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-30 Examples u Predicting heart disease l Features: cholesterol, chest pain, angina, age, etc. l Class: {present, absent} u Finding lemons in cars l Features: make, brand, miles per gallon, acceleration,etc. l Class: {normal, lemon} u Digit recognition l Features: matrix of pixel descriptors l Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} u Speech recognition l Features: Signal characteristics, language model l Class: {pause/hesitation, retraction}

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-31 Some Applications u Biostatistics -- Medical Research Council (Bugs) u Data Analysis -- NASA (AutoClass) u Collaborative filtering -- Microsoft (MSBN) u Fraud detection -- ATT u Classification -- SRI (TAN-BLT) u Speech recognition -- UC Berkeley