1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Mixture Models and the EM Algorithm
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Exact Inference in Bayes Nets
Expectation Maximization
Supervised Learning Recap
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
Improving Forecast Accuracy by Unconstraining Censored Demand Data Rick Zeni AGIFORS Reservations and Yield Management Study Group May, 2001.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
A. Darwiche Learning in Bayesian Networks. A. Darwiche Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown.
Markov Networks.
Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Overview Full Bayesian Learning MAP learning
Bayesian network inference
Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Clustering.
Expectation Maximization Algorithm
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Expectation-Maximization
Computer vision: models, learning and inference Chapter 10 Graphical Models.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Computer vision: models, learning and inference
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
Lecture 19: More EM Machine Learning April 15, 2010.
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Survey Propagation. Outline Survey Propagation: an algorithm for satisfiability 1 – Warning Propagation – Belief Propagation – Survey Propagation Survey.
CS Statistical Machine learning Lecture 24
MCMC in practice Start collecting samples after the Markov chain has “mixed”. How do you know if a chain has mixed or not? In general, you can never “proof”
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.
Lecture 2: Statistical learning primer for biologists
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Flat clustering approaches
29 August 2013 Venkat Naïve Bayesian on CDF Pair Scores.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Learning in Bayesian Networks. Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown Structure Incomplete.
Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
30 November, 2005 CIMCA2005, Vienna 1 Statistical Learning Procedure in Loopy Belief Propagation for Probabilistic Image Processing Kazuyuki Tanaka Graduate.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Markov Networks.
Stochastic Optimization Maximization for Latent Variable Models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Topic models for corpora and for graphs
Bucket Renormalization for Approximate Inference
Expectation-Maximization & Belief Propagation
Topic models for corpora and for graphs
Markov Networks.
Presentation transcript:

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

2 One-Slide Summary Using an ordinary naïve Bayes model: 1. One can do general purpose probability estimation and inference… 2. With excellent accuracy… 3. In linear time. In contrast, Bayesian network inference is worst- case exponential time.

3 Outline Background –General probability estimation –Naïve Bayes and Bayesian networks Naïve Bayes Estimation (NBE) Experiments –Methodology –Results Conclusion

4 Outline Background –General probability estimation –Naïve Bayes and Bayesian networks Naïve Bayes Estimation (NBE) Experiments –Methodology –Results Conclusion

5 General Purpose Probability Estimation Want to efficiently: –Learn joint probability distribution from data: –Infer marginal and conditional distributions: Many applications

6 State of the Art Learn a Bayesian network from data –Structure learning, parameter estimation Answer conditional queries –Exact inference: #P complete –Gibbs sampling: slow –Belief propagation: may not converge; approximation may be bad

7 Naïve Bayes Bayesian network with structure that allows linear time exact inference All variables independent given C. –In our application, C is hidden Classification –C represents the instance’s class Clustering –C represents the instance’s cluster

8 Naïve Bayes Clustering Model can be learned from data using expectation maximization (EM) C ShrekE.T.RayGigi …

9 Inference Example C ShrekETRayGigi Want to determine: Equivalent to: Problem reduces to computing marginal probabilities. …

10 How to Find Pr(Shrek,ET) 1. Sum out C and all other movies, Ray to Gigi.

11 How to Find Pr(Shrek,ET) 2. Apply naïve Bayes assumption.

12 How to Find Pr(Shrek,ET) 3. Push probabilities in front of summation.

13 How to Find Pr(Shrek,ET) 4. Simplify -- Any variable not in the query (Ray,…,Gigi) can be ignored!

14 Outline Background –General probability estimation –Naïve Bayes and Bayesian networks Naïve Bayes Estimation (NBE) Experiments –Methodology –Results Conclusion

15 Naïve Bayes Estimation (NBE) If cluster variable C was observed, learning parameters would be easy. Since it is hidden, we iterate two steps: –Use current model to “fill in” C for each example –Use filled-in values to adjust model parameters This is the Expectation Maximization (EM) algorithm (Dempster et al, 1977).

16 Naïve Bayes Estimation (NBE) repeat Add k clusters, initialized with training examples repeat E-step: Assign examples to clusters M-step: Re-estimate model parameters Every 5 iterations, prune low-weight clusters until convergence (according to validation set) k = 2k until convergence (according to validation set) Execute E-step and M-step twice more, including validation set

17 Speed and Power Running time: O(#EMiters x #clusters x #examples x #vars) Representational power: –In the limit, NBE can represent any probability distribution –From finite data, NBE never learns more clusters than training examples

18 Related Work AutoClass – naïve Bayes clustering (Cheeseman et al., 1988) Naïve Bayes clustering applied to collaborative filtering (Breese et al., 1998) Mixture of Trees – efficient alternative to Bayesian networks (Meila and Jordan, 2000)

19 Outline Background –General probability estimation –Naïve Bayes and Bayesian networks Naïve Bayes Estimation (NBE) Experiments –Methodology –Results Conclusion

20 Experiments Compare NBE to Bayesian networks (WinMine Toolkit by Max Chickering) 50 widely varied datasets –47 from UCI repository –5 to 1,648 variables –57 to 67,507 examples Metrics –Learning time –Accuracy (log likelihood) –Speed/accuracy of marginal/conditional queries

21 Learning Time NBE slower NBE faster

22 Overall Accuracy NBE worse NBE better WinMine

23 Query Scenarios * – See paper for multiple-variable conditional results

24 Inference Details NBE: Exact inference Bayesian networks –Gibbs sampling: 3 configurations 1 chain, 1,000 sampling iterations 10 chains, 1,000 sampling iterations per chain 10 chains, 10,000 sampling iterations per chain –Belief propagation, when possible

25 Marginal Query Accuracy Number of datasets (out of 50) on which NBE wins. # of query variables chain, 1k samples chains, 1k samples chains, 10k samples

26 Detailed Accuracy Comparison NBE worse NBE better

27 Conditional Query Accuracy Number of datasets (out of 50) on which NBE wins. # of hidden variables chain, 1k samples chains, 1k samples chains, 10k samples Belief propagation

28 Detailed Accuracy Comparison NBE worse NBE better

29 Marginal Query Speed 2,200 26, , ,000,000

30 Conditional Query Speed 55 5, ,000

31 Summary of Results Marginal queries –NBE at least as accurate as Gibbs sampling –NBE thousands, even millions of times faster Conditional queries –Easy for Gibbs: few hidden variables –NBE almost as accurate as Gibbs –NBE still several orders of magnitude faster –Belief propagation often failed or ran slowly

32 Conclusion Compared to Bayesian networks, NBE offers: –Similar learning time –Similar accuracy –Exponentially faster inference Try it yourself: –Download an open-source reference implementation from: