CSCI 5822 Probabilistic Models of Human and Machine Learning

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Introduction of Probabilistic Reasoning and Bayesian Networks
Software Engineering Laboratory1 Introduction of Bayesian Network 4 / 20 / 2005 CSE634 Data Mining Prof. Anita Wasilewska Hiroo Kusaba.
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
Bayesian learning finalized (with high probability)
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Today Logistic Regression Decision Trees Redux Graphical Models
Learning Bayesian Networks (From David Heckerman’s tutorial)
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Recitation 1 Probability Review
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Binomial Experiment A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the following properties:
Statistical Learning (From data to distributions).
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Announcements Spring Courses Somewhat Relevant to Machine Learning 5314: Algorithms for molecular bio (who’s teaching?) 5446: Chaotic dynamics (Bradley)
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Lecture 2: Statistical learning primer for biologists
Gaussian Processes For Regression, Classification, and Prediction.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Spring 2014 Course PSYC Thinking Proseminar – Matt Jones Provides beginning Ph.D. students with a basic introduction to research on complex human.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Bayesian Estimation and Confidence Intervals Lecture XXII.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
A Brief Introduction to Bayesian networks
Essential Probability & Statistics
Oliver Schulte Machine Learning 726
Matt Gormley Lecture 3 September 7, 2016
CS 2750: Machine Learning Review
Bayesian Estimation and Confidence Intervals
CS 2750: Machine Learning Directed Graphical Models
Oliver Schulte Machine Learning 726
ICS 280 Learning in Graphical Models
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Basic Probability Theory
with observed random variables
Distributions and Concepts in Probability Theory
CSCI 5822 Probabilistic Models of Human and Machine Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Networks: Motivation
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Models in Machine Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Class #19 – Tuesday, November 3
Directed Graphical Probabilistic Models: the sequel
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Mathematical Foundations of BME Reza Shadmehr
Naïve Bayes Classifier
Presentation transcript:

CSCI 5822 Probabilistic Models of Human and Machine Learning Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder

Learning In Bayesian Networks

General Learning Problem Set of random variables X = {X1, X2, X3, X4, …} Training set D = {X(1), X(2), …, X(N)} Each observation specifies values of subset of variables X(1) = {x1, x2, ?, x4, …} X(2) = {x1, x2, x3, x4, …} X(3) = {?, x2, x3, x4, …} Goal Estimate conditional joint distributions E.g., P(X1, X3 | X2, X4)

Classes Of Graphical Model Learning Problems Network structure known All variables observed Network structure known Some missing data (or latent variables) Network structure not known All variables observed Network structure not known Some missing data (or latent variables) next few classes going to skip (not too relevant for papers we’ll read; see optional readings for more info)

If Network Structure Is Known, The Problem Involves Learning Distributions The distributions are characterized by parameters 𝝝. Goal: infer 𝝝 that best explains the data D. Bayesian methods Maximum a posteriori Maximum likelihood Max. pseudo-likelihood Moment matching 1st moment but also higher moments

Learning CPDs When All Variables Are Observed And Network Structure Is Known Maximum likelihood estimation is trivial X Y Z P(X) ? P(X) ? P(Y) ? Training Data X Y Z 1 Trivial if all you want to do is get ML estimate from data, but why not do the Bayesian thing? Goal: infer entries in CPTs from the data: INFERENCE PROBLEM X Y P(Z|X,Y) ? 1

Recasting Learning As Bayesian Inference We’ve already used Bayesian inference in probabilistic models to compute posteriors on latent (a.k.a. hidden, nonobservable) variables from data. E.g., Weiss model Direction of motion E.g., Gaussian mixture model To which cluster does each data point belong Why not treat unknown parameters in the same way? E.g., Gaussian mixture parameters E.g., entries in conditional prob. tables

Recasting Learning As Inference Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations, you can infer the bias of the coin This is learning. This is inference.

Treating Conditional Probabilities As Latent Variables Graphical model probabilities (priors, conditional distributions) can also be cast as random variables E.g., Gaussian mixture model Remove the knowledge “built into” the links (conditional distributions) and into the nodes (prior distributions). Create new random variables to represent the knowledge  Hierarchical Bayesian Inference z λ x 𝛼 z x z λ x

Slides stolen from David Heckerman tutorial I’m going to flip a coin, with unknown bias. Depending on whether it comes up heads or tails, I will flip one of two other coins, each with an unknown bias. X: outcome of first flip Y: outcome of second flip I’m going to flip coin X, with unknown bias. depending on whether it comes up heads or tails, I will flip one of two coins. Call the selected one Y. Slides stolen from David Heckerman tutorial

training example 1 training example 2 The two cases are different examples (observed values of X, Y) X1: head X2: tail training example 1 training example 2

The two cases are different examples (observed values of X, Y) X1: head X2: tail

Parameters might not be independent But the scenario I painted makes them independent training example 1 training example 2

General Approach: Bayesian Learning of Probabilities in a Bayes Net If network structure Sh known and no missing data… We can express joint distribution over variables X in terms of model parameter vector θs Given random sample D = {X(1), X(2), …, X(N)}, compute the posterior distribution p(θs | D, Sh) Probabilistic formulation of all supervised and unsupervised learning problems. I is index over nodes

Computing Parameter Posteriors E.g., net structure X→Y

Computing Parameter Posteriors Given complete data (all X,Y observed) and no direct dependencies among parameters, Explanation Given complete data, each set of parameters is disconnected from each other set of parameters in the graph parameter independence i: index over nodes j: index over configurations over parents of a node THETA(X) and THETA(Y|X) are D-separated because X is known Why are THETA(Y|X) and THETA(Y|NOTX) d-separated? θx X Y θy|x D separation

Posterior Predictive Distribution Given parameter posteriors What is prediction of next observation X(N+1)? How can this be used for unsupervised and supervised learning? What does this formula look like if we’re doing approximate inference by sampling θs? What we talked about the past three classes What we just discussed P(X|theta, D) = P(X|theta)

Prediction Directly From Data In many cases, prediction can be made without explicitly computing posteriors over parameters E.g., coin toss example from earlier class Posterior distribution is Prediction of next coin outcome Note that we’ve integrated out theta altogether so the parameters are never explicitly represented:

Terminology Digression Bernoulli : Binomial :: Categorical : Multinomial single coin flip N flips of coin (e.g., exam score) single roll of die N rolls of die (e.g., # of each major)

Generalizing To Categorical RVs In Bayes Net notation change Variable Xi is discrete, with values xi1, ... xiri i: index of categorical RV j: index over configurations of the parents of node i k: index over values of node i unrestricted distribution: one parameter per probability Xi Xb Xa

Prediction Directly From Data: Categorical Random Variables Prior distribution is Posterior distribution is Posterior predictive distribution: Preview of topic modeling I: index over nodes j: index over values of parents of I k: index over values of node i

Other Easy Cases Members of the exponential family see Barber text 8.5 Linear regression with Gaussian noise see Barber text 18.1