General Graphical Model Learning Schema

Slides:



Advertisements
Similar presentations
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng.
School of Computing Science Simon Fraser University Vancouver, Canada.
Modelling Relational Statistics With Bayes Nets School of Computing Science Simon Fraser University Vancouver, Canada Tianxiang Gao Yuke Zhu.
. Learning Bayesian networks Slides by Nir Friedman.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Learning Bayesian Networks (From David Heckerman’s tutorial)
1 Collaborative Filtering: Latent Variable Model LIU Tengfei Computer Science and Engineering Department April 13, 2011.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Recitation 1 Probability Review
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
12/07/2008UAI 2008 Cumulative Distribution Networks and the Derivative-Sum-Product Algorithm Jim C. Huang and Brendan J. Frey Probabilistic and Statistical.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
Markov Logic And other SRL Approaches
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Collective Classification A brief overview and possible connections to -acts classification Vitor R. Carvalho Text Learning Group Meetings, Carnegie.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Generalizing Variable Elimination in Bayesian Networks 서울 시립대학원 전자 전기 컴퓨터 공학과 G 박민규.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
Slides for “Data Mining” by I. H. Witten and E. Frank.
K2 Algorithm Presentation KDD Lab, CIS Department, KSU
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
CPSC 322, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 30, 2015 Slide source: from David Page (MIT) (which were.
FACTORBASE: SQL for Multi-Relational Model Learning Zhensong Qian and Oliver Schulte, Simon Fraser University, Canada 1.Qian, Z.; Schulte, O. The BayesBase.
Lecture 2: Statistical learning primer for biologists
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Review of statistical modeling and probability theory Alan Moses ML4bio.
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
From Relational Statistics to Degrees of Belief School of Computing Science Simon Fraser University Vancouver, Canada Tianxiang Gao Yuke Zhu.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Learning Bayesian Networks for Complex Relational Data
First-Order Bayesian Networks
Relational Bayes Net Classifiers
Chapter 7. Classification and Prediction
Outlier Detection Exception Mining
Qian Liu CSE spring University of Pennsylvania
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Oliver Schulte Machine Learning 726
Learning Bayesian Network Models from Data
Information Retrieval Models: Probabilistic Models
Markov Networks.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Bayesian Models in Machine Learning
Markov Random Fields Presented by: Vladan Radosavljevic.
Class #19 – Tuesday, November 3
Bayesian Learning Chapter
Probabilistic Latent Preference Analysis
LECTURE 07: BAYESIAN ESTIMATION
A Framework for Testing Query Transformation Rules
Discriminative Probabilistic Models for Relational Data
Markov Networks.
Statistical Relational AI
Chapter 14 February 26, 2004.
Learning Bayesian networks
Presentation transcript:

General Graphical Model Learning Schema After Kimmig et al. 2014. Initialize graph G := empty. While not converged do Generate candidate graphs. For each candidate graph C, learn parameters θC that maximize score(C, θ, dataset). G := argmaxC score(C, θC,dataset). check convergence criterion. relational score From Relational Statistics to Degrees of Belief

Tutorial on Learning Bayesian Networks for Complex Relational Data Parameter Learning Section 3 Tutorial on Learning Bayesian Networks for Complex Relational Data

Overview: Upgrading Parameter Learning Extend learning concepts/algorithms designed for iid data to relational data This is called upgrading iid learning (van de Laer and De Raedt) Score/Objective Function: Random Selection Likelihood Algorithm: Fast Möbius Transform van de Laer, W. & De Raedt, L. (2001), How to upgrade propositional learners to first-order logic: A case study’, in Relational Data Mining', Springer Verlag

Likelihood Function for IID Data Learning Bayesian Networks for Complex Relational Data

Score-based Learning for IID data Most Bayesian network learning methods are based on a score function The score function measures how well the network fits the observed data Key component: the likelihood function. measures how likely each datapoint is according to the Bayesian network Bayesian Network intuitively, how well the model explains each datapoint data table Log-likelihood, e.g. -3.5 Learning Bayesian Networks for Complex Relational Data

The Bayes Net Likelihood Function for IID data For each row, compute the log-likelihood for the attribute values in the row. Log-likelihood for table = sum of log-likelihoods for rows. generalizes i.i.d. case: only one first-order variable. uses random selection semantics, instantiation principle

IID Example Title Drama Action Horror Fargo T F Kill_Bill Title Drama P(Action(M.)=T) = 1 Action(Movie) Title Drama Action Horror Fargo T F Kill_Bill Drama(Movie) P(Drama(M.)=T|Action(M.)=T) = 1/2 Horror(Movie) P(Horror(M.)=F|...) = 1 Title Drama Action Horror PB ln(PB) Fargo T F 1x1/2x1 = 1/2 -0.69 Kill_Bill In this toy data table: Action is always true Horror is always false P_B is the joint probability from the Bayes net. This is the product of conditional probabilities from the CP table. Total Log-likelihood Score for Table = -1.38 Learning Bayesian Networks for Complex Relational Data

Likelihood Function for Relational Data Learning Bayesian Networks for Complex Relational Data

Wanted: a likelihood score for relational data database Problems Multiple Tables. Dependent data points Bayesian Network Log-Likelihood, e.g. -3.5 likelihood score is not necessarily normalized likelihood function = likelihood score/normalization constant (partition function) Learning Bayesian Networks for Complex Relational Data

The Random Selection Likelihood Score Randomly select a grounding/instantiation for all first-order variables in the first-order Bayesian network Compute the log-likelihood for the attributes of the selected grounding Log-likelihood score = expected log-likelihood for a random grounding Generalizes IID log-likelihood, but without independence assumption Schulte, O. (2011), A tractable pseudo-likelihood function for Bayes Nets applied to relational data, in 'SIAM SDM', pp. 462-473.

Example P(g(A)=M) = 1/2 P(ActsIn(A,M)=T|g(A)=M) = 1/4 gender(A) ActsIn(A,M) P(ActsIn(A,M)=T|g(A)=M) = 1/4 P(ActsIn(A,M)=T|g(A)=W) = 2/4 Prob A M gender(A) ActsIn(A,M) PB ln(PB) 1/8 Brad_Pitt Fargo F 3/8 -0.98 Kill_Bill Lucy_Liu W 2/8 -1.39 T Steve_Buscemi -2.08 Uma_Thurman 0.27 geo -1.32 arith data + Bayesian network -> random selection likelihood value

Observed Frequencies Maximize Random Selection Likelihood Proposition The random selection log-likelihood score is maximized by setting the Bayesian network parameters to the observed conditional frequencies gender(A) ActsIn(A,M) P(g(A)=M) = 1/2 P(ActsIn(A,M)=T|g(A)=M) = 1/4 P(ActsIn(A,M)=T|g(A)=W) = 2/4 to compute the first conditional probability: there are 4 actor-movie pairs where the actor is male (Brad Pitt x 2 + Steve Buscemi x 2). Of those 4, there is only one where the actor appears in the movie (Buscemi in Fargo) Schulte, O. (2011), A tractable pseudo-likelihood function for Bayes Nets applied to relational data, in 'SIAM SDM', pp. 462-473.

Computing Maximum Likelihood Parameter Values The Parameter Values that maximize the Random Selection Likelihood Learning Bayesian Networks for Complex Relational Data

Computing Relational Frequencies Need to compute a contingency table with instantiation counts Well researched for all true relationships SQL Count(*) Virtual Join Partition Function Reduction g(A) Acts(A,M) action(M) count M F T 3 W 2 1 Parametrized polynomial complexity in number of first-order variables. Vardi, M. Y. (1995), On the Complexity of Bounded-Variable Queries, in 'PODS', ACM Press, , pp. 266-276. Yin, X.; Han, J.; Yang, J. & Yu, P. S. (2004), CrossMine: Efficient Classification Across Multiple Database Relations, in 'ICDE'. Venugopal, D.; Sarkhel, S. & Gogate, V. (2015), Just Count the Satisfied Groundings: Scalable Local-Search and Sampling Based Inference in MLNs, in AAAI, 2015, pp. 3606--3612.

Single Relation Case How to generalize to multiple relations? For single relation compute PD (R = F) using 1-minus trick (Getoor et al. 2003) Example: PD(HasRated(User,Movie) = T) = 4.27% PD(HasRated(User,Movie) = F) = 95.73% How to generalize to multiple relations? PD(ActsIn(Actor,Movie)=F,HasRated(User,Movie)=F) Getoor, L.; Friedman, N.; Koller, D. & Taskar, B. (2003), 'Learning probabilistic models of link structure', J. Mach. Learn. Res. 3, 679--707.

The Möbius Extension Theorem for negated relations For two link types R1 R2 p4 Joint probabilities R1 R2 p3 R1 R2 p2 R1 R2 p1 R1 R2 q4 R1 q3 Möbius Parameters R2 q2 nothing q1 Learning Bayesian Networks for Complex Relational Data

The Fast Inverse Möbius Transform HasRated(U,M) = R2 ActsIn(A,M) = R1 Initial table with no false relationships Table with joint probabilities J.P. = joint probability R1 R2 J.P. T 0.2 * 0.3 0.4 1 R1 R2 J.P. T 0.2 F 0.1 * 0.4 0.6 R1 R2 J.P. T 0.2 F 0.1 0.5 - ignoring attribute conditions numbers are made up * means: nothing specified. Exercise: trace method + - + - - + + Kennes, R. & Smets, P. (1990), Computational aspects of the Möbius transformation, in 'UAI', pp. 401-416.

Parameter Learning Time Fast Inverse Möbius transform (IMT) vs Constructing complement tables using SQL Times are in seconds Möbius transform is much faster, 15-200 times Learning Bayesian Networks for Complex Relational Data

Using Presence and Absence of Relationships Find correlations between links/relationships, not just attributes given links If a user performs a web search for an item, is it likely that the user watches a movie about the item? Example of Weka-interesting association rule on Financial benchmark dataset: statement_frequency(Account) = monthly  HasLoan(Account, Loan) = true Qian, Z.; Schulte, O. & Sun, Y. (2014), Computing Multi-Relational Sufficient Statistics for Large Databases, in 'Computational Intelligence and Knowledge Management (CIKM)', pp. 1249--1258.

Summary Random Selection Semantics  random selection log-likelihood Maximizing values for random selection log-likelihood = observed empirical frequencies. Generalizes maximum likelihood result for IID data. Fast Möbius Transform: computes database frequencies for conjunctive formulas involving any number of negative relationships. Enables link analysis: modelling probabilistic associations that involve the presence or absence of relationships. Learning Bayesian Networks for Complex Relational Data