Learning First-Order Probabilistic Models with Combining Rules Sriraam Natarajan Prasad Tadepalli Eric Altendorf Thomas G. Dietterich Alan Fern Angelo.

Slides:



Advertisements
Similar presentations
Bayesian Treatment of Incomplete Discrete Data applied to Mutual Information and Feature Selection Marcus Hutter & Marco Zaffalon IDSIA IDSIA Galleria.
Advertisements

1 Probability and the Web Ken Baclawski Northeastern University VIStology, Inc.
CS188: Computational Models of Human Behavior
Autonomic Scaling of Cloud Computing Resources
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Sriraam Natarajan Introduction to Probabilistic Logical Models Slides based on tutorials by Kristian Kersting, James Cussens, Lise Getoor & Pedro Domingos.
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
BAYESIAN NETWORKS CHAPTER#4 Book: Modeling and Reasoning with Bayesian Networks Author : Adnan Darwiche Publisher: CambridgeUniversity Press 2009.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Lecture 5: Learning models using EM
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Today Logistic Regression Decision Trees Redux Graphical Models
Bayesian Networks Alan Ritter.
Scalable Text Mining with Sparse Generative Models
Structure Refinement in First Order Conditional Influence Language Sriraam Natarajan, Weng-Keen Wong, Prasad Tadepalli School of EECS, Oregon State University.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Chapter 5 Data mining : A Closer Look.
Learning Models of Relational Stochastic Processes Sumit Sanghai.
Chapter 8 Introduction to Hypothesis Testing
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Learning to “Read Between the Lines” using Bayesian Logic Programs Sindhu Raghavan, Raymond Mooney, and Hyeonseo Ku The University of Texas at Austin July.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CPSC 322, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 30, 2015 Slide source: from David Page (MIT) (which were.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Ensemble Methods in Machine Learning
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
NTU & MSRA Ming-Feng Tsai
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Oliver Schulte Machine Learning 726
Maximum Expected Utility
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Erasmus University Rotterdam
Data Mining Lecture 11.
Bayesian Learning Chapter
Ensemble learning Reminder - Bagging of Trees Random Forest
Conditional Random Fields
Label and Link Prediction in Relational Data
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Chapter 14 February 26, 2004.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Machine Learning: Lecture 5
Presentation transcript:

Learning First-Order Probabilistic Models with Combining Rules Sriraam Natarajan Prasad Tadepalli Eric Altendorf Thomas G. Dietterich Alan Fern Angelo Restificar School of EECS Oregon State University

First-order Probabilistic Models Combine the expressiveness of first-order logic with the uncertainty modeling of the graphical models Combine the expressiveness of first-order logic with the uncertainty modeling of the graphical models Several formalisms already exist: Several formalisms already exist: Probabilistic Relational Models (PRMs) Probabilistic Relational Models (PRMs) Bayesian Logic Programs (BLPs) Bayesian Logic Programs (BLPs) Stochastic Logic Programs (SLPs) Stochastic Logic Programs (SLPs) Relational Bayesian Networks (RBNs) Relational Bayesian Networks (RBNs) Probabilistic Logic Programs (PLPs), … Probabilistic Logic Programs (PLPs), … Parameter sharing and quantification allow compact representation Parameter sharing and quantification allow compact representation “The project’s difficulty and the project team’s competence influence the project’s success.” “The project’s difficulty and the project team’s competence influence the project’s success.”

First-order Probabilistic Models Combine the expressiveness of first-order logic with the uncertainty modeling of the graphical models Combine the expressiveness of first-order logic with the uncertainty modeling of the graphical models Several formalisms already exist: Several formalisms already exist: Probabilistic Relational Models (PRMs) Probabilistic Relational Models (PRMs) Bayesian Logic Programs (BLPs) Bayesian Logic Programs (BLPs) Stochastic Logic Programs (SLPs) Stochastic Logic Programs (SLPs) Relational Bayesian Networks (RBNs) Relational Bayesian Networks (RBNs) Probabilistic Logic Programs (PLPs), … Probabilistic Logic Programs (PLPs), … Parameter sharing and quantification allow compact representation Parameter sharing and quantification allow compact representation The

Multiple Parents Problem Often multiple objects are related to an object by the same relationship Often multiple objects are related to an object by the same relationship One’s friend’s drinking habits influence one’s own One’s friend’s drinking habits influence one’s own A student’s GPA depends on the grades in the courses he takes A student’s GPA depends on the grades in the courses he takes The size of a mosquito population depends on the temperature and the rainfall each day since the last freeze The size of a mosquito population depends on the temperature and the rainfall each day since the last freeze The target variable in each of these statements The target variable in each of these statements has multiple influents (“parents” in Bayes net jargon) has multiple influents (“parents” in Bayes net jargon)

Population Rain1Temp1Rain2Temp2Rain3Temp3 Multiple Parents for population ■ Variable number of parents ■ Large number of parents ■ Need for compact parameterization

Solution 1: Aggregators Population Rain1Temp1Rain2Temp2Rain3Temp3 AverageRainAverageTemp Deterministi c Problem: Does not take into account the interaction between related parents Rain and Temp Stochastic

Solution 2: Combining Rules Population Rain1Temp1Rain2Temp2Rain3Temp3 Population3Population1 Population2 Top 3 distributions share parameters The 3 distributions are combined into one final distribution

Outline First-order Conditional Influence Language First-order Conditional Influence Language Learning the parameters of Combining Rules Learning the parameters of Combining Rules Experiments and Results Experiments and Results

First Order Conditional Influence Language First Order Conditional Influence Language Learning the parameters of Combining Rules Learning the parameters of Combining Rules Experiments and Results Experiments and Results

First-order Conditional Influence Language (FOCIL) Task and role of a document influence its folder Task and role of a document influence its folder if {task(t), doc(d), role(d,r,t)} then r.id, t.id Qinf d.folder. if {task(t), doc(d), role(d,r,t)} then r.id, t.id Qinf d.folder. The folder of the source of the document influences the folder of the document The folder of the source of the document influences the folder of the document if {doc(d1), doc(d2), source(d1,d2)} then d1.folder Qinf d2.folder if {doc(d1), doc(d2), source(d1,d2)} then d1.folder Qinf d2.folder The difficulty of the course and the intelligence of the student influence his/her GPA The difficulty of the course and the intelligence of the student influence his/her GPA if (student(s), course(c), takes(s,c))} then s.IQ, c.difficulty Qinf s.gpa ) if (student(s), course(c), takes(s,c))} then s.IQ, c.difficulty Qinf s.gpa )

Relationship to Other Formalisms Shares many of the same properties as other statistical relational models. Shares many of the same properties as other statistical relational models. Generalizes path expressions in probabilistic relational Generalizes path expressions in probabilistic relational models to arbitrary conjunctions of literals. models to arbitrary conjunctions of literals. Unlike BLPs, explicitly distinguishes between conditions, which do not allow uncertainty, and influents, which do. Unlike BLPs, explicitly distinguishes between conditions, which do not allow uncertainty, and influents, which do. Monotonicity relationships can be specified. Monotonicity relationships can be specified. if {person(p)} then p.age Q+ p.height if {person(p)} then p.age Q+ p.height

Combining Multiple Instances of a Single Statement If {task(t), doc(d), role(d,r,t)} then If {task(t), doc(d), role(d,r,t)} then t.id, r.id Qinf (Mean) d.folder t.id, r.id Qinf (Mean) d.folder t1.id d.folder Mean r1.id t2.id r2.id

A Different FOCIL Statement for the Same Target Variable If {doc(s), doc(d), source(s,d) } then s.folder Qinf (Mean) d.folder s.folder Qinf (Mean) d.folder d.folder s2.folder d.folder Mean s1.folder

Combining Multiple Statements Weighted Mean{ If {task(t), doc(d), role(d,r,t)} then t.id, r.id Qinf (Mean) d.folder t.id, r.id Qinf (Mean) d.folder If {doc(s), doc(d), source(s,d) } then s.folder Qinf (Mean) d.folder s.folder Qinf (Mean) d.folder}

“Unrolled” Network for Folder Prediction t1.id d.folder Weighted Mean d.folder s2.folder d.folder Mean r1.id t2.id r2.id s1.folder

First Order Conditional Influence Language First Order Conditional Influence Language Learning the parameters of Combining Rules Learning the parameters of Combining Rules Experiments and Results Experiments and Results

X 1 1,1 X 1 1,k … 1 X 1 2,1 X 1 2,k … 2 … X 1 m1,k … m1m1 Mean X 2 1,1 X 2 1,k … 1 X 2 2,1 X 2 2,k … 2 … X 2 m2,k … m2m2 Mean Weighted mean Rule1Rule2 Y General Unrolled Network

Gradient Descent for Squared Error Squared error Squared error where

Gradient Descent for Loglikelihood Loglikelihood Loglikelihood, where

Learning the weights Mean Squared Error Mean Squared Error Loglikelihood Loglikelihood

X 1 1,1 X 1 1,k … 1 … X 1 m1,k … m1m1 Mean X 2 1,1 X 2 1,k … 1 … X 2 m2,k … m2m2 Mean Weighted mean w1w1 w2w2 Y Expectation-Maximization 1111  1 m1 2121  2 m2 1/m 1 1/m 2

EM learning Expectation-step: Compute the responsibilities of each instance of each rule Expectation-step: Compute the responsibilities of each instance of each rule Maximization-step: Compute the maximum likelihood parameters using responsibilities as the counts Maximization-step: Compute the maximum likelihood parameters using responsibilities as the counts where n is the # of examples with 2 or more rules instantiated

First Order Conditional Influence Language First Order Conditional Influence Language Learning the parameters of Combining Rules Learning the parameters of Combining Rules Experiments and Results Experiments and Results

Experimental Setup 500 documents, 6 tasks, 2 roles, 11 folders 500 documents, 6 tasks, 2 roles, 11 folders Each document typically has 1-2 task-role pairs Each document typically has 1-2 task-role pairs 25% of documents have a source folder 25% of documents have a source folder 10-fold cross validation 10-fold cross validation Weighted Mean{ If {task(t), doc(d), role(d,r,t)} then t.id, r.id Qinf (Mean) d.folder. If {doc(s), doc(d), source(s,d) } then s.folder Qinf (Mean) d.folder. }

Folder prediction task Mean reciprocal rank – Mean reciprocal rank – where n i is the number of times the true folder was ranked as i where n i is the number of times the true folder was ranked as i Propositional classifiers: Propositional classifiers: Decision trees and Naïve Bayes Decision trees and Naïve Bayes Features are the number of occurrences of each task-role pair and source document folder Features are the number of occurrences of each task-role pair and source document folder

RankEM GD- MS GD-LLJ48NB MRR

Learning the weights Original dataset : 2 nd rule has more weight ) it is more predictive when both rules are applicable Original dataset : 2 nd rule has more weight ) it is more predictive when both rules are applicable Modified dataset : The folder names of all the sources were randomized ) 2 nd rule is made ineffective ) weight of Modified dataset : The folder names of all the sources were randomized ) 2 nd rule is made ineffective ) weight of the 2 nd rule decreases the 2 nd rule decreases EMGD-MSGD-LL Original data set Weights h.15,.85 i h.22,.78 i h.05,.95 i Score Modified data set Weights h.9,.1 i h.84,.16 i h 1,0 i Score

Lessons from Real-world Data The propositional learners are almost as good as the first-order learners in this domain! The propositional learners are almost as good as the first-order learners in this domain! The number of parents is 1-2 in this domain The number of parents is 1-2 in this domain About ¾ of the time only one rule is applicable About ¾ of the time only one rule is applicable Ranking of probabilities is easy in this case Ranking of probabilities is easy in this case Accurate modeling of the probabilities is needed Accurate modeling of the probabilities is needed Making predictions that combine with other predictions Making predictions that combine with other predictions Cost-sensitive decision making Cost-sensitive decision making

2 rules with 2 inputs each: W rule1 = 0.1,W rule2 = rules with 2 inputs each: W rule1 = 0.1,W rule2 = 0.9 Probability that an example matches a rule =.5 Probability that an example matches a rule =.5 If an example matches a rule, the number of instances is If an example matches a rule, the number of instances is Performance metric: average absolute error in predicted probability Performance metric: average absolute error in predicted probability Synthetic Data Set

Synthetic Data Set - Results

Synthetic Data Set GDMS

Synthetic Data Set GDLL

Synthetic Data Set EM

Conclusions Introduced a general instance of multiple parents problem in first-order probabilistic languages Introduced a general instance of multiple parents problem in first-order probabilistic languages Gradient descent and EM successfully learn the parameters of the conditional distributions as well as the parameters of the combining rules (weights) Gradient descent and EM successfully learn the parameters of the conditional distributions as well as the parameters of the combining rules (weights) First order methods significantly outperform propositional methods in modeling the distributions when the number of parents ¸ 3 First order methods significantly outperform propositional methods in modeling the distributions when the number of parents ¸ 3

Future Work We plan to extend these results to more general classes of combining rules We plan to extend these results to more general classes of combining rules Develop efficient inference algorithms with combining rules Develop efficient inference algorithms with combining rules Develop compelling applications Develop compelling applications Combining rules and aggregators Combining rules and aggregators Can they both be understood as instances of causal independence? Can they both be understood as instances of causal independence?