Discriminative Training of Markov Logic Networks

Slides:

Advertisements

Similar presentations

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Advertisements

Joint Inference in Information Extraction Hoifung Poon Dept. Computer Science & Eng. University of Washington (Joint work with Pedro Domingos)

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.

“Using Weighted MAX-SAT Engines to Solve MPE” -- by James D. Park Shuo (Olivia) Yang.

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.

Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)

Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u

Speeding Up Inference in Markov Logic Networks by Preprocessing to Reduce the Size of the Resulting Grounded Network Jude Shavlik Sriraam Natarajan Computer.

Adbuctive Markov Logic for Plan Recognition Parag Singla & Raymond J. Mooney Dept. of Computer Science University of Texas, Austin.

Markov Networks.

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Markov Logic: A Unifying Framework for Statistical Relational Learning Pedro Domingos Matthew Richardson

Speaker:Benedict Fehringer Seminar:Probabilistic Models for Information Extraction by Dr. Martin Theobald and Maximilian Dylla Based on Richards, M., and.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Overview Full Bayesian Learning MAP learning

A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,

CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.

Lecture 5: Learning models using EM

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Relational Models. CSE 515 in One Slide We will learn to: Put probability distributions on everything Learn them from data Do inference with them.

Learning, Logic, and Probability: A Unified View Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Stanley Kok, Matt.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

1 Learning the Structure of Markov Logic Networks Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington.

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Radial Basis Function Networks

Boosting Markov Logic Networks

Markov Logic Parag Singla Dept. of Computer Science University of Texas, Austin.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

Markov Logic And other SRL Approaches

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Markov Logic Networks Pedro Domingos Dept. Computer Science & Eng. University of Washington (Joint work with Matt Richardson)

Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.

Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

John Lafferty Andrew McCallum Fernando Pereira

Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.

CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.

Markov Logic Pedro Domingos Dept. of Computer Science & Eng. University of Washington.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.

Markov Logic: A Representation Language for Natural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Probabilistic Reasoning Inference and Relational Bayesian Networks.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

Lecture 7: Constrained Conditional Models

An Introduction to Markov Logic Networks in Knowledge Bases

Bogdan Moldovan, Ingo Thon, Jesse Davis, and Luc de Raedt

Markov Logic Networks for NLP CSCI-GA.2591

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 30

First-Order Logic and Inductive Logic Programming

Data Mining Lecture 11.

Markov Networks.

Bayesian Models in Machine Learning

Learning Markov Networks

Probabilistic Latent Preference Analysis

Markov Networks.

Presentation transcript:

Discriminative Training of Markov Logic Networks Parag Singla & Pedro Domingos

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Markov Logic Networks(MLNs) AI systems must be able to learn, reason logically and handle uncertainty Markov Logic Networks [Richardson and Domingos, 2004]- an effective way to combine first order logic and probability Markov Networks are used as underlying representation Features specfied using arbitrary formulas in finite first order logic

Training of MLNs – Generative Approach Optimize the joint distribution of all the variables Parameters learnt independent of specific inference task Maximum-likelihood (ML) training – computation of the gradient involves inference – too slow! Use Psuedo-likelihood (PL) as an alternative – easy to compute PL is suboptimal. Ignores any non-local interactions between variables ML, PL – generative training approaches

Training of MLNs -Discriminative Approach No need to optimize the joint distribution of all the variables Optimize the conditional likelihood (CL) of non-evidence variables given evidence variables Parameters learnt for a specific inference task Tends to do better than generative training in general

Why is Discriminative Better? Generative Parameters learnt are not optimized for the specific inference task. Need to model all the dependencies in the data – learning might become complicated. Example of generative models: MRFs Discriminative Parameters learnt are optimized for the specific inference task. Need not model dependencies between evidence variables – makes learning task easier. Example of discriminative models: CRFs [Lafferty, McCallum, Pereira 2001]

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Markov Logic Networks A Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number Together with a finite set of constants, it defines a Markov network with One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with the corresponding weight w

Likelihood 1 if jth ground clause is true, 0 otherwise Iterate over all ground clauses 1 if jth ground clause is true, 0 otherwise # true groundings of ith clause Iterate over all MLN clauses

Gradient of Log-Likelihood Feature count according to data Feature count according to model 1st term: # true groundings of formula in DB 2nd term: inference required (slow!)

Pseudo-Likelihood [Besag, 1975] Likelihood of each ground atom given its Markov blanket in the data Does not require inference at each step Optimized using L-BFGS [Liu & Nocedal, 1989]

Gradient of Pseudo-Log-Likelihood where nsati(x=v) is the number of satisfied groundings of clause i in the training data when x takes value v Most terms not affected by changes in weights After initial setup, each iteration takes O(# ground predicates x # first-order clauses)

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Conditional Likelihood (CL) Normalize over all possible configurations of non-evidence variables Non-evidence variables Iterate over all MLN clauses with at least one grounding containing query variables Evidence variables

Derivative of log CL 1st term: # true groundings (involving query variables) of formula in DB 2nd term: inference required, as before (slow!)

Derivative of log CL Approximate the expected count by MAP count MAP state

Approximating the Expected Count Use Voted Perceptron Algorithm [Collins, 2002] Approximate the expected count by count for the most likely state (MAP) state Used successfully for linear chain Markov networks MAP state found using Viterbi

Voted Perceptron Algorithm Initialize wi=0 For t=1 to T Find the MAP configuration according to current set of weights. wi,t=  * (training count – MAP count) wi= wi,t/T (Avoids over-fitting)

Generalizing Voted Perceptron Finding the MAP configuration NP hard for the general case. Can be reduced to a weighted satisfiability (MaxSAT) problem. Given a SAT formula in clausal form e.g. (x1 V x3 V x5) … (x5 V x7 Vx50) with clause i having weight of wi Find the assignment maximizing the sum of weights of satisfied clauses.

MaxWalkSAT [Kautz, Selman & Jiang 97] Assumes clauses with positive weights Mixes greedy search with random walks Start with some configuration of variables. Randomly pick an unsatisfied clause. With probability p, flip the literal in the clause which gives maximum gain. With probability 1-p flip a random literal in the clause. Repeat for a pre-decided number of flips, storing the best seen configuration.

Handling the Negative Weights MLN allows formulas with negative weights. A formula with weight w can be replaced by its negation with weight –w in the ground Markov network. (x1  x3  x5) [w] => (x1  x3  x5) [-w] => (x1  x3  x5) [-w] (x1  x3  x5) [-w] => x1 , x3 , x5 [ -w/3]

Weight Initialization and Learning Rate Weights initialized using log odds of each clause being true in the data. Determining the learning rate – use a validation set. Learning rate  1/#(ground predicates)

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Link Prediction UW-CSE database Used by Richardson & Domingos [2004] Database of people/courses/publications at UW-CSE 22 Predicates e.g. Student(P), Professor(P), AdvisedBy(P1,P2) 1158 constants divided into 10 types 4,055,575 ground atoms 3212 true ground atoms 94 hand coded rules stating various regularities Student(P) => !Professor(P) Predict AdvisedBy in the absence of information about the predicates Professor and Student

Systems Compared MLN(VP) MLN(ML) MLN(PL) KB CL NB BN

Results on Link Prediction

Results on Link Prediction

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Object Identification Given a database of various records referring to objects in the real world Each record represented by a set of attribute values Want to find out which of the records refer to the same object Example: A paper may have more than one reference in a bibliography database

Why is it Important? Data Cleaning and Integration – first step in the KDD process Merging of data from multiple sources results in duplicates Entity Resolution: Extremely important for doing any sort of data-mining State of the art – far from what is required. Citeseer has 30 different entries for the AI textbook by Russell and Norvig

Standard Approach [Fellegi & Sunter, 1969] Look at each pair of records independently Calculate the similarity score for each attribute value pair based on some metric Find the overall similarity score Merge the records whose similarity is above a threshold Take a transitive closure

An Example Subset of a Bibliography Relation Record Title Author Venue Object Identification using MLNs Linda Stewart KDD 2004 B2 SIGKDD 10 B3 Learning Boolean Formulas Bill Johnson B4 Learning of Boolean Formulas William Johnson Subset of a Bibliography Relation

Graphical Representation in Standard Model Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? Sim(KDD 2004, SIGKDD 10) Sim(KDD 2004, SIGKDD 10) Venue Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author Record-pair node Evidence node

What’s Missing? Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? Sim(KDD 2004, SIGKDD 10) Sim(KDD 2004, SIGKDD 10) Venue Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author If from b1=b2, you infer that “KDD 2004” is same as “SIGKDD 10”, how can you use that to help figure out if b3=b4?

Collective Model – Basic Idea Perform simultaneous inference for all the candidate pairs Facilitate flow of information through shared attribute values

Representation in Standard Model Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b3=b4 ? b1=b2 ? Sim(KDD 2004, SIGKDD 10) Sim(KDD 2004, SIGKDD 10) Venue Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author No sharing of nodes

Merging the Evidence Nodes Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b3=b4 ? b1=b2 ? Sim(KDD 2004, SIGKDD 10) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author Author Still does not solve the problem. Why?

Introducing Information Nodes Title Title Sim(Object Identification using MLNs, Object Identification using MLNs) Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? Information node b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(KDD 2004, SIGKDD 10) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author Full representation in Collective Model

Flow of Information Title Title Sim(Object Identification using MLNs, Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(KDD 2004, SIGKDD 10) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

Flow of Information Title Title Sim(Object Identification using MLNs, Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(KDD 2004, SIGKDD 10) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

Flow of Information Title Title Sim(Object Identification using MLNs, Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(KDD 2004, SIGKDD 10) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

Flow of Information Title Title Sim(Object Identification using MLNs, Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(KDD 2004, SIGKDD 10) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

Flow of Information Title Title Sim(Object Identification using MLNs, Sim(Learning Boolean Formulas, Leraning of Boolean Formulas) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(KDD 2004, SIGKDD 10) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

MLN Predicates for De-Duplicating Citation Databases If two bib entries are the same - SameBib(b1,b2) If two field values are the same - SameAuthor(a1,a2), SameTitle(t1,t2), SameVenue(v1,v2) If cosine based TFIDF score of two field values lies in a particular range (0, 0 - .2, .2 - .4, etc.) – 6 predicates for each field. E.g. AuthorTFIDF.8(a1,a2) is true if TFIDF similarity score of a1,a2 is in the range (.2, .4]

MLN Rules for De-Duplicating Citation Databases Singleton Predicates ! SameBib(b1,b2) Two fields are same => corresponding bib entries are same. Author(b1,a1)  Author(b2,a2)  SameAuthor(a1,a2)=> SameBib(b1,b2) Two papers are same => corresponding fields are same Author(b1,a1)  Author(b2,a2)  SameBib(b1,b2)=> SameAuthor(a1,a2) High similarity score => two fields are same AuthorTFIDF.8(a1,a2) =>SameAuthor(a1,a2) Transitive closure (currently not incorporated) SameBib(b1,b2)  SameBib(b2,b3) => SameBib(b1,b3) 25 first order predicates, 46 first order clauses.

Cora Database Cleaned up version of McCallum’s Cora database. 1295 citations to 132 difference Computer Science research papers, each citation described by author, venue, title fields. 401,552 ground atoms. 82,026 tuples (true ground atoms) Predict SameBib, SameAuthor, SameVenue

Systems Compared MLN(VP) MLN(ML) MLN(PL) KB CL NB BN

Results on Cora Predicting the Citation Matches

Results on Cora Predicting the Citation Matches

Results on Cora Predicting the Author Matches

Results on Cora Predicting the Author Matches

Results on Cora Predicting the Venue Matches

Results on Cora Predicting the Venue Matches

Outline Motivation Review of MLNs Discriminative Training Experiments Link Prediction Object Identification Conclusion and Future Work

Conclusions Markov Logic Networks – a powerful way of combining logic and probability. MLNs can be discriminatively trained using a voted perceptron algorithm Discriminatively trained MLNs perform better than purely logical approaches, purely probabilistic approaches as well as generatively trained MLNs.

Future Work Discriminative learning of MLN structure Max-margin type training of MLNs Extensions of MaxWalkSAT Further application to the link prediction, object identification and possibly other application areas.