Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Max-Margin Weight Learning for Markov Logic Networks
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Discriminative Training of Markov Logic Networks
Improving the Accuracy and Scalability of Discriminative Learning Methods for Markov Logic Networks Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney PhD.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Bayesian Abductive Logic Programs Sindhu Raghavan Raymond J. Mooney The University of Texas at Austin 1.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Jointly Identifying Temporal Relations with Markov Logic Katsumasa Yoshikawa †, Sebastian Riedel ‡, Masayuki Asahara †, Yuji Matsumoto † † Nara Institute.
A Structured Model for Joint Learning of Argument Roles and Predicate Senses Yotaro Watanabe Masayuki Asahara Yuji Matsumoto ACL 2010 Uppsala, Sweden July.
Linear Classifiers (perceptrons)
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Online Structure Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Department of Computer Science The University of Texas at Austin.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Markov Logic Networks: Exploring their Application to Social Network Analysis Parag Singla Dept. of Computer Science and Engineering Indian Institute of.
CMPUT 466/551 Principal Source: CMU
Review Markov Logic Networks Mathew Richardson Pedro Domingos Xinran(Sean) Luo, u
Adbuctive Markov Logic for Plan Recognition Parag Singla & Raymond J. Mooney Dept. of Computer Science University of Texas, Austin.
Efficient Weight Learning for Markov Logic Networks Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
School of Computing Science Simon Fraser University Vancouver, Canada.
Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Lecture 5: Learning models using EM
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Statistical Relational Learning Pedro Domingos Dept. Computer Science & Eng. University of Washington.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.
Web Query Disambiguation from Short Sessions Lilyana Mihalkova* and Raymond Mooney University of Texas at Austin *Now at University of Maryland College.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.
Markov Logic And other SRL Approaches
Online Learning for Collaborative Filtering
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Presenter: Libin Zheng, Yongqi Zhang Department of Computer Science and Engineering HKUST Date: 24/11/2015 Crowd-aided course selection on MOOC.
Happy Mittal (Joint work with Prasoon Goyal, Parag Singla and Vibhav Gogate) IIT Delhi New Rules for Domain Independent Lifted.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning Deep Generative Models by Ruslan Salakhutdinov
Boosted Augmented Naive Bayes. Efficient discriminative learning of
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Discriminative Learning for Markov Logic Networks
Asymmetric Gradient Boosting with Application to Spam Filtering
Large Scale Support Vector Machines
Support Vector Machines
Using Uneven Margins SVM and Perceptron for IE
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Primal Sparse Max-Margin Markov Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science The University of Texas at Austin SDM 2011, April 29, 2011

Motivation 2 D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, [ A0 He] [ AM-MOD would] [ AM-NEG n’t] [ V accept] [ A1 anything of value] from [ A2 those he was writing about] [ A0 He] [ AM-MOD would] [ AM-NEG n’t] [ V accept] [ A1 anything of value] from [ A2 those he was writing about] [ A0 He] [ AM-MOD would] [ AM-NEG n’t] [ V accept] [ A1 anything of value] from [ A2 those he was writing about] [ A0 He] [ AM-MOD would] [ AM-NEG n’t] [ V accept] [ A1 anything of value] from [ A2 those he was writing about] [ A0 He] [ AM-MOD would] [ AM-NEG n’t] [ V accept] [ A1 anything of value] from [ A2 those he was writing about] [ A0 He] [ AM-MOD would] [ AM-NEG n’t] [ V accept] [ A1 anything of value] from [ A2 those he was writing about] [ A0 He] [ AM-MOD would] [ AM-NEG n’t] [ V accept] [ A1 anything of value] from [ A2 those he was writing about] Citation segmentation Semantic role labeling

Motivation (cont.) 3  Markov Logic Networks (MLNs) [Richardson & Domingos, 2006] is an elegant and powerful formalism for handling those complex structured data  Existing weight learning methods for MLNs are in the batch setting  Need to run inference over all the training examples in each iteration  Usually take a few hundred iterations to converge  May not fit all the training examples in main memory  do not scale to problems having a large number of examples  Previous work applied an existing online algorithm to learn weights for MLNs but did not compare to other algorithms Introduce a new online weight learning algorithm and extensively compare to other existing methods

Outline 4  Motivation  Background  Markov Logic Networks  Primal-dual framework for online learning  New online learning algorithm for max-margin structured prediction  Experiment Evaluation  Summary

5 Markov Logic Networks [ Richardson & Domingos, 2006]  Set of weighted first-order formulas  Larger weight indicates stronger belief that the formula should hold.  The formulas are called the structure of the MLN.  MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers *Slide from [Domingos, 2007]

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) 6 *Slide from [Domingos, 2007]

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) 7 *Slide from [Domingos, 2007]

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) 8 *Slide from [Domingos, 2007]

Example: Friends & Smokers Cancer(A) Smokes(A)Friends(A,A) Friends(B,A) Smokes(B) Friends(A,B) Cancer(B) Friends(B,B) Two constants: Anna (A) and Bob (B) 9 *Slide from [Domingos, 2007]

Weight of formula iNo. of true groundings of formula i in x 10 Probability of a possible world A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. a possible world

Max-margin weight learning for MLNs [Huynh & Mooney, 2009]  maximize the separation margin: log of the ratio of the probability of the correct label and the probability of the closest incorrect one  Formulate as 1-slack Structural SVM [Joachims et al., 2009]  Use cutting plane method [Tsochantaridis et.al., 2004] with an approximate inference algorithm based on Linear Programming 11

Online learning 12 The accumulative loss of the online learner The accumulative loss of the best batch learner

 A general and latest framework for deriving low- regret online algorithms  Rewrite the regret bound as an optimization problem (called the primal problem), then considering the dual problem of the primal one  Derive a condition that guarantees the increase in the dual objective in each step  Incremental-Dual-Ascent (IDA) algorithms. For example: subgradient methods [Zinkevich, 2003] Primal-dual framework for online learning [Shalev-Shwartz et al., 2006] 13

Primal-dual framework for online learning (cont.) 14  Propose a new class of IDA algorithms called Coordinate-Dual-Ascent (CDA) algorithm:  The CDA update rule only optimizes the dual w.r.t the last dual variable (the current example)  A closed-form solution of CDA update rule  CDA algorithm has the same cost as subgradient methods but increase the dual objective more in each step  better accuracy

Steps for deriving a new CDA algorithm Define the regularization and loss functions 2. Find the conjugate functions 3. Derive a closed-form solution for the CDA update rule CDA algorithm for max-margin structured prediction

Max-margin structured prediction 16 MLNs: n(x,y)

1. Define the regularization and loss functions 17 Label loss function

1. Define the regularization and loss functions (cont.) 18

2. Find the conjugate functions 19

2. Find the conjugate functions (cont.) 20  Conjugate function of the regularization function f(w): f(w)=(1/2)||w|| 2 2  f * ( µ ) = (1/2)|| µ || 2 2

2. Find the conjugate functions (cont.) 21

22  CDA’s learning rate combines the learning rate of the subgradient method with the loss incurred at each step 3. Closed-form solution for the CDA update rule

Experiments 23

Experimental Evaluation 24  Citation segmentation on CiteSeer dataset  Search query disambiguation on a dataset obtained from Microsoft  Semantic role labeling on noisy CoNLL 2005 dataset

Citation segmentation 25  Citeseer dataset [Lawrence et.al., 1999] [ Poon and Domingos, 2007 ]  1,563 citations, divided into 4 research topics  Task: segment each citation into 3 fields: Author, Title, Venue  Used the MLN for isolated segmentation model in [ Poon and Domingos, 2007]

Experimental setup  4-fold cross-validation  Systems compared:  MM: the max-margin weight learner for MLNs in batch setting [Huynh & Mooney, 2009]  1-best MIRA [Crammer et al., 2005]  Subgradient  CDA CDA-PL CDA-ML  Metric:  F 1, harmonic mean of the precision and recall 26

Average F 1 on CiteSeer 27

Average training time in minutes 28

Search query disambiguation 29  Used the dataset created by Mihalkova & Mooney [2009]  Thousands of search sessions where ambiguous queries were asked: 4,618 sessions for training, 11,234 sessions for testing  Goal: disambiguate search query based on previous related search sessions  Noisy dataset since the true labels are based on which results were clicked by users  Used the 3 MLNs proposed in [Mihalkova & Mooney, 2009]

Experimental setup  Systems compared:  Contrastive Divergence (CD) [Hinton 2002] used in [Mihalkova & Mooney, 2009]  1-best MIRA  Subgradient  CDA CDA-PL CDA-ML  Metric:  Mean Average Precision (MAP): how close the relevant results are to the top of the rankings 30

MAP scores on Microsoft query search 31

Semantic role labeling 32  CoNLL 2005 shared task dataset [Carreras & Marques, 2005]  Task: For each target verb in a sentence, find and label all of its semantic components  90,750 training examples; 5,267 test examples  Noisy labeled experiment:  Motivated by noisy labeled data obtained from crowdsourcing services such as Amazon Mechanical Turk  Simple noise model: At p percent noise, there is p probability that an argument in a verb is swapped with another argument of that verb.

Experimental setup  Used the MLN developed in [Riedel, 2007]  Systems compared:  1-best MIRA  Subgradient  CDA-ML  Metric:  F 1 of the predicted arguments [Carreras & Marques, 2005] 33

F 1 scores on CoNLL

Summary 35  Derived CDA algorithms for max-margin structured prediction  Have the same computational cost as existing online algorithms but increase the dual objective more  Experimental results on several real-world problems show that the new algorithms generally achieve better accuracy and also have more consistent performance.

Thank you! 36 Questions?