Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,

Slides:

Advertisements

Similar presentations

June 2013 SLG Workshop, ICML, Atlanta GA Decomposing Structured Prediction via Constrained Conditional Models Dan Roth Department of Computer Science University.

Advertisements

Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Dynamic Bayesian Networks (DBNs)

Mixture of trees model: Face Detection, Pose Estimation and Landmark Localization Presenter: Zhang Li.

1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.

Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.

Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.

Abstract We present a model of curvilinear grouping using piecewise linear representations of contours and a conditional random field to capture continuity.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Standard EM/ Posterior Regularization (Ganchev et al, 10) E-step: M-step: argmax w E q log P (x, y; w) Hard EM/ Constraint driven-learning (Chang et al,

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Radial Basis Function Networks

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang, Dan Roth Department of Computer Science University of Illinois.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):

CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign Page 1.

Aspect Guided Text Categorization with Unobserved Labels Dan Roth, Yuancheng Tu University of Illinois at Urbana-Champaign.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Bayesian Networks Martin Bachler MLA - VO

Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Spreadsheet Engineering “Training in spreadsheet modeling improves both the efficiency and effectiveness with which analysts use spreadsheets” Steve Powell,

Fast Parallel and Adaptive Updates for Dual-Decomposition Solvers Ozgur Sumer, U. Chicago Umut Acar, MPI-SWS Alexander Ihler, UC Irvine Ramgopal Mettu,

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.

An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.

An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.

Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.

John Lafferty Andrew McCallum Fernando Pereira

1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,

NTU & MSRA Ming-Feng Tsai

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.

Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.

2-6 Special Functions Objectives Students will be able to: 1) identify and graph step, constant, and identity functions 2) Identify and graph absolute.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Lecture 7: Constrained Conditional Models

Kai-Wei Chang University of Virginia

Margin-based Decomposed Amortized Inference

Janardhan Rao (Jana) Doppa, Alan Fern, and Prasad Tadepalli

N-Gram Model Formulas Word sequences Chain rule of probability

Shashi Shekhar Weili Wu Sanjay Chawla Ranga Raju Vatsavai

The Naïve Bayes (NB) Classifier

Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.

Nonparametric Hypothesis Tests for Dependency Structures

Discriminative Probabilistic Models for Relational Data

Speech recognition, machine learning

Presentation transcript:

Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title, title etc. Given ads text, extract features, size, neighborhood, etc. Constraints like: ‘Title’ tokens are likely to appear together in a single block, A paper should have at most one ‘title’ Domain Knowledge: HMM transition matrix is diagonal heavy – generalization of submodular pairwise potentials.) Accuracy Training Time (hours)  Multi-label Document Classification Experiments on Reuters data Documents with multi-labels corn, crude, earn, grain, interest… Modeled as a PMN over a complete graph over the labels – singleton and pairwise components F1 Scores Training Time (hours) Local Learning (LL) baselines Global Learning (GL) and Dec. Learning (DecL)-2,3 DecL-1 aka Pseudomax No. of training examples Avg. Hamming Loss Structured Prediction  Predict y = {y 1,y 2,…,y n } 2 Y given input x  Features: Á (x, y); weight parameters: w  Inference: argmax y2 Y f(x,y) = w¢ Á(x,y)  Learning: estimate w Global Learning (GL) Exactness for Special Cases Pairwise Markov Network over a graph with edges E Assume domain knowledge on W * : we know that for separating w, if Á i,k (.;w) is: Submodular: Á i,k (0,0)+ Á i,k (1,1) > Á i,k (0,1) + Á i,k (1,0) OR Supermodular: Á i,k (0,0)+ Á i,k (1,1) < Á i,k (0,1) + Á i,k (1,0).  Structural SVMs learn by minimizing  Global inference as an intermediate step  Global Inference slow ) Global learning time consuming!!! Decomposed learning (DecL)  Reduce inference to a neighborhood around y j  Small neighborhoods ) efficient learning  We theoretically and experimentally show that Decomposed Learning with small neighborhoods can be identical to Global Learning (GL) Theoretical Results: Decompositions which yield Exactness W * : {w * | f(x j, y j ;w * ) ¸ f(x j, y ;w *) + ¢(y j,y), 8 y 2 Y, y j 2 training - data } W decl : {w * | f(x j, y j ;w * ) ¸ f(x j, y ;w *) + ¢(y j,y), 8 y 2 nbr(y j ), y j 2 training - data } Exactness: DecL is exact if it has the same set of separating weights as GL – W decl = W *  Exactness with finite data is much more useful than asymptotic consistency Main Theorem: DecL is exact if 8 w 2 W *, 9 ² > 0, such that 8 w ’ 2 B(w,²), 8 ( x j, y j ) 2 D we have if 9 y 2 Y such that f(x j, y ; w′) + ¢(y j, y) > f(x j, y j ; w′) then 9 y’ 2 nbr(y j ) with f(x j, y ’ ; w′) + ¢(y j, y ’ ) > f(x j, y j ; w′) E EjEj sub(Á) sup(Á) 1 0 Theorem: S pair decomposition consisting of connected components of E j yields Exactness Linear scoring function structured via constraints on Y  For simple constraints, possible to show exactness for decompositions with set sizes independent of n  Theorem: If Y is specified by k OR constraints, then DecL-(k+1) is exact  As an example consequence, when Y is specified by k horn-clauses: y 1,1 Æ y 1,2 … Æ y 1,r ! y 1,r+1, y 2,1 Æ y 2,2 … Æ y 2,r ! y 2,r+1,  y k,1 Æ y k,2 … Æ y k,l ! y k,r+1 decompositions with set-size (k+1), i.e. independent of the number of variables in constraints, r, yield exactness. Baseline: Local Learning (LL)  Approximations to GL which ignore certain structural interactions such that the remain structure become easy to learn  E.g. ignoring global constraints or pairwise interactions in a Markov network  Another baseline: LL+C where we learn pieces independently and apply full structural inference (e.g. constraints, if available), during test-time  Fast but oblivious to rich structural information!! For weights immediately outside W *, Global Inseparability ) DecL Inseparability Efficient Decomposed Learning for Structured Prediction Rajhans Samdani and Dan Roth, University of Illinois at Urbana-Champaign DecL: Learning via Decompositions Learn by varying a subset of the output variables, while fixing the remaining to their gold labels in y j y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 !  Decomposition is a collection of different (non-inclusive, possibly overlapping) sets of variables which we perform argmax over S j = {s 1,…,s l | 8 i, s i µ {1,…,n}; 8 i, k, s i * s k }  Learning with Decompositions in which all subsets of size k are considered: DecL-k  In practice, decompositions based on domain knowledge highly coupled variables together Supported by the Army Research Laboratory (ARL), Defense Advanced Research Projects Agency (DARPA), and the Office of Naval Research (ONR).. Exact Inference Update