Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,

Similar presentations


Presentation on theme: "Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,"— Presentation transcript:

1 Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title, title etc. Given ads text, extract features, size, neighborhood, etc. Constraints like: ‘Title’ tokens are likely to appear together in a single block, A paper should have at most one ‘title’ Domain Knowledge: HMM transition matrix is diagonal heavy – generalization of submodular pairwise potentials.) Accuracy Training Time (hours)  Multi-label Document Classification Experiments on Reuters data Documents with multi-labels corn, crude, earn, grain, interest… Modeled as a PMN over a complete graph over the labels – singleton and pairwise components F1 Scores Training Time (hours) Local Learning (LL) baselines Global Learning (GL) and Dec. Learning (DecL)-2,3 DecL-1 aka Pseudomax No. of training examples Avg. Hamming Loss Structured Prediction  Predict y = {y 1,y 2,…,y n } 2 Y given input x  Features: Á (x, y); weight parameters: w  Inference: argmax y2 Y f(x,y) = w¢ Á(x,y)  Learning: estimate w Global Learning (GL) Exactness for Special Cases Pairwise Markov Network over a graph with edges E Assume domain knowledge on W * : we know that for separating w, if Á i,k (.;w) is: Submodular: Á i,k (0,0)+ Á i,k (1,1) > Á i,k (0,1) + Á i,k (1,0) OR Supermodular: Á i,k (0,0)+ Á i,k (1,1) < Á i,k (0,1) + Á i,k (1,0).  Structural SVMs learn by minimizing  Global inference as an intermediate step  Global Inference slow ) Global learning time consuming!!! Decomposed learning (DecL)  Reduce inference to a neighborhood around y j  Small neighborhoods ) efficient learning  We theoretically and experimentally show that Decomposed Learning with small neighborhoods can be identical to Global Learning (GL) Theoretical Results: Decompositions which yield Exactness W * : {w * | f(x j, y j ;w * ) ¸ f(x j, y ;w *) + ¢(y j,y), 8 y 2 Y, y j 2 training - data } W decl : {w * | f(x j, y j ;w * ) ¸ f(x j, y ;w *) + ¢(y j,y), 8 y 2 nbr(y j ), y j 2 training - data } Exactness: DecL is exact if it has the same set of separating weights as GL – W decl = W *  Exactness with finite data is much more useful than asymptotic consistency Main Theorem: DecL is exact if 8 w 2 W *, 9 ² > 0, such that 8 w ’ 2 B(w,²), 8 ( x j, y j ) 2 D we have if 9 y 2 Y such that f(x j, y ; w′) + ¢(y j, y) > f(x j, y j ; w′) then 9 y’ 2 nbr(y j ) with f(x j, y ’ ; w′) + ¢(y j, y ’ ) > f(x j, y j ; w′) E EjEj sub(Á) sup(Á) 1 0 Theorem: S pair decomposition consisting of connected components of E j yields Exactness Linear scoring function structured via constraints on Y  For simple constraints, possible to show exactness for decompositions with set sizes independent of n  Theorem: If Y is specified by k OR constraints, then DecL-(k+1) is exact  As an example consequence, when Y is specified by k horn-clauses: y 1,1 Æ y 1,2 … Æ y 1,r ! y 1,r+1, y 2,1 Æ y 2,2 … Æ y 2,r ! y 2,r+1,  y k,1 Æ y k,2 … Æ y k,l ! y k,r+1 decompositions with set-size (k+1), i.e. independent of the number of variables in constraints, r, yield exactness. Baseline: Local Learning (LL)  Approximations to GL which ignore certain structural interactions such that the remain structure become easy to learn  E.g. ignoring global constraints or pairwise interactions in a Markov network  Another baseline: LL+C where we learn pieces independently and apply full structural inference (e.g. constraints, if available), during test-time  Fast but oblivious to rich structural information!! For weights immediately outside W *, Global Inseparability ) DecL Inseparability Efficient Decomposed Learning for Structured Prediction Rajhans Samdani and Dan Roth, University of Illinois at Urbana-Champaign DecL: Learning via Decompositions Learn by varying a subset of the output variables, while fixing the remaining to their gold labels in y j y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 y1y1 y3y3 y6y6 y5y5 y2y2 y4y4 !  Decomposition is a collection of different (non-inclusive, possibly overlapping) sets of variables which we perform argmax over S j = {s 1,…,s l | 8 i, s i µ {1,…,n}; 8 i, k, s i * s k }  Learning with Decompositions in which all subsets of size k are considered: DecL-k  In practice, decompositions based on domain knowledge highly coupled variables together Supported by the Army Research Laboratory (ARL), Defense Advanced Research Projects Agency (DARPA), and the Office of Naval Research (ONR).. Exact Inference Update


Download ppt "Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,"

Similar presentations


Ads by Google