Kai-Wei Chang University of Virginia

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net
Lecture 8: Inference Kai-Wei Chang University of Virginia Some slides are adapted from Vivek Skirmar’s course on Structured Prediction Advanced ML: Inference

Advanced ML: Inference
So far what we learned Thinking about structures A graph, a collection of parts that are labeled jointly, a collection of decisions Next: Prediction Sets structured prediction apart from binary/multiclass A B D C E F G A B D C E F G Advanced ML: Inference

The bigger picture The goal of structured prediction: Predicting a graph Modeling: Defining probability distributions over the random variables Involves making independence assumptions Inference: The computational step that actually constructs the output Also called decoding Learning creates functions that score predictions (e.g., learning model parameters) Advanced ML: Inference

Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Background knowledge about domain Data annotation difficulty Inference: deriving the probability of one or more random variables based on the model Semi-supervised/indirectly supervised? Advanced ML: Inference

What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference Integer programming Graph algorithms Sum-product, max-sum Heuristics for inference LP relaxation Sampling Advanced ML: Inference

Remember sequence prediction
Goal: Find the most probable/highest scoring state sequence argmaxy score(y) = argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) Computationally: discrete optimization The naïve algorithm Enumerate all sequences, score each one and pick the max Terrible idea! We can do better Scores decomposed over edges Advanced ML: Inference

The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn … Advanced ML: Inference

The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn … , 𝑦 𝑖−1 )] Idea 1. If I know the score of all sequences y1 to yn-1, then I could decide yn easily 2. Recurse to get score up to yn-1 Advanced ML: Inference

Inference questions This class: Mostly we use inference to mean “What is the highest scoring assignment to the output random variables for a given input?” Maximum A Posteriori (MAP) inference (if the score is probabilistic) Other inference questions What is the highest scoring assignment to some of the output variables given the input? Sample from the posterior distribution over the Y Loss-augmented inference: Which structure most violates the margin for a given scoring function? Computing marginal probabilities over Y Advanced ML: Inference

MAP inference A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference

MAP inference is discrete optimization
A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference

MAP inference is search
We want a graph that has highest scoring structure argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) Without assumptions, no algorithm can find the max without considering every possible structure How can we solve this computational problem? Exploit the structure of the search space and the cost function That is, exploit decomposition of the scoring function Usually stronger assumptions lead to easier inference E.g., consider 10 independent random variables Advanced ML: Inference

Approaches for inference
Exact vs. approximate inference Should the maximization be performed exactly? Or is a close-to-highest-scoring structure good enough? Exact: Search, dynamic programming, integer linear programming, …. Heuristic (called approximate inference): Gibbs sampling, belief propagation, beam search, linear programming relaxations, … Randomized vs. deterministic Relevant for approximate inference: If I run the inference program twice, will I get the same answer? Advanced ML: Inference

Coming up Formulating general inference as integer linear programs And variants of this idea Graph algorithms, dynamic programing, greedy search We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product Heuristics for inference Sampling, Gibbs Sampling Approximate graph search, beam search LP-relaxation Advanced ML: Inference

Inference: Integer Linear Programs
Advanced ML: Inference

The big picture MAP Inference is combinatorial optimization Combinatorial optimization problems can be written as integer linear programs (ILP) The conversion is not always trivial Allows injection of “knowledge” in the form of constraints Different ways of solving ILPs Commercial solvers: CPLEX, Gurobi, etc Specialized solvers if you know something about your problem Lagrangian relaxation, amortized inference, etc Can approximate to linear programs and hope for the best Integer linear programs are NP hard in general No free lunch Advanced ML: Inference

Detour: Linear programming
Minimizing a linear objective function subject to a finite number of linear constraints (equality or inequality) Very widely applicable Operations research, micro-economics, management Historical note Developed during world war II to reduce army costs “Programming” not the same as computer programming Advanced ML: Inference

Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Advanced ML: Inference

Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference

Linear programming In general linear linear Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b c Suppose we had to maximize any cTx on this region Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region c Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b c Suppose we had to maximize any cTx on this region Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region c Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region These three vertices are the only possible solutions! Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions The constraint matrix defines a polytope Only the vertices or faces of the polytope can be solutions Advanced ML: Inference

Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) Advanced ML: Inference

The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Darker is higher Advanced ML: Inference

The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Advanced ML: Inference

The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Even though all points in the region are allowed, the vertices maximize/minimize the cost Advanced ML: Inference

Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions The constraint matrix defines a polytope Only the vertices or faces of the polytope can be solutions Linear programs can be solved in polynomial time Questions? Advanced ML: Inference

Integer linear programming
In general Advanced ML: Inference

Geometry of integer linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Only integer points allowed Advanced ML: Inference

Integer linear programming
In general Solving integer linear programs in general can be NP-hard! LP-relaxation: Drop the integer constraints and hope for the best Advanced ML: Inference

0-1 integer linear programming
In general An instance of integer linear programs Still NP-hard Geometry: We are only considering points that are vertices of the Boolean hypercube Advanced ML: Inference

0-1 integer linear programming
Solution can be an interior point of the constraint set defined by Ax · b In general An instance of integer linear programs Still NP-hard Geometry: We are only considering points that are vertices of the Boolean hypercube Constraints prohibit certain vertices Eg: Only points within this region are allowed Advanced ML: Inference Questions?

Back to structured prediction
Recall that we are solving argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) The goal is to produce a graph The set of possible values that y can take is finite, but large General idea: Frame the argmax problem as a 0-1 integer linear program Allows addition of arbitrary constraints Advanced ML: Inference

Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Advanced ML: Inference

Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Maximize the score Pick exactly one label Advanced ML: Inference

Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Maximize the score Advanced ML: Inference

Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise We have taken a trivial problem (finding a highest scoring element of a list) and converted it into a representation that is NP-hard in the worst case! Lesson: Don’t solve multiclass classification with an ILP solver Maximize the score Advanced ML: Inference

ILP for a general conditional models
Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference

Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels z1A, z1B, z1C Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference

z13AA, z13AB, z13AC, z13BA, z13BB, z13BC, z13CA, z13CB, z13CC Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels z1A, z1B, z1C z23AA, z23AB, z23AC, z23BA, z23BB, z23BC, z23CA, z23CB, z23CC Each of these decision variables is associated with a score z2A, z2B, z2C Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Questions? Advanced ML: Inference

Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels Each of these decision variables is associated with a score Not all decisions can exist together. Eg: z13AB implies z1A and z3B Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference

Writing constraints as linear inequalities
Exactly one of z1A, z1B, z1C can be true z1A + z1B + z1C = 1 At least m of z1A, z1B, z1C should be true z1A + z1B + z1C ≥𝑚 At most m of z1A, z1B, z1C should be true z1A + z1B + z1C ≤𝑚 Implication: zi → zj Convert to disjunction: ¬zi ∨ zj (At least one of “not zi” or zj ) 1 – zi + zj ≥ 1 Advanced ML: Inference

Integer linear programming for inference
Easy to add additional knowledge Specify them as Boolean formulas Examples “If y1 is an A, then y2 or y3 should be a B or C” “No more than two A’s allowed in the output” Many inference problems have “standard” mappings to ILPs Sequences, parsing, dependency parsing Encoding of the problem makes a difference in solving time The mechanical encoding may not be efficient to solve Generally: more complex constraints make solving harder Advanced ML: Inference

Exercise: Sequence labeling
Goal: Find argmaxy 𝑾 𝑻 𝝓 (x,y) y = (y1, y2, , yn) y2 y3 y1 yn … How can this be written as an ILP? Advanced ML: Inference

ILP for inference: Remarks
Many combinatorial optimization problem can be written as an ILP Even the “easy”/polynomial ones Given an ILP, checking whether it represents a polynomial problem is intractable in general ILPs are a general language for thinking about combinatorial optimization The representation allows us to make general statements about inference Off-the-shelf solvers for ILPs are quite good Gurobi, CPLEX Use an off the shelf solver only if you can’t solve your inference problem otherwise Advanced ML: Inference

Coming up Formulating general inference as integer linear programs And variants of this idea Graph algorithms, dynamic programing, greedy search We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product Heuristics for inference Sampling, Gibbs Sampling Approximate graph search, beam search LP-relaxation Advanced ML: Inference

Inference: Graph algorithms Belief Propagation

Variable elimination (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Advanced ML: Inference

Variable elimination: (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Viterbi is an instance of max-product variable elimination Advanced ML: Inference

Variable elimination example (max-sum)
C D 2 4 1 Score_local y2 y3 y1 yn … A 2 B 4 C 1 D A 2 B 4 C 1 D First eliminate y1 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference

Variable elimination example
C D y2 y3 yn … A B C D Next eliminate y2 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference

C D y3 yn … A B C D Next eliminate y3 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference

yn A B C D We have all the information to make a decision for yn Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference Questions?

Two types of inference problems
Marginals: find Maximizer: find Probability in different domains Advanced ML: Inference

Belief Propagation BP provides exact solution when there are no loops in graph Viterbi is a special case Otherwise, “loopy” BP provides an approximate solution We use sum-product BP as running example, where we want to compute the Z in Advanced ML: Inference

Intuition iterative process in which neighboring variables “pass message” to each other: I (variable x3) think that you (variable x2) belong in these states with various likelihoods… After enough iterations, the conversations is likely to converge to a consensus that determines the marginal probabilities of all the variables. Advanced ML: Inference

Message Message from node i to node j: 𝑚 𝑖𝑗 ( 𝑥 𝑗 ) Message is not probability May not sum to 1 A high value of 𝑚 𝑖𝑗 ( 𝑥 𝑗 ) means that node i “believes” the marginal value 𝑃( 𝑥 𝑗 ) to be high. Advanced ML: Inference

Beliefs Estimated marginal probabilities are called beliefs. Algorithm Update messages until convergence Then calculate beliefs Advanced ML: Inference

Message update To update message from i to j, consider all messages flowing into i Advanced ML: Inference

Message update Advanced ML: Inference

Sum-product vs. max-product
The standard BP we just described is sum- product – used to estimate marginal A variant called max-product (or max-sum in log space), is used to estimate MAP Advanced ML: Inference

Max-product Message update same as before, except that sum is replaced by max: Belief: estimate most likely states Advanced ML: Inference

Recap Advanced ML: Inference

Example (Sum-Product)
2 B 1 How many possible assignments? For the sake of simplicity, let’s assume all the transition and emission scores are the same A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1×2+3×1 B 2×2+4×1 A 2 B 1 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 2 B 1 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2× (5×5) B 1×(8×8) A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 = 2× 2+3 × 2+3 A 1×2+3×1=5 B 8 A 5 B 8 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1×50+3×64=242 B 2×50+4×64=356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 A 2×242×5×5=12,100 B 1×356×8×8=22,784 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 242 B 356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 A 12,100 B 22,784 A -> A 1 A -> B 2 B -> A 3 B-> B 4 34,884 A 242 B 356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

Example (max-product)
2 B 1 How many possible assignments? For the sake of simplicity, let’s assume all the transition and emission scores are the same A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A max 1×2, 3×1 =3 B max 2×2, 4×1 =4 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 3 B 4 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 2× (3×3) B 1×(4×4) A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 A 3 B 4 A 3 B 4 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 = 2× 3 × 3 A 3 B 4 A 3 B 4 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A max 1×18, 3×16 =48 B max 2×18, 4×16 =64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2×48×3×3= 864 B 1×64×4×4=1,024 ℎ( 𝑥 𝑖 ) A 48 B 64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 B 64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 (𝐵) B 64 (B) A 3 (𝐵) B 4 (A) A 3 (B) B 4 (𝐴) A 18 B 16 A 3 (B) B 4 (𝐴) A 3(𝐵) B 4(A) A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

Inference: Graph algorithms General Search

Inference as search: General setting
Predicting a graph as a sequence of decisions General data structures: State: Encodes partial structure Transitions: Move from one partial structure to another Start state End state: We have a full structure There may be more than one end state Each transition is scored with the learned model Goal: Find an end state that has the highest total score Questions? Advanced ML: Inference

Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned Advanced ML: Inference

Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference

Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference

Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) How do the transitions get scored? (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Questions? Advanced ML: Inference

Graph search algorithms
Breadth/depth first search Keep a stack/queue/priority queue of “open” states The good: Guaranteed to be correct Explores every option The bad? Explores every option: Could be slow for any non-trivial graph (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference

Greedy search At each state, choose the highest scoring next transition Keep only one state in memory: The current state What is the problem? Local decisions may override global optimum Does not explore full search space Greedy algorithms can give the true optimum for special class of problems Eg: Maximum-spanning tree algorithms are greedy Questions? Advanced ML: Inference

Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) How do the transitions get scored? (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (A,B,-) (A,C,-) (A,A,-) (A,B,C) (A,C,-) Questions? Advanced ML: Inference

Example (greedy) A 2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

Example (greedy) A 2 B 1 A 2×3×16×2×2 = 384 B 1×16×4×2×2×2×2=1,024 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2×2×2 = 8 B 2×2×1×2×2=16 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

Beam search: A compromise
Keep size-limited priority queue of states Called the beam, sorted by total score for the state At each step: Explore all transitions from the current state Add all to beam and trim the size The good: Explores more than greedy search The bad: A good state might fall out of the beam In general, easy to implement, very popular No guarantees Questions? Advanced ML: Inference

Example beam = 3 Credit: Graham Neubig Advanced ML: Inference

Calculate score, but ignore removed hypotheses Advanced ML: Inference

Keep only best three Advanced ML: Inference

Structured prediction approaches based on search
Learning to search approaches Assume the complex decision is incrementally constructed by a sequence of decisions E.g., dagger, Searn, transition-based methods Learn how to make decisions at each branch. Advanced ML: Inference

Example: Dependency Parsing
Identifying relations between words I ate a cake with a fork To introduce how the l2s approach work. Let me use the dependency parsing problem as a running example. Dep parsing … I ate a cake with a fork Advanced ML: Inference

Learning to search (L2S) approaches
Define a search space and features Example: dependency parsing [Nivre03,NIPS16] Maintain a buffer and a stack Make predictions from left to right Three (four) types of actions: Shift, Reduce-Left, Reduce-Right We first define the search space in the following way. Credit: Google research blog Advanced ML: Inference

Learning to search approaches Shift-Reduce parser
Maintain a buffer and a stack Make predictions from left to right Three (four) types of actions: Shift, Reduce-Left, Reduce-Right I ate a cake I ate a cake Shift Reduce-Left ate a cake I I ate a cake Advanced ML: Inference

Learning to search (L2S) approaches
Define a search space and features Construct a reference policy (Ref) based on the gold label Learning a policy that imitates Ref The learning to search approach is related to reinforcement learning. But the different is here we have the true annotations. Therefore, in the training phase, we can use them to generate a reference policy as a guidance to traverse through the search space. For example, the reference policy knows how to take actions in order to generate this correct parse tree. Then we can learn a policy to imitate this reference policy so that in the test phase, without the reference policy, the learned policy still can get the right decision. Existing results show that under some conditions, with enough data the learned policy can be as good as the reference policy. sentence Advanced ML: Inference

Policies A policy maps observations to actions p( ) obs. =a input: x timestep: t partial traj: τ … anything else So far this approach sounds like reinformance learning. Wait? Reinforcement leanring is hard. Advanced ML: Inference

Imitation learning for joint prediction
Challenges: There are combinatorial number of search states How a sub-decision affect the final decision? Advanced ML: Inference

Credit Assignment Problem
When someone goes wrong which decision should be blamed Advanced ML: Inference

Imitation learning for joint prediction
Searn: [Langford, Daume& Marcu] Dagger: [Ross, Gordon & Bagnell] AggreVaTe: [Ross & Bagnell] LOLS: [Chang, Krishnamurthy, Agarwal, Daume, Langford] Advanced ML: Inference

Learning a Policy[ICML 15, Ross+15]
At “?” state, we construct a cost-sensitive multi-class example (?, [0, .2, .8]) E ? E loss=0 loss=.2 loss=.8 rollin rollout one-step deviations Advanced ML: Inference

Example: Sequence Labeling
Receive input: Make a sequence of predictions: Pick a timestep and try all perturbations there: Compute losses and construct example: x = the monster ate the sandwich y = Dt Nn Vb Dt Nn x = the monster ate the sandwich ŷ = Dt Dt Dt Dt Dt x = the monster ate the sandwich ŷDt = Dt Dt Vb Dt Nn l=1 ŷNn = Dt Nn Vb Dt Nn l=0 ŷVb = Dt Vb Vb Dt Nn l=1 ( { w=monster, p=Dt, …}, [1,0,1] ) Advanced ML: Inference

Learning to search approaches: Credit Assignment Compiler [NIPS16]
Write the decoder, providing some side information for training Library translates this piece of program with data to the update rules of model Applied to dependency parsing, Name entity recognition, relation extraction, POS tagging… Implementation: Vowpal Wabbit The key challenge here is representing search space is not intuitive. Usually, you have to define some state machines. Therefore, people usually don’t implement the right algorithms with good guarantee, instead they train the model with heuristic – basically implement some simple update rules in their code. This often leads to suboptimal performance. To fix this, we proposed a programming abstraction allow the user to represent the search space using a piece of program. Then the library works as a compiler to translate this piece of program with data to the update rules of model. This idea is analogy to factorie for probabilistic model. Advanced ML: Inference

Approximate Inference Inference by sampling

Inference by sampling Basic idea:
Monte Carlo methods: A large class of algorithms Origins in physics Basic idea: Repeatedly sample from a distribution Compute aggregate statistics from samples E.g.: The marginal distribution Useful when we have many, many interacting variables

Why sampling works Suppose we have some probability distribution P(z)
Might be a cumbersome function We want to answer questions about this distribution What is the mean? Approximate with samples from the distribution {z1, z2, , zn} Theory tells us that this is a good estimator Chernoff-Hoeffding style bounds Eg: Expectation

Key idea – rejection sampling

Work well when number of variables are small Advanced ML: Inference

The Markov Chain Monte Carlo revolution
Goal: To generate samples from a distribution P(y|x) The target distribution could be intractable to sample from Idea: Construct a Markov chain of structures whose stationary distribution converges to P An iterative process that constructs examples Initially samples might not be from the target distribution But after a long enough time, the samples are from a distribution that is close to P

The Markov Chain Monte Carlo revolution
Goal: To generate samples from a distribution P(y|x) The target distribution could be intractable to sample from Idea: drawing examples in a way that in a long run the distribution is closed to P(y|x) Formally: Construct a Markov chain of structures whose stationary distribution converges to P An iterative process that constructs examples Initially samples might not be from the target distribution But after a long enough time, the samples are from a distribution that is increasingly close to P

A detour Recall: Markov Chains A collection of random variables y0, y1, y2, …, yt form a Markov chain if the ith state depends only on the previous one A B BC D DE DF 0.1 0.8 0.9 A ! B ! C ! D ! E ! F F ! A ! A ! E ! F ! B! C

A detour Recall: Markov Chains A collection of random variables y0, y1, y2, …, yt form a Markov chain if the ith state depends only on the previous one A B BC D DE DF 0.1 0.8 0.9 A → B → C → D → E → F F → A → A → E → F → B → C

Temporal dynamics of a Markov chain
What is the probability that a chain is in a state z at time t+1?

Temporal dynamics of a Markov chain
What is the probability that a chain is in a state z at time t+1? A B BC D DE DF 0.1 0.8 0.9 Exercise: Suppose a Markov chain for these transition probabilities starts at A. What is the distribution over states after two steps?

Stationary distributions
Informally, If the set of states is {A, B, C, D, E, F} A distribution over the states such that after a transition, the distribution over the states is still 𝜋 How do we get to a stationary distribution? A regular Markov chain: There is an non-zero probability of getting from any states to any other in a finite number of steps If transition matrix is regular, just run it for a long time Steady-state behavior is the stationary distribution

Markov Chain Monte Carlo for inference
Back to inference Markov Chain Monte Carlo for inference Design a Markov chain such that Every state is a structure The stationary distribution of the chain is the probability distribution we care about P(y |x) How to do inference? Run the Markov chain for a long time till we think it gets to steady state Let the chain wander around the space and collect samples We have samples from P(y|x)

MCMC for inference 1

MCMC for inference After many steps 1 1

MCMC for inference After many steps After many steps 1 1 1

MCMC for inference 1 1 2 After many steps After many steps After many

With sufficient samples, we can answer inference questions like calculating the partition function (just sum over the samples) 1 1 3

Markov Chain Monte Carlo for inference
Back to inference Markov Chain Monte Carlo for inference Design a Markov chain such that Every state is a structure The stationary distribution of the chain is the probability distribution we care about P(y |x) How to do inference? Run the Markov chain for a long time till we think it gets to steady state Let the chain wander around the space and collect samples We have samples from P(y|x)

MCMC algorithms Metropolis-Hastings algorithm Gibbs sampling
An instance of the Metropolis Hastings algorithm Many variants exist Remember: We are sampling from an exponential state space All possible assignments to the random variables

Metropolis-Hastings Proposal distribution q(y → y’)
[Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953] [Hastings 1970] Proposal distribution q(y → y’) Proposes changes to the state Could propose large changes to the state Acceptance probability 𝛼 Should the proposal be accepted or not If yes, move to the proposed state, else remain in the previous state

Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN}

The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’)

The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’) Note that we don’t need to compute the partition function. Why?

The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’) Idea: when running with enough iteration, the distribution is invariant

Proposal functions for Metropolis
A design choice Different possibilities Only make local changes to the factor graph But the chain might not explore widely Make big jumps in the state space But the chain might move very slowly Doesn’t have to depend on the size of the graph

Gibbs Sampling Start with an initial guess y = (y1, y2, , yn)
Loop several times For i = 1 to n: Sample yi from P(yi| y1, y2,  yi-1, yi+1, , yn, x) We now have a complete sample A specific instance of Metropolis-Hastings algorithm, no proposal needs to be designed The ordering is arbitrary

MAP inference with MCMC
So far we have only seen how to collect samples Marginal inference with samples is easy Compute the marginal probabilities from the samples MAP inference: Find the sample with highest probability To help convergence to the maximum , acceptance condition becomes T is a temperature parameter that increases with every step Similar to simulated annealing

Summary of MCMC methods
A different approach for inference No guarantee of exactness General idea Set up a Markov chain whose stationary distribution is the probability distribution that we care about Run the chain, collect samples, aggregate Metropolis-Hastings, Gibbs sampling Many, many, many variants abound! Useful when exact inference is intractable Typically low memory costs, local changes only for Gibbs sampling Questions?

Inference What is inference? The prediction step
More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum Different flavors: MAP, marginal, loss augmented. Many algorithms, solution strategies One size doesn’t fit all Next steps: How can we take advantage of domain knowledge in inference? How can we deal making predictions about latent variables for which we don’t have data Questions?

Kai-Wei Chang University of Virginia

Similar presentations

Presentation on theme: "Kai-Wei Chang University of Virginia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kai-Wei Chang University of Virginia

Similar presentations

Presentation on theme: "Kai-Wei Chang University of Virginia"— Presentation transcript:

Similar presentations

About project

Feedback