Download presentation
Presentation is loading. Please wait.
1
Kai-Wei Chang CS @ University of Virginia kw@kwchang.net
Lecture 8: Inference Kai-Wei Chang University of Virginia Some slides are adapted from Vivek Skirmar’s course on Structured Prediction Advanced ML: Inference
2
Advanced ML: Inference
So far what we learned Thinking about structures A graph, a collection of parts that are labeled jointly, a collection of decisions Next: Prediction Sets structured prediction apart from binary/multiclass A B D C E F G A B D C E F G Advanced ML: Inference
3
Advanced ML: Inference
The bigger picture The goal of structured prediction: Predicting a graph Modeling: Defining probability distributions over the random variables Involves making independence assumptions Inference: The computational step that actually constructs the output Also called decoding Learning creates functions that score predictions (e.g., learning model parameters) Advanced ML: Inference
4
Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Background knowledge about domain Data annotation difficulty Inference: deriving the probability of one or more random variables based on the model Semi-supervised/indirectly supervised? Advanced ML: Inference
5
Advanced ML: Inference
What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference Integer programming Graph algorithms Sum-product, max-sum Heuristics for inference LP relaxation Sampling Advanced ML: Inference
6
Remember sequence prediction
Goal: Find the most probable/highest scoring state sequence argmaxy score(y) = argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) Computationally: discrete optimization The naïve algorithm Enumerate all sequences, score each one and pick the max Terrible idea! We can do better Scores decomposed over edges Advanced ML: Inference
7
The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn … Advanced ML: Inference
8
The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn … Advanced ML: Inference
9
The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn … , 𝑦 𝑖−1 )] Idea 1. If I know the score of all sequences y1 to yn-1, then I could decide yn easily 2. Recurse to get score up to yn-1 Advanced ML: Inference
10
Advanced ML: Inference
Inference questions This class: Mostly we use inference to mean “What is the highest scoring assignment to the output random variables for a given input?” Maximum A Posteriori (MAP) inference (if the score is probabilistic) Other inference questions What is the highest scoring assignment to some of the output variables given the input? Sample from the posterior distribution over the Y Loss-augmented inference: Which structure most violates the margin for a given scoring function? Computing marginal probabilities over Y Advanced ML: Inference
11
Advanced ML: Inference
MAP inference A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference
12
Advanced ML: Inference
MAP inference A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference
13
MAP inference is discrete optimization
A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference
14
MAP inference is discrete optimization
A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference
15
MAP inference is search
We want a graph that has highest scoring structure argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) Without assumptions, no algorithm can find the max without considering every possible structure How can we solve this computational problem? Exploit the structure of the search space and the cost function That is, exploit decomposition of the scoring function Usually stronger assumptions lead to easier inference E.g., consider 10 independent random variables Advanced ML: Inference
16
Approaches for inference
Exact vs. approximate inference Should the maximization be performed exactly? Or is a close-to-highest-scoring structure good enough? Exact: Search, dynamic programming, integer linear programming, …. Heuristic (called approximate inference): Gibbs sampling, belief propagation, beam search, linear programming relaxations, … Randomized vs. deterministic Relevant for approximate inference: If I run the inference program twice, will I get the same answer? Advanced ML: Inference
17
Advanced ML: Inference
Coming up Formulating general inference as integer linear programs And variants of this idea Graph algorithms, dynamic programing, greedy search We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product Heuristics for inference Sampling, Gibbs Sampling Approximate graph search, beam search LP-relaxation Advanced ML: Inference
18
Inference: Integer Linear Programs
Advanced ML: Inference
19
Advanced ML: Inference
The big picture MAP Inference is combinatorial optimization Combinatorial optimization problems can be written as integer linear programs (ILP) The conversion is not always trivial Allows injection of “knowledge” in the form of constraints Different ways of solving ILPs Commercial solvers: CPLEX, Gurobi, etc Specialized solvers if you know something about your problem Lagrangian relaxation, amortized inference, etc Can approximate to linear programs and hope for the best Integer linear programs are NP hard in general No free lunch Advanced ML: Inference
20
Detour: Linear programming
Minimizing a linear objective function subject to a finite number of linear constraints (equality or inequality) Very widely applicable Operations research, micro-economics, management Historical note Developed during world war II to reduce army costs “Programming” not the same as computer programming Advanced ML: Inference
21
Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Advanced ML: Inference
22
Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference
23
Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference
24
Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference
25
Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference
26
Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference
27
Advanced ML: Inference
Linear programming In general linear linear Advanced ML: Inference
28
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 Advanced ML: Inference
29
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Advanced ML: Inference
30
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region Advanced ML: Inference
31
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b c Suppose we had to maximize any cTx on this region Advanced ML: Inference
32
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region c Advanced ML: Inference
33
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b c Suppose we had to maximize any cTx on this region Advanced ML: Inference
34
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region c Advanced ML: Inference
35
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region These three vertices are the only possible solutions! Advanced ML: Inference
36
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions The constraint matrix defines a polytope Only the vertices or faces of the polytope can be solutions Advanced ML: Inference
37
Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) Advanced ML: Inference
38
Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Darker is higher Advanced ML: Inference
39
Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Advanced ML: Inference
40
Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Even though all points in the region are allowed, the vertices maximize/minimize the cost Advanced ML: Inference
41
Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions The constraint matrix defines a polytope Only the vertices or faces of the polytope can be solutions Linear programs can be solved in polynomial time Questions? Advanced ML: Inference
42
Integer linear programming
In general Advanced ML: Inference
43
Geometry of integer linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Only integer points allowed Advanced ML: Inference
44
Integer linear programming
In general Solving integer linear programs in general can be NP-hard! LP-relaxation: Drop the integer constraints and hope for the best Advanced ML: Inference
45
0-1 integer linear programming
In general An instance of integer linear programs Still NP-hard Geometry: We are only considering points that are vertices of the Boolean hypercube Advanced ML: Inference
46
0-1 integer linear programming
Solution can be an interior point of the constraint set defined by Ax · b In general An instance of integer linear programs Still NP-hard Geometry: We are only considering points that are vertices of the Boolean hypercube Constraints prohibit certain vertices Eg: Only points within this region are allowed Advanced ML: Inference Questions?
47
Back to structured prediction
Recall that we are solving argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) The goal is to produce a graph The set of possible values that y can take is finite, but large General idea: Frame the argmax problem as a 0-1 integer linear program Allows addition of arbitrary constraints Advanced ML: Inference
48
Advanced ML: Inference
Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Advanced ML: Inference
49
Advanced ML: Inference
Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Maximize the score Pick exactly one label Advanced ML: Inference
50
Advanced ML: Inference
Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Maximize the score Advanced ML: Inference
51
Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise We have taken a trivial problem (finding a highest scoring element of a list) and converted it into a representation that is NP-hard in the worst case! Lesson: Don’t solve multiclass classification with an ILP solver Maximize the score Advanced ML: Inference
52
ILP for a general conditional models
Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference
53
ILP for a general conditional models
Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels z1A, z1B, z1C Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference
54
ILP for a general conditional models
z13AA, z13AB, z13AC, z13BA, z13BB, z13BC, z13CA, z13CB, z13CC Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels z1A, z1B, z1C z23AA, z23AB, z23AC, z23BA, z23BB, z23BC, z23CA, z23CB, z23CC Each of these decision variables is associated with a score z2A, z2B, z2C Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Questions? Advanced ML: Inference
55
ILP for a general conditional models
Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels Each of these decision variables is associated with a score Not all decisions can exist together. Eg: z13AB implies z1A and z3B Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference
56
Writing constraints as linear inequalities
Exactly one of z1A, z1B, z1C can be true z1A + z1B + z1C = 1 At least m of z1A, z1B, z1C should be true z1A + z1B + z1C ≥𝑚 At most m of z1A, z1B, z1C should be true z1A + z1B + z1C ≤𝑚 Implication: zi → zj Convert to disjunction: ¬zi ∨ zj (At least one of “not zi” or zj ) 1 – zi + zj ≥ 1 Advanced ML: Inference
57
Writing constraints as linear inequalities
Exactly one of z1A, z1B, z1C can be true z1A + z1B + z1C = 1 At least m of z1A, z1B, z1C should be true z1A + z1B + z1C ≥𝑚 At most m of z1A, z1B, z1C should be true z1A + z1B + z1C ≤𝑚 Implication: zi → zj Convert to disjunction: ¬zi ∨ zj (At least one of “not zi” or zj ) 1 – zi + zj ≥ 1 Advanced ML: Inference
58
Integer linear programming for inference
Easy to add additional knowledge Specify them as Boolean formulas Examples “If y1 is an A, then y2 or y3 should be a B or C” “No more than two A’s allowed in the output” Many inference problems have “standard” mappings to ILPs Sequences, parsing, dependency parsing Encoding of the problem makes a difference in solving time The mechanical encoding may not be efficient to solve Generally: more complex constraints make solving harder Advanced ML: Inference
59
Exercise: Sequence labeling
Goal: Find argmaxy 𝑾 𝑻 𝝓 (x,y) y = (y1, y2, , yn) y2 y3 y1 yn … How can this be written as an ILP? Advanced ML: Inference
60
ILP for inference: Remarks
Many combinatorial optimization problem can be written as an ILP Even the “easy”/polynomial ones Given an ILP, checking whether it represents a polynomial problem is intractable in general ILPs are a general language for thinking about combinatorial optimization The representation allows us to make general statements about inference Off-the-shelf solvers for ILPs are quite good Gurobi, CPLEX Use an off the shelf solver only if you can’t solve your inference problem otherwise Advanced ML: Inference
61
Advanced ML: Inference
Coming up Formulating general inference as integer linear programs And variants of this idea Graph algorithms, dynamic programing, greedy search We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product Heuristics for inference Sampling, Gibbs Sampling Approximate graph search, beam search LP-relaxation Advanced ML: Inference
62
Inference: Graph algorithms Belief Propagation
Advanced ML: Inference
63
Variable elimination (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Advanced ML: Inference
64
Variable elimination: (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Viterbi is an instance of max-product variable elimination Advanced ML: Inference
65
Variable elimination: (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Viterbi is an instance of max-product variable elimination Advanced ML: Inference
66
Variable elimination example (max-sum)
C D 2 4 1 Score_local y2 y3 y1 yn … A 2 B 4 C 1 D A 2 B 4 C 1 D First eliminate y1 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference
67
Variable elimination example
C D y2 y3 yn … A B C D Next eliminate y2 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference
68
Variable elimination example
C D y3 yn … A B C D Next eliminate y3 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference
69
Variable elimination example
yn A B C D We have all the information to make a decision for yn Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference Questions?
70
Two types of inference problems
Marginals: find Maximizer: find Probability in different domains Advanced ML: Inference
71
Advanced ML: Inference
Belief Propagation BP provides exact solution when there are no loops in graph Viterbi is a special case Otherwise, “loopy” BP provides an approximate solution We use sum-product BP as running example, where we want to compute the Z in Advanced ML: Inference
72
Advanced ML: Inference
Intuition iterative process in which neighboring variables “pass message” to each other: I (variable x3) think that you (variable x2) belong in these states with various likelihoods… After enough iterations, the conversations is likely to converge to a consensus that determines the marginal probabilities of all the variables. Advanced ML: Inference
73
Advanced ML: Inference
Message Message from node i to node j: 𝑚 𝑖𝑗 ( 𝑥 𝑗 ) Message is not probability May not sum to 1 A high value of 𝑚 𝑖𝑗 ( 𝑥 𝑗 ) means that node i “believes” the marginal value 𝑃( 𝑥 𝑗 ) to be high. Advanced ML: Inference
74
Advanced ML: Inference
Beliefs Estimated marginal probabilities are called beliefs. Algorithm Update messages until convergence Then calculate beliefs Advanced ML: Inference
75
Advanced ML: Inference
Message update To update message from i to j, consider all messages flowing into i Advanced ML: Inference
76
Advanced ML: Inference
Message update Advanced ML: Inference
77
Advanced ML: Inference
Message update Advanced ML: Inference
78
Sum-product vs. max-product
The standard BP we just described is sum- product – used to estimate marginal A variant called max-product (or max-sum in log space), is used to estimate MAP Advanced ML: Inference
79
Advanced ML: Inference
Max-product Message update same as before, except that sum is replaced by max: Belief: estimate most likely states Advanced ML: Inference
80
Advanced ML: Inference
Recap Advanced ML: Inference
81
Example (Sum-Product)
2 B 1 How many possible assignments? For the sake of simplicity, let’s assume all the transition and emission scores are the same A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference
82
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1×2+3×1 B 2×2+4×1 A 2 B 1 Advanced ML: Inference
83
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 2 B 1 Advanced ML: Inference
84
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference
85
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2× (5×5) B 1×(8×8) A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference
86
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference
87
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference
88
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 = 2× 2+3 × 2+3 A 1×2+3×1=5 B 8 A 5 B 8 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference
89
Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1×50+3×64=242 B 2×50+4×64=356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
90
Example (Sum-Product)
2 B 1 A 2×242×5×5=12,100 B 1×356×8×8=22,784 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 242 B 356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
91
Example (Sum-Product)
2 B 1 A 12,100 B 22,784 A -> A 1 A -> B 2 B -> A 3 B-> B 4 34,884 A 242 B 356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
92
Example (max-product)
2 B 1 How many possible assignments? For the sake of simplicity, let’s assume all the transition and emission scores are the same A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference
93
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A max 1×2, 3×1 =3 B max 2×2, 4×1 =4 A 2 B 1 Advanced ML: Inference
94
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 3 B 4 A 2 B 1 Advanced ML: Inference
95
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference
96
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 2× (3×3) B 1×(4×4) A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference
97
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference
98
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 A 3 B 4 A 3 B 4 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference
99
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 = 2× 3 × 3 A 3 B 4 A 3 B 4 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference
100
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A max 1×18, 3×16 =48 B max 2×18, 4×16 =64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
101
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2×48×3×3= 864 B 1×64×4×4=1,024 ℎ( 𝑥 𝑖 ) A 48 B 64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
102
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 B 64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
103
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 (𝐵) B 64 (B) A 3 (𝐵) B 4 (A) A 3 (B) B 4 (𝐴) A 18 B 16 A 3 (B) B 4 (𝐴) A 3(𝐵) B 4(A) A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
104
Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 (𝐵) B 64 (B) A 3 (𝐵) B 4 (A) A 3 (B) B 4 (𝐴) A 18 B 16 A 3 (B) B 4 (𝐴) A 3(𝐵) B 4(A) A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
105
Inference: Graph algorithms General Search
Advanced ML: Inference
106
Inference as search: General setting
Predicting a graph as a sequence of decisions General data structures: State: Encodes partial structure Transitions: Move from one partial structure to another Start state End state: We have a full structure There may be more than one end state Each transition is scored with the learned model Goal: Find an end state that has the highest total score Questions? Advanced ML: Inference
107
Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned Advanced ML: Inference
108
Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference
109
Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference
110
Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) How do the transitions get scored? (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Questions? Advanced ML: Inference
111
Graph search algorithms
Breadth/depth first search Keep a stack/queue/priority queue of “open” states The good: Guaranteed to be correct Explores every option The bad? Explores every option: Could be slow for any non-trivial graph (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference
112
Advanced ML: Inference
Greedy search At each state, choose the highest scoring next transition Keep only one state in memory: The current state What is the problem? Local decisions may override global optimum Does not explore full search space Greedy algorithms can give the true optimum for special class of problems Eg: Maximum-spanning tree algorithms are greedy Questions? Advanced ML: Inference
113
Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) How do the transitions get scored? (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (A,B,-) (A,C,-) (A,A,-) (A,B,C) (A,C,-) Questions? Advanced ML: Inference
114
Advanced ML: Inference
Example (greedy) A 2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
115
Advanced ML: Inference
Example (greedy) A 2 B 1 A 2×3×16×2×2 = 384 B 1×16×4×2×2×2×2=1,024 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2×2×2 = 8 B 2×2×1×2×2=16 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference
116
Beam search: A compromise
Keep size-limited priority queue of states Called the beam, sorted by total score for the state At each step: Explore all transitions from the current state Add all to beam and trim the size The good: Explores more than greedy search The bad: A good state might fall out of the beam In general, easy to implement, very popular No guarantees Questions? Advanced ML: Inference
117
Advanced ML: Inference
Example beam = 3 Credit: Graham Neubig Advanced ML: Inference
118
Advanced ML: Inference
Calculate score, but ignore removed hypotheses Advanced ML: Inference
119
Advanced ML: Inference
Keep only best three Advanced ML: Inference
120
Structured prediction approaches based on search
Learning to search approaches Assume the complex decision is incrementally constructed by a sequence of decisions E.g., dagger, Searn, transition-based methods Learn how to make decisions at each branch. Advanced ML: Inference
121
Example: Dependency Parsing
Identifying relations between words I ate a cake with a fork To introduce how the l2s approach work. Let me use the dependency parsing problem as a running example. Dep parsing … I ate a cake with a fork Advanced ML: Inference
122
Learning to search (L2S) approaches
Define a search space and features Example: dependency parsing [Nivre03,NIPS16] Maintain a buffer and a stack Make predictions from left to right Three (four) types of actions: Shift, Reduce-Left, Reduce-Right We first define the search space in the following way. Credit: Google research blog Advanced ML: Inference
123
Learning to search approaches Shift-Reduce parser
Maintain a buffer and a stack Make predictions from left to right Three (four) types of actions: Shift, Reduce-Left, Reduce-Right I ate a cake I ate a cake Shift Reduce-Left ate a cake I I ate a cake Advanced ML: Inference
124
Learning to search (L2S) approaches
Define a search space and features Construct a reference policy (Ref) based on the gold label Learning a policy that imitates Ref The learning to search approach is related to reinforcement learning. But the different is here we have the true annotations. Therefore, in the training phase, we can use them to generate a reference policy as a guidance to traverse through the search space. For example, the reference policy knows how to take actions in order to generate this correct parse tree. Then we can learn a policy to imitate this reference policy so that in the test phase, without the reference policy, the learned policy still can get the right decision. Existing results show that under some conditions, with enough data the learned policy can be as good as the reference policy. sentence Advanced ML: Inference
125
Advanced ML: Inference
Policies A policy maps observations to actions p( ) obs. =a input: x timestep: t partial traj: τ … anything else So far this approach sounds like reinformance learning. Wait? Reinforcement leanring is hard. Advanced ML: Inference
126
Imitation learning for joint prediction
Challenges: There are combinatorial number of search states How a sub-decision affect the final decision? Advanced ML: Inference
127
Credit Assignment Problem
When someone goes wrong which decision should be blamed Advanced ML: Inference
128
Imitation learning for joint prediction
Searn: [Langford, Daume& Marcu] Dagger: [Ross, Gordon & Bagnell] AggreVaTe: [Ross & Bagnell] LOLS: [Chang, Krishnamurthy, Agarwal, Daume, Langford] Advanced ML: Inference
129
Learning a Policy[ICML 15, Ross+15]
At “?” state, we construct a cost-sensitive multi-class example (?, [0, .2, .8]) E ? E loss=0 loss=.2 loss=.8 rollin rollout one-step deviations Advanced ML: Inference
130
Example: Sequence Labeling
Receive input: Make a sequence of predictions: Pick a timestep and try all perturbations there: Compute losses and construct example: x = the monster ate the sandwich y = Dt Nn Vb Dt Nn x = the monster ate the sandwich ŷ = Dt Dt Dt Dt Dt x = the monster ate the sandwich ŷDt = Dt Dt Vb Dt Nn l=1 ŷNn = Dt Nn Vb Dt Nn l=0 ŷVb = Dt Vb Vb Dt Nn l=1 ( { w=monster, p=Dt, …}, [1,0,1] ) Advanced ML: Inference
131
Learning to search approaches: Credit Assignment Compiler [NIPS16]
Write the decoder, providing some side information for training Library translates this piece of program with data to the update rules of model Applied to dependency parsing, Name entity recognition, relation extraction, POS tagging… Implementation: Vowpal Wabbit The key challenge here is representing search space is not intuitive. Usually, you have to define some state machines. Therefore, people usually don’t implement the right algorithms with good guarantee, instead they train the model with heuristic – basically implement some simple update rules in their code. This often leads to suboptimal performance. To fix this, we proposed a programming abstraction allow the user to represent the search space using a piece of program. Then the library works as a compiler to translate this piece of program with data to the update rules of model. This idea is analogy to factorie for probabilistic model. Advanced ML: Inference
132
Approximate Inference Inference by sampling
Advanced ML: Inference
133
Inference by sampling Basic idea:
Monte Carlo methods: A large class of algorithms Origins in physics Basic idea: Repeatedly sample from a distribution Compute aggregate statistics from samples E.g.: The marginal distribution Useful when we have many, many interacting variables
134
Why sampling works Suppose we have some probability distribution P(z)
Might be a cumbersome function We want to answer questions about this distribution What is the mean? Approximate with samples from the distribution {z1, z2, , zn} Theory tells us that this is a good estimator Chernoff-Hoeffding style bounds Eg: Expectation
135
Key idea – rejection sampling
Advanced ML: Inference
136
Key idea – rejection sampling
Work well when number of variables are small Advanced ML: Inference
137
The Markov Chain Monte Carlo revolution
Goal: To generate samples from a distribution P(y|x) The target distribution could be intractable to sample from Idea: Construct a Markov chain of structures whose stationary distribution converges to P An iterative process that constructs examples Initially samples might not be from the target distribution But after a long enough time, the samples are from a distribution that is close to P
138
The Markov Chain Monte Carlo revolution
Goal: To generate samples from a distribution P(y|x) The target distribution could be intractable to sample from Idea: drawing examples in a way that in a long run the distribution is closed to P(y|x) Formally: Construct a Markov chain of structures whose stationary distribution converges to P An iterative process that constructs examples Initially samples might not be from the target distribution But after a long enough time, the samples are from a distribution that is increasingly close to P
139
A detour Recall: Markov Chains A collection of random variables y0, y1, y2, …, yt form a Markov chain if the ith state depends only on the previous one A B BC D DE DF 0.1 0.8 0.9 A ! B ! C ! D ! E ! F F ! A ! A ! E ! F ! B! C
140
A detour Recall: Markov Chains A collection of random variables y0, y1, y2, …, yt form a Markov chain if the ith state depends only on the previous one A B BC D DE DF 0.1 0.8 0.9 A → B → C → D → E → F F → A → A → E → F → B → C
141
Temporal dynamics of a Markov chain
What is the probability that a chain is in a state z at time t+1?
142
Temporal dynamics of a Markov chain
What is the probability that a chain is in a state z at time t+1? A B BC D DE DF 0.1 0.8 0.9 Exercise: Suppose a Markov chain for these transition probabilities starts at A. What is the distribution over states after two steps?
143
Stationary distributions
Informally, If the set of states is {A, B, C, D, E, F} A distribution over the states such that after a transition, the distribution over the states is still 𝜋 How do we get to a stationary distribution? A regular Markov chain: There is an non-zero probability of getting from any states to any other in a finite number of steps If transition matrix is regular, just run it for a long time Steady-state behavior is the stationary distribution
144
Key idea – rejection sampling
Advanced ML: Inference
145
Markov Chain Monte Carlo for inference
Back to inference Markov Chain Monte Carlo for inference Design a Markov chain such that Every state is a structure The stationary distribution of the chain is the probability distribution we care about P(y |x) How to do inference? Run the Markov chain for a long time till we think it gets to steady state Let the chain wander around the space and collect samples We have samples from P(y|x)
146
MCMC for inference 1
147
MCMC for inference After many steps 1 1
148
MCMC for inference After many steps After many steps 1 1 1
149
MCMC for inference 1 1 2 After many steps After many steps After many
150
MCMC for inference 1 1 3 After many steps After many steps After many
151
MCMC for inference 1 1 3 After many steps After many steps After many
With sufficient samples, we can answer inference questions like calculating the partition function (just sum over the samples) 1 1 3
152
Key idea – rejection sampling
Advanced ML: Inference
153
Markov Chain Monte Carlo for inference
Back to inference Markov Chain Monte Carlo for inference Design a Markov chain such that Every state is a structure The stationary distribution of the chain is the probability distribution we care about P(y |x) How to do inference? Run the Markov chain for a long time till we think it gets to steady state Let the chain wander around the space and collect samples We have samples from P(y|x)
154
MCMC algorithms Metropolis-Hastings algorithm Gibbs sampling
An instance of the Metropolis Hastings algorithm Many variants exist Remember: We are sampling from an exponential state space All possible assignments to the random variables
155
Metropolis-Hastings Proposal distribution q(y → y’)
[Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953] [Hastings 1970] Proposal distribution q(y → y’) Proposes changes to the state Could propose large changes to the state Acceptance probability 𝛼 Should the proposal be accepted or not If yes, move to the proposed state, else remain in the previous state
156
Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN}
157
Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN}
158
Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’)
159
Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’) Note that we don’t need to compute the partition function. Why?
160
Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’) Idea: when running with enough iteration, the distribution is invariant
161
Proposal functions for Metropolis
A design choice Different possibilities Only make local changes to the factor graph But the chain might not explore widely Make big jumps in the state space But the chain might move very slowly Doesn’t have to depend on the size of the graph
162
Gibbs Sampling Start with an initial guess y = (y1, y2, , yn)
Loop several times For i = 1 to n: Sample yi from P(yi| y1, y2, yi-1, yi+1, , yn, x) We now have a complete sample A specific instance of Metropolis-Hastings algorithm, no proposal needs to be designed The ordering is arbitrary
163
MAP inference with MCMC
So far we have only seen how to collect samples Marginal inference with samples is easy Compute the marginal probabilities from the samples MAP inference: Find the sample with highest probability To help convergence to the maximum , acceptance condition becomes T is a temperature parameter that increases with every step Similar to simulated annealing
164
Summary of MCMC methods
A different approach for inference No guarantee of exactness General idea Set up a Markov chain whose stationary distribution is the probability distribution that we care about Run the chain, collect samples, aggregate Metropolis-Hastings, Gibbs sampling Many, many, many variants abound! Useful when exact inference is intractable Typically low memory costs, local changes only for Gibbs sampling Questions?
165
Inference What is inference? The prediction step
More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum Different flavors: MAP, marginal, loss augmented. Many algorithms, solution strategies One size doesn’t fit all Next steps: How can we take advantage of domain knowledge in inference? How can we deal making predictions about latent variables for which we don’t have data Questions?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.