Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kai-Wei Chang University of Virginia

Similar presentations


Presentation on theme: "Kai-Wei Chang University of Virginia"— Presentation transcript:

1 Kai-Wei Chang CS @ University of Virginia kw@kwchang.net
Lecture 8: Inference Kai-Wei Chang University of Virginia Some slides are adapted from Vivek Skirmar’s course on Structured Prediction Advanced ML: Inference

2 Advanced ML: Inference
So far what we learned Thinking about structures A graph, a collection of parts that are labeled jointly, a collection of decisions Next: Prediction Sets structured prediction apart from binary/multiclass A B D C E F G A B D C E F G Advanced ML: Inference

3 Advanced ML: Inference
The bigger picture The goal of structured prediction: Predicting a graph Modeling: Defining probability distributions over the random variables Involves making independence assumptions Inference: The computational step that actually constructs the output Also called decoding Learning creates functions that score predictions (e.g., learning model parameters) Advanced ML: Inference

4 Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Background knowledge about domain Data annotation difficulty Inference: deriving the probability of one or more random variables based on the model Semi-supervised/indirectly supervised? Advanced ML: Inference

5 Advanced ML: Inference
What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference Integer programming Graph algorithms Sum-product, max-sum Heuristics for inference LP relaxation Sampling Advanced ML: Inference

6 Remember sequence prediction
Goal: Find the most probable/highest scoring state sequence argmaxy score(y) = argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) Computationally: discrete optimization The naïve algorithm Enumerate all sequences, score each one and pick the max Terrible idea! We can do better Scores decomposed over edges Advanced ML: Inference

7 The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn Advanced ML: Inference

8 The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn Advanced ML: Inference

9 The Viterbi algorithm: Recurrence
Goal: Find argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) y = (y1, y2, , yn) y2 y3 y1 yn , 𝑦 𝑖−1 )] Idea 1. If I know the score of all sequences y1 to yn-1, then I could decide yn easily 2. Recurse to get score up to yn-1 Advanced ML: Inference

10 Advanced ML: Inference
Inference questions This class: Mostly we use inference to mean “What is the highest scoring assignment to the output random variables for a given input?” Maximum A Posteriori (MAP) inference (if the score is probabilistic) Other inference questions What is the highest scoring assignment to some of the output variables given the input? Sample from the posterior distribution over the Y Loss-augmented inference: Which structure most violates the margin for a given scoring function? Computing marginal probabilities over Y Advanced ML: Inference

11 Advanced ML: Inference
MAP inference A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference

12 Advanced ML: Inference
MAP inference A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference

13 MAP inference is discrete optimization
A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference

14 MAP inference is discrete optimization
A combinatorial problem Computational complexity depends on The size of the input The factorization of the scores More complex factors generally lead to expensive inference A generally bad strategy in most but the simplest cases: “Enumerate all possible structures and pick the highest scoring one” ,…. 0.4 -10 41.3 Advanced ML: Inference

15 MAP inference is search
We want a graph that has highest scoring structure argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) Without assumptions, no algorithm can find the max without considering every possible structure How can we solve this computational problem? Exploit the structure of the search space and the cost function That is, exploit decomposition of the scoring function Usually stronger assumptions lead to easier inference E.g., consider 10 independent random variables Advanced ML: Inference

16 Approaches for inference
Exact vs. approximate inference Should the maximization be performed exactly? Or is a close-to-highest-scoring structure good enough? Exact: Search, dynamic programming, integer linear programming, …. Heuristic (called approximate inference): Gibbs sampling, belief propagation, beam search, linear programming relaxations, … Randomized vs. deterministic Relevant for approximate inference: If I run the inference program twice, will I get the same answer? Advanced ML: Inference

17 Advanced ML: Inference
Coming up Formulating general inference as integer linear programs And variants of this idea Graph algorithms, dynamic programing, greedy search We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product Heuristics for inference Sampling, Gibbs Sampling Approximate graph search, beam search LP-relaxation Advanced ML: Inference

18 Inference: Integer Linear Programs
Advanced ML: Inference

19 Advanced ML: Inference
The big picture MAP Inference is combinatorial optimization Combinatorial optimization problems can be written as integer linear programs (ILP) The conversion is not always trivial Allows injection of “knowledge” in the form of constraints Different ways of solving ILPs Commercial solvers: CPLEX, Gurobi, etc Specialized solvers if you know something about your problem Lagrangian relaxation, amortized inference, etc Can approximate to linear programs and hope for the best Integer linear programs are NP hard in general No free lunch Advanced ML: Inference

20 Detour: Linear programming
Minimizing a linear objective function subject to a finite number of linear constraints (equality or inequality) Very widely applicable Operations research, micro-economics, management Historical note Developed during world war II to reduce army costs “Programming” not the same as computer programming Advanced ML: Inference

21 Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Advanced ML: Inference

22 Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference

23 Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference

24 Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference

25 Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference

26 Example: The diet problem
A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 Double cheeseburger 0.3 0.01 Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference

27 Advanced ML: Inference
Linear programming In general linear linear Advanced ML: Inference

28 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 Advanced ML: Inference

29 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Advanced ML: Inference

30 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region Advanced ML: Inference

31 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b c Suppose we had to maximize any cTx on this region Advanced ML: Inference

32 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region c Advanced ML: Inference

33 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b c Suppose we had to maximize any cTx on this region Advanced ML: Inference

34 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region c Advanced ML: Inference

35 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions For example: x1 x2 x3 a1x1 + a2x2 + a3x3 = b Suppose we had to maximize any cTx on this region These three vertices are the only possible solutions! Advanced ML: Inference

36 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions The constraint matrix defines a polytope Only the vertices or faces of the polytope can be solutions Advanced ML: Inference

37 Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) Advanced ML: Inference

38 Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Darker is higher Advanced ML: Inference

39 Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Advanced ML: Inference

40 Geometry of linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Even though all points in the region are allowed, the vertices maximize/minimize the cost Advanced ML: Inference

41 Advanced ML: Inference
Linear programming In general This is a continuous optimization problem And yet, there are only a finite set of possible solutions The constraint matrix defines a polytope Only the vertices or faces of the polytope can be solutions Linear programs can be solved in polynomial time Questions? Advanced ML: Inference

42 Integer linear programming
In general Advanced ML: Inference

43 Geometry of integer linear programming
The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Only integer points allowed Advanced ML: Inference

44 Integer linear programming
In general Solving integer linear programs in general can be NP-hard! LP-relaxation: Drop the integer constraints and hope for the best Advanced ML: Inference

45 0-1 integer linear programming
In general An instance of integer linear programs Still NP-hard Geometry: We are only considering points that are vertices of the Boolean hypercube Advanced ML: Inference

46 0-1 integer linear programming
Solution can be an interior point of the constraint set defined by Ax · b In general An instance of integer linear programs Still NP-hard Geometry: We are only considering points that are vertices of the Boolean hypercube Constraints prohibit certain vertices Eg: Only points within this region are allowed Advanced ML: Inference Questions?

47 Back to structured prediction
Recall that we are solving argmaxy 𝑤 𝑇 𝜙(𝑥,𝑦) The goal is to produce a graph The set of possible values that y can take is finite, but large General idea: Frame the argmax problem as a 0-1 integer linear program Allows addition of arbitrary constraints Advanced ML: Inference

48 Advanced ML: Inference
Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Advanced ML: Inference

49 Advanced ML: Inference
Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Maximize the score Pick exactly one label Advanced ML: Inference

50 Advanced ML: Inference
Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise Maximize the score Advanced ML: Inference

51 Thinking in ILPs Let’s start with multi-class classification arg max 𝑦∈{𝐴,𝐵,𝐶} 𝜙(𝑥,𝑦) = arg𝑚𝑎 𝑥 𝑦∈{𝐴,𝐵,𝐶} score(y) Introduce decision variables for each label zA = 1 if output = A, 0 otherwise zB = 1 if output = B, 0 otherwise zC = 1 if output = C, 0 otherwise We have taken a trivial problem (finding a highest scoring element of a list) and converted it into a representation that is NP-hard in the worst case! Lesson: Don’t solve multiclass classification with an ILP solver Maximize the score Advanced ML: Inference

52 ILP for a general conditional models
Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference

53 ILP for a general conditional models
Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels z1A, z1B, z1C Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference

54 ILP for a general conditional models
z13AA, z13AB, z13AC, z13BA, z13BB, z13BC, z13CA, z13CB, z13CC Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels z1A, z1B, z1C z23AA, z23AB, z23AC, z23BA, z23BB, z23BC, z23CA, z23CB, z23CC Each of these decision variables is associated with a score z2A, z2B, z2C Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Questions? Advanced ML: Inference

55 ILP for a general conditional models
Suppose each yi can be A, B or C x1 x2 x3 y3 y2 y1 Introduce one decision variable for each part being assigned labels Each of these decision variables is associated with a score Not all decisions can exist together. Eg: z13AB implies z1A and z3B Our goal: maxy 𝑾 𝑻 𝝓(x1, y1) + 𝑾 𝑻 𝝓 (y1, y2, y3) + 𝑾 𝑻 𝝓 (x3, y2, y3) + 𝑾 𝑻 𝝓 (x1, x2, y2) Advanced ML: Inference

56 Writing constraints as linear inequalities
Exactly one of z1A, z1B, z1C can be true z1A + z1B + z1C = 1 At least m of z1A, z1B, z1C should be true z1A + z1B + z1C ≥𝑚 At most m of z1A, z1B, z1C should be true z1A + z1B + z1C ≤𝑚 Implication: zi → zj Convert to disjunction: ¬zi ∨ zj (At least one of “not zi” or zj ) 1 – zi + zj ≥ 1 Advanced ML: Inference

57 Writing constraints as linear inequalities
Exactly one of z1A, z1B, z1C can be true z1A + z1B + z1C = 1 At least m of z1A, z1B, z1C should be true z1A + z1B + z1C ≥𝑚 At most m of z1A, z1B, z1C should be true z1A + z1B + z1C ≤𝑚 Implication: zi → zj Convert to disjunction: ¬zi ∨ zj (At least one of “not zi” or zj ) 1 – zi + zj ≥ 1 Advanced ML: Inference

58 Integer linear programming for inference
Easy to add additional knowledge Specify them as Boolean formulas Examples “If y1 is an A, then y2 or y3 should be a B or C” “No more than two A’s allowed in the output” Many inference problems have “standard” mappings to ILPs Sequences, parsing, dependency parsing Encoding of the problem makes a difference in solving time The mechanical encoding may not be efficient to solve Generally: more complex constraints make solving harder Advanced ML: Inference

59 Exercise: Sequence labeling
Goal: Find argmaxy 𝑾 𝑻 𝝓 (x,y) y = (y1, y2, , yn) y2 y3 y1 yn How can this be written as an ILP? Advanced ML: Inference

60 ILP for inference: Remarks
Many combinatorial optimization problem can be written as an ILP Even the “easy”/polynomial ones Given an ILP, checking whether it represents a polynomial problem is intractable in general ILPs are a general language for thinking about combinatorial optimization The representation allows us to make general statements about inference Off-the-shelf solvers for ILPs are quite good Gurobi, CPLEX Use an off the shelf solver only if you can’t solve your inference problem otherwise Advanced ML: Inference

61 Advanced ML: Inference
Coming up Formulating general inference as integer linear programs And variants of this idea Graph algorithms, dynamic programing, greedy search We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product Heuristics for inference Sampling, Gibbs Sampling Approximate graph search, beam search LP-relaxation Advanced ML: Inference

62 Inference: Graph algorithms Belief Propagation
Advanced ML: Inference

63 Variable elimination (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Advanced ML: Inference

64 Variable elimination: (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Viterbi is an instance of max-product variable elimination Advanced ML: Inference

65 Variable elimination: (motivation)
Remember: We have a collection of inference variables that need to be assigned y = (y1, y2, ) General algorithm First fix an ordering of the variables, say (y1, y2, ) Iteratively: Find the best value for yi given the values of the previous neighbors Use back pointers to find final answer Viterbi is an instance of max-product variable elimination Advanced ML: Inference

66 Variable elimination example (max-sum)
C D 2 4 1 Score_local y2 y3 y1 yn A 2 B 4 C 1 D A 2 B 4 C 1 D First eliminate y1 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference

67 Variable elimination example
C D y2 y3 yn A B C D Next eliminate y2 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference

68 Variable elimination example
C D y3 yn A B C D Next eliminate y3 Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference

69 Variable elimination example
yn A B C D We have all the information to make a decision for yn Scor e 1 s =score_loca l 1 s,START Scor e i s = max y i−1 [scor e i−1 ( y i−1 )+score_loca l i ( y i−1 , y i )] Advanced ML: Inference Questions?

70 Two types of inference problems
Marginals: find Maximizer: find Probability in different domains Advanced ML: Inference

71 Advanced ML: Inference
Belief Propagation BP provides exact solution when there are no loops in graph Viterbi is a special case Otherwise, “loopy” BP provides an approximate solution We use sum-product BP as running example, where we want to compute the Z in Advanced ML: Inference

72 Advanced ML: Inference
Intuition iterative process in which neighboring variables “pass message” to each other: I (variable x3) think that you (variable x2) belong in these states with various likelihoods… After enough iterations, the conversations is likely to converge to a consensus that determines the marginal probabilities of all the variables. Advanced ML: Inference

73 Advanced ML: Inference
Message Message from node i to node j: 𝑚 𝑖𝑗 ( 𝑥 𝑗 ) Message is not probability May not sum to 1 A high value of 𝑚 𝑖𝑗 ( 𝑥 𝑗 ) means that node i “believes” the marginal value 𝑃( 𝑥 𝑗 ) to be high. Advanced ML: Inference

74 Advanced ML: Inference
Beliefs Estimated marginal probabilities are called beliefs. Algorithm Update messages until convergence Then calculate beliefs Advanced ML: Inference

75 Advanced ML: Inference
Message update To update message from i to j, consider all messages flowing into i Advanced ML: Inference

76 Advanced ML: Inference
Message update Advanced ML: Inference

77 Advanced ML: Inference
Message update Advanced ML: Inference

78 Sum-product vs. max-product
The standard BP we just described is sum- product – used to estimate marginal A variant called max-product (or max-sum in log space), is used to estimate MAP Advanced ML: Inference

79 Advanced ML: Inference
Max-product Message update same as before, except that sum is replaced by max: Belief: estimate most likely states Advanced ML: Inference

80 Advanced ML: Inference
Recap Advanced ML: Inference

81 Example (Sum-Product)
2 B 1 How many possible assignments? For the sake of simplicity, let’s assume all the transition and emission scores are the same A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference

82 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1×2+3×1 B 2×2+4×1 A 2 B 1 Advanced ML: Inference

83 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 2 B 1 Advanced ML: Inference

84 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference

85 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2× (5×5) B 1×(8×8) A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference

86 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 Advanced ML: Inference

87 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

88 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 = 2× 2+3 × 2+3 A 1×2+3×1=5 B 8 A 5 B 8 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

89 Example (Sum-Product)
2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1×50+3×64=242 B 2×50+4×64=356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

90 Example (Sum-Product)
2 B 1 A 2×242×5×5=12,100 B 1×356×8×8=22,784 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 242 B 356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

91 Example (Sum-Product)
2 B 1 A 12,100 B 22,784 A -> A 1 A -> B 2 B -> A 3 B-> B 4 34,884 A 242 B 356 A 5 B 8 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

92 Example (max-product)
2 B 1 How many possible assignments? For the sake of simplicity, let’s assume all the transition and emission scores are the same A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference

93 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A max 1×2, 3×1 =3 B max 2×2, 4×1 =4 A 2 B 1 Advanced ML: Inference

94 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 3 B 4 A 2 B 1 Advanced ML: Inference

95 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference

96 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 2× (3×3) B 1×(4×4) A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference

97 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 Advanced ML: Inference

98 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 A 3 B 4 A 3 B 4 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

99 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A 18 B 16 = 2× 3 × 3 A 3 B 4 A 3 B 4 Let’s verify: (A,(A,A)) = 2× 1×2 × 1×2 =8 (A,(A,B)) = 2× 1×2 × 3×1 =12 (A,(B,A)) = 2× 3×1 × 1×2 =12 (A,(B,B)) = 2× 3×1 × 3×1 =18 A 2 B 1 A 2 B 1 Advanced ML: Inference

100 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 ℎ( 𝑥 𝑖 ) A max 1×18, 3×16 =48 B max 2×18, 4×16 =64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

101 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2×48×3×3= 864 B 1×64×4×4=1,024 ℎ( 𝑥 𝑖 ) A 48 B 64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

102 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 B 64 A 3 B 4 A 3 B 4 A 18 B 16 A 3 B 4 A 3 B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

103 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 (𝐵) B 64 (B) A 3 (𝐵) B 4 (A) A 3 (B) B 4 (𝐴) A 18 B 16 A 3 (B) B 4 (𝐴) A 3(𝐵) B 4(A) A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

104 Example (max-product)
2 B 1 m ij 𝑛𝑒𝑤 𝑥 𝑗 = max x i 𝑓 𝑖,𝑗 𝑥 𝑖 , 𝑥 𝑗 𝑔 𝑖 𝑥 𝑖 𝑘∈𝑁𝑏𝑑(𝑖)\j 𝑚 𝑘𝑖 𝑜𝑙𝑑 ( 𝑥 𝑖 ) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 864 B 1,024 ℎ( 𝑥 𝑖 ) A 48 (𝐵) B 64 (B) A 3 (𝐵) B 4 (A) A 3 (B) B 4 (𝐴) A 18 B 16 A 3 (B) B 4 (𝐴) A 3(𝐵) B 4(A) A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

105 Inference: Graph algorithms General Search
Advanced ML: Inference

106 Inference as search: General setting
Predicting a graph as a sequence of decisions General data structures: State: Encodes partial structure Transitions: Move from one partial structure to another Start state End state: We have a full structure There may be more than one end state Each transition is scored with the learned model Goal: Find an end state that has the highest total score Questions? Advanced ML: Inference

107 Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned Advanced ML: Inference

108 Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference

109 Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference

110 Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) How do the transitions get scored? (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Questions? Advanced ML: Inference

111 Graph search algorithms
Breadth/depth first search Keep a stack/queue/priority queue of “open” states The good: Guaranteed to be correct Explores every option The bad? Explores every option: Could be slow for any non-trivial graph (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. (A,A,A) (C,C,C) Advanced ML: Inference

112 Advanced ML: Inference
Greedy search At each state, choose the highest scoring next transition Keep only one state in memory: The current state What is the problem? Local decisions may override global optimum Does not explore full search space Greedy algorithms can give the true optimum for special class of problems Eg: Maximum-spanning tree algorithms are greedy Questions? Advanced ML: Inference

113 Advanced ML: Inference
Example Suppose each y can be one of A, B or C x1 x2 x3 y3 y2 y1 Note: Here we have assumed an ordering (y1, y2, y3) How do the transitions get scored? (-,-,-) State: Triples (y1, y2, y3) all possibly unknown (A, -, -), (-, A, A), (-, -, -),… Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y’s are assigned (A,-,-) (B,-,-) (C,-,-) (A,A,-) (A,B,-) (A,C,-) (A,A,-) (A,B,C) (A,C,-) Questions? Advanced ML: Inference

114 Advanced ML: Inference
Example (greedy) A 2 B 1 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

115 Advanced ML: Inference
Example (greedy) A 2 B 1 A 2×3×16×2×2 = 384 B 1×16×4×2×2×2×2=1,024 A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 2×2×2 = 8 B 2×2×1×2×2=16 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 1 Advanced ML: Inference

116 Beam search: A compromise
Keep size-limited priority queue of states Called the beam, sorted by total score for the state At each step: Explore all transitions from the current state Add all to beam and trim the size The good: Explores more than greedy search The bad: A good state might fall out of the beam In general, easy to implement, very popular No guarantees Questions? Advanced ML: Inference

117 Advanced ML: Inference
Example beam = 3 Credit: Graham Neubig Advanced ML: Inference

118 Advanced ML: Inference
Calculate score, but ignore removed hypotheses Advanced ML: Inference

119 Advanced ML: Inference
Keep only best three Advanced ML: Inference

120 Structured prediction approaches based on search
Learning to search approaches Assume the complex decision is incrementally constructed by a sequence of decisions E.g., dagger, Searn, transition-based methods Learn how to make decisions at each branch. Advanced ML: Inference

121 Example: Dependency Parsing
Identifying relations between words I ate a cake with a fork To introduce how the l2s approach work. Let me use the dependency parsing problem as a running example. Dep parsing … I ate a cake with a fork Advanced ML: Inference

122 Learning to search (L2S) approaches
Define a search space and features Example: dependency parsing [Nivre03,NIPS16] Maintain a buffer and a stack Make predictions from left to right Three (four) types of actions: Shift, Reduce-Left, Reduce-Right We first define the search space in the following way. Credit: Google research blog Advanced ML: Inference

123 Learning to search approaches Shift-Reduce parser
Maintain a buffer and a stack Make predictions from left to right Three (four) types of actions: Shift, Reduce-Left, Reduce-Right I ate a cake I ate a cake Shift Reduce-Left ate a cake I I ate a cake Advanced ML: Inference

124 Learning to search (L2S) approaches
Define a search space and features Construct a reference policy (Ref) based on the gold label Learning a policy that imitates Ref The learning to search approach is related to reinforcement learning. But the different is here we have the true annotations. Therefore, in the training phase, we can use them to generate a reference policy as a guidance to traverse through the search space. For example, the reference policy knows how to take actions in order to generate this correct parse tree. Then we can learn a policy to imitate this reference policy so that in the test phase, without the reference policy, the learned policy still can get the right decision. Existing results show that under some conditions, with enough data the learned policy can be as good as the reference policy. sentence Advanced ML: Inference

125 Advanced ML: Inference
Policies A policy maps observations to actions p( ) obs. =a input: x timestep: t partial traj: τ … anything else So far this approach sounds like reinformance learning. Wait? Reinforcement leanring is hard. Advanced ML: Inference

126 Imitation learning for joint prediction
Challenges: There are combinatorial number of search states How a sub-decision affect the final decision? Advanced ML: Inference

127 Credit Assignment Problem
When someone goes wrong which decision should be blamed Advanced ML: Inference

128 Imitation learning for joint prediction
Searn: [Langford, Daume& Marcu] Dagger: [Ross, Gordon & Bagnell] AggreVaTe: [Ross & Bagnell] LOLS: [Chang, Krishnamurthy, Agarwal, Daume, Langford] Advanced ML: Inference

129 Learning a Policy[ICML 15, Ross+15]
At “?” state, we construct a cost-sensitive multi-class example (?, [0, .2, .8]) E ? E loss=0 loss=.2 loss=.8 rollin rollout one-step deviations Advanced ML: Inference

130 Example: Sequence Labeling
Receive input: Make a sequence of predictions: Pick a timestep and try all perturbations there: Compute losses and construct example: x = the monster ate the sandwich y = Dt Nn Vb Dt Nn x = the monster ate the sandwich ŷ = Dt Dt Dt Dt Dt x = the monster ate the sandwich ŷDt = Dt Dt Vb Dt Nn l=1 ŷNn = Dt Nn Vb Dt Nn l=0 ŷVb = Dt Vb Vb Dt Nn l=1 ( { w=monster, p=Dt, …}, [1,0,1] ) Advanced ML: Inference

131 Learning to search approaches: Credit Assignment Compiler [NIPS16]
Write the decoder, providing some side information for training Library translates this piece of program with data to the update rules of model Applied to dependency parsing, Name entity recognition, relation extraction, POS tagging… Implementation: Vowpal Wabbit The key challenge here is representing search space is not intuitive. Usually, you have to define some state machines. Therefore, people usually don’t implement the right algorithms with good guarantee, instead they train the model with heuristic – basically implement some simple update rules in their code. This often leads to suboptimal performance. To fix this, we proposed a programming abstraction allow the user to represent the search space using a piece of program. Then the library works as a compiler to translate this piece of program with data to the update rules of model. This idea is analogy to factorie for probabilistic model. Advanced ML: Inference

132 Approximate Inference Inference by sampling
Advanced ML: Inference

133 Inference by sampling Basic idea:
Monte Carlo methods: A large class of algorithms Origins in physics Basic idea: Repeatedly sample from a distribution Compute aggregate statistics from samples E.g.: The marginal distribution Useful when we have many, many interacting variables

134 Why sampling works Suppose we have some probability distribution P(z)
Might be a cumbersome function We want to answer questions about this distribution What is the mean? Approximate with samples from the distribution {z1, z2, , zn} Theory tells us that this is a good estimator Chernoff-Hoeffding style bounds Eg: Expectation

135 Key idea – rejection sampling
Advanced ML: Inference

136 Key idea – rejection sampling
Work well when number of variables are small Advanced ML: Inference

137 The Markov Chain Monte Carlo revolution
Goal: To generate samples from a distribution P(y|x) The target distribution could be intractable to sample from Idea: Construct a Markov chain of structures whose stationary distribution converges to P An iterative process that constructs examples Initially samples might not be from the target distribution But after a long enough time, the samples are from a distribution that is close to P

138 The Markov Chain Monte Carlo revolution
Goal: To generate samples from a distribution P(y|x) The target distribution could be intractable to sample from Idea: drawing examples in a way that in a long run the distribution is closed to P(y|x) Formally: Construct a Markov chain of structures whose stationary distribution converges to P An iterative process that constructs examples Initially samples might not be from the target distribution But after a long enough time, the samples are from a distribution that is increasingly close to P

139 A detour Recall: Markov Chains A collection of random variables y0, y1, y2, …, yt form a Markov chain if the ith state depends only on the previous one A B BC D DE DF 0.1 0.8 0.9 A ! B ! C ! D ! E ! F F ! A ! A ! E ! F ! B! C

140 A detour Recall: Markov Chains A collection of random variables y0, y1, y2, …, yt form a Markov chain if the ith state depends only on the previous one A B BC D DE DF 0.1 0.8 0.9 A → B → C → D → E → F F → A → A → E → F → B → C

141 Temporal dynamics of a Markov chain
What is the probability that a chain is in a state z at time t+1?

142 Temporal dynamics of a Markov chain
What is the probability that a chain is in a state z at time t+1? A B BC D DE DF 0.1 0.8 0.9 Exercise: Suppose a Markov chain for these transition probabilities starts at A. What is the distribution over states after two steps?

143 Stationary distributions
Informally, If the set of states is {A, B, C, D, E, F} A distribution over the states such that after a transition, the distribution over the states is still 𝜋 How do we get to a stationary distribution? A regular Markov chain: There is an non-zero probability of getting from any states to any other in a finite number of steps If transition matrix is regular, just run it for a long time Steady-state behavior is the stationary distribution

144 Key idea – rejection sampling
Advanced ML: Inference

145 Markov Chain Monte Carlo for inference
Back to inference Markov Chain Monte Carlo for inference Design a Markov chain such that Every state is a structure The stationary distribution of the chain is the probability distribution we care about P(y |x) How to do inference? Run the Markov chain for a long time till we think it gets to steady state Let the chain wander around the space and collect samples We have samples from P(y|x)

146 MCMC for inference 1

147 MCMC for inference After many steps 1 1

148 MCMC for inference After many steps After many steps 1 1 1

149 MCMC for inference 1 1 2 After many steps After many steps After many

150 MCMC for inference 1 1 3 After many steps After many steps After many

151 MCMC for inference 1 1 3 After many steps After many steps After many
With sufficient samples, we can answer inference questions like calculating the partition function (just sum over the samples) 1 1 3

152 Key idea – rejection sampling
Advanced ML: Inference

153 Markov Chain Monte Carlo for inference
Back to inference Markov Chain Monte Carlo for inference Design a Markov chain such that Every state is a structure The stationary distribution of the chain is the probability distribution we care about P(y |x) How to do inference? Run the Markov chain for a long time till we think it gets to steady state Let the chain wander around the space and collect samples We have samples from P(y|x)

154 MCMC algorithms Metropolis-Hastings algorithm Gibbs sampling
An instance of the Metropolis Hastings algorithm Many variants exist Remember: We are sampling from an exponential state space All possible assignments to the random variables

155 Metropolis-Hastings Proposal distribution q(y → y’)
[Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953] [Hastings 1970] Proposal distribution q(y → y’) Proposes changes to the state Could propose large changes to the state Acceptance probability 𝛼 Should the proposal be accepted or not If yes, move to the proposed state, else remain in the previous state

156 Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN}

157 Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN}

158 Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’)

159 Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’) Note that we don’t need to compute the partition function. Why?

160 Metropolis-Hastings Algorithm
The distribution we care about is P(y|x) Start with an initial guess y0 Loop for t = 1, 2, … N Propose next state y’ Calculate acceptance probability 𝛼 With probability 𝛼, accept proposal If accepted: yt+1 ← y’, else yt+1 ← yt Return {y0, y1, …, yN} Sample from q(yt → y’) Idea: when running with enough iteration, the distribution is invariant

161 Proposal functions for Metropolis
A design choice Different possibilities Only make local changes to the factor graph But the chain might not explore widely Make big jumps in the state space But the chain might move very slowly Doesn’t have to depend on the size of the graph

162 Gibbs Sampling Start with an initial guess y = (y1, y2, , yn)
Loop several times For i = 1 to n: Sample yi from P(yi| y1, y2,  yi-1, yi+1, , yn, x) We now have a complete sample A specific instance of Metropolis-Hastings algorithm, no proposal needs to be designed The ordering is arbitrary

163 MAP inference with MCMC
So far we have only seen how to collect samples Marginal inference with samples is easy Compute the marginal probabilities from the samples MAP inference: Find the sample with highest probability To help convergence to the maximum , acceptance condition becomes T is a temperature parameter that increases with every step Similar to simulated annealing

164 Summary of MCMC methods
A different approach for inference No guarantee of exactness General idea Set up a Markov chain whose stationary distribution is the probability distribution that we care about Run the chain, collect samples, aggregate Metropolis-Hastings, Gibbs sampling Many, many, many variants abound! Useful when exact inference is intractable Typically low memory costs, local changes only for Gibbs sampling Questions?

165 Inference What is inference? The prediction step
More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum Different flavors: MAP, marginal, loss augmented. Many algorithms, solution strategies One size doesn’t fit all Next steps: How can we take advantage of domain knowledge in inference? How can we deal making predictions about latent variables for which we don’t have data Questions?


Download ppt "Kai-Wei Chang University of Virginia"

Similar presentations


Ads by Google