Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom.

Similar presentations


Presentation on theme: "Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom."— Presentation transcript:

1 Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013

2 Modeling Distributions 2 Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X 2 : deadline? X 1 : losing sleep? X 3 : sick? X 4 : losing hair? X 5 : overeating? X 6 : loud roommate? X 7 : taking classes? X 8 : cold weather? X 9 : exercising? X 11 : single? X 10 : gaining weight?

3 Modeling Distributions 3 X 2 : deadline? X 1 : losing sleep? X 3 : sick? X 4 : losing hair? X 5 : overeating? X 6 : loud roommate? X 7 : taking classes? X 8 : cold weather? X 9 : exercising? X 10 : single? X 10 : gaining weight? = P( losing sleep, overeating | deadline, taking classes ) Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student.

4 Markov Random Fields (MRFs) 4 X 2 : deadline? X 1 : losing sleep? X 3 : sick? X 4 : losing hair? X 5 : overeating? X 6 : loud roommate? X 7 : taking classes? X 8 : cold weather? X 9 : exercising? X 10 : single? X 10 : gaining weight? Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student.

5 Markov Random Fields (MRFs) 5 X2X2 X1X1 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 graphical structure factor (parameters) Goal: Model distribution P(X) over random variables X

6 Conditional Random Fields (CRFs) 6 X2X2 Y1Y1 Y3Y3 Y4Y4 Y5Y5 X1X1 X3X3 X4X4 X5X5 X6X6 Y2Y2 MRFs: P(X)CRFs: P(Y|X) (Lafferty et al., 2001) Do not model P(X)Simpler structure (over Y only)

7 MRFs & CRFs 7 Benefits Principled statistical and computational framework Large body of literature Applications Natural language processing (e.g., Lafferty et al., 2001) Vision (e.g., Tappen et al., 2007) Activity recognition (e.g., Vail et al., 2007) Medical applications (e.g., Schmidt et al., 2008)...

8 Challenges 8 Goal: Given data, learn CRF structure and parameters. X2X2 Y1Y1 Y3Y3 Y4Y4 Y5Y5 X1X1 X5X5 X6X6 Y2Y2 Many learning methods require inference, i.e., answering queries P(A|B) NP hard in general (Srebro, 2003) Big structured optimization problem NP hard to approximate (Roth, 1996) Approximations often lack strong guarantees.

9 Thesis Statement CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems. We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization. 9

10 Outline Parameter Learning  Learning without intractable inference Scaling core methods 10 Structure Learning  Learning tractable structures Parallel Regression  Multicore sparse regression Parallel scaling solve via

11 Outline Parameter Learning  Learning without intractable inference Scaling core methods 11 Structure Learning  Learning tractable structures Parallel Regression  Multicore sparse regression Parallel scaling solve via

12 Log-linear MRFs 12 X2X2 X1X1 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 Goal: Model distribution P(X) over random variables X Parameters Features All results generalize to CRFs.

13 Parameter Learning: MLE 13 Traditional method: max-likelihood estimation (MLE) Minimize objective: Loss Gold Standard: MLE is (optimally) statistically efficient. Parameter Learning Given structure Φ and samples from P θ* (X), Learn parameters θ.

14 Parameter Learning: MLE 14

15 Parameter Learning: MLE 15 MLE requires inference.  Provably hard for general MRFs. (Roth, 1996) Inference makes learning hard. Can we learn without intractable inference?

16 Parameter Learning: MLE 16 Inference makes learning hard. Can we learn without intractable inference? Approximate inference & objectives Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006),... Many lack strong theory. Almost no guarantees for general MRFs or CRFs.

17 Our Solution 17 Max Likelihood Estimation (MLE) Optimal High Difficult Max Pseudolikelihood Estimation (MPLE) High Low Easy Sample complexity Computational complexity Parallel optimization PAC learnability for many MRFs! Bradley, Guestrin (2012)

18 Our Solution 18 Max Likelihood Estimation (MLE) Optimal High Difficult Sample complexity Computational complexity Parallel optimization PAC learnability for many MRFs! Max Pseudolikelihood Estimation (MPLE) High Low Easy Bradley, Guestrin (2012)

19 Our Solution 19 Max Likelihood Estimation (MLE) Optimal High Max Pseudolikelihood Estimation (MPLE) Difficult High Low Easy Max Composite Likelihood Estimation (MCLE) Low Easy Sample complexity Computational complexity Parallel optimization Choose MCLE structure to optimize trade-offs Bradley, Guestrin (2012)

20 Deriving Pseudolikelihood (MPLE) 20 X2X2 X1X1 X3X3 X4X4 X5X5 MLE: Hard to compute. So replace it!

21 Deriving Pseudolikelihood (MPLE) 21 X2X2 X1X1 X3X3 X4X4 X5X5 MLE: Estimate via regression: MPLE: (Besag, 1975) Tractable inference!

22 Pseudolikelihood (MPLE) 22 Pros No intractable inference! Consistent estimator Cons Less statistically efficient than MLE (Liang & Jordan, 2008) No PAC bounds PAC = Probably Approximately Correct (Valiant, 1984) MPLE: (Besag, 1975)

23 Sample Complexity: MLE 23 # parameters (length of θ ) Λ min : min eigenvalue of Hessian of loss at θ* probability of failure Our Theorem: Bound on n (# training examples needed) Recall: Requires intractable inference. parameter error (L1)

24 Sample Complexity: MPLE 24 # parameters (length of θ ) Λ min : min i [ min eigenvalue of Hessian of component i at θ* ] probability of failure parameter error (L1) Our Theorem: Bound on n (# training examples needed) Recall: Tractable inference. PAC learnability for many MRFs!

25 Sample Complexity: MPLE 25 Our Theorem: Bound on n (# training examples needed) PAC learnability for many MRFs! Related Work Ravikumar et al. (2010) Regression Y i ~X with Ising models Basis of our theory Liang & Jordan (2008) Asymptotic analysis of MLE, MPLE Our bounds match theirs Abbeel et al. (2006) Only previous method with PAC bounds for high-treewidth MRFs We extend their work: Extension to CRFs, algorithmic improvements, analysis Their method is very similar to MPLE.

26 Trade-offs: MLE & MPLE 26 Our Theorem: Bound on n (# training examples needed) Sample — computational complexity trade-off MLE Larger Λ min => Lower sample complexity Higher computational complexity MPLE Smaller Λ min => Higher sample complexity Lower computational complexity

27 Trade-offs: MPLE 27 X2X2 X1X1 Joint optimization for MPLE: X2X2 X1X1 Disjoint optimization for MPLE: 2 estimates of  Average estimates Lower sample complexity Data-parallel Sample complexity — parallelism trade-off

28 Synthetic CRFs 28 Random Associative ChainsStarsGrids Factor strength = strength of variable interactions

29 Predictive Power of Bounds 29 Errors should be ordered: MLE < MPLE < MPLE-disjoint L1 param error ε # training examples MLE MPLE MPLE-disjoint Factors: random, fixed strength Length-4 chains better

30 Predictive Power of Bounds 30 MLE & MPLE Sample Complexity: Factors: random Length-6 chains 10,000 train exs MLE Actual ε better harder

31 Failure Modes of MPLE 31 How do Λ min (MLE) and Λ min (MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree

32 Λ min : Model Diameter 32 Λ min ratio: MLE/MPLE (Higher = MLE better) Model diameter Λ min ratio Relative MPLE performance is independent of diameter in chains. (Same for random factors) Factors: associative, fixed strength Chains

33 Λ min : Factor Strength 33 Λ min ratio: MLE/MPLE (Higher = MLE better) Factor strength Λ min ratio Factors: associative Length-8 Chains MPLE performs poorly with strong factors. (Same for random factors, and star & grid models)

34 Λ min : Node Degree 34 Λ min ratio: MLE/MPLE (Higher = MLE better) Node degree Λ min ratio Factors: associative, fixed strength Stars MPLE performs poorly with high-degree nodes. (Same for random factors)

35 Failure Modes of MPLE 35 How do Λ min (MLE) and Λ min (MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree We can often fix this!

36 Composite Likelihood (MCLE) 36 MLE: Estimate P(Y) all at once

37 Composite Likelihood (MCLE) 37 MLE: Estimate P(Y) all at once MPLE: Estimate P(Y i |Y \i ) separately YiYi

38 Composite Likelihood (MCLE) 38 MLE: Estimate P(Y) all at once MPLE: Estimate P(Y i |Y \i ) separately Y Ai Something in between? Composite Likelihood (MCLE): Estimate P(Y Ai |Y \Ai ) separately. (Lindsay, 1988)

39 Generalizes MLE, MPLE; analogous: Objective Sample complexity Joint & disjoint optimization Composite Likelihood (MCLE) 39 MCLE Class: Node-disjoint subgraphs which cover graph.

40 Composite Likelihood (MCLE) 40 MCLE Class: Node-disjoint subgraphs which cover graph. Trees (tractable inference) Follow structure of P(X) Cover star structures Cover strong factors Choose large components Combs Generalizes MLE, MPLE; analogous: Objective Sample complexity Joint & disjoint optimization

41 Structured MCLE on a Grid 41 Grid size |X| Log loss ratio (other/MLE) MCLE (combs) MPLE Grid size |X| Training time (sec) MCLE (combs) MPLE MLE Grid. Associative factors. 10,000 train exs. Gibbs sampling. better MCLE (combs) lowers sample complexity...without increasing computation! MCLE tailored to model structure. Also in thesis: tailoring to correlations in data.

42 Summary: Parameter Learning 42 Likelihood (MLE) Optimal High Pseudolikelihood (MPLE) Difficult High Low Easy Composite Likelihood (MCLE) Low Easy Sample complexity Computational complexity Parallel optimization Finite sample complexity bounds for general MRFs, CRFs PAC learnability for certain classes Empirical analysis Guidelines for choosing MCLE structures: tailor to model, data

43 Outline Parameter Learning  Learning without intractable inference Scaling core methods 43 Parallel Regression  Multicore sparse regression Parallel scaling solve via Structure Learning  Learning tractable structures

44 CRF Structure Learning 44 X 3 : deadline? Y 1 : losing sleep? Y 3 : sick? Y 2 : losing hair? X 1 : loud roommate? X 2 : taking classes? Structure learning: Choose Y C I.e., learn conditional independence Evidence selection: Choose X D I.e., select X relevant to each Y C

45 Related Work Previous Work MethodStructure learning? Tractable inference? Evidence selection? Torralba et al. (2004) Boosted Random Fields YesNoYes Schmidt et al. (2008) Block-L1 regularized pseudolikelihood YesNo Shahaf et al. (2009) Edge weights + low-treewidth model Yes No Most similar to our work:  They focus on selecting treewidth-k structures.  We focus on the choice of edge weight. 45

46 Tree CRFs with Local Evidence Goal Given: Data Local evidence Learn tree CRF structure Via a scalable method Bradley, Guestrin (2010) 46 X i relevant to each Y i Fast inference at test-time

47 Chow-Liu for MRFs 47 Chow & Liu (1968) Y1Y1 Y2Y2 Y3Y3 Algorithm Weight edges with mutual information:

48 Chow-Liu for MRFs 48 Chow & Liu (1968) Algorithm Weight edges with mutual information: Choose max-weight spanning tree. Y1Y1 Y2Y2 Y3Y3 Chow-Liu finds a max-likelihood structure.

49 Chow-Liu for CRFs? What edge weight?  must be efficient to compute Global Conditional Mutual Information (CMI) Pro: Finds max-likelihood structure (with enough data) Con: Intractable for large |X| 49 Algorithm Weight each possible edge: Choose max-weight spanning tree.

50 Generalized Edge Weights Global CMI 50 Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over Y i,Y j,X i,X j Theorem No LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).

51 Heuristic Edge Weights Decomposable Conditional Influence (DCI) Local CMI MethodGuaranteesCompute w(i,j) tractably Comments Global CMIRecovers true treeNoShahaf et al. (2009) Local CMILower-bounds likelihood gain YesFails with strong Y i —X i potentials DCIExact likelihood gain for some edges YesBest empirically Global CMI 51

52 Synthetic Tests Trees w/ associative factors. |Y|=40. 1000 test samples. Error bars: 2 std. errors. # training examples Fraction edges recovered DCI Global CMI Local CMI Schmidt et al. True CRF better 52

53 Synthetic Tests Trees w/ associative factors. |Y|=40. 1000 test samples. Error bars: 2 std. errors. Seconds # training examples Global CMI DCI Local CMI Schmidt et al. 53 better

54 fMRI Tests X : fMRI voxels (500) Y : semantic features (218) predict (Application & data from Palatucci et al., 2009) Image from http://en.wikipedia.org/wiki/File:FMRI.jpg better Disconnected (Palatucci et al., 2009) DCI 1 DCI 2 54

55 Summary: Structure Learning 55 Analyzed generalizing Chow-Liu to CRFs Proposed class of edge weights: Local Linear Entropy Scores Negative result: insufficient for recovering trees Discovered useful heuristic edge weights: Local CMI, DCI Promising empirical results on synthetic & fMRI data Generalized Chow-Liu Compute edge weightsMax-weight spanning tree w 12 w 25 w 45 w 24 w 23

56 Outline 56 Parallel Regression  Multicore sparse regression Parallel scaling Parameter Learning Pseudolikelihood Canonical parameterization Scaling core methods Structure Learning Generalized Chow-Liu solve via Compute edge weights via P(Y i,Y j | X ij ) Regress each variable on its neighbors: P( X i | X \i )

57 Sparse (L1) Regression (Bradley, Kyrola, Bickson, Guestrin, 2011) Bias towards sparse solutions Lasso (Tibshirani, 1996) Objective: Goal: Predict from, given samples  Useful in high-dimensional setting (# features >> # examples)  Lasso and sparse logistic regression 57

58 Parallelizing LASSO Many LASSO optimization algorithms Gradient descent, interior point, stochastic gradient, shrinkage, hard/soft thresholding Coordinate descent (a.k.a. Shooting (Fu, 1998) ) One of the fastest algorithms (Yuan et al., 2010) Parallel optimization Matrix-vector ops (e.g., interior point) Stochastic gradient (e.g., Zinkevich et al., 2010) Shooting  Not great empirically  Best for many samples, not large d  Inherently sequential Shotgun: Parallel coordinate descent for L1 regression  simple algorithm, elegant analysis 58

59 Shooting: Sequential SCD where Stochastic Coordinate Descent (SCD) While not converged, Choose random coordinate j, Update w j (closed-form minimization) 59

60 Shotgun: Parallel SCD where Shotgun Algorithm (Parallel SCD) While not converged, On each of P processors, Choose random coordinate j, Update w j (same as for Shooting) Nice case: Uncorrelated features Bad case: Correlated features Is SCD inherently sequential? 60

61 Shotgun: Theory Convergence Theorem Final objective Assume # parallel updates iterations where = spectral radius of X T X Optimal objective Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009) 61

62 Shotgun: Theory Convergence Theorem final - opt objective Assume iterations # parallel updates where= spectral radius of X’X. Nice case: Uncorrelated features Bad case: Correlated features (at worst) where 62

63 Shotgun: Theory Convergence Theorem Assume... linear speedups predicted. Up to a threshold... Experiments match our theory! 63 P max =79 Mug32_singlepixcam T (iterations) P (parallel updates) SparcoProblem7 P max =284 T (iterations) P (parallel updates)

64 Lasso Experiments Compared many algorithms Interior point (L1_LS) Shrinkage (FPC_AS, SpaRSA) Projected gradient (GPSR_BB) Iterative hard thresholding (Hard_IO) Also ran: GLMNET, LARS, SMIDAS 35 datasets λ=.5, 10 Shooting Shotgun P = 8 (multicore) Single-Pixel Camera Sparco (van den Berg et al., 2009) Sparse Compressed ImagingLarge, Sparse Datasets 64 Shotgun proves most scalable & robust

65 Shotgun: Speedup Aggregated results from all tests Speedup # cores Optimal Lasso Iteration Speedup Lasso Time Speedup Logistic Reg. Time Speedup Not so great  But we are doing fewer iterations! Explanation: Memory wall (Wulf & McKee, 1995) The memory bus gets flooded. Logistic regression uses more FLOPS/datum.  Extra computation hides memory latency.  Better speedups on average! 65

66 Summary: Parallel Regression 66 Shotgun: parallel coordinate descent on multicore Analysis: near-linear speedups, up to problem-dependent limit Extensive experiments (37 datasets, 7 other methods) Our theory predicts empirical behavior well. Shotgun is one of the most scalable methods. Shotgun Decompose computation by coordinate updates Trade a little extra computation for a lot of parallelism

67 Recall: Thesis Statement We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization. 67 Parameter Learning Structured composite likelihood Structure Learning Generalized Chow-Liu Parallel Regression Shotgun: parallel coordinate descent MLEMCLEMPLE w 12 w 25 w 45 w 24 w 23  Decompositions use model structure & locality.  Trade-offs use model- and data-specific methods.

68 Future Work: Unified System 68 Parameter Learning Structure Learning Parallel Regression Structured MCLE Automatically: choose MCLE structure & parallelization strategy to optimize trade-offs, tailored to model & data. Shotgun (multicore) Distributed Limited communication in distributed setting. Handle complex objectives (e.g., MCLE). L1 Structure Learning Learning Trees Use structured MCLE? Learn trees for parameter estimators?

69 Summary Parameter learning  Structured composite likelihood Finite sample complexity bounds Empirical analysis Guidelines for choosing MCLE structures: tailor to model, data Analyzed canonical parameterization of Abbeel et al. (2006) 69 We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization. Structure learning  Generalizing Chow-Liu to CRFs Proposed class of edge weights: Local Linear Entropy Scores Insufficient for recovering trees Discovered useful heuristic edge weights: Local CMI, DCI Promising empirical results on synthetic & fMRI data Parallel regression  Shotgun: parallel coordinate descent on multicore Analysis: near-linear speedups, up to problem-dependent limit Extensive experiments (37 datasets, 7 other methods) Our theory predicts empirical behavior well. Shotgun is one of the most scalable methods. Than k you!


Download ppt "Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom."

Similar presentations


Ads by Google