Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Tree Models Part III: Learning Algorithms Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech.

Similar presentations


Presentation on theme: "Latent Tree Models Part III: Learning Algorithms Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech."— Presentation transcript:

1 Latent Tree Models Part III: Learning Algorithms Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang AAAI 2014 Tutorial

2 AAAI 2014 Tutorial Nevin L. Zhang HKUST2 To Determine 1. Number of latent variables 2. Cardinality of each latent variable 3. Model structure 4. Probability distributions Learning Latent Tree Models Model selection: 1, 2, 3 Parameter estimation: 4

3 AAAI 2014 Tutorial Nevin L. Zhang HKUST3  Run interactive program “LightBulbIllustration.jar”  Illustrate the possibility of inferring latent variables and latent structures from observed co-occurrence patterns. Light Bulb Illustration

4 AAAI 2014 Tutorial Nevin L. Zhang HKUST4 Part III: Learning Algorithms  Introduction  Search-based algorithms  Algorithms based on variable clustering  Distance-based algorithms  Empirical comparisons  Spectral methods for parameter estimation

5 AAAI 2014 Tutorial Nevin L. Zhang HKUST5  A search algorithm explores the space of regular models guided by a scoring function:  Start with an initial model  Iterate until model score ceases to increase  Modify the current model in various ways to generate a list of candidate models.  Evaluate the candidate models using the scoring function.  Pick the best candidate model  What scoring function to use? How do we evaluate candidate models?  This is the model selection problem. Search Algorithms

6 AAAI 2014 Tutorial Nevin L. Zhang HKUST6  Bayesian score: posterior probability P(m|D) P(m|D) = P(m)P(D|m) / P(D) = P(m)∫ P(D|m, θ) P(θ |m) dθ / P(D)  BIC Score: Large sample approximation of Bayesian score BIC(m|D) = log P(D|m, θ*) – d/2 logN  d : number of free parameters; N is the sample size.  θ*: MLE of θ, estimated using the EM algorithm.  Likelihood term of BIC: Measure how well the model fits data.  Second term: Penalty for model complexity.  The use of the BIC score indicates that we are looking for a model that fits the data well, and at the same time, not overly complex. Model Selection Criteria

7 AAAI 2014 Tutorial Nevin L. Zhang HKUST7  AIC (Akaike, 1974): AIC(m|D) = log P(D|m, θ*) – d/2  Holdout likelihood  Data => Training set, validation set.  Model parameters estimated based on the training set.  Quality of model is measured using likelihood on the validation set.  Cross validation: too expensive Model Selection Criteria

8 AAAI 2014 Tutorial Nevin L. Zhang HKUST8 Search Algorithms  Double hill climbing (DHC), (Zhang 2002, 2004)  7 manifest variables.  Single hill climbing (SHC), (Zhang and Kocka 2004)  12 manifest variables  Heuristic SHC (HSHC), (Zhang and Kocka 2004)  50 manifest variables  EAST, (Chen et al 2011)  100+ manifest variables

9 AAAI 2014 Tutorial Nevin L. Zhang HKUST9  Two search procedures  One for model structure  One for cardinalities of latent variables.  Very inefficient. Tested only on data sets with 7 or fewer variables. (Zhang 2004)  DHC tested on synthetic and real-world data sets, together with BIC, AIC, and Holdout likelihood respectively.  Best models found when BIC was used.  So subsequent work based on BIC. Double Hill Climbing (DHC)

10 AAAI 2014 Tutorial Nevin L. Zhang HKUST10  Determines both model structure and cardinalities of latent variables using a single search procedure.  Uses five search operators  Node Introduction (NI)  Node Deletion (ND)  Node Relation (NR)  State Introduction (SI)  State Deletion (SI) Single Hill Climbing (HSC)

11 AAAI 2014 Tutorial Nevin L. Zhang HKUST11  NI involves a latent variable Y and some of its neighbors  It introduces a new node Y’ to mediate Y and the neighbors.  The cardinality of Y’ is set at |Y|  Example:  Y2 introduced to mediate Y1 and its neighbors X1 and X2  The cardinality of Y2 is set at |Y1| Node Introduction (NI)

12 AAAI 2014 Tutorial Nevin L. Zhang HKUST12  NR involves a latent variable Y, a neighbor Z of Y, and another neighbor Y’ of Y that is also a latent variable.  It relocates Z from Y to Y’.  Example:  X3 is relocated from Y1 to Y2 Node Relocation (NR)

13 AAAI 2014 Tutorial Nevin L. Zhang HKUST13  ND involves a latent variable Y, a neighbor Y’ of Y that is a latent variables.  It remove Y, and reconnects the other neighbors of Y to Y’.  Example:  Y2 is removed w.r.t to Y1. Node Deletion

14 AAAI 2014 Tutorial Nevin L. Zhang HKUST14  State introduction (SI)  Increase the number of states of a latent variable by 1  State deletion (SD)  Reduce the number of states of a latent variable by 1. State Introduction/Deletion

15 AAAI 2014 Tutorial Nevin L. Zhang HKUST15 Single Hill Climbing (SHC)  Start with an initial model (LCM)  At each step:  Construct all possible candidate models using NI, ND, NR, SI and SD  Evaluate them one by one  Pick the best one  Still inefficient  Tested on data with no more than 12 variables.  Reason  Too many candidate models  Too expensive to run EM on all of them

16 AAAI 2014 Tutorial Nevin L. Zhang HKUST16  Scale up SHC  Idea 1: Restrict NI to involve only two neighbors of the latent variable it operators on The EAST Algorithm

17  How to go from the left to the right then with the restriction?  First apply NI, and then NR  Each NI operation is followed by NR operations to compensate for the restriction on NI. Reachability NI NR

18 AAAI 2014 Tutorial Nevin L. Zhang HKUST18 Idea 2: Reducing Number of Candidate Models  Not to use ALL the operators at once.  How?  BIC: BIC(m|D) = log P(D|m, θ*) – d/2 logN  Improve the two terms alternately  NI and SI improve the likelihood term?  Let be m’ obtained from m using NI or SI  Then, m’ includes m, hence has higher maximized likelihood log P(D|m’, θ’*) >= log P(D|m, θ*)  SD and ND reduce the penalty term.

19 AAAI 2014 Tutorial Nevin L. Zhang HKUST19 The EAST Algorithm (Chen et al. AIJ 2011) 1. Start with a simple initial model 2. Repeat until model score ceases to improve  Expansion:  Search with node introduction (NI), and state introduction (SI)  Each NI operation is followed by NR operations to compensate for the restriction on NI. (See Slide 17)  Adjustment:  Search with NR  Simplification:  Search with node deletion (ND), and state deletion (SD) EAST: Expansion, Adjustment, Simplification until Termination

20 Idea 3: Parameter Value Inheritance  m : current model;  m’ : candidate model generated by applying a search operator on m.  The two models share many parameters  m : ( θ 1, θ 2 ); m’ : ( θ 1, λ 2 );  When evaluating m’, inherit values of the shared parameters θ 1 from m, and estimate only the new parameters λ 2: λ* 2 = arg max λ2 log P(D|m’, θ 1, λ 2 )

21 AAAI 2014 Tutorial Nevin L. Zhang HKUST21 Avoid Local Optimum at the Expansion Phase  NI: Increases structure complexity.  SI: Increases variable complexity.  Key Issue at the expansion phase:  Tradeoff between structure complexity and variable complexity

22 AAAI 2014 Tutorial Nevin L. Zhang HKUST22  NI and SI are of different granularities  p = 100  SI: 101 more parameters  NI: 2 more parameters  Huge disparity in granularity  Penalty term in BIC insufficient to handle  SI always preferred initially,  Quick increase in variable complexity  Leading to local optimum in model score Operation Granularity

23 AAAI 2014 Tutorial Nevin L. Zhang HKUST23  EAST does not use BIC when choosing between candidate models produced by NI and SI.  Instead, it uses the cost-effectiveness principle  That is, select candidate model with highest improvement ratio Increase in model score per unit increase in model complexity.  Denominator is larger for operations that increase the number of model parameters more.  Can be justified using Likelihood Ratio Test (LRT)  It picks the candidate model that gives the strongest evidence to reject the null model (the current model) according to LRT. Dealing with Operation Granularity

24  The alternative model includes the null model, and hence fits data better than the null model.  Whether it fits significantly better is determined by p-value of the difference D, which approximately follows Chi-squared distribution with degree of freedom: d2 – d1 Likelihood Ratio Test (LRT) Wikipedia

25  Required D value for given p-value increases roughly linearly with d2-d1  The ratio D/(d2-d1) closely related to p-value  It is a measure of the strength of evidence in favor of the alternative model. Likelihood Ratio Test (LRT)

26 AAAI 2014 Tutorial Nevin L. Zhang HKUST26 Likelihood Ratio Test & Improvement Ratio  Second term is constant  First term is exactly ½ * D / (d2-d1)  Loosely speaking, the cost-effectiveness principle picks the candidate model that gives the strongest evidence to reject the null model (the current model) according to LRT.

27 Search Process on Danish Beer Data

28  EAST used on medical survey data  A few dozens variables  Hundreds to thousands observations  Model quality important  (Xu et al 2013)

29 AAAI 2014 Tutorial Nevin L. Zhang HKUST29 Part III: Learning Algorithms  Introduction  Search-based algorithms  Algorithms based on variable clustering  Distance-based algorithms  Empirical comparisons  Spectral methods for parameter estimation

30 AAAI 2014 Tutorial Nevin L. Zhang HKUST30  Key Idea  Group variables into clusters  Introduce a latent variable for each cluster  For discrete variables, mutual information is used as similarity measure  Algorithms  BIN-G: Harmeling & Williams, PAMI 2011  Bridged-islands (BI) algorithm: Liu et al. MLJ, 2013 Algorithm based on Variable Clustering

31 AAAI 2014 Tutorial Nevin L. Zhang HKUST31  Learns binary tree models  L  All observed variables  Loop  Remove from L pair of variables with highest mutual information  Introduce a new latent variable  Add new latent variable to L The BIN-G Algorithm

32 AAAI 2014 Tutorial Nevin L. Zhang HKUST32  Learn LCM: Cardinality of new latent variable and parameters  Let |H1|=1 and increase it gradually until termination  At each step, run EM to optimize model parameters and calculate the BIC score  Return LCM with highest BIC score  Determine MI between new latent variable and others  Convert the new latent variable into a observed via imputation (hard assignment)  Then calculate MI(H1; X3), MI(H1; X4) Two Issues NOTE: if some latent variables have cardinality 1, they can be removed from the model, resulting a forest, instead of tree.

33 Result of BIN-G on subset of 20 Newsgroups Dataset

34 AAAI 2014 Tutorial Nevin L. Zhang HKUST34  Learns non-binary trees.  Partitions all observed variables into clusters, with some clusters having >2 variables  Introduces a latent variable for each variable cluster  Links up the latent variables to get a global tree model  The result is a flat latent tree model in the sense that each latent variable it directly connected to at one observed variable. The BI Algorithm

35 AAAI 2014 Tutorial Nevin L. Zhang HKUST35  Identify a cluster of variables such that,  Variables in the cluster are closely correlated, and  The correlations can be properly modeled using a latent variable.  Uni-Dimensional (UD) cluster  Remove the cluster and repeat the process.  Eventually obtain a partition of the observed variables. BI Step 1: Partition the Observed Variables

36 AAAI 2014 Tutorial Nevin L. Zhang HKUST36  Sketch of algorithm for identifying first variable cluster  L  All observed variables  S  pair of variables with highest mutual information  Loop  X  Variable in L with highest MI with S  S  S U {X}, L  L \ {X}  Perform uni-dimensionality test on S,  If the test fails, stop loop and pick the first cluster of variables. Obtaining First Variable Cluster

37 AAAI 2014 Tutorial Nevin L. Zhang HKUST37  Test whether the correlations among variables in a set S can be properly modeled using a single latent variable  Example: S={X1, X2, X3, X4, X5}  Learn two models  m1: Best LCM, i.e., LTM with one latent variable  m2: Best LTM with two latent variables  Can be done using EAST  UD-test passes if and only if Uni-Dimensionality (UD) Test If the use of two latent variable does not give significantly better model, then the use of one latent variable is appropriate.

38 AAAI 2014 Tutorial Nevin L. Zhang HKUST38  Unlike a likelihood-ratio test,  Models do not need to be nested  Strength of evidence in favor of M2 depends on the value of K Bayes Factor Wikipedia

39 AAAI 2014 Tutorial Nevin L. Zhang HKUST39  The statistic is a large sample approximation of  Strength of evidence in favor of two latent variables depends on U :  In the UD-test, we usually set :  Conclude single latent variable if no strong evidence for >1 latent variables Bayes Factor and UD-Test Wikipedia

40 AAAI 2014 Tutorial Nevin L. Zhang HKUST40  Initially, S={X1, X2}  X3, X4 added to S, and UD-test passes  Next add X5  S = {X1, X2, X3, X4, X5},  m2 is significantly better than m1  UD-test fails  m2 gives two uni-dimensional (UD) clusters  The first variable cluster is: {X1, X2, X4}  Picked because it contains the initial variables X1 and X2 UD-Test and Variable Cluster

41 AAAI 2014 Tutorial Nevin L. Zhang HKUST41  Introduce a latent variable for each variable UD cluster.  Optimize the cardinalities of latent variables an parameters BI Step 2: Latent Variable Introduction

42 AAAI 2014 Tutorial Nevin L. Zhang HKUST42  Bridging the “islands” using Chow-Liu’s Algorithm (1968)  Estimate joint of each pair of latent variables Y and Y’: m and m’ are the LCMs that contains Y and Y ‘ respectively.  Calculate MI(Y;Y’)  Find the maximum spanning tree for MI values BI Step 3: Link up Latent Variables

43 AAAI 2014 Tutorial Nevin L. Zhang HKUST43  Improvement based on global consideration  Run EM to optimize parameters for whole model  For each latent variable Y and each observed variable X, calculate:  Re-estimate MI(Y; X) based the above distribution  Let Y* be the latent variable with highest MI(Y; X)  If Y* is not currently the neighbor of X, make it so. BI Step 4: Global Adjustment

44 Result of BI on subset of 20 Newsgroups Dataset

45 AAAI 2014 Tutorial Nevin L. Zhang HKUST45 Part II: Learning Algorithms  Introduction  Search-based algorithms  Algorithms based on variable clustering  Distance-based algorithms  Empirical comparisons  Spectral methods for parameter estimation

46 AAAI 2014 Tutorial Nevin L. Zhang HKUST46  Define distance between variables that are additive over trees  Estimate distances between observed variables from data  Inference model structure from those distance estimates  Assumptions:  Latent variables have equal cardinality, and it is known.  In some cases, it equals the cardinality of observed variables.  Or, all variables are continuous.  Focus on two algorithms  Recursive groping (Choi et al, JMLR 2011)  Neighbor Joining (Saitou & Nei, 1987, Studier and Keppler, 1988) Distance-Based Algorithms Slides based on Choi et al, 2010: www.ece.nus.edu.sg/stfpage/vtan/latentTree_slides.pdfwww.ece.nus.edu.sg/stfpage/vtan/latentTree_slides.pdf

47 AAAI 2014 Tutorial Nevin L. Zhang HKUST47  Information distance between two discrete variables X i and X j (Lake 1994) Information Distance  When both variables are binary:

48 AAAI 2014 Tutorial Nevin L. Zhang HKUST48  Erdos, Szekely, Steel, & Warnow, 1999 Additivity of Information Distance on Trees

49 AAAI 2014 Tutorial Nevin L. Zhang HKUST49 Testing Node Relationships  This implies the difference is a constant.  It does not change with k.  Equality not true  if j is not leaf, or  i is not the parent of j

50 AAAI 2014 Tutorial Nevin L. Zhang HKUST50 Testing Node Relationships  This implies the difference is a constant.  It does not change with k.  It is between – and  This property allows us to determine leaf nodes that are siblings

51 AAAI 2014 Tutorial Nevin L. Zhang HKUST51  RG is an algorithm that determines model structure using the two properties mentioned earlier.  Explain RG with an example  Assume data generated by the following model  Data contain no information about latent variables  Task: Recover model structure The Recursive Grouping (RG) Algorithm

52 AAAI 2014 Tutorial Nevin L. Zhang HKUST52  Step 1: Estimate from data the information distance between each pair of observed variables.  Step 2: Identify (leaf, parent-of-leaf) and (leaf siblings) pairs  For each i, j  If = c (constant) for all k \= i, j  If c =, then j is a leaf and i is its parent  If - < c <, then i and j are leaves and siblings Recursive Grouping

53 AAAI 2014 Tutorial Nevin L. Zhang HKUST53  Step 3: Introduce a hidden parent node for each sibling group without a parent. Recursive Grouping  NOTE: No need to determine model parameters here.

54 AAAI 2014 Tutorial Nevin L. Zhang HKUST54  Step 4. Compute the information distance for new hidden nodes. Recursive Grouping

55 AAAI 2014 Tutorial Nevin L. Zhang HKUST55  Step 5. Remove the identified child nodes and repeat Steps 2-4.  Parameters of the final model can be determined using EM if needed. Recursive Grouping

56 AAAI 2014 Tutorial Nevin L. Zhang HKUST56  Making Recursive Grouping more efficient  Step 1: Construct Chow-Liu tree over observed variables only based on information distance, i.e. maximum spanning tree CL Recursive Grouping (CLRG)

57 AAAI 2014 Tutorial Nevin L. Zhang HKUST57  Step 2: Select an internal node and its neighbors, and apply the recursive-grouping (RG) algorithm. (Much cheaper)  Step 3: Replace the output of RG with the sub-tree spanning the neighborhood. CL Recursive Grouping (CLRG)

58 AAAI 2014 Tutorial Nevin L. Zhang HKUST58  Repeat Steps 2-3 until all internal nodes are operated on.  Theorem: Both RG and CLRG are consistent CL Recursive Grouping (CLRG)

59 Result of CLRG on subset of 20 Newsgroups Dataset

60 AAAI 2014 Tutorial Nevin L. Zhang HKUST60  Choi et al 2011 assume all observed and latent variables have equal cardinality  So that information distance can be defined.  Assumption relaxed by (Song et al, 2011; Wang et al 2013):  Latent variables can have fewer states than observed,  But they still need to have equal cardinality among themselves.  New definition of information distance (pseudo-determinant): denotes the s-th largest singular value of matrix A k is the rank of joint probability matrix P xy. An Extension

61 AAAI 2014 Tutorial Nevin L. Zhang HKUST61  Gaussian distributions:  Non-Gaussian distributions (Song et al, NIPS 2011) obtained via Kernel embedding Information Distance for Continuous Variables

62 AAAI 2014 Tutorial Nevin L. Zhang HKUST62 Neighbor Joining  Another method to infer model structure using tree-additive distances  Originally developed for phylogenetic trees (Saitou & Nei, 1987)  Key Observations:  Closest pair might not be siblings  d AB smallest, but those two leaves are not neighbors

63 AAAI 2014 Tutorial Nevin L. Zhang HKUST63 Neighbor Joining  Let L be the number of leaf nodes  Define: r i = 1/(|L| - 2)  k in L d ik D ij = d ij - ( r i + r j )  Theorem  Pair of leaves with minimal D ij are siblings

64 AAAI 2014 Tutorial Nevin L. Zhang HKUST64 Neighbor Joining  Initialization  Define T to be the set of leaf nodes  Make list L of active nodes = T  Loop until |L|=2  Find two nodes i and j where D ij is minimal  Combine to form a new node k and  d km = 1/2( d im + d jm - d ij) for all m in L  d ik = 1/2 ( d ij + r i - r j )  d jk = d ij - d ik  Add k to L, and remove i and j from L and add node k  Results binary tree.

65 AAAI 2014 Tutorial Nevin L. Zhang HKUST65 Example

66 AAAI 2014 Tutorial Nevin L. Zhang HKUST66  Making Neighbor Joining more efficient  Step 1: Construct Chow-Liu tree over observed variables only based on information distance, i.e. maximum spanning tree  Step 2: Select an internal node and its neighbors, and apply the NJ algorithm. (Much cheaper)  Step 3: Replace the output of NJ with the sub-tree spanning the neighborhood.  Repeat Steps 2-3 until all internal nodes are operated on. CL Neighbor Joining (CLNJ)

67 Result of CLNJ on subset of 20 Newsgroups Dataset

68 AAAI 2014 Tutorial Nevin L. Zhang HKUST68  Originally developed for phylogenetic tree reconstruction (John et al, Journal of Algorithms, 2003)  Idea:  First determine the structures among quartets (groups of 4) of observed variables  Then combine the results to obtain a global model of all the observed variables.  If two nodes are not siblings in quartet model, they cannot be siblings in the global model.  For general LTMs: Chen et al, PGM 2006, Anandkumar et al, NIPS 20111, Mossel et al, 2011. Quartet-Based Algorithms

69 AAAI 2014 Tutorial Nevin L. Zhang HKUST69 Part III: Learning Algorithms  Introduction  Search-based algorithms  Algorithms based on variable clustering  Distance-based algorithms  Empirical comparisons  Spectral methods for parameter estimation

70 AAAI 2014 Tutorial Nevin L. Zhang HKUST70  Algorithms compared  Search-Based Algorithm:  EAST (Chen et al, AIJ 2011)  Variable Clustering-Based Algorithms  BIN (Harmeling & Williams, PAMI 2011)  BI (Liu et al. MLJ, 2013)  Distance-Based Algorithms  CLRG (Choi et al, JMRL 2011)  CLNJ (Saitou & Nei, 1987)  Data  Synthetic data  Real-world data  Measurements  Running time  Model quality Empirical Comparisons

71 AAAI 2014 Tutorial Nevin L. Zhang HKUST71  4-complete model (M4C):  Every latent node has 4 neighbors  All variables are binary  Parameter values randomly generated such normalized MI between each pair of neighbor is between 0.05 and 0.75. Generative Models

72 AAAI 2014 Tutorial Nevin L. Zhang HKUST72  M4CF: Obtained from M4C  More variables added such that each latent node has 3 observed neighbors. A flat model.  It is a flat model.  Other models and the total number of observed variables Generative Models

73 AAAI 2014 Tutorial Nevin L. Zhang HKUST73  Synthetic Data  Training: 5,000; Testing: 5,000  No information on latent variables  Evaluation Criteria: Distribution  m 0 : generative model; m : learned model  Empirical KL divergence on testing data: The smaller the better.  Second term is hold-out likelihood of m. The larger the better. Synthetic Data and Evaluation Criteria

74 Evaluation Criteria: Structure Example: For the two models on the right  m0  Y2-Y1: X1X2X3 | X4X5X6X7  Y1-Y3: X1X2X3X4 | X5X6X7  X1-Y2: X1 | X2X3X4X5X6X7 ..  m  Y2-Y1: X1X2 | X3X4X5X6X7  Y1-Y3: X1X2X3X4 | X 5X6X7  X1-Y2: X1 | X2X3X4X5X6X7 ..  d RF = (1 + 1)/2 = 1 Not defined for forests

75 AAAI 2014 Tutorial Nevin L. Zhang HKUST75  EAST was too slow on data sets with more than 100 attributes.  CLRG was the fastest, followed by CLNJ, BIN and BI. Running Times (Seconds)

76 AAAI 2014 Tutorial Nevin L. Zhang HKUST76  Flat generative models  Non-flat generative models  EAST found best models on data with dozens of attributes  BI found best models on data with hundreds of attributes.  BIN is the worst. (No RF values because it produces forests, not trees) Model Quality

77 AAAI 2014 Tutorial Nevin L. Zhang HKUST77  Make latent variables have different cardinalities in generative models  3 for those at levels 1 and 3  2 for those at level 2.  All algorithms perform worse in before  EAST and BI still found best models.  CLRG and CLNJ especially bad on M7CF1.  They assume all latent variables have equal cardinality. When latent variables have different cardinalities …

78 AAAI 2014 Tutorial Nevin L. Zhang HKUST78  Data  Evaluation criteria:  BIC score on training set: measures of model fit  Loglikelihood on testing set (hold-out likelihood): measures how well learned model predict unseen data. Real-World Data Sets

79 AAAI 2014 Tutorial Nevin L. Zhang HKUST79  CLNJ and CLRG not applicable to Coil-42 and Alarm because different attributes have different cardinalities.  EAST did not finish on News-100 and WebKB within 60 hours  CLRG was the fastest, followed by CLNJ, BIN and BI. Running Times (seconds)

80 AAAI 2014 Tutorial Nevin L. Zhang HKUST80  EAST and BI found best models. BIN found the worst.  Structures of models obtained on News-100 by BIN, BI, CLRG, and CLNJ are shown earlier.  BI introduced fewer latent variables. The model is more “compact”. Model Quality

81 AAAI 2014 Tutorial Nevin L. Zhang HKUST81  EAST found best models on data sets it could manage  With < 100 observed variables, hundreds to thousands observations  BI found best models on data sets with hundreds of observed variables, and was slower than BIN, CLRG, and CLNJ.  BIN found the worst models.  CLRG and CLNJ are not applicable when observed variables do NOT have equality cardinality. Summary

82 AAAI 2014 Tutorial Nevin L. Zhang HKUST82 Part III: Learning Algorithms  Introduction  Search-based algorithms  Algorithms based on variable clustering  Distance-based algorithms  Empirical comparisons  Spectral methods for parameter estimation

83 AAAI 2014 Tutorial Nevin L. Zhang HKUST83  The Expectation-Maximization (EM) algorithm (Dempster et al 1977)  Start with initial guess  Iterate until termination  Improve the current parameters values by maximizing the expected likelihood  Can be trapped at local maxima  Spectral methods (Anandkumar et al., 2012)  Get empirical distributions of 2 or 3 observed variables  Relate them to model parameters  Determine model parameters accordingly  Need large sample size for robust estimates. Parameter Estimation

84 AAAI 2014 Tutorial Nevin L. Zhang HKUST84  A MRF  Undirected graph  Potentials  Non-negative functions  Associated with edges and hyper-edges  Eliminate a variable X in MRF  Multiply all potentials involve X  Obtain a new potential by eliminate X from the product  Ex: Elimination of B and E : Markov Random Field over Graphs

85 AAAI 2014 Tutorial Nevin L. Zhang HKUST85  Lower case letter denote value of variable  E.g. a, b are values for A and B  Use to denote  Use to denote the matrix [ ]  Value at a-th row and b-th column is  Elimination rewritten as matrix multiplication Matrix Representation of Potentials and Elimination

86 AAAI 2014 Tutorial Nevin L. Zhang HKUST86  Use to denote  Use to denote the column vector whether the b-th row is  Similarly define notations and  Then, the equation can be rewritten as Matrix Representation of Potentials and Elimination

87 AAAI 2014 Tutorial Nevin L. Zhang HKUST87  Equality (Anandkumar et al., 2012)  H: latent; A, B, C: Observed  All variables have equal cardinality  For a given value b of B, Elements of diagonal matrix: P(B=b|H=1), P(B=b|H=2),…  Notes  The matrices on the right hand side can be estimated from data  The eigenvalues of the matrix on the left are model parameters: P(B=b|H=1), P(B=b|H=2),… Spectral Method for Parameter Estimation

88 AAAI 2014 Tutorial Nevin L. Zhang HKUST88  Parameter estimation:  Get empirical distributions P(A, B, C) and P(A, C) from data  For each value b of B, form matrix on the right hand side  Find the eigenvalues of the matrix. (Spectral method)  They are elements of P b|H, or P(B=b|H=1), P(B=b|H=2),…  Notes  Similarly, we can estimate the other parameters  The technique can used on LTMs with >1 latent variables and where observed variables have higher cardinality than latent variables. Spectral Method for Parameter Estimation

89 AAAI 2014 Tutorial Nevin L. Zhang HKUST89  Start with generalized MRF  Some potential might have negative values  Eliminate C, and then H’  Further eliminate H, we get P(A, B, A’).  For a value b of B, Derivation of Equation (1)

90 AAAI 2014 Tutorial Nevin L. Zhang HKUST90  Start with MRF  Eliminate H and H’ Let B=b  Next, eliminate C.  Putting together, we get Derivation of Equation (1)

91 AAAI 2014 Tutorial Nevin L. Zhang HKUST91  In general, we cannot determine P(A, B, C, D) from P(A, B, D), P(B, D), P(B, C, D).  Possible in LTMs (Parikh et al., 2011)  Can be used to estimate of joint probability without explicit parameter estimation  Get empirical distributions of 2 or 3 observed variables from data.  Caculate joint probability of particular value assignment of ALL observed from them using the relationship. Another Equation of Similar Flavor

92 AAAI 2014 Tutorial Nevin L. Zhang HKUST92 Matrix Representation of Potentials  Observed: A, B, C, D  Latent: H, G  All variables have equal cardinality,

93 AAAI 2014 Tutorial Nevin L. Zhang HKUST93  P(A, B, C, D) not changed during transformations Transformations

94 AAAI 2014 Tutorial Nevin L. Zhang HKUST94 Transformations  Eliminate: H’, G’, and {H, G}  From the last MRF, we get  Joint of 4 variables determined from joint of 3 and 2 observed variables

95 AAAI 2014 Tutorial Nevin L. Zhang HKUST95  The technique can used  On trees with >4 observed variables.  When observed variables have higher cardinality than latent variables. Notes

96 AAAI 2014 Tutorial Nevin L. Zhang HKUST96  Equation that relates model parameters to joint distributions of 2 or 3 observed variables  Equation that relates joint distributions of 4 or more observed variables to joint distributions of 2 or 3 observed variables.  Need large sample size for robust result Summary


Download ppt "Latent Tree Models Part III: Learning Algorithms Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech."

Similar presentations


Ads by Google