Latent Tree Models Part II: Definition and Properties

Latent Tree Models Part II: Definition and Properties
AAAI 2014 Tutorial Latent Tree Models Part II: Definition and Properties Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. Now we move on to Part II of the tutorial. Here I will give the precise definition of latent tree models, discuss their properties, and explain how they are related to other models in the literature.

Part II: Concept and Properties
Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties

Basic Latent Tree Models (LTM)
Bayesian network All variables are discrete Structure is a rooted tree Leaf nodes are observed (manifest variables) Internal nodes are not observed (latent variables) Parameters: P(Y1), P(Y2|Y1),P(X1|Y2), P(X2|Y2), … Semantics: The basic latent tree model is a BN where all variables are discrete, the structure is a rooted tree, the leaf nodes are observed and are sometimes called manifest variables, and the internal nodes are not observed and are called latent variables. Probability parameters for the model include a marginal distribution for the root node and a conditional distribution for each non-rooted node given its parent. The product of all those distributions is a joint distribution over all the variables. In early publications, latent tree models were called them hierarchical latent class models, because they generalize latent class models, a class finite mixture models for discrete data. Also known as Hierarchical latent class (HLC) models, HLC models (Zhang. JMLR 2004)

Joint Distribution over Observed Variables
Marginalizing out the latent variables in , we get a joint distribution over the observed variables In comparison with Bayesian network without latent variables, LTM: Is computationally very simple to work with. Represent complex relationships among manifest variables. What does the structure look like without the latent variables? An LTM represents a joint distribution over the observed and the latent variables. If we marginalize out the latent variables, we get a distribution over the observed variables only. So, an LTM can also be said to represent a joint distribution of the observed variables. To represent a joint distribution over observed variables, we can use a Bayesian network without latent variable. In comparison, the advantage of LTM are: On one hand, it is computationally simple to work with because of the tree structure, and on the other hand, it can represent complex relationship among the observed variables. To see this, we can imagine the relationships among the observed variables in this model without using latent variable: We would need a complete model where every pair of variables are connected by an edge. These two characteristics make LTMs a very interesting class of models.

Pouch Latent Tree Models (PLTM)
An extension of basic LTM Rooted tree Internal nodes represent discrete latent variables Each leaf node consists of one or more continuous observed variable, called a pouch. (Poon et al. ICML 2010) Pouch latent tree models are an extension of the basic LTM model. It is still a rooted tree, and the internal nodes still represent discrete latent variables. However, each leaf node consists of 1 or more continuous observed variables. Because it might contain multiple latent variables, it is hence called a pouch. For each pouch, there is a conditional Gaussian distributions for all its observed variables given the parent. For the pouch at the bottom left corner, e.g., we have a Gaussian distribution for X1 and X2 given the parent Y2.

More General Latent Variable Tree Models
Some internal nodes can be observed Internal nodes can be continuous Forest Primary focus of this tutorial: the basic LTM (Choi et al. JMLR 2011) In the literature, there are more general latent variable tree models. In some cases, some internal nodes can be observed, as in this example taken from Choi et al Here the node SLB, UTX are observed. In addition, the internal nodes might be continuous, and the overall structure might not be a forest instead of a tree. In this tutorial, our primary focus in on the basic LTM, although we will touch the other models here and there.

Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties Latent tree models are closely related to finite mixture models. We will see how in the next few minutes.

Finite Mixture Models (FMM)
Gaussian Mixture Models (GMM): Continuous attributes Graphical model The most common finite mixture model is the Gaussian mixture model. It is for continuous data. It assumes that data consist of K clusters. The distribution is each cluster is a normal distribution with mean vector mu-k and covariance matrix Sigma-k. The distribution of the whole data set is a mixture of those Gaussian distributions with mixing co-efficients pi_k. The red figure is a trivial graphical representation of GMM. The latent variable Z represents the K-clusters, while X represents the vector of attributes. If we spell out all the attributes, we get the graphical model on the right.

Finite Mixture Models (FMM)
GMM with independence assumption Block diagonal co-variable matrix Graphical Model Sometimes, independence assumptions are made, that is, the covariance matrix is assumed to have a block diagonal form. In this example, X3 is assumed to be independent from X1 and X2. To represent the independence graphically, we can have two pouches, one containing X1 and X2, while the other containing X3. To put it another way, the figure on the bottom shows a GMM with independence assumption.

Finite Mixture Models Latent class models (LCM): Discrete attributes
Distribution for cluster k: Product multinomial distribution: All FMMs One latent variable Yielding one partition of data Graphical Model The counter-part of GMM for discrete data is the latent class model. It assumes that data consist of K clusters. Within each cluster, the attributes are mutually independent. So, the distribution for each cluster is a product multinomial distribution. The distribution of the whole data is a mixture of those product multinomial distributions. In all finite mixture models, there is only one latent variable, and only one partition of data is produced.

From FMMs to LTMs Start with several GMMs,
Each based on a distinct subset of attributes Each partitions data from a certain perspective. Different partitions are independent of each other Link them up to form a tree model Get Pouch LTM Consider different perspectives in a single model Multiple partitions of data that are correlated. Conceptually, a latent tree mode can be viewed as a collection of finite mixture models that are linked up to form one single model. For example, we can start with those 3 GMMs. The first two have independence assumptions, while the third one does not. They are based on different attributes. They each can produce a partition of data. So, three partitions of data are obtained. Different partitions focus on different aspects of the data. An they are independent of each other. Now, we can link them up to form one single model. Here an extra latent variable is introduced. In general, this is not necessary. This bigger model produces multiple partition of data, just as the collection of smaller models shown above. The only difference is that, now the different partitions are related.

From FMMs to LTMs Start with several LCMs, Each based on a distinct subset of attributes Each partitions data from a certain perspective. Different partitions are independent of each other Link them up to form a tree model Get LTM Consider different perspectives in a single model Multiple partitions of data that are correlated. Here we have three latent class models, each based on a distinct subset of attributes. They can give us 3 independent partitions of data, each focuses on an aspect of the data. The partitions are independent of each other. If we link up the three models, we could get this overall model. It produces three related partitions of data. In summary, an LTM can be viewed as a collections of FMMs, with their latent variables linked up to form a tree structure. In this sense, LTM is an extension of LCMs, and pouch LTM is an extension of GMM. Summary: An LTM can be viewed as a collections of FMMs, with their latent variables linked up to form a tree structure.

Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties Latent tree models are also closely related to phylogenetic trees.

Phylogenetic trees TAXA (sequences) identify species
Edge lengths represent evolution time Usually, bifurcating tree topology Durbin, et al. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. Phylogenetic trees depict the evolution history of species. Each species is represented as a sequence called taxa. The edge length represents evolution time. Usually, phylogenetic trees are binary, meaning one species evolving into two different species at one time.

Probabilistic Models of Evolution
Two assumptions There are only substitutions, no insertions/deletions (aligned) One-to-one correspondence between sites in different sequences Each site evolves independently and identically P(x|y, t) = Pi=1 to m P(x(i) | y(i), t) m is sequence length P(x(i)|y(i), t) Jukes-Cantor (Character Evolution) Model [1969] Rate of substitution a In probabilistic models of evolution, it is typically assumed that evolution happens only because of substitutions of characters. There are no insertion or deletions of characters. It is further assumed that each site evolves independently. Under those assumptions, the probability and a sequence Y evolves into another sequence X in time t is the product of site evolution probability. Here, P(x(i)|Y(i), t) is the probability that the character at site i of sequence Y evolves into the character at site i of X, in time t. In the Juke-Cantor character evolution model, it is given by the matrix on the right, where alpha is the rate of substitution. We see that, when t=0, the matrix is a diagonal matrix with 1 on the diagonal, which indicates no evolution. When t goes to infinite, all the cells in the matrix goes to 1/4.

Phylogenetic Trees are Special LTMs
When focus on one site, phylogenetic trees are special latent tree models The structure is a binary tree The variables share the same state space. Each conditional distribution is characterized by only one parameters, i.e., the length of the corresponding edge Because different sites evolve independently and in identical manners, we can focus on the evolution of one site. When we do that, a phylogenetic tree becomes a latent tree where: the structure is binary, each observed and latent variable share the same state space (A, C, G, T), and each conditional distribution is characterized by one parameters, i.e., the length of the correspond edge. So, latent tree models can be viewed as a generalization of phylogenetic trees, where: a node can have more than two children, different variables may have different state spaces, and conditional distributions are multinomial.

Hidden Markov Models Hidden Markov models are also special latent tree models All latent variables share the same state space. All observed variables share the same state space. P(yt |st ) and P(st+1 | st ) are the same for different t ’s. Finally, hidden Markov models are obviously special latent tree models

Part II: Concept and Basic Properties
Latent Tree Models Definition Relationship with finite mixture models Relationship with phylogenetic trees Basic Properties I have now finished discussing the concept of latent tree models, and how they are related to various other models. Next, I will discuss some basic properties of latent tree models.

Two Concepts of Models So far, a model consists of
Observed and latent variables Connections among the variables Probability values For the rest of Part II, a model consists of Probability parameters To do that, it would be helpful to distinguish between two different concept of models. So far in this tutorial, our view about a model is that it consists of some observed and latent variables, connections among the variables, and probability values. For the next part, we need to take the view that a model consists of some observed and latent variables, connections among the variables, and probability parameters, rather than probability values.

Model Inclusion Let me first introduce the notion of model inclusion. Suppose we have two latent tree models m and m’ with the same observed variables. By setting the probability parameters at different values, the models represent different distributions over the observed variables. So, each of the model correspond to a set of joint distributions over the observed variables. We say that the model m includes the model m’ if the collection of joint distributions that m can represent is a super set of the collection of distributions that m’ can represent. In other words, for each vector of parameter values theta’ for m’, we can always find a vector of parameter values theta for m such that the two models give the same distributions over the observed variables.

Model Equivalence If m includes m’ and vice versa, then they are marginally equivalent. If they also have the same number of free parameters, then they are equivalent. It is not possible to distinguish between equivalent models based on data. If m includes m’ and m’ also includes m, then the two model can represent exactly the same collection of distributions over the observed variables. In the case, we say that they are marginally equivalent. If they also have the same number of free parameters, then we say that they are equivalent. It is not possible to distinguish between equivalent models based on data.

Root Walking Rook walking is an operation that we can apply on a latent tree. It changes the root of the model. To be more specific, it makes a neighbor of the current root as the new root.

Root Walking Example Root walks to X2; Root walks to X3
As an example, suppose we start from the model on the top. The root is Y1. We can let the root walk from Y1 to Y2. Then we get the model on the left where the root is Y2. On the other hand, if we let the root walk from Y1 to Y3, then we get the model on the right, where the root is Y3.

Root Walking Theorem: Root walking leads to equivalent latent tree models. (Zhang, JMLR 2004) Special case of covered arc reversal in general Bayesian network, Chickering, D. M. (1995). A transformational characterization of equivalent Bayesian network structures. UAI. It has been shown that root walk lead to equivalent latent tree models. In our example, all those three models are equivalent. The result is a special case of a more general result for Bayesian networks called covered arc reversal.

Implication Edge orientations in latent tree models are not identifiable. Technically, better to start with alternative definition of LTM: A latent tree model (LTM) is a Markov random field over an undirected tree, or tree-structured Markov network where variables at leaf nodes are observed and variables at internal nodes are hidden. The implication of the theorem is that edge orientations in latent tree models cannot be determined from data. Given this fact, a technically cleaner way is to define LTM as follows: a latent tree model is a Markov random field over an undirected tree, or a tree-structured Markov network, where the leaf are observed and the internal nodes are not.

Implication For technical convenience, we often root an LTM at one of its latent nodes and regard it as a directed graphical model. Rooting the model at different latent nodes lead to equivalent directed models. This is why we introduced LTM as directed models. For convenience, we often root LTM at one of its latent nodes and regard it as a directed model. The choice of root does not matter, as difference choice lead to equivalent models.

Regularity |X|: Cardinality of variable X, i.e., the number of states.
The next issue is regularity. An LTM is regular if each latent variable does not have too many states w.r.t to its neighbors. To the specific, the cardinality of a latent variable Z should no greater than the product of the cardinalities of all its neighbors, divided by the max of neighbor cardinalities. The inequality is strict when Z has only two neighbors.

Regularity Can focus on regular models only
Irregular models can be made regular Regularized models better than irregular models Theorem: The set of all regular models for a given set of observed variables is finite. (Zhang, JMLR 2004) It has been shown that an irregular model can be reduced to another model that is marginally equivalent and has fewer free parameters. As such, we can focus on regular models only. It has also been shown, for a given set of observed variables, there are only finite many regular models.

Latent Tree Models Part II: Definition and Properties

Similar presentations

Presentation on theme: "Latent Tree Models Part II: Definition and Properties"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latent Tree Models Part II: Definition and Properties

Similar presentations

Presentation on theme: "Latent Tree Models Part II: Definition and Properties"— Presentation transcript:

Similar presentations

About project

Feedback