Learning Tree Structures

Learning Tree Structures

Which Pt will be closest to P?
If we measured a distribution P, what is the tree-dependent distribution Pt that best approximates P? Which Pt will be closest to P? Search Space: All possible trees Goal: From all possible trees find the one closest to P Distance Measurement: Kullback–Leibler Divergence Operators/Procedure

Problem definition X1…Xn are random variables P is unknown
Given independent samples x1,…,xs drawn from distribution P Estimate P Possible Solution – best tree P(x) = Π P(xi|xj) xj- The parent of xi in some Solution 3 : Requires (r-1)r(n-1) + (r-1) r(r-1) parameters for each of the n-1 link matrices and r-1 parameters for the root node

Kullback–Leibler Divergence
For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to be Rewriting the definition of Kullback-Leibler divergence yields where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P. Non negative measure

Entropy is a measure for Uncertainty
Fair coin: H(½, ½) = – ½ log2(½) – ½ log2(½) = 1 bit (ie, need 1 bit to convey the outcome of coin flip) Biased coin: H( 1/100, 99/100) = – 1/100 log2(1/100) – 99/100 log2(99/100) = 0.08 bit As P( heads )  1, info of actual outcome  0 H(0, 1) = H(1, 0) = 0 bits ie, no uncertainty left in source (0  log2(0) = 0)

Two Phases Optimization Task
Assign Probabilities: What conditional probabilities Pt(x|y) would yield the best approximation of P for a given tree t ? Procedure: vary the structure of t over all possible spanning trees Goal: among all trees with probabilities- which is the closest to P in terms of KL divergence?

What Probabilities to assign?
Pt(x|y) = P(x|y) Theorem 1 (Chow-Liu 1968; See Pearl’s textbook 8.2.1: Given a fixed tree t. Setting the probabilities along the branches of the tree t to coincide with the conditional probabilities computed from P yields A distribution Pt that minimizes the KL divergence with P.

How to vary over all trees? How to move in the search space?
Maximum Weight Spanning Tree Theorem 2 (Chow-Liu 1968; See Pearl’s textbook 8.2.1: The Kullback–Leibler divergence is minimized across all trees by a maximum weight spanning tree where the weight on an edge (x, y) is given by the mutual information measurement. 1. Create maximum spanning tree t 2. Project P on it (the best way you can) 3. Done!

Mutual information Mutual information is a measure of dependence
Mutual information is nonnegative, I(X;Y) ≥ 0, and symmetric I(X;Y) = I(Y;X).

The algorithm Find Maximum spanning tree with weights given by :
Compute Pt Select an arbitrary root node and compute

Illustration of CL-Tree Learning
B D AB AC AD BC BD CD 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603 (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) A C B D A C B D AB AC AD BC BD CD (0.56, 0.11, 0.02, 0.31) (0.51, 0.17, 0.17, 0.15) (0.53, 0.15, 0.19, 0.13) (0.44, 0.14, 0.23, 0.19) (0.46, 0.12, 0.26, 0.16) (0.64, 0.04, 0.08, 0.24) 0.3126 0.0229 0.0172 0.0230 0.0183 0.2603

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P By Gibb’s inequality the Expression is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj) Q.E.D.

Theorem 1 : If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P Gibbs' inequality

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: maximizes DKL After assignment, and Bayes rule: +

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: From theorem 1, we get: maximizes DKL After assignment, and Bayes rule:

Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement: + The second and third term are independent of t D(P,Pt) is nonnegative (Gibb’s inequality) Thus, minimizing the distance D(P,Pt) is equivalent to maximizing the sum of branch weights Q.E.D.

Chow-Liu Results If distribution P is tree-structured, Chow-Liu finds a CORRECT tree-structured distribution If distribution P is NOT tree-structured, Chow-Liu finds a tree-structured distribution Q that minimizes KL-divergence – argminQ KL(P; Q) Even though 2(n log n) trees, Chow-Liu finds a BEST one in poly time O(n2 [m + log n])

References S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY). Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3): See summary also Chpater in Pearl’s book

Scores for General DAGs
Minimize: DKL(P,PG) =− 𝑖 𝐼( 𝑋 𝑖 ;𝑃 𝑎 𝑖 𝐺 )−𝐻( 𝑋 𝑖 Minimize log-likelihood: Problem: overfitting: “best solution” can be always be chosen as the complete graph. It fits the data best, but generalizes poorly. Solution: Bayesian scores, BIC or MDL .

Learning Tree Structures

Similar presentations

Presentation on theme: "Learning Tree Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Tree Structures

Similar presentations

Presentation on theme: "Learning Tree Structures"— Presentation transcript:

Similar presentations

About project

Feedback