# Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab.

## Presentation on theme: "Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab."— Presentation transcript:

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab

Motivation Problem: With few instances, learned models aren’t robust MEAN Principal Components Task: Shape modeling std +1 std -1 std +1 std -1 MEAN

Transfer Learning MEAN Principal Components Shape is stabilized, but doesn’t look like an elephant Can we use rhinos to help elephants? std +1 std -1 std +1 std -1 MEAN

Hierarchical Bayes  root  Elephant  Rhino P(  Elephant |  root ) P(  Rhino |  root ) P(Data Elephant |  Elephant ) P(Data Rhino |  Rhino ) P(  root )

Goals Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Automatically learn what to transfer

Hierarchical Bayes  root  Elephant  Rhino Compute full posterior P( £ | D ) P( £ c | £ root ) must be conjugate to P( D | £ c ) Problem: Often can’t perform full Bayesian computations

Approx.: Point estimation  root  Elephant  Rhino Empirical Bayes Point estimation Other approximations: Posterior as normal, sampling, etc. Best parameters are good enough; don’t need full distribution

More Issues: Multiple Levels Conjugate priors usually can’t be extended to multiple levels (e.g., Dirichlet, inverse-Wishart) Exception: Thibeaux and Jordan (‘05)

More Issues: Restrictive Priors Example: inverse-Wishart Pseudocount restriction  >= d If d is large, N is small, signal from prior overwhelms data We show experiments with N=3, d=20  A  A  B  B Normal-Inverse- Wishart parameters Gaussian parameters N = # samples, d = dimension Pseudocounts

Alternative: Shrinkage McCallum et al. (‘98) 1. Compute maximum likelihood at each node 2. “Shrink” each node toward its parent Linear combination of  and  parent Uses cross-validation of      parent Pros: Simple to compute Handles multiple levels Cons: Naive heuristic for transfer Averaging not always appropriate  parent

Undirected HB Reformulation Defines an undirected Markov random field model over  D Probabilistic Abstraction Hierarchies (Segal et al. ‘01)

F data : Encourage parameters to explain data Undirected Probabilistic Model  root F data Divergence  : high  Elephant  Rhino  : low Divergence: Encourage parameters to be similar to parents Divergence

Purpose of Reformulation Easy to specify F data can be likelihood, classification, or other objective Divergence can be L1-distance, L2-distance,  -insensitive loss, KL divergence, etc. No conjugacy or proper prior restrictions Easy to optimize Convex over  if F data is convex and Divergence is concave

Task: Categorize Documents Bag-of-words model F data : Multinomial log likelihood (regularized) µ i represents frequency of word i Divergence: L2 norm Application: Text categorization Newsgroup20 Dataset

Baselines 1. Maximum likelihood at each node (no hierarchy) 2. Cross-validate regularization (no hierarchy) 3. Shrinkage (McCallum et al. ’98, with hierarchy) AA BB  parent     parent

Can It Handle Multiple Levels? 75150225300375 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Classification Rate Total Number of Training Instances Newsgroup Topic Classification Max Likelihood (No regularization) Regularized Max Likelihood Shrinkage Undirected HB

Task: Learn shape (Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Application: Shape Modeling Mean landmark location Covariance over landmarks Regularization MEAN Principal Components Mammals Dataset (Fink, ’05)

Does Hierarchy Help? 6102030 -350 -300 -250 -200 -150 -100 -50 0 50 Total Number of Training Instances Delta log-loss / instance Mammal Pairs Regularized Max Likelihood Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Unregularized max likelihood, shrinkage: Much worse, not shown

Transfer Not all parameters deserve equal sharing

Split  into subcomponents µ i with weights  Allows for different strengths for different subcomponents, child-parent pairs Degrees of Transfer How do we estimate all these parameters?

Learning Degrees of Transfer Bootstrap approach Hyper-prior approach Bootstrap approach If and have a consistent relationship, want to encourage them to be similar Define Want to estimate variance of  across all possible datasets Select random subsets of data Let ¸ be empirical variance of  If and have a consistent relationship (low variance), ¸ strongly encourages similarity

Degrees of Transfer Bootstrap approach If we use an L2 norm for our Div terms: Resembles a product of Gaussian priors If we were to fix the value of empirical Bayes “Undirected empirical Bayes estimation” L2 norm Degree of Transfer } Gaussian Prior

Learning Degrees of Transfer Bootstrap approach If and have a consistent relationship, want to encourage them to be similar Hyper-prior approach Bayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £ Concretely: inverse-Gamma prior (forced to be positive) Prior on Degree of Transfer If likelihood is concave, entire objective is convex!

Do Degrees of Transfer Help? 6102030 -15 -10 -5 0 5 10 15 Total Number of Training Instances Delta log-loss / instance Mammal Pairs Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Regularized Max Likelihood Bison-Rhino Hyperprior

Do Degrees of Transfer Help? Const: No Degrees of Transfer Hyperprior, Bootstrap: Use Degrees of Transfer 6102030 -15 -10 -5 0 5 10 Delta log-loss / instance Total Number of Training Instances Bison-Rhino RegML Const Bootstrap Hyperprior  root

Do Degrees of Transfer Help? 6102030 0 1 2 3 4 5 6 7 8 Const: >100 bits worse Delta log-loss / instance Total Number of Training Instances RegML Bootstrap Hyperprior Llama-Rhino  root

Degrees of Transfer 1/ Stronger transfer Weaker transfer  root 05101520253035404550 0 2 4 6 8 10 12 14 16 18 20 Distribution of DOT coefficients using Hyperprior

Summary Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Refined transfer of components

Future Work Non-tree hierarchies (multiple inheritance) Block degrees of transfer Structure learning General undirected model doesn’t require tree structure Part discovery Gene Ontology (GO) network WordNet Hierarchy

Download ppt "Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab."

Similar presentations