Download presentation

Presentation is loading. Please wait.

Published byAlexander Strong Modified over 3 years ago

1
Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab

2
Motivation Problem: With few instances, learned models aren’t robust MEAN Principal Components Task: Shape modeling std +1 std -1 std +1 std -1 MEAN

3
Transfer Learning MEAN Principal Components Shape is stabilized, but doesn’t look like an elephant Can we use rhinos to help elephants? std +1 std -1 std +1 std -1 MEAN

4
Hierarchical Bayes root Elephant Rhino P( Elephant | root ) P( Rhino | root ) P(Data Elephant | Elephant ) P(Data Rhino | Rhino ) P( root )

5
Goals Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Automatically learn what to transfer

6
Hierarchical Bayes root Elephant Rhino Compute full posterior P( £ | D ) P( £ c | £ root ) must be conjugate to P( D | £ c ) Problem: Often can’t perform full Bayesian computations

7
Approx.: Point estimation root Elephant Rhino Empirical Bayes Point estimation Other approximations: Posterior as normal, sampling, etc. Best parameters are good enough; don’t need full distribution

8
More Issues: Multiple Levels Conjugate priors usually can’t be extended to multiple levels (e.g., Dirichlet, inverse-Wishart) Exception: Thibeaux and Jordan (‘05)

9
More Issues: Restrictive Priors Example: inverse-Wishart Pseudocount restriction >= d If d is large, N is small, signal from prior overwhelms data We show experiments with N=3, d=20 A A B B Normal-Inverse- Wishart parameters Gaussian parameters N = # samples, d = dimension Pseudocounts

10
Alternative: Shrinkage McCallum et al. (‘98) 1. Compute maximum likelihood at each node 2. “Shrink” each node toward its parent Linear combination of and parent Uses cross-validation of parent Pros: Simple to compute Handles multiple levels Cons: Naive heuristic for transfer Averaging not always appropriate parent

11
Undirected HB Reformulation Defines an undirected Markov random field model over D Probabilistic Abstraction Hierarchies (Segal et al. ‘01)

12
F data : Encourage parameters to explain data Undirected Probabilistic Model root F data Divergence : high Elephant Rhino : low Divergence: Encourage parameters to be similar to parents Divergence

13
Purpose of Reformulation Easy to specify F data can be likelihood, classification, or other objective Divergence can be L1-distance, L2-distance, -insensitive loss, KL divergence, etc. No conjugacy or proper prior restrictions Easy to optimize Convex over if F data is convex and Divergence is concave

14
Task: Categorize Documents Bag-of-words model F data : Multinomial log likelihood (regularized) µ i represents frequency of word i Divergence: L2 norm Application: Text categorization Newsgroup20 Dataset

15
Baselines 1. Maximum likelihood at each node (no hierarchy) 2. Cross-validate regularization (no hierarchy) 3. Shrinkage (McCallum et al. ’98, with hierarchy) AA BB parent parent

16
Can It Handle Multiple Levels? 75150225300375 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Classification Rate Total Number of Training Instances Newsgroup Topic Classification Max Likelihood (No regularization) Regularized Max Likelihood Shrinkage Undirected HB

17
Task: Learn shape (Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Application: Shape Modeling Mean landmark location Covariance over landmarks Regularization MEAN Principal Components Mammals Dataset (Fink, ’05)

18
Does Hierarchy Help? 6102030 -350 -300 -250 -200 -150 -100 -50 0 50 Total Number of Training Instances Delta log-loss / instance Mammal Pairs Regularized Max Likelihood Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Unregularized max likelihood, shrinkage: Much worse, not shown

19
Transfer Not all parameters deserve equal sharing

20
Split into subcomponents µ i with weights Allows for different strengths for different subcomponents, child-parent pairs Degrees of Transfer How do we estimate all these parameters?

21
Learning Degrees of Transfer Bootstrap approach Hyper-prior approach Bootstrap approach If and have a consistent relationship, want to encourage them to be similar Define Want to estimate variance of across all possible datasets Select random subsets of data Let ¸ be empirical variance of If and have a consistent relationship (low variance), ¸ strongly encourages similarity

22
Degrees of Transfer Bootstrap approach If we use an L2 norm for our Div terms: Resembles a product of Gaussian priors If we were to fix the value of empirical Bayes “Undirected empirical Bayes estimation” L2 norm Degree of Transfer } Gaussian Prior

23
Learning Degrees of Transfer Bootstrap approach If and have a consistent relationship, want to encourage them to be similar Hyper-prior approach Bayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £ Concretely: inverse-Gamma prior (forced to be positive) Prior on Degree of Transfer If likelihood is concave, entire objective is convex!

24
Do Degrees of Transfer Help? 6102030 -15 -10 -5 0 5 10 15 Total Number of Training Instances Delta log-loss / instance Mammal Pairs Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Regularized Max Likelihood Bison-Rhino Hyperprior

25
Do Degrees of Transfer Help? Const: No Degrees of Transfer Hyperprior, Bootstrap: Use Degrees of Transfer 6102030 -15 -10 -5 0 5 10 Delta log-loss / instance Total Number of Training Instances Bison-Rhino RegML Const Bootstrap Hyperprior root

26
Do Degrees of Transfer Help? 6102030 0 1 2 3 4 5 6 7 8 Const: >100 bits worse Delta log-loss / instance Total Number of Training Instances RegML Bootstrap Hyperprior Llama-Rhino root

27
Degrees of Transfer 1/ Stronger transfer Weaker transfer root 05101520253035404550 0 2 4 6 8 10 12 14 16 18 20 Distribution of DOT coefficients using Hyperprior

28
Summary Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Refined transfer of components

29
Future Work Non-tree hierarchies (multiple inheritance) Block degrees of transfer Structure learning General undirected model doesn’t require tree structure Part discovery Gene Ontology (GO) network WordNet Hierarchy

Similar presentations

OK

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on non biodegradable waste example Ppt on polymerase chain reaction Ppt on fauna of italy Ppt on condition monitoring Ppt on pre-ignition procedure Ppt on email etiquettes presentation pro Ppt on carl friedrich gauss contributions Download ppt on algebraic expressions and identities for class 8 Ppt on central limit theorem statistics Dsp ppt on dft testing