Download presentation

Presentation is loading. Please wait.

Published byRamiro Deere Modified over 2 years ago

1
Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University Dale Schuurmans University of Waterloo

2
The Learning Problem Density estimation: Classification: Logistic regression: Learning task: search for DATA weights Hypothesis + Optimization is hard! Typically resort to local optimization methods: gradient ascent, greedy hill-climbing, EM

3
Escaping local maxima These methods work by step perturbation during the local search Local methods converge to (one of many) local optimum TABU search Random restarts Simulated annealing Score h Stuck here

4
Weight Perturbation Our Idea:Perturbation of instance weights Do until convergence Perturb instance weights Use optimizer as black box To maximize on original goal diminish magnitude of perturbation Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

5
Weight Perturbation LOCAL SEARCH REWEIGHT h Score Hypothesis W DATA

6
Weight Perturbation Our Idea:Perturbation of instance weights Puts stronger emphasis on a subset of the instances Allows the learning procedure to escape local maxima W DATA W perturb

7
Iterative Procedure LOCAL SEARCH REWEIGHT h Score Hypothesis W DATA Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

8
Iterative Procedure Two methods for reweighting Random: Sampling random weights Adversarial: Directed reweighting To maximize on original goal slowly diminish magnitude of perturbations

9
Random Reweighting When hot, model can go almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data W t+1 W t+2 W*W* P(W) W Variance temp Mean is original weight WtWt Distance from original W

10
WtWt Adversarial Reweighting Idea: Challenge model by increasing w of bad (low scoring) instances W*W* W t+1 Converge towards original distribution by constraining distance from W * Challenge the model by emphasizing bad samples (minimize the score using W) A min-max game between re-weighting and optimizer Kivinen & Warmuth

11
Learning Bayesian Networks A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP DATA weights The Alarm network Learning task: find structure + parameters that maximize score

12
Iterations Log-loss/instance on test With similar running time: Random is superior to random re-starts Single Adversary run competes with random Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe) BASELINE Random annealing Adversary Alarm network: 37 variables, 1000 samples HOTCOLD TRUE STRUCTURE

13
Log-loss/instance on test data Percent at least this good Alarm network: 37 variables, 1000 samples BASELINE GENERATING MODEL Search with missing values Missing values introduce many local maxima EM combines search & parameters estimation (SEM) With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best 90% of random better then baseline Distance to true generating model is halved! RANDOM ADVERSARY

14
Real-life datasets 6 real-life examples with and without missing values Variables Samples Stock Soybean Rosetta Audio Soy-M Promoter Log-loss / instance on test data BASELINE Adversary 20-80% Random With similar running time: Adversary is efficient and preferable Random takes longer for inferior results

15
Represent using a motif Position Specific Scoring Matrix: ATCTAGCTGAGAATGCACACTGATCGAGCCC CACCATATTCTTCGGACTGCGCTATATAGAC TGCAACTAGTAGAGCTCTGCTAGAAACATTA CTAAGCTCTATGACTGCCGATTGCGCCGTTT GGGCGTCTGAGCTCTTTGCTCTTGACTTCCG CTTATTGATATTATCTCTCTTGCTCGTGACT GCTTTATTGTGGGGGGGACTGCTGATTATGC TGCTCATAGGAGAGACTGCGAGAGTCGTCGT AGGACTGCGTCGTCGTGATGATGCTGCTGAT CGATCGGACTGCCTAGCTAGTAGATCGATGT GACTGCAGAAGAGAGAGGGTTTTTTCGCGCC GCCCCGCGCGACTGCTCGAGAGGAAGTATAT ATGACTGCGCGCGCCGCGCGCCGGACTGCTT TATCCAGCTGATGCATGCATGCTAGTAGACT GCCTAGTCAGCTGCGATCGACTCGTAGCATG CATCGACTGCAGTCGATCGATGCTAGTTATT GGATGCGACTGAACTCGTAGCTGTAGTTATT Learning Sequence Motifs Motif DNA Promoter Sequences Highly non-linear score optimization is hard! ATCTAGCTGAGAATGCACACTGATCGAGCCC CACCATATTCTTCGGACTGCGCTATATAGAC TGCAACTAGTAGAGCTCTGCTAGAAACATTA CTAAGCTCTATGACTGCCGATTGCGCCGTTT GGGCGTCTGAGCTCTTTGCTCTTGACTTCCG CTTATTGATATTATCTCTCTTGCTCGTGACT GCTTTATTGTGGGGGGGACTGCTGATTATGC TGCTCATAGGAGAGACTGCGAGAGTCGTCGT AGGACTGCGTCGTCGTGATGATGCTGCTGAT CGATCGGACTGCCTAGCTAGTAGATCGATGT GACTGCAGAAGAGAGAGGGTTTTTTCGCGCC GCCCCGCGCGACTGCTCGAGAGGAAGTATAT ATGACTGCGCGCGCCGCGCGCCGGACTGCTT TATCCAGCTGATGCATGCATGCTAGTAGACT GCCTAGTCAGCTGCGATCGACTCGTAGCATG CATCGACTGCAGTCGATCGATGCTAGTTATT GGATGCGACTGAACTCGTAGCTGTAGTTATT A C G T Segal et al., RECOMB 2002

16
Real-life Motifs Results Construct PSSM: find that maximize the score Experiments on 9 transcription factors (motifs) ACE2FKH1FKH2MBP1MCM1NDD1SWI4SWI5SWI Motif Log-loss on test data BASELINE Adversary 20-80% Random PSSM: 4 letters x 20 positions, 550 sample With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times

17
Simulated annealing Simulated annealing: allow bad moves with some probability P(move) f(temp, ) Score h Wasteful propose, evaluate, reject cycle Needs a long time to escape local maxima WORSE then baseline on Bayesian networks!

18
Summary and Future Work General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP… Promising empirical results approach achievable maximum The BIG challenge: THEORETICAL INSIGHTS

19
Adversary Boosting Adversary Output: Single hypothesis Weights: Converge to original distribution Learning: h t+1 depends on h t Boosting An ensemble Diverge from original distribution h t+1 depends only on w t+1 same comparison is true of Random Vs. Bagging/Bootstrap

20
Other annealing methods Simulated annealing: allow bad moves with some probability P(move) f(temp, ) Score h h Deterministic annealing: Change scenery by changing family of h complex hypothesis simple hypothesis Not good on Bayesian network!Is not naturally applicable!

21
Score Score Intuition to Adversary What happens before and after re-weighting? start here Escaping local max. is easy! finish here Escaping global max. is hard! start here HOT COLD stuck here

22
Escaping local maxima Our Idea: Anneal with perturbation of instance weights instead of search perturbation Local methods converge to (one of many) local optimum TABU search, Random restarts, Simulated annealing Standard Annealing: allow bad moves Weight Annealing: change in scenery temp Score W Score W start here P(move) f(temp, )

23
Adversarial Update Equation where is the learning rate use exponential update in the right direction (Kivinen & Warmuth, 1997) IMPLICIT EQUATION Good example: low weight Bad Example: high weight

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google