Presentation is loading. Please wait.

Presentation is loading. Please wait.

Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Similar presentations


Presentation on theme: "Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il."— Presentation transcript:

1 Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University Dale Schuurmans University of Waterloo

2 The Learning Problem + Learning task: search for Optimization is hard!
DATA weights Hypothesis + Learning task: search for Optimization is hard! Typically resort to local optimization methods: gradient ascent, greedy hill-climbing, EM Density estimation: Classification: Logistic regression:

3 Escaping local maxima Local methods converge to (one of many) local optimum TABU search Random restarts Simulated annealing Score Stuck here h These methods work by step perturbation during the local search

4 Weight Perturbation Our Idea: Perturbation of instance weights
Do until convergence Perturb instance weights Use optimizer as black box To maximize on original goal diminish magnitude of perturbation Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

5 Weight Perturbation W h Score DATA LOCAL SEARCH REWEIGHT Hypothesis

6 Weight Perturbation Our Idea: Perturbation of instance weights Puts stronger emphasis on a subset of the instances  Allows the learning procedure to escape local maxima W DATA W DATA perturb

7 Iterative Procedure Benefits: DATA LOCAL SEARCH REWEIGHT Hypothesis
Score W DATA LOCAL SEARCH REWEIGHT Hypothesis Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

8 Iterative Procedure Two methods for reweighting
Random: Sampling random weights Adversarial: Directed reweighting To maximize on original goal  slowly diminish magnitude of perturbations

9 Mean is original weight
Random Reweighting Mean is original weight hot cold Wt Variance  temp P(W) Wt+2 W W* When hot, model can “go” almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data Wt+1 Distance from original W

10 Adversarial Reweighting
Idea: Challenge model by increasing w of “bad” (low scoring) instances Challenge the model by emphasizing bad samples (minimize the score using W) hot cold Wt A min-max game between re-weighting and optimizer Wt+1 W* Converge towards original distribution by constraining distance from W* Kivinen & Warmuth

11 Learning Bayesian Networks
A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP DATA weights The Alarm network Learning task: find structure + parameters that maximize score

12 Alarm network: 37 variables, 1000 samples
Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe) 5 10 15 20 25 30 35 40 -15.5 -15.45 -15.4 -15.35 -15.3 -15.25 -15.2 -15.15 Iterations Log-loss/instance on test TRUE STRUCTURE With similar running time: Random is superior to random re-starts Single Adversary run competes with random BASELINE Random annealing Adversary HOT COLD Alarm network: 37 variables, 1000 samples

13 Alarm network: 37 variables, 1000 samples
Search with missing values Missing values introduce many local maxima EM combines search & parameters estimation (SEM) Log-loss/instance on test data Percent at least this good Alarm network: 37 variables, 1000 samples 10 20 30 40 50 60 70 80 90 BASELINE GENERATING MODEL -15.1 -15.08 -15.06 -15.04 -15.02 -15 -14.98 -14.96 Distance to true generating model is halved! With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best ADVERSARY RANDOM 90% of random better then baseline

14 With similar running time:
Real-life datasets 6 real-life examples with and without missing values 0.5 Adversary 0.4 20-80% Random 0.3 With similar running time: Adversary is efficient and preferable Random takes longer for inferior results 0.2 Log-loss / instance on test data 0.1 BASELINE -0.1 -0.2 Stock Soybean Rosetta Audio Soy-M Promoter Variables Samples 36 446 30 300 70 200 36 546 13 100

15 Learning Sequence Motifs
DNA Promoter Sequences ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT --------- Represent using a motif Position Specific Scoring Matrix: A 0.97 0.02 C 0.01 0.99 0.2 G 0.1 0.8 T 0.03 0.98 Motif Segal et al., RECOMB 2002 Highly non-linear score optimization is hard!

16 PSSM: 4 letters x 20 positions, 550 sample
Real-life Motifs Results Construct PSSM: find  that maximize the score Experiments on 9 transcription factors (motifs) 50 Adversary 45 20-80% Random 40 With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times 35 30 25 Log-loss on test data 20 15 10 5 BASELINE -5 ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6 Motif PSSM: 4 letters x 20 positions, 550 sample

17 Simulated annealing Simulated annealing:
allow “bad” moves with some probability Score h P(move)  f(temp,) Wasteful propose, evaluate, reject cycle Needs a long time to escape local maxima WORSE then baseline on Bayesian networks!

18 Summary and Future Work
General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP… Promising empirical results approach “achievable” maximum The BIG challenge: THEORETICAL INSIGHTS

19 same comparison is true of Random Vs. Bagging/Bootstrap
Adversary ≠ Boosting Adversary Output: Single hypothesis Weights: Converge to original distribution Learning: ht+1 depends on ht Boosting An ensemble Diverge from original distribution ht+1 depends only on wt+1 same comparison is true of Random Vs. Bagging/Bootstrap

20 Other annealing methods
Simulated annealing: allow “bad” moves with some probability Deterministic annealing: Change scenery by changing family of h simple hypothesis Score h Score h P(move)  f(temp,) complex hypothesis Not good on Bayesian network! Is not naturally applicable!

21 Intuition to Adversary
What happens before and after re-weighting? start here -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Score -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Score HOT HOT COLD finish here COLD start here stuck here Escaping local max. is easy! Escaping global max. is hard!

22 Escaping local maxima Local methods converge to (one of many) local optimum TABU search, Random restarts, Simulated annealing Our Idea: Anneal with perturbation of instance weights instead of search perturbation W W Score Score P(move)  f(temp,) start here Standard Annealing: allow “bad” moves Weight Annealing: change in scenery  temp

23 Adversarial Update Equation
IMPLICIT EQUATION use exponential update in the right direction (Kivinen & Warmuth, 1997) where  is the learning rate Good example: low weight Bad Example: high weight


Download ppt "Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il."

Similar presentations


Ads by Google