Download presentation

Published byRamiro Deere Modified over 3 years ago

1
Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University Dale Schuurmans University of Waterloo

2
**The Learning Problem + Learning task: search for Optimization is hard!**

DATA weights Hypothesis + Learning task: search for Optimization is hard! Typically resort to local optimization methods: gradient ascent, greedy hill-climbing, EM Density estimation: Classification: Logistic regression:

3
Escaping local maxima Local methods converge to (one of many) local optimum TABU search Random restarts Simulated annealing Score Stuck here h These methods work by step perturbation during the local search

4
**Weight Perturbation Our Idea: Perturbation of instance weights**

Do until convergence Perturb instance weights Use optimizer as black box To maximize on original goal diminish magnitude of perturbation Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

5
Weight Perturbation W h Score DATA LOCAL SEARCH REWEIGHT Hypothesis

6
Weight Perturbation Our Idea: Perturbation of instance weights Puts stronger emphasis on a subset of the instances Allows the learning procedure to escape local maxima W DATA W DATA perturb

7
**Iterative Procedure Benefits: DATA LOCAL SEARCH REWEIGHT Hypothesis**

Score W DATA LOCAL SEARCH REWEIGHT Hypothesis Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

8
**Iterative Procedure Two methods for reweighting**

Random: Sampling random weights Adversarial: Directed reweighting To maximize on original goal slowly diminish magnitude of perturbations

9
**Mean is original weight**

Random Reweighting Mean is original weight hot cold Wt Variance temp P(W) Wt+2 W W* When hot, model can “go” almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data Wt+1 Distance from original W

10
**Adversarial Reweighting**

Idea: Challenge model by increasing w of “bad” (low scoring) instances Challenge the model by emphasizing bad samples (minimize the score using W) hot cold Wt A min-max game between re-weighting and optimizer Wt+1 W* Converge towards original distribution by constraining distance from W* Kivinen & Warmuth

11
**Learning Bayesian Networks**

A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP DATA weights The Alarm network Learning task: find structure + parameters that maximize score

12
**Alarm network: 37 variables, 1000 samples**

Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe) 5 10 15 20 25 30 35 40 -15.5 -15.45 -15.4 -15.35 -15.3 -15.25 -15.2 -15.15 Iterations Log-loss/instance on test TRUE STRUCTURE With similar running time: Random is superior to random re-starts Single Adversary run competes with random BASELINE Random annealing Adversary HOT COLD Alarm network: 37 variables, 1000 samples

13
**Alarm network: 37 variables, 1000 samples**

Search with missing values Missing values introduce many local maxima EM combines search & parameters estimation (SEM) Log-loss/instance on test data Percent at least this good Alarm network: 37 variables, 1000 samples 10 20 30 40 50 60 70 80 90 BASELINE GENERATING MODEL -15.1 -15.08 -15.06 -15.04 -15.02 -15 -14.98 -14.96 Distance to true generating model is halved! With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best ADVERSARY RANDOM 90% of random better then baseline

14
**With similar running time:**

Real-life datasets 6 real-life examples with and without missing values 0.5 Adversary 0.4 20-80% Random 0.3 With similar running time: Adversary is efficient and preferable Random takes longer for inferior results 0.2 Log-loss / instance on test data 0.1 BASELINE -0.1 -0.2 Stock Soybean Rosetta Audio Soy-M Promoter Variables Samples 36 446 30 300 70 200 36 546 13 100

15
**Learning Sequence Motifs**

DNA Promoter Sequences ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT --------- Represent using a motif Position Specific Scoring Matrix: A 0.97 0.02 C 0.01 0.99 0.2 G 0.1 0.8 T 0.03 0.98 Motif Segal et al., RECOMB 2002 Highly non-linear score optimization is hard!

16
**PSSM: 4 letters x 20 positions, 550 sample**

Real-life Motifs Results Construct PSSM: find that maximize the score Experiments on 9 transcription factors (motifs) 50 Adversary 45 20-80% Random 40 With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times 35 30 25 Log-loss on test data 20 15 10 5 BASELINE -5 ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6 Motif PSSM: 4 letters x 20 positions, 550 sample

17
**Simulated annealing Simulated annealing:**

allow “bad” moves with some probability Score h P(move) f(temp,) Wasteful propose, evaluate, reject cycle Needs a long time to escape local maxima WORSE then baseline on Bayesian networks!

18
**Summary and Future Work**

General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP… Promising empirical results approach “achievable” maximum The BIG challenge: THEORETICAL INSIGHTS

19
**same comparison is true of Random Vs. Bagging/Bootstrap**

Adversary ≠ Boosting Adversary Output: Single hypothesis Weights: Converge to original distribution Learning: ht+1 depends on ht Boosting An ensemble Diverge from original distribution ht+1 depends only on wt+1 same comparison is true of Random Vs. Bagging/Bootstrap

20
**Other annealing methods**

Simulated annealing: allow “bad” moves with some probability Deterministic annealing: Change scenery by changing family of h simple hypothesis Score h Score h P(move) f(temp,) complex hypothesis Not good on Bayesian network! Is not naturally applicable!

21
**Intuition to Adversary**

What happens before and after re-weighting? start here -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Score -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Score HOT HOT COLD finish here COLD start here stuck here Escaping local max. is easy! Escaping global max. is hard!

22
Escaping local maxima Local methods converge to (one of many) local optimum TABU search, Random restarts, Simulated annealing Our Idea: Anneal with perturbation of instance weights instead of search perturbation W W Score Score P(move) f(temp,) start here Standard Annealing: allow “bad” moves Weight Annealing: change in scenery temp

23
**Adversarial Update Equation**

IMPLICIT EQUATION use exponential update in the right direction (Kivinen & Warmuth, 1997) where is the learning rate Good example: low weight Bad Example: high weight

Similar presentations

OK

Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.

Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on dances of india Ppt on turbo generator free download Ppt on management by objectives peter Ppt on field study 2 Mis ppt on retail industry Ppt on network file system Ppt on boilers operations with fractions Ppt on area of parallelogram and triangles geometry Ppt on network switching hubs Ppt on adr and drive