Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem.

Similar presentations


Presentation on theme: "1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem."— Presentation transcript:

1 1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem

2 2 Lab 1 Polynomial regression

3 3 Matlab: POLY_GUI The code implements the ridge regression algorithm: w=argmin  i (1-y i f(x i )) 2 +  || w || 2 f(x) = w 1 x + w 2 x 2 + … + w n x n = w x T x = [x, x 2, …, x n ] w T = X + Y X + = X T (XX T +  ) -1 =(X T X+  ) -1 X T X=[x(1); x(2); … x(p)] (matrix (p, n)) The leave-one-out error (LOO) is obtained with PRESS statistic (Predicted REsidual Sums of Squares.): LOO error = (1/p)  k  [ r k /1-(XX + ) kk ] 2

4 4 Matlab: POLY_GUI

5 5 At the prompt type: poly_gui; Vary the parameters. Refrain from hitting “CV”. Explain what happens in the following situations: – Sample num. << Target degree (small noise) – Large noise, small sample num – Target degree << Model degree Why is the LOO error sometimes larger than the training and test error? Are there local minima in the LOO error? Is the LOO error flat near the optimum? Propose ways of getting a better solution.

6 6 CLOP Data Objects  X = rand(10,5)  Y = rand(10,1)  D = data(X,Y) % constructor  methods(D)  get_x(D)  get_y(D)  plot(D); The poly_gui emulates CLOP objects of type “data”:

7 7 CLOP Model Objects  P = poly_ridge; h = plot(P);  D = gene(P); plot(D, h);  [resu, P] = train(P, D);  mse(resu)  Dt = gene(P);  [tresu, P] = test(P, Dt);  mse(tresu)  plot(P, h); poly_ridge is a “model” object.

8 8 Lab 1 Support Vector Machines

9 9 Support Vector Classifier x1x1 x2x2 x=[x 1,x 2 ] f(x)=0 f(x)>0 f(x)<0 f(x) =   i y i k(x, x i ) k  SV Boser-Guyon-Vapnik-1992

10 10 Matlab: SVC_GUI At the prompt type: svc_gui; The code implements the Support Vector Machine algorithm with kernel k(s, t) = (1 + s  t) q exp -  ||s-t|| 2 Regularization similar to ridge regression: Hinge loss: L(x i )=max ( 0, 1-y i f(x i ) )  Empirical risk:  i L(x i ) w=argmin (1/C) ||w|| 2 +  i L(x i ) shrinkage

11 11 Lab 1 More loss functions…

12 12 Loss Functions z=y f(x) L(y, f(x)) Decision boundary Margin well classifiedmissclassified 0/1 loss square loss (1- z) 2 SVC loss,  =1 max(0, 1-z) logistic loss log(1+e -z ) Adaboost loss e -z Perceptron loss max(0, -z) SVC loss,  =2 max(0, (1- z)) 2

13 13 Exercise: Gradient Descent Linear discriminant f(x) =  j w j x j Functional margin z=y f(x), y=  1 Compute  z/  w j Derive the learning rules  w j  -  L/  w j corresponding to the following loss functions: square loss (1- z) 2 SVC loss max(0, 1-z) logistic loss log(1+e -z ) Adaboost loss e -z Perceptron loss max(0, -z)

14 14 Exercise: Dual Algorithms From the  w j derive the  w w =  i  i x i From the  w, derive the  i of the dual algorithms.

15 15 Summary Modern ML algorithms optimize a penalized risk functional:

16 16 Lab 2 Getting started with CLOP

17 17 Lab 2 CLOP tutorial

18 18 What is CLOP? CLOP=Challenge Learning Object Package. Based on the Spider developed at the Max Planck Institute. Two basic abstractions: –Data object –Model object Put the CLOP directory in your path. At the prompt type: use_spider_clop; If you have used before poly_gui… type clear classes

19 19 CLOP Data Objects  addpath( );  use_spider_clop;  X=rand(10,8);  Y=[1 1 1 1 1 -1 -1 -1 -1 -1]';  D=data(X,Y); % constructor  [p,n]=get_dim(D)  get_x(D)  get_y(D) At the Matlab prompt:

20 20 CLOP Model Objects  model = kridge; % constructor  [resu, model] = train(model, D);  resu, model.W, model.b0  Yhat = D.X*model.W' + model.b0  testD = data(rand(3,8), [-1 -1 1]');  tresu = test(model, testD);  balanced_errate(tresu.X, tresu.Y) D is a data object previously defined.

21 21 Hyperparameters and Chains  default(kridge)  hyper = {'degree=3', 'shrinkage=0.1'};  model = kridge(hyper);  model = chain({standardize,kridge(hyper)});  [resu, model] = train(model, D);  tresu = test(model, testD);  balanced_errate(tresu.X, tresu.Y) A model often has hyperparameters: Models can be chained:

22 22 Hyper-parameters Kernel methods: kridge and svc: k(x, y) = (coef0 + x  y) degree exp(-gamma ||x - y|| 2 ) k ij = k(x i, x j ) k ii  k ii + shrinkage Naïve Bayes: naive: none Neural network: neural units, shrinkage, maxiter Random Forest: rf (windows only) mtry

23 23 Exercise Here some the pattern recognition CLOP objects: @rf @naive @svc @neural @gentleboost@lssvm @gkridge@kridge @klogistic@logitboost Try at the prompt example(neural) Try other pattern recognition objects Try different sets of hyperparameters, e.g., example(svc({'gamma=1', 'shrinkage=0.001'})) Remember: use default(method) to get the HP.

24 24 Lab 2 Example: Digit Recognition Subset of the MNIST data of LeCun and Cortes used for the NIPS2003 challenge

25 25 data(X, Y) % Go to the Gisette directory:  cd('GISETTE') % Load “ validation ” data:  Xt=load('gisette_valid.data');  Yt=load('gisette_valid.labels'); % Create a data object % and examine it:  Dt=data(Xt, Yt);  browse(Dt, 2); % Load “ training ” data (longer):  X=load('gisette_train.data');  Y=load('gisette_train.labels');  [p, n]=get_dim(Dt);  D=train(subsample(['p_max=' num2str(p)]), data(X, Y));  clear X Y Xt Yt % Save for later use:  save('gisette', 'D', 'Dt');

26 26 model(hyperparam) % Define some hyperparameters:  hyper = {'degree=3', 'shrinkage=0.1'}; % Create a kernel ridge % regression model:  model = kridge(hyper); % Train it and test it:  [resu, Model] = train(model, D);  tresu = test(Model, Dt); % Visualize the results:  roc(tresu);  idx=find(tresu.X.*tresu.Y<0);  browse(get(D, idx), 2);

27 27 Exercise Here are some pattern recognition CLOP objects: @rf @naive @gentleboost @svc @neural @logitboost @kridge @lssvm @klogistic Instanciate a model with some hyperparameters ( use default(method) to get the HP) Vary the HP and the number of training examples (Hint: use get(D, 1:n) to restrict the data to n examples).

28 28 chain({model1, model2,…}) % Combine preprocessing and kernel ridge regression:  my_prepro=normalize;  model = chain({my_prepro,kridge(hyper)}); % Combine replicas of a base learner:  for k=1:10  base_model{k}=neural;  end  model=ensemble(base_model); ensemble({model1, model2,…})

29 29 Exercise Here are some preprocessing CLOP objects: @normalize @standardize @fourier Chain a preprocessing and a model, e.g.,  model=chain({fourier, kridge('degree=3')});  my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'});  model=chain({normalize, my_classif}); Train, test, visualize the results. Hint: you can browse the preprocessed data:  browse(train(standardize, D), 2);

30 30 Summary % After creating your complex model, just one command: train  model=ensemble({chain({standardize,kridge(hyper )}),chain({normalize,naive})});  [resu, Model] = train(model, D); % After training your complex model, just one command: test  tresu = test(Model, Dt); % You can use a “ cv ” object to perform cross- validation:  cv_model=cv(model);  [resu, Model] = train(model, D);  roc(resu);

31 31 Lab 3 Getting started with Feature Selection

32 32 POLY_GUI again…  clear classes  poly_gui; Check the “Multiplicative updates” (MU) box. Play with the parameters. Try CV Compare with no MU

33 33 Lab 3 Exploring feature selection methods

34 34 Re-load the GISETTE data % Start CLOP:  clear classes  use_spider_clop; % Go to the Gisette directory:  cd('GISETTE')  load('gisette');

35 35 Visualization 1) Create a heatmap of the data matrix or a subset: show(D); show(get(D,1:10, 1:2:500)); 2) Look at individual patterns: browse(D); browse(D, 2); % For 2d data % Display feature positions: browse(D, 2, [212, 463, 429, 239]); 3) Make a scatter plot of a few features: scatter(D, [212, 463, 429, 239]);

36 36 Example  my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});  model=chain({normalize, s2n('f_max=100'), my_classif});  [resu, Model] = train(model, D);  tresu = test(Model, Dt);  roc(tresu); % Show the misclassified first  [s,idx]=sort(tresu.X.*tresu.Y);  browse(get(Dt, idx), 2, Model{2});

37 37 Some Filters in CLOP Univariate: @s2n (Signal to noise ratio.) @Ttest (T statistic; similar to s2n.) @Pearson (Uses Matlab corrcoef. Gives the same results as Ttest, classes are balanced.) @aucfs (ranksum test) Multivariate: @relief (no elimination of redundancy) @gs (Gram-Schmidt orthogonalization; complementary features)

38 38 Exercise Change the feature selection algorithm Visualize the features What can you say of the various methods? Which one gives the best results for 2, 10, 100 features? Can you improve by changing the preprocessing? (Hint: try @pc_extract)

39 39 Lab 3 Feature significance

40 40 T-test Normally distributed classes, equal variance  2 unknown; estimated from data as  2 within. Null hypothesis H 0 :  + =  - T statistic: If H 0 is true, t= (  + -  -)/(  within  m + +1/m -  Student  m + +m - -  d.f.  -- ++ -- ++ P(X i |Y=-1) P(X i |Y=1) xixi

41 41 Evalution of pval and FDR Ttest object: –computes pval analytically –FDR~pval*n sc /n probe object: –takes any feature ranking object as an argument (e.g. s2n, relief, Ttest) –pval~n sp /n p –FDR~pval*n sc /n

42 42 Analytic vs. probe 0500100015002000250030003500400045005000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rank FDR

43 43 Example  [resu, FS] = train(Ttest, D);  [resu, PFS] = train(probe(Ttest), D);  figure('Name', 'pvalue');  plot(get_pval(FS, 1), 'r');  hold on; plot(get_pval(PFS, 1));  figure('Name', 'FDR');  plot(get_fdr(FS, 1), 'r');  hold on; plot(get_pval(PFS, 1));

44 44 Exercise What could explain differences between the pvalue and fdr with the analytic and probe method? Replace Ttest with chain({rmconst('w_min=0'), Ttest}) Recompute the pvalue and fdr curves. What do you notice? Choose an optimum number fnum of features based on pvalue or FDR. Visualize with browse(D, 2,FS, fnum); Create a model with fnum. Is fnum optimal? Do you get something better with CV?

45 45 Lab 3 Local feature selection

46 46 Exercise Consider the 1 nearest neighbor algorithm. We define the following score: Where s(k) (resp. d(k)) is the index of the nearest neighbor of x k belonging to the same class (resp. different class) as x k.

47 47 Exercise 1.Motivate the choice of such a cost function to approximate the generalization error (qualitative answer) 2.How would you derive an embedded method to perform feature selection for 1 nearest neighbor using this functional? 3.Motivate your choice (what makes your method an ‘embedded method’ and not a ‘wrapper’ method)

48 48 Relief nearest hit nearest miss D hit D miss Relief= D hit D miss Local_Relief= D miss /D hit

49 49 Exercise  [resu, FS] = train(relief, D);  browse(D, 2,FS, 20);  [resu, LFS] = train(local_relief,D);  browse(D, 2,LFS, 20); Propose a modification to the nearest neighbor algorithm that uses features relevant to individual patterns (like those provided by “local_relief”). Do you anticipate such an algorithm to perform better than the non-local version using “relief”?

50 50 Epilogue Becoming a pro and playing with other datasets

51 51 Some CLOP objects Basic learning machines Feature selection, pre- and post- processing Compound models

52 52 http://clopinet.com/challenges/ Challenges in –Feature selection –Performance prediction –Model selection –Causality Large datasets

53 53 Best BER=6.22  0.57% - n0=20 (4%) – BER0=7.33% MADELON Best BER=6.22  0.57% - n0=20 (4%) – BER0=7.33% my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'}); my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif}) Best BER=8.54  0.99% - n0=1000 (1%) – BER0=12.37% DOROTHEA Best BER=8.54  0.99% - n0=1000 (1%) – BER0=12.37% my_model=chain({TP('f_max=1000'), naive, bias}); Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark, Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg Schneider and Markus Uhr,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444. Dataset SizeTypeFeatures Training Examples Validation Examples Test Examples Arcene 8.7 MB Dense10000100 700 Gisette 22.5 MB Dense5000600010006500 Dexter 0.9 MB Sparse integer 20000300 2000 Dorothea 4.7 MB Sparse binary 100000800350800 Madelon 2.9 MB Dense50020006001800 Class taught at ETH, Zurich, winter 2005 Task of the students: Baseline method provided, BER0 performance and n0 features. Get BER<BER0 or BER=BER0 but n<n0. Extra credit for beating best challenge entry. GISETTE DOROTHEA NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha DEXTER MADELON ARCENE Best BER=3.30  0.40% - n0=300 (1.5%) – BER0=5% DEXTER Best BER=3.30  0.40% - n0=300 (1.5%) – BER0=5% my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'}); my_model=chain({s2n('f_max=300'), normalize, my_classif}) Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% GISETTE Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'}); my_model=chain({normalize, s2n('f_max=1000'), my_classif}); Best BER= 11.9  1.2 % - n0=1100 (11%) – BER0=14.7% ARCENE Best BER= 11.9  1.2 % - n0=1100 (11%) – BER0=14.7% my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'}); my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc}) NIPS 2003 Feature Selection Challenge

54 54 NIPS 2006 Model Selection Game Data set CLOP models selected ADA 2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias} GI NA 6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)} HI VA 3*{norm,svc(degree=1),bias} NO VA 5*{norm,gentleboost(kridge),bias} SYL VA 4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias} First place: Juha Reunanen, cross- indexing-7 sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown) Data set CLOP models selected ADA {sns, std, norm, neural(units=5), bias} GI NA {norm, svc(degree=5, shrinkage=0.01), bias} HI VA {std, norm, gentleboost(kridge), bias} NO VA {norm,gentleboost(neural), bias} SYL VA {std, norm, neural(units=1), bias} Second place: Hugo Jair Escalante Balderas, BRun2311062 sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown) Note: entry Boosting_1_001_x900 gave better results, but was older. Subject: Re: Goalie masks Lines: 21 Tom Barrasso wore a great mask, one time, last season. It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, along with a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Pens logo, and a space for the "new" logo. Lori NOVA GINA HIVA ADA SYLVA Dataset Domain Feature #Training #Validation #Test # ADAMarketing48414741541471 GINADigit recognition970315331531532 HIVADrug discovery1617384538438449 NOVAText classification16969175417517537 SYLVA Ecology 216130861309130857 Proc. IJCNN07, Orlando, FL, Aug, 2007: PSMS for Neural Networks H. Jair Escalante, Manuel Montes y G´omez, and Luis Enrique Sucar Model Selection and Assessment Using Cross-indexing, Juha Reunanen


Download ppt "1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem."

Similar presentations


Ads by Google