Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI B609: β€œFoundations of Data Science”

Similar presentations


Presentation on theme: "CSCI B609: β€œFoundations of Data Science”"β€” Presentation transcript:

1 CSCI B609: β€œFoundations of Data Science”
Lecture 13/14: Gradient Descent, Boosting and Learning from Experts Slides at Grigory Yaroslavtsev

2 Constrained Convex Optimization
Non-convex optimization is NP-hard: 𝑖 π‘₯ 𝑖 βˆ’ π‘₯ 𝑖 2 =0 β‡”βˆ€π‘–: π‘₯ 𝑖 ∈ 0,1 Knapsack: Minimize 𝑖 𝑐 𝑖 π‘₯ 𝑖 Subject to: 𝑖 𝑀 𝑖 π‘₯ 𝑖 β‰€π‘Š Convex optimization can often be solved by ellipsoid algorithm in π‘π‘œπ‘™π‘¦(𝑛) time, but too slow

3 Convex multivariate functions
Convexity: βˆ€π‘₯,π‘¦βˆˆ ℝ 𝑛 : 𝑓 π‘₯ β‰₯𝑓 𝑦 + π‘₯ βˆ’π‘¦ 𝛻f(y) βˆ€π‘₯,π‘¦βˆˆ ℝ 𝑛 , 0β‰€πœ†β‰€1: 𝑓 πœ†π‘₯+ 1βˆ’πœ† 𝑦 β‰€πœ†π‘“ π‘₯ + 1βˆ’πœ† 𝑓(𝑦) If higher derivatives exist: 𝑓 π‘₯ =𝑓 𝑦 +𝛻𝑓 𝑦 β‹… π‘₯βˆ’π‘¦ + π‘₯βˆ’π‘¦ 𝑇 𝛻 2 𝑓 π‘₯ π‘₯βˆ’π‘¦ +… 𝛻 2 𝑓 π‘₯ 𝑖𝑗 = πœ• 2 𝑓 πœ• π‘₯ 𝑖 πœ• π‘₯ 𝑗 is the Hessian matrix 𝑓 is convex iff it’s Hessian is positive semidefinite, 𝑦 𝑇 𝛻 2 𝑓𝑦β‰₯0 for all 𝑦.

4 Examples of convex functions
β„“ 𝑝 -norm is convex for 1β‰€π‘β‰€βˆž: πœ†π‘₯+ 1βˆ’πœ† 𝑦 𝑝 ≀ πœ†π‘₯ 𝑝 βˆ’πœ† 𝑦 𝑝 =πœ† π‘₯ 𝑝 + 1 βˆ’πœ† 𝑦 𝑝 𝑓 π‘₯ = log ( 𝑒 π‘₯ 1 + 𝑒 π‘₯ 2 +β‹― 𝑒 π‘₯ 𝑛 ) max π‘₯ 1 ,…, π‘₯ 𝑛 ≀𝑓 π‘₯ ≀ max π‘₯ 1 ,…, π‘₯ 𝑛 + log 𝑛 𝑓 π‘₯ = π‘₯ 𝑇 𝐴π‘₯ where 𝐴 is a p.s.d. matrix, 𝛻 2 𝑓=𝐴 Examples of constrained convex optimization: (Linear equations with p.s.d. constraints): minimize: π‘₯ 𝑇 𝐴π‘₯βˆ’ 𝑏 𝑇 π‘₯ (solution satisfies 𝐴π‘₯=𝑏) (Least squares regression): Minimize: 𝐴π‘₯βˆ’π‘ = x T A T A x βˆ’2 Ax T b+ b T b

5 Constrained Convex Optimization
General formulation for convex 𝑓 and a convex set 𝐾: π‘šπ‘–π‘›π‘–π‘šπ‘–π‘§π‘’: 𝑓 π‘₯ subject to: π‘₯∈𝐾 Example (SVMs): Data: 𝑋 1 ,…, 𝑋 𝑁 ∈ ℝ 𝑛 labeled by 𝑦 1 ,…, 𝑦 𝑁 ∈ βˆ’1,1 (spam / non-spam) Find a linear model: π‘Šβ‹… 𝑋 𝑖 β‰₯1β‡’ 𝑋 𝑖 is spam π‘Šβ‹… 𝑋 𝑖 β‰€βˆ’1β‡’ 𝑋 𝑖 is non-spam βˆ€π‘–:1 βˆ’ 𝑦 𝑖 π‘Š 𝑋 𝑖 ≀0 More robust version: π‘šπ‘–π‘›π‘–π‘šπ‘–π‘§π‘’: 𝑖 πΏπ‘œπ‘ π‘  1βˆ’π‘Š 𝑦 𝑖 𝑋 𝑖 +πœ† π‘Š 2 E.g. hinge loss Loss(t)=max(0,t) Another regularizer: πœ† π‘Š 1 (favors sparse solutions)

6 Gradient Descent for Constrained Convex Optimization
(Projection): π‘₯βˆ‰πΎβ†’π‘¦βˆˆπΎ y= argmin z∈𝐾 π‘§βˆ’π‘₯ 2 Easy to compute for β‹… : 𝑦=π‘₯/ π‘₯ 2 2 Let 𝛻𝑓 π‘₯ 2 ≀𝐺, max π‘₯,π‘¦βˆˆπΎ π‘₯βˆ’π‘¦ 2 ≀𝐷. Let 𝑇= 4 𝐷 2 𝐺 2 πœ– 2 Gradient descent (gradient + projection oracles): Let πœ‚=𝐷/𝐺 𝑇 Repeat for 𝑖=0, …, 𝑇: 𝑦 (𝑖+1) = π‘₯ (𝑖) βˆ’πœ‚π›»π‘“ π‘₯ 𝑖 π‘₯ (𝑖+1) = projection of 𝑦 (𝑖+1) on 𝐾 Output 𝑧= 1 𝑇 𝑖 π‘₯ (𝑖)

7 Gradient Descent for Constrained Convex Optimization
π‘₯ 𝑖+1 βˆ’ π‘₯ βˆ— ≀ 𝑦 𝑖+1 βˆ’ π‘₯ βˆ— = π‘₯ 𝑖 βˆ’ π‘₯ βˆ— βˆ’πœ‚π›»π‘“ π‘₯ 𝑖 = π‘₯ 𝑖 βˆ’ π‘₯ βˆ— πœ‚ 2 𝛻𝑓 π‘₯ 𝑖 βˆ’2πœ‚π›»π‘“ π‘₯ 𝑖 β‹… π‘₯ 𝑖 βˆ’ π‘₯ βˆ— Using definition of 𝐺: 𝛻𝑓 π‘₯ 𝑖 β‹… π‘₯ 𝑖 βˆ’ π‘₯ βˆ— ≀ 1 2πœ‚ π‘₯ 𝑖 βˆ’ π‘₯ βˆ— βˆ’ π‘₯ 𝑖+1 βˆ’ π‘₯ βˆ— πœ‚ 2 𝐺 2 𝑓 π‘₯ 𝑖 βˆ’π‘“ π‘₯ βˆ— ≀ 1 2πœ‚ π‘₯ 𝑖 βˆ’ π‘₯ βˆ— βˆ’ π‘₯ 𝑖+1 βˆ’ π‘₯ βˆ— πœ‚ 2 𝐺 2 Sum over 𝑖=1,…, 𝑇: 𝑖=1 𝑇 𝑓 π‘₯ 𝑖 βˆ’π‘“ π‘₯ βˆ— ≀ 1 2πœ‚ π‘₯ 0 βˆ’ π‘₯ βˆ— βˆ’ π‘₯ 𝑇 βˆ’ π‘₯ βˆ— π‘‡πœ‚ 2 𝐺 2

8 Gradient Descent for Constrained Convex Optimization
𝑖=1 𝑇 𝑓 π‘₯ 𝑖 βˆ’π‘“ π‘₯ βˆ— ≀ 1 2πœ‚ π‘₯ 0 βˆ’ π‘₯ βˆ— βˆ’ π‘₯ 𝑇 βˆ’ π‘₯ βˆ— π‘₯ 0 βˆ’ π‘₯ βˆ— βˆ’ π‘₯ 𝑇 βˆ’ π‘₯ βˆ— π‘‡πœ‚ 2 𝐺 2 𝑓 1 𝑇 𝑖 π‘₯ 𝑖 ≀ 1 𝑇 𝑖 𝑓 π‘₯ 𝑖 : 𝑓 1 𝑇 𝑖 π‘₯ 𝑖 βˆ’π‘“ π‘₯ βˆ— ≀ 𝐷 2 2πœ‚π‘‡ + πœ‚ 2 𝐺 2 Set πœ‚= 𝐷 𝐺 𝑇 β‡’ RHS ≀ 𝐷𝐺 𝑇 β‰€πœ–

9 Online Gradient Descent
Gradient descent works in a more general case: 𝑓→ sequence of convex functions 𝑓 1 , 𝑓 2 …, 𝑓 𝑇 At step 𝑖 need to output π‘₯ (𝑖) ∈𝐾 Let π‘₯ βˆ— be the minimizer of 𝑖 𝑓 𝑖 (𝑀) Minimize regret: 𝑖 𝑓 𝑖 π‘₯ 𝑖 βˆ’ 𝑓 𝑖 ( π‘₯ βˆ— ) Same analysis as before works in online case.

10 Stochastic Gradient Descent
(Expected gradient oracle): returns 𝑔 such that 𝔼 𝑔 𝑔 =𝛻𝑓 π‘₯ . Example: for SVM pick randomly one term from the loss function. Let 𝑔 𝑖 be the gradient returned at step i Let 𝑓 𝑖 = 𝑔 𝑖 𝑇 π‘₯ be the function used in the i-th step of OGD Let 𝑧= 1 𝑇 𝑖 π‘₯ (𝑖) and π‘₯ βˆ— be the minimizer of 𝑓.

11 Stochastic Gradient Descent
Thm. 𝔼 𝑓 𝑧 ≀𝑓 π‘₯ βˆ— + 𝐷𝐺 𝑇 where 𝐺 is an upper bound of any gradient output by oracle. 𝑓 𝑧 βˆ’π‘“ π‘₯ βˆ— ≀ 1 𝑇 𝑖 (𝑓 π‘₯ 𝑖 βˆ’π‘“( π‘₯ βˆ— )) (convexity) ≀ 1 𝑇 𝑖 𝛻𝑓 π‘₯ 𝑖 π‘₯ 𝑖 βˆ’ π‘₯ βˆ— = 1 𝑇 𝑖 𝔼 [ 𝑔 𝑖 𝑇 ( π‘₯ 𝑖 βˆ’ π‘₯ βˆ— )] (grad. oracle) = 1 𝑇 𝑖 𝔼[ 𝑓 𝑖 ( π‘₯ 𝑖 )βˆ’ 𝑓 𝑖 (π‘₯ βˆ— )] = 1 𝑇 𝔼[ 𝑖 𝑓 𝑖 ( π‘₯ 𝑖 )βˆ’ 𝑓 𝑖 (π‘₯ βˆ— )] 𝔼[] = regret of OGD , always β‰€πœ–

12 VC-dim of combinations of concepts
For π’Œ concepts β„Ž 1 ,…, β„Ž π’Œ + a Boolean function 𝑓: π‘π‘œπ‘š 𝑏 𝑓 β„Ž 1 ,… β„Ž π’Œ ={π‘₯βˆˆπ‘‹:𝑓 β„Ž 1 π‘₯ ,… β„Ž π’Œ π‘₯ =1} Ex: 𝐻 = lin. separators, 𝑓=𝐴𝑁𝐷 / 𝑓=π‘€π‘Žπ‘—π‘œπ‘Ÿπ‘–π‘‘π‘¦ For a concept class 𝐻 + a Boolean function 𝑓: 𝐢𝑂𝑀 𝐡 𝑓,π’Œ 𝐻 ={π‘π‘œπ‘š 𝑏 𝑓 β„Ž 1 ,… β„Ž π’Œ : β„Ž 𝑖 ∈𝐻} Lem. If 𝑉𝐢-dim(𝐻)=𝒅 then for any 𝑓: 𝑉𝐢-dim 𝐢𝑂𝑀 𝐡 𝑓,π’Œ 𝐻 ≀𝑂(π’Œπ’… log(π’Œπ’…))

13 VC-dim of combinations of concepts
Lem. If 𝑉𝐢-dim(𝐻)=𝑑 then for any 𝑓: 𝑉𝐢-dim 𝐢𝑂𝑀 𝐡 𝑓,π’Œ 𝐻 ≀𝑂(π’Œπ’… log(π’Œπ’…)) Let 𝑛=π‘‰πΆβˆ’dim 𝐢𝑂𝑀 𝐡 𝑓,π’Œ 𝐻 β‡’βˆƒ set 𝑆 of 𝑛 points shattered by 𝐢𝑂𝑀 𝐡 𝑓,π’Œ 𝐻 Sauer’s lemma β‡’ ≀ 𝑛 𝑑 ways of labeling 𝑆 by 𝐻 Each labeling in 𝐢𝑂𝑀 𝐡 𝑓,π’Œ 𝐻 determined by π’Œ labelings of 𝑆 by 𝐻 β‡’ ≀ (𝑛 𝒅 ) π’Œ = 𝑛 π’Œπ’… labelings 2 𝑛 ≀ 𝑛 π’Œπ’… β‡’π‘›β‰€π’Œπ’… log 𝑛 ⇒𝑛≀2π’Œπ’… log π’Œπ’…

14 Back to the batch setting
Classification problem Instance space 𝑋: 0,1 𝒅 or ℝ 𝒅 (feature vectors) Classification: come up with a mapping 𝑋→{0,1} Formalization: Assume there is a probability distribution 𝐷 over 𝑋 𝒄 βˆ— = β€œtarget concept” (set 𝒄 βˆ— βŠ†π‘‹ of positive instances) Given labeled i.i.d. samples from 𝐷 produce π’‰βŠ†π‘‹ Goal: have 𝒉 agree with 𝒄 βˆ— over distribution 𝐷 Minimize: π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 = Pr 𝐷 [𝒉 Ξ” 𝒄 βˆ— ] π‘’π‘Ÿ π‘Ÿ 𝐷 𝒉 = β€œtrue” or β€œgeneralization” error

15 Boosting Strong learner: succeeds with prob. β‰₯1 βˆ’πœ–
Weak learner: succeeds with prob. β‰₯ 1 2 +𝛾 Boosting (informal): weak learner that works under any distribution β‡’ strong learner Idea: run weak leaner 𝑨 on sample 𝑆 under reweightings focusing on misclassified examples

16 VC-dim ≀𝑂( 𝒕 𝟎 VCβˆ’dim 𝐻 log ( 𝒕 𝟎 VCβˆ’dim 𝐻 ))
Boosting (cont.) 𝐻 = class of hypothesis produced by 𝑨 Apply majority rule to β„Ž 1 ,…, β„Ž 𝒕 𝟎 ∼𝐻: VC-dim ≀𝑂( 𝒕 𝟎 VCβˆ’dim 𝐻 log ( 𝒕 𝟎 VCβˆ’dim 𝐻 )) Algorithm: Given 𝑆=( π‘₯ 1 ,…, π‘₯ 𝑛 ) set 𝑀 𝑖 =1 in π’˜=( 𝑀 1 ,…, 𝑀 𝑛 ) For 𝑑=1,…, 𝒕 𝟎 do: Call weak learner on 𝑆,π’˜ β‡’ hypothesis β„Ž 𝑑 For misclassified π‘₯ 𝑖 multiply 𝑀 𝑖 by 𝛼=( 𝛾)/( 1 2 βˆ’π›Ύ) Output: MAJ( β„Ž 1 , …, β„Ž 𝒕 𝟎 )

17 Boosting: analysis Def (𝜸-weak learner on sample): For labeled examples π‘₯ 𝑖 weighted by 𝑀 𝑖 with weight of correct β‰₯ 𝛾 𝑖=1 𝑛 𝑀 𝑖 Thm. If 𝐴 is 𝛾-weak learner on 𝑆 β‡’ for 𝒕 𝟎 =𝑂 1 𝛾 2 π‘™π‘œπ‘” 𝑛 boosting achieves 0 error on 𝑆. Proof. m = # mistakes of the final classifier Each was misclassified β‰₯ 𝒕 𝟎 2 times β‡’ weight β‰₯ 𝛼 𝒕 𝟎 /2 Total weight β‰₯π‘š 𝛼 𝒕 𝟎 /2 Total weight at 𝑑=W t π‘Š 𝑑+1 ≀ 𝛼 1 2 βˆ’π›Ύ 𝛾 π‘Š 𝑑 = 1+2𝛾 π‘Š(𝑑)

18 Boosting: analysis (cont.)
π‘Š 0 =π‘›β‡’π‘Š 𝒕 𝟎 ≀𝑛 1+2𝛾 𝒕 𝟎 π‘š 𝛼 𝒕 𝟎 /2 β‰€π‘Š 𝒕 𝟎 ≀𝑛 1+2𝛾 𝒕 𝟎 𝛼=( 𝛾)/( 1 2 βˆ’π›Ύ) =(1+2𝛾)/(1βˆ’2𝛾) π‘šβ‰€π‘› 1βˆ’2𝛾 𝒕 𝟎 / 𝛾 𝒕 𝟎 /2 = 𝑛 1βˆ’4 𝛾 2 𝒕 𝟎 /2 1βˆ’π‘₯≀ 𝑒 βˆ’π‘₯ β‡’ π‘šβ‰€π‘› 𝑒 βˆ’2 𝛾 2 𝒕 𝟎 β‡’ 𝒕 𝟎 =𝑂 1 𝛾 2 log 𝑛 Comments: Applies even if the weak learners are adversarial VC-dim bounds ⇒𝑛= 𝑂 1 πœ– π‘‰πΆβˆ’dim⁑(𝐻) 𝛾 2


Download ppt "CSCI B609: β€œFoundations of Data Science”"

Similar presentations


Ads by Google