Presentation is loading. Please wait.

Presentation is loading. Please wait.

Backpropagation and Neural Nets

Similar presentations


Presentation on theme: "Backpropagation and Neural Nets"β€” Presentation transcript:

1 Backpropagation and Neural Nets
EECS 442 – Prof. David Fouhey Winter 2019, University of Michigan

2 So Far: Linear Models 𝐿(π’˜) = πœ† π’˜ 2 2 + 𝑖=1 𝑛 𝑦 𝑖 βˆ’ π’˜ 𝑇 𝒙 𝑖 ) 2
𝐿(π’˜) = πœ† π’˜ 𝑖=1 𝑛 𝑦 𝑖 βˆ’ π’˜ 𝑇 𝒙 𝑖 ) 2 Example: find w minimizing squared error over data Each datapoint represented by some vector x Can find optimal w with ~10 line derivation

3 Last Class 𝐿(π’˜) = πœ† π’˜ 2 2 + 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) )
𝐿(π’˜) = πœ† π’˜ 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) ) What about an arbitrary loss function L? What about an arbitrary parametric function f? Solution: take the gradient, do gradient descent π’˜ 𝑖+1 = π’˜ π’Š βˆ’π›Ό βˆ‡ 𝑀 𝐿(𝑓( π’˜ 𝑖 )) What if L(f(w)) is complicated? Today!

4 Taking the Gradient – Review
𝑓 π‘₯ = βˆ’π‘₯+3 2 𝑓= π‘ž 2 π‘ž=π‘Ÿ+3 π‘Ÿ=βˆ’π‘₯ πœ•π‘“ πœ•π‘ž =2π‘ž πœ•π‘ž πœ•π‘Ÿ =1 πœ•π‘ž πœ•π‘₯ =βˆ’1 Chain rule πœ•π‘“ πœ•π‘₯ = πœ•π‘“ πœ•π‘ž πœ•π‘ž πœ•π‘Ÿ πœ•π‘Ÿ πœ•π‘₯ =2π‘žβˆ—1βˆ—βˆ’1 =βˆ’2 βˆ’π‘₯+3 =2π‘₯βˆ’6

5 Supplemental Reading Lectures can only introduce you to a topic
You will solidify your knowledge by doing I highly recommend working through everything in the Stanford CS213N resources These slides follow the general examples with a few modifications. The primary difference is that I define local variables n, m per-block.

6 Let’s Do This Another Way
Suppose we have a box representing a function f. This box does two things: Forward: Given forward input n, compute f(n) Backwards: Given backwards input g, return g*df/dn 𝑛 𝑓(𝑛) f 𝑔 𝑔( πœ•π‘“ πœ•π‘›)

7 Let’s Do This Another Way
𝑓 π‘₯ = βˆ’π‘₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 πœ• πœ•π‘› 𝑛 2 =2𝑛 =2 βˆ’π‘₯+3 =βˆ’2π‘₯+6 (βˆ’2π‘₯+6) βˆ—1 πœ• πœ•π‘› βˆ—1=

8 Let’s Do This Another Way
𝑓 π‘₯ = βˆ’π‘₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 βˆ’2π‘₯+6 πœ• πœ•π‘› =1 1 βˆ—(βˆ’2x+6)

9 Let’s Do This Another Way
𝑓 π‘₯ = βˆ’π‘₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 βˆ’2π‘₯+6 βˆ’2π‘₯+6 πœ• πœ•π‘› =βˆ’1 βˆ’1 βˆ—(βˆ’2x+6) 2xβˆ’6

10 Let’s Do This Another Way
𝑓 π‘₯ = βˆ’π‘₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 2xβˆ’6 βˆ’2π‘₯+6 βˆ’2π‘₯+6

11 f Two Inputs 𝑛 𝑔( πœ•π‘“ πœ•π‘›) 𝑓(𝑛,π‘š) π‘š 𝑔 𝑔( πœ•π‘“ πœ•π‘š)
Given two inputs, just have two input/output wires Forward: the same Backward: the same – send gradients with respect to each variable f 𝑛 π‘š 𝑓(𝑛,π‘š) 𝑔 𝑔( πœ•π‘“ πœ•π‘›) 𝑔( πœ•π‘“ πœ•π‘š)

12 f(x,y,z) = (x+y)z x x+y y (x+y)z z n+m n*m
Example Credit: Karpathy and Fei-Fei

13 f(x,y,z) = (x+y)z x x+y y z*1 (x+y)z z 1 (x+y)*1 πœ• πœ•π‘› π‘›π‘š=π‘š β†’π‘§βˆ—1
n+m Multiplication swaps inputs, multiplies gradient x+y y z*1 n*m (x+y)z z 1 (x+y)*1 πœ• πœ•π‘› π‘›π‘š=π‘š β†’π‘§βˆ—1 πœ• πœ•π‘š π‘›π‘š=𝑛 β†’(π‘₯+𝑦)βˆ—1 Example Credit: Karpathy and Fei-Fei

14 f(x,y,z) = (x+y)z x 1*z*1 x+y y 1*z*1 (x+y)z z*1 z 1 (x+y)*1 β†’1βˆ—π‘§βˆ—1
n+m Addition sends gradient through unchanged 1*z*1 x+y y 1*z*1 n*m (x+y)z z*1 z 1 (x+y)*1 β†’1βˆ—π‘§βˆ—1 πœ• πœ•π‘› 𝑛+π‘š=1 β†’1βˆ—π‘§βˆ—1 πœ• πœ•π‘š 𝑛+π‘š=1 Example Credit: Karpathy and Fei-Fei

15 f(x,y,z) = (x+y)z x z x+y y z (x+y)z z*1 z 1 x+y πœ• π‘₯+𝑦 𝑧 πœ•π‘§ =(π‘₯+𝑦)
n+m z x+y y z n*m (x+y)z z*1 z 1 x+y πœ• π‘₯+𝑦 𝑧 πœ•π‘§ =(π‘₯+𝑦) πœ• π‘₯+𝑦 𝑧 πœ•π‘₯ =𝑧 πœ• π‘₯+𝑦 𝑧 πœ•π‘¦ =𝑧 Example Credit: Karpathy and Fei-Fei

16 Once More, With Numbers!

17 f(x,y,z) = (x+y)z 1 n+m 5 4 n*m 50 10 Example Credit: Karpathy and Fei-Fei

18 f(x,y,z) = (x+y)z 1 5 4 10 50 10 1 5 πœ• πœ•π‘› π‘›π‘š=π‘š β†’10βˆ—1 πœ• πœ•π‘š π‘›π‘š=𝑛 β†’5βˆ—1
n+m 5 4 10 n*m 50 10 1 5 πœ• πœ•π‘› π‘›π‘š=π‘š β†’10βˆ—1 πœ• πœ•π‘š π‘›π‘š=𝑛 β†’5βˆ—1 Example Credit: Karpathy and Fei-Fei

19 f(x,y,z) = (x+y)z 1 10 5 4 10 50 10 10 1 5 β†’1βˆ—10βˆ—1 πœ• πœ•π‘› 𝑛+π‘š=1 β†’1βˆ—5βˆ—1
n+m 10 5 4 10 n*m 50 10 u 10 1 5 β†’1βˆ—10βˆ—1 πœ• πœ•π‘› 𝑛+π‘š=1 β†’1βˆ—5βˆ—1 πœ• πœ•π‘š 𝑛+π‘š=1 Example Credit: Karpathy and Fei-Fei

20 We want to fit a model w that just will equal 6.
Think You’ve Got It? 𝐿 π‘₯ = π‘€βˆ’6 2 We want to fit a model w that just will equal 6. World’s most basic linear model / neural net: no inputs, just constant output.

21 I’ll Need a Few Volunteers
𝐿 π‘₯ = π‘€βˆ’6 2 n-6 n g n2 n g 2ng Job #1 (n-6): Forward: Compute n-6 Backwards: Multiply by 1 Job #2 (n2): Forward: Compute n2 Backwards: Multiply by 2n Job #3: Backwards: Write down 1

22 Preemptively The diagrams look complex but that’s since we’re covering the details together

23 Something More Complex
𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) w0 n*m x0 n+m w1 n+m n*-1 en n+1 1/n x1 w2 Example Credit: Karpathy and Fei-Fei

24 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) w0 2 * -2 x0 -1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3

25 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3 Example Credit: Karpathy and Fei-Fei

26 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 w2 -3 Example Credit: Karpathy and Fei-Fei

27 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 βˆ’ βˆ’2 βˆ—1=βˆ’0.53 w2 -3 Where does 1.37 come from? Example Credit: Karpathy and Fei-Fei

28 a b w0 c d x0 e f w1 x1 -0.53 1 w2 1βˆ—βˆ’0.53=βˆ’0.53 2 * -2 -1 + 4 -3 * 6
𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 1 w2 -3 1βˆ—βˆ’0.53=βˆ’0.53 Example Credit: Karpathy and Fei-Fei

29 a b w0 c d x0 e f w1 x1 -0.53 -0.53 1 w2 𝑒 βˆ’1 βˆ—βˆ’0.53=βˆ’0.2 2 * -2 -1 +
𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 -0.53 1 w2 -3 𝑒 βˆ’1 βˆ—βˆ’0.53=βˆ’0.2 Example Credit: Karpathy and Fei-Fei

30 a b w0 c d x0 e f w1 x1 -0.2 -0.53 -0.53 1 w2 βˆ’1βˆ—βˆ’0.2=0.2 2 * -2 -1 +
𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.2 -0.53 -0.53 1 w2 -3 βˆ’1βˆ—βˆ’0.2=0.2 Example Credit: Karpathy and Fei-Fei

31 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1βˆ—0.2=0.2 Gets sent back both directions -3 Example Credit: Karpathy and Fei-Fei

32 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1βˆ—0.2=0.2 Gets sent back both directions -3 0.2 Example Credit: Karpathy and Fei-Fei

33 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -2 x0 -1 e f πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 βˆ’1βˆ—0.2=βˆ’0.2 -3 0.2 2βˆ—0.2=0.4 Example Credit: Karpathy and Fei-Fei

34 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -0.2 -2 x0 -1 e f 0.4 πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 βˆ’2βˆ—0.2=βˆ’0.4 βˆ’3βˆ—0.2=βˆ’0.6 -3 0.2 Example Credit: Karpathy and Fei-Fei

35 a b 𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -0.2 -2 x0 -1 e f 0.4 πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 -0.6 w2 -3 0.2 Example Credit: Karpathy and Fei-Fei

36 PHEW! a b w0 c d -0.2 x0 e f 0.4 w1 -0.4 x1 -0.6 w2 0.2 2 * -2 -1 + 4
𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) πœ• πœ•π‘› π‘š+𝑛=1 πœ• πœ•π‘› π‘šπ‘›=π‘š w0 2 * c d πœ• πœ•π‘› 𝑒 𝑛 = 𝑒 𝑛 πœ• πœ•π‘› 𝑛 βˆ’1 =βˆ’ 𝑛 βˆ’2 -0.2 -2 x0 -1 e f 0.4 πœ• πœ•π‘› π‘Žπ‘›=π‘Ž πœ• πœ•π‘› 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 *-1 en +1 n-1 x1 -2 -0.6 PHEW! w2 -3 0.2 Example Credit: Karpathy and Fei-Fei

37 f … Summary π‘₯ 1 𝑔( πœ•π‘“ πœ• π‘₯ 1 ) π‘₯ 2 𝑔( πœ•π‘“ πœ• π‘₯ 2 ) 𝑓( π‘₯ 1 ,…, π‘₯ 𝑛 ) 𝑔 π‘₯ 𝑛
Each block computes backwards (g) * local gradient (df/dxi) at the evaluation point f 𝑓( π‘₯ 1 ,…, π‘₯ 𝑛 ) 𝑔 π‘₯ 1 𝑔( πœ•π‘“ πœ• π‘₯ 1 ) 𝑔( πœ•π‘“ πœ• π‘₯ 2 ) π‘₯ 2 π‘₯ 𝑛 𝑔( πœ•π‘“ πœ• π‘₯ 𝑛 ) …

38 Multiple Outputs Flowing Back
Gradients from different backwards sum up 𝑗=1 𝐾 𝑔 𝑗 ( πœ• 𝑓 𝑗 πœ• π‘₯ 𝑖 ) π‘₯ 1 f 𝑓 1 ( π‘₯ 1 ,…, π‘₯ 𝑛 ) 𝑔 1 𝑗=1 𝐾 𝑔 𝑗 ( πœ• 𝑓 𝑗 πœ• π‘₯ 1 ) … 𝑓 𝐾 ( π‘₯ 1 ,…, π‘₯ 𝑛 ) 𝑔 𝐾 π‘₯ 𝑛 𝑗=1 𝐾 𝑔 𝑗 ( πœ• 𝑓 𝑗 πœ• π‘₯ 𝑛 )

39 Multiple Outputs Flowing Back
𝑓 π‘₯ = βˆ’π‘₯+3 2 (-x+3)2 x -x -x+3 -n n+3 m*n -x+3 -x+3 x-3 1 x x -n n+3

40 Multiple Outputs Flowing Back
𝑓 π‘₯ = βˆ’π‘₯+3 2 x = π‘₯βˆ’3 + π‘₯βˆ’3 πœ•π‘“ πœ•π‘₯ x-3 x =2π‘₯βˆ’6 x x-3

41 Does It Have To Be So Painful?
𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) w0 n*m x0 n+m w1 𝜎 𝑛 = 𝑒 βˆ’π‘› n+m n*-1 en n+1 1/x x1 w2 Example Credit: Karpathy and Fei-Fei

42 Does It Have To Be So Painful?
𝜎 𝑛 = 𝑒 βˆ’π‘› πœ• πœ•π‘› 𝜎 𝑛 = 𝑒 βˆ’π‘› 𝑒 βˆ’π‘› 2 = 1+ 𝑒 βˆ’π‘› βˆ’1 1+ 𝑒 βˆ’π‘› 𝑒 βˆ’π‘› 1+ 𝑒 βˆ’π‘› 1+ 𝑒 βˆ’π‘› βˆ’ 1 1+ 𝑒 βˆ’π‘› =1βˆ’πœŽ(𝑛) 𝜎(𝑛) = 1βˆ’πœŽ 𝑛 𝜎(𝑛) πœ• πœ•π‘› 𝜎 𝑛 = βˆ’ 𝑒 βˆ’π‘› βˆ—1βˆ— 𝑒 βˆ’π‘› βˆ—βˆ’1 Chain rule: d/dx (1/x)*d/dx (1+x)* d/dx (e*x)*d/dx (-x) Line 1 to 2: For the curious Example Credit: Karpathy and Fei-Fei

43 Does It Have To Be So Painful?
𝑓 π’˜,𝒙 = 𝑒 βˆ’( 𝑀 0 π‘₯ 0 + 𝑀 1 π‘₯ 1 + 𝑀 2 ) w0 n*m x0 n+m w1 n+m Οƒ(n) x1 w2 𝜎 𝑛 = 𝑒 βˆ’π‘› πœ•πœŽ(𝑛) πœ•π‘› =(1βˆ’πœŽ 𝑛 )𝜎(𝑛) Example Credit: Karpathy and Fei-Fei

44 Does It Have To Be So Painful?
Can compute for any function Pick your functions carefully: existing code is usually structured into sensible blocks

45 Building Blocks Takes signals from other cells, processes, sends out
Input from other cells Output to other cells Neuron diagram credit: Karpathy and Fei-Fei

46 Artificial Neuron Weighted average of other neuron outputs passed through an activation function 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 Activation 𝑓 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏

47 Can differentiate whole thing e.g., dNeuron/dx1.
Artificial Neuron Can differentiate whole thing e.g., dNeuron/dx1. What can we now do? w3 * x3 w2 * x2 w1 * x1 b βˆ‘ f

48 Artificial Neuron Each artificial neuron is a linear model + an activation function f Can find w, b that minimizes a loss function with gradient descent βˆ‘ f w,b x

49 βˆ‘ βˆ‘ βˆ‘ βˆ‘ Artificial Neurons f f f f
Connect neurons to make a more complex function; use backprop to compute gradient βˆ‘ f w,b x βˆ‘ f w,b x βˆ‘ f w,b x βˆ‘ f w,b x

50 What’s The Activation Function
Sigmoid 𝑠 π‘₯ = 1 1+ 𝑒 βˆ’π‘₯ Nice interpretation Squashes things to (0,1) Gradients are near zero if neuron is high/low

51 What’s The Activation Function
ReLU (Rectifying Linear Unit) max⁑(0,π‘₯) Constant gradient Converges ~6x faster If neuron negative, zero gradient. Be careful!

52 What’s The Activation Function
Leaky ReLU (Rectifying Linear Unit) π‘₯ :π‘₯β‰₯0 0.01π‘₯ :π‘₯<0 ReLU, but allows some small gradient for negative vales

53 Setting Up A Neural Net Input Hidden Output h1 h2 h3 h4 y1 y2 y3 x2 x1

54 Setting Up A Neural Net Input Hidden 1 Hidden 2 Output a1 a2 a3 a4 h1
y1 y2 y3 x2 x1

55 Fully Connected Network
x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 Each neuron connects to each neuron in the previous layer

56 Fully Connected Network
a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 π’˜ π’Š , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function β„Ž 𝑖 =𝑓( π’˜ π’Š 𝑻 𝒂+ 𝑏 𝑖 ) How do we do all the neurons all at once?

57 Fully Connected Network
a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 π’˜ π’Š , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function 𝒉=𝑓(𝑾𝒂+𝒃) = 𝑀 1 𝑇 𝑀 2 𝑇 𝑀 3 𝑇 𝑀 4 𝑇 𝑏 1 𝑏 2 𝑏 3 𝑏 4 π‘Ž 1 π‘Ž 2 π‘Ž 3 π‘Ž 4 β„Ž 1 β„Ž 2 β„Ž 3 β„Ž 4 + 𝑓 ( )

58 Fully Connected Network
Define New Block: β€œLinear Layer” (Ok technically it’s Affine) n L W b 𝐿 𝒏 =𝑾𝒏+𝒃 Can get gradient with respect to all the inputs (do on your own; useful trick: have to be able to do matrix multiply)

59 Fully Connected Network
x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 L f(n) W b L f(n) W b L f(n) W b x

60 Fully Connected Network
x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 What happens if we remove the activation functions? W b W b W b x L L L

61 Demo Time

62


Download ppt "Backpropagation and Neural Nets"

Similar presentations


Ads by Google