Download presentation
Presentation is loading. Please wait.
1
Backpropagation and Neural Nets
EECS 442 β Prof. David Fouhey Winter 2019, University of Michigan
2
So Far: Linear Models πΏ(π) = π π 2 2 + π=1 π π¦ π β π π π π ) 2
πΏ(π) = π π π=1 π π¦ π β π π π π ) 2 Example: find w minimizing squared error over data Each datapoint represented by some vector x Can find optimal w with ~10 line derivation
3
Last Class πΏ(π) = π π 2 2 + π=1 π πΏ( π¦ π ,π(π;π) )
πΏ(π) = π π π=1 π πΏ( π¦ π ,π(π;π) ) What about an arbitrary loss function L? What about an arbitrary parametric function f? Solution: take the gradient, do gradient descent π π+1 = π π βπΌ β π€ πΏ(π( π π )) What if L(f(w)) is complicated? Today!
4
Taking the Gradient β Review
π π₯ = βπ₯+3 2 π= π 2 π=π+3 π=βπ₯ ππ ππ =2π ππ ππ =1 ππ ππ₯ =β1 Chain rule ππ ππ₯ = ππ ππ ππ ππ ππ ππ₯ =2πβ1ββ1 =β2 βπ₯+3 =2π₯β6
5
Supplemental Reading Lectures can only introduce you to a topic
You will solidify your knowledge by doing I highly recommend working through everything in the Stanford CS213N resources These slides follow the general examples with a few modifications. The primary difference is that I define local variables n, m per-block.
6
Letβs Do This Another Way
Suppose we have a box representing a function f. This box does two things: Forward: Given forward input n, compute f(n) Backwards: Given backwards input g, return g*df/dn π π(π) f π π( ππ ππ)
7
Letβs Do This Another Way
π π₯ = βπ₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 π ππ π 2 =2π =2 βπ₯+3 =β2π₯+6 (β2π₯+6) β1 π ππ β1=
8
Letβs Do This Another Way
π π₯ = βπ₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 β2π₯+6 π ππ =1 1 β(β2x+6)
9
Letβs Do This Another Way
π π₯ = βπ₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 β2π₯+6 β2π₯+6 π ππ =β1 β1 β(β2x+6) 2xβ6
10
Letβs Do This Another Way
π π₯ = βπ₯+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 2xβ6 β2π₯+6 β2π₯+6
11
f Two Inputs π π( ππ ππ) π(π,π) π π π( ππ ππ)
Given two inputs, just have two input/output wires Forward: the same Backward: the same β send gradients with respect to each variable f π π π(π,π) π π( ππ ππ) π( ππ ππ)
12
f(x,y,z) = (x+y)z x x+y y (x+y)z z n+m n*m
Example Credit: Karpathy and Fei-Fei
13
f(x,y,z) = (x+y)z x x+y y z*1 (x+y)z z 1 (x+y)*1 π ππ ππ=π βπ§β1
n+m Multiplication swaps inputs, multiplies gradient x+y y z*1 n*m (x+y)z z 1 (x+y)*1 π ππ ππ=π βπ§β1 π ππ ππ=π β(π₯+π¦)β1 Example Credit: Karpathy and Fei-Fei
14
f(x,y,z) = (x+y)z x 1*z*1 x+y y 1*z*1 (x+y)z z*1 z 1 (x+y)*1 β1βπ§β1
n+m Addition sends gradient through unchanged 1*z*1 x+y y 1*z*1 n*m (x+y)z z*1 z 1 (x+y)*1 β1βπ§β1 π ππ π+π=1 β1βπ§β1 π ππ π+π=1 Example Credit: Karpathy and Fei-Fei
15
f(x,y,z) = (x+y)z x z x+y y z (x+y)z z*1 z 1 x+y π π₯+π¦ π§ ππ§ =(π₯+π¦)
n+m z x+y y z n*m (x+y)z z*1 z 1 x+y π π₯+π¦ π§ ππ§ =(π₯+π¦) π π₯+π¦ π§ ππ₯ =π§ π π₯+π¦ π§ ππ¦ =π§ Example Credit: Karpathy and Fei-Fei
16
Once More, With Numbers!
17
f(x,y,z) = (x+y)z 1 n+m 5 4 n*m 50 10 Example Credit: Karpathy and Fei-Fei
18
f(x,y,z) = (x+y)z 1 5 4 10 50 10 1 5 π ππ ππ=π β10β1 π ππ ππ=π β5β1
n+m 5 4 10 n*m 50 10 1 5 π ππ ππ=π β10β1 π ππ ππ=π β5β1 Example Credit: Karpathy and Fei-Fei
19
f(x,y,z) = (x+y)z 1 10 5 4 10 50 10 10 1 5 β1β10β1 π ππ π+π=1 β1β5β1
n+m 10 5 4 10 n*m 50 10 u 10 1 5 β1β10β1 π ππ π+π=1 β1β5β1 π ππ π+π=1 Example Credit: Karpathy and Fei-Fei
20
We want to fit a model w that just will equal 6.
Think Youβve Got It? πΏ π₯ = π€β6 2 We want to fit a model w that just will equal 6. Worldβs most basic linear model / neural net: no inputs, just constant output.
21
Iβll Need a Few Volunteers
πΏ π₯ = π€β6 2 n-6 n g n2 n g 2ng Job #1 (n-6): Forward: Compute n-6 Backwards: Multiply by 1 Job #2 (n2): Forward: Compute n2 Backwards: Multiply by 2n Job #3: Backwards: Write down 1
22
Preemptively The diagrams look complex but thatβs since weβre covering the details together
23
Something More Complex
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) w0 n*m x0 n+m w1 n+m n*-1 en n+1 1/n x1 w2 Example Credit: Karpathy and Fei-Fei
24
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) w0 2 * -2 x0 -1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3
25
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3 Example Credit: Karpathy and Fei-Fei
26
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 w2 -3 Example Credit: Karpathy and Fei-Fei
27
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 β β2 β1=β0.53 w2 -3 Where does 1.37 come from? Example Credit: Karpathy and Fei-Fei
28
a b w0 c d x0 e f w1 x1 -0.53 1 w2 1ββ0.53=β0.53 2 * -2 -1 + 4 -3 * 6
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 1 w2 -3 1ββ0.53=β0.53 Example Credit: Karpathy and Fei-Fei
29
a b w0 c d x0 e f w1 x1 -0.53 -0.53 1 w2 π β1 ββ0.53=β0.2 2 * -2 -1 +
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 -0.53 1 w2 -3 π β1 ββ0.53=β0.2 Example Credit: Karpathy and Fei-Fei
30
a b w0 c d x0 e f w1 x1 -0.2 -0.53 -0.53 1 w2 β1ββ0.2=0.2 2 * -2 -1 +
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.2 -0.53 -0.53 1 w2 -3 β1ββ0.2=0.2 Example Credit: Karpathy and Fei-Fei
31
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1β0.2=0.2 Gets sent back both directions -3 Example Credit: Karpathy and Fei-Fei
32
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1β0.2=0.2 Gets sent back both directions -3 0.2 Example Credit: Karpathy and Fei-Fei
33
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -2 x0 -1 e f π ππ ππ=π π ππ π+π=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 β1β0.2=β0.2 -3 0.2 2β0.2=0.4 Example Credit: Karpathy and Fei-Fei
34
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -0.2 -2 x0 -1 e f 0.4 π ππ ππ=π π ππ π+π=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 β2β0.2=β0.4 β3β0.2=β0.6 -3 0.2 Example Credit: Karpathy and Fei-Fei
35
a b π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -0.2 -2 x0 -1 e f 0.4 π ππ ππ=π π ππ π+π=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 -0.6 w2 -3 0.2 Example Credit: Karpathy and Fei-Fei
36
PHEW! a b w0 c d -0.2 x0 e f 0.4 w1 -0.4 x1 -0.6 w2 0.2 2 * -2 -1 + 4
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) π ππ π+π=1 π ππ ππ=π w0 2 * c d π ππ π π = π π π ππ π β1 =β π β2 -0.2 -2 x0 -1 e f 0.4 π ππ ππ=π π ππ π+π=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 *-1 en +1 n-1 x1 -2 -0.6 PHEW! w2 -3 0.2 Example Credit: Karpathy and Fei-Fei
37
f β¦ Summary π₯ 1 π( ππ π π₯ 1 ) π₯ 2 π( ππ π π₯ 2 ) π( π₯ 1 ,β¦, π₯ π ) π π₯ π
Each block computes backwards (g) * local gradient (df/dxi) at the evaluation point f π( π₯ 1 ,β¦, π₯ π ) π π₯ 1 π( ππ π π₯ 1 ) π( ππ π π₯ 2 ) π₯ 2 π₯ π π( ππ π π₯ π ) β¦
38
Multiple Outputs Flowing Back
Gradients from different backwards sum up π=1 πΎ π π ( π π π π π₯ π ) π₯ 1 f π 1 ( π₯ 1 ,β¦, π₯ π ) π 1 π=1 πΎ π π ( π π π π π₯ 1 ) β¦ π πΎ ( π₯ 1 ,β¦, π₯ π ) π πΎ π₯ π π=1 πΎ π π ( π π π π π₯ π )
39
Multiple Outputs Flowing Back
π π₯ = βπ₯+3 2 (-x+3)2 x -x -x+3 -n n+3 m*n -x+3 -x+3 x-3 1 x x -n n+3
40
Multiple Outputs Flowing Back
π π₯ = βπ₯+3 2 x = π₯β3 + π₯β3 ππ ππ₯ x-3 x =2π₯β6 x x-3
41
Does It Have To Be So Painful?
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) w0 n*m x0 n+m w1 π π = π βπ n+m n*-1 en n+1 1/x x1 w2 Example Credit: Karpathy and Fei-Fei
42
Does It Have To Be So Painful?
π π = π βπ π ππ π π = π βπ π βπ 2 = 1+ π βπ β1 1+ π βπ π βπ 1+ π βπ 1+ π βπ β 1 1+ π βπ =1βπ(π) π(π) = 1βπ π π(π) π ππ π π = β π βπ β1β π βπ ββ1 Chain rule: d/dx (1/x)*d/dx (1+x)* d/dx (e*x)*d/dx (-x) Line 1 to 2: For the curious Example Credit: Karpathy and Fei-Fei
43
Does It Have To Be So Painful?
π π,π = π β( π€ 0 π₯ 0 + π€ 1 π₯ 1 + π€ 2 ) w0 n*m x0 n+m w1 n+m Ο(n) x1 w2 π π = π βπ ππ(π) ππ =(1βπ π )π(π) Example Credit: Karpathy and Fei-Fei
44
Does It Have To Be So Painful?
Can compute for any function Pick your functions carefully: existing code is usually structured into sensible blocks
45
Building Blocks Takes signals from other cells, processes, sends out
Input from other cells Output to other cells Neuron diagram credit: Karpathy and Fei-Fei
46
Artificial Neuron Weighted average of other neuron outputs passed through an activation function π π€ π π₯ π +π Activation π π π€ π π₯ π +π
47
Can differentiate whole thing e.g., dNeuron/dx1.
Artificial Neuron Can differentiate whole thing e.g., dNeuron/dx1. What can we now do? w3 * x3 w2 * x2 w1 * x1 b β f
48
Artificial Neuron Each artificial neuron is a linear model + an activation function f Can find w, b that minimizes a loss function with gradient descent β f w,b x
49
β β β β Artificial Neurons f f f f
Connect neurons to make a more complex function; use backprop to compute gradient β f w,b x β f w,b x β f w,b x β f w,b x
50
Whatβs The Activation Function
Sigmoid π π₯ = 1 1+ π βπ₯ Nice interpretation Squashes things to (0,1) Gradients are near zero if neuron is high/low
51
Whatβs The Activation Function
ReLU (Rectifying Linear Unit) maxβ‘(0,π₯) Constant gradient Converges ~6x faster If neuron negative, zero gradient. Be careful!
52
Whatβs The Activation Function
Leaky ReLU (Rectifying Linear Unit) π₯ :π₯β₯0 0.01π₯ :π₯<0 ReLU, but allows some small gradient for negative vales
53
Setting Up A Neural Net Input Hidden Output h1 h2 h3 h4 y1 y2 y3 x2 x1
54
Setting Up A Neural Net Input Hidden 1 Hidden 2 Output a1 a2 a3 a4 h1
y1 y2 y3 x2 x1
55
Fully Connected Network
x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 Each neuron connects to each neuron in the previous layer
56
Fully Connected Network
a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 π All layer a values x2 x1 π π , π π Neuron i weights, bias π Activation function β π =π( π π π» π+ π π ) How do we do all the neurons all at once?
57
Fully Connected Network
a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 π All layer a values x2 x1 π π , π π Neuron i weights, bias π Activation function π=π(πΎπ+π) = π€ 1 π π€ 2 π π€ 3 π π€ 4 π π 1 π 2 π 3 π 4 π 1 π 2 π 3 π 4 β 1 β 2 β 3 β 4 + π ( )
58
Fully Connected Network
Define New Block: βLinear Layerβ (Ok technically itβs Affine) n L W b πΏ π =πΎπ+π Can get gradient with respect to all the inputs (do on your own; useful trick: have to be able to do matrix multiply)
59
Fully Connected Network
x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 L f(n) W b L f(n) W b L f(n) W b x
60
Fully Connected Network
x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 What happens if we remove the activation functions? W b W b W b x L L L
61
Demo Time
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.