Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chap. 10 Models.

Similar presentations


Presentation on theme: "Chap. 10 Models."— Presentation transcript:

1 Chap. 10 Models

2 data analysis objectives

3 Regression

4 Optimization: Linear Regression (Least Square)
Curve fitting with a line w = (w0, w1) for h(x,w) = w0 +w1x Given a dataset (x1, y1), (x2, y2), ….. (xN, yN) Find w that minimizes the error: Least Square Mean (LSM) Error y = E(w) = (1/N) ∑ n(yi - h(xi, w))2 E(w) = (y – Xw)T(y – Xw) XT(y – Xw) = 0 => XTXw = XTy w = (XTX)-1XTy

5 Optimization: Linear Regression (Least Square)
For a linear line w1 = ∑(xi-E[x])(yi-E[y])/∑(xi-E[x]))2 w0 = E[y] – w1*E[x] w1 = correlation(x,y) * standard_dev(y) / standard_dev(x) x & y are perfectly correlated –increase in stdev of x is proportional to the increase of stdev of y Anti-correlated, stdev increase in x will decrease stdev in y Not correlated, changes in x does not affect prediction

6 Polynomial Regression
Curve fitting by a polynomial of degree m w = (w0, w1, .. ,wm) for h(x,w) = w0 +w1x + w2x2 + … +wmxm Given a dataset (x1, y1), (x2, y2), ….. (xN, yN) Find w that minimizes the error: E(w) = (1/N) ∑ N(h(xi, w) – yi)2 Least Square Mean (LSM) Error y = Xw + e e = (1/N)(Xw – y)2 ΔE(w) = (1/N)2XT(Xw-y) = 0 XTXw = XTy w = (XTX)-1XTy Pseudo-inverse: (XTX)-1XT

7 Linear Classifier

8 Linear classifier Given training data
(petal len, sepal len, iris-vericolor or iris-virginica) New data (petal len, sepal len), predict if vericolor or virginica ? Given a dataset (x11, x12), (x21, x22), ….. (xN1, xn1) & (y1, y2, … yn), assume a model: h(x,w) = w1xi1 + w2xi2 Error in prediction: εi = yi – hi(x,w)

9 Linear classifier Cost function J: J(w) = ∑ n(yi - h(xi, w))2 /2
wi = wi-1 + Δw J(w+Δw) = J(w) + ΔJ(w) Use Δw = -η ΔJ(w) W = np.zeros(X.shape[1] + 1) costs = [] iter = 50 for i in range(iter): output = np.dot(X, W[1:]) + W[0] error = y – output W[1:] += eta * np.dot(X.T, error) W[0] += eta * error.sum() cost = (error**2).sum()/2.0 costs.append(cost)

10 Gradient Descent

11 Classifier Linear classifier: (X,y)
(x11, x12), (x21, x22), ….. (xN1, xn1) & (y1, y2, … yn), assume a model: h(x,w) = w1xi1 + w2xi2 Error in prediction: εi = yi – hi(x,w) Input X is d-dimensional ((xi1, xi2, … , xid) , yi) hi(xi,w) = w1xi1 + w2xi wdxid = ∑jwj xi j εi = yi – hi(x,w) Nonlinear ((xi1, xi2, … , xid) , (y1, y2,… yk,…)) hk(xi,w) = w1kxi1 + w2kxi wdkxid = ∑jwjk xi j εi = yk– g(hk(x,w))

12 Single Neuron (Perceptron)
Single output, y Connection from input to a neuron has positive/negative weight wij Total input x0 = ∑iwi xi Output y = ϕ(x0) +e = +1 if x,>0 -1, otherwise

13 Perceptron Example (j=1)
Input x = (x1, x2, …, xd) e.g. customer profile info (income, debt, employment years,...) Output = 1 if (∑d wi xi – threshold) > 0 = -1 otherwise Use w0 for the threshold, such that hypothesis ϕ(x) = sign(∑i=1,d wi xi + w0) = sign(∑i=0,d wi xi) = sign(wTx) Given dataset: input x = (1, x1, x2, …, xd) (x1, y1), … (xN, yN) Learning Algorithm: Update weights w <- w + ynxn for misclassified dataset that sign(wTxn) != yn

14 Perceptron Example Two inputs, one output Trained by
Total input xj = w1 x1 + w2 x2 + w0 (w0 is bias) Assume a step function for g(xj) (x1 x2) → y (0, ½) 1 (1,1) 1 (1,1/2) 0 (0,0) 0 w2 /2 + w0 > 0 w1 + w2 + w0 > 0 w1 + w2/2 + w0 < 0 w0 < 0

15 Perceptron Example Visualize Can pick x2 = ¼ + ½ x1
w1 = -¼, w2 = -1/2, w0 = 1 w2 /2 + w0 > 0 w1 + w2 + w0 > 0 w1 + w2/2 + w0 < 0 w0 < 0

16 Perceptron Example Given dataset: input x = (1, x1, x2, …, xd)
(x1, y1), … (xN, yN) Learning Algorithm: Update weights w <- w + ynxn for misclassified dataset that sign(wTxn) != yn W = np.zeros(X.shape[1] + 1) costs = [] iter = 50 for i in range(iter): output = np.dot(X, W[1:]) + W[0] error = y – np.where(output>=0.0, 1, -1) W[1:] += eta * np.dot(X, error) W[0] += eta * error.sum() cost = (error**2).sum()/2.0 costs.append(cost)

17

18 Logistic Regression Total input xj = ∑iwij xi Output yj = Θ(xj)
= 1/(1 + e-x) Θ(xj) sigmoid function can be interpreted as a probability input x: cholesterol level, age, weight, … Signal s = wTx Output Θ(s): prob. of heart attack

19 Error Measure For each (x, y), y is generated by probability f(x):
Plausible error measure based on likelihood: if h=f, how likely to get y from x ? p(y|x) = h(x) for y= +1 1 – h(x) for y = -1 Substitute h(x) = Θ(s) = Θ(wTx) Since Θ(-s) = 1 - Θ(s), p(y|x) = Θ(ywTx)

20 Error Measure Likelihood of having (x1,y1), … (xN, yN):
Πn=1,NP(yn|xn) = Πn=1,NΘ(ynwTxn) Maximize Σn=1,N ln(Θ(ynwTxn))/N Minimize -Σn=1,N ln(Θ(ynwTxn))/N = Σn=1,Nln(1/Θ(ynwTxn))/N = Σn=1,Nln(1 + exp(-ynwTxn))/N Error Measure: E(w) = Σn=1,Nln(1 + exp(-ynwTxn))/N

21 Minimizing Error Measure – Gradient Descent
E(w) = Σn=1,Nln(1 + exp(-ynwTxn))/N Start with w(0) Use fixed step size η with unit vector v ΔE = E(w(1)) – E(w(0)) = E(w(0) + ηv) – E(w(0)) = ηΔE(w(0))Tv + O(η2) ≥ -η||ΔE(w(0))|| Since v is a unit vector, v = -ΔE(w(0))/||ΔE(w(0))||

22 Implementation η affects the performance
Instead of Δw = ηv = -η ΔE(w(0))/||ΔE(w(0))|| use Δw = -ηΔE(w(0)) (η: learning rate) Rule of thumb: η=0.1

23 Logistic regression algorithm
Initialize weights at t=0 to w(0) For t = 1,2,…. compute the gradient ΔE = - [Σn=1,N ynxn /(1 + exp(-ynwT(t) xn)]/N update weights: w(t+1) = w(t) – ηΔE until Δw < threshold

24 Gradient Descent Minimize Error Measure: E(w) = Σn=1,NE(h(xn), yn)/N
= Σn=1,Nln(1 + exp(-ynwTxn))/N for logistic Difficult to apply to ALL samples at one time (batch) Stochastic GD Pick one (xn, yn) at a time Apply GD to E(h(xn), yn) En[-ΔE(h(xn), yn)] = Σn=1,NΔE(h(xn), yn)/N = -ΔE (average direction)

25 Neural Networks

26 Neural Networks Simulate human nerve system Neurons and synapse
Neuron puts out a real number between 0 and 1 Perceptron is good for linear classification Cannot do this:

27 Combining perceptrons
Combine h1 and h2 Logical OR and AND x1 and x2 can be +1 or -1

28 Multilayer perceptrons
3 layers

29 Neural Network Multilayer perceptrons are limited to multiple linear separations Can have two many weights as layers increase Soften the threshold to logistic Each Θ can be different

30 NN parameters Assume that each Θ is identical
tanh(s) = (es – e-s)/(es + e-s) Weights wi,j(l) l : layer; i input; j output Relationship xj(l) = Θ(sj(l)) = Θ(Σn=0,d(l-1)wij(l)xi(l-1))

31 How NN works All the weights {wi,j(l)} determine the model, h(x)
Error on one sample (xn, yn) is E(h(xn), yn) = e(w) Compute the gradient ΔE(w): e(w)/ wij(l)

32 Computing Δe(w) Use a chain rule Δe(w) = e(w)/ wij(l)
=[ e(w)/ sj(l)]x[ sj(l)/ wij(l)] [ sj(l)/ wij(l)] = xi(l-1) [ e(w)/ sj(l)] = δj(l) δj(l) can be computed recursively Error is separated into a product of xi(l-1) and sj(l)

33 δ for the final layer δj(l) = [ e(w)/ sj(l)]
For the final layer, j=1, l=L: δj(l) = [ e(w)/ s1(L)] e(w) = e(h(xn), yn) = e(x1(L), yn) = (x1(L)-yn)2 (last layer is linear) = (Θ(s1(L)) - yn)2 Θ’ = 1 – Θ2(s) for tanh()

34 Backpropagation of δ δi(l-1) = [ e(w)/ si(l-1)]
= Σj=1,d(l) [ e(w)/ sj(l)] *[ sj(l)/ xi(l-1)] *[ xi(l-1)/ si(l-1)] [ xi(l-1)/ si(l-1)] = Θ’(si(l-1)) [ sj(l)/ xi(l-1)] = wij(l) [ e(w)/ sj(l)] = δj(l) δi(l-1) = [1 – (xi(l-1))2] Σj=1,d(l) wij(l) δj(l)

35 NN Application Protein structure prediction by PROF Input layer
Sliding 15-residue window Predict secondary structure of the central residue One residue has 20 input nodes Hidden layer Connected to ALL input and output nodes

36 Convolutional NN Character Recognition
Divide a picture by 16x16 blocks x = (1, x1, x2, …, x256) Need to compute w = (w0, w1, w2, …, w256) Use intensity and symmetry as features x = (1, x1, x2)

37 Conventional NN Conventional NN
Connection from layer l to every node in layer (l+1) 28x28 pixel images Into 784 input nodes

38 Convolutional NN Convolutional NN Input layer in 2D
Each neuron in the next layer is connected to a subset of input neurons If 5x5 input neurons to one neuron in the first hidden layer, 24x24 neurons in the next layer

39 Convolutional NN Convolutional NN

40 Convolutional NN Convolutional layer
5x5 NN has the same bias and weights (j,k) hidden neuron s = Σl=0,4 Σm=0,4 wl,m aj+l,k+m + b similar to convolution Output: Θ(s) 5x5 NN filters features from input layer Map from input layer to hidden layer is called a feature map Multiple feature maps

41 Convolutional NN Some feature maps Dark – high weight

42 Convolutional NN Pooling layer Condense convolutional layer E.g. 2x2
Pooling layer is 12x12


Download ppt "Chap. 10 Models."

Similar presentations


Ads by Google