Download presentation
Presentation is loading. Please wait.
1
Chap. 10 Models
2
data analysis objectives
3
Regression
4
Optimization: Linear Regression (Least Square)
Curve fitting with a line w = (w0, w1) for h(x,w) = w0 +w1x Given a dataset (x1, y1), (x2, y2), ….. (xN, yN) Find w that minimizes the error: Least Square Mean (LSM) Error y = E(w) = (1/N) ∑ n(yi - h(xi, w))2 E(w) = (y – Xw)T(y – Xw) XT(y – Xw) = 0 => XTXw = XTy w = (XTX)-1XTy
5
Optimization: Linear Regression (Least Square)
For a linear line w1 = ∑(xi-E[x])(yi-E[y])/∑(xi-E[x]))2 w0 = E[y] – w1*E[x] w1 = correlation(x,y) * standard_dev(y) / standard_dev(x) x & y are perfectly correlated –increase in stdev of x is proportional to the increase of stdev of y Anti-correlated, stdev increase in x will decrease stdev in y Not correlated, changes in x does not affect prediction
6
Polynomial Regression
Curve fitting by a polynomial of degree m w = (w0, w1, .. ,wm) for h(x,w) = w0 +w1x + w2x2 + … +wmxm Given a dataset (x1, y1), (x2, y2), ….. (xN, yN) Find w that minimizes the error: E(w) = (1/N) ∑ N(h(xi, w) – yi)2 Least Square Mean (LSM) Error y = Xw + e e = (1/N)(Xw – y)2 ΔE(w) = (1/N)2XT(Xw-y) = 0 XTXw = XTy w = (XTX)-1XTy Pseudo-inverse: (XTX)-1XT
7
Linear Classifier
8
Linear classifier Given training data
(petal len, sepal len, iris-vericolor or iris-virginica) New data (petal len, sepal len), predict if vericolor or virginica ? Given a dataset (x11, x12), (x21, x22), ….. (xN1, xn1) & (y1, y2, … yn), assume a model: h(x,w) = w1xi1 + w2xi2 Error in prediction: εi = yi – hi(x,w)
9
Linear classifier Cost function J: J(w) = ∑ n(yi - h(xi, w))2 /2
wi = wi-1 + Δw J(w+Δw) = J(w) + ΔJ(w) Use Δw = -η ΔJ(w) W = np.zeros(X.shape[1] + 1) costs = [] iter = 50 for i in range(iter): output = np.dot(X, W[1:]) + W[0] error = y – output W[1:] += eta * np.dot(X.T, error) W[0] += eta * error.sum() cost = (error**2).sum()/2.0 costs.append(cost)
10
Gradient Descent
11
Classifier Linear classifier: (X,y)
(x11, x12), (x21, x22), ….. (xN1, xn1) & (y1, y2, … yn), assume a model: h(x,w) = w1xi1 + w2xi2 Error in prediction: εi = yi – hi(x,w) Input X is d-dimensional ((xi1, xi2, … , xid) , yi) hi(xi,w) = w1xi1 + w2xi wdxid = ∑jwj xi j εi = yi – hi(x,w) Nonlinear ((xi1, xi2, … , xid) , (y1, y2,… yk,…)) hk(xi,w) = w1kxi1 + w2kxi wdkxid = ∑jwjk xi j εi = yk– g(hk(x,w))
12
Single Neuron (Perceptron)
Single output, y Connection from input to a neuron has positive/negative weight wij Total input x0 = ∑iwi xi Output y = ϕ(x0) +e = +1 if x,>0 -1, otherwise
13
Perceptron Example (j=1)
Input x = (x1, x2, …, xd) e.g. customer profile info (income, debt, employment years,...) Output = 1 if (∑d wi xi – threshold) > 0 = -1 otherwise Use w0 for the threshold, such that hypothesis ϕ(x) = sign(∑i=1,d wi xi + w0) = sign(∑i=0,d wi xi) = sign(wTx) Given dataset: input x = (1, x1, x2, …, xd) (x1, y1), … (xN, yN) Learning Algorithm: Update weights w <- w + ynxn for misclassified dataset that sign(wTxn) != yn
14
Perceptron Example Two inputs, one output Trained by
Total input xj = w1 x1 + w2 x2 + w0 (w0 is bias) Assume a step function for g(xj) (x1 x2) → y (0, ½) 1 (1,1) 1 (1,1/2) 0 (0,0) 0 w2 /2 + w0 > 0 w1 + w2 + w0 > 0 w1 + w2/2 + w0 < 0 w0 < 0
15
Perceptron Example Visualize Can pick x2 = ¼ + ½ x1
w1 = -¼, w2 = -1/2, w0 = 1 w2 /2 + w0 > 0 w1 + w2 + w0 > 0 w1 + w2/2 + w0 < 0 w0 < 0
16
Perceptron Example Given dataset: input x = (1, x1, x2, …, xd)
(x1, y1), … (xN, yN) Learning Algorithm: Update weights w <- w + ynxn for misclassified dataset that sign(wTxn) != yn W = np.zeros(X.shape[1] + 1) costs = [] iter = 50 for i in range(iter): output = np.dot(X, W[1:]) + W[0] error = y – np.where(output>=0.0, 1, -1) W[1:] += eta * np.dot(X, error) W[0] += eta * error.sum() cost = (error**2).sum()/2.0 costs.append(cost)
18
Logistic Regression Total input xj = ∑iwij xi Output yj = Θ(xj)
= 1/(1 + e-x) Θ(xj) sigmoid function can be interpreted as a probability input x: cholesterol level, age, weight, … Signal s = wTx Output Θ(s): prob. of heart attack
19
Error Measure For each (x, y), y is generated by probability f(x):
Plausible error measure based on likelihood: if h=f, how likely to get y from x ? p(y|x) = h(x) for y= +1 1 – h(x) for y = -1 Substitute h(x) = Θ(s) = Θ(wTx) Since Θ(-s) = 1 - Θ(s), p(y|x) = Θ(ywTx)
20
Error Measure Likelihood of having (x1,y1), … (xN, yN):
Πn=1,NP(yn|xn) = Πn=1,NΘ(ynwTxn) Maximize Σn=1,N ln(Θ(ynwTxn))/N Minimize -Σn=1,N ln(Θ(ynwTxn))/N = Σn=1,Nln(1/Θ(ynwTxn))/N = Σn=1,Nln(1 + exp(-ynwTxn))/N Error Measure: E(w) = Σn=1,Nln(1 + exp(-ynwTxn))/N
21
Minimizing Error Measure – Gradient Descent
E(w) = Σn=1,Nln(1 + exp(-ynwTxn))/N Start with w(0) Use fixed step size η with unit vector v ΔE = E(w(1)) – E(w(0)) = E(w(0) + ηv) – E(w(0)) = ηΔE(w(0))Tv + O(η2) ≥ -η||ΔE(w(0))|| Since v is a unit vector, v = -ΔE(w(0))/||ΔE(w(0))||
22
Implementation η affects the performance
Instead of Δw = ηv = -η ΔE(w(0))/||ΔE(w(0))|| use Δw = -ηΔE(w(0)) (η: learning rate) Rule of thumb: η=0.1
23
Logistic regression algorithm
Initialize weights at t=0 to w(0) For t = 1,2,…. compute the gradient ΔE = - [Σn=1,N ynxn /(1 + exp(-ynwT(t) xn)]/N update weights: w(t+1) = w(t) – ηΔE until Δw < threshold
24
Gradient Descent Minimize Error Measure: E(w) = Σn=1,NE(h(xn), yn)/N
= Σn=1,Nln(1 + exp(-ynwTxn))/N for logistic Difficult to apply to ALL samples at one time (batch) Stochastic GD Pick one (xn, yn) at a time Apply GD to E(h(xn), yn) En[-ΔE(h(xn), yn)] = Σn=1,NΔE(h(xn), yn)/N = -ΔE (average direction)
25
Neural Networks
26
Neural Networks Simulate human nerve system Neurons and synapse
Neuron puts out a real number between 0 and 1 Perceptron is good for linear classification Cannot do this:
27
Combining perceptrons
Combine h1 and h2 Logical OR and AND x1 and x2 can be +1 or -1
28
Multilayer perceptrons
3 layers
29
Neural Network Multilayer perceptrons are limited to multiple linear separations Can have two many weights as layers increase Soften the threshold to logistic Each Θ can be different
30
NN parameters Assume that each Θ is identical
tanh(s) = (es – e-s)/(es + e-s) Weights wi,j(l) l : layer; i input; j output Relationship xj(l) = Θ(sj(l)) = Θ(Σn=0,d(l-1)wij(l)xi(l-1))
31
How NN works All the weights {wi,j(l)} determine the model, h(x)
Error on one sample (xn, yn) is E(h(xn), yn) = e(w) Compute the gradient ΔE(w): e(w)/ wij(l)
32
Computing Δe(w) Use a chain rule Δe(w) = e(w)/ wij(l)
=[ e(w)/ sj(l)]x[ sj(l)/ wij(l)] [ sj(l)/ wij(l)] = xi(l-1) [ e(w)/ sj(l)] = δj(l) δj(l) can be computed recursively Error is separated into a product of xi(l-1) and sj(l)
33
δ for the final layer δj(l) = [ e(w)/ sj(l)]
For the final layer, j=1, l=L: δj(l) = [ e(w)/ s1(L)] e(w) = e(h(xn), yn) = e(x1(L), yn) = (x1(L)-yn)2 (last layer is linear) = (Θ(s1(L)) - yn)2 Θ’ = 1 – Θ2(s) for tanh()
34
Backpropagation of δ δi(l-1) = [ e(w)/ si(l-1)]
= Σj=1,d(l) [ e(w)/ sj(l)] *[ sj(l)/ xi(l-1)] *[ xi(l-1)/ si(l-1)] [ xi(l-1)/ si(l-1)] = Θ’(si(l-1)) [ sj(l)/ xi(l-1)] = wij(l) [ e(w)/ sj(l)] = δj(l) δi(l-1) = [1 – (xi(l-1))2] Σj=1,d(l) wij(l) δj(l)
35
NN Application Protein structure prediction by PROF Input layer
Sliding 15-residue window Predict secondary structure of the central residue One residue has 20 input nodes Hidden layer Connected to ALL input and output nodes
36
Convolutional NN Character Recognition
Divide a picture by 16x16 blocks x = (1, x1, x2, …, x256) Need to compute w = (w0, w1, w2, …, w256) Use intensity and symmetry as features x = (1, x1, x2)
37
Conventional NN Conventional NN
Connection from layer l to every node in layer (l+1) 28x28 pixel images Into 784 input nodes
38
Convolutional NN Convolutional NN Input layer in 2D
Each neuron in the next layer is connected to a subset of input neurons If 5x5 input neurons to one neuron in the first hidden layer, 24x24 neurons in the next layer
39
Convolutional NN Convolutional NN
40
Convolutional NN Convolutional layer
5x5 NN has the same bias and weights (j,k) hidden neuron s = Σl=0,4 Σm=0,4 wl,m aj+l,k+m + b similar to convolution Output: Θ(s) 5x5 NN filters features from input layer Map from input layer to hidden layer is called a feature map Multiple feature maps
41
Convolutional NN Some feature maps Dark – high weight
42
Convolutional NN Pooling layer Condense convolutional layer E.g. 2x2
Pooling layer is 12x12
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.