Download presentation

Presentation is loading. Please wait.

Published byDiana Lee Modified about 1 year ago

1
M ORE C LASSIFIERS

2
A GENDA Key concepts for all classifiers Precision vs recall Biased sample sets Linear classifiers Intro to neural networks

3
R ECAP : D ECISION B OUNDARIES With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x1>=20 x2 x1 F x2>=10 T F F T x2>=15 TF T

4
B EYOND E RROR R ATES 4

5
B EYOND E RROR R ATE Predicting security risk Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them) Searching for images Returning irrelevant images is worse than omitting relevant ones 5

6
B IASED S AMPLE S ETS Often there are orders of magnitude more negative examples than positive E.g., all images of Kris on Facebook If I classify all images as “not Kris” I’ll have >99.99% accuracy Examples of Kris should count much more than non-Kris!

7
F ALSE P OSITIVES 7 x1x1 x2x2 True decision boundaryLearned decision boundary

8
F ALSE P OSITIVES 8 x1x1 x2x2 New query An example incorrectly predicted to be positive True decision boundaryLearned decision boundary

9
F ALSE N EGATIVES 9 x1x1 x2x2 New query An example incorrectly predicted to be negative True decision boundaryLearned decision boundary

10
P RECISION VS. R ECALL Precision # of relevant documents retrieved / # of total documents retrieved Recall # of relevant documents retrieved / # of total relevant documents Numbers between 0 and 1 10

11
P RECISION VS. R ECALL Precision # of true positives / (# true positives + # false positives) Recall # of true positives / (# true positives + # false negatives) A precise classifier is selective A classifier with high recall is inclusive 11

12
R EDUCING F ALSE P OSITIVE R ATE 12 x1x1 x2x2 True decision boundaryLearned decision boundary

13
R EDUCING F ALSE N EGATIVE RATE 13 x1x1 x2x2 True decision boundaryLearned decision boundary

14
P RECISION -R ECALL CURVES 14 Precision Recall Measure Precision vs Recall as the decision boundary is tuned Perfect classifier Actual performance

15
P RECISION -R ECALL CURVES 15 Precision Recall Measure Precision vs Recall as the decision boundary is tuned Penalize false negatives Penalize false positives Equal weight

16
P RECISION -R ECALL CURVES 16 Precision Recall Measure Precision vs Recall as the decision boundary is tuned

17
P RECISION -R ECALL CURVES 17 Precision Recall Measure Precision vs Recall as the decision boundary is tuned Better learning performance

18
O PTION 1: C LASSIFICATION T HRESHOLDS Many learning algorithms (e.g., probabilistic models, linear models) give real-valued output v( x ) that needs thresholding for classification v( x ) > => positive label given to x v( x ) negative label given to x May want to tune threshold to get fewer false positives or false negatives 18

19
O PTION 2: W EIGHTED DATASETS Weighted datasets : attach a weight w to each example to indicate how important it is Instead of counting “# of errors”, count “sum of weights of errors” Or construct a resampled dataset D’ where each example is duplicated proportionally to its w As the relative weights of positive vs negative examples is tuned from 0 to 1, the precision- recall curve is traced out

20
L INEAR CLASSIFIERS : M OTIVATION Decision tree produces axis-aligned decision boundaries Can we accurately classify data like this? x2 x1

21
P LANE G EOMETRY Any line in 2D can be expressed as the set of solutions (x,y) to the equation ax+by+c=0 (an implicit surface) ax+by+c > 0 is one side of the line ax+by+c < 0 is the other ax+by+c = 0 is the line itself y x b a

22
P LANE G EOMETRY In 3D, a plane can be expressed as the set of solutions (x,y,z) to the equation ax+by+cz+d=0 ax+by+cz+d > 0 is one side of the plane ax+by+cz+d < 0 is the other side ax+by+cz+d = 0 is the plane itself ab c z x y

23
L INEAR C LASSIFIER In d dimensions, c 0 +c 1 *x 1 +…+c d *x d =0 is a hyperplane. Idea: Use c 0 +c 1 *x 1 +…+c d *x d > 0 to denote positive classifications Use c 0 +c 1 *x 1 +…+c d *x d < 0 to denote negative classifications

24
P ERCEPTRON 24 g xixi x1x1 xnxn y wiwi y = f(x,w) = g ( i=1,…,n w i x i ) x1x1 x2x2 w 1 x 1 + w 2 x 2 = 0 g(u) u

25
A S INGLE P ERCEPTRON CAN LEARN 25 g xixi x1x1 xnxn y wiwi A disjunction of boolean literals x 1 x 2 x 3 Majority function

26
A S INGLE P ERCEPTRON CAN LEARN 26 g xixi x1x1 xnxn y wiwi A disjunction of boolean literals x 1 x 2 x 3 Majority function XOR?

27
P ERCEPTRON L EARNING R ULE θ θ + x (i) (y (i) -g(θ T x (i) )) (g outputs either 0 or 1, y is either 0 or 1) If output is correct, weights are unchanged If g is 0 but y is 1, then the value of g on attribute i is increased If g is 1 but y is 0, then the value of g on attribute i is decreased Converges if data is linearly separable, but oscillates otherwise 27

28
P ERCEPTRON 28 g xixi x1x1 xnxn y wiwi ? y = f(x,w) = g ( i=1,…,n w i x i ) g(u) u

29
U NIT (N EURON ) 29 g xixi x1x1 xnxn y wiwi y = g( i=1,…,n w i x i ) g(u) = 1/[1 + exp(- u)]

30
N EURAL N ETWORK Network of interconnected neurons 30 g xixi x1x1 xnxn y wiwi g xixi x1x1 xnxn y wiwi Acyclic (feed-forward) vs. recurrent networks

31
T WO -L AYER F EED -F ORWARD N EURAL N ETWORK 31 InputsHidden layer Output layer w 1j w 2k

32
N ETWORKS WITH HIDDEN LAYERS Can represent XORs, other nonlinear functions Common neuron types: Soft perceptron (sigmoid), radial basis functions, linear, … As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features How to train hidden layers? 32

33
B ACKPROPAGATION (P RINCIPLE ) Treat the problem as one of minimizing errors between the example label and the network output, given the example and network weights as input Error( x i,y i, w ) = (y i – f( x i, w )) 2 Sum this error term over all examples E( w ) = i Error( x i,y i, w ) = i (y i – f( x i, w )) 2 Minimize errors using an optimization algorithm Stochastic gradient descent is typically used 33

34

35

36

37

38

39

40

41

42

43
S TOCHASTIC G RADIENT D ESCENT For each example ( x i,y i ), take a gradient descent step to reduce the error for ( x i,y i ) only. 43

44
S TOCHASTIC G RADIENT D ESCENT Objective function values (measured over all examples) over time settle into local minimum Step size must be reduced over time, e.g., O(1/t) 44

45
N EURAL N ETWORKS : P ROS AND C ONS Pros Bioinspiration is nifty Can represent a wide variety of decision boundaries Complexity is easily tunable (number of hidden nodes, topology) Easily extendable to regression tasks Cons Haven’t gotten close to unlocking the power of the human (or cat) brain Complex boundaries need lots of data Slow training Mostly lukewarm feelings in mainstream ML (although the “deep learning” variant is en vogue now)

46
N EXT C LASS Another guest lecture

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google