CS 188: Artificial Intelligence

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Slides from: Doug Gray, David Poole

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Computer vision: models, learning and inference

Neural Networks Marco Loog.

Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.

Artificial Neural Networks

CS 4700: Foundations of Artificial Intelligence

Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

Linear Discrimination Reading: Chapter 2 of textbook.

Non-Bayes classifiers. Linear discriminants, neural networks.

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Neural networks and support vector machines

Fall 2004 Backpropagation CS478 - Machine Learning.

CS 388: Natural Language Processing: Neural Networks

Deep Feedforward Networks

Deep Learning Amin Sobhani.

Artificial Intelligence

ECE 5424: Introduction to Machine Learning

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

第 3 章神经网络.

Perceptrons Lirong Xia.

CSE 473 Introduction to Artificial Intelligence Neural Networks

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Classification with Perceptrons Reading:

Artificial Neural Networks

CS 4/527: Artificial Intelligence

Neural Networks and Backpropagation

Machine Learning Today: Reading: Maria Florina Balcan

CSC 578 Neural Networks and Deep Learning

Support Vector Machines

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence

Classification Neural Networks 1

Collaborative Filtering Matrix Factorization Approach

CS 4501: Introduction to Computer Vision Training Neural Networks II

CAP 5636 – Advanced Artificial Intelligence

Artificial Intelligence Chapter 3 Neural Networks

Object Classes Most recent work is at the object level We perceive the world in terms of objects, belonging to different classes. What are the differences.

CS 188: Artificial Intelligence

Neural Networks Geoff Hulten.

Artificial Intelligence Chapter 3 Neural Networks

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Neural Networks References: “Artificial Intelligence for Games”

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Artificial Intelligence Chapter 3 Neural Networks

Softmax Classifier.

Neural networks (1) Traditional multi-layer perceptrons

Backpropagation David Kauchak CS159 – Fall 2019.

Image Classification & Training of Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

CS 188: Artificial Intelligence

Logistic Regression Chapter 7.

Announcements HW10 Due tonight! That is all..

Machine Learning Perceptron: Linearly Separable Supervised Learning

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Linear Discrimination

David Kauchak CS158 – Spring 2019

Introduction to Neural Networks

Word representations David Kauchak CS158 – Fall 2016.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Image recognition.

Perceptrons Lirong Xia.

Logistic Regression Geoff Hulten.

Overall Introduction for the Lecture

Presentation transcript:

CS 188: Artificial Intelligence Deep Learning Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks! Instructor: Anca Dragan --- University of California, Berkeley [These slides created by Dan Klein, Pieter Abbeel, Anca Dragan, Josh Hug for CS188 Intro to AI at UC Berkeley. All CS188 materials available at http://ai.berkeley.edu.] 1

Last Time: Linear Classifiers Inputs are feature values Each feature has a weight Sum is the activation If the activation is: Positive, output +1 Negative, output -1 w1  f1 w2 >0? f2 w3 f3

Non-Linearity

Non-Linear Separators Data that is linearly separable works out great for linear decision rules: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: x x x2 x This and next slide adapted from Ray Mooney, UT

Computer Vision 5

Object Detection 6

Manual Feature Design 7

Features and Generalization [Dalal and Triggs, 2005] 8

Features and Generalization HoG, SIFT, Daisy Deformable part models Image HoG 9

Manual Feature Design  Deep Learning Manual feature design requires: Domain-specific expertise Domain-specific effort What if we could learn the features, too? -> Deep Learning 10

Perceptron w1  f1 w2 >0? f2 w3 f3 11

Two-Layer Perceptron Network  >0? w21 w31 w1 f1 w12   w22 w2 >0? >0? f2 w32 w3 f3 w13  >0? w23 w33 12

N-Layer Perceptron Network    >0? >0? >0? … f1     >0? >0? >0? >0? f2 … f3    >0? >0? >0? … 13

Local Search Simple, general idea: Properties Start wherever Repeat: move to the best neighboring state If no neighbors better than current, quit Neighbors = small perturbations of w Properties Plateaus and local optima How to escape plateaus and find a good local optimum? How to deal with very large parameter vectors? E.g., 14

Perceptron: Accuracy via Zero-One Loss w1  f1 w2 >0? f2 w3 f3 Objective: Classification Accuracy 15

Perceptron: Accuracy via Zero-One Loss w1  f1 w2 >0? f2 w3 f3 Objective: Classification Accuracy Issue: many plateaus  how to measure incremental progress? 16

Soft-Max Score for y=1: Score for y=-1: Probability of label: 17

Soft-Max Score for y=1: Score for y=-1: Probability of label: Objective: Log: 18

Two-Layer Neural Network  >0? w21 w31 w1 f1 w12   w22 w2 >0? f2 w32 w3 f3 w13  w23 >0? w33 19

N-Layer Neural Network    >0? >0? >0? … f1     >0? >0? >0? f2 … f3    >0? >0? >0? … 20

Our Status Our objective Challenge: how to find a good w ? Changes smoothly with changes in w Doesn’t suffer from the same plateaus as the perceptron network Challenge: how to find a good w ? Equivalently:

1-d optimization Could evaluate and Or, evaluate derivative: Then step in best direction Or, evaluate derivative: Which tells which direction to step into

2-D Optimization Source: Thomas Jungblut’s Blog

Steepest Descent Idea: Start somewhere Repeat: Take a step in the steepest descent direction Figure source: Mathworks

Steepest Direction Steepest Direction = direction of the gradient

Optimization Procedure: Gradient Descent Init: For i = 1, 2, … : learning rate --- tweaking parameter that needs to be chosen carefully How? Try multiple choices Crude rule of thumb: update should change by about 0.1 – 1 % Important detail we haven’t discussed (yet): How do we compute ?

Computing Gradients Gradient = sum of gradients for each term How do we compute the gradient for one term about the current weights?

N-Layer Neural Network    >0? >0? >0? … f1     >0? >0? >0? f2 … f3    >0? >0? >0? … 29

Represent as Computational Graph x * s (scores) hinge loss + L W R 30

e.g. x = -2, y = 5, z = -4 31

e.g. x = -2, y = 5, z = -4 Want: 32

e.g. x = -2, y = 5, z = -4 Want: 33

e.g. x = -2, y = 5, z = -4 Want: 34

e.g. x = -2, y = 5, z = -4 Want: 35

e.g. x = -2, y = 5, z = -4 Want: 36

e.g. x = -2, y = 5, z = -4 Want: 37

e.g. x = -2, y = 5, z = -4 Want: 38

e.g. x = -2, y = 5, z = -4 Want: 39

e.g. x = -2, y = 5, z = -4 Chain rule: Want: 40

e.g. x = -2, y = 5, z = -4 Want: 41

e.g. x = -2, y = 5, z = -4 Chain rule: Want: 42

activations f 43

activations f “local gradient” 44

activations f “local gradient” gradients 45

activations f “local gradient” gradients 46

activations f “local gradient” gradients 47

activations f “local gradient” gradients 48

Another example: 49

Another example: 50

Another example: 51

Another example: 52

Another example: 53

Another example: 54

Another example: 55

Another example: 56

Another example: 57

Another example: (-1) * (-0.20) = 0.20 58

Another example: 59

Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) 60

Another example: 61

Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 62

sigmoid function sigmoid gate 63

sigmoid function sigmoid gate (0.73) * (1 - 0.73) = 0.2 64

Gradients add at branches + 65

Mini-batches and Stochastic Gradient Descent Typical objective: = average log-likelihood of label given input = estimate based on mini-batch 1…k Mini-batch gradient descent: compute gradient on mini-batch (+ cycle over mini-batches: 1..k, k+1…2k, ... ; make sure to randomize permutation of data!) Stochastic gradient descent: k = 1