CS 188: Artificial Intelligence

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Slides from: Doug Gray, David Poole
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Computer vision: models, learning and inference
Neural Networks Marco Loog.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Artificial Neural Networks
CS 4700: Foundations of Artificial Intelligence
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Fall 2004 Backpropagation CS478 - Machine Learning.
CS 388: Natural Language Processing: Neural Networks
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Artificial Intelligence
ECE 5424: Introduction to Machine Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
第 3 章 神经网络.
Perceptrons Lirong Xia.
CSE 473 Introduction to Artificial Intelligence Neural Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Artificial Neural Networks
CS 4/527: Artificial Intelligence
Neural Networks and Backpropagation
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence
Classification Neural Networks 1
Collaborative Filtering Matrix Factorization Approach
CS 4501: Introduction to Computer Vision Training Neural Networks II
CAP 5636 – Advanced Artificial Intelligence
Artificial Intelligence Chapter 3 Neural Networks
Object Classes Most recent work is at the object level We perceive the world in terms of objects, belonging to different classes. What are the differences.
CS 188: Artificial Intelligence
Neural Networks Geoff Hulten.
Artificial Intelligence Chapter 3 Neural Networks
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Neural Networks References: “Artificial Intelligence for Games”
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Artificial Intelligence Chapter 3 Neural Networks
Softmax Classifier.
Neural networks (1) Traditional multi-layer perceptrons
Backpropagation David Kauchak CS159 – Fall 2019.
Image Classification & Training of Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
CS 188: Artificial Intelligence
Logistic Regression Chapter 7.
Announcements HW10 Due tonight! That is all..
Machine Learning Perceptron: Linearly Separable Supervised Learning
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Linear Discrimination
David Kauchak CS158 – Spring 2019
Introduction to Neural Networks
Word representations David Kauchak CS158 – Fall 2016.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Image recognition.
Perceptrons Lirong Xia.
Logistic Regression Geoff Hulten.
Overall Introduction for the Lecture
Presentation transcript:

CS 188: Artificial Intelligence Deep Learning Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks! Instructor: Anca Dragan --- University of California, Berkeley [These slides created by Dan Klein, Pieter Abbeel, Anca Dragan, Josh Hug for CS188 Intro to AI at UC Berkeley. All CS188 materials available at http://ai.berkeley.edu.] 1

Last Time: Linear Classifiers Inputs are feature values Each feature has a weight Sum is the activation If the activation is: Positive, output +1 Negative, output -1 w1  f1 w2 >0? f2 w3 f3

Non-Linearity

Non-Linear Separators Data that is linearly separable works out great for linear decision rules: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: x x x2 x This and next slide adapted from Ray Mooney, UT

Computer Vision 5

Object Detection 6

Manual Feature Design 7

Features and Generalization [Dalal and Triggs, 2005] 8

Features and Generalization HoG, SIFT, Daisy Deformable part models Image HoG 9

Manual Feature Design  Deep Learning Manual feature design requires: Domain-specific expertise Domain-specific effort What if we could learn the features, too? -> Deep Learning 10

Perceptron w1  f1 w2 >0? f2 w3 f3 11

Two-Layer Perceptron Network  >0? w21 w31 w1 f1 w12   w22 w2 >0? >0? f2 w32 w3 f3 w13  >0? w23 w33 12

N-Layer Perceptron Network    >0? >0? >0? … f1     >0? >0? >0? >0? f2 … f3    >0? >0? >0? … 13

Local Search Simple, general idea: Properties Start wherever Repeat: move to the best neighboring state If no neighbors better than current, quit Neighbors = small perturbations of w Properties Plateaus and local optima How to escape plateaus and find a good local optimum? How to deal with very large parameter vectors? E.g., 14

Perceptron: Accuracy via Zero-One Loss w1  f1 w2 >0? f2 w3 f3 Objective: Classification Accuracy 15

Perceptron: Accuracy via Zero-One Loss w1  f1 w2 >0? f2 w3 f3 Objective: Classification Accuracy Issue: many plateaus  how to measure incremental progress? 16

Soft-Max Score for y=1: Score for y=-1: Probability of label: 17

Soft-Max Score for y=1: Score for y=-1: Probability of label: Objective: Log: 18

Two-Layer Neural Network  >0? w21 w31 w1 f1 w12   w22 w2 >0? f2 w32 w3 f3 w13  w23 >0? w33 19

N-Layer Neural Network    >0? >0? >0? … f1     >0? >0? >0? f2 … f3    >0? >0? >0? … 20

Our Status Our objective Challenge: how to find a good w ? Changes smoothly with changes in w Doesn’t suffer from the same plateaus as the perceptron network Challenge: how to find a good w ? Equivalently:

1-d optimization Could evaluate and Or, evaluate derivative: Then step in best direction Or, evaluate derivative: Which tells which direction to step into

2-D Optimization Source: Thomas Jungblut’s Blog

Steepest Descent Idea: Start somewhere Repeat: Take a step in the steepest descent direction Figure source: Mathworks

Steepest Direction Steepest Direction = direction of the gradient

Optimization Procedure: Gradient Descent Init: For i = 1, 2, …   : learning rate --- tweaking parameter that needs to be chosen carefully How? Try multiple choices Crude rule of thumb: update should change by about 0.1 – 1 % Important detail we haven’t discussed (yet): How do we compute ?  

Computing Gradients Gradient = sum of gradients for each term How do we compute the gradient for one term about the current weights?

N-Layer Neural Network    >0? >0? >0? … f1     >0? >0? >0? f2 … f3    >0? >0? >0? … 29

Represent as Computational Graph x * s (scores) hinge loss + L W R 30

e.g. x = -2, y = 5, z = -4 31

e.g. x = -2, y = 5, z = -4 Want: 32

e.g. x = -2, y = 5, z = -4 Want: 33

e.g. x = -2, y = 5, z = -4 Want: 34

e.g. x = -2, y = 5, z = -4 Want: 35

e.g. x = -2, y = 5, z = -4 Want: 36

e.g. x = -2, y = 5, z = -4 Want: 37

e.g. x = -2, y = 5, z = -4 Want: 38

e.g. x = -2, y = 5, z = -4 Want: 39

e.g. x = -2, y = 5, z = -4 Chain rule: Want: 40

e.g. x = -2, y = 5, z = -4 Want: 41

e.g. x = -2, y = 5, z = -4 Chain rule: Want: 42

activations f 43

activations f “local gradient” 44

activations f “local gradient” gradients 45

activations f “local gradient” gradients 46

activations f “local gradient” gradients 47

activations f “local gradient” gradients 48

Another example: 49

Another example: 50

Another example: 51

Another example: 52

Another example: 53

Another example: 54

Another example: 55

Another example: 56

Another example: 57

Another example: (-1) * (-0.20) = 0.20 58

Another example: 59

Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) 60

Another example: 61

Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 62

sigmoid function sigmoid gate 63

sigmoid function sigmoid gate (0.73) * (1 - 0.73) = 0.2 64

Gradients add at branches + 65

Mini-batches and Stochastic Gradient Descent Typical objective: = average log-likelihood of label given input = estimate based on mini-batch 1…k Mini-batch gradient descent: compute gradient on mini-batch (+ cycle over mini-batches: 1..k, k+1…2k, ... ; make sure to randomize permutation of data!) Stochastic gradient descent: k = 1