Download presentation

Presentation is loading. Please wait.

Published byNelson Gibbs Modified over 4 years ago

1
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University boyuan@sjtu.edu.cn

2
Review of Lecture One Overview of AI – Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics – Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity – Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence – Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the limit of Turin Machine Course Content – Focus mainly on learning and inference – Discuss current problems and research efforts – Perception and behavior (vision, robotic, NLP, bionics …) not included Exam – Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) – Course materials

3
Today’s Content Overview of machine learning Linear regression – Gradient decent – Least square fit – Stochastic gradient decent – The normal equation Applications

5
Basic Terminologies x =Input variables/features y =Output variables/target variables (x, y) = Training examples, the i th training example = (x (i), y (i) ) m (j) =Number of training examples (1, …, m) n (i) =Number of input variables/features (0, …,n) h(x) =Hypothesis/function/model that outputs the predicative value under a given input x = Parameter/weight, which parameterizes the mapping of X to its predictive value, thus We define x 0 = 1 (the intercept), thus able to use a matrix representation:

8
Gradient Decent The Cost Function is defined as: Using the matrix to represent the training samples with respect to The gradient decent is based on the partial derivatives with respect to The algorithm is therefore: Loop { } (for every j) There is another alternative to iterate, called stochastic gradient decent:

9
Normal Equation An explicit way to directly obtain

10
The Optimization Problem by the Normal Equation We set the derivatives to zero, and obtain the Normal Equations:

11
Today’s Content Linear Regression – Locally Weighted Regression (an adaptive method) Probabilistic Interpretation – Maxima Likelihood Estimation vs. Least Square (Gaussian Distribution) Classification by Logistic Regression – LMS updating – A Perceptron-based Learning Algorithm

12
Linear Regression 1.Number of Features 2.Over-fitting and under-fitting Issue 3.Feature selection problem (to be covered later) 4.Adaptive issue Some definitions: Parametric Learning (fixed set of with n being constant) Non-parametric Learning (number of grows with m linearly) Locally-Weighted Regression (Loess/Lowess Regression) non-parametric A bell-shape weighting (not a Gaussian) Every time you need to use the entire training data set to train for a given input to predict its output (computational complexity)

13
Extension of Linear Regression Linear Additive (straight-line): x 1 =1, x 2 =x Polynomial: x 1 =1, x 2 =x, …, x n =x n-1 Chebyshev Orthogonal Polynomial: x 1 =1, x 2 =x, …, x n =2x(x n-1 -x n-2 ) Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of different frequencies of x n Pairwise Interaction:linear terms + x k1,k2 (k =1, …, N) … The central problem underlying these representations are whether or not the optimization processes for are convex.

14
Probabilistic Interpretation Why Ordinary Least Square (OLE)? Why not other power terms? – Assume – PDF for Gaussian is – This implies that – Or, ~ = Random Noises, ~ Why Gaussian for random variables? Central limit theorem?

15
Consider training data are stochastic Assume are i.i.d. (independently identically distributed) – Likelihood of L( ) = the probability of y given x parameterized by What is Maximum Likelihood Estimation (MLE)? – Chose parameters to maximize the function, so to make the training data set as probable as possible; – Likelihood L( ) of the parameters, probability of the data. Maximum Likelihood (updated)

16
The Equivalence of MLE and OLE = J( ) !?

17
Sigmoid (Logistic) Function Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize Linear Methods) that the choice of the logistic function is a natural one.

18
Recall (Note the positive sign rather than negative) Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:

19
One Useful Property of the Logistic Function

20
Identical to Least Square Again?

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google