# Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

## Presentation on theme: "Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University"— Presentation transcript:

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University boyuan@sjtu.edu.cn

Review of Lecture One Overview of AI – Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics – Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity – Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence – Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the limit of Turin Machine Course Content – Focus mainly on learning and inference – Discuss current problems and research efforts – Perception and behavior (vision, robotic, NLP, bionics …) not included Exam – Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) – Course materials

Today’s Content Overview of machine learning Linear regression – Gradient decent – Least square fit – Stochastic gradient decent – The normal equation Applications

Basic Terminologies x =Input variables/features y =Output variables/target variables (x, y) = Training examples, the i th training example = (x (i), y (i) ) m (j) =Number of training examples (1, …, m) n (i) =Number of input variables/features (0, …,n) h(x) =Hypothesis/function/model that outputs the predicative value under a given input x  = Parameter/weight, which parameterizes the mapping of X to its predictive value, thus We define x 0 = 1 (the intercept), thus able to use a matrix representation:

Gradient Decent The Cost Function is defined as: Using the matrix to represent the training samples with respect to  The gradient decent is based on the partial derivatives with respect to  The algorithm is therefore: Loop { } (for every j) There is another alternative to iterate, called stochastic gradient decent:

Normal Equation An explicit way to directly obtain 

The Optimization Problem by the Normal Equation We set the derivatives to zero, and obtain the Normal Equations:

Today’s Content Linear Regression – Locally Weighted Regression (an adaptive method) Probabilistic Interpretation – Maxima Likelihood Estimation vs. Least Square (Gaussian Distribution) Classification by Logistic Regression – LMS updating – A Perceptron-based Learning Algorithm

Linear Regression 1.Number of Features 2.Over-fitting and under-fitting Issue 3.Feature selection problem (to be covered later) 4.Adaptive issue Some definitions: Parametric Learning (fixed set of  with n being constant) Non-parametric Learning (number of  grows with m linearly) Locally-Weighted Regression (Loess/Lowess Regression) non-parametric A bell-shape weighting (not a Gaussian) Every time you need to use the entire training data set to train for a given input to predict its output (computational complexity)

Extension of Linear Regression Linear Additive (straight-line): x 1 =1, x 2 =x Polynomial: x 1 =1, x 2 =x, …, x n =x n-1 Chebyshev Orthogonal Polynomial: x 1 =1, x 2 =x, …, x n =2x(x n-1 -x n-2 ) Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of different frequencies of x n Pairwise Interaction:linear terms + x k1,k2 (k =1, …, N) … The central problem underlying these representations are whether or not the optimization processes for  are convex.

Probabilistic Interpretation Why Ordinary Least Square (OLE)? Why not other power terms? – Assume – PDF for Gaussian is – This implies that – Or, ~ = Random Noises, ~ Why Gaussian for random variables? Central limit theorem?

Consider training data are stochastic Assume are i.i.d. (independently identically distributed) – Likelihood of L(  ) = the probability of y given x parameterized by  What is Maximum Likelihood Estimation (MLE)? – Chose parameters  to maximize the function, so to make the training data set as probable as possible; – Likelihood L(  ) of the parameters, probability of the data. Maximum Likelihood (updated)

The Equivalence of MLE and OLE = J(  ) !?

Sigmoid (Logistic) Function Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize Linear Methods) that the choice of the logistic function is a natural one.

Recall (Note the positive sign rather than negative) Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:

One Useful Property of the Logistic Function

Identical to Least Square Again?

Download ppt "Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University"

Similar presentations