2010 Winter School on Machine Learning and Vision Sponsored by Canadian Institute for Advanced Research and Microsoft Research India With additional support.

Slides:

Advertisements

Similar presentations

Bayesian Learning & Estimation Theory

Advertisements

Pattern Recognition and Machine Learning

Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Neural Networks and Kernel Methods

Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.

Slides from: Doug Gray, David Poole

Linear Regression.

Regularization David Kauchak CS 451 – Fall 2013.

Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Deep Learning and Neural Nets Spring 2015

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Indian Statistical Institute Kolkata

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Collaborative Filtering Matrix Factorization Approach

Introduction Mohammad Beigi Department of Biomedical Engineering Isfahan University

Gaussian process modelling

PATTERN RECOGNITION AND MACHINE LEARNING

July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Data Mining and Decision Support

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.

RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Neural networks and support vector machines

Deep Feedforward Networks

Deep Learning Amin Sobhani.

CSE 4705 Artificial Intelligence

Linear Regression (continued)

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Classification with Perceptrons Reading:

Machine Learning Today: Reading: Maria Florina Balcan

Collaborative Filtering Matrix Factorization Approach

network of simple neuron-like computing elements

Pattern Recognition and Machine Learning

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Model generalization Brief summary of methods

Artificial Intelligence 10. Neural Networks

Machine learning overview

Neural networks (3) Regularization Autoencoder

Support Vector Machines 2

Presentation transcript:

2010 Winter School on Machine Learning and Vision Sponsored by Canadian Institute for Advanced Research and Microsoft Research India With additional support from Indian Institute of Science, Bangalore and The University of Toronto, Canada

Agenda Saturday Jan 9 – Sunday Jan 10: Preperatory Lectures Monday Jan 11 – Saturday Jan 16: Tutorials and Research Lectures Sunday Jan 17: Discussion and closing

Speakers William Freeman, MIT Brendan Frey, University of Toronto Yann LeCun, New York University Jitendra Malik, UC Berkeley Bruno Olshaussen, UC Berkeley B Ravindran, IIT Madras Sunita Sarawagi, IIT Bombay Manik Varma, MSR India Martin Wainwright, UC Berkeley Yair Weiss, Hebrew University Richard Zemel, University of Toronto

Winter School Organization Co-Chairs:Brendan Frey, University of Toronto Manik Varma, Microsoft Research India Local Organzation:KR Ramakrishnan, IISc, Bangalore B Ravindran, IIT, Madras Sunita Sarawagi, IIT, Bombay CIFAR and MSRI:Dr P Anandan, Managing Director, MSRI Michael Hunter, Research Officer, CIFAR Vidya Natampally, Director Strategy, MSRI Dr Sue Schenk, Programs Director, CIFAR Ashwani Sharma, Manager Research, MSRI Dr Mel Silverman, VP Research, CIFAR

The Canadian Institute for Advanced Research (CIFAR) Objective: To fund networks of internationally leading researchers, and their students and postdoctoral fellows Programs –Neural computation and perception (vision) –Genetic networks –Cosmology and gravitation –Nanotechnology –Successful societies –…–… Track record: 13 Nobel prizes (8 current)

Neural Computation and Perception (Vision) –Geoff Hinton, Director, Toronto –Yoshua Bengio, Montreal –Michael Black, Brown –David Fleet, Toronto –Nando De Freitas, UBC –Bill Freeman*, MIT –Brendan Frey*, Toronto –Yann LeCun*, NYU –David Lowe, UBC –David MacKay, U Cambridge –Bruno Olshaussen*, Berkeley –Sam Roweis, NYU –Nikolaus Troje, Queens –Martin Wainwright*, Berkeley –Yair Weiss*, Hebrew Univ –Hugh Wilson, York Univ –Rich Zemel*, Toronto –… Goal: Develop computational models for human-spectrum vision Members

Introduction to Machine Learning Brendan Frey University of Toronto Brendan J. Frey University of Toronto

Textbook Christopher M. Bishop Pattern Recognition and Machine Learning Springer 2006 To avoid cluttering slides with citations, Ill cite sources only when the material is not presented in the textbook

How can we develop algorithms that will Track objects? Recognize objects? Segment objects? Denoise the video? Determine the state (eg, gait) of each object? …and do all this in 24 hours? Analyzing video

Handwritten digit clustering and recognition How can we develop algorithms that will Automatically cluster these images? Use a training set of labeled images to learn to classify new images? Discover how to account for variability in writing style?

Document analysis How can we develop algorithms that will Produce a summary of the document? Find similar documents? Predict document layouts that are suitable for different readers?

Bioinformatics How can we develop algorithms that will Identify regions of DNA that have high levels of transcriptional activity in specific tissues? Find start sites and stop sites of genes, by looking for common patterns of activity? Find out of place activity patterns and label their DNA regions as being non-functional? Mouse tissues DNA activity Low High Position in DNA …

The machine learning algorithm development pipeline Problem statement Mathematical description of a cost function Mathematical description of how to minimize the cost function Implementation E(w)E(w) L ( ) p(x|w) E/ w i L / = 0 Given training vectors x 1,…,x N and targets t 1,…,t N, find… r(i,k) = s(i,k) – max j {s(i,j)+a(i,j)} …

Tracking using hand-labeled coordinates To track the man in the striped shirt, we could 1.Hand-label his horizontal position in some frames 2.Extract a feature, such as the location of a sinusoidal (stripe) pattern in a horizontal scan line 3.Relate the real-valued feature to the true labeled position Pixel intensity Horizontal location of pixel x = 100 t = Feature, x Hand-labeled horizontal coordinate, t

Feature, x Hand-labeled horizontal coordinate, t Feature, x Hand-labeled horizontal coordinate, t Tracking using hand-labeled coordinates How do we develop an algorithm that relates our input feature x to the hand-labeled target t ?

Regression: Problem set-up Input: x, Target: t, Training data: (x 1,t 1 )…(x N,t N ) t is assumed to be a noisy measurement of an unknown function applied to x Feature extracted from video frame Horizontal position of object Ground truth function

Example: Polynomial curve fitting y(x,w) = w 0 + w 1 x + w 2 x 2 + … + w M x M Regression: Learn parameters w = (w 1,…,w M )

Linear regression The form y(x,w) = w 0 + w 1 x + w 2 x 2 + … + w M x M is linear in the w s Instead of x, x 2, …, x M, we can generally use basis functions: y(x,w) = w 0 + w 1 1 (x) + w 2 2 (x) + … + w M M (x)

Multi-input linear regression y(x,w) = w 0 + w 1 1 (x) + w 2 2 (x) + … + w M M (x) x and 1 (),…, M () are known, so the task of learning w doesnt change if x is replaced with a vector of inputs x : y(x,w) = w 0 + w 1 1 (x) + w 2 2 (x) + … + w M M (x) Example: Now, each m (x) maps a vector to a real number A special case is linear regression for a linear model: m (x) = x m x = entire scan line

Multi-input linear regression If we like, we can create a set of basis functions and lay them out in the D-dimensional space: 1-D2-D Problem: Curse of dimensionality

The curse of dimensionality Distributing bins or basis functions uniformly in the input space may work in 1 dimension, but will become exponentially useless in higher dimensions

Objective of regression: Minimize error E(w) = ½ n ( t n - y(x n,w) ) 2 This is called Sum of Squared Error, or SSE Other forms Mean Squared Error, MSE = (1/N) n ( t n - y(x n,w) ) 2 Root Mean Squared Error, RMSE, E RMS = (1/N) n ( t n - y(x n,w) ) 2

How the observed error propagates back to the parameters E(w) = ½ n ( t n - m w m m (x n ) ) 2 The rate of change of E w.r.t. w m is E(w)/ w m = - n ( t n - y(x n,w) ) m (x n ) The influence of input m (x n ) on E(w) is given by weighting the error for each training case by m (x n ) y(xn,w)y(xn,w)

Gradient-based algorithms Gradient descent –Initially, set w to small random values –Repeat until its time to stop: For m = 0…M m - n ( t n - y(x n,w) ) m (x n ) or m (E(w 1..w m +..w M )-E(w 1..w m..w M )) /, where is tiny For m = 0…M w m w m - m, where is the learning rate Off-the-shelf conjugate gradients optimizer: You provide a function that, given w, returns E(w) and 0 E,…, M E (total of M+2 numbers) This is a finite- element approximation to E(w)/ w m

An exact algorithm for linear regression y(x,w) = w 0 + w 1 1 (x) + w 2 2 (x) + … + w M M (x) Evaluate the basis functions for the training cases x 1,…,x N and put them in a design matrix: where we define 0 (x) = 1 (to account for w 0 ) Now, the vector of predictions is y = and the error is E = (t- ) T (t- ) = t T t - 2t T + T T Setting E/ w = 0 gives -2 T t + 2 T = 0 Solution: w MATLAB

Over-fitting After learning, collect test data and measure its error Over-fitting the training data leads to large test error

If M is fixed, say at M = 9, collecting more training data helps… N = 10

Model selection using validation data Collect additional validation data (or set aside some training data for this purpose) Perform regression with a range of values of M and use validation data to pick M Here, we could choose M = 7 Validation

Regularization using weight penalties (aka shrinkage, ridge regression, weight decay) To prevent over-fitting, we can penalize large weights: E(w) = ½ n ( t n - y(x n,w) ) 2 + / 2 m w m 2 Now, over-fitting depends on the value of

Comparison of model selection and ridge regression/weight decay

Using validation data to regularize tracking Feature, x Hand-labeled horizontal coordinate, t Feature, x Hand-labeled horizontal coordinate, t Training dataValidation data Entire data set Hand-labeled horizontal coordinate, t M = 5

Validation when data is limited S-fold cross validation –Partition the data into S sets –For M=1,2,…: For s=1…S: – Train on all data except the s th set – Measure error on s th set Add errors to get cross-validation error for M –Pick M with lowest cross-validation error Leave-one-out cross validation –Use when data is sparse –Same as S-fold cross validation, with S = N

Questions?

How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red line doesnt reveal different levels of uncertainty in predictions Cross validation reduced the training data, so the red line isnt as accurate as it should be Choosing a particular M and w seems wrong – we should hedge our bets