CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Slides:



Advertisements
Similar presentations
Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
An Overview of Machine Learning
Supervised Learning Recap
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Instance Based Learning
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Unsupervised Learning and Data Mining
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Data mining and machine learning A brief introduction.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. Concept Learning Reference : Ch2 in Mitchell’s book 1. Concepts: Inductive learning hypothesis General-to-specific.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Machine Learning Lecture 11 Summary G53MLE | Machine Learning | Dr Guoping Qiu1.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Copyright 2005 by David Helmbold1 Support Vector Machines (SVMs) References: Cristianini & Shawe-Taylor book; Vapnik’s book; and “A Tutorial on Support.
Usman Roshan Dept. of Computer Science NJIT
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining Practical Machine Learning Tools and Techniques
Semi-Supervised Clustering
Machine Learning & Deep Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Machine learning, pattern recognition and statistical data modelling
Data Mining Lecture 11.
Overview of Supervised Learning
Overview of Machine Learning
Biointelligence Laboratory, Seoul National University
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Linear Discrimination
Usman Roshan Dept. of Computer Science NJIT
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
What is Artificial Intelligence?
Presentation transcript:

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides

Introduction  Why and what of machine learning: past experience, Optimize a performance criterion  Difference between supervised and unsupervised learning

Supervised Learning Terminology  Domain: set of all possible x vectors  Hypothesis/Concept: a Boolean function on domain  Target: the concept to be learned  Hypothesis class/space: is the set of hypotheses (concepts) that can be output by a given learning algorithm  Version space: all concepts in hypotheses space consistent with training set.  Noisy

Supervised Learning Terminology  Inductive bias  Overfitting/Underfitting  Feature selection

Bayesian Learning  Maximum Likelihood  Maximum a Posteriori  Mean a Posteriori  Generative/Discriminative Models  Bayes optimal prediction minimizes risk  risk of payment t on x is ∑ t’ L(t, t’)P(t’|x)

Instance Based Learning  K-nearest neighbor  Edited NN: Reduce memory and computation by only storing “important” points  Instance Based Density Estimation: Histogram method  Smoothing models  Regressogram  Running mean smoother  Kernel smoother

Decision trees  Impurity measures  Gini index: 2p(1-p)  Entropy: -p lg p - (1-p) lg (1-p)  Error rate: 1-max[p, 1-p]  Generalized entropy for multiple classes  Avoiding overfitting  PrePruning  PostPruning  Random Forests

Naïve Bayes  Naïve independence assumption P(x | t) =  j P(x j | t)  Predict the label t maximizing P(t)  j P(x j | t)  Numeric Features: use Gaussian or other density  Attributes for text classification?  Bag of words model

Linear Regression  Basis function  From maximum likelihood to least squares (eq. 3.11~3.12)  Maximum likelihood weight vector (eq. 3.15)  Sequential Learning/ stochastic gradient descent

Linear Regression (cont.)  Regularized least squares  Multiple outputs: same basis function for all components of target vector  The Bias-Variance Decomposition  Predictive distribution

Linear Classification  Linear threshold:  wx = w 1 x 1 +w 2 x 2 +…+w n x n ≥ w 0  Multi-class  Learn a linear function y k for each class k: y k (x) = w k T x + w k,0  Predict class k if y k (x) > y j (x) for all other j’s  Perceptron if (wx i ) t i ≤ 0 then mistake: w gets w+  i t i x i

Linear Discriminant Analysis  For each class y:  Estimate P(x | y) with a Gaussian (same covariance matrix for all classes)  Estimate prior P(y) as fraction of training data with class y  Predict the y maximizing P(y) P(x | y), Or maximizing log(P(y)) + log(P(x | y))

Logistic Regression  Logistic regression gives distribution on labels: p(t=1| x, w) =1/(1+e -wx )  Use gradient descent to learn w  wx is equal to log odds: log( p(t=1|w,x) / p(t=0|w,x) )

Artificial Neural Networks  activation a j =  i w j,i z i  Node j output z j is some f j (a j ), Common f(a j ) are tanh and logistic sigmoid  Backpropagation used to learn weights

Support Vector Machines  Pick a linear threshold hypothesis with biggest margin  Find w, b,  such that: w  x i + b   when y i = +1 w  x i + b  -  when y i = -1 and  as big as possible  Scaling issue - fix by setting  =1 and finding shortest w : min w,b ||w|| 2 subject to y i (w  x i + b)  1 for all examples (x i,y i )

Kernel Functions  predictions (and finding a i ’s) depend only on dot products  can use any dot-product like function K(x,x’)  K(x,z) is a kernel function if K(x,z) =  (x)  (z)

Variants of SVM  Multiclass  Regression: f(x)=w T x+w 0  One-Class

Clustering  Hierarchical – Creates Tree of clusterings  Agglomerative (bottom up – merge “closest”)  Divisive (top down - less common)  Partitional – One set of clusters created, usually # of clusters supplied by user  Clusters can be:  Overlapping (soft) / Non-overlapping (hard)

Partitional Algorithms  K-Means  Gaussian mixtures (EM)

Hidden Markov Models  Three Basic Problems of HMMs  Evaluation: Given parameters, and outputs, calculate P (outputs | parameters ) Dynamic Prog.  State sequence: Given parameters, and outputs, find state sequence Q * such maximizing probability of generating outputs Dynamic Prog.  Learning: Given a set of observation sequences, find parameters maximizing likelihood of the observation sequences EM algorithm