Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

CS 446: Machine Learning Dan Roth

Slides from: Doug Gray, David Poole

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

Machine learning continued Image source:

CS 4700: Foundations of Artificial Intelligence

Chapter 4: Linear Models for Classification

The Nature of Statistical Learning Theory by V. Vapnik

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.

Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.

Classification and application in Remote Sensing.

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Data mining and statistical learning - lecture 13 Separating hyperplane.

Artificial Neural Networks

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Part I: Classification and Bayesian Learning

Crash Course on Machine Learning

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, January 19, 2001.

Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.

Artificial Neural Networks

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Machine Learning Chapter 4. Artificial Neural Networks

Machine Learning CSE 681 CH2 - Supervised Learning.

Lecture 3: Basic Machine Learning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks to Dan Roth of UIUC.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

An Introduction to Support Vector Machines (M. Law)

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, January 22, 2001 William.

Ensemble Learning (1) Boosting Adaboost Boosting is an additive model

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Intro Learning (Reading: Chapter.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Overview Concept Learning Representation Inductive Learning Hypothesis

Linear Discrimination Reading: Chapter 2 of textbook.

Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.

Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Thursday, August 26, 1999 William.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 34 of 41 Wednesday, 10.

CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

CS 445/545 Machine Learning Winter, 2014 See syllabus at

Data Mining and Decision Support

Kansas State University Department of Computing and Information Sciences CIS 690: Implementation of High-Performance Data Mining Systems Thursday, 20 May.

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Dan Roth Department of Computer and Information Science

Analytical Learning Discussion (4 of 4):

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Data Mining Lecture 11.

Machine Learning Today: Reading: Maria Florina Balcan

CSCI B609: “Foundations of Data Science”

Computational Learning Theory

Overview of Machine Learning

Computational Learning Theory

Why Machine Learning Flood of data

Support Vector Machines and Kernels

Logistic Regression Chapter 7.

CS639: Data Management for Data Science

CS249: Neural Language Model

Presentation transcript:

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth

Administrative Stuff  Textbook: Machine Learning by Tom Mitchell (optional)  Most slides adapted from Mitchell  Slides will be posted (possibly only after lecture)  Grade: 50% final exam; 50% HW (mostly final)

What’s it all about?  Very loosely: We have lots of data and wish to automatically learn concept definitions in order to determine if new examples belong to the concept or not.

INTRODUCTIONCS446-Fall108 Supervised Learning (  Given: Examples (x,f (x)) of some unknown function f  Find: A good approximation of f  x provides some representation of the input  The process of mapping a domain element into a representation is called Feature Extraction. (Hard; ill-understood; important)  x 2 {0,1} n or x 2 < n  The target function (label)  f(x) 2 {-1,+1} Binary Classification  f(x) 2 {1,2,3,.,k-1} Multi-class classification  f(x) 2 < Regression

INTRODUCTIONCS446-Fall109 Supervised Learning : Examples  Disease diagnosis  x: Properties of patient (symptoms, lab tests)  f : Disease (or maybe: recommended therapy)  Part-of-Speech tagging  x: An English sentence (e.g., The can will rust)  f : The part of speech of a word in the sentence  Face recognition  x: Bitmap picture of person’s face  f : Name the person (or maybe: a property of)  Automatic Steering  x: Bitmap picture of road surface in front of car  f : Degrees to turn the steering wheel Many problems that do no seem like classification problems can be decomposed to classification problems. E.g, Semantic Role Labeling Semantic Role Labeling

INTRODUCTIONCS446-Fall1010 y = f (x 1, x 2, x 3, x 4 ) Unknown function x1x1 x2x2 x3x3 x4x4 Example x 1 x 2 x 3 x 4 y A Learning Problem Can you learn this function? What is it?

INTRODUCTIONCS446-Fall1011 Hypothesis Space Complete Ignorance: There are 2 16 = possible functions over four input features. We can’t figure out which one is correct until we’ve seen every possible input-output pair. After seven examples we still have 2 9 possibilities for f Is Learning Possible? Example x 1 x 2 x 3 x 4 y ? ? ? ? ? ? ? ? ?

INTRODUCTIONCS446-Fall1012 General strategies for Machine Learning  Develop representation languages for expressing concepts  Serve to limit the expressivity of the target models  E.g., Functional representation (n-of-m); Grammars; stochastic models;  Develop flexible hypothesis spaces:  Nested collections of hypotheses. Decision trees, neural networks  Hypothesis spaces of flexible size In either case:  Develop algorithms for finding a hypothesis in our hypothesis space, that fits the data  And hope that they will generalize well

INTRODUCTIONCS446-Fall1013 Terminology  Target function (concept): The true function f :X  {…Labels…}  Concept: Boolean function. Example for which f (x)= 1 are positive examples; those for which f (x)= 0 are negative examples (instances)  Hypothesis: A proposed function h, believed to be similar to f. The output of our learning algorithm.  Hypothesis space: The space of all hypotheses that can, in principle, be output by the learning algorithm.  Classifier: A discrete valued function produced by the learning algorithm. The possible value of f: {1,2,…K} are the classes or class labels. (In most algorithms the classifier will actually return a real valued function that we’ll have to interpret).  Training examples: A set of examples of the form {(x, f (x))}

INTRODUCTIONCS446-Fall1014 Representation Step: What’s Good?  Learning problem: Find a function that best separates the data  What function?  What’s best?  (How to find it?)  A possibility: Define the learning problem to be: Find a (linear) function that best separates the data Linear = linear in the instance space x= data representation; w = the classifier Y = sgn {w T x}

INTRODUCTIONCS446-Fall1015 Expressivity f(x) = sgn {x ¢ w -  } = sgn{  i=1 n w i x i -  }  Many functions are Linear  Conjunctions:  y = x 1 Æ x 3 Æ x 5  y = sgn{1 ¢ x ¢ x ¢ x 5 - 3}  At least m of n:  y = at least 2 of {x 1,x 3, x 5 }  y = sgn{1 ¢ x ¢ x ¢ x 5 - 2}  Many functions are not  Xor: y = x 1 Æ x 2 Ç : x 1 Æ : x 2  Non trivial DNF: y = x 1 Æ x 2 Ç x 3 Æ x 4  But can be made linear Probabilistic Classifiers as well

INTRODUCTIONCS446-Fall1016 Exclusive-OR (XOR)  (x 1 Æ x 2) Ç ( : {x 1 } Æ : {x 2 })  In general: a parity function.  x i 2 {0,1}  f(x 1, x 2,…, x n ) = 1 iff  x i is even This function is not linearly separable. x1x1 x 2

INTRODUCTIONCS446-Fall1017 A General Framework for Learning  Goal: predict an unobserved output value y 2 Y based on an observed input vector x 2 X  Estimate a functional relationship y~f(x) from a set {(x,y) i } i=1,n  Most relevant - Classification: y  {0,1} (or y  {1,2,…k} )  (But, within the same framework can also talk about Regression, y 2 <  What do we want f(x) to satisfy?  We want to minimize the Loss (Risk): L(f()) = E X,Y ( [f(x)  y] )  Where: E X,Y denotes the expectation with respect to the true distribution. Simply: # of mistakes […] is a indicator function

INTRODUCTIONCS446-Fall1018 Learning as an Optimization Problem  A Loss Function L(f(x),y) measures the penalty incurred by a classifier f on example (x,y).  There are many different loss functions one could define:  Misclassification Error: L(f(x),y) = 0 if f(x) = y; 1 otherwise  Squared Loss: L(f(x),y) = (f(x) –y) 2  Input dependent loss: L(f(x),y) = 0 if f(x)= y; c(x)otherwise. A continuous convex loss function allows a simpler optimization algorithm. f(x) –y L

INTRODUCTIONCS446-Fall1019 Summary: Key Issues in Machine Learning  Modeling  How to formulate application problems as machine learning problems ? How to represent the data?  Learning Protocols (where is the data & labels coming from?)  Representation:  What are good hypothesis spaces ?  Any rigorous way to find these? Any general approach?  Algorithms:  What are good algorithms?  How do we define success?  Generalization Vs. over fitting  The computational problem