Introduction to Data Science Lecture 7 Machine Learning Overview

Introduction to Data Science Lecture 7 Machine Learning Overview
CS 194 Spring 2014 Michael Franklin Dan Bruckner, Evan Sparks, Shivaram Venkataraman

What is it? “Machine learning systems automatically learn programs from data” P. Domingos, CACM 10/12

Some Examples Classification Regression
Learned attribute is categorical (spam vs. ham) Input: vectors of “feature values” (disc. or cont.) Output: a single discrete value (a.k.a. the “class”) Regression Learned attribute is numeric Fiting a curve to data can then use that curve to predict outcomes

Classification Example
Learn a function that predicts given weather if someone will play golf? From Bill Howe’s Coursera Class: Introduction to Data Science

Three Components of ML Algorithms
Representation Language for the classifier What the input looks like Evaluation (scoring function) How to tell good models from bad ones Optimization How to search among the possible models to find the highest-scoring one P. Domingos, “A Few Useful Things to Know About Machine Learning”, CACM Oct 2012.

Some Examples of Them P. Domingos, “A Few Useful Things to Know About Machine Learning”, CACM Oct 2012.

Terminology Supervised Learning Unsupervised Learning
Given examples of inputs and outputs (i.e., labeled data) Learn the relationship between them Unsupervised Learning Inputs but no outputs (unlabeled data) Learn the latent labels e.g., clustering, dimension reduction You get to do both in HW 2 (see rest of today’s reading ch for K-means)

Supervised Learning Cat Dog ???

Generalization is the Goal
Pick a subset of your data as the training set. Train your model on that Then, test it using the held back data (i.e., the test set) Most important rule: Don’t Test on your Training Data (easy to predict the ones you’ve already seen!)

Overfitting Low error on training data, but High error on test data
Example: Your classifier is 100% accurate on training data but only 50% accurate on test data, when it could have been 75% accurate on each

Cross-Validation Holding back data reduces amount of data available for training. Alternative: Randomly divide training data into multiple subsets hold out each one while training on the rest average the results for evaluation

Feature Engineering Constructing features of the raw data on which to learn After gathering, integrating and cleaning, this is the next step when do we get to run our learning algorithm? Often domain-specific Requires trial and error

Unsupervised Learning
“Deep Learning” from Google’s Brain project

Plan for Rest of This Evening
We’ll focus on Supervised Learning In particular, Shivaram will cover Linear Regression in some detail Followed by an R-based Lab Finally – some HW2 programming tips based on what we saw in HW1 (Dan) Announcement: Midterm: Thursday April 17 6pm; Kroeber rm 160

Introduction to Data Science Lecture 7 Machine Learning Overview

Similar presentations

Presentation on theme: "Introduction to Data Science Lecture 7 Machine Learning Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Data Science Lecture 7 Machine Learning Overview

Similar presentations

Presentation on theme: "Introduction to Data Science Lecture 7 Machine Learning Overview"— Presentation transcript:

Similar presentations

About project

Feedback