Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014.

Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014

Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning

Introduction What is Machine Learning? Machine Learning Process Collect input-output examples from the experts Learn a function to map from the input to the output Traditional Software Process Interview the experts Create an algorithm that automates their process

Introduction Some Concepts of statistical ML – Supervised Learning – Semi-supervised Learning – Weakly-supervised Learning – Unsupervised Learning – Reinforcement Learning

Introduction What is Machine Learning? Machine Learning Process Collect input-output examples from the experts Learn a function to map from the input to the output Traditional Software Process Interview the experts Create an algorithm that automates their process

Introduction 3 essential parts of statistical ML – Model(Probability distribution, discriminant function with unresolved parameters) – Strategy(Loss function and risk function) – Algorithm (optimal algorithms, stochastic gradient algorithm, etc.) Statistical ML = Model + Strategy + Algorithm --Hang Li

Supervised Learning Given: Training examples (x i, f(x i )) for some unknown function f. Find: A good approximation to f. Example – Traffic sign recognition x: picture of traffic sign F(x): type of traffic sign – Spam Detection x: email message f(x): spam or not spam

Supervised Learning Training examples are drawn independently at random according to unknown probability distribution P(x,y) The Learning algorithm analyzes the examples and produces a classifier f Given a new data point (x,y) drawn from P, the classifier is given x and predicts y The loss is then measured Goal of the learning algorithm: find the f that minimizes the expected loss

The Main Approaches to Machine Learning Learn a classifier: a function f. Learn a conditional distribution: a conditional distribution P(y|x) Learn the joint probability distribution: P(x,y) one example of each method – Learn a classifier: The Perceptron algorithm – Learn a conditional distribution: Logistic regression – Learn the joint distribution: Linear discriminant analysis

Linear Threshold Units We Assume that each feature x j and each weight w j is a real number We will study three different algorithms for learning linear threshold units: Perceptron: function Logistic Regression: conditional distribution Linear Discriminant Analysis: joint distribution

A canonical representation Given a training example of the form (,y) Transform it to (,y) The parameter vector will then be (w 0,w 1,w 2,w 3,w 4 ) We will call the unthresholded hypothesis u(x,w)= =w T x Each hypothesis can be written h(x)=sgn(u(x,w)) Our goal is to find w

Geometrical View Consider 3 training examples: – (,+1) – (<2.0,2.0),-1) We want a classifier that looks like this:

The Unthresholded Discriminant Function is a Hyperplane The equation u(x,w)=w T x is a plane

Machine Learning==Optimization Given: – A set of N training examples {(x 1,y 1 ),(x 2,y 2 ),…,(x n,y n )} – A loss function L Find – The weight vector w that minimizes the expected loss on the training data

Step-wise Constant Loss Function Derivative is either 0 or infinite

Approximating the expected loss by a smooth function Simplify the optimization problem by replacing the original objective function by a surrogate loss function. Hinge loss: When y = 1 :

Minimizing J by Gradient Descent Search

Batch Perceptron Algorithm

Online Perceptron Alogrithm This is also called stochastic gradient descent because the overall gradient is approximated by the gradient from each individual example

Logistic Regression Learn the conditional probability P(y|x) Let p y (x;w) be our estimate of P(y|x),where w is a vector of adjustable parameters. Assume only two classes y=0 and y=1, and It is easy to show that this is equivalent to In other words, the log odds of class 1 is a linear function of x.

The reason of choosing exp function A linear function has a range from negative infinite to positive infinite and we need to force it to be a positive and sum to 1 in order to be a probability

Choosing the Loss Function For probabilistic models, we use the log loss:

Compare with 0/1 Loss

Maximum Likelihood Fitting To minimize the log loss, we should maximize The likelihood of the data is: It is easier to work with the log likelihood:

Maximizing the log likelihood via gradient ascent Similar to the gradient descent algorithm of perceptron algorithm,need to rewrite the log likelihood in terms of p 1 (x i ;w)

Logistic Regression Implements a Linear Discriminant Function In the 2-class 0/1 loss function case, we should predict y =1 if P(y=1|x;w)>0.5 Take log of both sides Or w T x>0

The Joint Probability Approach: Linear Discriminant Analysis Learn P(x,y). This is called the generative approach, because we can think of P(x,y) as a model of how the data is generated – For example, if we factor the joint distribution into the form P(x,y)=P(y)P(x|y) – Generative story Draw y~P(y) choose a class Draw x~P(x|y) generate the features for x – This can be represented as a probabilistic graphical model

Linear Discriminant analysis P(y) is a discrete multinomial distribution For LDA, we assume that P(x|y) is a multivariate normal distribution with mean u k and covariance matrix ∑

The LDA Model Linear discriminant analysis assumes that the joint distribution has the form

Fitting the LDA Model

LDA learns an LTU

Two Geometric Views of LDA View 1: Mahalanobis Distance

Two Geometric Views of LDA View 2: Most Informative Low-Dimensional Projection

Comparing Perceptron, Logistic Regression, and LDA Statistical Efficiency: If the generative model P(x,y) is correct, then LDA usually gives the highest accuracy, particularly when the amount of training data is small. If the model is correct, LDA requires 30% less data than Logistic Regression in theory Computational Efficiency: Generative Models typically are the easiest to learn. The LDA parameters can be computed directly from the data without using gradient descent.

Comparing Perceptron, Logistic Regression, and LDA Vapnik’s Principle – If your goal is to minimize 0/1 loss, then you should do that directly, rather than first solving a harder problem(probability estimation) – This is what Perceptron does – Other algorithms that follow this principle SVM Decision Tree Neural Networks

Comparing Perceptron, Logistic Regression, and LDA Robustness to model assumptions. The generative model usually performs poorly when the assumptions are violated. Logistic Regression is more robust to model assumptions, and Perceptron is even more robust. Robustness to missing values and noise. In many applications, some of the features xij may be missing or corrupted in some of the training examples. Generative models typically provide better way of handling this than non-generative models.

Thank you!

Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu July 4,2014

Weakly-supervised Learning Answers to what we left from the last lecture

When A T B=B T A ? Given: (AB) T =B T A T (A T B) T = B T A = A T B Only when: (A T B) T = A T B, i.e. A T B is a symmetric matrix ?

Weakly-supervised Learning Why weakly-supervised learning?

Weakly-supervised Learning

Collecting full annotations for all the images and videos in a large dataset is an onerous and expensive task the largest dataset that provides only image-level labels consists of millions of images Instead of relying on small supervised dataset for complex visual tasks, weak learning allows us to use large, inexpensive datasets

Multiple Instance Learning

Multiple Instance Learning(discriminant)

Multiple Instance Learning Algorithms Learning Axis-Parallel Concepts(Dietterich et al., 1997) Diverse Density(DD) ( Maron and Lozano-Perez, 1998) EM-DD (Zhang and Goldman, 2001) Citation kNN (Wang and Zucker, 2000) SVM for multi-instance learning ( Andrews et al., 2002)

Multiple Instance Learning Applications Drug activity prediction Content-based image retrieval and classification Text categorization

Clustering Change detection

Clustering Unlabeled data Clustering: the process of grouping a set of objects into classes of similar objects

Issues For Clustering Representation for clustering – similarity/distance How many clusters? – Fixed a priori? – Completely data driven?

Hard vs. Soft Clustering Hard clustering: Each document belongs to exactly one cluster – More common and easier to do Soft clustering: A document can belong to more than one cluster. – You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes – You can only do that with a soft clustering approach.

Clustering Algorithms Flat algorithms – Usually start with a random (partial) partitioning – Refine it iteratively K means clustering (Model based clustering) Hierarchical algorithms – Bottom-up, agglomerative – (Top-down, divisive)

K-Means Assumes data are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids.

K-Means Algorithm Select K random points {s 1, s 2,… s K } as seeds. Until clustering converges (or other stopping criterion): For each point d i : Assign d i to the cluster c j such that dist(x i, s j ) is minimal. (Next, update the seeds to the centroid of each cluster) For each cluster c j : s j =  (c j )

How Many Clusters Number of clusters K is given Finding the “right” number of clusters is part of the problem

Change Detection Goal: Given two sets of samples, we want to compare the probability distributions behind Two approaches: – Distributional change detection – Structural change detection

Distributional Change Detection Goal: Detect change in probability distributions behind two sets of samples through divergence

Examples ROI detection in images:

Examples Event detection in movies

Examples Event detection from Twitter

Distances and Divergences

Kullback-Leibler Divergence

f-Divergences

Pearson Divergence

Estimate Densities Maximum likelihood estimation Bayes estimation Kernel density estimation Nearest-neighbor density estimation

Thank you!

Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014.

Similar presentations

Presentation on theme: "Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014.

Similar presentations

Presentation on theme: "Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014."— Presentation transcript:

Similar presentations

About project

Feedback