Download presentation
Presentation is loading. Please wait.
Published byRoderick Neal Modified over 8 years ago
1
Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014
2
Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning
3
Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning
4
Introduction What is Machine Learning? Machine Learning Process Collect input-output examples from the experts Learn a function to map from the input to the output Traditional Software Process Interview the experts Create an algorithm that automates their process
5
Introduction Some Concepts of statistical ML – Supervised Learning – Semi-supervised Learning – Weakly-supervised Learning – Unsupervised Learning – Reinforcement Learning
6
Introduction What is Machine Learning? Machine Learning Process Collect input-output examples from the experts Learn a function to map from the input to the output Traditional Software Process Interview the experts Create an algorithm that automates their process
7
Introduction 3 essential parts of statistical ML – Model(Probability distribution, discriminant function with unresolved parameters) – Strategy(Loss function and risk function) – Algorithm (optimal algorithms, stochastic gradient algorithm, etc.) Statistical ML = Model + Strategy + Algorithm --Hang Li
8
Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning
9
Supervised Learning Given: Training examples (x i, f(x i )) for some unknown function f. Find: A good approximation to f. Example – Traffic sign recognition x: picture of traffic sign F(x): type of traffic sign – Spam Detection x: email message f(x): spam or not spam
10
Supervised Learning Training examples are drawn independently at random according to unknown probability distribution P(x,y) The Learning algorithm analyzes the examples and produces a classifier f Given a new data point (x,y) drawn from P, the classifier is given x and predicts y The loss is then measured Goal of the learning algorithm: find the f that minimizes the expected loss
11
The Main Approaches to Machine Learning Learn a classifier: a function f. Learn a conditional distribution: a conditional distribution P(y|x) Learn the joint probability distribution: P(x,y) one example of each method – Learn a classifier: The Perceptron algorithm – Learn a conditional distribution: Logistic regression – Learn the joint distribution: Linear discriminant analysis
12
Linear Threshold Units We Assume that each feature x j and each weight w j is a real number We will study three different algorithms for learning linear threshold units: Perceptron: function Logistic Regression: conditional distribution Linear Discriminant Analysis: joint distribution
13
A canonical representation Given a training example of the form (,y) Transform it to (,y) The parameter vector will then be (w 0,w 1,w 2,w 3,w 4 ) We will call the unthresholded hypothesis u(x,w)= =w T x Each hypothesis can be written h(x)=sgn(u(x,w)) Our goal is to find w
14
Geometrical View Consider 3 training examples: – (,+1) – (<2.0,2.0),-1) We want a classifier that looks like this:
15
The Unthresholded Discriminant Function is a Hyperplane The equation u(x,w)=w T x is a plane
16
Machine Learning==Optimization Given: – A set of N training examples {(x 1,y 1 ),(x 2,y 2 ),…,(x n,y n )} – A loss function L Find – The weight vector w that minimizes the expected loss on the training data
17
Step-wise Constant Loss Function Derivative is either 0 or infinite
18
Approximating the expected loss by a smooth function Simplify the optimization problem by replacing the original objective function by a surrogate loss function. Hinge loss: When y = 1 :
19
Minimizing J by Gradient Descent Search
20
Batch Perceptron Algorithm
21
Online Perceptron Alogrithm This is also called stochastic gradient descent because the overall gradient is approximated by the gradient from each individual example
22
Logistic Regression Learn the conditional probability P(y|x) Let p y (x;w) be our estimate of P(y|x),where w is a vector of adjustable parameters. Assume only two classes y=0 and y=1, and It is easy to show that this is equivalent to In other words, the log odds of class 1 is a linear function of x.
23
The reason of choosing exp function A linear function has a range from negative infinite to positive infinite and we need to force it to be a positive and sum to 1 in order to be a probability
24
Choosing the Loss Function For probabilistic models, we use the log loss:
25
Compare with 0/1 Loss
26
Maximum Likelihood Fitting To minimize the log loss, we should maximize The likelihood of the data is: It is easier to work with the log likelihood:
27
Maximizing the log likelihood via gradient ascent Similar to the gradient descent algorithm of perceptron algorithm,need to rewrite the log likelihood in terms of p 1 (x i ;w)
28
Logistic Regression Implements a Linear Discriminant Function In the 2-class 0/1 loss function case, we should predict y =1 if P(y=1|x;w)>0.5 Take log of both sides Or w T x>0
29
The Joint Probability Approach: Linear Discriminant Analysis Learn P(x,y). This is called the generative approach, because we can think of P(x,y) as a model of how the data is generated – For example, if we factor the joint distribution into the form P(x,y)=P(y)P(x|y) – Generative story Draw y~P(y) choose a class Draw x~P(x|y) generate the features for x – This can be represented as a probabilistic graphical model
30
Linear Discriminant analysis P(y) is a discrete multinomial distribution For LDA, we assume that P(x|y) is a multivariate normal distribution with mean u k and covariance matrix ∑
31
The LDA Model Linear discriminant analysis assumes that the joint distribution has the form
32
Fitting the LDA Model
33
LDA learns an LTU
36
Two Geometric Views of LDA View 1: Mahalanobis Distance
37
Two Geometric Views of LDA View 2: Most Informative Low-Dimensional Projection
38
Comparing Perceptron, Logistic Regression, and LDA Statistical Efficiency: If the generative model P(x,y) is correct, then LDA usually gives the highest accuracy, particularly when the amount of training data is small. If the model is correct, LDA requires 30% less data than Logistic Regression in theory Computational Efficiency: Generative Models typically are the easiest to learn. The LDA parameters can be computed directly from the data without using gradient descent.
39
Comparing Perceptron, Logistic Regression, and LDA Vapnik’s Principle – If your goal is to minimize 0/1 loss, then you should do that directly, rather than first solving a harder problem(probability estimation) – This is what Perceptron does – Other algorithms that follow this principle SVM Decision Tree Neural Networks
40
Comparing Perceptron, Logistic Regression, and LDA Robustness to model assumptions. The generative model usually performs poorly when the assumptions are violated. Logistic Regression is more robust to model assumptions, and Perceptron is even more robust. Robustness to missing values and noise. In many applications, some of the features xij may be missing or corrupted in some of the training examples. Generative models typically provide better way of handling this than non-generative models.
41
Thank you!
42
Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu July 4,2014
43
Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning
44
Weakly-supervised Learning Answers to what we left from the last lecture
49
When A T B=B T A ? Given: (AB) T =B T A T (A T B) T = B T A = A T B Only when: (A T B) T = A T B, i.e. A T B is a symmetric matrix ?
50
Weakly-supervised Learning Why weakly-supervised learning?
51
Weakly-supervised Learning
52
Collecting full annotations for all the images and videos in a large dataset is an onerous and expensive task the largest dataset that provides only image-level labels consists of millions of images Instead of relying on small supervised dataset for complex visual tasks, weak learning allows us to use large, inexpensive datasets
53
Multiple Instance Learning
57
Multiple Instance Learning(discriminant)
58
Multiple Instance Learning Algorithms Learning Axis-Parallel Concepts(Dietterich et al., 1997) Diverse Density(DD) ( Maron and Lozano-Perez, 1998) EM-DD (Zhang and Goldman, 2001) Citation kNN (Wang and Zucker, 2000) SVM for multi-instance learning ( Andrews et al., 2002)
59
Multiple Instance Learning Applications Drug activity prediction Content-based image retrieval and classification Text categorization
60
Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning
61
Clustering Change detection
62
Clustering Unlabeled data Clustering: the process of grouping a set of objects into classes of similar objects
63
Issues For Clustering Representation for clustering – similarity/distance How many clusters? – Fixed a priori? – Completely data driven?
64
Hard vs. Soft Clustering Hard clustering: Each document belongs to exactly one cluster – More common and easier to do Soft clustering: A document can belong to more than one cluster. – You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes – You can only do that with a soft clustering approach.
65
Clustering Algorithms Flat algorithms – Usually start with a random (partial) partitioning – Refine it iteratively K means clustering (Model based clustering) Hierarchical algorithms – Bottom-up, agglomerative – (Top-down, divisive)
66
K-Means Assumes data are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids.
67
K-Means Algorithm Select K random points {s 1, s 2,… s K } as seeds. Until clustering converges (or other stopping criterion): For each point d i : Assign d i to the cluster c j such that dist(x i, s j ) is minimal. (Next, update the seeds to the centroid of each cluster) For each cluster c j : s j = (c j )
68
How Many Clusters Number of clusters K is given Finding the “right” number of clusters is part of the problem
69
Change Detection Goal: Given two sets of samples, we want to compare the probability distributions behind Two approaches: – Distributional change detection – Structural change detection
70
Distributional Change Detection Goal: Detect change in probability distributions behind two sets of samples through divergence
71
Examples ROI detection in images:
72
Examples Event detection in movies
73
Examples Event detection from Twitter
74
Distances and Divergences
75
Kullback-Leibler Divergence
76
f-Divergences
77
Pearson Divergence
78
Estimate Densities Maximum likelihood estimation Bayes estimation Kernel density estimation Nearest-neighbor density estimation
79
Thank you!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.