Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014.

Similar presentations


Presentation on theme: "Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014."— Presentation transcript:

1 Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014

2 Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning

3 Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning

4 Introduction What is Machine Learning? Machine Learning Process Collect input-output examples from the experts Learn a function to map from the input to the output Traditional Software Process Interview the experts Create an algorithm that automates their process

5 Introduction Some Concepts of statistical ML – Supervised Learning – Semi-supervised Learning – Weakly-supervised Learning – Unsupervised Learning – Reinforcement Learning

6 Introduction What is Machine Learning? Machine Learning Process Collect input-output examples from the experts Learn a function to map from the input to the output Traditional Software Process Interview the experts Create an algorithm that automates their process

7 Introduction 3 essential parts of statistical ML – Model(Probability distribution, discriminant function with unresolved parameters) – Strategy(Loss function and risk function) – Algorithm (optimal algorithms, stochastic gradient algorithm, etc.) Statistical ML = Model + Strategy + Algorithm --Hang Li

8 Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning

9 Supervised Learning Given: Training examples (x i, f(x i )) for some unknown function f. Find: A good approximation to f. Example – Traffic sign recognition x: picture of traffic sign F(x): type of traffic sign – Spam Detection x: email message f(x): spam or not spam

10 Supervised Learning Training examples are drawn independently at random according to unknown probability distribution P(x,y) The Learning algorithm analyzes the examples and produces a classifier f Given a new data point (x,y) drawn from P, the classifier is given x and predicts y The loss is then measured Goal of the learning algorithm: find the f that minimizes the expected loss

11 The Main Approaches to Machine Learning Learn a classifier: a function f. Learn a conditional distribution: a conditional distribution P(y|x) Learn the joint probability distribution: P(x,y) one example of each method – Learn a classifier: The Perceptron algorithm – Learn a conditional distribution: Logistic regression – Learn the joint distribution: Linear discriminant analysis

12 Linear Threshold Units We Assume that each feature x j and each weight w j is a real number We will study three different algorithms for learning linear threshold units: Perceptron: function Logistic Regression: conditional distribution Linear Discriminant Analysis: joint distribution

13 A canonical representation Given a training example of the form (,y) Transform it to (,y) The parameter vector will then be (w 0,w 1,w 2,w 3,w 4 ) We will call the unthresholded hypothesis u(x,w)= =w T x Each hypothesis can be written h(x)=sgn(u(x,w)) Our goal is to find w

14 Geometrical View Consider 3 training examples: – (,+1) – (<2.0,2.0),-1) We want a classifier that looks like this:

15 The Unthresholded Discriminant Function is a Hyperplane The equation u(x,w)=w T x is a plane

16 Machine Learning==Optimization Given: – A set of N training examples {(x 1,y 1 ),(x 2,y 2 ),…,(x n,y n )} – A loss function L Find – The weight vector w that minimizes the expected loss on the training data

17 Step-wise Constant Loss Function Derivative is either 0 or infinite

18 Approximating the expected loss by a smooth function Simplify the optimization problem by replacing the original objective function by a surrogate loss function. Hinge loss: When y = 1 :

19 Minimizing J by Gradient Descent Search

20 Batch Perceptron Algorithm

21 Online Perceptron Alogrithm This is also called stochastic gradient descent because the overall gradient is approximated by the gradient from each individual example

22 Logistic Regression Learn the conditional probability P(y|x) Let p y (x;w) be our estimate of P(y|x),where w is a vector of adjustable parameters. Assume only two classes y=0 and y=1, and It is easy to show that this is equivalent to In other words, the log odds of class 1 is a linear function of x.

23 The reason of choosing exp function A linear function has a range from negative infinite to positive infinite and we need to force it to be a positive and sum to 1 in order to be a probability

24 Choosing the Loss Function For probabilistic models, we use the log loss:

25 Compare with 0/1 Loss

26 Maximum Likelihood Fitting To minimize the log loss, we should maximize The likelihood of the data is: It is easier to work with the log likelihood:

27 Maximizing the log likelihood via gradient ascent Similar to the gradient descent algorithm of perceptron algorithm,need to rewrite the log likelihood in terms of p 1 (x i ;w)

28 Logistic Regression Implements a Linear Discriminant Function In the 2-class 0/1 loss function case, we should predict y =1 if P(y=1|x;w)>0.5 Take log of both sides Or w T x>0

29 The Joint Probability Approach: Linear Discriminant Analysis Learn P(x,y). This is called the generative approach, because we can think of P(x,y) as a model of how the data is generated – For example, if we factor the joint distribution into the form P(x,y)=P(y)P(x|y) – Generative story Draw y~P(y) choose a class Draw x~P(x|y) generate the features for x – This can be represented as a probabilistic graphical model

30 Linear Discriminant analysis P(y) is a discrete multinomial distribution For LDA, we assume that P(x|y) is a multivariate normal distribution with mean u k and covariance matrix ∑

31 The LDA Model Linear discriminant analysis assumes that the joint distribution has the form

32 Fitting the LDA Model

33 LDA learns an LTU

34

35

36 Two Geometric Views of LDA View 1: Mahalanobis Distance

37 Two Geometric Views of LDA View 2: Most Informative Low-Dimensional Projection

38 Comparing Perceptron, Logistic Regression, and LDA Statistical Efficiency: If the generative model P(x,y) is correct, then LDA usually gives the highest accuracy, particularly when the amount of training data is small. If the model is correct, LDA requires 30% less data than Logistic Regression in theory Computational Efficiency: Generative Models typically are the easiest to learn. The LDA parameters can be computed directly from the data without using gradient descent.

39 Comparing Perceptron, Logistic Regression, and LDA Vapnik’s Principle – If your goal is to minimize 0/1 loss, then you should do that directly, rather than first solving a harder problem(probability estimation) – This is what Perceptron does – Other algorithms that follow this principle SVM Decision Tree Neural Networks

40 Comparing Perceptron, Logistic Regression, and LDA Robustness to model assumptions. The generative model usually performs poorly when the assumptions are violated. Logistic Regression is more robust to model assumptions, and Perceptron is even more robust. Robustness to missing values and noise. In many applications, some of the features xij may be missing or corrupted in some of the training examples. Generative models typically provide better way of handling this than non-generative models.

41 Thank you!

42 Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu July 4,2014

43 Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning

44 Weakly-supervised Learning Answers to what we left from the last lecture

45

46

47

48

49 When A T B=B T A ? Given: (AB) T =B T A T (A T B) T = B T A = A T B Only when: (A T B) T = A T B, i.e. A T B is a symmetric matrix ?

50 Weakly-supervised Learning Why weakly-supervised learning?

51 Weakly-supervised Learning

52 Collecting full annotations for all the images and videos in a large dataset is an onerous and expensive task the largest dataset that provides only image-level labels consists of millions of images Instead of relying on small supervised dataset for complex visual tasks, weak learning allows us to use large, inexpensive datasets

53 Multiple Instance Learning

54

55

56

57 Multiple Instance Learning(discriminant)

58 Multiple Instance Learning Algorithms Learning Axis-Parallel Concepts(Dietterich et al., 1997) Diverse Density(DD) ( Maron and Lozano-Perez, 1998) EM-DD (Zhang and Goldman, 2001) Citation kNN (Wang and Zucker, 2000) SVM for multi-instance learning ( Andrews et al., 2002)

59 Multiple Instance Learning Applications Drug activity prediction Content-based image retrieval and classification Text categorization

60 Contents Introduction Supervised Learning Weakly-supervised Learning Unsupervised Learning

61 Clustering Change detection

62 Clustering Unlabeled data Clustering: the process of grouping a set of objects into classes of similar objects

63 Issues For Clustering Representation for clustering – similarity/distance How many clusters? – Fixed a priori? – Completely data driven?

64 Hard vs. Soft Clustering Hard clustering: Each document belongs to exactly one cluster – More common and easier to do Soft clustering: A document can belong to more than one cluster. – You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes – You can only do that with a soft clustering approach.

65 Clustering Algorithms Flat algorithms – Usually start with a random (partial) partitioning – Refine it iteratively K means clustering (Model based clustering) Hierarchical algorithms – Bottom-up, agglomerative – (Top-down, divisive)

66 K-Means Assumes data are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids.

67 K-Means Algorithm Select K random points {s 1, s 2,… s K } as seeds. Until clustering converges (or other stopping criterion): For each point d i : Assign d i to the cluster c j such that dist(x i, s j ) is minimal. (Next, update the seeds to the centroid of each cluster) For each cluster c j : s j =  (c j )

68 How Many Clusters Number of clusters K is given Finding the “right” number of clusters is part of the problem

69 Change Detection Goal: Given two sets of samples, we want to compare the probability distributions behind Two approaches: – Distributional change detection – Structural change detection

70 Distributional Change Detection Goal: Detect change in probability distributions behind two sets of samples through divergence

71 Examples ROI detection in images:

72 Examples Event detection in movies

73 Examples Event detection from Twitter

74 Distances and Divergences

75 Kullback-Leibler Divergence

76 f-Divergences

77 Pearson Divergence

78 Estimate Densities Maximum likelihood estimation Bayes estimation Kernel density estimation Nearest-neighbor density estimation

79 Thank you!


Download ppt "Sharing Some Valuable Points of Machine Learning Notes of MLSS2014 Reporter: Xinliang Zhu June 27,2014."

Similar presentations


Ads by Google