Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013.

Similar presentations


Presentation on theme: "1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013."— Presentation transcript:

1 1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013

2 2 CSC 8520 Spring 2013. Paula Matuszek Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function which predicts the outcome from the inputs: –Given a training set of N example pairs (x 1, y 1 ) (x 2,y 2 )...(x n,y n ), where each y j was generated by an unknown function y = f(x), discover a function h that approximates the true function y A large class of supervised learning approaches discover h with regression methods.

3 3 CSC 8520 Spring 2013. Paula Matuszek Regression Classifiers The basic idea underlying regression is: –plot all of the sample points for n-feature examples in an n-dimensional space –find an n-dimensional plane that separates the positive examples from the negative examples best If you are familiar with the statistical concept of regression for prediction, this is the same idea.

4 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

5 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

6 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

7 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

8 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 Any of these would be fine....but which is best?

9 9 CSC 8520 Spring 2013. Paula Matuszek Measuring Fit For prediction, we can measure the error of our prediction by looking at how far off our predicted value is from the actual value. –compute individual errors –sum them Typically we are much more worried about large errors than small ones –square the errors This gives us a measure of fit which is the sum of squared errors This best fit hypothesis can be solved analytically (see equation 18.3 in the text)

10 10 CSC 8520 Spring 2013. Paula Matuszek Linear Classifiers A linear classifier is just a hypothesis determined by linear regression with a threshold added Rather than a hard threshold we typically use a logistic function to determine the optimal cutoffs. –fitted through gradient descent

11 11 CSC 8520 Spring 2013. Paula Matuszek More on Regression Models So far we have discussed linear models We can add dimensions to the model by including higher-order terms, such as squared or cubed values of the features. As with decision trees, we can get overfitting. If we add enough dimensions we can fit almost anything!

12 12 CSC 8520 Spring 2013. Paula Matuszek Support Vector Machines A Support Vector Machine (SVM) is a classifier –It uses features of instances to decide which class each instance belongs to It is a supervised machine-learning classifier –Training cases are used to calculate parameters for a model which can then be applied to new instances to make a decision It is a binary classifier –it distinguishes between two classes It is currently the most popular off-the-shelf machine learning classifier

13 13 CSC 8520 Spring 2013. Paula Matuszek Basic Idea Underlying SVMs Find a line, or a plane, or a hyperplane, that separates our classes cleanly. –This is the same concept as we have seen in regression. By finding the greatest margin separating them –This is not the same concept as we have seen in regression. What does it mean?

14 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 Any of these would be fine....but which is best?

15 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Classifier Margin denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

16 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. Called Linear Support Vector Machine (SVM)

17 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. Called Linear Support Vector Machine (SVM) Support Vectors are those datapoints that the margin pushes up against

18 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 2.Empirically it works very very well.

19 19 CSC 8520 Spring 2013. Paula Matuszek Concept Check For which of these could we use a basic linear SVM? –A: Classify the three kinds of iris in the UC Irvine data set? –B: Classify email into spam and non-spam? –C: Classify students into likely to pass or not? Which of these is the SVM margin? BA

20 20 CSC 8520 Spring 2013. Paula Matuszek Messy Data This is all good so far. Suppose our aren’t that neat:

21 21 CSC 8520 Spring 2013. Paula Matuszek Soft Margins Intuitively, it still looks like we can make a decent separation here. –Can’t make a clean margin –But can almost do so, if we allow some errors We introduce a slack variable, which measure the degree of misclassification and adds a cost (C) for these misclassified instances. Tradeoff between wide margin and classification errors –High cost will give relatively narrow margins –Low cost will give broader margins but misclassify more data. How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do

22 22 CSC 8520 Spring 2013. Paula Matuszek Only Two Errors, Narrow Margin

23 23 CSC 8520 Spring 2013. Paula Matuszek Several Errors, Wider Margin

24 24 CSC 8520 Spring 2013. Paula Matuszek Finding the Margin Conceptually similar to sum of squared errors for regression First we find the maximum margin separator –minimize the error of the points that are closest to the separator line –Formula 18.14 in the text The margin is then the band that touches the nearest points

25 25 CSC 8520 Spring 2013. Paula Matuszek Evaluating SVMs Same as evaluating any other classifier Train on sample data, evaluate on test data (why?) Look at: –classification accuracy –confusion matrix –sensitivity and specificity

26 26 CSC 8520 Spring 2013. Paula Matuszek More on Evaluating SVMs Overfitting: very close fit to training data which takes advantage of irrelevant variations in instances –performance on test data will be much lower –may mean that your training sample isn’t representative –in SVMs, may mean that C is too high Is the SVM actually useful? –Compare to the “majority” classifier

27 27 CSC 8520 Spring 2013. Paula Matuszek Non-Linearly-Separable Data Suppose we can’t get a good linear separation of data? As with regression, allowing non-linearity will give us better modeling of many data sets. In SVMs, we do this by using a kernel. A kernel is a function which maps our data into a higher-order order feature space where we can find a separating hyperplane Common kernels are polynomial, RBF

28 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Hard 1-dimensional dataset What can be done about this? x=0

29 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Hard 1-dimensional dataset x=0

30 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Hard 1-dimensional dataset x=0

31 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials SVM Kernel Functions K(a,b)=(a. b +1) d is an example of an SVM Kernel Function Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function Most common: Radial-Basis-style Kernel Function:

32 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials

33 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials

34 34 CSC 8520 Spring 2013. Paula Matuszek Kernel Trick We don’t actually have to compute the complete higher-order function In the 18.14 equation we only use the dot product So we replace it with a kernel function This means we can work with much higher dimensions without getting hopeless performance The kernel trick in SVMs refers to all of this: using a kernel function instead of the dot product to give us separation of non-linear data without impossible performance cost.

35 35 CSC 8520 Spring 2013. Paula Matuszek Why SVMs? Focus on the instances nearest the margin is paying more attention to where the differences are critical Can handle very large feature sets effectively In practice has been shown to work well in a variety of domains

36 36 CSC 8520 Spring 2013. Paula Matuszek Summary SVMs are a form of supervised classifier The basic SVM is binary and linear, but there are non-linear and multi-class extensions “One key insight and one neat trick” 1 –key insight: maximum margin separator –neat trick: kernel trick A good method to try first if you have no knowledge about the domain Applicable in a wide variety of domains 1 Artificial Intelligence, a Modern Approach, third edition, Russell and Norvig, 2010, p, 744

37 37 CSC 8520 Spring 2013. Paula Matuszek CSC 8520 Spring 2010. Paula Matuszek Learning by Analogy: Case-based Reasoning Case-based systems are a significant chunk of AI in their own right. A case-based system has two major components: –Case base –Problem solver The case base contains a growing set of cases, analogous to either a KB or a training set. Problem solver has –A case retriever and –A case reasoner. May also have a case installer.

38 38 CSC 8520 Spring 2013. Paula Matuszek CSC 8520 Spring 2010. Paula Matuszek Case-Based Retrieval Cases are described as a set of features Retrieval uses methods such as –Nearest neighbor: compare all features to all cases in KB and choose closest match –Indexed: compute and store some indices with each case and retrieve matching indices –Domain-based model clustering: CB is organized into a domain model; insertion is harder, but retrieval is easier. Example: “documents like this one” –Features are the word frequencies in the document

39 39 CSC 8520 Spring 2013. Paula Matuszek Case-Based Reasoning Example A frequency matrix for diagnosing system problems is a simple case-based example Representation is a matrix of observed symptoms and causes Each case is an entry in cell of the matrix –Critic is actual outcome of case –Learner adds entry to appropriate cells Performer matches symptoms, chooses possible causes

40 40 CSC 8520 Spring 2013. Paula Matuszek Battery dead Out of gas Alternator bad Battery bad Car won’t start case 1 case 2 case 3 Car stalls at stoplights case 4case 5 Car misfires in rainy weather Lights won’t come on case 1 case 2

41 41 CSC 8520 Spring 2013. Paula Matuszek Case-based Reasoning Definition of relevant features is critical: –Need to get the ones which influence outcomes –At the right level of granularity The reasoner can be a complex planning and what-if reasoning system, or a simple query for missing data. Only really becomes a “learning” system if there is a case installer as well. Can grow cumulatively.

42 42 CSC 8520 Spring 2013. Paula Matuszek Neural Nets, the very short version A neural net consists of layers of nodes, or neurons, each of which has an activation level –Nodes of each layer receive inputs from previous layers; these are combined according to a set of weights. –If the activation level is reached the node “fires” and sends inputs to the next level – The initial layer is data from cases; the final layer is expected outcomes Learning is accomplished by modifying the weights to reduce the prediction error

43 43 CSC 8520 Spring 2013. Paula Matuszek Figure 18.19: A neuron Figure 18:20 A simple network A network with an input layer, a hidden layer, and an output layer.

44 44 CSC 8520 Spring 2013. Paula Matuszek Neural Nets, continued The typical method of modifying the weights is back-propagation –success or failure at the output node is “propagated back” through the nodes which contributed to that output node –essentially a form of gradient descent Number of hidden nodes/layers is complicated decision –too many = overfitting –Typical is to try several and evaluate

45 45 CSC 8520 Spring 2013. Paula Matuszek Reinforcement Learning If an agent has multiple sequential actions to perform, learning needs a different mode –each action affects available future actions –feedback may not be available after every action –agent has a long-term goal to maximize Agent learns a policy which is basically a transition function Issues include –exploration vs exploitation –credit assignment –generalization

46 46 CSC 8520 Spring 2013. Paula Matuszek Supervised Learning, Summary Learning systems which, given examples and results, learn a model which will yield correct results for future examples Typically used for classification Most widely applied ML category Depends on getting relevant features and representative examples Evaluated against separate test sample; overfitting Usefulness should be checked against “majority” classifier

47 47 CSC 8520 Spring 2013. Paula Matuszek Unsupervised Learning Typically used to refer to clustering methods which don’t require training cases –No prior definition of goal –Typical aim is “put similar things together” Grouping search results Grouping inputs to a customer response system Purchases from a web site Census data about types and costs of housing Combinations of hand-modeled and automatic can work very well: Google News, for instance. Still requires good feature set

48 48 CSC 8520 Spring 2013. Paula Matuszek Clustering Basics l Collect examples l Compute similarity among examples according to some metric l Group examples together such that – examples within a cluster are similar – examples in different clusters are different l Summarize each cluster l Sometimes -- assign new instances to the most similar cluster

49 3. Clustering Example............................... Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt

50 50 CSC 8520 Spring 2013. Paula Matuszek Measures of Similarity In order to do clustering we need some kind of measure of similarity. This is basically our “critic” Vector of values, depends on domain: –documents: bag of words, linguistic features –purchases: cost, purchaser data, item data –census data: most of what is collected Cosine similarity

51 51 CSC 8520 Spring 2013. Paula Matuszek Based on home.iitk.ac.in/~mfelixor/Files/non-numeric-Clustering-seminar.pp Cosine similarity measurement l Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. l The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. l As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

52 52 CSC 8520 Spring 2013. Paula Matuszek Clustering Algorithms l Hierarchical – Bottom up – Top-down l Flat – K means l Probabilistic – Expectation Maximumization (E-M)

53 53 CSC 8520 Spring 2013. Paula Matuszek 7 Hierarchical Agglomerative Clustering (HAC) l Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. l The history of merging forms a binary tree or hierarchy. Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt 53

54 54 CSC 8520 Spring 2013. Paula Matuszek http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt Dendogram: Hierarchical Clustering l Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

55 55 CSC 8520 Spring 2013. Paula Matuszek Partitioning (Flat) Algorithms l Partitioning method: Construct a partition of n documents into a set of K clusters l Given: a set of documents and the number K l Find: a partition of K clusters that optimizes the chosen partitioning criterion – Globally optimal: exhaustively enumerate all partitions. Usually too expensive. – Effective heuristic methods: K-means algorithm. http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt

56 56 CSC 8520 Spring 2013. Paula Matuszek 18 K-Means Clustering l Typically provide number of desired clusters, k. l Randomly choose k instances as seeds. l Form initial clusters based on these seeds. l Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. l Stop when clustering converges or after a fixed number of iterations. Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt

57 22 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt

58 58 CSC 8520 Spring 2013. Paula Matuszek K-Means Tradeoff between having more clusters (better focus within each cluster) and having too many clusters. Overfitting again. Results can vary based on random seed selection. –Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. The algorithm is sensitive to outliers –Data points that are far from other data points. –Could be errors in the data recording or some special data points with very different values. http://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt

59 59 CSC 8520 Spring 2013. Paula Matuszek Strengths of k-means Strengths: –Simple: easy to understand and to implement –Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. –Since both k and t are small. k-means is considered a linear algorithm. K-means is most popular clustering algorithm. In practice, performs well, especially on text. www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt

60 60 CSC 8520 Spring 2013. Paula Matuszek REALLY Unsupervised Learning Turn the machine loose to learn on its own Needs –A representation. Still need some idea of what we are trying to learn! –Good natural language processing –A context People don’t learn very well unsupervised. Currently some interesting research for instance- level knowledge. Much harder to acquire structural or relational knowledge – but we are getting there.

61 61 CSC 8520 Spring 2013. Paula Matuszek More Aspects of Machine Learning Machine learning varies by degree of human intervention: –Rote -- human builds KB. Cyc –Human assisted -- human adds knowledge directed by machine. Animals, Teiresias –Human scored -- human provides training cases. Neural nets, ID3, CART. –Completely automated. -- Nearest Neighbor, other clustering methods

62 62 CSC 8520 Spring 2013. Paula Matuszek More Aspects of Machine Learning Machine Learning varies by degree of transparency –Hand-built KBs are by definition clear to humans –Human-aided trees like Animals are also generally clear and meaningful, could easily be modified by humans –Inferred rules like ID3's are generally understood by humans but may not be intuitively obvious. Modifying them by hand may lead to worse results. –Systems like SVMs are typically black box: you can look at the models but it's hard to interpret them in any human- meaningful way and essentially impossible to modify them by hand.

63 63 CSC 8520 Spring 2013. Paula Matuszek More Aspects of Machine Learning Machine learning varies by goal of the process –Extend a knowledge base –Improve some kind of decision making, such as guessing an animal or classifying diseases. –Improve overall performance of a program, such as game playing –Organize large amounts of data –Find patterns or "knowledge" not previously known, often to take some action.

64 64 CSC 8520 Spring 2013. Paula Matuszek The Web Machine learning is one of those fields where the web is changing everything! Three major factors –One problematic aspect of machine learning research is finding enough data. This is NOT an issue on the web! –Another problematic aspect is getting a critic Web offers a lot of opportunities –A third is identifying good practical uses for machine learning Lots of online opportunities here

65 65 CSC 8520 Spring 2013. Paula Matuszek Online Uses for Machine Learning Improved search: learn from click-throughs. Recommendations: learn from peoples’ opinions and choices. Online games. AIs add to the background but can’t be too static. Better targeting for ads. More learning from click- throughs. Customer Response Centers. Clustering, improved retrieval of responses.

66 66 CSC 8520 Spring 2013. Paula Matuszek Summary Valuable both because we want to understand how humans learn and because it improves computer systems May learn representation or actions or both Variety of methods, some knowledge-based and some statistical Currently very active research area Web is providing a lot of new opportunities Still a long way to go


Download ppt "1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013."

Similar presentations


Ads by Google