Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning Machine learning explores the study and construction of algorithms that can learn from data. Basic Idea: Instead of trying to create a.

Similar presentations


Presentation on theme: "Machine Learning Machine learning explores the study and construction of algorithms that can learn from data. Basic Idea: Instead of trying to create a."— Presentation transcript:

1 Machine Learning Machine learning explores the study and construction of algorithms that can learn from data. Basic Idea: Instead of trying to create a very complex program to do X. Use a (relatively) simple program that can learn to do X. Example: Instead of trying to program a car to drive ( If light(red) && NOT(pedestrian) || speed(X) <= 12 &&.. ), create a program that watches human drive, and learns how to drive*. *Currently, self driving cars do a bit of both.

2 Why Machine Learning I Why do machine learning instead of just writing an explicit program? It is often much cheaper, faster and more accurate. It may be possible to teach a computer something that we are not sure how to program. For example: We could explicitly write a program to tell if a person is obese If (weight kg /(height m  height m )) > 30, printf(“Obese”) We would find it hard to write a program to tell is a person is sad However, we could easily obtain a 1,000 photographs of sad people/ not sad people, and ask a machine learning algorithm to learn to tell them apart.

3 Grasshoppers Katydids The Classification Problem (informal definition) Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Katydid or Grasshopper?

4 American Canadian The Classification Problem (informal definition) Given a collection of annotated data. In this case 3 instances Canadian of and 3of American, decide what type of insect the unlabeled example is. Canadian or American?

5 Blame Canada

6 Thorax Length AbdomenLength AntennaeLength MandibleSize Spiracle Diameter Leg Length For any domain of interest, we can measure features Color {Green, Brown, Gray, Other} Has Wings?

7 What features can we cheaply measure from coins? 1.Diameter 2.Thickness 3.Weight 4.Electrical Resistance 5.? Probably not color or other optical features.

8 25mm Nominally 24.26 mmNominally 19.05 mm 20mm The Ideal Case In the best case, we would find a single feature that would strongly separate the coins. Diameter is clearly such a feature for the simpler case of pennies vs. quarters. Decision threshold

9 25mm 20mm Usage Once we learn the threshold, we no longer need to keep the data. When an unknown coin comes in, we measure the feature of interest, and see which side of the decision threshold it lands on. Decision threshold ? IF diameter(unknown_coin) < 22 coin_type = ‘penny’ ELSE coin_type = ‘quarter’ END

10 Let us revisit the original problem of classifying Canadian vs. American Quarters Which of our features (if any) are useful? 1.Diameter 2.Thickness 3.Weight 4.Electrical Resistance I measured these features for 50 Canadian and 50 American quarters….

11 1 Diameter Here I have 99% blue on the right side, but the left side is about 50/50 green/blue.

12 Thickness 2 Here I have all green on the left side, but the right side is about 50/50 green/blue.

13 Weight 3 The weight feature seem very promising. It is not perfect, but the left side is about 92% blue, and the right side about 92% green

14 Electrical Resistance 4 The electrical resistance feature seems promising. Again, it is not perfect, but the left side is about 89% blue, and the right side about 89% green.

15 Diameter Thickness 12 1,2 We can try all possible pairs of features. {Diameter, Thickness} {Diameter, Weight} {Diameter, Electrical Resistance} {Thickness, Weight} {Thickness, Electrical Resistance} {Weight, Electrical Resistance} This combination does not work very well.

16 Diameter Weight 13 1,3

17 For brevity, some combinations are omitted Let us jump to the last combination…

18 Weight Electrical Resistance 34 3,4

19 Diameter Thickness Weight 123 2,31,31,2 1,2,3 -10 -5 0 5 0 5 -10 -5 0 5 We can also try all possible triples of features. {Diameter, Thickness, Weight} {Diameter, Thickness, Electrical Resistance} etc This combination does not work that well.

20 Diameter Thickness Weight Electrical Resistance 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4

21 Given a set of N features, there are 2 N -1 feature subsets we can test. In this case, we can test all of them (exhaustive search), but in general, this is not possible. 10 features = 1,023 20 features = 1,048,576 100 features = 1,267,650,600,228,229,401,496,703,205,376 We typically resort to greedy search. 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 Greedy Forward Section Initial state: Empty Set: No features Operators: Add a single feature. Evaluation Function: K-fold cross validation.

22 How accurate can we be if we use no features? The answer is called the Default Rate, the size of the most common class, over the size of the full dataset. The Default Rate Examples: I want to predict the sex of some pregnant friends babies. The most common class is ‘boy’, so I will always say ‘boy’. I do just a tiny bit better than random guessing. I want to predict the sex of the nurse that will give me a flu shot next week. The most common class is ‘female’, so I will say ‘female’. No features

23 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. {} 0 20 40 60 80 100

24 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. {}{3} 0 20 40 60 80 100

25 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. {}{3}{3,4} 0 20 40 60 80 100

26 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 Greedy Forward Section Initial state: Empty Set: No features Operators: Add a feature. Evaluation Function: K-fold cross validation. {}{3}{3,4}{1,3,4} 0 20 40 60 80 100

27 Feature Generation I Left Bar 10 12345678 9 1 2 3 4 5 6 7 8 9 Right Bar Sometimes, instead of (or in addition to) searching for features, we can make new features out of combinations of old features in some way. Recall this “pigeon problem”. … We could not get good results with a linear classifier. Suppose we created a new feature…

28 Feature Generation II Left Bar 10 12345678 9 1 2 3 4 5 6 7 8 9 Right Bar Suppose we created a new feature, called F new F new = |(Right_Bar – Left_Bar)| Now the problem is trivial to solve with a linear classifier. 12345678 9 10 0 F new

29 Feature Generation III We actually do feature generation all the time. Consider the problem of classifying underweight, healthy, obese. It is a two dimensional problem, that we can approximately solve with linear classifiers. But we can generate a feature call BMI, Body-Mass Index. BMI = height/ weight 2 This converts the problem into a 1D problem 18.524.9 BMI

30 Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Grasshoppers Katydids Abdomen Length

31 Simple Linear Classifier previously unseen instance If previously unseen instance above the line then class is Katydid else class is Grasshopper Katydids Grasshoppers R.A. Fisher 1890-1962 10 123456789 1 2 3 4 5 6 7 8 9

32 Predictive accuracy Speed and scalability –time to construct the model –time to use the model –efficiency in disk-resident databases Robustness –handling noise, missing values and irrelevant features, streaming data Interpretability: –understanding and insight provided by the model We have now seen one classification algorithm, and we are about to see more. How should we compare them ?

33 Predictive Accuracy I How do we estimate the accuracy of our classifier? We can use K-fold cross validation Insect ID AbdomenLengthAntennaeLength Insect Class 12.75.5Grasshopper 28.09.1Katydid 30.94.7Grasshopper 41.13.1Grasshopper 55.48.5Katydid 62.91.9Grasshopper 76.16.6Katydid 80.51.0Grasshopper 98.36.6Katydid 108.14.7Katydids We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead Accuracy = Number of correct classifications Number of instances in our database K = 5

34 Robustness I We need to consider what happens when we have: Noise For example, a persons age could have been mistyped as 650 instead of 65, how does this effect our classifier? (This is important only for building the classifier, if the instance to be classified is noisy we can do nothing). Missing values 10 123456789 1 2 3 4 5 6 7 8 9 123456789 1 2 3 4 5 6 7 8 9 123456789 1 2 3 4 5 6 7 8 9 For example suppose we want to classify an insect, but we only know the abdomen length (X-axis), and not the antennae length (Y-axis), can we still classify the instance?

35 Robustness II We need to consider what happens when we have: Irrelevant features For example, suppose we want to classify people as either Suitable_Grad_Student Unsuitable_Grad_Student And it happens that scoring more than 5 on a particular test is a perfect indicator for this problem… 10 If we also use “hair_length” as a feature, how will this effect our classifier?

36 Nearest Neighbor Classifier previously unseen instance If the nearest instance to the previously unseen instance is a Katydid class is Katydid else class is Grasshopper Katydids Grasshoppers Joe Hodges 1922-2000 Evelyn Fix 1904-1965 Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length

37 This division of space is called Dirichlet Tessellation (or Voronoi diagram, or Theissen regions). We can visualize the nearest neighbor algorithm in terms of a decision surface… Note the we don’t actually have to construct these surfaces, they are simply the implicit boundaries that divide the space into regions “belonging” to each instance.

38 The nearest neighbor algorithm is sensitive to outliers… The solution is to…

39 We can generalize the nearest neighbor algorithm to the K- nearest neighbor (KNN) algorithm. We measure the distance to the nearest K instances, and let them vote. K is typically chosen to be an odd number. K = 1K = 3

40 10 123456789 Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper. Using just the antenna length we get perfect classification! Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper. Using just the antenna length we get perfect classification! The nearest neighbor algorithm is sensitive to irrelevant features… Training data 12345678910123456789 6 5 Suppose however, we add in an irrelevant feature, for example the insects mass. Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification! Suppose however, we add in an irrelevant feature, for example the insects mass. Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification!

41 How do we mitigate the nearest neighbor algorithms sensitivity to irrelevant features? Use more training instances Ask an expert what features are relevant to the task Use statistical tests to try to determine which features are useful Search over feature subsets (in the next slide we will see why this is hard)

42 Why searching over feature subsets is hard Suppose you have the following classification problem, with 100 features, where is happens that Features 1 and 2 (the X and Y below) give perfect classification, but all 98 of the other features are irrelevant… Using all 100 features will give poor results, but so will using only Feature 1, and so will using Feature 2! Of the 2 100 –1 possible subsets of the features, only one really works. Only Feature 1 Only Feature 2

43 1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 Forward Selection Backward Elimination Bi-directional Search

44 The nearest neighbor algorithm is sensitive to the units of measurement X axis measured in centimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is red. X axis measured in millimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is blue. One solution is to normalize the units to pure numbers. Typically the features are Z-normalized to have a mean of zero and a standard deviation of one. X = (X – mean(X))/std(x)

45 We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data editing. Note that this can sometimes improve accuracy! One possible approach. Delete all instances that are surrounded by members of their own class. We can also speed up classification with indexing

46 10 123456789 1 2 3 4 5 6 7 8 9 Manhattan (p=1) Max (p=inf) Mahalanobis Weighted Euclidean Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case…

47 …In fact, we can use the nearest neighbor algorithm with any distance/similarity function IDNameClass 1GunopulosGreek 2PapadopoulosGreek 3KolliosGreek 4DardanosGreek 5 KeoghIrish 6GoughIrish 7GreenhaughIrish 8HadleighIrish For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance… edit_distance(Faloutsos, Keogh) = 8 edit_distance(Faloutsos, Gunopulos) = 6 Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name. Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc…

48 Peter Piter Pioter Piotr Substitution (i for e) Insertion (o) Deletion (e) Edit Distance Example It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest transformation How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution1 Unit Insertion1 Unit Deletion1 Unit D( Peter,Piotr ) is 3 Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter

49 Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you in absolute confidence primarily to seek your assistance to transfer our cash of twenty one Million Dollars ($21,000.000.00) now in the custody of a private Security trust firm in Europe the money is in trunk boxes deposited and declared as family valuables by my late father as a matter of fact the company does not know the content as money, although my father made them to under stand that the boxes belongs to his foreign partner. …

50 This mail is probably spam. The original message has been attached along with this report, so you can recognize or block similar unwanted mail in future. See http://spamassassin.org/tag/ for more details. Content analysis details: (12.20 points, 5 required) NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spam FROM_ENDS_IN_NUMS (0.7 points) From: ends in numbers MIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundary URGENT_BIZ (2.7 points) BODY: Contains urgent matter US_DOLLARS_3 (1.5 points) BODY: Nigerian scam key phrase ($NN,NNN,NNN.NN) DEAR_SOMETHING (1.8 points) BODY: Contains 'Dear (something)' BAYES_30 (1.6 points) BODY: Bayesian classifier says spam probability is 30 to 40% [score: 0.3728]


Download ppt "Machine Learning Machine learning explores the study and construction of algorithms that can learn from data. Basic Idea: Instead of trying to create a."

Similar presentations


Ads by Google