Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670.

Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670 1.0125 -0.7545 1 0.6611 0.0046 0.6234 0.2311 0 1.9153 0.3261 0.8232 0.9796 The objects to be classified are flowers The two classes are: 1) Was pollinated by a honey-bee 2) Was not pollinated by a honey-bee The biologist measured four features: F1: The longitude of the plant F2: The latitude of the plant F3: The height of the flower from ground F4: The diameter of the flower.

This plot is F1 -3-20123 Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670 1.0125 -0.7545 1 0.6611 0.0046 0.6234 0.2311 0 1.9153 0.3261 0.8232 0.9796

This plot is F1 vs F3 -4-3-201234 -4 -3 -2 0 1 2 3 4 Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670 1.0125 -0.7545 1 0.6611 0.0046 0.6234 0.2311 0 1.9153 0.3261 0.8232 0.9796

This plot is F3 vs F4 Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670 1.0125 -0.7545 1 0.6611 0.0046 0.6234 0.2311 0 1.9153 0.3261 0.8232 0.9796

-4-3-201234 -4 -3 -2 0 1 2 3 4 This plot is F1 vs F2 Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294

F1 F2 1.4371 0.4416 -0.0276 -0.8036 0.9239 0.2876 -0.3213 0.4670 0.6611 0.0046 1.9153 0.3261

-3-20123 -2.5 -2 -1.5 -0.5 0 0.5 1 1.5 2 2.5 Which algorithms would work well on this dataset.

-3-20123 -2.5 -2 -1.5 -0.5 0 0.5 1 1.5 2 2.5

-3-20123 -2.5 -2 -1.5 -0.5 0 0.5 1 1.5 2 2.5 F2 < -1.1? Blue class (1)? YN

-3-20123 -2.5 -2 -1.5 -0.5 0 0.5 1 1.5 2 2.5 F2 < -1.1? Blue class (1)F1 < 0.98 YN

-3-20123 -2.5 -2 -1.5 -0.5 0 0.5 1 1.5 2 2.5

Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670 1.0125 -0.7545 1 0.6611 0.0046 0.6234 0.2311 0 1.9153 0.3261 0.8232 0.9796 Problem: You are given a problem with four features, but you are not told which features (if any) are useful for classification. How can you figure out which are useful? You could try plotting all pairs, and visual inspection.. For four features that is only 6 combinations For forty features it is 780 combinations However it might not be a pair of features that is best, it could be a subset of one, two, three… For forty features there are 2 40 -1 (over a trillion) combinations.

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 50%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 62%61%51%52%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 60% 58% 97%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 87% 88%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 72%

12 34 3,42,4 1,4 2,31,31,2 2,3,4 1,3,41,2,4 1,2,3 1,2,3,4 01234 0 10 20 30 40 50 60 70 80 90 100 Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670 1.0125 -0.7545 1 0.6611 0.0046 0.6234 0.2311 0 1.9153 0.3261 0.8232 0.9796 F1 and F2

1 0.5377 -1.3499 0.6715 0.8884 -0.1022 -0.8637 -1.0891 -0.6156 1.4193 -1.1480 0 1.8339 3.0349 -1.2075 -1.1471 -0.2414 0.0774 0.0326 0.7481 0.2916 0.1049 1 -2.2588 0.7254 0.7172 -1.0689 0.3192 -1.2141 0.5525 -0.1924 0.1978 0.7223 1 0.8622 -0.0631 1.6302 -0.8095 0.3129 -1.1135 1.1006 0.8886 1.5877 2.5855 0 0.3188 0.7147 0.4889 -2.9443 -0.8649 -0.0068 1.5442 -0.7648 -0.8045 -0.6669 0 -1.3077 -0.2050 1.0347 1.4384 -0.0301 1.5326 0.0859 -1.4023 0.6966 0.1873 1 -0.4336 -0.1241 0.7269 0.3252 -0.1649 -0.7697 -1.4916 -1.4224 0.8351 -0.0825 0 0.3426 1.4897 -0.3034 -0.7549 0.6277 0.3714 -0.7423 0.4882 -0.2437 -1.9330 12345678

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 One nearest neighbor cross validation Cross-validation accuracy Data Binary vector indicating features to use [ 1 1 0 0 ] How I would solve this problem… Ignore the search part for now Just get the cross validation working

Extra Credit Possibility https://archive.ics.uci.edu/ml/datasets.html Also do search on a UCI dataset, and explain why the result is intuitive. Don’t do this, unless the rest of your project is perfect!

School Employees Simpson's Family What is a natural grouping among these objects?

We can look at the dendrogram to determine the “correct” number of clusters. In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately)

Outlier One potential use of a dendrogram is to detect outliers The single isolated branch is suggestive of a data point that is very different to all others

Chimpanzee Pygmy Chimp Human Gorilla Orangutan Sumatran Orangutan Gibbon Hellenic Armenian Persian Hellenic Armenian “Armenian borrowed so many words from Iranian languages that it was at first considered a branch of the Indo- Iranian languages, and was not recognized as an independent branch of the Indo-European languages for many decades” Do Trees Make Sense for non-Biological Objects? The answer is “Yes”. There are increasing theoretical and empirical results to suggest that phylogenetic methods work for cultural artifacts. Does horizontal transmission invalidate cultural phylogenies? Greenhill, Currie & Gray. Branching, blending, and the evolution of cultural similarities and differences among human populations. Collard, Shennan, & Tehrani...results show that trees constructed with Bayesian phylogenetic methods are robust to realistic levels of borrowing

Because trees are powerful in biology They make predictions – Pacific Yew produces taxol which treats some cancers, but it is expensive. Its nearest relative, the European Yew was also found to produce taxol. They tell us the order of events – Which came first, classic geometric spider webs, or messy cobwebs? They tell us about.. – “Homelands”, where did it come from. – “Dates” when did it happen. – Rates of change – Ancestral states Why would we want to use trees for non biological things?

Markus Pudenz http://blog.ranker.com/hierarchical-clustering-ranker-list-beers/#.VWSsZPn2BhF Mostly Belgium beers Irish beers Mostly Californian beers Clustered based on crowd sourced user subjective ranking

Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter Peder Peka Peadar Pedro (Portuguese/Spanish) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Where is Pet'Ka from? Where is Bedros from?

How do we know the dates? If we can get dates, even upper/lower bounds, some events, we can interpolate to the rest of the tree. “ Irish/Welsh Split: Must be before 300AD. Archaic Irish inscriptions date back to the 5 th century AD – divergence must have occurred well before this time.” Gray, R.D. and Atkinson, Q. D., Language tree divergence times support the Anatolian theory of Indo-European origin

Family Tree of Languages Has Roots in Anatolia, Biologists Say By NICHOLAS WADE Published: August 23, 2012 NEW York TimesNICHOLAS WADE Biologists using tools developed for drawing evolutionary family trees say that they have solved a longstanding problem in archaeology: the origin of the Indo-European family of languages. The family includes English and most other European languages, as well as Persian, Hindi and many others. Despite the importance of the languages, specialists have long disagreed about their origin..

Partitional Clustering Nonhierarchical, each instance is placed in exactly one of K nonoverlapping clusters. Since only one set of clusters is output, the user normally has to input the desired number of clusters K.

Squared Error 10 123456789 1 2 3 4 5 6 7 8 9 Objective Function

Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

0 1 2 3 4 5 012345 K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

Comments on the K-Means Method Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes

The K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) – starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering – PAM works effectively for small data sets, but does not scale well for large data sets

EM Algorithm Initialize K cluster centers Iterate between two steps –Expectation step: assign points to clusters –Maximation step: estimate model parameters

Iteration 1 The cluster means are randomly assigned

Iteration 2

Iteration 5

Iteration 25

Nearest Neighbor Clustering Not to be confused with Nearest Neighbor Classification Items are iteratively merged into the existing clusters that are closest. Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. What happens if the data is streaming…

10 Threshold t t 1 2

10 New data point arrives… It is within the threshold for cluster 1, so add it to the cluster, and update cluster center. 1 2 3

10 New data point arrives… It is not within the threshold for cluster 1, so create a new cluster, and so on.. 1 2 3 4 Algorithm is highly order dependent… It is difficult to determine t in advance…

How can we tell the right number of clusters? In general, this is a unsolved problem. However there are many approximate methods. In the next few slides we will see an example. For our example, we will use the familiar katydid/grasshopper dataset. However, in this case we are imagining that we do NOT know the class labels. We are only clustering on the X and Y axis values.

12345678910 When k = 1, the objective function is 873.0

0.00E+00 1.00E+02 2.00E+02 3.00E+02 4.00E+02 5.00E+02 6.00E+02 7.00E+02 8.00E+02 9.00E+02 1.00E+03 123456 We can plot the objective function values for k equals 1 to 6… The abrupt change at k = 2, is highly suggestive of two clusters in the data. This technique for determining the number of clusters is known as “knee finding” or “elbow finding”. Note that the results are not always as clear cut as in this toy example k Objective Function

Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670.

Similar presentations

Presentation on theme: "Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670.

Similar presentations

Presentation on theme: "Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670."— Presentation transcript:

Similar presentations

About project

Feedback