Classification Problem Class F1 F2 F3 F4 0 1.4371 0.4416 0.8416 -0.6591 1 -0.0276 -0.8036 0.2391 0.7431 1 0.9239 0.2876 -0.7893 -0.1294 1 -0.3213 0.4670.

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

Conceptual Clustering
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Searching on Multi-Dimensional Data
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Dendrograms for Data Mining
What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
What is Cluster Analysis?
What is Cluster Analysis?
K-means clustering CS281B Winter02 Yan Wang and Lihua Lin.
Read “chainLetters.pdf” in my public directory C. H. Bennett, M. Li, and B. Ma, "Chain letters and evolutionary histories", Scientific Amer., pp. 76 -
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Module 04: Algorithms Topic 07: Instance-Based Learning
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
DATA MINING CLUSTERING K-Means.
Clustering Slides by Eamonn Keogh.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Initial slides by Eamonn Keogh Clustering. Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
The next 6 pages are mini reviews of your ML classification skills.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Web-Mining Agents Clustering
Clustering CSC 600: Data Mining Class 21.
Slides by Eamonn Keogh (UC Riverside)
The next 6 pages are mini reviews of your ML classification skills
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Topic 3: Cluster Analysis
Exam on Friday Bring a scantron. It will be similar to the midterm
Ke Chen Reading: [7.3, EA], [9.1, CMB]
DATA MINING Introductory and Advanced Topics Part II - Clustering
Clustering Wei Wang.
Topic 5: Cluster Analysis
Presentation transcript:

Classification Problem Class F1 F2 F3 F The objects to be classified are flowers The two classes are: 1) Was pollinated by a honey-bee 2) Was not pollinated by a honey-bee The biologist measured four features: F1: The longitude of the plant F2: The latitude of the plant F3: The height of the flower from ground F4: The diameter of the flower.

This plot is F Class F1 F2 F3 F

This plot is F1 vs F Class F1 F2 F3 F

This plot is F3 vs F4 Class F1 F2 F3 F

This plot is F1 vs F2 Class F1 F2 F3 F

F1 F

Which algorithms would work well on this dataset.

F2 < -1.1? Blue class (1)? YN

F2 < -1.1? Blue class (1)F1 < 0.98 YN

Class F1 F2 F3 F Problem: You are given a problem with four features, but you are not told which features (if any) are useful for classification. How can you figure out which are useful? You could try plotting all pairs, and visual inspection.. For four features that is only 6 combinations For forty features it is 780 combinations However it might not be a pair of features that is best, it could be a subset of one, two, three… For forty features there are (over a trillion) combinations.

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 50%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 62%61%51%52%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 60% 58% 97%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 87% 88%

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 72%

,42,4 1,4 2,31,31,2 2,3,4 1,3,41,2,4 1,2,3 1,2,3, Class F1 F2 F3 F F1 and F2

1234 3,42,41,42,31,31,2 2,3,41,3,41,2,41,2,3 1,2,3,4 One nearest neighbor cross validation Cross-validation accuracy Data Binary vector indicating features to use [ ] How I would solve this problem… Ignore the search part for now Just get the cross validation working

Extra Credit Possibility Also do search on a UCI dataset, and explain why the result is intuitive. Don’t do this, unless the rest of your project is perfect!

School Employees Simpson's Family What is a natural grouping among these objects?

We can look at the dendrogram to determine the “correct” number of clusters. In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately)

Outlier One potential use of a dendrogram is to detect outliers The single isolated branch is suggestive of a data point that is very different to all others

Chimpanzee Pygmy Chimp Human Gorilla Orangutan Sumatran Orangutan Gibbon Hellenic Armenian Persian Hellenic Armenian “Armenian borrowed so many words from Iranian languages that it was at first considered a branch of the Indo- Iranian languages, and was not recognized as an independent branch of the Indo-European languages for many decades” Do Trees Make Sense for non-Biological Objects? The answer is “Yes”. There are increasing theoretical and empirical results to suggest that phylogenetic methods work for cultural artifacts. Does horizontal transmission invalidate cultural phylogenies? Greenhill, Currie & Gray. Branching, blending, and the evolution of cultural similarities and differences among human populations. Collard, Shennan, & Tehrani...results show that trees constructed with Bayesian phylogenetic methods are robust to realistic levels of borrowing

Because trees are powerful in biology They make predictions – Pacific Yew produces taxol which treats some cancers, but it is expensive. Its nearest relative, the European Yew was also found to produce taxol. They tell us the order of events – Which came first, classic geometric spider webs, or messy cobwebs? They tell us about.. – “Homelands”, where did it come from. – “Dates” when did it happen. – Rates of change – Ancestral states Why would we want to use trees for non biological things?

Markus Pudenz Mostly Belgium beers Irish beers Mostly Californian beers Clustered based on crowd sourced user subjective ranking

Piotr Pyotr Petros Pietro Pedro Pierre Piero Peter Peder Peka Peadar Pedro (Portuguese/Spanish) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Where is Pet'Ka from? Where is Bedros from?

How do we know the dates? If we can get dates, even upper/lower bounds, some events, we can interpolate to the rest of the tree. “ Irish/Welsh Split: Must be before 300AD. Archaic Irish inscriptions date back to the 5 th century AD – divergence must have occurred well before this time.” Gray, R.D. and Atkinson, Q. D., Language tree divergence times support the Anatolian theory of Indo-European origin

Family Tree of Languages Has Roots in Anatolia, Biologists Say By NICHOLAS WADE Published: August 23, 2012 NEW York TimesNICHOLAS WADE Biologists using tools developed for drawing evolutionary family trees say that they have solved a longstanding problem in archaeology: the origin of the Indo-European family of languages. The family includes English and most other European languages, as well as Persian, Hindi and many others. Despite the importance of the languages, specialists have long disagreed about their origin..

Partitional Clustering Nonhierarchical, each instance is placed in exactly one of K nonoverlapping clusters. Since only one set of clusters is output, the user normally has to input the desired number of clusters K.

Squared Error Objective Function

Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1k1 k2k2 k3k3

Comments on the K-Means Method Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes

The K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) – starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering – PAM works effectively for small data sets, but does not scale well for large data sets

EM Algorithm Initialize K cluster centers Iterate between two steps –Expectation step: assign points to clusters –Maximation step: estimate model parameters

Iteration 1 The cluster means are randomly assigned

Iteration 2

Iteration 5

Iteration 25

Nearest Neighbor Clustering Not to be confused with Nearest Neighbor Classification Items are iteratively merged into the existing clusters that are closest. Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. What happens if the data is streaming…

10 Threshold t t 1 2

10 New data point arrives… It is within the threshold for cluster 1, so add it to the cluster, and update cluster center

10 New data point arrives… It is not within the threshold for cluster 1, so create a new cluster, and so on Algorithm is highly order dependent… It is difficult to determine t in advance…

How can we tell the right number of clusters? In general, this is a unsolved problem. However there are many approximate methods. In the next few slides we will see an example. For our example, we will use the familiar katydid/grasshopper dataset. However, in this case we are imagining that we do NOT know the class labels. We are only clustering on the X and Y axis values.

When k = 1, the objective function is 873.0

When k = 2, the objective function is 173.1

When k = 3, the objective function is 133.6

0.00E E E E E E E E E E E We can plot the objective function values for k equals 1 to 6… The abrupt change at k = 2, is highly suggestive of two clusters in the data. This technique for determining the number of clusters is known as “knee finding” or “elbow finding”. Note that the results are not always as clear cut as in this toy example k Objective Function