Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Similar presentations


Presentation on theme: "Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion."— Presentation transcript:

1 Apache Mahout

2 Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion

3 What is Mahout? Distributed machine learning libraries – “scalable to reasonably large data sets” – Runs on Hadoop

4 What? Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and fault-tolerance Mahout brings: – Library of machine learning algorithms – Examples

5 Why Mahout? Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented

6 Clustering Unsupervised Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses

7 Types Supervised – Using labeled training data, create function that predicts output of unseen inputs Unsupervised – Using unlabeled data, create function that predicts output Semi-Supervised – Uses labeled and unlabeled data

8 Example: Clustering Google News

9 K-means Algorithm 1)Pick a number (k) of cluster centers 2)Assign every element to its nearest cluster center 3)Move each cluster center to the mean of its assigned elements 4)Repeat 2-3 until convergence

10 Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. K-means Example

11 Invocation using the command line takes the form:

12 Canopy Clustering Canopy Clustering is a very simple, fast and surprisingly accurate method for grouping objects into clusters. Define two thresholds Tight: T 1 Loose: T 2 Put all records into a set S While S is not empty Remove any record r from S and create a canopy centered at r For each other record r i, compute cheap distance d from r to r i If d < T 2, place r i in r’s canopy If d < T 1, remove r i from S

13 Canopy Clustering SequenceFile (WritableComparable, VectorWritable) Invocation using the command line takes the form:

14 Fuzzy K-Means Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means, the popular simple clustering technique. Like K-Means, Fuzzy K-Means works on those objects which can be represented in n- dimensional vector space and a distance measure is defined. The algorithm is similar to k-means. Initialize k clusters Until converged Compute the probability of a point belong to a cluster for every pair Re-compute the cluster centers using above probability membership values of points to clusters.

15 Fuzzy K-Means Invocation using the command line takes the form:

16 Conclusion Mahout did not scale well Mahout was not easy to learn Mahout was not easily modifiable For performance and efficiency, it is better to – Understand the data set – Understand data mining – Understand the methodology

17 Thank you !


Download ppt "Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion."

Similar presentations


Ads by Google