Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016
Machine Learning Machine learning is programming computers to optimize a performance criterion using example data or past experience. Machine Learning Strategies 1) Supervised 2)Unsupervised 2/29/2016
Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in action/behaviors Identify key topics in large collection of text Detect anomalies in output Ranking search results 2/29/2016
Apache Mahout Introduction Machine Learning Library for Scalable applications Includes core algorithms for Recommendation, Clustering and Classification that are implemented on top of Hadoop Map-Reduce model. Also includes core libraries are highly optimized to allow for good performance also for non-distributed algorithms. 2/29/2016
Mahout is distributed under a commercially friendly Apache Software license. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Currently Mahout supports mainly three use cases: 1) Recommendation mining 2) Clustering 3) Classification 2/29/2016
Why Mahout Many Open Source ML libraries (PyBrain, Shark etc) either 1) lack community 2) lack scalability 3) lack documentations and examples Most Mahout implementations are Map Reduce enabled 2/29/2016
The main goal of Apache Mahout is to be useful to practitioners. -This means implementations should be easy to use from within Java applications. -It should be close to trivial to deploy the trained models. -Scaling to include more and more diverse data should be simple. 2/29/2016
Recommendations Extensive Framework for collaborative filtering Recommenders 1) user based 2) item based Many different similarity measures e.g. Cosine, LLR, Tanimoto, Pearson, 2/29/2016
Algorithms For Recommendatation User-Based Collaborative Filtering – Single Machine Item-Based Collaborative Filtering - single machine / Mapreduce Matrix Factorization with Alternating Least Squares - single machine / MapReduce Matrix Factorization with Alternating Least Squares on Implicit Feedback- single machine / MapReduce Weighted Matrix Factorization, SVD++, Parallel SGD - single machine 2/29/2016
User-Based Recommender 2/29/2016
Clustering 2/29/2016
Algorithms for Clustering K-Means Clustering Fuzzy K-Means Mean Shift Clustering Dirichlet Process Clustering (For Topic Modelling) 2/29/2016
We can use commands instead of Clustering algorithms that can run on Hadoop infrastructure e.g. for Canopy Clustering command is bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.canopy.Job k-Means Clustering bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job Fuzzy k-Means Clustering bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job 2/29/2016
Classification Algorithms implemented in Mahout for Classifiaction Logistic Regression - trained via SGD - single machine Naive Bayes/ Complementary Naive Bayes - MapReduce Random Forest - MapReduce Hidden Markov Models - single machine Multilayer Perceptron - single machine 2/29/2016
Running Naïve Bayes from Command Line Three Commands 1) mahout seq2sparse performs TF/IDF transformations 2) mahout trainnb model is trained by using Byes Model 3) mahout testnb classification and testing is performed. 2/29/2016
Installation of Mahout Download the tar files of both apache-mahout and apache-maven projects Unzip the tar files in a directory Set the Path Variables for maven Set present working directory to the mahout's core folder Compile the project by 'mvn-compile' Build the project by 'mvn-install' 2/29/2016
Mahout Vs Weka Base\ TechnologiesMahoutWEKA ScalabilityMoreLess AlgorithmsLessMore GUINoYes LicenseApacheGPL 2/29/2016
MAHOUT COMMERCIAL USERS Adobe: Uses clustering algorithms to increase video consumption by better user targeting. Amazon: For Personalization platform. AOL: For shopping recommendations. Twitter: Uses Mahout’s LDA implementation for user interest modeling. Yahoo! Mail: Uses Mahout’s Frequent Pattern Set Mining. Drupal: Users Mahout to provide open source content recommendation solutions. Evolv: Uses Mahout for its Workforce Predictive Analytics platform. Foursquare: Uses Mahout for its recommendation engine. Idealo: Uses Mahout’s recommendation engine. 2/29/2016
References Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen, on “Scalable Sentiment Classification for Big Data Analysis Using Naıve Bayes Classifier”, 2013 IEEE International Conference on Big Data. Rui Máximo Esteves, Chunming Rong, “Using Mahout for clustering Wikipedia’s latest Articles”, 2011 Third IEEE International Conference on Cloud Computing Technology and Science. Kathleen Ericson and Shrideep Pallickara, “On the Performance of Distributed Data Clustering Algorithms in File and Streaming Processing Systems”, 2011 Fourth IEEE International Conference on Utility and Cloud Computing. Sean Owen, Robin Anil, “Mahout In Action”, Manning Publications 2/29/2016
THANK YOU 2/29/2016