Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.

Slides:



Advertisements
Similar presentations
CS525: Special Topics in DBs Large-Scale Data Management
Advertisements

Florida International University COP 4770 Introduction of Weka.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
An Overview of Machine Learning
Recommender System with Hadoop and Spark
Personalisation and Recommendations using Drupal Keywords: – Personalisation – Recommendations – Scalable machine learning – Predictions – Similarity –
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Scalable Text Mining with Sparse Generative Models
Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout Saptak Sen, Microsoft Bill Ramos, Advaiya.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Machine Learning as a Service
Click to Add Title A Systematic Framework for Sentiment Identification by Modeling User Social Effects Kunpeng Zhang Assistant Professor Department of.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
PhD Dissertation Defense Scaling Up Machine Learning Algorithms to Handle Big Data BY KHALIFEH ALJADDA ADVISOR: PROFESSOR JOHN A. MILLER DEC-2014 Computer.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Table of Contents Introduction Why Data Analytics Data Analytics Terminology Predictive Analytics Data Analytics challenges Data Analytics Platform Data.
Next Generation of Apache Hadoop MapReduce Owen
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
ItemBased Collaborative Filtering Recommendation Algorithms 1.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Image taken from: slideshare
Big Data Infrastructure
Big Data is a Big Deal!.
Presented by: Javier Pastorino Fall 2016
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Scalable Machine Learning
Industrial Strength Machine Learning Jeff Eastman
Tutorial: Big Data Algorithms and Applications Under Hadoop
Introducing Apache Mahout
Spark Presentation.
DATA SCIENCE Online Training at GoLogica
Hadoop Clusters Tess Fulkerson.
Waikato Environment for Knowledge Analysis
Introduction to Spark.
Mining and Analyzing Data from Open Source Software Repository
Scalable Parallel Interoperable Data Analytics Library
HPML Conference, Lyon, Sept 2018
Charles Tappert Seidenberg School of CSIS, Pace University
CSE 491/891 Lecture 25 (Mahout).
Recommender Systems: Collaborative & Content-based Filtering Features
Introducing Apache Mahout
Presentation transcript:

Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016

Machine Learning Machine learning is programming computers to optimize a performance criterion using example data or past experience. Machine Learning Strategies 1) Supervised 2)Unsupervised 2/29/2016

Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in action/behaviors Identify key topics in large collection of text Detect anomalies in output Ranking search results 2/29/2016

Apache Mahout Introduction Machine Learning Library for Scalable applications Includes core algorithms for Recommendation, Clustering and Classification that are implemented on top of Hadoop Map-Reduce model. Also includes core libraries are highly optimized to allow for good performance also for non-distributed algorithms. 2/29/2016

Mahout is distributed under a commercially friendly Apache Software license. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Currently Mahout supports mainly three use cases: 1) Recommendation mining 2) Clustering 3) Classification 2/29/2016

Why Mahout Many Open Source ML libraries (PyBrain, Shark etc) either 1) lack community 2) lack scalability 3) lack documentations and examples Most Mahout implementations are Map Reduce enabled 2/29/2016

The main goal of Apache Mahout is to be useful to practitioners. -This means implementations should be easy to use from within Java applications. -It should be close to trivial to deploy the trained models. -Scaling to include more and more diverse data should be simple. 2/29/2016

Recommendations Extensive Framework for collaborative filtering Recommenders 1) user based 2) item based Many different similarity measures e.g. Cosine, LLR, Tanimoto, Pearson, 2/29/2016

Algorithms For Recommendatation User-Based Collaborative Filtering – Single Machine Item-Based Collaborative Filtering - single machine / Mapreduce Matrix Factorization with Alternating Least Squares - single machine / MapReduce Matrix Factorization with Alternating Least Squares on Implicit Feedback- single machine / MapReduce Weighted Matrix Factorization, SVD++, Parallel SGD - single machine 2/29/2016

User-Based Recommender 2/29/2016

Clustering 2/29/2016

Algorithms for Clustering K-Means Clustering Fuzzy K-Means Mean Shift Clustering Dirichlet Process Clustering (For Topic Modelling) 2/29/2016

We can use commands instead of Clustering algorithms that can run on Hadoop infrastructure e.g. for Canopy Clustering command is bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.canopy.Job k-Means Clustering bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job Fuzzy k-Means Clustering bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job 2/29/2016

Classification Algorithms implemented in Mahout for Classifiaction Logistic Regression - trained via SGD - single machine Naive Bayes/ Complementary Naive Bayes - MapReduce Random Forest - MapReduce Hidden Markov Models - single machine Multilayer Perceptron - single machine 2/29/2016

Running Naïve Bayes from Command Line Three Commands 1) mahout seq2sparse performs TF/IDF transformations 2) mahout trainnb model is trained by using Byes Model 3) mahout testnb classification and testing is performed. 2/29/2016

Installation of Mahout Download the tar files of both apache-mahout and apache-maven projects Unzip the tar files in a directory Set the Path Variables for maven Set present working directory to the mahout's core folder Compile the project by 'mvn-compile' Build the project by 'mvn-install' 2/29/2016

Mahout Vs Weka Base\ TechnologiesMahoutWEKA ScalabilityMoreLess AlgorithmsLessMore GUINoYes LicenseApacheGPL 2/29/2016

MAHOUT COMMERCIAL USERS Adobe: Uses clustering algorithms to increase video consumption by better user targeting. Amazon: For Personalization platform. AOL: For shopping recommendations. Twitter: Uses Mahout’s LDA implementation for user interest modeling. Yahoo! Mail: Uses Mahout’s Frequent Pattern Set Mining. Drupal: Users Mahout to provide open source content recommendation solutions. Evolv: Uses Mahout for its Workforce Predictive Analytics platform. Foursquare: Uses Mahout for its recommendation engine. Idealo: Uses Mahout’s recommendation engine. 2/29/2016

References Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen, on “Scalable Sentiment Classification for Big Data Analysis Using Naıve Bayes Classifier”, 2013 IEEE International Conference on Big Data. Rui Máximo Esteves, Chunming Rong, “Using Mahout for clustering Wikipedia’s latest Articles”, 2011 Third IEEE International Conference on Cloud Computing Technology and Science. Kathleen Ericson and Shrideep Pallickara, “On the Performance of Distributed Data Clustering Algorithms in File and Streaming Processing Systems”, 2011 Fourth IEEE International Conference on Utility and Cloud Computing. Sean Owen, Robin Anil, “Mahout In Action”, Manning Publications 2/29/2016

THANK YOU 2/29/2016