Download presentation
Presentation is loading. Please wait.
Published byTamsin Collins Modified over 9 years ago
1
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1
2
Analytics ? Machine learning, data mining & statistics tools Analyze/mine/summarize large datasets Extract knowledge from past or streaming data Predict trends in future data 2
3
ML Today Internet search clustering Social network analysis Taxonomy transformations Market analytics Recommendation systems Log analysis & event filtering SPAM filtering Fraud detection
4
Tools & Algorithms Collaborative Filtering Clustering Techniques Classification Algorithms Association Rules Frequent Pattern Mining Statistical libraries (Regression, SVM, …) Others… 4
5
Common Use Cases 5
6
Make It Industry Strength: Big Data 6 --Efficient in analyzing/mining data --Do not scale --Efficient in managing big data --Does not analyze or mine data
7
On Going Research Effort Ricardo (VLDB’10): Integrating Hadoop and R using Jaql 7 Haloop (SIGMOD’10): Supporting iterative processing in Hadoop
8
Some Projects Apache Mahout Open-source package on Hadoop for data mining and machine learning Revolution R (R-Hadoop or Radoop ) Extensions to R package to run on Hadoop 8
9
Apache Mahout 9
10
Apache Software Foundation project Create scalable machine learning libraries Why ? Many Open Source ML libraries either: Lack Community Lack Documentation Lack Scalability Or are research-oriented only 10
11
Support Machine Learning
12
But Must Scale & Perform Be as fast as possible Scale to as much data as possible 12
13
But Must Scale & Perform Be as fast as possible given intrinsic algorithm ! What is expressible as map-reduce jobs ? Work in progress... 13
14
C1: Collaborative Filtering 14
15
C2: Clustering Group similar objects together K-Means, Fuzzy K-Means, Density-Based,… Different distance measures Manhattan, Euclidean, … 15
16
C3: Classification 16
17
FPM: Frequent Pattern Mining Find the frequent itemsets are sold frequently together Very common in market analysis, access pattern analysis, etc… 17
18
Matrices and Statistics Math libraries Vectors, matrices, etc. Noise reduction Similarity Functions 18
19
Apache Mahout http://mahout.apache.org/ 19
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.