Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

Similar presentations


Presentation on theme: "CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1."— Presentation transcript:

1 CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

2 Analytics ? Machine learning, data mining & statistics tools Analyze/mine/summarize large datasets Extract knowledge from past or streaming data Predict trends in future data 2

3 ML Today Internet search clustering Social network analysis Taxonomy transformations Market analytics Recommendation systems Log analysis & event filtering SPAM filtering Fraud detection

4 Tools & Algorithms Collaborative Filtering Clustering Techniques Classification Algorithms Association Rules Frequent Pattern Mining Statistical libraries (Regression, SVM, …) Others… 4

5 Common Use Cases 5

6 Make It Industry Strength: Big Data 6 --Efficient in analyzing/mining data --Do not scale --Efficient in managing big data --Does not analyze or mine data

7 On Going Research Effort Ricardo (VLDB’10): Integrating Hadoop and R using Jaql 7 Haloop (SIGMOD’10): Supporting iterative processing in Hadoop

8 Some Projects Apache Mahout Open-source package on Hadoop for data mining and machine learning Revolution R (R-Hadoop or Radoop ) Extensions to R package to run on Hadoop 8

9 Apache Mahout 9

10 Apache Software Foundation project Create scalable machine learning libraries Why ? Many Open Source ML libraries either: Lack Community Lack Documentation Lack Scalability Or are research-oriented only 10

11 Support Machine Learning

12 But Must Scale & Perform Be as fast as possible Scale to as much data as possible 12

13 But Must Scale & Perform Be as fast as possible given intrinsic algorithm ! What is expressible as map-reduce jobs ? Work in progress... 13

14 C1: Collaborative Filtering 14

15 C2: Clustering Group similar objects together K-Means, Fuzzy K-Means, Density-Based,… Different distance measures Manhattan, Euclidean, … 15

16 C3: Classification 16

17 FPM: Frequent Pattern Mining Find the frequent itemsets are sold frequently together Very common in market analysis, access pattern analysis, etc… 17

18 Matrices and Statistics Math libraries Vectors, matrices, etc. Noise reduction Similarity Functions 18

19 Apache Mahout http://mahout.apache.org/ 19


Download ppt "CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1."

Similar presentations


Ads by Google