Download presentation

Presentation is loading. Please wait.

1
**Machine Learning on Spark**

Shivaram Venkataraman UC Berkeley

2
Machine learning Computer Science Statistics

3
**Machine learning Spam ﬁlters Click prediction Recommendations**

Search ranking

4
**Machine learning techniques Classification Clustering Regression**

Active learning Collaborative filtering

5
**Implementing Machine Learning**

Machine learning algorithms are Complex, multi-stage Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

6
**Machine Learning using Spark**

Spark RDDs efficient data sharing In-memory caching accelerates performance Up to 20x faster than Hadoop Easy to use high-level programming interface Express complex algorithms ~100 lines.

7
**Machine learning techniques Classification Clustering Regression**

Active learning Collaborative filtering

8
**K-Means Clustering using Spark**

Focus: Implementation and Performance

9
**Grouping data according to similarity**

Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East

10
**Grouping data according to similarity**

Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East

11
**K-Means Algorithm E.g. archaeological dig Distance North Distance East**

Benefits Popular Fast Conceptually straightforward E.g. archaeological dig Clustering applications – applications – image recognition Distance North Distance East

12
**K-Means: preliminaries**

Data: Collection of values Clustering applications – applications – image recognition data = lines.map(line=> parseVector(line)) Feature 2 Feature 1

13
**K-Means: preliminaries**

Dissimilarity: Squared Euclidean distance Clustering applications – applications – image recognition Feature 2 dist = p.squaredDist(q) Feature 1

14
**K-Means: preliminaries**

K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1

15
**K-Means: preliminaries**

K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1

16
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1

17
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1

18
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

19
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

20
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

21
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

22
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

23
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

24
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() Feature 1

25
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

26
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

27
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

28
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1

29
**K-Means Algorithm • Initialize K cluster centers**

• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1

30
**K-Means Source Feature 2 Feature 1 centers = data.takeSample(**

false, K, seed) while (d > ɛ) { Clustering applications – applications – image recognition closest = data.map(p => (closestPoint(p,centers),p)) Feature 2 pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) d = distance(centers, newCenters) centers = newCenters.map(_) } Feature 1

31
**Ease of use Interactive shell:**

Useful for featurization, pre-processing data Lines of code for K-Means Spark ~ 90 lines – (Part of hands-on tutorial !) Hadoop/Mahout ~ 4 files, > 300 lines

32
Performance Logistic Regression K-Means [Zaharia et. al, NSDI’12]

33
**Conclusion Spark: Framework for cluster computing**

Fast and easy machine learning programs K means clustering using Spark Hands-on exercise this afternoon ! Examples and more:

Similar presentations

OK

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on bank lending process Ppt on understanding marketing management Convert word doc to ppt online shopping Best ppt on online education Ppt on extinct animals of india Seminar ppt on geothermal energy Free ppt on human resource management Ppt on active magnetic bearing Ppt on commercial use of microorganisms Morphological opening ppt on mac