Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning on Spark

Similar presentations


Presentation on theme: "Machine Learning on Spark"— Presentation transcript:

1 Machine Learning on Spark
Shivaram Venkataraman UC Berkeley

2 Machine learning Computer Science Statistics

3 Machine learning Spam filters Click prediction Recommendations
Search ranking

4 Machine learning techniques Classification Clustering Regression
Active learning Collaborative filtering

5 Implementing Machine Learning
Machine learning algorithms are Complex, multi-stage Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

6 Machine Learning using Spark
Spark RDDs  efficient data sharing In-memory caching accelerates performance Up to 20x faster than Hadoop Easy to use high-level programming interface Express complex algorithms ~100 lines.

7 Machine learning techniques Classification Clustering Regression
Active learning Collaborative filtering

8 K-Means Clustering using Spark
Focus: Implementation and Performance

9 Grouping data according to similarity
Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East

10 Grouping data according to similarity
Clustering E.g. archaeological dig Clustering applications – applications – image recognition Distance North Grouping data according to similarity Distance East

11 K-Means Algorithm E.g. archaeological dig Distance North Distance East
Benefits Popular Fast Conceptually straightforward E.g. archaeological dig Clustering applications – applications – image recognition Distance North Distance East

12 K-Means: preliminaries
Data: Collection of values Clustering applications – applications – image recognition data = lines.map(line=> parseVector(line)) Feature 2 Feature 1

13 K-Means: preliminaries
Dissimilarity: Squared Euclidean distance Clustering applications – applications – image recognition Feature 2 dist = p.squaredDist(q) Feature 1

14 K-Means: preliminaries
K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1

15 K-Means: preliminaries
K = Number of clusters Clustering applications – applications – image recognition Feature 2 Data assignments to clusters S1, S2,. . ., SK Feature 1

16 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1

17 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. Clustering applications – applications – image recognition Feature 2 Feature 1

18 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

19 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

20 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 Feature 1

21 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

22 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

23 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) Feature 1

24 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() Feature 1

25 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

26 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

27 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps)) Feature 1

28 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1

29 K-Means Algorithm • Initialize K cluster centers
• Repeat until convergence: centers = data.takeSample( false, K, seed) Clustering applications – applications – image recognition Feature 2 while (dist(centers, newCenters) > ɛ) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) Feature 1

30 K-Means Source Feature 2 Feature 1 centers = data.takeSample(
false, K, seed) while (d > ɛ) { Clustering applications – applications – image recognition closest = data.map(p => (closestPoint(p,centers),p)) Feature 2 pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) d = distance(centers, newCenters) centers = newCenters.map(_) } Feature 1

31 Ease of use Interactive shell:
Useful for featurization, pre-processing data Lines of code for K-Means Spark ~ 90 lines – (Part of hands-on tutorial !) Hadoop/Mahout ~ 4 files, > 300 lines

32 Performance Logistic Regression K-Means [Zaharia et. al, NSDI’12]

33 Conclusion Spark: Framework for cluster computing
Fast and easy machine learning programs K means clustering using Spark Hands-on exercise this afternoon ! Examples and more:

34


Download ppt "Machine Learning on Spark"

Similar presentations


Ads by Google