Download presentation

Presentation is loading. Please wait.

Published byDestiny Dennett Modified over 2 years ago

1
Machine Learning on Spark Shivaram Venkataraman UC Berkeley

2
Computer Science Machine learning Statistics

3
Machine learning Spam lters Recommendations Click prediction Search ranking

4
Machine learning techniques Classification Regression Clustering Active learning Collaborative filtering

5
Implementing Machine Learning Machine learning algorithms are -Complex, multi-stage -Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

6
Spark RDDs efficient data sharing In-memory caching accelerates performance -Up to 20x faster than Hadoop Easy to use high-level programming interface -Express complex algorithms ~100 lines. Machine Learning using Spark

7
Machine learning techniques Classification Regression Clustering Active learning Collaborative filtering

8
K-Means Clustering using Spark Focus: Implementation and Performance

9
Clustering Grouping data according to similarity Distance East Distance North E.g. archaeological dig

10
Clustering Grouping data according to similarity Distance East Distance North E.g. archaeological dig

11
K-Means Algorithm Benefits Popular Fast Conceptually straightforward Distance East Distance North E.g. archaeological dig

12
K-Means: preliminaries Feature 1 Feature 2 Data: Collection of values data = lines.map(line=> parseVector(line))

13
K-Means: preliminaries Feature 1 Feature 2 Dissimilarity: Squared Euclidean distance dist = p.squaredDist(q)

14
K-Means: preliminaries Feature 1 Feature 2 K = Number of clusters Data assignments to clusters S 1, S 2,..., S K

15
K-Means: preliminaries Feature 1 Feature 2 K = Number of clusters Data assignments to clusters S 1, S 2,..., S K

16
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.

17
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.

18
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed)

19
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed)

20
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed)

21
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

22
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

23
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

24
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey()

25
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))

26
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))

27
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))

28
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)

29
K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)

30
K-Means Source Feature 1 Feature 2 centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (d > ɛ) { } d = distance(centers, newCenters) centers = newCenters.map(_)

31
Ease of use Interactive shell: Useful for featurization, pre-processing data Lines of code for K-Means -Spark ~ 90 lines – (Part of hands-on tutorial !) -Hadoop/Mahout ~ 4 files, > 300 lines

32
K-Means Logistic Regression Performance [Zaharia et. al, NSDI12]

33
K means clustering using Spark Hands-on exercise this afternoon ! Examples and more: Spark: Framework for cluster computing Fast and easy machine learning programs Conclusion

34

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google