Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Similar presentations


Presentation on theme: "Machine Learning on Spark Shivaram Venkataraman UC Berkeley."— Presentation transcript:

1 Machine Learning on Spark Shivaram Venkataraman UC Berkeley

2 Computer Science Machine learning Statistics

3 Machine learning Spam lters Recommendations Click prediction Search ranking

4 Machine learning techniques Classification Regression Clustering Active learning Collaborative filtering

5 Implementing Machine Learning Machine learning algorithms are -Complex, multi-stage -Iterative MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

6 Spark RDDs efficient data sharing In-memory caching accelerates performance -Up to 20x faster than Hadoop Easy to use high-level programming interface -Express complex algorithms ~100 lines. Machine Learning using Spark

7 Machine learning techniques Classification Regression Clustering Active learning Collaborative filtering

8 K-Means Clustering using Spark Focus: Implementation and Performance

9 Clustering Grouping data according to similarity Distance East Distance North E.g. archaeological dig

10 Clustering Grouping data according to similarity Distance East Distance North E.g. archaeological dig

11 K-Means Algorithm Benefits Popular Fast Conceptually straightforward Distance East Distance North E.g. archaeological dig

12 K-Means: preliminaries Feature 1 Feature 2 Data: Collection of values data = lines.map(line=> parseVector(line))

13 K-Means: preliminaries Feature 1 Feature 2 Dissimilarity: Squared Euclidean distance dist = p.squaredDist(q)

14 K-Means: preliminaries Feature 1 Feature 2 K = Number of clusters Data assignments to clusters S 1, S 2,..., S K

15 K-Means: preliminaries Feature 1 Feature 2 K = Number of clusters Data assignments to clusters S 1, S 2,..., S K

16 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.

17 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points.

18 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed)

19 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed)

20 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed)

21 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

22 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

23 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: Assign each cluster center to be the mean of its clusters data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

24 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey()

25 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))

26 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))

27 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))

28 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)

29 K-Means Algorithm Feature 1 Feature 2 Initialize K cluster centers Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)

30 K-Means Source Feature 1 Feature 2 centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (d > ɛ) { } d = distance(centers, newCenters) centers = newCenters.map(_)

31 Ease of use Interactive shell: Useful for featurization, pre-processing data Lines of code for K-Means -Spark ~ 90 lines – (Part of hands-on tutorial !) -Hadoop/Mahout ~ 4 files, > 300 lines

32 K-Means Logistic Regression Performance [Zaharia et. al, NSDI12]

33 K means clustering using Spark Hands-on exercise this afternoon ! Examples and more: Spark: Framework for cluster computing Fast and easy machine learning programs Conclusion

34


Download ppt "Machine Learning on Spark Shivaram Venkataraman UC Berkeley."

Similar presentations


Ads by Google