Presentation is loading. Please wait.

Presentation is loading. Please wait.

Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD

Similar presentations


Presentation on theme: "Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

1 Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
COSC6376 Cloud Computing Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

2 Outline Homework1 Tutorial based on Netflix dataset

3 Homework 1 K-means Clustering of Amazon Reviews
Create related product items based on the Amazon review ratings Understand the K-means and canopy clustering algorithms and their relationship Implement these algorithms using Apache Spark Analyze the effect of running these algorithms on a large data set using Amazon Cloud

4 Tutorial based on Netflix Dataset
K-means example using Netflix dataset Rating dataset similar to Amazon reviews Amazon datasets productid, userid, rating, timestamp other meta data fields and review texts Netflix dataset movieid, userid, rating, timestamp

5 Netflix Prize Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies Netflix internal movie rating predictor: Cinematch used for recommending movies $1,000,000 award to these who can improve the prediction by 10% (in terms of root means squared error) Winner: BellKor's Pragmatic Chaos Another team: Ensemble Results equally good but submitted 20 minutes later

6

7 Competition Cancelled
Researchers demonstrated that individuals can be identified by matching the Netflix data sets with film ratings online Netflix users filed a class action lawsuit against Netflix for privacy violation Video Privacy Protection Act of 1988

8

9 Movie Dataset The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5:: 2::1194::4:: 7::1123::1::

10 K-means Clustering Clustering problem description: iterate {
Compute distance from all points to all kcenters Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages } Good survey: AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999

11 K-means Illustration Randomly select k centroids
Assign cluster label of each point according to the distance to the centroids

12 K-means Illustration Reclustering Recalculate the centroids
Repeat, until the cluster labels do not change, or the changes of centroids are very small

13 Summary of K-means Determine the value of k
Determine the initial k centroids Repeat until converge - Determine membership: Assign each point to the closest centroid - Update centroid position: Compute the average of the assigned members

14 The Setting The dataset is stored in HDFS
We use a MapReduce kMeans to get the clustering result Implement each iteration in one MapReduce process Pass the k centroids to the Maps Map: assign a label to each record according to the distances to the k centroids <cluster id, record> Reduce: calculate the mean for each cluster, and replace the centroid with the new mean

15 Complexity The complexity is pretty high:
k * n * O ( distance metric ) * num (iterations) Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.

16 Furthermore There are three big ways a data set can be large:
There are a large number of elements in the set. Each element can have many features. There can be many clusters to discover Conclusion – Clustering can be huge, even when you distribute it.

17 Canopy Clustering Preliminary step to help parallelize computation.
Clusters data into overlapping Canopies using super cheap distance metric. Efficient Accurate

18

19 Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it’s canopy strongly mark all points within some stronger threshold }

20 After the Canopy Clustering…
Run K-mean clustering as usual. Treat objects in separate clusters as being at infinite distances.

21 MapReduce Implementation:
Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure. The Distance Metric The Canopy Metric ($) The K-Means Metric ($$$)

22 Steps Get Data into a form you can use (MR)
Picking Canopy Centers (MR) Assign Data Points to Canopies (MR) Pick K-Means Cluster Centers K-Means algorithm (MR) Iterate!

23 Canopy Distance Function
Canopy selection requires a simple distance function Number of rater IDs in common Close and far distance thresholds Close distance threshold: 8 rater IDs in common Far distance threshold: 2 rate IDs in common

24 K-means Distance Metric
The set of ratings for a movie given by a set of users can be thought of as a vector A = [user1_score, user2_score, ..., userN_score] To evaluate the distance between two movies, A and B, use the similarity metric below, Similarity(A, B) = sum(A_i * B_i) / (sqrt(sum(A_i^2)) * sqrt(sum(B_i^2))) where the sum(...) functions retrieve all A_i or B_i for 0 <= i < n

25 Example Three vectors Distance or similarity between A and B  ¼=0.25
Vector(A) Vector (B) Vector (C) Distance or similarity between A and B distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4  ¼=0.25 Similarity (A,B) = 0.25

26 Data Massaging Convert the data into the required format.
In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>>

27 Canopy Cluster – Mapper A

28 Threshold Value

29

30

31

32

33

34

35 Reducer Mapper A - Red center Mapper B – Green center

36 Redundant Centers within the Threshold of Each Other.

37 Add Small Error => Threshold+ξ

38 So far we found , only the canopy center.
Run another MR job to find out points that are belong to canopy center. canopy clusters are ready when the job is completed. How it would look like ?

39 Canopy Cluster - Before MR job Sparse Matrix

40 Canopy Cluster – After MR job

41 Cells with values 1 are grouped together and users are moved from their original location

42 K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId, {m1,m2,m3,m4,m5}>

43 User A Toy Story Avatar Jumanji Heat User B GoldenEye Money Train
Mortal Kombat User C Toy Story Avatar Jumanji Heat Golden Eye MoneyTrain Mortal Kombat UserA 1 User B User C

44 Find k-neighbors from the same canopy cluster.
Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready

45 All Points – Before Clustering

46 Canopy - Clustering

47 Canopy Clustering and K-means Clustering


Download ppt "Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD"

Similar presentations


Ads by Google