2 Problem  The problem – To predict the opinion the user will have on the different items and be able to recommend the “best” items to each user.

3 Recommender System  Apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services, usually during a live interaction.

4 Pseudo-Distributed Cluster  We use Hadoop to split the set of users across n machines, copy the input data to each, and then run one Recommender on each machine to process recommendations for a subset of users.

5 Algorithm used

6 Item Based(Sequential) Input: User preferences for items Begin for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u 's preference for j, weighted by s, to a running average end for End for return the top items, ranked by weighted average End

7 Co-occurrence matrix (Parallel)  It’ll compute the number of times each pair of items occurs together in some user’s list of preferences.  The more two items turn up together, the more related or similar they probably are  Note that the entries in the matrix aren’t affected by preference values

8 Co-occurrence Matrix

9 Computing user vectors  In a data model with n items, user preferences are like a vector over n dimensions, with one dimension for each item. The user’s preference values for items are the values in the vector  Items that the user expresses no preference for map to a 0 value in the vector. Such a vector is typically quite sparse, and mostly zeroes, because users typically express a preference for only a small subset of all items.

10 Producing Recommendation

11 MapReduce 1. Input is assembled in the form of many key-value (K1,V1) pairs, typically as input files on an HDFS instance. 2. A map function is applied to each (K1,V1) pair, which results in zero or more key-value pairs of a different kind (K2,V2).(Mapping) 3. All V2 for each K2 are combined, during shuffle and sort phase. 4. A reduce function is called for each K2 and all its associated V2, which results in zero or more key- value pairs of yet a different kind (K3,V3), output back to HDFS.(Reducing)

12 Translating to MapReduce: generating user vectors 1. Input files are treated as (Long,String) pairs by the framework, where the Long key is a position in the file and the String value is the line of the text file 2. Each line is parsed into a user ID and several item IDs by a map function. The function emits new key-value pairs: a user ID mapped to item ID, for each item ID. 3. The framework collects all item IDs that were mapped to each user ID together. 4. A reduce function constructs a Vector from all item IDs for the user, and outputs the user ID mapped to the user’s preference vector.

13 Calculating co- occurrence  The next phase of the computation is another MapReduce that uses the output of the first MapReduce to compute co-occurrences. 1. Input is user IDs mapped to Vectors of user preferences—the output of the last MapReduce. 2. The map function determines all co- occurrences from one user’s preferences, and emits one pair of item IDs for each co- occurrence—item ID mapped to item ID. Both mappings, from one item ID to the other and vice versa, are recorded.

14 3. The framework collects, for each item, all co- occurrences mapped from that item. 4.The reducer counts, for each item ID, all co- occurrences that it receives and constructs a new Vector that represents all co- occurrences for one item with a count of the number of times they have co-occurred. These can be used as the rows—or columns—of the co-occurrence matrix.

15 Matrix Multiplication algorithm Begin Assign R to be the zero vector for each column i in the co-occurrence matrix multiply column vector i by the ith element of the user vector add this vector to R End for End




19 System configuration (Implemented on VMWare)  Memory - 1 GB  Hard Disk - 8 GB  Processor- 1  Os - Ubuntu 10.10(32 bit)

20 No. of preferencesSequential(ms)Parallel(ms) 204431063 407734152 10013040012 20030057421 Results

21 Sequential

22 Conclusion  The overhead of initializing the cluster, distributing the data and executable code, and marshalling the results is nontrivial.  So the results will be better if it used for computing on large data with multiple machines in cluster or on cloud.

