Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA (2012239) VAIBHAV JAISWAL(2012249)

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Recommender System with Hadoop and Spark
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
報告人 : 葉瑞群 日期 :2012/01/9 出處 : IEEE Transactions on Knowledge and Data Engineering.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Data processing with Hadoop
VI-SEEM data analysis service
MapReduce Algorithm Design
CS639: Data Management for Data Science
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Problem  The problem – To predict the opinion the user will have on the different items and be able to recommend the “best” items to each user.

Recommender System  Apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services, usually during a live interaction.

Pseudo-Distributed Cluster  We use Hadoop to split the set of users across n machines, copy the input data to each, and then run one Recommender on each machine to process recommendations for a subset of users.

Algorithm used

Item Based(Sequential) Input: User preferences for items Begin for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u 's preference for j, weighted by s, to a running average end for End for return the top items, ranked by weighted average End

Co-occurrence matrix (Parallel)  It’ll compute the number of times each pair of items occurs together in some user’s list of preferences.  The more two items turn up together, the more related or similar they probably are  Note that the entries in the matrix aren’t affected by preference values

Co-occurrence Matrix

Computing user vectors  In a data model with n items, user preferences are like a vector over n dimensions, with one dimension for each item. The user’s preference values for items are the values in the vector  Items that the user expresses no preference for map to a 0 value in the vector. Such a vector is typically quite sparse, and mostly zeroes, because users typically express a preference for only a small subset of all items.

Producing Recommendation

MapReduce 1. Input is assembled in the form of many key-value (K1,V1) pairs, typically as input files on an HDFS instance. 2. A map function is applied to each (K1,V1) pair, which results in zero or more key-value pairs of a different kind (K2,V2).(Mapping) 3. All V2 for each K2 are combined, during shuffle and sort phase. 4. A reduce function is called for each K2 and all its associated V2, which results in zero or more key- value pairs of yet a different kind (K3,V3), output back to HDFS.(Reducing)

Translating to MapReduce: generating user vectors 1. Input files are treated as (Long,String) pairs by the framework, where the Long key is a position in the file and the String value is the line of the text file 2. Each line is parsed into a user ID and several item IDs by a map function. The function emits new key-value pairs: a user ID mapped to item ID, for each item ID. 3. The framework collects all item IDs that were mapped to each user ID together. 4. A reduce function constructs a Vector from all item IDs for the user, and outputs the user ID mapped to the user’s preference vector.

Calculating co- occurrence  The next phase of the computation is another MapReduce that uses the output of the first MapReduce to compute co-occurrences. 1. Input is user IDs mapped to Vectors of user preferences—the output of the last MapReduce. 2. The map function determines all co- occurrences from one user’s preferences, and emits one pair of item IDs for each co- occurrence—item ID mapped to item ID. Both mappings, from one item ID to the other and vice versa, are recorded.

3. The framework collects, for each item, all co- occurrences mapped from that item. 4.The reducer counts, for each item ID, all co- occurrences that it receives and constructs a new Vector that represents all co- occurrences for one item with a count of the number of times they have co-occurred. These can be used as the rows—or columns—of the co-occurrence matrix.

Matrix Multiplication algorithm Begin Assign R to be the zero vector for each column i in the co-occurrence matrix multiply column vector i by the ith element of the user vector add this vector to R End for End

System configuration (Implemented on VMWare)  Memory - 1 GB  Hard Disk - 8 GB  Processor- 1  Os - Ubuntu 10.10(32 bit)

No. of preferencesSequential(ms)Parallel(ms) Results

Sequential

Conclusion  The overhead of initializing the cluster, distributing the data and executable code, and marshalling the results is nontrivial.  So the results will be better if it used for computing on large data with multiple machines in cluster or on cloud.