HPML Conference, Lyon, Sept 2018

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Spark: Cluster Computing with Working Sets
Recommender System with Hadoop and Spark
Aamir Shafi, Bryan Carpenter, Mark Baker
16/13/2015 3:30 AM6/13/2015 3:30 AM6/13/2015 3:30 AMIntroduction to Software Development What is a computer? A computer system contains: Central Processing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Distributed Platforms
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
ITCS-3190.
Introducing Apache Mahout
Distributed Network Traffic Feature Extraction for a Real-time IDS
Hadoop MapReduce Framework
Spark Presentation.
Parallel Programming By J. H. Wang May 2, 2017.
Pattern Parallel Programming
Apache Hadoop YARN: Yet Another Resource Manager
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Replication Middleware for Cloud Based Storage Service
Introduction to Spark.
I590 Data Science Curriculum August
Distributed Systems CS
Applying Twister to Scientific Applications
Data Science Curriculum March
湖南大学-信息科学与工程学院-计算机与科学系
EECS 498 Introduction to Distributed Systems Fall 2017
Introduction to Apache
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Resource-Efficient and QoS-Aware Cluster Management
Distributed Systems CS
MPJ: A Java-based Parallel Computing System
Cloud Computing Large-scale Resource Management
CSE 491/891 Lecture 25 (Mahout).
Parallel Programming in C with MPI and OpenMP
Big-Data Analytics with Azure HDInsight
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Convergence of Big Data and Extreme Computing
Map Reduce, Types, Formats and Features
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

HPML Conference, Lyon, Sept 2018 Performance Comparison of a Parallel Recommender Algorithm across three Hadoop-based Frameworks HPML Conference, Lyon, Sept 2018 Christina P. A. Diedhiou Supervisor: Dr. Bryan Carpenter School of Computing

Objectives of the Session Aim & Objectives of research Overview Hadoop/ MPJ Express Overview of Recommender Systems Implementation of ALSWR Evaluation and Comparison Future works

Aims & objectives of research Evaluate highly scalable parallel frameworks and algorithms for recommendation systems. Main interest on a Java Message Passing Interface: MPJ Express integrated with Hadoop. Evaluate open source Java message passing library for parallel computing: MPJ Express Use MPJ Express to implement collaborative filtering on datasets with an algorithm: ALSWR Benchmark the performance and measure the parallel speedup Compare our results with other frameworks: Mahout, Spark, Giraph

Hadoop Hadoop: A Framework that stores and processes voluminous amounts of data in a reliable and fault-tolerant manner Hadoop 2 released in 2014 Yarn: Resource Manager: Manage & allocate resources across cluster Node Manager: Run on all nodes, report to RM Application Master: Specific to each job, manage operation within containers, ensure there is enough containers

MPJ Express Open source Java MPI-like library that allows application developers to write and execute parallel applications on multicore processors and compute clusters. 2015 MPJE provides Yarn base runtime

Integration of MPJE in Yarn mpjrun.sh –yarn –np 2 –dev niodev MPJApp.jar 1)Submit YARN application 2)Request container allocation for Application Master (AM) 3)AM generates a Container Launch Context (CLC) and allocates container to each node 4)Each mpj-yarn-wrapper send outputs and error streams of the program to the MPJYarnClient

Collaborative Filtering Recommender Systems What is a Recommender System? Software tools and techniques providing suggestions to users on items they might want/ like Example of recommender systems: Netflix, Google news, YouTube, Amazon… Recommender System Demographic Collaborative Filtering Content Based

Collaborative Filtering Based on users’ purchases or decisions histories Rationality: 2 individuals sharing the same opinion on an item, will have similar taste on another item

Alternating Least Squares with Lambda Regularization (ALSWR) ALSWR is an iterative algorithm. It shifts between fixing two different matrices until a convergence is reached. Step 1: Initialize matrix M in a pseudorandom way Step 2: Fix M, Solve U by minimizing the objective function (the sum of squared errors); Step 3: Fix U, Solve M by minimizing the objective function similarly Steps 2 and 3 are repeated until a stopping criterion is satisfied.

Implementation of ALSWR Step 1 uses locally held matrix R decomposed by row and columns (users & items) Step 2 update the items (movies), between (1) & (2) all locally computed elements of users are gathered and broadcast Step 3 update the users. Between (2) & (3) all locally computed elements of items must be gathered together and broadcast to processing nodes. Communication between nodes of cluster established by collective communication

Collective Communication AllGather: enables elements to be send to all processes by gathering all these elements to all processes. Allreduce: sum over all processes then distribute the results to all processes. Barrier: Synchronisation. Prevents processes to go beyond a barrier unless they have all send a signal

Experiments Data collected: MovieLens users ratings 20+ millions ratings Yahoo music users ratings 717 +millions ratings Hardware: Linux cluster with 4 nodes 16+ cores Software used: MPJ Express, Java, Hadoop 2 Algorithm: Least square methods: Alternating-Least-Squares with Weighted— Regularization (ALSWR) Data storage: Hadoop distributed File System (HDFS) Method: Configuration of the nodes with Hadoop and YARN Adding dataset in HDFS Dataset partitioning with Map Reduce or MPJ Code Run ALSWR java code Experiments: Sequential (1 process) VS Parallel speedup (many processes) Comparison with Spark, Mahout, Giraph

Results MovieLens data Good parallel speedup for MPJ Express and Spark: time decreases when number of cores increases No Variance for Mahout from 4 cores and above. MPJ Express averagely 13.9 times faster than Mahout MPJ Express averagely 1.4 times faster than Spark

Results MovieLens data Closer Look on MPJE & Spark performance/ parallel speedup Constant progress, but could be improved for better parallel computation

Results Yahoo music data Time in minutes Nbr Of Processes MPJ Express Spark 1 298 417 2 142 217 4 84.4 136 8 45.56 65 12 33.15 54 16 28.35 55 No computation for Mahout du to size of data Better parallel speedup achieved for MPJE & Spark

Future works More study on Spark Results More experiments on Giraph Assess and compare the accuracy RMSE Generating synthetic datasets to reach “social media scale”

Further readings/ references Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the Netflix prize. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5034 LNCS, 337–348. http://doi.org/10.1007/978-3-540-68880-8_32 Kabijo, M., & Llic, A. (2015). Recommending items to more than a billion people : Machine Learning. Retrieved from https://www.reddit.com/r/MachineLearning/comments/38d7xu/recommending_items_to _more_than_a_billion_people/? Aamir Shafi, 2014, MPJ Express an implementation of message passing interface (MPI) in Java, http://www.powershow.com/view1/154baa- ZDc1Z/MPJ_Express_An_Implementation_of_Message_Passing_Interface_MPI_in_Ja va_powerpoint_ppt_presentation http://mpj-express.org/ https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Purpose

Contacts Christina P. A. Diedhiou christina.diedhiou@port.ac.uk Dr. Bryan Carpenter bryan.carpenter@port.ac.uk