Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Collaborative QoS Prediction in Cloud Computing Department of Computer Science & Engineering The Chinese University of Hong Kong Hong Kong, China Rocky.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
SCALING SGD to Big dATA & Huge Models
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Sebastian Schelter, Venu Satuluri, Reza Zadeh
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
User Interests Imbalance Exploration in Social Recommendation: A Fitness Adaptation Authors : Tianchun Wang, Xiaoming Jin, Xuetao Ding, and Xiaojun Ye.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun
Data Structures and Algorithms in Parallel Computing Lecture 1.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
ICONIP 2010, Sydney, Australia 1 An Enhanced Semi-supervised Recommendation Model Based on Green’s Function Dingyan Wang and Irwin King Dept. of Computer.
Millions of Jobs or a few good solutions …. David Abramson Monash University MeSsAGE Lab X.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Next Generation of Apache Hadoop MapReduce Owen
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Image taken from: slideshare
Big Data Analytics and HPC Platforms
Matrix Factorization and Collaborative Filtering
TensorFlow– A system for large-scale machine learning
Collaborative Filtering for Streaming data
Large-scale Machine Learning
Multiplicative updates for L1-regularized regression
Chilimbi, et al. (2014) Microsoft Research
Distributed Computation Framework for Machine Learning
Learning Recommender Systems with Adaptive Regularization
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Latent Space Model for Road Networks to Predict Time-Varying Traffic
CMPT 733, SPRING 2016 Jiannan Wang
Logistic Regression & Parallel SGD
Data-Intensive Computing: From Clouds to GPU Clusters
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Spark and Scala.
Scaling up Link Prediction with Ensembles
Asymmetric Transitivity Preserving Graph Embedding
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
TensorFlow: A System for Large-Scale Machine Learning
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning and Matrix Computations workshop in conjunction with NIPS 2014

Latent Factor Models Given M – sparse – n x m Returns U and V – rank k Applications – Dimensionality reduction – Recommendation – Inference

Seem familiar? So why not just use SVD? SVD!

Problems with SVD (Feb 24, 2015 edition)

Revamped loss function g – global bias term b U i – user-specific bias term for user i b V j – item-specific bias term for item j prediction function p(i, j) = g + b U i + b V j + u T i v j a(i, j) – analogous to SVD’s m ij (ground truth) New loss function:

Algorithm

Problems 1.Resulting U and V, for graphs with millions of vertices, still equate to hundreds of gigabytes of floating point values. 2.SGD is inherently sequential; either locking or multiple passes are required to synchronize.

Problem 1: size of parameters Solution: Parameter Server architecture

Problem 2: simultaneous writes Solution: …so what?

Lock-free concurrent updates? Assumptions 1. f is Lipshitz continuously differentiable 2. f is strongly convex 3.Ω (size of hypergraph) is small 4.Δ (fraction of edges that intersect any variable) is small 5.ρ (sparsity of hypergraph) is small

Hogwild! Lock-free updates

Factorbird Architecture

Parameter server architecture Open source! –

Factorbird Machinery memcached – Distributed memory object caching system finagle – Twitter’s RPC system HDFS – persistent filestore for data Scalding – Scala front-end for Hadoop MapReduce jobs Mesos – resource manager for learner machines

Factorbird stubs

Model assessment Matrix factorization using RMSE – Root-mean squared error SGD performance often a function of hyperparameters – λ: regularization – η: learning rate – k: number of latent factors

[Hyper]Parameter grid search aka “parameter scans:” finding the optimal combination of hyperparameters – Parallelize!

Experiments “RealGraph” – Not a dataset; a framework for creating graph of user-user interactions on Twitter Kamath, Krishna, et al. "RealGraph: User Interaction Prediction at Twitter." User Engagement Optimization KDD

Experiments Data: binarized adjacency matrix of subset of Twitter follower graph – a(i, j) = 1 if user i interacted with user j, 0 otherwise All prediction errors weighted equally (w(i, j) = 1) 100 million interactions 440,000 [popular] users

Experiments 80% training, 10% validation, 10% testing

Experiments k = 2 Homophily

Experiments Scalability of Factorbird – large RealGraph subset – 229M x 195M (44.6 quadrillion) – 38.5 billion non-zero entries Single SGD pass through training set: ~2.5 hours ~ 40 billion parameters

Important to note As with most (if not all) distributed platforms:

Future work Support streaming (user follows) Simultaneous factorization Fault tolerance Reduce network traffic s/memcached/custom application/g Load balancing

Strengths Excellent extension of prior work – Hogwild, RealGraph Current and [mostly] open technology – Hadoop, Scalding, Mesos, memcached Clear problem, clear solution, clear validation

Weaknesses Lack of detail, lack of detail, lack of detail – How does number of machines affect runtime? – What were performance metrics of the large RealGraph subset? – What were some of the properties of the dataset (when was it collected, how were edges determined, what does “popular” mean, etc)? – How did other factorization methods perform by comparison?

Questions?

Assignment 1 Code: 65pts – 20: NBTrain (counting) – 20: message passing and sorting – 20: NBTest (scanning model, accuracy) – 5: How to run Q1-Q5: 35pts

Assignment 2 Code: 70pts – 20: MR for counting words – 15: MR for counting labels – 20: MR for joining model + test data – 15: MR for classification – 5: How to run Q1-Q2: 30pts

Assignment 3 Code: 50pts – 10: Compute TF – 10: Compute IDF – 25: K-means iterations – 5: How to run Q1-Q4: 50pts