COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

Slides:

Advertisements

Similar presentations

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Advertisements

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Distributed Graph Processing Abhishek Verma CS425.

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

Markov Networks.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.

Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.

GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.

Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.

Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.

Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.

GraphLab A New Framework for Parallel Machine Learning

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

GDG DevFest Central Italy Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)

Carnegie Mellon University GraphLab Tutorial Yucheng Low.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.

NEURAL NETWORKS FOR DATA MINING

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Scientific Writing Abstract Writing. Why ? Most important part of the paper Number of Readers ! Make people read your work. Sell your work. Make your.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Data Structures and Algorithms in Parallel Computing

Parallel processing

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Big Data: Graph Processing

Data Mining: Concepts and Techniques

CS 445/545 Machine Learning Winter, 2017

Multi-task learning approaches to modeling context-specific networks

A New Parallel Framework for Machine Learning

Chilimbi, et al. (2014) Microsoft Research

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

CS 445/545 Machine Learning Spring, 2017

Machine Learning With Python Sreejith.S Jaganadh.G.

DTMC Applications Ranking Web Pages & Slotted ALOHA

Classification with Perceptrons Reading:

CSCI5570 Large Scale Data Processing Systems

湖南大学-信息科学与工程学院-计算机与科学系

Markov Networks.

February 26th – Map/Reduce

Link Counts GOOGLE Page Rank engine needs speedup

Cloud Computing Lecture #4 Graph Algorithms with MapReduce

CS110: Discussion about Spark

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Approaching an ML Problem

COS 418: Distributed Systems Lecture 19 Wyatt Lloyd

Neural Networks Geoff Hulten.

ML – Lecture 3B Deep NN.

Big Data I: Graph Processing, Distributed Machine Learning

TensorFlow: A System for Large-Scale Machine Learning

Markov Networks.

Automatic Handwriting Generation

Sequential Learning with Dependency Nets

GhostLink: Latent Network Inference for Influence-aware Recommendation

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman Graph Processing COS 518: Advanced Computer Systems Lecture 12 Mike Freedman [Content adapted from K. Jamieson and J. Gonzalez]

Graphs are Everywhere Netflix Wiki Collaborative Filtering Social Network Users Movies Netflix Collaborative Filtering This idea has lots of applications. Might be inferring interests about people in a social network, >>> Collaborative filtering: trying to predict movie ratings of each user given a handful of movie ratings from all users. BIPARTITE GRAPH where you have users on one side and movies on the other --> sparse matrix (RED). >>> Textual analysis: e.g. detecting user sentiment regarding product reviews on a website like Amazon. >>> relate random variables. Probabilistic Analysis Docs Words Wiki Text Analysis

Label Propagation Page Rank Concrete Examples So for a concrete example, let’s look at label propagation in a social network graph. Label Propagation Page Rank

Label Propagation Algorithm Social Arithmetic: Recurrence Algorithm: iterate until convergence Parallelism: Compute all Likes[i] in parallel Sue Ann 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like I Like: + 60% Cameras, 40% Biking 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking 40% 10% 50% Profile So suppose we want to inferences about for example someone’s hobbies, we might want to do "social arithmetic" on a social networking graph... >>> Then once we have these interests on the graph we can compute the probability someone likes a hobby as the weighted probabilities that their friends likes that hobby. Some measure of closeness would be the weight. This is JUST ONE STEP that gets applied to the ENTIRE GRAPH – so iterate this applied to ALL NODES. >>> FURTHERMORE we want to perform this computation on all nodes in the graph IN PARALLEL to speed it up! Me Carlos

PageRank Algorithm Recurrence Algorithm: Parallelism: PageRank of u is dependent on PR of all pages linking to u, divided by the number of links from each of these pages Recurrence Algorithm: PR[u] = 𝛴v∈Bu PR[v] / L[v] iterate until convergence Parallelism: Compute all PR[u] in parallel . the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v.

Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like What My Friends Like Taking a step back – the big picture here is that this is one example of many GRAPH PARALLEL algorithms that have the following properties: (L) Significant information in the DEPENDENCIES between different pieces of data, can represent those deps naturally using a graph. (C) As a result computation FACTORS into different LOCAL neighborhoods, depends DIRECTLY only on those neighborhoods. (R) b/c graph has many CHAINS of connections, need to ITERATE local computation, PROPAGATING information.

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel MapReduce MapReduce? So that's what these Graph-Parallel algorithms are all about. So at this point you should be wondering, why not use MapReduce for Graph Parallel Algorithms? Feature Extraction Algorithm Tuning Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Tensor Factorization PageRank Lasso Basic Data Processing

Problem: Data Dependencies MapReduce doesn’t efficiently express data dependencies User must code substantial data transformations Costly data replication Number of reasons. First, MR doesn't have any way of directly representing a DEPENDENCY GRAPH. To use MR, on these problems, user has to code up substantial TRANSFORMATIONS to data to make each piece of data independent from the others, this often involves REPLICATING data, which is costly in terms of storage. Independent Data Rows

Iterative Algorithms MR doesn’t efficiently express iterative algorithms: Iterations Barrier Data Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data Many ML algorithms (like the social network example before) ITERATIVELY REFINE their variables during operation. >>> >>> >>> MR doesn’t provide a mechanism to directly encode ITERATIVE computation, so it's forced to wait for all the data in the PRECEDING ITERATION before computing the next. >>> SEGUE: If one processor is slow (common case) then whole system runs slowly. Processor Slow Data Data Data Data Data

MapAbuse: Iterative MapReduce Only a subset of data needs computation: Iterations Barrier Data Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data Often just a SUBSET of the data needs to be computed on, that subset will change on each iteration of the algorithm. So this setup using MapReduce is doing a lot of UNNECESSARY PROCESSING. Data Data Data Data Data

MapAbuse: Iterative MapReduce System is not optimized for iteration: Iterations Data Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Startup Penalty Disk Penalty Data Finally there is a STARTUP PENALTY associated with the storage and network delays of reading and writing to disk. So the system isn’t optimized for iteration from a performance standpoint. Data Data Data Data Data

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM There’s a large universe of applications that fit nicely into graph-parallel algorithms, and GraphLab is designed to target these applications. Feature Extraction Cross Validation Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting Computing Sufficient Statistics

Next set of readings Graph processing Machine learning Why relationships, sampling, and iterations often use in graph processing not well fit by MapReduce How to take a graph-centric processing perspective Machine learning These are solving one type of ML algorithm What other systems are needed, particularly given heavy focus on iterative algorithms

Today’s readings Streaming is about unbounded data sets, not particular execution engines Streaming is in fact a strict superset of batch, Lambda Architecture destined for retirement Needs of good streaming systems: correctness and tools for reasoning about time. Differences between event time and processing time, and the challenges they impose