Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.

Slides:



Advertisements
Similar presentations
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Edited by Malak Abdullah Jordan University of Science and Technology Data Structures Using C++ 2E Chapter 12 Graphs.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Felix Halim, Roland H.C. Yap, Yongzheng Wu
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Network Support for Cloud Services Lixin Gao, UMass Amherst.
Computer System Architectures Computer System Software
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
LDBC-Benchmarking Graph-Processing Platforms: A Vision Benchmarking Graph-Processing Platforms: A Vision (A SPEC Research Group Process) Delft University.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Chronos: A Graph Engine for Temporal Graph Analysis
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Protecting Sensitive Labels in Social Network Data Anonymization.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Mining High Utility Itemset in Big Data
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Data Structures and Algorithms in Parallel Computing Lecture 4.
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard.
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Data Structures and Algorithms in Parallel Computing
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Lu Qin Center of Quantum Computation and Intelligent Systems, University of Technology, Australia Jeffery Xu Yu The Chinese University of Hong Kong, China.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
Mizan:Graph Processing System
Software Coherence Management on Non-Coherent-Cache Multicores
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Data Structures and Algorithms in Parallel Computing
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Distributed Systems CS
Summary Background Introduction in algorithms and applications
Replication-based Fault-tolerance for Large-scale Graph Processing
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Hybrid Programming with OpenMP and MPI
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Programming with Shared Memory Specifying parallelism
Phoenix: A Substrate for Resilient Distributed Graph Analytics
Presentation transcript:

Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and System Lab Cluster and Grid Computing Lab Huazhong University of Science and Technology ICPP’15

Outline Motivation System model Algorithm Convergence Evaluation Conclusion

Motivation “Big data” era Loosely coupled data Key-value pairs Hadoop, Spark, many others Tightly coupled data Graph data Pregel, GraphLab, GraphChi, X-Stream, many others Graph computing Execution model Synchronous model (BSP) Asynchronous model Execution manner Deterministic executions Nondeterministic executions

Motivation (Cont’d) Deterministic execution Widely and extensively studied Architecture, OS, Scheduling Set/Chromatic scheduler (GraphLab), DIG (Galois), external deterministic (GraphChi) Pros. Deterministic execution path (always) leads to deterministic results Cons. High overhead introduced to order the tasks (consider a billion-node graph!) Nondeterministic execution Poorly studied Pros. High parallelism, High performance! Cons. Need to prevent (at least) data-races Un-documented

Motivation (Cont’d) Example of two execution manners Problem: High overhead for defining the execution sequence! Question: What if all these tasks are executed nondeterministically? A1: Obviously, Avoided ordering overhead and improved parallelism! A2: Data-races on edges! Taken from GraphLab paper But what if we eliminate the data-races?

Motivation (Cont’d) Objective of this research Study the nondeterministic execution of graph algorithms Wait…… Why to study that? Graph algorithms are special cases of parallel computing! Iterative computing Associative law: a+(b+c) = (a+b)+c Idempotent law: f(f(x)) = f(x) Potential towards higher performance! Questions: Will an algorithm converge by nondeterministic executions? Will the executions lead to deterministic results (i.e., external deterministic)?

Outline Motivation System model Algorithm Convergence Evaluation Conclusion

System model Share memory computer # processors >= 1 Graphs loaded in memory COTS components, nothing special for HW and OS Synchronous implementation of asynchronous model Computing is organized by multiple iterations Barriers are enforced between two consecutive iterations Updates are applied “immediately” Example: GraphChi, GRACE Vertex-centric computing “Think Like A Vertex” Data-dependences happen on edges

System model (Cont’d) Race-free Method1: Architecture support Method2: Compiler support Method3: Explicit lock/unlock Convert data-races to “conflicts” Scheduling General methods Example: static, dynamic or other methods in OpenMP Assumption on scheduling DvDv DuDu v u DeDe DeDe add_schedule(u)

Outline Motivation System model Algorithm Convergence Evaluation Conclusion

Algorithm Convergence Methodology Classify the “conflicts” on edges Read-write conflicts Case1: Read-after-write  read new value  converge Case2: Write-after-read  read old value  converge? Write-write conflicts Case1: (correct)write-after-(wrong)write  correct edge values  converge Case2: (wrong)write-after-(correct)write  corruption edge values  converge?

Algorithm Convergence (Cont’d) Read-write conflict DvDv DuDu v u DeDe DeDe Case1: Read-after-write Converge DvDv DuDu v u DeDe DeDe Case2: Write-after-read DvDv DuDu v u Read old value Next iteration Converge DeDe DeDe

Algorithm Convergence (Cont’d) Sufficient condition1 to convergence Chain-to-converge exists Deduction1: If algorithm A on graph G converges with synchronous model execution, A will converge with nondeterministic execution. Deduction2: If algorithm A on graph G converge by a deterministic scheduler of asynchronous mode, A will converge with nondeterministic execution. Example algorithms that converge: PageRank Many other fixed point iterative algorithms

Algorithm Convergence (Cont’d) Write-write conflicts DvDv DuDu v u DeDe DeDe Case1: (correct)write-after-(wrong)write DeDe Converge DvDv DuDu v u DeDe DeDe Case2: (wrong)write-after-(correct)write DeDe DvDv DuDu v u DeDe DeDe Corrupted edge value Next iteration DeDe Falsely converge Correcting edge value DvDv DuDu v u DeDe Next iteration Converge

Algorithm Convergence (Cont’d) Sufficient condition2 to convergence In order to correct the corrupted edge value: Algorithm A on graph G converges with deterministic asynchronous model execution. Algorithm A satisfies monotonicity property. (falsely converge) Algorithms that converge: WCC (Weakly Connected Components) by MLP (Minimal Label Propagation) BFS (Breadth First Search) Many other graph traversal algorithms Algorithms that does not converge: BP (Belief Propagation)

Outline Motivation System model Algorithm Convergence Evaluation Conclusion

Evaluation Experiment setup 2*2.6-GHz Intel Xeon E processors (8 cores) 64GB of RAM GCC version: Real-world graph data-sets Web-BerkStan, web-Google, soc-LiveJournal1, cage15 Platform GraphChi (C++ version 0.2) Algorithms PageRank, SSSP, WCC, BFS Avail at:

Evaluation (Cont’d) Using architecture support achieves best performance (exec. time reduction can be up to 70%) Using explicit locking/unlocking achieves not the best performance, but still good scalability, and sometimes even outperform deterministic executions.

Evaluation (Cont’d) difference degree is 3 Result1:{1, 2, 3, 5, 7} Result2:{1, 2, 3, 7, 5} Suffix---- 0, 1, 2, 3, 4 Results are not deterministic (external deterministic) With increased precision (smaller ε), variations in results move to less important pages How about the produced results of PageRank? Measure the difference:

Outline Motivation System model Algorithm Convergence Evaluation Conclusion

Graph algorithms are special cases of parallel computing Does not necessarily need high overhead deterministic executions! Most of the algorithms can be executed nondeterministically Examples include PageRank, WCC, BFS and many others. Not all of the nondeterministic executions produce deterministic results! Open problems More discussions on sufficient conditions for algorithm convergence by nondeterministic execution More discussions on the variations (nondeterminacy) in results produced by nondeterministic executions (e.g., PageRank) Theoretical analysis on speed of convergence Extending the system model to pure asynchronous computing

Thank you! Q&A