Distributed Galois Andrew Lenharth 2/27/2015. Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible.

Slides:



Advertisements
Similar presentations
Executional Architecture
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Concurrent Programming Problems OS Spring Concurrency pros and cons Concurrency is good for users –One of the reasons for multiprogramming Working.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
Chapter 5 Processes and Threads Copyright © 2008.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
Seyed Mohammad Ghaffarian ( ) Computer Engineering Department Amirkabir University of Technology Fall 2010.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
1 Quality Objects: Advanced Middleware for Wide Area Distributed Applications Rick Schantz Quality Objects: Advanced Middleware for Large Scale Wide Area.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.
1 Organization of Programming Languages-Cheng (Fall 2004) Concurrency u A PROCESS or THREAD:is a potentially-active execution context. Classic von Neumann.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Thread Synchronization: Too Much Milk. 2 Implementing Critical Sections in Software Hard The following example will demonstrate the difficulty of providing.
Advanced Operating Systems CIS 720 Lecture 1. Instructor Dr. Gurdip Singh – 234 Nichols Hall –
Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Introduction Distributed Algorithms for Multi-Agent Networks Instructor: K. Sinan YILDIRIM.
Real Time Operating Systems Lecture 10 David Andrews
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer Progress Sonic.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
Data Structures and Algorithms in Parallel Computing Lecture 4.
CSCI1600: Embedded and Real Time Software Lecture 23: Real Time Scheduling I Steven Reiss, Fall 2015.
Data Structure Introduction Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Structures and Algorithms in Parallel Computing
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
CSCI1600: Embedded and Real Time Software Lecture 15: Advanced Programming Concepts Steven Reiss, Fall 2015.
Agenda  Quick Review  Finish Introduction  Java Threads.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
A Parallel Communication Infrastructure for STAPL
TensorFlow– A system for large-scale machine learning
Advanced Operating Systems CIS 720
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Alternative system models
Lecture 18: Coherence and Synchronization
CPSC 457 Operating Systems
Background and Motivation
Distributed Computing:
EE 4xx: Computer Architecture and Performance Programming
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
How to improve (decrease) CPI
Outline Process Management Process manager Hardware process
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Presentation transcript:

Distributed Galois Andrew Lenharth 2/27/2015

Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible where possible Both simple programming model and fast implementation – Like Galois, may need restrictions or structure for highest performance

Overview PGAS (using fat pointers) Implicit, asynchronous communication Default execution mode: – Galois compatable – Implicit locking and data movement – Plugable schedulers – Speculative execution All D-Galois programs are valid Galois

Support Galois Implementation User Code User Context Graph Parallel Loop Contention Manager Memory Management Statistics Topology Scheduler Barrier Termination Etc

Support Distributed Galois Implementation User Code User Context Graph Parallel Loop Contention Manager Memory Management Statistics Topology Scheduler Barrier Termination Etc NetworkDirectory Remote Store

Current Status Working implementation of baseline – Asynchronous, speculative

Interesting Problems Livelock Asynchronous directory Abstractions for building data-structures Network hardware Network software Remote updates Scheduling

Solved: Livelock Source: object state transition is more complex, is asynchronous, and may require multiple steps (hence interruptable) Solution: scheme to ensure forward progress of one host Alternate: if this happens a lot for your application, a coordinated scheduling may be more appropriate (or relaxed consistency)

Asynchronous Directory Source: communication and workers interleave access to directory (and directly to objects stored in the directory) Solution: mostly just a pain.

Abstraction for building DS Source: Distributed data structures are hard (so are SM DS). Solution: Set of abstractions Federated object: different instance on each host/thread, pointers resolve locally. Federation bootstrapped by runtime. Federated objects don’t have any notion of exclusive behavior

Remote Updates Directory synchronization really bad when not needed (essential when needed) Many algorithms have an update and schedule behavior for their neighbors Treat this behavior as a task type – Multiple task-types per loop – Quite similar to nested parallelism

Remote Updates – PageRank Self.value += self.residual For n : neighbor n.residual += f(self.residual) Schedule (operator type on) {n} Self.value += self.residual For n : neighbor Schedule (update type on) {n, f(self.residual)} With a new operator: Self.redual += update Schedule (operator type on) {self}

Scheduling Source: Imagine SSSP using the existing schedulers (host-unaware) on distributed memory Need schedule with way to anchor work to data-structure element

Network hardware

Networks Small asynchronous messages are bad for throughput Scale-free graphs stress throughput Large messages are bad for latency Find optimal point – Sometimes latency is critical

Nagle’s algorithm If you don’t have a large message, wait a while to get more data Bad for latency Also, keeps MPI in it’s broken behavior range Also, requires O(P) memory for communications (assuming direct pointwise)

Communication pattern

Software Routing Pros: single communication channel – Scales with hosts – Aggregates all messages Cons: 2 hops (or more)