X10: An Object-Oriented Approach to Non-uniform Cluster Computing Vijay Saraswat IBM Research.

Slides:



Advertisements
Similar presentations
X10 Overview Vijay Saraswat This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract.
Advertisements

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 12 Introduction to ASP.NET.
Determinate Imperative Programming: The CF Model Vijay Saraswat IBM TJ Watson Research Center joint work with Radha Jagadeesan, Armando Solar- Lezama,
Higher-order Abstract Syntax with constraints or Testing concurrent systems or 25 Years of Logic Programming Languages Radha Jagadeesan, De Paul U Gopalan.
Introduction to Grid Application On-Boarding Nick Werstiuk
©2003 aQute, All Rights Reserved Tokyo, August 2003 : 1 OSGi Service Platform Tokyo August 28, 2003 Peter Kriens CEO aQute, OSGi Fellow
1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.
1 Processes and Threads Creation and Termination States Usage Implementations.
Multiple Processor Systems
Making the System Operational
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
Configuration management
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
Process Management.
Semantic Analysis and Symbol Tables
IBM Research: Software Technology © 2006 IBM Corporation 1 Programming Language X10 Christoph von Praun IBM Research HPC WPL Sandia National Labs December.
X10 Tutorial PSC Software Productivity Study May 23 – 27, 2005 Vivek Sarkar IBM T.J. Watson Research Center This work has been supported.
Threads, SMP, and Microkernels
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
Processes Management.
Executional Architecture
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
1 Programming Languages (CS 550) Mini Language Interpreter Jeremy R. Johnson.
The University of Adelaide, School of Computer Science
Reliable and Efficient Programming Abstractions for Sensor Networks Nupur Kothari, Ramki Gummadi (USC), Todd Millstein (UCLA) and Ramesh Govindan (USC)
IBM’s X10 Presentation by Isaac Dooley CS498LVK Spring 2006.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
IBM WebSphere survey Kristian Bisgaard Lassen. University of AarhusIBM WebSphere survey2 Tools  WebSphere Application Server Portal Studio Business Integration.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Peter Juszczyk CS 492/493 - ISGS. // Is this C# or Java? class TestApp { static void Main() { int counter = 0; counter++; } } The answer is C# - In C#
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 5 : JAVA Thread Programming Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
1 Concurrent Languages – Part 1 COMP 640 Programming Languages.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Java Thread and Memory Model
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.
Parallel Computing Presented by Justin Reschke
Concurrency (Threads) Threads allow you to do tasks in parallel. In an unthreaded program, you code is executed procedurally from start to finish. In a.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Concurrency in Java MD. ANISUR RAHMAN. slide 2 Concurrency  Multiprogramming  Single processor runs several programs at the same time  Each program.
A Parallel Communication Infrastructure for STAPL
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Process concept.
Advanced Operating Systems CIS 720
Chapter 4 – Thread Concepts
Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads
Computer Engg, IIT(BHU)
课程名 编译原理 Compiling Techniques
Introduction Enosis Learning.
Introduction Enosis Learning.
Threads, SMP, and Microkernels
Foundations and Definitions
JIT Compiler Design Maxine Virtual Machine Dhwani Pandya
Presentation transcript:

X10: An Object-Oriented Approach to Non-uniform Cluster Computing Vijay Saraswat IBM Research

July 23, 2003 IBM PL Day Overview Introduction and context Clustered Computing Language model and constructs Big picture places, atomic, async, finish, clocks, arrays Example programs and demo Conclusion and Future Work Guarantees Challenges

July 23, Acknowledgements X10 Tools Julian Dolby, Steve Fink, Robert Fuhrer, Matthias Hauswirth, Peter Sweeney, Frank Tip, Mandana Vaziri University partners: MIT (StreamIt), Purdue University (X10), UC Berkeley (StreamBit), U. Delaware (Atomic sections), U. Illinois (Fortran plug-in), Vanderbilt University (Productivity metrics), DePaul U (Semantics) X10 core team Philippe Charles Chris Donawa (IBM Toronto) Kemal Ebcioglu Christian Grothoff (Purdue) Allan Kielstra (IBM Toronto) Maged Michael Christoph von Praun Vivek Sarkar Additional contributors to X10 ideas: David Bacon, Bob Blainey, Perry Cheng, Julian Dolby, Guang Gao (U Delaware), Robert O'Callahan, Filip Pizlo (Purdue), Lawrence Rauchwerger (Texas A&M), Mandana Vaziri, Jan Vitek (Purdue), V.T. Rajan, Radha Jagadeesan (DePaul) X10 PM+Tools Team Lead: Kemal Ebcioglu, Vivek Sarkar PERCS Principal Investigator: Mootaz Elnozahy

July 23, 2003 IBM PL Day Performance and Productivity Challenges 1) Memory wall: Architectures exhibit severe non-uniformities in bandwidth & latency in memory hierarchy Clusters (scale-out) SMP Multiple cores on a chip Coprocessors (SPUs) SMTs SIMD ILP... L3 Cache Memory... L2 Cache PEs, L1 $ Proc Cluster PEs, L1 $... L2 Cache PEs, L1 $ Proc Cluster PEs, L1 $... 2) Frequency wall: Architectures introduce hierarchical heterogeneous parallelism to compensate for frequency scaling slowdown 3) Scalability wall: Software will need to deliver ~ way parallelism to utilize peta-scale parallel systems

July 23, High Complexity Limits Development Productivity HPC Software Lifecycle Production Runs of Parallel Code Requirements Input Data Written Specification Algorithm Development Source Code Development of Parallel Source Code --- Design, Code, Test, Port, Scale, Optimize Parallel Specification Maintenance and Porting of Parallel Code L3 Cache Memory... L2 Cache PEs, L1 $ Proc Cluster PEs, L1 $... L2 Cache PEs, L1 $ Proc Cluster PEs, L1 $... One billion transistors in a chip \\ 1995: entire chip can be accessed in 1 cycle 2010: only small fraction of chip can be accessed in 1 cycle Major sources of complexity for application developer: 1) Severe non-uniformities in data accesses 2) Applications must exhibit large degrees of parallelism (up to ~ 10 5 threads) Complexity leads to increases in all phases of HPC Software Lifecycle related to parallel code //

July 23, PERCS Programming Model/Tools: Overall Architecture X10 source code Productivity Metrics X10 Development Toolkit Fortran/MPI/OpenMP) Java Development Toolkit Integrated Programming Environment: Edit, Compile, Debug, Visualize, Refactor Use Eclipse platform (eclipse.org) as foundation for integrating tools Morphogenic Software: separation of concerns, separation of roles C/C++ /MPI /OpenMP C Development Toolkit Java+Threads+Conc utils Fortran Development Toolkit Continuous Program Optimization (CPO) PERCS System Software (K42) PERCS System Hardware... X10 Components X10 runtime Integrated Concurrency Library: messages, synchronization, threads Fortran components C/C++ components Fortran runtime C/C++ runtime Java components Java runtime Performance Exploration PERCS = Productive Easy-to-use Reliable Computer Systems Fast extern interface

July 23, Scalability Axiom: Programmer must have explicit language constructs to deal with non-uniformity of access. Axiom: Allow specification of a large collection of activities. Axiom: A program must use scalable synchronization constructs. Axiom: The runtime may implement aggregate operations more efficiently than user-specified iterations with index variables. Axiom: The user may know more than the compiler/RTS. X10 Design Assumptions Productivity Axiom: OO provides proven baseline productivity, maintenance, portability benefits. Axiom: Design must rule out large classes of errors (Type safe, Memory safe, Pointer safe, Lock safe, Clock safe …) Axiom: Design must support incremental introduction of explicit place types/remote operations. Axiom: PM must integrate with static tools (Eclipse) -- flag performance problems, refactor code, detect races. Axiom: PM must support automatic static and dynamic optimization (CPO). Support High Productivity (&, possibly U ) High Performance Programmer

July 23, The X10 Programming Model A program is a collection of places, each containing resident data and a dynamic collection of activities. Program may distribute aggregate data (arrays) across places during allocation. Program may directly operate only on local data, using atomic blocks. Program may spawn multiple (local or remote) activities in parallel. Program must use asynchronous operations to access/update remote data. Program may detect termination or (repeatedly) detect quiescence of a data- dependent, distributed set of activities. Shared Memory (P=1)MPI (P > 1) Cluster Computing: Common framework for P>=1 heap stack control heap stack control... Activities & Activity-local storage Place-local heap Partitioned Global heap heap stack control heap stack control... Place-local heap Partitioned Global heap Outbound activities Inbound activities Outbound activity replies Inbound activity replies... Place Activities & Activity-local storage Immutable Data Granularity of place can range from single register file to an entire SMP system atomic, when finish, clock async, {at/for}each distribution place Formalized in Saraswat, Jagadeesan Concurrent Clustered Programming.

July 23, 2003 IBM PL Day async async (P) S Parent activity creates a new child activity at place P, to execute statement S; returns immediately. S may reference final variables in enclosing blocks. double A[D]=…; // Global dist. array final int k = …; async ( A.distribution[99] ) { // Executed at A[99]s place atomic A[99] = k; } async PlaceExpressionSingleListopt Statement cf Cilks spawn

July 23, finish finish S Execute S, but wait until all (transitively) spawned asyncs have terminated. Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. Useful for expressing synchronous operations on remote data And potentially, ordering information in a weakly consistent memory model finish ateach(point [i]:A) A[i] = i; finish async(A.distribution[j]) A[j] = 2; // All A[i]=i will complete before A[j]=2; Statement ::= finish Statement Rooted Exception Model finish ateach(point [i]:A) A[i] = i; finish async(A.distribution[j]) A[j] = 2; // All A[i]=i will complete before A[j]=2; cf Cilks sync

July 23, 2003 IBM PL Day atomic Atomic blocks are Conceptually executed in a single step, while other activities are suspended An atomic block may not include Blocking operations Accesses to data at remote places Creation of activities at remote places // push data onto concurrent list-stack Node node=new Node (17); atomic { node.next = head; head = node; } // target defined in lexically enclosing environment. public atomic boolean CAS( Object old, Object new) { if (target.equals(old)) { target = new; return true; } return false; } Statement ::= atomic Statement MethodModifier ::= atomic

July 23, 2003 IBM PL Day when Activity suspends until a state in which the guard is true; in that state the body is executed atomically. Statement ::= WhenStatement WhenStatement ::= when ( Expression ) Statement class OneBuffer { nullable Object datum = null; boolean filled = false; public void send(Object v) { when ( !filled ) { this.datum = v; this.filled = true; } public Object receive() { when ( filled ) { Object v = datum; datum = null; filled = false; return v; }

July 23, 2003 IBM PL Day regions, distributions Region a (multi-dimensional) set of indices Distribution A mapping from indices to places High level algebraic operations are provided on regions and distributions region R = 0:100; region R1 = [0:100, 0:200]; region RInner = [1:99, 1:199]; // a local distribution distribution D1=R-> here; // a blocked distribution distribution D = block(R); // union of two distributions distribution D = (0:1) -> P0 || (2:N) -> P1; distribution DBoundary = D – RInner; Based on ZPL.

July 23, 2003 IBM PL Day arrays Array section A [RInner] High level parallel array, reduction and span operators Highly parallel library implementation A-B (array subtraction) A.reduce(intArray.add,0) A.sum() Arrays may be Multidimensional Distributed Value types Initialized in parallel: int [D] A= new int[D] (point [i,j]) {return N*i+j;};

July 23, 2003 IBM PL Day ateach, foreach ateach (point p:A) S Creates |region(A)| async statements Instance p of statement S is executed at the place where A[p] is located foreach (point p:R) S Creates |R| async statements in parallel at current place Termination of all activities can be ensured using finish. ateach ( FormalParam: Expression ) Statement foreach ( FormalParam: Expression ) Statement public boolean run() { distribution D = distribution.factory.block(TABLE_SIZE); long[.] table = new long[D] (point [i]) { return i; } long[.] RanStarts = new long[distribution.factory.unique()] (point [i]) { return starts(i);}; long[.] SmallTable = new long value[TABLE_SIZE] (point [i]) {return i*S_TABLE_INIT;}; finish ateach (point [i] : RanStarts ) { long ran = nextRandom(RanStarts[i]); for (int count: 1:N_UPDATES_PER_PLACE) { int J = f(ran); long K = SmallTable[g(ran)]; async atomic table[J] ^= K; ran = nextRandom(ran); }} return table.sum() == EXPECTED_RESULT; }

July 23, 2003 IBM PL Day clocks Operations clock c = new clock(); c.resume(); Signals completion of work by activity in this clock phase. next; Blocks until all clocks it is registered on can advance. Implicitly resumes all clocks. c.drop(); Unregister activity with c. async (P) clock (c 1,…,c n )S (Clocked async): activity is registered on the clocks (c 1,…,c n ) Static Semantics An activity may operate only on those clocks it is live on. In finish S,S may not contain any top-level clocked asyncs. Dynamic Semantics A clock c can advance only when all its registered activities have executed c.resume(). No explicit operation to register a clock. Supports over-sampling, hierarchical nesting.

July 23, 2003 IBM PL Day Example: SpecJBB finish async { clock c = new clock(); Company company = createCompany(...); for (int w : 0:wh_num) for (int t: 0:term_num) async clocked(c) { // a client initialize; next; //1. while (company.mode!=STOP) { select a transaction; think; process the transaction; if (company.mode==RECORDING) record data; if (company.mode==RAMP_DOWN) { c.resume(); //2. } gather global data; } // a client // master activity next; //1. company.mode = RAMP_UP; sleep rampuptime; company.mode = RECORDING; sleep recordingtime; company.mode = RAMP_DOWN; next; //2. // All clients in RAMP_DOWN company.mode = STOP; } // finish // Simulation completed. print results.

July 23, Formal semantics (FX10) Based on Middleweight Java (MJ) Configuration is a tree of located processes Tree necessary for finish. Clocks formalized using short circuits (PODC 88). Bisimulation semantics. Basic theorems Equational laws Clock quiescence is stable. Monotonicity of places. Deadlock freedom (for language w/out when). … Type Safety … Memory Safety

July 23, 2003 IBM PL Day Current Status We have an operational X implementation All programs shown here run. Analysis passes X10 source AST Parser Code Templates Code emitter Annotated AST X10 Grammar Target Java JVM X10 Multithreaded RTS Native code Program output Structure Translator based on Polyglot (Java compiler framework) X10 extensions are modular. Uses Jikes parser generator. Code metrics Parser: ~45/14K* Translator: ~112/9K RTS: ~190/10K Polyglot base: ~517/80K Approx 180 test cases. (* classes+interfaces/LOC) Limitations Clocked final not yet implemented. Type-checking incomplete. No type inference. Implicit syntax not supported. 09/03 02/04 07/04 02/05 07/05 12/05 06/06 PERCS Kickoff X10 Kickoff X Spec Draft X10 Prototype #1 X10 Productivity Study X10 Prototype #2 Open Source Release? PEM Events

July 23, 2003 IBM PL Day Future Work: Implementation Type checking/inference Clocked types Place-aware types Consistency management Lock assignment for atomic sections Data-race detection Activity aggregation Batch activities into a single thread. Message aggregation Batch small messages. Load-balancing Dynamic, adaptive migration of places from one processor to another. Continuous optimization Efficient implementation of scan/reduce Efficient invocation of components in foreign languages C, Fortran Garbage collection across multiple places Welcome University Partners and other collaborators.

July 23, 2003 IBM PL Day Future work: Other topics Design/Theory Atomic blocks Structural study of concurrency and distribution Clocked types Hierarchical places Weak memory model Persistence/Fault tolerance Database integration Tools Refactoring language. Applications Several HPC programs planned currently. Also: web-based applications. Welcome University Partners and other collaborators.

Backup material

July 23, 2003 IBM PL Day Type system Value classes May only have final fields. May only be subclassed by value classes. Instances of value classes can be copied freely between places. nullable is a type constructor nullable T contains the values of T and null. Place types: specify the place at which the data object lives. Future work: Include generics and dependent types.

July 23, 2003 IBM PL Day Example: Latch public class Latch implements future { protected boolean forced = false; protected nullable boxed result = null; protected nullable exception z = null; public atomic boolean setValue( nullable Object val, nullable exception z ) { if ( forced ) return false; // these assignment happens only once. this.result.val= val; this.z = z; this.forced = true; return true; public atomic boolean forced() { return forced; } public Object force() { when ( forced ) { if (z != null) throw z; return result; } public interface future { boolean forced(); Object force(); } public class boxed { nullable Object val; }