Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly (Imperial College London) Joint work with Olav Beckmann,

Slides:

Advertisements

Similar presentations

Operating Systems Components of OS

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

The road to reliable, autonomous distributed systems

Distributed Processing, Client/Server, and Clusters

Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Static Analysis of Embedded C Code John Regehr University of Utah Joint work with Nathan Cooprider.

Tutorials 2 A programmer can use two approaches when designing a distributed application. Describe what are they? Communication-Oriented Design Begin with.

Exceptions in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

© 2006 Pearson Addison-Wesley. All rights reserved4-1 Chapter 4 Data Abstraction: The Walls.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Memory Management 2010.

Object Based Operating Systems1 Learning Objectives Object Orientation and its benefits Controversy over object based operating systems Object based operating.

Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.

SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.

Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.

Course Instructor: Aisha Azeem

C++ fundamentals.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

1 CS101 Introduction to Computing Lecture 19 Programming Languages.

Advances in Language Design

Starting Chapter 4 Starting. 1 Course Outline* Covered in first half until Dr. Li takes over. JAVA and OO: Review what is Object Oriented Programming.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

What is Architecture  Architecture is a subjective thing, a shared understanding of a system’s design by the expert developers on a project  In the.

+ A Short Java RMI Tutorial Usman Saleem

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

1 Module Objective & Outline Module Objective: After completing this Module, you will be able to, appreciate java as a programming language, write java.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

CS 320 Assignment 1 Rewriting the MISC Osystem class to support loading machine language programs at addresses other than 0 1.

Generative Programming. Automated Assembly Lines.

Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.

Software Engineering Prof. Ing. Ivo Vondrak, CSc. Dept. of Computer Science Technical University of Ostrava

AOP-1 Aspect Oriented Programming. AOP-2 Aspects of AOP and Related Tools Limitation of OO Separation of Concerns Aspect Oriented programming AspectJ.

1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.

Copyright © 2005 Elsevier Object-Oriented Programming Control or PROCESS abstraction is a very old idea (subroutines!), though few languages provide it.

CSCE 314 Programming Languages Reflection Dr. Hyunyoung Lee 1.

Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.

1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.

Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.

CEN6502, Spring Understanding the ORB: Client Side Structure of ORB (fig 4.1) Client requests may be passed to ORB via either SII or DII SII decide.

OCR A Level F453: The function and purpose of translators Translators a. describe the need for, and use of, translators to convert source code.

Mr H Kandjimi 2016/01/03Mr Kandjimi1 Week 3 –Modularity in C++

Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Topic 4: Distributed Objects Dr. Ayman Srour Faculty of Applied Engineering and Urban Planning University of Palestine.

Introduction to Operating Systems Concepts

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

Current Generation Hypervisor Type 1 Type 2.

Key Ideas from day 1 slides

Self Healing and Dynamic Construction Framework:

A Closer Look at Instruction Set Architectures

Component Based Software Engineering

Behavioral Design Patterns

In-situ Visualization using VisIt

Programming Models for SimMillennium

Hierarchical Architecture

Weaving Abstractions into Workflows

Programming Languages

Software Architecture

Polymorphism Professor Hugh C. Lauer CS-2303, System Programming Concepts (Slides include materials from The C Programming Language, 2nd edition, by Kernighan.

Introduction to Optimization

Presentation transcript:

Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly (Imperial College London) Joint work with Olav Beckmann, Karen Osmond, Tony Field and others Dagstuhl, January 2006

Software PerformanceOptimisation Group 2 Domain-specific optimisation Libraries extend general-purpose languages Good libraries promote problem- focused code “Active libraries” apply library-specific optimisations to client code C a = new C(…); C b = new C(…); … c = a.f(…); … print( b.g(c) ); constructor C(…); f(…) {…} g(…) {…} Client Library Client calling context may enable optimisation fusion, redundancy elimination, incremental- isation, etc

Software PerformanceOptimisation Group 3 Active library technologies How to deliver “active libraries”? Domain-specific compiler? Source-to-source transformation? Plug-in – based compiler architecture? Plug-in – based virtual machine? “Domain-specific optimisation components” Aspect weaver? This talk is about an appealingly low-tech solution, which we glorify with a big name – the “Domain-Specific Interpreter”

Software PerformanceOptimisation Group 4 Domain-specific interpreter DSI is interposed between client and library C a = new C(…); C b = new C(…); … c = a.f(…); … print( b.g(c) ); constructor C(…); f(…) {…} g(…) {…} Client Library Domain- specific interpreter DSI

Software PerformanceOptimisation Group 5 Domain-specific interpreter DSI is interposed between client and library C a = new C(…); C b = new C(…); … c = a.f(…); … print( b.g(c) ); constructor C(…); f(…) {…} g(…) {…} Client Library Delay Execution, build “recipe” DSI Plan optimised execution, execute Inject proxy between application and library Use proxy to capture, delay and optimise the calls

Software PerformanceOptimisation Group 6 Domain-specific interpreter DSI is a design pattern Standard questions: When is DSI a good idea? When is it applicable? How do you implement it (in your favoured language)? Show me an example! Let’s do the example first…

MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes

MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes

MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes I’m going to show you how we dramatically improved MayaVi interactivity By parallel execution on SMP By parallel execution on linux cluster By caching pre-calculated results Without changing a single line of MayaVi or VTK code Without writing a compiler

MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes I’m going to show you how we dramatically improved MayaVi interactivity By parallel execution on SMP By parallel execution on linux cluster By caching pre-calculated results Without changing a single line of MayaVi or VTK code* Without writing a compiler * Actually we did change a few lines in VTK to fix a problem with Python’s Global Interpreter Lock

Software PerformanceOptimisation Group 11 MayaVi: Working on partitioned data Our ocean simulations are generated in parallel Input data consists of a set of partitions (and an XML index) Normally, VTK fuses these partitions into one mesh as they are read

Software PerformanceOptimisation Group 12 MayaVi: Working on partitioned data Our ocean simulations are generated in parallel Input data consists of a set of partitions (and an XML index) Normally, VTK fuses these partitions into one mesh as they are read Some – many – analyses can operate partition-by-partition

Software PerformanceOptimisation Group 13 MayaVi: what the DSI has to do Capture all delayable calls to methods from a DSL through a proxy layer A force point is a call which requires an immediate result – in this case to render on screen A recipe is the set of calls between consecutive force points (in parallel)

Software PerformanceOptimisation Group 14 Implementing a generic DSI proxy in Python Actually, the real implementation generates dummies for all the methods and members as well as the classes So when MayaVi reflects on the module to generate the GUI configuration forms it finds the right stuff import vtkpython_real from vtkdsi import proxyObject for className in dir(vtkpython_real): exec “class “ + className + “(proxyObject):pass” class proxyObject: def __getattr__ (self, callName): return lambda callArgs: self.proxyCall(callName, callArgs) def proxyCall(self, callName, callArgs): # if forcepoint: optimise and apply recipe # else: add call to the current recipe Self-generating proxy module Proxy implementation

We replace the “vtkPython” wrapper module With a module with dummy definitions of every class in vtkPython When Python finds no method implementation for these classes, it passes the method name and arguments to “__getattr__” This bounces the call to our DSI method “proxyCall” Which eventually calls the real vtkPython module In Python it’s remarkably easy to interpose the proxy

Software PerformanceOptimisation Group 16 How well does it work? Benchmark: Plot isosurfaces for seven pressure values in flow past heated sphere Each isosurface is several hundred MB Hardware: For SMP: Athlon 1600+, dual SMP, 256 KB L2, 1 GB RAM, Linux 2.4 For distributed-memory: Cluster of 4 Pentium GHz, 512 KB L2, 1 GB RAM, Linux 2.4

Software PerformanceOptimisation Group 17 How well does it work? Benchmark: Plot isosurfaces for seven pressure values in flow past heated sphere Each isosurface is several hundred MB Hardware: For SMP: Athlon 1600+, dual SMP, 256 KB L2, 1 GB RAM, Linux 2.4 For distributed-memory: Cluster of 4 Pentium GHz, 512 KB L2, 1 GB RAM, Linux 2.4

Tiling optimisation yields substantial speedup Modest further speedup from two- way shared- memory parallel Parallel execution on a four- processor Linux cluster also offers substantial speedup Isosurface benchmark: cluster of four 2GHz Pentium 4 PCs

Tiling optimisation yields substantial speedup Modest further speedup from two- way shared- memory parallel Parallel execution on a four- processor Linux cluster also offers substantial speedup Isosurface benchmark: cluster of four 2GHz Pentium 4 PCs

Software PerformanceOptimisation Group 20 Further MayaVi DSI optimisations Caching: check whether results of this recipe (or part thereof) are available in cache Multiple frames per second… Region of Interest (RoI): Load from disk only those partitions which intersect a cuboid specified by the user Level of Detail (LoD): Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions Put together… “Google Earth” for global ocean flow

Software PerformanceOptimisation Group 21 Further MayaVi DSI optimisations Caching: check whether results of this recipe (or part thereof) are available in cache Multiple frames per second… Region of Interest (RoI): Load from disk only those partitions which intersect a cuboid specified by the user Level of Detail (LoD): Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions Put together… “Google Earth” for global ocean flow Large space of possible execution plans for each visualisation task - choose Appropriate parallelisation recalculate or retrieve from (remote, persistent, peer?) cache Which intermediate results to save to cache Partition size Level of detail (eg to satisfy response-time budget) Whether to decimate surfaces to fit in graphics RAM Whether to construct (and cache) index for multiple isosurfaces

Software PerformanceOptimisation Group 22 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! When: You can’t analyse the client code The client code is too complex to analyse statically The client composes library code dynamically The overheads are small compared to library functions’ execution time

Software PerformanceOptimisation Group 23 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! When: Execution of library code can be delayed All dependencies between client and library code are explicit in library API Library data structures are opaque

Software PerformanceOptimisation Group 24 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! Interpose proxy: Built by hand Using generic proxy mechanism based on reflection – as shown in Python Using IDL-based parameter marshalling Using aspect weaver (but…)

Software PerformanceOptimisation Group 25 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! We have used the DSI trick several times So have lots of other people… MayaVi/Python/VTK Message fusion and scheduling in parallel programming Loop fusion in a matrix/vector library Aggregation of Java RMI (correctness issues are tricky)

Software PerformanceOptimisation Group 26 What makes DSI hard to implement? Non-opaque return values Eg vector type is opaque, but dot-product returns a non-opaque scalar Exceptions Delayed execution shifts the point where errors are discovered Unnecessary force-points Eg property getter methods Hidden dependencies Eg we can aggregate remote method calls provided none of them results in a call back that can affect the caller JVM Antidependencies Client overwrites operand of delayed call (Next to Last slide)

Software PerformanceOptimisation Group 27 DSI using AOP DSI proxying is a form of “around advice”; cf in AspectJ: But to handle methods with return values, need a more powerful tool than AspectJ – can the dataflow pointcut mechanism of Masuhara and Kawauchi do it? around() : dsiCalls() { Runnable worker = new Runnable() { public void run() { proceed(); } }; Recipe.add(worker); } See

Software PerformanceOptimisation Group 28 Conclusions/discussion DSI is not new But just keeps popping up, solves tricky problems DSI programs are program generators Type safety of the recipe derives from type safety of the client (so DSI interpreter could be tagless) Safety of optimising transformations is another matter… DSIs can be JITs Eg our C++ matrix/vector library uses a multistage programming library to generate C loops at runtime (and fuse them) There is a useful catalogue of techniques to enhance DSI applicability, overheads etc Last slide

Software PerformanceOptimisation Group 29 Related stuff… Lazy evaluation – with reflection Template metaprogramming – encode recipe in type Proxy interposition trick is common in dynamically- typed languages: Redefining the lookup function in Common Lisp The “doesNotUnderstand: hack” in Smalltalk The idea of converting a call to a message… Message-Oriented Programming: The Case for First Class Messages (Dave Thomas, JOT 2004) Tomasulo-style renaming to prevent antidependences from forcing execution Compare with explicit recipe construction workflow systems, command objects, LINQ

Software PerformanceOptimisation Group 30 Examples of DSIs in action Kwok Yeung’s delayed-evaluation self-optimising Java RMI Quinlan and Isaacs A++ library Olav Beckmann’s DESOBLAS library Ruenger and Rauber’s rescheduling MPI, Fabrizio Petrini et al’s bulk-synchronous BCS-MPI), BSP Thomas Jensen’s Communication Fusion library (aggregates MPI collectives)