Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly (Imperial College London) Joint work with Olav Beckmann,

Similar presentations


Presentation on theme: "Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly (Imperial College London) Joint work with Olav Beckmann,"— Presentation transcript:

1 Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly (Imperial College London) Joint work with Olav Beckmann, Karen Osmond, Tony Field and others Dagstuhl, January 2006

2 Software PerformanceOptimisation Group 2 Domain-specific optimisation Libraries extend general-purpose languages Good libraries promote problem- focused code “Active libraries” apply library-specific optimisations to client code C a = new C(…); C b = new C(…); … c = a.f(…); … print( b.g(c) ); constructor C(…); f(…) {…} g(…) {…} Client Library Client calling context may enable optimisation fusion, redundancy elimination, incremental- isation, etc

3 Software PerformanceOptimisation Group 3 Active library technologies How to deliver “active libraries”? Domain-specific compiler? Source-to-source transformation? Plug-in – based compiler architecture? Plug-in – based virtual machine? “Domain-specific optimisation components” Aspect weaver? This talk is about an appealingly low-tech solution, which we glorify with a big name – the “Domain-Specific Interpreter”

4 Software PerformanceOptimisation Group 4 Domain-specific interpreter DSI is interposed between client and library C a = new C(…); C b = new C(…); … c = a.f(…); … print( b.g(c) ); constructor C(…); f(…) {…} g(…) {…} Client Library Domain- specific interpreter DSI

5 Software PerformanceOptimisation Group 5 Domain-specific interpreter DSI is interposed between client and library C a = new C(…); C b = new C(…); … c = a.f(…); … print( b.g(c) ); constructor C(…); f(…) {…} g(…) {…} Client Library Delay Execution, build “recipe” DSI Plan optimised execution, execute Inject proxy between application and library Use proxy to capture, delay and optimise the calls

6 Software PerformanceOptimisation Group 6 Domain-specific interpreter DSI is a design pattern Standard questions: When is DSI a good idea? When is it applicable? How do you implement it (in your favoured language)? Show me an example! Let’s do the example first…

7 MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes

8 MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes

9 MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes I’m going to show you how we dramatically improved MayaVi interactivity By parallel execution on SMP By parallel execution on linux cluster By caching pre-calculated results Without changing a single line of MayaVi or VTK code Without writing a compiler

10 MayaVi Tool for visualising fluid flows GUI supports interactive construction of visualisation pipelines Eg Fluid flow past a heated sphere: temperature isosurface with temperature- shaded streamtubes I’m going to show you how we dramatically improved MayaVi interactivity By parallel execution on SMP By parallel execution on linux cluster By caching pre-calculated results Without changing a single line of MayaVi or VTK code* Without writing a compiler * Actually we did change a few lines in VTK to fix a problem with Python’s Global Interpreter Lock

11 Software PerformanceOptimisation Group 11 MayaVi: Working on partitioned data Our ocean simulations are generated in parallel Input data consists of a set of partitions (and an XML index) Normally, VTK fuses these partitions into one mesh as they are read

12 Software PerformanceOptimisation Group 12 MayaVi: Working on partitioned data Our ocean simulations are generated in parallel Input data consists of a set of partitions (and an XML index) Normally, VTK fuses these partitions into one mesh as they are read Some – many – analyses can operate partition-by-partition

13 Software PerformanceOptimisation Group 13 MayaVi: what the DSI has to do Capture all delayable calls to methods from a DSL through a proxy layer A force point is a call which requires an immediate result – in this case to render on screen A recipe is the set of calls between consecutive force points (in parallel)

14 Software PerformanceOptimisation Group 14 Implementing a generic DSI proxy in Python Actually, the real implementation generates dummies for all the methods and members as well as the classes So when MayaVi reflects on the module to generate the GUI configuration forms it finds the right stuff import vtkpython_real from vtkdsi import proxyObject for className in dir(vtkpython_real): exec “class “ + className + “(proxyObject):pass” class proxyObject: def __getattr__ (self, callName): return lambda callArgs: self.proxyCall(callName, callArgs) def proxyCall(self, callName, callArgs): # if forcepoint: optimise and apply recipe # else: add call to the current recipe Self-generating proxy module Proxy implementation

15 We replace the “vtkPython” wrapper module With a module with dummy definitions of every class in vtkPython When Python finds no method implementation for these classes, it passes the method name and arguments to “__getattr__” This bounces the call to our DSI method “proxyCall” Which eventually calls the real vtkPython module In Python it’s remarkably easy to interpose the proxy

16 Software PerformanceOptimisation Group 16 How well does it work? Benchmark: Plot isosurfaces for seven pressure values in flow past heated sphere Each isosurface is several hundred MB Hardware: For SMP: Athlon 1600+, dual SMP, 256 KB L2, 1 GB RAM, Linux 2.4 For distributed-memory: Cluster of 4 Pentium 4 2.8 GHz, 512 KB L2, 1 GB RAM, Linux 2.4

17 Software PerformanceOptimisation Group 17 How well does it work? Benchmark: Plot isosurfaces for seven pressure values in flow past heated sphere Each isosurface is several hundred MB Hardware: For SMP: Athlon 1600+, dual SMP, 256 KB L2, 1 GB RAM, Linux 2.4 For distributed-memory: Cluster of 4 Pentium 4 2.8 GHz, 512 KB L2, 1 GB RAM, Linux 2.4

18 Tiling optimisation yields substantial speedup Modest further speedup from two- way shared- memory parallel Parallel execution on a four- processor Linux cluster also offers substantial speedup Isosurface benchmark: cluster of four 2GHz Pentium 4 PCs

19 Tiling optimisation yields substantial speedup Modest further speedup from two- way shared- memory parallel Parallel execution on a four- processor Linux cluster also offers substantial speedup Isosurface benchmark: cluster of four 2GHz Pentium 4 PCs

20 Software PerformanceOptimisation Group 20 Further MayaVi DSI optimisations Caching: check whether results of this recipe (or part thereof) are available in cache Multiple frames per second… Region of Interest (RoI): Load from disk only those partitions which intersect a cuboid specified by the user Level of Detail (LoD): Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions Put together… “Google Earth” for global ocean flow

21 Software PerformanceOptimisation Group 21 Further MayaVi DSI optimisations Caching: check whether results of this recipe (or part thereof) are available in cache Multiple frames per second… Region of Interest (RoI): Load from disk only those partitions which intersect a cuboid specified by the user Level of Detail (LoD): Each dataset is stored in full-resolution form but also in a hierarchy of coarsened, decimated versions Put together… “Google Earth” for global ocean flow Large space of possible execution plans for each visualisation task - choose Appropriate parallelisation recalculate or retrieve from (remote, persistent, peer?) cache Which intermediate results to save to cache Partition size Level of detail (eg to satisfy response-time budget) Whether to decimate surfaces to fit in graphics RAM Whether to construct (and cache) index for multiple isosurfaces

22 Software PerformanceOptimisation Group 22 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! When: You can’t analyse the client code The client code is too complex to analyse statically The client composes library code dynamically The overheads are small compared to library functions’ execution time

23 Software PerformanceOptimisation Group 23 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! When: Execution of library code can be delayed All dependencies between client and library code are explicit in library API Library data structures are opaque

24 Software PerformanceOptimisation Group 24 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! Interpose proxy: Built by hand Using generic proxy mechanism based on reflection – as shown in Python Using IDL-based parameter marshalling Using aspect weaver (but…)

25 Software PerformanceOptimisation Group 25 Back to DSI… Standard questions: When is DSI a good idea? When is it applicable? How do you implement it? Show me an example! We have used the DSI trick several times So have lots of other people… MayaVi/Python/VTK Message fusion and scheduling in parallel programming Loop fusion in a matrix/vector library Aggregation of Java RMI (correctness issues are tricky)

26 Software PerformanceOptimisation Group 26 What makes DSI hard to implement? Non-opaque return values Eg vector type is opaque, but dot-product returns a non-opaque scalar Exceptions Delayed execution shifts the point where errors are discovered Unnecessary force-points Eg property getter methods Hidden dependencies Eg we can aggregate remote method calls provided none of them results in a call back that can affect the caller JVM Antidependencies Client overwrites operand of delayed call (Next to Last slide)

27 Software PerformanceOptimisation Group 27 DSI using AOP DSI proxying is a form of “around advice”; cf in AspectJ: But to handle methods with return values, need a more powerful tool than AspectJ – can the dataflow pointcut mechanism of Masuhara and Kawauchi do it? around() : dsiCalls() { Runnable worker = new Runnable() { public void run() { proceed(); } }; Recipe.add(worker); } See http://www.cs.chalmers.se/~giese/aop/course4.pdf

28 Software PerformanceOptimisation Group 28 Conclusions/discussion DSI is not new But just keeps popping up, solves tricky problems DSI programs are program generators Type safety of the recipe derives from type safety of the client (so DSI interpreter could be tagless) Safety of optimising transformations is another matter… DSIs can be JITs Eg our C++ matrix/vector library uses a multistage programming library to generate C loops at runtime (and fuse them) There is a useful catalogue of techniques to enhance DSI applicability, overheads etc Last slide

29 Software PerformanceOptimisation Group 29 Related stuff… Lazy evaluation – with reflection Template metaprogramming – encode recipe in type Proxy interposition trick is common in dynamically- typed languages: Redefining the lookup function in Common Lisp The “doesNotUnderstand: hack” in Smalltalk The idea of converting a call to a message… Message-Oriented Programming: The Case for First Class Messages (Dave Thomas, JOT 2004) Tomasulo-style renaming to prevent antidependences from forcing execution Compare with explicit recipe construction workflow systems, command objects, LINQ

30 Software PerformanceOptimisation Group 30 Examples of DSIs in action Kwok Yeung’s delayed-evaluation self-optimising Java RMI Quinlan and Isaacs A++ library Olav Beckmann’s DESOBLAS library Ruenger and Rauber’s rescheduling MPI, Fabrizio Petrini et al’s bulk-synchronous BCS-MPI), BSP Thomas Jensen’s Communication Fusion library (aggregates MPI collectives)


Download ppt "Software PerformanceOptimisation Group Domain-specific interpreters (a nested talk) Paul Kelly (Imperial College London) Joint work with Olav Beckmann,"

Similar presentations


Ads by Google