A performance model for X10 Apps David GroveOlivier TardieuDavid Cunningham Ben HertaIgor PeshanskyVijay Saraswat.

A performance model for X10 Apps David GroveOlivier TardieuDavid Cunningham Ben HertaIgor PeshanskyVijay Saraswat

Purpose of this talk. Performance model: A detailed abstraction of the implementation of the language constructs Consequences and limitations of design & implementation decisions A guide for how to write well-performing programs in X10 Will try to distinguish between: By Design: Properties that we believe all implementations must share Managed X10: Information relating to the JVM-based implementation of X10 Native X10: Information relating to the C++-based implementation of X10 Work in progress: Due to resource constraints, each implementation has known limitations However, with care it is possible to write performing programs in X10, right now...

Implementation Overview X10 Source Parsing / Type Check AST Optimizations AST Lowering X10 AST C++ Code Generation Java Code Generation C++ SourceJava Source C++ CompilerJava Compiler XRCXRJXRX Native CodeBytecode X10RT X10 Compiler Front-End C++ Back-End Java Back-End Java VMsNative Env JNI Native X10Managed X10

Constraints public def m1 (a2:Any{self!=null}) { a2.hashCode(); // null check elided } public def m2 (a:Any) { a.hashCode(); // null check required val a2 = a as Any{self!=null}; // requires evaluation of self!=null expression a2.hashCode(); // null check elided m1(a); // either implicit cast or type error (depending on compiler option) } By design: Constraints can enable optimisations Constraint checks are not free Without -STATIC_CALLS, extra casts will be added

Interfaces By design: X10 Interfaces = Java Interfaces Many possible implementation techniques Managed X10: X10 Interfaces implemented with Java interfaces Native X10: Not using multiple virtual inheritance (thunks, extra vtables, constructor overhead) Using custom ITable search implementation Not as fast as Managed X10 (could be improved) Managed & Native X10: 'implements' does not cost performance Interface calls slower than virtual calls Closures By design: Function types are interfaces, closure dispatch = interface dispatch Closure creation = new + shallow copy of environment Will constant propagate + inline closures in scope

Generics By Design: Reified – Instantiating types known at runtime (e.g. for instanceof T) Native X10: C++ templates -- performance (and code bloat) is comparable Managed X10: Java generics with extra fields/params to hold reified type info Structs Native X10: Non-primitives (e.g. Complex) compiled to class (no vtable though) No indirection, pass by value, copy on assignment Managed X10: Non-primitives compiled to regular classes Unsigned integers are NOT primitives (thus slower) Indirection, pass by reference, alias on assignment Performance same as regular classes Generics + Structs (e.g. ArrayList[Pair[Double,Double]]) Native X10: Elements are contiguous (no indirection) Just like C++'s std::vector > Managed X10: All elements are boxed (considerably slower than ArrayList[Double])

Garbage Collector (single place) Native X10: Uses BDWGC Not as fast as state-of-the-art JVM GCs Avoid allocations if possible (e.g. use structs) Explicit Runtime.dealloc() supported (unsafe) Managed X10: Uses JVM GC Garbage Collector (distributed) By Design: Object graph is distributed Object can be kept alive by remote place via GlobalRef[T] (and friends) Remote pointers will incur additional GC / memory management costs Such objects probably take longer to be reclaimed Native & Managed X10: Objects pointed to by GlobalRef[T] are immortalized Use them sparingly @Mortal annotation prevents immortalization, but is unsafe Place 0Place 1

Milking the Post Compiler Managed X10: Expect standard 'hot code' JVM optimizations to kick in Native X10: g++/xlC do not understand our ITable mechanisms Interface calls currently not inlined or optimized by us either Final calls are devirtualised Only devirtualised calls are inlined by g++/xlC Native X10 compilation units: Methods defined in a different.cc file will not be inlined Methods defined in a.h file will be inlined according to usual g++/xlC logic Generic classes/functions are defined in.h files @Header annotation on methods moves definition to a.h file Manually concatenating.cc files will also enable cross-file optimizations Exception Performance By design: More relaxed model than Java – allows more re-ordering E.g. hoisting final field access out of loop – moves NPE earlier Rooted exception model complicates finish and workstealing Native/Managed X10: Exceptional control flow considered to be 'the slow path' in most cases

Concurrency finish { for (i in 0..1023) { async { … } By Design: All parallelism must be explicit (via async construct) Lots of activities => One thread per activity impractical Implementations must multiplex activities to single thread Lots of activities => One stack per activity impractical Implementations must multiplex activity stacks to worker stack At least 2 scheduling policies are possible async spawn memory management overhead ~= closure creation Activities are stealable by other cores => async spawn is at least a CAS Stealing is not free so balanced loads are better

Concurrency Native/Managed X10: atomic / when use per-place lock Lock/Monitor classes available for fine-grained locking Native/Managed X10 Scheduling details: X10_NTHREADS controls number of worker threads A worker will not preempt its running activity with another activity Activities have well-defined yield points (blocking constructs) Programmer can yield explicitly with Runtime.probe() Native/Managed X10 Stealing details: Each worker has a deque holding unexecuted asyncs Spawn async: push onto front Need more work: pop from front If empty: steal from end of another worker's deque If all deques empty: spin (keep trying) If deque is well-populated, no contention upon steal (CAS does not fail)

Scheduling for finish Native/Managed X10: Mechanisms similar to Doug Lea's Fork/Join framework When activity A blocked, start executing a fresh activity B on the same stack Resume A when B terminates (and B's portion of stack unwinds) main Stack grows A B... // activity A finish { async { // activity B starts... // activity B terminates } } // A blocks here... // rest of activity A Thus, activities A and B multiplexed onto 1 stack Stack activities are in finish nesting order, down to entrypoint 'main' Local deque always contains the activities which must progress to unblock finish

Other blocking constructs Native/Managed X10: Unfortunately, fork/join approach does not work with arbitrary synchronization finish { var a:Boolean = false, b:Boolean = false; async { // A pushes B onto Deque here atomic { b = true; } when (a == true); // B blocks here } when (b == true); // A blocks here, pops Deque, starts B atomic { a = true; } } Or even more simply clocked finish { clocked async { next; next; } next; next; } Only way to unblock B is to progress A, but A is 'buried' in the stack Cannot progress A until B has terminated (unwound) Solution: When A blocks, start a new thread to run B! (abandon fine-grained conc.) Conseqeunce: can only synchronize fine-grained concurrency with finish main A B

Distribution By Design: All captured vars in an 'at' must be serialized and transmitted to dest Additionally, all objects reachable from these vars must be transmitted too In general, a set of object graphs must be serialized Serialization overhead thus non-trivial Object graph can be pruned by capturing fewer vars or @Transient annotation In particular, implicit 'this' capture is a common mistake class C { x:Int; unrelated: MassiveObject; def foo() { val x_ = x; at (p) {...x... // actually sugar for this.x...x_... // much better } Since custom serialization has side-effects, auto eliding serialization is difficult Array.asyncCopy() API will do a direct array->array copy without serialization.

Distributed finish By design: finish / at / async nesting can be arbitrarily complex finish must track remote asyncs Extra communication needed to implement distributed finish finish { async at (p) { for (…) { async { async at (p2) { … } Native/Managed X10: At most 1 extra termination message per remote async Optimization: Local accumulation of termination messages until quiescence More details in paper

Static variables By design: Static initialization runs at Place 0 before main, in a finish Static fields in Places > 0 initialized by serialization and replication from place 0 If static state is large enough, this might cause a noticeable delay. Conclusion X10 (the language) has a novel performance model (particularly with respect to concurrent and distributed language constructs) Designed for speed but even an ideal implementation cannot work miracles Programmers must understand and work according to the model Current X10 implementations have limitations Performance model is thus different for each implementation Must be taken into account for maximum performance

A performance model for X10 Apps David GroveOlivier TardieuDavid Cunningham Ben HertaIgor PeshanskyVijay Saraswat.

Similar presentations

Presentation on theme: "A performance model for X10 Apps David GroveOlivier TardieuDavid Cunningham Ben HertaIgor PeshanskyVijay Saraswat."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A performance model for X10 Apps David GroveOlivier TardieuDavid Cunningham Ben HertaIgor PeshanskyVijay Saraswat.

Similar presentations

Presentation on theme: "A performance model for X10 Apps David GroveOlivier TardieuDavid Cunningham Ben HertaIgor PeshanskyVijay Saraswat."— Presentation transcript:

Similar presentations

About project

Feedback