Presentation is loading. Please wait.

Presentation is loading. Please wait.

SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas.

Similar presentations


Presentation on theme: "SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas."— Presentation transcript:

1 SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger http://parasol.tamu.edu/~rwerger Parasol Lab, Dept of Computer Science, Texas A&M N. Amato, B. Stroustrup, M. Adams

2 SmartApps Architecture Compiled code + runtime hooks Static STAPL Compiler Augmented with runtime techniques Predictor & Optimizer STAPL Application advanced stages development stage Toolbox Toolbox Get Runtime Information (Sample input, system information, etc.) Execute Application Continuously monitor performance and adapt as necessary Predictor & Optimizer Predictor & Evaluator Adaptive Software Runtime tuning (w/o recompile) Compute Optimal Application and RTS + OS Configuration Recompute Application and/or Reconfigure RTS + OS Configurer Predictor & Evaluator Smart Application Small adaptation (tuning) Large adaptation (failure, phase change) DataBase Adaptive RTS+ OS

3 Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

4 STAPL: Standard Template Adaptive Parallel Library STAPL: A library of parallel, generic constructs based on the C++ Standard Template Library (STL). – Components for Program Development  pAlgorithms, pContainers, Views, pRange – Portability and Optimization  STAPL RTS and Adaptive Remote Method Invocation (ARMI) Communication Library  Framework for Algorithm Selection and Tuning (FAST)

5 Applications Using STAPL l Particle Transport - TAXI l Bioinformatics - Protein Folding l Geophysics - Seismic Ray Tracing l Aerospace - MHD –Seq. “Ctran” code (7K LOC) –STL (1.2K LOC) –STAPL (1.3K LOC)

6 Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

7 Consistency Models Processor Consistency (default and currently supported model) –Accesses from a processor on another’s memory are sequential –Requires in-order processing of RMIs l Limited parallelism Object Consistency (future) –Accesses to different objects can happen out of order –Uncovers fine-grained parallelism l Accesses to different objects are concurrent l Potential gain in scalability –Can be made default for specific computational phases Mixed Consistency (future) –Use Object Consistency on select objects l Selection of objects fit for this model can be: u Elective – the application can specify that an object’s state does not depend on others’ states. u Detected – if it is possible to assert the absence of such dependencies –Use Processor Consistency on the rest

8 Algorithm Selection Problem Description: Given multiple implementations of an abstract operation with a specified execution environment and input data, choose one that maximizes performance l Our Objective: Create general framework for parallel algorithm selection –Applicable to any abstract operation: sorting, matrix multiplication, convex hull, … –Flexible specification of execution environment and input data parameters l Environment: #procs, memory interconnection, cache, thread management, OS policies l Input data: data type, layout, size, other properties (e.g., sortedness of input) –Generic modeling interface – interchange different approaches Use generic machine learning approaches l Our Strategy: FAST: Framework for Algorithms Selection & Tuning –Integrated approach within STAPL –Selection is transparent to end user –Library adaptively choose the best algorithm at run-time from a library of implementation choices

9 FAST Architecture

10 Problem Specification User specifies parameters that may affect performance (e.g., num processors, input size, algorithm specific) and ranges (e.g, input size=100M..200M). User supplies list of candidate implementations.

11 FAST Architecture Metric Collection User provides method to collect data about an implementation execution (e.g., presortedness measure).

12 FAST Architecture Instance Selection FAST selects instances to execute and use when making decision model.

13 FAST Architecture Instance Execution FAST invokes user interface to execute test instances and collect metrics and timings.

14 FAST Architecture Database Insertion FAST inserts results of training instances into database.

15 FAST Architecture Model Generation FAST uses training instances to generate decision model.

16 FAST Architecture Selection Querying FAST provides interface for user application to query generated model.

17 Parallel Sorting: Experimental Results SGI Altix Selection Model SGI Altix Validation Set (V1) – 100% Accuracy N=120M Adaptive Performance Penalty Attributes for Selection Model –Processor Count –Data Type –Input Size –Max Value (impacts radix sort) –Presortedness

18 Parallel Sorting - Altix Relative Performance (V2) l Model obtains 99.7% of the possible performance. l Next best algorithm (sample) provides only 90.4%.

19 FAST: Future Work l Improve and refine algorithm selection framework –Refine sampling for training input selection –Algorithm developer specified models –Online feedback/model refinement (incremental learning) –Machine characterization & microbenchmarks l Expand employment in STAPL. –pContainers - locking strategies, data (re)distribution, consistency model –pAlgorithms - more algorithms, parameter tuning –pRange - work granularity and scheduling policies –ARMI Runtime System - dynamic aggregation.

20 Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

21 RTS – Current state Application Specific Parameters Smart Application STAPL RTS ARMIExecutor K42 User-Level Dispatcher Kernel Scheduler (no custom scheduling, e.g. NPTL) Operating System Memory Manager Advanced stage Experimental stage: multithreading ARMIExecutor Comm. Thread RMI Thread Task Thread Custom scheduling Kernel scheduling

22 ARMI – Current State ARMI: Adaptive Remote Method Invocation –Abstraction of shared-memory and message passing communication layer (MPI, pThreads, OpenMP, mixed, Converse). –Programmer expresses fine-grain parallelism that ARMI adaptively coarsens to balance latency versus overhead. –Support for sync, async, point-to-point and group communication. –Automated (de)serialization of C++ classes. ARMI can be as easy/natural as shared memory and as efficient as message passing.

23 ARMI Communication Primitives Point to Point Communication armi_async - non-blocking: doesn’t wait for request arrival or completion. armi_sync - blocking and non-blocking versions. Collective Operations armi_broadcast, armi_reduce, etc. can adaptively set groups for communication. Synchronization armi_fence, armi_barrier - fence implements distributed termination algorithm to ensure that all requests sent, received, and serviced. armi_wait - blocks until at least at least one request is received and serviced. armi_flush - empties local send buffer, pushing outstanding to remote destinations.

24 RTS Current Work - Multithreading In ARMI l Specialized communication thread dedicated the emission and reception of messages –Reduces latency, in particular on SYNC requests l Specialized threads for the processing of RMIs –Uncovers additional parallelism (RMIs from different sources can be executed concurrently) –Provides a suitable framework for future work on relaxing the consistency model and on the speculative execution of RMIs In the Executor l Specialized threads for the execution of tasks –Concurrently execute ready tasks from the DDG (when all dependencies are satisfied)

25 TAXI Experimental Results l uBGL - 1024 node, 2048 processor unclassified Blue Gene cluster at LLNL. l Results shown for two machine configurations. –Virtual Mode - two independent computation processes per node. –Coprocessor - one processor per node for computation and one for communication.

26 TAXI Results on uBGL

27

28 RTS – The Next Generation Application Specific Parameters Virtualization Layer Provides a unified view of the RTS services Smart Application STAPL RTS implementation ARMI Communicator Executor Smart User-Level Scheduler K42 User-Level Dispatcher Kernel Scheduler (no custom scheduling, e.g. NPTL) Other RTS (E.g., Converse, Marcel …) Unsupported System Best Affinity Supported System Memory Manager M:N Threading Model 1:1 Scheduler Virtualization Synchronization Server (fence, distributed locks …)

29 RTS Virtualization Another level of portability –Allow porting the Smart Application to another runtime system –Achieving the best possible affinity between RTS and OS If another RTS is better for a specific system, use it! –Abstract the complexity of supporting a heterogeneous group of heterogeneous systems Avoid the special-case implementation paradigm –Providing support for multiple architectures is costly l In time and maintenance l In performance (a system supported as a special case will invariably present poorer performance than the principal targets). –Why emulate the wheel ?

30 RTS Threading Models 1:1 threading model : (1 user-level thread mapped onto 1 kernel thread) –Default kernel scheduling –Heavy kernel threads M:N threading model : (M user-level threads mapped onto N kernel threads) –Customized scheduling l Enables scheduler-based optimizations (e.g. priority scheduling, better support for relaxing the consistency model …) –Light user-level threads l Smaller threading cost u Can match N with the number of available hardware threads : no kernel-thread swapping, no preemption, no kernel over-scheduling … u User-level thread scheduling requires no kernel trap l Perfect and free load balancing within the node u User-level threads are cooperatively scheduled on the available kernel threads (they migrate freely).

31 RTS Executor Customized task scheduling –Executor maintains a ready queue (all tasks for which dependencies are satisfied in the DDG) –Order tasks from the ready queue based on a scheduling policy (e.g. round robin, static block or interleaved block scheduling, dynamic scheduling …) –The RTS decides the policy, but the user can also specify it himself –Policies can differ for every pRange Customized load balancing –Implement load balancing strategies (e.g. work stealing) –Allow the user to choose the strategy –K42 : generate a customized work migration manager

32 RTS Synchronization l Efficient implementation of synchronization primitives is crucial –One of the main performance bottlenecks in parallel computing –Common scalability limitation Fence –Efficient implementation using a novel Distributed Termination Detection algorithm Global Distributed Locks –Symmetrical implementation to avoid contention –Support for logically recursive locks (required by the compositional SmartApps framework) Group-based synchronization –Allows efficient usage of ad-hoc computation groups –Semantic equivalent of the global primitives –Scalability requirement for large-scale systems

33 Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

34 The Pivot: Static Analysis of C++ Applications (lead: Bjarne Stroustrup) Compiler IPR XPR Tool 2 Tool 1 C++ source Object code C++ source IDL Tool 4 “information” Tool 3 Specialized representation (e.g. flow graph) Compiler

35 Context for the Pivot l Semantically Enhanced Library (Language) –Enhanced notation through libraries –Restrict semantics through tools l And take advantage of that semantics C++ Domain Specific Library Semantic Restriction s Bell Labs Proverbs Library design is language design Language design is library design

36 Context for the Pivot l Provide the advantages of specialized languages –Without introducing new “special purpose” languages –Without supporting special-purpose language tool chains –Avoiding the 99.?% language death rate l Provide general support for the SELL idea –Not just a specialized tool per application/library –The Pivot fits here C++ Domain Specific Library Semantic Restriction s

37 Current and future work l Complete infrastructure –Complete EDG and GCC interfaces –Represent headers (modularity) directly –Complete type representation in XPR l Initial applications –Style analysis l including type safety and security – Analysis and transformation of STAPL programs l Build alliances l Currently: compiles Firefox l Inserts measurement & control for SmartApps

38 Conclusion l STAPL is a good platform for developing SmartApps –used on several large-scale, complex applications –We plan to release a version soon to friendly users l Intel TBB – The simple STAPL for multicores  University research into commercial product l More info at http://parasol.tamu.edu/http://parasol.tamu.edu/


Download ppt "SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas."

Similar presentations


Ads by Google