Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University.

Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University. ayg@cs.purdue.edu http://www.cs.purdue.edu/pdsl

Building Applications for Next Generation Computing Platforms Emerging trends point to two disruptive technologies: –Architecture innovations from the desktop to scalable systems –Embedded intelligence and ubiquitous processing How do we program these platforms efficiently? Very little of what we have learned over three decades of parallel programming directly applies here.

Evolution of Microprocessor Architectures Chip-Multiprocessor Architectures Scalable Multicore Platforms Heterogeneous Multicore Processors Transactional Memory

Multicore Architectures -- An Overview The Myth: –Multicore processors are designed for speed. The Reality: Multicore processors are motivated by power considerations: –Power is proportional to clock speed –Power is quadratic in V dd –V dd can be reduced as clock speed is reduced –Computation speed is generally sublinear in clock speed

Multicore Architectures -- An Overview Collocate multiple processor cores on a single chip (a special class of chip-multiprocessors) Programming model is typically thread-based Many microprocessors are hardware compatible with existing motherboards (memory performance?) Memory systems vary widely across various vendors (AMD vs. Intel vs. IBM PowerPC/Cell)

Multicore Architectures -- Trends Current generation typically at dual- or quad-core Desktops and mobile dual-core variants available Scalable multicore: AMD and Intel both plan up to 16 cores in the next two years and up to 64 cores in the medium term. Heterogeneous multicore: some of the most commoly used processors today are heterogeneous multicore (network routers, ARM/TI DSPs in cell-phones).

Memory System Architecture Trading off latency and bandwidth (the Cell solution) Programmable caches Transactional Memory

Transactional Memory Overview Addresses problems of correctness of parallel programs as well as performance. Requires hardware support. Mitigates many of the problems associated with locks – composability, granularity, mixing correctness and performance.

Transactional Memory Overview begin_transaction x = x + 1 y = y + x if (x < 10) z = x; else z = y; end_transaction Thread 1 begin_transaction x = x - 1 y = y - x if (x > 10) z = x; else z = y; end_transaction Thread 2 Each thread sees either all, or none of the other threads updates. Basic mechanisms: isolation (conflict detection), versioning (maintain versions), and atomicity (commit or rollback).

Implications for Application Development and Performance Fundamental changes in the entire application stack –Programming paradigms (models of concurrency) –Software support (compilers, OS) –Library support (application kernels) –Runtime systems and performance monitoring (performance bottlenecks and alleviation) –Analysis techniques (scaling to the extreme)

Ongoing work at Purdue / collaborators – A Birds-eye View (Collaborators: Intel -- Compilers, Libraries, UMN -- Analysis Techniques, EPFL -- Programming Paradigms) Programming Models: What are appropriate concurrency abstractions? –When is communication good? –How do we deal with the spectrum of coherence models seamlessly? –How do we use transactions in real programs (I/O and networks are not transactional)

Programming Models: The Mediera Environment –Define domains of identical coherence models. –Build slack into concurrency. –View other cores as intelligent caches. –Use an LRU-type strategy to swap out threads across cores. –Support for algorithmic asynchrony. A number of important issues need to be resolved relating to mixed models -- messaging overhead associated with swapped out threads, resource bounds, livelock, priority inversion.

Library Support Building optimized multicore libraries for important computational kernels (sparse algebra, quantum scale – MD methods) / Intel MKE. Novel algorithms for memory-constrained platforms (excess FLOPS, instead of excess memory accesses). Demonstrated application performance (model reduction, nano- scale modeling). Comprehensive benchmarking of platforms (DARPA/HPCS pilot study) with a view to identifying performance bottlenecks and desirable application characteristics.

Analysis Techniques How do we analyze programs over large number of cores? Isoefficiency metric –Scaling problem size with number of cores to maintain performance. Memory constrained scaling –Quantifying drop in performance with increase in number of cores while operating at peak memory Impact of limited bandwidth –Increasing number of cores implies lower bandwidth at each core

Technical Objective To develop the next generation software environment for scalable chip-multiprocessor systems, along with library support and validating applications.

Software Environments for Embedded Systems Setting of calibration tests

Programming Scalable Systems The traditional approach to distributed programming involves writing “network-enabled” programs for each node – The program encodes distributed system behavior using complex messaging between nodes – This paradigm raises several issues and limitations: Program development is time consuming Programs are error prone and difficult to debug Lack of a distributed behavior specification, which precludes verification Limitations with respect to scalability, heterogeneity and performance

Programming Scalable Systems Macroprogramming entails direct specification of the distributed system behavior in contrast to programming individual nodes Provides: – Seamless support for heterogeneity Uniform programming platform Node capability-aware abstractions Performance scaling – Separating the application from system-level details – Scalability and adaptability with network & load dynamics – Validation of behavioral specification

Technical Objective To develop a second generation operating system suite that facilitates rapid macroprogramming of efficient self- organized distributed applications for scalable embedded systems

Ongoing Work: The CosmOS System Suite for Embedded Environments CosmOS Components: –Programming model, compilation techniques –Device independent node operating system interfaces and implementations –Network operating system

CosmOS Programming Model Macroprogram consists of: Distributed system behavioral specification Constraints associated with mapping behavioral specification to physical system Behavioral Specification – Functional Components (FCs) Represents a specific data processing function Typed input and output interface – Interaction Assignment (IA) Directed graph that specifies data flow through FCs Data source and sinks are (logical) device ports

CosmOS Program Valdiation Statically type-checked interaction assignment The output of a component can be connected to the input of another only if their types match Functional components represent a deterministic data processing function The output sequence depends only on the inputs to the FC Correctness Given input at each source in the IA the outputs at sinks are deterministically known

CosmOS Functional Components Elementary unit of execution – Isolated from the state of the system and other FCs – Uses only stack variables and statically assigned state memory – Asynchronous execution: data flow and control flow handled by cosmOS Static memory – Prevents non-deterministic behavior due to malloc failures – Leads to a lean memory management system in the OS Reusable components – The only interaction is via typed interfaces Dynamically loadable components – Runtime updates possible Average raw_t avg_t

CosmOS Program Specification Sections: – Enumerations – Declarations – Mapping constraints – IA Description

CosmOS Program: An Example %photo : device = PHOTO_SENSOR, out [ raw_t ]; %fs : device = FILE_DUMP, in [ * ]; %avg : { fcid = FCID_AVG, in [ raw_t, avg_t ], out [ avg_t ] }; %thresh : { fcid = FCID_THRESH, in [ raw_t ], out [ raw_t ] }; @ snode = CAP_PHOTO_SENSOR : photo, thresh; @ fast_m = CAP_FAST_CPU : avg; @ server = CAP_FS | CAP_UNIQUE_SERVER : avg, fs; start_ia timer(100)  photo(1); photo(1)  thresh(2,0,500); thresh(2,0)  avg(3,0,10), avg(4,0,100); avg(3,0)  fs(5) |  avg(3,1); avg(4,0)  fs(6) |  avg(4,1); end_ia raw_t T(t) P() Threshold (500) raw_t * Average (10) raw_tavg_t FS * Average (100) raw_tavg_t FS avg_t

CosmOS: Runtime System Average (10) avg_t raw_t * avg_t FS raw_t * avg_t FS raw_t T(t) P() Threshold (500) raw_t Average (100) avg_t

CosmOS: Runtime System Provides a low-footprint execution environment for CosmOS programs Key components – Data flow and control flow – Locking and concurrency – Load conditioning – Routing primitives

CosmOS Node Operating System Updateable User space Static OS Kernel Platform Independent Kernel App FC Services HW Drivers Hardware Abstraction Layer

CosmOS: Current Status Fully functional implementations for Mica2 and POSIX (on Linux) Mica2: Non-preemptive function pointer scheduler Dynamic memory management POSIX: Multi-threading using POSIX threads and underlying scheduler The OS exists as library calls and a single management thread

CosmOS: Current Status Comprehensively evaluated and validated Alpha releases can be freely downloaded from: http://www.cs.purdue.edu/~awan/cosmos/

CosmOS Validation ECN Net Interne t 802.11b Peer-to-Peer FM 433MHz Laser attached via serial port to Stargate computers MICA2 motes with ADXL 202 Currently laser readings can be viewed for from anywhere over the Internet (conditioned on firewall settings) Pilot deployment at BOWEN labs

CosmOS: Ongoing Work Semantics of the CosmOS Programming Model GUI for Interaction Assignment Library of modules Large-scale deployment and scalability studies Application-specific optimizations.

Thank you! For papers and talks on these topics, please visit: http://www.cs.purdue.edu/pdsl

Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University.

Similar presentations

Presentation on theme: "Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University.

Similar presentations

Presentation on theme: "Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University."— Presentation transcript:

Similar presentations

About project

Feedback