Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University.

Slides:



Advertisements
Similar presentations
Using MapuSoft Instead of OS Vendor’s Simulators.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
A component- and message-based architectural style for GUI software
Hadi Salimi Distributed Systems Labaratory, School of Computer Engineering, Iran University of Science and Technology, Fall
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.
Chapter 13 Embedded Systems
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Macroprogramming Heterogeneous Sensor Networks Using COSM OS Asad Awan Department of Computer Science Acknowledgement:
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Chapter 13 Embedded Systems
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Figure 1.1 Interaction between applications and the operating system.
Experience with K42, an open- source, Linux-compatible, scalable operation-system kernel IBM SYSTEM JOURNAL, VOL 44 NO 2, 2005 J. Appovoo 、 M. Auslander.
1 I/O Management in Representative Operating Systems.
CS 3013 & CS 502 Summer 2006 Threads1 CS-3013 & CS-502 Summer 2006.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
What is Concurrent Programming? Maram Bani Younes.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
Macroprogramming Sensor Networks for DDDAS Applications Asad Awan Department of Computer Science.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Parallel Processing CS453 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 16 System Architecture and Design II.
Multi-core architectures. Single-core computer Single-core CPU chip.
Distributed Systems: Concepts and Design Chapter 1 Pages
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.
Processes Introduction to Operating Systems: Module 3.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
University of Toronto at Scarborough © Kersti Wain-Bantin CSCC40 system architecture 1 after designing to meet functional requirements, design the system.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Programmability Hiroshi Nakashima Thomas Sterling.
Full and Para Virtualization
1 Advanced Operating Systems - Fall 2009 Lecture 2 – January 12, 2009 Dan C. Marinescu Office: HEC 439 B.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Background Computer System Architectures Computer System Software.
Software Architecture of Sensors. Hardware - Sensor Nodes Sensing: sensor --a transducer that converts a physical, chemical, or biological parameter into.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Computer System Structures
Introduction to Load Balancing:
Challenges in Concurrent Computing
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Chapter 4: Threads.
Software Defined Networking (SDN)
QNX Technology Overview
Introduction to Operating Systems
Presentation transcript:

Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University.

Building Applications for Next Generation Computing Platforms Emerging trends point to two disruptive technologies: –Architecture innovations from the desktop to scalable systems –Embedded intelligence and ubiquitous processing How do we program these platforms efficiently? Very little of what we have learned over three decades of parallel programming directly applies here.

Evolution of Microprocessor Architectures Chip-Multiprocessor Architectures Scalable Multicore Platforms Heterogeneous Multicore Processors Transactional Memory

Multicore Architectures -- An Overview The Myth: –Multicore processors are designed for speed. The Reality: Multicore processors are motivated by power considerations: –Power is proportional to clock speed –Power is quadratic in V dd –V dd can be reduced as clock speed is reduced –Computation speed is generally sublinear in clock speed

Multicore Architectures -- An Overview Collocate multiple processor cores on a single chip (a special class of chip-multiprocessors) Programming model is typically thread-based Many microprocessors are hardware compatible with existing motherboards (memory performance?) Memory systems vary widely across various vendors (AMD vs. Intel vs. IBM PowerPC/Cell)

Multicore Architectures -- Trends Current generation typically at dual- or quad-core Desktops and mobile dual-core variants available Scalable multicore: AMD and Intel both plan up to 16 cores in the next two years and up to 64 cores in the medium term. Heterogeneous multicore: some of the most commoly used processors today are heterogeneous multicore (network routers, ARM/TI DSPs in cell-phones).

Memory System Architecture Trading off latency and bandwidth (the Cell solution) Programmable caches Transactional Memory

Transactional Memory Overview Addresses problems of correctness of parallel programs as well as performance. Requires hardware support. Mitigates many of the problems associated with locks – composability, granularity, mixing correctness and performance.

Transactional Memory Overview begin_transaction x = x + 1 y = y + x if (x < 10) z = x; else z = y; end_transaction Thread 1 begin_transaction x = x - 1 y = y - x if (x > 10) z = x; else z = y; end_transaction Thread 2 Each thread sees either all, or none of the other threads updates. Basic mechanisms: isolation (conflict detection), versioning (maintain versions), and atomicity (commit or rollback).

Implications for Application Development and Performance Fundamental changes in the entire application stack –Programming paradigms (models of concurrency) –Software support (compilers, OS) –Library support (application kernels) –Runtime systems and performance monitoring (performance bottlenecks and alleviation) –Analysis techniques (scaling to the extreme)

Ongoing work at Purdue / collaborators – A Birds-eye View (Collaborators: Intel -- Compilers, Libraries, UMN -- Analysis Techniques, EPFL -- Programming Paradigms) Programming Models: What are appropriate concurrency abstractions? –When is communication good? –How do we deal with the spectrum of coherence models seamlessly? –How do we use transactions in real programs (I/O and networks are not transactional)

Programming Models: The Mediera Environment –Define domains of identical coherence models. –Build slack into concurrency. –View other cores as intelligent caches. –Use an LRU-type strategy to swap out threads across cores. –Support for algorithmic asynchrony. A number of important issues need to be resolved relating to mixed models -- messaging overhead associated with swapped out threads, resource bounds, livelock, priority inversion.

Library Support Building optimized multicore libraries for important computational kernels (sparse algebra, quantum scale – MD methods) / Intel MKE. Novel algorithms for memory-constrained platforms (excess FLOPS, instead of excess memory accesses). Demonstrated application performance (model reduction, nano- scale modeling). Comprehensive benchmarking of platforms (DARPA/HPCS pilot study) with a view to identifying performance bottlenecks and desirable application characteristics.

Analysis Techniques How do we analyze programs over large number of cores? Isoefficiency metric –Scaling problem size with number of cores to maintain performance. Memory constrained scaling –Quantifying drop in performance with increase in number of cores while operating at peak memory Impact of limited bandwidth –Increasing number of cores implies lower bandwidth at each core

Technical Objective To develop the next generation software environment for scalable chip-multiprocessor systems, along with library support and validating applications.

Software Environments for Embedded Systems Setting of calibration tests

Programming Scalable Systems The traditional approach to distributed programming involves writing “network-enabled” programs for each node – The program encodes distributed system behavior using complex messaging between nodes – This paradigm raises several issues and limitations: Program development is time consuming Programs are error prone and difficult to debug Lack of a distributed behavior specification, which precludes verification Limitations with respect to scalability, heterogeneity and performance

Programming Scalable Systems Macroprogramming entails direct specification of the distributed system behavior in contrast to programming individual nodes Provides: – Seamless support for heterogeneity Uniform programming platform Node capability-aware abstractions Performance scaling – Separating the application from system-level details – Scalability and adaptability with network & load dynamics – Validation of behavioral specification

Technical Objective To develop a second generation operating system suite that facilitates rapid macroprogramming of efficient self- organized distributed applications for scalable embedded systems

Ongoing Work: The CosmOS System Suite for Embedded Environments CosmOS Components: –Programming model, compilation techniques –Device independent node operating system interfaces and implementations –Network operating system

CosmOS Programming Model Macroprogram consists of: Distributed system behavioral specification Constraints associated with mapping behavioral specification to physical system Behavioral Specification – Functional Components (FCs) Represents a specific data processing function Typed input and output interface – Interaction Assignment (IA) Directed graph that specifies data flow through FCs Data source and sinks are (logical) device ports

CosmOS Program Valdiation Statically type-checked interaction assignment The output of a component can be connected to the input of another only if their types match Functional components represent a deterministic data processing function The output sequence depends only on the inputs to the FC Correctness Given input at each source in the IA the outputs at sinks are deterministically known

CosmOS Functional Components Elementary unit of execution – Isolated from the state of the system and other FCs – Uses only stack variables and statically assigned state memory – Asynchronous execution: data flow and control flow handled by cosmOS Static memory – Prevents non-deterministic behavior due to malloc failures – Leads to a lean memory management system in the OS Reusable components – The only interaction is via typed interfaces Dynamically loadable components – Runtime updates possible Average raw_t avg_t

CosmOS Program Specification Sections: – Enumerations – Declarations – Mapping constraints – IA Description

CosmOS Program: An Example %photo : device = PHOTO_SENSOR, out [ raw_t ]; %fs : device = FILE_DUMP, in [ * ]; %avg : { fcid = FCID_AVG, in [ raw_t, avg_t ], out [ avg_t ] }; %thresh : { fcid = FCID_THRESH, in [ raw_t ], out [ raw_t ] snode = CAP_PHOTO_SENSOR : photo, fast_m = CAP_FAST_CPU : server = CAP_FS | CAP_UNIQUE_SERVER : avg, fs; start_ia timer(100)  photo(1); photo(1)  thresh(2,0,500); thresh(2,0)  avg(3,0,10), avg(4,0,100); avg(3,0)  fs(5) |  avg(3,1); avg(4,0)  fs(6) |  avg(4,1); end_ia raw_t T(t) P() Threshold (500) raw_t * Average (10) raw_tavg_t FS * Average (100) raw_tavg_t FS avg_t

CosmOS: Runtime System Average (10) avg_t raw_t * avg_t FS raw_t * avg_t FS raw_t T(t) P() Threshold (500) raw_t Average (100) avg_t

CosmOS: Runtime System Provides a low-footprint execution environment for CosmOS programs Key components – Data flow and control flow – Locking and concurrency – Load conditioning – Routing primitives

CosmOS Node Operating System Updateable User space Static OS Kernel Platform Independent Kernel App FC Services HW Drivers Hardware Abstraction Layer

CosmOS: Current Status Fully functional implementations for Mica2 and POSIX (on Linux) Mica2: Non-preemptive function pointer scheduler Dynamic memory management POSIX: Multi-threading using POSIX threads and underlying scheduler The OS exists as library calls and a single management thread

CosmOS: Current Status Comprehensively evaluated and validated Alpha releases can be freely downloaded from:

CosmOS Validation ECN Net Interne t b Peer-to-Peer FM 433MHz Laser attached via serial port to Stargate computers MICA2 motes with ADXL 202 Currently laser readings can be viewed for from anywhere over the Internet (conditioned on firewall settings) Pilot deployment at BOWEN labs

CosmOS: Ongoing Work Semantics of the CosmOS Programming Model GUI for Interaction Assignment Library of modules Large-scale deployment and scalability studies Application-specific optimizations.

Thank you! For papers and talks on these topics, please visit: