Data Center & Large-Scale Systems (updated) Luis Ceze, Bill Feiereisen, Krishna Kant, Richard Murphy, Onur Mutlu, Anand Sivasubramanian, Christos Kozyrakis.

Slides:



Advertisements
Similar presentations
What is Cloud Computing? Massive computing resources, deployed among virtual datacenters, dynamically allocated to specific users and tasks and accessed.
Advertisements

What is Cloud Computing? Massive computing resources, deployed among virtual datacenters, dynamically allocated to specific users and tasks and accessed.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.
GENI: Global Environment for Networking Innovations Larry Landweber Senior Advisor NSF:CISE Joint Techs Madison, WI July 17, 2006.
Clouds C. Vuerli Contributed by Zsolt Nemeth. As it started.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Green Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology,
Smart Buildings and Smart Energy CITRIS Kickoff meeting – Sept J. Rabaey College of Engineering, University of California at Berkeley.
OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager
Lee Center Workshop, May 19, 2006 Distributed Objects System with Support for Sequential Consistency.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,
Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.
Assets and Dynamics Computation for Virtual Worlds.
NGNS Program Managers Richard Carlson Thomas Ndousse ASCAC meeting 11/21/2014 Next Generation Networking for Science Program Update.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Data-Center Traffic Management COS 597E: Software Defined Networking.
Power is Leading Design Constraint Direct Impacts of Power Management – IDC: Server 2% of US energy consumption and growing exponentially HPC cluster market.
SPRING 2011 CLOUD COMPUTING Cloud Computing San José State University Computer Architecture (CS 147) Professor Sin-Min Lee Presentation by Vladimir Serdyukov.
New Challenges in Cloud Datacenter Monitoring and Management
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
CERN openlab Open Day 10 June 2015 KL Yong Sergio Ruocco Data Center Technologies Division Speeding-up Large-Scale Storage with Non-Volatile Memory.
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
Distributed Real-Time Systems for the Intelligent Power Grid Prof. Vincenzo Liberatore.
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Some Thoughts on Sensor Network Research Krishna Kant Program Director National Science Foundation CNS/CSR Program.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
NSF Critical Infrastructures Workshop Nov , 2006 Kannan Ramchandran University of California at Berkeley Current research interests related to workshop.
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
N. GSU Slide 1 Chapter 02 Cloud Computing Systems N. Xiong Georgia State University.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Cluster Reliability Project ISIS Vanderbilt University.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Tessellation: Space-Time Partitioning in a Manycore Client OS Rose Liu 1,2, Kevin Klues 1, Sarah Bird 1, Steven Hofmeyr 3, Krste Asanovic 1, John Kubiatowicz.
Software-Defined Networking - Attributes, candidate approaches, and use cases - MK. Shin, ETRI M. Hoffmann, NSN.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated.
Networked Embedded and Control Systems WP ICT Call 2 Objective ICT ICT National Contact Points Mercè Griera i Fisa Brussels, 23 May 2007.
Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
From the Transatlantic Networking Workshop to the DAM Jamboree to the LHCOPN Meeting (Geneva-Amsterdam-Barcelona) David Foster CERN-IT.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Internet2 Strategic Directions October Fundamental Questions  What does higher education (and the rest of the world) require from the Internet.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Parallel IO for Cluster Computing Tran, Van Hoai.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
DUSD(Labs) GSRC Calibrating Achievable Design 11/02.
From the Transatlantic Networking Workshop to the DAM Jamboree David Foster CERN-IT.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
SuperComputing 2003 “The Great Academia / Industry Grid Debate” ?
Welcome: Intel Multicore Research Conference
Grid Computing.
Structural Simulation Toolkit / Gem5 Integration
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Many-core Software Development Platforms
Software Defined Networking (SDN)
Power is Leading Design Constraint
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
Presentation transcript:

Data Center & Large-Scale Systems (updated) Luis Ceze, Bill Feiereisen, Krishna Kant, Richard Murphy, Onur Mutlu, Anand Sivasubramanian, Christos Kozyrakis ACAR Workshop – February 2010

Data Center Realities A supercomputer 100s cores/chip, 10s nodes/rack, 10s racks/container, 100s containers/DC Distributed memory & storage Hierarchical network Programmed by many Used by billions Low cost

“Man on the Moon” Goals (aka our 10-year deliverables) $1 and 1 watt / person for DC infrastructure Assuming our whole life is on-line Minimize capital and operational expenses From HW architecture to automated DC mgmt One application program for all scales From 1 node & 100 users to 1M servers / 1B users What are the prog. models, system SW, and HW?

Research Directions for Computer Architects Data center as a computer 1. Chip-level support for DCs 2. DC node architecture 3. DC memory/storage hierarchy 4. DC operating & runtime system Crosscutting: energy efficient data center

Energy Efficient Data Center Minimize DC energy consumption Energy efficiency & proportionality >100x improvement in queries/watt Research questions State aware scale down of nodes Node vs component level power modes Tradeoffs with availability and QoS Energy efficient servers Optimizing cores & memory for requests/Watt Energy efficient software & algorithms Environmental sustainability of DC HW

1. Chip-level Support for DC How do we optimize many-cores for DCs Research questions Enabling fast messaging Enabling remote memory access Enabling memory scale-down Enabling isolation & privacy Enabling dynamic languages Enabling end-to-end monitoring Enabling managability HW/SW interface, virtualization, scalability

2. DC Node Architecture Current HW is direct evolution of PC Is this optimal for perf, energy, reliability, & cost? Research questions Thin vs thick vs heterogeneous system? Implications for language & mgmt system Balancing throughput & QoS DC unit: pizza box vs rack vs container? Optimization opportunities at each granularity Memory system organization Integration of compute and networking Role of SSDs, non-volatile memory, photonics?

3. DC Memory/Storage Hierarchies How do we provide fast, high BW access to terabytes? Research questions Large-scale memory hierarchies? Local & global HW architecture Processor/memory balance Role of SSDs, non-volatile memory, photonics? Locality optimizations Moving data to computer or compute to data Interactions with availability and isolation issues APIs and file-systems? Consistency model and synchronization Protection, security, …

4. DC OS & Runtime System What is the OS and runtime system for DC? Research questions Scalability, availability, reliability 1M nodes, 5 9s availability Real-time (re)provisioning of resources End-to-end monitoring, introspection, flow control Graceful degradation and scale-down Multi-tenancy, isolation, and security Compliance

Methodology & Resources Methodology: back to the future Full-system research + prototypes Multi-disciplinary projects Architecture, systems, languages, compilers, … Resources: what’s missing Experimental data center for systems research Learn from other areas (e.g., GENI & Openflow) Workloads & large-scale analysis methods Collaboration with industry is critical

Funding Programs Horizontal programs are still useful Need to educate panels for the needs of the area… Need larger scale efforts as well NSF ERC, STC, MARCO center (Musyc), … Grand challenges program Framed around one important application at the time Help address its needs and make technology advances for DC Provides focus, motivates full-system work, relevance of results Avoids the burden of solving all problems Take app-specific results and transfer to general purpose DC Example: DC technology for neuro-engineering systems Data capture & analysis of neural activity to control prosthetics Challenge: a DC that fits in the closet of a biology lab

Roadmap Urgent: 2 year infrastructure program Two parallel thrusts Using existing chips (system-level architecture & SW) 5-year scope, >10x energy efficiency improvement Using novel chips 10-year scope, >100x energy efficiency improvement Open questions NSF expedition? International collaboration?