Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,

Lecture 6: Multicore Systems

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

High Performing Cache Hierarchies for Server Workloads

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

Chapter 17 Parallel Processing.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Computer performance.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.

Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.

Topics Architecture of FPGA: Logic elements. Interconnect. Pins.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Sunpyo Hong, Hyesoon Kim

CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Dark Silicon and End of Multicore.

CS203 – Advanced Computer Architecture

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

Presented by: Nick Kirchem Feb 13, 2004

Assessing and Understanding Performance

Architecture & Organization 1

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Architecture & Organization 1

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

Computer Evolution and Performance

Presentation transcript:

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs

Agenda Motivation & Goals Brief background about Multicore / CMPs Technical details presented in the paper Key results and contributions Conclusions Drawbacks How paper relates to class and my project Q&A

Motivation & Goals Motivation Superscalar paradigm is reaching diminishing returns Wire delays will limit area of the chip that is useful for single conventional processing core Goals Compare area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue and how big the per- processor on-chip cache should be. Related Work Compaq Piranha

Brief background on CMPs Metrics to evaluate CMPs Maximize total chip performance = Maximize Job throughput Maximizing Job throughput involves comparing Processor Organization : Out -of-order or smaller in-order issue Cache hierarchy : Amount of cache memory per processor Off-chip bandwidth: Finite bandwidth limits number of cores that can be placed on the chip Application characteristics: Applications with different memory access patterns require different CMP designs to attain maximum throughput

Brief background on CMPs Chip Multiprocessor model L1 and L2 cache per processor L2 cache directly connected to off-chip DRAM through a set of distributed memory channels Shared L2 cache Large cache bandwidth requirements Vs slow global wires

Technical Details Area models The model expresses all area in terms of CBE (unit area for one byte of cache) In-order and Out-of-order issue processors were considered taking cache sizes in to consideration Performance per unit area – 2-way in-order (P IN ) and 4-way out-of-order (P OUT )

Technical Details I/O Pin Bandwidth Number of I/O pins built on a single chip is limited by physical technology and does not scale with transistors Decrease in number of pins per transistor as technology advances I/O pin speeds have not increased at the same rate as processor clock rates

Technical Details Maximizing throughput Performance on server workloads can be defined as aggregate performance of all the cores on the chip If Number of cores (N c ) and performance of each core (P i ) are given Peak performance (P cmp ) of a server CMP will be P cmp = ∑ i=1 to Nc P i Performance of individual core in a CMP is dependent on application characteristics such as available instruction level parallelism, cache behavior, and communication overhead among threads

Technical Details Application characteristics Ten SPEC benchmarks were chosen –mesa, mgrid, equake, gcc, ammp, vpr, parser, perlbmk, art and mcf Taxonomy of applications Processor Bound – Applications whose working set can be captured easily in L2 cache (Mesa, mgrid, equake) Cache-sensitive – Applications whose performance is limited by L2 cache capacity (Gcc, ammp, vpr, parser and perlbmk) Bandwidth-bound – Applications whose performance is limited strictly by the rate that data can be moved between processor and DRAM (Art, mcf and sphinx) Applications are not bound to one class or another, they move along these three domains as processor, cache and bandwidth capacities are modulated

Technical Details Experimental methodology Used Simple scalar tool set to model both in-order and out-of-order processors P IN and P OUT

Results – Effect of varying L2 cache size

Results – Performance Scalability versus channel sharing

Maximizing CMP Throughput Combine area analysis and performance simulations to find out which CMP configuration will be most area efficient for future technology Fixed chip area – 400mm² Calculate the number of cores and cores/channel based on the chip area with different cache sizes

Results – Application type versus Throughput

Results – Technology Scaling

CMPs for Server Applications Most commonly used server workloads OLTP DSS DSS workloads – cache sensitive applications (1MB / 2MB) L2 cache OLTP workloads – bandwidth bound applications

Conclusions Transistor counts are projected to increase faster than pins – which limit the number of cores that can be used in future technology Out-of-order issue cores are more efficient than in-order issue cores For different workloads the impact of insufficient bandwidth causes throughput optimal L2 cache sizes to grow from 128KB at 100nm to 1MB at 50 and 35nm As technology advances wire delays may be too high to add more cache per each processor

Drawbacks SPEC benchmarks were used which are not similar to server workloads Power consumption was not at all considered while trying to maximize performance area The evaluation metrics estimated signaling speed to be increasing linearly with 1.5 times the processor clock. Technology advances in this area may permit larger number of processors than predicted

Paper Related to Class and Project Relation with class – We have been studying from the beginning of this semester regarding multi-core architecture. This paper presented how we can design CMP architecture based on the application taking in to consideration the current technology. Relation with Project – My project is studying the CMP architecture in relation to Mobile Edge Computing devices.

Q & A