MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC;

Slides:



Advertisements
Similar presentations
COMP375 Computer Architecture and Organization Senior Review.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Quiz 4 Solution. n Frequency = 2.5GHz, CLK = 0.4ns n CPI = 0.4, 30% loads and stores, n L1 hit =0, n L1-ICACHE : 2% miss rate, 32-byte blocks n L1-DCACHE.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.
An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors Jack L. Lo, Luiz André Barroso, Susan Eggers Kourosh Gharachorloo,
Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Computer Architecture, Memory Hierarchy & Virtual Memory
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Western Research Laboratory Design and Evaluation of Architectures for Commercial Applications Luiz André Barroso Part III: architecture studies.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Snooping Cache and Shared-Memory Multiprocessors
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
CMPE 421 Parallel Computer Architecture
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
IT253: Computer Organization
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Modern processor design
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Computer Sciences Department University of Wisconsin-Madison
Presented by: Nick Kirchem Feb 13, 2004
Lecture 2: Performance Evaluation
COSC3330 Computer Architecture
Memory System Characterization of Commercial Workloads
5.2 Eleven Advanced Optimizations of Cache Performance
ECE 445 – Computer Organization
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 23: Cache, Memory, Virtual Memory
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Presented by: Eric Carty-Fickes
Lecture 20: OOO, Memory Hierarchy
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Cache - Optimization.
Presentation transcript:

MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC; worked on Dash and Flash) Edouard Bugnion (one of the original founders of VMware; also worked on SimOS) Presented by: David Eitel, March 31, 2010

Types of Commercial Applications  Online Transaction Processing (OLTP)  Decision Support Systems (DSS)  Web Index Search (WIS) Source: S. Brin and L. Page. “The Anatomy of a Large-Scale Hypertextual Web Search Engine.”

Benchmarks  Oracle Database Engine  TPC-B Banking Benchmark for OLTP  TPC-D Benchmark for DSS (read-only queries)  AltaVista Sources:

Monitoring Results Source: Fig. 4  OLTP has more complex queries than DSS/AV  Important to have low-latency to non-primary caches because OLTP working set is very large.  Cache misses for DSS are low – misses on large database tables. Icache = instruction cache Dcache = data cache Scache = secondary cache Bcache = board-level cache Big CPI! Lots of Bcache misses Breakdown of the execution time misses Sum of single- and dual-issue cycles Pipeline and address translation related stalls >75% mem stalls Scache = secondary cache Bcache = board-level cache

Simulation Results for OLTP Source: Fig. 5 Associativity Cache Size Data capacity/ Conflict misses INST = instruction execution CACHE = stalls within cache hierarchy MEM = memory system stalls  Idle time increases with bigger caches.  The I/O latency cannot be hidden with faster processing rates.  Faster processing rates with a more efficient memory system = more commits ready for the log writer (I/O).  OLTP benefits from larger Bcaches.

More Simulation Results (OLTP and DSS)  DSS works well with current sized caches because the working sets are small (few misses in on-chip caches)  Replacement/instr miss rate are not affected by line size  good for larger cache sizes.  False sharing increases with cache line size.  What would be different if increased latency and bandwidth were accounted for when line size increases?  Are the results NOT valid because size(database) = size(main memory)? Sources: Fig. 7 and Fig. 8

Important Things to Remember  As # processors increases, communication stalls increase (see Fig. 6)  O/S activity & I/O latencies do not greatly affect the behavior of database engines.  OLTP has instruction & data locality  helped by off-chip caches  DSS and WIS have working sets that fit in memory  sensitive to on-chip caches Source:

Discussion Questions  What are some new commercial applications that have developed since this paper was written?  How much have the issues in this paper been addressed in recent architecture designs?  What should we focus on in the “parallel” future to increase performance for commercial applications?  Could we change commercial workloads to function more like scientific workloads to obtain performance gains? Source: