Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
SE-292 High Performance Computing
CS 105 Tour of the Black Holes of Computing
Chapter 4 Memory Management Basic memory management Swapping
Online Algorithm Huaping Wang Apr.21
Cache and Virtual Memory Replacement Algorithms
Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
361 Computer Architecture Lecture 15: Cache Memory
Chapter 11 – Virtual Memory Management
Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
ITEC 352 Lecture 25 Memory(2). Review RAM –Why it isnt on the CPU –What it is made of –Building blocks to black boxes –How it is accessed –Problems with.
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
10/20: Lecture Topics HW 3 Problem 2 Caches –Types of cache misses –Cache performance –Cache tradeoffs –Cache summary Input/Output –Types of I/O Devices.
T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CSCE614 HW4: Implementing Pseudo-LRU Cache Replacement Policy
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Multiscalar processors
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Chapter Twelve Memory Organization
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Chapter 9 Memory Organization By Nguyen Chau Topics Hierarchical memory systems Cache memory Associative memory Cache memory with associative mapping.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
COSC3330 Computer Architecture
Data Prefetching Smruti R. Sarangi.
CS2100 Computer Organization
Ramya Kandasamy CS 147 Section 3
Multilevel Memories (Improving performance using alittle “cash”)
Lecture: Cache Hierarchies
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
Running OpenSSL Crypto Algorithms in Simplescalar
Computer Architecture Lecture 3
CSCI1600: Embedded and Real Time Software
Lecture 08: Memory Hierarchy Cache Performance
Morgan Kaufmann Publishers
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Data Prefetching Smruti R. Sarangi.
Cache - Optimization.
Aliasing and Anti-Aliasing in Branch History Table Prediction
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values and Charts (25 points) Analyses (5 points) L2 cache Replacement Statistics (30 points) Values and Charts (30 points)

Default Statistics: Analyses CPI affected by Percentage of branches, predictability of branches Cache hit rates Parallelism inherent in programs CPI of cc and go higher than others Larger percentage of tough to predict branches cc: 17% branches abt 12% of which is miss-predicted Go: 13% branches abt 20% of which is miss-predicted CPI of cc higher than go L1 miss rate of cc (2.6%) is higher than go (0.6%)

Default Statistics: Analyses Compress has high miss rates Smaller execution run: compulsory misses L2 miss rate of anagram high Very few L2 accesses : compulsory misses Program based analyses Gcc has lot of branches Go program has small memory footprint Anagram is a simple program Compress: input file only 20 bytes Note: All are integer programs CPI < 1, multiple issue, out of order

Branch Predictor: Statistics Perfect > Bimodal > taken = not-taken Variation across benchmarks (2 points) Go and cc show greatest variation They have significant number of tough to predict branches.

L2 replacement policies No great change in miss-rate or CPI 30 points for the values and plots L1 cache was big so very few L2 accesses Associativity of L2 cache was small LRU > FIFO > Random

Distribution 90 – 100

Phase 2 :Profile guided OPT Profiling Run Run un-optimized code with sample inputs Instrument code to collect information about the run Callgraph frequencies Basicblock frequencies Recompile Use collected information to produce better code Inlining Put hot code together to improve I$

Phase 2: Compiler branch hints if (error) // not-taken { … } Compiler provides hints about branches taken/not- taken using profile information In this question Learn to use simulator as a profiler Learn to estimate benefits of optimizations.

Example Simple loop 1000: … 1004: … // mostly not taken 1008: jz : jmp 1000 For each branch mark taken or not-taken Taken > 50% Mark taken Not-taken > 50% Mark Not-taken In the above example 1008: not-taken 1032: not-taken 1064: taken InstrFrequencyNot-taken ,00065% % ,00035%

Profiling Run For each static branch instruction Collect execution frequency Percentage taken/not-taken Modify bpred_update function in bpred.c Maintain data structure for each branch instruction indexed by instruction address Maintain frequency, taken information Dump this information in the end.

Analysis From the information collected If branch is taken > 50% of time, mark taken; Otherwise not-taken Remember the instruction addresses and the hint.

Performance Estimation For all branches Predict taken/ not-taken according to the hint You may want to load all the hints into a data structure at the start. Data structure similar to one used for profiling. Indexed by branch instruction address. Estimate new CPI Notes: Sufficient to do this for cc and anagram. After modifying SimpleScalar need to make !!!

Phase2: L2 replacement policy LRU policy Works well HW complexity is high Number of status bits to track when each block in a set is last accessed This number increases with associativity. PLRU Pseudo LRU policies Simpler replacement policy that attempts to mimic LRU.

Tree based PLRU policy For a n way cache, there are nway -1 binary decision bits Let us consider a 4 way set associative cache L0, L1, L2 and L3 are the blocks in the set B0, B1 and B2 are decision bits

Tree based LRU for 4 way

Notes Use a 4K direct mapped L1 cache Hopefully this should lead to L2 accesses! Use a 16 way 256 KB L2 cache Hopefully enough ways to make a difference! Compare PLRU with LRU, FIFO and Random Sufficient to do this experiment for cc and anagram!

Perfect Mem Disambiguation Memory Disambiguation Techniques employed by processor to execute loads/stores out of order Use a HW structure called Load/Store queue Tracks addresses / values of loads and stores Load can be issued from LSQ If there are no prior stores writing to the same address. If address in unknown, then cant issue load Perfect Disambiguation All addresses are known

How are addresses known Two ways to do this: Trace based: Run once and collect and remember all the addresses All registers values are actually known to the simulator through functional simulation Even though a register is yet to be computed, the simulator knows the value Look at lsq_refresh() function in sim-outorder.c To give you flexibility to do both ways Simulate only a million instructions Fast forward 100 million instructions

Mem Disambiguation Compare CPI with and without perfect disambiguation Sufficient to do this for cc and go -fastfwd 100 million instructions Simulate for additional 1 million instructions