6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Transforming a FAST simulator into RTL implementation Nikhil A. Patil & Derek Chiou FAST Research group, University of Texas at Austin 1.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
Aug. 24, 2007ELEC 5200/6200 Project1 Computer Design Project ELEC 5200/6200-Computer Architecture and Design Fall 2007 Vishwani D. Agrawal James J.Danaher.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Computer Organization and Design Computer Abstractions and Technology
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Hyunjun Jang Texas A&M University.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Modern processor design
CMPE 421 REVIEW: MIDTERM 1. A MODIFIED FIVE-Stage Pipeline PC A Y R MD1 addr inst Inst Memory Imm Ext add rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
CS 352H: Computer Systems Architecture
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Microarchitecture.
Computer Architecture Principles Dr. Mike Frank
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Derek Chiou The University of Texas at Austin
Introduction to SimpleScalar (Based on SimpleScalar Tutorial)
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
Computer Structure S.Abinash 11/29/ _02.
Ka-Ming Keung Swamy D Ponpandi
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 20: OOO, Memory Hierarchy
* From AMD 1996 Publication #18522 Revision E
Introduction to Microprocessor Programming
COMS 361 Computer Organization
Overview Prof. Eric Rotenberg
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University of Texas at Austin Electrical and Computer Engineering

6/15/06Derek Chiou, UT Austin, RAMP2 FAST Goals Fast: a s fast as possible 2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive? Accurate: p roduce cycle-accurate numbers for modern microprocessors (Pentium M) Complete: r un unmodified operating systems, applications, ISAs,… Transparent: full visibility, no performance hit Inexpensive: n eed thousands Usable: q uick changes, use RTL to generate I/O: the MOST important part of systems

6/15/06Derek Chiou, UT Austin, RAMP3 Functional/Timing Partitioning Proven Partitioning Asim, Simplescalar, Timing- First, Memoized, etc. Simplifies simulator. Promotes reuse Same performance in software Asim at 10KHz Most of the time spent in timing model! Hardware??? Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst stream

6/15/06Derek Chiou, UT Austin, RAMP4 FAST Functional model could be Pure software (QEMU, Bochs, Simics, SimNow) Use JIT for performance, very fast No better hardware for executing ISA than processor Can operate under the covers (flush cache for example) Pure Hardware (Hoe et al) Hybrid (Hoe et al) Timing model very simple hardware Functional Model (ISA) Timing Model (Micro-architecture) Inst stream FPGA Full-System Simulator

6/15/06Derek Chiou, UT Austin, RAMP5 What is a FAST Timing Model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2

6/15/06Derek Chiou, UT Austin, RAMP6 More Complexity Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH (and only part of control path)!!!!

6/15/06Derek Chiou, UT Austin, RAMP7 Driving a Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models

6/15/06Derek Chiou, UT Austin, RAMP8 Complexity: BP iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-speculate Rollback, restore BP only works in processor if it’s fairly accurate Degrades to trace driven! FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

6/15/06Derek Chiou, UT Austin, RAMP9 Parallelism: Detect Problem & Rollback FM Memory FM TM Network TM Memory Model

6/15/06Derek Chiou, UT Austin, RAMP10 Functional Model Rollback Need to Rollback, force branch Rollback, restore and continue How? set_pc(inst_num, pc) Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient Currently implemented with checkpoints ISA state, memory, peripherals Works for parallelism too BR

6/15/06Derek Chiou, UT Austin, RAMP11 RTL to Timing Model Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Timing model perfectly models RTL Verification???

6/15/06Derek Chiou, UT Austin, RAMP12 Current FAST System

6/15/06Derek Chiou, UT Austin, RAMP13 QEMU on Xilinx PowerPC

6/15/06Derek Chiou, UT Austin, RAMP14 Status x86 functional model boots Linux, targeting to Pentium D-like and beyond (Dam Sunwoo) Modified Bochs and QEMU Branch-predicted multi-function unit, OOO timing model compiles in Bluespec (FAST group) Synthesized for FPGA, 8.5K lines of code, rated Top 5 User! Memory, disk models Hope to have network model soon Have straight pipeline 486 model with TLBs and caches Preliminary statistics gathered in hardware timing model RTL-to-timing model (Nikhil Patil) Defining tools for ISA extension and timing model assembly

6/15/06Derek Chiou, UT Austin, RAMP15 Timing Model Resources OOO, superscalar, 2b branch prediction, five functional units, 32KB DCache [INTERFACE: Fast_if]+ [TM: IfcVB(interface bt. Bluespec & Verilog)/CmdQ/Fetch/Decode/Rename/Execute] : 26% of V2P30 (3593 slices) 22 Block RAMS (out of 136) ROB broken right now Early configurable cache model (state shouldn’t change much) 32KB 4-way set associative cache with 16B cache-lines 165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30) 2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30)

6/15/06Derek Chiou, UT Austin, RAMP16 Current Performance Functional model Up to 500K x86 inst/sec today on V2P30 FPGA includes rollbacks assuming 5% mis-speculation Not that optimized 5MIPS unmodified 10M+ on 3.0GHz Pentium 4 DRC box should give this performance PowerPC ISA should be much faster! PowerPC on PowerPC Timing model Not bottleneck!

6/15/06Derek Chiou, UT Austin, RAMP17 Conclusions 1MHz to 100MHz, cycle-accurate, full-system, multiprocessor x86, x86-64, PowerPC, ARM, Sparc simulator Leverage extant full-system simulators FPGA timing models maximize performance and statistic gathering capabilities Pretty much any timing model seems to fit into a single FPGA (Pentium M in V2P30?) Uniprocesssor, multi-processor capable Tools can minimize creation/modification effort