Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel.

Slides:



Advertisements
Similar presentations
More Intel machine language and one more look at other architectures.
Advertisements

Computer Organization and Architecture
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Final Project : Pipelined Microprocessor Joseph Kim.
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Hasim Joel Emer †‡ Michael Adler †, Artur Klauser †, Angshuman Parashar †, Michael Pellauer ‡, Murali Vijayaraghavan ‡ † VSSAD Intel ‡ CSAIL MIT.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
©UCB CS 162 Computer Architecture Lecture 1 Instructor: L.N. Bhuyan
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Arvind and Joel Emer Computer Science and Artificial Intelligence Laboratory M.I.T. Branch Prediction.
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.
COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
CDA 3101 Fall 2013 Introduction to Computer Organization Multicycle Datapath 9 October 2013.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Computer Architecture Lecture 27 Fasih ur Rehman.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
COMP541 Multicycle MIPS Montek Singh Mar 25, 2010.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Introduction to Computer Organization Pipelining.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Design of Digital Circuits Lecture 14: Microprogramming
Lecture: Pipelining Basics
Timing Model of a Superscalar O-o-O processor in HAsim Framework
Single Clock Datapath With Control
HASim Implementing a Functional/Timing Partitioned Microprocessor Simulator with an FPGA Nirav Dave*, Michael Pellauer*, Joel Emer†*, & Arvind* Massachusetts.
CDA 3101 Spring 2016 Introduction to Computer Organization
Hyperthreading Technology
Figure 13.1 MIPS Single Clock Cycle Implementation.
Lecture 11: Memory Data Flow Techniques
Lecture 5: Pipelining Basics
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Control unit extension for data hazards
Instruction Execution Cycle
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Control unit extension for data hazards
Control unit extension for data hazards
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Presentation transcript:

Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel Emer †‡ † MIT CS and AI Lab Computation Structures Group ‡ Intel Corporation VSSAD Group To Appear In: ISPASS 2008

Motivation We want to simulate target platforms quickly We also want to construct simulators quickly Partitioned simulators are a known technique from traditional performance models: ISA Off-chip communication Micro-architecture Resource contention Dependencies Interaction Simplifies timing model Amortize functional model design effort over many models Functional Partition can be extremely FPGA-optimized Timing Partition Timing Partition Functional Partition Functional Partition

Different Partitioning Schemes As categorized by Mauer, Hill and Wood: Source: [MAUER 2002], ACM SIGMETRICS We believe that a timing-directed solution will ultimately lead to the best performance Both partitions upon the FPGA

Functional Partition in Software Asim Get Instruction (at a given Address) Get Dependencies Get Instruction Results Read Memory * Speculatively Write Memory * (locally visible) Commit or Abort instruction Write Memory * (globally visible) * Optional depending on instruction type

Execution in Phases FDXRCFDXWCWFDXC The Emer Assertion: All data dependencies can be represented via these phases FDXRA FDXXCW

Detailed Example: 3 Different Timing Models Executing the same instruction sequence:

Functional Partition in Hardware? Requirements Support these operations in hardware Allow for out-of-order execution, speculation, rollback Challenges Minimize operation execution times Pipeline wherever possible Tradeoff between BRAM/multiport RAMs Race conditions due to extreme parallelism

Functional Partition As Pipeline Conveys concept well, but poor performance Token Gen DecExeMemLCom GComFet Timing Model Memory State Register State RegFile Functional Partition

Implementation: Large Scoreboards in BRAM Series of tables in BRAM Store information about each in-flight instruction Tables are indexed by “token” Also used by the timing partition to refer to each instruction New operation “getToken” to allocate a space in the tables

Implementing the Operations See paper for details (also extra slides)

Assessment: Three Timing Models Unpipelined Target MIPS R10K-like out-of-order superscalar 5-Stage Pipeline

Assessment: Target Performance Targets have idealized memory hierarchy

Assessment: Simulator Performance Some correspondence between target and functional partition is very helpful

Assessment: Reuse and Physical Stats Where is functionality implemented: FPGA usage: DesignIMemProgram Counter Branch Predictor Scoreboard/ ROB Reg File Maptable/ Freelist ALUDMemStore Buffer Snapshots/ Rollback Functional Partition UnpipelinedN/A 5-StageN/A Out-of-Order Unpipelined5-stageOut of Order FPGA Slices6599 (20%)9220 (28%)22,873 (69%) Block RAMs18 (5%)25 (7%) Clock Speed98.8 MHz96.9 MHz95.0 MHz Average FMR Simulation Rate2.4 MHz14 MHz6 MHz Average Simulator IPS 2.4 MIPS5.1 MIPS4.7 MIPS Virtex IIPro 70 Using ISE 8.1i

Future Work: Simulating Multicores Scheme 1: Duplicate both partitions Scheme 2: Cluster Timing Parititions Timing Model A Timing Model A Func Reg + Datapath Func Reg + Datapath Timing Model B Timing Model B Func Reg + Datapath Func Reg + Datapath Func Reg + Datapath Func Reg + Datapath Timing Model C Timing Model C Func Reg + Datapath Func Reg + Datapath Timing Model D Timing Model D Functional Memory State Functional Memory State Timing Model A Timing Model A Timing Model B Timing Model B Timing Model C Timing Model C Timing Model D Timing Model D Functional Reg State + Datapath Functional Reg State + Datapath Functional Memory State Functional Memory State Interaction occurs here Interaction still occurs here Use a context ID to reference all state lookups

Future Work: Simulating Multicores Scheme 3: Perform multiplexing of timing models themselves Leverage HASim A-Ports in Timing Model Out of scope of today’s talk Timing Model D Timing Model D Functional Reg State + Datapath Functional Reg State + Datapath Functional Memory State Functional Memory State Interaction still occurs here Use a context ID to reference all state lookups Timing Model C Timing Model C Timing Model B Timing Model B Timing Model A Timing Model A

UT-FAST is Functional-First This can be unified into Timing-Directed Just do “execute-at-fetch” Future Work: Unifying with the UT-FAST model Func Partition Func Partition Timing Partition Timing Partition Emulator Ø Ø Ø Ø functional emulator running in software FPGA execution stream resteer execution stream resteer functional emulator running in software

Summary Described a scheme for closely-coupled timing- directed partitioning Both partitions are suitable for on-FPGA implementation Demonstrated such a scheme’s benefits: Very Good Reuse, Very Good Area/Clock Speed Good FPGA-to-Model Cycle Ratio: Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target) We plan to extend this using contexts for hardware multiplexing [Chung 07] Future: rare complex operations (such as syscalls) could be done in software using virtual channels

Questions?

Extra Slides

Functional Partition Fetch

Functional Partition Decode

Functional Partition Execute

Functional Partition Back End

Timing Model: Unpipelined

5-Stage Pipeline Timing Model

Out-Of-Order Superscalar Timing Model