Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Slides:

Advertisements

Similar presentations

Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,

Advertisements

Part IV: Memory Management

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Chapter 7 Protocol Software On A Conventional Processor.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 CSIT431 Introduction to Operating Systems Welcome to CSIT431 Introduction to Operating Systems In this course we learn about the design and structure.

Computer System Overview

Computer Organization and Architecture

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.

Memory Management. Process must be loaded into memory before being executed. Memory needs to be allocated to ensure a reasonable supply of ready processes.

Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

lecture 5: CPU Scheduling

Chapter 1 Computer System Overview

Chapter 2 Memory and process management

Simultaneous Multithreading

PowerPC 604 Superscalar Microprocessor

Lecture: Out-of-order Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Address-Value Delta (AVD) Prediction

Processor Fundamentals

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Chapter 5: CPU Scheduling

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

15-740/ Computer Architecture Lecture 14: Prefetching

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

COMP755 Advanced Operating Systems

Dynamic Binary Translators and Instrumenters

Presentation transcript:

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh

Outline Introduction Implementations of Precomputation Some Examples of Precomputation Results of Precomputation Tests Summary

Introduction Why Precomputation? Designed to improve single thread performance on a multithread system Utilizes idle hardware to improve cache hit rates Useful in programs with unpredictable access patterns

Precomputation Allows programs to run faster What are the key causes of delay? Waiting for input values Waiting for memory Poor speculation

Solving the Delay Issue Since most programs are slowed by waiting for data, prefetching this data would speed execution

Problems with Prefetching Small instruction window Needs to predict branches Has limited resources on hand Does not solve the problems of pointer chains

Solution Expand the instruction window! Normally done by increasing instruction-level parallelism (ILP) Increasing ILP means increasing the sizes of hardware structures, such as register size, issue queues, and the reorder buffer This is not an ideal solution for people working with a fixed structure size, so another solution is necessary

Precomputation Solution The instruction window can be expanded by executing instructions in a separate thread of execution that can assist the main thread by testing for cache data and evaluating branches before the main thread Since this data is executed before it normally would be, it is referred to as being "precomputed"

Adding Precomputation to your CPU How can precomputation be included? Different methods for multiple thread use Secondary thread is run ahead of the main thread, and software controlled When the main thread stalls on an instruction, execute secondary thread. This method is hardware controlled A mixture of hardware and software control Each method has certain advantages

Implementation of the Design Software Controlled Precomputation Hardware Controlled Precomputation

Software-Controlled Precomputation

The Basics Allows compiler to initiate helper threads into code that is likely to incur cache misses Launches precomputation threads based on the programmer's knowledge, cache miss profiling, and compiler locality analysis

Running the Threads When the code calls for a precomputation thread to be made, check for idle hardware If no hardware is idle, drop the request Otherwise, start a precomputation thread at the given PC

Applications of Software Precomputation Analysis of programs with irregular access patterns that are typically difficult for prefetching Usually involving pointers, hash tables, indirect array references

Fixing pointer chains A big problem with prefetching is that of pointer chains A pointer chain is a where the address of the next node is not known until after the current load finishes Single pointers can be resolved by using jump-pointer prefetching Jump-pointers become too complex to resolve multiple chains Running a helper thread for each chain allows multi-chains to be resolved quickly

Using Precomputation on a linked- list A single thread is used due to their being a sufficient number of nodes present to use precomputation to mask latency

More complicated uses for precomputation Hashing is the most difficult challenge to prefetching for two reasons: Good hash algorithms are fairly random, so regular prefetching is hard Good hash algorithms us short chains, so jump- pointer prefetching will not work Precomputation allows for N hash functions to run at the same time, reducing memory stall

Support for software based precomputation In order to utilize software based precomputation, it is necessary to implement a few new instructions for the existing processor Thread_ID = PreExecute_Start(Start_PC, Max_Insts): Request for and idle context to start pre-execution at Start_PC and stop when Max_insts instructions have been executed: Thread_ID holds either the identity of the pre-execution thread or -1 if there is no idle context. This instruction has effect only if it is executed by the main thread PreExecute_Stop(): The thread that executes this instruction will be self terminated if it a pre-execution thread; no effect otherwise. PreExecute_Cancel(Thread_ID): Terminate the pre-execution thread with Thread_ID. This instruction has effect only if it is executed by the main thread.

Hardware-Controlled Precomputation

The Basics Allocates a set portion of available registers to precomputation threads Runs secondary helper thread when the primary thread is stalled

Integration in Hardware In order to execute the secondary (future) thread, additional structures are needed within the hardware These are the future IFQ, future rename table, and Preg status table The processor must also have a PC for both threads, both initially being the same

Updating the Hardware at Runtime The future IFQ is loaded with instructions fetched by the future thread The future rename table receives a copy of each instruction that is mapped into the primary rename table For each instruction dispatched by the future thread, an entry is added to the Preg status table, which keeps track of the registers assigned to the future thread Other fields in the Preg table indicate whether or not the register is able to be reused by the future thread

Importance of Register Reuse By allowing the future thread to timeout and reuse registers, it is possible to run the future thread more efficiently With the timeout protocols, it is also possible to allocate resources from the future thread to the primary thread to ensure the priority of the primary thread

Resuming Activity in the Primary Thread It would be wasteful to run the same instructions twice if the data is still available, so many hardware based precomputation schemes allow for the passing of data from the Preg status table and future rename table If an instruction exists in the tables, and has the appropriate sequence numbers, it is allocated to the primary thread and removed from the future thread tables

Recovering from Branch Mispredictions Once it begins execution, only the future thread accesses the branch predictor. Instead of its normal operation, the branch predictor gives its information to the future thread, which then conveys information via a FIFO queue. These predictions are updated by the future thread when they are resolved, so that the primary thread does not have to go along the mispredicted path. On detecting a misprediction, the future thread checkpoints back to the state of the misprediction. This check point rolls back to the sequence number of the mapping, not the mapping itself. Anything after this checkpoint can be overwritten, and is flagged as such Given the opportunistic nature of the future thread, it misprediction penalty does not play a major role in its performance

Results of Precomputation Tests

Summary Utilizes secondary threads to improve speed Can run as hardware or software based Generally runs programs more than 25% faster than normal execution