RISC Instruction Pipelines and Register Windows

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.

Lecture 12 Reduce Miss Penalty and Hit Time

1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

OMSE 510: Computing Foundations 4: The CPU!

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Computer Organization and Architecture

Chapter 8. Pipelining.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computer Organization and Architecture The CPU Structure.

Chapter 12 Pipelining Strategies Performance Hazards.

CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Appendix A Pipelining: Basic and Intermediate Concepts

Computer Organization and Assembly language

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Pipelining By Toan Nguyen.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

RISC Architecture and Super Computer

Prince Sultan College For Woman

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

CH12 CPU Structure and Function

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Processor Organization and Architecture

RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:

RISC Architecture Prof. Sin-Min Lee Department of Computer Science San Jose State University.

Revision Midterm 3, RISC Architecture Prof. Sin-Min Lee Department of Computer Science San Jose State University.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Basics and Architectures

RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.

Chapter 5 Basic Processing Unit

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

6-1 Chapter 6 - Languages and the Machine Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Pipelining and Parallelism Mark Staveley

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

Images courtesy of Addison Wesley Longman, Inc. Copyright © 2001 Chapter 11 Reduced Instruction Set Computing.

Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Pipelining Example Laundry Example: Three Stages

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

Advanced Architectures

Computer Organization and Architecture + Networks

Visit for more Learning Resources

A Closer Look at Instruction Set Architectures

Advanced Topic: Alternative Architectures Chapter 9 Objectives

Instruction Level Parallelism and Superscalar Processors

Control Unit Introduction Types Comparison Control Memory

COMPUTER ORGANIZATION AND ARCHITECTURE

Presentation transcript:

11.3 - RISC Instruction Pipelines and Register Windows By: Andy Le CS147 – Dr. Sin-Min Lee San Jose State University, Fall 2003

Outline Quick Review of RISC Instruction Pipelines Register Windowing and Renaming Practical Perspective

What is RISC? RISC is an acronym for Reduced Instruction Set Computers An opponent of the RISC processor is CISC which is Complex Instruction Set Computers. Goals of these 2 are to improve system performance.

What is RISC? (continued) CISC and RISC differ in complexities of their instruction sets where CISC is more complex than RISC. The reduced number and complexity of instructions of the instruction sets of RISC processors are the basis for the improved performance. For example, the smaller instruction set allows a designer to implement a hardwired control unit which runs at a higher clock rate than its equivalent micro sequenced control unit.

11.3.1 Instruction Pipeline A pipeline is like an assembly line in which many products are being worked on simultaneously each at different station. With a RISC processor, 1 instruction is executed while the next is being decoded and its operands are being loaded while the following instruction is being fetched all at the same time.

Instruction Pipeline (continued) By overlapping these operations, the CPU executes 1 instruction per clock cycle even though each instruction requires 3 cycles for fetch, decode, and execute.

Instruction Pipeline stages Recall the assembly line example. An instruction pipeline processes an instruction the same way the assembly line assembles a product (i.e. a car). Typical 4 stage pipeline: Stage 1: Fetches instruction from memory. Stage 2: Decodes instruction and fetches any required operands Stage 3: Executes instructions Stage 4: Stores results Each stage processes instructions simultaneously after a delay to fill the pipeline; this allows CPU to execute 1 instruction per clock cycle.

Instruction Pipeline Stages (continued) Numerous stages between different CPU First RISC computer was an IBM 801 which used 4-stage instruction pipeline RISC II processor used 3-stages MIPS uses 5-stage pipeline

Instruction Pipeline (Advantages of) We could use several control units for processing instructions but a single pipelined control unit offers some advantages Reduces requirement of hardware pipeline Each stage performs only a portion of the pipeline stages and that no stage needs to incorporate a complete hardware control unit Each stage only needs the hardware associated with its specific task.

Instruction Pipelines (Example of an advantage) The instruction-fetch stage only needs to read an instruction from memory. This stage does not need the hardware used to decode or execute instructions Similarly, a stage that decodes instructions does not access memory; the instruction-fetch stage has already loaded the instruction from memory into an instruction register for use at the decode stage

Instruction Pipelines (Example of an advantage) Another advantage of instruction pipelines is the reduced complexity of the memory interface If each stage had a complete control unit, ALL stages can access the memory and it can cause memory access conflicts in which the CPU must take care of. In general practice, RISC CPU have their memory partitioned into instruction and data modules

Instruction Pipeline (Example of an advantage) The instruction-fetch stage reads data only from the instruction memory module; at all other times, it is dealing with data in registers The execute-instruction stage only access memory when it is reading data from the data memory module. Custom designed memory can be configured to allow simultaneous read/write access to different locations while avoiding memory access conflicts by the execute-instruction and store-results stages of the pipeline

Pipeline Clock Rate The clock rate of the pipeline and the CPU is limited to its slowest stage. Example: 4 stage pipeline with delays of 20ns, 20ns, 100ns, 40ns. The clock period must be at least 100ns to handle the delay at the 3rd stage (100ns). This results in a maximum clock rate of 10MHz. When ALL stages have same delay time, the pipeline will achieve maximum performance

Pipeline Clock Rate To improve pipeline performance, CPU designers use different number of stages in their instruction pipelines. The execute stage is divided into 2 sequential stages in some CPUs.

Pipeline Clock Rate Recall the 4 stage pipeline example (20,20,100,40 ns) Assume the 3rd stage (100ns) execute-instruction stage can be divided into 2 sequential stages to 50ns each. The pipeline will have 1 more additional stage totaling 5 stages with delays of 20,20,50,50, 40 ns. The clock period would be reduced to 50ns and clock frequency is doubled to 20MHz

Pipeline Clock Rate Sometimes we can use this same logic from the previous example to reduce the number of stages in a pipeline. In the 5 stage pipeline, we can combine stages 1 and 2 (20ns each) into a single stage with a delay of 40ns. Now the pipeline has 4 stages with delays of 40, 50, 50, 40ns. Combining the first 2 stages slightly reduces the hardware complexity w/o reducing clock frequency. Having 1 less stage allows the pipeline to fill and produce results starting 1 clock cycle earlier.

Pipeline Clock Rate The pipeline offers significant performance improvement over non-pipelined CPU. The speedup is the ratio of the time needed to process n instructions using a non-pipelined control unit to the time needed using a pipelined control unit.

Speedup ratio The speedup ratio (Sn) is expressed by this formula: Sn = n * T1 / (n + k – 1) * Tk n = number of instructions T1 = time needed to process 1 instruction (non-pipeline) k = number of stages in the pipeline Tk = clock period of the pipeline

Speedup Ratio Example Sn = n * 180 / (n + 4 – 1) * 50 Let T1 = 180 ns (time needed to process 1 instruction) k = 4 (stages in the pipeline) Tk = 50 ns (clock period of the pipeline) Applying the formula: Sn = n * 180 / (n + 4 – 1) * 50 For steady state (n  ∞), the maximum speedup is Sn = 180 / 50  3.6

Speedup Ratio Example In reality, the speedup would be slightly less than this for 2 reasons. 1st reason, this does not account for the first few cycles needed to fill the pipeline; in addition the 180ns includes the time needed for the latches at the end of each stage. In a non-pipelined CPU, these latches and their associated delays do not exist and the actual time needed to process an instruction would be slightly less than 180ns.

Speedup Ratio Example Pipelines do cause and encounter problems. One problem is memory access. A pipeline must fetch an instruction from memory in a single clock cycle but main memory usually is not fast enough to supply data quickly. RISC processors must include cache memory which is at least as fast as the pipeline. If the RISC CPU cannot fetch 1 instruction per clock cycle, it cannot execute 1 instruction per clock cycle. The cache must separate instructions and data to prevent memory conflicts from different stages of the pipeline.

Speedup Ratio Example 2nd problem is caused by branch statements. When a branch statement is executed, the instructions already in the pipeline should not exist. They are instructions located sequentially after the branch statement and not the instructions to which the branch statement jumps. There is not much that the pipeline can do about this but instead an optimizing compiler is needed to reorder the instructions to avoid this problem.

11.3.2 Register Windowing and Renaming The reduced hardware requirements of RISC processors leave additional space available on the chip for the system designer. RISC CPUs generally use this space to include a large number of registers ( > 100 occasionally). The CPU can access data in registers more quickly than data in memory so having more registers makes more data available faster. Having more registers also helps reduce the number of memory references especially when calling and returning from subroutines.

Register Windowing and Renaming The RISC processor may not be able to access all the registers it has at any given time provided that it has many of it. Most RISC CPUs have some global registers which are always accessible. The remaining registers are windowed so that only a subset of the registers are accessible at any specific time.

Register Windows To understand how register windows work, we consider the windowing scheme used by the Sun SPARC processor. The processor can access any of the 32 different registers at a given time. (The instruction formats for SPARC always use 5 bits to select a source/destination register which can take any 32 different values. Of these 32 registers, 8 are global registers that are always accessible. The remaining 24 registers are contained in the register window.

Register Window On figure 11.4, the register window overlap. The overlap consists of 8 registers in SPARC CPU. Notice that the organization of the windows are supposed to be circular and not linear; meaning that the last window overlaps with the first window. Example: the last 8 registers of window 1 are also the first 8 registers of window 2. Similarly, the last 8 registers of window 2 are also the first 8 registers of window 3. The middle 8 registers of window 2 are local; they are not shared with any other window.

Register Window The RISC CPU must keep track of which window is active and which windows contain valid data. A window pointer register contains the value of the window that is currently active. A window mask register contains 1 bit per window and denotes which windows contains valid data.

Register Window Register windows provide their greatest benefit when the CPU calls a subroutine. During the calling process, the register window is moved down 1 window position. In the SPARC CPU, if window 1 is active and the CPU calls a subroutine, the processor activates window 2 by updating the window pointer and window mask registers. The CPU can pass parameters to the subroutine via the registers that overlap both windows instead of memory. This saves a lot of time when accessing data. The CPU can use the same registers to return results to the calling routine.

Register Window Example Example: A CPU with 48 registers. The CPU has 4 windows with 16 registers each and an overlap of 4 registers between windows. Initially the CPU is running a program using the 1st register window (shown in 11.5a). It must call a subroutine and pass 3 parameters to the subroutine. The CPU stores these parameters in 3 of the 4 overlapping registers and calls the subroutine. Figure 11.5b shows the state of the CPU while the subroutine is active. The subroutine can directly access the parameters. Now the subroutine calculates a result to be returned to the main program. The deactivates the 2nd window; the CPU now works with the 1st window as shown in figure 11.5c and can directly access the result.

Register Windows Most RISC processors that use register windowing have about 8 windows which is sufficient for the vast majority of programs to handle subroutine calls. Besides recursive subroutines, it is unusual for a program to go beyond 8 levels of calls. However it is possible for the level of calls to exceed the number of windows and the CPU must be able to handle this situation.

Register Renaming More recent processors may use register renaming to add flexibility to the idea of register windowing. A processor that uses register renaming can select any registers to create its working register “window.” The CPU uses pointers to keep track of which registers are active and which physical register corresponds to each logical register.

Register Renaming Unlike register windowing, in which only specific groups of physical registers are active at any given time, register renaming allows any group of physical registers to be active. The Practical Perspective looks at register windowing and register renaming in real-world CPUs.

Practical Perspective Both register windowing and register renaming are used in modern processors. In addition to SPARC processors, the RISC II processor uses register windowing. The RISC II processor has 138 registers; 10 are global and the rest are windowed. Each of its 8 windows contain 22 registers: 6 common input, 10 local, and 6 common output. The UltraSPARC III processor also uses register windowing. The MIPS R10000 on the other hand uses register renaming. Intel’s Itanium processor takes a unique approach. It has 128 integer registers, 32 of which are global. The remaining 96 registers are windowed, but the size of the window is not fixed; their size can be varied depending on the needs of the program under execution.

Happy Thanksgiving Day!!!