Download presentation
Presentation is loading. Please wait.
Published byLee Potter Modified over 9 years ago
1
Superscalar Architectures Jason Moore and Habib Ammari March 25 th, 2004 CSE 8383: Advanced Computer Architecture Instructor: Prof. Hesham El-Rewini
2
Outline Introduction and Motivations Overview of Superscalar Architectures Asynchronous Superscalar Architecture Design Fault Tolerant Superscalar Design Comparison Study of Superscalar Architectures Summary and Conclusions Superscalar Architectures 2
3
A multiple-issue processor that issues varying numbers of instructions whether statically or dynamically scheduled Hazard detection is done in the hardware Execution - static scheduling: in order execution - dynamic scheduling: out of order execution Superscalar Architectures 3 Introduction and Review
4
Superscalar Architectures 4 Introduction and Review (cont’d)
5
Why? - Currently we have to wait the time for the longest stage for a result - How often is the critical path taken? Issues with Asynchronous Design - Increase probability of timing faults - Loss predictability that clock provides Superscalar Architectures 5 Asynchronous Superscalar Design
6
Ad Hoc Networking Instructions are sent to Execution Units Depends on the compiler to group together dependant instructions Instruction compounding can be dynamic by using a look-up table Superscalar Architectures 6 Asynchronous Superscalar Design (cont’d)
7
Data Forwarding request is sent to next instruction in the compound Request gains access to the write port Once Request is received - All other operands are available. Data Forwarding can occur - Instruction is waiting on other operands. Data Forwarding cannot occur Acknowledgement or cancellation signal is sent to the forwarding unit Superscalar Architectures 7 Data Forwarding
8
Superscalar Architectures 8 Performance Results
9
What we have gained - slight speed up - no clock distribution problem The Cost - More circuitry - Longer design times - Increase probability of timing faults Superscalar Architectures 9 Good Idea?
10
Why? - As chips reduce in size toward.1 micron, transient faults will increase. - Asynchronous designs such as the one I discussed earlier are prone to such faults. - Fault Tolerance can be added to the Superscalar Design at low cost Superscalar Architectures 10 Fault Tolerant Superscalar Design
11
Error Detection/Error Correction Duplication of the system Re-executing each program Superscalar Architectures 11 Types of Fault Tolerant Techniques
12
The ROB is modified so that all statements are executed twice This change will not require anymore entries in the ROB No more functional units will be required What if the 2 results do not match? Superscalar Architectures 12 Needed Changes
13
An extra bit can be added to the ROB table to represent if the statement is waiting to be run for the first time or second time. Existing functional units can be used without significant slow down since utilization is not at 100% due to hazards The 2 results are compared if they agree the statement can simply follow regular superscalar commitment algorithm Superscalar Architectures 13 Basic Idea of How This Works
14
Two options - Run the statement a third time and take the result in which 2 of 3 agree upon - Re-issue the statement Superscalar Architectures 14 Basic Idea of How This works (cont’d) F S D F S D
15
If the number ALU’s is >= 2 then there is significant performance lost by the new system Superscalar Architectures 15 Performance
16
Ad Hoc Networking Problem: Circuits with quadratic delay Ө(n 2 ) where n: issue width Study of a superscalar processor architectures: Ultrascalar Performance study: VLSI complexities (gate delays, area) Superscalar Architectures 16 Study of a Superscalar Architecture: Ultrascalar
17
Superscalar architecture implementation Execution stations (ALUs and controllers) Parallel-prefix tree circuits (Interconnection network) An interleaved cache connected to the execution stations Mechanism to communicate register values between execution stations Main goal Design circuits to have at most linear delays Superscalar Architectures 17 Study of a Superscalar Architecture: Ultrascalar (cont’d)
18
Performance metrics: three parameters L (number of logical registers): registers seen by the programmer (defined by the ISA) < real registers employed by the processor implementor n (issue width of the processor): number of instructions executed per clock cycle M: bandwidth provided to memory Memory bandwidth: M(n) = Ο(n) Superscalar Architectures 18 Study of a Superscalar Architecture: Ultrascalar (cont’d)
19
Ultrascalar Design Passing of the entire logical register file with ready bits to every outstanding instruction The datapath of an Ultrascalar processor has 8 execution stations (responsible for decoding and executing instructions using the data in their register files) Execution station (ES) “classification” (oldest ES and all younger ones to its right), t: time Superscalar Architectures 19 Study of a Superscalar Architecture: Ultrascalar (cont’d)
20
Instruction SequenceExecution Station R 3 = R 1 / R 2 ES 6 R 0 = R 0 + R 3 ES 7 R 1 = R 5 + R 6 ES 0 R 1 = R 0 + R 1 ES 1 R 2 = R 5 * R 6 ES 2 R 2 = R 2 + R 4 ES 3 R 0 = R 5 – R 6 ES 4 R 4 = R 0 + R 7 ES 5 Superscalar Architectures 20 Communication between ESs through rings of MUXes L rings of MUXes (one for each logical register defined by the IS) Study of a Superscalar Architecture: Ultrascalar (cont’d)
21
Superscalar Architectures 21 Study of a Superscalar Architecture: Ultrascalar (cont’d)
22
(logical register’s value, ready bit) is carried by a MUX to successive ESs, and a new pair is inserted at every update of the ring’s register Ready bit: indicator of whether an instruction computed the register’s value Initialization of the registers of each ring done by the oldest ES. An ES becomes the oldest one on the next clock cycle if it is holding the oldest unfinished instruction ES’s internal structure: ALU, register file, instruction decode logic, and control logic Superscalar Architectures 22 Study of a Superscalar Architecture: Ultrascalar (cont’d)
23
Superscalar Architectures 23 Study of a Superscalar Architecture: Ultrascalar (cont’d)
24
Superscalar Architectures 24 Study of a Superscalar Architecture: Ultrascalar (cont’d)
25
Scalability issue: a newly computed value is propagated through the entire ring of multiplexers in one clock cycle Number of multiplexers the ring = number of outstanding instructions (n) linear gate delay: O(n) Goal: reduce the clock cycle Replace each ring of processors with a cyclic, segmented, parallel prefix (CSPP) circuit Circuit’s gate delay is logarithmic (tree structure) O(log 2 n) Superscalar Architectures 25 Study of a Superscalar Architecture: Ultrascalar (cont’d)
26
Superscalar Architectures 26 Study of a Superscalar Architecture: Ultrascalar (cont’d)
27
Analysis of Ultrascalar Objective: determine the Ultrascalar datapath’s wire delay and area Wire delay and area = f (VLSI layout) 2-dimentional VLSI layout: 16 ESs connected together and to memory 2 types of nodes: P (propagates the value of one logical register) and M (routes a number of memory accesses) X(n): the side length of an n-station Ultrascalar layout Superscalar Architectures 27 Study of a Superscalar Architecture: Ultrascalar (cont’d)
28
Superscalar Architectures 28 Study of a Superscalar Architecture: Ultrascalar (cont’d)
29
Superscalar Architectures 29 X(n) = 2X(n/4) + width of the wires to connect the four n/4-wide Ultrascalars (L) wires to connect the registers (M(n)) wires to provide memory bandwidth out of a subtree of n ESs A 1-station-wide Ultrascalar has width (L) X(n) = (L) + (M(n)) + 2X(n/4)if n > 1 = (L) + (M(1)) = (L) otherwise Study of a Superscalar Architecture: Ultrascalar (cont’d)
30
X(n) = (n 1/2 L) if M(n) = (n 1/2- ) for >0 (1) = (n 1/2 (L + log n)) if M(n) = (n 1/2 ) (2) = (n 1/2 L + M(n)) if M(n) = (n 1/2+ ) for >0 (3) L = constant cases 1 and 3 optimal; case 2 near optimal (optimal to within a factor of log n) Case 1: n ESs require a chip that is (n 1/2 ) on a side Case 2: similar to case 1 Case 3: external memory bandwidth of M(n) requires a side length of (M(n)) Superscalar Architectures 30 Study of a Superscalar Architecture: Ultrascalar (cont’d)
31
Ad Hoc Networking Asynchronous superscalar design makes slight speed gains at added costs of circuitry, design time, and more faults The Superscalar architecture can inexpensively be modified to be fault tolerant Study of superscalar architecture: Ultrascalar Notion of complexity Performance analysis w.r.t. quantitative metrics Superscalar Architectures 31 Summary and Conclusions
32
Ad Hoc Networking D.K. Arvind and Robert D. Mullins, A Fully Asynchronous Superscalar Architecture, In M. Moonen and F. Catthoor, editors, Proc. of the Third Int. Workshop on Algorithms and Parallel VLSI Architectures: pages 203-215, Elsevier Science Publishers, Aug. 1994. D. Henry, B. Kuszmaul, and V. Viswanath, “The Ultrascalar Processor – An Asymptotically Scalable superscalar Microarchitecture”, The Twentieth Anniversary Conference on Advanced Research in VLSI (ARVLSI’99), Atlanta, Georgia, March 21-24, 1999. B. Kuszmaul, D. Henry, and G. Loh, “A comparison of Scalable Superscalar Processors”, Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, Saint Malo, France, June, 1999 Avi Mendelson and Neeraj Suri, Designing High-Performance & Reliable Superscalar Architectures The Out of Order Reliable Superscalar (O3RS) Approach, DSN 2000 : pages 473-481, June 2000. S. Wallace and N. Bagherzadeh, Performance Issues of a Superscalar Microprocessor, 23rd International Conference of Parallel Processing, August 1994. Superscalar Architectures 32 Useful Pointers
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.