Introduction Lihu Rappoport, 10/2004 1 MAMAS – Computer Architecture 234267 Dr. Lihu Rappoport Some of the slides were taken from: (1) Avi Mendelson (2)

introduction Lihu Rappoport, 10/2004 1 MAMAS – Computer Architecture 234267 Dr. Lihu Rappoport Some of the slides were taken from: (1) Avi Mendelson (2) Randi Katz (3) Patterson

introduction Lihu Rappoport, 10/2004 2 General Course Information  Grade – 20% Exercise (mandatory) – 80% Final exam  Textbooks – Computer Architecture a Quantitative Approach: Hennessy & Patterson  Other course information www.cs.technion.ac.il/~cs234267 – Foils will be on the WEB several days before the class

introduction Lihu Rappoport, 10/2004 3 Class Focus  CPU – Introduction: performance, instruction set (RISC vs. CISC) – Pipeline, hazards – Out-of-order and speculative execution – Branch prediction  Memory Hierarchy – Cache – Main memory – Virtual Memory  Advanced Topics  PC Architecture – Motherboard & chipset, DRAM, I/O, Disk, peripherals

introduction Lihu Rappoport, 10/2004 4 Computer System Structure CPU PCI North Bridge Memory mouse LAN Lan Adap Graphic Adapt Mem BUS CPU BUS Cache Sound Card speakers South Bridge AGP USB ctrlr IDE controller SIO CD ROM Hard Disk Parallel Port Serial Port Floppy Drive keybrd

introduction Lihu Rappoport, 10/2004 5 Technology Trends CapacitySpeed Logic2× in 3 years2× in 3 years DRAM4× in 3 years1.4× in 10 years Disk2× in 3 years1.4× in 10 years CPU Performance Trends Logic Speed: 2× per 3 years Logic Capacity:2× per 3 years Leads to Computing capacity:4× per 3 years – If we could keep all the transistors busy all the time – Actual: 3.3× per 3 years

introduction Lihu Rappoport, 10/2004 6 Performance Observations  Performance is doubled every 18-24 months (Gordon Moore) – Can It Last Forever ?  So far it does  Power becomes a major issue  The “Spiral” principle (A. Grove): New software is written to the fastest available processor Actually, it takes the next generation of processors to run this software (reasonably) – It seems that the spiral is slowing down

introduction Lihu Rappoport, 10/2004 7 Architecture & Microarchitecture  Architecture The collection of features of a processor (or a system) as they are seen by the “user” – Examples: Instruction set, addressing modes, data width  Microarchitecture The collection of features or way of implementation of a processor (or a system) that do not affect the user – Examples: caches size and structure, num of execution units, num execution pipelines  Timing is considered mArch (though it is user visible!)  Processors with the same arch may have different μArch

introduction Lihu Rappoport, 10/2004 8 Compatibility  Backward compatibility – New hardware can run existing software – Example: Pentium  4 can run software originally written for Pentium  III, Pentium  II, Pentium , 486, 386, 268  Forward compatibility – New software can run on existing hardware – Example: new software written with MMX TM must still run on older Pentium processors which do not support MMX TM – Less important than backward compatibility  New ideas: architecture independent – JIT – just in time compiler: Java and.NET – Binary translation

introduction Lihu Rappoport, 10/2004 9 Comparing Performance  MIPS = Instruction Count / Time × 106 = Clock Rate × Instructions/clock × 106 – Relevant when comparing systems with the same instruction set – Provides a system level performance that includes all the performance parameters such as compiler, memory, CPU, etc. – NOT a good performance metrics to compare  Machines with different instruction sets  Different applications running on the same hardware  MFLOP/s = FP Operations / Time × 106 – Machine dependent – Specifically for floating point applications

introduction Lihu Rappoport, 10/2004 10 Benchmarks – Programs for Evaluating Processor Performance  Toy Benchmarks – 10-100 line programs – e.g.: sieve, puzzle, quicksort  Synthetic Benchmarks – Attempt to match average frequencies of real workloads – e.g., Winstone, Drystone  Real programs – e.g., gcc, spice  SPEC: System Performance Evaluation Cooperative – SPECint (8 integer programs) – and SPECfp (10 floating point)

introduction Lihu Rappoport, 10/2004 11 CPI – Cycles Per Instruction  The CPU is synchronous: works according to a clock signal – Clock cycle is measured in nsec (10 -9 of a second) – Clock rate (= 1/clock cycle) is measured in GHz (10 9 cyc/sec)  CPI – Cycles Per Instruction – Average #cycles per Instruction (in a given program) – IPC (= 1/CPI) : Instructions per cycles  CPI i - #cycles to execute a given type of instruction – e.g.: CPI add = 1, CPI mul = 3 – Independent of program CPI = #cycles required to execute the program #instruction executed in the program

introduction Lihu Rappoport, 10/2004 12 CPI (cont.)  Calculating the CPI of a program – IC i : #times instruction of type i is executed in the program – IC : #instruction executed in the program – F i : relative frequency of instruction of type i : F i = IC i /IC – #cycles required to execute the program: – CPI:

introduction Lihu Rappoport, 10/2004 13 CPU Time  CPU Time – The time required by the CPU to execute a given program: CPU Time = clock cycle  #cyc = clock cycle  CPI  IC  Our goal: minimize CPU Time – Minimize clock cycle: more MHz (process, circuit,  Arch) – Minimize CPI:  Arch (e.g.: more execution units) – Minimize IC: architecture (e.g.: MMX TM technology)  Speedup due to enhancement E ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E

introduction Lihu Rappoport, 10/2004 14 Speedup overall = ExTime old ExTime new = 1 Speedup enhanced Fraction enhanced (1 - Fraction enhanced ) + ExTime new = ExTime old × Speedup enhanced Fraction enhanced (1 - Fraction enhanced ) + Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: Amdahl’s Law

introduction Lihu Rappoport, 10/2004 15 Floating point instructions improved to run at 2×, but only 10% of executed instructions are FP Speedup overall = 1 0.95 =1.053 ExTime new = ExTime old × (0.9 + 0.1 / 2) = 0.95 × ExTime old Corollary: Make The Common Case Fast Amdahl’s Law: Example

introduction Lihu Rappoport, 10/2004 16 ISA Design

introduction Lihu Rappoport, 10/2004 17 instruction set software hardware Instruction Set Design The ISA is what the user and the compiler sees The ISA is what the hardware needs to implement

introduction Lihu Rappoport, 10/2004 18 Why ISA is important?  Code size – long instructions may take more time to be fetched – Requires larges memory (important in small devices, e.g., cell phones)  Number of instructions (IC) – Reducing IC reduce execution time (assuming same CPI and frequency)  Code “simplicity” – Simple HW implementation which leads to higher frequency and lower power – Code optimization can better be applied to “simple code”

introduction Lihu Rappoport, 10/2004 19 Architectural Consideration Example  Displacement Address Size – 1% of addresses > 16-bits – 12 - 16 bits of displacement needed 0% 10% 20% 30% 012 345678 9 101112131415 Address Bits Int. Avg. FP Avg.

introduction Lihu Rappoport, 10/2004 20 CISC Processors  CISC - Complex Instruction Set Computer – The idea: a high level machine language  Characteristic – Many instruction types, with a many addressing modes – Some of the instructions are complex:  Execute complex tasks  Require many of cycles – ALU operations directly on memory  Only a few registers, in many cases not orthogonal – Variable length instructions  common instructions get short codes  save code length  Example: x86

introduction Lihu Rappoport, 10/2004 21 Top 10 x86 Instructions Rankinstruction% of total executed 1load22% 2conditional branch20% 3compare16% 4store12% 5add8% 6and6% 7sub5% 8move register-register4% 9call1% 10return1% Total96% Simple instructions dominate instruction frequency

introduction Lihu Rappoport, 10/2004 22 CISC Drawbacks  Implement complex instructions and complex addressing modes  complicates the processor  slows down the simple, common instructions  contradicts Make The Common Case Fast  Compilers don’t use complex instructions / indexing methods  Variable length instructions are real pain in the neck – Difficult to decode few instructions in parallel  As long as instruction is not decoded, its length is unknown  It is unknown where the instruction ends  It is unknown where the next instruction starts – An instruction may be over more than a single cache line – An instruction may be over more than a single page

introduction Lihu Rappoport, 10/2004 23 RISC Processors  RISC - Reduced Instruction Set Computer – The idea: simple instructions enable fast hardware  Characteristic – A small instruction set, with only a few instructions formats – Simple instructions  execute simple tasks  Most of them require a single cycle (with pipeline) – A few indexing methods – ALU operations on registers only  Memory is accessed using Load and Store instructions only  Many orthogonal registers  Three address machine: Add dst, src1, src2 – Fixed length instructions  Examples: MIPS TM, Sparc TM, Alpha TM, PowerPC TM

introduction Lihu Rappoport, 10/2004 24 RISC Processors (Cont.)  Simple architecture  Simple micro-architecture  – Simple, small and fast control logic – Simpler to design and validate – Room for large on die caches – Shorten time-to-market  Using a smart compiler – Better pipeline usage – Better register allocation  Existing RISC processor are not “pure” RISC – e.g., support division which takes many cycles

introduction Lihu Rappoport, 10/2004 25 Compilers and ISA  Ease of compilation – Orthogonality:  no special registers  few special cases  all operand modes available with any data type or instruction type – Regularity:  no overloading for the meanings of instruction fields – streamlined  resource needs easily determined  Register Assignment is critical too – Easier if lots of registers

introduction Lihu Rappoport, 10/2004 26 CISC Is Dominant  The x86 architecture, which is a CISC architecture, dominates the processor market – A vast amount of existing software – Intel, AMD, Microsoft and others benefit from this  Intel and AMD put a lot of money to make high performance x86 processors, despite the architectural disadvantage  Current x86 processor give the best cost/performance – CISC processors use  arch ideas from the RISC world – Starting at Pentium  II and K6 , x86 processors translate CISC instructions into RISC-like operations internally  the inside core looks much like that of a RISC processor

introduction Lihu Rappoport, 10/2004 27 Special Purpose Architecture  Optimize for a specific set of applications  Controllers – Small code and data footprint – Low energy  DSP – ADD + Multiply, Loop – Data prefetching – “Expose pipe” to allow massive optimizations  Software Specific Extensions – Extend existing architecture to accelerate execution of specific applications – e.g., MMX TM, SSE

introduction Lihu Rappoport, 10/2004 28 MMX TM Technology  Multi-Media-eXtension  57 new instructions added to x86  SIMD: Single Instruction Multiple Data  64-bit packed integer (4×16 or 8×8 or 2×32)  Introduced on Pentium®/MMX TM and Pentium® II on ’97  Added 8 new 64 bit registers (MM0 – MM7) a0a1a2a3 b0b1b2b3 a0+b0a1+b1a2+b2a3+b3 + 64-bit

introduction Lihu Rappoport, 10/2004 29 SSE TM  Streaming SIMD Extensions  128-bit packed / scalar single precision FP (4×32)  Introduced on Pentium® III on ’99  8 new 128 bit registers (XMM0 – XMM7)  Accelerates graphics, video, speech, image, photo processing, encryption, financial, engineering and scientific applications Packed:Scalar: x0x1x2x3 y0y1y2y3 x0+y0x1+y1x2+y2x3+y3 + 128-bits x0x1x2x3 y0y1y2y3 x0+y0y1y2y3 + 128-bits

introduction Lihu Rappoport, 10/2004 30 SSE2 TM  FP instructions in Packed Double precision – 2×64 bits, May replace x87  Enlarging the MMX from 64 bits to 128-bits – 128-bit packed / scalar integer: 16×8, 8×16, 4×32, 2×64, or 1×128 bits  Type conversion and cache control instructions (e.g. Cache Line Flush)  Uses the same SSE 128-bit XMM registers  Introduced on Pentium® 4 on 00’ Double precision FP (2×64):128-bit Packed Integer: p0p1 q0q1 p0+q0p1+q1 + 128-bits a0a2a4a6 + a1a3a5a7 b0b2b4b6b1b3b5b7 a0+ b0 a2+ b2 a4+ b4 a6+ b6 a1+ b1 a3+ b3 a5+ b5 a7+ b7 128-bits

introduction Lihu Rappoport, 10/2004 31 Virtual machines (JAVA)  Machine independent ISA – Can be run on different architectures – Each architectures has an emulation (virtual machine) that forms a “system within the system”  The code can be “compiled for the native code “on the fly” – This process is called JIT: Just-In-Time .Net allows to combine different formats of code: – e.g., different programming languages  Pros – Portability, Flexibility  Cons – Efficiency – The JIT can apply only very basic optimization

introduction Lihu Rappoport, 10/2004 32 Backup

introduction Lihu Rappoport, 10/2004 33 RISC and Amdhal’s Law (Example)  In compare to the CISC architecture: – 10% of the static code, that executes 90% of the dynamic has the same CPI – 90% of the static code, which is only 10% of the dynamic, increases in 60% – The number of instruction being executed is increased in 50% – The speed of the processor is doubled  This was true for the time the RISC processors were invented  We get  And then CPI new CPI old Fraction enhanced Speedup enhanced = (1 - Fraction enhanced ) + = 0.9 + 0.1×1.6 = 1.06 CPU Time old clock old CPI old IC old Speedup overall = =   = 2/(1.06  1.5)=1.26 CPU Time new clock new CPI new IC new

introduction Lihu Rappoport, 10/2004 34 MIPS Architecture  2 32 bytes of memory  32 x 32-bit GPRs (R0 = 0) InstructionMeaning add R1, R2, R3R1 = R2 + R3 addi R1, R0, 4R1 = R0 + 4 sub R1, R2, R3R1 = R2 – R3 lw R1, 100(R2)R1 = Memory[R2+100] sw R1, 100(R2)Memory[R2+100] = R1 beq R4,R5,Lif R4 = R5 PC = Label j Label PC = Label

introduction Lihu Rappoport, 10/2004 35  R-type (register insts)  I-type (Load, Store, Branch, inst’s w/imm data)  J-type (Jump) op: operation of the instruction rs, rt, rd: the source and destination register specifiers shamt: shift amount funct: selects the variant of the operation in the “op” field address / immediate: address offset or immediate value target address: target address of the jump instruction op target address 0 2631 6 bits 26 bits oprs rtrdshamtfunct 061116 212631 6 bits 5 bits oprs rt immediate 016 212631 6 bits 16 bits5 bits MIPS Instruction Formats

introduction Lihu Rappoport, 10/2004 36  Each memory location is 8 bit = 1 byte wide.  Each memory location has an address – Address space has 2 32 bytes  An address is 32 bit wide  Memory stores both instructions and data – Each instruction is 32 bit wide  stored in 4 consecutive bytes in memory – Various data types have different width The Memory Space 1 byte 00000000 00000001 00000002 00000003 00000004 00000005 00000006 FFFFFFFA FFFFFFFB FFFFFFFC FFFFFFFD FFFFFFFE FFFFFFFF

introduction Lihu Rappoport, 10/2004 37 MIPS ISA

introduction Lihu Rappoport, 10/2004 38 Example: Calculating CPI Assumptions – ALU operations are executed only on registers – Memory is accessed only by load/store – Instruction Mix: Op i Freq i CPI i ALU43%1 Load21%2 Store12%2 Branch24%2 Assume that for 25% of the ALU operations – One of the source operands was read from memory a cycle earlier, using a load – This data is used only by the current ALU operation

introduction Lihu Rappoport, 10/2004 39 Example (cont.) Is it worth to add register / memory ALU operations: – One source operand in memory – One source operand in register – CPI of 2 Assuming that this causes branch CPI to increase to 3 Exec Time = IC × CPI × ClK_cyc

introduction Lihu Rappoport, 10/2004 40 Example Solution OpFreq i CPI i Freq i × CPI i ALU.43 1 0.43 Load.21 2 0.42 Store.12 2 0.24 Branch.24 2 0.48 CPI old 1.57 CPI OLD =

introduction Lihu Rappoport, 10/2004 41 Example Solution (cont.) OpFreq i CPI i Freq i × CPI i ALU0.43 .25 × 0.43 1 0.32 Load0.21 .25 × 0.43 2 0.20 Store0.12 2 0.24 Branch0.24 3 0.72 Reg/Mem .25 × 0.43 2 0.22 0.89 1.7 CPI new = 1.7 / 0.89 = 1.91 CPI new = Exec Time new = IC new × CPI new × Clk_cyc new = (0.89 × IC old ) × 1.91/1.57 CPI old × Clk_cyc old = 1.08 Exec Time old

introduction Lihu Rappoport, 10/2004 42 IA (X86) evolution

introduction Lihu Rappoport, 10/2004 43 1980 1990 2000 Memory chips64 K Byte 4 M Byte 256 M-1 G Byte CPU Speed 1-2 MHz 20-40 400-1000 Hard Disks40 M Byte 1 G Byte 20 G Byte LAN (Switch)2-10 M bps 100 M bps 1 G bps Busses2-20 MB/sec 40-400 MB/sec 1 GB/sec Hardware Technology

introduction Lihu Rappoport, 10/2004 44 1978: The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1982: The 80286 increases address space to 24 bits, +instructions 1985: The 80386 extends to 32 bits, new addressing modes 1989-1995: The 80486, Pentium , Pentium Pro  add a few instructions (mostly designed for higher performance) 1997: Pentium  II, and Pentium  with MMX add MMX 1999: Pentium  III adds SSE AMD K6 adds 3Dnow 2000: Pentium  4 adds SSE2 2003: AMD Athlon adds AMD64 x86 Architecture Extensions

Introduction Lihu Rappoport, 10/2004 1 MAMAS – Computer Architecture 234267 Dr. Lihu Rappoport Some of the slides were taken from: (1) Avi Mendelson (2)

Similar presentations

Presentation on theme: "Introduction Lihu Rappoport, 10/2004 1 MAMAS – Computer Architecture 234267 Dr. Lihu Rappoport Some of the slides were taken from: (1) Avi Mendelson (2)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction Lihu Rappoport, 10/2004 1 MAMAS – Computer Architecture 234267 Dr. Lihu Rappoport Some of the slides were taken from: (1) Avi Mendelson (2)

Similar presentations

Presentation on theme: "Introduction Lihu Rappoport, 10/2004 1 MAMAS – Computer Architecture 234267 Dr. Lihu Rappoport Some of the slides were taken from: (1) Avi Mendelson (2)"— Presentation transcript:

Similar presentations

About project

Feedback