Chapter 2 Computer Evolution and Performance

Chapter 2 Computer Evolution and Performance

Contents A Brief History of Computers Designing for Performance
Pentium and PowerPC Evolution Performance Evaluation

ENIAC Electronic Numerical Integrator And Computer
A brief history of computers Electronic Numerical Integrator And Computer John Mauchly and John Presper Eckert Trajectory tables for weapons Started 1943 / Finished 1946 Too late for war effort Used until 1955 Decimal (not binary) 20 accumulators of 10 digits Programmed manually by switches 18,000 vacuum tubes 30 tons 1,500 square feet 140 kW power consumption 5,000 additions per second

von Neumann/Turing Stored Program concept
A brief history of computers Stored Program concept Main memory storing programs and data ALU operating on binary data Control unit interpreting instructions from memory and executing Input and output equipment operated by control unit Princeton Institute for Advanced Studies IAS Completed 1952

von Nuemann Machine A brief history of computers Input Output Equipment Arithmetic And Logic Unit Main Memory Program Control Unit If a program could be represented in a form suitable for storing in memory, the programming process could be facilitated A computer could get its its instructions from memory, and a program could could be set or altered by setting the values of a portion of memory

IAS Memory Formats 1000 x 40 bit words Binary number
A brief history of computers 8 19 20 28 39 Left Instruction Right Instruction Opcode Address 1 (a) Number Word (b) Instruction Word Sign Bit 1000 x 40 bit words Binary number 2 x 20 bit instructions Each instruction consisting of an 8-bit opcode A 12-bit address designating one of the words in memory

IAS Registers Memory Buffer Register Memory Address Register
A brief history of computers Memory Buffer Register Containing a word to be stored in memory, or used to receive a word from memory Memory Address Register Specifying the address in memory of the word to be written from or read into the MBR Instruction Register Containing the 8-bit opcode instruction being executed Instruction Buffer Register Employed to hold temporarily the righthand instruction from a word memory Program Counter Containing the address of the next instruction-pair to be fetched from memory Accumulator and Multiplier Quotient Employed to hold temporarily operands and results of ALU operations.

Central Processing Unit
Structure of IAS A brief history of computers Main Memory Arithmetic and Logic Unit Program Control Unit Input Output Equipment MBR Arithmetic & Logic Circuits MQ Accumulator MAR Control Circuits IBR IR PC Address Instructions & Data Central Processing Unit

Partial Flowchart of IAS
A brief history of computers IR ← IBR (0:7) MAR ← IBR (8:19) IR ← IBR (20:27) MAR←IBR(28:39) IBR←MBR(20:39) IR ← MBR (0:7) MAR ←MBR(8:19) Left Instruction Required? PC ← PC+1 MBR ← M(MAR) AC ← MBR PC ← MAR MAR ← PC Is Next In IBR? Start No Memory Access required Yes No Is AC ≥ 0? If AC ≥ 0 then Go to M(X, 0:19) AC ← AC + M(X) Fetch Cycle Execu- tion AC ← M(X) Decode instruction in IR

The IAS Instruction Set
A brief history of computers

The IAS Instruction Set
A brief history of computers Data transfer Move data between memory and ALU registers or between two ALU registers Unconditional branch This sequence can be changed by a branch instruction allowing decision points Conditional branch The branch can be made dependent on a condition, thus allowing decision points Arithmetic Operations performed by the ALU Address modify Permits addresses to be computed in the ALU and then inserted into instruction stored in memory.

Commercial Computers 1947 - Eckert-Mauchly Computer Corporation
A brief history of computers Eckert-Mauchly Computer Corporation UNIVAC I (Universal Automatic Computer) US Bureau of Census 1950 calculations Became part of Sperry-Rand Corporation Late 1950s - UNIVAC II Faster More memory

IBM Had helped build the Mark I Punched-card processing equipment
A brief history of computers Had helped build the Mark I Punched-card processing equipment the 701 IBM’s first stored program computer Scientific calculations the 702 Business applications Lead to 700/7000 series

(operations per second)
Computer Generations A brief history of computers Generation Approximate Dates Technology Typical Speed (operations per second) 1 2 3 4 5 Vacuum tube Transistor Small- and Medium-scale Integration Large-scale Very-large-scale 40,000 200,000 1,000,000 10,000,000 100,000,000

Transistors Replaced vacuum tubes Smaller Cheaper
A brief history of computers Replaced vacuum tubes Smaller Cheaper Less heat dissipation Solid State device Made from Silicon (Sand) Invented 1947 at Bell Labs William Shockley et al.

Transistor Based Computers
A brief history of computers Second generation machines NCR & RCA produced small transistor machines IBM 7000 Digital Equipment Corporation (DEC) Produced PDP-1

IBM 700/7000 Series Model Number First Delivery CPU Technology Memory
A brief history of computers Model Number First Delivery CPU Technology Memory Cycle Time(㎲) Size(K) 701 1952 Vacuum Tubes Electro- Static tubes 30 2-4 704 1955 Core 12 4-32 709 1958 32 7090 1960 Transistor 2.18 7094 I 1962 2 7094 II 1964 1.4

IBM 700/7000 Series Model Number of Opcodes of Index Registers
A brief history of computers Model Number of Opcodes of Index Registers Hardwired Floating Point I/O Overlap (Channels) Instruction Fetch Speed (relative To 701) 701 24 No 1 704 80 3 Yes 2.5 709 140 4 7090 169 25 7094 I 185 7 (double Precision) 30 7094 II 50

An IBM 7094 Configuration Mag Tape Units CPU Card Punch Data Channel
A brief history of computers Mag Tape Units Card Punch Line Printer Reader Drum Disk Hypertapes Teleprocessing Equipment Multiplexor Memory CPU Data Channel

The IBM 7094 A brief history of computers The most important point is the use of data channels. A data channel is an independent I/O module with its own processor and its own instruction set. Another new feature is the multiplexor, which is the central termination point for data channel, the CPU, and memory.

Microelectronics Literally - “small electronics”
A brief history of computers Literally - “small electronics” A computer is made up of gates, memory cells and interconnections These can be manufactured on a semiconductor e.g. silicon wafer

Microelectronics Data storage Data processing Data movement Control
A brief history of computers Data storage Provided by memory cells Data processing Provided by gates Data movement The paths between components are used to move data from memory to memory and from memory through gates to memory Control The paths between components can carry control signals. The memory cell will store the bit on its input lead when the WRITE control signal is ON and will place that bit on its output lead when the READ control signal is ON.

Wafer, Chip, and Gate Small-scale integration (SSI) Wafer Chip Package
A brief history of computers Gate Wafer Chip Package Small-scale integration (SSI)

Generations of Computer
A brief history of computers Vacuum tube Transistor Small scale integration on Up to 100 devices on a chip Medium scale integration - to 1971 100-3,000 devices on a chip Large scale integration 3, ,000 devices on a chip Very large scale integration to date 100, ,000,000 devices on a chip Ultra large scale integration Over 100,000,000 devices on a chip

Moore’s Law Increased density of components on chip
A brief history of computers Increased density of components on chip Gordon Moore - cofounder of Intel Number of transistors on a chip will double every year Since 1970’s development has slowed a little Number of transistors doubles every 18 months Cost of a chip has remained almost unchanged Higher packing density means shorter electrical paths, giving higher performance Smaller size gives increased flexibility Reduced power and cooling requirements Fewer interconnections increases reliability

Growth in CPU Transistor Count

IBM 360 series 1964 Replaced (& not compatible with) 7000 series
A brief history of computers 1964 Replaced (& not compatible with) 7000 series First planned “family” of computers Similar or identical instruction sets Similar or identical O/S Increasing speed Increasing number of I/O ports(i.e. more terminals) Increased memory size Increased cost Multiplexed switch structure

Key Characteristics of 360 Family
A brief history of computers Many of its features have become standard on other large computers Characters Model 30 Model 40 Model 50 Model 65 Model 75 Maximum memory size (bytes) 64K 256K 512K Data rate from memory (Mbytes/s) 0.5 0.8 2.0 8.0 16.0 Processor cycle time (㎲) 1.0 0.625 0.25 0.2 Relative speed 1 3.5 10 21 50 Maximum number of data channels 3 4 6 Maximum data rate on one channel (Mbytes/s) 250 400 800 1250

DEC PDP-8 1964 First minicomputer (after miniskirt!)
A brief history of computers 1964 First minicomputer (after miniskirt!) Did not need air conditioned room Small enough to sit on a lab bench $16,000 $100k+ for IBM 360 Embedded applications & OEM Later models of the PDP-8 used a bus structure that is now virtually universal for minicomputers and microcomputers

PDP-8/E Block Diagram A brief history of computers Highly flexible architecture allowing modules to be plugged into the bus to create various configurations

Semiconductor Memory A brief history of computers The first application of integrated circuit technology to computers construction of the processor also used to construct memories 1970 Fairchild Size of a single core i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core Capacity approximately doubles each year

Evolution of Intel Microprocessors

Microprocessor Speed Design for performance In memory chips, the relentless pursuit of speed has quadrupled the capacity of DRAM, every years Pipelining On board cache On board L1 & L2 cache Branch prediction Data flow analysis Speculative execution

Evolution of DRAM / Processor Characteristics
Design for performance

Performance Mismatch Processor speed increased
Design for performance Processor speed increased Memory capacity increased Memory speed lags behind processor speed

Performance Balance Design for performance It is responsible for carrying a constant flow of program instructions and data between memory chips and the processor → The interface between processor and main memory is the most crucial pathway in the entire computer

Trends in DRAM use Design for performance

Performance Balance Design for performance On average, the number of DRAMs per system is going down. The solid black lines in the figure show that, for a fixed-sized memory, the number of DRAMs needed is declining The shaded bands show that for a particular type of system, main memory size has slowly increased while the number of DRAMs has declined

Solutions Increase number of bits retrieved at one time
Design for performance Increase number of bits retrieved at one time Make DRAM “wider” rather than “deeper” Change DRAM interface Cache Reduce frequency of memory access More complex cache and cache on chip Increase interconnection bandwidth High speed buses Hierarchy of buses

Performance Balance Two constantly evolving factors to be coped with
Design for performance Two constantly evolving factors to be coped with The rate at which performance is changing in the various technology areas differs greatly from one type of element to another New applications and new peripheral devices constantly change the nature of the demand on the system in terms of typical instruction profile and the data access patterns.

Intel Pentium - results of design effort on CISCs 1971 - 4004
Pentium and PowerPC evolution Pentium - results of design effort on CISCs First microprocessor All CPU components on a single chip 4 bit Followed in 1972 by 8008 8 bit Both designed for specific applications Intel’s first general purpose microprocessor 8086 16 bit, instruction cache, or queue 80286 addressing a 16-Mbyte memory

Intel 80386 80486 Pentium Pentium Pro Pentium II Pentium III Merced
Pentium and PowerPC evolution 80386 32 bit, multitasking 80486 built-in math coprocessor Pentium superscalar techniques Pentium Pro Pentium II Intel MMX thchnology Pentium III additional floating-point instruction Merced 64-bit organization

PowerPC RISC systems PowerPC Processor Summary
Pentium and PowerPC evolution RISC systems PowerPC Processor Summary

Two Notions of Performance
Performance evaluation Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concodre 3 hours 1350 mph 132 178,200 Which has higher performance? Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns. .. (Performance) throughput, bandwidth Response time and throughput often are in opposition

To Assess Performance Response Time Throughput
Performance evaluation Response Time Time to complete a task Throughput Total amount of work done per time Execution Time (CPU Time) User CPU time Time spent in the program System CPU time Time spent in OS Elapsed Time Execution Time + Time of I/O and time sharing

Criteria of Performance
Performance evaluation Execution time seems to measure the power of the CPU Elapsed time measures the performance of whole system including OS and I/O User is interested in elapsed time Sales people are interested in the highest number of performance that can be quoted Performance analysist is interested in both execution time and elapsed time

Definitions Performance is in units of things-per-second
Performance evaluation Performance is in units of things-per-second bigger is better If we are primarily concerned with response time performance(x) = execution_time(x) " X is n times faster than Y" means Performance(X) n = Performance(Y)

Example Time of Concorde vs. Boeing 747?
Performance evaluation Time of Concorde vs. Boeing 747? bigger is better Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5hours/3hours Throughput of Concorde vs. Boeing 747 ? Concord is 178,200 pmph / 286,700 pmph = 0.62 times faster Boeing is 286,700 pmph / 178,200 pmph = 1.6 times faster Boeing is 1.6 times (60% faster in terms of throughput Concord is 2.2 times (220% faster in terms of flying time We will focus primarily on execution time for a single job

Basis of Evaluation Cons Pros very specific non-portable
Performance evaluation Cons Pros very specific non-portable difficult to run, or measure hard to identify cause representative Actual Target Workload portable widely used improvements useful in reality less representative Full Application Benchmarks easy to cool easy to run, early in design cycle Small kernel Benchmarks peak may be a long way from application performance identify peak capability and potential bottlenecks Microbenchmarks

MIPS Millions of Instruction(Executed) Per Second
Performance evaluation Millions of Instruction(Executed) Per Second Often used measure of performance Native MIPS clock rate CPI × 106 instruction count execution time × 106 = instruction count CPU clocks × cycle time × 106 = instruction count × clock rate cycle time × 106 = instruction count × clock rate instruction count × CPI × 106 = clock rate CPI × 106

MIPS Meaningless information Problems Peak MIPS
Performance evaluation Meaningless information Run a program and time it Count the number of executed instruction to get MIPs rating Problems Cannot compare different computers with different instruction sets Varies between programs executed on the same computer Peak MIPS This is what many manufacturers provide Usually neglecting ‘peak’o

Relative MIPS CPU time of VAX 11/780 × MIPS of VAX 11/780
Performance evaluation Call VAX 11/780 1 MIPS machine (not true) . Makes MIPS rating more independent of benchmark programs Advantage of relative MIPS is small CPU time of VAX 11/780 × MIPS of VAX 11/780 CPU time of machine A CPU time of VAX 11/780 CPU time of machine A

FLOPS Million Floating Point Instructions Per Second
Performance evaluation Million Floating Point Instructions Per Second Used for engineering and scientific applications where floating point operations account for a high fraction of all executed instructions Problems Program dependent Many programs does not use floating point operations Machine dependent Depends on relative mixture of integer and floating point operations Depends on relative mixture of cheep(+.-) and expensive(×) floating point operations Normalized FLOPS (relative FLOPS) Peak FLOPS

SPEC Marks System Performance Evaluation Coorperative
Non-profit group initially founded by APOLLO, HP, MIPSCO, and SUN Now includes many more like IBM, DEC, AT&T, MOTOROLA, etc Measures the ratio of execution time on the target measure to that on a VAX 11/780 Summarizes performance by taking the geometric means of the ratios

SPEC95 Performance evaluation Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags eliminate special undocumented incantations that may not even generate working code for real programs

Metrics of performance
Performance evaluation Answers per month Useful Operations per second Application Programming Language Compiler (millions) of Instructions per second ?MIPS (millions) of (F.P.) operations per second ?MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Each metric has a place and a purpose, and each can be misused

Aspects of CPU Performance
Performance evaluation CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle instr. count CPI clock rate Program Compiler Instr. Set Arch Organization Technology

Performance evaluation CPU Time (Instruction count) × (CPI) × (Clock Cycle) number of Instructions × Clock Rate . Depends on technology and organization CPI Cycles Per Instruction Depends on organization and instruction set Instruction Count Depends on compiler and instruction set cycle × second instruction cycle cycle seconds

Performance evaluation If CPI is not uniform across all instructions CPU cycles = Σ (CPIi × Ii) n - number of instructions in instruction set CPIi - CPI for instruction i Ii - number of times instruction i occurs in a program CPU Time = Σ (CPIi × Ii × clock cycle) CPI = It assumes that a given instruction always takes the same number of cycles to execute n i=1 n i=1 Σ (CPIi × Ii) number of executed instruction n i=1

Aspects of CPU Performance
Performance evaluation CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle instr. count CPI clock rate Program X Compiler Instr. Set Organization Technology

CPI Invest Resources where time is Spent! n i=1
Performance evaluation average cycles per instruction CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n i=1 CPU time = ∑ (Clock Cycle Time × CPI i × I I) n i=1 I i CPI = ∑ CPI i × F i where F i = Instruction Count "instruction frequency" Invest Resources where time is Spent!

Example of RISC Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time
Performance evaluation Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% % Load 20% % Store 10% % Branch 20% % 2.2 Typical Mix How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once?

Amdahl's Law Speedup due to enhancement E:
Performance evaluation Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) ((1-F) + F/S) X ExTime(without E) Speedup(with E) (1-F) + F/S

Cost Traditionally ignored by textbooks because of rapid change
Performance evaluation Traditionally ignored by textbooks because of rapid change Driven by learning curve : manufacturing costs decrease with time Understanding learning curve effects on yield is key to cost projection Yield Fraction of manufactured items that survive the testing procedure Testing and Packaging Big factors in lowering costs

Cost Cost of Chips Cost vs. Price Cost Cost of die
Performance evaluation Cost of Chips Cost Cost of die Wafer Yield = dies / wafer Cost vs. Price Component cost : 15~33% Direct cost : 6~8% Gross margin : 34~39% Average discount : 25~40% final yield manufacture + testing + packaging = dies per wafer × die yield cost of wafer =

Chapter 2 Computer Evolution and Performance

Similar presentations

Presentation on theme: "Chapter 2 Computer Evolution and Performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2 Computer Evolution and Performance

Similar presentations

Presentation on theme: "Chapter 2 Computer Evolution and Performance"— Presentation transcript:

Similar presentations

About project

Feedback