Presentation on theme: "Computer Architecture and Organization"— Presentation transcript:
1 Computer Architecture and Organization Computer Evolution and Performance
2 Electronic Numerical Integrator And Computer ENIAC - backgroundElectronic Numerical Integrator And ComputerJohn Presper Eckert and John MauchlyUniversity of PennsylvaniaTrajectory tables for weaponsStarted 1943Finished 1946Too late for war effortUsed until 1955
3 ENIAC - detailsDecimal (not binary)20 accumulators of 10 digitsProgrammed manually by switches18,000 vacuum tubes30 tons15,000 square feet140 kW power consumption5,000 additions per second
4 Stored Program concept Main memory storing programs and data von Neumann/TuringStored Program conceptMain memory storing programs and dataALU operating on binary dataControl unit interpreting instructions from memory and executingInput and output equipment operated by control unitPrinceton Institute for Advanced StudiesIASCompleted 1952
6 Set of registers (storage in CPU) IAS - details1000 x 40 bit wordsBinary number2 x 20 bit instructionsSet of registers (storage in CPU)Memory Buffer RegisterMemory Address RegisterInstruction RegisterInstruction Buffer RegisterProgram CounterAccumulatorMultiplier Quotient
8 1947 - Eckert-Mauchly Computer Corporation Commercial ComputersEckert-Mauchly Computer CorporationUNIVAC I (Universal Automatic Computer)US Bureau of Census 1950 calculationsBecame part of Sperry-Rand CorporationLate 1950s - UNIVAC IIFasterMore memory
9 Punched-card processing equipment 1953 - the 701 IBMPunched-card processing equipmentthe 701IBM’s first stored program computerScientific calculationsthe 702Business applicationsLead to 700/7000 series
10 TransistorsReplaced vacuum tubesSmallerCheaperLess heat dissipationSolid State deviceMade from Silicon (Sand)Invented 1947 at Bell LabsWilliam Shockley et al.
11 Transistor Based Computers Second generation machinesNCR & RCA produced small transistor machinesIBM 7000DECProduced PDP-1
12 MicroelectronicsLiterally - “small electronics”A computer is made up of gates, memory cells and interconnectionsThese can be manufactured on a semiconductore.g. silicon wafer
13 Generations of Computer Vacuum tubeTransistorSmall scale integration onUp to 100 devices on a chipMedium scale integration - to 1971100-3,000 devices on a chipLarge scale integration3, ,000 devices on a chipVery large scale integration100, ,000,000 devices on a chipUltra large scale integration –Over 100,000,000 devices on a chip
14 Moore’s Law Increased density of components on chip Gordon Moore – co-founder of IntelNumber of transistors on a chip will double every yearSince 1970’s development has slowed a littleNumber of transistors doubles every 18 monthsCost of a chip has remained almost unchangedHigher packing density means shorter electrical paths, giving higher performanceSmaller size gives increased flexibilityReduced power and cooling requirementsFewer interconnections increases reliability
16 Replaced (& not compatible with) 7000 series IBM 360 series1964Replaced (& not compatible with) 7000 seriesFirst planned “family” of computersSimilar or identical instruction setsSimilar or identical O/SIncreasing speedIncreasing number of I/O ports (i.e. more terminals)Increased memory sizeIncreased costMultiplexed switch structure
17 First minicomputer (after miniskirt!) DEC PDP-81964First minicomputer (after miniskirt!)Did not need air conditioned roomSmall enough to sit on a lab bench$16,000$100k+ for IBM 360Embedded applications and OEMBUS STRUCTURE - Omnibus
19 Capacity approximately doubles each year Semiconductor Memory1970FairchildSize of a single corei.e. 1 bit of magnetic core storageHolds 256 bitsNon-destructive readMuch faster than coreCapacity approximately doubles each year
20 IntelFirst microprocessorAll CPU components on a single chip4 bitFollowed in 1972 by 80088 bitBoth designed for specific applicationsIntel’s first general purpose microprocessor
24 Increase number of bits retrieved at one time SolutionsIncrease number of bits retrieved at one timeMake DRAM “wider” rather than “deeper”Change DRAM interfaceCacheReduce frequency of memory accessMore complex cache and cache on chipIncrease interconnection bandwidthHigh speed busesHierarchy of buses
25 Peripherals with intensive I/O demands Large data throughput demands I/O DevicesPeripherals with intensive I/O demandsLarge data throughput demandsProcessors can handle thisProblem moving dataSolutions:CachingBufferingHigher-speed interconnection busesMore elaborate bus structuresMultiple-processor configurations
27 Key is BalanceProcessor componentsMain memoryI/O devicesInterconnection structures
28 Improvements in Chip Organization and Architecture Increase hardware speed of processorFundamentally due to shrinking logic gate sizeMore gates, packed more tightly, increasing clock ratePropagation time for signals reducedIncrease size and speed of cachesDedicating part of processor chipCache access times drop significantlyChange processor organization and architectureIncrease effective speed of executionParallelism
29 Problems with Clock Speed and Logic Density PowerPower density increases with density of logic and clock speedDissipating heatRC delaySpeed at which electrons flow limited by resistance and capacitance of metal wires connecting themDelay increases as RC product increasesWire interconnects thinner, increasing resistanceWires closer together, increasing capacitanceMemory latencyMemory speeds lag processor speedsSolution:More emphasis on organizational and architectural approaches
31 Increased Cache Capacity Typically two or three levels of cache between processor and main memoryChip density increasedMore cache memory on chipFaster cache accessPentium chip devoted about 10% of chip area to cachePentium 4 devotes about 50%
32 More Complex Execution Logic Enable parallel execution of instructionsPipeline works like assembly lineDifferent stages of execution of different instructions at same time along pipelineSuperscalar allows multiple pipelines within single processorInstructions that do not depend on one another can be executed in parallel
33 Internal organization of processors complex Diminishing ReturnsInternal organization of processors complexCan get a great deal of parallelismFurther significant increases likely to be relatively modestBenefits from cache are reaching limitIncreasing clock rate runs into power dissipation problemSome fundamental physical limits are being reached
34 New Approach – Multiple Cores Multiple processors on single chipLarge shared cacheWithin a processor, increase in performance proportional to square root of increase in complexityIf software can use multiple processors, doubling number of processors almost doubles performanceSo, use two simpler processors on the chip rather than one more complex processorWith two processors, larger caches are justifiedPower consumption of memory logic less than processing logicExample: IBM POWER4Two cores based on PowerPC
36 Pentium Evolution8080first general purpose microprocessor8 bit data pathUsed in first personal computer – Altair8086much more powerful16 bitinstruction cache, prefetch few instructions8088 (8 bit external bus) used in first IBM PC8028616 Mbyte memory addressableup from 1Mb8038632 bitSupport for multitasking
37 Pentium Evolution 80486 Pentium Pentium Pro sophisticated powerful cache and instruction pipeliningbuilt in maths co-processorPentiumSuperscalarMultiple instructions executed in parallelPentium ProIncreased superscalar organizationAggressive register renamingbranch predictiondata flow analysisspeculative execution
38 Pentium Evolution Pentium II Pentium III Pentium 4 Itanium Itanium 2 MMX technologygraphics, video & audio processingPentium IIIAdditional floating point instructions for 3D graphicsPentium 4Note Arabic rather than Roman numeralsFurther floating point and multimedia enhancementsItanium64 bitsee chapter 15Itanium 2Hardware enhancements to increase speedSee Intel web pages for detailed information on processors
39 Pentium Evolution Core Core 2 First x86 with dual coreCore 264 bit architectureCore 2 Quad – 3GHz – 820 million transistorsFour processors on chipx86 architecture dominant outside embedded systemsOrganization and technology changed dramaticallyInstruction set architecture evolved with backwards compatibility~1 instruction per month added500 instructions availableSee Intel web pages for detailed information on processors
40 PowerPC 1975, 801 minicomputer project (IBM) RISC Berkeley RISC I processor1986, IBM commercial RISC workstation product, RT PC.Not commercial successMany rivals with comparable or better performance1990, IBM RISC System/6000RISC-like superscalar machinePOWER architectureIBM alliance with Motorola (68000 microprocessors), and Apple, (used in Macintosh)Result is PowerPC architectureDerived from the POWER architectureSuperscalar RISCApple MacintoshEmbedded chip applications
41 PowerPC Family 601: 603: 604: 620: Quickly to market. 32-bit machine Low-end desktop and portable32-bitComparable performance with 601Lower cost and more efficient implementation604:Desktop and low-end servers32-bit machineMuch more advanced superscalar designGreater performance620:High-end servers64-bit architecture
42 PowerPC Family 740/750: G4: G5: Also known as G3 Two levels of cache on chipG4:Increases parallelism and internal speedG5:Improvements in parallelism and internal speed64-bit organization
43 Embedded Systems Requirements Different sizesDifferent constraints, optimization, reuseDifferent requirementsSafety, reliability, real-time, flexibility, legislationLifespanEnvironmental conditionsStatic v dynamic loadsSlow to fast speedsComputation v I/O intensiveDescrete event v continuous dynamics
45 Designed by ARM Inc., Cambridge, England Licensed to manufacturers ARM EvolutionDesigned by ARM Inc., Cambridge, EnglandLicensed to manufacturersHigh speed, small die, low power consumptionPDAs, hand held games, phonesE.g. iPod, iPhoneAcorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989Acorn, VLSI and Apple Computer founded ARM Ltd.
46 ARM Systems Categories Embedded real timeApplication platformLinux, Palm OS, Symbian OS, Windows mobileSecure applications
47 Performance Assessment Clock Speed Key parametersPerformance, cost, size, security, reliability, power consumptionSystem clock speedIn Hz or multiples ofClock rate, clock cycle, clock tick, cycle timeSignals in CPU take time to settle down to 1 or 0Signals may change at different speedsOperations need to be synchronisedInstruction execution in discrete stepsFetch, decode, load and store, arithmetic or logicalUsually require multiple clock cycles per instructionPipelining gives simultaneous execution of instructionsSo, clock speed is not the whole story
49 Instruction Execution Rate Millions of instructions per second (MIPS)Millions of floating point instructions per second (MFLOPS)Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy
50 Benchmarks Programs designed to test performance Written in high level languagePortableRepresents style of taskSystems, numerical, commercialEasily measuredWidely distributedE.g. System Performance Evaluation Corporation (SPEC)CPU2006 for computation bound17 floating point programs in C, C++, Fortran12 integer programs in C, C++3 million lines of codeSpeed and rate metricsSingle task and throughput
51 SPEC Speed Metric Single task Base runtime defined for each benchmark using reference machineResults are reported as ratio of reference time to system run timeTrefi execution time for benchmark i on reference machineTsuti execution time of benchmark i on test systemOverall performance calculated by averaging ratios for all 12 integer benchmarksUse geometric meanAppropriate for normalized numbers such as ratios
52 SPEC Rate MetricMeasures throughput or rate of a machine carrying out a number of tasksMultiple copies of benchmarks run simultaneouslyTypically, same as number of processorsRatio is calculated as follows:Trefi reference execution time for benchmark iN number of copies run simultaneouslyTsuti elapsed time from start of execution of program on all N processors until completion of all copies of programAgain, a geometric mean is calculated
53 Potential speed up of program using multiple processors Amdahl’s LawGene Amdahl [AMDA67]Potential speed up of program using multiple processorsConcluded that:Code needs to be parallelizableSpeed up is bound, giving diminishing returns for more processorsTask dependentServers gain by maintaining multiple connections on multiple processorsDatabases can be split into parallel tasks
54 Amdahl’s Law Formula For program running on single processor Fraction f of code infinitely parallelizable with no scheduling overheadFraction (1-f) of code inherently serialT is total execution time for program on single processorN is number of processors that fully exploit parralle portions of codeConclusionsf small, parallel processors has little effectN ->∞, speedup bound by 1/(1 – f)Diminishing returns for using more processors
55 Computer Performance Measures Example 1:A program runs on computer A in 10 seconds. A has a 4 GHz clock rate. Design a computer B that runs the same program in 6 seconds. Constraint is that a faster design is possible but will require 1.2 times as many clock cycles as A. What is B’s clock rate?
56 Computer Performance Measures Example 2:Given are two computers with different instruction sets: B’s clock rate is 3 times that of A’s; a program on B requires twice as many instructions as one on A to do the same task. However, B’s CPI rate is 2, whereas A’s CPI rate is 3. Which machine does a job faster and by how much?
57 Computer Performance Measures Example 3:Machine A has twice the MIPS rate of machine B but requires 50% more instructions. Which is faster on a given task?
58 Computer Performance Measures Example 4:Machine A’s clock rate is 500 MHz, Machine B is 250 MHz. CPI for A is 2, CPI for B is 1.2. Which is faster on a common program (meaning the same instruction set)?