Presentation on theme: "Computer Architecture and Organization"— Presentation transcript:
1Computer Architecture and Organization Computer Evolution and Performance
2Electronic Numerical Integrator And Computer ENIAC - backgroundElectronic Numerical Integrator And ComputerJohn Presper Eckert and John MauchlyUniversity of PennsylvaniaTrajectory tables for weaponsStarted 1943Finished 1946Too late for war effortUsed until 1955
3ENIAC - detailsDecimal (not binary)20 accumulators of 10 digitsProgrammed manually by switches18,000 vacuum tubes30 tons15,000 square feet140 kW power consumption5,000 additions per second
4Stored Program concept Main memory storing programs and data von Neumann/TuringStored Program conceptMain memory storing programs and dataALU operating on binary dataControl unit interpreting instructions from memory and executingInput and output equipment operated by control unitPrinceton Institute for Advanced StudiesIASCompleted 1952
6Set of registers (storage in CPU) IAS - details1000 x 40 bit wordsBinary number2 x 20 bit instructionsSet of registers (storage in CPU)Memory Buffer RegisterMemory Address RegisterInstruction RegisterInstruction Buffer RegisterProgram CounterAccumulatorMultiplier Quotient
81947 - Eckert-Mauchly Computer Corporation Commercial ComputersEckert-Mauchly Computer CorporationUNIVAC I (Universal Automatic Computer)US Bureau of Census 1950 calculationsBecame part of Sperry-Rand CorporationLate 1950s - UNIVAC IIFasterMore memory
9Punched-card processing equipment 1953 - the 701 IBMPunched-card processing equipmentthe 701IBM’s first stored program computerScientific calculationsthe 702Business applicationsLead to 700/7000 series
10TransistorsReplaced vacuum tubesSmallerCheaperLess heat dissipationSolid State deviceMade from Silicon (Sand)Invented 1947 at Bell LabsWilliam Shockley et al.
11Transistor Based Computers Second generation machinesNCR & RCA produced small transistor machinesIBM 7000DECProduced PDP-1
12MicroelectronicsLiterally - “small electronics”A computer is made up of gates, memory cells and interconnectionsThese can be manufactured on a semiconductore.g. silicon wafer
13Generations of Computer Vacuum tubeTransistorSmall scale integration onUp to 100 devices on a chipMedium scale integration - to 1971100-3,000 devices on a chipLarge scale integration3, ,000 devices on a chipVery large scale integration100, ,000,000 devices on a chipUltra large scale integration –Over 100,000,000 devices on a chip
14Moore’s Law Increased density of components on chip Gordon Moore – co-founder of IntelNumber of transistors on a chip will double every yearSince 1970’s development has slowed a littleNumber of transistors doubles every 18 monthsCost of a chip has remained almost unchangedHigher packing density means shorter electrical paths, giving higher performanceSmaller size gives increased flexibilityReduced power and cooling requirementsFewer interconnections increases reliability
16Replaced (& not compatible with) 7000 series IBM 360 series1964Replaced (& not compatible with) 7000 seriesFirst planned “family” of computersSimilar or identical instruction setsSimilar or identical O/SIncreasing speedIncreasing number of I/O ports (i.e. more terminals)Increased memory sizeIncreased costMultiplexed switch structure
17First minicomputer (after miniskirt!) DEC PDP-81964First minicomputer (after miniskirt!)Did not need air conditioned roomSmall enough to sit on a lab bench$16,000$100k+ for IBM 360Embedded applications and OEMBUS STRUCTURE - Omnibus
19Capacity approximately doubles each year Semiconductor Memory1970FairchildSize of a single corei.e. 1 bit of magnetic core storageHolds 256 bitsNon-destructive readMuch faster than coreCapacity approximately doubles each year
20IntelFirst microprocessorAll CPU components on a single chip4 bitFollowed in 1972 by 80088 bitBoth designed for specific applicationsIntel’s first general purpose microprocessor
24Increase number of bits retrieved at one time SolutionsIncrease number of bits retrieved at one timeMake DRAM “wider” rather than “deeper”Change DRAM interfaceCacheReduce frequency of memory accessMore complex cache and cache on chipIncrease interconnection bandwidthHigh speed busesHierarchy of buses
25Peripherals with intensive I/O demands Large data throughput demands I/O DevicesPeripherals with intensive I/O demandsLarge data throughput demandsProcessors can handle thisProblem moving dataSolutions:CachingBufferingHigher-speed interconnection busesMore elaborate bus structuresMultiple-processor configurations
27Key is BalanceProcessor componentsMain memoryI/O devicesInterconnection structures
28Improvements in Chip Organization and Architecture Increase hardware speed of processorFundamentally due to shrinking logic gate sizeMore gates, packed more tightly, increasing clock ratePropagation time for signals reducedIncrease size and speed of cachesDedicating part of processor chipCache access times drop significantlyChange processor organization and architectureIncrease effective speed of executionParallelism
29Problems with Clock Speed and Logic Density PowerPower density increases with density of logic and clock speedDissipating heatRC delaySpeed at which electrons flow limited by resistance and capacitance of metal wires connecting themDelay increases as RC product increasesWire interconnects thinner, increasing resistanceWires closer together, increasing capacitanceMemory latencyMemory speeds lag processor speedsSolution:More emphasis on organizational and architectural approaches
31Increased Cache Capacity Typically two or three levels of cache between processor and main memoryChip density increasedMore cache memory on chipFaster cache accessPentium chip devoted about 10% of chip area to cachePentium 4 devotes about 50%
32More Complex Execution Logic Enable parallel execution of instructionsPipeline works like assembly lineDifferent stages of execution of different instructions at same time along pipelineSuperscalar allows multiple pipelines within single processorInstructions that do not depend on one another can be executed in parallel
33Internal organization of processors complex Diminishing ReturnsInternal organization of processors complexCan get a great deal of parallelismFurther significant increases likely to be relatively modestBenefits from cache are reaching limitIncreasing clock rate runs into power dissipation problemSome fundamental physical limits are being reached
34New Approach – Multiple Cores Multiple processors on single chipLarge shared cacheWithin a processor, increase in performance proportional to square root of increase in complexityIf software can use multiple processors, doubling number of processors almost doubles performanceSo, use two simpler processors on the chip rather than one more complex processorWith two processors, larger caches are justifiedPower consumption of memory logic less than processing logicExample: IBM POWER4Two cores based on PowerPC
36Pentium Evolution8080first general purpose microprocessor8 bit data pathUsed in first personal computer – Altair8086much more powerful16 bitinstruction cache, prefetch few instructions8088 (8 bit external bus) used in first IBM PC8028616 Mbyte memory addressableup from 1Mb8038632 bitSupport for multitasking
37Pentium Evolution 80486 Pentium Pentium Pro sophisticated powerful cache and instruction pipeliningbuilt in maths co-processorPentiumSuperscalarMultiple instructions executed in parallelPentium ProIncreased superscalar organizationAggressive register renamingbranch predictiondata flow analysisspeculative execution
38Pentium Evolution Pentium II Pentium III Pentium 4 Itanium Itanium 2 MMX technologygraphics, video & audio processingPentium IIIAdditional floating point instructions for 3D graphicsPentium 4Note Arabic rather than Roman numeralsFurther floating point and multimedia enhancementsItanium64 bitsee chapter 15Itanium 2Hardware enhancements to increase speedSee Intel web pages for detailed information on processors
39Pentium Evolution Core Core 2 First x86 with dual coreCore 264 bit architectureCore 2 Quad – 3GHz – 820 million transistorsFour processors on chipx86 architecture dominant outside embedded systemsOrganization and technology changed dramaticallyInstruction set architecture evolved with backwards compatibility~1 instruction per month added500 instructions availableSee Intel web pages for detailed information on processors
40PowerPC 1975, 801 minicomputer project (IBM) RISC Berkeley RISC I processor1986, IBM commercial RISC workstation product, RT PC.Not commercial successMany rivals with comparable or better performance1990, IBM RISC System/6000RISC-like superscalar machinePOWER architectureIBM alliance with Motorola (68000 microprocessors), and Apple, (used in Macintosh)Result is PowerPC architectureDerived from the POWER architectureSuperscalar RISCApple MacintoshEmbedded chip applications
41PowerPC Family 601: 603: 604: 620: Quickly to market. 32-bit machine Low-end desktop and portable32-bitComparable performance with 601Lower cost and more efficient implementation604:Desktop and low-end servers32-bit machineMuch more advanced superscalar designGreater performance620:High-end servers64-bit architecture
42PowerPC Family 740/750: G4: G5: Also known as G3 Two levels of cache on chipG4:Increases parallelism and internal speedG5:Improvements in parallelism and internal speed64-bit organization
43Embedded Systems Requirements Different sizesDifferent constraints, optimization, reuseDifferent requirementsSafety, reliability, real-time, flexibility, legislationLifespanEnvironmental conditionsStatic v dynamic loadsSlow to fast speedsComputation v I/O intensiveDescrete event v continuous dynamics
45Designed by ARM Inc., Cambridge, England Licensed to manufacturers ARM EvolutionDesigned by ARM Inc., Cambridge, EnglandLicensed to manufacturersHigh speed, small die, low power consumptionPDAs, hand held games, phonesE.g. iPod, iPhoneAcorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989Acorn, VLSI and Apple Computer founded ARM Ltd.
46ARM Systems Categories Embedded real timeApplication platformLinux, Palm OS, Symbian OS, Windows mobileSecure applications
47Performance Assessment Clock Speed Key parametersPerformance, cost, size, security, reliability, power consumptionSystem clock speedIn Hz or multiples ofClock rate, clock cycle, clock tick, cycle timeSignals in CPU take time to settle down to 1 or 0Signals may change at different speedsOperations need to be synchronisedInstruction execution in discrete stepsFetch, decode, load and store, arithmetic or logicalUsually require multiple clock cycles per instructionPipelining gives simultaneous execution of instructionsSo, clock speed is not the whole story
49Instruction Execution Rate Millions of instructions per second (MIPS)Millions of floating point instructions per second (MFLOPS)Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy
50Benchmarks Programs designed to test performance Written in high level languagePortableRepresents style of taskSystems, numerical, commercialEasily measuredWidely distributedE.g. System Performance Evaluation Corporation (SPEC)CPU2006 for computation bound17 floating point programs in C, C++, Fortran12 integer programs in C, C++3 million lines of codeSpeed and rate metricsSingle task and throughput
51SPEC Speed Metric Single task Base runtime defined for each benchmark using reference machineResults are reported as ratio of reference time to system run timeTrefi execution time for benchmark i on reference machineTsuti execution time of benchmark i on test systemOverall performance calculated by averaging ratios for all 12 integer benchmarksUse geometric meanAppropriate for normalized numbers such as ratios
52SPEC Rate MetricMeasures throughput or rate of a machine carrying out a number of tasksMultiple copies of benchmarks run simultaneouslyTypically, same as number of processorsRatio is calculated as follows:Trefi reference execution time for benchmark iN number of copies run simultaneouslyTsuti elapsed time from start of execution of program on all N processors until completion of all copies of programAgain, a geometric mean is calculated
53Potential speed up of program using multiple processors Amdahl’s LawGene Amdahl [AMDA67]Potential speed up of program using multiple processorsConcluded that:Code needs to be parallelizableSpeed up is bound, giving diminishing returns for more processorsTask dependentServers gain by maintaining multiple connections on multiple processorsDatabases can be split into parallel tasks
54Amdahl’s Law Formula For program running on single processor Fraction f of code infinitely parallelizable with no scheduling overheadFraction (1-f) of code inherently serialT is total execution time for program on single processorN is number of processors that fully exploit parralle portions of codeConclusionsf small, parallel processors has little effectN ->∞, speedup bound by 1/(1 – f)Diminishing returns for using more processors
55Computer Performance Measures Example 1:A program runs on computer A in 10 seconds. A has a 4 GHz clock rate. Design a computer B that runs the same program in 6 seconds. Constraint is that a faster design is possible but will require 1.2 times as many clock cycles as A. What is B’s clock rate?
56Computer Performance Measures Example 2:Given are two computers with different instruction sets: B’s clock rate is 3 times that of A’s; a program on B requires twice as many instructions as one on A to do the same task. However, B’s CPI rate is 2, whereas A’s CPI rate is 3. Which machine does a job faster and by how much?
57Computer Performance Measures Example 3:Machine A has twice the MIPS rate of machine B but requires 50% more instructions. Which is faster on a given task?
58Computer Performance Measures Example 4:Machine A’s clock rate is 500 MHz, Machine B is 250 MHz. CPI for A is 2, CPI for B is 1.2. Which is faster on a common program (meaning the same instruction set)?