Download presentation
Presentation is loading. Please wait.
1
ADVANCED COMPUTER ARCHITECHTURE
Iran University of Sciemce and Technology Computer Facaulty ADVANCED COMPUTER ARCHITECHTURE Parallelism, Scalability, Programmability Dr Mahmoud Fathy
2
What is Computer Architecture
Advanced Computer Architecture Dr Fathy
3
Forces on Computer Architecture
Advanced Computer Architecture Dr Fathy
4
Advanced Computer Architecture Dr Fathy
A Take on Moore’s Law Advanced Computer Architecture Dr Fathy
5
Advanced Computer Architecture Dr Fathy
A Take on Moore’s Law Moore’s Law (1965) Number of transistors per square inch doubled every year Reality: number of per square inch doubled every 18 months CPU Speed increases 54% per year DRAM Capacity increases 80% per year (Quadrupled every 3 years) Relative Performance Technology year 1 Vacuum Tube 1951 35 Transistor 1965 900 Integrated Circuit 1975 2,400,000 VLSI 1995 Advanced Computer Architecture Dr Fathy
6
Processor Performance
Advanced Computer Architecture Dr Fathy
7
Cleaver Architecture Design
Advanced Computer Architecture Dr Fathy
8
Processor – Memory Performance Gap
Advanced Computer Architecture Dr Fathy
9
Technology Trend v.s. Power Dissipation
Advanced Computer Architecture Dr Fathy
10
“Hot” Computer Importance of Low Power Processor Design
Advanced Computer Architecture Dr Fathy
11
Advanced Computer Architecture Dr Fathy
Computer Food Chain Advanced Computer Architecture Dr Fathy
12
Computer Engineering Methodology
Advanced Computer Architecture Dr Fathy
13
Measurement and Evaluation
Advanced Computer Architecture Dr Fathy
14
Measurement and Evaluation (contd)
Three component in computer architecture evaluation Simulators Benchmarks Evaluation Metrics (Performance, Cost, Power) Advanced Computer Architecture Dr Fathy
15
A Computer Architecture Simulator
Advanced Computer Architecture Dr Fathy
16
A taxonomy of Simulator Tools
Advanced Computer Architecture Dr Fathy
17
Functional v.s. Performance Simulators
Advanced Computer Architecture Dr Fathy
18
Execution v.s. Trace-Driven Simulation
Advanced Computer Architecture Dr Fathy
19
Advanced Computer Architecture Dr Fathy
Computer Performance History of Computer Performance Execution time of a single instruction (such as addition) Instruction mix\ MIPS Mflops (with introducing supercomputers ) Real Programs (dificult running and different operation systems) Toy programs (system performance evaluation cooperation) 1988 SPEC company was established by SUN, MIPS, DEC & Appolo Advanced Computer Architecture Dr Fathy
20
Advanced Computer Architecture Dr Fathy
SPEC History SPEC History SPEC 89 CPU Intensive (6floating point + 4 integer point) SPEC 92 (SPECintr, SPECFP), deleting programs such as Matrix 300 from SPEC89 SPEC 95 SPEC 2000 (11 Integer, Cint 2000, 14 fp CFP 2000) SPEC viewperf (3D rendering) SEC apc -Pro/Engineer . -Solid Works (3D CAD) . -Graphic V15 (aircraft design) Advanced Computer Architecture Dr Fathy
21
Programs to evaluate Processor Performance
Advanced Computer Architecture Dr Fathy
22
Advanced Computer Architecture Dr Fathy
Benchmarks Advanced Computer Architecture Dr Fathy
23
Performance & Measuring
The execution Time of a Program is the main measure of Computer Performance A Machine (X) is n% Faster than machine Y if : Advanced Computer Architecture Dr Fathy
24
Performance & Measuring
Example: If the Machine X executes a program in 10 Seconds and the Machine Y executes the Same program in 15 Seconds. The machine X is 50% Percent faster than Machine Y. Advanced Computer Architecture Dr Fathy
25
Performance & Measuring (Amdal’s Law)
SpeedUP Ts=The Sequential Time of the Program P = The Degree of Parallelism Advanced Computer Architecture Dr Fathy
26
Performance & Measuring (Amdal’s Law)
Example: Assume that the processing power of a system have been increased 10 times, But this part is just 40% of the all execution time. What is the Speedup? Advanced Computer Architecture Dr Fathy
27
Performance & Measuring (Amdal’s Law)
Example: The processing power of a CPU has been increased 5 times. But the cost of the new CPU has been increased 5 times. The CPU time of the program is 50% and the CPU cost is 1/3 of the whole computer cost. Is this upgrade reasonable from cost to performance ratio point? Advanced Computer Architecture Dr Fathy
28
Performance & Measuring
Time in LINUX Example: Time : % (CPU User) (System CPU Time) (Execution Time) (CPU Time /Execution Time) CPU Time = (Cycles of the Program) / (Clock Rate) CPU Time = (Cycles Period Of each Clock) CPU Time = CPI (Clock Per Instruction) (Number of Instructions) (Period of each Clock) Advanced Computer Architecture Dr Fathy
29
Performance & Measuring
RISC Processor CPI (Small) No of Instructions (Large) Period of each Clock (small) CISC Processor CPI (Large) No of Instructions (Small) Period of each Clock (Large) MIPS=No of Instructions/Execution Time 106 MIPS=Clock Rate/CPI 106 Advanced Computer Architecture Dr Fathy
30
Performance & Measuring
MIPS is not a good metric for performance for example I860 with 50MHz frequency has 100 MFlops Processing Power And 150 MOPS and R3000 (MIPS family Processor) is 16MFlops and 33 MOPS but can execute SPEC program 15% faster than I860. EXAMPLE: (Showing that MIPS is not a good metric for performance evaluation) A Computer has got 3 types of Instructions with different CPI rates Instruction Type CPI A 1 B 2 C 3 Advanced Computer Architecture Dr Fathy
31
Performance & Measuring
The compiler Designer has 2 choices to translate a high level language function. A B C Choice Choice What is the CPI of each choice? CPI1 = 10 / 5 = 2 CPI2 = 9 / 6 = 1.5 Now the compiler designer for translating a program has two choices. Choice (million Instruction) Choice (million Instruction) What is the MIPS rate and Execution Time of Each Sequence? Advanced Computer Architecture Dr Fathy
32
Performance & Measuring
Example: To show If we add a new instruction to a computer, How it effects the performance of the system. The instruction mix of a computer is as follows: Operation Probability CPI ALU Load Store Branch Assume that 25% of ALU operations use a Loaded Operand just One time. It means that this operand is not used in other next instructions. Now we want to add a new REG/MEM instruction type which is an ADD instruction and needs two cycles to execute. This change causes the branch instruction to be executed in 3 cycles. Whether this new machine is faster or the older? Advanced Computer Architecture Dr Fathy
33
Performance & Measuring
New CPU time = (0.893*Old_Instruction_Count)*1.908* Old CPU time = 1.57*Old_Instruction_Count* New CPU time = 1.7 * Old_Instruction_Count * Therefore the old machine is faster. Advanced Computer Architecture Dr Fathy
34
Performance & Measuring
Boeing Mph Concord Mph Relative MIPS = execution time of reference / (execution time of x * MIPS) Weighted Megaflops : ADD, SUB, MUL 1 DIV,SQR 4 EXP,SIN 8 Bench Marks (Dhrystone) Vax 11/ KD/S INTEL i SUN MC CRAY XMP VAX Advanced Computer Architecture Dr Fathy
35
Performance & Measuring
Benchmark (Whetstone): A FORTRAN floating point program DEC 11/ KW/S IBM TP1 (A Database Benchmark): VAX TPS Sequent 140 TPS Benchmark for Intelligent Computers is measured in KLIPS (Kilo Logic Inference Per Second) 400 KLIPS 40 MIPS Advanced Computer Architecture Dr Fathy
36
Performance & Measuring
Y X Normalized on Y Normalized on X Execution Time on Y Execution Time on X 1 0.1 10 Program A 100 1000 Program B 5.05 55 500.5 Arithmatic Mean 31.6 Geometric Mean Advanced Computer Architecture Dr Fathy
37
Performance & Measuring
Example: Two programs A & B are run on two machines X & Y. Which machine is faster? Y X Normalized on Y Normalized on X Execution Time on Y Execution Time on X 1 0.1 10 Program A 100 1000 Program B 5.05 55 500.5 Arithmatic Mean 31.6 Geometric Mean Advanced Computer Architecture Dr Fathy
38
Performance & Measuring
B A 20 10 1 Program 1 100 1000 Proigram 2 40 110 1001 Total Example: Weight for program I on Machine A or B for N program, which Normalizes the execution time. Advanced Computer Architecture Dr Fathy
39
Performance & Measuring
Advanced Computer Architecture Dr Fathy
40
Performance & Measuring
Advanced Computer Architecture Dr Fathy
41
Performance & Measuring
Advanced Computer Architecture Dr Fathy
42
Performance & Measuring
Advanced Computer Architecture Dr Fathy
43
Performance & Measuring
Advanced Computer Architecture Dr Fathy
44
Performance & Measuring
Advanced Computer Architecture Dr Fathy
45
Performance & Measuring
Advanced Computer Architecture Dr Fathy
46
Performance & Measuring
Advanced Computer Architecture Dr Fathy
47
Performance & Measuring
MIPS1 = Clock Rate/(CPI 106) CPI1 = 1.43 CPI2 = 1.25 MIPS1 = 69.4 MIPS2 = 80 MIPS2>MIPS1 CPU Time = No. of Instructions * CPI / Clock rate CPU Time1 = 0.1 Sec CPU Time2 = 0.15 Sec Compiler program 1 is faster than Compiler 2 but It has less MIPS rate ratio. So MIPS cannot be a good metric for performance Evaluation. Advanced Computer Architecture Dr Fathy
48
Parallel Computer Models
Why Parallel Processing? Fig The exponential growth of microprocessor performance, known as Moore’s Law, shown over the past two decades (extrapolated). Advanced Computer Architecture Dr Fathy
49
Parallel Computer Models
The Semiconductor Technology Roadmap Calendar year 2001 2004 2007 2010 2013 2016 Halfpitch (nm) 140 90 65 45 32 22 Clock freq. (GHz) 2 4 7 12 20 30 Wiring levels 8 9 10 Power supply (V) 1.1 1.0 0.8 0.7 0.6 0.5 Max. power (W) 130 160 190 220 250 290 From the 2001 edition of the roadmap [Alla02] Factors contributing to the validity of Moore’s law Denser circuits; Architectural improvements Measures of processor performance Instructions/second (MIPS, GIPS, TIPS, PIPS) Floating-point operations per second (MFLOPS, GFLOPS, TFLOPS, PFLOPS) Running time on benchmark suites Advanced Computer Architecture Dr Fathy
50
Parallel Computer Models
Why High-Performance Computing? Higher speed (solve problems faster) Important when there are “hard” or “soft” deadlines; e.g., 24-hour weather forecast Higher throughput (solve more problems) Important when there are many similar tasks to perform; e.g., transaction processing Higher computational power (solve larger problems) e.g., weather forecast for a week rather than 24 hours, or with a finer mesh for greater accuracy Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors) Advanced Computer Architecture Dr Fathy
51
Parallel Computer Models
The Speed-of-Light Argument The speed of light is about 30 cm/ns. Signals travel at a fraction of speed of light (say, 1/3). If signals must travel 1 cm during the execution of an instruction, that instruction will take at least 0.1 ns; thus, performance will be limited to 10 GIPS. This limitation is eased by continued miniaturization, architectural methods such as cache memory, etc.; however, a fundamental limit does exist. How does parallel processing help? Wouldn’t multiple processors need to communicate via signals as well? Advanced Computer Architecture Dr Fathy
52
Parallel Computer Models
The Quest for Higher Performance Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp ) 1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama Material science, nuclear stockpile sim Aerospace/space sim, climate research Atmospheric, oceanic, and earth sciences 32,768 proc’s, 8 TB, 28 TB disk storage 10,240 proc’s, 20 TB, 440 TB disk storage 5,120 proc’s, 10 TB, 700 TB disk storage Linux + custom OS Linux Unix 71 TFLOPS, $100 M 52 TFLOPS, $50 M 36 TFLOPS*, $400 M? Dual-proc Power-PC chips (10-15 W power) 20x Altix (512 Itanium2) linked by Infiniband Built of custom vector microprocessors Full system: 130k-proc, 360 TFLOPS (est) Volume = 50x IBM, Power = 14x IBM Advanced Computer Architecture Dr Fathy
53
Parallel Computer Models
Supercomputer Performance Growth The exponential growth in supercomputer performance over the past two decades (from [Bell92], with ASCI performance goals and microprocessor peak FLOPS superimposed as dotted lines). Advanced Computer Architecture Dr Fathy
54
One Reason for Sublinear Speedup: Communication Overhead
Trade-off between communication time and computation time in the data-parallel realization Advanced Computer Architecture Dr Fathy
55
Another Reason for Sublinear Speedup: Input/Output Overhead
Effect of a constant I/O time on the data-parallel realization Advanced Computer Architecture Dr Fathy
56
Trends in High-Technology Development
Development of some technical fields into $1B businesses and the roles played by government research and industrial R&D over time (IEEE Computer, early 90s?). Advanced Computer Architecture Dr Fathy
57
Trends in Hi-Tech Development (2003)
Advanced Computer Architecture Dr Fathy
58
Status of Computing Power (circa 2000)
GFLOPS on desktop: Apple Macintosh, with G4 processor TFLOPS in supercomputer center: 1152-processor IBM RS/6000 SP (switch-based network) Cray T3E, torus-connected PFLOPS on drawing board: 1M-processor IBM Blue Gene (2005?) 32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers Processor: 8 threads, on-chip memory, no data cache Chip: defect-tolerant, row/column rings in a 6 6 array Board: 8 8 chip grid organized as 4 4 4 cube Tower: Boards linked to 4 neighbors in adjacent towers System: 323232 cube of chips, 1.5 MW (water-cooled) Advanced Computer Architecture Dr Fathy
59
Parallel Computer Models
Parallel Processing on Single Processor Computers 1- Using of Multi Operational Units 2- Parallelism and Pipeline Inside a CPU 3- Overlapping the Operations of I/O & CPU 4- Making Equilibrium in Bandwidth of Subsystems 4-1- Bandwidth of CPU (high) 4-2- Bandwidth of Memory (less) 4-3- Bandwidth of I/O (very little) 5- Hierarchy of Memory 5-1- register memory cache memory 5-3- main memory secondary memory 6- Using of Multi Programs and Time Sharing Advanced Computer Architecture Dr Fathy
60
Types of Parallelism: A Taxonomy
The Flynn-Johnson classification of computer systems. Advanced Computer Architecture Dr Fathy
61
Parallel Computer Models Parallel Computer Models
Flynn’s classification of computer architectures. Advanced Computer Architecture Dr Fathy
62
Parallel Computer Models
Flynn’s classification of computer architectures (Contd) Advanced Computer Architecture Dr Fathy
63
Parallel Computer Models
Advanced Computer Architecture Dr Fathy
64
SIMD versus MIMD Architectures
Most early parallel machines had SIMD designs Attractive to have skeleton processors (PEs) Eventually, many processors per chip High development cost for custom chips, high cost MSIMD and SPMD variants Most modern parallel machines have MIMD designs COTS components (CPU chips and switches) MPP: Massively or moderately parallel? Tightly coupled versus loosely coupled Explicit message passing versus shared memory Network-based NOWs and COWs Networks/Clusters of workstations Grid computing Vision: Plug into wall outlets for computing power 1960 1970 1980 1990 2000 2010 ILLIAC IV TMC CM-2 Goodyear MPP DAP MasPar MP-1 Clearspeed array coproc SIMD Timeline Advanced Computer Architecture Dr Fathy
65
Global versus Distributed Memory
Fig A parallel processor with global memory. Options: Crossbar Bus(es) MIN Bottleneck Complex Expensive Advanced Computer Architecture Dr Fathy
66
Removing the Processor-to-Memory Bottleneck
A parallel processor with global memory and processor caches. Challenge: Cache coherence Advanced Computer Architecture Dr Fathy
67
Distributed Shared Memory
Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch Advanced Computer Architecture Dr Fathy
68
Parallel Computer Models
شبکه های اتصالی Fully connected Hypercube Mesh Ring Cube Star . Advanced Computer Architecture Dr Fathy
69
Parallel Computer Models
Multiprocessors and Multicomputers Shared-Memory Multiprocessor The UMA (Uniform Menory Access) Model In a UMA Multiprocessor model , the physical memory is uniformly shared by all the processors. All processors have equal access time to all memory words, which is why it is called uniform memory access. Multiprocessors are called tightly coupled systems due to the degree Of resource sharing. Advanced Computer Architecture Dr Fathy
70
Parallel Computer Models
symmetric multiprocessor When all processors have equal access to all peripheral devices, the system is called a symmetric multiprocessor. In such a case, all processors are equally capable of running the executive programs. In a asymmetric multiprocessor, only one or a subset of processors are executive-capable. The remaining processors have no I/O capability are called attached processors. Advanced Computer Architecture Dr Fathy
71
Parallel Computer Models
The NUMA (Uniform Menory Access) Model A NUMA multiprocessor is a shared-memory system in which the access time varies with the location of the memory word Advanced Computer Architecture Dr Fathy
72
Parallel Computer Models
Besides distributed memories, globally shared memory can be added to multiprocessor system. In this case, there are three memory-access pattern: the fastest is local memory access. The next is global memory access. The slowest is access of remote memory as illustrated in this picture. Advanced Computer Architecture Dr Fathy
73
Parallel Computer Models
The COMA Model A multi processor using cache-only memory assumes the COMA model. This model is depicted the following picture. The COMA model is a special case of NUMA machine, in which the distributed main memories are converted to caches. There is no memory hierarchy at each processor node. Besides the UMA, NUMA and coma models specified above, other variations exist for multiprocessors. For example, a cache-coherent non-uniform memory access (CC-NUMA) model can be specified with distributed shared memory and cache directories. Advanced Computer Architecture Dr Fathy
74
Parallel Computer Models
Representative Multicomputers. Several commercialy multiprocessors are summerized in the following table: Advanced Computer Architecture Dr Fathy
75
Parallel Computer Models
Distributed-Memory Multiprocessor A distributed-Memory Multiprocessor system is modeled in the following figure. The system consists of multiple computers, often called nodes, interconnected by a message-passing network. Each node is an autonomous computer consisting of a processor, local memory, and sometimes attached disks or I/O peripherals. The message-passing network provides point to point static connections among the nodes. All local memories are private and accessible only by local processors. For this resean, traditional multicomputers have been called no-remote-memory-access (NORMA) machines. Advanced Computer Architecture Dr Fathy
76
Parallel Computer Models
Representative Multicomputers Three message-passing multicomputers are summarized in the following table. With distributed processor/memory nodes, these machines are better in achieving a scalable performance. However, message passing imposes a hardship on programmers to distribute the computations and data sets over the nodes or to establish communication among nodes. Advanced Computer Architecture Dr Fathy
77
Parallel Computer Models
A Taxonomy of MIMD Computers Parallel computers appear as either SIMD or MIMD configurations. The SIMDs appeal more to special-purpose applications. It is clear that SIMDs are not size-scalable, but unclear whether large SIMDs are generation-scalable. The fact that CM-5 has an MIMD architecture, away from the SIMD architecture in CM-2, may shed some light on the architectural trend. Furthermore, the boundary between multiprocessors and multicomputers has become blurred in recent years, eventually, the distinction may vanish. Advanced Computer Architecture Dr Fathy
78
Parallel Computer Models
Multivector and SIMD Computers Here we introduce supercomputers and parallel processors for vector processing and data parallelism. We classify supercomputers either as pipelined vector ,machines using a few powerful processors equipped with vector hardware, or as SIMD computers emphasizing massive data parallelism. Vector Supercomputers A vector computer is often built on top of a scalar processor. As shown in following figure. The vector processor is attached to the scalar processor as an optional feature. Program and data are first loaded into the main memory thought a host computer. All instructions are first decoded by the scalar control operation, it will be directly executed by the scalar processor using the scalar functional pipelines. Advanced Computer Architecture Dr Fathy
79
Parallel Computer Models
Representative supercomputers Over a dozen pipelined vector computers have been manufactured, renging from workstations to mini- and supercomputers. Advanced Computer Architecture Dr Fathy
80
Parallel Computer Models
SIMD Supercomputer You know that an abstract model of SIMD computers having a single instruction stream over multiple data stream. An operational model of SIMD computers is presented in the following figure. Advanced Computer Architecture Dr Fathy
81
Parallel Computer Models
SIMD Machine Model An operational model of an SIMD computer is specified by a 5-tuple: Where N is the number of processing elements (PEs) in the machine. For example Illiac IV has 64 PEs and connection Machine CM-2 uses 65,536 PEs. C is the set of instructions directly executed by the control unit (CU), including scalar and program flow control instructions. I is the set of instructions broadcast by the CU to al PEs for parallel execution. These include arithmetic, logic, data routing, masking, and other local operations executed by each active PE over data within that PE. M is the set of masking schemes, where each mask partitions the set of the PEs into enabled and disabled subsets. R is the set of data-routing functions, specifying various patterns to be set up in the interconnection network for inter-PE communications. One can describe a particular SIMD machine architecture by specifying the 5-tuple. Advanced Computer Architecture Dr Fathy
82
Parallel Computer Models
Representative SIMD Computers Three SIMD supercomputers are summarized im the following table. The number of PEs in these systems ranges from 4096 in the DAP610 to 16,384 in the MasPar MP-1 and 65,536 in CM-2. Both the CM-2 and DAP610 are fine-grain, bit-slice SIMD computers with attached floating-point accelerator for blocks of PEs. Advanced Computer Architecture Dr Fathy
83
Parallel Computer Models
Architectural Development Tracks The architectures of most existing computers follow certain development tracks. Understanding features of various tracks provides insights for new architectural development. We look into six tracks to be studied in later chapters. These tracks are distinguished by similarity in computational models and technological bases. Multiple-processor Tracks generally speaking, a multiple-processor system can be either a shared-memory multiprocessor or a distributed-memory multicomputer Message-Passing Track The Cosmic Cube pioneered the development of message-passing multicomputers. Advanced Computer Architecture Dr Fathy
84
Parallel Computer Models
Shared-Memory Track The figure shows a track of multiprocessor development employing a single address space in the entire system. Advanced Computer Architecture Dr Fathy
85
Parallel Computer Models
Multivector Track These are traditional vector supercomputers. The CDC7600 was the first vector dual-processor system. Advanced Computer Architecture Dr Fathy
86
Parallel Computer Models
SIMD Track The Illiac IV pioneered the construction of SIMD computers, even the array processor concept can be traced back for earlier to the 1960s. Advanced Computer Architecture Dr Fathy
87
Parallel Computer Models
Multithreaded and Dataflow Tracks These are two research tracks that have been mainly experimented with in laboratories Multithreading Track Multithreading idea war pioneered by Burton Smith (1978) in the HELP system which extended the concept of scoreboarding of multiple functional units in the CDC6400. Advanced Computer Architecture Dr Fathy
88
Parallel Computer Models
The Dataflow Track The key idea is to use a dataflow mechanism, instead of a control-flow mechanism as in von Neumann machines, to direct the program flow. Fine-grain, instruction-level parallelism is exploited in dataflow computers. Advanced Computer Architecture Dr Fathy
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.