1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total.

1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ bandwidth) Time of Flight Receiver Overhead Transport Latency Total Latency = per access + Size x per byte per access= Sender + Receiver Overhead + Time of Flight (5 to 200 µsec + 5 to 200 µsec + 0.1 µsec) per byte + Size ÷ 100 MByte/s Total Latency (processor busy) (processor busy) +

2 Network History/Limits n TCP/UDP/IP protocols for WAN/LAN in 1980s n Lightweight protocols for LAN in 1990s n Limit is standards and efficient SW protocols 10 Mbit Ethernet in 1978 (shared) 100 Mbit Ethernet in 1995 (shared, switched) 1000 Mbit Ethernet in 1998 (switched) m FDDI; ATM Forum for scalable LAN (still meeting) n Internal I/O bus limits delivered BW m 32-bit, 33 MHz PCI bus = 1 Gbit/sec m future: 64-bit, 66 MHz PCI bus = 4 Gbit/sec

3 Network Summary n Fast serial lines, switches offer high bandwidth, low latency over reasonable distances n Protocol software development and standards committee bandwidth limit innovation rate m Ethernet forever? n Internal I/O bus interface to network is bottleneck to delivered bandwidth, latency

4 Memory History/Trends/State of Art n DRAM: main memory of all computers m Commodity chip industry: no company >20% share m Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM) n State of the Art: $152, 128 MB DIMM (16 64-Mbit DRAMs),10 ns x 64b (800MB/sec) n Capacity: 4X/3 yrs (60%/yr..) m Moore’s Law n MB/$: + 25%/yr. n Latency: – 7%/year, Bandwidth: + 20%/yr. (so far) source: www.pricewatch.com, 5/21/98

5 Memory Innovations/Limits n High Bandwidth Interfaces, Packages m RAMBUS DRAM: 800 – 1600 MByte/sec per chip n Latency limited by memory controller, bus, multiple chips, driving pins n More Application Bandwidth => More Cache misses = per access + block size x per byte Memory latency + Size / (DRAM BW x width) = 150 ns + 30 ns m Called Amdahl’s Law: Law of diminishing returns DRAMDRAM DRAMDRAM DRAMDRAM DRAMDRAM Bus Proc Cache

6 Memory Summary n DRAM rapid improvements in capacity, MB/$, bandwidth; slow improvement in latency n Processor-memory interface (cache+memory bus) is bottleneck to delivered bandwidth m Like network, memory “protocol” is major overhead

7 Processor Trends/ History n Microprocessor: main CPU of “all” computers m < 1986, +35%/ yr. performance increase (2X/2.3yr) m >1987 (RISC), +60%/ yr. performance increase (2X/1.5yr) n Cost fixed at $500/chip, power whatever can cool n History of innovations to 2X / 1.5 yr (Works on TPC?) m Multilevel Caches (helps clocks / instruction) m Pipelining (helps seconds / clock, or clock rate) m Out-of-Order Execution (helps clocks / instruction) m Superscalar (helps clocks / instruction) CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock

8 State of the Art: Alpha 21264 n 15M transistors n 2 64KB caches on chip; 16MB L2 cache off chip n Clock 600 MHz (Fastest Cray Supercomputer: T90 2.2 nsec) n 90 watts n Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle n Execution out-of-order

9 Processor Limit: DRAM Gap Alpha 21264 full cache miss in instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors

10 Processor Limits for TPC-C SPEC- Pentium Pro int95TPC-C m Multilevel Caches: Miss rate 1MB L2 cache0.5%5% m Superscalar (2-3 instr. retired/clock): % clks40%10% m Out-of-Order Execution speedup2.0X1.4X m Clocks per Instruction0.83.4 n % Peak performance40%10% source: Kim Keeton, Dave Patterson, Y. Q. He, R. C. Raphael, and Walter Baker. "Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads," Proc. 25th Int'l. Symp. on Computer Architecture, June 1998. (www.cs.berkeley.edu/~kkeeton/Papers/papers.html ) Bhandarkar, D.; Ding, J. “Performance characterization of the Pentium Pro processor.” Proc. 3rd Int'l. Symp. on High-Performance Computer Architecture, Feb 1997. p. 288-97.

11 Processor Innovations/Limits n Low cost, low power embedded processors m Lots of competition, innovation m Integer perf. embedded proc. ~ 1/2 desktop processor m Strong ARM 110: 233 MHz, 268 MIPS, 0.36W typ., $49 n Very Long Instruction Word (Intel,HP IA-64/Merced) m multiple ops/ instruction, compiler controls parallelism n Consolidation of desktop industry? Innovation? PowerPC PA-RISC MIPS Alpha IA-64 SPARC x86

12 Processor Summary n SPEC performance doubling / 18 months m Growing CPU-DRAM performance gap & tax m Running out of ideas, competition? Back to 2X / 2.3 yrs? n Processor tricks not as useful for transactions? m Clock rate increase compensated by CPI increase? m When > 100 MIPS on TPC-C? n Cost fixed at ~$500/chip, power whatever can cool n Embedded processors promising m 1/10 cost, 1/100 power, 1/2 integer performance?

13 Systems: History, Trends, Innovations n Cost/Performance leaders from PC industry n Transaction processing, file service based on Symmetric Multiprocessor (SMP)servers m 4 - 64 processors m Shared memory addressing n Decision support based on SMP and Cluster (Shared Nothing) n Clusters of low cost, small SMPs getting popular

14 State of the Art System: PC n $1140 OEM n 1 266 MHz Pentium II n 64 MB DRAM n 2 UltraDMA EIDE disks, 3.1 GB each n 100 Mbit Ethernet Interface n (PennySort winner) source: www.research.microsoft.com/research/barc/SortBenchmark/PennySort.ps

15 State of the Art SMP: Sun E10000 … data crossbar switch 4 address buses … …… bus bridge … … 1 …… scsiscsi … … 23 Mem Xbar bridge Proc s 1 Mem Xbar bridge Proc s 16 Proc n TPC-D,Oracle 8, 3/98 m SMP 64 336 MHz CPUs, 64GB dram, 668 disks (5.5TB) m Disks,shelf$2,128k m Boards,encl.$1,187k m CPUs$912k m DRAM$768k m Power$96k m Cables,I/O$69k m HW total $5,161k scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi source: www.tpc.org

16 State of the art Cluster: NCR WorldMark … BYNET switched network … …… bus bridge … … 1 …… scsiscsi … … 64 Bus bridge Proc Mem 1 Proc Mem Bus bridge Proc 32 Proc n TPC-D, TD V2, 10/97 m 32 nodes x 4 200 MHz CPUs, 1 GB DRAM, 41 disks (128 cpus, 32 GB, 1312 disks, 5.4 TB) m CPUs, DRAM, encl., boards, power $5,360k m Disks+cntlr$2,164k m Disk shelves$674k m Cables$126k m Console$16k m HW total $8,340k scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi Mem pci source: www.tpc.org pci

17 State of the Art Cluster: Tandem/Compaq SMP n ServerNet switched network n Rack mounted equipment n SMP: 4-PPro, 3GB dram, 3 disks (6/rack) n 10 Disk shelves/rack @ 7 disks/shelf n Total: 6 SMPs (24 CPUs, 18 GB DRAM), 402 disks (2.7 TB) n TPC-C, Oracle 8, 4/98 m CPUs$191k m DRAM, $122k m Disks+cntlr$425k m Disk shelves$94k m Networking$76k m Racks$15k m HW total $926k

1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total.

Similar presentations

Presentation on theme: "1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total.

Similar presentations

Presentation on theme: "1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total."— Presentation transcript:

Similar presentations

About project

Feedback