1 Hardware Technology Trends and Database Opportunities David A. Patterson and Kimberly K. Keeton

Slides:



Advertisements
Similar presentations
IT253: Computer Organization
Advertisements

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
Chapter 4 The Components of the System Unit
1 Hash-Join n Hybrid hash join m R: 71k rows x 145 B m S: 200k rows x 165 B m TPC-D lineitem, part n Clusters benefit from one-pass algorithms n IDISK.
Disk Storage SystemsCSCE430/830 Disk Storage Systems CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine) Fall,
1 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
1  1998 Morgan Kaufmann Publishers Chapter 8 Storage, Networks and Other Peripherals.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
CS61C L21 Pipeline © UC Regents 1 CS61C - Machine Structures Lecture 21 - Introduction to Pipelined Execution November 15, 2000 David Patterson
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
CS430 – Computer Architecture Introduction to Pipelined Execution
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
Computer performance.
CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY URL:
UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.
CPE 731 Advanced Computer Architecture Technology Trends Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
Programming for GCSE Topic 5.1: Memory and Storage T eaching L ondon C omputing William Marsh School of Electronic Engineering and Computer Science Queen.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Disk Storage SystemsCSCE430/830 Disk Storage Systems CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine) Fall,
Csci 136 Computer Architecture II – IO and Storage Systems Xiuzhen Cheng
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
1 Introduction to Hardware/Architecture David A. Patterson EECS, University of California.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Lecture # 10 Processors Microcomputer Processors.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
1 IRAM Vision Microprocessor & DRAM on a single chip: –on-chip memory latency 5-10X, bandwidth X –improve energy efficiency 2X-4X (no off-chip bus)
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
William Stallings Computer Organization and Architecture 6th Edition
Lynn Choi School of Electrical Engineering
Hardware Technology Trends and Database Opportunities
Lecture 16: Data Storage Wednesday, November 6, 2006.
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Berkeley Cluster: Zoom Project
Architecture & Organization 1
Morgan Kaufmann Publishers Memory & Cache
Basic Computer Organization
IDISK Cluster 8 disks, 8 CPUs, DRAM /shelf
Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection.
CS775: Computer Architecture
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Architecture & Organization 1
CSE 451: Operating Systems Winter 2006 Module 13 Secondary Storage
Performance of computer systems
CSE 451: Operating Systems Winter 2007 Module 13 Secondary Storage
CSE 451: Operating Systems Secondary Storage
Computer Evolution and Performance
Performance of computer systems
CSE 451: Operating Systems Spring 2005 Module 13 Secondary Storage
CSE 451: Operating Systems Autumn 2004 Secondary Storage
IRAM Vision Microprocessor & DRAM on a single chip:
Presentation transcript:

1 Hardware Technology Trends and Database Opportunities David A. Patterson and Kimberly K. Keeton EECS, University of California Berkeley, CA

2 Outline n Review of Five Technologies: Disk, Network, Memory, Processor, Systems m Description / History / Performance Model m State of the Art / Trends / Limits / Innovation m Following precedent: 2 Digressions n Common Themes across Technologies m Perform.: per access (latency) + per byte (bandwidth) m Fast: Capacity, BW, Cost; Slow: Latency, Interfaces m Moore’s Law affecting all chips in system n Technologies leading to Database Opportunity? m Hardware & Software Alternative to Today m Back-of-the-envelope comparison: scan, sort, hash-join

3 Disk Description / History 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2,300 MBytes source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces” Sector Track Cylinder Head Platter Arm Embed. Proc. (ECC, SCSI) Track Buffer

4 Disk History 1989: 63 Mbit/sq. in 60,000 MBytes 1997: 1450 Mbit/sq. in 2300 MBytes source: New York Times, 2/23/98, page C3 1997: 3090 Mbit/sq. in 8100 MBytes

5 Performance Model /Trends n Capacity m + 60%/year (2X / 1.5 yrs) n Transfer rate (BW) m + 40%/year (2X / 2.0 yrs) n Rotation + Seek time m – 8%/ year (1/2 in 10 yrs) n MB/$ m > 60%/year (2X / <1.5 yrs) m Fewer chips + areal density source: Ed Grochowski, 1996, “IBM leadership in disk drive technology”; Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth per access per byte { +

6 Chips / 3.5 inch Disk: 1993 v vs. 12 chips; 2 chips (head, misc) in 200x?

7 State of the Art: Seagate Cheetah 18 m 6962 cylinders, 12 platters m 18.2 GB, 3.5 inch disk m 1MB track buffer (+ 4MB optional expansion) m 19 watts m 0.15 ms controller time m avg. seek = 6 ms (seek 1 track = 1 ms) m 1/2 rotation = 3 ms m 21 to 15 MB/s media (=> 16 to 11 MB/s) »deliver 75% (ECC, gaps...) m $1647 or 11MB/$ (9¢/MB) source: 5/21/98 Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth per access per byte { + Sector Track Cylinder Head Platter Arm Track Buffer

8 Disk Limit: I/O Buses CPU Memory bus Memory C External I/O bus (SCSI) C (PCI) C Internal I/O bus C n Multiple copies of data, SW layers n Bus rate vs. Disk rate m SCSI: Ultra2 (40 MHz), Wide (16 bit): 80 MByte/s m FC-AL: 1 Gbit/s = 125 MByte/s (single disk in 2002) n Cannot use 100% of bus m Queuing Theory (< 70%) m Command overhead (Effective size = size x 1.2) Controllers (15 disks)

9 Disk Challenges / Innovations n Cost SCSI v. EIDE: m $275: IBM 4.3 GB, UltraWide SCSI (40MB/s)16MB/$ m $176: IBM 4.3 GB, DMA/EIDE (17MB/s) 24MB/$ m Competition, interface cost, manufact. learning curve? n Rising Disk Intelligence m SCSI3, SSA, FC-AL, SMART m Moore’s Law for embedded processors, too source:

10 Disk Limit n Continued advance in capacity (60%/yr) and bandwidth (40%/yr.) n Slow improvement in seek, rotation (8%/yr) n Time to read whole disk YearSequentiallyRandomly minutes6 hours minutes 1 week n Dynamically change data layout to reduce seek, rotation delay? Leverage space vs. spindles?

11 Disk Summary n Continued advance in capacity, cost/bit, BW; slow improvement in seek, rotation n External I/O bus bottleneck to transfer rate, cost? => move to fast serial lines (FC-AL)? n What to do with increasing speed of embedded processor inside disk?

12 Network Description/Innovations n Shared Media vs. Switched: pairs communicate at same time n Aggregate BW in switched network is many times shared m point-to-point faster only single destination, simpler interface m Serial line: 1 – 5 Gbit/sec n Moore’s Law for switches, too m 1 chip: 32 x 32 switch, 1.5 Gbit/sec links, $ Gbit/sec aggregate bandwidth (AMCC S2025)

13 Network Performance Model Sender Receiver Sender Overhead Transmission time (size ÷ band- width) Time of Flight Receiver Overhead Transport Latency Total Latency = per access + Size x per byte per access= Sender + Receiver Overhead + Time of Flight (5 to 200 µsec + 5 to 200 µsec µsec) per byte + Size ÷ 100 MByte/s Total Latency (processor busy) (processor busy) +

14 Network History/Limits n TCP/UDP/IP protocols for WAN/LAN in 1980s n Lightweight protocols for LAN in 1990s n Limit is standards and efficient SW protocols 10 Mbit Ethernet in 1978 (shared) 100 Mbit Ethernet in 1995 (shared, switched) 1000 Mbit Ethernet in 1998 (switched) m FDDI; ATM Forum for scalable LAN (still meeting) n Internal I/O bus limits delivered BW m 32-bit, 33 MHz PCI bus = 1 Gbit/sec m future: 64-bit, 66 MHz PCI bus = 4 Gbit/sec

15 Network Summary n Fast serial lines, switches offer high bandwidth, low latency over reasonable distances n Protocol software development and standards committee bandwidth limit innovation rate m Ethernet forever? n Internal I/O bus interface to network is bottleneck to delivered bandwidth, latency

16 Memory History/Trends/State of Art n DRAM: main memory of all computers m Commodity chip industry: no company >20% share m Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM) n State of the Art: $152, 128 MB DIMM (16 64-Mbit DRAMs),10 ns x 64b (800MB/sec) n Capacity: 4X/3 yrs (60%/yr..) m Moore’s Law n MB/$: + 25%/yr. n Latency: – 7%/year, Bandwidth: + 20%/yr. (so far) source: 5/21/98

17 Memory Innovations/Limits n High Bandwidth Interfaces, Packages m RAMBUS DRAM: 800 – 1600 MByte/sec per chip n Latency limited by memory controller, bus, multiple chips, driving pins n More Application Bandwidth => More Cache misses = per access + block size x per byte Memory latency + Size / (DRAM BW x width) = 150 ns + 30 ns m Called Amdahl’s Law: Law of diminishing returns DRAMDRAM DRAMDRAM DRAMDRAM DRAMDRAM Bus Proc Cache

18 Memory Summary n DRAM rapid improvements in capacity, MB/$, bandwidth; slow improvement in latency n Processor-memory interface (cache+memory bus) is bottleneck to delivered bandwidth m Like network, memory “protocol” is major overhead

19 Processor Trends/ History n Microprocessor: main CPU of “all” computers m < 1986, +35%/ yr. performance increase (2X/2.3yr) m >1987 (RISC), +60%/ yr. performance increase (2X/1.5yr) n Cost fixed at $500/chip, power whatever can cool n History of innovations to 2X / 1.5 yr (Works on TPC?) m Multilevel Caches (helps clocks / instruction) m Pipelining (helps seconds / clock, or clock rate) m Out-of-Order Execution (helps clocks / instruction) m Superscalar (helps clocks / instruction) CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock CPU time= Seconds = Instructions x Clocks x Seconds Program Program Instruction Clock

20 Pipelining is Natural! °Laundry Example °Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away °Washer takes 30 minutes °Dryer takes 30 minutes °“Folder” takes 30 minutes °“Stasher” takes 30 minutes to put clothes into drawers ABCD

21 Sequential Laundry Sequential laundry takes 8 hours for 4 loads 30 TaskOrderTaskOrder B C D A Time 30 6 PM AM

22 Pipelined Laundry: Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A 30

23 Pipeline Hazard: Stall A depends on D; stall since folder tied up TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A E F bubble 30

24 Out-of-Order Laundry: Don’t Wait A depends on D; rest continue; need more resources to allow out-of-order TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A 30 E F bubble

25 Superscalar Laundry: Parallel per stage More resources, HW match mix of parallel tasks? TaskOrderTaskOrder 12 2 AM 6 PM Time B C D A E F (light clothing) (dark clothing) (very dirty clothing) (light clothing) (dark clothing) (very dirty clothing) 30

26 Superscalar Laundry: Mismatch Mix Task mix underutilizes extra resources TaskOrderTaskOrder 12 2 AM 6 PM Time 30 (light clothing) (dark clothing) (light clothing) A B D C

27 State of the Art: Alpha n 15M transistors n 2 64KB caches on chip; 16MB L2 cache off chip n Clock 600 MHz (Fastest Cray Supercomputer: T nsec) n 90 watts n Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle n Execution out-of-order

28 Processor Limit: DRAM Gap Alpha full cache miss in instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors

29 Processor Limits for TPC-C SPEC- Pentium Pro int95TPC-C m Multilevel Caches: Miss rate 1MB L2 cache0.5%5% m Superscalar (2-3 instr. retired/clock): % clks40%10% m Out-of-Order Execution speedup2.0X1.4X m Clocks per Instruction n % Peak performance40%10% source: Kim Keeton, Dave Patterson, Y. Q. He, R. C. Raphael, and Walter Baker. "Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads," Proc. 25th Int'l. Symp. on Computer Architecture, June ( ) Bhandarkar, D.; Ding, J. “Performance characterization of the Pentium Pro processor.” Proc. 3rd Int'l. Symp. on High-Performance Computer Architecture, Feb p

30 Processor Innovations/Limits n Low cost, low power embedded processors m Lots of competition, innovation m Integer perf. embedded proc. ~ 1/2 desktop processor m Strong ARM 110: 233 MHz, 268 MIPS, 0.36W typ., $49 n Very Long Instruction Word (Intel,HP IA-64/Merced) m multiple ops/ instruction, compiler controls parallelism n Consolidation of desktop industry? Innovation? PowerPC PA-RISC MIPS Alpha IA-64 SPARC x86

31 Processor Summary n SPEC performance doubling / 18 months m Growing CPU-DRAM performance gap & tax m Running out of ideas, competition? Back to 2X / 2.3 yrs? n Processor tricks not as useful for transactions? m Clock rate increase compensated by CPI increase? m When > 100 MIPS on TPC-C? n Cost fixed at ~$500/chip, power whatever can cool n Embedded processors promising m 1/10 cost, 1/100 power, 1/2 integer performance?

32 Systems: History, Trends, Innovations n Cost/Performance leaders from PC industry n Transaction processing, file service based on Symmetric Multiprocessor (SMP)servers m processors m Shared memory addressing n Decision support based on SMP and Cluster (Shared Nothing) n Clusters of low cost, small SMPs getting popular

33 State of the Art System: PC n $1140 OEM n MHz Pentium II n 64 MB DRAM n 2 UltraDMA EIDE disks, 3.1 GB each n 100 Mbit Ethernet Interface n (PennySort winner) source:

34 State of the Art SMP: Sun E10000 … data crossbar switch 4 address buses … …… bus bridge … … 1 …… scsiscsi … … 23 Mem Xbar bridge Proc s 1 Mem Xbar bridge Proc s 16 Proc n TPC-D,Oracle 8, 3/98 m SMP MHz CPUs, 64GB dram, 668 disks (5.5TB) m Disks,shelf$2,128k m Boards,encl.$1,187k m CPUs$912k m DRAM$768k m Power$96k m Cables,I/O$69k m HW total $5,161k scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi source:

35 State of the art Cluster: NCR WorldMark … BYNET switched network … …… bus bridge … … 1 …… scsiscsi … … 64 Bus bridge Proc Mem 1 Proc Mem Bus bridge Proc 32 Proc n TPC-D, TD V2, 10/97 m 32 nodes x MHz CPUs, 1 GB DRAM, 41 disks (128 cpus, 32 GB, 1312 disks, 5.4 TB) m CPUs, DRAM, encl., boards, power $5,360k m Disks+cntlr$2,164k m Disk shelves$674k m Cables$126k m Console$16k m HW total $8,340k scsiscsi scsiscsi scsiscsi scsiscsi scsiscsi Mem pci source: pci

36 State of the Art Cluster: Tandem/Compaq SMP n ServerNet switched network n Rack mounted equipment n SMP: 4-PPro, 3GB dram, 3 disks (6/rack) n 10 Disk 7 disks/shelf n Total: 6 SMPs (24 CPUs, 18 GB DRAM), 402 disks (2.7 TB) n TPC-C, Oracle 8, 4/98 m CPUs$191k m DRAM, $122k m Disks+cntlr$425k m Disk shelves$94k m Networking$76k m Racks$15k m HW total $926k

37 Berkeley Cluster: Zoom Project n 3 TB storage system m GB disks, MHz PPro PCs, 100Mbit Switched Ethernet m System cost small delta (~30%) over raw disk cost n Application: San Francisco Fine Arts Museum Server m 70,000 art images online m Zoom in 32X; try it yourself! m (statue)

38 User Decision Support Demand vs. Processor speed CPU speed 2X / 18 months Database demand: 2X / 9-12 months Database-Proc. Performance Gap: “Greg’s Law” “Moore’s Law”

39 Outline n Technology: Disk, Network, Memory, Processor, Systems m Description/Performance Models m History/State of the Art/ Trends m Limits/Innovations n Technology leading to a New Database Opportunity? m Common Themes across 5 Technologies m Hardware & Software Alternative to Today m Benchmarks

40 Review technology trends to help? n Desktop Processor: + SPEC performance – TPC-C performance, – CPU-Memory perf. gap n Embedded Processor: + Cost/Perf, + inside disk – controllers everywhere DiskMemoryNetwork n Capacity+ + … n Bandwidth+++ n Latency––– n Interface–––

41 IRAM: “Intelligent RAM” Microprocessor & DRAM on a single chip: m on-chip memory latency 5-10X, bandwidth X m serial I/O 5-10X v. buses m improve energy efficiency 2X-4X (no off-chip bus) m reduce number of controllers m smaller board area/volume DRAMDRAM Proc Bus DRAM $$ Proc L2$ Bus DRAM I/O Bus BusBus C C CC C...

42 “Intelligent Disk”(IDISK): Scalable Decision Support? n Low cost, low power processor & memory included in disk at little extra cost (e.g., Seagate optional track buffer) n Scaleable processing AND communication as increase disks … cross bar … … … IRAM … … … … … cross bar

43 IDISK Cluster n 8 disks, 8 CPUs, DRAM /shelf n 15 shelves /rack = 120 disks/rack n 1312 disks / 120 = 11 racks n Connect 4 disks / ring n 1312 / 4 = Gbit links n 328 / 16 => 36 32x32 switch n HW, assembly cost: ~$1.5 M

44 Cluster IDISK Software Models 1) Shared Nothing Database: (e.g., IBM, Informix, NCR TeraData, Tandem) 2) Hybrid SMP Database: Front end running query optimizer, applets downloaded into IDISKs 3) Start with Personal Database code developed for portable PCs, PDAs (e.g., M/S Access, M/S SQLserver, Oracle Lite, Sybase SQL Anywhere) then augment with new communication software

45 Back of the Envelope Benchmarks n All configurations have ~300 disks n Equivalent speeds for central and disk procs. n Benchmarks: Scan, Sort, Hash-Join

46 Scan n Scan 6 billion 145 B rows m TPC-D lineitem table n Embarrassingly parallel task; limited by number processors n IDISK Speedup: m NCR: 2.4X m Compaq:12.6X 2.4X 12.6X

47 MinuteSort n External sorting: data starts and ends on disk n MinuteSort: how much can we sort in a minute? m Benchmark designed by Nyberg, et al., SIGMOD ‘94 m Current record: 8.4 GB on 95 UltraSPARC I’s w/ Myrinet [NOWSort:Arpaci-Dusseau97] n Sorting Review: m One-pass sort: data sorted = memory size m Two-pass sort: »Data sorted proportional to sq.rt. (memory size) »Disk I/O requirements: 2x that of one-pass sort

48 MinuteSort n IDISK sorts 2.5X - 13X more than clusters n IDISK sort limited by disk B/W n Cluster sorts limited by network B/W

49 Hash-Join n Hybrid hash join m R: 71k rows x 145 B m S: 200k rows x 165 B m TPC-D lineitem, part n Clusters benefit from one-pass algorithms n IDISK benefits from more processors, faster network n IDISK Speedups: m NCR: 1.2X m Compaq:5.9X 5.9X 1.2X

50 Other Uses for IDISK n Software RAID n Backup accelerator m High speed network connecting to tapes m Compression to reduce data sent, saved n Performance Monitor m Seek analysis, related accesses, hot data n Disk Data Movement accelerator m Optimize layout without using CPU, buses

51 IDISK App: Network attach web, files n Snap!Server: Plug in Ethernet 10/100 & power cable, turn on n 32-bit CPU, flash memory, compact multitasking OS, SW update from Web n Network protocols: TCP/IP, IPX, NetBEUI, and HTTP (Unix, Novell, M/S, Web) n 1 or 2 EIDE disks n 6GB $950,12GB $1727 (7MB/$, 14¢/MB) source: 9” 15”

52 Related Work CMU “Active Disks” >Disks, {>Memory, CPU speed, network} / Disk Apps Small functions e.g., scan UCSB “Active Disks” Medium functions e.g., image General Purpose UCB “Intelligent Disks” source: Eric Riedel, Garth Gibson, Christos Faloutsos,CMU VLDB ‘98; Anurag Acharya et al, UCSB T.R.

53 IDISK Summary n IDISK less expensive by 10X to 2X, faster by 2X to 12X? m Need more realistic simulation, experiments n IDISK scales better as number of disks increase, as needed by Greg’s Law n Fewer layers of firmware and buses, less controller overhead between processor and data n IDISK not limited to database apps: RAID, backup, Network Attached Storage,... n Near a strategic inflection point?

54 Messages from Architect to Database Community n Architects want to study databases; why ignored? m Need company OK before publish! (“DeWitt” Clause) m DB industry, researchers fix if want better processors m SIGMOD/PODS join FCRC? n Disk performance opportunity: minimize seek, rotational latency, ultilize space v. spindles n Think about smaller footprint databases: PDAs, IDISKs,... m Legacy code a reason to avoid virtually all innovations??? m Need more flexible/new code base?

55 Acknowledgments n Thanks for feedback on talk from M/S BARC (Jim Gray, Catharine van Ingen, Tom Barclay, Joe Barrera, Gordon Bell, Jim Gemmell, Don Slutz) and IRAM Group (Krste Asanovic, James Beck, Aaron Brown, Ben Gribstad, Richard Fromm, Joe Gebis, Jason Golbus, Kimberly Keeton, Christoforos Kozyrakis, John Kubiatowicz, David Martin, David Oppenheimer, Stelianos Perissakis, Steve Pope, Randi Thomas, Noah Treuhaft, and Katherine Yelick) n Thanks for research support: DARPA, California MICRO, Hitachi, IBM, Intel, LG Semicon, Microsoft, Neomagic, SGI/Cray, Sun Microsystems, TI

56 Questions? Contact us if you’re interested:

s != 1990s n Scan Only n Limited communication between disks n Custom Hardware n Custom OS n Invent new algorithms n Only for for databases n Whole database code n High speed communication between disks n Optional intelligence added to standard disk (e.g., Cheetah track buffer) n Commodity OS n 20 years of development n Useful for WWW, File Servers, backup

58 Stonebraker’s Warning “The history of DBMS research is littered with innumerable proposals to construct hardware database machines to provide high performance operations. In general these have been proposed by hardware types with a clever solution in search of a problem on which it might work.” Readings in Database Systems (second edition), edited by Michael Stonebraker, p.603

59 Grove’s Warning “...a strategic inflection point is a time in the life of a business when its fundamentals are about to change.... Let's not mince words: A strategic inflection point can be deadly when unattended to. Companies that begin a decline as a result of its changes rarely recover their previous greatness.” Only the Paranoid Survive, Andrew S. Grove, 1996

60 Clusters of PCs? n 10 PCs/rack = 20 disks/rack n 1312 disks / 20 = 66 racks, 660 PCs n 660 /16 = Mbit Ethernet Switches + 9 1Gbit Switches n 72 racks / 4 = 18 UPS n Floor space: aisles between racks to access, repair PCs 72 / 8 x 120 = 1100 sq. ft. n HW, assembly cost: ~$2M n Quality of Equipment? n Repair? n System Admin.?

61 Today’s Situation: Microprocessor MIPS MPUs R5000R k/5k n Clock Rate200 MHz 195 MHz1.0x n On-Chip Caches32K/32K 32K/32K 1.0x n Instructions/Cycle 1(+ FP)4 4.0x n Pipe stages x n ModelIn-orderOut-of-order--- n Die Size (mm 2 ) x m without cache, TLB x n Development (man yr..) x n SPECint_base x

62 Potential Energy Efficiency: 2X-4X n Case study of StrongARM memory hierarchy vs. IRAM memory hierarchy  cell size advantages  much larger cache  fewer off-chip references  up to 2X-4X energy efficiency for memory m less energy per bit access for DRAM Memory cell area ratio/process: P6,  ‘164,SArm cache/logic : SRAM/SRAM : DRAM/DRAM 20-50:8-11:1

63 Today’s Situation: DRAM $16B $7B Intel: 30%/year since 1987; 1/3 income profit

64 MBits per square inch: DRAM as % of Disk over time source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces” 470 v Mb/ sq. in. 9 v. 22 Mb/ sq. in. 0.2 v. 1.7 Mb/sq. in. 0% 5% 10% 15% 20% 25% 30% 35% 40%

65 What about I/O? n Current system architectures have limitations n I/O bus performance lags other components n Parallel I/O bus performance scaled by increasing clock speed and/or bus width m E.g.. 32-bit PCI: ~50 pins; 64-bit PCI: ~90 pins  Greater number of pins  greater packaging costs n Are there alternatives to parallel I/O buses for IRAM?

66 Serial I/O and IRAM n Communication advances: fast (Gbit/s) serial I/O lines [YankHorowitz96], [DallyPoulton96] m Serial lines require 1-2 pins per unidirectional link m Access to standardized I/O devices »Fiber Channel-Arbitrated Loop (FC-AL) disks »Gbit/s Ethernet networks n Serial I/O lines a natural match for IRAM n Benefits m Serial lines provide high I/O bandwidth for I/O-intensive applications m I/O BW incrementally scalable by adding more lines »Number of pins required still lower than parallel bus