Presentation on theme: "1 CyberBricks: The future of Database And Storage Engines Jim Gray"— Presentation transcript:
1 CyberBricks: The future of Database And Storage Engines Jim Gray
2 Outline What storage things are coming from Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans The future of storage: ActiveDisks
3 New Storage Software From Microsoft SQL Server 7.0: » Simplicity: Auto-most-things » Scalability on Win95 to Enterprise » Data warehousing: built-in OLAP, VLDB NT 5: » Better volume management (from Veritas) » HSM architecture » Intellimirror » Active directory for transparency
4 “Hydra”Server Dedicated Windows terminal Existing, Desktop PC MS-DOS,UNIX,Macclients Net PC Thin Client Support TSO comes to NT Lower Per-Client cost Huge centralized data stores.
5 Windows NT 5.0 Intelli-Mirror ™ Files and settings mirrored on client and server Great for mobile users Facilitates roaming Easy to replace PCs Optimizes network performance Means HUGE data stores
6 Outline What storage things are coming from Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans The future of storage: ActiveDisks
7 Microsoft TerraServer: Scaleup to Big Databases Build a 1 TB SQL Server database Data must be » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere Loaded » 1.5 M place names from Encarta World Atlas » 3 M Sq Km from USGS (1 meter resolution) » 1 M Sq Km from Russian Space agency (2 m) On the web (world’s largest atlas) Sell images with commerce server.
8 Microsoft TerraServer Background Earth is 500 Tera-meters square » USA is 10 tm TM 2 land in 70ºN to 70ºS We have pictures of 6% of it » 3 tsm from USGS » 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with » Encarta™ Atlas globe gazetteer » StreetsPlus™ in the USA 40x60 km 2 jump image 20x30 km 2 browse image 10x15 km 2 thumbnail 1.8x1.2 km 2 tile Someday » multi-spectral image » of everywhere » once a day / hour
9 USGS Digital Ortho Quads (DOQ) US Geologic Survey 4 Tera Bytes Most data not yet published Based on a CRADA » Microsoft TerraServer makes data available. USGS “DOQ” 1x1 meter 4 TB Continental US New Data Coming
10 Russian Space Agency (SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor) 1.5 Meter Geo Rectified imagery of (almost) anywhere Almost equal-area projection De-classified satellite photos (from 200 KM), More data coming (1 m) Selling imagery on Internet. Putting 2 tm 2 onto Microsoft TerraServer. SPIN-2
11 Microsoft.com/ Demo SPIN-2 Microsoft BackOffice
12 Demo navigate by coverage map to White House Download image buy imagery from USGS navigate by name to Venice buy SPIN2 image & Kodak photo Pop out to Expedia street map of Venice Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)
13 1TB Database Server AlphaServer x GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 ) Hardware 100 Mbps Ethernet Switch DS3 Site Servers Internet Map Server SPIN-2 Web Servers STK 9710 DLT Tape Library 48 9 GB Drives Alpha Server 8400 Enterprise Storage Array 8 x 440MHz Alphacpus 10 GB DRAM 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives
14 The Microsoft TerraServer Hardware Compaq AlphaServer 8400 Compaq AlphaServer x400Mhz Alpha cpus 8x400Mhz Alpha cpus 10 GB DRAM 10 GB DRAM GB StorageWorks Disks GB StorageWorks Disks » 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (~14 TB) STK 9710 tape robot (~14 TB) WindowsNT 4 EE, SQL Server 7.0 WindowsNT 4 EE, SQL Server 7.0
15 browser HTML Java Viewer The Internet Web Client Microsoft Automap ActiveX Server Internet Info Server 4.0 Image Delivery Application SQL Server 7 Microsoft Site Server EE Internet Information Server 4.0 Image Provider Site(s) TerraServer DB Automap Server Terra-Server Stored Procedures Internet Information Server 4.0 Image Server Active Server Pages MTS TerraServer Web Site Software SQL Server 7
16 Backup and Recovery » STK 9710 Tape robot » Legato NetWorker™ » SQL Server 7 Backup & Restore » Clocked at 80 MBps (peak) (~ 200 GB/hr) SQL Server Enterprise Mgr » DBA Maintenance » SQL Performance Monitor System Management & Maintenance
17 Microsoft TerraServer File Group Layout Convert 324 disks to 28 RAID5 sets plus 28 spare drives Make 4 WinNT volumes (RAID 50) 595 GB per volume Build 30 20GB files on each volume DB is File Group of 120 files E: F: G: H:
18 Image Delivery and Load Incremental load of 4 more TB in next 18 months DLT Tape “tar” \ Drop’N’ DoJob Wait 4 Load LoadMgr DB 100mbit EtherSwitch GB Drives Enterprise Storage Array Alpha Server GB Drives GB Drives STK DLT Tape Library GB Drives Alpha Server 4100 ESA Alpha Server 4100 LoadMgr DLT Tape NT Backup ImgCutter \ Drop’N’ \Images 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place... LoadMgr
19 Technical Challenge Key idea Problem: Geo-Spatial Search without geo-spatial access methods. (just standard SQL Server) Solution: H Geo-spatial search key: ä Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) ä Z-transform X & Y into single Z value, build B-tree on Z ä Adjacent images stored next to each other H Search Method: ä Latitude and Longitude => X, Y, then Z ä Select on matching Z value
20 Some Tera-Byte Databases Kilo Mega Giga Tera Peta Exa Zetta Yotta The Web: 1 TB of HTML TerraServer 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked EOS/DIS (picture of planet each week) » 15 PB by 2007 Federal Clearing house: images of checks » 15 PB by 2006 (7 year history) Nuclear Stockpile Stewardship Program » 10 Exabytes (???!!)
21 Library of Congress (text) Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter All Disks All Tapes A Movie LoC (image) Info Capture You can record everything you see or hear or read. What would you do with it? How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write:8 GB (words) See: / ksg.html
22 Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A letter Library of Congress (text) All Disks All Tapes A Movie LoC (image) All Photos LoC (sound + cinima) All Information!
23 Michael Lesk’s Points Soon everything can be recorded and kept Most data will never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology.
24 Outline What storage things are coming from Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans The future of storage: ActiveDisks
25 Storage Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk Tape /Optical Robot This Campus This Room 10 min My Head 1 min 1.5 hr Sacramento 2 Years Pluto 2,000 Years Andromeda
26 DataFlow Programming Prefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!) Need a memory that gets the data in advance ( 100MB/S) Solution: Pipeline data to/from the processor Pipe data from source (tape, disc, ram...) to cpu cache
27 MetaMessage: Technology Ratios Are Important If everything gets faster&cheaper at the same rate THEN nothing really changes. Things getting MUCH BETTER: » communication speed & cost 1,000x » processor speed & cost 100x » storage size & cost 100x Things staying about the same » speed of light (more or less constant) » people (10x more expensive) » storage speed (only 10x better)
29 Storage Ratios Changed in Last 20 Years MediaPrice: 4000X, Bandwidth 10X, Access/s 10X DRAM:DISK $/MB: 100:1 25:1 TAPE : DISK $/GB:100:1 5:1
30 Storage Ratios Changed 4,000x lower media price Capacity : 100X, Bandwidth 10X, Access/s 10X DRAM:DISK $/MB: 100:1 25:1 TAPE : DISK $/GB:100:1 5:1 DRAM/disk media price ratio » :1 » :1 » :1 » today ~.15$pMB disk 5 $pMB dram
31 Disk Access Time Access time = SeekTime 6 ms 5%/y + RotateTime 3 ms 5%/y + ReadTime 1 ms 25%/y Other useful facts: » Power rises more than size 3 (so small is indeed beautiful) » Small devices are more rugged » Small devices can use plastics (forces are much smaller) e.g. bugs fall without breaking anything
32 Standard Storage Metrics Capacity: » RAM: MB and $/MB: today at 100MB & 1$/MB » Disk:GB and $/GB: today at 10GB and 50$/GB » Tape: TB and $/TB: today at.1TB and 10$/GB (nearline) Access time (latency) » RAM:100 ns » Disk: 10 ms » Tape: 30 second pick, 30 second position Transfer rate » RAM: 1 GB/s » Disk: 5 MB/s Arrays can go to 1GB/s » Tape: 3 MB/s not clear that striping works
33 New Storage Metrics: Kaps, Maps, Gaps, SCANs Kaps: How many kilobyte objects served per second » the file server, transaction procssing metric Maps: How many megabyte objects served per second » the Mosaic metric Gaps: How many gigabyte objects served per hour » the video & EOSDIS metric SCANS: How many scans of all the data per day » the data mining and utility metric And: $/Kaps, $/Maps, $/Gaps, $/SCAN
34 How To Get Lots of Maps, Gaps, SCANS parallelism: use many little devices in parallel At 10 MB/s: 1.2 days to scan 1,000 x parallel: 100 seconds/scan Parallelism: divide a big problem into many smaller ones to be solved in parallel.
35 Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (5x cheaper than disc) Tape is cheap:100 $/tape 40 GB/tape => 2.5 $/GB (100x cheaper than disc).
36 Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$... 3 m$ ) tapes (at 40GB each) => 20$/GB $/GB (1x…10x cheaper than disc) Optical needs a robot (50 k$ ) 100 platters = 200GB ( TODAY ) => 250 $/GB ( more expensive than disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!
37 The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server » shorter queues » parallel transfer » lower cost/access and cost/byte This is obvious for disk & tape arrays
38 My Solution to Tertiary Storage Tape Farms, Not Mainframe Silos Scan in 12 hours. many independent tape robots (like a disc farm) 10K$ robot 10 tapes 400 GB 6 MB/s 25$/GB 30 Maps 15 Gaps 2 Scans 100 robots 40TB 25$/GB 3K Maps 1.5K Gaps 2 Scans 1M$
,000 10, ,000 1,, 1000 xDisc Farm STK Tape Robot 6,000 tapes, 8 readers 100x DLTTape Farm GB/K$ Maps Scans SCANS/Day Kaps The Metrics: Disk and Tape Farms Win Data Motel: Data checks in, but it never checks out
41 Storage Ratios Impact on Software Gone from 512 B pages to 8192 B pages (will go to 64 KB pages in 2006) Treat disks as tape: » Increased use of sequential access » Use disks for backup copies Use tape for » VERY COLD data or » Offsite Archive » Data interchange
42 Summary Storage accesses are the bottleneck Accesses are getting larger (Maps, Gaps, SCANS) Capacity and cost are improving BUT Latencies and bandwidth are not improving much SO Use parallel access (disk and tape farms) Use sequential access (scans)
43 Controller The Memory Hierarchy Measuring & Modeling Sequential IO Where is the bottleneck? How does it scale with » SMP, RAID, new interconnects Adapter SCSI File cache PCI Memory Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Mem bus App address space
44 Sequential IO your mileage will vary 40 MB/secAdvertised UW SCSI 35r-23w MB/secActual disk transfer 29r-17w MB/sec64 KB request (NTFS) 9 MB/secSingle disk media 3 MB/sec 2 KB request (SQL Server) Measuring hardware & Software Looking for software fixes.. Aiming for “out of the box” 1/2 power point: 50% of peak power “out of the box”
45 PAP (peak advertised Performance) vs RAP (real application performance) Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 7.2 MB/s 133 MBps 7.2 MB/s MBps 7.2 MB/s SCSI File System Buffers Application Data Disk PCI 40 MBps 7.2 MB/s
46 The Best Case: Temp File, NO IO Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware is limited to 150 MBps per processor
47 Bottleneck Analysis Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Memory Read/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps
48 3 Stripes and Your Out! 3 disks can saturate adapter Similar story with UltraWide CPU time goes down with request size Ftdisk (striping is cheap) =
49 Parallel SCSI Busses Help Second SCSI bus nearly doubles read and wce throughput Write needs deeper buffers Experiment is unbuffered (3-deep +WCE) 2 x
50 File System Buffering & Stripes (UltraWide Drives) FS buffering helps small reads FS buffered writes peak at 12MBps 3-deep async helps Write peaks at 20 MBps Read peaks at 30 MBps
51 PAP vs RAP Reads are easy, writes are hard Async write can match WCE. 422 MBps 142MBps 133 MBps 72 MBps MBps 9 MBps SCSI File System Application Data PCI SCSI Disks 40 MBps 31 MBps
54 Penny Sort Ground Rules How much can you sort for a penny. » Hardware and Software cost » Depreciated over 3 years » 1M$ system gets about 1 second, » 1K$ system gets about 1,000 seconds. » Time (seconds) = SystemPrice ($) / 946,080 Input and output are disk resident Input is » 100-byte records (random data) » key is first 10 bytes. Must create output file and fill with sorted version of input file. Daytona (product) and Indy (special) categories
55 PennySort Hardware » 266 Mhz Intel PPro » 64 MB SDRAM (10ns) » Dual Fujitsu DMA 3.2GB EIDE Software » NT workstation 4.3 » NT 5 sort Performance » sort 15 M 100-byte records (~1.5 GB) » Disk to disk » elapsed time 820 sec cpu time = 404 sec
56 Cluster Sort Conceptual Model Multiple Data Sources Multiple Data Destinations Multiple nodes Disks -> Sockets -> Disk -> Disk B AAA BBB CCC A AAA BBB CCC C AAA BBB CCC BBB AAA CCC BBB AAA CCC
57 Cluster Install & Execute If this is to be used by others, it must be: Easy to install Easy to execute Installations of distributed systems take time and can be tedious. (AM2, GluGuard) Parallel Remote execution is non-trivial. (GLUnix, LSF) How do we keep this “simple” and “built-in” to NTClusterSort ?
58 Remote Install RegConnectRegistry() RegCreateKeyEx() Add Registry entry to each remote node.
60 Outline What storage things are coming from Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans The future of storage: ActiveDisks
61 Crazy Disk Ideas Disk Farm on a card: surface mount disks Disk (magnetic store) on a chip: (micro machines in Silicon) NT and BackOffice in the disk controller (a processor with 100MB dram) ASIC
62 Remember Your Roots
63 Year 2002 Disks Big disk (10 $/GB) » 3” » 100 GB » 150 kaps (k accesses per second) » 20 MBps sequential Small disk (20 $/GB) » 3” » 4 GB » 100 kaps » 10 MBps sequential Both running Windows NT™ 7.0? (see below for why)
64 The Disk Farm On a Card The 1 TB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs....etc LOTS of accesses/second bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card).
65 Put Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Acknowledgements : Dave Patterson explained this to me a year ago Kim Keeton Erik Riedel Catharine Van Ingen Helped me sharpen these arguments
66 Technology Drivers: Disks Disks on track 100x in 10 years 2 TB 3.5” drive Shrink to 1” is 200GB Disk replaces tape? Disk is super computer! Kilo Mega Giga Tera Peta Exa Zetta Yotta
67 Data Gravity Processing Moves to Transducers (moves to data sources & sinks) Move Processing to data sources Move to where the power (and sheet metal) is Processor in » Modem » Display » Microphones (speech recognition) & cameras (vision) » Storage: Data storage and analysis
68 It’s Already True of Printers Peripheral = CyberBrick You buy a printer You get a » several network interfaces » A Postscript engine cpu, memory, software, a spooler (soon) » and… a print engine.
69 Functionally Specialized Cards Storage Network Display M MB DRAM P mips processor ASIC Today: P=50 mips M= 2 MB In a few years P= 200 mips M= 64 MB
70 Tera Byte Backplane TODAY » Disk controller is 10 mips risc engine with 2MB DRAM » NIC is similar power SOON » Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages » Uniform programming model » Great tools » Security » economics (CyberBricks) » Move computation to data (minimize traffic) All Device Controllers will be Cray 1’s Central Processor & Memory
71 Basic Argument for x-Disks Future disk controller is a super-computer. » 1 bips processor » 128 MB dram » 100 GB disk plus one arm Connects to SAN via high-level protocols » RPC, HTTP, DCOM, Kerberos, Directory Services,…. » Commands are RPCs » Management, security,…. » Services file/web/db/… requests » Managed by general-purpose OS with good dev environment Apps in disk saves data movement » need programming environment in controller
72 The Slippery Slope If you add function to server Then you add more function to server Function gravitates to data. Nothing = Sector Server Everything = App Server Something = Fixed App Server
73 Why Not a Sector Server? (let’s get physical!) Good idea, that’s what we have today. But » cache added for performance » Sector remap added for fault tolerance » error reporting and diagnostics added » SCSI commends (reserve,.. are growing) » Sharing problematic (space mgmt, security,…) Slipping down the slope to a 2-D block server
74 Why Not a 1-D Block Server? Put A LITTLE on the Disk Server Tried and true design » HSC - VAX cluster » EMC » IBM Sysplex (3980?) But look inside » Has a cache » Has space management » Has error reporting & management » Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… » Has locking » Has remote replication » Has an OS » Security is problematic » Low-level interface moves too many bytes
75 Why Not a 2-D Block Server? Put A LITTLE on the Disk Server Tried and true design » Cedar -> NFS » file server, cache, space,.. » Open file is many fewer msgs Grows to have » Directories + Naming » Authentication + access control » RAID 0, 1, 2, 3, 4, 5, 10, 50,… » Locking » Backup/restore/admin » Cooperative caching with client File Servers are a BIG hit: NetWare™ » SNAP! is my favorite today
76 Why Not a File Server? Put a Little on the Disk Server Tried and true design » Auspex, NetApp,... » Netware Yes, but look at NetWare » File interface gives you app invocation interface » Became an app server Mail, DB, Web,…. » Netware had a primitive OS Hard to program, so optimized wrong thing
77 Why Not Everything? Allow Everything on Disk Server (thin client’s) Tried and true design » Mainframes, Minis,... » Web servers,… » Encapsulates data » Minimizes data moves » Scaleable It is where everyone ends up. All the arguments against are short-term.
78 The Slippery Slope If you add function to server Then you add more function to server Function gravitates to data. Nothing = Sector Server Everything = App Server Something = Fixed App Server
79 Disk = Node has magnetic storage (100 GB?) has processor & DRAM has SAN attachment has execution environment OS Kernel SAN driverDisk driver File SystemRPC,... ServicesDBMS Applications
80 Technology Drivers: System on a Chip Integrate Processing with memory on chip » chip is 75% memory now » 1MB cache >> 1960 supercomputers » 256 Mb memory chip is 32 MB! » IRAM, CRAM, PIM,… projects abound Integrate Networking with processing on chip » system bus is a kind of network » ATM, FiberChannel, Ethernet,.. Logic on chip. » Direct IO (no intermediate bus) Functionally specialized cards shrink to a chip.
81 How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other » CORBA? DCOM? IIOP? RMI? » One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Wire(s) h streams datagrams RPC? Applications VIAL/VIPL streams datagrams RPC? Applications
82 Technology Drivers: What if Networking Was as Cheap As Disk IO? TCP/IP » Unix/NT 100% 40MBps Disk » Unix/NT 8% 40MBps Why the Difference? Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA Host does TCP/IP packetizing, checksum,… flow control small buffers
83 Technology Drivers: The Promise of SAN/VIA:10x in 2 years Today: » wires are 10 MBps (100 Mbps Ethernet) » ~20 MBps tcp/ip saturates 2 cpus » round-trip latency is ~300 us In the lab » Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… » Fast user-level communication tcp/ip ~ 100 MBps 10% of each processor round-trip latency is 15 us
84 Gbps Ethernet: 110 MBps SAN: Standard Interconnect PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps LAN faster than memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?
85 Technology Drivers: GBps Ethernet replaces SCSI Why I love SCSI » Its fast (30MBps (ultra) to 100 MBps (ultra3)) » The protocol uses little processor power Why I hate SCSI » Wires must be short » Cables are pricey » pins bend
86 Technology Drivers Plug & Play Software RPC is standardizing: (DCOM, IIOP, HTTP) » Gives huge TOOL LEVERAGE » Solves the hard problems for you: naming, security, directory service, operations,... Commoditized programming environments » FreeBSD, Linix, Solaris,…+ tools » NetWare + tools » WinCE, WinNT,…+ tools » JavaOS + tools Apps gravitate to data. General purpose OS on controller runs apps.
87 Basic Argument for x-Disks Future disk controller is a super-computer. » 1 bips processor » 128 MB dram » 100 GB disk plus one arm Connects to SAN via high-level protocols » RPC, HTTP, DCOM, Kerberos, Directory Services,…. » Commands are RPCs » management, security,…. » Services file/web/db/… requests » Managed by general-purpose OS with good dev environment Move apps to disk to save data movement » need programming environment in controller
88 Outline What storage things are coming from Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans The future of storage: ActiveDisks Papers and Talks at