Presentation on theme: "1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows)"— Presentation transcript:
1 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows) –Gray
2 Terminology for Scaleability Bill Devlin, Jim Gray, Bill Laing, George Spix,,,, paper at: ftp://ftp.research.microsoft.com/pub/tr/tr doc ftp://ftp.research.microsoft.com/pub/tr/tr doc Farms of servers: –Clones: identical Scaleability + availability –Partitions: Scaleability –Packs Partition availability via fail-over GeoPlex –for disaster tolerance. Farm Clone Shared Nothing Shared Disk Partition Pack Shared Nothing Active- Active Active- Passive Geo Plex
3 Unpredictable Growth The TerraServer Story: –Expected 5 M hits per day –Got 50 M hits on day 1 –Peak at 20 M hpd on a hot day –Average 5 M hpd over last 2 years Most of us cannot predict demand –Must be able to deal with NO demand –Must be able to deal with HUGE demand
4 Web Services Requirements Scalability : Need to be able to add capacity –New processing –New storage –New networking Availability : Need continuous service –Online change of all components (hardware and software) –Multiple service sites –Multiple network providers Agility : Need great tools –Manage the system –Change the application several times per year. –Add new services several times per year.
5 Premise: Each Site is a Farm Buy computing by the slice (brick): –Rack of servers + disks. –Functionally specialized servers Grow by adding slices –Spread data and computation to new slices Two styles: –Clones: anonymous servers –Parts+Packs: Partitions fail over within a pack In both cases, GeoPlex remote farm for disaster recovery
6 Scale UP Scaleable Systems ScaleUP: grow by adding components to a single system. ScaleOut: grow by adding more systems. Scale OUT
7 ScaleUP and Scale OUT Everyone does both. Choices –Size of a brick –Clones or partitions –Size of a pack Whos software? –scaleup and scaleout both have a large software component 1M$/slice –IBM S390? –Sun E 10,000? 100 K$/slice –Wintel 8X 10 K$/slice –Wintel 4x 1 K$/slice –Wintel 1x
8 Clones: Availability+Scalability Some applications are –Read-mostly –Low consistency requirements –Modest storage requirement (less than 1TB) Examples: –HTML web servers (IP sprayer/sieve + replication) –LDAP servers (replication via gossip) Replicate app at all nodes (clones)Replicate app at all nodes (clones) Load Balance: –Spray& Sieve: requests across nodes. –Route: requests across nodes. Grow: adding clones Fault tolerance: stop sending to that clone.
9 Two Clone Geometries Shared-Nothing: exact replicas Shared-Disk (state stored in server) Shared Nothing ClonesShared Disk Clones If clones have any state: make it disposable. Manage clones by reboot, failing that replace. One person can manage thousands of clones.
10 Clone Requirements Automatic replication (if they have any state) –Applications (and system software) –Data Automatic request routing –Spray or sieve Management: –Who is up? –Update management & propagation –Application monitoring. Clones are very easy to manage: –Rule of thumb: 100s of clones per admin.
11 Partitions Partitions for Scalability Clones are not appropriate for some apps. –State-full apps do not replicate well –high update rates do not replicate well Examples – –Databases –Read/write file server… –Cache managers –chat Partition state among servers Partitioning: –must be transparent to client. –split & merge partitions online
12 Packs for Availability Each partition may fail (independent of others) Partitions migrate to new node via fail-over –Fail-over in seconds Pack: the nodes supporting a partition –VMS Cluster, Tandem, SP2 HACMP,.. –IBM Sysplex –WinNT MSCS (wolfpack) Partitions typically grow in packs. ActiveActive: all nodes provide service ActivePassive: hot standby is idle Cluster-In-A-Box now commodity
14 Parts+Packs Requirements Automatic partitioning (in dbms, mail, files,…) –Location transparent –Partition split/merge –Grow without limits (100x10TB) –Application-centric request routing Simple fail-over model –Partition migration is transparent –MSCS-like model for services Management: –Automatic partition management (split/merge) –Who is up? –Application monitoring.
15 GeoPlex: Farm Pairs Two farms (or more) State (your mailbox, bank account) stored at both farms Changes from one sent to other When one farm fails other provides service Masks –Hardware/Software faults –Operations tasks (reorganize, upgrade move) –Environmental faults (power fail, earthquake, fire)
16 Directory Fail-Over Load Balancing Routes request to right farm –Farm can be clone or partition At farm, routes request to right service At service routes request to –Any clone –Correct partition. Routes around failures.
well-managed nodes well-managed packs & clones well-managed GeoPlex Masks some hardware failures Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures Masks site failures (power, network, fire, move,…) Masks some operations failures Availability
18 Cloned Packed file servers Packed Partitions: Database Transparency Cluster Scale Out Scenarios SQL Temp StateWeb File StoreA Cloned Front Ends ( firewall, sprayer, web server ) SQL Partition 3 The FARM: Clones and Packs of Partitions Web Clients Web File StoreB replication SQL DatabaseSQL Partition 2SQL Partition1 Load Balance
19 Some Examples: TerraServer: –6 IIS clone front-ends (wlbs) –3-partition 4-pack backend: 3 active 1 passive –Partition by theme and geography (longitude) –1/3 sysadmin Hotmail: –1000 IIS clone HTTP login –3400 IIS clone HTTP front door – clones for ad rotator, in/out bound… –115 partition backend (partition by mailbox) –Cisco local director for load balancing –50 sysadmin Google: (inktomi is similar but smaller) –700 clone spider –300 clone indexer –5-node geoplex (full replica) –1,000 clones/farm do search –100 clones/farm for http –10 sysadmin See Challenges to Building Scalable Services: A Survey of Microsofts Internet Services, Steven Levi and Galen Hunt
20 Acronyms RACS : Reliable Arrays of Cloned Servers RAPS : Reliable Arrays of partitioned and Packed Servers (the first p is silent ).
21 Emissaries and Fiefdoms Emissaries are stateless (nearly) Emissaries are easy to clone. Fiefdoms are stateful Fiefdoms get partitioned.
22 Summary Terminology for scaleability Farms of servers: –Clones: identical Scaleability + availability –Partitions: Scaleability –Packs Partition availability via fail-over GeoPlex for disaster tolerance. Architectural Blueprint for Large eSites Bill Laing Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Gray, Bill Laing, George Spix MS-TR ftp://ftp.research.microsoft.com/pub/tr/tr doc Farm Clone Shared Nothing Shared Disk Partition Pack Shared Nothing Active- Active Active- Passive Geo Plex
23 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows) –Gray
24 What Windows is Doing Continued architecture and analysis work AppCenter, BizTalk, SQL, SQL Service Broker, ISA,… all key to Clones/Partitions Exchange is an archetype –Front ends, directory, partitioned, packs, transparent mobility. NLB (clones) and MSCS (Packs) High Performance Technical Computing Appliances and hardware trends Management of these kind of systems Still need good ideas on….
25 Architecture and Design work Produced an architectural Blueprint for large eSites published on MSDN – Creating and testing instances of the architecture –Team led by Per Vonge Neilsen –Actually building and testing examples of the architecture with partners. (sometimes known as MICE) Built a scalability Megalab run by Robert Barnes –1000 node cyber wall, 315 1U Compaq DL360s, 32 8ways, 7000 disks
27 Clones and Packs aka Clustering Integrated the NLB and MSCS teams –Both focused on scalability and availability –NLB for Clones –MSCS for Partitions/Packs Vision is a single communications and group membership infrastructure and a set of management tools for Clones, Partitions, and Packs Unify management for clones/partitions at BOTH: OS and app level (e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)
28 Clustering in Whistler Server Microsoft Cluster Server –Much improved setup and installation –4 node support in Advanced server Kerberos support for Virtual Servers Password change without restarting cluster service 8 node support in Datacenter SAN enhancements (Device reset not bus reset for disk arbitration, Shared disk and boot disk on same bus) Quorum of nodes supported (no shared disk needed) Network Load Balancer –New NLB manager Bi-Directional affinity for ISA as a Proxy/Firewall Virtual cluster support (Different port rules for each IP addr) Dual NIC support
29 Geoclusters AKA - Geographically dispersed (Packs) –Essentially the nodes and storage are replicated at 2 sites, disks are remotely mirrored Being deployed today, helping vendors them get certified, we still need better tools Working with –EMC, Compaq, NSISoftware, StorageApps Log shipping (SQL) and extended VLANs (IIS) are also solutions
30 High Performance Computing Last year (CY2000) This work is a part of server scale- out efforts (BLaing) Web site and HPC Tech Preview CD late last year –A W2000 Beowulf equivalent w/ 3 rd -party tools Better than the competition –10-25% faster than Linux on SMPs (2, 4 & 8 ways) –More reliable than SP2 (!) –Better performance & integration w/ IBM periphs (!) But it lacks MPP debugger, tools, evangelism, reputation See../windows2000/hpc Also \\jcbach\public\cornell* This year (CY2001) Partner w/ Cornell/MPI-Soft/+ –Unix to W2000 projects –Evangelism of commercial HPC (start w/ financial svcs) –Showcase environment & apps (EBC support) –First Itanium FP play-offs –BIG tools integration / beta Dell & Compaq offer web HPC buy and support experience (buy capacity by-the-slice) Beowulf-on-W2000 book by Tom Sterling (author of Beowulf on Linux) Gain on Sun in the list Address the win-by-default assumption for Linux in HPC No vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.
31 Appliances and Hardware Trends The appliances team under TomPh is focused on dramatically simplifying the user experience of installing the kind of devices –Working with OEMs to adopt WindowsXP Ultradense servers are on the horizon –100s of servers per rack –Manage the rack as one Infiniband and 10 GbpsEthernet change things.
32 Operations and Management Great research work done in MSR on this topic –The Mega services paper by Levi and Hunt –The follow on BIG project developed the ideas of Scale Invariant Service Descriptions with automated monitoring and deployment of servers. Building on that work in Windows Server group AppCenter doing similar things at app level
33 Still Need Good Ideas on… Automatic partitioning Stateful load balancing Unified management of clones/partitions at both app and OS level
34 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows) –Gray
35 We're building Petabyte Stores Soon everything can be recorded and indexed Hotmail 100TB now MSN 100TB now List price is 800M$/PB ( including FC switches & brains ) Must Geoplex it. Can we get if for 1M$/PB? Personal 1TB stores for 1k$ Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
36 Building a Petabyte Store EMC ~ 500k$/TB = 500 M$/PB plus FC switches plus… 800 M$/PB TPC-C SANs (Dell 18GB/…) 62 M$/PB Dell local SCSI, 3ware 20 M$/PB Do it yourself: 5 M$/PB a billion here, a billion there, soon your talking about real money!
GB, 2k$ (now) 6M$ / PB 4x80 GB IDE (2 hot plugable) –(1,000$) SCSI-IDE bridge –200k$ Box –500 Mhz cpu –256 MB SRAM –Fan, power, Enet –500$ Ethernet Switch: –150$/port Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)
38 Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 80 GB in 3 hours) 80 GB/overnite = ~N x $/nite Compare to 1$/GB via Internet
39 A Storage Brick 2 x 80GB disks 500 Mhz cpu (intel/ amd/ arm) 256MB ram 2 eNet RJ45 Fan(s) Current disk form factor 30 watt 600$ (?) per rack (48U - 3U/module - 16 units/U) 400 disks, 200 whistler nodes 32 TB 100 Billion Instructions Per Second 120 K$/rack, 4 M$/PB, per Petabyte (33 racks) 4 M$ 3 TeraOps (6,600 nodes) 13 k disk arms (1/2 TBps IO)
40 What Software Do The Bricks Run? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other –COM+ SOAP, BizTalk Huge leverage in high-level interfaces. Same old distributed system story. Infiniband /Gbps Ehternet CLR streams datagrams RPC? Applications CLR streams datagrams RPC? Applications
41 Storage Rack in 2 years? 300 arms 50TB (160 GB/arm) 24 racks 48 storage processors 2x6+1 in rack Disks = 2.5 GBps IO Controllers = 1.2 GBps IO Ports 500 MBps IO My suggestion: move the processors into the storage racks.
42 Auto Manage Storage 1980 rule of thumb: –A DataAdmin per 10GB, SysAdmin per mips 2000 rule of thumb –A DataAdmin per 5TB –SysAdmin per 100 clones (varies with app). Problem: –5TB is 60k$ today, 10k$ in a few years. –Admin cost >> storage cost??? Challenge: –Automate ALL storage admin tasks
43 Its Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, –use other copy until failure repaired, –refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)
44 Call To Action Lets work together to make storage bricks –Low cost –High function NAS (network attached storage) not SAN (storage area network) Ship NT8/CLR/IIS/SQL/Exchange/… with every disk drive
45 Three Talks Scalability Terminology –Gray (with help from Devlin, Laing, Spix) What Windows is doing re this –Laing The M$ PetaByte (as time allows) –Gray
48 Disk vs Tape Disk –80 GB –35 MBps – 5 ms seek time – 3 ms rotate latency – 3$/GB for drive 2$/GB for ctlrs/cabinet –4 TB/rack –1 hour scan Tape –40 GB –10 MBps –10 sec pick time – second seek time –2$/GB for media 8$/GB for drive+library –10 TB/rack –1 week scan The price advantage of tape is gone, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape. Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =12 drives
49 Whats a Balanced System? System Bus PCI Bus
50 Data on Disk Can Move to RAM in 8 years 100:1 6 years Today: 3/5/2001 Disk: 300 GB per K$ RAM: 1 GB per K$ pc133 ecc sdram
51 5-year Tech Trends 256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories Huge disks now: 5-50 GB 3.5 disks then: GB disks Petabyte storage farms –(that you cant back up or restore). Disks >> tapes –Small disks: One platter one inch 10GB SAN convergence 1 GBps point to point is easy 1 GB RAM chips MAD at 50 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 500 mips cpus for 10$ 5 bips cpus at high end
52 The Absurd? Consequences Further segregate processing from storage Poor locality Much useless data movement Amdahls laws: bus: 10 B/ips io: 1 b/ips Processors Disks ~ 1 Tips RAM Memory ~ 1 TB ~ 100TB 100 GBps 10 TBps
53 Drives shrink (1.8, 1) 150 kaps for 500 GB is VERY cold data 3 GB/platter today, 30 GB/platter in 5years. Most disks are ½ full TPC benchmarks use 9GB drives (need arms or bandwidth). One solution: smaller form factor –More arms per GB –More arms per rack –More arms per Watt
54 Tera Byte Backplane TODAY –Disk controller is 10 mips risc engine with 2MB DRAM –NIC is similar power SOON –Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages –Uniform programming model –Great tools –Security –Economics (cyberbricks) –Move computation to data (minimize traffic) All Device Controllers will be Super-Computers Central Processor & Memory
55 Crazy Disk Ideas Disk Farm on a card: surface mount disks Disk (magnetic store) on a chip: (micro machines in Silicon) Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram) ASIC The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen.ISBN:
56 Functionally Specialized Cards Storage Network Display M MB DRAM P mips processor ASIC Today: P=50 mips M= 2 MB In a few years P= 200 mips M= 64 MB