Presentation on theme: "Three Talks Scalability Terminology What Windows is doing re this"— Presentation transcript:
1 Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
2 Terminology for Scaleability Bill Devlin, Jim Gray, Bill Laing, George Spix,,,, paper at: ftp://ftp.research.microsoft.com/pub/tr/tr docFarms of servers:Clones: identicalScaleability + availabilityPartitions:ScaleabilityPacksPartition availability via fail-overGeoPlexfor disaster tolerance.FarmCloneSharedNothingDiskPartitionPackActive- ActiveActive- PassiveGeoPlex
3 Unpredictable Growth The TerraServer Story: Expected 5 M hits per dayGot 50 M hits on day 1Peak at 20 M hpd on a “hot” dayAverage 5 M hpd over last 2 yearsMost of us cannot predict demandMust be able to deal with NO demandMust be able to deal with HUGE demand
4 Web Services Requirements Scalability: Need to be able to add capacityNew processingNew storageNew networkingAvailability: Need continuous serviceOnline change of all components (hardware and software)Multiple service sitesMultiple network providersAgility: Need great toolsManage the systemChange the application several times per year.Add new services several times per year.
5 Premise: Each Site is a Farm Buy computing by the slice (brick):Rack of servers + disks.Functionally specialized serversGrow by adding slicesSpread data and computation to new slicesTwo styles:Clones: anonymous serversParts+Packs: Partitions fail over within a packIn both cases, GeoPlex remote farm for disaster recovery
6 Scaleable Systems Scale UP Scale OUT ScaleUP: grow by adding components to a single system.ScaleOut: grow by adding more systems.Scale OUT
7 ScaleUP and Scale OUT Everyone does both. Choice’s Who’s software? Size of a brickClones or partitionsSize of a packWho’s software?scaleup and scaleout both have a large software component1M$/sliceIBM S390?Sun E 10,000?100 K$/sliceWintel 8X10 K$/sliceWintel 4x1 K$/sliceWintel 1x
8 Clones: Availability+Scalability Some applications areRead-mostlyLow consistency requirementsModest storage requirement (less than 1TB)Examples:HTML web servers (IP sprayer/sieve + replication)LDAP servers (replication via gossip)Replicate app at all nodes (clones)Load Balance:Spray& Sieve: requests across nodes.Route: requests across nodes.Grow: adding clonesFault tolerance: stop sending to that clone.
9 Two Clone Geometries Shared-Nothing: exact replicas Shared-Disk (state stored in server)Shared Nothing ClonesShared Disk ClonesIf clones have any state: make it disposable.Manage clones by reboot, failing that replace.One person can manage thousands of clones.
10 Clone Requirements Automatic replication (if they have any state) Applications (and system software)DataAutomatic request routingSpray or sieveManagement:Who is up?Update management & propagationApplication monitoring.Clones are very easy to manage:Rule of thumb: 100’s of clones per admin.
11 Partitions for Scalability Clones are not appropriate for some apps.State-full apps do not replicate wellhigh update rates do not replicate wellExamplesDatabasesRead/write file server…Cache managerschatPartition state among serversPartitioning:must be transparent to client.split & merge partitions online
12 Packs for Availability Each partition may fail (independent of others)Partitions migrate to new node via fail-overFail-over in secondsPack: the nodes supporting a partitionVMS Cluster, Tandem, SP2 HACMP,..IBM Sysplex™WinNT MSCS (wolfpack)Partitions typically grow in packs.ActiveActive: all nodes provide serviceActivePassive: hot standby is idleCluster-In-A-Box now commodity
14 Parts+Packs Requirements Automatic partitioning (in dbms, mail, files,…)Location transparentPartition split/mergeGrow without limits (100x10TB)Application-centric request routingSimple fail-over modelPartition migration is transparentMSCS-like model for servicesManagement:Automatic partition management (split/merge)Who is up?Application monitoring.
15 GeoPlex: Farm Pairs Two farms (or more) State (your mailbox, bank account) stored at both farmsChanges from one sent to otherWhen one farm fails other provides serviceMasksHardware/Software faultsOperations tasks (reorganize, upgrade move)Environmental faults (power fail, earthquake, fire)
16 Directory Fail-Over Load Balancing Routes request to right farmFarm can be clone or partitionAt farm, routes request to right serviceAt service routes request toAny cloneCorrect partition.Routes around failures.
17 well-managed packs & clones 99999well-managed nodesAvailabilityMasks some hardware failureswell-managed packs & clonesMasks hardware failures,Operations tasks (e.g. software upgrades)Masks some software failureswell-managed GeoPlexMasks site failures (power, network, fire, move,…)Masks some operations failures
18 Cluster Scale Out Scenarios The FARM: Clones and Packs of PartitionsPacked Partitions: Database TransparencySQL Partition 3SQL Partition 2SQL Partition1SQL DatabaseWeb File StoreBreplicationClonedPacked fileserversWeb File StoreASQL Temp StateCloned Front Ends (firewall, sprayer, web server)Web ClientsLoad Balance
19 Some Examples: TerraServer: Hotmail: 6 IIS clone front-ends (wlbs)3-partition 4-pack backend: 3 active 1 passivePartition by theme and geography (longitude)1/3 sysadminHotmail:1000 IIS clone HTTP login3400 IIS clone HTTP front doorclones for ad rotator, in/out bound…115 partition backend (partition by mailbox)Cisco local director for load balancing50 sysadminGoogle: (inktomi is similar but smaller)700 clone spider300 clone indexer5-node geoplex (full replica)1,000 clones/farm do search100 clones/farm for http10 sysadminSee Challenges to Building Scalable Services: A Survey of Microsoft’s Internet Services, Steven Levi and Galen Hunt
20 Acronyms RACS: Reliable Arrays of Cloned Servers RAPS: Reliable Arrays of partitioned and Packed Servers (the first p is silent ).
21 Emissaries and Fiefdoms Emissaries are stateless (nearly) Emissaries are easy to clone.Fiefdoms are stateful Fiefdoms get partitioned.
22 Summary Terminology for scaleability Farms of servers: CloneSharedNothingDiskPartitionPackActive- ActiveActive- PassiveGeoPlexTerminology for scaleabilityFarms of servers:Clones: identicalScaleability + availabilityPartitions:ScaleabilityPacksPartition availability via fail-overGeoPlex for disaster tolerance.Architectural Blueprint for Large eSitesBill LaingScalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPSBill Devlin, Jim Gray, Bill Laing, George Spix MS-TR-99-85ftp://ftp.research.microsoft.com/pub/tr/tr doc
23 Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
24 What Windows is Doing Continued architecture and analysis work AppCenter, BizTalk, SQL, SQL Service Broker, ISA,… all key to Clones/PartitionsExchange is an archetypeFront ends, directory, partitioned, packs, transparent mobility.NLB (clones) and MSCS (Packs)High Performance Technical ComputingAppliances and hardware trendsManagement of these kind of systemsStill need good ideas on….
25 Architecture and Design work Produced an architectural Blueprint for large eSites published on MSDNCreating and testing instances of the architectureTeam led by Per Vonge NeilsenActually building and testing examples of the architecture with partners. (sometimes known as MICE)Built a scalability “Megalab” run by Robert Barnes1000 node cyber wall, U Compaq DL360s, ways, 7000 disks
27 Clones and Packs aka Clustering Integrated the NLB and MSCS teamsBoth focused on scalability and availabilityNLB for ClonesMSCS for Partitions/PacksVision is a single communications and group membership infrastructure and a set of management tools for Clones, Partitions, and PacksUnify management for clones/partitions at BOTH: OS and app level (e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)
28 Clustering in Whistler Server Microsoft Cluster ServerMuch improved setup and installation4 node support in Advanced serverKerberos support for Virtual ServersPassword change without restarting cluster service8 node support in DatacenterSAN enhancements (Device reset not bus reset for disk arbitration, Shared disk and boot disk on same bus)Quorum of nodes supported (no shared disk needed)Network Load BalancerNew NLB managerBi-Directional affinity for ISA as a Proxy/FirewallVirtual cluster support (Different port rules for each IP addr)Dual NIC support
29 Geoclusters AKA - Geographically dispersed (Packs) Essentially the nodes and storage are replicated at 2 sites, disks are remotely mirroredBeing deployed today, helping vendors them get certified, we still need better toolsWorking withEMC, Compaq, NSISoftware, StorageAppsLog shipping (SQL) and extended VLANs (IIS) are also solutions
30 High Performance Computing Last year (CY2000)This work is a part of server scale-out efforts (BLaing)Web site and HPC Tech Preview CD late last yearA W2000 “Beowulf” equivalent w/ 3rd-party toolsBetter than the competition10-25% faster than Linux on SMPs (2, 4 & 8 ways)More reliable than SP2 (!)Better performance & integration w/ IBM periphs (!)But it lacks MPP debugger, tools, evangelism, reputationSee ../windows2000/hpcAlso \\jcbach\public\cornell*This year (CY2001)Partner w/ Cornell/MPI-Soft/+Unix to W2000 projectsEvangelism of commercial HPC (start w/ financial svcs)Showcase environment & apps (EBC support)First Itanium FP “play-offs”BIG tools integration / betaDell & Compaq offer web HPC buy and support experience (buy capacity by-the-slice)Beowulf-on-W2000 book by Tom Sterling (author of Beowulf on Linux)Gain on Sun in the listAddress the win-by-default assumption for Linux in HPCNo vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.
31 Appliances and Hardware Trends The appliances team under TomPh is focused on dramatically simplifying the user experience of installing the kind of devicesWorking with OEMs to adopt WindowsXPUltradense servers are on the horizon100s of servers per rackManage the rack as oneInfiniband and 10 GbpsEthernet change things.
32 Operations and Management Great research work done in MSR on this topicThe Mega services paper by Levi and HuntThe follow on BIG project developed the ideas ofScale Invariant Service Descriptions withautomated monitoring anddeployment of servers.Building on that work in Windows Server groupAppCenter doing similar things at app level
33 Still Need Good Ideas on… Automatic partitioningStateful load balancingUnified management of clones/partitions at both app and OS level
34 Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
35 We're building Petabyte Stores YottaZettaExaPetaTeraGigaMegaKiloEverything!RecordedSoon everything can be recorded and indexedHotmail 100TB nowMSN 100TB nowList price is 800M$/PB (including FC switches & brains)Must Geoplex it.Can we get if for 1M$/PB?Personal 1TB stores for 1k$All Books MultiMediaAll LoC books(words).MovieA PhotoA Book24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
36 Building a Petabyte Store EMC ~ 500k$/TB = M$/PB plus FC switches plus… M$/PBTPC-C SANs (Dell 18GB/…) M$/PBDell local SCSI, 3ware M$/PBDo it yourself: M$/PBa billion here, a billion there, soon your talking about real money!
37 320 GB, 2k$ (now) 6M$ / PB4x80 GB IDE (2 hot plugable)(1,000$)SCSI-IDE bridge200k$Box500 Mhz cpu256 MB SRAMFan, power, Enet500$Ethernet Switch:150$/portOr 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)
38 Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 80 GB in 3 hours)80 GB/overnite= ~N x 2 MB/second@ 19.95$/niteCompare to 1$/GBvia Internet
39 A Storage Brick 2 x 80GB disks 500 Mhz cpu (intel/ amd/ arm) 256MB ram 2 eNet RJ45Fan(s)Current disk form factor30 watt600$ (?)per rack (48U - 3U/module - 16 units/U)400 disks, 200 whistler nodes32 TB100 Billion Instructions Per Second120 K$/rack, 4 M$/PB,per Petabyte (33 racks)4 M$3 TeraOps (6,600 nodes)13 k disk arms (1/2 TBps IO)
40 What Software Do The Bricks Run? Each node has an OSEach node has local resources: A federation.Each node does not completely trust the others.Nodes use RPC to talk to each otherCOM+ SOAP, BizTalkHuge leverage in high-level interfaces.Same old distributed system story.ApplicationsApplicationsdatagramsstreamsRPC??RPCstreamsdatagramsCLRCLRInfiniband /Gbps Ehternet
41 Storage Rack in 2 years?300 arms50TB (160 GB/arm)24 racks 48 storage processors 2x6+1 in rackDisks = 2.5 GBps IOControllers = 1.2 GBps IOPorts MBps IOMy suggestion: move the processors into the storage racks.
42 Auto Manage Storage 1980 rule of thumb: 2000 rule of thumb Problem: A DataAdmin per 10GB, SysAdmin per mips2000 rule of thumbA DataAdmin per 5TBSysAdmin per 100 clones (varies with app).Problem:5TB is 60k$ today, 10k$ in a few years.Admin cost >> storage cost???Challenge:Automate ALL storage admin tasks
43 It’s Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days!Store it in two (or more) places online (on disk?). A geo-plexScrub it continuously (look for errors)On failure,use other copy until failure repaired,refresh lost copy from safe copy.Can organize the two copies differently (e.g.: one by time, one by space)
44 Call To Action Lets work together to make storage bricks Low costHigh functionNAS (network attached storage) not SAN (storage area network)Ship NT8/CLR/IIS/SQL/Exchange/… with every disk drive
45 Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
48 Disk vs Tape Disk 1 hour scan Tape 1 week scan 80 GB35 MBps5 ms seek time3 ms rotate latency3$/GB for drive 2$/GB for ctlrs/cabinet4 TB/rack1 hour scanTape40 GB10 MBps10 sec pick timesecond seek time2$/GB for media 8$/GB for drive+library10 TB/rack1 week scanGuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB=12 drivesThe price advantage of tape is gone, andthe performance advantage of disk is growingAt 10K$/TB, disk is competitive with nearline tape.
50 Data on Disk Can Move to RAM in 8 years Today: 3/5/2001Disk:300 GB per K$RAM:1 GB per K$pc133 ecc sdram100:16 years
51 5-year Tech Trends 256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memoriesHuge disks now: 5-50 GB 3.5” disks then: GB disksPetabyte storage farms(that you can’t back up or restore).Disks >> tapes“Small” disks: One platter one inch 10GBSAN convergence 1 GBps point to point is easy1 GB RAM chipsMAD at 50 GbpsiDrives shrink one quantum10 GBps SANs are ubiquitous500 mips cpus for 10$5 bips cpus at high end
52 The Absurd? Consequences Further segregate processing from storagePoor localityMuch useless data movementAmdahl’s laws: bus: 10 B/ips io: 1 b/ipsDisksRAM Memory~ 1 TBProcessors100 GBps10 TBps~ 1 Tips~ 100TB
53 Drives shrink (1.8”, 1”) 150 kaps for 500 GB is VERY cold data 3 GB/platter today, 30 GB/platter in 5years.Most disks are ½ fullTPC benchmarks use 9GB drives (need arms or bandwidth).One solution: smaller form factorMore arms per GBMore arms per rackMore arms per Watt
54 All Device Controllers will be Super-Computers TODAYDisk controller is 10 mips risc engine with 2MB DRAMNIC is similar powerSOONWill become 100 mips systems with 100 MB DRAM.They are nodes in a federation (can run Oracle on NT in disk controller).AdvantagesUniform programming modelGreat toolsSecurityEconomics (cyberbricks)Move computation to data (minimize traffic)CentralProcessor & MemoryTera ByteBackplane
55 Crazy Disk Ideas Disk Farm on a card: surface mount disks Disk (magnetic store) on a chip: (micro machines in Silicon)Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram)ASICThe Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen .ISBN:
56 Functionally Specialized Cards P mips processorStorageNetworkDisplayASICToday:P=50 mipsM= 2 MBM MB DRAMIn a few yearsP= 200 mipsM= 64 MBASICASIC