Presentation on theme: "Three Talks Scalability Terminology What Windows is doing re this"— Presentation transcript:
1Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
2Terminology for Scaleability Bill Devlin, Jim Gray, Bill Laing, George Spix,,,, paper at: ftp://ftp.research.microsoft.com/pub/tr/tr docFarms of servers:Clones: identicalScaleability + availabilityPartitions:ScaleabilityPacksPartition availability via fail-overGeoPlexfor disaster tolerance.FarmCloneSharedNothingDiskPartitionPackActive- ActiveActive- PassiveGeoPlex
3Unpredictable Growth The TerraServer Story: Expected 5 M hits per dayGot 50 M hits on day 1Peak at 20 M hpd on a “hot” dayAverage 5 M hpd over last 2 yearsMost of us cannot predict demandMust be able to deal with NO demandMust be able to deal with HUGE demand
4Web Services Requirements Scalability: Need to be able to add capacityNew processingNew storageNew networkingAvailability: Need continuous serviceOnline change of all components (hardware and software)Multiple service sitesMultiple network providersAgility: Need great toolsManage the systemChange the application several times per year.Add new services several times per year.
5Premise: Each Site is a Farm Buy computing by the slice (brick):Rack of servers + disks.Functionally specialized serversGrow by adding slicesSpread data and computation to new slicesTwo styles:Clones: anonymous serversParts+Packs: Partitions fail over within a packIn both cases, GeoPlex remote farm for disaster recovery
6Scaleable Systems Scale UP Scale OUT ScaleUP: grow by adding components to a single system.ScaleOut: grow by adding more systems.Scale OUT
7ScaleUP and Scale OUT Everyone does both. Choice’s Who’s software? Size of a brickClones or partitionsSize of a packWho’s software?scaleup and scaleout both have a large software component1M$/sliceIBM S390?Sun E 10,000?100 K$/sliceWintel 8X10 K$/sliceWintel 4x1 K$/sliceWintel 1x
8Clones: Availability+Scalability Some applications areRead-mostlyLow consistency requirementsModest storage requirement (less than 1TB)Examples:HTML web servers (IP sprayer/sieve + replication)LDAP servers (replication via gossip)Replicate app at all nodes (clones)Load Balance:Spray& Sieve: requests across nodes.Route: requests across nodes.Grow: adding clonesFault tolerance: stop sending to that clone.
9Two Clone Geometries Shared-Nothing: exact replicas Shared-Disk (state stored in server)Shared Nothing ClonesShared Disk ClonesIf clones have any state: make it disposable.Manage clones by reboot, failing that replace.One person can manage thousands of clones.
10Clone Requirements Automatic replication (if they have any state) Applications (and system software)DataAutomatic request routingSpray or sieveManagement:Who is up?Update management & propagationApplication monitoring.Clones are very easy to manage:Rule of thumb: 100’s of clones per admin.
11Partitions for Scalability Clones are not appropriate for some apps.State-full apps do not replicate wellhigh update rates do not replicate wellExamplesDatabasesRead/write file server…Cache managerschatPartition state among serversPartitioning:must be transparent to client.split & merge partitions online
12Packs for Availability Each partition may fail (independent of others)Partitions migrate to new node via fail-overFail-over in secondsPack: the nodes supporting a partitionVMS Cluster, Tandem, SP2 HACMP,..IBM Sysplex™WinNT MSCS (wolfpack)Partitions typically grow in packs.ActiveActive: all nodes provide serviceActivePassive: hot standby is idleCluster-In-A-Box now commodity
14Parts+Packs Requirements Automatic partitioning (in dbms, mail, files,…)Location transparentPartition split/mergeGrow without limits (100x10TB)Application-centric request routingSimple fail-over modelPartition migration is transparentMSCS-like model for servicesManagement:Automatic partition management (split/merge)Who is up?Application monitoring.
15GeoPlex: Farm Pairs Two farms (or more) State (your mailbox, bank account) stored at both farmsChanges from one sent to otherWhen one farm fails other provides serviceMasksHardware/Software faultsOperations tasks (reorganize, upgrade move)Environmental faults (power fail, earthquake, fire)
16Directory Fail-Over Load Balancing Routes request to right farmFarm can be clone or partitionAt farm, routes request to right serviceAt service routes request toAny cloneCorrect partition.Routes around failures.
17well-managed packs & clones 99999well-managed nodesAvailabilityMasks some hardware failureswell-managed packs & clonesMasks hardware failures,Operations tasks (e.g. software upgrades)Masks some software failureswell-managed GeoPlexMasks site failures (power, network, fire, move,…)Masks some operations failures
18Cluster Scale Out Scenarios The FARM: Clones and Packs of PartitionsPacked Partitions: Database TransparencySQL Partition 3SQL Partition 2SQL Partition1SQL DatabaseWeb File StoreBreplicationClonedPacked fileserversWeb File StoreASQL Temp StateCloned Front Ends (firewall, sprayer, web server)Web ClientsLoad Balance
19Some Examples: TerraServer: Hotmail: 6 IIS clone front-ends (wlbs)3-partition 4-pack backend: 3 active 1 passivePartition by theme and geography (longitude)1/3 sysadminHotmail:1000 IIS clone HTTP login3400 IIS clone HTTP front doorclones for ad rotator, in/out bound…115 partition backend (partition by mailbox)Cisco local director for load balancing50 sysadminGoogle: (inktomi is similar but smaller)700 clone spider300 clone indexer5-node geoplex (full replica)1,000 clones/farm do search100 clones/farm for http10 sysadminSee Challenges to Building Scalable Services: A Survey of Microsoft’s Internet Services, Steven Levi and Galen Hunt
20Acronyms RACS: Reliable Arrays of Cloned Servers RAPS: Reliable Arrays of partitioned and Packed Servers (the first p is silent ).
21Emissaries and Fiefdoms Emissaries are stateless (nearly) Emissaries are easy to clone.Fiefdoms are stateful Fiefdoms get partitioned.
22Summary Terminology for scaleability Farms of servers: CloneSharedNothingDiskPartitionPackActive- ActiveActive- PassiveGeoPlexTerminology for scaleabilityFarms of servers:Clones: identicalScaleability + availabilityPartitions:ScaleabilityPacksPartition availability via fail-overGeoPlex for disaster tolerance.Architectural Blueprint for Large eSitesBill LaingScalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPSBill Devlin, Jim Gray, Bill Laing, George Spix MS-TR-99-85ftp://ftp.research.microsoft.com/pub/tr/tr doc
23Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
24What Windows is Doing Continued architecture and analysis work AppCenter, BizTalk, SQL, SQL Service Broker, ISA,… all key to Clones/PartitionsExchange is an archetypeFront ends, directory, partitioned, packs, transparent mobility.NLB (clones) and MSCS (Packs)High Performance Technical ComputingAppliances and hardware trendsManagement of these kind of systemsStill need good ideas on….
25Architecture and Design work Produced an architectural Blueprint for large eSites published on MSDNCreating and testing instances of the architectureTeam led by Per Vonge NeilsenActually building and testing examples of the architecture with partners. (sometimes known as MICE)Built a scalability “Megalab” run by Robert Barnes1000 node cyber wall, U Compaq DL360s, ways, 7000 disks
27Clones and Packs aka Clustering Integrated the NLB and MSCS teamsBoth focused on scalability and availabilityNLB for ClonesMSCS for Partitions/PacksVision is a single communications and group membership infrastructure and a set of management tools for Clones, Partitions, and PacksUnify management for clones/partitions at BOTH: OS and app level (e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)
28Clustering in Whistler Server Microsoft Cluster ServerMuch improved setup and installation4 node support in Advanced serverKerberos support for Virtual ServersPassword change without restarting cluster service8 node support in DatacenterSAN enhancements (Device reset not bus reset for disk arbitration, Shared disk and boot disk on same bus)Quorum of nodes supported (no shared disk needed)Network Load BalancerNew NLB managerBi-Directional affinity for ISA as a Proxy/FirewallVirtual cluster support (Different port rules for each IP addr)Dual NIC support
29Geoclusters AKA - Geographically dispersed (Packs) Essentially the nodes and storage are replicated at 2 sites, disks are remotely mirroredBeing deployed today, helping vendors them get certified, we still need better toolsWorking withEMC, Compaq, NSISoftware, StorageAppsLog shipping (SQL) and extended VLANs (IIS) are also solutions
30High Performance Computing Last year (CY2000)This work is a part of server scale-out efforts (BLaing)Web site and HPC Tech Preview CD late last yearA W2000 “Beowulf” equivalent w/ 3rd-party toolsBetter than the competition10-25% faster than Linux on SMPs (2, 4 & 8 ways)More reliable than SP2 (!)Better performance & integration w/ IBM periphs (!)But it lacks MPP debugger, tools, evangelism, reputationSee ../windows2000/hpcAlso \\jcbach\public\cornell*This year (CY2001)Partner w/ Cornell/MPI-Soft/+Unix to W2000 projectsEvangelism of commercial HPC (start w/ financial svcs)Showcase environment & apps (EBC support)First Itanium FP “play-offs”BIG tools integration / betaDell & Compaq offer web HPC buy and support experience (buy capacity by-the-slice)Beowulf-on-W2000 book by Tom Sterling (author of Beowulf on Linux)Gain on Sun in the listAddress the win-by-default assumption for Linux in HPCNo vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.
31Appliances and Hardware Trends The appliances team under TomPh is focused on dramatically simplifying the user experience of installing the kind of devicesWorking with OEMs to adopt WindowsXPUltradense servers are on the horizon100s of servers per rackManage the rack as oneInfiniband and 10 GbpsEthernet change things.
32Operations and Management Great research work done in MSR on this topicThe Mega services paper by Levi and HuntThe follow on BIG project developed the ideas ofScale Invariant Service Descriptions withautomated monitoring anddeployment of servers.Building on that work in Windows Server groupAppCenter doing similar things at app level
33Still Need Good Ideas on… Automatic partitioningStateful load balancingUnified management of clones/partitions at both app and OS level
34Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
35We're building Petabyte Stores YottaZettaExaPetaTeraGigaMegaKiloEverything!RecordedSoon everything can be recorded and indexedHotmail 100TB nowMSN 100TB nowList price is 800M$/PB (including FC switches & brains)Must Geoplex it.Can we get if for 1M$/PB?Personal 1TB stores for 1k$All Books MultiMediaAll LoC books(words).MovieA PhotoA Book24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
36Building a Petabyte Store EMC ~ 500k$/TB = M$/PB plus FC switches plus… M$/PBTPC-C SANs (Dell 18GB/…) M$/PBDell local SCSI, 3ware M$/PBDo it yourself: M$/PBa billion here, a billion there, soon your talking about real money!
37320 GB, 2k$ (now) 6M$ / PB4x80 GB IDE (2 hot plugable)(1,000$)SCSI-IDE bridge200k$Box500 Mhz cpu256 MB SRAMFan, power, Enet500$Ethernet Switch:150$/portOr 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)
38Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 80 GB in 3 hours)80 GB/overnite= ~N x 2 MB/second@ 19.95$/niteCompare to 1$/GBvia Internet
39A Storage Brick 2 x 80GB disks 500 Mhz cpu (intel/ amd/ arm) 256MB ram 2 eNet RJ45Fan(s)Current disk form factor30 watt600$ (?)per rack (48U - 3U/module - 16 units/U)400 disks, 200 whistler nodes32 TB100 Billion Instructions Per Second120 K$/rack, 4 M$/PB,per Petabyte (33 racks)4 M$3 TeraOps (6,600 nodes)13 k disk arms (1/2 TBps IO)
40What Software Do The Bricks Run? Each node has an OSEach node has local resources: A federation.Each node does not completely trust the others.Nodes use RPC to talk to each otherCOM+ SOAP, BizTalkHuge leverage in high-level interfaces.Same old distributed system story.ApplicationsApplicationsdatagramsstreamsRPC??RPCstreamsdatagramsCLRCLRInfiniband /Gbps Ehternet
41Storage Rack in 2 years?300 arms50TB (160 GB/arm)24 racks 48 storage processors 2x6+1 in rackDisks = 2.5 GBps IOControllers = 1.2 GBps IOPorts MBps IOMy suggestion: move the processors into the storage racks.
42Auto Manage Storage 1980 rule of thumb: 2000 rule of thumb Problem: A DataAdmin per 10GB, SysAdmin per mips2000 rule of thumbA DataAdmin per 5TBSysAdmin per 100 clones (varies with app).Problem:5TB is 60k$ today, 10k$ in a few years.Admin cost >> storage cost???Challenge:Automate ALL storage admin tasks
43It’s Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days!Store it in two (or more) places online (on disk?). A geo-plexScrub it continuously (look for errors)On failure,use other copy until failure repaired,refresh lost copy from safe copy.Can organize the two copies differently (e.g.: one by time, one by space)
44Call To Action Lets work together to make storage bricks Low costHigh functionNAS (network attached storage) not SAN (storage area network)Ship NT8/CLR/IIS/SQL/Exchange/… with every disk drive
45Three Talks Scalability Terminology What Windows is doing re this Gray (with help from Devlin, Laing, Spix)What Windows is doing re thisLaingThe M$ PetaByte (as time allows)Gray
47Storage capacity beating Moore’s law 3 k$/TB today (raw disk)
48Disk vs Tape Disk 1 hour scan Tape 1 week scan 80 GB35 MBps5 ms seek time3 ms rotate latency3$/GB for drive 2$/GB for ctlrs/cabinet4 TB/rack1 hour scanTape40 GB10 MBps10 sec pick timesecond seek time2$/GB for media 8$/GB for drive+library10 TB/rack1 week scanGuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB=12 drivesThe price advantage of tape is gone, andthe performance advantage of disk is growingAt 10K$/TB, disk is competitive with nearline tape.
50Data on Disk Can Move to RAM in 8 years Today: 3/5/2001Disk:300 GB per K$RAM:1 GB per K$pc133 ecc sdram100:16 years
515-year Tech Trends 256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memoriesHuge disks now: 5-50 GB 3.5” disks then: GB disksPetabyte storage farms(that you can’t back up or restore).Disks >> tapes“Small” disks: One platter one inch 10GBSAN convergence 1 GBps point to point is easy1 GB RAM chipsMAD at 50 GbpsiDrives shrink one quantum10 GBps SANs are ubiquitous500 mips cpus for 10$5 bips cpus at high end
52The Absurd? Consequences Further segregate processing from storagePoor localityMuch useless data movementAmdahl’s laws: bus: 10 B/ips io: 1 b/ipsDisksRAM Memory~ 1 TBProcessors100 GBps10 TBps~ 1 Tips~ 100TB
53Drives shrink (1.8”, 1”) 150 kaps for 500 GB is VERY cold data 3 GB/platter today, 30 GB/platter in 5years.Most disks are ½ fullTPC benchmarks use 9GB drives (need arms or bandwidth).One solution: smaller form factorMore arms per GBMore arms per rackMore arms per Watt
54All Device Controllers will be Super-Computers TODAYDisk controller is 10 mips risc engine with 2MB DRAMNIC is similar powerSOONWill become 100 mips systems with 100 MB DRAM.They are nodes in a federation (can run Oracle on NT in disk controller).AdvantagesUniform programming modelGreat toolsSecurityEconomics (cyberbricks)Move computation to data (minimize traffic)CentralProcessor & MemoryTera ByteBackplane
55Crazy Disk Ideas Disk Farm on a card: surface mount disks Disk (magnetic store) on a chip: (micro machines in Silicon)Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram)ASICThe Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen .ISBN:
56Functionally Specialized Cards P mips processorStorageNetworkDisplayASICToday:P=50 mipsM= 2 MBM MB DRAMIn a few yearsP= 200 mipsM= 64 MBASICASIC