Three Talks Scalability Terminology What Windows is doing re this

Three Talks Scalability Terminology What Windows is doing re this
Gray (with help from Devlin, Laing, Spix) What Windows is doing re this Laing The M$ PetaByte (as time allows) Gray

Terminology for Scaleability Bill Devlin, Jim Gray, Bill Laing, George Spix,,,, paper at: ftp://ftp.research.microsoft.com/pub/tr/tr doc Farms of servers: Clones: identical Scaleability + availability Partitions: Scaleability Packs Partition availability via fail-over GeoPlex for disaster tolerance. Farm Clone Shared Nothing Disk Partition Pack Active- Active Active- Passive Geo Plex

Unpredictable Growth The TerraServer Story:
Expected 5 M hits per day Got 50 M hits on day 1 Peak at 20 M hpd on a “hot” day Average 5 M hpd over last 2 years Most of us cannot predict demand Must be able to deal with NO demand Must be able to deal with HUGE demand

Web Services Requirements
Scalability: Need to be able to add capacity New processing New storage New networking Availability: Need continuous service Online change of all components (hardware and software) Multiple service sites Multiple network providers Agility: Need great tools Manage the system Change the application several times per year. Add new services several times per year.

Premise: Each Site is a Farm
Buy computing by the slice (brick): Rack of servers + disks. Functionally specialized servers Grow by adding slices Spread data and computation to new slices Two styles: Clones: anonymous servers Parts+Packs: Partitions fail over within a pack In both cases, GeoPlex remote farm for disaster recovery

Scaleable Systems Scale UP Scale OUT
ScaleUP: grow by adding components to a single system. ScaleOut: grow by adding more systems. Scale OUT

ScaleUP and Scale OUT Everyone does both. Choice’s Who’s software?
Size of a brick Clones or partitions Size of a pack Who’s software? scaleup and scaleout both have a large software component 1M$/slice IBM S390? Sun E 10,000? 100 K$/slice Wintel 8X 10 K$/slice Wintel 4x 1 K$/slice Wintel 1x

Clones: Availability+Scalability
Some applications are Read-mostly Low consistency requirements Modest storage requirement (less than 1TB) Examples: HTML web servers (IP sprayer/sieve + replication) LDAP servers (replication via gossip) Replicate app at all nodes (clones) Load Balance: Spray& Sieve: requests across nodes. Route: requests across nodes. Grow: adding clones Fault tolerance: stop sending to that clone.

Two Clone Geometries Shared-Nothing: exact replicas
Shared-Disk (state stored in server) Shared Nothing Clones Shared Disk Clones If clones have any state: make it disposable. Manage clones by reboot, failing that replace. One person can manage thousands of clones.

Clone Requirements Automatic replication (if they have any state)
Applications (and system software) Data Automatic request routing Spray or sieve Management: Who is up? Update management & propagation Application monitoring. Clones are very easy to manage: Rule of thumb: 100’s of clones per admin.

Partitions for Scalability
Clones are not appropriate for some apps. State-full apps do not replicate well high update rates do not replicate well Examples Databases Read/write file server… Cache managers chat Partition state among servers Partitioning: must be transparent to client. split & merge partitions online

Packs for Availability
Each partition may fail (independent of others) Partitions migrate to new node via fail-over Fail-over in seconds Pack: the nodes supporting a partition VMS Cluster, Tandem, SP2 HACMP,.. IBM Sysplex™ WinNT MSCS (wolfpack) Partitions typically grow in packs. ActiveActive: all nodes provide service ActivePassive: hot standby is idle Cluster-In-A-Box now commodity

Packed Partitions Scalability + Availability
Partitions and Packs Partitions Scalability Packed Partitions Scalability + Availability

Parts+Packs Requirements
Automatic partitioning (in dbms, mail, files,…) Location transparent Partition split/merge Grow without limits (100x10TB) Application-centric request routing Simple fail-over model Partition migration is transparent MSCS-like model for services Management: Automatic partition management (split/merge) Who is up? Application monitoring.

GeoPlex: Farm Pairs Two farms (or more)
State (your mailbox, bank account) stored at both farms Changes from one sent to other When one farm fails other provides service Masks Hardware/Software faults Operations tasks (reorganize, upgrade move) Environmental faults (power fail, earthquake, fire)

Directory Fail-Over Load Balancing
Routes request to right farm Farm can be clone or partition At farm, routes request to right service At service routes request to Any clone Correct partition. Routes around failures.

well-managed packs & clones
9 9 9 9 9 well-managed nodes Availability Masks some hardware failures well-managed packs & clones Masks hardware failures, Operations tasks (e.g. software upgrades) Masks some software failures well-managed GeoPlex Masks site failures (power, network, fire, move,…) Masks some operations failures

Cluster Scale Out Scenarios
The FARM: Clones and Packs of Partitions Packed Partitions: Database Transparency SQL Partition 3 SQL Partition 2 SQL Partition1 SQL Database Web File StoreB replication Cloned Packed file servers Web File StoreA SQL Temp State Cloned Front Ends (firewall, sprayer, web server) Web Clients Load Balance

Some Examples: TerraServer: Hotmail:
6 IIS clone front-ends (wlbs) 3-partition 4-pack backend: 3 active 1 passive Partition by theme and geography (longitude) 1/3 sysadmin Hotmail: 1000 IIS clone HTTP login 3400 IIS clone HTTP front door clones for ad rotator, in/out bound… 115 partition backend (partition by mailbox) Cisco local director for load balancing 50 sysadmin Google: (inktomi is similar but smaller) 700 clone spider 300 clone indexer 5-node geoplex (full replica) 1,000 clones/farm do search 100 clones/farm for http 10 sysadmin See Challenges to Building Scalable Services: A Survey of Microsoft’s Internet Services, Steven Levi and Galen Hunt

Acronyms RACS: Reliable Arrays of Cloned Servers
RAPS: Reliable Arrays of partitioned and Packed Servers (the first p is silent ).

Emissaries and Fiefdoms
Emissaries are stateless (nearly) Emissaries are easy to clone. Fiefdoms are stateful Fiefdoms get partitioned.

Summary Terminology for scaleability Farms of servers:
Clone Shared Nothing Disk Partition Pack Active- Active Active- Passive Geo Plex Terminology for scaleability Farms of servers: Clones: identical Scaleability + availability Partitions: Scaleability Packs Partition availability via fail-over GeoPlex for disaster tolerance. Architectural Blueprint for Large eSites Bill Laing Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Gray, Bill Laing, George Spix MS-TR-99-85 ftp://ftp.research.microsoft.com/pub/tr/tr doc

What Windows is Doing Continued architecture and analysis work
AppCenter, BizTalk, SQL, SQL Service Broker, ISA,… all key to Clones/Partitions Exchange is an archetype Front ends, directory, partitioned, packs, transparent mobility. NLB (clones) and MSCS (Packs) High Performance Technical Computing Appliances and hardware trends Management of these kind of systems Still need good ideas on….

Architecture and Design work
Produced an architectural Blueprint for large eSites published on MSDN Creating and testing instances of the architecture Team led by Per Vonge Neilsen Actually building and testing examples of the architecture with partners. (sometimes known as MICE) Built a scalability “Megalab” run by Robert Barnes 1000 node cyber wall, U Compaq DL360s, ways, 7000 disks

Clones and Packs aka Clustering
Integrated the NLB and MSCS teams Both focused on scalability and availability NLB for Clones MSCS for Partitions/Packs Vision is a single communications and group membership infrastructure and a set of management tools for Clones, Partitions, and Packs Unify management for clones/partitions at BOTH: OS and app level (e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)

Clustering in Whistler Server
Microsoft Cluster Server Much improved setup and installation 4 node support in Advanced server Kerberos support for Virtual Servers Password change without restarting cluster service 8 node support in Datacenter SAN enhancements (Device reset not bus reset for disk arbitration, Shared disk and boot disk on same bus) Quorum of nodes supported (no shared disk needed) Network Load Balancer New NLB manager Bi-Directional affinity for ISA as a Proxy/Firewall Virtual cluster support (Different port rules for each IP addr) Dual NIC support

Geoclusters AKA - Geographically dispersed (Packs)
Essentially the nodes and storage are replicated at 2 sites, disks are remotely mirrored Being deployed today, helping vendors them get certified, we still need better tools Working with EMC, Compaq, NSISoftware, StorageApps Log shipping (SQL) and extended VLANs (IIS) are also solutions

High Performance Computing
Last year (CY2000) This work is a part of server scale-out efforts (BLaing) Web site and HPC Tech Preview CD late last year A W2000 “Beowulf” equivalent w/ 3rd-party tools Better than the competition 10-25% faster than Linux on SMPs (2, 4 & 8 ways) More reliable than SP2 (!) Better performance & integration w/ IBM periphs (!) But it lacks MPP debugger, tools, evangelism, reputation See ../windows2000/hpc Also \\jcbach\public\cornell* This year (CY2001) Partner w/ Cornell/MPI-Soft/+ Unix to W2000 projects Evangelism of commercial HPC (start w/ financial svcs) Showcase environment & apps (EBC support) First Itanium FP “play-offs” BIG tools integration / beta Dell & Compaq offer web HPC buy and support experience (buy capacity by-the-slice) Beowulf-on-W2000 book by Tom Sterling (author of Beowulf on Linux) Gain on Sun in the list Address the win-by-default assumption for Linux in HPC No vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.

Appliances and Hardware Trends
The appliances team under TomPh is focused on dramatically simplifying the user experience of installing the kind of devices Working with OEMs to adopt WindowsXP Ultradense servers are on the horizon 100s of servers per rack Manage the rack as one Infiniband and 10 GbpsEthernet change things.

Operations and Management
Great research work done in MSR on this topic The Mega services paper by Levi and Hunt The follow on BIG project developed the ideas of Scale Invariant Service Descriptions with automated monitoring and deployment of servers. Building on that work in Windows Server group AppCenter doing similar things at app level

Still Need Good Ideas on…
Automatic partitioning Stateful load balancing Unified management of clones/partitions at both app and OS level

We're building Petabyte Stores
Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded Soon everything can be recorded and indexed Hotmail 100TB now MSN 100TB now List price is 800M$/PB (including FC switches & brains) Must Geoplex it. Can we get if for 1M$/PB? Personal 1TB stores for 1k$ All Books MultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Building a Petabyte Store
EMC ~ 500k$/TB = M$/PB plus FC switches plus… M$/PB TPC-C SANs (Dell 18GB/…) M$/PB Dell local SCSI, 3ware M$/PB Do it yourself: M$/PB a billion here, a billion there, soon your talking about real money!

320 GB, 2k$ (now) 6M$ / PB 4x80 GB IDE (2 hot plugable) (1,000$) SCSI-IDE bridge 200k$ Box 500 Mhz cpu 256 MB SRAM Fan, power, Enet 500$ Ethernet Switch: 150$/port Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)

Hot Swap Drives for Archive or Data Interchange
25 MBps write (so can write N x 80 GB in 3 hours) 80 GB/overnite = ~N x 2 MB/second @ 19.95$/nite Compare to 1$/GB via Internet

A Storage Brick 2 x 80GB disks 500 Mhz cpu (intel/ amd/ arm) 256MB ram
2 eNet RJ45 Fan(s) Current disk form factor 30 watt 600$ (?) per rack (48U - 3U/module - 16 units/U) 400 disks, 200 whistler nodes 32 TB 100 Billion Instructions Per Second 120 K$/rack, 4 M$/PB, per Petabyte (33 racks) 4 M$ 3 TeraOps (6,600 nodes) 13 k disk arms (1/2 TBps IO)

What Software Do The Bricks Run?
Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other COM+ SOAP, BizTalk Huge leverage in high-level interfaces. Same old distributed system story. Applications Applications datagrams streams RPC ? ? RPC streams datagrams CLR CLR Infiniband /Gbps Ehternet

Storage Rack in 2 years? 300 arms 50TB (160 GB/arm) 24 racks 48 storage processors 2x6+1 in rack Disks = 2.5 GBps IO Controllers = 1.2 GBps IO Ports MBps IO My suggestion: move the processors into the storage racks.

Auto Manage Storage 1980 rule of thumb: 2000 rule of thumb Problem:
A DataAdmin per 10GB, SysAdmin per mips 2000 rule of thumb A DataAdmin per 5TB SysAdmin per 100 clones (varies with app). Problem: 5TB is 60k$ today, 10k$ in a few years. Admin cost >> storage cost??? Challenge: Automate ALL storage admin tasks

It’s Hard to Archive a Petabyte It takes a LONG time to restore it.
At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, use other copy until failure repaired, refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)

Call To Action Lets work together to make storage bricks
Low cost High function NAS (network attached storage) not SAN (storage area network) Ship NT8/CLR/IIS/SQL/Exchange/… with every disk drive

Cheap Storage Disks are getting cheap:
3 k$/TB disks (12 80 GB 250$ each)

Storage capacity beating Moore’s law
3 k$/TB today (raw disk)

Disk vs Tape Disk 1 hour scan Tape 1 week scan
80 GB 35 MBps 5 ms seek time 3 ms rotate latency 3$/GB for drive 2$/GB for ctlrs/cabinet 4 TB/rack 1 hour scan Tape 40 GB 10 MBps 10 sec pick time second seek time 2$/GB for media 8$/GB for drive+library 10 TB/rack 1 week scan Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =12 drives The price advantage of tape is gone, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape.

What’s a Balanced System?
System Bus PCI Bus

Data on Disk Can Move to RAM in 8 years
Today: 3/5/2001 Disk: 300 GB per K$ RAM: 1 GB per K$ pc133 ecc sdram 100:1 6 years

5-year Tech Trends 256 way nUMA?
Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories Huge disks now: 5-50 GB 3.5” disks then: GB disks Petabyte storage farms (that you can’t back up or restore). Disks >> tapes “Small” disks: One platter one inch 10GB SAN convergence 1 GBps point to point is easy 1 GB RAM chips MAD at 50 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 500 mips cpus for 10$ 5 bips cpus at high end

The Absurd? Consequences
Further segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips Disks RAM Memory ~ 1 TB Processors 100 GBps 10 TBps ~ 1 Tips ~ 100TB

Drives shrink (1.8”, 1”) 150 kaps for 500 GB is VERY cold data
3 GB/platter today, 30 GB/platter in 5years. Most disks are ½ full TPC benchmarks use 9GB drives (need arms or bandwidth). One solution: smaller form factor More arms per GB More arms per rack More arms per Watt

All Device Controllers will be Super-Computers
TODAY Disk controller is 10 mips risc engine with 2MB DRAM NIC is similar power SOON Will become 100 mips systems with 100 MB DRAM. They are nodes in a federation (can run Oracle on NT in disk controller). Advantages Uniform programming model Great tools Security Economics (cyberbricks) Move computation to data (minimize traffic) Central Processor & Memory Tera Byte Backplane

Crazy Disk Ideas Disk Farm on a card: surface mount disks
Disk (magnetic store) on a chip: (micro machines in Silicon) Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram) ASIC The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen .ISBN:

Functionally Specialized Cards
P mips processor Storage Network Display ASIC Today: P=50 mips M= 2 MB M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC

Three Talks Scalability Terminology What Windows is doing re this

Similar presentations

Presentation on theme: "Three Talks Scalability Terminology What Windows is doing re this"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Three Talks Scalability Terminology What Windows is doing re this

Similar presentations

Presentation on theme: "Three Talks Scalability Terminology What Windows is doing re this"— Presentation transcript:

Similar presentations

About project

Feedback