Presentation is loading. Please wait.

Presentation is loading. Please wait.

Berkeley RAD Lab Technical Vision Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005.

Similar presentations


Presentation on theme: "Berkeley RAD Lab Technical Vision Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005."— Presentation transcript:

1 Berkeley RAD Lab Technical Vision Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005

2 Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

3 Overarching Mantra Enable a faster pace of network service innovation through new distributed system architectures that reduce operations cost by 2-3 orders of magnitude The Challenge: Software systems: Too much information => make sense of it through statistical learning & control theory Network systems: Too little information => exploit better observation and monitoring in the network infrastructure to drive management processes

4 In practice this means … Single person can write, deploy, operate the next- generation IT business (“the Fortune 1 million”)  Do for Internet apps what Web did for individual publishing  Gray’ s challenge: planetary-scale distributed system operated by a single part-time operator Goal: programmers focus on functionality; put the *ility in the platform  Could be built on utility computing, giving access to distributed physical resources  Integrated approach to network and server/service management Requires 100x-1000x reduction in TCO from today’s levels

5 What things are like today World-scale services created and operated by expert teams  “Google-sized organization” to create a Google  Amazon’s book browsing, designed by programmers, is cumbersome  Browsing for housewares, designed by domain experts on mature infrastructure, more usable We don’t know what the next “killer app” will be!  NOW project didn’t predict Internet search as a “Killer app” for NOW’s If we succeed, the next killer Internet app will be written, deployed, operated, at Google-like scales, by a single programmer

6 Focusing on lowering cost of ownership Standard way to account for “where the money goes” in operating a deployed distributed application Definition independent of who is operating the app  Operators per byte of storage or per CPU? No, doesn’t scale with technology changes  Operators per end-user served? (This is the figure of merit for e-tailers)  Operators per geographic region served?  Operators per $ spent on capital cost?  Operators per $ of revenue?

7 Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

8 Enabling Technologies for Reducing TCO in ServRADS Past successes  microrebooting: Fast recovery makes false positives tolerable  Pinpoint: using SLT to detect and localize fine-grain failures  visualization+SLT to help operators & earn their trust Elements of technical vision  SLT and machine learning  Operator-centric visualization  Control theory  “Open source” failures database (sanitized, open failures & forensics repository)

9 Example scenarios Helping operators make sense of instrumentation  Using ML techniques to localize failures (P. Bodik, E. Kiciman)  Using automatically-induced statistical models to identify likely causes of performance problems (S. Zhang, I. Cohen et al.)  Combining SLT with visualization for cross-checking problem reports and rapidly spotting potential problems visually Facilitating self-tuning/configuration  Using control theory to improve performance of a distributed streaming database (W. Xu)  Service placement in wide-area distributed system (D. Oppenheimer)  Microreboots (G. Candea) and microreplacement (S. Kawamoto) as low- cost prevention/repair strategies If false positive cost can be kept low, automate. Otherwise, help operator do her job.

10 Services example: combining viz + SLT

11 Reduce TCO via Planetary-scale Abstractions Inspiration: narrowly-focused planetary-scale abstractions whose design & implementation...  scale well: understand distributed scheduling, locality, symptoms of wide- area failures  monitorable and controllable (using SLT & linear CT)  retain precisely-quantifiable and “acceptable” semantics under partial- failure conditions Examples of existing “narrow but powerful” services  MapReduce in Google understands data locality Can easily imagine a “lossy” MapReduce, like online aggregation  queues/messaging in Yahoo, Amazon, others  User information database in Yahoo  Instrumentation collection & analysis services using Telegraph-CQ

12 Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

13 RADS Network Problem Internet routing has proven to be robust But …  Poor visibility: hard to determine health of the network  Routing policy interactions defeat propagation of useful diagnostic info: difficult to identify root cause problems  Slow reaction times to connectivity failures; operator intervention (across admin domains) increases cost of ownership Key observation: network service failures attributed to unexpected traffic surge patterns  Approach: identify and protect “good” traffic during surge Mechanism deployed in network edge:  It’s where the servers and clients are located  Greatest need for lowering management costs  Administrative scope and responsibility is well-defined

14 iBoxes: New network element for Observe, Analyze, Act Enterprise Network Architecture Inspection-and-Action Boxes: Deep multiprotocol packet inspection No routing; observation & marking Policing points: drop, fence, block

15 Network-Level Observe-Analyze-Act Observe  Packet, path, protocol, service invocation statistical collection and sampling: frequencies, latencies, completion rates  Construct the collection infrastructure Analyze  Determine correlations among observations  “Normal” model discovery + anomaly detection  Exploit SLT Act  Experiment to test correlations  Prioritize and throttle  Mark and annotate  Control theory? Distributed analyses and actions

16 Network Layer Mechanism: Annotations Enhance network visibility: disseminate observations, communicate actions, provide in- band network management actions, iBox-to-iBox communications iBoxes label packets at annotation layer but do not rewrite packet contents Annotations stack, must be removed from packets before delivery to A-layer unaware end nodes Phy Link Network Annotation Transport Session Presentation Application

17 R R Distribution Tier E E E S S I R IAIA E Internet Edge Access Edge Server Edge Spam Appliance Primary & Secondary DNS Servers ISIS S Mail Server S Scenario: Traffic Surge Inhibiting Network Services DNS Server swamped by excessive request traffic  Observe: DNS time outs, Web access traffic slowed, but also higher than normal mail delivery latency implying busy server edge (correlation between Mail Server and DNS Server utilization?)  Root Cause: High DNS request rates generated by Spam Appliance triggered by mail surge

18 Scenario How Diagnosed?  I-S detects high link utilization but abnormally high DNS traffic  Stats from I-I: high mail traffic, low outgoing web traffic, in traffic high but link utilization not high  Stats from I-A: lower web traffic, no unusual mail origination  Problem localized to Server edge, but visibility limited: RADS can help R R Distribution Tier E E E S S I R IAIA E Internet Edge Access Edge Server Edge Spam Appliance Primary & Secondary DNS Servers ISIS S Mail Server S

19 Scenario Possible Action Responses  Experiment: Redirect local DNS requests to Secondary DNS server: if these complete, can infer the server is the problem, not the network  Throttle: Due to MS-DNS correlation, block/slow email traffic at Server Edge: should expect reduced DNS server utilization R R Distribution Tier E E E S S I R IAIA E Internet Edge Access Edge Server Edge Spam Appliance Primary & Secondary DNS Servers ISIS S Mail Server S

20 Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

21 Embodying principles in a prototype Platform architecture and prototype to enable rapid innovation in network services by non-experts  automatically accommodates scaling, provisioning, failure management  multi-datacenter (geoplexed)  observable networks connecting datacenters  potentially planetary scale  runs with minimal operator oversight Prototype keeps various research projects focused on common goal and allows ongoing testing Participation in standards processes to promote “best practices” in platform as open standards

22 Edge Network Reliable Adaptive Distributed Systems Distributed Middleware Client SLT Services Distributed Middleware Server Internet IP Network Router Edge Network iBox Prototype Applications Programming Abstractions For Roll-back and wide-area distributed computations Crash-only services + Observation Infrastructure for System SLT Checkable Protocols Fast Detection & Route Recovery Observation Infrastructure for network SLT Commodity Internet Operator User Application- Specific Overlay Network

23 Generic iBox Architecture Interconnection Fabric Input Ports Output Ports Buffers “Tag” Mem CP AP Action Processor CP Classification Processor Rules & Programs

24 Possible architecture of a rack Visualization From other datacenters To other datacenters Datacenter boundary T-CQ engine Preprocessed data Sanitized data SLT algo. app. server & application, e.g. J2EE SLT algo. High-level sensor data Control loops SLT algo. High-level effectors Externally-induced failures, workload changes, etc. Microrecovery actions Syndrome identification To other datacenters

25 Outline Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary

26 ServRADS: Observations & Summary SLT algorithms make sense of large amounts of data  Classification, outlier/anomaly detection, clustering, etc. Viz helps operator use “visual pattern recognition” to quickly spot problems and cross-check SLT models  Enables operator expertise to be quickly brought to bear  Builds operators’ trust in statistical/machine learning models Challenge  Fundamental challenges associated with applying SLT to problem determination (coming up next session)  Unifying many techniques into a coherent approach - prototype platform as unifying artifact Idea: capture best practices in TCO-optimized, planetary-scale abstractions

27 NetRADS: Observations & Summary COPS: Paradigm for (more) automatically protecting critical resources when network is under stress  Checkable protocols: visible semantics  Observe network behavior: good (easy), bad (hard), suspicious  Protect services: throttle, redirect Network management major contributor to TCO NetRADS built on:  iBoxes: pervasive infrastructure for observation and action at the network level  Annotation Layer: for marking, control, inter-iBox communications  Integration with Internet service approach for service/server-level visibility and integrated management


Download ppt "Berkeley RAD Lab Technical Vision Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005."

Similar presentations


Ads by Google