Slide 1 ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin,

ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin, Randi Thomas, David Patterson, and Kathy Yelick Computer Science Division University of California, Berkeley http://iram.cs.berkeley.edu/istore/

ISTORE Philosophy: SAM The ISTORE project is researching techniques for bringing scalability, availability, and maintainability (SAM) to large server systems ISTORE vision: a self-testing HW/SW platform that automatically reacts to situations requiring an administrative response –brings self-maintenance to applications and storage ISTORE target: high-end servers for data- intensive infrastructure services –single-purpose systems managing large amounts of data for large numbers of active network users –e.g. TB of data, 10,000s requests/sec, millions of users

Motivation: Service Demands Emergence of a true information infrastructure –today: e-commerce, online database services, online backup, search engines, and web servers –tomorrow: more of above (with ever-growing datasets), plus thin-client/PDA infrastructure support –these services have different needs than traditionally fault-tolerant services (ATMs, telephone switch,...) »rapid software evolution »unpredictable, wildly fluctuating demand and user base »often must incorporate low-cost, off-the-shelf HW and SW components

Service Demands (2) Infrastructure users expect “always-on” service and constant quality of service –infrastructure must provide scalable fault-tolerance and performance-tolerance »to a rapidly growing and evolving application base –failures and slowdowns have major business impact »e.g., recent EBay, E*Trade, Schwab outages

The Need for 24x7 Availability Today’s widely deployed systems can’t provide 24x7 fault- and performance-tolerance –they rely on manual administration »static data and application partitioning »human detection of and response to most anomalous behaviors and changes in system environment –human administrators are too expensive, too slow, too prone to mistakes »Jim Gray reports 42% of Tandem failures due to administrator error (in 1985) Tomorrow’s ever-growing infrastructure systems need to be self-maintaining –self-maintaining systems anticipate problems and handle them as they arise, automatically

Self-Maintaining Systems Self-maintaining systems require: –a robust platform that provides online self-testing of its hardware and software –easy incremental scalability when existing resources stop providing desired quality of service –rapid detection of anomalous behavior and changes in system environment »failures, load spikes, changing access patterns,... –fast and flexible reaction to detected conditions –flexible specification of conditions that trigger adaptation Systems deployed on the ISTORE platform will be self-maintaining

Target Application Model Scalable applications for data storage and access –e.g., bottom (data) tier of three-tier systems Desired properties: –ability to manage replicated/distributed state »including distribution of workload across replicas –ability to create and destroy replicas on the fly –persistence model that can tolerate node failure without loss of data »logging of writes, soft-state, etc. –ability to migrate service between nodes »e.g., checkpoint and restore, or kill and restart –built-in application self-testing

Target Application Model (2) What existing application architectures come close to fitting this model? –parallel shared-nothing DBMSs »IBM DB2, Teradata, Tandem SQL/MX –distributed server applications »Lotus Notes/Domino »traditional distributed filesystems/fileservers –cluster-aware applications (with small mods?) »LARD cluster web server (Rice) »Microsoft Cluster Server Phase 2 (?) What doesn’t fit? –simple 2-node “hot standby” failover clusters »Microsoft Cluster Server Phase 1

The ISTORE Approach Divides self-maintenance into two components: 1) reactive self-maintenance: dynamic reaction to exceptional system events »self-diagnosing, self-monitoring hardware »software monitoring and problem detection »automatic reaction to detected problems 2) proactive self-maintenance: continuous online self- testing and self-analysis »automatic characterization of system components »in situ fault injection, self-testing, and scrubbing to detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

Reactive Self-Maintenance ISTORE defines a layered system model for monitoring and reaction: Self-monitoring hardware SW monitoring Problem detection Coordination of reaction Reaction mechanisms Provided by ISTORE Runtime System Provided by Application ISTORE API defines interface between runtime system and app. reaction mechanisms Policies ISTORE API Policies define system’s monitoring, detection, and reaction behavior

Hardware architecture: plug-and-play intelligent devices with integrated self-monitoring, diagnostics, and fault injection hardware –intelligence used to collect and filter monitoring data –diagnostics and fault injection enhance robustness –networked to create a scalable shared-nothing cluster Self-monitoring hardware Disk Intelligent Disk “Brick” CPU, memory, diagnostic processor, redundant NICs Intelligent Chassis: scalable redundant switching, power, env’t monitoring x64

ISTORE-II Hardware Vision System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk Target for + 5-7 years: 1999 IBM MicroDrive: – 1.7” x 1.4” x 0.2” (43 mm x 36 mm x 5 mm) –340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006 MicroDrive? –9 GB, 50 MB/s (1.6X/yr capacity, 1.4X/yr BW)

2006 ISTORE ISTORE node –Add 20% pad to MicroDrive size for packaging, connectors –Then double thickness to add IRAM –2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm) Crossbar switches growing by Moore’s Law –2x/1.5 yrs  4X transistors/3yrs –Crossbars grow by N 2  2X switch/3yrs –16 x 16 in 1999  64 x 64 in 2005 ISTORE rack (19” x 33” x 84”) (480 mm x 840 mm x 2130 mm) –1 tray (3” high)  16 x 32  512 ISTORE nodes –20 trays+switches+UPS  10,240 ISTORE nodes(!)

Each node includes extra diagnostic support –diagnostic processor: independent hardware running monitoring and control software »monitors hardware and environmental state not normally visible to system software »control reboot/power-cycle main CPU inject simulated faults: power, bus transients, memory errors, network interface failure,... –separate “diagnostic network” connects the diagnostic processors of each brick »provides independent network path to diagnostic CPU works when brick CPU is powered off or has failed Self-monitoring hardware

Software collects and filters monitoring data –hardware monitors device “health”, environmental conditions, and indicators that software is working »some information processed locally to provide fail-fast behavior when higher-level software deemed potentially untrustworthy »most information passed on to software monitoring –software monitoring layer also collects higher-level performance data, access patterns, app. heartbeats SW monitoring

The data is collected in a virtual “database” –desired monitoring data is selected and aggregated by specifying “views” over the database »database schema + views hide differences in monitoring implementation on heterogeneous HW and SW Running example –If ambient temperature of a shelf is rising significantly faster than that of other shelves, »reduce power consumption on those nodes, then »if necessary, migrate non-redundant data replicas off some nodes on that shelf and shut them down –view: for each shelf, average temperature across all temperature sensors on that shelf SW monitoring

Conditions requiring administrative response are detected by observing values and/or patterns in the monitoring data –triggers specify these patterns and invoke appropriate adaptation algorithms »input to a trigger is a view of the monitoring data »views and triggers can be specified separately to allow easy selection of desired reaction algorithm easy redefinition of conditions that invoke a particular reaction Running example –trigger: change in temperature of one shelf > 0 and more than twice the change in temperature of any other shelf, averaged over a one-minute period Problem detection

Adaptation algorithms coordinate application- level reaction mechanisms –adaptation algorithms define a sequence of operations that address the anomaly detected by the associated trigger –adaptation algorithms call application-implemented mechanisms via a standard API »but are independent of application mechanism details Running example: coordination of reaction 1) identify nodes with non-redundant data 2) invoke application mechanism to migrate that data off n of those nodes 3) reduce power consumption by those n nodes 4) install trigger to monitor temperature change and shut down nodes if power reduction is ineffective Coordination of reaction

ISTORE expects reaction mechanisms to be implemented by the application –these reaction mechanisms are application-specific »e.g., moving data requires knowledge of data semantics, consistency policies,... –a research goal of ISTORE is to provide a standard API to these mechanisms »initially, try to leverage and extend existing mechanisms to avoid wholesale rewriting of applications many data-intensive applications already support functionality similar to the needed mechanisms »eventually, generalize and extend API to encompass mechanisms and needs of future applications Reaction mechanisms

Programmer or administrator specifies policies to control the system’s adaptive behavior –the policy compiler turns a high-level declarative specification of desired behavior into the appropriate: »adaptation algorithms (that invoke application mechanisms through the ISTORE API) »triggers (to invoke the adaptation algorithms when the appropriate conditions are detected) »views (that enable monitoring needed by the triggers) Running example –policy: if ambient temperature of a shelf is rising significantly faster than that of other shelves, reduce power and prepare to shut down nodes Policies

Summary: Layered System Model Layered system model for monitoring and reaction provides reactive self-maintenance Self-monitoring hardware SW monitoring Problem detection Coordination of reaction Reaction mechanisms Provided by ISTORE Runtime System Provided by Application Self-maintenance in ISTORE also consists of proactive, continuous self-testing and analysis Policies ISTORE API

The ISTORE Approach Divides self-maintenance into two components: 1) reactive self-maintenance: dynamic reaction to exceptional system events »self-diagnosing, self-monitoring hardware »software monitoring and problem detection »automatic reaction to detected problems 2) proactive self-maintenance: continuous online self- testing and self-analysis »in situ fault injection, self-testing, and scrubbing to detect flaky hardware components and to exercise rarely-taken application code paths before they’re used »automatic characterization of system components

Continuous Online Self-Testing Self-maintaining systems should automatically carry out preventative maintenance –need aggressive in situ component testing via »fault injection: triggering hardware and software error handling paths to verify their integrity/existence »stress testing: pushing HW/SW components past normal operating parameters »scrubbing: periodic restoration of potentially “decaying” hardware or software state ISTORE periodically isolates nodes from the system and performs extensive self-tests –nodes can be easily isolated due to ISTORE’s built-in redundancy »even in a deployed, running system

Self-Testing: Hardware Goals of hardware self-testing is to detect flaky components and preserve data integrity Examples: –fault injection: power cycle disk to check for stiction –stress testing: run disk controller at 100% utilization to test behavior under load –scrubbing: read all disk sectors and rewrite any that suffer soft errors; “fire” disk if too many errors

Self-Testing: Software Software self-testing proactively identifies weaknesses in software before they cause a visible failure –helps prevent failure due to bugs that only appear in certain hardware/software configurations –helps identify bugs that occur when software is driven into an untested state only reachable in a live system »e.g., long uptimes, heavy load, unexpected requests Examples –fault injection (includes HW- and SW-induced faults that the SW is expected to handle): SCSI parity error, invalid return codes from operating system –stress testing: heavy load, pathological requests –scrubbing: restart/reboot long-running software

Online Self-Analysis Self-maintaining systems require knowledge of their components’ dynamic runtime behavior –current “plug-and-play” hardware approaches are not sufficient »need more than just discovery of new devices’ functional capabilities and supported APIs –also need dynamic component characterization

Characterizing HW/SW Behavior An ISTORE may contain black-box components –heterogeneous hardware devices –application-supplied reaction mechanisms whose implementations are hidden To select and tune adaptation algorithms, the ISTORE system needs to understand the behavior of these components –in the context of a complex, live system –examples: »characterize performance of disks in system, use that data to select destination disks for replica creation »isolate two nodes, invoke replication from one to the other, monitor actions taken by application (e.g., how long it takes, how much data is moved)

Support for Application Self-tuning ISTORE’s characterization mechanisms can also help applications tune themselves –current systems require manual tuning to meet scalability and performance goals »especially true for shared-nothing systems in which computational and storage resources aren’t pooled –possible research direction is to expose characterization information to application via an extension of the ISTORE API –this would allow “aware” applications to automatically adapt their behavior based on system conditions

ISTORE API The ISTORE API defines interfaces for –adaptation algorithms to invoke application reaction mechanisms »e.g., migrate data, replicate data, checkpoint, shutdown,... –applications to provide hints to the runtime system so it can optimize adaptation algorithms & data storage »e.g., application tags data whose unavailability can be temporarily tolerated –runtime system to invoke application self-testing and fault injection, and for application to report results –runtime system to inform application about current state of system, hardware capabilities,...

Summary ISTORE focuses on Scalability, Availability, and Maintainability for emerging data- intensive network applications ISTORE provides a platform for deploying self-maintaining systems that are up 24x7 ISTORE will achieve self-maintenance via: –hardware platform with integrated diagnostic support –reactive self-maintenance: a layered, policy-driven runtime system that provides a framework for monitoring and reaction –proactive self-maintenance: support for continuous on-line self-testing and component characterization –and a standard API for interfacing applications to the runtime system

Slide 1 ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin,

Similar presentations

Presentation on theme: "Slide 1 ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin,

Similar presentations

Presentation on theme: "Slide 1 ISTORE: A Platform for Scalable, Available, Maintainable Storage-Intensive Applications Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin,"— Presentation transcript:

Similar presentations

About project

Feedback