© Prof. Dr.-Ing. Wolfgang Lehner | Resiliency-Aware Data Management Matthias Boehm 1 Wolfgang Lehner 1 Christof Fetzer 2 TU Dresden 1 Database Technology.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

EE5900 Advanced Embedded System For Smart Infrastructure
Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Starfish: A Self-tuning System for Big Data Analytics.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Thank you for your introduction.
CSCE430/830 Computer Architecture
When Data Management Systems Meet Approximate Hardware: Challenges and Opportunities Author: Bingsheng He (Nanyang Technological University, Singapore)
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
Fault-Tolerant Systems Design Part 1.
Microprocessor Reliability
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
Parallel Research at Illinois Parallel Everywhere
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
OpenFOAM on a GPU-based Heterogeneous Cluster
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
CprE 458/558: Real-Time Systems
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Towards Eco-friendly Database Management Systems W. Lang, J. M. Patel (U Wisconsin), CIDR 2009 Shimin Chen Big Data Reading Group.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Redundant Array of Independent Disks
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Storage in Big Data Systems
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Disks Chapter 5 Thursday, April 5, Today’s Schedule Input/Output – Disks (Chapter 5.4)  Magnetic vs. Optical Disks  RAID levels and functions.
Distributed Software Engineering Lecture 1 Introduction Sam Malek SWE 622, Fall 2012 George Mason University.
Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
Data Tagging Architecture for System Monitoring in Dynamic Environments Bharat Krishnamurthy, Anindya Neogi, Bikram Sengupta, Raghavendra Singh (IBM Research.
Fault-Tolerant Systems Design Part 1.
Presenters: Rezan Amiri Sahar Delroshan
OPERATING SYSTEM SCHEDULING FOR EFFICIENT ONLINE SELF- TEST IN ROBUST SYSTEMS PRIDHVI RAJ RAMANUJULA CSc 8320.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Database Replication in Tashkent CSEP 545 Transaction Processing Sameh Elnikety.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Lecture 28: Reliability Today’s topics: GPU wrap-up Disk basics RAID
Fault Tolerance Distributed Web-based Systems
(A Research Proposal for Optimizing DBMS on CMP)
Declarative Transfer Learning from Deep CNNs at Scale
Presentation transcript:

© Prof. Dr.-Ing. Wolfgang Lehner | Resiliency-Aware Data Management Matthias Boehm 1 Wolfgang Lehner 1 Christof Fetzer 2 TU Dresden 1 Database Technology Group 2 Systems Engineering Group August 30, 2011

Matthias Böhm | | 2 > Motivation: Increasing Error Rates Increasing Component Error Rates  Decreasing feature sizes (new tech generations)  Reduced voltage supply  Static (hard) vs. dynamic (soft) errors  8% increase error rate per tech generation [Borkar05]  25,000 – 70,000 FIT / Mbit [Schroeder09] Increasing System Error Rates  Increasing scale  # of components (core, transistor)  Memory capacities  Example:  Fixed error rate / component Resiliency-Aware Data Management 1 P( )=0.039 (at least one component fails) MemCPU Cosmic Radiation (95% neutrons)  Errors and error-prone behavior will become the normal case 1 P( )=

Matthias Böhm | | 3 > Implicit (silent) vs. Explicit (detected/corrected) Errors  State-of-the-art: error detection and correction at HW/OS level State-of-the-Art: Resilient Memory  ECC / parity bits / memory scrubbing / full data redundancy State-of-the-Art: Resilient Computing  Computation redundancy Motivation: Resiliency Costs Resiliency-Aware Data Management d1p3p1p2 P d1d2d3d4d2d3d4 Task A =? Task A Task A‘ voting Task A‘‘Task A‘  Such resiliency mechanisms cause „resiliency costs“ (8,4) (16,11) (32,26) (64,57) Double Modular Redundancy (DMR): Triple Modular Redundancy (TMR): ECC Extended Hamming(7+1,4)

Matthias Böhm | | 4 > HW Infrastructure OS / Middleware Motivation: Resiliency Costs (2) Resiliency Costs Categories  Performance overhead (throughput, latency)  Memory overhead  Energy consumption  Monetary HW costs Resiliency OS-Level  Memory overhead (capacity, bandwidth)  Computation overhead  Energy consumption (increased time) Resiliency HW-Level  Monetary HW costs (Chipset, ECC RAM)  Energy consumption (time, chip space)  Computation overhead Resiliency-Aware Data Management HW Infrastructure OS / Middleware Data Management ECC RAM 0123 L3 ECC mem control Memory CPU  Increasing error rates ~ increasing resiliency costs!

Matthias Böhm | | 5 > Vision of Resiliency-Aware Data Management Resiliency-Aware Data Management

Matthias Böhm | | 6 > Data Management Vision Overview Problem of State-of-the-Art  Resiliency-awareness on HW / OS level (general-purpose)  Increasing error rates  Increasing resiliency costs Key Observation  Different resiliency requirements  Data management context knowledge Resiliency-Aware Data Management  Exploit context knowledge of query processing and data storage  Efficiency (reduced resiliency costs)  Effectiveness (detection/correction) Data Management QiQi UiUi mission- critical queries nice-to-have analytics HW Infrastructure OS / Middleware Data System Access System Storage System configuration HW/OS primitives Resiliency-Aware Data Management input streams

Matthias Böhm | | 7 > Resiliency-Aware Data Management C1: Resilient Query Processing C2: Resilient Data Storage C3: Resiliency- Aware Optimization Resilient Database Challenges

Matthias Böhm | | 8 > Guard Plan C1: Resilient Query Processing Challenge  Problem: missing/invalid tuples (explicit/implicit)  Goal: reliable query results by error correction / error-tolerant algorithms Example (Advanced Analytics)  Q: Ψ k=365 ( γ( σ a<107 R ⋈ S ⋈ T ⋈ U ))  Computation redundancy Resiliency-Aware Data Management C1: QP C3: Opt C2: DS ⋈ S R ⋈ ⋈ T σ a<107 γ Ψ k=365 U ⋈ S R ⋈ ⋈ T σ a<107 γ U Check Plan Scheduling Operator Semantics Intermediate Results

Matthias Böhm | | 9 > C1: Resilient Query Processing (2) Example (Advanced Analytics cont.)  AR(2), MSE, L-BFGS-B, C40 Energy Demand  P( )=0.01  val ∈ [0,max]  N=100 Resiliency-Aware Data Management C1: QP C3: Opt C2: DS Approximate Query Results Error-Tolerant Algorithms Error-Proportional Overhead

Matthias Böhm | | 10 > abc C2: Resilient Data Storage Challenge  Problem: data loss/corruption (explicit/implicit)  Goal: data stability by data redundancy and error correction Example (Data Partitioning)  Table R (a,b,c)  Data redundancy (synopsis and replicas) Optimization  Exploit the multiple replicas  (complementary) layouts  E.g., different sorting orders, partitioning schemes, compression schemes, etc Resiliency-Aware Data Management C1: QP C3: Opt C2: DS abc abcabc Table RTable R‘ Synopsis S R Synopsis S R‘ Time-based /on-the-fly error detection and correction acb Test Scheduling Multiple Replicas Workload Characteristics

Matthias Böhm | | 11 > C3: Resiliency-Aware Optimization Challenge  Problem: search space of QP/DS, HW heterogeneity  Goal: Multi-objective optimization (performance, accuracy, energy, resiliency) Example (Frequency/Voltage Scaling (DFS,DVS))  1) Choose frequency level  2) Select voltage scheme  3) Optimize voltage  E.g., decreased frequency/voltage Resiliency-Aware Data Management C1: QP C3: Opt C2: DS Multi-Objective, Global, Architecture-Aware Optimization DFS/DVS Accuracy ErrorsEnergy Performance – (+)(+) – – + + – (–)(–) + convex ⋈ S R ⋈ ⋈ T σ a<107 γ Ψ k=365 U Q:

Matthias Böhm | | 12 > Conclusion Problem of State-of-the-Art  General-purpose resiliency mechanisms at HW/OS level  Increasing error rates  increasing resiliency costs Summary  Vision of „Resiliency-Aware Data Management“  Challenge Resilient Query Processing  Challenge Resilient Data Storage  Challenge Resiliency-Aware Optimization  Research directions and more in the paper! Conclusion / New Opportunities  Resiliency-aware data management can reduce resiliency costs  Research Opportunity:  Reconsideration of many DB aspects w.r.t. resiliency  Colloboration Opportunity:  Inter-disciplinary research field (HW, OS, Systems, DB) Resiliency-Aware Data Management

Matthias Böhm | | 13 > Choose your Resiliency Level! Resiliency-Aware Data Management

© Prof. Dr.-Ing. Wolfgang Lehner | Resiliency-Aware Data Management Matthias Boehm 1 Wolfgang Lehner 1 Christof Fetzer 2 TU Dresden 1 Database Technology Group 2 Systems Engineering Group August 30, 2011

Matthias Böhm | | 15 > Background and Related Work Resiliency-Aware Data Management

Matthias Böhm | | 16 > Background and Related Work Taxonomy  Faults (tech defects), Errors (system-internal), Failures (system-external) Static vs Dynamic Errors (memory / computation)  Static (hard / permanent): cosmic radiation, dynamic variability, aging  Dynamic (soft / transient): static variability, aging Implicit vs. Explicit Errors  Implicit: silent errors  general-purpose techniques (ECC, etc)  Explicit: detected or corrected errors Related DB-Level  Error-aware frameworks (e.g., MapReduce/Hadoop)  general-purpose techniques  Recovery processing / replication [Upadhyaya11]  reacting on explicit errors  Implicit: [Graefe09], [Borisov11], [Simitsis10]  specific DM aspects Resiliency-Aware Data Management  Holistic resilient data management

Matthias Böhm | | 17 > Choose your Resiliency Level! Resiliency-Aware Data Management

Matthias Böhm | | 18 > TX Level vs. Resiliency Level Similarities  Different application requirements on integrity  TX: physical and operational integrity  Resiliency: physical integrity  Ensuring integrity incurrs cost overheads  Context knowledge can be exploited for reducing costs  TX: TX scheduling (logical serialization)  Resiliency: challenges and use cases Differences  Configuration granularity  TX: we could handle different TX level concurrently  Resiliency: configuraing HW parameters can have global influence on multiple queries on that HW component  Scope  TX: integrity for running query or TX (assumption: DB is transformed from one consistent state to another by TX only)  Resiliency: computation and data integrity Resiliency-Aware Data Management