Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida.

Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida NSF CHREC Center

2 Outline Overview NASA Dependable Multiprocessor Reconfigurable Fault Tolerance (RFT) Space Applications Novel Computing Platforms RapidIO Conclusions

3 Overview What is advanced space computing?  New concepts, methods, and technologies to enable and deploy high-performance computing in space – for an increasing variety of missions and applications Why is advanced space computing vital?  On-board data processing Downlink bandwidth to Earth is extremely limited Sensor data rates, resolutions, and modes are dramatically increasing Remote data processing from Earth is no longer viable Must process sensor data where it is captured, then downlink results  On-board autonomous processing & control Remote control from Earth is often not viable Propagation delays and bandwidth limits are insurmountable Space vehicles and space-delivered vehicles require autonomy Autonomy requires high-speed computing for decision-making Why is it difficult to achieve?  Cannot simply strap a Cray to a rocket! Hazardous radiation environment in space Platforms with limited power, weight, size, cooling, etc. Traditional space processing technologies (RadHard) are severely limited  Potential for long mission times with diverse set of needs Need powerful yet adaptive technologies Must ensure high levels of reliability and availability

4 Taxonomy of Fault Tolerance First, let us define various possible modes/methods of providing fault tolerance (FT)  Many other options beyond simply throwing triple-modular redundancy (TMR) at the problem  Software FT vs. hardware FT concepts largely similar, differences only at implementation level  Radiation-hardening not listed, falls under “prevention” as opposed to detection or correction Detect Correct or Mask Fault-Tolerant HLL (e.g. MPI) FT-HLL Concurrent Error Detection CED Self-Checking Pairs SCP Algorithm-Based Fault-Tolerance ABFT Error Correction Codes ECC N-Version Programming NVP Byzantine Resilience BR Checkpointing & Roll-back CR Software-Implemented Fault Tolerance SIFT N-Modular Redundancy NMR Temporal and spatial variants possible for many techniques Most of these FT modes are currently being used at UF

5 NASA/Honeywell/UF Project 1 st Space Supercomputer  Funded by NASA NMP  In-situ sensor processing  Autonomous control  Speedups of 100  to 1000   First fault-tolerant, parallel, reconfigurable computer for space Infrastructure for fault-tolerant, high-speed computing in space  Robust system services  Fault-tolerant MPI services  Application services  FPGA services  Standard design framework  Transparent API to resources for earth & space scientists NASA Dependable Multiprocessor (DM)

6 Dependable Multiprocessor DM System Architecture  Dual system controllers Redundant radiation-hardened PPC boards Monitor data processors’ health and communicate with spacecraft  Data processing engines High-performance, low-power COTS SBCs running Linux PowerPC with AltiVec capabilities Optional FPGA co-processor for additional performance Scalable to 20 data processing nodes  Redundant Interconnect Dual GigE connections Automatically switch networks when error is detected DM Middleware (DMM)  FT System Services Manages status and health of multiple concurrent jobs  FT Embedded MPI (FEMPI) Lightweight subset of MPI Allows fault recovery without restarting an entire parallel application  Application & FPGA Services Commonly used libraries such as ATLAS, FFTW, GSL Simplified, generic API for FPGA usage through USURP *  High-Availability Middleware Framework used to enable health monitoring of cluster * USURP is a standardized interface specification for RC platforms, developed by researchers at UF

7 DMM Components Mission Manager (MM)  Controls high-level job deployment  Facilitates replication of lower-level jobs Spatial or temporal replication Automatically compares and validates outputs  Monitors real-time deadlines  Enables roll-forward / roll-back when faults occur Job Manager (JM)  Controls low-level job deployment and scheduling across system FT Manager (FTM)  Manages low-level system faults (node crash, job crash) JM Agent (JMA)  Deploys and monitors programs on given node  Provides application “heartbeat” to system controller Mass Data Store (MDS)  Provides reliable centralized data services  Enables reliable checkpointing

8 Algorithm-Based Fault Tolerance Commonly refers to matrix coding method that is preserved through certain linear algebra operations  Matrix and vector multiply Discrete Fourier Transform Discrete Wavelet Transform  Matrix decomposition: C = AB (LU, QR, Cholesky)  Matrix inversion Used to detect errors in these operations, and in certain cases allows for error correction ABFT algorithms integrate with DM through Application Services API An improved method of using ABFT on the 2D-FFT and SAR has been researched at UF  Uses Hamming encoding  Low overhead due to ABFT Important aspects of ABFT currently under investigation at UF  Round-off analysis  Coverage analysis  Code types  Encoding and Decoding strategies  Overhead Fault-tolerant Partial Transform Computation Flow of Fault-tolerant 2D-FFT Experimental Overhead of Fault-tolerant RDP vs. a Fault-intolerant Version

9 Source Code Transformations Most science applications are inherently non-fault-tolerant  Requires SIFT framework to improve reliability Possible to immunize programs against most errors by transforming application source code  Less overhead  More control over FT techniques  Compiler-independent  Integrates with DM system through Application Services API Custom source-to-source (S2S) transformation tool is currently under development at UF  Accepts C source files as inputs  Generates fault tolerant versions  Uses fine-grain NMR-type of approach to provide improved reliability and dependability  Provides means of control flow checking (CFC) through software  Minimizes number of undetected errors Transformation options to be supported by the tool  Variable replication  Function replication  Memory duplication / memory checking  Synchronization intervals  Condition evaluation Post-evaluation verification Evaluation using replicated variables  Block protection

10 Reconfigurable Fault Tolerance GOAL – Research how to take advantage of reconfigurable nature of FPGAs, to provide dynamically-adaptive fault tolerance in RC systems  Leverage partial reconfiguration (PR) where advantageous  Explore virtual architectures to enable PR and reconfigurable fault tolerance (RFT) MOTIVATION – Why go with fixed/static FT, when performance & reliability can be tuned as needed?  Environmentally-aware & adaptive computing is wave of future  Achieving power savings and/or performance improvement, without sacrificing reliability CHALLENGES – limitations in concepts and tools, open-ended problem requires innovative solutions  Conventional methods typically based upon radiation- hardened components and/or fault masking via chip-level TMR  Highly-custom nature of FPGA architectures in different systems and apps makes defining a common approach to FT difficult Satellite orbits, passing through the Van Allen radiation belt

11 Reconfigurable FT Virtual Architecture for RFT  Novel concept of adaptable component-level protection (ACP)  Common components within VA: Adaptable protection frame – largely module/design-independent (see figure above) Error Status Register (ESR) for system-level error tracking/handling Re-synchronization controller or interfaces, for state saving and restoration Configuration controller, two options:  Internal configuration through ICAP  External configuration controller  Benefits of internal protection: Early error detection and handling = faster recovery Redundancy can be changed into parallelism PR can be leveraged to provide uninterrupted operation of non-failed components  Challenges of internal protection: Impossible to eliminate single points of failure, may still need higher-level (external) detection and handling Stronger possibility of fault/error going unnoticed Single-event functional interrupts (SEFI) are major concern AB B 2× parallel, SCP A no parallel, TMR B A D C 4× parallel, single BLANKBLANK BLANKBLANK no parallel, SCP “sockets” for modules Adaptable Component- level Protection VA concept diagram FPGA

12 Space Applications Synthetic Aperture Radar (SAR)  Used to form high-resolution images of Earth’s surface from moving platform in space  Patch-based processing with significant amount of overlap between patch boundaries Parallelizable on multiple levels of granularity, possible without need for any inter-processor communication (one patch per node) 2-dimensional data set, can range in size from several hundred Megabytes to Gigabytes Data set not significantly reduced through course of application Highly amenable to ABFT; possible application for the Dependable Multiprocessor project

13 Space Applications Hyperspectral Imaging (HSI)  Uses traditional beamforming techniques to perform coarse-grained classification on hyperspectral images  Adjustable to enable real-time processing Mostly embarrassingly parallel, exception being weight computation (shown in red below) 3-dimensional data set, reduced through course of application Auto-correlation sample matrix (ACSM) calculation and beamforming (detection) amenable to ABFT  Suggest NMR for weight computation (weight) Parallel and multi-FPGA decompositions explored

14 Space Applications Cosmic Ray Elimination  Uses image processing techniques to remove artifacts caused by cosmic rays  Image shows pre- and post-processed versions of a Hubble Telescope observation Images are highly parallelizable, with minimal communication necessary Main computation: median filtering  Fault-tolerant median filter developed Other portions of algorithm replicated by hand or S2S translator Other aerospace-related application kernels  Space-Time Adaptive Processing (STAP)  Ground Moving Target Indicator (GMTI)  Airborne LIDAR  Digital Down Conversion (DDC)  PDF Estimation

15 Novel Computing Platforms Fixed multi-core (FMC) devices  Cell Heterogeneous, vector compute engine, 3.2 GHz clock rate, ~70 W max. power consumption  GPU Homogeneous, many (e.g. 100+) stream processors, ~1.5 GHz clock rate, ~120 W max. power consumption Reconfigurable multi-core (RMC) devices  Field-Programmable Object Array (FPOA) Heterogeneous, coarse-grained processing elements, 1 GHz clock rate, ~35 W max power consumption  Field-Programmable Gate Array (FPGA) Heterogeneous, fine-grained processing elements, max. clock rate ~500 MHz, achievable clock rate varies, ~30 W max. power consumption  Tilera Homogeneous, coarse-grained processing elements (64 32-bit MIPS-like processors on-chip), ~750 MHz clock rate, ~30 W max. power consumption  Element CXi Heterogeneous, coarse-grained processing elements, 200 MHz clock rate, ~1 W max. power consumption Cell processor block diagram - http://www.research.ibm.com/journal/rd/494/kahle.html FPOA architecture - http://www.mathstar.com/Architecture.php

16 RC: Vital Technology for Space HPEC devices featured here; similar results vs. 65nm Xeon, 90nm GPU, etc. (see RSSI’08). Results excerpted from pending presentation from CHREC-UF site for HPEC’08 Workshop. Versatility in space missions (adapts as needs demand) Fixed archs. burdened with fixed choices, limited tradeoffs Performance in space missions (speed, power, size, etc.) e.g. Computational density per Watt (CDW) device metric  FPGAs far exceed FMC devices (CPU, Cell, GPU, etc.) Parallel Operations – scales up to max. # of adds and mults (# of adds = # of mults) possible Achievable Frequency – lowest frequency after PAR of DSP & logic-only impls. of add & mult comp. cores [FPGA] Power – scales linearly with resource util; max. power reduced by ratio of achievable freq. to max. freq. [FPGA] Parallel Operations – scales up to max. # of adds and mults (# of adds = # of mults) possible Achievable Frequency – lowest frequency after PAR of DSP & logic-only impls. of add & mult comp. cores [FPGA] Power – scales linearly with resource util; max. power reduced by ratio of achievable freq. to max. freq. [FPGA]

17 RapidIO High-speed embedded system interconnect, replacement for bus-based backplanes  Parallel and serial variants, serial is wave of future  Multiple programming models Research with RapidIO at UF  Simulative research studying capability of RapidIO-based computing platforms to support space-based radar (SBR) processing  Custom testbed designed and built, for verification of simulation models & experimentation with RapidIO & FPGAs Experimental logic analyzer measurements Visualization of simulated GMTI application progress

18 Conclusions Fault tolerance for space should be more than RadHard components & spatial TMR designs  Fixed worst-case designs extremely limited in perf/Watt  Instead, many FT methods & modes can be exploited  Adaptive systems that react to environmental changes  COTS featured inside critical performance path  RadHard for FT management, outside critical perf. path UF active on many space-related FT issues  NASA Dependable Multiprocessor, CHREC RFT F4-08  Modes: SIFT, ABFT, S2S, RFT, FEMPI, CR, CED, etc.  Devices: PPC/AV, FPGA, FPOA, Tilera, ElementCXi, etc.  Space apps: HSI, SAR, LIDAR, GMTI, CRE, et al.

19 2009 IEEE Aerospace Conference Track 7.12 Dependable Software for High Performance Embedded Computing Platforms  Transient error detection and recovery techniques  Compiler-based fault-tolerant techniques  Algorithm-based fault-tolerant techniques  Tools and techniques for designing reliable software  SIFT management frameworks  Software dependability analysis  Adaptive fault-tolerant techniques  FT applications Track Chairs  Richard LindermanRichard.Linderman@rl.af.mil  Grzegorz Cieslewskicieslewski@hcs.ufl.edu Dates  Abstract Submissions Due: July 1st, 2008  Paper Submissions Due: November 2nd, 2008

Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida.

Similar presentations

Presentation on theme: "Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida.

Similar presentations

Presentation on theme: "Advanced Space Computing with System-Level Fault Tolerance Grzegorz Cieslewski, Adam Jacobs, Chris Conger, Alan D. George ECE Dept., University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback