The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

Slides:



Advertisements
Similar presentations
A Model Driven Tool for Automated System Level Testing of Middleware Turker Keskinpala Abhishek Dubey Steve Nordstrom Ted Bapty Sandeep Neema Institute.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Implementing Fault Tolerant Systems with Windows CE.NET Reliable System Design 2010 by: Amir M. Rahmani.
Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault- Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve.
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
GSI, Oct 2005Hans G. Essel DAQ Control1 H.G.Essel, J.Adamczewski, B.Kolb, M.Stockmeier.
MotoHawk Training Model-Based Design of Embedded Systems.
A Computation Management Agent for Multi-Institutional Grids
Model for Supporting High Integrity and Fault Tolerance Brian Dobbing, Aonix Europe Ltd Chief Technical Consultant.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Camilo Lara KIP HLT Production Readiness Review 1 HLT Cluster Management.
CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/27 A Control Software for the ALICE High Level Trigger Timm.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
CHEP03 - UCSD - March 24th-28th 2003 T. M. Steinbeck, V. Lindenstruth, H. Tilsner, for the Alice Collaboration Timm Morten Steinbeck, Computer Science.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
L. Granado Cardoso, F. Varela, N. Neufeld, C. Gaspar, C. Haen, CERN, Geneva, Switzerland D. Galli, INFN, Bologna, Italy ICALEPCS, October 2011.
Towards a Distributed, Service-Oriented Control Infrastructure for Smart Grid ASU - Cyber Physical Systems Lab Professor G. Fainekos Presenter: Ramtin.
THE AFFORDABLE SUPERCOMPUTER HARRISON CARRANZA APARICIO CARRANZA JOSE REYES ALAMO CUNY – NEW YORK CITY COLLEGE OF TECHNOLOGY ECC Conference 2015 – June.
Computer System Architectures Computer System Software
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
Marcelo de Paiva Guimarães Bruno Barberi Gnecco Marcelo Knorich Zuffo
BTeV WorkshopNashville, Nov 15, 2002 Mossé, Pitt BTeV-RTES Project Very Lightweight Agents: VLAs Daniel Mossé, Jae Oh, Madhura Tamhankar, John Gross Computer.
Fault Tolerance Issues in the BTeV Trigger J.N. Butler Fermilab July 13, 2001.
Operating Systems CS3502 Fall 2014 Dr. Jose M. Garrido
Multi-Agent Testbed for Emerging Power Systems Mark Stanovich, Sanjeev Srivastava, David A. Cartes, Troy Bevis.
Fault Tolerance and Adaptation in Large Scale, Heterogeneous, Soft Real-Time Systems RTES Collaboration (NSF ITR grant ACI ) Paul Sheldon Vanderbilt.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Illinois Center for Wireless Systems Wireless Security Quantification and Mechanisms Bill Sanders Professor, Electrical and Computer Engineering Director,
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Cluster Reliability Project ISIS Vanderbilt University.
Co-design Environment for Secure Embedded Systems Matt Eby, Janos L. Mathe, Jan Werner, Gabor Karsai, Sandeep Neema, Janos Sztipanovits, Yuan Xue Institute.
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D LDAQ – the New Lujan Center Data Acquisition Application Frans Trouw, Gary.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Constraint-Based Embedded Program Composition IMPACT Rapid Construction of Efficient Embedded Systems. Multiple System Variants for Little Cost. Rapid,
MACCE and Real-Time Schedulers Steve Roberts EEL 6897.
Suzhen Lin, A. Sai Sudhir, G. Manimaran Real-time Computing & Networking Laboratory Department of Electrical and Computer Engineering Iowa State University,
Z. Kalbarczyk K. Whisnant, Q. Liu, R.K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Slide title In CAPITALS 50 pt Slide subtitle 32 pt Model based development for the RUNES component middleware platform Gabor Batori
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
LQCD Workflow Execution Framework: Models, Provenance, and Fault-Tolerance Luciano Piccoli 1,3, Abhishek Dubey 2, James N. Simone 3, James B. Kowalkowski.
A Software Solution for the Control, Acquisition, and Storage of CAPTAN Network Topologies Ryan Rivera, Marcos Turqueti, Alan Prosser, Simon Kwan Electronic.
Institute for Software Integrated Systems Vanderbilt University DARPA ASC PI Meeting May 26-28, 1999 Adaptive Model-Integrated Computing Akos Ledeczi.
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
BridgePoint Integration John Wolfe / Robert Day Accelerated Technology.
Generative Approaches for Application Tailoring of Mobile Devices Victoria M. Davis, Dr. Jeff Gray (UAB) and Dr. Joel Jones (UA) Portions of this research.
1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.
DØ Online16-April-1999S. Fuess Online Computing Status DØ Collaboration Meeting 16-April-1999 Stu Fuess.
CHEP March 2003 Sarah Wheeler 1 Supervision of the ATLAS High Level Triggers Sarah Wheeler on behalf of the ATLAS Trigger/DAQ High Level Trigger.
ATCA at UIUC M. Haney, M. Kasten High Energy Physics Z. Kalbarczyk, T. Pham, T. Nguyen Coordinated Science Laboratory ILLINOIS UNIVERSITY OF ILLINOIS AT.
DemoSystem2004 RTES Collaboration (NSF ITR grant ACI ) Real-Time and Embedded Technology & Applications Symposium (IEEE) FALSE II Workshop San Francisco,
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
1 Reconfigurable Environment for Analysis and Test of Software Systems Sam Martin REATSS.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
OPERATING SYSTEM BY KINSHUK RASTOGI. WHAT IS AN OPERATING SYSTEM? What is an operating system in the first place? An operating system is a software that.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Background Computer System Architectures Computer System Software.
OPERATING SYSTEM BASICS. What is an operating system and what does it do? The operating system has two basic functions: –communicates with the PC.
Computer Software. Two Major Types of SW System SW Programs that generally perform the background tasks in a computer. These programs, many times, talk.
DOMAIN SPECIFIC LANGUAGE AND MANAGEMENT ENVIRONMENT FOR EXTENSIBLE CPS Martin Lehofer Siemens Corporate Technology Princeton, NJ, USA Subhav Pradhan Institute.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
ARMOR-based Hierarchical Fault/Error Management
Co-designed Virtual Machines for Reliable Computer Systems
Presentation transcript:

The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski 3, Derek Messie 5, Daniel Mossé 6, Sandeep Neema 2, Steve Nordstrom 2, Jae Oh 5, Paul Sheldon 7, Shweta Shetty 2, Dmitri Volper 5, Long Wang 4, Di Yao 2 1 High Energy Physics, University of Illinois, 1110 W. Green Street, Urbana, IL USA 2 Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN USA 3 Fermi National Accelerator Laboratory, Batavia, IL USA 4 Electrical and Computer Science, University of Illinois, Urbana, IL USA 5 Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY USA 6 Computer Science, University of Pittsburgh, Pittsburgh, PA USA 7 Physics and Astronomy Department, Vanderbilt University, Nashville, TN USA

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Outline Real Time Embedded System Project –BTeV => RTES Prototypes –SuperComputing 2003 –Demo System 2004 Beyond BTeV

M. Haney; RT 2005The RTES Project - BTeV, and Beyond BTeV - High Energy Physics Input: 500 GB/s (2.5 MHz) Level 1 processing: 190  s –rate of 396 ns –528 “8 GHz” G5 CPUs (factor of 50 event reduction) –high performance interconnects Level 2/3 processing: ms (factor of 10+2 event reduction) –1536 “12 GHz” CPUs commodity networking Output: 200 MB/s (4 kHz) = 1-2 Petabytes/year

M. Haney; RT 2005The RTES Project - BTeV, and Beyond BTeV’s Need “Given the very complex nature of this system where thousands of events are simultaneously and asynchronously cooking, issues of data integrity, robustness, and monitoring are critically important and have the capacity to cripple a design if not dealt with at the outset… BTeV [needs to] supply the necessary level of “self-awareness” in the trigger system.” –[June 2000 Project Review]

M. Haney; RT 2005The RTES Project - BTeV, and Beyond thus, RTES The Real Time Embedded System Group –University of Illinois –University of Pittsburgh –University of Syracuse –Vanderbilt University (PI) –Fermilab Physicists and Computer Scientists/Electrical Engineers with expertise in –High performance, real-time system software and hardware, –Reliability and fault tolerance, –System specification, generation, and modeling tools. NSF ITR grant ACI

M. Haney; RT 2005The RTES Project - BTeV, and Beyond The RTES Solution Model Integrated Computing –Graphical representation of complex system, with modeling (simulation) resources ARMORs –To protect Linux processes And sub processors VLAs –To monitor/mitigate at every level embedded, supervisory Linux, Linux trigger farm, etc.

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Modeling Environment: GME*  Fault handling  Process dataflow  HW Configuration * GME is an Open-Source, Meta-configurable, multi-aspect graphical modeling tool

M. Haney; RT 2005The RTES Project - BTeV, and Beyond ARMOR: Adaptive Reconfigurable Mobile Objects of Reliability Heartbeat ARMOR Detects and recovers FTM failures Fault Tolerant Manager Highest ranking manager in the system Daemons Detect ARMOR crash and hang failures ARMOR processes Provide a hierarchy of error detection and recovery. ARMORS are protected through checkpointing and internal self-checking. Execution ARMOR Oversees application process (e.g. the various Trigger Supervisor/Monitors) Daemon Fault Tolerant Manager (FTM) Daemon Heartbeat ARMOR Daemon Exec ARMOR App Process network

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Very Lightweight Agents Minimal footprint Platform independence –Employable everywhere in the system! Monitors hardware and software Handles fault detection & communications with higher level entities Physics Application Hardware OS Kernel (Linux) VLA L2/L3 Manager Nodes (Linux) Physics Application Level 2/3 Farm Nodes (Linux) Network API

M. Haney; RT 2005The RTES Project - BTeV, and Beyond RTES view of the BTeV L1 Trigger

M. Haney; RT 2005The RTES Project - BTeV, and Beyond SC2003 Prototype Gateway PC - Windows OS DATA DSP - BIOS Physics Application Physics Application Very Light Monitor Agent TCP/IP PC - Linux OS EPICS Graphical Display System TCP/IP COMMANDS ARMOR Microkernel Recovery Policy Msg Parser Local Manager ARMOR DSP Interface Daemon ARMOR Microkernel Recovery Policy Msg Parser Local Manager ARMOR EPICS Interface Daemon

M. Haney; RT 2005The RTES Project - BTeV, and Beyond EPICS GUI

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Independent Review Following SuperComputing 2003, a software review was conducted –GME needs to coherently address multiple, differing domains System modeling, messaging, fault mitigation, Run Control function, GUI, other –ARMORs need to be easily customized Via GME –Overall packaging and version control - vital

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Domain-specific languages GME models, metamodels, and interpreters for –system description, messaging, state machine (run control, ARMOR), GUI Each language generates appropriate artifacts –C++, Python, Matlab M-files, Elvin config files, etc.

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Versioning/Build System Run Tree System Executables Build Tree UDM Translators Canonical XML models Domain Models Metamodels Language Specification Domain Artifacts Compiler/ Linker Translator Source Files Models Language Specification Object Source Artifacts OUT IN OUT IN

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Demo System L2/3 Trigger FTM Global MgrHeartbeat/Source node Regional Mgr 1 Worker 1.1 HB Exec ARMOR Filter 1Filter 2Event Builder Worker 1.2 Regional Mgr 2 Exec ARMOR Worker 2.1 Elvin Router GUI Region 1 Elvin msg ARMOR msg Exec ARMOR Event Source

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Demo System 2004 Iron Ganglia public private laptop Matlab Elvin laptop Matlab Elvin Boulder Elvin Global RC, ARMOR Regional RC, ARMOR Worker RC, VLA, ARMOR FilterApp Worker RC, VLA, ARMOR FilterApp … Regional RC, ARMOR Worker RC, VLA, ARMOR FilterApp Worker RC, VLA, ARMOR FilterApp … Regional RC, ARMOR Worker RC, VLA, ARMOR FilterApp … DataSource file reader

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Matlab GUI

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Beyond BTeV - CMS GME modeling for XDAQ –System descriptions, state machines, messaging… –Work in progress Fault tolerance for HLT –ARMORs and VLAs Being discussed Balancing CMS needs and RTES goals Adding value, without requiring changes

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Beyond BTeV - LQCD Lattice Gauge Theory Computation –farm at Fermilab Single-point sensitivities –Single process fault can compromise entire farm computation –Checkpointed; can be restarted, but… ARMORs and VLAs –Batch/autonomous protection No operator –Dynamic mix of protection requirements Not a (quasi)static L2/3 Trigger

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Beyond BTeV - Grid, Other Grid Projects –Load balancing and networks studies Nodes-in-farm => farms-in-grid Resource driven, deadline driven, other –Extension of studies done for BTeV/partitioning Other - Dark Energy Survey (astro camera) –“Simple” system (few nodes) –Not real-time hard (can reacquire image) But it will be a good case-study for the “cost” of incorporating RTES (GME, ARMORs, VLAs)

M. Haney; RT 2005The RTES Project - BTeV, and Beyond Conclusions The RTES project developed two prototypes (L1, and L2/3) for BTeV –Demonstrated at conferences RTES is now applying its design-time modeling and runtime middleware to several high performance heterogeneous embedded application environments

M. Haney; RT 2005The RTES Project - BTeV, and Beyond